- Blog
Practical Implementation of Watermarking in Mule 4
- August 28, 2024
- MuleSoft
Building on the foundational understanding of watermarking, this document provides a step-by-step guide to implementing watermarking in Mule 4. By following the outlined procedure, you can ensure that your data processing is both efficient and reliable. Let’s begin with a simple flow to demonstrate watermarking in action.
Create a Simple Flow:
Create a simple flow with “On New or Updated File” and a logger to print the file name.
Enable Watermark Option:
Enable the watermark option, then deploy the POC without enabling Object Store V2, which is not enabled by default, as shown in the following exhibit.
Place a File:
 After deployment, place a file named test-file1.txt at the SFTP location, which gets processed.
Then, create another file named `test-file2.txt` and process it.
Now, it processes only the test-file2.txt and does not reprocess test-file1.txt, though it was still in the source folder, as the watermark feature is enabled.
Next, restart the same application and observe the behavior.
After a restart, both files are fetched from the source folder, even though they were processed before the restart.
This behavior shows that the watermark stores data only in memory, which gets cleared upon restart. To store watermarks persistently, developers must enable the ‘Use Object Store V2’ feature. This way, processed files will not be picked up after restart or redeployment.
Enable Object Store v2:
To store watermark data persistently, enable the Object Store v2 feature.
Restart the application after making the changes.
Since Object Store V2 was not used before the restart, the system will process both files again.
 After deploying the ‘Use Object Store V2’ change, the Object Store option on the Anypoint Platform Runtime Manager displays the relevant Object Store data in a key-value format.
Detailed Explanation of Object Store Partitions in Runtime Manager
If you select the Object Store, you can see the following partitions, which are not editable:
- _pollingSource_watermarkFlow/ids-on-updated-watermark
- _pollingSource_watermarkFlow/recently-processed-ids
- _pollingSource_watermarkFlow/watermark
_pollingSource_watermarkFlow/ids-on-updated-watermark:
This partition involves handling data records that have been updated since the last time the application checked the watermark.
Function: It identifies which records have changed since the last time the application checked the watermark. Checking a watermark involves querying a data source (such as a database or file system) for records modified or created after the watermark timestamp or identifier.
Usage: When the flow polls data, it retrieves the list of records and compares them against the stored watermark to filter out the ones that need processing. For instance, in an SFTP file processing scenario, the file poller checks the modification timestamps of files against the watermark to find out which files are new or updated.
The connector or the Anypoint Platform does not provide visibility on the IDs that have a watermark.
_pollingSource_watermarkFlow/recently-processed-ids:
This partition tracks the IDs or identifiers of recently processed records.
Function: It records which items have been processed, usually storing this information in a persistent store such as an object store. This helps in ensuring that records are not processed multiple times.
Usage: After processing the data, the IDs or other unique identifiers of the processed records get saved here. This way, when the flow runs again, it can check this component to ensure it does not reprocess the duplicate records. For example, in a file polling scenario, it could store filenames or paths to avoid reprocessing them.
The following image shows the last processed file. To verify, create another file in SFTP and validate if the recently processed IDs are updating.
A new file, test-file3.txt, was created and processed.
Now, recently-processed-id must get updated to test-file3.txt from the earlier value of test-file2.txt
_pollingSource_watermarkFlow/watermark:
Function: It represents the point in time or state to which records have been processed earlier. It can be a timestamp, a file modification date, or any other marker that helps identify new or updated records since the last processing.
Usage: The watermark labels serve as a reference point for data retrieval. When polling for new data, records, or files with a timestamp or identifier greater than the watermark, it is considered new or updated. After processing, the watermark is updated to reflect the new state.
Apprehending Binary Serialization in MuleSoft Object Store
In Runtime Manager, the values stored in the Object Store are displayed as [binary value] BINARY. The value is stored in a binary format that is not directly readable or interpretable through the UI.
Mule 4 wraps data values in an internal Mule object format while storing them. This wrapping allows MuleSoft Integration Solutions to handle different data types consistently and manage serialization and deserialization.
Mule 4 uses binary serialisation to convert Mule objects into a persistently stored format, enabling the data to be converted into a binary stream for storage.
MuleSoft Integration Services uses this procedure to serialize and deserialize objects. It translates Mule objects into a binary format for storage and can convert them back when retrieving.
The Runtime Manager’s user interface cannot deserialize the binary data into its original Mule object format. Therefore, it only displays the value as [binary value] BINARY, indicating that it’s in a binary format that cannot be interpreted directly through the UI.
Check the Behaviour After Restart:
The final step is to restart or stop and start the application to check whether the processed files are being reprocessed.
After the restart, none of the files get reprocessed, even after stopping and starting the application, as persistent watermarking is enabled using Object Store V2.
Final Thoughts
Watermarking with Mule 4 connectors and Object Store V2 ensures efficient and accurate data processing. Object Store V2 offers persistent storage for watermark data, preventing unnecessary reprocessing and preserving data integrity even after restarts or redeployments. This method is beneficial for file or database polling scenarios, where processing only the latest changes is essential for system efficiency and accuracy.Want to dive deeper into MuleSoft insights? Make the most of your MuleSoft integrations with ProwessSoft’s expert help. Reach out to our team today to optimize your data processing with effective watermarking techniques.