Blogs
Introduction of Watermarking in Mule 4 Introduction
- August 28, 2024
- MuleSoft
Introduction of Watermarking in Mule 4 Introduction
Watermarking in Mule 4 is a powerful technique that ensures efficient data processing by marking the state of a data source at a specific point in time. This method prevents reprocessing already handled data, focusing only on new or updated information. Through this article, we explore how watermarking functions within Mule 4 connectors, along with practical examples and the advantages of using Object Store V2 for persistent data handling.
How Watermarking Works
1. Initial State: When data processing starts, an initial watermark is set. Watermarks could be a timestamp, identifier, or any other marker representing the data’s state at that moment.
- Data Retrieval: As data is retrieved or polled, it gets compared against the existing watermark. This comparison helps identify new or updated records since the last recorded state.
- Processing: Only files/data newer than the existing watermark are processed, ensuring efficient and up-to-date data handling.
- Update Watermark: After processing, the watermark gets updated to reflect the current state of the data. This updated watermark serves as the reference for the following data retrieval cycle.
Exhibiting Watermarking
Here’s a simplified example of how you might set up watermarking with the File Connector:
Initial Watermark: Store the timestamp of the last processed file.
Polling and Comparison:
Use the File Connector to list files in a directory.
Compare file timestamps or names with the stored watermark.
Process and Update:
Process files that are newer than the stored watermark.
Update the watermark with the latest timestamp or filename.
Implementing Watermarking Before Data Retrieval
The above implementation works with remarkable efficacy when fetching files from a folder or rows from a database. However, the goal is to implement watermarking before retrieving data or files from the respective connectors.
This article edifies how Watermarking can be achieved using various Mule 4 connectors.
Default Watermark Features in Mule 4 Connectors
Mule 4 connectors have built-in watermarking features that can be used for efficient data processing. Let’s dive into some specific connectors:
File- On New or Updated File
SFTP- On New or Updated File
FTP- On New or Updated File
Database- On Table Row
Let’s dive into each connector to learn how watermarking is achieved.
File – On New or Updated File
In a File connector, there are two modes for watermarking strategies: “CREATION_TIMESTAMP” or “MODIFIED_TIMESTAMP”. The strategy depends on whether you want to filter files based on their creation or modification time.
In the File Connector, there are three watermarking options:
DISABLED
CREATED_TIMESTAMP – Picks only newly created files in the file location or directory. It can also identify newly created sub-directories and the files within them when used in recursive mode (which can be enabled by selecting the “recursive” checkbox)
MODIFIED_TIMESTAMP – Picks both the newly created and modified files. The recursive mode works here as well as in the CREATED_TIMESTAMP mode.
SFTP – On New or Updated File/ FTP – On New or Updated File
The SFTP Connector in Mule 4 also has built-in watermarking capabilities. Developers can manage incremental file processing by adding Object Store v2. By tracking and comparing file metadata, developers can ensure that only new or updated files are processed, maintaining efficient data handling and avoiding reprocessing.
Nota Bene: One noteworthy point to keep in mind is that when watermarking is enabled in the SFTP/FTP connector, data is stored in memory. If the application is restarted or redeployed, the watermark data is deleted, causing all files to be processed again since the memory is cleared. The developers must use the Object Store v2 feature to avoid this deletion, which stores data on AWS Dynamo DB behind the scenes and preserves it even after a restart or redeployment. Let’s explore this step by step with a sample POC.
On the Anypoint Platform, the CloudHub runtime has the Object Store V2 feature, disabled by default.
Conclusion
Watermarking is an essential technique for ensuring efficient data processing in Mule 4. By marking the state of data, it prevents the reprocessing of already handled data and focuses on processing only new or updated information. Understanding the fundamentals of how watermarking works with various Mule 4 connectors sets the stage for implementing more advanced and practical use cases.