Apache Flink's watermarks are a fundamental concept for managing event time in stream processing, particularly when dealing with out-of-order events and ensuring event time consistency. To understand them deeply, we need to explore their technical aspects and practical applications.
What is a Watermark in Apache Flink?
A watermark is a mechanism that Flink uses to track the progress of event time (the time events actually occurred in the real world, as opposed to the time at which the events are processed by the system) for a data stream. Watermarks are not necessary in every stream processing scenario, but when used they are emitted by Flink into the source data stream. They are not actual data events, but are metadata markers used to track the temporal aspect of the stream as events are processed.
In a data stream, a watermark indicates that there should not be any more events arriving that are older than the watermark's timestamp. It's a declaration saying, "Up to this point in time, we have received all relevant data."
Watermarks serve several critical purposes in Apache Flink:
- Watermarks enable Flink to process data streams based on actual event time rather than current wall clock time and enable Flink to track the progression of event-time for each data source. This is particularly important when events arrive out of order, as watermarks ensure that Flink processes events in the correct order based on their timestamps.
- Watermarks are essential for time-based operations such as tumbling and sliding windows. Flink uses watermarks to determine when to trigger window calculations and emit window results.
- Watermarks help Flink detect late events, which are events that arrive after a watermark, time-wise. By tracking watermarks, Flink can allow the application to decide whether or not to “wait” for late events or process partial results without them.
In real-time stream processing, events can arrive out of order due to many factors, including network latency, distributed system architecture, etc. Watermarks help Flink manage and process these events correctly according to their event times.
Implementing Watermarks in Flink
Implementing watermarks in Apache Flink involves:
- Defining a Watermark Strategy: Selecting and configuring a watermark generation strategy suitable for your data and use case.
- Custom Watermark Generators: For specific use cases, custom watermark generators can be implemented to control the logic of watermark generation.
Watermark strategies are defined when configuring a Flink source and determine how watermarks are generated for that data stream. The strategy lets Flink know how to extract an event’s timestamp and generate watermarks. While it is possible to implement a custom method for assigning timestamps, Flink comes with some pre-implemented timestamp assigners to make the process easier.
The forBoundedOutOfOrderness() strategy method can be used in scenarios where the maximum lateness expected in a stream is known in advance. When using this strategy, you specify the maximum amount of time an element is allowed to be late before it is considered a late event. The handling of late events is up to the implementation of an operator.
The simplest case for periodic watermark generation is when the data stream’s event timestamps occur in ascending order. In that case, the current timestamp can act as a watermark, because no earlier timestamps will arrive. Although it is rarely used in production environments, the forMonotonousTimestamps() method can be used to implement this strategy. It is only necessary that timestamps are ascending per parallel data source task, for example within each partition when using Kafka. Flink will generate correct watermarks whenever parallel streams are shuffled, unioned, connected, or merged.
Using Watermarks
An Apache Flink job is composed of operators, which are the classes and methods used to convert a source data stream into a sink/destination data stream. Typically there are one or more source operators, a few transformation operators for the actual processing, and one or more sink operators. Watermarks are used throughout this Flink processing pipeline.
Watermarks are generated by source operators, which are then propagated as part of the data stream to other operators. Typically, parallel subtasks of a source generate their watermarks independently, defining the event time at each source. Some operators consume multiple input streams, such as a union or operators following a keyBy() or partition function. For these operators, the current event time is the minimum of its input streams’ event times.
Transformation operators can make use of watermarks when performing window functions such as Reduce, Aggregate, and Process. The events processed by these functions are determined by window assigners, such as tumbling, sliding, and session windows. Transformation operators can also update watermarks based on their processing logic, incorporating information from multiple input streams.
Finally, sink operators use watermarks to determine when to commit results to external systems, ensuring that results are emitted in event-time order.
Challenges and Considerations
Implementing watermarks in Flink requires careful consideration:
Choosing Watermark Strategy: The right watermark strategy is crucial for achieving accurate event time processing and handling out-of-order events for your streaming application. The choice depends on the nature of the data stream and the application requirements.
- Bounded out-of-orderness watermarks allow events to be considered within a specified time bound.
- Ingestion time watermarks can be used if your events have a reliable and consistent timestamp reflecting the time when events are ingested by Flink.
- Custom watermark strategies can be implemented to align with the characteristics of your data, such as anticipated patterns or event frequency.
Watermark Accuracy: Accurate watermarks are crucial for timely and correct processing. Overestimated watermarks may lead to premature results, while underestimated watermarks may cause unnecessary latency.
Event-Time Skew: Watermarks may not be perfectly accurate due to factors such as event-time skew, where clocks across different systems may differ slightly.
Latency vs. Completeness: There is a trade-off between processing latency and result completeness. The choice of watermark strategy should align with the application's latency requirements. A more aggressive watermark strategy (i.e., a shorter duration for bounded time) can lead to faster processing but at the risk of missing late-arriving data.
Tuning Delay: The watermark delay (in periodic watermarks) must be tuned based on the expected out-of-order extent in the data. Too much delay can lead to higher latency, while too little can result in incomplete results. Care must also be taken to address idle periods, for when the source data stream has no events arriving.
Summary
Apache Flink watermarks are fundamental to event-time processing and timely decision-making in stream processing applications. By understanding their purpose, generation methods, and usage considerations, developers can effectively leverage watermarks to ensure accurate and timely processing of data streams that are appropriate for their applications.