Structured Query Language (SQL) is one of the most popular languages there is, and for good reason. Chances are, you might be something of a SQL expert yourself. You’ve done a join or two in your day, and you have a pretty good handle on how they work and how to use them effectively. But, have you ever done a join on streaming data, where your data is unbounded and continuous? In this article, I’d like to dive a little deeper into the (relatively) new kid on the block and examine what it means to do joins on streaming data, in contrast to how it works for data at rest.
To understand the differences in how joins are performed in these different worlds, we must delve into the nature of each type of data, understand their specific technical challenges, and explore commonly used approaches to address them.
Data at Rest: Traditional joins
In traditional databases or data warehouses where data is at rest, JOINs are performed on static datasets. These datasets are finite, and their complete content and total size is known at the time of the join operation, which dictates how they are performed and allows for certain optimizations:
1. Join Algorithms: Common algorithms like hash joins, nested loop joins, or sort-merge joins are used based on the size and indexing of the data. These algorithms depend on the static nature of the data, and any changes prompt a re-execution for updated results.
2. Index Utilization: Databases can leverage indices to speed up join operations, reducing the time complexity significantly. Once created, indices are updated when data changes, incurring additional time and storage costs.
3. Memory Management: Since the entire dataset is available, databases can optimize memory usage and disk I/O for join operations.
4. Statelessness: Join operations for data at rest do not require the user to actively consider how to maintain state (beyond ensuring the database system has enough resources to execute the join, e.g., memory and disk). They execute across the entire dataset and terminate in a finite amount of time. The result is final unless the join query is run again after the underlying data has changed.
Streaming Data: New challenges for joins
Streaming data, or data in motion, is fundamentally different from data at rest. It is unbounded, continuous, and typically lacks indices. Very often, it is generated in real time or near real time from multiple sources like applications, IoT sensors, log files, and servers that are performing a wide range of tasks. There are several factors that make joins across streaming data different than traditional joins:
Windowing
Time-based Windows: Joins in streams often involve windowing, where the join operation is performed over data arriving within a specific time frame. Different approaches can be taken here, including:
- consecutive equal-sized time intervals, where each input event is processed as part of exactly one window
- using “sliding windows” of a specified duration that begin at intervals shorter than the window’s time span, where each input event may be processed in more than one window
Handling Late Data: Dealing with out-of-order data or late-arriving data is a challenge unique to streaming joins. Apache Flink, for example, uses event-time and watermarks to understand how to work with the event time for a given data stream. It’s also worth noting that other streaming operators like aggregations or TopN operations also need extra care for out of order events.
Statefulness
Handling Infinite Streams: Since data is continuous and unbounded, streaming systems like Apache Flink or Kafka Streams maintain state to perform joins. This state includes data that has arrived on one stream and is awaiting matching data from another stream.
State Management: Efficient state management is crucial, especially for long window durations or high-throughput streams. A combination of in-memory and disk-based storage approaches may be used depending on the characteristics of the underlying data streams.
Join Types
Inner, Outer, and Temporal Joins: Streaming joins can be diverse, like temporal joins where two streams are joined based on the time context of the events. Handling null values in outer joins and late-arriving data are crucial aspects, but both are managed effectively by tools like Flink’s window join, where pairwise combinations of elements of two streams behave like an inner-join. Flink provides support for several different join operations:
- Regular: In which any new record, or changes to either side of the join, are visible and affect the entirety of the join result.
- Interval: Returns a simple Cartesian product restricted by the join condition and a time constraint.
- Temporal: Takes an arbitrary table and correlates each row to the corresponding row’s relevant version in a versioned table.
- Lookup: Typically used to enrich a table with data that is queried from an external system.
Resource Management
Handling Skew: Ideally, data in a distributed database gets distributed evenly. Skew describes the degree to which a table’s data is unevenly balanced among partitions, where small amounts of skew are inevitable and generally harmless. When data is at rest, it’s easier to understand and mitigate the causes of skewness by slicing and dicing your data. However, in stream processing the data is continuously changing, so understanding the causes and managing skewness requires the effective use of observability tools like metrics and logs.
Scalability: Given the unbounded and continuous nature of streaming data, concerns around scalability are not limited to managing the sheer size of the state (although this is certainly a concern), but more so the ability to process high rates and volumes of incoming data. As the number of data sources or their data volume increases, the requirement to handle peak demand requires a comprehensive approach to managing resources.
Consistency and Fault Tolerance
Exactly-Once Semantics: When joining data at rest, each row is processed exactly once. However, for data that is being continuously streamed from various sources, there are conditions for which the same record of data may be generated or sent multiple times. For example, Flink guarantees that it processes all data in Kafka (or any other exactly-once source) once, despite failures of Flink or Kafka. Ensuring that each record is processed exactly once, despite failures or retries, is crucial for maintaining accuracy in streaming joins.
Checkpointing: Stream processing platforms often use checkpointing to recover from delivery failures and prevent data duplication.
Apache Flink’s Handling of Streaming Joins
As a real-time data processing engine, Apache Flink offers sophisticated mechanisms to handle streaming joins:
Event Time vs. Processing Time: Apache Flink can perform joins based on event time (when the event actually occurred) rather than processing time (when the event is processed by the system), which is crucial for accurate joins in streaming data.
Watermarks: Apache Flink uses event-time and watermarks to deal with late-arriving data, allowing it to handle out-of-order events effectively in a join.
State Backends: Apache Flink provides different implementations of state backends to efficiently manage the state required for joins over large and continuous data streams.
Summary
The primary difference between joins in streaming data and data at rest lies in the handling of continuous, unbounded, and potentially unordered data streams. While joins on data at rest optimize performance using pre-calculated indices and known data sizes, streaming joins require sophisticated state management, windowing techniques, and fault tolerance. Continuously streaming data joins are also among the more advanced use-cases for stream processing, requiring an additional layer of understanding about how to work with data streams in order to efficiently and effectively join them together.
The main benefit of mastering streaming joins is their reduced latency. Unlike traditional OLTP database environments where joins may occur at fixed intervals, streaming joins offer real-time experiences, opening up a wide range of new use cases, such as continuously updated search indexes, real-time dashboards, or alerting.