🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

July 23, 2024

min read

Comparing Apache Flink and Spark for Modern Stream Data Processing

Share this post

The ability to harness and act upon data in real time has become a critical differentiator, enabling everything from personalized customer experiences to optimized supply chain management. Traditional batch-oriented approaches to ETL (Extract, Transform, Load) and its variants, ELT and Reverse ETL, struggle to keep up, highlighting their limitations and the need for more agile and scalable solutions.

Apache Flink and Spark Structured Streaming are two leading real-time processing frameworks, each with unique strengths and trade-offs. At Decodable, we have selected to build our platform on-top of Apache Flink as our real-time data processing engine, enabling our customers to react swiftly to market changes, optimize operations, and deliver better customer experiences. While Flink forms the foundation, Decodable enhances it by integrating with Apache Kafka and Debezium to support dynamic ELT, ETL, stream processing and data movement workflows.

In this blog post, we will talk about why we picked Flink to be the foundation of our platform from 3 different perspectives:

Architectural design.
(Stateful) stream processing capabilities.
Production readiness.

These aspects are core to real-time data processing and management. They are fundamental to provide an intuitive user experience, with automated scaling, advanced schema management, and stringent security and compliance features.

Architecture Design

While Apache Spark and Apache Flink both support processing real-time data, they do so in fundamentally different ways. These differences are due in large part to their origin and primary use cases. Spark was originally developed as a parallel data processing engine for at-rest data, while Flink was developed as a streaming-first parallel data processing engine. While Spark has developed streaming capabilities and Flink has developed batch capabilities, the depths of these capabilities still differ today.

Processing Latency

Spark is foremost a batch processing engine, with only experimental support for real-time data processing - what they call Continuous Processing. Instead, Spark offers Spark Structured Streaming which executes transformations on small batches of data accumulated in memory at a predefined interval. These microbatches are often lower latency than true batch processing because Spark keeps the job running between batches, thereby eliminating external orchestration and startup overhead, but are still much higher latency than true stream processing systems like Flink. These microbatches are often on the order of tens of seconds during which time no data is output from the job.

Flink is a streaming-first processing engine and natively processes each record as it arrives. This results in latency often as low as milliseconds making it suitable for online systems such as threat detection and response, fraud detection, location tracking, gaming leaderboards and match making, inventory management, and more. Rather than relying on microbatch sizes and configuration to control back pressure, Flink has robust flow control and fixed size blocking queues to ensure downstream operators in a job are not overwhelmed.

Note: In later sections we will talk about the different APIs for both Spark and Flink in more detail, but we will add here that is has been observed that in the Flink Table and SQL APIs, some transformations benefit from micro-batching multiple input records due to the nature of how Flink processes data, micro-batching can be enabled in the APIs via these configurations.

Fault Tolerance, Failure Handling and Recovery

<div class="side-note">In stream processing, checkpointing is a crucial technique for ensuring data integrity and fault tolerance. It involves periodically saving the state of the stream processing job to durable storage. This allows the system to recover from failures without losing data. Checkpointing, which frequently consumes computational and memory resources, must be managed efficiently to maximize processing time for data.</div>

Spark creates checkpoints by periodically saving the entire state of the streaming application to a reliable storage system like HDFS. As of Spark 3.2 a RocksDB state backend was introduced which allowed Spark to support incremental checkpoints.

This approach ensures that Spark can recover the entire processing state from these checkpoints in case of a failure. However, Spark's recovery process is generally coarse-grained; it typically involves restarting the whole application from the last successful checkpoint, rather than recovering individual tasks. This recovery process can take much longer and thus impact use cases where low latency is a requirement.

Flink provides native support for incremental checkpoints which was built into its framework. In addition, Flink provides many failure strategies which provide either fine or coarse grained recovery of the streaming application. In fine grain mode, a set of transformations (operators) in the application are identified and independently recovered, enhancing the robustness and efficiency of the recovery process.

Flink’s recovery processes are designed to minimize downtime by using the local state recovery if possible. This capability is particularly beneficial in large-scale streaming applications, where restarting entire applications can be costly and time-consuming. Flink’s approach allows for faster recovery and reduces the impact of failures on application performance.

(Stateful) Stream Processing Capabilities

Language and API Support

Spark supports SQL, Java, Scala, Python, and R. Its main APIs are the RDD, DataFrame and Dataset APIs. It has a layered API providing different levels where each layer is built on top of the previous and provides different flexibility/convenience tradeoffs at each level.

The lowest level, RDDs, provides users with an interface that follows the functional programming paradigm where transformations are written in the programming language selected. The DataFrame layer is a higher level pandas-like API where transformations are limited to the functions provided by the API, and where users can write programmatic transformations in UDFs. The Dataset API is similar to DataFrames, but with stronger typing constraints. As a higher level API, the DataFrame and Dataset API are more restrictive than the RDD API, but can be heavily optimized using standard relational algebra optimizations (e.g. join reordering, predicate and projection pushdown, set logic rewrites) that result in more efficient jobs.

Flink supports SQL, Java, and Python. Similar to Spark, it also provides a layered API. Its main APIs are the Process Function, DataStream and Table APIs.

The lowest level is the DataStream and Process Function API that provides the most flexibility for managing the lifecycle of their state and while similar to Spark’s RDD APIs and providing a programmatic way of writing transformations via the Process Function, it also provides a set of higher level transformation. The Table API is similar to DataFrame APIs providing a pandas-like API.

‍

Processing Modes

Spark supports multiple output modes for processing. Append mode outputs an append-only data stream designed for simple transformation jobs. Update mode outputs events that can insert, update, or delete records in a target system which is necessary for values that can change as new data is processed. A simple example of this is a count of all failed HTTP requests by URL. Finally, Full mode outputs the entire result set on every new record processed. Developers must specify the proper output mode for their job.

‍Flink supports the equivalent of Spark’s Append and Update modes, but makes it easy for users by automatically selecting the correct mode based on the job. If, for example, a job simply transforms data, an append-only output is emitted whereas the error count example above would output a change stream (equivalent to Update mode).

Stateful Processing Expressiveness

<div class="side-note">Stateful stream processing is the ability for a stream processing engine to remember and use data across events. This state can be simple, like the total count of records seen by the job, or very complex, like a table of records to join a stream against. State must be kept consistent with input data for correct, reproducible processing. As a result, it must be saved between job restarts, upgrades, and failures.</div>

‍Spark supports stateful stream processing however there are some limitations. Spark only provides limited stateful streaming capabilities via their arbitrary state functions which are still marked as experimental features. This limits the streaming use cases that can be expressed in Spark natively and requires bespoke implementations of Spark or even a separate mico-service.

‍Flink supports stateful stream processing with no restriction on the number of stateful operators in a single job. This approach greatly reduces the management complexity and is a lot more cost-effective in resource consumption.

Unlike Spark, Flink also provides many powerful lower-level processing primitives that make stream processing feel more natural and intuitive. These features include expressive windowing and join functionalities, timers, or state access.

Watermarking for Handling Out of Order and Late Arriving Data Semantics

<div class="side-note">Watermarks are a mechanism to track the progress of time as data flows through a stream processing application. Watermarks are used by the operations such as the windowing or aggregation functions to handle out-of-order or late arriving data. Watermarks are also used to determine what data should remain or be removed from the state of the application. A simple, yet useful watermarking strategy is allowing a fixed amount of lateness.</div>

‍Spark’s watermarking uses a global watermark if multiple streams are involved, which is the minimum of the watermarks across all input streams. Watermarks are calculated based on the latest event seen per stream, encompassing all partitions, which can be inefficient in scenarios where streams and partitions vary significantly in their data arrival times.

For handling late arriving data, Spark provides a late threshold defined on the watermark for the given input stream. If the events fall outside the watermark + late threshold for any operation, the events will be dropped. An example of late arriving data would be mobile activities happening offline (event time) that are only sent once the user is back online (processing time).

‍Flink offers flexible watermarking options. In addition to global watermarks, Flink allows custom watermark generation at both the source, such as in Kafka partitions, and during stream processing with operators like aggregations and joins.

Flink handles late arriving data differently and provides more flexibility by exposing an allowedLateness function at the operator level. These operators include SQL aggregations and window operations.

Flink also included a feature for handling input streams that are idle or have a low volume of data with the mechanism of unaligned checkpoints and withIdleness. This mechanism is particular useful for handling two types of input streams:

Following the Kimball Data Modelling concepts, dimensional data joined to enrich the main stream holding the facts data.
Handling data skews, when some partitions for a Kafka topic are empty (idle).

Examples of where these features in Flink make stream processing feel nature and simplifies the application, would be:

Enriching credit card transactions (facts) with information about the credit card holder (dimensions) or information about the transactions (dimensions) for providing better insights for a budgeting application.
Enriching web browsing data (events) with user data (dimensions) to power targeted advertisements.

Spark does not offer great mechanisms for handling such input streams and will require you to treat these as static datasets, where they get updated out of band of the main streaming application, which is inefficient and waste of resources or requiring you to store in an additional system like Cassandra, which adds an additional system, complicating the architecture and maintenance.

Here is a diagram comparing the high level architecture of a spark streaming vs Flink application for enriching credit card transactions with customer and merchant information. For a Spark application, we have much more complexity due to the nature of how Spark can handle different types of datasets and its lack of native CDC support, which we will describe in the next section.

‍

Change Data Capture (CDC) Support

<div class="side-note">CDC is a commonly used technique for real-time data integration across different data systems.</div>

‍Spark Streaming currently does not have a built-in CDC connector framework. To use CDC, Spark can only consume change data captured by external tools (such as Debezium) that publish to Kafka, which Spark then processes. This method requires setting up an additional component to manage the CDC logic before the data gets ingested by Spark.

‍Flink offers a robust set of native CDC connectors built on top of Debezium. Flink's CDC connectors can directly connect to databases and capture changes without the need for additional CDC tools like Kafka and Kafka Connect. These connectors handle snapshot state and continue to capture row-level changes, allowing for exactly-once processing semantics and efficient backfilling capabilities.

Production Readiness

Backpressure

Spark employs a backpressure mechanism that can be manually enabled and tuned through configuration settings <span class="inline-code">spark.streaming.backpressure.enabled</span>, <span class="inline-code">spark.streaming.backpressure.initialRate</span>, <span class="inline-code">spark.streaming.receiver.maxRate</span>, and many more. Such design calls for deep Spark expertise to fine tune the backpressure situations - which can often become a never-ending exercise of configuration changes.

‍Flink handles backpressure inherently and automatically without the need for explicit configuration. Flink's design includes built-in flow control mechanisms that adjust processing dynamically, maintaining system stability even under high loads. Flink continuously monitors the processing capabilities and adjusts the data flow accordingly to prevent bottlenecks in data processing, thus ensuring smooth operation of real-time data processing tasks.

Resource Optimization

Spark gives you the option to schedule an application with dynamic resource allocation. In this mode, Spark will automatically horizontally scale to increase or decrease the number of worker pods the application will use. This results in a coarse grain optimization at an application level, and users are required to manually specify operation level parallelism by using the coalesce or repartition function. Further resource optimization functionalities are offered by vendors like AWS and DataBricks.

‍Flink provides two options for managing the resources for an application, running your application in reactive mode and or with the Flink autoscaler. Reactive mode functions similarly to Spark’s dynamic allocation, where users can horizontally scale the resource at the application level, by increasing or decreasing the worker pods given to the entire application. On the other hand, the Flink autoscaler is a novel, cost-effective resource optimizer part of the Flink Kubernetes Operator. This optimizer performs a fine-grained optimization at the task level, automatically adjusting the parallelism of every task based on incoming records and backlog.

Not only does Flink automatically scale your application, it is also able to automatically adjust the memory allocation of the worker nodes, which usually is a labor intensive process of trialing a lot of different memory configuration settings.

Conclusion

In conclusion, we think that Spark Streaming provides many of the building blocks for developing stream processing applications. If you’re already familiar with Spark and have a Spark platform/stack setup, Spark may be a good enough choice to get you started on your stream processing journey. But if you’re looking to build low-latency processing and stateful complex event processing applications, Flink offers compelling advantages with a much richer library to express your streaming applications in a streaming first approach.

Summary

This table summarizes the key differences between Apache Spark and Apache Flink in various aspects of streaming data processing:

References

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Eric Xiao

Eric is a data enthusiast, currently working on the data platform team at Decodable. In his previous roles, he has created data pipelines to drive analytical and e-commerce use cases for companies like Shopify and Walmart, as well as working on the platform teams that powered Apache Spark and Flink applications.

Getting Started with Apache Flink and Flink SQL

August 22, 2023

min read

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Get the Technical Guide Watch Our Tech Talk

Heading 2

In this blog post, we will talk about why we picked Flink to be the foundation of our platform from 3 different perspectives:

Architectural design.
(Stateful) stream processing capabilities.
Production readiness.