Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt—feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- Flink SQL: the Challenges of Implementing a Streaming SQL Engine As the Flink SQL Team Leader at Alibaba, Jark Wu is well placed to discuss the Challenges of Implementing a streaming SQL engine. These slides from his talk at Flink Forward 2023 cover some useful detail, including handling late data, change log events, and more.
- Defense Against the Dark Art of Rebalancing in Kafka Streams Responsive continue their top-notch blog output with this fantastic article from Sophie Blee-Goldman going into the real nuts and bolts of Kafka Streams rebalancing, how it works, the problems it can cause—and what to do about it when it does.
- How to Write Simple and Efficient Flink SQL A useful post from Xiaolin He—one of the committers to Apache Flink and PMC member—in which they summarise their talk from Flink Forward Asia 2022.
- Apache Flink data enrichment patterns An interesting study of the performance and other characteristics of three enrichment patterns (Synchronous, Asynchronous, and Synchronous Cached) in Apache Flink, by AWS' Luis Morales and Lorenzo Nicora.
- Let’s Flink on EKS: Data Lake Primer An introduction to running Apache Flink on EKS as part of an Apache Iceberg-based data lake.
- Stream Processing Basics — Stateless Operations A useful technology-agnostic primer on the types of stateless operations in stream processing, from Redpanda's Dunith Danushka.
- A Beginner's Guide to Sequence Analytics in SQL A really nice introduction to doing time-based analytics in SQL, such as user session analysis. Starts with cleaning the data up with some simple <span class="inline-code">CASE</span> statements and gets progressively more complex, getting onto some nice use of <span class="inline-code">LAG</span> functions to pivot data.
- Yaroslav Tkachenko returns to the newsletter for a second month running with this article that discusses Considerations for Data Stream Materialization framing it in the context of being able to choose only two from the following: Cost vs Performance vs Reusability.
- Getting Started With PyFlink on Kubernetes An excellent hands-on guide from Gunnar running you through the steps for creating your first PyFlink job and running it on Kubernetes.
- As Confluent increases their focus on Apache Flink, it's interesting to see how they now position both Kafka Streams and ksqlDB. Are one or both for the chop, or all three live in glorious harmony? In this article on his own blog, Confluent's Kai Waehner discusses how he sees KStreams and Flink co-existing.
Event Streaming
- Two lists in the “awesome-*” canon this month:
• awesome-kafka-streams an excellent list, recently published, of companies using Kafka Streams in production.
• awesome-public-real-time-datasets A nice collection of realtime datasets, with plenty of free ones and some paid.
- How To Produce Messages With librdkafka A nice write up from Confluent’s Jakub Korab on the internals of librdkafka (which underlies many of the non-Java clients for Apache Kafka).
- Getting started with tiered storage in Apache Kafka An interesting guide from Kafka's PMC chair, Mickael Maison, on how to use the new tiered storage feature (currently in early access, and not recommended for production use yet).
- Unlocking Idempotency with Retroactive Tombstones WarpStream is a new project that offers a Kafka-compatible API with storage of data directly on S3. This blog details how they implement an idempotent producer, which helps them get one step closer to Exactly Once Semantics as offered in Kafka itself.
Change Data Capture
- “Change Data Capture Breaks Encapsulation”. Does it, though? Gunnar examines what data encapsulation means in the context of CDC, whether it actually breaks it—and how data contracts have an important role to play.
- The Debezium connector for JDBC was added earlier this year in the 2.2 release but many people are still unaware of it. It adds a Kafka Connect sink capability to what had to-date only dealt with sources. As you’d expect it’s fully aware of the data structures that Debezium uses, making it easier to stream from one database to another. It’s also Apache 2.0 licensed.
- Can Debezium Lose Events? Not unless you screw things up on the source database side! Gunnar provides a useful explainer on what could, in theory, lead to events being lost—and how to avoid it.
Data Platforms and Architecture
- The disaggregated write-ahead log An interesting thought experiment from Shikhar Bhushan in which they describe S2 (the Stream Store)—the equivalent of S3 for logs. This would be a truly serverless API, single-digit-milliseconds to tail and append, for a practically unlimited number of them with bottomless storage and elastic throughput.
- Durable Execution: Justifying the Bubble A useful roundup from Chris Riccomini on what he describes as “workflow orchestration with transactional state management”, covering technologies including Temporal, Restate, and more.
Data Ecosystem
- Netflix’s journey to an Apache Iceberg–only data lake Fascinating talk from AWS re:Invent 2023 in which Ashwin Kayyoor and Rakesh Veeramacheneni discuss at the tooling that Netflix have built around Apache Iceberg and their project to migrate over a million Apache Hive tables and 300PB of data to Apache Iceberg, yielding multiple benefits including 25% cost reduction.
- RockDB - Java API Performance Improvements Details of some performance analysis work done on using Java with RocksDB, including recommendations for optimisations when using the API
FLIP Foyer
Flink Improvement Proposals, or FLIP for short, are the process of the Flink community for discussing and working towards major changes in the project, such as new features, redesigns, deprecations, etc. Here are some recent FLIPs worth keeping an eye on:
- FLIP-319: Integrate with Kafka's Support for Proper 2PC Participation Based on Kafka’s work towards supporting two-phase commit transactions (KIP-939, see below), this FLIP aims to improve the Flink Kafka sink in regards to exactly-once guarantees (no more data loss in case of Kafka transaction timeouts) and maintainability (currently, the sink relies on Java reflection to adjust some parts of transaction handling in the Kafka client)
- FLIP-356: Support Nested Fields Filter Pushdown Performance improvements are always welcomed, and in this spirit this FLIP aims to allow Flink Table API sources to push down nested filter expressions such as <span class="inline-code">WHERE user.age < 30</span> to the underlying data source, reducing the amount of data which needs to be ingested into the Flink engine, only to filter out the data there
- FLIP-385: Add OpenTelemetryTraceReporter and OpenTelemetryMetricReporter Seeking to improve Flink’s observability story, this FLIP suggests to add metrics and trace reporters using the OpenTelemetry standard
- [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink While it’s not a formal FLIP yet, this proposal by Leonard Xu must not be missed. With most of its connectors being based on Debezium, Flink CDC plays a vital role for many real-time data integration use cases, and making it part of Apache Flink itself will be a great step for further growing this project and its community.
KIP Corner
As the counterpart to Flink’s FLIPs, Kafka Improvement Proposals, known as KIPs, are how the Apache Kafka community discusses and agrees on new functionality in Kafka. This month we’ve got three under discussion that caught our attention:
- KIP-1008: ParKa - the Marriage of Parquet and Kafka Imagine writing to your Kafka topics and them being directly queryable in your data lake—that’s what this KIP would give you the potential to do. The idea is to have Kafka producers create log segments in the Apache Parquet file format and propagate those segments as-is to object store using special ingestion consumers. It remains to be seen how good of a fit Parquet is for Kafka logs, but with the continued rise of data lakes and the corresponding query engines, this is a highly interesting development worth keeping an eye on.
- KIP-939: Support Participation in 2PC Two-Phase Commit (2PC) is one of those architectural concepts that get some folk all hot-under-the-collar. Without it, there’s no way to guarantee ACID semantics across multiple components of a distributed system, say Apache Kafka and a relational database. This KIP proposes to allow Kafka to participate in distributed transactions managed by an external coordinator, one use case being the integration into stream processing pipeline based on Apache Flink, with Flink being the transaction coordinator and ensuring exactly-once guarantees.
- The third one on our list is actually three 🙂. It’s a set of KIPs all aimed at improving the ops experience with Kafka Connect:
• KIP-995: Allow users to specify initial offsets while creating connectors
• KIP-980: Allow creating connectors in a stopped state
• KIP-875: First-class offsets support in Kafka Connect
Paper of the Month
📄 A Survey on Transactional Stream Processing
Are transactional semantics the next big thing in stream processing? While the jury still is out on this one, this paper by Shuhao Zhang et al. provides an extensive overview on the current status quo of the space, discussing “key concepts, techniques, and challenges to be overcome, in order to ensure reliable and consistent data stream processing”. With 29 pages overall, it’s a somewhat longer read, but it’s time well spent, providing insight into concepts, implementation approaches, and applications of transactional stream processing systems.
Events & Call for Papers (CfP)
- Kafka Summit Bengaluru May 2 (CfP closes January 10th)
- Databricks DATA + AI Summit June 10–13 (CfP closes January 5th)
- Community over Code EU [formerly ApacheCon EU] June 3–5 (CfP closes January 12th)
- Devoxx UK May 8-10 (CfP closes January 12th)
- Open Source Summit NA April 16–18 (CfP closes January 14th)
Thanks to Floor at Aiven for sharing this useful conference list with us.
New Releases
- Apache Flink 1.17.2
- Apache Flink Kubernetes Operator 1.7.0
- Apache Flink Kubernetes Operator 1.6.1
- Apache Kafka 3.6.1
- Apache Kafka 3.5.2
- Debezium 2.5.0.Beta1
Phew, a jam-packed newsletter this month! 😅 We hope you’ve enjoyed this edition, and are all-ears for any feedback or suggestions you’ve got. And to those who celebrate, Happy Holidays!