Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Hans-Peter Grahsl (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- 2024 has been a pretty busy year for Kafka Streams which is why everyone interested in the project might want to take a look into this extensive KIPs review post by Sophie Blee-Goldman. It features explanations for a number of impactful Kafka Streams-specific KIPs categorized into API enhancements, task assignment improvements, and monitoring improvements.
- Ever wondered what it takes to run custom Flink jobs on Kubernetes? That’s exactly what Gunnar explores in-depth in a two part hands-on article series. Part one touches on installation and setup, deploying Flink jobs via custom resources and creating container images while part two covers fault tolerance and high availability, savepoint management, observability, and UI access.
Event Streaming
- David Arthur started a substack called Building Apache Kafka. The first two articles discuss how the build infra for the project has evolved over the years and in particular, why the switch from Jenkins to GitHub Actions “has been a breath of fresh air for the project”.
- Manu Cupcic wrote an in-depth article which explains Kafka transactions twice. First, revisiting how transactional semantics in Kafka work under the covers, then describing why, where, and how WarpStream’s implementation differs.
- Stéphane Derosiaux dives into the notion of lag when working with Kafka, why lag exists and what contributes to it, what the difference between offset vs. time lag is from metrics perspective, and how all this is related to Little's law.
- In my quest of exploring new CLI tools, I recently stumbled upon Yōzefu, an interactive terminal app written in Rust to inspect data in Kafka topics. It offers a SQL-inspired language for fine-grained filtering capabilities of records.
Data Ecosystem
- In this community post on the Alibaba blog, which draws from a recent Flink Forward presentation, you get a good first overview about Apache Paimon to understand its unique advantages and the use cases where it really shines.
- With “Databases in 2024: A Year in Review”, Andy Pavlov continues an article series which started back in 2021. Lots of insightful facts about database licences and changes thereof, big vendor fights, the rise of DuckDB, a bunch of random happenings related to product releases, acquisitions, funding, and deaths of companies in the DB space-all paired with Andy’s own opinions make this a great and joyful read.
- Apache Flink committer and PMC member Jark Wu recently wrote two articles about the open-sourced FLink Unified Streaming Storage (Fluss) project. In the first article, he addresses the rationale behind it by discussing major challenges when trying to build real-time analytics on top of Kafka. The second post dives deeper into Fluss' architecture and points to the project's roadmap to get a taste of where it's heading.
Data Platforms and Architecture
- Junaid Effendi wrote about the tech stacks of companies like Netflix and Stripe in the past. The most recent article provides insights into Pinterest’s Data Stack and references further materials such as blog posts and presentations to dive deeper into their specific use of selected technologies.
- Uber Engineering shared how they adopted Ray—a general compute engine for Python designed for ML, AI, and other algorithmic workloads—to optimize their rides business. They explain their motivation to combine the strengths of Ray and Spark to get the best of both worlds.
- SeungMin Lee wrote about Kakao Tech's "Journey with Apache Flink & Flink CDC". The 2nd half of this extensive article touches upon their customization efforts when building on earlier versions of the upstream Flink CDC connector for MySQL.
RDBMS and Change Data Capture
- In one of his recent posts, Phil Eaton provides a nice, hands-on introduction to the inner workings of logical replication in Postgres. The article not only explains the main mechanisms but also references vital parts of the actual source code bringing publications and subscriptions to life behind the scenes.
- Marc Brooker contrasts snapshot isolation and serializability and shares why he believes snapshot isolation paired with strong consistency should be the default for most apps and dev teams when choosing from the database isolation spectrum.
- Wanna spend a bit of time here and there to polish your RDBMS knowledge by revisiting some fundamentals? Here are two very helpful resources shared by Markus Winand: SQL Indexing and Tuning e-Book and modern-sql.com
- Among the many different choices for doing CDC, kuvasz-streamer is a relatively new open-source tool on the block focusing exclusively on Postgres to Postgres scenarios. It’s a zero dependencies app written in Go. Read more about it in their documentation.
- An often neglected challenge when working with CDC and building on top of change event streams is that of respecting transactions. This is why I wrote "Aggregating Change Data Capture Events based on Transactional Boundaries" which shows one way how to go about that in the context of Debezium, Kafka, and Flink.
Paper of the Month
In their paper Streaming SQL Multi-Way Join Method for Long State Streams, Jinlong Hu and Tingfeng Qiu introduce UMJoin, a multi-way stream join operator that addresses memory constraints by using an LSM-Tree backend to handle extensive stateful data streams. Experiments on TPC-DS and TPC-H benchmarks demonstrate UMJoin's effectiveness in managing long-state streams and the Two-Step Convert (TSC) method's capability to improve multi-way join query execution.
Events & Call for Papers (CfP)
- Gartner Data & Analytics Summit (Orlando, FL) March 3-5
- Iceberg Summit 2025 (San Francisco, CA) April 8-9 (CfP open until February 9)
- Data Council (Bay Area, CA) April 22-24
- Current 2025 (London, UK) May 20-21 (CfP open until February 17)
- We are Developers (Berlin, Germany) July 9-11 (CfP open until February 3)
New Releases
- Apache Flink CDC 3.3.0
- Debezium 3.0.7.Final and Debezium 3.1.0.Alpha1
- Strimzi 0.45.0
- Kroxylicious v0.9.0
<hr/><br/>
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.
Gunnar (LinkedIn / Bluesky / X / Mastodon / Email)Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)