Back
January 30, 2025
7
min read

Checkpoint Chronicle - January 2025

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Hans-Peter Grahsl (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

  • 2024 has been a pretty busy year for Kafka Streams which is why everyone interested in the project might want to take a look into this extensive KIPs review post by Sophie Blee-Goldman. It features explanations for a number of impactful Kafka Streams-specific KIPs categorized into API enhancements, task assignment improvements, and monitoring improvements.
  • Ever wondered what it takes to run custom Flink jobs on Kubernetes? That’s exactly what Gunnar explores in-depth in a two part hands-on article series. Part one touches on installation and setup, deploying Flink jobs via custom resources and creating container images while part two covers fault tolerance and high availability, savepoint management, observability, and UI access.

Event Streaming

  • Manu Cupcic wrote an in-depth article which explains Kafka transactions twice. First, revisiting how transactional semantics in Kafka work under the covers, then describing why, where, and how WarpStream’s implementation differs.
  • Stéphane Derosiaux dives into the notion of lag when working with Kafka, why lag exists and what contributes to it, what the difference between offset vs. time lag is from metrics perspective, and how all this is related to Little's law.
  • In my quest of exploring new CLI tools, I recently stumbled upon Yōzefu, an interactive terminal app written in Rust to inspect data in Kafka topics. It offers a SQL-inspired language for fine-grained filtering capabilities of records.

Data Ecosystem

  • In this community post on the Alibaba blog, which draws from a recent Flink Forward presentation, you get a good first overview about Apache Paimon to understand its unique advantages and the use cases where it really shines.
  • With “Databases in 2024: A Year in Review”, Andy Pavlov continues an article series which started back in 2021. Lots of insightful facts about database licences and changes thereof, big vendor fights, the rise of DuckDB, a bunch of random happenings related to product releases, acquisitions, funding, and deaths of companies in the DB space-all paired with Andy’s own opinions make this a great and joyful read.
  • Apache Flink committer and PMC member Jark Wu recently wrote two articles about the open-sourced FLink Unified Streaming Storage (Fluss) project. In the first article, he addresses the rationale behind it by discussing major challenges when trying to build real-time analytics on top of Kafka. The second post dives deeper into Fluss' architecture and points to the project's roadmap to get a taste of where it's heading.

Data Platforms and Architecture

  • Junaid Effendi wrote about the tech stacks of companies like Netflix and Stripe in the past. The most recent article provides insights into Pinterest’s Data Stack and references further materials such as blog posts and presentations to dive deeper into their specific use of selected technologies.
  • Uber Engineering shared how they adopted Ray—a general compute engine for Python designed for ML, AI, and other algorithmic workloads—to optimize their rides business. They explain their motivation to combine the strengths of Ray and Spark to get the best of both worlds.
  • SeungMin Lee wrote about Kakao Tech's "Journey with Apache Flink & Flink CDC". The 2nd half of this extensive article touches upon their customization efforts when building on earlier versions of the upstream Flink CDC connector for MySQL.

RDBMS and Change Data Capture

  • Marc Brooker contrasts snapshot isolation and serializability and shares why he believes snapshot isolation paired with strong consistency should be the default for most apps and dev teams when choosing from the database isolation spectrum.
  • Among the many different choices for doing CDC, kuvasz-streamer is a relatively new open-source tool on the block focusing exclusively on Postgres to Postgres scenarios. It’s a zero dependencies app written in Go. Read more about it in their documentation.

Paper of the Month

In their paper Streaming SQL Multi-Way Join Method for Long State Streams, Jinlong Hu and Tingfeng Qiu introduce UMJoin, a multi-way stream join operator that addresses memory constraints by using an LSM-Tree backend to handle extensive stateful data streams. Experiments on TPC-DS and TPC-H benchmarks demonstrate UMJoin's effectiveness in managing long-state streams and the Two-Step Convert (TSC) method's capability to improve multi-way join query execution.

Events & Call for Papers (CfP)

New Releases

<hr/><br/>

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Gunnar (LinkedIn / Bluesky / X / Mastodon / Email)Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Hans-Peter Grahsl (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

  • 2024 has been a pretty busy year for Kafka Streams which is why everyone interested in the project might want to take a look into this extensive KIPs review post by Sophie Blee-Goldman. It features explanations for a number of impactful Kafka Streams-specific KIPs categorized into API enhancements, task assignment improvements, and monitoring improvements.
  • Ever wondered what it takes to run custom Flink jobs on Kubernetes? That’s exactly what Gunnar explores in-depth in a two part hands-on article series. Part one touches on installation and setup, deploying Flink jobs via custom resources and creating container images while part two covers fault tolerance and high availability, savepoint management, observability, and UI access.

Event Streaming

  • Manu Cupcic wrote an in-depth article which explains Kafka transactions twice. First, revisiting how transactional semantics in Kafka work under the covers, then describing why, where, and how WarpStream’s implementation differs.
  • Stéphane Derosiaux dives into the notion of lag when working with Kafka, why lag exists and what contributes to it, what the difference between offset vs. time lag is from metrics perspective, and how all this is related to Little's law.
  • In my quest of exploring new CLI tools, I recently stumbled upon Yōzefu, an interactive terminal app written in Rust to inspect data in Kafka topics. It offers a SQL-inspired language for fine-grained filtering capabilities of records.

Data Ecosystem

  • In this community post on the Alibaba blog, which draws from a recent Flink Forward presentation, you get a good first overview about Apache Paimon to understand its unique advantages and the use cases where it really shines.
  • With “Databases in 2024: A Year in Review”, Andy Pavlov continues an article series which started back in 2021. Lots of insightful facts about database licences and changes thereof, big vendor fights, the rise of DuckDB, a bunch of random happenings related to product releases, acquisitions, funding, and deaths of companies in the DB space-all paired with Andy’s own opinions make this a great and joyful read.
  • Apache Flink committer and PMC member Jark Wu recently wrote two articles about the open-sourced FLink Unified Streaming Storage (Fluss) project. In the first article, he addresses the rationale behind it by discussing major challenges when trying to build real-time analytics on top of Kafka. The second post dives deeper into Fluss' architecture and points to the project's roadmap to get a taste of where it's heading.

Data Platforms and Architecture

  • Junaid Effendi wrote about the tech stacks of companies like Netflix and Stripe in the past. The most recent article provides insights into Pinterest’s Data Stack and references further materials such as blog posts and presentations to dive deeper into their specific use of selected technologies.
  • Uber Engineering shared how they adopted Ray—a general compute engine for Python designed for ML, AI, and other algorithmic workloads—to optimize their rides business. They explain their motivation to combine the strengths of Ray and Spark to get the best of both worlds.
  • SeungMin Lee wrote about Kakao Tech's "Journey with Apache Flink & Flink CDC". The 2nd half of this extensive article touches upon their customization efforts when building on earlier versions of the upstream Flink CDC connector for MySQL.

RDBMS and Change Data Capture

  • Marc Brooker contrasts snapshot isolation and serializability and shares why he believes snapshot isolation paired with strong consistency should be the default for most apps and dev teams when choosing from the database isolation spectrum.
  • Among the many different choices for doing CDC, kuvasz-streamer is a relatively new open-source tool on the block focusing exclusively on Postgres to Postgres scenarios. It’s a zero dependencies app written in Go. Read more about it in their documentation.

Paper of the Month

In their paper Streaming SQL Multi-Way Join Method for Long State Streams, Jinlong Hu and Tingfeng Qiu introduce UMJoin, a multi-way stream join operator that addresses memory constraints by using an LSM-Tree backend to handle extensive stateful data streams. Experiments on TPC-DS and TPC-H benchmarks demonstrate UMJoin's effectiveness in managing long-state streams and the Two-Step Convert (TSC) method's capability to improve multi-way join query execution.

Events & Call for Papers (CfP)

New Releases

<hr/><br/>

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Gunnar (LinkedIn / Bluesky / X / Mastodon / Email)Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.