Back
February 27, 2025
5
min read

Checkpoint Chronicle - February 2025

Checkpoint Chronicle - February 2025

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

  • Sophie Blee-Goldman put together a comprehensive guide to upgrading topologies of Kafka Streams apps. It emphasizes the importance of proactive planning and adherence to best practices to ensure seamless application evolution.
  • In "Exploring Apache DataFusion as a Foundation for Streaming Framework", Yaroslav Tkachenko highlights the growing trend of adopting Rust for high-performance data processing and examines the feasibility of utilizing Apache DataFusion, a Rust-based query engine leveraging Apache Arrow, as the core of a stream processing framework.
  • In a recent interview, Chris Riccomini conversed with Gilad Kleinman (Epsio), about incremental view maintenance. They chat about how Epsio differentiates itself by updating result tables directly within the original database and why and how it consolidates changes before writing them back to the OLTP system.
  • Are you a developer who lives and breathes Python? I wrote "A Hands-On Introduction to PyFlink" to help you understand just about enough of the underlying building blocks to get started with your Python-based stream processing jobs in a practical way.

Event Streaming

  • Jason Taylor's "The Curse of Replication Saturation in Apache Kafka" explains why the replication process can become a bottleneck, leading to increased latency and reduced throughput. Mitigation strategies include monitoring replication metrics closely and optimizing related configurations, such as adjusting the number of replication threads and tuning network settings.
  • KIP-1134 introduces the concept of virtual clusters in Apache Kafka, enabling multiple, isolated logical clusters to coexist within a single physical Kafka cluster. This design allows tenants to share underlying hardware and resources while maintaining separation in terms of topics, configurations, and access controls.
  • In "How We Run a 5 GB/s Kafka Workload for Just $50 per Hour", Matteo Meril et al. provide an in-depth analysis to understand cost differences when running Kafka workloads across selected vendors. Their benchmarks show how a leaderless architecture, eliminating most inter-AZ traffic, and using cloud object storage result in significant cost savings.

Data Ecosystem

  • SeungMin Lee from Kakao’s Data Analytics Platform shares an in-depth experience report about moving data from MySQL to Apache Iceberg using Flink CDC. The article explains many design decisions and implementation choices behind their approach and touches upon consistency, performance, and error handling related aspects.
  • Apache Paimon has officially released version 1.0.1 stable, after nearly five months of development. Read the release details to learn about Iceberg compatibility, snapshot transactions, performance improvements, Flink / Spark integrations, and several other aspects.

Data Platforms and Architecture

RDBMS and Change Data Capture

  • The talk "7+ million Postgres tables" by Kailash Nadh explores the hacks and decisions that went into what originally seemed like a ridiculous idea, but turned out to be a high performing, cost-effective, and practically zero maintenance solution to a hard problem.
  • Ismail Simsek recently created pydbzengine, an open-source python library to interact with the Debezium embedded engine. This blog post demonstrates how to capture data changes from PostgreSQL using this library and propagate it into DuckDB with dlt (data load tool).
  • Anthony Accomazzo’s article highlights key challenges when implementing a consistent CDC solution. He describes how they implemented a variation of Netflix's DBLog design in Elixir using a chunked capture approach with watermark synchronization and PostgreSQL's pg_logical_emit_message.
  • In a guest post on the Debezium blog, RenĂ© RĂĽtter shares strategies to enhance the performance of initial snapshots when using Debezium with Oracle databases. By adjusting selected configuration properties his team reduced snapshot times by 25%.

Paper of the Month

Pavan Edara et al. dive into Vortex, a storage engine developed within Google BigQuery to facilitate real-time analytics on streaming data. Key features of Vortex include ACID compliance, a unified API for batch and streaming data, and a highly distributed, regionally replicated architecture optimized for append-focused ingestion of structured and semi-structured data. Currently, Vortex supports petabyte-scale data ingestion in BigQuery, achieving sub-second data freshness and query latency.

Events & Call for Papers (CfP)

New Releases

‍

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.

‍

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

đź‘Ť Got it!
Oops! Something went wrong while submitting the form.
Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

  • Sophie Blee-Goldman put together a comprehensive guide to upgrading topologies of Kafka Streams apps. It emphasizes the importance of proactive planning and adherence to best practices to ensure seamless application evolution.
  • In "Exploring Apache DataFusion as a Foundation for Streaming Framework", Yaroslav Tkachenko highlights the growing trend of adopting Rust for high-performance data processing and examines the feasibility of utilizing Apache DataFusion, a Rust-based query engine leveraging Apache Arrow, as the core of a stream processing framework.
  • In a recent interview, Chris Riccomini conversed with Gilad Kleinman (Epsio), about incremental view maintenance. They chat about how Epsio differentiates itself by updating result tables directly within the original database and why and how it consolidates changes before writing them back to the OLTP system.
  • Are you a developer who lives and breathes Python? I wrote "A Hands-On Introduction to PyFlink" to help you understand just about enough of the underlying building blocks to get started with your Python-based stream processing jobs in a practical way.

Event Streaming

  • Jason Taylor's "The Curse of Replication Saturation in Apache Kafka" explains why the replication process can become a bottleneck, leading to increased latency and reduced throughput. Mitigation strategies include monitoring replication metrics closely and optimizing related configurations, such as adjusting the number of replication threads and tuning network settings.
  • KIP-1134 introduces the concept of virtual clusters in Apache Kafka, enabling multiple, isolated logical clusters to coexist within a single physical Kafka cluster. This design allows tenants to share underlying hardware and resources while maintaining separation in terms of topics, configurations, and access controls.
  • In "How We Run a 5 GB/s Kafka Workload for Just $50 per Hour", Matteo Meril et al. provide an in-depth analysis to understand cost differences when running Kafka workloads across selected vendors. Their benchmarks show how a leaderless architecture, eliminating most inter-AZ traffic, and using cloud object storage result in significant cost savings.

Data Ecosystem

  • SeungMin Lee from Kakao’s Data Analytics Platform shares an in-depth experience report about moving data from MySQL to Apache Iceberg using Flink CDC. The article explains many design decisions and implementation choices behind their approach and touches upon consistency, performance, and error handling related aspects.
  • Apache Paimon has officially released version 1.0.1 stable, after nearly five months of development. Read the release details to learn about Iceberg compatibility, snapshot transactions, performance improvements, Flink / Spark integrations, and several other aspects.

Data Platforms and Architecture

RDBMS and Change Data Capture

  • The talk "7+ million Postgres tables" by Kailash Nadh explores the hacks and decisions that went into what originally seemed like a ridiculous idea, but turned out to be a high performing, cost-effective, and practically zero maintenance solution to a hard problem.
  • Ismail Simsek recently created pydbzengine, an open-source python library to interact with the Debezium embedded engine. This blog post demonstrates how to capture data changes from PostgreSQL using this library and propagate it into DuckDB with dlt (data load tool).
  • Anthony Accomazzo’s article highlights key challenges when implementing a consistent CDC solution. He describes how they implemented a variation of Netflix's DBLog design in Elixir using a chunked capture approach with watermark synchronization and PostgreSQL's pg_logical_emit_message.
  • In a guest post on the Debezium blog, RenĂ© RĂĽtter shares strategies to enhance the performance of initial snapshots when using Debezium with Oracle databases. By adjusting selected configuration properties his team reduced snapshot times by 25%.

Paper of the Month

Pavan Edara et al. dive into Vortex, a storage engine developed within Google BigQuery to facilitate real-time analytics on streaming data. Key features of Vortex include ACID compliance, a unified API for batch and streaming data, and a highly distributed, regionally replicated architecture optimized for append-focused ingestion of structured and semi-structured data. Currently, Vortex supports petabyte-scale data ingestion in BigQuery, achieving sub-second data freshness and query latency.

Events & Call for Papers (CfP)

New Releases

‍

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.

‍

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

Let's get decoding

Decodable is free. No CC required. Never expires.