🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

February 27, 2025

min read

Checkpoint Chronicle - February 2025

Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

Sophie Blee-Goldman put together a comprehensive guide to upgrading topologies of Kafka Streams apps. It emphasizes the importance of proactive planning and adherence to best practices to ensure seamless application evolution.

In "Exploring Apache DataFusion as a Foundation for Streaming Framework", Yaroslav Tkachenko highlights the growing trend of adopting Rust for high-performance data processing and examines the feasibility of utilizing Apache DataFusion, a Rust-based query engine leveraging Apache Arrow, as the core of a stream processing framework.
In a recent interview, Chris Riccomini conversed with Gilad Kleinman (Epsio), about incremental view maintenance. They chat about how Epsio differentiates itself by updating result tables directly within the original database and why and how it consolidates changes before writing them back to the OLTP system.
Are you a developer who lives and breathes Python? I wrote "A Hands-On Introduction to PyFlink" to help you understand just about enough of the underlying building blocks to get started with your Python-based stream processing jobs in a practical way.

Event Streaming

Jason Taylor's "The Curse of Replication Saturation in Apache Kafka" explains why the replication process can become a bottleneck, leading to increased latency and reduced throughput. Mitigation strategies include monitoring replication metrics closely and optimizing related configurations, such as adjusting the number of replication threads and tuning network settings.
KIP-1134 introduces the concept of virtual clusters in Apache Kafka, enabling multiple, isolated logical clusters to coexist within a single physical Kafka cluster. This design allows tenants to share underlying hardware and resources while maintaining separation in terms of topics, configurations, and access controls.
In "How We Run a 5 GB/s Kafka Workload for Just $50 per Hour", Matteo Meril et al. provide an in-depth analysis to understand cost differences when running Kafka workloads across selected vendors. Their benchmarks show how a leaderless architecture, eliminating most inter-AZ traffic, and using cloud object storage result in significant cost savings.

Data Ecosystem

SeungMin Lee from Kakao’s Data Analytics Platform shares an in-depth experience report about moving data from MySQL to Apache Iceberg using Flink CDC. The article explains many design decisions and implementation choices behind their approach and touches upon consistency, performance, and error handling related aspects.
Apache Paimon has officially released version 1.0.1 stable, after nearly five months of development. Read the release details to learn about Iceberg compatibility, snapshot transactions, performance improvements, Flink / Spark integrations, and several other aspects.

Data Platforms and Architecture

Adam Bellemare critiques the inefficiencies of the traditional medallion architecture and proposes a "shift left" strategy i.e. producing high-quality, standardized data earlier in pipelines. This approach reduces redundancy, improves accessibility, and optimizes efficiency by eliminating unnecessary downstream transformations and copies.
The talks "Complexity is the Gotcha of Event-driven Architecture" (by David Boyne) and "So You Want to Build An Event Driven System?" (by James Eastham) contain lots of useful insights and practical tips based on real-world experiences with event-driven architectures. Both made it into the 100 most watched software engineering talks of 2024 according to this list.
Ayeshee Patra outlines a step-by-step approach to integrating OLTP databases with Apache Pinot for real-time analytics using Debezium, Amazon MSK, and StarTree Cloud. It covers best practices for CDC, partitioning strategies, authentication, and query performance, ensuring efficient and accurate data ingestion.

RDBMS and Change Data Capture

The talk "7+ million Postgres tables" by Kailash Nadh explores the hacks and decisions that went into what originally seemed like a ridiculous idea, but turned out to be a high performing, cost-effective, and practically zero maintenance solution to a hard problem.
Ismail Simsek recently created pydbzengine, an open-source python library to interact with the Debezium embedded engine. This blog post demonstrates how to capture data changes from PostgreSQL using this library and propagate it into DuckDB with dlt (data load tool).
Anthony Accomazzo’s article highlights key challenges when implementing a consistent CDC solution. He describes how they implemented a variation of Netflix's DBLog design in Elixir using a chunked capture approach with watermark synchronization and PostgreSQL's pg_logical_emit_message.
In a guest post on the Debezium blog, René Rütter shares strategies to enhance the performance of initial snapshots when using Debezium with Oracle databases. By adjusting selected configuration properties his team reduced snapshot times by 25%.

Paper of the Month

Pavan Edara et al. dive into Vortex, a storage engine developed within Google BigQuery to facilitate real-time analytics on streaming data. Key features of Vortex include ACID compliance, a unified API for batch and streaming data, and a highly distributed, regionally replicated architecture optimized for append-focused ingestion of structured and semi-structured data. Currently, Vortex supports petabyte-scale data ingestion in BigQuery, achieving sub-second data freshness and query latency.

Events & Call for Papers (CfP)

Gartner Data & Analytics Summit (Orlando, FL) March 3-5
Current 2025 (Bengaluru, India) March 19
Iceberg Summit 2025 (San Francisco, CA) April 8-9
Data Council (Bay Area, CA) April 22-24
Current 2025 (London, UK) May 20-21

New Releases

‍

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.

‍

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

January 30, 2025

min read

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Start Free Talk To An Expert

Heading 2

Stream Processing, Streaming SQL, and Streaming Databases

Sophie Blee-Goldman put together a comprehensive guide to upgrading topologies of Kafka Streams apps. It emphasizes the importance of proactive planning and adherence to best practices to ensure seamless application evolution.

In "Exploring Apache DataFusion as a Foundation for Streaming Framework", Yaroslav Tkachenko highlights the growing trend of adopting Rust for high-performance data processing and examines the feasibility of utilizing Apache DataFusion, a Rust-based query engine leveraging Apache Arrow, as the core of a stream processing framework.
In a recent interview, Chris Riccomini conversed with Gilad Kleinman (Epsio), about incremental view maintenance. They chat about how Epsio differentiates itself by updating result tables directly within the original database and why and how it consolidates changes before writing them back to the OLTP system.
Are you a developer who lives and breathes Python? I wrote "A Hands-On Introduction to PyFlink" to help you understand just about enough of the underlying building blocks to get started with your Python-based stream processing jobs in a practical way.

Event Streaming

Jason Taylor's "The Curse of Replication Saturation in Apache Kafka" explains why the replication process can become a bottleneck, leading to increased latency and reduced throughput. Mitigation strategies include monitoring replication metrics closely and optimizing related configurations, such as adjusting the number of replication threads and tuning network settings.
KIP-1134 introduces the concept of virtual clusters in Apache Kafka, enabling multiple, isolated logical clusters to coexist within a single physical Kafka cluster. This design allows tenants to share underlying hardware and resources while maintaining separation in terms of topics, configurations, and access controls.
In "How We Run a 5 GB/s Kafka Workload for Just $50 per Hour", Matteo Meril et al. provide an in-depth analysis to understand cost differences when running Kafka workloads across selected vendors. Their benchmarks show how a leaderless architecture, eliminating most inter-AZ traffic, and using cloud object storage result in significant cost savings.

Data Ecosystem

SeungMin Lee from Kakao’s Data Analytics Platform shares an in-depth experience report about moving data from MySQL to Apache Iceberg using Flink CDC. The article explains many design decisions and implementation choices behind their approach and touches upon consistency, performance, and error handling related aspects.
Apache Paimon has officially released version 1.0.1 stable, after nearly five months of development. Read the release details to learn about Iceberg compatibility, snapshot transactions, performance improvements, Flink / Spark integrations, and several other aspects.

Data Platforms and Architecture

Adam Bellemare critiques the inefficiencies of the traditional medallion architecture and proposes a "shift left" strategy i.e. producing high-quality, standardized data earlier in pipelines. This approach reduces redundancy, improves accessibility, and optimizes efficiency by eliminating unnecessary downstream transformations and copies.
The talks "Complexity is the Gotcha of Event-driven Architecture" (by David Boyne) and "So You Want to Build An Event Driven System?" (by James Eastham) contain lots of useful insights and practical tips based on real-world experiences with event-driven architectures. Both made it into the 100 most watched software engineering talks of 2024 according to this list.
Ayeshee Patra outlines a step-by-step approach to integrating OLTP databases with Apache Pinot for real-time analytics using Debezium, Amazon MSK, and StarTree Cloud. It covers best practices for CDC, partitioning strategies, authentication, and query performance, ensuring efficient and accurate data ingestion.

RDBMS and Change Data Capture

The talk "7+ million Postgres tables" by Kailash Nadh explores the hacks and decisions that went into what originally seemed like a ridiculous idea, but turned out to be a high performing, cost-effective, and practically zero maintenance solution to a hard problem.
Ismail Simsek recently created pydbzengine, an open-source python library to interact with the Debezium embedded engine. This blog post demonstrates how to capture data changes from PostgreSQL using this library and propagate it into DuckDB with dlt (data load tool).
Anthony Accomazzo’s article highlights key challenges when implementing a consistent CDC solution. He describes how they implemented a variation of Netflix's DBLog design in Elixir using a chunked capture approach with watermark synchronization and PostgreSQL's pg_logical_emit_message.
In a guest post on the Debezium blog, René Rütter shares strategies to enhance the performance of initial snapshots when using Debezium with Oracle databases. By adjusting selected configuration properties his team reduced snapshot times by 25%.

Paper of the Month

Events & Call for Papers (CfP)

Gartner Data & Analytics Summit (Orlando, FL) March 3-5
Current 2025 (Bengaluru, India) March 19
Iceberg Summit 2025 (San Francisco, CA) April 8-9
Data Council (Bay Area, CA) April 22-24
Current 2025 (London, UK) May 20-21

New Releases

‍

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.

‍

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Hans-Peter Grahsl

Let's get decoding

Decodable is free. No CC required. Never expires.

Start for Free Talk to an Expert Join the Community on Slack

Checkpoint Chronicle - February 2025

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

Data Platforms and Architecture

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - January 2025

Checkpoint Chronicle - December 2024

Checkpoint Chronicle - November 2024

Table of contents

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

Data Platforms and Architecture

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - January 2025

Checkpoint Chronicle - December 2024

Checkpoint Chronicle - November 2024

Let's get decoding