🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

January 30, 2025

min read

Checkpoint Chronicle - January 2025

Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Hans-Peter Grahsl (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

2024 has been a pretty busy year for Kafka Streams which is why everyone interested in the project might want to take a look into this extensive KIPs review post by Sophie Blee-Goldman. It features explanations for a number of impactful Kafka Streams-specific KIPs categorized into API enhancements, task assignment improvements, and monitoring improvements.

Ever wondered what it takes to run custom Flink jobs on Kubernetes? That’s exactly what Gunnar explores in-depth in a two part hands-on article series. Part one touches on installation and setup, deploying Flink jobs via custom resources and creating container images while part two covers fault tolerance and high availability, savepoint management, observability, and UI access.

Event Streaming

David Arthur started a substack called Building Apache Kafka. The first two articles discuss how the build infra for the project has evolved over the years and in particular, why the switch from Jenkins to GitHub Actions “has been a breath of fresh air for the project”.

Manu Cupcic wrote an in-depth article which explains Kafka transactions twice. First, revisiting how transactional semantics in Kafka work under the covers, then describing why, where, and how WarpStream’s implementation differs.

Stéphane Derosiaux dives into the notion of lag when working with Kafka, why lag exists and what contributes to it, what the difference between offset vs. time lag is from metrics perspective, and how all this is related to Little's law.

In my quest of exploring new CLI tools, I recently stumbled upon Yōzefu, an interactive terminal app written in Rust to inspect data in Kafka topics. It offers a SQL-inspired language for fine-grained filtering capabilities of records.

Data Ecosystem

In this community post on the Alibaba blog, which draws from a recent Flink Forward presentation, you get a good first overview about Apache Paimon to understand its unique advantages and the use cases where it really shines.

With “Databases in 2024: A Year in Review”, Andy Pavlov continues an article series which started back in 2021. Lots of insightful facts about database licences and changes thereof, big vendor fights, the rise of DuckDB, a bunch of random happenings related to product releases, acquisitions, funding, and deaths of companies in the DB space-all paired with Andy’s own opinions make this a great and joyful read.

Apache Flink committer and PMC member Jark Wu recently wrote two articles about the open-sourced FLink Unified Streaming Storage (Fluss) project. In the first article, he addresses the rationale behind it by discussing major challenges when trying to build real-time analytics on top of Kafka. The second post dives deeper into Fluss' architecture and points to the project's roadmap to get a taste of where it's heading.

Data Platforms and Architecture

Junaid Effendi wrote about the tech stacks of companies like Netflix and Stripe in the past. The most recent article provides insights into Pinterest’s Data Stack and references further materials such as blog posts and presentations to dive deeper into their specific use of selected technologies.

Uber Engineering shared how they adopted Ray—a general compute engine for Python designed for ML, AI, and other algorithmic workloads—to optimize their rides business. They explain their motivation to combine the strengths of Ray and Spark to get the best of both worlds.

SeungMin Lee wrote about Kakao Tech's "Journey with Apache Flink & Flink CDC". The 2nd half of this extensive article touches upon their customization efforts when building on earlier versions of the upstream Flink CDC connector for MySQL.

RDBMS and Change Data Capture

In one of his recent posts, Phil Eaton provides a nice, hands-on introduction to the inner workings of logical replication in Postgres. The article not only explains the main mechanisms but also references vital parts of the actual source code bringing publications and subscriptions to life behind the scenes.

Marc Brooker contrasts snapshot isolation and serializability and shares why he believes snapshot isolation paired with strong consistency should be the default for most apps and dev teams when choosing from the database isolation spectrum.

Wanna spend a bit of time here and there to polish your RDBMS knowledge by revisiting some fundamentals? Here are two very helpful resources shared by Markus Winand: SQL Indexing and Tuning e-Book and modern-sql.com

Among the many different choices for doing CDC, kuvasz-streamer is a relatively new open-source tool on the block focusing exclusively on Postgres to Postgres scenarios. It’s a zero dependencies app written in Go. Read more about it in their documentation.

An often neglected challenge when working with CDC and building on top of change event streams is that of respecting transactions. This is why I wrote "Aggregating Change Data Capture Events based on Transactional Boundaries" which shows one way how to go about that in the context of Debezium, Kafka, and Flink.

Paper of the Month

In their paper Streaming SQL Multi-Way Join Method for Long State Streams, Jinlong Hu and Tingfeng Qiu introduce UMJoin, a multi-way stream join operator that addresses memory constraints by using an LSM-Tree backend to handle extensive stateful data streams. Experiments on TPC-DS and TPC-H benchmarks demonstrate UMJoin's effectiveness in managing long-state streams and the Two-Step Convert (TSC) method's capability to improve multi-way join query execution.

Events & Call for Papers (CfP)

Gartner Data & Analytics Summit (Orlando, FL) March 3-5
Iceberg Summit 2025 (San Francisco, CA) April 8-9 (CfP open until February 9)
Data Council (Bay Area, CA) April 22-24
Current 2025 (London, UK) May 20-21 (CfP open until February 17)
We are Developers (Berlin, Germany) July 9-11 (CfP open until February 3)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Gunnar (LinkedIn / Bluesky / X / Mastodon / Email)Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

December 19, 2024

min read

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Start Free Talk To An Expert

Heading 2

Stream Processing, Streaming SQL, and Streaming Databases

2024 has been a pretty busy year for Kafka Streams which is why everyone interested in the project might want to take a look into this extensive KIPs review post by Sophie Blee-Goldman. It features explanations for a number of impactful Kafka Streams-specific KIPs categorized into API enhancements, task assignment improvements, and monitoring improvements.

Ever wondered what it takes to run custom Flink jobs on Kubernetes? That’s exactly what Gunnar explores in-depth in a two part hands-on article series. Part one touches on installation and setup, deploying Flink jobs via custom resources and creating container images while part two covers fault tolerance and high availability, savepoint management, observability, and UI access.

Event Streaming

David Arthur started a substack called Building Apache Kafka. The first two articles discuss how the build infra for the project has evolved over the years and in particular, why the switch from Jenkins to GitHub Actions “has been a breath of fresh air for the project”.

Manu Cupcic wrote an in-depth article which explains Kafka transactions twice. First, revisiting how transactional semantics in Kafka work under the covers, then describing why, where, and how WarpStream’s implementation differs.

Stéphane Derosiaux dives into the notion of lag when working with Kafka, why lag exists and what contributes to it, what the difference between offset vs. time lag is from metrics perspective, and how all this is related to Little's law.

In my quest of exploring new CLI tools, I recently stumbled upon Yōzefu, an interactive terminal app written in Rust to inspect data in Kafka topics. It offers a SQL-inspired language for fine-grained filtering capabilities of records.

Data Ecosystem

In this community post on the Alibaba blog, which draws from a recent Flink Forward presentation, you get a good first overview about Apache Paimon to understand its unique advantages and the use cases where it really shines.

With “Databases in 2024: A Year in Review”, Andy Pavlov continues an article series which started back in 2021. Lots of insightful facts about database licences and changes thereof, big vendor fights, the rise of DuckDB, a bunch of random happenings related to product releases, acquisitions, funding, and deaths of companies in the DB space-all paired with Andy’s own opinions make this a great and joyful read.

Apache Flink committer and PMC member Jark Wu recently wrote two articles about the open-sourced FLink Unified Streaming Storage (Fluss) project. In the first article, he addresses the rationale behind it by discussing major challenges when trying to build real-time analytics on top of Kafka. The second post dives deeper into Fluss' architecture and points to the project's roadmap to get a taste of where it's heading.

Data Platforms and Architecture

Junaid Effendi wrote about the tech stacks of companies like Netflix and Stripe in the past. The most recent article provides insights into Pinterest’s Data Stack and references further materials such as blog posts and presentations to dive deeper into their specific use of selected technologies.

Uber Engineering shared how they adopted Ray—a general compute engine for Python designed for ML, AI, and other algorithmic workloads—to optimize their rides business. They explain their motivation to combine the strengths of Ray and Spark to get the best of both worlds.

SeungMin Lee wrote about Kakao Tech's "Journey with Apache Flink & Flink CDC". The 2nd half of this extensive article touches upon their customization efforts when building on earlier versions of the upstream Flink CDC connector for MySQL.

RDBMS and Change Data Capture

In one of his recent posts, Phil Eaton provides a nice, hands-on introduction to the inner workings of logical replication in Postgres. The article not only explains the main mechanisms but also references vital parts of the actual source code bringing publications and subscriptions to life behind the scenes.

Marc Brooker contrasts snapshot isolation and serializability and shares why he believes snapshot isolation paired with strong consistency should be the default for most apps and dev teams when choosing from the database isolation spectrum.

Wanna spend a bit of time here and there to polish your RDBMS knowledge by revisiting some fundamentals? Here are two very helpful resources shared by Markus Winand: SQL Indexing and Tuning e-Book and modern-sql.com

Among the many different choices for doing CDC, kuvasz-streamer is a relatively new open-source tool on the block focusing exclusively on Postgres to Postgres scenarios. It’s a zero dependencies app written in Go. Read more about it in their documentation.

An often neglected challenge when working with CDC and building on top of change event streams is that of respecting transactions. This is why I wrote "Aggregating Change Data Capture Events based on Transactional Boundaries" which shows one way how to go about that in the context of Debezium, Kafka, and Flink.

Paper of the Month

Events & Call for Papers (CfP)

Gartner Data & Analytics Summit (Orlando, FL) March 3-5
Iceberg Summit 2025 (San Francisco, CA) April 8-9 (CfP open until February 9)
Data Council (Bay Area, CA) April 22-24
Current 2025 (London, UK) May 20-21 (CfP open until February 17)
We are Developers (Berlin, Germany) July 9-11 (CfP open until February 3)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Gunnar (LinkedIn / Bluesky / X / Mastodon / Email)Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Hans-Peter Grahsl

Let's get decoding

Decodable is free. No CC required. Never expires.

Start for Free Talk to an Expert Join the Community on Slack

Checkpoint Chronicle - January 2025

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

Data Platforms and Architecture

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - December 2024

Checkpoint Chronicle - November 2024

Checkpoint Chronicle - October 2024

Table of contents

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

Data Platforms and Architecture

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - December 2024

Checkpoint Chronicle - November 2024

Checkpoint Chronicle - October 2024

Let's get decoding