🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

March 27, 2025

min read

Checkpoint Chronicle - March 2025

Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

I was regularly checking the Apache Flink website lately until I finally stumbled upon the long-awaited Flink 2.0.0 announcement. Grab a coffee and take some time to read through several highlights including but not limited to disaggregated state management, materialized tables, optimized batch execution mode, and deeper integration with Apache Paimon.

A recent Flink Community post on the Alibaba blog provides insights into how Flink helps big companies in the retail and e-commerce industries with real-time personalization to improve customer experience.

The folks at Responsive started a really insightful article series a while ago. This time, Almog Gavra walks us through the lifecycle of a Kafka Streams application. He explains several good practices and why it’s important to wire exception handlers and various types of listeners in your KStreams apps.

Giannis Polyzos discusses the concept of custom triggers in Apache Flink and shows how to control windowed computation by going beyond the built-in triggers which only cover standard behaviour.

My latest article walks you through the process of creating a real-time, multi-stage data pipeline by combining the flexibility of custom Flink jobs written in Java with the convenience and declarative nature of Flink SQL.

Event Streaming

Apache Kafka 4.0 was released earlier this March and shipped tons of good stuff. It’s the first major version to run KRaft mode by default, features a new consumer group protocol and offers early access to traditional queue semantics. Check out all details in the official release announcement.

Talking about Queues for Kafka (KIP-932) there have been several articles lately addressing this new feature set. While Andrew Schofield provides a concise overview here, Gunnar Morling dives deep to explore the fundamental underpinnings in his new “Let's Take a Look at…” series.

Jack Vanlightly examines how Kafka’s replication protocol embraces disaggregation—a separation of control and data planes—unlike more monolithic consensus protocols such as Raft.

David Arthur wrote about “Build Timeouts” in the context of CI for the Apache Kafka project and shared how they combine the timeout command with thread dumps to tackle the problem of stuck builds.

Data Ecosystem

Alireza Sadeghi recently shared a comprehensive overview about the open-source data engineering landscape. It’s a great resource to keep track of what’s happening in this rapidly evolving space. While review articles are only published once per year, this repo provides ongoing updates.

Interested in how the interplay between a columnar format, a high-performance RPC framework and a SQL-based interface allow to overcome the inefficiencies of row-based data access protocols from the past? Learn more in Dipankar Mazumdar’s article “What is Apache Arrow Flight, Flight SQL & ADBC?”

Vu Trinh put together a really insightful walkthrough after spending 8 hours learning about Parquet. The article distills lots of details in a very approachable manner to help understand not only the Parquet file format structure but also its read/write protocol.

In this short video to celebrate the first official release of Apache Polaris 0.9, Danica Fine takes a look back at the project’s origins and shares what to expect in the future.

Jark Wu—creator of Fluss—highlights the most exciting features shipped with the latest release in the Fluss 0.6.0 announcement post. Learn more about the new Merge Engine feature for primary key tables, prefix lookup for Delta Join, and column compression.

RDBMS and Change Data Capture

In “Life Altering Postgresql Patterns” Ethan McCue shares 11 useful tips for working with Postgres, from using UUIDs as primary keys all the way to returning JSON objects from queries.

Reladiff—a fork of the discontinued data-diff project—is a neat tool for diffing large datasets. It supports cross-database and intra-database diffs for a dozen supported databases.

Andrea Peruffo blogged about how to write single message transformations (SMTs) in Go for Debezium. Built on top of TinyGo, Chicory, and WASM, this new integration path allows developers to extend CDC processing capabilities by adding custom filters and routes implemented in Go.

Agus Mahari put together this beginner-friendly article explaining step by step how to set up a CDC pipeline, powered by Debezium and Kafka Connect, between different relational databases and Redpanda.

“Detect data mutation patterns with Debezium” by Fiore Mario Vitale explores how to build a monitoring dashboard sourced by specific database activity metrics which are exposed by Debezium. The examples repository contains all the bits to get going.

Paper of the Month

In Styx: Transactional Stateful Functions on Streaming Dataflows Kyriakos Psarakis et al. introduce a dataflow-based Stateful Functions-as-a-Service (SFaaS) runtime which supports multi-partition transactions while providing serializable isolation guarantees. They tested Styx with different workloads to demonstrate how it can outperform alternative solutions in throughput by at least one order of magnitude.

Events & Call for Papers (CfP)

Iceberg Summit 2025 (San Francisco, CA) April 8-9
Data Council (Bay Area, CA) April 22-24
Current 2025 (London, UK) May 20-21
Flink Forward 2025 (Barcelona, Spain) Oct 13-16, CfP (opening in April)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

February 27, 2025

min read

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Start Free Talk To An Expert

Heading 2

Stream Processing, Streaming SQL, and Streaming Databases

I was regularly checking the Apache Flink website lately until I finally stumbled upon the long-awaited Flink 2.0.0 announcement. Grab a coffee and take some time to read through several highlights including but not limited to disaggregated state management, materialized tables, optimized batch execution mode, and deeper integration with Apache Paimon.

A recent Flink Community post on the Alibaba blog provides insights into how Flink helps big companies in the retail and e-commerce industries with real-time personalization to improve customer experience.

The folks at Responsive started a really insightful article series a while ago. This time, Almog Gavra walks us through the lifecycle of a Kafka Streams application. He explains several good practices and why it’s important to wire exception handlers and various types of listeners in your KStreams apps.

Giannis Polyzos discusses the concept of custom triggers in Apache Flink and shows how to control windowed computation by going beyond the built-in triggers which only cover standard behaviour.

My latest article walks you through the process of creating a real-time, multi-stage data pipeline by combining the flexibility of custom Flink jobs written in Java with the convenience and declarative nature of Flink SQL.

Event Streaming

Apache Kafka 4.0 was released earlier this March and shipped tons of good stuff. It’s the first major version to run KRaft mode by default, features a new consumer group protocol and offers early access to traditional queue semantics. Check out all details in the official release announcement.

Talking about Queues for Kafka (KIP-932) there have been several articles lately addressing this new feature set. While Andrew Schofield provides a concise overview here, Gunnar Morling dives deep to explore the fundamental underpinnings in his new “Let's Take a Look at…” series.

Jack Vanlightly examines how Kafka’s replication protocol embraces disaggregation—a separation of control and data planes—unlike more monolithic consensus protocols such as Raft.

David Arthur wrote about “Build Timeouts” in the context of CI for the Apache Kafka project and shared how they combine the timeout command with thread dumps to tackle the problem of stuck builds.

Data Ecosystem

Alireza Sadeghi recently shared a comprehensive overview about the open-source data engineering landscape. It’s a great resource to keep track of what’s happening in this rapidly evolving space. While review articles are only published once per year, this repo provides ongoing updates.

Interested in how the interplay between a columnar format, a high-performance RPC framework and a SQL-based interface allow to overcome the inefficiencies of row-based data access protocols from the past? Learn more in Dipankar Mazumdar’s article “What is Apache Arrow Flight, Flight SQL & ADBC?”

Vu Trinh put together a really insightful walkthrough after spending 8 hours learning about Parquet. The article distills lots of details in a very approachable manner to help understand not only the Parquet file format structure but also its read/write protocol.

In this short video to celebrate the first official release of Apache Polaris 0.9, Danica Fine takes a look back at the project’s origins and shares what to expect in the future.

Jark Wu—creator of Fluss—highlights the most exciting features shipped with the latest release in the Fluss 0.6.0 announcement post. Learn more about the new Merge Engine feature for primary key tables, prefix lookup for Delta Join, and column compression.

RDBMS and Change Data Capture

In “Life Altering Postgresql Patterns” Ethan McCue shares 11 useful tips for working with Postgres, from using UUIDs as primary keys all the way to returning JSON objects from queries.

Reladiff—a fork of the discontinued data-diff project—is a neat tool for diffing large datasets. It supports cross-database and intra-database diffs for a dozen supported databases.

Andrea Peruffo blogged about how to write single message transformations (SMTs) in Go for Debezium. Built on top of TinyGo, Chicory, and WASM, this new integration path allows developers to extend CDC processing capabilities by adding custom filters and routes implemented in Go.

Agus Mahari put together this beginner-friendly article explaining step by step how to set up a CDC pipeline, powered by Debezium and Kafka Connect, between different relational databases and Redpanda.

“Detect data mutation patterns with Debezium” by Fiore Mario Vitale explores how to build a monitoring dashboard sourced by specific database activity metrics which are exposed by Debezium. The examples repository contains all the bits to get going.

Paper of the Month

Events & Call for Papers (CfP)

Iceberg Summit 2025 (San Francisco, CA) April 8-9
Data Council (Bay Area, CA) April 22-24
Current 2025 (London, UK) May 20-21
Flink Forward 2025 (Barcelona, Spain) Oct 13-16, CfP (opening in April)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Hans-Peter Grahsl

Let's get decoding

Decodable is free. No CC required. Never expires.

Start for Free Talk to an Expert Join the Community on Slack

Checkpoint Chronicle - March 2025

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - February 2025

Checkpoint Chronicle - January 2025

Checkpoint Chronicle - December 2024

Table of contents

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - February 2025

Checkpoint Chronicle - January 2025

Checkpoint Chronicle - December 2024

Let's get decoding