Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- I was regularly checking the Apache Flink website lately until I finally stumbled upon the long-awaited Flink 2.0.0 announcement. Grab a coffee and take some time to read through several highlights including but not limited to disaggregated state management, materialized tables, optimized batch execution mode, and deeper integration with Apache Paimon.
- A recent Flink Community post on the Alibaba blog provides insights into how Flink helps big companies in the retail and e-commerce industries with real-time personalization to improve customer experience.
- The folks at Responsive started a really insightful article series a while ago. This time, Almog Gavra walks us through the lifecycle of a Kafka Streams application. He explains several good practices and why it’s important to wire exception handlers and various types of listeners in your KStreams apps.
- Giannis Polyzos discusses the concept of custom triggers in Apache Flink and shows how to control windowed computation by going beyond the built-in triggers which only cover standard behaviour.
- My latest article walks you through the process of creating a real-time, multi-stage data pipeline by combining the flexibility of custom Flink jobs written in Java with the convenience and declarative nature of Flink SQL.
Event Streaming
- Apache Kafka 4.0 was released earlier this March and shipped tons of good stuff. It’s the first major version to run KRaft mode by default, features a new consumer group protocol and offers early access to traditional queue semantics. Check out all details in the official release announcement.
- Talking about Queues for Kafka (KIP-932) there have been several articles lately addressing this new feature set. While Andrew Schofield provides a concise overview here, Gunnar Morling dives deep to explore the fundamental underpinnings in his new “Let's Take a Look at…” series.
- Jack Vanlightly examines how Kafka’s replication protocol embraces disaggregation—a separation of control and data planes—unlike more monolithic consensus protocols such as Raft.
- David Arthur wrote about “Build Timeouts” in the context of CI for the Apache Kafka project and shared how they combine the timeout command with thread dumps to tackle the problem of stuck builds.
Data Ecosystem
- Alireza Sadeghi recently shared a comprehensive overview about the open-source data engineering landscape. It’s a great resource to keep track of what’s happening in this rapidly evolving space. While review articles are only published once per year, this repo provides ongoing updates.
- Interested in how the interplay between a columnar format, a high-performance RPC framework and a SQL-based interface allow to overcome the inefficiencies of row-based data access protocols from the past? Learn more in Dipankar Mazumdar’s article “What is Apache Arrow Flight, Flight SQL & ADBC?”
- Vu Trinh put together a really insightful walkthrough after spending 8 hours learning about Parquet. The article distills lots of details in a very approachable manner to help understand not only the Parquet file format structure but also its read/write protocol.
- In this short video to celebrate the first official release of Apache Polaris 0.9, Danica Fine takes a look back at the project’s origins and shares what to expect in the future.
- Jark Wu—creator of Fluss—highlights the most exciting features shipped with the latest release in the Fluss 0.6.0 announcement post. Learn more about the new Merge Engine feature for primary key tables, prefix lookup for Delta Join, and column compression.
RDBMS and Change Data Capture
- In “Life Altering Postgresql Patterns” Ethan McCue shares 11 useful tips for working with Postgres, from using UUIDs as primary keys all the way to returning JSON objects from queries.
- Reladiff—a fork of the discontinued data-diff project—is a neat tool for diffing large datasets. It supports cross-database and intra-database diffs for a dozen supported databases.Â
- Andrea Peruffo blogged about how to write single message transformations (SMTs) in Go for Debezium. Built on top of TinyGo, Chicory, and WASM, this new integration path allows developers to extend CDC processing capabilities by adding custom filters and routes implemented in Go.
- Agus Mahari put together this beginner-friendly article explaining step by step how to set up a CDC pipeline, powered by Debezium and Kafka Connect, between different relational databases and Redpanda.
- “Detect data mutation patterns with Debezium” by Fiore Mario Vitale explores how to build a monitoring dashboard sourced by specific database activity metrics which are exposed by Debezium. The examples repository contains all the bits to get going.
Paper of the Month
In Styx: Transactional Stateful Functions on Streaming Dataflows Kyriakos Psarakis et al. introduce a dataflow-based Stateful Functions-as-a-Service (SFaaS) runtime which supports multi-partition transactions while providing serializable isolation guarantees. They tested Styx with different workloads to demonstrate how it can outperform alternative solutions in throughput by at least one order of magnitude.
Events & Call for Papers (CfP)
- Iceberg Summit 2025 (San Francisco, CA) April 8-9
- Data Council (Bay Area, CA) April 22-24
- Current 2025 (London, UK) May 20-21
- Flink Forward 2025 (Barcelona, Spain) Oct 13-16, CfP (opening in April)
New Releases
- Apache Flink 2.0
- Apache Flink Kubernetes Operator 1.11.0
- Apache Kafka 4.0
- Debezium 3.0.8.Final and 3.1.0.Beta1
- Apache Iceberg 1.8.1
- Apache Polaris 0.9.0 (incubating)
- Fluss 0.6.0
- Kroxylicious v0.11.0
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.