Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling, Robin Moffatt (your editor-in-chief for this edition), and Hans-Peter Grahsl. Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- As featured in October’s Checkpoint Chronicle, Ververica recently announced Fluss—a streaming storage engine. It’s now been released under Apache 2.0 licence, and Yaroslav Tkachenko has taken it for a spin.
- Interesting details of how the Chinese logistics provider KYE evaluated different stream processing engines and databases to settle on Apache Flink and OceanBase.
- ChunTing Wu writes up their experience trying out Apache Paimon with Flink, Trino, and StarRocks (and helpfully shares the Docker Compose so you can try it too)
- A useful peek from Lenon Rodrigues under the covers of Flink looking at how and when data transfer occurs between the different processes.
Event Streaming
- Excellent research and analysis from RedMonk’s Kate Holterhoff in Why Message Queues Endure: A History
- Strimzi 0.44 added auto-rebalancing on cluster scaling—this article from Paolo Patierno has a good overview of how to use it. If you want to know more also check out this KubeCon NA 2024 conference talk from Jakub Scholz.
- KIP-853 added dynamic controller quorum to Kafka in 3.9—this article from Federico Valeri and Luke Chen at RedHat explains how KRaft clusters can now scale controller nodes without downtime.
- An excellent exploration of Apache Kafka Internals from Moncef Abboud. I love this kind of approach to learning more about a technology—writing about it in public is not only a great way to improve one’s own understanding, but to help others too.
- Gohlay is a tool for Kafka for scheduling messages. I was pretty intrigued by it, and its author gave some really useful background about it in this Reddit post.
- Apache Kafka 4.0 is scheduled for release late January/early February, and the feature freeze just passed. One of the most significant changes is the removal of ZooKeeper and marking KRaft as production ready (KIP-833). You can find the release plan here, and also see a list of included JIRAs.
Data Ecosystem
- A detailed and thorough exploration and evaluation of the different types of Data Catalog and their implementations from OneHouse’s Kyle Weller. This is a really messy space and I’m looking forward to when we have the kind of relatively stable state that we now have around the open table formats (OTF).
- For data engineers one of the biggest bits of news from AWS re:Invent last week was the launch of Amazon S3 Tables which hold Apache Iceberg data in a dedicated S3 bucket type. There’s been a fair bit written about them, including simple explanations, hands-on, and analysis:some text
- Build a managed transactional data lake with Amazon S3 Tables (AWS)
- A First Look at S3 (Iceberg) Tables (Nikhil Benesch, Materialize)
- AWS S3 Tables?! The Iceberg Cometh (Daniel Beach, Rippleshot)
- AWS S3 Tables and the race for managed storage (Roy Hasson, Upsolver)
- Meet your new data lakehouse: S3 Iceberg Tables (Stanislav Kozlovski)
Data Platforms and Architecture
- I enjoyed this honest writeup from Hex of how a data warehouse got into a bit of a mess over time, and how they rectified it with no downtime to end users.
- If you like database internals you’ll definitely want to read this post from Adam Prout in which he categorises the different types of consensus algorithms that distributed databases use.
- I’m a big fan of graph technology and thoroughly enjoyed this account of how Booking.com use it to detect fraud in real-time.
- Details from Prabodh Agarwal of how Toplyne are building a data lakehouse using Kafka and Apache Hudi, including why they chose Hudi over Iceberg and Delta Lake.
RDBMS and Change Data Capture
- Analysis of how Postgres’s OLAP performance has improved over time, by Tomas Vondra. It uses TPC-H and shows a 4x improvement between Postgres 8 and 18.
- I had fun exploring Flink CDC—it’s a pretty cool thing! Also check out details of WikiMedia’s plans for using it too.
- A useful look at how Postgres and Debezium work together, from Arijit Mazumdar.
- Gunnar wrote up details of using failover replication slots in Postgres 17, including a nice demo of their use with Decodable’s Postgres CDC connector.
Papers of the Month
A couple of papers from Amazon to look at this month. First up, their 2022 paper describing improvements made to Redshift since its launch in 2013, and then their paper from earlier this year analysing the workloads seen on Amazon Redshift and looking at how they differ from the canonical TPC simulated workloads often used for benchmarking.
Events & Call for Papers (CfP)
- NDC London (London, UK) January 27-31
- Current 2025 (Bangalore, India) March 19 (CfP open until December 19)
- Current 2025 (London, UK) May 20-21
- Current 2025 (New Orleans, LA) October 29-30
New Releases
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.
Gunnar (LinkedIn / Bluesky / X / Mastodon / Email)
Robin (LinkedIn / Bluesky / X / Mastodon / Email)
Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)