Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling, Robin Moffatt (your editor-in-chief for this edition), and Hans-Peter Grahsl. Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- As featured in Octoberās Checkpoint Chronicle, Ververica recently announced Flussāa streaming storage engine. Itās now been released under Apache 2.0 licence, and Yaroslav Tkachenko has taken it for a spin.
- Interesting details of how the Chinese logistics provider KYE evaluated different stream processing engines and databases to settle on Apache Flink and OceanBase.
- ChunTing Wu writes up their experience trying out Apache Paimon with Flink, Trino, and StarRocks (and helpfully shares the Docker Compose so you can try it too)
- A useful peek from Lenon Rodrigues under the covers of Flink looking at how and when data transfer occurs between the different processes.
Event Streaming
- Excellent research and analysis from RedMonkās Kate Holterhoff in Why Message Queues Endure: A History
- Strimzi 0.44 added auto-rebalancing on cluster scalingāthis article from Paolo Patierno has a good overview of how to use it. If you want to know more also check out this KubeCon NA 2024 conference talk from Jakub Scholz.
- KIP-853 added dynamic controller quorum to Kafka in 3.9āthis article from Federico Valeri and Luke Chen at RedHat explains how KRaft clusters can now scale controller nodes without downtime.
- An excellent exploration of Apache Kafka Internals from Moncef Abboud. I love this kind of approach to learning more about a technologyāwriting about it in public is not only a great way to improve oneās own understanding, but to help others too.
- Gohlay is a tool for Kafka for scheduling messages. I was pretty intrigued by it, and its author gave some really useful background about it in this Reddit post.
- Apache Kafka 4.0 is scheduled for release late January/early February, and the feature freeze just passed. One of the most significant changes is the removal of ZooKeeper and marking KRaft as production ready (KIP-833). You can find the release plan here, and also see a list of included JIRAs.
Data Ecosystem
- A detailed and thorough exploration and evaluation of the different types of Data Catalog and their implementations from OneHouseās Kyle Weller. This is a really messy space and Iām looking forward to when we have the kind of relatively stable state that we now have around the open table formats (OTF).
- For data engineers one of the biggest bits of news from AWS re:Invent last week was the launch of Amazon S3 Tables which hold Apache Iceberg data in a dedicated S3 bucket type. Thereās been a fair bit written about them, including simple explanations, hands-on, and analysis:some text
- Build a managed transactional data lake with Amazon S3 Tables (AWS)
- A First Look at S3 (Iceberg) Tables (Nikhil Benesch, Materialize)
- AWS S3 Tables?! The Iceberg Cometh (Daniel Beach, Rippleshot)
- AWS S3 Tables and the race for managed storage (Roy Hasson, Upsolver)
- Meet your new data lakehouse: S3 Iceberg Tables (Stanislav Kozlovski)
Data Platforms and Architecture
- I enjoyed this honest writeup from Hex of how a data warehouse got into a bit of a mess over time, and how they rectified it with no downtime to end users.
- If you like database internals youāll definitely want to read this post from Adam Prout in which he categorises the different types of consensus algorithms that distributed databases use.
- Iām a big fan of graph technology and thoroughly enjoyed this account of how Booking.com use it to detect fraud in real-time.
- Details from Prabodh Agarwal of how Toplyne are building a data lakehouse using Kafka and Apache Hudi, including why they chose Hudi over Iceberg and Delta Lake.
RDBMS and Change Data Capture
- Analysis of how Postgresās OLAP performance has improved over time, by Tomas Vondra. It uses TPC-H and shows a 4x improvement between Postgres 8 and 18.
- I had fun exploring Flink CDCāitās a pretty cool thing! Also check out details of WikiMediaās plans for using it too.
- A useful look at how Postgres and Debezium work together, from Arijit Mazumdar.
- Gunnar wrote up details of using failover replication slots in Postgres 17, including a nice demo of their use with Decodableās Postgres CDC connector.
Papers of the Month
A couple of papers from Amazon to look at this month. First up, their 2022 paper describing improvements made to Redshift since its launch in 2013, and then their paper from earlier this year analysing the workloads seen on Amazon Redshift and looking at how they differ from the canonical TPC simulated workloads often used for benchmarking.
Events & Call for Papers (CfP)
- NDC London (London, UK) January 27-31
- Current 2025 (Bangalore, India) March 19 (CfP open until December 19)
- Current 2025 (London, UK) May 20-21
- Current 2025 (New Orleans, LA) October 29-30
New Releases
Thatās all for this month! We hope youāve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.
Gunnar (LinkedIn / Bluesky / X / Mastodon / Email)
Robin (LinkedIn / Bluesky / X / Mastodon / Email)
Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)