Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling, Robin Moffatt (your editor-in-chief for this edition), and Hans-Peter Grahsl. Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- Last week saw the 10th anniversary of Apache Flink celebrated at Flink Forward in Berlin. Ververica announced Fluss, which is river in German—and they’ve managed to crowbar into an acronym too: FLink Unified Streaming Storage. It’s a proprietary project currently but slated for release into open source and donation to Apache Software Foundation as a standalone project. Fluss is a streaming platform providing both low-latency analytics as well as Iceberg or Paimon-backed batch analytics. Read more in the announcement blog post as well as my colleague Sharon Xie’s post on LinkedIn.
- The first preview release of Flink 2.0 is out now. Check out the blog post to learn more about what’s new, as well as details of breaking API changes. Related to this also is details of changes to Flink’s state management which Yuan Mei shared earlier this year.
- Drasi is an open-source project from Microsoft. It combines CDC with event processing, using Neo4j's Cypher language. In another vindication for open source standards, it supports Debezium's CDC format for output. Read more here and here.
Event Streaming
- New event streaming platforms are ten-a-penny these days, but Iggy.rs stands out for not being Apache Kafka protocol compatible. You can read more about its genesis and philosophy, and its roadmap.
- There’s nothing fishy about this one—Tom Cooper has published a nice tool called Kipper which lists the Apache Kafka KIPs, their age, and relative activity based on mailing list traffic.
- Some useful notes from Jacob Scholz about the forthcoming Apache Kafka 4.0 release including the work that the Strimzi project is putting into supporting it.
- A write up from Gil Friedlis of how they migrated their on-premises Kafka handling 2TB of data a day to a managed Kafka service, and what re-architecting they did on the way.
- Interesting hands-on exploration from Kir Titievsky of Kafka's KafkaAvroSerializer and KafkaAvroDeserializer—with a bonus discussion about the use of Avro vs other serialisation formats in Kafka.
Data Ecosystem
- Sudhendu Pandey has a useful primer on Polaris catalog and data catalogs in general.
- Instead of drip-feeding the plethora of posts that Jack Vanlightly has been writing about open table formats recently I’m going to list several of them here together for your delectation and enjoyment:some text
- A summary of Change Query and CDC Support plus deep-dives covering Iceberg, Paimon, Delta Lake, and Hudi
- Support for streaming ingest of row-level operations across the formats
- Table format interoperability, future or fantasy?
- Plus Jack draws some cool sketches :)
- BigQuery has added support for Iceberg, whilst aspects of the future direction of Iceberg now that Databricks employs several of the key project members is playing out on the mailing list.
- An interesting post from Xiangpeng Hao about Parquet pruning in Apache DataFusion. DataFusion is a query engine and worth looking at too if you haven’t already.
Data Platforms and Architecture
- A couple of posts from last year by SmartNews detailing a move from batch-based Spark-driven warehouse on Iceberg to a streaming-based Flink-driven warehouse, the reasonings, and detail around Iceberg housekeeping.
- ngrok is one of my favourite tools to use, and so I was particularly interested to see this article detailing how their data platform is built. The streaming part uses Kafka and Flink, and they hold the data in Iceberg format. There’s also a batch component of their platform which they’ve also written about.
- Fascinating look at the performance benefits seen by rearchitecting a system to use SQLite instead of Redis. I love the kind of performance analysis and thinking that’s gone into this one, particularly since it’s not a like-for-like switch out, but a different way of architecting reads and writes.
- Discord has become pretty popular in recent years as a chat platform similar to Slack, so you can imagine they have a fair amount of data to deal with. In 2022 they were running 177 Cassandra nodes holding trillions of chat messages. This blog details the troubles they had with this, and their eventual migration to ScyllaDB.
RDBMS and Change Data Capture
- When it comes to CDC I’ve found some people ‘get it’, whilst others are perhaps not even aware of it as a technique and tie themselves up in knots when CDC is the perfect solution. This is what drove my latest blog post: Why Do I Need CDC?
- Yelp migrated from Postgres to MySQL, for the simple reason that MySQL is that Yelp standardise on MySQL and there were fewer and fewer people to support Postgres. This post details the migration process and how they achieved it with zero downtime or data loss.
- At the moment Red Hat are the owners and stewards of the open source Debezium project, and they’ve shared plans to move it to a foundation. This is a positive and healthy move for the community, and it’ll be good to see if it happens.
- The Debezium project is running a survey to find out more about how people use it as well as capture input for future direction.
Paper of the Month
In their paper from June this year What Goes Around Comes Around... And Around… none other than Michael Stonebraker and Andrew Pavlo look at developments in the data store space in the last 20 years. To give you a flavour of it, the opening page states “Many systems that started out rejecting the RM with much fanfare (think NoSQL) now expose a SQL-like interface for [relational model] databases. Such systems are now on a path to convergence with RDBMSs”.
Whether you’ve been working in this area for that long or not (ahem), it’s a useful primer/recap of where we’re at now, what’s worked and what hasn’t—and some things to look for in the future. The authors have clear opinions but ones that are worth giving weight to in my view.
Speaking of Andrew, he and his research group at CMU keep pushing the boundary in terms of where databases are headed. One of the highlights at P99 CONF—an online event on everything performance and low latency—last week was his talk about implementing a key-value store in kernel space using eBPF, yielding very impressive performance gains. We’re eagerly looking forward to the accompanying research paper; in the meantime there’s a summary of his talk here.
Events & Call for Papers (CfP)
- KubeCon / CloudNativeCon NA 2024 (Salt Lake City, UT) November 12-15
- AWS re:Invent (Las Vegas, NV) December 2-6
- NDC London (London, UK) January ‘25 27-31
- Current 2025 (London, UK) May 20-21
- Current 2025 (Bangalore, India) Mid-March
- Current 2025 (New Orleans, LA) October 29-30
New Releases
- Debezium 3.0.0.Final
- Apache Paimon 0.9
- Apache Flink Kubernetes Operator 1.10.0
- Not released at time of writing, but Apache Kafka 3.9.0 is imminent
See you on the Socials?
For those of you mourning the heydays of “Data Twitter”, you might want to check out Bluesky which erupted over the weekend with a deluge of data folk following figures such as Kelsey Hightower and others onto the platform. The interface is very similar to Twitter’s (no head scratching over which server to join, sorry Mastodon), and the discourse so far is informative, respectful, and silly.
Bluesky has a concept of “Starter packs” which is a set of people to follow, and Chris Riccomini has put together a good one of data folk. You’ll find me and my fellow Chronicle editors on Bluesky here: @rmoff.net @gunnar.bsky.social @hpgrahsl.bsky.social
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.
Gunnar (LinkedIn / Bluesky / X / Mastodon / Email)