Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- A couple of Flink performance-related blogs from last year, with useful details of how to extract Flink Flame Graph data for offline analysis and tips on analysing and tuning Flink SQL queries.
- In From Samza to Flink: A Decade of Stream Processing Chris Riccomini takes us through the history of stream processing and his opinionated take on its direction today. Given that he was there at the time of Samza and the inception of Kafka Connect, he is definitely qualified to be writing about it. I enjoyed reading the article, and whilst I don’t agree with all the points, it’s well written.
- In a similar vein to the previous item, there are some interesting thoughts (if understandably-biased ones) about streaming databases and SQL from in this interview with Materialize’s Nikhil Benesch that Yaroslav Tkachenko published.
- I’ve been continuing my journey of learning Flink SQL:some text
- Understanding how to install and troubleshoot JARs for Flink took me a while.
- I also took a look at the REST API of the SQL Gateway.
- Finally, I wrapped all my learnings so far up into a talk for Kafka Summit London which I gave this week. You can find the slides here and the associated code to run Flink locally with Kafka and Iceberg on GitHub.
- My colleague Sharon Xie did an excellent talk, also at Kafka Summit London, on event time and watermarks in Flink.
Event Streaming
- A useful exploration of what Kafka’s at-least-once delivery guarantees look like in practice with some troubleshooting along the way.
- Here’s a really nice performance analysis blog that does a good job tracking down tail latencies in Kafka.
- An interesting writeup of how WarpStream has implemented support for topic compaction.
Change Data Capture
- Gunnar published a blog last week discussing a Taxonomy Of Data Change Events, and presented at Kafka Summit London this week his new talk Data Contracts In Practice With Debezium and Apache Flink.
- A new set of examples from Gordon Murray showing how to use Debezium with WarpStream.
Data Platforms and Architecture
- One of my favourite talks at Kafka Summit London was a lightning talk from Gleb Shipilov in which he described the streaming platform at Exness. You can read full details over the course of four blogs including evaluation decisions in favour of Flink over Spark and ksqlDB.
- This article is from last year, but still an interesting account of how (and why) Instacart moved their Flink platform off AWS EMR and onto self-managed Flink using Kubernetes (on AWS EKS).
- I really enjoyed this article on how Figma horizontally scaled their Postgres-based architecture. It includes a really good explanation of why they stuck with Postgres despite the “obvious” alternative to scale using NoSQL (turns out relational databases are useful!) or some of the newer technologies such as Spanner (too much new to learn, timescales too tight)
- More Postgres architecture, in this article describing how Cloudflare operates distributed Postgres clusters.
Data Ecosystem
- Two good articles about real-world database administration, with GitHub.com’s migration from MySQL 5.7 to 8.0, and the UK Government’s Digital Service’s migration of their PostgreSQL database within AWS.
- The ecosystem around open-table formats (OTF) continues to be an interesting one, with the project formerly known as OneTable now becoming Apache XTable (incubating). This move to an ASF project I think is really important as it defines a well-known and respected governance model. It will still be interesting to see how OTFs play out in the coming months and years—will we end up with one format simply winning outright, or will the cross-format support that XTable (and to a slight extent Databricks’ Uniform) provide give the best of all worlds? Personally, I’m sceptical; I think the formats will slug it out and in the end, one or two will survive—perhaps for different use cases—and the others will either pivot or die.
- LinkedIn has quite the pedigree for open-source projects (Apache Kafka, anyone?) and this newly open-sourced one looks interesting. OpenHouse is described as a control plane for managed tables, and includes a catalog and services for managing Iceberg tables.
- I always enjoy reading my former colleague Jack Vanlightly’s writing, and this recent article about Confluent’s new capability called Tableflow—which will support writing from Apache Kafka into Apache Iceberg—is no exception.
Papers of the Month
- Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine
- An Empirical Evaluation of Columnar Storage Formats
Events & Call for Papers (CfP)
- Berlin Buzzwords (Berlin, Germany) June 9-11 (CfP closed)
- JavaZone (Oslo, Norway) September 4-5 (CfP closes Apr 7th)
- Current '24 | The Next Generation of Kafka Summit (Austin, TX) September 17-18 (CfP closed)
- Flink Forward (Berlin, Germany) October 23-24 (CfP not yet announced)
New Releases
A bumper batch of new releases this month:
Plus a couple of useful-looking projects that I noticed:
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.
Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)