Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- My favourite kind of blog post for learning about something is to actually build something with it—and Fredrik Meyer does a great job of this in his post here: Estimating Pi with Kafka Streams.
- Apache Paimon is a lakehouse storage format (in the same vein as Apache Iceberg et al). It was spun out of the Apache Flink project originally and has a focus on streaming data. It graduated to a top-level Apache project earlier this year, and recently announced the introduction of a feature called deletion vectors to improve the efficiency of reading and writing data.
- This is a good writeup of how you can use SQL-based stream processing for SecOps analysis of real time S3 bucket access. It’s built on Feldera which is an open-source implementation of DBSP (see Papers, below).
- One of the things that can sometimes be frustrating about the melee of open-source projects is getting them to work well together, so I was pleased to see this discussion and progress on tighter support between Flink and Iceberg.
- An interesting account of some benchmarking that Yahoo did to look at the performance (from a cost point of view) difference between Apache Flink and Google Cloud Dataflow.
- Flink has several sub-projects, including Flink ML. This blog post is based on a talk that was given at Flink Forward Asia last year about how Flink ML works and some of its new features.
Event Streaming
- An excellent article about how Agoda analysed and solved performance problems they observed in their use of Apache Kafka, focussing on topic partition load balancing strategies.
- WarpStream is one of the several Kafka-as-an-API providers that have emerged recently, and continue their interesting blog series with notes on Cloud Disk Costs and their opinion that Tiered Storage Won’t Fix Kafka.
- My former colleague Lucia Cerchie has a nice article on how to write Kafka Improvement Proposals (KIPs), and I enjoyed reading Phuc Tran’s account of their experiences getting started contributing to Apache Kafka.
- Adam Warski has written a superb explainer and analysis of how queuing with Kafka can be problematic, as well as discussing KIP-932 which is intended to address this issue, plus offering a solution that’s available now called the “KMQ” pattern.
- The problem of queues in Kafka caused CloudKitchens to ditch Kafka and develop their own technology for reliable order processing called KEQ. There’s an interesting Reddit thread that accompanies the article.
- I’m always curious to see the kind of “day 2” challenges that people encounter once they get past the “Hello World” phase of implementation. This blog post from DoorDash is a thorough explanation of how they handle test/prod Kafka workloads and multi-tenancy, including in some cases using the same Kafka cluster for both.
Data Platforms and Architecture
- Data Modeling goes in and out of fashion as technology enables people to crunch more data without needing it for pure optimisation alone—but then people start getting wrong or confusing data, and realise that there’s a reason data modelling has been a thing for many decades. Joe Reiss is looking to raise its profile once more by writing a book about it and has shared excerpts from the first and second chapters of Practical Data Modeling. Meanwhile, Adrian Bednarz writes about the One Big Table (OBT) approach and its implications when doing stream processing.
- An interesting explanation of how BackMarket are taking the opportunity during their migration onto GCP to apply some Data Mesh principles to how data is made available to their users. It discusses the implementation from both a logical point of view, and the specific GCP tools used.
- It might not have the sparkle and allure of GenAI, but someone’s gotta do it—taking the digital trash out. Netflix generates around 2 Petabytes of data every week, of which it’s estimated that 40% is never used. This article goes into more detail and explains the tools and processes used to manage this data and its deletion when needed.
- Canva describes how they hit scaling problems and ended up migrating from MySQL to Snowflake for an application responsible for counting usage. One thing that struck me was no use of Kafka (or equivalent) for ingestion, and use of Snowflake to do the big crunching instead of Spark or Flink. I wonder how much of that is their existing familiarity with Snowflake (as mentioned in the article) vs it being more suitable for the job.
Change Data Capture
- A good primer on Flink CDC 3.0 which covers its architecture, APIs, and features including schema evolution.
Data Ecosystem
Apache DataFusion recently became its own top level Apache project, graduating out of being a part of Apache Arrow. DataFusion is a query engine that can be used for building data systems. It’s already found in many projects, including an accelerator for Apache Spark called Comet.
At its peak, Apache HBase held 6PB of data for Pinterest and underpinned many of their systems. This article is a really well written account of their reasons for deciding to deprecate it in favour of tools including Apache Druid, and TiDB.
There’s a reason Chris Riccomini is featured often in Checkpoint Chronicle (I checked: five of the last six!)—he writes really useful and pragmatic posts 😀. This one looking at the Nimble and Lance file formats is no exception. Whilst Parquet is going nowhere anytime soon, it’s interesting to look at the nebulous beginnings of what might one day replace it and why.
As well as Chris, I’m a big fanboi of Jack Vanlightly’s writing. He has a knack for making the complex intelligible without dumbing it down. His recent post on Hybrid Transactional/Analytical Storage is an interesting look at both Confluent’s strategy as well as the broader landscape for data platforms.
This one caught my eye, and I’ll quote from the readme directly: pg_lakehouse is an extension that transforms Postgres into an analytical query engine over object stores like S3 and table formats like Apache Iceberg. Queries are pushed down to Apache DataFusion.
Papers of the Month
- Vortex: A Stream-oriented Storage Engine For Big Data Analytics
- DBSP: Automatic Incremental View Maintenance for Rich Query Languages
Events & Call for Papers (CfP)
Berlin Buzzwords (Berlin, Germany) June 9-11 (CfP closed)
Current '24 | The Next Generation of Kafka Summit (Austin, TX) September 17-18 (CfP closed)
Flink Forward (Berlin, Germany) October 23-24 (CfP extended, now closes on May 31)
New Releases
A few new releases this month:
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.
Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)