Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling, Robin Moffatt (your editor-in-chief for this edition), and Hans-Peter Grahsl. Feel free to send our way any choice nuggets that you think we should feature in future editions.
I’m writing this on my flight back from Current 2024 in Austin. It was a packed couple of days that you can read more about in my recap blog. If you want to catch the keynotes they are already online day 1 & day 2. I was excited to present at the conference, along with three of my colleagues, and you can find our talks here:
- Data Contracts In Practice With Debezium and Apache Flink (Gunnar Morling)
- The Joy of JARs (and Other Flink SQL Troubleshooting Tales) (Robin Moffatt)
- So you want to write a User-Defined Function for Flink? (Hans-Peter Grahsl)
- Timing is Everything: Understanding Event-Time Processing in Flink SQL (Sharon Xie)
Stream Processing, Streaming SQL, and Streaming Databases
- Rohan Desai, a co-founder at Responsive, has written about the new Async Processor for Kafka Streams. It includes a useful introduction that sets the scene for why it’s needed and what the limitations of current options are.
- My former colleague Bill Bejeck has written an excellent series all about windowing in stream processing. I featured one of them previously; here’s the full set now:
- I wrote up a guide on how to write Delta Lake tables from Flink, along with some troubleshooting details.
Event Streaming
- In one of the more surprising of recent acquisitions, Confluent announced that they have acquired WarpStream to fill the Bring-Your-Own-Cloud (BYOC) gap in their portfolio. WarpStream have shot to prominence in this space for their Kafka-compatible platform which uses S3 to store data directly. WarpStream themselves are talking about the acquisition as an opportunity to grow further within Confluent, rather than their technology simply being absorbed into Confluent’s own. Confluent’s Jack Vanlightly took the opportunity to “clarify” his views on BYOC after being critical of it last year.
- Javier Holguera has published a useful set of blogs around naming conventions in Kafka, covering topics and producers/consumers.
- Following on from Uber’s contribution of tiered storage for Kafka, Pinterest have shared details and the source code (under Apache 2.0 licence) of their own implementation. It differs in several ways, including handling tiered storage separate from the broker. Pinterest having been using it in production since May this year, offloading a staggering 200 TB of data per day using it.
- A useful writeup from DoorDash on how they run their internal self-service Kafka platform.
Data Ecosystem
- Amazon S3 added support for conditional writes which Gunnar took a look at and published this excellent post on how it could be used for leader election in distributed systems.
- Decades after the first ones were created, new databases are still being written. Learn more in this post about the challenges James Munro found with existing systems that led to him creating ArcticDB.
- Some observations from StarRocks on the Unity catalog and what the Databricks acquisition of Tabular earlier this year might mean for Apache Iceberg.
- I really enjoyed this well-written piece from Expedia detailing how and why they migrated from Apache Cassandra to ScyllaDB. It includes the evaluation of the different migration options, and how they achieved a zero-downtime migration of their 15-node, 1TB Cassandra cluster.
Data Platforms and Architecture
- Airbnb wrote about their Lambda-architecture-based platform Riverbed last year, and recently published a more detailed look at it. It’s based around their CDC solution which has been around for a while called SpinalTap.
- This blog post from Uber is interesting for the level of detail it goes into in how they organize the data in their object stores to optimize for things such as data ownership, access control, throughput optimisation, and platform date limits. It’s part of their move to GCP and migrating from HDFS to GCS.
- If you’re interested in system design you’ll want to check out this detailed two-part series from Agoda that goes into the nuts-and-bolts of how they manage deduplication of bookings across multiple data centers.
- Staying with online travel booking, this post from booking.com discusses how the ML-based ranking platform within the search results works.
RDBMS and Change Data Capture
- Two fascinating blogs about different approaches taken to upgrading to MySQL 8.0 from Uber and GitHub.
- At some point the splicing and mixing of different protocols and engines is all going to get too silly, but for now bear with me: Tansu is a Kafka-compatible broker using Postgres as its storage, and pg_duckdb embeds DuckDB as an engine within Postgres.
- Back in the regular world of vanilla Postgres, there’s a nice collection of the things you maybe didn’t realize that it could do, along with a slightly provocative assertion that you should just use Postgres. The latter is actually pretty useful in positioning its strengths (and weaknesses) against other types of data management systems.
Paper of the Month
In Petabyte-Scale Row-Level Operations in Data Lakehouses, Ryan Blue (one of the co-creators of Apache Iceberg) and a team from Apple look at improvements in Iceberg and Apache Spark that bring the lakehouse closer to the data warehouse of old in terms of richer functionality and performance.
Events & Call for Papers (CfP)
- Flink Forward (Berlin, Germany) October 23-24
- OSDC West (Burlingame, CA) October 29-31
- KubeCon / CloudNativeCon NA 2024 (Salt Lake City, UT) November 12-15
- AWS re:Invent (Las Vegas, NV) December 2-6
- NDC London (London, UK) January ‘25 27-31
- Current 2025 (London, UK) May 20-21
- Current 2025 (Bangalore, India) Mid-March
- Current 2025 (New Orleans, LA) October 29-30
New Releases
- Apache Flink CDC 3.2.0
- Apache Paimon 0.9.0
- Debezium 3.0.0.CR1
- Kroxylicious 0.8.0
- Strimzi Kafka Operator 0.43
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.
Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)
Hans-Peter (LinkedIn / X / Mastodon / Email)