Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt - feel free to send our way any choice nuggets that you think we should feature in future editions.
- Apache Iceberg has some notable big adopters, such as Netflix (where it was created), and Apple. In this fascinating talk from QCon New York 2023, Stephen Wu from Apple talks about using Iceberg as a streaming source into Apache Flink, including the reasons why they included Iceberg in the pipeline and not just Apache Kafka throughout.
- Several long-time developers contributing to the Kafka Streams project within Apache Kafka have launched a startup called responsive—and with it an absolute barn-stormer of a blog post: A Size for Every Stream: The Expert's Guide to Sizing Kafka Streams
- In this excellent talk from QCon London 2023 and just published on the QCon website (alongside the previous blog on the same subject), Matt Boyle and Andrea Medda from Cloudflare detail their journey to 1 trillion messages in Kafka and the issues and learnings they had along the way. Some of the points included dealing with message compatibility whilst retaining loose-coupling as they moved away from a monolithic implementation, internal tooling and libraries to improve developer velocity, and accurate monitoring and instrumentation.
- From the Decodable stables this month we have a useful set of blogs:
• If you’re new to stream processing then this post from Robert is just what you need to give you an overview of the space.
• Gunnar then goes on to cover what change data capture (CDC) is and the wide range of use cases for it.
• To finish off the introductory blog posts, check out Robin’s view of what is Apache Flink as he sets out on his learning journey.
- A sign of the increasing maturity of the stream processing space is that we are moving beyond solely “Hey I implemented Hello World with Flink!” blog posts—although there’s nothing wrong with these too, and I always applaud learning in public—and onto the gnarlier topics that come with running stream processing applications for real. This post from Yaroslav Tkachenko discusses a zero-downtime technique for deploying new versions of Flink applications based on the blue-green deployment pattern.
- Arroyo are one of the many new companies in the increasingly-crowded streaming space, and earlier this year published this excellent blog that explains streaming SQL clearly along with a discussion of dealing with updates, comparing two different approaches that it calls Dataflow Semantics and Update Semantics.
- Another new kid on the block in streaming is Epsio, who published this useful explainer of their implementation of a streaming SQL engine.
- A year after their first post, the Netflix team returned with an update on the Data Mesh platform, detailing their move from individual Flink operators to adoption of Flink SQL. It’s interesting to see how Flink is used, and also detail around exactly how companies at this scale implement streaming technologies.
It’s worth noting that Netflix’s Data Mesh != Zhamak Dehghani’s Data Mesh - it’s just a naming collision, and one which Netflix discusses here. - The Open Table Format (OTF) squabbles continue, with this blog from Iceberg co-creator and Tabular co-founder Ryan Blue discussing flaws in Apache Hudi’s Atomicity, Consistency, and Isolation guarantees, which one of Hudi’s co-creators (and Onehouse founder), Vinoth Chander, disagrees with in this post. Databricks’ Ali Ghodsi even threw his tuppence worth in too.
It seems there are at least two angles to this argument:
• Arguments over the technical correctness of claimed features (ACID, etc)
• Strategic positioning; Both Databricks and OneHouse see the OTF future as being multiple formats with some form of interoperability between them (Uniform and OneTable, respectively). The impression that I’ve got is that Tabular see Iceberg as the format around which others will and should converge.
- A common driver for moving from batch to stream processing is the need to get fresher information in front of users of an application. That’s what happened at Vinted, where they migrated their batch-based load of Elasticsearch to one powered by Flink.
- Thanh Tung Dao has compiled two very useful lists
• Companies using Apache Kafka in production
• Companies using Apache Flink in production
These are a great reference if you want to understand the level of adoption, use cases, and scale of usage of these tools.
- Replication slots not advancing in certain circumstances used to be a notorious source of headaches for users of change data capture with Postgres. The engineering team at Zalando—who are heavy users of Debezium—took a stab at addressing this, using keep-alive messages in order to address this issue. They discuss how they patched the Postgres JDBC driver in this in-depth blog post. No more unexpected WAL growth!
Paper of the Month
Who doesn’t love to read them a good paper? In this spirit, we’re going to reference one research or industry paper (either classic, or just hot off the press) in the field of data processing and streaming each month, starting with one of our all-time favourites:
📄 One SQL to Rule Them All: An Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables (arXiv:1905.12133)
In this paper from 2019, Edmon Begoli et al. discuss how the “pervasive use of time-varying relations, robust event-time semantics support, and materialization control can substantially improve the ease-of-use of streaming SQL”. Definitely an inspiring read!
New Releases
- Apache Flink 1.18
- CDC Connectors for Apache Flink 2.4.2
- Apache Kafka 3.6.0
- Debezium 2.5.0.Alpha1 (adds support for MariaDB and Informix)
Events
- Kafka Summit is into its eighth year now and will be in both London (19-20 March 2024) and Bangalore (2 May 2024) next year. The Call for Papers (CfP) is open for both, closing on 27th November (London) and 10th January 2024 (Bangalore).
- Data Council is going to take place in Austin again (26-28 March 2024). The CfP for it is open for just a few more days (closing on 17th November), so hurry to get your proposals in.
- Current 2023 was held in San Jose this year with attendees from across the data streaming space. If you weren’t able to make it but don’t want to feel left out, you can read a variety of write-ups, including from Confluent themselves, the trade press, vendors including Cloudera and Tabular, and independents such as Yaroslav Tkachenko and Bitrock.
That’s all for this month! We hope you’ve enjoyed the newsletter, and are all-ears for any feedback or suggestions you’ve got.