Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- Sanchay Javeria from Pinterest describes some lessons from debugging a tricky direct memory leak in Flink with a nice description of how they went about it, and some background on how Flink manages memory.
- Another good troubleshooting story, from Seth Saperstein at Lyft about Flink Streaming’s Kinesis Connector.
- I recently set out on a seemingly-straightforward journey to understand the role of the Catalog in Flink SQL. As they say: sometimes it’s the journey that is the adventure, not the destination ;-) I wrote up my findings in two parts, covering a primer and then a hands-on guide. On the subject of Flink catalogs, Gordon Murray had a nice little hack involving them here.
- Sofia Korableva from Depop describes how they use the Queryable State Store in Kafka Streams.
Event Streaming
- A couple of good Kafka ops blogs from Zendesk. Tim Cuthbertson writes about rotating CA roots for Kafka, whilst Rui Chen describes how they migrated their Kafka Clusters from EC2 VMs to Kubernetes.
- David Mariassy from Shopify has a detailed description of his vision for managing Kafka schemas more effectively, with a useful primer on schema compatibility too.
- Understanding the importance of good data modeling is having a somewhat-overdue resurgence in the data warehousing and analytics world, but we shouldn’t forget that events in a stream also need modeling correctly for flexibility, ease of use, and of course, accuracy. Timo Dechau covers this well in his nice article here.
- An excellent writeup from Thomas Dangleterre describing how Decathlon implemented Kafka Connect on Strimzi including sinks for S3, MongoDB, and HTTP to make calls out to their telemetry platform driven by messages in a Kafka topic.
Change Data Capture
- The Debezium Roadmap has been updated with plans for version 2.6 and later, including support for InfluxDB source, MongoDB sink, and improvements in the SQL Server connector.
- Ranjit Singh from Macquarie Bank describes the evaluation and adoption of Debezium for Change Data Capture.
- One of the fantastic things about open-source projects is that a solution for a problem encountered by one user might be useful for another, and so it was with Miguel A. Sotomayor’s description of Birdie’s contribution to Debezium for setting replica identity for postgres automatically.
- Written by some very well known names in the stream processing space, including Tyler Adikau and Fabian Hueske, this paper describes Incremental Processing with Change Queries in Snowflake.
Data Platforms and Architecture
- If you run your own infrastructure—and even if you don’t—you should read this fascinating account by Fabrice Harbulot and Minh Khoi Nguyen of Grab’s implementation of a self-service streaming data platform which supports both Flink and Kafka.
- An interesting look by Instacart’s Nate Kupp at their tech stack which they rebuilt last year which includes Flink, Kafka, and Debezium.
- Hadoop might be on the decline in popularity for building new data platforms but it’s still used in plenty of places—not least LinkedIn who have a mind-boggling 55,000 hosts across their big data infrastructure. Anuj Maurice describes how they implement rolling upgrades to keep versions up to date with minimal downtime or performance impact.
- A nice writeup from our own Sharon Xie on where stream processing fits into your platform.
- Two interesting blogs from Grab and Expedia covering how they solve the requirement of giving users the ability to explore real-time data.
Data Ecosystem
- The Modern Data Stack is a moniker that’s been ubiquitous for several years now and one to which any data tool vendor worth its salt would try to hitch its wagon. That is, until last week, when Tristan Handy at dbt wondered out loud whether the term "Modern Data Stack" [is] Still a Useful Idea? And thus spawning a series of response articles from names synonymous with the space including from Joe Reiss and Benn Stancil.
- DocStore is a distributed database built at Uber, offering strong consistency, caching with Redis, CDC—and the ability to serve over 40 million reads per second.
- Part of my fun with Flink catalogs (that I mention above) was reacquainting myself with the Hive Metastore. My former colleague Oz Katz has a good article exploring the options in this space now and looking at how some of the new ones aren’t entirely open, or have elements of vendor lock-in.
- Real time analytics is a hot space with many active projects and vendors. Whilst both Vimeo and Lyft have embraced ClickHouse (moving from Apache Phoenix on HBase and Apache Druid respectively), Uber uses Apache Pinot at scale.
- Daniel Beach is a data engineer at Rippleshot and prolific blogger. A few of his articles that I’ve enjoyed recently are Config Driven Pipelines and Are Data Contracts For Real? and Batch vs Near-Realtime vs Streaming
Papers of the Month
Murat Demirbas has a fascinating blog in which he analyses papers that have been published. Two papers that caught my eye recently are:
- Scalable OLTP in the Cloud: What's the BIG DEAL? (👉 Murat's analysis and commentary)
- Verifying Transactional Consistency of MongoDB (👉 Murat's analysis and commentary)
Events & Call for Papers (CfP)
- Kafka Summit London is March 19-20 and Decodable will be there. Come and say hi at our booth, and be sure to catch the three talks we’re doing at the conference:
• Sharon Xie: Timing is Everything: Understanding Event-Time Processing in Flink SQL (Wednesday 20th 10am)
• Gunnar Morling: Data Contracts In Practice With Debezium and Apache Flink (Wednesday 20th 1pm)
• Robin Moffatt: 🐲 Here be Dragons^H^H Stacktraces — Flink SQL for Non-Java Developers (Tuesday 19th 3:30pm)
- Berlin Buzzwords (Berlin, Germany) June 9-11 (CfP closes Feb 25th)
- Current '24 | The Next Generation of Kafka Summit (Austin, TX) Sep 17-18 (CfP closes Feb 26th)
- JavaZone (Oslo, Norway) September 4-5 (CfP closes Apr 7th)
New Releases
There are also a couple of releases that are almost there but not quite at the time of going to press 🙂
- flink-connector-jdbc-3.1.2 RC3 vote has passed, and so the release is imminent (this will add support for Flink 1.18 to the connector)
- Apache Kafka 3.7 RC4 vote is underway. This release includes a bunch new stuff such as a Docker image for Kafka (KIP-975), Kafka Connect supporting the creation of connectors in a stopped state (KIP-980), and in Kafka Streams support for rack aware task assignment (KIP-925) plus a bunch of improvements to Interactive Queries v2 (KIP-968, KIP-985, KIP-992)
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.
Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)