Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling, Robin Moffatt, and Hans-Peter Grahsl (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- Apurva Mehta from Responsive wrote this interesting article about several challenges related to using embedded RocksDB for state storage in stream processing frameworks. He highlighted why OLTP databases like MongoDB or ScyllaDB can be rather good alternatives and yet they come with their own trade-offs. This is exactly what a recent follow-up article from Rohan Desai picks up to make the case for SlateDB-backed stream processing state.
Event Streaming
- If you’re interested in the safety properties of distributed systems you don’t want to miss Jepsen’s analysis of Bufstream, which is a Kafka-compatible streaming system backed by object storage.
- Javier Holguera wrote a multi-part series about naming “all the things” when working with Kafka. Part 3 focuses on how developers could go about defining proper names for the relevant properties when running Kafka Connect connectors.
- “Provisioning Kafka topics the easy way!” from Matt Venz describes how Zendesk's engineering team streamlined Kafka topic provisioning by integrating it into their existing self-service interface. They developed a custom resource for Kubernetes and an operator to reconcile these definitions with the Kafka cluster.
Data Ecosystem
- This post on the Clickhouse blog discusses a significantly enhanced JSON data type which is purpose-built to deliver high-performance handling of JSON data on top of ClickHouse’s columnar storage.
- Medium's engineering team recently focused on reducing Snowflake costs by optimizing 22 of their most expensive data pipelines. The key areas for improvements were efficient data loading, query optimization, materialized views, and data partitioning.
- OpenLineage is an open platform for the collection and analysis of data lineage. In "The Future of Lineage" Julien Le Dem discusses the evolution and significance of data lineage in modern data processing and how advancements have led to automated extraction methods across batch and stream processing systems.
- Read about how Notion’s data catalog evolved from the initial wild west situation to enhanced user engagement. They adopted TypeScript as an Interface Definition Language for data models and are converting these to JSON Schema for catalog integration including automated metadata descriptions to ensure consistency and accessibility.
- Xebia Data Engineers Daniël Tom et al. share on their blog how they are using dbt and Jupyter notebooks to integrate DuckDB's efficient in-process analytics with centralized data governance concepts found in the Unity Catalog. A local dev environment based on Docker Compose can be found here.
Data Platforms and Architecture
- You may have heard about the huge influx of new users joining Bluesky in recent weeks. Here are two articles which offer a look into the rear mirror by discussing how the distributed social network platform has been built over the years and what the engineering culture behind it looks like.
- As their diverse Snowflake usage patterns and workloads presented challenges in monitoring and cost attribution, Canva’s engineering team had to implement a comprehensive monitoring strategy. Find the details in Rob Scriva’s blog post “Our Journey to Snowflake Monitoring Mastery”.
- Uber’s advanced settlement accounting system needs to cope with challenges such as diverse PSP file formats, duplicate transactions, and missing or incorrect references all across 50+ payment service providers. Read about their architecture to enhance financial accuracy, prevent fraud, maintain regulatory compliance, and support strategic business decisions through near real-time reporting.
- Jack Vanlightly posted two articles highlighting the conflict between incremental data processing and data quality. Similar to solid software engineering principles, he suggests improving collaboration between software and data teams by means of stable interfaces to reduce system coupling and ensure better data quality and reliability.
RDBMS and Change Data Capture
- In “Transactional schema migration across tenant databases” Gwen Shapira blogs about the architecture and implementation of Nile’s—soon to be open sourced—Postgres extension called pg_karnak which is a distributed DDL layer.
- A common thing people wanna know is how to stream data from Postgres to Snowflake, and so Robin wrote this nice blog post showing how to get the job done with Decodable, using either the web UI or a declarative approach with YAML.
- Dominik Durner’s blog post introduces Colibri, CedarDB's hybrid storage engine designed to support HTAP (Hybrid Transactional and Analytical Processing) by combining the strengths of row-based and columnar data storage formats.
- As Gunnar is revisiting the outbox pattern, he explores alternative solutions such as "listen to yourself", the upcoming two-phase commit support for Kafka etc., and compares the involved trade-offs to figure out whether the outbox pattern is still relevant in 2024.
- If you want to share feedback about your Debezium experiences and thereby help shape the future roadmap of the project, take the community survey which is still open a few more days until the end of November.
- Last month, the Debezium team announced interest in moving the project to a foundation. After careful consideration with the internal and external communities the project now started the transition to the Commonhaus Foundation. Here is a dedicated FAQ on what that means.
- Schema evolution is a crucial aspect, yet not always considered from day one when building streaming ETL pipelines on top of change data capture. My recent blog post,
dives into a number of common source-side data model changes to understand the consequences for schema compatibility towards downstream consumers.
Paper of the Month
The research paper from Jeff Shute et al. explores an extension to SQL by introducing a "pipe syntax". Inspired by data flow patterns in other modern languages, GoogleSQL aims to simplify complex query structures while maintaining backward compatibility with traditional SQL. Interested? Read all the details in “SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL”.
Events & Call for Papers (CfP)
- AWS re:Invent (Las Vegas, NV) December 2-6
- NDC London (London, UK) January ‘25 27-31
- CfP open for Data Analytics DevRoom @ Fosdem 2025 (Brussels, Belgium) February 1-2
- CfP open for Current 2025 (Bangalore, India) March 19
- CfP open for Devoxx UK (London, UK), May 7-9
- Current 2025 (London, UK) May 20-21
- Current 2025 (New Orleans, LA) October 29-30
New Releases
- Apache Kafka 3.8.1 and 3.9.0
- librdkafka (the Apache Kafka C/C++ library) 2.6.0
- Apache Iceberg 1.7.0
- Debezium 3.0.2.Final (with 3.0.3.Final imminent)
- Strimzi 0.44.0 (plus video summary)
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.
Gunnar (LinkedIn / Bluesky / X / Mastodon / Email)
Robin (LinkedIn / Bluesky / X / Mastodon / Email)
Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)