Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling (your editor-in-chief for this edition) and Robin Moffatt. Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- Streaming SQL with Apache Flink: A Gentle Introduction Flink SQL exposes Flinkās powerful stream processing capabilities to a large audience of SQL-savvy data engineers. This post by Giannis Polyzos is an excellent introduction to Flink SQL, discussing different kinds of query operators, checkpointing, and more.
- Stream processing becomes mainstream A high-level overview on the current state of stream processing by Javier Redondo, discussing the differences between stream-to-sink and stream-to-table frameworks as well as between libraries and cluster frameworks.
- Building a Fully Managed Apache Flink Service, Behind the Scenes Decodable software engineer Jared Breeden explores in this post what it takes to build a fully managed platform for stream processing based on Flink, touching on aspects like developer experience, observability, schema management, etc.
- What is stateful stream processing? Insightful post by Arroyoās Micah Wylde, explaining what the āstateā in stateful stream processing is about, how systems like Flink and Arroyo deal with it, and how checkpointing ensures consistency after failures.
- Yes, Change Data Capture Still Breaks Database Encapsulation This one is a reaction by Chris Riccomini to my recent article which in turn was triggered by Chrisā original post, all on the same subject. Overall I feel Chris and I are not really in disagreement that much, itās more that I described building blocks and implementation techniques, whereas Chris is looking for a ready-made product experience.
- CDC Streaming ELT Framework Flink CDC 3.0, released at the end of last year, introduces a new ELT framework, aiming to simplify the definition of CDC-driven data pipelines. The reference documentation is a great starting point to learn more about this project, which was just adopted into the upstream Apache Flink project earlier this month.
Event Streaming
- Sliding window rate limits in distributed systems Naveen Kumar Jakuva Premkumar and Abdullah Al Mamun of Grab discuss in this in-depth post how they rate limit the number of marketing emails and push notifications sent to their users, using two of my favorite data structures: bloom filters and roaring bitmaps.
- Understanding lag in a streaming pipeline New Relic is known as a large-scale user of Apache Kafka. So itās always interesting to learn about their experiences from running data streaming pipelines. Amy Boyle explains how they identify lag in pipelines using different techniques, as well as their strategies for automatically and dynamically adapting to it.
- An overview of Cloudflare's logging pipeline Cloudflareās blog is a great source of insightful posts on the technologies they use. In this article, Colin Douch gives an overview on Cloudflareās logging pipeline, based on Apache Kafka, connecting clusters in multiple data centers using Mirror Maker, and streaming logs to queryable systems such as Clickhouse and Elasticsearch.
- How Mixpanel Built a āFast Laneā for Our Modern Data Stack A slightly longer read, but definitely worth the time: Illirik Smirnov shares experiences from his work at Mixpanel for creating a near-real-time data pipeline based on Google Cloud Pub/Sub for use cases where their existing (reverse) ETL system doesnāt provide the required latency SLAs.
Change Data Capture
- Logical Replication From Postgres 16 Stand-By Servers As of version 16, Postgres supports logical replication from read replicas. In this two-part blog series, I am taking a deep dive into this new functionality, showing how to make use of it in general, how to connect Debezium to read replicas, and how to handle replication slots in case of fail-over scenarios.
- PG Slot Notify: Monitor Postgres Slot Growth in Slack Making sure that Postgres replication slots donāt consume too much disk space is a key concern for every Postgres DBA. Kaushik Iska presents a neat solution to this problem in the form of PG Slot Notify, a chat bot which sends alerts to designated Slack channels in case a slot grows beyond a configured threshold.
- Debezium and TimescaleDB Support for TimescaleDBāa time series database based on Postgresāhas been on the wishlist for many Debezium users for quite some time. This has finally become a reality in Debezium 2.5.0. Debeziumās project lead Jiri Pechanec discusses this new feature in this post, including the capability to capture changes to continuously updated aggregates.
- Streamlined Performance: Debezium JDBC connector batch support For the longest time, the Debezium project focused on the source side of real-time data pipelines. This has changed with the recent addition of a JDBC sink connector. Fiore Mario Vitale dives into some performance improvements to this connector.
Data Platforms and Architecture
- Our First Netflix Data Engineering Summit Netflix data engineers shared their learnings around building reliable data pipelines in an internal conference last year. The sessions by Holden Karau and others have been published on YouTube now, including gems such as Streaming SQL on Data Mesh using Apache Flink and Psyberg, An Incremental ETL Framework Using Iceberg.
- Designing A Data-Intensive Future: An Unscripted Journey with Martin Kleppmann Jesse Anderson interviews data legend Martin Kleppmann, talking about the evolution of data systems since Martinās book āDesigning Data-Intensive Applicationsā came out in 2017, his current work on local-first collaboration software, and his thoughts about being in academia.
- Handling Imbalanced Traffic with Kafka Swimlanes Mixing messages from different sources on one Kafka topic can cause a delay in processing when there are sudden spikes in the messages coming in from one producer. Angus Gibbs of HubSpot discusses how they address that issue, for instance routing backfill traffic and real-time traffic to different topics.
- API-First Approach to Kafka Topic Creation Managing Kafka topics in a consistent, secure, and reliable manner, ideally as self-service for developers, is an evergreen topic for data platform teams. Varun Chakravarthy et al. describe in this post how they've solved this problem at DoorDash by means of infra-structure-as-code using Pulumi.
- Kafka on Kubernetes: Reloaded for fault tolerance The Strimzi operator is a very popular way for running Apache Kafka on Kubernetes. Fabrice Harbulot and Thang Le of Grab discuss how they have designed their Kafka cluster deployments with a strong focus on fault tolerance, leveraging Strimziās rolling deployment mechanism and EBS volumes for persisting the data from their Kafka topics.
- Can Event-Driven Architecture make Software Design Easier? An episode of Kris Jenkinsā āDeveloper Voicesā podcast, talking about everything events: event systems, Event Sourcing, the CloudEvents specification, and more. And it being a podcast with Kris, of course there was a mention of Clojure too.
Data Ecosystem
- Using Server Sent Events to Simplify Real-time Streaming at Scale Every year, Shopify ships their Black Friday Cyber Monday (BFCM) live map, a real-time visualization of Shopify sales. In this article, Bao Nguyen discusses the latest version of this map, built using Server-Sent Events (SSE) and Apache Flink for processing raw merchant sales data coming in via Apache Kafka.
- 1ļøā£ššļøš¦ (1BRC in SQL with DuckDB) Robin takes a stab at the One Billion Row Challenge in this postāand staying true to his SQL personality, heās using DuckDB for it. Unsurprisingly, everyoneās favorite embeddable analytics database does really well.
- How Apple built iCloud to store billions of databases Leonardo Creed takes a look under the hood of CloudKit, Appleās backend service for iCloud, and how they leverage FoundationDB and Apache Cassandra, managing hundreds of petabytes of data.
- Kubernetes for Data Engineers While most data engineers donāt need to work directly with Kubernetes in their jobs (you can decide whether thatās good or bad ;), it can still be interesting to learn about the core ideas and concepts. Daniel Beach provides a nice introduction to Kubernetes in this post.
- Are you using Protocol Buffers as serialization format with Kafka? I asked this question on Twitā¦, mh, X, the other day, and I was really surprised by the large number of folks who shared their experiences from using ProtoBuf. It seems itās way more popular than I had thought, with some advantages over Avro, including better support for languages other than Java, a less verbose format for defining schemas, support for partial deserialization, and others.
- Super-fast deduplication of large datasets using Splink and DuckDB Deduplicating data is a common task for most data engineers. Splink is an open-source Python tool designed for this task, and Robin Linacre puts it into action in this post together with DuckDB, deduplicating a dataset of seven million rows, comparing runtimes on EC2 instances of different sizes.
Paper of the Month
š DBLog: A Watermark Based Change-Data-Capture Framework (arXiv:2010.12597)
In this paper from 2020, Andreas Andreakis and Ioannis Papapanagiotou, back then working on a CDC solution at Netflix, propose an innovative algorithm for running backfills of existing data (āsnapshottingā) concurrently to reading changes from the transaction log, leveraging a windowed de-duplication approach. Adopted by Debezium, Flink CDC, and other CDC solutions, this has been a massive improvement for data engineering teams running CDC pipelines.
Events & Call for Papers (CfP)
- GeeCON (KrakĆ³w, Poland) May 15-17 (CfP closes Jan 31st)
- JNation (Coimbra, Portugal) June 4-5 (CfP closes Jan 31st)
- JPrime (Sofia, Bulgaria) May 28-29 (CfP closes Feb 15th)
- NDC Oslo (Oslo, Norway) June 10-14 (CfP closes Feb 18th)
- Berlin Buzzwords (Berlin, Germany) June 9-11 (CfP closes Feb 25th)
- Current '24 | The Next Generation of Kafka Summit (Austin, TX) Sep 17-18 (CfP closes Feb 26th)
New Releases
Thatās all for this month! We hope youāve enjoyed the newsletter and would love to hear about any feedback or suggestions youāve got.
Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)