Back
January 24, 2024
5
min read

Checkpoint Chronicle - January 2024

By
Gunnar Morling
Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling (your editor-in-chief for this edition) and Robin Moffatt. Feel free to send our way any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

  • Streaming SQL with Apache Flink: A Gentle Introduction Flink SQL exposes Flink’s powerful stream processing capabilities to a large audience of SQL-savvy data engineers. This post by Giannis Polyzos is an excellent introduction to Flink SQL, discussing different kinds of query operators, checkpointing, and more.
  • Stream processing becomes mainstream A high-level overview on the current state of stream processing by Javier Redondo, discussing the differences between stream-to-sink and stream-to-table frameworks as well as between libraries and cluster frameworks.
  • Building a Fully Managed Apache Flink Service, Behind the Scenes Decodable software engineer Jared Breeden explores in this post what it takes to build a fully managed platform for stream processing based on Flink, touching on aspects like developer experience, observability, schema management, etc.
  • What is stateful stream processing? Insightful post by Arroyo’s Micah Wylde, explaining what the “state” in stateful stream processing is about, how systems like Flink and Arroyo deal with it, and how checkpointing ensures consistency after failures.
  • Yes, Change Data Capture Still Breaks Database Encapsulation This one is a reaction by Chris Riccomini to my recent article which in turn was triggered by Chris’ original post, all on the same subject. Overall I feel Chris and I are not really in disagreement that much, it’s more that I described building blocks and implementation techniques, whereas Chris is looking for a ready-made product experience.
  • CDC Streaming ELT Framework Flink CDC 3.0, released at the end of last year, introduces a new ELT framework, aiming to simplify the definition of CDC-driven data pipelines. The reference documentation is a great starting point to learn more about this project, which was just adopted into the upstream Apache Flink project earlier this month.

Event Streaming

  • Sliding window rate limits in distributed systems Naveen Kumar Jakuva Premkumar and Abdullah Al Mamun of Grab discuss in this in-depth post how they rate limit the number of marketing emails and push notifications sent to their users, using two of my favorite data structures: bloom filters and roaring bitmaps.
  • Understanding lag in a streaming pipeline New Relic is known as a large-scale user of Apache Kafka. So it’s always interesting to learn about their experiences from running data streaming pipelines. Amy Boyle explains how they identify lag in pipelines using different techniques, as well as their strategies for automatically and dynamically adapting to it.
  • An overview of Cloudflare's logging pipeline Cloudflare’s blog is a great source of insightful posts on the technologies they use. In this article, Colin Douch gives an overview on Cloudflare’s logging pipeline, based on Apache Kafka, connecting clusters in multiple data centers using Mirror Maker, and streaming logs to queryable systems such as Clickhouse and Elasticsearch.
  • How Mixpanel Built a “Fast Lane” for Our Modern Data Stack A slightly longer read, but definitely worth the time: Illirik Smirnov shares experiences from his work at Mixpanel for creating a near-real-time data pipeline based on Google Cloud Pub/Sub for use cases where their existing (reverse) ETL system doesn’t provide the required latency SLAs.

Change Data Capture

  • Logical Replication From Postgres 16 Stand-By Servers As of version 16, Postgres supports logical replication from read replicas. In this two-part blog series, I am taking a deep dive into this new functionality, showing how to make use of it in general, how to connect Debezium to read replicas, and how to handle replication slots in case of fail-over scenarios.
  • PG Slot Notify: Monitor Postgres Slot Growth in Slack Making sure that Postgres replication slots don’t consume too much disk space is a key concern for every Postgres DBA. Kaushik Iska presents a neat solution to this problem in the form of PG Slot Notify, a chat bot which sends alerts to designated Slack channels in case a slot grows beyond a configured threshold.
  • Debezium and TimescaleDB Support for TimescaleDB—a time series database based on Postgres—has been on the wishlist for many Debezium users for quite some time. This has finally become a reality in Debezium 2.5.0. Debezium’s project lead Jiri Pechanec discusses this new feature in this post, including the capability to capture changes to continuously updated aggregates.
  • Streamlined Performance: Debezium JDBC connector batch support For the longest time, the Debezium project focused on the source side of real-time data pipelines. This has changed with the recent addition of a JDBC sink connector. Fiore Mario Vitale dives into some performance improvements to this connector.

Data Platforms and Architecture

Data Ecosystem

  • Using Server Sent Events to Simplify Real-time Streaming at Scale Every year, Shopify ships their Black Friday Cyber Monday (BFCM) live map, a real-time visualization of Shopify sales. In this article, Bao Nguyen discusses the latest version of this map, built using Server-Sent Events (SSE) and Apache Flink for processing raw merchant sales data coming in via Apache Kafka.
  • 1️⃣🐝🏎️🦆 (1BRC in SQL with DuckDB) Robin takes a stab at the One Billion Row Challenge in this post—and staying true to his SQL personality, he’s using DuckDB for it. Unsurprisingly, everyone’s favorite embeddable analytics database does really well.
  • How Apple built iCloud to store billions of databases Leonardo Creed takes a look under the hood of CloudKit, Apple’s backend service for iCloud, and how they leverage FoundationDB and Apache Cassandra, managing hundreds of petabytes of data.
  • Kubernetes for Data Engineers While most data engineers don’t need to work directly with Kubernetes in their jobs (you can decide whether that’s good or bad ;), it can still be interesting to learn about the core ideas and concepts. Daniel Beach provides a nice introduction to Kubernetes in this post.
  • Are you using Protocol Buffers as serialization format with Kafka? I asked this question on Twit…, mh, X, the other day, and I was really surprised by the large number of folks who shared their experiences from using ProtoBuf. It seems it’s way more popular than I had thought, with some advantages over Avro, including better support for languages other than Java, a less verbose format for defining schemas, support for partial deserialization, and others.
  • Super-fast deduplication of large datasets using Splink and DuckDB Deduplicating data is a common task for most data engineers. Splink is an open-source Python tool designed for this task, and Robin Linacre puts it into action in this post together with DuckDB, deduplicating a dataset of seven million rows, comparing runtimes on EC2 instances of different sizes.

Paper of the Month

📄 DBLog: A Watermark Based Change-Data-Capture Framework (arXiv:2010.12597)

In this paper from 2020, Andreas Andreakis and Ioannis Papapanagiotou, back then working on a CDC solution at Netflix, propose an innovative algorithm for running backfills of existing data (“snapshotting”) concurrently to reading changes from the transaction log, leveraging a windowed de-duplication approach. Adopted by Debezium, Flink CDC, and other CDC solutions, this has been a massive improvement for data engineering teams running CDC pipelines.

Events & Call for Papers (CfP)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Gunnar Morling

Gunnar is an open-source enthusiast at heart, currently working on Apache Flink-based stream processing. In his prior role as a software engineer at Red Hat, he led the Debezium project, a distributed platform for change data capture. He is a Java Champion and has founded multiple open source projects such as JfrUnit, kcctl, and MapStruct.

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling (your editor-in-chief for this edition) and Robin Moffatt. Feel free to send our way any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

  • Streaming SQL with Apache Flink: A Gentle Introduction Flink SQL exposes Flink’s powerful stream processing capabilities to a large audience of SQL-savvy data engineers. This post by Giannis Polyzos is an excellent introduction to Flink SQL, discussing different kinds of query operators, checkpointing, and more.
  • Stream processing becomes mainstream A high-level overview on the current state of stream processing by Javier Redondo, discussing the differences between stream-to-sink and stream-to-table frameworks as well as between libraries and cluster frameworks.
  • Building a Fully Managed Apache Flink Service, Behind the Scenes Decodable software engineer Jared Breeden explores in this post what it takes to build a fully managed platform for stream processing based on Flink, touching on aspects like developer experience, observability, schema management, etc.
  • What is stateful stream processing? Insightful post by Arroyo’s Micah Wylde, explaining what the “state” in stateful stream processing is about, how systems like Flink and Arroyo deal with it, and how checkpointing ensures consistency after failures.
  • Yes, Change Data Capture Still Breaks Database Encapsulation This one is a reaction by Chris Riccomini to my recent article which in turn was triggered by Chris’ original post, all on the same subject. Overall I feel Chris and I are not really in disagreement that much, it’s more that I described building blocks and implementation techniques, whereas Chris is looking for a ready-made product experience.
  • CDC Streaming ELT Framework Flink CDC 3.0, released at the end of last year, introduces a new ELT framework, aiming to simplify the definition of CDC-driven data pipelines. The reference documentation is a great starting point to learn more about this project, which was just adopted into the upstream Apache Flink project earlier this month.

Event Streaming

  • Sliding window rate limits in distributed systems Naveen Kumar Jakuva Premkumar and Abdullah Al Mamun of Grab discuss in this in-depth post how they rate limit the number of marketing emails and push notifications sent to their users, using two of my favorite data structures: bloom filters and roaring bitmaps.
  • Understanding lag in a streaming pipeline New Relic is known as a large-scale user of Apache Kafka. So it’s always interesting to learn about their experiences from running data streaming pipelines. Amy Boyle explains how they identify lag in pipelines using different techniques, as well as their strategies for automatically and dynamically adapting to it.
  • An overview of Cloudflare's logging pipeline Cloudflare’s blog is a great source of insightful posts on the technologies they use. In this article, Colin Douch gives an overview on Cloudflare’s logging pipeline, based on Apache Kafka, connecting clusters in multiple data centers using Mirror Maker, and streaming logs to queryable systems such as Clickhouse and Elasticsearch.
  • How Mixpanel Built a “Fast Lane” for Our Modern Data Stack A slightly longer read, but definitely worth the time: Illirik Smirnov shares experiences from his work at Mixpanel for creating a near-real-time data pipeline based on Google Cloud Pub/Sub for use cases where their existing (reverse) ETL system doesn’t provide the required latency SLAs.

Change Data Capture

  • Logical Replication From Postgres 16 Stand-By Servers As of version 16, Postgres supports logical replication from read replicas. In this two-part blog series, I am taking a deep dive into this new functionality, showing how to make use of it in general, how to connect Debezium to read replicas, and how to handle replication slots in case of fail-over scenarios.
  • PG Slot Notify: Monitor Postgres Slot Growth in Slack Making sure that Postgres replication slots don’t consume too much disk space is a key concern for every Postgres DBA. Kaushik Iska presents a neat solution to this problem in the form of PG Slot Notify, a chat bot which sends alerts to designated Slack channels in case a slot grows beyond a configured threshold.
  • Debezium and TimescaleDB Support for TimescaleDB—a time series database based on Postgres—has been on the wishlist for many Debezium users for quite some time. This has finally become a reality in Debezium 2.5.0. Debezium’s project lead Jiri Pechanec discusses this new feature in this post, including the capability to capture changes to continuously updated aggregates.
  • Streamlined Performance: Debezium JDBC connector batch support For the longest time, the Debezium project focused on the source side of real-time data pipelines. This has changed with the recent addition of a JDBC sink connector. Fiore Mario Vitale dives into some performance improvements to this connector.

Data Platforms and Architecture

Data Ecosystem

  • Using Server Sent Events to Simplify Real-time Streaming at Scale Every year, Shopify ships their Black Friday Cyber Monday (BFCM) live map, a real-time visualization of Shopify sales. In this article, Bao Nguyen discusses the latest version of this map, built using Server-Sent Events (SSE) and Apache Flink for processing raw merchant sales data coming in via Apache Kafka.
  • 1️⃣🐝🏎️🦆 (1BRC in SQL with DuckDB) Robin takes a stab at the One Billion Row Challenge in this post—and staying true to his SQL personality, he’s using DuckDB for it. Unsurprisingly, everyone’s favorite embeddable analytics database does really well.
  • How Apple built iCloud to store billions of databases Leonardo Creed takes a look under the hood of CloudKit, Apple’s backend service for iCloud, and how they leverage FoundationDB and Apache Cassandra, managing hundreds of petabytes of data.
  • Kubernetes for Data Engineers While most data engineers don’t need to work directly with Kubernetes in their jobs (you can decide whether that’s good or bad ;), it can still be interesting to learn about the core ideas and concepts. Daniel Beach provides a nice introduction to Kubernetes in this post.
  • Are you using Protocol Buffers as serialization format with Kafka? I asked this question on Twit…, mh, X, the other day, and I was really surprised by the large number of folks who shared their experiences from using ProtoBuf. It seems it’s way more popular than I had thought, with some advantages over Avro, including better support for languages other than Java, a less verbose format for defining schemas, support for partial deserialization, and others.
  • Super-fast deduplication of large datasets using Splink and DuckDB Deduplicating data is a common task for most data engineers. Splink is an open-source Python tool designed for this task, and Robin Linacre puts it into action in this post together with DuckDB, deduplicating a dataset of seven million rows, comparing runtimes on EC2 instances of different sizes.

Paper of the Month

📄 DBLog: A Watermark Based Change-Data-Capture Framework (arXiv:2010.12597)

In this paper from 2020, Andreas Andreakis and Ioannis Papapanagiotou, back then working on a CDC solution at Netflix, propose an innovative algorithm for running backfills of existing data (“snapshotting”) concurrently to reading changes from the transaction log, leveraging a windowed de-duplication approach. Adopted by Debezium, Flink CDC, and other CDC solutions, this has been a massive improvement for data engineering teams running CDC pipelines.

Events & Call for Papers (CfP)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Gunnar Morling

Gunnar is an open-source enthusiast at heart, currently working on Apache Flink-based stream processing. In his prior role as a software engineer at Red Hat, he led the Debezium project, a distributed platform for change data capture. He is a Java Champion and has founded multiple open source projects such as JfrUnit, kcctl, and MapStruct.