Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling (your editor-in-chief for this edition) and Robin Moffatt. Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- In Designing Kafka Streams Applications, Valérie Servaire, Paul Amar, Damien Fayet, and Sébastien Viale take us through their process of designing Kafka Streams topologies at Michelin.
- Robin continues his quest of learning everything about Flink SQL; in his latest post, aptly named Flink SQL—Misconfiguration, Misunderstanding, and Mishaps, he explores commonly encountered failures when starting with Flink SQL, such as version mismatches, incomplete classpaths, bad catalog configurations, and more.
- On the heels of the Big Three cloud providers, Cloudflare are continuously building out their portfolio of managed services, including object storage, serverless workers, and more. Amongst the latest additions, there are services for durable execution and stream processing, with a Beta release planned for later this year. The post provides a preview of how future API may look like.
Event Streaming
- Loving, loving, loving the post Real-Time Twitch Chat Sentiment Analysis with Apache Flink by Volker Janz: it describes how to implement a streaming pipeline for sentiment analysis from the ground up, implementing a custom source function to ingest Twitch chat messages into Flink and using the Stanford CoreNLP library for processing the same. Excellent write-up!
- “Debezium and Kafka are a solid tandem for replication needs.” I always enjoy reading posts where data engineers describe the problems they’ve encountered and how they’ve overcome them, as there is so much to learn from them. In How we solved RevenueCat’s biggest challenges on data ingestion into Snowflake, Jesús Sánchez provides a deep insight into their data ingestion pipeline and explains why the team decided to build it that way.
- Head-of-line blocking is a well-known issue most Kafka users will encounter sooner or later (here’s a Twitter thread I did a while ago on this topic). One way for dealing with this problem is to use parallel Kafka consumers, as nicely illustrated in this post by the engineering team of Halodoc.
- Is the Kafka wire protocol the more important thing than the Apache Kafka project itself? Chris Riccomini makes the case for this way of looking at things in Ce n'est pas un Kafka: Kafka is a Protocol, drawing parallels to similar cases like the protocols of S3, Redis, or Postgres.
- In Modernizing analytics data ingestion pipeline from legacy engine to distributed processing engine Dineshkumar Shanmugam describes how Freshworks moved a data pipeline processing up to 800,000 events per minute from a bespoke Python-based implementation to Apache Spark.
- Apache Hudi is another popular contender in the data lake space and as such often is used together with Debezium for enabling real-time analytics use cases. Shi Kai Ng and Shuguang Xiang discuss how these tools are used at Grab, tightly integrated via Apache Flink.
- Hakampreet Singh Pandher from the Yelp engineering team describes in Building data abstractions with streaming at Yelp how they improved their developer experience by providing a unified data abstraction for online (i.e. REST) and streaming data consumers, based on Apache Flink, Apache Beam, and their in-house tool Joinery.
- Apache Iceberg is becoming more and more popular as an open table format for data lakes. In this session (transcript included), Steven Wu of Apple shares his learnings from building a streaming pipeline from Iceberg, using Apache Flink.
Change Data Capture
- Looking for a hands-on example for integrating CDC with Debezium into a stream processing pipeline based on Apache Spark? Then check out this tutorial by Abdelbarre Chafik, running you step by step through setting up a CDC-driven data pipeline, using Docker, Apache Kafka, Debezium, and Apache Spark Streaming.
- Taking data from OLTP to OLAP is one of the data engineering evergreens. In his Data Council session Streaming CDC data from PostgreSQL to Snowflake, challenges and solutions, Alexandru Cristu presents a solution to this problem based on Debezium and Apache Flink.
Data Ecosystem
- The High Go blog is a great place for learning more about everyone’s favourite RDBMS, aka Postgres. This post by Cary Huang is no exception, providing a nice overview of the stages a query goes through when being executed.
- The Kafka community is bidding farewell to ZooKeeper, not only reducing the operational complexity of running Kafka clusters but also allowing for way more partitions on individual nodes than before. In his latest post, Paolo Patierno of the Strimzi team describes the differences between using Kafka with ZooKeeper and without (“KRaft mode”) and how to migrate existing clusters to the new mode.
- When running a Kafka (or as in this case, Redpanda) cluster in a single region isn’t enough to address your availability requirements, a multi-region set-up is the way to go. In this talk from QCon London, Michał Maślanka discusses the fundamentals of multi-region clusters and dives into details such as leader election, reads from follower replicas, and tiered storage.
- Not only Apache Flink has data catalogs (as Robin discussed in these two posts a few weeks ago), but also other data systems such as Apache Iceberg. In A Deep Dive into the Concept and World of Apache Iceberg Catalogs, Dremio’s Alex Merced explores different catalog implementations for the popular open-source table format, including AWS Glue, Hive, Nessie, and others.
- DuckDB is an in-process analytics database, which has seen tremendous interest over the last few years. Initially solely an in-memory engine, it supports spilling over to disk when processing queries against large datasets since version 0.9. If you are curious about how this works under the hood, I highly recommend to check out this post about external aggregation in DuckDB, by Laurens Kuiper.
FLIP Foyer
- FLIP-435: Introduce a New Materialized Table for Simplifying Data Pipelines, proposing to introduce “a new table type [...], called Materialized Table, to simplify streaming and batch ETL pipeline cost-effectively”.
- FLIP-423: Disaggregated State Storage and Management (Umbrella FLIP), proposing to support remote (object) storage for state management purposes in Flink
Paper of the Month
Events & Call for Papers (CfP)
- Kafka Summit Bengaluru (Bengaluru, India) May 2 (CfP closed)
- StrimziCon (virtual event) May 22 (CfP closed)
- Berlin Buzzwords (Berlin, Germany) June 9-11 (CfP closed)
- Current '24 | The Next Generation of Kafka Summit (Austin, TX) September 17-18 (CfP closed)
- Flink Forward (Berlin, Germany) October 23-24 (CfP closes on May 17)
New Releases
A few new releases this month:
- Apache Flink Kubernetes Operator 1.8.0
- Apache Kafka 3.6.2
- Debezium 2.6.0.Final / 2.6.1.Final / 2.7.0.Alpha1
- Kroxylicious 0.5.1
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.