Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- Sophie Blee-Goldman put together a comprehensive guide to upgrading topologies of Kafka Streams apps. It emphasizes the importance of proactive planning and adherence to best practices to ensure seamless application evolution.
- In "Exploring Apache DataFusion as a Foundation for Streaming Framework", Yaroslav Tkachenko highlights the growing trend of adopting Rust for high-performance data processing and examines the feasibility of utilizing Apache DataFusion, a Rust-based query engine leveraging Apache Arrow, as the core of a stream processing framework.
- In a recent interview, Chris Riccomini conversed with Gilad Kleinman (Epsio), about incremental view maintenance. They chat about how Epsio differentiates itself by updating result tables directly within the original database and why and how it consolidates changes before writing them back to the OLTP system.
- Are you a developer who lives and breathes Python? I wrote "A Hands-On Introduction to PyFlink" to help you understand just about enough of the underlying building blocks to get started with your Python-based stream processing jobs in a practical way.
Event Streaming
- Jason Taylor's "The Curse of Replication Saturation in Apache Kafka" explains why the replication process can become a bottleneck, leading to increased latency and reduced throughput. Mitigation strategies include monitoring replication metrics closely and optimizing related configurations, such as adjusting the number of replication threads and tuning network settings.
- KIP-1134 introduces the concept of virtual clusters in Apache Kafka, enabling multiple, isolated logical clusters to coexist within a single physical Kafka cluster. This design allows tenants to share underlying hardware and resources while maintaining separation in terms of topics, configurations, and access controls.
- In "How We Run a 5 GB/s Kafka Workload for Just $50 per Hour", Matteo Meril et al. provide an in-depth analysis to understand cost differences when running Kafka workloads across selected vendors. Their benchmarks show how a leaderless architecture, eliminating most inter-AZ traffic, and using cloud object storage result in significant cost savings.
Data Ecosystem
- SeungMin Lee from Kakao’s Data Analytics Platform shares an in-depth experience report about moving data from MySQL to Apache Iceberg using Flink CDC. The article explains many design decisions and implementation choices behind their approach and touches upon consistency, performance, and error handling related aspects.
- Apache Paimon has officially released version 1.0.1 stable, after nearly five months of development. Read the release details to learn about Iceberg compatibility, snapshot transactions, performance improvements, Flink / Spark integrations, and several other aspects.
Data Platforms and Architecture
- Adam Bellemare critiques the inefficiencies of the traditional medallion architecture and proposes a "shift left" strategy i.e. producing high-quality, standardized data earlier in pipelines. This approach reduces redundancy, improves accessibility, and optimizes efficiency by eliminating unnecessary downstream transformations and copies.
- The talks "Complexity is the Gotcha of Event-driven Architecture" (by David Boyne) and "So You Want to Build An Event Driven System?" (by James Eastham) contain lots of useful insights and practical tips based on real-world experiences with event-driven architectures. Both made it into the 100 most watched software engineering talks of 2024 according to this list.
- Ayeshee Patra outlines a step-by-step approach to integrating OLTP databases with Apache Pinot for real-time analytics using Debezium, Amazon MSK, and StarTree Cloud. It covers best practices for CDC, partitioning strategies, authentication, and query performance, ensuring efficient and accurate data ingestion.
RDBMS and Change Data Capture
- The talk "7+ million Postgres tables" by Kailash Nadh explores the hacks and decisions that went into what originally seemed like a ridiculous idea, but turned out to be a high performing, cost-effective, and practically zero maintenance solution to a hard problem.
- Ismail Simsek recently created pydbzengine, an open-source python library to interact with the Debezium embedded engine. This blog post demonstrates how to capture data changes from PostgreSQL using this library and propagate it into DuckDB with dlt (data load tool).
- Anthony Accomazzo’s article highlights key challenges when implementing a consistent CDC solution. He describes how they implemented a variation of Netflix's DBLog design in Elixir using a chunked capture approach with watermark synchronization and PostgreSQL's pg_logical_emit_message.
- In a guest post on the Debezium blog, René Rütter shares strategies to enhance the performance of initial snapshots when using Debezium with Oracle databases. By adjusting selected configuration properties his team reduced snapshot times by 25%.
Paper of the Month
Pavan Edara et al. dive into Vortex, a storage engine developed within Google BigQuery to facilitate real-time analytics on streaming data. Key features of Vortex include ACID compliance, a unified API for batch and streaming data, and a highly distributed, regionally replicated architecture optimized for append-focused ingestion of structured and semi-structured data. Currently, Vortex supports petabyte-scale data ingestion in BigQuery, achieving sub-second data freshness and query latency.
Events & Call for Papers (CfP)
- Gartner Data & Analytics Summit (Orlando, FL) March 3-5
- Current 2025 (Bengaluru, India) March 19Â
- Iceberg Summit 2025 (San Francisco, CA) April 8-9
- Data Council (Bay Area, CA) April 22-24
- Current 2025 (London, UK) May 20-21
New Releases
- Apache Flink 1.19.2 and 1.20.1
- Debezium 3.1.0.Alpha2
- Apache Iceberg 1.8.0
- Apache Paimon 1.0.1
- librdkafka v2.8.0
- Kroxylicious v0.10.0
‍
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.
‍