Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. I’m delighted to introduce myself, Hans-Peter Grahsl, as your editor-in-chief for this edition, joining the roster of editors alongside my Decodable colleagues Gunnar Morling and Robin Moffatt. Feel free to send our way any choice nuggets that you think we should feature in future editions. With no further ado, let’s see what August has turned up for us in the data and streaming space…
Stream Processing, Streaming SQL, and Streaming Databases
- Sparkle: Standardizing Modular ETL at Uber explains how the company has standardized its ETL workloads by developing a modular, configuration-driven approach on top of Apache Spark. Sparkle allows developers to focus more on writing the business logic rather than repetitive boilerplate code. This has significantly improved developer productivity and data quality due to test-driven ETL capabilities.
- Choosing between different stream processing solutions can be a daunting task. The blog post Comparing Apache Flink and Spark for Modern Stream Data Processing, written by Decodable’s Eric Xiao, offers some help in that regard by providing some insights on the similarities as well as the key differences for both these very capable and widely deployed data streaming engines.
- Ran Zhang's Airbnb Tech Blog post describes how they evolved their Apache Flink architecture in 3 phases. They transitioned away from Yarn and Airflow to directly running Apache Flink on Kubernetes which not only provided several advantages but introduced a few new challenges along the way.
- While it's typically rather convenient to use Flink SQL to declaratively describe stream processing pipelines, the devil is in the detail at times. Troubleshooting Flink SQL S3 problems, authored by Robin, walks you through selected issues you might face when trying to work with S3 connectors.
Event Streaming
- In JSONata: The Missing Declarative Language for Kafka Connect, Robert Yokota explores the potential of JSONata as a flexible alternative for defining transformations within Kafka Connect without the need for custom code. He created a general-purpose JSONata SMT which allows users to apply JSONata expressions directly to Kafka Connect records, potentially simplifying the transformation process and replacing the need for many custom SMTs.
- Users of old Apache Kafka client versions are missing out on significant enhancements introduced in more recent releases. In Why You Need To Upgrade Your Apache Kafka Client Immediately, Gilles Philippart highlights the risks of using outdated clients such as lack of support, vulnerability to bugs, and missed performance enhancements. There is even a compiled list of all the things that have been introduced since client version 2.0.
- Back in late 2020, Confluent open-sourced a parallel Apache Kafka Consumer library for Java which provides an alternate approach to parallelism that subdivides the unit of work from a partition down to a key or even a message. Italo Nesi recently started a new project kafka-pyrallel-consumer to support similar parallelization techniques for Python applications.
Data Ecosystem
- Gwen Shapira's blog post details the creation of an AI Code Assistant, a tool designed to help users explore codebases through interactions with a virtual "senior developer." She describes the technical journey from selecting models and frameworks to integrating the entire system into a functioning SaaS product. Key aspects covered include database schema design, the role of embeddings in facilitating effective query responses, and the final user interaction model that mimics ChatGPT's conversational style.
- Jack Vanlightly recently started a blog series comparing table format internals. In How Do The Table Formats Represent The Canonical Set Of Files? he classifies these formats into two approaches, namely the "log of deltas" used by Hudi and Delta Lake, versus the "log of snapshots" used by Iceberg and Paimon. Append-Only Tables And Incremental Reads explores mechanisms by which different table formats enable data to be added sequentially and read incrementally by compute engines like Apache Flink, highlighting their strategies for managing data files and snapshots for efficient data retrieval and streaming.
- In their article Maestro: Data/ML Workflow Orchestrator at Netflix, the authors share that their versatile workflow orchestrator is now open source. Addressing the scalability challenges of its predecessor, Meson, Maestro has been designed for high scalability and operational excellence, and offers a fully managed workflow-as-a-service (WAAS) platform to support a diverse user base including data scientists and engineers.
Data Platforms and Architecture
- The Data Engineering Open Forum, held on April 18th, 2024, at Netflix’s Los Gatos office brought together data engineers from various industries to exchange insights on current and future challenges in data engineering. This post recaps the inaugural forum event with session summaries and links to all talk recordings.
- Felix GV, committer on the Venice project, discusses the complexities surrounding AI infrastructure. In his talk Lessons Learned from Building LinkedIn’s AI Data Platform, he explains the evolution of LinkedIn's AI platform from a disjointed set of tools to an integrated, opinionated system designed to streamline workflows for AI researchers and engineers. Felix shares lessons learned and practical advice for AI infrastructure development, and touches upon the scalability and operational strategies of LinkedIn’s systems, especially the Venice data platform.
- The blog post Building and Scaling Notion's Data Lake details how Notion implemented and evolved their data infrastructure to efficiently manage and analyze the growing volume of data. As Notion scaled, they transitioned from a traditional data warehouse to a data lake architecture, which offered better flexibility and performance. Debezium played a crucial role by enabling real-time change data capture from their primary database.
RDBMS and Change Data Capture
- Ryan Marcus, recently wondered how much PostgreSQL had improved over the past decade. In Ten years of improvements in PostgreSQL's optimizer, he highlights the results of comparing PostgreSQL versions 8 through 16 by means of the Join Order Benchmark. The analysis revealed that each new major PostgreSQL version on average reduced latency by about 15%.
- It's been a while since we heard news about Debezium's UI development. The post Status of Debezium UI explains the motivation for a broader approach and why the initial web UI project tied to Kafka Connect has been paused. The team is now focusing on a new UI designed for multiple deployment models, starting with Debezium Server on Kubernetes. This new UI allows to create data pipelines, abstracting away from specific deployment details. A proof-of-concept prototype is available for community feedback.
- The blog Real time Data Analytics solution with Spanner Change Streams by Nikhil Challa emphasizes the necessity of CDC (Change Data Capture) for near real-time data replication and analytics use cases. He explains how Google Cloud Spanner’s Change Streams feature works and provides a step by step guide on how to get started.
- The new Debezium Alpha 3.0 introduces a new sink connector for MongoDB, the 2nd after adding a JDBC sink back in version 2.2. While Kafka 3.8 is the new baseline for testing and building Debezium 3.0, the project also shifts its Java requirements to JDK 17 / 21 and needs a more recent version of Maven for building Debezium from sources. Read about a bunch of other notable changes here and here.
Papers of the Month
In their paper A survey on the evolution of stream processing systems, Marios Fragkoulis et al. provide a thorough discussion of the evolution of stream processing systems over the past 20+ years, categorizing them into three generations: early scale-up systems, distributed data-parallel systems, and emerging trends in cloud and edge computing. It highlights advancements in handling out-of-order data, state management, fault tolerance, load management, and elasticity. The paper also unifies terminology across different systems and outlines key design considerations for modern stream processing. Future directions emphasize the need for adaptive, scalable, and resource-efficient systems, particularly for IoT and real-time analytics.
George Siachamis et al. investigate the effectiveness of current control-based auto scaling solutions for stream processing engines (SPEs) under dynamic workloads. In their research paper Evaluating Stream Processing Autoscalers, they apply benchmarking to compare autoscalers and reveal key issues such as poor handling of stateful queries and high latency during scaling actions. Their experiments demonstrate that existing autoscalers struggle with real-world, highly dynamic workloads, often leading to inefficient resource utilization and performance degradation. Surprisingly, general-purpose autoscalers, particularly those based on CPU usage, can outperform specialized solutions in certain scenarios, highlighting areas for future research and improvement in autoscaler design.
Events & Call for Papers (CfP)
- Current '24 | The Next Generation of Kafka Summit (Austin, TX) September 17-18
- Flink Forward (Berlin, Germany) October 23-24
- OSDC West (Burlingame, CA) October 29-31
- KubeCon / CloudNativeCon NA 2024 (Salt Lake City, UT) November 12-15
- AWS re:Invent (Las Vegas, NV) December 2-6
- NDC London (London, UK) January ‘25 27-31 (CfP closes on September 1)
New Releases
A few recent releases:
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.
Gunnar (LinkedIn / X / Mastodon / Email)