🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

September 26, 2024

min read

Checkpoint Chronicle - September 2024

Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling, Robin Moffatt (your editor-in-chief for this edition), and Hans-Peter Grahsl. Feel free to send our way any choice nuggets that you think we should feature in future editions.

I’m writing this on my flight back from Current 2024 in Austin. It was a packed couple of days that you can read more about in my recap blog. If you want to catch the keynotes they are already online day 1 & day 2. I was excited to present at the conference, along with three of my colleagues, and you can find our talks here:

Data Contracts In Practice With Debezium and Apache Flink (Gunnar Morling)
The Joy of JARs (and Other Flink SQL Troubleshooting Tales) (Robin Moffatt)
So you want to write a User-Defined Function for Flink? (Hans-Peter Grahsl)
Timing is Everything: Understanding Event-Time Processing in Flink SQL (Sharon Xie)

Stream Processing, Streaming SQL, and Streaming Databases

Rohan Desai, a co-founder at Responsive, has written about the new Async Processor for Kafka Streams. It includes a useful introduction that sets the scene for why it’s needed and what the limitations of current options are.
My former colleague Bill Bejeck has written an excellent series all about windowing in stream processing. I featured one of them previously; here’s the full set now:
I wrote up a guide on how to write Delta Lake tables from Flink, along with some troubleshooting details.

Event Streaming

In one of the more surprising of recent acquisitions, Confluent announced that they have acquired WarpStream to fill the Bring-Your-Own-Cloud (BYOC) gap in their portfolio. WarpStream have shot to prominence in this space for their Kafka-compatible platform which uses S3 to store data directly. WarpStream themselves are talking about the acquisition as an opportunity to grow further within Confluent, rather than their technology simply being absorbed into Confluent’s own. Confluent’s Jack Vanlightly took the opportunity to “clarify” his views on BYOC after being critical of it last year.

Javier Holguera has published a useful set of blogs around naming conventions in Kafka, covering topics and producers/consumers.
Following on from Uber’s contribution of tiered storage for Kafka, Pinterest have shared details and the source code (under Apache 2.0 licence) of their own implementation. It differs in several ways, including handling tiered storage separate from the broker. Pinterest having been using it in production since May this year, offloading a staggering 200 TB of data per day using it.
A useful writeup from DoorDash on how they run their internal self-service Kafka platform.

Data Ecosystem

Amazon S3 added support for conditional writes which Gunnar took a look at and published this excellent post on how it could be used for leader election in distributed systems.
Decades after the first ones were created, new databases are still being written. Learn more in this post about the challenges James Munro found with existing systems that led to him creating ArcticDB.
Some observations from StarRocks on the Unity catalog and what the Databricks acquisition of Tabular earlier this year might mean for Apache Iceberg.
I really enjoyed this well-written piece from Expedia detailing how and why they migrated from Apache Cassandra to ScyllaDB. It includes the evaluation of the different migration options, and how they achieved a zero-downtime migration of their 15-node, 1TB Cassandra cluster.

Data Platforms and Architecture

Airbnb wrote about their Lambda-architecture-based platform Riverbed last year, and recently published a more detailed look at it. It’s based around their CDC solution which has been around for a while called SpinalTap.
This blog post from Uber is interesting for the level of detail it goes into in how they organize the data in their object stores to optimize for things such as data ownership, access control, throughput optimisation, and platform date limits. It’s part of their move to GCP and migrating from HDFS to GCS.
If you’re interested in system design you’ll want to check out this detailed two-part series from Agoda that goes into the nuts-and-bolts of how they manage deduplication of bookings across multiple data centers.
Staying with online travel booking, this post from booking.com discusses how the ML-based ranking platform within the search results works.

RDBMS and Change Data Capture

Two fascinating blogs about different approaches taken to upgrading to MySQL 8.0 from Uber and GitHub.
At some point the splicing and mixing of different protocols and engines is all going to get too silly, but for now bear with me: Tansu is a Kafka-compatible broker using Postgres as its storage, and pg_duckdb embeds DuckDB as an engine within Postgres.
Back in the regular world of vanilla Postgres, there’s a nice collection of the things you maybe didn’t realize that it could do, along with a slightly provocative assertion that you should just use Postgres. The latter is actually pretty useful in positioning its strengths (and weaknesses) against other types of data management systems.

Paper of the Month

In Petabyte-Scale Row-Level Operations in Data Lakehouses, Ryan Blue (one of the co-creators of Apache Iceberg) and a team from Apple look at improvements in Iceberg and Apache Spark that bring the lakehouse closer to the data warehouse of old in terms of richer functionality and performance.

Events & Call for Papers (CfP)

Flink Forward (Berlin, Germany) October 23-24
OSDC West (Burlingame, CA) October 29-31
KubeCon / CloudNativeCon NA 2024 (Salt Lake City, UT) November 12-15
AWS re:Invent (Las Vegas, NV) December 2-6
NDC London (London, UK) January ‘25 27-31
Current 2025 (London, UK) May 20-21
Current 2025 (Bangalore, India) Mid-March
Current 2025 (New Orleans, LA) October 29-30

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.

Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)
Hans-Peter (LinkedIn / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Robin Moffatt

Robin is a Principal DevEx Engineer at Decodable. He has been speaking at conferences since 2009 including QCon, Devoxx, Strata, Kafka Summit, and Øredev. You can find many of his talks online and his articles on the Decodable blog as well as his own blog.

Outside of work, Robin enjoys running, drinking good beer, and eating fried breakfasts—although generally not at the same time.

August 22, 2024

min read

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Start Free Talk To An Expert

Heading 2

Data Contracts In Practice With Debezium and Apache Flink (Gunnar Morling)
The Joy of JARs (and Other Flink SQL Troubleshooting Tales) (Robin Moffatt)
So you want to write a User-Defined Function for Flink? (Hans-Peter Grahsl)
Timing is Everything: Understanding Event-Time Processing in Flink SQL (Sharon Xie)

Stream Processing, Streaming SQL, and Streaming Databases

Rohan Desai, a co-founder at Responsive, has written about the new Async Processor for Kafka Streams. It includes a useful introduction that sets the scene for why it’s needed and what the limitations of current options are.
My former colleague Bill Bejeck has written an excellent series all about windowing in stream processing. I featured one of them previously; here’s the full set now:
I wrote up a guide on how to write Delta Lake tables from Flink, along with some troubleshooting details.

Event Streaming

In one of the more surprising of recent acquisitions, Confluent announced that they have acquired WarpStream to fill the Bring-Your-Own-Cloud (BYOC) gap in their portfolio. WarpStream have shot to prominence in this space for their Kafka-compatible platform which uses S3 to store data directly. WarpStream themselves are talking about the acquisition as an opportunity to grow further within Confluent, rather than their technology simply being absorbed into Confluent’s own. Confluent’s Jack Vanlightly took the opportunity to “clarify” his views on BYOC after being critical of it last year.

Javier Holguera has published a useful set of blogs around naming conventions in Kafka, covering topics and producers/consumers.
Following on from Uber’s contribution of tiered storage for Kafka, Pinterest have shared details and the source code (under Apache 2.0 licence) of their own implementation. It differs in several ways, including handling tiered storage separate from the broker. Pinterest having been using it in production since May this year, offloading a staggering 200 TB of data per day using it.
A useful writeup from DoorDash on how they run their internal self-service Kafka platform.

Data Ecosystem

Amazon S3 added support for conditional writes which Gunnar took a look at and published this excellent post on how it could be used for leader election in distributed systems.
Decades after the first ones were created, new databases are still being written. Learn more in this post about the challenges James Munro found with existing systems that led to him creating ArcticDB.
Some observations from StarRocks on the Unity catalog and what the Databricks acquisition of Tabular earlier this year might mean for Apache Iceberg.
I really enjoyed this well-written piece from Expedia detailing how and why they migrated from Apache Cassandra to ScyllaDB. It includes the evaluation of the different migration options, and how they achieved a zero-downtime migration of their 15-node, 1TB Cassandra cluster.

Data Platforms and Architecture

Airbnb wrote about their Lambda-architecture-based platform Riverbed last year, and recently published a more detailed look at it. It’s based around their CDC solution which has been around for a while called SpinalTap.
This blog post from Uber is interesting for the level of detail it goes into in how they organize the data in their object stores to optimize for things such as data ownership, access control, throughput optimisation, and platform date limits. It’s part of their move to GCP and migrating from HDFS to GCS.
If you’re interested in system design you’ll want to check out this detailed two-part series from Agoda that goes into the nuts-and-bolts of how they manage deduplication of bookings across multiple data centers.
Staying with online travel booking, this post from booking.com discusses how the ML-based ranking platform within the search results works.

RDBMS and Change Data Capture

Two fascinating blogs about different approaches taken to upgrading to MySQL 8.0 from Uber and GitHub.
At some point the splicing and mixing of different protocols and engines is all going to get too silly, but for now bear with me: Tansu is a Kafka-compatible broker using Postgres as its storage, and pg_duckdb embeds DuckDB as an engine within Postgres.
Back in the regular world of vanilla Postgres, there’s a nice collection of the things you maybe didn’t realize that it could do, along with a slightly provocative assertion that you should just use Postgres. The latter is actually pretty useful in positioning its strengths (and weaknesses) against other types of data management systems.

Paper of the Month

Events & Call for Papers (CfP)

Flink Forward (Berlin, Germany) October 23-24
OSDC West (Burlingame, CA) October 29-31
KubeCon / CloudNativeCon NA 2024 (Salt Lake City, UT) November 12-15
AWS re:Invent (Las Vegas, NV) December 2-6
NDC London (London, UK) January ‘25 27-31
Current 2025 (London, UK) May 20-21
Current 2025 (Bangalore, India) Mid-March
Current 2025 (New Orleans, LA) October 29-30

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.

Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)
Hans-Peter (LinkedIn / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Robin Moffatt

Outside of work, Robin enjoys running, drinking good beer, and eating fried breakfasts—although generally not at the same time.

Let's get decoding

Decodable is free. No CC required. Never expires.

Start for Free Talk to an Expert Join the Community on Slack

Checkpoint Chronicle - September 2024

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

Data Platforms and Architecture

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - August 2024

Checkpoint Chronicle - July 2024

Checkpoint Chronicle - June 2024

Table of contents

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

Data Platforms and Architecture

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - August 2024

Checkpoint Chronicle - July 2024

Checkpoint Chronicle - June 2024

Let's get decoding