🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

June 25, 2024

min read

Checkpoint Chronicle - June 2024

Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.

I’m going to do something a bit different and open with some coverage of the broader data ecosystem, because it’s been a busy few weeks!

Data Ecosystem

Catalogs

Databricks and Snowflake each announced the intention to open-source their catalog implementations (Unity and Polaris, respectively).
Polaris is not yet available open-source, but there’s a nice write up from Sem Sinchenko of their initial experiences with Unity.
Chris Riccomini has a good explainer of the importance of catalogs in data architectures.

Table Formats

As well as announcing the open-sourcing of Unity, Databricks announced the acquisition of Tabular. This is the company founded by Ryan Blue, the original creator of Apache Iceberg. What this means for Iceberg is going to be fascinating to see. Will Databricks use the Iceberg expertise that they’ve now got to advance the format in parallel with their own Delta Lake, or perhaps just evolve intra-compatibility?
If you’re not familiar with Iceberg, there’s a nice primer from Seattle Data Guy, and you can see it in action in this blog that I wrote about writing Kafka data to Iceberg.
There’s an interesting analysis from Confluent—who are backing Iceberg with their TableFlow product—looking at whether Apache Iceberg will win over Delta Lake (tl;dr: yes). That said, plenty of vendors are hedging their bets, with BigQuery recently announcing first-party support for Delta Lake.
Lastly, there were cries of “don’t forget us!” from the Apache Hudi crowd, telling us that Hudi is the Open Data Lakehouse Platform We Need.

And finally…

Redpanda acquired Benthos and swiftly rebranded it as RedPanda Connect with some licensing changes, prompting competitor WarpStream to fork it.
In all of this, Alex Merced’s article about open source is particularly timely. He explains well what open source means, how it's governed, and its relationship with commercial entities—relevant to the catalog announcements, table formats, and Benthos acquisition.

Stream Processing, Streaming SQL, and Streaming Databases

A solid post from Maciej Maciejko looking at several aspects of optimizing Flink SQL.
Jaehyeon Kim writes some excellent hands-on articles and this one about PyFlink is well worth a read—and complements Gunnar's article about PyFlink from last year well.
I do enjoy a good methodical troubleshooting story, and this one from Matvey Mitnitsky about fixing Kafka Streams uneven tasks distribution is a good read.
This article from Adam Bellemere does a really good job of outlining the kind of problem scenarios that can occur with bad data, and the differences in handling it with batch vs stream processing (plus it has some excellent memes 😉). If you’re interested in this area then take a look at the concept of Write-Audit-Publish too; I wrote about this last year.

Event Streaming

Walmart have been using Kafka for many years, and share some really useful experiences and suggestions in this article about Reliably Processing Trillions of Kafka Messages Per Day.
An interesting look at how booking.com use events as part of their observability strategy.
Getting metrics out of Kafka clients has previously been limited to what you could do with JMX. KIP-714 proposes improvements in this area, adding support for OpenTelemetry. My friend and former colleague Ricardo Ferreira has done a nice writeup of it in this article.

Data Platforms and Architecture

As an active user of Reddit, and a mod over on the r/apachekafka subreddit, I’m always particularly interested to learn more about the platform itself. This blog from Stephan Weinwurm and colleagues in the engineering team at Reddit explains how they implemented and rolled out a new Python-based microservice based around Kafka for scoring new content as its posted to check for any breach of Reddit’s content rules.
I wince every time I see someone give the hackneyed and generic “fraud detection” as a reason for needing real-time data—so these examples from Lyft and Uber of how they use data are a really refreshing read.
I was fascinated by this teaser from Spotify about more details of their data platform, and delighted when they published this followup going into details of some of the design as well as technologies used, including Apache Flink and BigQuery.
Uber are not mucking about when it comes to data volumes in their batch systems—over an exabyte is held across their estate of Hadoop (HDFS) servers. This blog details their migration from purely on-premises infrastructure to a hybrid deployment with GCP. Whilst adopting Google’s object store (GCS) they are for now continuing to run their own software, with a move to native PaaS on GCP planned in for the future.

RDBMS and Change Data Capture

A very nice deep dive from Jason Fulghum looking at MySQL's Replication Protocol, which pairs nicely with an article looking at database replication concepts in general from earlier this year by Atakan Serbes.
Elad Leev has a good review of the benefits that CDC, and specifically Debezium, provide.

PapersRant of the Month

A departure from this regular section’s content—this month I came across this article which so perfectly blended common-sense, good analysis, and a sweary rant, so I just had to include it 😀

I Will F**king Piledrive You If You Mention AI Again

Events & Call for Papers (CfP)

Beam Summit (Sunnyvale, CA) September 4-5
JavaZone (Oslo, Norway) September 4-5
Current '24 | The Next Generation of Kafka Summit (Austin, TX) September 17-18
BigDataLDN (London, UK) September 18-19
Flink Forward (Berlin, Germany) October 23-24
Big Data Conference Europe (Vilnius, Lithuania & Online) November 19-22

New Releases

A few new releases this month:

Debezium 2.7.0.Beta1
Apache Flink CDC 3.1.1
Apache Flink 1.19.1
DuckDB 1.0.0 (bonus: check out this talk from InfoQ from one of DuckDB’s creators, Hannes Mühleisen)

‍

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

‍

Gunnar (LinkedIn / X / Mastodon / Email)

Robin (LinkedIn / X / Mastodon / Email)

‍

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Gunnar Morling

Gunnar is an open-source enthusiast at heart, currently working on Apache Flink-based stream processing. In his prior role as a software engineer at Red Hat, he led the Debezium project, a distributed platform for change data capture. He is a Java Champion and has founded multiple open source projects such as JfrUnit, kcctl, and MapStruct.

Using Stand-by Servers for Postgres Logical Replication

December 19, 2023

min read

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Start Free Talk To An Expert

Heading 2

I’m going to do something a bit different and open with some coverage of the broader data ecosystem, because it’s been a busy few weeks!

Data Ecosystem

Catalogs

Databricks and Snowflake each announced the intention to open-source their catalog implementations (Unity and Polaris, respectively).
Polaris is not yet available open-source, but there’s a nice write up from Sem Sinchenko of their initial experiences with Unity.
Chris Riccomini has a good explainer of the importance of catalogs in data architectures.

Table Formats

As well as announcing the open-sourcing of Unity, Databricks announced the acquisition of Tabular. This is the company founded by Ryan Blue, the original creator of Apache Iceberg. What this means for Iceberg is going to be fascinating to see. Will Databricks use the Iceberg expertise that they’ve now got to advance the format in parallel with their own Delta Lake, or perhaps just evolve intra-compatibility?
If you’re not familiar with Iceberg, there’s a nice primer from Seattle Data Guy, and you can see it in action in this blog that I wrote about writing Kafka data to Iceberg.
There’s an interesting analysis from Confluent—who are backing Iceberg with their TableFlow product—looking at whether Apache Iceberg will win over Delta Lake (tl;dr: yes). That said, plenty of vendors are hedging their bets, with BigQuery recently announcing first-party support for Delta Lake.
Lastly, there were cries of “don’t forget us!” from the Apache Hudi crowd, telling us that Hudi is the Open Data Lakehouse Platform We Need.

And finally…

Redpanda acquired Benthos and swiftly rebranded it as RedPanda Connect with some licensing changes, prompting competitor WarpStream to fork it.
In all of this, Alex Merced’s article about open source is particularly timely. He explains well what open source means, how it's governed, and its relationship with commercial entities—relevant to the catalog announcements, table formats, and Benthos acquisition.

Stream Processing, Streaming SQL, and Streaming Databases

A solid post from Maciej Maciejko looking at several aspects of optimizing Flink SQL.
Jaehyeon Kim writes some excellent hands-on articles and this one about PyFlink is well worth a read—and complements Gunnar's article about PyFlink from last year well.
I do enjoy a good methodical troubleshooting story, and this one from Matvey Mitnitsky about fixing Kafka Streams uneven tasks distribution is a good read.
This article from Adam Bellemere does a really good job of outlining the kind of problem scenarios that can occur with bad data, and the differences in handling it with batch vs stream processing (plus it has some excellent memes 😉). If you’re interested in this area then take a look at the concept of Write-Audit-Publish too; I wrote about this last year.

Event Streaming

Walmart have been using Kafka for many years, and share some really useful experiences and suggestions in this article about Reliably Processing Trillions of Kafka Messages Per Day.
An interesting look at how booking.com use events as part of their observability strategy.
Getting metrics out of Kafka clients has previously been limited to what you could do with JMX. KIP-714 proposes improvements in this area, adding support for OpenTelemetry. My friend and former colleague Ricardo Ferreira has done a nice writeup of it in this article.

Data Platforms and Architecture

As an active user of Reddit, and a mod over on the r/apachekafka subreddit, I’m always particularly interested to learn more about the platform itself. This blog from Stephan Weinwurm and colleagues in the engineering team at Reddit explains how they implemented and rolled out a new Python-based microservice based around Kafka for scoring new content as its posted to check for any breach of Reddit’s content rules.
I wince every time I see someone give the hackneyed and generic “fraud detection” as a reason for needing real-time data—so these examples from Lyft and Uber of how they use data are a really refreshing read.
I was fascinated by this teaser from Spotify about more details of their data platform, and delighted when they published this followup going into details of some of the design as well as technologies used, including Apache Flink and BigQuery.
Uber are not mucking about when it comes to data volumes in their batch systems—over an exabyte is held across their estate of Hadoop (HDFS) servers. This blog details their migration from purely on-premises infrastructure to a hybrid deployment with GCP. Whilst adopting Google’s object store (GCS) they are for now continuing to run their own software, with a move to native PaaS on GCP planned in for the future.

RDBMS and Change Data Capture

A very nice deep dive from Jason Fulghum looking at MySQL's Replication Protocol, which pairs nicely with an article looking at database replication concepts in general from earlier this year by Atakan Serbes.
Elad Leev has a good review of the benefits that CDC, and specifically Debezium, provide.

PapersRant of the Month

A departure from this regular section’s content—this month I came across this article which so perfectly blended common-sense, good analysis, and a sweary rant, so I just had to include it 😀

I Will F**king Piledrive You If You Mention AI Again

Events & Call for Papers (CfP)

Beam Summit (Sunnyvale, CA) September 4-5
JavaZone (Oslo, Norway) September 4-5
Current '24 | The Next Generation of Kafka Summit (Austin, TX) September 17-18
BigDataLDN (London, UK) September 18-19
Flink Forward (Berlin, Germany) October 23-24
Big Data Conference Europe (Vilnius, Lithuania & Online) November 19-22

New Releases

A few new releases this month:

Debezium 2.7.0.Beta1
Apache Flink CDC 3.1.1
Apache Flink 1.19.1
DuckDB 1.0.0 (bonus: check out this talk from InfoQ from one of DuckDB’s creators, Hannes Mühleisen)

‍

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

‍

Gunnar (LinkedIn / X / Mastodon / Email)

Robin (LinkedIn / X / Mastodon / Email)

‍

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Gunnar Morling

Let's get decoding

Decodable is free. No CC required. Never expires.

Start for Free Talk to an Expert Join the Community on Slack

Checkpoint Chronicle - June 2024

Data Ecosystem

Catalogs

Table Formats

And finally…

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Platforms and Architecture

RDBMS and Change Data Capture

PapersRant of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Using Stand-by Servers for Postgres Logical Replication

Catalogs in Flink SQL—A Primer

6 Myths Preventing You from Embracing Real-Time Data

Table of contents

Data Ecosystem

Catalogs

Table Formats

And finally…

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Platforms and Architecture

RDBMS and Change Data Capture

PapersRant of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Using Stand-by Servers for Postgres Logical Replication

Catalogs in Flink SQL—A Primer

6 Myths Preventing You from Embracing Real-Time Data

Let's get decoding