Back
June 25, 2024
5
min read

Checkpoint Chronicle - June 2024

By
Gunnar Morling
Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.

I’m going to do something a bit different and open with some coverage of the broader data ecosystem, because it’s been a busy few weeks!

Data Ecosystem

Catalogs

Table Formats

  • As well as announcing the open-sourcing of Unity, Databricks announced the acquisition of Tabular. This is the company founded by Ryan Blue, the original creator of Apache Iceberg. What this means for Iceberg is going to be fascinating to see. Will Databricks use the Iceberg expertise that they’ve now got to advance the format in parallel with their own Delta Lake, or perhaps just evolve intra-compatibility? 
  • If you’re not familiar with Iceberg, there’s a nice primer from Seattle Data Guy, and you can see it in action in this blog that I wrote about writing Kafka data to Iceberg
  • There’s an interesting analysis from Confluent—who are backing Iceberg with their TableFlow product—looking at whether Apache Iceberg will win over Delta Lake (tl;dr: yes). That said, plenty of vendors are hedging their bets, with BigQuery recently announcing first-party support for Delta Lake.
  • Lastly, there were cries of “don’t forget us!” from the Apache Hudi crowd, telling us that Hudi is the Open Data Lakehouse Platform We Need.

And finally…

  • Redpanda acquired Benthos and swiftly rebranded it as RedPanda Connect with some licensing changes, prompting competitor WarpStream to fork it.
  • In all of this, Alex Merced’s article about open source is particularly timely. He explains well what open source means, how it's governed, and its relationship with commercial entities—relevant to the catalog announcements, table formats, and Benthos acquisition.

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Platforms and Architecture

  • As an active user of Reddit, and a mod over on the r/apachekafka subreddit, I’m always particularly interested to learn more about the platform itself. This blog from Stephan Weinwurm and colleagues in the engineering team at Reddit explains how they implemented and rolled out a new Python-based microservice based around Kafka for scoring new content as its posted to check for any breach of Reddit’s content rules.
  • I wince every time I see someone give the hackneyed and generic “fraud detection” as a reason for needing real-time data—so these examples from Lyft and Uber of how they use data are a really refreshing read. 
  • I was fascinated by this teaser from Spotify about more details of their data platform, and delighted when they published this followup going into details of some of the design as well as technologies used, including Apache Flink and BigQuery.
  • Uber are not mucking about when it comes to data volumes in their batch systems—over an exabyte is held across their estate of Hadoop (HDFS) servers. This blog details their migration from purely on-premises infrastructure to a hybrid deployment with GCP. Whilst adopting Google’s object store (GCS) they are for now continuing to run their own software, with a move to native PaaS on GCP planned in for the future.

RDBMS and Change Data Capture

PapersRant of the Month

A departure from this regular section’s content—this month I came across this article which so perfectly blended common-sense, good analysis, and a sweary rant, so I just had to include it 😀

Events & Call for Papers (CfP)

New Releases

A few new releases this month:

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Gunnar (LinkedIn / X / Mastodon / Email)

Robin (LinkedIn / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Gunnar Morling

Gunnar is an open-source enthusiast at heart, currently working on Apache Flink-based stream processing. In his prior role as a software engineer at Red Hat, he led the Debezium project, a distributed platform for change data capture. He is a Java Champion and has founded multiple open source projects such as JfrUnit, kcctl, and MapStruct.

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.

I’m going to do something a bit different and open with some coverage of the broader data ecosystem, because it’s been a busy few weeks!

Data Ecosystem

Catalogs

Table Formats

  • As well as announcing the open-sourcing of Unity, Databricks announced the acquisition of Tabular. This is the company founded by Ryan Blue, the original creator of Apache Iceberg. What this means for Iceberg is going to be fascinating to see. Will Databricks use the Iceberg expertise that they’ve now got to advance the format in parallel with their own Delta Lake, or perhaps just evolve intra-compatibility? 
  • If you’re not familiar with Iceberg, there’s a nice primer from Seattle Data Guy, and you can see it in action in this blog that I wrote about writing Kafka data to Iceberg
  • There’s an interesting analysis from Confluent—who are backing Iceberg with their TableFlow product—looking at whether Apache Iceberg will win over Delta Lake (tl;dr: yes). That said, plenty of vendors are hedging their bets, with BigQuery recently announcing first-party support for Delta Lake.
  • Lastly, there were cries of “don’t forget us!” from the Apache Hudi crowd, telling us that Hudi is the Open Data Lakehouse Platform We Need.

And finally…

  • Redpanda acquired Benthos and swiftly rebranded it as RedPanda Connect with some licensing changes, prompting competitor WarpStream to fork it.
  • In all of this, Alex Merced’s article about open source is particularly timely. He explains well what open source means, how it's governed, and its relationship with commercial entities—relevant to the catalog announcements, table formats, and Benthos acquisition.

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Platforms and Architecture

  • As an active user of Reddit, and a mod over on the r/apachekafka subreddit, I’m always particularly interested to learn more about the platform itself. This blog from Stephan Weinwurm and colleagues in the engineering team at Reddit explains how they implemented and rolled out a new Python-based microservice based around Kafka for scoring new content as its posted to check for any breach of Reddit’s content rules.
  • I wince every time I see someone give the hackneyed and generic “fraud detection” as a reason for needing real-time data—so these examples from Lyft and Uber of how they use data are a really refreshing read. 
  • I was fascinated by this teaser from Spotify about more details of their data platform, and delighted when they published this followup going into details of some of the design as well as technologies used, including Apache Flink and BigQuery.
  • Uber are not mucking about when it comes to data volumes in their batch systems—over an exabyte is held across their estate of Hadoop (HDFS) servers. This blog details their migration from purely on-premises infrastructure to a hybrid deployment with GCP. Whilst adopting Google’s object store (GCS) they are for now continuing to run their own software, with a move to native PaaS on GCP planned in for the future.

RDBMS and Change Data Capture

PapersRant of the Month

A departure from this regular section’s content—this month I came across this article which so perfectly blended common-sense, good analysis, and a sweary rant, so I just had to include it 😀

Events & Call for Papers (CfP)

New Releases

A few new releases this month:

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Gunnar (LinkedIn / X / Mastodon / Email)

Robin (LinkedIn / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Gunnar Morling

Gunnar is an open-source enthusiast at heart, currently working on Apache Flink-based stream processing. In his prior role as a software engineer at Red Hat, he led the Debezium project, a distributed platform for change data capture. He is a Java Champion and has founded multiple open source projects such as JfrUnit, kcctl, and MapStruct.