Back
September 26, 2024
4
min read

Checkpoint Chronicle - September 2024

By
Robin Moffatt
Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling, Robin Moffatt (your editor-in-chief for this edition), and Hans-Peter Grahsl. Feel free to send our way any choice nuggets that you think we should feature in future editions.

I’m writing this on my flight back from Current 2024 in Austin. It was a packed couple of days that you can read more about in my recap blog. If you want to catch the keynotes they are already online day 1 & day 2. I was excited to present at the conference, along with three of my colleagues, and you can find our talks here:

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

  • In one of the more surprising of recent acquisitions, Confluent announced that they have acquired WarpStream to fill the Bring-Your-Own-Cloud (BYOC) gap in their portfolio. WarpStream have shot to prominence in this space for their Kafka-compatible platform which uses S3 to store data directly. WarpStream themselves are talking about the acquisition as an opportunity to grow further within Confluent, rather than their technology simply being absorbed into Confluent’s own. Confluent’s Jack Vanlightly took the opportunity to “clarify” his views on BYOC after being critical of it last year.
  • Javier Holguera has published a useful set of blogs around naming conventions in Kafka, covering topics and producers/consumers.
  • Following on from Uber’s contribution of tiered storage for Kafka, Pinterest have shared details and the source code (under Apache 2.0 licence) of their own implementation. It differs in several ways, including handling tiered storage separate from the broker. Pinterest having been using it in production since May this year, offloading a staggering 200 TB of data per day using it.
  • A useful writeup from DoorDash on how they run their internal self-service Kafka platform.

Data Ecosystem

  • Amazon S3 added support for conditional writes which Gunnar took a look at and published this excellent post on how it could be used for leader election in distributed systems.
  • Decades after the first ones were created, new databases are still being written. Learn more in this post about the challenges James Munro found with existing systems that led to him creating ArcticDB.
  • Some observations from StarRocks on the Unity catalog and what the Databricks acquisition of Tabular earlier this year might mean for Apache Iceberg.
  • I really enjoyed this well-written piece from Expedia detailing how and why they migrated from Apache Cassandra to ScyllaDB. It includes the evaluation of the different migration options, and how they achieved a zero-downtime migration of their 15-node, 1TB Cassandra cluster.

Data Platforms and Architecture

  • Airbnb wrote about their Lambda-architecture-based platform Riverbed last year, and recently published a more detailed look at it. It’s based around their CDC solution which has been around for a while called SpinalTap.
  • This blog post from Uber is interesting for the level of detail it goes into in how they organize the data in their object stores to optimize for things such as data ownership, access control, throughput optimisation, and platform date limits. It’s part of their move to GCP and migrating from HDFS to GCS.
  • If you’re interested in system design you’ll want to check out this detailed two-part series from Agoda that goes into the nuts-and-bolts of how they manage deduplication of bookings across multiple data centers.
  • Staying with online travel booking, this post from booking.com discusses how the ML-based ranking platform within the search results works.

RDBMS and Change Data Capture

  • Two fascinating blogs about different approaches taken to upgrading to MySQL 8.0 from Uber and GitHub.
  • At some point the splicing and mixing of different protocols and engines is all going to get too silly, but for now bear with me: Tansu is a Kafka-compatible broker using Postgres as its storage, and pg_duckdb embeds DuckDB as an engine within Postgres.
  • Back in the regular world of vanilla Postgres, there’s a nice collection of the things you maybe didn’t realize that it could do, along with a slightly provocative assertion that you should just use Postgres. The latter is actually pretty useful in positioning its strengths (and weaknesses) against other types of data management systems. 

Paper of the Month

In Petabyte-Scale Row-Level Operations in Data Lakehouses, Ryan Blue (one of the co-creators of Apache Iceberg) and a team from Apple look at improvements in Iceberg and Apache Spark that bring the lakehouse closer to the data warehouse of old in terms of richer functionality and performance.

Events & Call for Papers (CfP)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.

Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)
Hans-Peter (LinkedIn / X / Mastodon / Email)

đŸ“« Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Robin Moffatt

Robin is a Principal DevEx Engineer at Decodable. He has been speaking at conferences since 2009 including QCon, Devoxx, Strata, Kafka Summit, and Øredev. You can find many of his talks online and his articles on the Decodable blog as well as his own blog.

Outside of work, Robin enjoys running, drinking good beer, and eating fried breakfasts—although generally not at the same time.

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling, Robin Moffatt (your editor-in-chief for this edition), and Hans-Peter Grahsl. Feel free to send our way any choice nuggets that you think we should feature in future editions.

I’m writing this on my flight back from Current 2024 in Austin. It was a packed couple of days that you can read more about in my recap blog. If you want to catch the keynotes they are already online day 1 & day 2. I was excited to present at the conference, along with three of my colleagues, and you can find our talks here:

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

  • In one of the more surprising of recent acquisitions, Confluent announced that they have acquired WarpStream to fill the Bring-Your-Own-Cloud (BYOC) gap in their portfolio. WarpStream have shot to prominence in this space for their Kafka-compatible platform which uses S3 to store data directly. WarpStream themselves are talking about the acquisition as an opportunity to grow further within Confluent, rather than their technology simply being absorbed into Confluent’s own. Confluent’s Jack Vanlightly took the opportunity to “clarify” his views on BYOC after being critical of it last year.
  • Javier Holguera has published a useful set of blogs around naming conventions in Kafka, covering topics and producers/consumers.
  • Following on from Uber’s contribution of tiered storage for Kafka, Pinterest have shared details and the source code (under Apache 2.0 licence) of their own implementation. It differs in several ways, including handling tiered storage separate from the broker. Pinterest having been using it in production since May this year, offloading a staggering 200 TB of data per day using it.
  • A useful writeup from DoorDash on how they run their internal self-service Kafka platform.

Data Ecosystem

  • Amazon S3 added support for conditional writes which Gunnar took a look at and published this excellent post on how it could be used for leader election in distributed systems.
  • Decades after the first ones were created, new databases are still being written. Learn more in this post about the challenges James Munro found with existing systems that led to him creating ArcticDB.
  • Some observations from StarRocks on the Unity catalog and what the Databricks acquisition of Tabular earlier this year might mean for Apache Iceberg.
  • I really enjoyed this well-written piece from Expedia detailing how and why they migrated from Apache Cassandra to ScyllaDB. It includes the evaluation of the different migration options, and how they achieved a zero-downtime migration of their 15-node, 1TB Cassandra cluster.

Data Platforms and Architecture

  • Airbnb wrote about their Lambda-architecture-based platform Riverbed last year, and recently published a more detailed look at it. It’s based around their CDC solution which has been around for a while called SpinalTap.
  • This blog post from Uber is interesting for the level of detail it goes into in how they organize the data in their object stores to optimize for things such as data ownership, access control, throughput optimisation, and platform date limits. It’s part of their move to GCP and migrating from HDFS to GCS.
  • If you’re interested in system design you’ll want to check out this detailed two-part series from Agoda that goes into the nuts-and-bolts of how they manage deduplication of bookings across multiple data centers.
  • Staying with online travel booking, this post from booking.com discusses how the ML-based ranking platform within the search results works.

RDBMS and Change Data Capture

  • Two fascinating blogs about different approaches taken to upgrading to MySQL 8.0 from Uber and GitHub.
  • At some point the splicing and mixing of different protocols and engines is all going to get too silly, but for now bear with me: Tansu is a Kafka-compatible broker using Postgres as its storage, and pg_duckdb embeds DuckDB as an engine within Postgres.
  • Back in the regular world of vanilla Postgres, there’s a nice collection of the things you maybe didn’t realize that it could do, along with a slightly provocative assertion that you should just use Postgres. The latter is actually pretty useful in positioning its strengths (and weaknesses) against other types of data management systems. 

Paper of the Month

In Petabyte-Scale Row-Level Operations in Data Lakehouses, Ryan Blue (one of the co-creators of Apache Iceberg) and a team from Apple look at improvements in Iceberg and Apache Spark that bring the lakehouse closer to the data warehouse of old in terms of richer functionality and performance.

Events & Call for Papers (CfP)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you have.

Gunnar (LinkedIn / X / Mastodon / Email)
Robin (LinkedIn / X / Mastodon / Email)
Hans-Peter (LinkedIn / X / Mastodon / Email)

đŸ“« Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Robin Moffatt

Robin is a Principal DevEx Engineer at Decodable. He has been speaking at conferences since 2009 including QCon, Devoxx, Strata, Kafka Summit, and Øredev. You can find many of his talks online and his articles on the Decodable blog as well as his own blog.

Outside of work, Robin enjoys running, drinking good beer, and eating fried breakfasts—although generally not at the same time.