Back
May 28, 2024
5
min read

Checkpoint Chronicle - May 2024

By
Robin Moffatt
Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Platforms and Architecture

  • Data Modeling goes in and out of fashion as technology enables people to crunch more data without needing it for pure optimisation alone—but then people start getting wrong or confusing data, and realise that there’s a reason data modelling has been a thing for many decades. Joe Reiss is looking to raise its profile once more by writing a book about it and has shared excerpts from the first and second chapters of Practical Data Modeling. Meanwhile, Adrian Bednarz writes about the One Big Table (OBT) approach and its implications when doing stream processing.
  • An interesting explanation of how BackMarket are taking the opportunity during their migration onto GCP to apply some Data Mesh principles to how data is made available to their users. It discusses the implementation from both a logical point of view, and the specific GCP tools used.
  • It might not have the sparkle and allure of GenAI, but someone’s gotta do it—taking the digital trash out. Netflix generates around 2 Petabytes of data every week, of which it’s estimated that 40% is never used. This article goes into more detail and explains the tools and processes used to manage this data and its deletion when needed.
  • Canva describes how they hit scaling problems and ended up migrating from MySQL to Snowflake for an application responsible for counting usage. One thing that struck me was no use of Kafka (or equivalent) for ingestion, and use of Snowflake to do the big crunching instead of Spark or Flink. I wonder how much of that is their existing familiarity with Snowflake (as mentioned in the article) vs it being more suitable for the job.

Change Data Capture

Data Ecosystem

Apache DataFusion recently became its own top level Apache project, graduating out of being a part of Apache Arrow. DataFusion is a query engine that can be used for building data systems. It’s already found in many projects, including an accelerator for Apache Spark called Comet.

At its peak, Apache HBase held 6PB of data for Pinterest and underpinned many of their systems. This article is a really well written account of their reasons for deciding to deprecate it in favour of tools including Apache Druid, and TiDB. 

There’s a reason Chris Riccomini is featured often in Checkpoint Chronicle (I checked: five of the last six!)—he writes really useful and pragmatic posts 😀. This one looking at the Nimble and Lance file formats is no exception. Whilst Parquet is going nowhere anytime soon, it’s interesting to look at the nebulous beginnings of what might one day replace it and why.

As well as Chris, I’m a big fanboi of Jack Vanlightly’s writing. He has a knack for making the complex intelligible without dumbing it down. His recent post on Hybrid Transactional/Analytical Storage is an interesting look at both Confluent’s strategy as well as the broader landscape for data platforms.

This one caught my eye, and I’ll quote from the readme directly: pg_lakehouse is an extension that transforms Postgres into an analytical query engine over object stores like S3 and table formats like Apache Iceberg. Queries are pushed down to Apache DataFusion.

Papers of the Month

Events & Call for Papers (CfP)

Berlin Buzzwords (Berlin, Germany) June 9-11 (CfP closed)

Current '24 | The Next Generation of Kafka Summit (Austin, TX) September 17-18 (CfP closed)

Flink Forward (Berlin, Germany) October 23-24 (CfP extended, now closes on May 31)

New Releases

A few new releases this month:

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Gunnar (LinkedIn / X / Mastodon / Email)

Robin (LinkedIn / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Robin Moffatt

Robin is a Principal DevEx Engineer at Decodable. He has been speaking at conferences since 2009 including QCon, Devoxx, Strata, Kafka Summit, and Øredev. You can find many of his talks online and his articles on the Decodable blog as well as his own blog.

Outside of work, Robin enjoys running, drinking good beer, and eating fried breakfasts—although generally not at the same time.

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt (your editor-in-chief for this edition). Feel free to send our way any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Platforms and Architecture

  • Data Modeling goes in and out of fashion as technology enables people to crunch more data without needing it for pure optimisation alone—but then people start getting wrong or confusing data, and realise that there’s a reason data modelling has been a thing for many decades. Joe Reiss is looking to raise its profile once more by writing a book about it and has shared excerpts from the first and second chapters of Practical Data Modeling. Meanwhile, Adrian Bednarz writes about the One Big Table (OBT) approach and its implications when doing stream processing.
  • An interesting explanation of how BackMarket are taking the opportunity during their migration onto GCP to apply some Data Mesh principles to how data is made available to their users. It discusses the implementation from both a logical point of view, and the specific GCP tools used.
  • It might not have the sparkle and allure of GenAI, but someone’s gotta do it—taking the digital trash out. Netflix generates around 2 Petabytes of data every week, of which it’s estimated that 40% is never used. This article goes into more detail and explains the tools and processes used to manage this data and its deletion when needed.
  • Canva describes how they hit scaling problems and ended up migrating from MySQL to Snowflake for an application responsible for counting usage. One thing that struck me was no use of Kafka (or equivalent) for ingestion, and use of Snowflake to do the big crunching instead of Spark or Flink. I wonder how much of that is their existing familiarity with Snowflake (as mentioned in the article) vs it being more suitable for the job.

Change Data Capture

Data Ecosystem

Apache DataFusion recently became its own top level Apache project, graduating out of being a part of Apache Arrow. DataFusion is a query engine that can be used for building data systems. It’s already found in many projects, including an accelerator for Apache Spark called Comet.

At its peak, Apache HBase held 6PB of data for Pinterest and underpinned many of their systems. This article is a really well written account of their reasons for deciding to deprecate it in favour of tools including Apache Druid, and TiDB. 

There’s a reason Chris Riccomini is featured often in Checkpoint Chronicle (I checked: five of the last six!)—he writes really useful and pragmatic posts 😀. This one looking at the Nimble and Lance file formats is no exception. Whilst Parquet is going nowhere anytime soon, it’s interesting to look at the nebulous beginnings of what might one day replace it and why.

As well as Chris, I’m a big fanboi of Jack Vanlightly’s writing. He has a knack for making the complex intelligible without dumbing it down. His recent post on Hybrid Transactional/Analytical Storage is an interesting look at both Confluent’s strategy as well as the broader landscape for data platforms.

This one caught my eye, and I’ll quote from the readme directly: pg_lakehouse is an extension that transforms Postgres into an analytical query engine over object stores like S3 and table formats like Apache Iceberg. Queries are pushed down to Apache DataFusion.

Papers of the Month

Events & Call for Papers (CfP)

Berlin Buzzwords (Berlin, Germany) June 9-11 (CfP closed)

Current '24 | The Next Generation of Kafka Summit (Austin, TX) September 17-18 (CfP closed)

Flink Forward (Berlin, Germany) October 23-24 (CfP extended, now closes on May 31)

New Releases

A few new releases this month:

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Gunnar (LinkedIn / X / Mastodon / Email)

Robin (LinkedIn / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Robin Moffatt

Robin is a Principal DevEx Engineer at Decodable. He has been speaking at conferences since 2009 including QCon, Devoxx, Strata, Kafka Summit, and Øredev. You can find many of his talks online and his articles on the Decodable blog as well as his own blog.

Outside of work, Robin enjoys running, drinking good beer, and eating fried breakfasts—although generally not at the same time.