Back
November 26, 2024
6
min read

Solve Your Debezium Challenges

Debezium is a widely used open-source platform for change data capture (CDC). In this recent AMA tech talk, Gunnar Morling, former lead of the project, answers a wide range of Debezium and CDC-related questions, the highlights of which are summarized in the transcript below. Watch the full tech talk on demand to get all the details.

Setting the Stage

Here at Decodable, we have built a fully managed platform for radically simplifying the development and management of real-time data pipelines. The typical workloads that we see our customers run are manifold, ranging from basic data replication use cases all the way to sophisticated stream processing applications using complex joins, aggregations, and more. In order to do that, it is of course necessary to integrate and connect to various different source and destination systems. Decodable is built on top of powerful and battle-tested open-source technologies, including Apache Flink and Debezium, the latter of which powers several of our popular fully-managed connectors, which is why we are highly committed to change data capture in general, and Debezium in particular. With that bit of introductory context out of the way, let’s switch over to the main topic of today, which is the AMA (ask me anything) with Gunnar Morling.

Debezium recently joined the Commonhaus Foundation. How do you feel about that? What are your thoughts?

That's a great question, and I think it’s really good to set the scene. To give some context, Debezium was started as an open-source project by Red Hat, where I was the Project Lead for for a couple of years. One of the goals of the project was to have a very active and diverse community. I suppose people are aware of the notion of having open-source foundations like the Cloud Native Computing Foundation, Linux Foundation, or Apache Software Foundation, which are vendor neutral playing fields. They typically own the intellectual property (IP), they own the trademarks, they may or may not prescribe some processes around those projects, and different companies come together under the umbrella of such a foundation to work on projects like Kubernetes or Apache Kafka.

So that's one way of doing open-source, at the other end of the spectrum there are more vendor-driven projects, which is how it was for Debezium. It was always an open-source project, it's Apache v2 licensed, but at least initially Red Hat owned the trademark and the website, so at the end of the day it was a project which they controlled. This can give people some doubts about whether or not they want to contribute to such a project, because if somebody can just decide to take the ball and leave the field, do you want to be part of that game? I think the Debezium community is and has been doing an amazing job of not suffering from those doubts.

Even while it was a Red Hat sponsored project, we always had substantial contributions from other companies, individuals, and many different folks. By now, hundreds of people–over 500—have contributed commits, coming from different companies like Google, Stripe, or other big names. And they have led the work on some of the connectors—as one example, the Google Cloud Spanner connector was developed not by Red Hat engineers, but by Google engineers, and they are the ones who define the roadmap for that connector. So we had a really diverse project, and in fact it's kind of hilarious to say that many people didn't even know it was a Red Hat backed project—they were sometimes surprised by that.

But still, actually making this move and bringing the project to a foundation, I think it's a great move because it formalizes that diversity. For instance, with a foundation owning the trademark, there is just no way this could ever be taken away. I was hoping for this to happen because I saw that other projects at Red Hat, like Quarkus or Hibernate, had made similar moves. So when I learned about this, I was very excited. I'm very happy to see that Debezium is now also making this move to a foundation, specifically to the Commonhaus Foundation.

I'm excited about that foundation in particular because I think it's kind of the sweet spot. They do all the IP and trademark ownership, which creates a level playing field, but at the same time they are not prescriptive in terms of processes. Projects can decide for themselves how they want to do releases, where they want to run their CI/CD infrastructure, and so forth. Other foundations can be more prescriptive, which may create heavy processes which contributors can get hung up on a little bit.

So big shout out to the team for making this move, and also big shout out of course to Red Hat for allowing it, because in a sense they are relinquishing control over the project. I think it really takes deep understanding and deep commitment to open-source to do that, and I'm super grateful and super happy about it. I think it enables Debezium to grow for the next chapter in its open-source journey.

If you could wave a magic wand, what three features would you add to Debezium?

Okay, well that's a good one. I would say having a first class UI. There is a Debezium UI that is primarily focused on Debezium Server, but UIs have this characteristic that they tend to create lots of work—there’s a long tale of effort that needs to be put into it. So yeah, magically having a really amazing UI, I think that would be very nice to have. Also, better Oracle support, which comes up all the time. Right now the main mode of how the Oracle connector works is using LogMiner, and having a truly push-based connector would be great. For the third one, having a distributed mode for Debezium Server—I think this would be very worthwhile. I'm sure after the call I will have 10 other things which come to mind!

Debezium Connectors require bin logs to be in Row format (not Mixed) when fetching data from MySQL RDS. Would there be a performance impact of changing the bin log format on MySQL RDS?

Good question—to give some context, the binlog format describes how change events in the MySQL transaction log look. CDC connectors like Debezium need this row format so they actually get a view of the entire state of a modified row in the database, as opposed to having just an update statement for instance. So will there be a performance impact? I suppose it could be the case that the bin log might be a bit bigger, but I don't think it's anything to be concerned about. It's what people have been doing for many years using both Debezium and MySQL in RDS. It's not something I think you would have to be concerned about.

We have a Postgres database with bulk updates happening every hour, which results in replication slot lag. Is there a way to solve this with Debezium?

Replication lag is a common theme, and I would say that's not a problem per se. If you have a bulk update and there is a backlog of 20 GB which the connector needs to process, that's fine provided you have the storage capacity to hold those 20 GB. You don't really have to do anything about it—the connector will just process this backlog. It's a matter of sizing the hardware correctly, which will determine how long it takes, but I don't think there is anything which is inherently wrong with that. I mean, if you do a bulk update which updates 50 million records, it is what it is.

What I would say is you should have monitoring in place for that. You should have dashboards and alerts on the replication lag, and in particular the size of your WAL, because if it grows too much then yes, your database machine might run out of disk space which would be pretty bad. So you definitely want to have monitoring to keep an eye on that; if it starts to grow beyond the limit of what your machine can hold, that's something you would want to know about and act on accordingly.

If real-time replication is not the goal, can we still use Debezium?

To address that, it would be interesting to know what goal you are pursuing. If your task is to upload data into your data warehouse once per week, then probably I wouldn't look at a CDC tool like Debezium. It doesn't make the most sense in that case, and you may be better off with a batch job which polls the data. It really comes down to what you want to do with the data. For instance, are you interested in each and every change of your data? Let's consider a purchase order record, which may have a life cycle that goes through multiple states. Thinking about a polling based approach, maybe you are fine with retrieving the data only once per week, but maybe you also still need to have all those intermediary states. If that is a requirement, then you could look at CDC log based change capture since you would be able to get all the updates your data goes through.

If the source database restricts redo log access due to columns containing personally identifiable information (PII), what options do we have?

There is a notion of what's called column filters in Debezium, so for each table you're capturing you can specify which columns you want to expose. If you have sensitive data, then you could skip those columns before you send the data from Debezium to the destination. Depending on the connector, you could have this filtering done even further ahead in the data pipeline. For example, if you are using Postgres, you could specify as part of your publication, which defines what gets exported from Postgres via logical replication, that you never want those sensitive columns to leave the database in the first place. This could also be a potential performance gain, because you're transferring less data.

In addition to those above, Gunnar tackled many more questions about CDC and Debezium—check out the on-demand tech talk to get the full story, including answers to all of these great questions:

  • Backfills with Debezium can be difficult, with mostly only complete backfill signals being the options. How do we do partial backfill in the most efficient way?
  • Can you talk about how initial snapshots for large (8 TB+) databases work, and how that can be made feasible and efficient?
  • When using Apache Flink to replicate Debezium CDC changes via Kafka into some target (e.g., Apache Iceberg), what are some ways to reduce the amount of state held in Flink? Are there ways to minimize this?
  • Have you heard of any plans for hyperscalers (GCP, AWS, Azure, etc.) to have Debezium as a managed offering?
  • What are the most common use cases for consuming Debezium CDC events? Do you see people building distributed systems built using Kafka topics? Or is data warehousing more of the primary use case?
  • Any plans to introduce parallelism into initial or incremental snapshotting?
  • Do you know if Snowflake will be added as an input source for Debezium in the future?
  • Are there any suggestions for handling ON CASCADE DELETE and DROP TABLE/DROP DATABASE DDLs to produce tombstone records when replicating CDC records into Kafka and needing to delete the corresponding records from Kafka topics?
  • What advice would you give to developers looking to contribute to the Debezium project? Are there areas that are more accessible to newer developers?
  • How can we monitor Debeizum’s latency performance when pushing changes to Kafka?
  • As replication is a must with Debezium, what is the best way to upgrade the source database to minimize downtime, and how would Debezium catch up afterward?
  • Are big companies using the outbox pattern? Is it reliable? Does it introduce latency?

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
David Fabritius

Debezium is a widely used open-source platform for change data capture (CDC). In this recent AMA tech talk, Gunnar Morling, former lead of the project, answers a wide range of Debezium and CDC-related questions, the highlights of which are summarized in the transcript below. Watch the full tech talk on demand to get all the details.

Setting the Stage

Here at Decodable, we have built a fully managed platform for radically simplifying the development and management of real-time data pipelines. The typical workloads that we see our customers run are manifold, ranging from basic data replication use cases all the way to sophisticated stream processing applications using complex joins, aggregations, and more. In order to do that, it is of course necessary to integrate and connect to various different source and destination systems. Decodable is built on top of powerful and battle-tested open-source technologies, including Apache Flink and Debezium, the latter of which powers several of our popular fully-managed connectors, which is why we are highly committed to change data capture in general, and Debezium in particular. With that bit of introductory context out of the way, let’s switch over to the main topic of today, which is the AMA (ask me anything) with Gunnar Morling.

Debezium recently joined the Commonhaus Foundation. How do you feel about that? What are your thoughts?

That's a great question, and I think it’s really good to set the scene. To give some context, Debezium was started as an open-source project by Red Hat, where I was the Project Lead for for a couple of years. One of the goals of the project was to have a very active and diverse community. I suppose people are aware of the notion of having open-source foundations like the Cloud Native Computing Foundation, Linux Foundation, or Apache Software Foundation, which are vendor neutral playing fields. They typically own the intellectual property (IP), they own the trademarks, they may or may not prescribe some processes around those projects, and different companies come together under the umbrella of such a foundation to work on projects like Kubernetes or Apache Kafka.

So that's one way of doing open-source, at the other end of the spectrum there are more vendor-driven projects, which is how it was for Debezium. It was always an open-source project, it's Apache v2 licensed, but at least initially Red Hat owned the trademark and the website, so at the end of the day it was a project which they controlled. This can give people some doubts about whether or not they want to contribute to such a project, because if somebody can just decide to take the ball and leave the field, do you want to be part of that game? I think the Debezium community is and has been doing an amazing job of not suffering from those doubts.

Even while it was a Red Hat sponsored project, we always had substantial contributions from other companies, individuals, and many different folks. By now, hundreds of people–over 500—have contributed commits, coming from different companies like Google, Stripe, or other big names. And they have led the work on some of the connectors—as one example, the Google Cloud Spanner connector was developed not by Red Hat engineers, but by Google engineers, and they are the ones who define the roadmap for that connector. So we had a really diverse project, and in fact it's kind of hilarious to say that many people didn't even know it was a Red Hat backed project—they were sometimes surprised by that.

But still, actually making this move and bringing the project to a foundation, I think it's a great move because it formalizes that diversity. For instance, with a foundation owning the trademark, there is just no way this could ever be taken away. I was hoping for this to happen because I saw that other projects at Red Hat, like Quarkus or Hibernate, had made similar moves. So when I learned about this, I was very excited. I'm very happy to see that Debezium is now also making this move to a foundation, specifically to the Commonhaus Foundation.

I'm excited about that foundation in particular because I think it's kind of the sweet spot. They do all the IP and trademark ownership, which creates a level playing field, but at the same time they are not prescriptive in terms of processes. Projects can decide for themselves how they want to do releases, where they want to run their CI/CD infrastructure, and so forth. Other foundations can be more prescriptive, which may create heavy processes which contributors can get hung up on a little bit.

So big shout out to the team for making this move, and also big shout out of course to Red Hat for allowing it, because in a sense they are relinquishing control over the project. I think it really takes deep understanding and deep commitment to open-source to do that, and I'm super grateful and super happy about it. I think it enables Debezium to grow for the next chapter in its open-source journey.

If you could wave a magic wand, what three features would you add to Debezium?

Okay, well that's a good one. I would say having a first class UI. There is a Debezium UI that is primarily focused on Debezium Server, but UIs have this characteristic that they tend to create lots of work—there’s a long tale of effort that needs to be put into it. So yeah, magically having a really amazing UI, I think that would be very nice to have. Also, better Oracle support, which comes up all the time. Right now the main mode of how the Oracle connector works is using LogMiner, and having a truly push-based connector would be great. For the third one, having a distributed mode for Debezium Server—I think this would be very worthwhile. I'm sure after the call I will have 10 other things which come to mind!

Debezium Connectors require bin logs to be in Row format (not Mixed) when fetching data from MySQL RDS. Would there be a performance impact of changing the bin log format on MySQL RDS?

Good question—to give some context, the binlog format describes how change events in the MySQL transaction log look. CDC connectors like Debezium need this row format so they actually get a view of the entire state of a modified row in the database, as opposed to having just an update statement for instance. So will there be a performance impact? I suppose it could be the case that the bin log might be a bit bigger, but I don't think it's anything to be concerned about. It's what people have been doing for many years using both Debezium and MySQL in RDS. It's not something I think you would have to be concerned about.

We have a Postgres database with bulk updates happening every hour, which results in replication slot lag. Is there a way to solve this with Debezium?

Replication lag is a common theme, and I would say that's not a problem per se. If you have a bulk update and there is a backlog of 20 GB which the connector needs to process, that's fine provided you have the storage capacity to hold those 20 GB. You don't really have to do anything about it—the connector will just process this backlog. It's a matter of sizing the hardware correctly, which will determine how long it takes, but I don't think there is anything which is inherently wrong with that. I mean, if you do a bulk update which updates 50 million records, it is what it is.

What I would say is you should have monitoring in place for that. You should have dashboards and alerts on the replication lag, and in particular the size of your WAL, because if it grows too much then yes, your database machine might run out of disk space which would be pretty bad. So you definitely want to have monitoring to keep an eye on that; if it starts to grow beyond the limit of what your machine can hold, that's something you would want to know about and act on accordingly.

If real-time replication is not the goal, can we still use Debezium?

To address that, it would be interesting to know what goal you are pursuing. If your task is to upload data into your data warehouse once per week, then probably I wouldn't look at a CDC tool like Debezium. It doesn't make the most sense in that case, and you may be better off with a batch job which polls the data. It really comes down to what you want to do with the data. For instance, are you interested in each and every change of your data? Let's consider a purchase order record, which may have a life cycle that goes through multiple states. Thinking about a polling based approach, maybe you are fine with retrieving the data only once per week, but maybe you also still need to have all those intermediary states. If that is a requirement, then you could look at CDC log based change capture since you would be able to get all the updates your data goes through.

If the source database restricts redo log access due to columns containing personally identifiable information (PII), what options do we have?

There is a notion of what's called column filters in Debezium, so for each table you're capturing you can specify which columns you want to expose. If you have sensitive data, then you could skip those columns before you send the data from Debezium to the destination. Depending on the connector, you could have this filtering done even further ahead in the data pipeline. For example, if you are using Postgres, you could specify as part of your publication, which defines what gets exported from Postgres via logical replication, that you never want those sensitive columns to leave the database in the first place. This could also be a potential performance gain, because you're transferring less data.

In addition to those above, Gunnar tackled many more questions about CDC and Debezium—check out the on-demand tech talk to get the full story, including answers to all of these great questions:

  • Backfills with Debezium can be difficult, with mostly only complete backfill signals being the options. How do we do partial backfill in the most efficient way?
  • Can you talk about how initial snapshots for large (8 TB+) databases work, and how that can be made feasible and efficient?
  • When using Apache Flink to replicate Debezium CDC changes via Kafka into some target (e.g., Apache Iceberg), what are some ways to reduce the amount of state held in Flink? Are there ways to minimize this?
  • Have you heard of any plans for hyperscalers (GCP, AWS, Azure, etc.) to have Debezium as a managed offering?
  • What are the most common use cases for consuming Debezium CDC events? Do you see people building distributed systems built using Kafka topics? Or is data warehousing more of the primary use case?
  • Any plans to introduce parallelism into initial or incremental snapshotting?
  • Do you know if Snowflake will be added as an input source for Debezium in the future?
  • Are there any suggestions for handling ON CASCADE DELETE and DROP TABLE/DROP DATABASE DDLs to produce tombstone records when replicating CDC records into Kafka and needing to delete the corresponding records from Kafka topics?
  • What advice would you give to developers looking to contribute to the Debezium project? Are there areas that are more accessible to newer developers?
  • How can we monitor Debeizum’s latency performance when pushing changes to Kafka?
  • As replication is a must with Debezium, what is the best way to upgrade the source database to minimize downtime, and how would Debezium catch up afterward?
  • Are big companies using the outbox pattern? Is it reliable? Does it introduce latency?

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

David Fabritius