Back
September 24, 2024
9
min read

Understanding CDC with Debezium Server and Debezium Engine

During recent years, Debezium established itself as the de-facto standard for change data capture (CDC). Historically, Debezium usage has always been tied to Kafka and Kafka Connect but due to some technical reasons and an ever growing demand from enterprises as well as the community, alternative deployment models have been conceived. In particular, this article explores and motivates Debezium Server, a lightweight standalone application to stream CDC events to various systems other than Kafka. Additionally, the Debezium Engine is highlighted as the most flexible solution under the Debezium umbrella which can support very specific CDC needs. The article concludes by contrasting Debezium usage with or without a persistent log in-between data sources and sinks. Ready to start streaming data changes and going beyond Kafka? Let’s dive in…

A glimpse in the rear view

During its early years, Debezium was exclusively focused on integrating with Apache Kafka Connect, a framework and runtime designed for streaming data exchange between disparate systems using Kafka. Building on top of Kafka Connect was a reasonable choice back then and allowed Debezium to not only efficiently capture and propagate database changes via Kafka topics, but at the same time rely on Kafka Connect's scalability and fault-tolerance capabilities.

The rise of Apache Kafka's popularity and wide-spread usage in enterprises—small and large—across various industries clearly aided in Debezium adoption since it was relatively straight-forward for organizations to deploy Debezium functionality into their landscapes and existing Kafka (Connect) ecosystems.

Moving beyond Kafka (Connect)

As of today, Debezium in the context of Kafka Connect is still the predominant deployment model in the wild. On the one hand, this is because historically it was the only option to use Debezium connectors with. On the other hand, it’s because over the years Kafka Connect evolved and matured into a well-understood and battle-tested data integration solution in the streaming space.

And yet, data folks out there were eagerly striving for the same powerful CDC functionality in non-Kafka Connect contexts. To better understand why, let’s look at a few potential reasons: 

  • De-facto standard for CDC: Throughout the years, Debezium became the de-facto standard for change data capture. Besides the connectors developed by the Debezium core team, there are community-led development efforts as well as 3rd party connectors engineered by vendors, one example for the latter being the Google Cloud Spanner connector. They are all based on the building blocks the Debezium project provides and consistently use Debezium's change event payload format. Even other open-source projects adopted the format, for instance, Flink CDC which can directly support the same CDC payload structure. All this contributes to the fact that users want to work with Debezium, irrespective of their landscape and other technologies in use.
  • Kafka isn’t everywhere: Running Debezium connectors as Kafka Connect connectors means any data integration scenario is inherently tied to Apache Kafka itself. Even if Kafka is deemed to be the de-facto standard event streaming platform it’s still far from being ubiquitous. Thus, it’s a debatable point whether introducing Kafka itself can be reasonably justified only because you plan to do CDC with Debezium.
  • Other messaging infra or direct propagation favored: You might already be heavily and happily invested in other, non-Kafka compatible messaging infrastructure. Or, maybe it would suit your use case even better to directly propagate CDC events into selected target systems rather than being always forced to go through intermediaries. Think about ingestion into databases, caching infrastructure, or targeting HTTP APIs etc.
  • Kafka Connect isn’t cloud-native: Similar to various other types of applications, also data integration workloads are meant to run on Kubernetes or in the cloud. For Kafka Connect there are technical constraints and limitations in that regard, most of which stem from its architecture and design when it comes to deploying connectors. The major issue being that deployed connectors are not isolated from one another but instead executed in shared JVM processes. This has unfavorable consequences such as competing for resources, scaling complexities, resource leaks, security concerns, missing health checks, and others which are elaborately discussed in this article.

But what if there was a different choice, an alternative way to run any of the Debezium connectors while at the same time breaking free from Apache Kafka and Kafka Connect in particular? For certain use cases and workloads, it would be beneficial to spin up a lightweight (containerized) application for doing change data capture without the need for any additional CDC-related data infrastructure components running next to it. This is exactly where Debezium Server comes into the picture.

Enter Debezium Server

Debezium Server streams data change events originating from any of Debezium’s supported databases to a number of different target systems, including but not limited to messaging infrastructure other than Kafka. Under the hood Debezium Server is powered by the Debezium Engine which gets explicitly covered in the final section of this article briefly.

Deployment Model

Debezium Server is a turn-key ready, highly configurable Java application written with Quarkus, a modern Java framework for developing resource-efficient cloud-native applications. What’s convenient about such a standalone application is that you can deploy it in any way and wherever you see fit. It doesn’t matter if you want to go for bare-metal scenarios, prefer VM instances or plan to run containers deployed to Kubernetes. One fundamental aspect to keep in mind right from the beginning though is that each Debezium Server application instance can only host and run a single source connector-to-sink pairing. In the illustration below we see two concrete examples, namely MySQL to Apache Pulsar and MongoDB to Google Cloud Pub/Sub.

At first sight, this may seem like an unfavorable limitation. But in fact—and as discussed in the previous section—it helps to alleviate several pain points related to the lack of proper isolation for Debezium connectors when deployed to Kafka Connect clusters.

Resilience and Scalability

What about the resilience of Debezium Server which, after all, is simply run as a standalone application in a single JVM process? If the source connector and the single task it runs fails, or the JVM crashes for whatever reason, there is no automatic compensation happening. Debezium Server would need to be restarted manually in order to continue to make progress with capturing data changes. It’s quite common though to run Debezium Server as a containerized application in Kubernetes and do so based on the available operator. Running in Kubernetes is beneficial due to its fault-tolerance mechanisms as part of the container orchestration. For instance, Debezium Server exposes a liveness probe, which in case it fails, instructs Kubernetes to automatically take action and restart the failed container.

But how can Debezium Server know where to continue its work after a restart was necessary? After all, compared to a Kafka Connect deployment there is no Kafka in place to back any such information like offsets in topics. For this reason, Debezium Server provides configuration settings to define the offset storage strategy, for instance, a file or a database can be specified together with a flush interval to durably store these offsets.

Another important characteristic is scalability. It sounds like a huge limitation of Debezium Server to only ever run a single connector with one task. Remember, contrary to this, Kafka Connect is able to run several tasks per connector but for this to work a connector must support multiple tasks in the first place. Currently, however, most Debezium source connectors don't allow for this anyway. Leaving aside the very few that do support it (e.g. SQLServer and MongoDB), the single task limit is effectively not really a show stopper in many cases. In other words, the scalability differences alone compared to Kafka Connect shouldn't immediately rule out a connector deployment with Debezium Server.

Modifications of Change Event Payloads

It’s a common requirement as part of CDC pipelines to apply basic data modifications on the fly. For this, Kafka Connect offers a concept called single message transforms (SMT). This article introduces the versatility of SMTs by discussing several use cases such as filtering or routing, data type conversions, renaming and excluding fields, or masking sensitive data. Concrete hands-on examples featuring 12 different SMTs in action are provided in this blog post series. Due to the common need to manipulate data in flight Debezium Server also supports transformations. The only minor difference is that all related configuration settings must be prefixed accordingly.

Momentum and Community Interest

Even though Debezium Server has a relatively short history, it received a lot of community interest and has quickly grown into a first-class citizen under the Debezium umbrella. More and more development effort is put into both, improving the feature set of Debezium Server and extending the tooling around it to make real-world deployments easier and better suited for enterprise needs. Two concrete examples in that regard are first, the addition of an operator to smoothly deploy Debezium Server workloads onto Kubernetes clusters, and second, a bigger and longer-term initiative to design and implement a completely new approach towards a dedicated web UI which will target Debezium Server first.

Debezium Engine for Ultimate Flexibility

Very specific use cases may require full control over the captured data changes. For instance, let’s say you want to perform in-application caching. There are two fundamental challenges regarding caching, namely warm-up and invalidation, both of which can be nicely addressed by means of CDC in general. But neither deploying Debezium connectors in Kafka Connect nor running Debezium Server is particularly helpful to achieve in-application caching. Luckily, there is a 3rd option to benefit from Debezium’s CDC capabilities, namely the Debezium Engine which comes in two flavors.

EmbeddedEngine

The initial purpose of the EmbeddedEngine was to make testing of Debezium connectors easier and detached from any Kafka related infrastructure, hence, it was originally not meant for high load production scenarios. The major limitation in that regard is that it can only run a single task which means CDC records are sequentially processed by a single thread. As a result, the necessary processing steps, including serialization, transformations, and custom handler methods get executed sequentially record by record. Instead of trying to improve the existing engine, efforts were put into a new implementation following the same interface (DebeziumEngine) while deprecating the EmbeddedEngine with Debezium release 2.7.0 and plans to have it removed with Debezium 3.1.0.

AsyncEmbeddedEngine

This new implementation—available since Debezium 2.6.0—supports multiple threads, provided that the connector in question is capable of running multiple tasks (e.g. SQLServer or MongoDB connectors). Essentially, there are two thread pools, one to parallelize on tasks and another to parallelize the actual processing of CDC records within a  task. The exact behavior and the extent to which the processing is parallelized can be configured accordingly. Broadly speaking, CDC records are either processed a) in the same order as present in the original batch of source records or b) they can be processed fully asynchronously, meaning the original order of the CDC records might not be preserved.

Either of the two engines allow your custom handler method implementations to process data change events at will. Referring back to the in-application caching use case, you can handle snapshot events to warm up your cache and any non-snapshot event to update it in real-time, all directly within your custom application.

Probably the most prominent use of the Debezium Engine apart from its usage within the Debezium project itself is Apache Flink’s CDC implementation which, for some of its supported data sources, builds on top of the EmbeddedEngine.

With or without a Persistent Log

An important characteristic and architectural choice enabled by the Debezium Engine is that it allows for streaming CDC events out of databases and directly propagating them to target systems. This means having an explicit and persistent log, such as Apache Kafka, between sources and sinks is not required in such a setup. On the plus side, it means to have a simpler architecture and in particular, one critical stateful service less to care about—lower maintenance efforts and potential cost savings could result from that.

This notwithstanding, the lack of a persistent log leads to disadvantages. First, no CDC event history is persisted anywhere, which means you cannot simply replay change events whenever necessary. Second, without any messaging infrastructure in-between source and sink there is no decoupling. So if the sink is temporarily down this leads to problems. Another aspect related to this is that sink systems might become overwhelmed and collapse under the load. Third, adding new consumers at any time to process data change events—including those from the past—is not supported. The same holds true for fan out scenarios where you would like to feed CDC events to multiple sinks in parallel, sourced from the same change event stream.

Clearly, it's a trade-off and whether or not to decide for direct CDC event propagation from source to sink or a persistent log in-between needs to be evaluated individually on a case-by-case basis.

Summary

This article explored a few reasons why Debezium has expanded beyond its initial roots and evolved towards supporting deployments without the need for Kafka-related infrastructure. Since it became the de-facto standard for open-source change data capture, more and more people wanted to use Debezium's powerful CDC capabilities with various other messaging systems. Debezium Server has been highlighted as a flexible and lightweight standalone alternative to achieve this and to deploy connectors in a cloud-native way by means of a dedicated Kubernetes operator. For full control of CDC events, Debezium Engine was discussed including the pros and cons which result from propagating data changes directly to target systems instead of relying on a persistent log in-between sources and sinks.

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

đź‘Ť Got it!
Oops! Something went wrong while submitting the form.
Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

During recent years, Debezium established itself as the de-facto standard for change data capture (CDC). Historically, Debezium usage has always been tied to Kafka and Kafka Connect but due to some technical reasons and an ever growing demand from enterprises as well as the community, alternative deployment models have been conceived. In particular, this article explores and motivates Debezium Server, a lightweight standalone application to stream CDC events to various systems other than Kafka. Additionally, the Debezium Engine is highlighted as the most flexible solution under the Debezium umbrella which can support very specific CDC needs. The article concludes by contrasting Debezium usage with or without a persistent log in-between data sources and sinks. Ready to start streaming data changes and going beyond Kafka? Let’s dive in…

A glimpse in the rear view

During its early years, Debezium was exclusively focused on integrating with Apache Kafka Connect, a framework and runtime designed for streaming data exchange between disparate systems using Kafka. Building on top of Kafka Connect was a reasonable choice back then and allowed Debezium to not only efficiently capture and propagate database changes via Kafka topics, but at the same time rely on Kafka Connect's scalability and fault-tolerance capabilities.

The rise of Apache Kafka's popularity and wide-spread usage in enterprises—small and large—across various industries clearly aided in Debezium adoption since it was relatively straight-forward for organizations to deploy Debezium functionality into their landscapes and existing Kafka (Connect) ecosystems.

Moving beyond Kafka (Connect)

As of today, Debezium in the context of Kafka Connect is still the predominant deployment model in the wild. On the one hand, this is because historically it was the only option to use Debezium connectors with. On the other hand, it’s because over the years Kafka Connect evolved and matured into a well-understood and battle-tested data integration solution in the streaming space.

And yet, data folks out there were eagerly striving for the same powerful CDC functionality in non-Kafka Connect contexts. To better understand why, let’s look at a few potential reasons: 

  • De-facto standard for CDC: Throughout the years, Debezium became the de-facto standard for change data capture. Besides the connectors developed by the Debezium core team, there are community-led development efforts as well as 3rd party connectors engineered by vendors, one example for the latter being the Google Cloud Spanner connector. They are all based on the building blocks the Debezium project provides and consistently use Debezium's change event payload format. Even other open-source projects adopted the format, for instance, Flink CDC which can directly support the same CDC payload structure. All this contributes to the fact that users want to work with Debezium, irrespective of their landscape and other technologies in use.
  • Kafka isn’t everywhere: Running Debezium connectors as Kafka Connect connectors means any data integration scenario is inherently tied to Apache Kafka itself. Even if Kafka is deemed to be the de-facto standard event streaming platform it’s still far from being ubiquitous. Thus, it’s a debatable point whether introducing Kafka itself can be reasonably justified only because you plan to do CDC with Debezium.
  • Other messaging infra or direct propagation favored: You might already be heavily and happily invested in other, non-Kafka compatible messaging infrastructure. Or, maybe it would suit your use case even better to directly propagate CDC events into selected target systems rather than being always forced to go through intermediaries. Think about ingestion into databases, caching infrastructure, or targeting HTTP APIs etc.
  • Kafka Connect isn’t cloud-native: Similar to various other types of applications, also data integration workloads are meant to run on Kubernetes or in the cloud. For Kafka Connect there are technical constraints and limitations in that regard, most of which stem from its architecture and design when it comes to deploying connectors. The major issue being that deployed connectors are not isolated from one another but instead executed in shared JVM processes. This has unfavorable consequences such as competing for resources, scaling complexities, resource leaks, security concerns, missing health checks, and others which are elaborately discussed in this article.

But what if there was a different choice, an alternative way to run any of the Debezium connectors while at the same time breaking free from Apache Kafka and Kafka Connect in particular? For certain use cases and workloads, it would be beneficial to spin up a lightweight (containerized) application for doing change data capture without the need for any additional CDC-related data infrastructure components running next to it. This is exactly where Debezium Server comes into the picture.

Enter Debezium Server

Debezium Server streams data change events originating from any of Debezium’s supported databases to a number of different target systems, including but not limited to messaging infrastructure other than Kafka. Under the hood Debezium Server is powered by the Debezium Engine which gets explicitly covered in the final section of this article briefly.

Deployment Model

Debezium Server is a turn-key ready, highly configurable Java application written with Quarkus, a modern Java framework for developing resource-efficient cloud-native applications. What’s convenient about such a standalone application is that you can deploy it in any way and wherever you see fit. It doesn’t matter if you want to go for bare-metal scenarios, prefer VM instances or plan to run containers deployed to Kubernetes. One fundamental aspect to keep in mind right from the beginning though is that each Debezium Server application instance can only host and run a single source connector-to-sink pairing. In the illustration below we see two concrete examples, namely MySQL to Apache Pulsar and MongoDB to Google Cloud Pub/Sub.

At first sight, this may seem like an unfavorable limitation. But in fact—and as discussed in the previous section—it helps to alleviate several pain points related to the lack of proper isolation for Debezium connectors when deployed to Kafka Connect clusters.

Resilience and Scalability

What about the resilience of Debezium Server which, after all, is simply run as a standalone application in a single JVM process? If the source connector and the single task it runs fails, or the JVM crashes for whatever reason, there is no automatic compensation happening. Debezium Server would need to be restarted manually in order to continue to make progress with capturing data changes. It’s quite common though to run Debezium Server as a containerized application in Kubernetes and do so based on the available operator. Running in Kubernetes is beneficial due to its fault-tolerance mechanisms as part of the container orchestration. For instance, Debezium Server exposes a liveness probe, which in case it fails, instructs Kubernetes to automatically take action and restart the failed container.

But how can Debezium Server know where to continue its work after a restart was necessary? After all, compared to a Kafka Connect deployment there is no Kafka in place to back any such information like offsets in topics. For this reason, Debezium Server provides configuration settings to define the offset storage strategy, for instance, a file or a database can be specified together with a flush interval to durably store these offsets.

Another important characteristic is scalability. It sounds like a huge limitation of Debezium Server to only ever run a single connector with one task. Remember, contrary to this, Kafka Connect is able to run several tasks per connector but for this to work a connector must support multiple tasks in the first place. Currently, however, most Debezium source connectors don't allow for this anyway. Leaving aside the very few that do support it (e.g. SQLServer and MongoDB), the single task limit is effectively not really a show stopper in many cases. In other words, the scalability differences alone compared to Kafka Connect shouldn't immediately rule out a connector deployment with Debezium Server.

Modifications of Change Event Payloads

It’s a common requirement as part of CDC pipelines to apply basic data modifications on the fly. For this, Kafka Connect offers a concept called single message transforms (SMT). This article introduces the versatility of SMTs by discussing several use cases such as filtering or routing, data type conversions, renaming and excluding fields, or masking sensitive data. Concrete hands-on examples featuring 12 different SMTs in action are provided in this blog post series. Due to the common need to manipulate data in flight Debezium Server also supports transformations. The only minor difference is that all related configuration settings must be prefixed accordingly.

Momentum and Community Interest

Even though Debezium Server has a relatively short history, it received a lot of community interest and has quickly grown into a first-class citizen under the Debezium umbrella. More and more development effort is put into both, improving the feature set of Debezium Server and extending the tooling around it to make real-world deployments easier and better suited for enterprise needs. Two concrete examples in that regard are first, the addition of an operator to smoothly deploy Debezium Server workloads onto Kubernetes clusters, and second, a bigger and longer-term initiative to design and implement a completely new approach towards a dedicated web UI which will target Debezium Server first.

Debezium Engine for Ultimate Flexibility

Very specific use cases may require full control over the captured data changes. For instance, let’s say you want to perform in-application caching. There are two fundamental challenges regarding caching, namely warm-up and invalidation, both of which can be nicely addressed by means of CDC in general. But neither deploying Debezium connectors in Kafka Connect nor running Debezium Server is particularly helpful to achieve in-application caching. Luckily, there is a 3rd option to benefit from Debezium’s CDC capabilities, namely the Debezium Engine which comes in two flavors.

EmbeddedEngine

The initial purpose of the EmbeddedEngine was to make testing of Debezium connectors easier and detached from any Kafka related infrastructure, hence, it was originally not meant for high load production scenarios. The major limitation in that regard is that it can only run a single task which means CDC records are sequentially processed by a single thread. As a result, the necessary processing steps, including serialization, transformations, and custom handler methods get executed sequentially record by record. Instead of trying to improve the existing engine, efforts were put into a new implementation following the same interface (DebeziumEngine) while deprecating the EmbeddedEngine with Debezium release 2.7.0 and plans to have it removed with Debezium 3.1.0.

AsyncEmbeddedEngine

This new implementation—available since Debezium 2.6.0—supports multiple threads, provided that the connector in question is capable of running multiple tasks (e.g. SQLServer or MongoDB connectors). Essentially, there are two thread pools, one to parallelize on tasks and another to parallelize the actual processing of CDC records within a  task. The exact behavior and the extent to which the processing is parallelized can be configured accordingly. Broadly speaking, CDC records are either processed a) in the same order as present in the original batch of source records or b) they can be processed fully asynchronously, meaning the original order of the CDC records might not be preserved.

Either of the two engines allow your custom handler method implementations to process data change events at will. Referring back to the in-application caching use case, you can handle snapshot events to warm up your cache and any non-snapshot event to update it in real-time, all directly within your custom application.

Probably the most prominent use of the Debezium Engine apart from its usage within the Debezium project itself is Apache Flink’s CDC implementation which, for some of its supported data sources, builds on top of the EmbeddedEngine.

With or without a Persistent Log

An important characteristic and architectural choice enabled by the Debezium Engine is that it allows for streaming CDC events out of databases and directly propagating them to target systems. This means having an explicit and persistent log, such as Apache Kafka, between sources and sinks is not required in such a setup. On the plus side, it means to have a simpler architecture and in particular, one critical stateful service less to care about—lower maintenance efforts and potential cost savings could result from that.

This notwithstanding, the lack of a persistent log leads to disadvantages. First, no CDC event history is persisted anywhere, which means you cannot simply replay change events whenever necessary. Second, without any messaging infrastructure in-between source and sink there is no decoupling. So if the sink is temporarily down this leads to problems. Another aspect related to this is that sink systems might become overwhelmed and collapse under the load. Third, adding new consumers at any time to process data change events—including those from the past—is not supported. The same holds true for fan out scenarios where you would like to feed CDC events to multiple sinks in parallel, sourced from the same change event stream.

Clearly, it's a trade-off and whether or not to decide for direct CDC event propagation from source to sink or a persistent log in-between needs to be evaluated individually on a case-by-case basis.

Summary

This article explored a few reasons why Debezium has expanded beyond its initial roots and evolved towards supporting deployments without the need for Kafka-related infrastructure. Since it became the de-facto standard for open-source change data capture, more and more people wanted to use Debezium's powerful CDC capabilities with various other messaging systems. Debezium Server has been highlighted as a flexible and lightweight standalone alternative to achieve this and to deploy connectors in a cloud-native way by means of a dedicated Kubernetes operator. For full control of CDC events, Debezium Engine was discussed including the pros and cons which result from propagating data changes directly to target systems instead of relying on a persistent log in-between sources and sinks.

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.