Whether itâs Understanding CDC or CDC Explained or even Five things about CDC; number four will shock you!, the internet is awash with articles about Change Data Capture (CDC). CDCÂ is the process of incrementally extracting data change events as they occur within a database. In this post Iâd like to take a step back from these and look at the reasons why you might even want to consider CDC in the first place. From here weâll then build on the requirements we identify to come up with a proposed solution.
The starting point for any exploration of CDC is your data in a database. And most probably a transactional, or operational, one. Youâve got an order system, an inventory management system, youâve got any kind of system thatâs built on an OLTP databaseâwhere the focus is on getting the data in quickly in a highly performant and resilient way.
With that data in the database, you want to do something else. But before you can do it, you need the data out of the database.
Or do you? Perhaps you could just use the data in place. After all, who doesnât enjoy baiting the friendly neighborhood DBA? What could possibly go wrong with running unbounded queries on a production database thatâs backing an operational system? đ€Ą
OK, I jest, I jest. A huge amount could go wrong. OLTP systems are primarily optimized for writing and reading data for specific transactions, such as the creation or update of an individual customer order, rather than for efficiently extracting large volumes of data. Depending on what kind of things you want to do with the data your query access paths could be both slow for you, and worse, slow down the production system.
So letâs assume you want the data out of the database. Before we get onto how (and weâre not burying the lede here, the answer is in the title of the post đ) letâs take a moment to think about the kind of things that we might want to do with the data.
Common CDC use cases
âYou can read more about CDC use cases in this blog post from Gunnar Morling.
Why not just query the production database?
You can look at this list another way. There are some things that you want to do with the data but canât against the operational database because youâll risk impacting the system that itâs serving. Workloads like big number-crunching analytical queries is a great example. We might well serve the data from the same database platform (e.g. Postgres, Oracle, etc) but optimized for analytical queries (i.e. OLAP) in configuration and table indexing and design strategies.
The other aspect here is that the database weâve built the operational system on is not the best one for serving what we want with our other uses of the data. Analytics, as mentioned, could be served from a data lake (S3 + Iceberg, for example). Real-time analytics from Pinot et al. Weâve not mentioned them yet but search indexes and application caches are two other important places where operational data is needed but in their own specialized serving technologies. For example, OpenSearch (indexing) and Redis (cache).
The overarching point is; we need that operational data out, so that we put it somewhere else. Doing so gives us the best of both worlds; a performant operational system protected from the performance impact of other uses, and those other uses being best served by the respective optimal technology.
My colleague, Gunnar Morling, put it succinctly:
CDC saves the day!
This being a vendor blog, I am of course going to tell you that there is a silver bullet to all your problems, and that itâs CDC đ. The thing is, CDC is one of those patterns that does fit amazingly well into a lot of system architectures.
Before I get too evangelical, letâs pause to look at what CDC is.
What CDC is (and isnât)
CDC, or Change Data Capture, does what it says on the tin. It captures changes to data. Specifically, changes made to data in one or more tables in a database. For example, the creation of a new customer record, an update to a purchase order, or the deletion of an inventory item.
So after three changes to the data by the application, weâd have three entries captured by CDC:
What happens to those changes is then up to the implementation of the CDC tool. CDC is just the extraction of changes from a database. This is where one of the big confusions seems to arise around CDC:
- CDC does not load or ingest data
- CDC is not a method of ETL
- CDC is not real-time analytics
- CDC does not transfer changes to a target database
CDC is often part of an ETL process, and the output from CDC is often written to a target system, perhaps even for real-time analytics. But CDC itself is just getting the data out.
One of the reasons there are misunderstandings around what CDC is stems from the different ways in which CDC can be implemented. Fortunately for you, dear reader, there is one way that is objectively better than the others.
The best way to do CDC is log-based CDC
Most databases are built on a concept known as the transaction log.Â
It goes by different names (binlog, redo log, write-ahead log) in different implementations, but the concept is the same throughout.Â
When you perform an <span class="inline-code">INSERT</span>, <span class="inline-code">UPDATE</span>, or <span class="inline-code">DELETE</span> against a table, that operation is recorded in the transaction log. This has nothing to do with CDCâthis is just how databases work. The transaction log serves as a point of recovery should the server go bang. If that happens, the database takes the last known-good point and replays the transaction log forward, thus applying all the committed changes that were made since.
What log-based CDC does is tap into the transaction log of the source database and expose it in a form that can be used downstream. This is what makes log-based CDC so powerful. The database itself uses the transaction log as its system of record, of absolute truth. What better way to get the highest possible fidelity of whatâs happening in a database than the component that the database itself uses?
If theory is your thing, youâll love Martin Kleppmannâs talk Turning the database inside-out and the Immutability Changes Everything paper from Pat Helland.
Log-based CDC has the following fantastically powerful features:
The rising star in this space is the open-source Debezium project. Founded and led by Red Hat, and often rebadged by vendors as their own CDC tool, Debezium supports numerous databases and has wide adoption.
Okay, okay. Whether itâs âNot Invented Hereâ syndrome or simply skepticism about using the transaction log for CDC, you may well be thinking of other ways of capturing changes to data. AirBnB have a nice run-down from 2018 of the options available and their pros and cons. Hereâs a very quick look at the gotchas for other techniques, with reference to the list above of benefits of log-based CDC.
- Query-based CDC, a.k.a., Polling. Relies on a data model for the tables that include a column by which differences since the last poll can be determined. Higher impact because youâre actively querying the database. Canât capture deletes or before-state of changed rows. Learn more.
- Friends donât let friends do Dual Writes. If your application needs the same data being written to a database, one way to achieve this is to have the source send it to both your application and the database. You donât impact the database but you run the real risk of data inconsistencies and divergence, and unintended coupling. Learn more.
- Triggers are a feature of many databases, but using them for CDC requires you to literally code SQL in the source database for each condition you want to capture. You are then tied into needing to test, document, and maintain that code. When the code fires you need to do something with the event, which if youâre trying to replicate data outside of the database could get tricky. Thatâs before youâve considered the performance impact on the source database of a trigger firing for every single change. Learn more.
- Table replication. This is the sledgehammer-to-crack a nut approach. Instead of doing Change Data Capture you just do Data Capture and dump the full tables each time. Wasteful (unless every row changed), low fidelity (no change details per row), high latency, resource intensive on the source database.
- Similar to the above would be in-built replication in which the database replicates tables to another instance of the database. This is fine if you want an exact replica of the tables being served by the same database, but not useful if you want the data on a different platform. Itâs worth noting that built-in replication usually is built on the same transaction log concepts that log-based CDC uses.
- Comparing source and target tables might make sense for a one-off data replication scenario, but is not credible as a genuine way of replicating changes from a source database. Each comparison execution will put a query load on the source database, similar to replication discussed above.
Limitations of log-based CDC
Remember, this is a vendor blog. Weâre supposed to present these silver bullets as infallible, right? Whilst log-based CDC is pretty darn good and most of the time the correct answer, there are some challenges with it to be aware of. Some are more real than others, some have better mitigations than others.
Is CDC expensive?
No, unless youâre in Oracle GoldenGate (OGG) world. The Rolex-watch of the CDC world, OGG is well respected for a reason, but comes with a price tag to match. There are other âregularâ enterprise offerings, not to mention the open source Debezium project which is, well, open source!Â
Mitigation: Did I mention Debezium yet?Â
CDC can be more complex to set up
OK, thereâs a little bit more in this one, Iâll admit. Once you start diving into the world of transaction logs, you need to be on your DBAâs good side because youâll need to do some work to configure your CDC tool to access it. Depending on what youâre then integrating with your CDC tool might also need some work to configure it to write the data where you want it.Â
Mitigation: Log-based CDC is widely adopted and documentation is well-tested and thorough.Â
Mitigation: If you use a managed provider for Debezium a lot of the configuration and management is done for you.
Does CDC have an impact on the operational system?
Yes and no. The question is whether that impact is impactful or not. Whilst log-based CDC is the lightest-touch way to get change data out of the system (see above for other, heavier, options), you are still interfacing with an operational system. You need to watch out for things like the Postgres replication slot filling up and eating all the disk space. You also need to make sure that whatever the source database, itâs configured to retain the transaction log for as long as needed for the CDC tool to read from it. If you donât then youâll miss change events.
Mitigation: As above; log-based CDC is widely used at scale in production systems world-wide. Read the friendly manual, and youâll be fine. Just make sure your DBA is on-board đ.
Log-based CDC in actionâan overview
Hereâs a very simple overview of what you might build with log-based CDC. The requirement is populating a data lake from an operational database. Weâll use Postgres in the example, but the concept applies to any database.
- Any <span class="inline-code">INSERT</span>, <span class="inline-code">UPDATE</span>, or <span class="inline-code">DELETE</span> is captured in the write-ahead-log (WAL)
- Debezium streams the changes as they happen to an Apache Kafka topic (although itâs worth noting that these days Debezium doesnât have to use Kafka). The Kafka topic buffers the changes and makes them available for downstream use
- A tool such as Apache Flink or Kafka Connect reads the data from the Kafka topic and writes it in Apache Iceberg format to S3Â
Okay Okay, Iâm sold! Where do I buy a log-based CDC?
(I told you: this is a vendor blog; did you think the CTA really wasnât going to be to sign up for our service? đ)
Well, since you ask, Decodable provides a fully-managed Debezium service as well as a rich connector library for sending the changes captured to other platforms such as OpenSearch, S3, as well as other databases. We also have a Flink-SQL based stream processing platform so you can choose to transform the changes that are captured before sending them onwards.