In the world of data engineering, moving data across systems - whether through data replication or ETL processes - is a complex but essential task that necessitates tapping into external systems. This task can be a tedious, time-consuming, and costly endeavor. As such, we’ve been focused on making it easier and more cost-efficient to make these connections at scale. With that ethos in mind, we’re thrilled to announce the release of our multi-stream connector (MSC) feature at Decodable. With this new feature, you can unify multiple resources in a data system like a Postgres cluster or a Snowflake database under a single connection, yielding even greater reduction in cost and operational overhead.
In this blog, we’ll take a deeper look at why multi-stream connectors are valuable, how we architected them, and show an example of how you can leverage this new feature in Decodable to more easily and efficiently build streaming applications.
A Unified Approach to Data Movement
Decodable is a real-time data movement platform built on top of Apache Flink and Debezium. It is a cloud service, supporting continuous data ingestion from external data systems, (optional) data processing, and output of data into other external systems. The following terminology will be used throughout this blog:
- Connection - A configured instance of a connector. For a source connection, it reads data from an external data system and writes it to streams for a source. For a sink connection, it reads data from streams and writes it to an external data system.
- Stream - A sequence of durably-stored data records in Decodable. Streams serve as the outputs of source connections, inputs to sink connections, and both inputs and outputs for data-processing pipelines.
What Are Multi-Stream Connectors?
It’s easy to see that as the number of connected resources grows, so too does the cost and complexity of managing a separate connection for each resource. Multi-stream connectors help manage this complexity — reducing load on external systems and curbing costs for users by consolidating multiple connections to the same host into a single connection instance.
So now that you know what a multi-stream connector can do - how are you supposed to configure one? When designing this feature, we recognized that streamlining the process of connection configuration is crucial to actual, scalable use. We revamped our UX to make setting up multi-stream connections more straightforward:
- On the source side, the system automatically scans all potential inputs for a source connection and translating data types between Decodable and external systems.
- On the sink side, the system automatically sets up necessary output resources, like Snowflake tables, if they don't exist.
This workflow allows for rapid setup of data synchronization across numerous external resources, optimizing both time and resources.
In order to multiplex the relationship between a connection and multiple external resources, we introduced the concept of stream-mappings to connection. A stream-mapping is made up of the following:
- Stream ID: The ID string of a Decodable stream.
- External Resource Specifier: A key-value map, distinctly identifying a concrete resource in the connection’s external system.
- Allowed keys are specific to each external system.
- For example, schema and table for a MySQL connection or topic for Kafka.
A source connection will read data from the entities identified by the external resource specifiers into the given Decodable streams. A sink connection will write data from the specified streams into the connected external data system.
To see how this works in practice, here’s an example of configuring a multi-stream connection with our new declarative YAML formatting:
Case Study: The Multi-Tenant Customer
We recently helped a customer who needed end-to-end replication from MySQL to Snowflake. Their MySQL system employs a multi-tenant architecture, in which each tenant has a unique database, with an identical set of tables.
The batch-based technology the customer was previously using for replication caused undue stress on their MySQL system, leading to production outages, and failure to actually sync their data downstream. Enter Decodable and multi-stream connectors.
The MySQL source connector used at Decodable is built on top of Debezium. Debezium uses a single network connection to consume the transaction log (called “binlog” in MySQL terminology) for all tables stored on a MySQL host. The new multi-stream feature at Decodable unlocks this full power of Debezium, allowing a single MySQL source connection to read the changes of many tables on a host system at once.
The result is a highly efficient connection that not only cuts down on the resource costs needed to ingest multiple tables, but also respects the processing resources of the host system it’s reading from.
More subtly, we were able to further streamline downstream data management through clever configuration of the MySQL stream-mappings. By mapping all like-tables across each tenant into the same Decodable stream, we reduced the over two million unique MySQL tables to just over two hundred total streams and Snowflake tables.
Note that the MySQL connections additionally inject metadata such as the database of origin into each record, for subsequent data processing. All in all, this cleaned up the customer’s data warehouse, eliminated the need for downstream unioning, and allowed us to serve millions of upstream MySQL tables through a single Snowflake sink connection.
In brief, the efficiency of multi-stream connectors resulted in a substantial cost reduction over our customer’s legacy solution, a much cleaner data warehouse, and a group of engineers who can sleep better at night knowing that we won’t burden their production systems.
Outlook
What’s next in the world of multi-stream connectors? We’ve seen the excitement from our customers who used them and the impact it’s had on their business, validating our commitment to upgrade all our existing connectors to MSC. In addition, all the new connectors will be MSC by default.
Currently supported multi-stream connectors include:MySQL CDC source, Postgres CDC source, Snowflake sink. You can find a complete list of all of our connectors at Decodable here.
Ready to dive deeper into how multi-stream connectors can optimize your streaming workflows?
Join our upcoming tech talk and demo Managing Streaming from Multiple External Resources with One Connection on May 8, 9am PST/12pm EST as Decodable engineers Gunnar Morling and John MacKinnon illustrate how easy it is to build a multi-stream data flow from a MySQL database to Snowflake.