Back
October 14, 2022
8
min read

Real Time Streaming Joins With SQL

By
Robert Metzger
Share this post

Decodable enables continuous SQL joins across streaming sources including streaming systems like Kafka, Pulsar and Kinesis as well as databases with Change Data Capture (CDC). This is exactly the use-case for real-time joins. In this blog we'll explore how this works, and how it helps bring siloed data into play in low-latency streaming applications. Whenever a table in your database changes, Decodable instantly updates the join result.

Let’s assume you have a database table with all your customer data, and a table mapping the zip-codes to your retail stores, and your goal is to build a report counting the number of customers which are living close by a store. With a traditional database you would periodically run a SQL query to get that report. With modern streaming platforms like decodable, you are able to run such a query continuously. Whenever you get new or updated customer data, a store data changes, the report will update in real-time.

Combining a stream and a table in another example, you have a website clickstream of events coming from Confluent Cloud (a managed Apache Kafka service) and you want to enrich each stream event by looking up & adding customer information in a PostgreSQL table. You'll use this to tell you the last product page the customer viewed. You'll then store the result back into PostgreSQL, which will always be up to date with the latest page viewed. This is exactly the use case for real-time joins.

‍

‍

Currently, decodable supports running real-time joins with data from MySQL and PostgreSQL databases. We are currently working on adding more databases, such as Oracle, Microsoft SQL Server, Snowflake and more.

Benefits Of Real-Time Joins

What are the benefits of real-time joins compared to traditional joins, executed periodically?

The main benefit is the reduced latency. Historically you'd use a batch mode to execute a join, orchestrated by Apache Airflow or AWS Step Functions triggering a query, say once per hour. Even if the query takes only a few seconds to run, the average latency will be 30 minutes, the long tail latency much higher.

We know from our customers that data is most valuable when fresh. In other words, the higher the latency you have in your processing pipeline, the less value you can extract from it. For example a global logistics company which is already using stream processing has a competitive advantage if it can inform customers in real-time about the locations of their goods, and notify them in case of delays. A traditional "bricks and mortar" retailer can also benefit from bringing online experiences into their stores: real-time promotions, post-checkout surveys or upsells like product insurance.

Another benefit is data integration across many systems: It is often difficult to join data located in different databases - the data is locked in. How do you extract and normalize data coming from a traditional Oracle database, Snowflake or MongoDB? With decodable’s connector ecosystem and stream processing capabilities, this can be done in a matter of minutes.

Real-time joins are also resource efficient and fast, because the joins are calculated based on the change / delta of the data, rather than scanning the entire database.

Creating Real-Time Joins In Decodable

Decodable is a platform for real-time data processing. Data is represented in streams. To get data into a stream, you need to set up connectors. A connector can read data from systems like Kafka or Kinesis. Also, we support connecting to databases such as MySQL or PostgreSQL. In case of a database, the connector will read the transaction log from a database into a decodable stream.

Once you have multiple streams of transaction logs (we call them change streams, because they represent change events in a database), you can set up pipelines to process them. A pipeline is defined by its input streams, an output stream and a SQL query describing, in our example, a join operation:

‍

A pipeline output stream can then be used by other pipelines or by a connector writing the data somewhere, such as MySQL, PostgreSQL or Kafka.

Conclusion

Almost all data is produced in real-time, but typically we capture the data in a static database or data warehouse before processing it. Traditionally, this data has been stored and then processed in batches, mostly due to technical limitations and the perceived challenges of stream processing. However, companies like Uber, Lyft, Amazon, Netflix and Stripe have embraced real-time stream processing for a few years now, mostly using internal stream processing platforms. Decodable removes the technical and skills barriers to adopting streaming data by offering a fully managed serverless stream processing platform, using SQL as the transformation language.

Real-time joins are among the more advanced use-cases for stream processing, as they require a lot of different infrastructure components for getting the data out of databases, and for efficiently executing real-time joins. Decodable eliminates this complexity, making streaming joins available to everyone.

Create a free decodable account today to try out streaming joins yourself!


You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join the community Slack

‍

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

đź‘Ť Got it!
Oops! Something went wrong while submitting the form.
Robert Metzger

Robert oversees the core Apache Flink-based data platform at Decodable powering the SaaS stream processing platform. Beyond this role, he’s a committer and the PMC Chair of the Apache Flink project. He has co-created Flink and contributed many core components of the project over the years. He previously co-founded and successfully exited data Artisans (now Ververica), the company that created and commercialized Flink.

Decodable enables continuous SQL joins across streaming sources including streaming systems like Kafka, Pulsar and Kinesis as well as databases with Change Data Capture (CDC). This is exactly the use-case for real-time joins. In this blog we'll explore how this works, and how it helps bring siloed data into play in low-latency streaming applications. Whenever a table in your database changes, Decodable instantly updates the join result.

Let’s assume you have a database table with all your customer data, and a table mapping the zip-codes to your retail stores, and your goal is to build a report counting the number of customers which are living close by a store. With a traditional database you would periodically run a SQL query to get that report. With modern streaming platforms like decodable, you are able to run such a query continuously. Whenever you get new or updated customer data, a store data changes, the report will update in real-time.

Combining a stream and a table in another example, you have a website clickstream of events coming from Confluent Cloud (a managed Apache Kafka service) and you want to enrich each stream event by looking up & adding customer information in a PostgreSQL table. You'll use this to tell you the last product page the customer viewed. You'll then store the result back into PostgreSQL, which will always be up to date with the latest page viewed. This is exactly the use case for real-time joins.

‍

‍

Currently, decodable supports running real-time joins with data from MySQL and PostgreSQL databases. We are currently working on adding more databases, such as Oracle, Microsoft SQL Server, Snowflake and more.

Benefits Of Real-Time Joins

What are the benefits of real-time joins compared to traditional joins, executed periodically?

The main benefit is the reduced latency. Historically you'd use a batch mode to execute a join, orchestrated by Apache Airflow or AWS Step Functions triggering a query, say once per hour. Even if the query takes only a few seconds to run, the average latency will be 30 minutes, the long tail latency much higher.

We know from our customers that data is most valuable when fresh. In other words, the higher the latency you have in your processing pipeline, the less value you can extract from it. For example a global logistics company which is already using stream processing has a competitive advantage if it can inform customers in real-time about the locations of their goods, and notify them in case of delays. A traditional "bricks and mortar" retailer can also benefit from bringing online experiences into their stores: real-time promotions, post-checkout surveys or upsells like product insurance.

Another benefit is data integration across many systems: It is often difficult to join data located in different databases - the data is locked in. How do you extract and normalize data coming from a traditional Oracle database, Snowflake or MongoDB? With decodable’s connector ecosystem and stream processing capabilities, this can be done in a matter of minutes.

Real-time joins are also resource efficient and fast, because the joins are calculated based on the change / delta of the data, rather than scanning the entire database.

Creating Real-Time Joins In Decodable

Decodable is a platform for real-time data processing. Data is represented in streams. To get data into a stream, you need to set up connectors. A connector can read data from systems like Kafka or Kinesis. Also, we support connecting to databases such as MySQL or PostgreSQL. In case of a database, the connector will read the transaction log from a database into a decodable stream.

Once you have multiple streams of transaction logs (we call them change streams, because they represent change events in a database), you can set up pipelines to process them. A pipeline is defined by its input streams, an output stream and a SQL query describing, in our example, a join operation:

‍

A pipeline output stream can then be used by other pipelines or by a connector writing the data somewhere, such as MySQL, PostgreSQL or Kafka.

Conclusion

Almost all data is produced in real-time, but typically we capture the data in a static database or data warehouse before processing it. Traditionally, this data has been stored and then processed in batches, mostly due to technical limitations and the perceived challenges of stream processing. However, companies like Uber, Lyft, Amazon, Netflix and Stripe have embraced real-time stream processing for a few years now, mostly using internal stream processing platforms. Decodable removes the technical and skills barriers to adopting streaming data by offering a fully managed serverless stream processing platform, using SQL as the transformation language.

Real-time joins are among the more advanced use-cases for stream processing, as they require a lot of different infrastructure components for getting the data out of databases, and for efficiently executing real-time joins. Decodable eliminates this complexity, making streaming joins available to everyone.

Create a free decodable account today to try out streaming joins yourself!


You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join the community Slack

‍

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Robert Metzger

Robert oversees the core Apache Flink-based data platform at Decodable powering the SaaS stream processing platform. Beyond this role, he’s a committer and the PMC Chair of the Apache Flink project. He has co-created Flink and contributed many core components of the project over the years. He previously co-founded and successfully exited data Artisans (now Ververica), the company that created and commercialized Flink.