Back
September 3, 2024
6
min read

Reducing the Barrier to Entry for Building Stream Processing Applications

In this episode of the Data Engineering Podcast, Tobias Macey interviews Eric Sammer, Founder and CEO, about starting your stream processing journey with Decodable, a data movement and stream processing platform based on Apache Flink and Debezium. Eric has been working on database systems, data platforms, and infrastructure for over 25 years. As an early employee at Cloudera, Eric worked on “first generation Big Data stuff” with Hadoop, Spark, and Kafka. Since 2010, he’s been more focused on building platforms as a vendor.

The Stream Processing Challenge

Can you give us an overview of Decodable, the story behind it, and why you decided to spend your time in this area?

The motivation behind Decodable stems from wanting to help organizations to not spend their time and energy working to solve the low-level problem of data movement and stream processing. I hate this problem so much that I am determined to solve it, or die trying.

There are a bunch of different sources of data—for example, think of event streaming platforms like Kafka, Redpanda, Kinesis, GCP Pub/Sub, and operational databases like Postgres, MySQL, Oracle, and SQL Server. There are also an increasing number of destinations of data—so in addition to the previous examples, there are also analytic database systems, cloud data warehouses, data lakes, Snowflake, DataBricks, and even things like Amazon S3 with Apache Parquet data. And increasingly a longtail of more specialized data infrastructure, including the real-time OLAP systems, StarTree, Imply, Apache Druid, Apache Pinot, Rockset, as well as things like full-text search systems, Elasticsearch, InfluxDB, telemetry and metric stores, and so much more.

Decodable exists to sit in between those source and destination systems, including microservices that are hanging off of Kafka topics, and be able to process data in real time: filter, join, enrich, aggregate—all the verbs that people would use, either in SQL or through Java APIs. Decodable is built on open source projects, primarily Apache Flink on the stream processing side and Debezium on the change data capture side, so that's the set of APIs and capabilities that we're exposing through our platform.

Reducing Stream Processing Complexity

At Decodable, we really wanted to solve this problem and make it as simple as possible to be able to move and process data, to add new source and destination systems as needed. Fundamentally, that's the space in which we play, and this problem is just far more complicated than it should be. Our goal is to drain the complexity out of this problem—it just doesn't need to be as complex as it is.

Within the overall space of complexity in streaming, you hear a lot about things like checkpointing, windowing, at-least-once consistency, exactly-once semantics, and challenges of late arriving data. How much of it are you still dealing with as you build Decodable, and how much of it has been solved through “force of will” or new updates?

That's the spiciest of spicy questions, because that's the complexity that we're talking about—so that's right on the money. To be really honest with you, Decodable has already reached the point where we’ve handled many of those things. There's quite a bit of configurability in these systems, and in talking with Robert Metzger, the PMC chair of Flink, about this problem, he explained that you need to understand that Flink was intentionally built in a way that allows quite a bit of tuning, where a lot about the workload. If you compare that to something like Postgres, where you just fire off queries (and yes, there's quite a bit of tuning you can do to the RDMS as a whole), for the most part it just kind of does what you expect it to do. So that's the right question, how to effectively and efficiently manage the complexity of data movement and stream processing.

For things like checkpointing, memory tuning, buffers, configuring at-least-once versus exactly-once, getting the semantics right, the retention and the timeouts, those kinds of things Decodable has done a pretty good job of addressing. It is actually possible to “paper over” a lot of that type of complexity. Separate from that, some of the challenges around late arriving data and state management, in terms of message event replay and how you compensate for bad data, those things are intrinsic to data engineering—they're not exclusively stream processing problems, although they do tend to rear their heads in stream processing systems in ways that you probably don't encounter as much in batch.

But it’s not so much about making issues with the data stream never occur, because that’s an intrinsic thing that you have to think about, but you can build workflows, processes, and systems to at least guide people down the right path of making sane decisions. So for example, if you care about late arriving data, there's actually a way of presenting the concept of watermarks, with all of its underlying deep complexity, to somebody in a way that is more intuitive. We strive to do that out-of-the-box with Decodable.

New Use Cases of Stream Processing

Now that we have platforms like Decodable and others focusing on different aspects of this streaming ecosystem, how does that influence the types of problems, the types of businesses, that are actually starting to implement streaming data as a core capability of their business?

Just within the past couple of years, we’ve seen companies who once said that they don’t have any streaming use cases come around to the idea that they definitely do. A good example is logistics and how that can drive customer experience. Until quite recently, nobody really thought they needed to know exactly where their pizza or Chinese food was—that is, until GrubHub and DoorDash and all these other companies started showing you the little tracking icon on your mobile device. Now customers just expect that, so these new use cases have really pushed a bunch of people into thinking about these things, especially in retail, fintech, logistics, and gaming—systems like Fortnite or companies like Epic and EA Games have done quite a bit with real-time processing based on things like Flink.

The first step for a lot of people was adopting an event streaming system like Kafka—when you look at the Fortune 500, or even the larger global companies, most of them have done that. Obviously, companies like Confluent and AWS have really popularized the notion of event streaming in general. At Decodable, we look at that in a similar way that people look at S3. Kafka is the primitive that enables a whole bunch of other things to follow. And stream processing is the natural thing that follows event streaming. First you are able to move your data, then you’re able to process that data without actually having to write microservices that listen to or produce directly to those Kafka topics, so it's a natural extension.

Event Streaming (Kafka) vs. Stream Processing (Flink)

The phrases “event streaming” and “stream processing” can be confusing as people begin to explore data streaming. Can we clearly draw a line between these two concepts?

Event streaming, the way we use that phrase at Decodable and as it’s used more broadly, refers to the durable storage and movement of data in real time. Some projects and services, like Redpanda with its Wasm support or Apache Pulsar with its functions, blur the line somewhat by having different capabilities that squish event streaming and stream processing together into one box (although in certain cases they are effectively two different boxes). That can make things mushy for people, but we think about event streaming specifically as durable storage and movement of data.

And then stream processing is the actual processing and connectivity of that data—connectivity of course referring to connecting to source and destination systems to get messages or event data onto and off of a Kafka topic or something similar.

Using the Kafka ecosystem as an example, storage and movement is the Kafka broker, the connectivity would be Kafka connect, and then the processing aspect is where it starts to get a little bit interesting. There's KStreams, there's ksqlDB, there's obviously Flink—our weapon of choice—and there are other systems that you can pair with those approaches.

Stream Processing Adoption and Learnings

How has the adoption curve of Apache Flink and stream processing over the past several years influenced Decodable’s product focus? How have the overall industry trends factored into the ways that you think about the solution space that you're targeting?

There are a couple of things we've learned, and we can also take a look at some of the stuff the stream processing ecosystem has gotten wrong, but has been working to improve—not just examples of how the Decodable platform has evolved, but also in our shared understanding of this space. For context, in 2021 Decodable was about 8 months old and we had just raised our Series A round of funding. We're now almost three years in and have spent quite a bit of additional time with customers. Initially, we thought the stream processing space was more mainstream than it really was, essentially we thought it was further along in its adoption and maturity.

SQL versus Java

An assumption which stemmed from that was thinking that “obviously” the right thing to do is to focus on SQL, because everybody knows it and that was the right way to think about it. That was probably correct in large part, but at the same time incomplete. One of the things we've learned since then from talking to customers is that somewhere around 70 to 90 percent of workloads can be expressed in SQL. A lot of what people need to do with stream processing is route data, or knock out some PII data, or filter records for just the successful HTTP events—really simple stuff that is a one-line SQL statement, effectively replacing a whole chunk of Java code. Using SQL means one less service to explicitly build, instrument, and monitor—you can easily push that functionality to a vendor, which has advantages.

But there are also around 10 to 20 percent of use cases that are just different. They're different because of a few reasons. One, purely from an expressiveness perspective, SQL is just hard to use in specific cases. Typically because there are more sophisticated state management, or really, really complex sessionization use cases, or data enrichment use cases—for instance, these can sometimes require very specific kinds of time management and window functions. There are cases where people have to reach into third party libraries, and especially as LLMs and AI become all the rage, those kinds of things aren't available as SQL functions. Or they have extremely deeply nested data that is complex to think about in the relational model for a variety of reasons.

While some of those things could potentially be done in SQL, it can be more natural for people to think about it as imperative code. As a result, customers told us that they had to be able to write code, so we pivoted to expose full support for the Flink APIs. That allowed at least 50 percent of the people we were talking to who had said, “We love what you're doing, but we can't use you because you're SQL-only,” to be able to use us. In retrospect, of course that makes sense, but the market just wasn't so far along that everything could be SQL. You do need both, and Flink has recognized this from very early on. We have exposed that functionality and built support around it in a way that is super nice to use, but that was one big thing we learned.

Data Sovereignty and Bring Your Own Cloud

Another big trend is the issue of data sovereignty. Especially as more highly regulated businesses, or just more complex businesses, move to the cloud with their data platforms—not just their application stack, but their data platforms—the question arises around shipping all of their event data to a third-party vendor to be processed and then having it returned. Some companies, such as banks, insurance companies, healthcare providers, recoil when you say, “Just give us all your data.” Things are a little different for the likes of Snowflake or Salesforce; but let's be honest, Decodable has not yet reached the Snowflake stage, and so the level of trust we enjoy with customers is probably not quite as robust as theirs.

Customers rightfully want to be able to maintain control of their data, and so our answer to this—and Redpanda has done something similar—is a “bring your own cloud” model, where we cut the platform into a control plane and a data plane. The control plane handles only command and control messages, while the data plane is the part that actually processes data and touches customer infrastructure. Most significantly, the data plane runs inside the customer’s own cloud account. So the platform is still in the cloud, it still has a lot of the features of a managed offering because we're able to collaborate on the management aspects, but the data is resident within the customer’s account and they have full control over it.

There may be a debate raging online right now about whether or not that's the future of cloud services. Being more pragmatic, we offer both fully-managed and bring your own cloud because, quite frankly, we have customers that ask us for both. Some customers prefer one over the other, and that was an interesting thing we've learned. Companies offering cloud-based data infrastructure services have a high wall to climb until they are as mature and well-developed as a Snowflake.

Adding Support for Custom Code (Java, Scala, etc.)

What did the process look like from the realization of unanticipated customer requirements to the point where you delivered the necessary product changes? What were some of the mine fields that you had to traverse on that journey?

The issue for supporting custom code in addition to SQL for processing jobs wasn’t so much about how to implement or support the necessary changes. This is because Decodable is building on the shoulders of giants with Apache Flink, which already has a super robust DataStream API and a Table API. These both offer different levels of abstraction that are really ergonomic and nice to use for most use cases, even for maybe for all use cases (hedging a little bit because there's always going to be somebody who hates it). But the fact that Flink already has a lot of this capability meant that the issue for us was to determine how to offer it in a way that is safe for the platform.

The real issue for a cloud provider is untrusted third-party code. Customers are uploading arbitrary code to a cloud service and saying, “Run this for me.” For us to build a service around this, it’s important to get it right. We've even talked to other people building data platforms, who have said they might trust customer code to not do adversarial things, but they still worry about supply chain attacks, exfiltration of sensitive information, or more often about the code just chewing up arbitrary resources in a way that impacts other workloads. So resource isolation and management, and then security and safety, are the biggest challenges.

In the “bring your own cloud” model, it's a little bit easier because that's actually the part that's running on the customer's infrastructure. It's relatively easy for them to have isolated infrastructure because it’s single-tenant by definition. But essentially custom code inside of our managed fully-managed platform is also single-tenant for that custom job. As a consequence, there is quite a bit of infrastructure which goes into that. Beyond that, there were secondary considerations, such as preventing custom code from touching parts of Flink that we have to control and manage for a variety of reasons, for example observability infrastructure and those kinds of things. Those were probably easier to solve—they're quite a bit of work, but conceptually easier to solve.

Future Support for Additional Programming Languages

Are you looking at supporting Python for Flink jobs?

This is a place of more deep research on our part, and we talk a lot about expanding the language support. Flink is Java, it enjoys a rich ecosystem of tooling and compatibility, and the runtime performance is arguably better than it should be—modern Java is incredible. Sure, there are people who think, “I could hand code it in Rust better,” and yes, there can be some truth to that from the perspective of a C++/Rust person. It’s primarily about striking the right balance.

Opening Flink and its capabilities to languages like Python, Go, JavaScript/TypeScript, is actually really really important because, certainly for data platforms at larger organizations, they're typically not homogeneous on language usage. Polyglot programming is a whole thing these days.  Being able to support that reality is something we need to improve—not just within Decodable, but the larger Flink ecosystem. Other people in the community probably would agree with that. As an example, PyFlink support is okay, but it needs to be better. 

Onboarding and Developer Experience

What is the developer experience and the onboarding process, the concepts people need to be aware of, as they start using the Decodable platform?

This is one of those areas where the work is never completely done. Systems like Postgres have existed forever, and the UX/DX is relatively sophisticated and well known—there are patterns surrounding its usage. For stream processing, because it is effectively an integration technology, it's all about connecting to upstream and downstream systems, and it has the overhead of effectively being a query engine. It's not a database per se, but it has all the hallmarks of the query engine portion of a database system: arbitrary workload, definition, those kinds of things—there's quite a bit of sophistication and complexity in that.

Inside of Decodable, we've tried to distill this down to the fewest number of concepts as possible. We think about the world in these terms: connections to external systems, data streams that are produced or consumed, and what we call pipelines, which actually process data in between those streams. That allows us to compose those three core primitives into sequences, or DAGs, showing for example that this connection goes to this stream, which feeds these five pipelines, which feed maybe these other pipelines, which feed these other connections, and so on and so forth.

For example, if you're a SQL person and you're thinking about using Decodable to get data from Postgres to Snowflake with a bunch of transformations in the middle—to cleanse it, or restructure it, so you're not burning Snowflake warehouse credit—you're probably thinking in a dbt and SQL mode, and in that case you want to be able to fit into an existing workflow. We don't want to rip somebody out of that mindset and try to teach them a different way of working.

Customer Interaction with Decodable

Where do you want people to be thinking about how to interact with Decodable? Should it be fully managed by their data orchestration system? Or should Decodable be something that people don't even know exists, and it just plugs away and does its thing?

This is a place where streaming is a little bit different from the batch side, in that we don't necessarily think in terms of orchestration because it doesn't look like an Airflow, it looks a whole lot more like a code deploy. Although I suppose there is a life cycle, in that you start connections and pipelines that run forever, or they run until you have to make an update with a new version of the code.

We have customers today who think about this as being something they do through Terraform, and that's the way they want to manage connections and pipelines. We also have customers who think about it as part of scripting, or inside of a make file—at least a couple of our customers run “make pipelines”, and it calls our CLI tool to provision a bunch of things. And there's definitely a “Click Ops” demographic, who could run <span class="inline-code">decodable create stream</span> and <span class="inline-code">decodable create connection</span> from the command line, but it’s just easier in some cases to go into the UI and plug in their Kafka parameters and click Start—although typically they do go back and replicate it with the CLI.

There are also customers who have intensely sophisticated internal infrastructure that is part of their own development and deployment processes, and they just directly touch our REST APIs. For some customers, their developers are using Decodable and they don't even know it. They define some SQL and some connection information inside of a YAML file, which then gets committed to git or GitHub, and something wakes up and takes that YAML file, parses it, and then makes a bunch of API calls to Decodable—which is essentially what our CLI does, give or take.

Connecting to External Systems

How is Decodable making connections to external systems a manageable problem, without having to spend all your engineering time on writing connectors for all the different source and destination systems?

Another good example of this is Splunk, which has an incredibly long tail of connectors for various systems. Anytime you're building connectors, you have to be careful of the long-tail problem. At Decodable, we look for points of leverage, and we think customers should look for points of leverage—there are some natural aggregators. In other words, there are places where things natively integrate and get first class support.

There are two, maybe three, natural points of gravity in the data platform: the operational database, the event streaming layer, and the data warehouse. Those are your three natural aggregators. There are an increasing number of systems that support Snowflake, Postgres, MySQL, MongoDB, Cassandra, and so on. And there are an increasing number of systems that support Kafka, Pulsar, and Kinesis as natural aggregation points. As an example, getting the equivalent of net flow data out of AWS, you can get that on a Kinesis stream. Or AWS security audit logs, you can very naturally get those in a Kinesis stream. We think those things are natural aggregators—also including object stores like S3 and TCS.

To whatever degree we can, we spend most of our time thinking about the natural aggregators and then how to get data into and out of those aggregators. Some of those are natural source systems, while others are natural destination systems—and some of them are both natural source and sink/destination systems. As an example, we don't think the data warehouse is the right source to drive operational systems for a variety of reasons. Typically that team is not on call, they don't have the same uptime SLAs, you just don't think about feeding microservices from your data warehouse—it doesn't make a ton of sense.

Enriching Data with Stream Processing

Given that the information to do data enrichment may need to come from within the data warehouse, how does that fit into the way you think about data movement and stream processing?

This is where it gets interesting—the question could also be asked, “While you can do enrichment in the data warehouse, if the streaming layer is the thing that actually collects data from all of the operational systems and is feeding the data warehouse, isn’t it also capable of performing the enrichment that needs to go to other systems?” It is very unlikely that there is one place that fits the uptime, latency, throughput, and other SLA requirements of all these systems. So it's actually not a bad thing to be able to do enrichment with operational information from a Postgres database in the streaming layer and also to have that same data inside of the data warehouse.

A lot of people will use Decodable, or something like it, to collect data from all these operational systems and put what is effectively the raw data, or mostly raw, slightly massaged data, into the warehouse—batch workloads inside of the data warehouse can of course make use of that data directly. But when customers are feeding transactional messaging systems, for example something like MailChimp or SendGrid, they may also be doing that enrichment in the streaming engine. This is why we support joins and sophisticated stateful processing inside the stream processing layer—so customers can do that enrichment in real time and then directly feed the operational system, effectively bypassing the data warehouse for that use case.

Companies don’t need to be thinking about this as an either/or situation. People do ask, “What about data duplication? What about logic duplication?” Our answer is: that's why we support SQL. The same SQL that you can run in the data warehouse, you can also run in the stream processing layer. That's why it’s important to be able to do that, and why we also support things like dbt. Customers can actually push that processing upstream when it makes sense. It may not be required all the time, but it’s always available when they need it to be able to fulfill their use cases.

Surprising Use Cases for Stream Processing

As you have worked with customers, helping them figure out how best to apply streaming to their problems, what are some of the most interesting, or innovative, or unexpected ways that you've seen them use your platform?

Truly, we are frequently surprised with what people are doing with Decodable. We definitely see “bread and butter” use cases, the filter, route, transform, aggregate, and trigger type of use cases that happen in the stream processing layer. One of the things that we have really come to appreciate is the combination of change data capture from an operational database system, through the stream processing engine to transform that data, and then pumping it back into other operational systems like caches and full text search.  These days it’s probably not all that unique, but at the time it surprised me how simple it made things.

We worked on a use case with a company that takes in résumé data from MySQL, does a bunch of parsing, cleansing, and transformation, and then indexes that data into Elasticsearch to make it searchable by different features of a candidate. This process wound up being zero code for them—it's one SQL statement that does very low latency processing of this data, and then indexing back into the full-text search system.

If you extend that to things like caches, Redis and ElastiCache for example, then you actually start to see all this opportunity to be able to create what amounts to materialized views that are optimized for different workloads. Again, that's probably not super surprising these days, but the kinds of applications that you can build and how quickly you can build them—that's the thing that’s surprising.

Interesting Stream Processing Challenges

In your work of building this platform, working in this ecosystem, what are the most interesting, unexpected, or challenging lessons that you've learned?

As an engineer, you look at these open-source projects and they are so sophisticated, so powerful. Everybody in the data engineering space describes their tech stack as Flink plus Kafka plus Debezium—and the “plus” in that sentence, those two pluses, are doing a lot of work.

The thing that has been surprising is just how much time and effort we spend on the glue between these otherwise fantastic Lego bricks. Here's probably the dirty secret about Decodable: our differentiation is not that we're 3 milliseconds faster than the other guy, our differentiation is all the glue we’ve built to go in between those systems. The DevX, the deployment, the APIs, the orchestration, the observability—the observability alone around data pipelines in a streaming context! Data quality looks different in a streaming context, it functions differently in a streaming context, so those things are much harder than I would have expected.

It’s similar to the joke about drawing an owl, where first you draw two circles and then step two is to draw the rest of the owl. For a data movement and stream processing platform, step one is to deploy Flink, Kafka, and Debezium—step two is to build the rest of the platform. The amount of time and effort is daunting, especially given that's our full-time job here at Decodable. We spend all our time thinking about it, and if anything, it shows the need for data platform teams inside of organizations and the value they provide.

With that context, for people who are considering Decodable or considering streaming, what are the cases where building it yourself is the wrong choice?

People shouldn’t think about Decodable or stream processing as implying that they will be shutting down their data warehouse. The idea that companies will take their Airflow, dbt, and Snowflake and shoot them into the moon—that's just not what's going to happen, and we don't make that claim.

Stream processing, at its core, is something that typically sits between two other systems—some source and some destination—and processes data to make it available in that larger system. Stream processing is not meant to be a serving system—we purposely decouple the Decodable platform from Apache Pinot, Imply, Snowflake, Databricks, S3, or Redis. We believe you actually want those as two separate systems so you can cook the data and then serve it in the most appropriate storage and query engine that makes sense for a particular workload.

People shouldn’t be imagining they’re going to replace Snowflake with Decodable, or Databricks with Decodable, or Postgres with Decodable—that’s not how you should think about stream processing. If you're thinking about how to get data from Postgres through the outbox pattern to be available for microservices, or into HubSpot or Snowflake or S3, those are the places where Decodable and stream processing fit.

What’s Coming Next from Decodable?

As you continue to build and iterate and scale the Decodable platform, what are the things you have planned for the near to medium term? Are there any particular projects you're excited to dig into?

There's a lot we can do. There are some areas where we will always be expanding, new connectors for example. We'll always be improving the SQL dialect, enriching the APIs to allow people more expressive capabilities. There are some higher order functions that we want to provide, more sophisticated management of event time, things that make it easier to build processing jobs.

Developer experience is always top of mind for us. It is not yet where we want it to be, there are certain things that are probably harder to do than they could be—think operational flows like making a backward-incompatible schema change, which sometimes needs to happen. You should avoid it at all costs, but when it does need to happen, it's currently messier than we’d like it to be. Reprocessing of data for bootstrapping net new things, like bringing up a new Elasticsearch instance and replaying the last 12 months of data to populate it, we want to make that easier. You can do all of those things today, but they could be easier.

Are there any other aspects of the work that you're doing at Decodable or the overall ecosystem of stream processing that you'd like to cover before we close out the show?

Honestly, this is one of the better overviews of Decodable and I really appreciate the questions. We probably hit all the critical stuff. I would say that a lot of people are still struggling to understand where stream processing fits into their stack, into the data platform, and which workloads should move to it—we're spending more and more time trying to help people understand that.

At its core, people can think about Decodable as providing low-latency ETL that can power both analytical systems and microservices—it’s not really necessary to stress much harder on it than that. The operational-to-analytical and operational-to-operational data infrastructure is probably the mental model that people should have for stream processing. That might be one of the areas where we have some more work to do, and it's probably an evergreen topic to talk about. But otherwise it's been a real pleasure, and I really appreciate the opportunity to be on the show today.

As a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?

The largest portion of the problem is really one around developer experience. Anecdotal evidence for that is every team at every Fortune 500 has a company-specific data platform layer that glues a set of things together in a way that makes sense for that particular company, for a particular team. There aren’t yet well developed standards the way we have on the application development side. That creates so much unnecessary complexity, things are different just for the sake of being different. Development efforts that end up burning time and money, and that hurts people. I don't know the solution, but I wish we could come up with better patterns and practices for stream processing and encode those in a way that vendors can then more tightly integrate with.

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
David Fabritius

In this episode of the Data Engineering Podcast, Tobias Macey interviews Eric Sammer, Founder and CEO, about starting your stream processing journey with Decodable, a data movement and stream processing platform based on Apache Flink and Debezium. Eric has been working on database systems, data platforms, and infrastructure for over 25 years. As an early employee at Cloudera, Eric worked on “first generation Big Data stuff” with Hadoop, Spark, and Kafka. Since 2010, he’s been more focused on building platforms as a vendor.

The Stream Processing Challenge

Can you give us an overview of Decodable, the story behind it, and why you decided to spend your time in this area?

The motivation behind Decodable stems from wanting to help organizations to not spend their time and energy working to solve the low-level problem of data movement and stream processing. I hate this problem so much that I am determined to solve it, or die trying.

There are a bunch of different sources of data—for example, think of event streaming platforms like Kafka, Redpanda, Kinesis, GCP Pub/Sub, and operational databases like Postgres, MySQL, Oracle, and SQL Server. There are also an increasing number of destinations of data—so in addition to the previous examples, there are also analytic database systems, cloud data warehouses, data lakes, Snowflake, DataBricks, and even things like Amazon S3 with Apache Parquet data. And increasingly a longtail of more specialized data infrastructure, including the real-time OLAP systems, StarTree, Imply, Apache Druid, Apache Pinot, Rockset, as well as things like full-text search systems, Elasticsearch, InfluxDB, telemetry and metric stores, and so much more.

Decodable exists to sit in between those source and destination systems, including microservices that are hanging off of Kafka topics, and be able to process data in real time: filter, join, enrich, aggregate—all the verbs that people would use, either in SQL or through Java APIs. Decodable is built on open source projects, primarily Apache Flink on the stream processing side and Debezium on the change data capture side, so that's the set of APIs and capabilities that we're exposing through our platform.

Reducing Stream Processing Complexity

At Decodable, we really wanted to solve this problem and make it as simple as possible to be able to move and process data, to add new source and destination systems as needed. Fundamentally, that's the space in which we play, and this problem is just far more complicated than it should be. Our goal is to drain the complexity out of this problem—it just doesn't need to be as complex as it is.

Within the overall space of complexity in streaming, you hear a lot about things like checkpointing, windowing, at-least-once consistency, exactly-once semantics, and challenges of late arriving data. How much of it are you still dealing with as you build Decodable, and how much of it has been solved through “force of will” or new updates?

That's the spiciest of spicy questions, because that's the complexity that we're talking about—so that's right on the money. To be really honest with you, Decodable has already reached the point where we’ve handled many of those things. There's quite a bit of configurability in these systems, and in talking with Robert Metzger, the PMC chair of Flink, about this problem, he explained that you need to understand that Flink was intentionally built in a way that allows quite a bit of tuning, where a lot about the workload. If you compare that to something like Postgres, where you just fire off queries (and yes, there's quite a bit of tuning you can do to the RDMS as a whole), for the most part it just kind of does what you expect it to do. So that's the right question, how to effectively and efficiently manage the complexity of data movement and stream processing.

For things like checkpointing, memory tuning, buffers, configuring at-least-once versus exactly-once, getting the semantics right, the retention and the timeouts, those kinds of things Decodable has done a pretty good job of addressing. It is actually possible to “paper over” a lot of that type of complexity. Separate from that, some of the challenges around late arriving data and state management, in terms of message event replay and how you compensate for bad data, those things are intrinsic to data engineering—they're not exclusively stream processing problems, although they do tend to rear their heads in stream processing systems in ways that you probably don't encounter as much in batch.

But it’s not so much about making issues with the data stream never occur, because that’s an intrinsic thing that you have to think about, but you can build workflows, processes, and systems to at least guide people down the right path of making sane decisions. So for example, if you care about late arriving data, there's actually a way of presenting the concept of watermarks, with all of its underlying deep complexity, to somebody in a way that is more intuitive. We strive to do that out-of-the-box with Decodable.

New Use Cases of Stream Processing

Now that we have platforms like Decodable and others focusing on different aspects of this streaming ecosystem, how does that influence the types of problems, the types of businesses, that are actually starting to implement streaming data as a core capability of their business?

Just within the past couple of years, we’ve seen companies who once said that they don’t have any streaming use cases come around to the idea that they definitely do. A good example is logistics and how that can drive customer experience. Until quite recently, nobody really thought they needed to know exactly where their pizza or Chinese food was—that is, until GrubHub and DoorDash and all these other companies started showing you the little tracking icon on your mobile device. Now customers just expect that, so these new use cases have really pushed a bunch of people into thinking about these things, especially in retail, fintech, logistics, and gaming—systems like Fortnite or companies like Epic and EA Games have done quite a bit with real-time processing based on things like Flink.

The first step for a lot of people was adopting an event streaming system like Kafka—when you look at the Fortune 500, or even the larger global companies, most of them have done that. Obviously, companies like Confluent and AWS have really popularized the notion of event streaming in general. At Decodable, we look at that in a similar way that people look at S3. Kafka is the primitive that enables a whole bunch of other things to follow. And stream processing is the natural thing that follows event streaming. First you are able to move your data, then you’re able to process that data without actually having to write microservices that listen to or produce directly to those Kafka topics, so it's a natural extension.

Event Streaming (Kafka) vs. Stream Processing (Flink)

The phrases “event streaming” and “stream processing” can be confusing as people begin to explore data streaming. Can we clearly draw a line between these two concepts?

Event streaming, the way we use that phrase at Decodable and as it’s used more broadly, refers to the durable storage and movement of data in real time. Some projects and services, like Redpanda with its Wasm support or Apache Pulsar with its functions, blur the line somewhat by having different capabilities that squish event streaming and stream processing together into one box (although in certain cases they are effectively two different boxes). That can make things mushy for people, but we think about event streaming specifically as durable storage and movement of data.

And then stream processing is the actual processing and connectivity of that data—connectivity of course referring to connecting to source and destination systems to get messages or event data onto and off of a Kafka topic or something similar.

Using the Kafka ecosystem as an example, storage and movement is the Kafka broker, the connectivity would be Kafka connect, and then the processing aspect is where it starts to get a little bit interesting. There's KStreams, there's ksqlDB, there's obviously Flink—our weapon of choice—and there are other systems that you can pair with those approaches.

Stream Processing Adoption and Learnings

How has the adoption curve of Apache Flink and stream processing over the past several years influenced Decodable’s product focus? How have the overall industry trends factored into the ways that you think about the solution space that you're targeting?

There are a couple of things we've learned, and we can also take a look at some of the stuff the stream processing ecosystem has gotten wrong, but has been working to improve—not just examples of how the Decodable platform has evolved, but also in our shared understanding of this space. For context, in 2021 Decodable was about 8 months old and we had just raised our Series A round of funding. We're now almost three years in and have spent quite a bit of additional time with customers. Initially, we thought the stream processing space was more mainstream than it really was, essentially we thought it was further along in its adoption and maturity.

SQL versus Java

An assumption which stemmed from that was thinking that “obviously” the right thing to do is to focus on SQL, because everybody knows it and that was the right way to think about it. That was probably correct in large part, but at the same time incomplete. One of the things we've learned since then from talking to customers is that somewhere around 70 to 90 percent of workloads can be expressed in SQL. A lot of what people need to do with stream processing is route data, or knock out some PII data, or filter records for just the successful HTTP events—really simple stuff that is a one-line SQL statement, effectively replacing a whole chunk of Java code. Using SQL means one less service to explicitly build, instrument, and monitor—you can easily push that functionality to a vendor, which has advantages.

But there are also around 10 to 20 percent of use cases that are just different. They're different because of a few reasons. One, purely from an expressiveness perspective, SQL is just hard to use in specific cases. Typically because there are more sophisticated state management, or really, really complex sessionization use cases, or data enrichment use cases—for instance, these can sometimes require very specific kinds of time management and window functions. There are cases where people have to reach into third party libraries, and especially as LLMs and AI become all the rage, those kinds of things aren't available as SQL functions. Or they have extremely deeply nested data that is complex to think about in the relational model for a variety of reasons.

While some of those things could potentially be done in SQL, it can be more natural for people to think about it as imperative code. As a result, customers told us that they had to be able to write code, so we pivoted to expose full support for the Flink APIs. That allowed at least 50 percent of the people we were talking to who had said, “We love what you're doing, but we can't use you because you're SQL-only,” to be able to use us. In retrospect, of course that makes sense, but the market just wasn't so far along that everything could be SQL. You do need both, and Flink has recognized this from very early on. We have exposed that functionality and built support around it in a way that is super nice to use, but that was one big thing we learned.

Data Sovereignty and Bring Your Own Cloud

Another big trend is the issue of data sovereignty. Especially as more highly regulated businesses, or just more complex businesses, move to the cloud with their data platforms—not just their application stack, but their data platforms—the question arises around shipping all of their event data to a third-party vendor to be processed and then having it returned. Some companies, such as banks, insurance companies, healthcare providers, recoil when you say, “Just give us all your data.” Things are a little different for the likes of Snowflake or Salesforce; but let's be honest, Decodable has not yet reached the Snowflake stage, and so the level of trust we enjoy with customers is probably not quite as robust as theirs.

Customers rightfully want to be able to maintain control of their data, and so our answer to this—and Redpanda has done something similar—is a “bring your own cloud” model, where we cut the platform into a control plane and a data plane. The control plane handles only command and control messages, while the data plane is the part that actually processes data and touches customer infrastructure. Most significantly, the data plane runs inside the customer’s own cloud account. So the platform is still in the cloud, it still has a lot of the features of a managed offering because we're able to collaborate on the management aspects, but the data is resident within the customer’s account and they have full control over it.

There may be a debate raging online right now about whether or not that's the future of cloud services. Being more pragmatic, we offer both fully-managed and bring your own cloud because, quite frankly, we have customers that ask us for both. Some customers prefer one over the other, and that was an interesting thing we've learned. Companies offering cloud-based data infrastructure services have a high wall to climb until they are as mature and well-developed as a Snowflake.

Adding Support for Custom Code (Java, Scala, etc.)

What did the process look like from the realization of unanticipated customer requirements to the point where you delivered the necessary product changes? What were some of the mine fields that you had to traverse on that journey?

The issue for supporting custom code in addition to SQL for processing jobs wasn’t so much about how to implement or support the necessary changes. This is because Decodable is building on the shoulders of giants with Apache Flink, which already has a super robust DataStream API and a Table API. These both offer different levels of abstraction that are really ergonomic and nice to use for most use cases, even for maybe for all use cases (hedging a little bit because there's always going to be somebody who hates it). But the fact that Flink already has a lot of this capability meant that the issue for us was to determine how to offer it in a way that is safe for the platform.

The real issue for a cloud provider is untrusted third-party code. Customers are uploading arbitrary code to a cloud service and saying, “Run this for me.” For us to build a service around this, it’s important to get it right. We've even talked to other people building data platforms, who have said they might trust customer code to not do adversarial things, but they still worry about supply chain attacks, exfiltration of sensitive information, or more often about the code just chewing up arbitrary resources in a way that impacts other workloads. So resource isolation and management, and then security and safety, are the biggest challenges.

In the “bring your own cloud” model, it's a little bit easier because that's actually the part that's running on the customer's infrastructure. It's relatively easy for them to have isolated infrastructure because it’s single-tenant by definition. But essentially custom code inside of our managed fully-managed platform is also single-tenant for that custom job. As a consequence, there is quite a bit of infrastructure which goes into that. Beyond that, there were secondary considerations, such as preventing custom code from touching parts of Flink that we have to control and manage for a variety of reasons, for example observability infrastructure and those kinds of things. Those were probably easier to solve—they're quite a bit of work, but conceptually easier to solve.

Future Support for Additional Programming Languages

Are you looking at supporting Python for Flink jobs?

This is a place of more deep research on our part, and we talk a lot about expanding the language support. Flink is Java, it enjoys a rich ecosystem of tooling and compatibility, and the runtime performance is arguably better than it should be—modern Java is incredible. Sure, there are people who think, “I could hand code it in Rust better,” and yes, there can be some truth to that from the perspective of a C++/Rust person. It’s primarily about striking the right balance.

Opening Flink and its capabilities to languages like Python, Go, JavaScript/TypeScript, is actually really really important because, certainly for data platforms at larger organizations, they're typically not homogeneous on language usage. Polyglot programming is a whole thing these days.  Being able to support that reality is something we need to improve—not just within Decodable, but the larger Flink ecosystem. Other people in the community probably would agree with that. As an example, PyFlink support is okay, but it needs to be better. 

Onboarding and Developer Experience

What is the developer experience and the onboarding process, the concepts people need to be aware of, as they start using the Decodable platform?

This is one of those areas where the work is never completely done. Systems like Postgres have existed forever, and the UX/DX is relatively sophisticated and well known—there are patterns surrounding its usage. For stream processing, because it is effectively an integration technology, it's all about connecting to upstream and downstream systems, and it has the overhead of effectively being a query engine. It's not a database per se, but it has all the hallmarks of the query engine portion of a database system: arbitrary workload, definition, those kinds of things—there's quite a bit of sophistication and complexity in that.

Inside of Decodable, we've tried to distill this down to the fewest number of concepts as possible. We think about the world in these terms: connections to external systems, data streams that are produced or consumed, and what we call pipelines, which actually process data in between those streams. That allows us to compose those three core primitives into sequences, or DAGs, showing for example that this connection goes to this stream, which feeds these five pipelines, which feed maybe these other pipelines, which feed these other connections, and so on and so forth.

For example, if you're a SQL person and you're thinking about using Decodable to get data from Postgres to Snowflake with a bunch of transformations in the middle—to cleanse it, or restructure it, so you're not burning Snowflake warehouse credit—you're probably thinking in a dbt and SQL mode, and in that case you want to be able to fit into an existing workflow. We don't want to rip somebody out of that mindset and try to teach them a different way of working.

Customer Interaction with Decodable

Where do you want people to be thinking about how to interact with Decodable? Should it be fully managed by their data orchestration system? Or should Decodable be something that people don't even know exists, and it just plugs away and does its thing?

This is a place where streaming is a little bit different from the batch side, in that we don't necessarily think in terms of orchestration because it doesn't look like an Airflow, it looks a whole lot more like a code deploy. Although I suppose there is a life cycle, in that you start connections and pipelines that run forever, or they run until you have to make an update with a new version of the code.

We have customers today who think about this as being something they do through Terraform, and that's the way they want to manage connections and pipelines. We also have customers who think about it as part of scripting, or inside of a make file—at least a couple of our customers run “make pipelines”, and it calls our CLI tool to provision a bunch of things. And there's definitely a “Click Ops” demographic, who could run <span class="inline-code">decodable create stream</span> and <span class="inline-code">decodable create connection</span> from the command line, but it’s just easier in some cases to go into the UI and plug in their Kafka parameters and click Start—although typically they do go back and replicate it with the CLI.

There are also customers who have intensely sophisticated internal infrastructure that is part of their own development and deployment processes, and they just directly touch our REST APIs. For some customers, their developers are using Decodable and they don't even know it. They define some SQL and some connection information inside of a YAML file, which then gets committed to git or GitHub, and something wakes up and takes that YAML file, parses it, and then makes a bunch of API calls to Decodable—which is essentially what our CLI does, give or take.

Connecting to External Systems

How is Decodable making connections to external systems a manageable problem, without having to spend all your engineering time on writing connectors for all the different source and destination systems?

Another good example of this is Splunk, which has an incredibly long tail of connectors for various systems. Anytime you're building connectors, you have to be careful of the long-tail problem. At Decodable, we look for points of leverage, and we think customers should look for points of leverage—there are some natural aggregators. In other words, there are places where things natively integrate and get first class support.

There are two, maybe three, natural points of gravity in the data platform: the operational database, the event streaming layer, and the data warehouse. Those are your three natural aggregators. There are an increasing number of systems that support Snowflake, Postgres, MySQL, MongoDB, Cassandra, and so on. And there are an increasing number of systems that support Kafka, Pulsar, and Kinesis as natural aggregation points. As an example, getting the equivalent of net flow data out of AWS, you can get that on a Kinesis stream. Or AWS security audit logs, you can very naturally get those in a Kinesis stream. We think those things are natural aggregators—also including object stores like S3 and TCS.

To whatever degree we can, we spend most of our time thinking about the natural aggregators and then how to get data into and out of those aggregators. Some of those are natural source systems, while others are natural destination systems—and some of them are both natural source and sink/destination systems. As an example, we don't think the data warehouse is the right source to drive operational systems for a variety of reasons. Typically that team is not on call, they don't have the same uptime SLAs, you just don't think about feeding microservices from your data warehouse—it doesn't make a ton of sense.

Enriching Data with Stream Processing

Given that the information to do data enrichment may need to come from within the data warehouse, how does that fit into the way you think about data movement and stream processing?

This is where it gets interesting—the question could also be asked, “While you can do enrichment in the data warehouse, if the streaming layer is the thing that actually collects data from all of the operational systems and is feeding the data warehouse, isn’t it also capable of performing the enrichment that needs to go to other systems?” It is very unlikely that there is one place that fits the uptime, latency, throughput, and other SLA requirements of all these systems. So it's actually not a bad thing to be able to do enrichment with operational information from a Postgres database in the streaming layer and also to have that same data inside of the data warehouse.

A lot of people will use Decodable, or something like it, to collect data from all these operational systems and put what is effectively the raw data, or mostly raw, slightly massaged data, into the warehouse—batch workloads inside of the data warehouse can of course make use of that data directly. But when customers are feeding transactional messaging systems, for example something like MailChimp or SendGrid, they may also be doing that enrichment in the streaming engine. This is why we support joins and sophisticated stateful processing inside the stream processing layer—so customers can do that enrichment in real time and then directly feed the operational system, effectively bypassing the data warehouse for that use case.

Companies don’t need to be thinking about this as an either/or situation. People do ask, “What about data duplication? What about logic duplication?” Our answer is: that's why we support SQL. The same SQL that you can run in the data warehouse, you can also run in the stream processing layer. That's why it’s important to be able to do that, and why we also support things like dbt. Customers can actually push that processing upstream when it makes sense. It may not be required all the time, but it’s always available when they need it to be able to fulfill their use cases.

Surprising Use Cases for Stream Processing

As you have worked with customers, helping them figure out how best to apply streaming to their problems, what are some of the most interesting, or innovative, or unexpected ways that you've seen them use your platform?

Truly, we are frequently surprised with what people are doing with Decodable. We definitely see “bread and butter” use cases, the filter, route, transform, aggregate, and trigger type of use cases that happen in the stream processing layer. One of the things that we have really come to appreciate is the combination of change data capture from an operational database system, through the stream processing engine to transform that data, and then pumping it back into other operational systems like caches and full text search.  These days it’s probably not all that unique, but at the time it surprised me how simple it made things.

We worked on a use case with a company that takes in résumé data from MySQL, does a bunch of parsing, cleansing, and transformation, and then indexes that data into Elasticsearch to make it searchable by different features of a candidate. This process wound up being zero code for them—it's one SQL statement that does very low latency processing of this data, and then indexing back into the full-text search system.

If you extend that to things like caches, Redis and ElastiCache for example, then you actually start to see all this opportunity to be able to create what amounts to materialized views that are optimized for different workloads. Again, that's probably not super surprising these days, but the kinds of applications that you can build and how quickly you can build them—that's the thing that’s surprising.

Interesting Stream Processing Challenges

In your work of building this platform, working in this ecosystem, what are the most interesting, unexpected, or challenging lessons that you've learned?

As an engineer, you look at these open-source projects and they are so sophisticated, so powerful. Everybody in the data engineering space describes their tech stack as Flink plus Kafka plus Debezium—and the “plus” in that sentence, those two pluses, are doing a lot of work.

The thing that has been surprising is just how much time and effort we spend on the glue between these otherwise fantastic Lego bricks. Here's probably the dirty secret about Decodable: our differentiation is not that we're 3 milliseconds faster than the other guy, our differentiation is all the glue we’ve built to go in between those systems. The DevX, the deployment, the APIs, the orchestration, the observability—the observability alone around data pipelines in a streaming context! Data quality looks different in a streaming context, it functions differently in a streaming context, so those things are much harder than I would have expected.

It’s similar to the joke about drawing an owl, where first you draw two circles and then step two is to draw the rest of the owl. For a data movement and stream processing platform, step one is to deploy Flink, Kafka, and Debezium—step two is to build the rest of the platform. The amount of time and effort is daunting, especially given that's our full-time job here at Decodable. We spend all our time thinking about it, and if anything, it shows the need for data platform teams inside of organizations and the value they provide.

With that context, for people who are considering Decodable or considering streaming, what are the cases where building it yourself is the wrong choice?

People shouldn’t think about Decodable or stream processing as implying that they will be shutting down their data warehouse. The idea that companies will take their Airflow, dbt, and Snowflake and shoot them into the moon—that's just not what's going to happen, and we don't make that claim.

Stream processing, at its core, is something that typically sits between two other systems—some source and some destination—and processes data to make it available in that larger system. Stream processing is not meant to be a serving system—we purposely decouple the Decodable platform from Apache Pinot, Imply, Snowflake, Databricks, S3, or Redis. We believe you actually want those as two separate systems so you can cook the data and then serve it in the most appropriate storage and query engine that makes sense for a particular workload.

People shouldn’t be imagining they’re going to replace Snowflake with Decodable, or Databricks with Decodable, or Postgres with Decodable—that’s not how you should think about stream processing. If you're thinking about how to get data from Postgres through the outbox pattern to be available for microservices, or into HubSpot or Snowflake or S3, those are the places where Decodable and stream processing fit.

What’s Coming Next from Decodable?

As you continue to build and iterate and scale the Decodable platform, what are the things you have planned for the near to medium term? Are there any particular projects you're excited to dig into?

There's a lot we can do. There are some areas where we will always be expanding, new connectors for example. We'll always be improving the SQL dialect, enriching the APIs to allow people more expressive capabilities. There are some higher order functions that we want to provide, more sophisticated management of event time, things that make it easier to build processing jobs.

Developer experience is always top of mind for us. It is not yet where we want it to be, there are certain things that are probably harder to do than they could be—think operational flows like making a backward-incompatible schema change, which sometimes needs to happen. You should avoid it at all costs, but when it does need to happen, it's currently messier than we’d like it to be. Reprocessing of data for bootstrapping net new things, like bringing up a new Elasticsearch instance and replaying the last 12 months of data to populate it, we want to make that easier. You can do all of those things today, but they could be easier.

Are there any other aspects of the work that you're doing at Decodable or the overall ecosystem of stream processing that you'd like to cover before we close out the show?

Honestly, this is one of the better overviews of Decodable and I really appreciate the questions. We probably hit all the critical stuff. I would say that a lot of people are still struggling to understand where stream processing fits into their stack, into the data platform, and which workloads should move to it—we're spending more and more time trying to help people understand that.

At its core, people can think about Decodable as providing low-latency ETL that can power both analytical systems and microservices—it’s not really necessary to stress much harder on it than that. The operational-to-analytical and operational-to-operational data infrastructure is probably the mental model that people should have for stream processing. That might be one of the areas where we have some more work to do, and it's probably an evergreen topic to talk about. But otherwise it's been a real pleasure, and I really appreciate the opportunity to be on the show today.

As a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?

The largest portion of the problem is really one around developer experience. Anecdotal evidence for that is every team at every Fortune 500 has a company-specific data platform layer that glues a set of things together in a way that makes sense for that particular company, for a particular team. There aren’t yet well developed standards the way we have on the application development side. That creates so much unnecessary complexity, things are different just for the sake of being different. Development efforts that end up burning time and money, and that hurts people. I don't know the solution, but I wish we could come up with better patterns and practices for stream processing and encode those in a way that vendors can then more tightly integrate with.

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

David Fabritius