Data movement is the most stubborn problem in infrastructure
Let’s start with the good news: the building blocks of a data platform have evolved in amazing ways, tackling increasingly specific and difficult problems in data.
Invariably, some or all of the data produced by these systems is required elsewhere. Online transactional data needs to be centralized in a data warehouse for AI modeling and analytics. User profile information needs to be moved to a cache for faster serving. Transaction data needs to be enriched with account information in real-time for fraud detection. The connectivity and movement of data between systems is more critical than ever.
With each new system or use case – each new instance of data movement – we approach the problem by first traversing an enormous decision tree. A series of questions drives the selection of tech stack, technique (ELT vs. ETL), language (SQL vs. Java vs. Python), and team (engineering vs. data engineering vs. analytics engineering). Even the simplest of use cases might run through a long list of questions:
- In which systems does the data I need live? Where does it need to go?
- How does that data need to be queried?
- What are the latency characteristics?
- Can this data be updated or is it immutable?
- Do I need to do any transformation before it hits the target system?
- What is the schema and format of the source and destination?
- What kind of guarantees are required on this data?
- How should failures be handled?
The list goes on, and the process is similar to how we used to build and ship services before Kubernetes. Dozens of different ways of doing the same thing, each of which with different tradeoffs, behavior, robustness. The impacts of heterogeneity in data movement are not trivial, either. Most of our engineering time - some estimate as much as 80% - is spent building and maintaining infrastructure between these systems.
Tackling data movement is the most effective way to ship applications, services, and features faster
We need the equivalent of Kubernetes for data movement. A unified approach to these necessary but mundane concerns would allow us to do what’s really important: ship faster, ship cheaper, and stay compliant. The current way we approach data movement operates against those interests.
This is not to say the data movement problem has gone unaddressed. Quite the opposite – there are lots of different ways to look at our existing options: ETL, ELT, event streaming, stream processing, change data capture (CDC), and often some combination of these. None of these things are inherently bad - there’s just always a tradeoff:
- Batch systems are reliable – BUT don’t support online use cases like keeping caches and search index up to date.
- ETL systems are capable – BUT aren’t quite as powerful as a data warehouse at transformation and have fallen out of favor.
- ELT systems are simple – BUT requires target systems to perform all transformations, which is not always possible.
- Event streaming is powerful – BUT requires additional systems and specialized resources to maintain.
- CDC systems are efficient – BUT struggle with connectivity, transformation, and supporting infrastructure.
- Point-to-point is turnkey – BUT requires sourcing the same data multiple times and reimplementing business logic.
Each approach addresses a different part of the process or tech stack, makes different tradeoffs, comes with different features, and ultimately solves just one piece of the puzzle.
We need to use a diverse set of systems and databases to build sophisticated analytics and applications. However, applying that same pick-and-choose approach to data movement strains source systems with multiple scans, complicates compliance, and strains operations with confusion about what is in use and when.
The data stack needs to be heterogeneous - but data movement across the stack should be unified
It doesn’t need to be this hard. Rather than go through the decision tree, assemble a tech stack, and build a custom system for each pipeline, it should be possible to manage data movement with a unified approach.
Unifying data movement does not mean ‘offer an umbrella of tools under one invoice.’ We should expect a unified solution to simplify architectures, improve performance, decrease latencies, and manage costs. To achieve these things, a unified data movement platform needs to be opinionated about a few things:
Real-time is the default. Support both online and offline data movement use cases in a unified manner. This means support for low-latency data (measured in seconds, not minutes) by default, rather than as a specialized configuration. This also does not mean that all data MUST arrive at its target system with streaming-level latency – just that batch and streaming workloads are both available through the same method and with the same ease of implementation.
Connectivity should be 1:n. Source data once, and make it available to any number of downstream use cases and systems. This includes robust, tech-specific connectors to bridge between specific source and sink systems to a unified streaming record format. This also includes the automatic selection of the right tech for each source and sink (e.g. CDC for RDBMSs or Snowpipe Streaming for cost effective low latency ingest), as well as native support for event streaming platforms like Kafka, Kinesis, and Redpanda.
ETL > ELT. A data movement platform should have the ability to transform data when and where it’s needed - and provide zero additional cost or latency when it’s not. Data warehousing is critically important for ad hoc and exploratory analysis, but recurring workloads and analytics can and should be moved upstream. This can not only save teams on cost-of-compute – it is essential for target systems like caches and search indices that do not have internal transformation capabilities. We’ve danced around the point here with overly-specific terms like E(T)L, ETLT, and Reverse-ETL. Let’s instead admit that ETL should be the dominant pattern for data movement, even when the T is silent.
Data movement without stream processing is incomplete. It’s not enough to move data from point to point with low-latency - or manage transformations as batches. Teams should have access to the full power of flexible, stateful stream processing, with support for real-time transformation, filtering, routing, aggregation, pattern matching, rules systems, triggers and alerting, and enrichment.
Unified means Unified. Simply put, this is not a problem that is solved by a data cloud or a bolt-on acquisition. We should be able to scale up from the simplest ELT-style pipelines to rich, real-time stream processing without switching to different infrastructure. One system should support both simple declarative transformations in SQL and sophisticated use cases where general purpose languages and library ecosystems are necessary. And that data movement system should be agnostic to the source and destination formats, with the flexibility to move all data without bias.
This vision for a data movement platform is one that starts with a sophisticated core and dials back as needed - not the other way around. It doesn’t require you to rethink your entire data platform. It’s a viewpoint that lets us incrementally tap into more advanced functionality when needed – or use turnkey methods for turnkey problems. Our most stubborn challenge deserves a pragmatic solution.
All your data – when, where, and how you need it
Decodable is the Pragmatic Data Movement Company. It’s admittedly a little gimmicky to put those words all in caps, but it’s something we believe at our core. Our mission is to help teams get the data to where it needs to be, in the right format, with as little or as much processing as needed.
Decodable is real-time. All processing and movement is based on a real-time streaming core, which makes it easy to support both online and offline use cases. Connectors for offline systems can group data into batches for efficiency where latency is less important.
Decodable is both ETL and ELT. You shouldn’t have to choose a tech stack based on whether or not you need to process data. Optional transformation between sources and sinks means you can filter or aggregate high volume data, or even offload processing for obvious transformations. When you’re not processing data, you’re not paying a cost for it.
Decodable is efficient and cost-effective. Decodable is designed to operate efficiently on its own, like load reductions in source systems via continuous processing, or per-job resource scaling to reduce underutilized clusters. Other savings are unlocked as teams use specific capabilities within Decodable. For example, using routing to send high-value data to high-performance data warehouses and the rest to cost effective object stores in open table formats.
There are a wide range of other investments we’ve put into our data movement platform as well:
Expect more from your data movement platform.
Data movement is too hard. Teams spend far too much of their time fighting to get the data they need to do their jobs. We’re ending the status quo of long development cycles for out of date, low quality data from brittle data pipelines built across multiple platforms. It’s time to simplify.
We built Decodable to absorb the complexity inherent in data movement, and to work the way you expect. We want to make it easy to have high data quality, remain compliant, and keep pipelines running. We’d love to provide you with a truly great developer experience, or at least one you hate the least. If you’re building a data platform to get data from one place to another, with or without transformation; fighting out of date data; or building jobs to process data in real-time – let’s talk today.