Back
April 3, 2024
5
min read

The Pragmatic Approach to Data Movement

By
Eric Sammer
Share this post

Data movement is the most stubborn problem in infrastructure

Let’s start with the good news: the building blocks of a data platform have evolved in amazing ways, tackling increasingly specific and difficult problems in data.

Invariably, some or all of the data produced by these systems is required elsewhere. Online transactional data needs to be centralized in a data warehouse for AI modeling and analytics. User profile information needs to be moved to a cache for faster serving. Transaction data needs to be enriched with account information in real-time for fraud detection. The connectivity and movement of data between systems is more critical than ever.

With each new system or use case – each new instance of data movement – we approach the problem by first traversing an enormous decision tree. A series of questions drives the selection of tech stack, technique (ELT vs. ETL), language (SQL vs. Java vs. Python), and team (engineering vs. data engineering vs. analytics engineering). Even the simplest of use cases might run through a long list of questions:

  • In which systems does the data I need live? Where does it need to go?
  • How does that data need to be queried?
  • What are the latency characteristics?
  • Can this data be updated or is it immutable?
  • Do I need to do any transformation before it hits the target system?
  • What is the schema and format of the source and destination?
  • What kind of guarantees are required on this data?
  • How should failures be handled?

The list goes on, and the process is similar to how we used to build and ship services before Kubernetes. Dozens of different ways of doing the same thing, each of which with different tradeoffs, behavior, robustness. The impacts of heterogeneity in data movement are not trivial, either. Most of our engineering time - some estimate as much as 80% - is spent building and maintaining infrastructure between these systems.

Tackling data movement is the most effective way to ship applications, services, and features faster

We need the equivalent of Kubernetes for data movement. A unified approach to these necessary but mundane concerns would allow us to do what’s really important: ship faster, ship cheaper, and stay compliant. The current way we approach data movement operates against those interests.

This is not to say the data movement problem has gone unaddressed. Quite the opposite – there are lots of different ways to look at our existing options: ETL, ELT, event streaming, stream processing, change data capture (CDC), and often some combination of these. None of these things are inherently bad - there’s just always a tradeoff:

  • Batch systems are reliable – BUT don’t support online use cases like keeping caches and search index up to date.
  • ETL systems are capable – BUT aren’t quite as powerful as a data warehouse at transformation and have fallen out of favor.
  • ELT systems are simple – BUT requires target systems to perform all transformations, which is not always possible.
  • Event streaming is powerful – BUT requires additional systems and specialized resources to maintain.
  • CDC systems are efficient – BUT struggle with connectivity, transformation, and supporting infrastructure.
  • Point-to-point is turnkey – BUT requires sourcing the same data multiple times and reimplementing business logic.

Each approach addresses a different part of the process or tech stack, makes different tradeoffs, comes with different features, and ultimately solves just one piece of the puzzle.

We need to use a diverse set of systems and databases to build sophisticated analytics and applications. However, applying that same pick-and-choose approach to data movement strains source systems with multiple scans, complicates compliance, and strains operations with confusion about what is in use and when.

The data stack needs to be heterogeneous - but data movement across the stack should be unified

It doesn’t need to be this hard. Rather than go through the decision tree, assemble a tech stack, and build a custom system for each pipeline, it should be possible to manage data movement with a unified approach.

Unifying data movement does not mean ‘offer an umbrella of tools under one invoice.’ We should expect a unified solution to simplify architectures, improve performance, decrease latencies, and manage costs. To achieve these things, a unified data movement platform needs to be opinionated about a few things:

Real-time is the default. Support both online and offline data movement use cases in a unified manner. This means support for low-latency data (measured in seconds, not minutes) by default, rather than as a specialized configuration. This also does not mean that all data MUST arrive at its target system with streaming-level latency – just that batch and streaming workloads are both available through the same method and with the same ease of implementation.

Connectivity should be 1:n. Source data once, and make it available to any number of downstream use cases and systems. This includes robust, tech-specific connectors to bridge between specific source and sink systems to a unified streaming record format. This also includes the automatic selection of the right tech for each source and sink (e.g. CDC for RDBMSs or Snowpipe Streaming for cost effective low latency ingest), as well as native support for event streaming platforms like Kafka, Kinesis, and Redpanda.

ETL > ELT. A data movement platform should have the ability to transform data when and where it’s needed - and provide zero additional cost or latency when it’s not. Data warehousing is critically important for ad hoc and exploratory analysis, but recurring workloads and analytics can and should be moved upstream. This can not only save teams on cost-of-compute – it is essential for target systems like caches and search indices that do not have internal transformation capabilities. We’ve danced around the point here with overly-specific terms like E(T)L, ETLT, and Reverse-ETL. Let’s instead admit that ETL should be the dominant pattern for data movement, even when the T is silent.

Data movement without stream processing is incomplete. It’s not enough to move data from point to point with low-latency - or manage transformations as batches. Teams should have access to the full power of flexible, stateful stream processing, with support for real-time transformation, filtering, routing, aggregation, pattern matching, rules systems, triggers and alerting, and enrichment.

Unified means Unified. Simply put, this is not a problem that is solved by a data cloud or a bolt-on acquisition. We should be able to scale up from the simplest ELT-style pipelines to rich, real-time stream processing without switching to different infrastructure. One system should support both simple declarative transformations in SQL and sophisticated use cases where general purpose languages and library ecosystems are necessary. And that data movement system should be agnostic to the source and destination formats, with the flexibility to move all data without bias.

This vision for a data movement platform is one that starts with a sophisticated core and dials back as needed - not the other way around. It doesn’t require you to rethink your entire data platform. It’s a viewpoint that lets us incrementally tap into more advanced functionality when needed – or use turnkey methods for turnkey problems. Our most stubborn challenge deserves a pragmatic solution.

All your data – when, where, and how you need it

Decodable is the Pragmatic Data Movement Company. It’s admittedly a little gimmicky to put those words all in caps, but it’s something we believe at our core. Our mission is to help teams get the data to where it needs to be, in the right format, with as little or as much processing as needed.

Decodable is real-time. All processing and movement is based on a real-time streaming core, which makes it easy to support both online and offline use cases. Connectors for offline systems can group data into batches for efficiency where latency is less important.

Decodable is both ETL and ELT. You shouldn’t have to choose a tech stack based on whether or not you need to process data. Optional transformation between sources and sinks means you can filter or aggregate high volume data, or even offload processing for obvious transformations. When you’re not processing data, you’re not paying a cost for it.

Decodable is efficient and cost-effective. Decodable is designed to operate efficiently on its own, like load reductions in source systems via continuous processing, or per-job resource scaling to reduce underutilized clusters. Other savings are unlocked as teams use specific capabilities within Decodable. For example, using routing to send high-value data to high-performance data warehouses and the rest to cost effective object stores in open table formats.

There are a wide range of other investments we’ve put into our data movement platform as well:

Broad, Adaptive Connectivity.
A catalog of fully supported connectors – including OLTP databases, event streaming platforms, data lakes, data warehouses, caches, search systems, and other analytical database systems.
Decodable connectors adapt to their source and destination systems and will perform the correct operations based on the stream to which they’re connected.
Fully-Configured Stream Processing.
You decide a job’s maximum task size and concurrency, and that’s it. Get out of the business of tuning and optimizing connectors and processing. Apache Flink users will never have to worry about checkpoint intervals, network buffers, or RocksDB tuning ever again. Scale out to handle any load with no clusters to provision or manage.
1:N Publish / Subscribe.
Any source of data can be consumed by any number of sinks or processing jobs. You don’t need to pull the same data for each target system. Not only does this reduce load on sources, but drives down resource consumption and cost.
Fully Managed, or Bring Your Own Cloud.
Get up and running fast with a fully managed account, or run the Decodable Data Plane entirely within your cloud account so your data stays private to you. BYOC also allows you to take advantage of cloud provider commitments and discounts.
Multiple Processing Languages
You can write stateful processing logic in the same SQL you know today, or drop down into the Apache Flink APIs for sophisticated processing in Java or Python. You can even bring your existing Flink jobs if you have them. Transform, filter, aggregate, join and enrich, pattern match, mask and anonymize, route, and interact with external APIs and services. Or don’t. It’s your choice.
Globally Deployable
Get up and running fast with a fully managed account, or run the Decodable Data Plane entirely within your cloud account so your data stays private to you. BYOC also allows you to take advantage of cloud provider commitments and discounts.
Exactly-once and Fault Tolerant
All data movement and processing is exactly-once by default. That means no duplicate or missing records. If a failure occurs, the platform picks right back up where it left off. All data is replicated across multiple availability zones while in transit, and processing state is backed by trusted cloud object stores. The pubsub nature of data means streaming it to multiple regions is simple.
Secure and Compliant
Decodable is SOC2 Type 2 and GDPR compliant. The platform is regularly independently pen tested and audited. BYOC deployment allows for zero exposure of your data, credentials, or keys to us while still benefiting from the full platform. Single sign-on with MFA and fine-grained role based access control are supported. User-defined code is fully isolated from platform code.
Built for Developers
With a UI, CLI, and APIs for everything, you can adapt Decodable to how you work.
Open Core
Decodable is built on Apache Flink, Apache Kafka, and Debezium. If we don’t deliver value, you can run your jobs elsewhere.

Expect more from your data movement platform.

Data movement is too hard. Teams spend far too much of their time fighting to get the data they need to do their jobs. We’re ending the status quo of long development cycles for out of date, low quality data from brittle data pipelines built across multiple platforms. It’s time to simplify.

We built Decodable to absorb the complexity inherent in data movement, and to work the way you expect. We want to make it easy to have high data quality, remain compliant, and keep pipelines running. We’d love to provide you with a truly great developer experience, or at least one you hate the least. If you’re building a data platform to get data from one place to another, with or without transformation; fighting out of date data; or building jobs to process data in real-time – let’s talk today.

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Eric Sammer

Eric Sammer is a data analytics industry veteran who has started two companies, Rocana (acquired by Splunk in 2017), and Decodable. He is an author, engineer, and leader on a mission to help companies move and transform data to achieve new and useful business results. Eric is a speaker on topics including data engineering, ML/AI, real-time data processing, entrepreneurship, and open source. He has spoken at events including the RTA Summit and Current, on podcasts with Software Engineering Daily and Sam Ramji, and has appeared in various industry publications.

Data movement is the most stubborn problem in infrastructure

Let’s start with the good news: the building blocks of a data platform have evolved in amazing ways, tackling increasingly specific and difficult problems in data.

Invariably, some or all of the data produced by these systems is required elsewhere. Online transactional data needs to be centralized in a data warehouse for AI modeling and analytics. User profile information needs to be moved to a cache for faster serving. Transaction data needs to be enriched with account information in real-time for fraud detection. The connectivity and movement of data between systems is more critical than ever.

With each new system or use case – each new instance of data movement – we approach the problem by first traversing an enormous decision tree. A series of questions drives the selection of tech stack, technique (ELT vs. ETL), language (SQL vs. Java vs. Python), and team (engineering vs. data engineering vs. analytics engineering). Even the simplest of use cases might run through a long list of questions:

  • In which systems does the data I need live? Where does it need to go?
  • How does that data need to be queried?
  • What are the latency characteristics?
  • Can this data be updated or is it immutable?
  • Do I need to do any transformation before it hits the target system?
  • What is the schema and format of the source and destination?
  • What kind of guarantees are required on this data?
  • How should failures be handled?

The list goes on, and the process is similar to how we used to build and ship services before Kubernetes. Dozens of different ways of doing the same thing, each of which with different tradeoffs, behavior, robustness. The impacts of heterogeneity in data movement are not trivial, either. Most of our engineering time - some estimate as much as 80% - is spent building and maintaining infrastructure between these systems.

Tackling data movement is the most effective way to ship applications, services, and features faster

We need the equivalent of Kubernetes for data movement. A unified approach to these necessary but mundane concerns would allow us to do what’s really important: ship faster, ship cheaper, and stay compliant. The current way we approach data movement operates against those interests.

This is not to say the data movement problem has gone unaddressed. Quite the opposite – there are lots of different ways to look at our existing options: ETL, ELT, event streaming, stream processing, change data capture (CDC), and often some combination of these. None of these things are inherently bad - there’s just always a tradeoff:

  • Batch systems are reliable – BUT don’t support online use cases like keeping caches and search index up to date.
  • ETL systems are capable – BUT aren’t quite as powerful as a data warehouse at transformation and have fallen out of favor.
  • ELT systems are simple – BUT requires target systems to perform all transformations, which is not always possible.
  • Event streaming is powerful – BUT requires additional systems and specialized resources to maintain.
  • CDC systems are efficient – BUT struggle with connectivity, transformation, and supporting infrastructure.
  • Point-to-point is turnkey – BUT requires sourcing the same data multiple times and reimplementing business logic.

Each approach addresses a different part of the process or tech stack, makes different tradeoffs, comes with different features, and ultimately solves just one piece of the puzzle.

We need to use a diverse set of systems and databases to build sophisticated analytics and applications. However, applying that same pick-and-choose approach to data movement strains source systems with multiple scans, complicates compliance, and strains operations with confusion about what is in use and when.

The data stack needs to be heterogeneous - but data movement across the stack should be unified

It doesn’t need to be this hard. Rather than go through the decision tree, assemble a tech stack, and build a custom system for each pipeline, it should be possible to manage data movement with a unified approach.

Unifying data movement does not mean ‘offer an umbrella of tools under one invoice.’ We should expect a unified solution to simplify architectures, improve performance, decrease latencies, and manage costs. To achieve these things, a unified data movement platform needs to be opinionated about a few things:

Real-time is the default. Support both online and offline data movement use cases in a unified manner. This means support for low-latency data (measured in seconds, not minutes) by default, rather than as a specialized configuration. This also does not mean that all data MUST arrive at its target system with streaming-level latency – just that batch and streaming workloads are both available through the same method and with the same ease of implementation.

Connectivity should be 1:n. Source data once, and make it available to any number of downstream use cases and systems. This includes robust, tech-specific connectors to bridge between specific source and sink systems to a unified streaming record format. This also includes the automatic selection of the right tech for each source and sink (e.g. CDC for RDBMSs or Snowpipe Streaming for cost effective low latency ingest), as well as native support for event streaming platforms like Kafka, Kinesis, and Redpanda.

ETL > ELT. A data movement platform should have the ability to transform data when and where it’s needed - and provide zero additional cost or latency when it’s not. Data warehousing is critically important for ad hoc and exploratory analysis, but recurring workloads and analytics can and should be moved upstream. This can not only save teams on cost-of-compute – it is essential for target systems like caches and search indices that do not have internal transformation capabilities. We’ve danced around the point here with overly-specific terms like E(T)L, ETLT, and Reverse-ETL. Let’s instead admit that ETL should be the dominant pattern for data movement, even when the T is silent.

Data movement without stream processing is incomplete. It’s not enough to move data from point to point with low-latency - or manage transformations as batches. Teams should have access to the full power of flexible, stateful stream processing, with support for real-time transformation, filtering, routing, aggregation, pattern matching, rules systems, triggers and alerting, and enrichment.

Unified means Unified. Simply put, this is not a problem that is solved by a data cloud or a bolt-on acquisition. We should be able to scale up from the simplest ELT-style pipelines to rich, real-time stream processing without switching to different infrastructure. One system should support both simple declarative transformations in SQL and sophisticated use cases where general purpose languages and library ecosystems are necessary. And that data movement system should be agnostic to the source and destination formats, with the flexibility to move all data without bias.

This vision for a data movement platform is one that starts with a sophisticated core and dials back as needed - not the other way around. It doesn’t require you to rethink your entire data platform. It’s a viewpoint that lets us incrementally tap into more advanced functionality when needed – or use turnkey methods for turnkey problems. Our most stubborn challenge deserves a pragmatic solution.

All your data – when, where, and how you need it

Decodable is the Pragmatic Data Movement Company. It’s admittedly a little gimmicky to put those words all in caps, but it’s something we believe at our core. Our mission is to help teams get the data to where it needs to be, in the right format, with as little or as much processing as needed.

Decodable is real-time. All processing and movement is based on a real-time streaming core, which makes it easy to support both online and offline use cases. Connectors for offline systems can group data into batches for efficiency where latency is less important.

Decodable is both ETL and ELT. You shouldn’t have to choose a tech stack based on whether or not you need to process data. Optional transformation between sources and sinks means you can filter or aggregate high volume data, or even offload processing for obvious transformations. When you’re not processing data, you’re not paying a cost for it.

Decodable is efficient and cost-effective. Decodable is designed to operate efficiently on its own, like load reductions in source systems via continuous processing, or per-job resource scaling to reduce underutilized clusters. Other savings are unlocked as teams use specific capabilities within Decodable. For example, using routing to send high-value data to high-performance data warehouses and the rest to cost effective object stores in open table formats.

There are a wide range of other investments we’ve put into our data movement platform as well:

Broad, Adaptive Connectivity.
A catalog of fully supported connectors – including OLTP databases, event streaming platforms, data lakes, data warehouses, caches, search systems, and other analytical database systems.
Decodable connectors adapt to their source and destination systems and will perform the correct operations based on the stream to which they’re connected.
Fully-Configured Stream Processing.
You decide a job’s maximum task size and concurrency, and that’s it. Get out of the business of tuning and optimizing connectors and processing. Apache Flink users will never have to worry about checkpoint intervals, network buffers, or RocksDB tuning ever again. Scale out to handle any load with no clusters to provision or manage.
1:N Publish / Subscribe.
Any source of data can be consumed by any number of sinks or processing jobs. You don’t need to pull the same data for each target system. Not only does this reduce load on sources, but drives down resource consumption and cost.
Fully Managed, or Bring Your Own Cloud.
Get up and running fast with a fully managed account, or run the Decodable Data Plane entirely within your cloud account so your data stays private to you. BYOC also allows you to take advantage of cloud provider commitments and discounts.
Multiple Processing Languages
You can write stateful processing logic in the same SQL you know today, or drop down into the Apache Flink APIs for sophisticated processing in Java or Python. You can even bring your existing Flink jobs if you have them. Transform, filter, aggregate, join and enrich, pattern match, mask and anonymize, route, and interact with external APIs and services. Or don’t. It’s your choice.
Globally Deployable
Get up and running fast with a fully managed account, or run the Decodable Data Plane entirely within your cloud account so your data stays private to you. BYOC also allows you to take advantage of cloud provider commitments and discounts.
Exactly-once and Fault Tolerant
All data movement and processing is exactly-once by default. That means no duplicate or missing records. If a failure occurs, the platform picks right back up where it left off. All data is replicated across multiple availability zones while in transit, and processing state is backed by trusted cloud object stores. The pubsub nature of data means streaming it to multiple regions is simple.
Secure and Compliant
Decodable is SOC2 Type 2 and GDPR compliant. The platform is regularly independently pen tested and audited. BYOC deployment allows for zero exposure of your data, credentials, or keys to us while still benefiting from the full platform. Single sign-on with MFA and fine-grained role based access control are supported. User-defined code is fully isolated from platform code.
Built for Developers
With a UI, CLI, and APIs for everything, you can adapt Decodable to how you work.
Open Core
Decodable is built on Apache Flink, Apache Kafka, and Debezium. If we don’t deliver value, you can run your jobs elsewhere.

Expect more from your data movement platform.

Data movement is too hard. Teams spend far too much of their time fighting to get the data they need to do their jobs. We’re ending the status quo of long development cycles for out of date, low quality data from brittle data pipelines built across multiple platforms. It’s time to simplify.

We built Decodable to absorb the complexity inherent in data movement, and to work the way you expect. We want to make it easy to have high data quality, remain compliant, and keep pipelines running. We’d love to provide you with a truly great developer experience, or at least one you hate the least. If you’re building a data platform to get data from one place to another, with or without transformation; fighting out of date data; or building jobs to process data in real-time – let’s talk today.

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Eric Sammer

Eric Sammer is a data analytics industry veteran who has started two companies, Rocana (acquired by Splunk in 2017), and Decodable. He is an author, engineer, and leader on a mission to help companies move and transform data to achieve new and useful business results. Eric is a speaker on topics including data engineering, ML/AI, real-time data processing, entrepreneurship, and open source. He has spoken at events including the RTA Summit and Current, on podcasts with Software Engineering Daily and Sam Ramji, and has appeared in various industry publications.