This article originally appeared at DZone.
It has become clear that Apache Flink is the core of the modern stream processing platform. While Flink is necessary, it’s also insufficient, leaving reliable storage and transport of real-time data and connectivity open issues for platform engineers to resolve. To that end, it’s frequently paired with an event streaming platform such as Apache Kafka, as well as a myriad of connectivity technologies, including Debezium for change data capture (CDC). Together, these cover the foundations of ETL, ELT, and stream processing use cases. They are the right building blocks to power both operational and analytical data infrastructure, and notably bridge the two. They power everything from inventory tracking and management, to product recommendations, to food delivery. But they are not, by themselves, the full picture.
Flink, Kafka, and Debezium: The Tip of the Iceberg
While Flink, Kafka, and Debezium are the major building blocks, the iceberg runs deep. They do not, on their own, form a complete real-time data platform. The last 20% of the platform often winds up taking 80% of the time. Here are some of the things to consider.
Resource management and discovery
A centralized and unified set of APIs to manage all of the connection and job definitions and lifecycle operations, as well as analyze lineage and dependency information, is table stakes. This API is also how developers and tools control the platform. Connections and jobs will likely run on Kubernetes, so you’ll have to pick a Flink deployment mode, build images, set up the Flink Kubernetes controller, establish resource quotas, configure auto-scaling of the cluster, and so on. For many, multi-tenancy and multi-region support is necessary to support organization-wide teams and use cases.
Schema management
A schema registry that captures the structure of data is a good start. You’ll also want to think about how to unify and translate type systems across different connected systems. Schema inference from connection and job output, and automatic propagation of type information, will save users tons of time and reduce errors. Performing schema and type compatibility and migration validation early will prevent (ok, maybe ease) major data quality and compatibility challenges. Support for well-known schema definition languages like Avro IDL and JSON Schema makes integration with other systems simpler and safer. Finally, pick a serialization format (e.g. JSON, Avro, Protobuf) that the entire system will use for transport and processing and have it automatically enforced at all data ingress points. Trust me.
Connectivity
You’ll need to select and maintain connectors for the systems you care about. This often means dealing with version upgrades, quality management, performance tuning, building support for malformed data handling, and dealing with append-only versus CDC data. And maybe the occasional bug fix; it’s not uncommon to patch and upstream changes given how fast source and destination systems change.
Stream Processing State Management
Both connections and stream processing jobs need to maintain state. Sometimes, this is as simple as keeping track of what data has been processed, while other times it’s complex aggregation and join state. You’ll need to configure Flink (and other parts of the system, as appropriate) to use persistent storage, handle version retention and selective restoration, and durability guarantees. Be careful to control access to this state across different tenants and jobs, especially when operating on sensitive data.
Developer tooling and observability
Whether you’re building for yourself or you’re building a platform for internal customers, command line, UI, CI/CD, dbt support, and other tooling determines how productive engineers will be. You’ll want to make sure the control plane is instrumented for auditing, the data plane is instrumented for health, quality, and performance, and all of that data is centrally available for compliance, debugging, and performance tuning.
Security engineering
Core data infrastructure touches the most sensitive data that exists within the organization. API authentication and deep resource authorization, auditing, job runtime isolation, and secret management and integration are just the tip of an even bigger iceberg.
And more
I’m out of space. Flink runtime versions, support for backfill and reprocessing, malformed data handling policies and enforcement, fine-grained resource access control, configuration tuning of jobs, establishing and maintaining processing guarantees, and dealing with scale should all be top of mind.
Breaking the Ice for Your Real-time Stream Processing Platform
Teams need a complete platform to successfully run stream processing workloads in production. This is even more true when building a platform for others when you’re the (data) platform engineering team. Hopefully these pointers prove helpful in mapping out your real-time data platform. For more information about stream processing, data platforms, Apache Flink, and Debezium, visit our product page.
Additional Resources
- Ready to connect to a data stream and create a pipeline? Start free
- Take a guided tour with our Quickstart Guide