Decodable’s platform uses Apache Flink to run our customers’ real-time data processing jobs. In this post, I will explain how we manage the underlying Flink deployments at Decodable in a multi-tenant environment. To see where Apache Flink fits in the overall Decodable platform, check out the Platform Overview blog which outlines our high-level platform architecture.
Flink has many traits that we love for real-time data processing. It is a distributed processing engine designed for stream processing - through stateful computations over unbounded data. It is tremendously scalable and is used by Alibaba, Netflix, Uber, Lyft, and a host of other companies of all sizes.
Decodable offers real-time data processing as a self-service platform, and - like every multi-tenant SaaS service - we must provide strict tenant isolation, uptime guarantees and performance SLAs. Let’s break down our requirements and see how we configured and deployed Flink to meet our needs.
Requirement: Low Cost to Create a Free Account
Our goal here is to minimize the initial time a user has to wait to have an account ready while minimizing the cost of resources. SaaS platform users typically experience delays when creating the initial account, due to provisioning of the underlying infrastructure dedicated to the account. It can easily take longer than 10 minutes to create a new user account.
As a user, you never want to hear that your account will be deleted because the provider's internal costs for free trial accounts are too high. At Decodable, we believe that our customers will experience the value of our platform and many will want to expand to the pay-as-you-go tier as their usage increases. Obviously, we incur costs from the free tier offering, and want to keep those costs to a minimum.
Solution: On-Demand Flink Resource Creation
Creating a Decodable account only takes seconds. This is largely because we don’t pre-create any resource for Flink at account creation time. Flink resources are only created when a data processing job is triggered, either through a preview session or a running instance of a pipeline or connection. When a job is terminated, the resources are cleaned up immediately.
Requirement: Strong Security and Isolation for Production Workloads
Once a data processing job is running in production, it should continue running without issues. This is a lot easier said than done! In a multi-tenant environment, we must prevent bad actors and mitigate noisy neighbors, as well as limiting the blast radius of any unexpected outages.
Solution: Flink Application Clusters for Pipeline and Connection Jobs
Flink offers different deployment modes to deploy natively on Kubernetes. Given the requirements here, we use application mode to run our connection and pipeline jobs. Application mode runs one job per cluster and the JobManager is solely responsible for one job. This isolation avoids noisy neighbors and resource contention. In addition, we configured kubernetes resource limits for each Flink pod to ensure CPU and memory resources are shared fairly. We’ve also configured high availability for Flink clusters so they will auto recover if a pod is erroneously killed or during an unexpected incident.
Requirement: Quick Response to Preview Jobs
Preview is a feature where a user can test a pipeline running a SQL statement with real production data. Preview is critical to a user’s pipeline authoring experience because it helps users understand whether or not their pipeline will do what they expect on real data. In the iterative process of trying and testing, we want to minimize the cycle time.
Solution: Custom Flink Session Clusters for Preview Jobs
We deploy Flink session clusters to run our preview jobs. The main reason is that Flink session mode has a much lower start-up time because it uses a shared pre-deployed JobManager. However TaskManagers (the workers) are designed to be shared and make it hard to isolate the jobs from different tenants. To ensure job isolation in the session mode, we modified Flink so that each TaskManager only runs 1 job. Preview jobs are short-lived (no more than 2 minutes) and the TaskManagers are killed when the job times out. To further reduce the preview response time, we also modified Flink to pre-deploy some idle TaskManagers. When a TaskManager is used, a new TaskManager is immediately created so there is always a pool of idle TaskManagers.
In summary: Decodable uses Flink session clusters to run fast ephemeral jobs, and application clusters to run production jobs. We provide multi-tenant data isolation guarantees in both scenarios. We also modified Flink to boost its performance and we plan to contribute our changes back to the Flink community.
What’s Next at Decodable?
Running Flink in production has been a long journey but there is more to do in the future. Some of the interesting items in our backlog are:
- Ensure network bandwidth is shared fairly among accounts.
- Optimize performance for jobs with large states while ensuring fair share of resources (eg:disk/network usage).
- Auto-adjust Flink resource based on the workload. Potentially extend the reactive mode to support native kubernetes deployments.
- Figure out an intuitive and secure way to allow users to run custom code.
Get Started with Decodable
You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.
Learn more:
- Read the docs
- Check out our other blogs
- Subscribe to our YouTube Channel
- Join our Slack community