Back
May 9, 2022
7
min read

Flink Deployments At Decodable

By
Sharon Xie
Share this post

Decodable’s platform uses Apache Flink to run our customers’ real-time data processing jobs. In this post, I will explain how we manage the underlying Flink deployments at Decodable in a multi-tenant environment. To see where Apache Flink fits in the overall Decodable platform, check out the Platform Overview blog which outlines our high-level platform architecture.

Flink has many traits that we love for real-time data processing. It is a distributed processing engine designed for stream processing - through stateful computations over unbounded data. It is tremendously scalable and is used by Alibaba, Netflix, Uber, Lyft, and a host of other companies of all sizes.

Decodable offers real-time data processing as a self-service platform, and - like every multi-tenant SaaS service - we must provide strict tenant isolation, uptime guarantees and performance SLAs. Let’s break down our requirements and see how we configured and deployed Flink to meet our needs.

Requirement: Low Cost to Create a Free Account

Our goal here is to minimize the initial time a user has to wait to have an account ready while minimizing the cost of resources. SaaS platform users typically experience delays when creating the initial account, due to provisioning of the underlying infrastructure dedicated to the account. It can easily take longer than 10 minutes to create a new user account.

As a user, you never want to hear that your account will be deleted because the provider's internal costs for free trial accounts are too high. At Decodable, we believe that our customers will experience the value of our platform and many will want to expand to the pay-as-you-go tier as their usage increases. Obviously, we incur costs from the free tier offering, and want to keep those costs to a minimum.

Solution: On-Demand Flink Resource Creation

Creating a Decodable account only takes seconds. This is largely because we don’t pre-create any resource for Flink at account creation time. Flink resources are only created when a data processing job is triggered, either through a preview session or a running instance of a pipeline or connection. When a job is terminated, the resources are cleaned up immediately.

Requirement: Strong Security and Isolation for Production Workloads

Once a data processing job is running in production, it should continue running without issues. This is a lot easier said than done! In a multi-tenant environment, we must prevent bad actors and mitigate noisy neighbors, as well as limiting the blast radius of any unexpected outages.

Solution: Flink Application Clusters for Pipeline and Connection Jobs

Flink offers different deployment modes to deploy natively on Kubernetes. Given the requirements here, we use application mode to run our connection and pipeline jobs. Application mode runs one job per cluster and the JobManager is solely responsible for one job. This isolation avoids noisy neighbors and resource contention. In addition, we configured  kubernetes resource limits for each Flink pod to ensure CPU and memory resources are shared fairly. We’ve also configured high availability for Flink clusters so they will auto recover if a pod is erroneously killed or during an unexpected incident.

Requirement: Quick Response to Preview Jobs

Preview is a feature where a user can test a pipeline running a SQL statement with real production data. Preview is critical to a user’s pipeline authoring experience because it helps users understand whether or not their pipeline will do what they expect on real data. In the iterative process of trying and testing, we want to minimize the cycle time.

Solution: Custom Flink Session Clusters for Preview Jobs

We deploy Flink session clusters to run our preview jobs. The main reason is that Flink session mode has a much lower start-up time because it uses a shared pre-deployed JobManager. However TaskManagers (the workers) are designed to be shared and make it hard to isolate the jobs from different tenants. To ensure job isolation in the session mode, we modified Flink so that each TaskManager only runs 1 job. Preview jobs are short-lived (no more than 2 minutes) and the TaskManagers are killed when the job times out. To further reduce the preview response time, we also modified Flink to pre-deploy some idle TaskManagers. When a TaskManager is used, a new TaskManager is immediately created so there is always a pool of idle TaskManagers.

In summary: Decodable uses Flink session clusters to run fast ephemeral jobs, and application clusters to run production jobs. We provide multi-tenant data isolation guarantees in both scenarios. We also modified Flink to boost its performance and we plan to contribute our changes back to the Flink community.

What’s Next at Decodable?

Running Flink in production has been a long journey but there is more to do in the future. Some of the interesting items in our backlog are:

  • Ensure network bandwidth is shared fairly among accounts.
  • Optimize performance for jobs with large states while ensuring fair share of resources (eg:disk/network usage).
  • Auto-adjust Flink resource based on the workload. Potentially extend the reactive mode to support native kubernetes deployments.
  • Figure out an intuitive and secure way to allow users to run custom code.

Get Started with Decodable

You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join Decodable!

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Sharon Xie

Sharon is a founding engineer at Decodable. Currently she leads product management and development. She has over six years of experience in building and operating streaming data platforms, with extensive expertise in Apache Kafka, Apache Flink, and Debezium. Before joining Decodable, she served as the technical lead for the real-time data platform at Splunk, where her focus was on the streaming query language and developer SDKs.

Decodable’s platform uses Apache Flink to run our customers’ real-time data processing jobs. In this post, I will explain how we manage the underlying Flink deployments at Decodable in a multi-tenant environment. To see where Apache Flink fits in the overall Decodable platform, check out the Platform Overview blog which outlines our high-level platform architecture.

Flink has many traits that we love for real-time data processing. It is a distributed processing engine designed for stream processing - through stateful computations over unbounded data. It is tremendously scalable and is used by Alibaba, Netflix, Uber, Lyft, and a host of other companies of all sizes.

Decodable offers real-time data processing as a self-service platform, and - like every multi-tenant SaaS service - we must provide strict tenant isolation, uptime guarantees and performance SLAs. Let’s break down our requirements and see how we configured and deployed Flink to meet our needs.

Requirement: Low Cost to Create a Free Account

Our goal here is to minimize the initial time a user has to wait to have an account ready while minimizing the cost of resources. SaaS platform users typically experience delays when creating the initial account, due to provisioning of the underlying infrastructure dedicated to the account. It can easily take longer than 10 minutes to create a new user account.

As a user, you never want to hear that your account will be deleted because the provider's internal costs for free trial accounts are too high. At Decodable, we believe that our customers will experience the value of our platform and many will want to expand to the pay-as-you-go tier as their usage increases. Obviously, we incur costs from the free tier offering, and want to keep those costs to a minimum.

Solution: On-Demand Flink Resource Creation

Creating a Decodable account only takes seconds. This is largely because we don’t pre-create any resource for Flink at account creation time. Flink resources are only created when a data processing job is triggered, either through a preview session or a running instance of a pipeline or connection. When a job is terminated, the resources are cleaned up immediately.

Requirement: Strong Security and Isolation for Production Workloads

Once a data processing job is running in production, it should continue running without issues. This is a lot easier said than done! In a multi-tenant environment, we must prevent bad actors and mitigate noisy neighbors, as well as limiting the blast radius of any unexpected outages.

Solution: Flink Application Clusters for Pipeline and Connection Jobs

Flink offers different deployment modes to deploy natively on Kubernetes. Given the requirements here, we use application mode to run our connection and pipeline jobs. Application mode runs one job per cluster and the JobManager is solely responsible for one job. This isolation avoids noisy neighbors and resource contention. In addition, we configured  kubernetes resource limits for each Flink pod to ensure CPU and memory resources are shared fairly. We’ve also configured high availability for Flink clusters so they will auto recover if a pod is erroneously killed or during an unexpected incident.

Requirement: Quick Response to Preview Jobs

Preview is a feature where a user can test a pipeline running a SQL statement with real production data. Preview is critical to a user’s pipeline authoring experience because it helps users understand whether or not their pipeline will do what they expect on real data. In the iterative process of trying and testing, we want to minimize the cycle time.

Solution: Custom Flink Session Clusters for Preview Jobs

We deploy Flink session clusters to run our preview jobs. The main reason is that Flink session mode has a much lower start-up time because it uses a shared pre-deployed JobManager. However TaskManagers (the workers) are designed to be shared and make it hard to isolate the jobs from different tenants. To ensure job isolation in the session mode, we modified Flink so that each TaskManager only runs 1 job. Preview jobs are short-lived (no more than 2 minutes) and the TaskManagers are killed when the job times out. To further reduce the preview response time, we also modified Flink to pre-deploy some idle TaskManagers. When a TaskManager is used, a new TaskManager is immediately created so there is always a pool of idle TaskManagers.

In summary: Decodable uses Flink session clusters to run fast ephemeral jobs, and application clusters to run production jobs. We provide multi-tenant data isolation guarantees in both scenarios. We also modified Flink to boost its performance and we plan to contribute our changes back to the Flink community.

What’s Next at Decodable?

Running Flink in production has been a long journey but there is more to do in the future. Some of the interesting items in our backlog are:

  • Ensure network bandwidth is shared fairly among accounts.
  • Optimize performance for jobs with large states while ensuring fair share of resources (eg:disk/network usage).
  • Auto-adjust Flink resource based on the workload. Potentially extend the reactive mode to support native kubernetes deployments.
  • Figure out an intuitive and secure way to allow users to run custom code.

Get Started with Decodable

You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join Decodable!

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Sharon Xie

Sharon is a founding engineer at Decodable. Currently she leads product management and development. She has over six years of experience in building and operating streaming data platforms, with extensive expertise in Apache Kafka, Apache Flink, and Debezium. Before joining Decodable, she served as the technical lead for the real-time data platform at Splunk, where her focus was on the streaming query language and developer SDKs.