🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

November 5, 2024

min read

How to Build a Customer 360 Analytics Pipeline

Share this post

Customer 360 refers to getting a single view of customer engagement across their entire customer journey. It connects apps and data from multiple interactions to give businesses a 360-degree customer view. It includes data from a variety of sources, including customer demographics, customer relationship management (CRM), social media, eCommerce, marketing, sales, customer service, mobile apps, and many other customer touchpoints.

Businesses can leverage insights gained from a comprehensive customer view to improve and deliver exceptional experiences, increase customer loyalty, create reliable customer profiles to improve marketing and sales initiatives, streamline and connect business processes and workflows to improve efficiency and functionality, and reduce time and cost caused by human error in the customer journey.

Customer 360 data comes in many forms from many sources, most often populated into transactional databases like Postgres, MySQL, and others. Taken together, this data can be used to gain a more complete understanding of the customer to optimize marketing and sales funnels. Below we can see examples of raw call log and clickstream data, with one record per activity type and per user, depending on the specific activity that was logged.

Call Log Records

{
  "log_datatime": "2020-03-04 13:19:22",
  "xml": "<call_log><activity_id>701C</activity_id><call_id>367e5d7e-a3e6-4d27-a5c7-35706e9dca9d</call_id><user_id>4433a94b-12c5-4397-8837-3eedf11e78e6</user_id><start_time>2020-03-04 13:15:12</start_time><end_time>2020-03-04 13:19:22</end_time><call_time_seconds>207</call_time_seconds><from_phone_number>+37277774841</from_phone_number><to_phone_number>+37249234343</to_phone_number><outcome>answered</outcome><has_recording>false</has_recording></call_log>"
}

‍Clickstream Log Records

{
  "event_timestamp": "2020-11-16 14:32:19",
  "user_id": "4433a94b-12c5-4397-8837-3eedf11e78e6",
  "site_id": "wj32-gao1-4w1o-iqp4",
  "pages_visited": 8,
  "total_seconds_on_site": 426,
  "avg_percent_viewed": 28.198543
}

Currently, each data source is in a unique data format and uses different field names for similar data, which makes it difficult to process. It would be better to have all the data in a consistent schema which can then be sent to the same sink connection to be used for analysis, regardless of the original source or form of the data. By using one or more Decodable pipelines, which are streaming SQL queries that process data, we can transform the raw data into a form that is best suited for how it will be consumed.

The Standard Customer 360 Pipeline

A standard Customer 360 data processing pipeline uses a series of data extraction processes from multiple online transaction processing (OLTP) databases and other sources, and typical architectures start by implementing Apache Kafka to serve as the backbone for ingesting data from servers or web applications. Once ingested, the data undergoes cleaning and transformation using a processing engine like Apache Flink to improve accuracy and usability—this stage includes deduplication, standardization, and data enrichment by joining together data from multiple sources. This ensures that all customer records are complete, accurate, and standardized across attributes, forming a reliable base for loading into a data warehouse or Customer Data Platform (CDP) optimized for analysis and reporting. Advanced analytics, including customer segmentation, predictive modeling, and real-time personalization, can be applied to create a holistic, actionable view of each customer.

While this pipeline is powerful, setting up and maintaining such a streaming system comes with significant challenges. Flink and Kafka are complex tools requiring deep technical expertise to configure and scale, as well as ongoing investments in monitoring, performance tuning, and scaling the infrastructure. Security and stability are major concerns in these architectures, particularly when handling sensitive user data. Organizations must ensure that their clickstream pipelines are compliant with regulations like SOC 2 Type II and GDPR, which mandate strict data handling and protection procedures. As business requirements change or data volumes grow, the system must be continuously updated, which introduces additional overhead in terms of time and resources.

The complexity of these systems and the need for continuous oversight create significant barriers to entry for many organizations looking to build and maintain real-time Customer 360 analytics pipelines.

Customer 360 with Decodable

Here at Decodable, we’ve built a solution that goes beyond the foundational technologies, addressing the broader requirements of real-time stream processing for ELT, ETL, and data replication. This includes ensuring a solid developer experience, providing extensive and flexible connectivity with built-in change data capture (CDC) support, managing schema, ensuring scalability across different workloads and use cases, providing observability, maintaining security, data governance, compliance, and offering ongoing support.

As a fully managed service, our platform takes care of the stream processing infrastructure and the deployment of Flink jobs so you can focus on the business logic for your data pipelines. That means there are no servers for you to manage, no clusters to create, size, or monitor, and no software dependencies to update or maintain within our platform.

In this example, we'll walk through how the Decodable data service is used to clean, transform, and aggregate data from multiple data sources. The processed data can then be used to inform business decisions.

Pipeline Architecture

For this example, two pipelines are used to process each of the two raw incoming data streams into the desired form. In more complex cases, it can be helpful to break your processing logic down into smaller, more manageable steps. This results in pipelines that are easier to test and maintain. Each stage in the sequence of pipelines is used to bring the data closer to its final desired form using SQL queries.

Decodable uses SQL to process data that should feel familiar to anyone who has used relational database systems. The primary differences you'll notice are that:

You activate a pipeline to start it, and deactivate a pipeline to stop it
All pipeline queries specify a source and a sink
Certain operations, notably joins and aggregations, must include windows

Unlike relational databases, all pipelines write their results into an output data stream (or sink). As a result, all pipelines are a single statement in the form INSERT INTO SELECT ... FROM, where sink and source are streams you've defined.

Transform Call Logs

As with most data services pipelines, the first step is to apply a variety of transformations to clean up and simplify the input data. For this example, an inner SELECT is used to parse the XML object blob using the xpaths function and extract the desired fields. Then the start_time field is converted from a string to a timestamp type and the call_time_seconds field is converted to an integer.

Pipeline: Standardize data stream

insert into transformed
select
  call_log.user_id as user_id,
  to_timestamp(call_log.start_time) as engagement_datetime,
  'sales call' as engagement_type,
  call_log.call_id as engagement_source_id,
  cast(call_log.call_time_seconds as int) as engagement_seconds
from (
  select
    -- parse XML to a DOM and extract fields using XPath expressions
    xpaths(xml,
      'user_id', '//call_log/user_id',
      'start_time', '//call_log/start_time',
      'call_id', '//call_log/call_id',
      'call_time_seconds', '//call_log/call_time_seconds'
    ) as call_log
  from `call_log`
)

After creating a new pipeline and entering the SQL query, clicking the Run Preview button will verify its syntax and then fire up a new executable environment to process the next 10 records coming in from the source stream and display the results. Decodable handles all the heavy lifting on the backend, allowing you to focus on working directly with your data streams to ensure that you are getting the results you need.

Transform Clickstream

For the website clickstream data, the required transformations for this example are fairly minimal. Primarily the field names are changed to match the desired schema for a standardized data stream, and the event_timestamp field is converted to a timestamp.

Pipeline: Standardize data stream

insert into transformed
select
  user_id,
  to_timestamp(event_timestamp) as engagement_datetime,
  'website' as engagement_type,
  site_id as engagement_source_id,
  total_seconds_on_site as engagement_seconds
from `clickstream`

Conclusion

At this point, a sink connection (one that writes a stream to an external system, such as AWS S3, Kafka, Kinesis, Postgres, Pulsar, or Redpanda) can be created to allow the results to be consumed by your own applications and services.

{
  "user_id": "4433a94b-12c5-4397-8837-3eedf11e78e6",
  "engagement_datetime": "2020-11-16 22:59:59",
  "engagement_type": "website",
  "engagement_source_id": "wj32-gao1-4w1o-iqp4",
  "engagement_seconds": 426
}

As we can see from this example, a sophisticated business problem can be addressed in a very straight-forward way using Decodable pipelines. It is not necessary to create docker containers and there is no SQL server infrastructure to set up or maintain—all that is needed are the built-in Decodable connectors and a working familiarity with creating the SQL queries themselves.

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

David Fabritius

October 14, 2022

min read

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Get the Technical Guide Watch Our Tech Talk

Heading 2

Call Log Records

{
  "log_datatime": "2020-03-04 13:19:22",
  "xml": "<call_log><activity_id>701C</activity_id><call_id>367e5d7e-a3e6-4d27-a5c7-35706e9dca9d</call_id><user_id>4433a94b-12c5-4397-8837-3eedf11e78e6</user_id><start_time>2020-03-04 13:15:12</start_time><end_time>2020-03-04 13:19:22</end_time><call_time_seconds>207</call_time_seconds><from_phone_number>+37277774841</from_phone_number><to_phone_number>+37249234343</to_phone_number><outcome>answered</outcome><has_recording>false</has_recording></call_log>"
}

‍Clickstream Log Records

{
  "event_timestamp": "2020-11-16 14:32:19",
  "user_id": "4433a94b-12c5-4397-8837-3eedf11e78e6",
  "site_id": "wj32-gao1-4w1o-iqp4",
  "pages_visited": 8,
  "total_seconds_on_site": 426,
  "avg_percent_viewed": 28.198543
}

The Standard Customer 360 Pipeline

Customer 360 with Decodable

Pipeline Architecture

Decodable uses SQL to process data that should feel familiar to anyone who has used relational database systems. The primary differences you'll notice are that:

You activate a pipeline to start it, and deactivate a pipeline to stop it
All pipeline queries specify a source and a sink
Certain operations, notably joins and aggregations, must include windows

Transform Call Logs

Pipeline: Standardize data stream

insert into transformed
select
  call_log.user_id as user_id,
  to_timestamp(call_log.start_time) as engagement_datetime,
  'sales call' as engagement_type,
  call_log.call_id as engagement_source_id,
  cast(call_log.call_time_seconds as int) as engagement_seconds
from (
  select
    -- parse XML to a DOM and extract fields using XPath expressions
    xpaths(xml,
      'user_id', '//call_log/user_id',
      'start_time', '//call_log/start_time',
      'call_id', '//call_log/call_id',
      'call_time_seconds', '//call_log/call_time_seconds'
    ) as call_log
  from `call_log`
)

Transform Clickstream

Pipeline: Standardize data stream

insert into transformed
select
  user_id,
  to_timestamp(event_timestamp) as engagement_datetime,
  'website' as engagement_type,
  site_id as engagement_source_id,
  total_seconds_on_site as engagement_seconds
from `clickstream`

Conclusion

{
  "user_id": "4433a94b-12c5-4397-8837-3eedf11e78e6",
  "engagement_datetime": "2020-11-16 22:59:59",
  "engagement_type": "website",
  "engagement_source_id": "wj32-gao1-4w1o-iqp4",
  "engagement_seconds": 426
}

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

David Fabritius

Let's get decoding

Decodable is free. No CC required. Never expires.

Start for Free Talk to an Expert Join the Community on Slack

How to Build a Customer 360 Analytics Pipeline

Call Log Records

‍Clickstream Log Records

The Standard Customer 360 Pipeline

Customer 360 with Decodable

Pipeline Architecture

Transform Call Logs

Pipeline: Standardize data stream

Transform Clickstream

Pipeline: Standardize data stream

Conclusion

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Real Time Streaming Joins With SQL

Considerations for Implementing Streaming JOINs

Where does Stream Processing Fit into Your Data Platform?

Table of contents

Call Log Records

‍Clickstream Log Records

The Standard Customer 360 Pipeline

Customer 360 with Decodable

Pipeline Architecture

Transform Call Logs

Pipeline: Standardize data stream

Transform Clickstream

Pipeline: Standardize data stream

Conclusion

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Real Time Streaming Joins With SQL

Considerations for Implementing Streaming JOINs

Where does Stream Processing Fit into Your Data Platform?

Let's get decoding