Building a Clickstream Analytics Pipeline

Clickstream data is collected from user activity on the web and used to provide insights into how visitors get to the website, what they do once there, how long they stay on any given page, the number of page visits visitors make, and the number of unique and repeat visitors. Clickstream analytics have the ability to refine data by processing, cleaning, and transforming the raw data into convenient structures that make analysis of data easier and more accurate. Using web data, businesses can not only identify customer needs but can offer customized solutions to cater to the needs of an evolving customer base. The global clickstream analytics market size was valued at $868.8 million in 2018, and is projected to reach $2.5 billion by 2026, indicating a significant focus for businesses.

Clickstream data is a record of a user's online activity, including every click, page navigation, and interaction with a website or application. It can be used to understand how users engage with a site and optimize marketing and sales funnels. Below we can see a sample of raw clickstream data, with one record per page visit for every user of each website monitored.

{
  "event_datetime": "2020-11-16 22:59:59",
  "event": "view_item",
  "user_id": "f6d4-24d4-4a29-3be1",
  "click_id": "a5cf-179b9-c9d4-83ab",
  "site_id": "wj32-gao1-4w1o-iqp4",
  "page": {
    "id": "b7b1-05fb-bf95-a85a",
    "url": "/product-67890",
    "previous_id": "2905-81e7-be8e-4814",
    "previous_url": "/category-tshirts"
  },
  "engagement": {
    "seconds_on_data": 79,
    "percent_viewed": 39.7
  }
}

Currently, it is not in the best form for analyzing how well the website is performing. For this, it would be better to have statistics aggregated over time. By using one or more Decodable pipelines, which are streaming SQL queries that process data, we can transform the raw data into a form that is best suited for how it will be consumed.

‍The Standard Clickstream Pipeline

A standard clickstream analytics pipeline typically involves several key components and tools that have been integrated to collect, process, and analyze user interactions. A typical architecture starts by implementing Apache Kafka to serve as the backbone for ingesting data from servers or web applications. Data streams are then formatted for downstream use, typically by converting raw logs into structured or semi-structured formats. The stream is then connected to a processing system like Apache Flink, which processes the data in real time, applying analytics such as sessionization, user behavior analysis, or conversion tracking. This architecture enables businesses to act on user data immediately, offering insights that can drive user engagement, marketing decisions, and overall product improvement.

While this pipeline is powerful, setting up and maintaining such a streaming system comes with significant challenges. Flink and Kafka are complex tools requiring deep technical expertise to configure and scale, as well as ongoing investments in monitoring, performance tuning, and scaling the infrastructure. Security and stability are major concerns in these architectures, particularly when handling sensitive user data. Organizations must ensure that their clickstream pipelines are compliant with regulations like SOC 2 Type II and GDPR, which mandate strict data handling and protection procedures. As business requirements change or data volumes grow, the system must be continuously updated, which introduces additional overhead in terms of time and resources.

The complexity of these systems and the need for continuous oversight create significant barriers to entry for many organizations looking to build and maintain real-time clickstream analytics pipelines.

Clickstream with Decodable

Here at Decodable, we’ve built a solution that goes beyond the foundational technologies, addressing the broader requirements of real-time stream processing for ELT, ETL, and data replication. This includes ensuring a solid developer experience, providing extensive and flexible connectivity, managing schema, ensuring scalability across different workloads and use cases, providing observability, maintaining security, data governance, compliance, and offering ongoing support.

As a fully managed service, our platform takes care of the stream processing infrastructure and the deployment of Flink jobs so you can focus on the business logic for your data pipelines. That means there are no servers for you to manage, no clusters to create, size, or monitor, and no software dependencies to update or maintain within our platform.

In this example, we’ll walk through how the Decodable data service is used to clean, transform, and enrich real-time clickstream data. The processed data can then be used to inform business decisions.

Pipeline Architecture

For this example, only a single pipeline is needed to process the raw incoming data into the desired form. But it is also possible to use multiple pipelines in a series of stages, with the output of each one being used as the input for the next. Depending on the complexity of the desired processing, it can be helpful to break it down into smaller, more manageable steps. This results in pipelines that are easier to test and maintain. Each stage in the sequence of pipelines is used to bring the data closer to its final desired form using SQL queries.

Decodable uses SQL to process data that should feel familiar to anyone who has used relational database systems. The primary differences you’ll notice are that:

  • You activate a pipeline to start it, and deactivate a pipeline to stop it
  • All pipeline queries specify a source and a sink
  • Certain operations, notably JOINs and aggregations, must include windows

Unlike relational databases, all pipelines write their results into an output data stream (or sink). As a result, all pipelines are a single statement in the form INSERT INTO SELECT ... FROM , where sink and source are streams you’ve defined.

Aggregate And Enrich Data Stream

For this example, the pipeline leverages the SQL tumble group window function to create a set of records across a non-overlapping, continuous window with a fixed duration of 1 hour. For each interval, the number of pages visited, the total amount of time spent reading or interacting with these pages, and an average of how much of the pages were actually viewed is calculated, grouped by website and user.

As an alternative, the hop window function could be used to create a set of records across a fixed duration that hops (or slides) by a given interval. If the hop interval is smaller than the window duration, the hopping windows overlap, and records from the data stream are assigned to multiple windows. Then a subsequent pipeline could be used to filter the results to one representing the highest level of engagement over a set duration for each user for each website.

Pipeline: Aggregate Clickstream Data

insert into summary
select
  window_start,
  window_end,
  site_id,
  user_id,
  count(1) as pages_visited,
  sum(engagement.seconds_on_data) as total_seconds_on_site,
  avg(engagement.percent_viewed) as avg_percent_viewed,
from table (
  tumble(
    table clickstream,
    descriptor(to_timestamp(event_datetime)),
    interval '1' hour
  )
)
group by
  window_start,
  window_end,
  site_id,
  user_id

After creating a new pipeline and entering the SQL query, clicking the Run Preview button will verify its syntax and then fire up a new executable environment to process the next 10 records coming in from the source stream and display the results. Decodable handles all the heavy lifting on the backend, allowing you to focus on working directly with your data streams to ensure that you are getting the results you need.

Conclusion

At this point, a sink connection (one that writes a stream to an external system, such as AWS S3, Kafka, Kinesis, Postgres, Pulsar, or Redpanda) can be created to allow the results to be consumed by your own applications and services.

{
  "window_start": "2020-11-16 14:00:00",
  "window_end": "2020-11-16 15:00:00",
  "user_id": "f6d4-24d4-4a29-3be1",
  "site_id": "wj32-gao1-4w1o-iqp4",
  "pages_visited": 8,
  "total_seconds_on_site": 426,
  "avg_percent_viewed": 28.198543
}

As we can see from this example, a sophisticated business problem can be addressed in a very straight-forward way using Decodable pipelines. It is not necessary to create docker containers, there is no SQL server infrastructure to set up or maintain, all that is needed is a working familiarity with creating the SQL queries themselves.


You can watch demonstrations of several examples on the Decodable YouTube channel.

Additional documentation for all of Decodable’s services is available here.

Please consider joining us on our community Slack.

Other Solutions

Customer Data

Aggregating customer data streams: call logs, clickstream data, ecommerce activity, geolocation, point-of-sale terminals, social media feeds.

Learn more