All posts by Robert Cepa

How TimescaleDB helped us scale analytics and reporting

2025-07-08 Robert Cepa

Post Syndicated from Robert Cepa original https://blog.cloudflare.com/timescaledb-art/

At Cloudflare, PostgreSQL and ClickHouse are our standard databases for transactional and analytical workloads. If you’re part of a team building products with configuration in our Dashboard, chances are you’re using PostgreSQL. It’s fast, versatile, reliable, and backed by over 30 years of development and real-world use. It has been a foundational part of our infrastructure since the beginning, and today we run hundreds of PostgreSQL instances across a wide range of configurations and replication setups.

ClickHouse is a more recent addition to our stack. We started using it around 2017, and it has enabled us to ingest tens of millions of rows per second while supporting millisecond-level query performance. ClickHouse is a remarkable technology, but like all systems, it involves trade-offs.

In this post, I’ll explain why we chose TimescaleDB — a Postgres extension — over ClickHouse to build the analytics and reporting capabilities in our Zero Trust product suite.

Designing for future growth

After a decade in software development, I’ve grown to appreciate systems that are simple and boring. Over time, I’ve found myself consistently advocating for architectures with the fewest moving parts possible. Whenever I see a system diagram with more than three boxes, I ask: Why are all these components here? Do we really need all of this?

As engineers, it’s easy to fall into the trap of designing for scenarios that might never happen. We imagine future scale, complex failure scenarios, or edge cases, and start building solutions for them upfront. But in reality, systems often don’t grow the way we expect, or don’t have to. Designing for large scale can be deferred by setting the right expectations with customers, and by adding guardrails like product limits and rate limits. Focusing on launching initial versions of products with just a few essential parts, maybe two or three components, gives us something to ship, test, and learn from quickly. We can always add complexity later, but only once it’s clear we need it.

Whether I specifically call it YAGNI, or Keep it simple, stupid, or think about it as minimalism in engineering, the core idea is the same: we’re rarely good at predicting the future, and every additional component we introduce carries a cost. Each box in the system diagram is something that can break itself or other boxes, spiral into outages, and ruin weekend plans of on-call engineers. Each box also requires documentation, tests, observability, and service level objectives (SLOs). Oftentimes, teams need to learn a new programming language just to support a new box.

Making Digital Experience Monitoring simple

Two years ago, I was tasked with building a new product at Cloudflare: Digital Experience Monitoring (DEX). DEX provides visibility into device, network, and application performance across Zero Trust environments. Our initial goal was clear — launch an MVP focused on fleet status monitoring and synthetic tests, giving customers actionable analytics and troubleshooting. From a technical standpoint, fleet status and synthetic tests are two types of structured logs generated by the WARP client. These logs are uploaded to an API, stored in a database, and ultimately visualized in the Cloudflare Dashboard.

As with many new engineering teams at Cloudflare, DEX started as a “tiger team”: a small group of experienced engineers tasked with validating a new product quickly. I worked with the following constraints:

Team of three full-stack engineers.
Daily collaboration with 2-3 other teams.
Can launch in beta, engineering can drive product limits.
Emphasis on shipping fast.

To strike a balance between usefulness and simplicity, we made deliberate design decisions early on:

Fleet status logs would be uploaded from WARP clients at fixed 2-minute intervals.
Synthetic tests required users to preconfigure them by target (HTTP or traceroute) and frequency.
We capped usage: each device could run up to 10 synthetic tests, no more than once every 5 minutes.
Data retention of 7 days.

These guardrails gave us room to ship DEX months earlier and gather early feedback from customers without prematurely investing in scalability and performance.

We knew we needed a basic configuration plane — an interface in the Dashboard for users to create and manage synthetic tests, supported by an API and database to persist this data. That led us to the following setup:

HTTP API for managing test configurations.
PostgreSQL for storing those configurations.
React UI embedded in the Cloudflare Dashboard.

Just three components — simple, focused, and exactly what we needed. Of course, each of these boxes came with real complexity under the hood. PostgreSQL was deployed as a high-availability cluster: one primary, one synchronous replica for failover scenarios, and several asynchronous replicas distributed across two geographies. The API was deployed on horizontally scaled Kubernetes pods across two geographies. The React app was served globally as standard via Cloudflare’s network. Thanks to our platform teams, all of that complexity was abstracted away, allowing us to think in terms of just three essential parts, but it really shows that each box can come with a huge cost behind the scenes.

Next, we needed to build the analytics plane — an ingestion pipeline to collect structured logs from WARP clients, store them, and visualize them for our customers in the Dashboard. I was personally excited to explore ClickHouse for this. I have seen its performance in other projects and was eager to experiment with it. But as I dug into the internal documentation on how to get started with ClickHouse, reality set in:

Writing data to Clickhouse

Your service must generate logs in a clear format, using Cap’n Proto or Protocol Buffers. Logs should be written to a socket for logfwdr to transport to PDX, then to a Kafka topic. Use a Concept:Inserter to read from Kafka, batching data to achieve a write rate of less than one batch per second.

Oh. That’s a lot. Including ClickHouse and the WARP client, we’re looking at five boxes to be added to the system diagram. This architecture exists for good reason, though. The default and most commonly used table engine in ClickHouse, MergeTree, is optimized for high-throughput batch inserts. It writes each insert as a separate partition, then runs background merges to keep data manageable. This makes writes very fast, but not when they arrive in lots of tiny batches, which was exactly our case with millions of individual devices uploading one log event every 2 minutes. Too many small writes can trigger write amplification, resource contention, and throttling.

So it became clear that ClickHouse is a sports car and to get value out of it we had to bring it to a race track, shift into high gear, and drive it at top speed. But we didn’t need a race car — we needed a daily driver for short trips to a grocery store. For our initial launch, we didn’t need millions of inserts per second. We needed something easy to set up, reliable, familiar, and good enough to get us to market. A colleague suggested we just use PostgreSQL, quoting “it can be cranked up” to handle the load we were expecting. So, we took the leap!

First design of configuration and analytics plane for DEX:

Using PostgreSQL for analytics

Structurally, there’s not much difference between configuration data and analytical logs. Logs are simply structured payloads — often in JSON — that can be transformed into a columnar format and persisted in a relational database.

Here’s an example of a device state log:

{
  “timestamp”: “2025-06-16T22:50:12.226Z”,
  “accountId”: “025779fde8cd4ab8a3e5138f870584a7”,
  “deviceId”: “07dfde77-3f8a-4431-89f7-acfcf4ead4fc”,
  “colo”: “SJC”,
  “status”: “connected”,
  “mode”: “warp+doh”,
  “clientVersion”: “2024.3.409.0”,
  “clientPlatform”: “windows”,
}

To store these logs, we created a simple PostgreSQL table:

CREATE TABLE device_state (
	"timestamp" TIMESTAMP WITH TIME ZONE NOT NULL,
	account_id TEXT NOT NULL,
	device_id TEXT NOT NULL,
	colo TEXT,
	status TEXT,
	mode TEXT,
	client_version TEXT,
	client_platform TEXT
);

You might notice that this table doesn’t have a primary key. That’s intentional, because time-series data is almost never queried by a unique ID. Instead, we query by time ranges and filter by various attributes (e.g. account ID or device ID). Still, we needed a way to deduplicate logs in case of client retries.

We created two indexes to optimize for our most common queries:

CREATE UNIQUE INDEX device_state_device_account_time ON device_state USING btree (device_id, account_id, “timestamp”);
CREATE INDEX device_state_account_time ON device_state USING btree (account_id, “timestamp”);

The unique index ensures deduplication: each (device, account, timestamp) tuple represents a single, unique log. The second index supports typical time-window queries at the account level. Since we always query by account_id (represents individual customers) and timestamp, they are always a part of the index.

We inserted data from our API using UPSERT query:

INSERT INTO device_state (…) VALUES (…) ON CONFLICT DO NOTHING;

About order of columns in multicolumn indexes

PostgreSQL’s B-tree indexes support multiple columns, but column order has a major impact on query performance.

From PostgreSQL documentation about multicolumn indexes:

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned. Constraints on columns to the right of these columns are checked in the index, so they save visits to the table proper, but they do not reduce the portion of the index that has to be scanned.

What’s interesting in time series workloads is that the queries usually have inequality constraints on the time column, and then equality constraints on all other columns.

A typical query to build line charts and pie charts visualizing data in a time interval often looks like this:

SELECT 
  DATE_TRUNC(‘hour’, timestamp) as hour, 
  account_id,
  device_id,
  status,
  COUNT(*) as total
FROM device_state 
WHERE 
  account_id = ‘a’ AND
  device_id = ‘b’ AND
  timestamp BETWEEN ‘2025-07-01’ AND ‘2025-07-02’
GROUP BY hour, account_id, device_id, status;

Notice our WHERE clause — it has equality constraints on account_id and device_id, and two inequality constraints on timestamp. If we had built our index in the order of (timestamp, account_id, device_id), only the “timestamp” section of the index could’ve been used to reduce the index section to be scanned, and account_id and device_id would have to be fully scanned, with values that are not ‘a’ or ‘b’ filtered out after scanning.

Additionally, the runtime complexity of search in btree is O(log n) — the search will get slower as the size of your table (and all indexes) grows, so another optimization is to reduce the portion of the index that needs to be scanned. Even for columns with equality constraints, you can greatly reduce query times by ordering columns by cardinality. We’ve seen up to 100% improvement in SELECT query performance when we simply changed the order of account_id and device_id in our multicolumn index.

To get the best performance for time range queries, we follow these rules for order of columns:

The timestamp column is always last.
Other columns are leading columns, ordered by their cardinalities starting with the highest cardinality column.

Launch and improvements

Because we took a step back during system design and avoided optimizing for the future, thanks to our minimal and focused architecture, we went from zero to a working DEX MVP in under four months.

Early metrics were promising, providing reasonable throughput capabilities and latency for API requests:

~200 inserts/sec at launch.
Query latencies in the hundreds of milliseconds for most customers.

Post-launch, we focused on collecting feedback while monitoring system behavior. As adoption grew, we scaled to 1,000 inserts/sec, and our tables grew to billions of rows. That’s when we started to see performance degradation — particularly for large customers querying 7+ day time ranges across tens of thousands of devices.

Optimizing query performance with precomputed aggregates

As DEX grew to billions of device logs, one of the first performance optimizations we explored was precomputing aggregates, also known as downsampling.

The idea is that if you know the shape of your queries ahead of time — say, grouped by status, mode, or geographic location — you can precompute and store those summaries in advance, rather than querying the raw data repeatedly. This dramatically reduces the volume of data scanned and the complexity of the query execution.

To illustrate this in an example, let’s consider DEX Fleet Status:

In our DEX Fleet Status dashboard, we render common visualizations like:

Number of connected devices by data center location (colo)
Device status and connection mode over time

These charts typically group logs by status, mode, or colo, either over a 1-hour window or across the full time range.

Our largest customers may have 30,000+ devices, each reporting logs every 2 minutes. That’s millions of records per day per customer. But the columns we’re visualizing (e.g. status and mode) only have a few distinct values (4–6). By aggregating this data ahead of time, we can collapse millions of rows into a few hundred per interval and query dramatically smaller, narrower tables.

This made a huge impact: we saw up to 1000x query performance improvement and charts that previously took several seconds now render instantly, even for 7-day views across tens of thousands of devices.

Implementing this technique in PostgreSQL is challenging. While PostgreSQL does support materialized views, they didn’t fit our needs out of the box because they don’t refresh automatically and incrementally. Instead, we used a cron job that was periodically running custom aggregation queries for all pre-aggregate tables (we had 6 of them). Our Database platform team had a lightweight framework built for data retention purposes that we plugged into. Still, any schema change required cross-team coordination, and we invested considerable time in optimizing aggregation performance. But the results were worth it: fast, reliable queries for the majority of customer use cases.

Table partitioning

Pre-computed aggregates are great, but they’re not the answer to everything. As we were adding more table columns for new DEX features, we needed to invest time in creating new pre-aggregated tables. Additionally, some features required queries with combined filters, which required querying the raw data that included all the columns. But we didn’t have good enough performance in raw tables.

One technique we considered to improve performance on raw tables was table partitioning. In PostgreSQL, tables are stored in one large file (large tables are split to 1 GB segment files). With partitioning, you can break a large table into smaller child tables, each covering a slice of data (e.g. one day of logs). PostgreSQL then scans only the relevant partitions based on your query’s timestamp filter. This can dramatically improve query performance in some cases.

What was particularly interesting for us was range-partitioning on the timestamp column, because our customers wanted longer data retention, up to one year, and storing one year of data in one large table would have destroyed query performance.

CREATE TABLE device_state (
    …
) PARTITION BY RANGE (timestamp);

CREATE TABLE device_state_20250601 PARTITION OF device_state
    FOR VALUES FROM ('2025-06-01') TO ('2025-06-02');
CREATE TABLE device_state_20250601 PARTITION OF device_state
    FOR VALUES FROM ('2025-06-02') TO ('2025-06-03');
CREATE TABLE device_state_20250601 PARTITION OF device_state
    FOR VALUES FROM ('2025-06-03') TO ('2025-06-04');

Unfortunately, PostgreSQL doesn’t automatically manage partitions — you must manually create each one as shown above, so we would have needed to build a full partition management system to automate this.

We ended up not adopting it because in the end, partitioning didn’t solve our core problem: speeding up frequent dashboard queries on recent raw data up to past 7 days.

TimescaleDB

As our raw PostgreSQL setup began to show its limits, we started exploring other options to improve query performance. That’s when we discovered TimescaleDB. What particularly caught my attention was columnstore and sparse indexes, common techniques in OLAP databases like ClickHouse. It seemed to be the solution for our raw performance problem. On top of that:

It’s Postgres: TimescaleDB is packaged as a PostgreSQL extension and it seamlessly coexists with it, granting access to the entire Postgres ecosystem. We can still use vanilla Postgres tables for transactional workloads, and TimescaleDB hypertables for analytical tasks, offering convenience of one database for everything.
Automatic partition management: Unlike Postgres, which requires manual table partitioning, TimescaleDB’s hypertables are partitioned by default and automatically managed.
Automatic data pre-aggregation/downsampling: Tedious processes in native Postgres, such as creating and managing downsampled tables, are automated in TimescaleDB through continuous aggregates. This feature eliminates the need for custom-built cron jobs and simplifies the development and deployment of pre-computed aggregates.
Realtime data pre-aggregation/downsampling: A common problem with async aggregates is that they can be out-of-date, because aggregation jobs can take a long time to complete. TimescaleDB addresses the issue of outdated async aggregates with its realtime aggregation by seamlessly integrating the most recent raw data into rollup tables during queries.
Compression: Compression is a cornerstone feature of TimescaleDB. Compression can reduce table size by more than 90% while simultaneously enhancing query performance.
Columnstore performance for real-time analytics: TimescaleDB’s hybrid row/columnar engine, Hypercore, enables fast scans and aggregations over large datasets. It’s fully mutable, so we can backfill with UPSERTs. Combined with compression, it delivers strong performance for analytical queries while minimizing storage overhead.
Rich library of analytics tools and functions: TimescaleDB offers a suite of tools and functions tailored for analytical workloads, including percentile approximation, count of unique values approximation, time-weighted averages, etc…

One especially compelling aspect: TimescaleDB made aggregation and data retention automatic, allowing us to simplify our infrastructure and remove a box from the system architecture entirely.

Evaluating TimescaleDB for DEX

We deployed a self-hosted TimescaleDB instance on our canary PostgreSQL cluster to run an apples-to-apples comparison against vanilla Postgres. Our production backend was dual-writing to both systems.

As expected, installing TimescaleDB was trivial. Simply load the library and run the following SQL query:

CREATE EXTENSION IF NOT EXISTS timescaledb;

Then we:

Created raw tables
Converted them to hypertables
Enabled columnstore features
Set up continuous aggregates
Configured automated policies for compression and retention

Here’s a condensed example for device_state logs:

– Create device_state table.
CREATE TABLE device_state (
	…
);

– Convert it to a hypertable.
SELECT create_hypertable ('device_state', by_range ('timestamp', INTERVAL '1 hour'));

– Add columnstore settings
ALTER TABLE device_state SET (
    timescaledb.enable_columnstore,
    timescaledb.segmentby = ‘account_id’
);

– Schedule recurring compression jobs
CALL add_columnstore_policy(‘device_state’, after => INTERVAL '2 hours', schedule_interval => INTERVAL '1 hour');

– Schedule recurring data retention jobs
SELECT add_retention_policy(‘device_state’, INTERVAL '7 days');

– Create device_state_by_status_1h continuous aggregate
CREATE MATERIALIZED VIEW device_state_by_status_1h
WITH (timescaledb.continuous) AS
SELECT
  time_bucket (INTERVAL '1 hour', TIMESTAMP) AS time_bucket,
  Account_id,
  Status,
  COUNT(*) as total
FROM device_state
GROUP BY 1,2,3
WITH no data;

– Enable realtime aggregates
ALTER MATERIALIZED VIEW ‘device_state_by_status_1h’
SET (timescaledb.materialized_only=FALSE);

– Schedule recurring continuous aggregate jobs to refresh past 10 hours every 10 minutes
SELECT add_continuous_aggregate_policy (
  ‘device_state_by_status_1h’,
  start_offset=>INTERVAL '10 hours',
  end_offset=>INTERVAL '1 minute',
  schedule_interval=>INTERVAL '10 minutes',
  buckets_per_batch => 1
);

After a two-week backfill period, we ran side-by-side benchmarks using real production queries from our dashboard. We tested:

3 time windows: past 1 hour, 24 hours, and 7 days
3 columnstore modes: uncompressed, compressed, and compressed with segmenting
Datasets containing 500 million to 1 billion rows

We saw 5x to 35x performance improvements, depending on query type and time range:

For short windows (1–24 hours), even uncompressed hypertables performed well.
For longer windows (7 days), compression and columnstore settings (especially with segmentby) made all the difference.
Sparse indexes were critical. Once PostgreSQL’s btree indexes broke down at scale, Timescale’s minmax sparse indexes and columnar layout outperformed.

On top of query performance, we saw impressive compression ratios, up to 33x:

SELECT 
    pg_size_pretty(before_compression_total_bytes) as before,
    pg_size_pretty(after_compression_total_bytes) as after,
    ROUND(before_compression_total_bytes / after_compression_total_bytes::numeric, 2) as compression_ratio
FROM hypertable_compression_stats('device_state');

before: 1616 GB
after: 49 GB
compression_ratio: 32.83

That meant we could retain 33x more data for the same cost.

What makes columnstore so fast?

Two main things: compression and sparse indexes.

It might seem counterintuitive that querying compressed data, which requires decompression, can be faster than querying raw data. But in practice, input/output (I/O) is the major bottleneck in most analytical workloads. The reduction in disk I/O from compression often outweighs the CPU cost of decompressing. In TimescaleDB, compression transforms a hypertable into a columnar format: values from each column are grouped in chunks (typically 1,000 at a time), stored in arrays, and then compressed into binary form. More detailed explanation in this TimescaleDB blog post.

You might wonder how this is possible in PostgreSQL, which is traditionally row-based. TimescaleDB has a really clever solution for it by utilizing PostgreSQL TOAST pages. The way it works is after tuples of 1000 values are compressed, they’re moved to external TOAST pages. The columnstore table itself then basically becomes a table of pointers to TOAST, where actual data is stored and only retrieved lazily, column-by-column.

The second factor is sparse minmax indexes. The idea behind sparse indexes is that rather than storing every single value in an index, store every N-th value. This makes them much smaller and more efficient to query in very large datasets. TimescaleDB implements minmax sparse indexes, where for each compressed tuple of 1,000 values it creates two additional metadata columns, storing min and max values. The query engine then looks at these columns to determine whether a value could possibly be found in a compressed tuple before attempting to decompress it.

What we found later, unfortunately, after we did our evaluation of TimescaleDB, is that sparse indexes need to be explicitly enabled via timescaledb.orderby option. Otherwise, TimescaleDB sets it to some default value, which may not always be the most efficient for your queries. We added all columns that we filter on to orderby setting:

– Add columnstore settings
ALTER TABLE device_state SET (
    timescaledb.enable_columnstore,
    timescaledb.segmentby = ‘account_id’,
    timescaledb.orderby = ‘timestamp,device_id,colo,mode,status,client_version,client_platform
);

TimescaleDB at Cloudflare

Following the success with DEX, other teams started exploring TimescaleDB for its simplicity and performance. One notable example is the Zero Trust Analytics & Reporting (ART) team.

The ART team is responsible for generating analytics and long-term reports — spanning months or even years — for Zero Trust products such as Access, Gateway, CASB, and DLP. These datasets live in various ClickHouse and PostgreSQL clusters that we wanted to replicate into a singular home that is specifically designed to unify related, but not co-located data points, together and modeled to address our customer’s analytical needs.

We chose to use TimescaleDB as the aggregation layer on top of raw logs stored elsewhere. We built a system of crawlers using cron jobs that periodically query the multitude of clusters for hourly aggregates across all customers. These aggregates are ingested into TimescaleDB, where we use continuous aggregates to further roll them up into daily and monthly summaries for reporting.

Access and Gateway datasets are massive, often ingesting millions of rows per second. To support arbitrary filters in reporting, crawler queries group by all relevant fields, including high-cardinality columns like IP addresses. This means the downsampling ratio is low, and in some cases, we’re inserting ~100,000 aggregated rows per second. TimescaleDB handles this load just fine, but to support it we made some adjustments:

We switched from bulk INSERTS to COPY. This significantly improved ingestion throughput. We didn’t benchmark it ourselves, but plenty of benchmarks show that COPY performs much better with large batches.
We disabled synchronous replication. In our case, temporary data loss is acceptable — our crawlers are idempotent and can reprocess missing data as needed.
We also disabled fsync. Again, durability is less of a concern for this use case, so skipping disk syncs helped with ingest performance.
We dropped most indexes in hypertables, only kept one on (account_id, timestamp), and relied on aggressive compression and sparse indexes. The absence of indexes helped with insert rates and didn’t have a significant impact on query performance, because only a very small part of the table was uncompressed and relied on traditional btree indexes.

You can see this system in action at Cloudflare Zero Trust Analytics.

Conclusion

Prioritizing core value and resisting the urge to prematurely optimize can accelerate time to market—and sometimes take you on an unexpected journey that leads to better solutions than you’d originally planned. In the early days of DEX, taking a step back to focus on what truly mattered helped us discover TimescaleDB, which turned out to be exactly what we needed.

Not every team needs a hyper-specialized race car that requires 100 octane fuel, carbon ceramic brakes, and ultra-performance race tires: while each one of these elements boost performance, there’s a real cost towards having those items in the form of maintenance and uniqueness. For many teams at Cloudflare, TimescaleDB strikes a phenomenal balance between the simplicity of storing your analytical data under the same roof as your configuration data, while also gaining much of the impressive performance of a specialized OLAP system.

Check out TimescaleDB in action by using our robust analytics, reporting, and digital experience monitoring capabilities on our Zero Trust platform. To learn more, reach out to your account team or sign up directly here.

Building Product Intelligence Platform with Cloudflare Workers

2021-02-04 Robert Cepa

Post Syndicated from Robert Cepa original https://blog.cloudflare.com/building-product-intelligence-platform-with-cloudflare-workers/

Building Product Intelligence Platform with Cloudflare Workers

“You can only improve what you can measure.”

We try to make Cloudflare’s onboarding experience as accessible as possible. For this reason, many customers are able to set up Cloudflare, configure their accounts and products, and discover additional products entirely on their own in our dashboard. Our Customer Onboarding team builds the dashboard experiences that make this possible.

The Onboarding team is data-driven, so we use data to validate our ideas. Rather than shipping the implementation of some idea right away, we run A/B tests with a small percentage of our customers. The results of these tests tell us what we should do with our idea next – either ship it to everyone, try to improve it (and run the test again), or discard it. This practice helps us with hedging our efforts so we don’t waste time on an idea that isn’t fruitful, and it provides us a method to reliably gather more information about needs of our customers. We use a third-party analytics tool to produce data for these A/B tests. This tool helps us to collect and analyse data about how our customers interact with the experiences that we build.

The onboarding experience in the dashboard is just one of many places where our customers interact with Cloudflare. Other Product teams, Customer Success team, and Marketing team build their own experiences in the dashboard and beyond, and they use their own analytics tools that best suit their needs. Each of those teams has different goals, but we all have one thing in common – we want to understand our customers.

For example, knowing how our customers interact with campaigns and emails that Marketing teams build can help us on the Onboarding team to build a better, more personalized onboarding experience. Similarly, understanding how our customers interact with the onboarding experience in the dashboard can help our Marketing team to create more personalized emails and campaigns.

Using multiple third-party analytics tools across multiple teams created many challenges related to data integrity, security, privacy and performance. In this blogpost, we are going to talk about how we used Cloudflare Workers to build our product intelligence platform to overcome these challenges, serving hundreds of millions of requests per month from over 200 cities over the world, close to our customers, all without having to configure and maintain infrastructure.

Motivation: Data integrity, security, privacy, and performance

In the past, teams at Cloudflare used third-party scripts provided by analytics platforms like Google Analytics and Heap to measure user behavior. These scripts presented multiple challenges:

Data Integrity

In the product analytics world, an “event” is any user interaction with the product. Because we were using third-party scripts to send event data to varied analytics destinations, it was hard to make sure that these event data are consistent across all these destinations. In our case, our analytics tools categorized event data in different ways, creating confusion for our teams. For example, if a Cloudflare customer purchased our Workers product, Heap would send an event named “Purchase Workers”, while Google Analytics called it “Product Purchase Success” with a data attribute label: “workers”. Nobody trusted this data, so they sought out more reliable sources, such as billing databases.

Security and privacy

Third-party analytics vendors use third-party scripts to track end-user behavior. We take security and data privacy very seriously, and these scripts pose risks to us and our customers. They are hard to audit, and make it hard to ensure they don’t send data we don’t want them to send. They also change over time, and can be buggy, inefficient, and hard to test.

Performance

We want to give Cloudflare dashboard users a highly performant experience, but third-party scripts can cause slowdowns. For example, they can have a significant size because they try to do a lot of things automatically. Having to load and parse too much JavaScript can extend page load and render times, delay user interaction, and drain more battery. They can also fire too many network requests to multiple servers.

Vendor lock-in

Cloudflare’s dashboard codebase is massive, and hardcoded tracking calls tied to a specific analytics vendor makes that vendor difficult to replace. Moreover, adding a new vendor would require significant effort to add that vendor’s tracking calls everywhere, and would have a negative impact on performance on the frontend.

We wanted to solve these problems by creating a system that would decouple what we measure from how we measure it. The requirements were:

Unified API: a single API that all clients adhere to regardless of the vendor they primarily use. Engineers don’t need to understand how analytics vendors work and what data they require.
Secure and compliant: we fully own and control the code, protecting our customers from vulnerabilities in third-party code. We fully control how our data is measured, distributed, and stored.
Performant: lightweight, fast, and non-blocking on the frontend. Move as much logic as we can to the backend.
Flexible: ability to add/replace/remove vendors with relatively small effort on the backend, and no effort on the frontend.

We chose to use Cloudflare Workers, which deploys serverless code on the edge across the globe, as our backend infrastructure. Workers offers the following advantages:

Nimbleness through serverless development: Our team is small, and analytics wasn’t our primary focus at the time, so we wanted to create something quickly without having to worry about setting up and maintaining the infrastructure. With Workers, we never have to look at things like system health status, or load balancing and scaling, or how fast it is across the world. Everything is included in the package, and works really well.
JavaScript support: Since we work on user experiences, we are mostly UI-engineering focused and use React+TypeScript every day. Our team can write frontend and backend code in the same language, which reduces cognitive load.
Dogfooding opportunities: We help to test Workers at scale, which makes the product stronger.

Iteration #1: Sparrow and Trace Worker

Our analytics platform has gone through multiple iterations. The first version had two components – a JavaScript SDK called Sparrow, and a corresponding worker we call Trace.

Sparrow

The Sparrow SDK turns various data about product events into a consistent format, so internal users don’t need to understand API requirements further down the data pipeline.

Sparrow has 2 main features:

// tracks page visits
sparrow.pageview(pathname: string);

// tracks user interaction
sparrow.track(event: string, properties: Record<string, any>);

The pageview function can be run whenever a page loads in an application, which allows us to track where users navigate.

The track function is more generic. We can send any event name with any metadata. For example, the event name can be “purchase product” with properties: { product: “workers” }.

Both functions create a JSON object with the following interface and send it to the Trace Worker, which forwards it along to various analytics platforms:

{
  event: string,
  deviceId: string,
  userId?: string,
  properties: Record<string, any>
}

Trace Worker

The Trace Worker receives event data from Sparrow, checks payload correctness to make sure the request came from valid sources, and fans out the data to all connected analytics providers. The following diagram shows the pipeline.

Any third-party vendor can be added to Trace Worker, as long as that vendor provides a REST API. How the data are parsed, transformed, and sent to those APIs is implemented by us in our custom functions we call trackers. Trackers aim to replicate the behavior of third-party scripts provided by these vendors. Why are we doing this when we can just use third-party scripts? The main reasons are security and data privacy.

We use allowlists to explicitly define event names and event properties that can be sent further upstream. This helps us to prevent sending potentially sensitive information from cookies, URL query parameters, or data payloads. Every event and data property that is not in the allowlist is ignored.
On top of that, all allowed properties are sanitized by our internal data scrubber.
Always HTTPS: Some third-party scripts still use non-secure HTTP protocol. Trace Worker runs on HTTPS, and we make sure that outgoing requests are also using HTTPS.
We fully control the code, which means there are no surprises – the code cannot update without us knowing it.
Because the logic lives in the worker, our customers are not exposed to unnecessary client-side risks from using eval or document.write.

Another benefit is performance – because most of our analytics framework’s logic lives in the worker, there’s less JavaScript we need to send to the client, which means faster load times! SparrowJS on its own is super lightweight.

Here’s the simplified implementation of Trace Worker:

import trackers from ‘./trackers;
import { generateContext } from ‘./utils’;
Import { sanitize } from ‘./sanitizer’;

addEventListener(‘fetch’, event => {
  event.respondWith(handle(event));
}

async function handle(event: FetchEvent) {
  try {
    const payload = sanitize(await event.request.json());

    const context = await generateContext(event);


    // fan out
    event.waitUntil(Promise.allSettled(trackers.map(tracker => tracker[payload.event === “pageview” ? “pageview” : “event”](payload, context))));

    // return new Response(“OK”, { status:”OK”, statusCode: 200  })
  } catch (err) {
    // logging
    return new Response(“Something went wrong”, { 
      status:”Internal Server Error”, 
      statusCode: 500 
    })
  }
}

trackers is an array of tracking functions for each third-party vendor. Under the hood, they transform the incoming requests from Sparrow and send them to each vendor’s REST APIs. For example, this is a simplified implementation of Google Analytics tracker that transforms Sparrow payloads to adhere to Measurement Protocol:

const URL = ‘https://www.google-analytics.com/collect’;

export async function event(event: TrackingEvent, context: Context) {
  return fetch(URL, {
    method: ‘POST’,
    body: new URLSearchParams({
      ...createCommonParams(context),
      t: ‘event’,
      ea: context.data.event,
      ec: context.data.properties.category || ‘Uncategorized Event’,
      el: context.data.properties.label,
    }).toString()
  });
}

export async function pageview(event: TrackingEvent, context: Context) {
  return fetch(URL, {
    method: ‘POST’,
    body: new URLSearchParams({
      ...createCommonParams(context),
      t: ‘pageview’,
      dp: context.data.event
    }).toString()
  });
}

function createCommonParams(context: Context) {
  return {
    tid: context.gaId,
    v: ‘1’,
    cid: context.data.deviceId,
    uid: context.data.userId,
    ...context.data.properties
  }
}

Similarly, Heap tracker implements its own transformation for https://heapanalytics.com/api/track.

As you may have noticed, Trace Worker is not your typical service worker. There is no origin service – Trace Worker is the service, except it runs everywhere in the network.

Problem: Nobody (still) trusts the data

Iteration #1 of our data analytics’ platform worked well for a while, but as more product teams used Sparrow to run their own analyses, we started getting reports of data not looking right. The reports were along the lines of:

“Google Analytics underreports Heap by x percent…”
“Product purchases are not consistent with DB…”
“Signup conversion dropped by x percent, but we don’t think that’s actually happening…”

At the same time, we added another vendor – Amplitude, which made these problems even more complicated, because we now had three systems out of sync.

Due to the distributed nature of our analytics platform, we had a lot of potential breaking points. To find a solution, we needed to answer questions like:

Dashboard/Sparrow problems

Are product teams using Sparrow correctly?
Do we have any hard redirects that cause request cancellation? Should we try Beacon API and see what changes?
How does Trace Worker respond?

Trace Worker problems

Are we not catching some exceptions?
Are we exceeding runtime limits?
Are we hitting firewall/DDoS protection?

Third-party vendors problems

Are they silently rejecting or not storing some payloads sent from Trace Worker? Google Analytics always responds with 200 OK to any request. How are other vendors handling requests?
Are they having internal issues? How can we know, since these systems are blackboxes?
Can we recover any lost data?

If we release Trace Worker and our event volume takes a nosedive, it’s a pretty strong clue that it’s caused by us and we should roll back.

But what if nothing unusual happens for a while, and then page views drop by 20%? 10%? 5%? Is it us, or third-party vendors, or just a nice sunny day in Europe and so people are not sitting behind their desks? As we had no source of truth to compare these data against, this was impossible to answer.

Regardless, we knew we had to get serious about observability before we even begin asking these questions. At the time, there was no wrangler tail or Workers analytics (there are now!). Also, even though we used Sentry, an app monitoring platform, our logger was a very basic wrapper around Sentry’s REST API, because there was no full-blown Sentry SDK for Workers runtime – the current SDKs use globals, causing race-conditions in Workers.

The goals were:

1. Get better at diagnosing our own problems – catch every exception in Trace Worker and every non-200 HTTP response from third-party vendors, and log it to some visible source, with some helpful stack-traces and other metadata like request headers and body.

2. Be able to isolate problems that may be happening outside of our codebase – have a single source of truth for all incoming/outgoing requests which we can query and compare against data in our third-party tools. This would help us discover dropped requests that weren’t represented with an error state.

Iteration #2: New Sentry SDK for Workers

To meet these goals, we implemented and open-sourced a new Sentry SDK for Workers called toucan-js. Toucan adheres to the Sentry unified API guidelines, so the interface is familiar from other SDKs (node/browser). It currently supports capturing errors and messages with stack-traces that can be enhanced with source maps, breadcrumbs, request data/headers/cookies, tags, and extra metadata.

Since we replaced our simple Sentry logger with toucan-js, every single log started having:

Full request payload and some allowed headers
Stack-trace with source maps
Response status code and body (if applicable)

If anything goes wrong, we have all the information we need to reproduce and fix it – request body, headers, stack-trace, and all necessary context.

At the same time, we started sending cloned requests to the /debug/collect endpoint in Google Analytics Tracker that, combined with Sentry alerts, helped us find many dropped requests due to schema adherence problems such as “The value provided for parameter ‘cid’ is invalid.”

Iteration #3: The single source of truth

Better Sentry logs helped us with major drifts, but the data were still slightly off. While we observed some intermittent HTTP errors in Sentry, when we compared the number of these alerts with differences between different analytics platforms the numbers didn’t add up.

Due to this uncertainty, we decided to own the data layer, and create our own database – the single source of truth of all incoming payloads sent from Sparrow to Trace Worker before any transformation.

In order for us to trust the data in this ‘single source of truth’ database, the database needed to receive Sparrow payloads from outside of Trace Worker, preferably from a system that sits right in front of it, with minimal logic, that changes rarely, and that is highly available. Ideally, this system was to do three things – grab the incoming request payload, log it, and forward it to Trace Worker. These payloads should be logged raw, untouched, corresponding to whatever is sent from clients (SparrowJS).

The nice thing about this solution is that even if Trace Worker gets a bad release, we will not lose any data. Another strong case for us owning the data is that incidents in third-party vendors will not affect us anymore, because the solution will open the door for backfilling of dropped requests.

We considered Workers KV — Cloudflare’s low latency key-value store hosted at the network edge — for our storage needs, but being able to query the data was really important for us, because we wanted to diagnose complex problems quickly and select the data based on some property. For this reason, we went in a different direction.

Google BigQuery was our storage solution

We decided to use Google BigQuery for our ‘single source of truth’ database because:

It was designed for big data
It lets us use SQL to query what we need
We can use REST API in our new system to send the logs

Of course, Google BigQuery is a columnar database. How would we use it to store JSON data?

The first option was to write some kind of transformer that would map every object property to a column, but that was against our requirement of a system with minimal logic. The set of allowed characters we could use to name a column was also limited, so we wouldn’t be able to map column names back to original properties.

Due to these limitations we decided to store raw json strings, and use JSON functions to build views on top of these data.

First, we created a partitioned-by-day table called raw with the following schema:

Field name	Type
eventId	STRING
timestamp	TIMESTAMP
data	STRING

Sparrow’s payloads are stored in the data field as stringified JSON.

We don’t run queries against this table directly. Instead, we built a view called raw_normalized that looks something like this:

select 
  json_extract_scalar(data, '$.event') as event, 
  json_extract_scalar(data, '$.deviceId') as deviceId, 
  json_extract_scalar(data, '$.userId') as userId, 
  json_extract_scalar(data, '$.properties.category') as category, 
  json_extract_scalar(data, '$.properties.productName) as productName 
from raw;

With this setup, we can write complex SQL queries while retaining the original JSON values. To demonstrate on a simple example, when we insert a row with data being:

{
  event: “purchase product”,
  deviceId: “desktop1”,
  userId: “michelle1”,
  properties: { category: “billing”, productName: “workers” }
}

and then run:

select * from data_normalized where event = ‘purchase product’;

we get:

event	deviceId	userId	category	productName
purchase product	desktop1	michelle1	billing	workers

We had our data layer prepared. But how to actually push the data into BigQuery?

Dispatcher Worker

We created another worker, the Dispatcher, that sits in front of Trace Worker! As we said earlier, the sole purpose of Dispatcher Worker is to:

Read the incoming request body
Send it to BigQuery
Forward the incoming request to Trace Worker

The architecture changed to:

Here’s a simplified implementation:

import Toucan from 'toucan-js';
import { BigQueryClient } from “./bigquery”;

const bigQuery = new BigQueryClient({
  serviceAccountEmail: SERVICE_ACCOUNT_EMAIL,
  serviceAccountSecret: SERVICE_ACCOUNT_SECRET,
  projectId: PROJECT_ID,
  datasetId: DATASET_ID
});
 
addEventListener('fetch', event => {
  const toucan = new Toucan({dsn: DSN, event});
 
  // do the work without blocking the response
  event.waitUntil(dispatch(event, biqQuery, toucan));

  event.respondWith(return new Response('OK', {
      status: 200,
      statusText: 'OK'
  });
});
 
async function dispatch(event: FetchEvent, bigQuery: BigQueryClient, toucan: Toucan) { 
  try {
     // Original request to be sent to Trace Worker
    const requestOriginal = event.request;
    
    // We clone the request here to allow multiple uses of Body
    const requestClone = requestOriginal.clone();
    
    // read the request payload
    const payload = await requestClone.text();
 
    // create a timestamp
    const timestamp = Date.now();
 
    // send to BQ
    const bigQueryResponse = await bigQuery.insertRow({timestamp, json: payload});
     
    // log failed logs
    if (!bigQueryResponse.ok) {    
       sentry.captureException(await HttpError.fromResponse(bigQueryResponse));
    }
 
    // send to trace worker
    const traceResponse = await fetch(TRACE_WORKER_URL, requestOriginal)

    // log failed logs
    if (!traceResponse.ok) {    
       toucan.captureException(await HttpError.fromResponse(traceResponse));
    }
  } catch (err) {
   toucan.captureException(err);
  }
}

BigQueryClient is a lightweight SDK we implemented to be able to send data to BigQuery. Internally, it builds a request and sends it to Google Cloud Platform using their Stream API. We won’t go into details, but we want to briefly cover how we handle authentication.

Google Cloud APIs use the OAuth 2.0 protocol for authenticating both user accounts and service accounts. In short, the protocol involves building a signed JWT (JSON Web Token), sending it to Google Authorization Server to obtain access token, and sending that access token with all requests to GCP API.

We tried a few libraries to help us build that JWK (such as jsonwebtoken), but they were too slow due to their RSA implementation, and we were hitting runtime limits. So we implemented our own JWT builder using SubtleCrypto, which is a web standard that is also implemented in Cloudflare Workers!

JSON Web Tokens consist of 3 parts:

Header
Body
Signature

First, we build the header:

const tokenHeader = base64UrlEncode(JSON.stringify({
  alg: ‘RS256’,
  typ: ‘JWT’
}));

Then we build the token body, and concatenate with the header to build the token base:

const nowSeconds = Date.now() / 1000;
const tokenTtl = 3600;
const expire = nowSeconds + tokenTtl;

const tokenBody = base64UrlEncode(JSON.stringify({
  iss: SERVICE_ACCOUNT_EMAIL,
  scope : ‘https://www.googleapis.com/auth/bigquery.insertdata’
  aud: ‘https://www.googleapis.com/oauth2/v4/token’,
  exp: expire,
  iat: nowSeconds
}));
 
const tokenBase = `${tokenHeader}.${tokenBody}`;

All that’s left is signing the token base:

const signature = base64UrlEncode(
  arrayBufferToString(
    await crypto.subtle.sign(
      {
        name: ‘RSASSA-PKCS1-v1_5’,
        hash: { name: ‘SHA-256’ }
      },
     SERVICE_ACCOUNT_JWK,
     new TextEncoder.encode(tokenBase)
    )
  )
);
 
const jwt = `${tokenBase}.${signature}`;

Once we have the JWT, our SDK sends a request to the Authorization Server to retrieve the access token:

const token = await (await fetch (‘https://www.googleapis.com/oauth2/v4/token’, {
  method: ‘POST’,
  body: 'grant_type=urn%3Aietf%3Aparams%3Aoauth%3Agrant-type%3Ajwt-bearer&assertion=' + jwt,
  headers: { ‘content-type’: ‘application/x-www-form-urlencoded’ }
})).json();

The access token doesn’t need to be requested with every FetchEvent – it can be reused until it expires. Caching the token helps with performance, because RSA encryption is costly.

BigQueryClient stores the access token in a global variable, so all isolates that share the environment can use it. In fact, that’s the reason we initialize BigQueryClient outside of addEventListener. The SDK manages the token internally and handles the OAuth2 ceremony for the clients – the first call of insertAll generates a JWT to retrieve and store an access token, but subsequent calls of insertAll use the access token from the memory.

With all the pieces put together, this is the state of our analytics pipeline today.

Future work: Risk management

Logging all data gives us great visibility and makes debugging easier. We now have a clear picture of where in the pipeline the problems are, and we have all possible information to fix them. We can react to problems pretty well, but we would like to get better at preventing problems in production before they happen.

We currently have two environments: staging and production. Our staging environment is behind Cloudflare Access, only accessible to Cloudflare employees. When we merge our changes, the CI pipeline automatically deploys them to the staging environment where we can test these changes before they get to production.

While our staging environment helps us with catching catastrophic errors early, it’s not too great for finding errors that may cause partial data drops, because the staging traffic is generally very low, which makes it harder to spot changes in data patterns. From this perspective, our releases are very risky.

To reduce these risks, we need to test our changes with production traffic. We can achieve that with a variant of blue-green deployment approach, ensuring we have two production environments, as identical as possible. We could use another Worker, sitting in front of Blue and Green Dispatcher Workers, receiving live traffic and duplicating all requests to blue/green variants. One of these variants, green for example, would be using live code and live data sources, and the other one, blue, would be in the final stage of testing, with our new code, logging to the staging data sources. Once we validate the data in staging data sources, we can easily cut-over by switching data sources between green and blue.

Future work: Durability

Outages of third-party vendors don’t affect us anymore, since we essentially own the data layer and can backfill dropped events once a vendor is fixed and back online.

This presents us with a question: How can we automate this with Workers?

How can we make sure that everything in our source of truth will eventually get stuffed to the analytics tools?

Can we implement a true distributed and durable streaming platform like Kafka at the edge?

We think we can! More about this next time.

Noise

All posts by Robert Cepa

How TimescaleDB helped us scale analytics and reporting

Designing for future growth

Making Digital Experience Monitoring simple

Using PostgreSQL for analytics

About order of columns in multicolumn indexes

Launch and improvements

Optimizing query performance with precomputed aggregates

Table partitioning

TimescaleDB

Evaluating TimescaleDB for DEX

What makes columnstore so fast?

TimescaleDB at Cloudflare

Conclusion

Building Product Intelligence Platform with Cloudflare Workers

Motivation: Data integrity, security, privacy, and performance

Data Integrity

Security and privacy

Performance

Vendor lock-in

Iteration #1: Sparrow and Trace Worker

Sparrow

Trace Worker

Problem: Nobody (still) trusts the data

Dashboard/Sparrow problems

Trace Worker problems

Third-party vendors problems

Iteration #2: New Sentry SDK for Workers

Iteration #3: The single source of truth

Google BigQuery was our storage solution

Dispatcher Worker

Future work: Risk management

Future work: Durability

The collective thoughts of the interwebz