All posts by Jamie Herre

Cloudflare incident on November 14, 2024, resulting in lost logs

2024-11-26 Jamie Herre

Post Syndicated from Jamie Herre original https://blog.cloudflare.com/cloudflare-incident-on-november-14-2024-resulting-in-lost-logs

On November 14, 2024, Cloudflare experienced an incident which impacted the majority of customers using Cloudflare Logs. During the roughly 3.5 hours that these services were impacted, about 55% of the logs we normally send to customers were not sent and were lost. We’re very sorry this happened, and we are working to ensure that a similar issue doesn’t happen again.

This blog post explains what happened and what we’re doing to prevent recurrences. Also, the systems involved and the particular class of failure we experienced will hopefully be of interest to engineering teams beyond those specifically using these products.

Failures within systems at scale are inevitable, and it’s essential that subsystems protect themselves from failures in other parts of the larger system to prevent cascades. In this case, a misconfiguration in one part of the system caused a cascading overload in another part of the system, which was itself misconfigured. Had it been properly configured, it could have prevented the loss of logs.

Background

Cloudflare’s network is a globally distributed system enabling and supporting a wide variety of services. Every part of this system generates event logs which contain detailed metadata about what’s happening with our systems around the world. For example, an event log is generated for every request to Cloudflare’s CDN. Cloudflare Logs makes these event logs available to customers, who use them in a number of ways, including compliance, observability, and accounting.

On a typical day, Cloudflare sends about 4.5 trillion individual event logs to customers. Although this represents less than 10% of the over 50 trillion total customer event logs processed, it presents unique challenges of scale when building a reliable and fault-tolerant system.

System architecture

Cloudflare’s network is composed of tens of thousands of individual servers, network hardware components, and specialized software programs located in over 330 cities around the world. Although Cloudflare’s Edge Log Delivery product will send customers their event logs directly from each server, most customers opt not to do this because doing so will create significant complication and cost at the receiving end.

By analogy, imagine the postal service ringing your doorbell once for each letter instead of once for each packet of letters. With thousands or millions of letters each second, the number of separate transactions that would entail becomes prohibitive.

Fortunately, we also offer Logpush, which collects and pushes logs to customers in more predictable file sizes and which scales automatically with usage. In order to provide this feature several services work together to collect and push the logs, as illustrated in the diagram below:

Logfwdr

Logfwdr is an internal service written in Golang that accepts event logs from internal services running across Cloudflare’s global network and forwards them in batches to a service called Logreceiver. Logfwdr handles many different types of event logs, and one of its responsibilities is to determine which event logs should be forwarded and where they should be sent based on the type of event log, which customers it represents, and associated rules about where it should be processed. Configuration is provided to Logfwdr to enable it to make these determinations.

Logreceiver

Logreceiver (also written in Golang) accepts the batches of logs from across Cloudflare’s global network and further sorts them depending on the type of event and its purpose. For Cloudflare Logs, Logreceiver demultiplexes the batches into per-customer batches and forwards them to be buffered by Buftee. Currently, Logreceiver is handling about 45 PB (uncompressed) of customer event logs each day.

Buftee

It’s common for data pipelines to include a buffer. Producers and consumers of the data might be operating at different cadences, and parts of the pipeline will experience variances in how quickly they can process information. Using a buffer makes it easier to manage these situations, and helps to prevent data loss if downstream consumers are broken. It’s also convenient to have a buffer that supports multiple downstream consumers with different cadences (like the pipe fitting function of a tee.)

At Cloudflare, we use an internal system called Buftee (written in Golang) to support this combined function. Buftee is a highly distributed system which supports a large number of named “buffers”. It supports operating on named “prefixes” (collections of buffers) as well as multiple representations/resolutions of the same time-indexed dataset. Using Buftee makes it possible for Cloudflare to handle extremely high throughput very efficiently.

For Cloudflare Logs, Buftee provides buffers for each Logpush job, containing 100% of the logs generated by the zone or account referenced by each job. This means that failure to process one customer’s job will not affect progress on another customer’s job. Handling buffers in this way avoids “head of line” blocking and also enables us to encrypt and delete each customer’s data separately if needed.

Buftee typically handles over 1 million buffers globally. The following is a snapshot of the number of buffers managed by Buftee servers in the period just prior to the incident.

Logpush

Logpush is a Golang service which reads logs from Buftee buffers and pushes the results in batches to various destinations configured by customers. A batch could end up, for example, as a file in R2. Each job has a unique configuration, and only jobs that are active and configured will be pushed. Currently, we push over 600 million such batches each day.

What happened

On November 14, 2024, we made a change to support an additional dataset for Logpush. This required adding a new configuration to be provided to Logfwdr in order for it to know which customers’ logs to forward for this new stream. Every few minutes, a separate system re-generates the configuration used by Logfwdr to decide which logs need to be forwarded. A bug in this system resulted in a blank configuration being provided to Logfwdr.

This bug essentially informed Logfwdr that no customers had logs configured to be pushed. The team quickly noticed the mistake and reverted the change in under five minutes.

Unfortunately, this first mistake triggered a second, latent bug in Logfwdr itself. A failsafe introduced in the early days of this feature, when traffic was much lower, was configured to “fail open”. This failsafe was designed to protect against a situation when this specific Logfwdr configuration was unavailable (as in this case) by transmitting events for all customers instead of just those who had configured a Logpush job. This was intended to prevent the loss of logs at the expense of sending more logs than strictly necessary when individual hosts were prevented from getting the configuration due to intermittent networking errors, for example.

When this failsafe was first introduced, the potential list of customers was smaller than it is today. This small window of less than five minutes resulted in a massive spike in the number of customers whose logs were sent by Logfwdr.

Even given this massive overload, our systems would have continued to send logs if not for one additional problem. Remember that Buftee creates a separate buffer for each customer with their logs to be pushed. When Logfwdr began to send event logs for all customers, Buftee began to create buffers for each one as those logs arrived, and each buffer requires resources as well as the bookkeeping to maintain them. This massive increase, resulting in roughly 40 times more buffers, is not something we’ve provisioned Buftee clusters to handle. In the lead-up to impact, Buftee was managing 40 million buffers globally, as shown in the figure below.

A short temporary misconfiguration lasting just five minutes created a massive overload that took us several hours to fix and recover from. Because our backstops were not properly configured, the underlying systems became so overloaded that we could not interact with them normally. A full reset and restart was required.

Root causes

The bug in the Logfwdr configuration system was easy to fix, but it’s the type of bug that was likely to happen at some point. We had planned for it by designing the original “fail open” behavior. However, we neglected to regularly test that the broader system was capable of handling a fail open event.

The bigger failure was that Buftee became unresponsive. Buftee’s purpose is to be a safeguard against bugs like this one. A huge increase in the number of buffers is a failure mode that we had predicted, and had put mechanisms in Buftee to prevent this failure from cascading. Our failure in this case was that we had not configured these mechanisms. Had they been configured correctly, Buftee would not have been overwhelmed.

It’s like having a seatbelt in a car, yet not fastening it. The seatbelt is there to protect you in case of an accident but if you don’t actually buckle it up, it’s not going to do its job when you need it. Similarly, while we had the safeguard of Buftee in place, we hadn’t ‘buckled it up’ by configuring the necessary settings. We’re very sorry this happened and are taking steps to prevent a recurrence as described below.

Going forward

We’re creating alerts to ensure that these particular misconfigurations will be impossible to miss, and we are also addressing the specific bug and the associated tests that triggered this incident.

Just as importantly, we accept that mistakes and misconfigurations are inevitable. All our systems at Cloudflare need to respond to these predictably and gracefully. Currently, we conduct regular “cut tests” to ensure that these systems will cope with the loss of a datacenter or a network failure. In the future, we’ll also conduct regular “overload tests” to simulate the kind of cascade which happened in this incident to ensure that our production systems will handle them gracefully.

Logpush is a robust and flexible platform for customers who need to integrate their own logging and monitoring systems with Cloudflare. Different Logpush jobs can be deployed to support multiple destinations or, with filtering, multiple subsets of logs.

Explaining Cloudflare’s ABR Analytics

2020-09-29 Jamie Herre

Post Syndicated from Jamie Herre original https://blog.cloudflare.com/explaining-cloudflares-abr-analytics/

Explaining Cloudflare's ABR Analytics

Cloudflare’s analytics products help customers answer questions about their traffic by analyzing the mind-boggling, ever-increasing number of events (HTTP requests, Workers requests, Spectrum events) logged by Cloudflare products every day. The answers to these questions depend on the point of view of the question being asked, and we’ve come up with a way to exploit this fact to improve the quality and responsiveness of our analytics.

Useful Accuracy

Consider the following questions and answers:

What is the length of the coastline of Great Britain? 12.4K km
What is the total world population? 7.8B
How many stars are in the Milky Way? 250B
What is the total volume of the Antarctic ice shelf? 25.4M km³
What is the worldwide production of lentils? 6.3M tonnes
How many HTTP requests hit my site in the last week? 22.6M

Useful answers do not benefit from being overly exact. For large quantities, knowing the correct order of magnitude and a few significant digits gives the most useful answer. At Cloudflare, the difference in traffic between different sites or when a single site is under attack can cross nine orders of magnitude and, in general, all our traffic follows a Pareto distribution, meaning that what’s appropriate for one site or one moment in time might not work for another.

Because of this distribution, a query that scans a few hundred records for one customer will need to scan billions for another. A report that needs to load a handful of rows under normal operation might need to load millions when a site is under attack.

To get a sense of the relative difference of each of these numbers, remember “Powers of Ten”, an amazing visualization that Ray and Charles Eames produced in 1977. Notice that the scale of an image determines what resolution is practical for recording and displaying it.

Using ABR to determine resolution

This basic fact informed our design and implementation of ABR for Cloudflare analytics. ABR stands for “Adaptive Bit Rate”. It’s essentially an eponym for the term as used in video streaming such as Cloudflare’s own Stream Delivery. In those cases, the server will select the best resolution for a video stream to match your client and network connection.

In our case, every analytics query that supports ABR will be calculated at a resolution matching the query. For example, if you’re interested to know from which country the most firewall events were generated in the past week, the system might opt to use a lower resolution version of the firewall data than if you had opted to look at the last hour. The lower resolution version will provide the same answer but take less time and fewer resources. By using multiple, different resolutions of the same data, our analytics can provide consistent response times and a better user experience.

You might be aware that we use a columnar store called ClickHouse to store and process our analytics data. When using ABR with ClickHouse, we write the same data at multiple resolutions into separate tables. Usually, we cover seven orders of magnitude – from 100% to 0.0001% of the original events. We wind up using an additional 12% of disk storage but enable very fast ad hoc queries on the reduced resolution tables.

Aggregations and Rollups

The ABR technique facilitates aggregations by making compact estimates of every dimension. Another way to achieve the same ends is with a system that computes “rollups”. Rollups save space by computing either complete or partial aggregations of the data as it arrives.

For example, suppose we wanted to count a total number of lentils. (Lentils are legumes and among the oldest and most widely cultivated crops. They are a staple food in many parts of the world.) We could just count each lentil as it passed through the processing system. Of course because there a lot of lentils, that system is distributed – meaning that there are hundreds of separate machines. Therefore we’ll actually have hundreds of separate counters.

Also, we’ll want to include more information than just the count, so we’ll also include the weight of each lentil and maybe 10 or 20 other attributes. And of course, we don’t want just a total for each attribute, but we’ll want to be able to break it down by color, origin, distributor and many other things, and also we’ll want to break these down by slices of time.

In the end, we’ll have tens of thousands or possibly millions of aggregations to be tabulated and saved every minute. These aggregations are expensive to compute, especially when using aggregations more complicated than simple counters and sums. They also destroy some information. For example, once we’ve processed all the lentils through the rollups, we can’t say for sure that we’ve counted them all, and most importantly, whichever attributes we neglected to aggregate are unavailable.

The number we’re counting, 6.3M tonnes, only includes two significant digits which can easily be achieved by counting a sample. Most of the rollup computations used on each lentil (on the order 1013 to account for 6.3M tonnes) are wasted.

Other forms of aggregations

So far, we’ve discussed ABR and its application to aggregations, but we’ve only given examples involving “counts” and “sums”. There are other, more complex forms of aggregations we use quite heavily. Two examples are “topK” and “count-distinct”.

A “topK” aggregation attempts to show the K most frequent items in a set. For example, the most frequent IP address, or country. To compute topK, just count the frequency of each item in the set and return the K items with the highest frequencies. Under ABR, we compute topK based on the set found in the matching resolution sample. Using a sample makes this computation a lot faster and less complex, but there are problems.

The estimate of topK derived from a sample is biased and dependent on the distribution of the underlying data. This can result in overestimating the significance of elements in the set as compared to their frequency in the full set. In practice this effect can only be noticed when the cardinality of the set is very high and you’re not going to notice this effect on a Cloudflare dashboard. If your site has a lot of traffic and you’re looking at the Top K URLs or browser types, there will be no difference visible at different resolutions. Also keep in mind that as long as we’re estimating the “proportion” of the element in the set and the set is large, the results will be quite accurate.

The other fascinating aggregation we support is known as “count-distinct”, or number of uniques. In this case we want to know the number of unique values in a set. For example, how many unique cache keys have been used. We can safely say that a uniform random sample of the set cannot be used to estimate this number. However, we do have a solution.

We can generate another, alternate sample based on the value in question. For example, instead of taking a random sample of all requests, we take a random sample of IP addresses. This is sometimes called distinct reservoir sampling, and it allows us to estimate the true number of distinct IPs based on the cardinality of the sampled set. Again, there are techniques available to improve these estimates, and we’ll be implementing some of those.

ABR improves resilience and scalability

Using ABR saves us resources. Even better, it allows us to query all the attributes in the original data, not just those included in rollups. And even better, it allows us to check our assumptions against different sample intervals in separate tables as a check that the system is working correctly, because the original events are preserved.

However, the greatest benefits of employing ABR are the ones that aren’t directly visible. Even under ideal conditions, a large distributed system such as Cloudflare’s data pipeline is subject to high tail latency. This occurs when any single part of the system takes longer than usual for any number of a long list of reasons. In these cases, the ABR system will adapt to provide the best results available at that moment in time.

For example, compare this chart showing Cache Performance for a site under attack with the same chart generated a moment later while we simulate a failure of some of the servers in our cluster. In the days before ABR, your Cloudflare dashboard would fail to load in this scenario. Now, with ABR analytics, you won’t see significant degradation.

Stretching the analogy to ABR in video streaming, we want you to be able to enjoy your analytics dashboard without being bothered by issues related to faulty servers, or network latency, or long running queries. With ABR you can get appropriate answers to your questions reliably and within a predictable amount of time.

In the coming months, we’re going to be releasing a variety of new dashboards and analytics products based on this simple but profound technology. Watch your Cloudflare dashboard for increasingly useful and interactive analytics.

Noise