Tag Archives: Post Mortem

1.1.1.1 lookup failures on October 4th, 2023

Post Syndicated from Ólafur Guðmundsson original http://blog.cloudflare.com/1-1-1-1-lookup-failures-on-october-4th-2023/

1.1.1.1 lookup failures on  October 4th, 2023

1.1.1.1 lookup failures on  October 4th, 2023

On 4 October 2023, Cloudflare experienced DNS resolution problems starting at 07:00 UTC and ending at 11:00 UTC. Some users of 1.1.1.1 or products like WARP, Zero Trust, or third party DNS resolvers which use 1.1.1.1 may have received SERVFAIL DNS responses to valid queries. We’re very sorry for this outage. This outage was an internal software error and not the result of an attack. In this blog, we’re going to talk about what the failure was, why it occurred, and what we’re doing to make sure this doesn’t happen again.

Background

In the Domain Name System (DNS), every domain name exists within a DNS zone. The zone is a collection of domain names and host names that are controlled together. For example, Cloudflare is responsible for the domain name cloudflare.com, which we say is in the “cloudflare.com” zone. The .com top-level domain (TLD) is owned by a third party and is in the “com” zone. It gives directions on how to reach cloudflare.com. Above all of the TLDs is the root zone, which gives directions on how to reach TLDs. This means that the root zone is important in being able to resolve all other domain names. Like other important parts of the DNS, the root zone is signed with DNSSEC, which means the root zone itself contains cryptographic signatures.

The root zone is published on the root servers, but it is also common for DNS operators to retrieve and retain a copy of the root zone automatically so that in the event that the root servers cannot be reached, the information in the root zone is still available. Cloudflare’s recursive DNS infrastructure takes this approach as it also makes the resolution process faster. New versions of the root zone are normally published twice a day. 1.1.1.1 has a WebAssembly app called static_zone running on top of the main DNS logic that serves those new versions when they are available.

1.1.1.1 lookup failures on  October 4th, 2023

What happened

On 21 September, as part of a known and planned change in root zone management, a new resource record type was included in the root zones for the first time. The new resource record is named ZONEMD, and is in effect a checksum for the contents of the root zone.

The root zone is retrieved by software running in Cloudflare’s core network. It is subsequently redistributed to Cloudflare’s data centers around the world. After the change, the root zone containing the ZONEMD record continued to be retrieved and distributed as normal. However, the 1.1.1.1 resolver systems that make use of that data had problems parsing the ZONEMD record. Because zones must be loaded and served in their entirety, the system’s failure to parse ZONEMD meant the new versions of the root zone were not used in Cloudflare’s resolver systems. Some of the servers hosting Cloudflare's resolver infrastructure failed over to querying the DNS root servers directly on a request-by-request basis when they did not receive the new root zone. However, others continued to rely on the known working version of the root zone still available in their memory cache, which was the version pulled on 21 September before the change.

On 4 October 2023 at 07:00 UTC, the DNSSEC signatures in the version of the root zone from 21 September expired. Because there was no newer version that the Cloudflare resolver systems were able to use, some of Cloudflare’s resolver systems stopped being able to validate DNSSEC signatures and as a result started sending error responses (SERVFAIL). The rate at which Cloudflare resolvers generated SERVFAIL responses grew by 12%. The diagrams below illustrate the progression of the failure and how it became visible to users.

1.1.1.1 lookup failures on  October 4th, 2023

Incident timeline and impact

21 September 6:30 UTC: Last successful pull of the root zone
4 October 7:00 UTC: DNSSEC signatures in the root zone obtained on 21 September expired causing an increase in SERVFAIL responses to client queries.
7:57: First external reports of unexpected SERVFAILs started coming in.
8:03: Internal Cloudflare incident declared.
8:50: Initial attempt made at stopping 1.1.1.1 from serving responses using the stale root zone file with an override rule.
10:30: Stopped 1.1.1.1 from preloading the root zone file entirely.
10:32: Responses returned to normal.
11:02: Incident closed.

This below chart shows the timeline of impact along with the percentage of DNS queries that returned with a SERVFAIL error:

1.1.1.1 lookup failures on  October 4th, 2023

We expect a baseline volume of SERVFAIL errors for regular traffic during normal operation. Usually that percentage sits at around 3%. These SERVFAILs can be caused by legitimate issues in the DNSSEC chain, failures to connect to authoritative servers, authoritative servers taking too long to respond, and many others. During the incident the amount of SERVFAILs peaked at 15% of total queries, although the impact was not evenly distributed around the world and was mainly concentrated in our larger data centers like Ashburn, Virginia; Frankfurt, Germany; and Singapore.

Why this incident happened

Why parsing the ZONEMD record failed

DNS has a binary format for storing resource records. In this binary format the type of the resource record (TYPE)  is stored as a 16-bit integer. The type of resource record determines how the resource data (RDATA) is parsed. When the record type is 1, this means it is an A record, and the RDATA can be parsed as an IPv4 address. Record type 28 is an AAAA record, whose RDATA can be parsed as an IPv6 address instead. When a parser runs into an unknown resource type it won’t know how to parse its RDATA, but fortunately it doesn’t have to: the RDLENGTH field indicates how long the RDATA field is, allowing the parser to treat it as an opaque data element.

                                   1  1  1  1  1  1
      0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                                               |
    /                                               /
    /                      NAME                     /
    |                                               |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      TYPE                     |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                     CLASS                     |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      TTL                      |
    |                                               |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                   RDLENGTH                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
    /                     RDATA                     /
    /                                               /
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
RFC 1035

The reason static_zone didn’t support the new ZONEMD record is because up until now we had chosen to distribute the root zone internally in its presentation format, rather than in the binary format. When looking at the text representation for a few resource records we can see there is a lot more variation in how different records are presented.

.			86400	IN	SOA	a.root-servers.net. nstld.verisign-grs.com. 2023100400 1800 900 604800 86400
.			86400	IN	RRSIG	SOA 8 0 86400 20231017050000 20231004040000 46780 . J5lVTygIkJHDBt6HHm1QLx7S0EItynbBijgNlcKs/W8FIkPBfCQmw5BsUTZAPVxKj7r2iNLRddwRcM/1sL49jV9Jtctn8OLLc9wtouBmg3LH94M0utW86dKSGEKtzGzWbi5hjVBlkroB8XVQxBphAUqGxNDxdE6AIAvh/eSSb3uSQrarxLnKWvHIHm5PORIOftkIRZ2kcA7Qtou9NqPCSE8fOM5EdXxussKChGthmN5AR5S2EruXIGGRd1vvEYBrRPv55BAWKKRERkaXhgAp7VikYzXesiRLdqVlTQd+fwy2tm/MTw+v3Un48wXPg1lRPlQXmQsuBwqg74Ts5r8w8w==
.			518400	IN	NS	a.root-servers.net.
.			86400	IN	ZONEMD	2023100400 1 241 E375B158DAEE6141E1F784FDB66620CC4412EDE47C8892B975C90C6A102E97443678CCA4115E27195B468E33ABD9F78C
Example records taken from https://www.internic.net/domain/root.zone

When we run into an unknown resource record it’s not always easy to know how to handle it. Because of this, the library we use to parse the root zone at the edge does not make an attempt at doing so, and instead returns a parser error.

Why a stale version of the root zone was used

The static_zone app, tasked with loading and parsing the root zone for the purpose of serving the root zone locally (RFC 7706), stores the latest version in memory. When a new version is published it parses it and, when successfully done so, drops the old version. However, as parsing failed the static_zone app never switched to a newer version, and instead continued using the old version indefinitely. When the 1.1.1.1 service is first started the static_zone app does not have an existing version in memory. When it tries to parse the root zone it fails in doing so, but because it does not have an older version of the root zone to fall back on, it falls back on querying the root servers directly for incoming requests.

1.1.1.1 lookup failures on  October 4th, 2023

Why the initial attempt at disabling static_zone didn’t work

Initially we tried to disable the static_zone app through override rules, a mechanism that allows us to programmatically change some behavior of 1.1.1.1. The rule we deployed was:

phase = pre-cache set-tag rec_disable_static

For any incoming request this rule adds the tag rec_disable_static to the request. Inside the static_zone app we check for this tag and, if it’s set, we do not return a response from the cached, static root zone. However, to improve cache performance queries are sometimes forwarded to another node if the current node can’t find the response in its own cache. Unfortunately, the rec_disable_static tag is not included in the queries being forwarded to other nodes, which caused the static_zone app to continue replying with stale information until we eventually disabled the app entirely.

Why the impact was partial

Cloudflare regularly performs rolling reboots of the servers that host our services for tasks like kernel updates that can only take effect after a full system restart. At the time of this outage, resolver server instances that were restarted between the ZONEMD change and the DNSSEC invalidation did not contribute to impact. If they had restarted during this two-week period, they would have failed to load the root zone on startup and fallen back to resolving by sending DNS queries to root servers instead. In addition, the resolver uses a technique called serve stale (RFC 8767) with the purpose of being able to continue to serve popular records from a potentially stale cache to limit the impact. A record is considered to be stale once the TTL amount of seconds has passed since the record was retrieved from upstream.  This prevented a total outage; impact was mainly felt in our largest data centers which had many servers that had not restarted the 1.1.1.1 service in that timeframe.

Remediation and follow-up steps

This incident had widespread impact, and we take the availability of our services very seriously. We have identified several areas of improvement and will continue to work on uncovering any other gaps that could cause a recurrence.

Here is what we are working on immediately:

Visibility: We’re adding alerts to notify when static_zone serves a stale root zone file. It should not have been the case that serving a stale root zone file went unnoticed for as long as it did. If we had been monitoring this better, with the caching that exists, there would have been no impact. It is our goal to protect our customers and their users from upstream changes.

Resilience: We will re-evaluate how we ingest and distribute the root zone internally. Our ingestion and distribution pipelines should handle new RRTYPEs seamlessly, and any brief interruption to the pipeline should be invisible to end users.

Testing: Despite having tests in place around this problem, including tests related to unreleased changes in parsing the new ZONEMD records, we did not adequately test what happens when the root zone fails to parse. We will improve our test coverage and the related processes.

Architecture: We should not use stale copies of the root zone past a certain point. While it’s certainly possible to continue to use stale root zone data for a limited amount of time, past a certain point there are unacceptable operational risks. We will take measures to ensure that the lifetime of cached root zone data is better managed as described in RFC 8806: Running a Root Server Local to a Resolver.

Conclusion

We are deeply sorry that this incident happened. There is one clear message from this incident: do not ever assume that something is not going to change!  Many modern systems are built with a long chain of libraries that are pulled into the final executable, each one of those may have bugs or may not be updated early enough for programs to operate correctly when changes in input happen. We understand how important it is to have good testing in place that allows detection of regressions and systems and components that fail gracefully on changes to input. We understand that we need to always assume that “format” changes in the most critical systems of the internet (DNS and BGP) are going to have an impact.

We have a lot to follow up on internally and are working around the clock to make sure something like this does not happen again.

Hardening Workers KV

Post Syndicated from Matt Silverlock original http://blog.cloudflare.com/workers-kv-restoring-reliability/

Hardening Workers KV

Hardening Workers KV

Over the last couple of months, Workers KV has suffered from a series of incidents, culminating in three back-to-back incidents during the week of July 17th, 2023. These incidents have directly impacted customers that rely on KV — and this isn’t good enough.

We’re going to share the work we have done to understand why KV has had such a spate of incidents and, more importantly, share in depth what we’re doing to dramatically improve how we deploy changes to KV going forward.

Workers KV?

Workers KV — or just “KV” — is a key-value service for storing data: specifically, data with high read throughput requirements. It’s especially useful for user configuration, service routing, small assets and/or authentication data.

We use KV extensively inside Cloudflare too, with Cloudflare Access (part of our Zero Trust suite) and Cloudflare Pages being some of our highest profile internal customers. Both teams benefit from KV’s ability to keep regularly accessed key-value pairs close to where they’re accessed, as well its ability to scale out horizontally without any need to become an expert in operating KV.

Given Cloudflare’s extensive use of KV, it wasn’t just external customers impacted. Our own internal teams felt the pain of these incidents, too.

The summary of the post-mortem

Back in June 2023, we announced the move to a new architecture for KV, which is designed to address two major points of customer feedback we’ve had around KV: high latency for infrequently accessed keys (or a key accessed in different regions), and working to ensure the upper bound on KV’s eventual consistency model for writes is 60 seconds — not “mostly 60 seconds”.

At the time of the blog, we’d already been testing this internally, including early access with our community champions and running a small % of production traffic to validate stability and performance expectations beyond what we could emulate within a staging environment.

However, in the weeks between mid-June and culminating in the series of incidents during the week of July 17th, we would continue to increase the volume of new traffic onto the new architecture. When we did this, we would encounter previously unseen problems (many of these customer-impacting) — then immediately roll back, fix bugs, and repeat. Internally, we’d begun to identify that this pattern was becoming unsustainable — each attempt to cut traffic onto the new architecture would surface errors or behaviors we hadn’t seen before and couldn’t immediately explain, and thus we would roll back and assess.

The issues at the root of this series of incidents proved to be significantly challenging to track and observe. Once identified, the two causes themselves proved to be quick to fix, but an (1) observability gap in our error reporting and (2) a mutation to local state that resulted in an unexpected mutation of global state were both hard to observe and reproduce over the days following the customer-facing impact ending.

The detail

One important piece of context to understand before we go into detail on the post-mortem: Workers KV is composed of two separate Workers scripts – internally referred to as the Storage Gateway Worker and SuperCache. SuperCache is an optional path in the Storage Gateway Worker workflow, and is the basis for KV's new (faster) backend (refer to the blog).

Here is a timeline of events:

Time Description
2023-07-17 21:52 UTC Cloudflare observes alerts showing 500 HTTP status codes in the MEL01 data-center (Melbourne, AU) and begins investigating.
We also begin to see a small set of customers reporting HTTP 500s being returned via multiple channels. It is not immediately clear if this is a data-center-wide issue or KV specific, as there had not been a recent KV deployment, and the issue directly correlated with three data-centers being brought back online.
2023-07-18 00:09 UTC We disable the new backend for KV in MEL01 in an attempt to mitigate the issue (noting that there had not been a recent deployment or change to the % of users on the new backend).
2023-07-18 05:42 UTC Investigating alerts showing 500 HTTP status codes in VIE02 (Vienna, AT) and JNB01 (Johannesburg, SA).
2023-07-18 13:51 UTC The new backend is disabled globally after seeing issues in VIE02 (Vienna, AT) and JNB01 (Johannesburg, SA) data-centers, similar to MEL01. In both cases, they had also recently come back online after maintenance, but it remained unclear as to why KV was failing.
2023-07-20 19:12 UTC The new backend is inadvertently re-enabled while deploying the update due to a misconfiguration in a deployment script.
2023-07-20 19:33 UTC The new backend is (re-) disabled globally as HTTP 500 errors return.
2023-07-20 23:46 UTC Broken Workers script pipeline deployed as part of gradual rollout due to incorrectly defined pipeline configuration in the deployment script.
Metrics begin to report that a subset of traffic is being black-holed.
2023-07-20 23:56 UTC Broken pipeline rolled back; errors rates return to pre-incident (normal) levels.

All timestamps referenced are in Coordinated Universal Time (UTC).

We initially observed alerts showing 500 HTTP status codes in the MEL01 data-center (Melbourne, AU) at 21:52 UTC on July 17th, and began investigating. We also received reports from a small set of customers reporting HTTP 500s being returned via multiple channels. This correlated with three data centers being brought back online, and it was not immediately clear if it related to the data centers or was KV-specific — especially given there had not been a recent KV deployment. On 05:42, we began investigating alerts showing 500 HTTP status codes in VIE02 (Vienna) and JNB02 (Johannesburg) data-centers; while both had recently come back online after maintenance, it was still unclear why KV was failing. At 13:51 UTC, we made the decision to disable the new backend globally.

Following the incident on July 18th, we attempted to deploy an allow-list configuration to reduce the scope of impacted accounts. However, while attempting to roll out a change for the Storage Gateway Worker at 19:12 UTC on July 20th, an older configuration was progressed causing the new backend to be enabled again, leading to the third event. As the team worked to fix this and deploy this configuration, they attempted to manually progress the deployment at 23:46 UTC, which resulted in the passing of a malformed configuration value that caused traffic to be sent to an invalid Workers script configuration.

After all deployments and the broken Workers configuration (pipeline) had been rolled back at 23:56 on the 20th July, we spent the following three days working to identify the root cause of the issue. We lacked observability as KV's Worker script (responsible for much of KV's logic) was throwing an unhandled exception very early on in the request handling process. This was further exacerbated by prior work to disable error reporting in a disabled data-center due to the noise generated, which had previously resulted in logs being rate-limited upstream from our service.

This previous mitigation prevented us from capturing meaningful logs from the Worker, including identifying the exception itself, as an uncaught exception terminates request processing. This has raised the priority of improving how unhandled exceptions are reported and surfaced in a Worker (see Recommendations, below, for further details). This issue was exacerbated by the fact that KV's Worker script would fail to re-enter its "healthy" state when a Cloudflare data center was brought back online, as the Worker was mutating an environment variable perceived to be in request scope, but that was in global scope and persisted across requests. This effectively left the Worker “frozen” with the previous, invalid configuration for the affected locations.

Further, the introduction of a new progressive release process for Workers KV, designed to de-risk rollouts (as an action from a prior incident), prolonged the incident. We found a bug in the deployment logic that led to a broader outage due to an incorrectly defined configuration.

This configuration effectively caused us to drop a single-digit % of traffic until it was rolled back 10 minutes later. This code is untested at scale, and we need to spend more time hardening it before using it as the default path in production.

Additionally: although the root cause of the incidents was limited to three Cloudflare data-centers (Melbourne, Vienna, and Johannesburg), traffic across these regions still uses these data centers to route reads and writes to our system of record. Because these three data centers participate in KV’s new backend as regional tiers, a portion of traffic across the Oceania, Europe, and African regions was affected. Only a portion of keys from enrolled namespaces use any given data center as a regional tier in order to limit a single (regional) point of failure, so while traffic across all data centers in the region was impacted, nowhere was all traffic in a given data center affected.

We estimated the affected traffic to be 0.2-0.5% of KV's global traffic (based on our error reporting), however we observed some customers with error rates approaching 20% of their total KV operations. The impact was spread across KV namespaces and keys for customers within the scope of this incident.

Both KV’s high total traffic volume and its role as a critical dependency for many customers amplify the impact of even small error rates. In all cases, once the changes were rolled back, errors returned to normal levels and did not persist.

Thinking about risks in building software

Before we dive into what we’re doing to significantly improve how we build, test, deploy and observe Workers KV going forward, we think there are lessons from the real world that can equally apply to how we improve the safety factor of the software we ship.

In traditional engineering and construction, there is an extremely common procedure known as a   “JSEA”, or Job Safety and Environmental Analysis (sometimes just “JSA”). A JSEA is designed to help you iterate through a list of tasks, the potential hazards, and most importantly, the controls that will be applied to prevent those hazards from damaging equipment, injuring people, or worse.

One of the most critical concepts is the “hierarchy of controls” — that is, what controls should be applied to mitigate these hazards. In most practices, these are elimination, substitution, engineering, administration and personal protective equipment. Elimination and substitution are fairly self-explanatory: is there a different way to achieve this goal? Can we eliminate that task completely? Engineering and administration ask us whether there is additional engineering work, such as changing the placement of a panel, or using a horizontal boring machine to lay an underground pipe vs. opening up a trench that people can fall into.

The last and lowest on the hierarchy, is personal protective equipment (PPE). A hard hat can protect you from severe injury from something falling from above, but it’s a last resort, and it certainly isn’t guaranteed. In engineering practice, any hazard that only lists PPE as a mitigating factor is unsatisfactory: there must be additional controls in place. For example, instead of only wearing a hard hat, we should engineer the floor of scaffolding so that large objects (such as a wrench) cannot fall through in the first place. Further, if we require that all tools are attached to the wearer, then it significantly reduces the chance the tool can be dropped in the first place. These controls ensure that there are multiple degrees of mitigation — defense in depth — before your hard hat has to come into play.

Coming back to software, we can draw parallels between these controls: engineering can be likened to improving automation, gradual rollouts, and detailed metrics. Similarly, personal protective equipment can be likened to code review: useful, but code review cannot be the only thing protecting you from shipping bugs or untested code. Automation with linters, more robust testing, and new metrics are all vastly safer ways of shipping software.

As we spent time assessing where to improve our existing controls and how to put new controls in place to mitigate risks and improve the reliability (safety) of Workers KV, we took a similar approach: eliminating unnecessary changes, engineering more resilience into our codebase, automation, deployment tooling, and only then looking at human processes.

How we plan to get better

Cloudflare is undertaking a larger, more structured review of KV's observability tooling, release infrastructure and processes to mitigate not only the contributing factors to the incidents within this report, but recent incidents related to KV. Critically, we see tooling and automation as the most powerful mechanisms for preventing incidents, with process improvements designed to provide an additional layer of protection. Process improvements alone cannot be the only mitigation.

Specifically, we have identified and prioritized the below efforts as the most important next steps towards meeting our own availability SLOs, and (above all) make KV a service that customers building on Workers can rely on for storing configuration and service data in the hot path of their traffic:

  • Substantially improve the existing observability tooling for unhandled exceptions, both for internal teams and customers building on Workers. This is especially critical for high-volume services, where traditional logging alone can be too noisy (and not specific enough) to aid in tracking down these cases. The existing ongoing work to land this will be prioritized further. In the meantime, we have directly addressed the specific uncaught exception with KV's primary Worker script.
  • Improve the safety around the mutation of environmental variables in a Worker, which currently operate at "global" (per-isolate) scope, but can appear to be per-request. Mutating an environmental variable in request scope mutates the value for all requests transiting that same isolate (in a given location), which can be unexpected. Changes here will need to take backwards compatibility in mind.
  • Continue to expand KV’s test coverage to better address the above issues, in parallel with the aforementioned observability and tooling improvements, as an additional layer of defense. This includes allowing our test infrastructure to simulate traffic from any source data-center, which would have allowed us to more quickly reproduce the issue and identify a root cause.
  • Improvements to our release process, including how KV changes and releases are reviewed and approved, going forward. We will enforce a higher level of scrutiny for future changes, and where possible, reduce the number of changes deployed at once. This includes taking on new infrastructure dependencies, which will have a higher bar for both design and testing.
  • Additional logging improvements, including sampling, throughout our request handling process to improve troubleshooting & debugging. A significant amount of the challenge related to these incidents was due to the lack of logging around specific requests (especially non-2xx requests)
  • Review and, where applicable, improve alerting thresholds surrounding error rates. As mentioned previously in this report, sub-% error rates at a global scale can have severe negative impact on specific users and/or locations: ensuring that errors are caught and not lost in the noise is an ongoing effort.
  • Address maturity issues with our progressive deployment tooling for Workers, which is net-new (and will eventually be exposed to customers directly).

This is not an exhaustive list: we're continuing to expand on preventative measures associated with these and other incidents. These changes will not only improve KVs reliability, but other services across Cloudflare that KV relies on, or that rely on KV.

We recognize that KV hasn’t lived up to our customers’ expectations recently. Because we rely on KV so heavily internally, we’ve felt that pain first hand as well. The work to fix the issues that led to this cycle of incidents is already underway. That work will not only improve KV’s reliability but also improve the reliability of any software written on the Cloudflare Workers developer platform, whether by our customers or by ourselves.

Connection errors in Asia Pacific region on July 9, 2023

Post Syndicated from Christian Elmerot original http://blog.cloudflare.com/connection-errors-in-asia-pacific-region-on-july-9-2023/

Connection errors in Asia Pacific region on July 9, 2023

Connection errors in Asia Pacific region on July 9, 2023

On Sunday, July 9, 2023, early morning UTC time, we observed a high number of DNS resolution failures — up to 7% of all DNS queries across the Asia Pacific region — caused by invalid DNSSEC signatures from Verisign .com and .net Top Level Domain (TLD) nameservers. This resulted in connection errors for visitors of Internet properties on Cloudflare in the region.

The local instances of Verisign’s nameservers started to respond with expired DNSSEC signatures in the Asia Pacific region. In order to remediate the impact, we have rerouted upstream DNS queries towards Verisign to locations on the US west coast which are returning valid signatures.

We have already reached out to Verisign to get more information on the root cause. Until their issues have been resolved, we will keep our DNS traffic to .com and .net TLD nameservers rerouted, which might cause slightly increased latency for the first visitor to domains under .com and .net in the region.

Background

In order to proxy a domain’s traffic through Cloudflare’s network, there are two components involved with respect to the Domain Name System (DNS) from the perspective of a Cloudflare data center: external DNS resolution, and upstream or origin DNS resolution.

To understand this, let’s look at the domain name blog.cloudflare.com — which is proxied through Cloudflare’s network — and let’s assume it is configured to use origin.example as the origin server:

Connection errors in Asia Pacific region on July 9, 2023

Here, the external DNS resolution is the part where DNS queries to blog.cloudflare.com sent by public resolvers like 1.1.1.1 or 8.8.8.8 on behalf of a visitor return a set of Cloudflare Anycast IP addresses. This ensures that the visitor’s browser knows where to send an HTTPS request to load the website. Under the hood your browser performs a DNS query that looks something like this (the trailing dot indicates the DNS root zone):

$ dig blog.cloudflare.com. +short
104.18.28.7
104.18.29.7

(Your computer doesn’t actually use the dig command internally; we’ve used it to illustrate the process) Then when the next closest Cloudflare data center receives the HTTPS request for blog.cloudflare.com, it needs to fetch the content from the origin server (assuming it is not cached).

There are two ways Cloudflare can reach the origin server. If the DNS settings in Cloudflare contain IP addresses then we can connect directly to the origin. In some cases, our customers use a CNAME which means Cloudflare has to perform another DNS query to get the IP addresses associated with the CNAME. In the example above, blog.cloudflare.com has a CNAME record instructing us to look at origin.example for IP addresses. During the incident, only customers with CNAME records like this going to .com and .net domain names may have been affected.

The domain origin.example needs to be resolved by Cloudflare as part of the upstream or origin DNS resolution. This time, the Cloudflare data center needs to perform an outbound DNS query that looks like this:

$ dig origin.example. +short
192.0.2.1

DNS is a hierarchical protocol, so the recursive DNS resolver, which usually handles DNS resolution for whoever wants to resolve a domain name, needs to talk to several involved nameservers until it finally gets an answer from the authoritative nameservers of the domain (assuming no DNS responses are cached). This is the same process during the external DNS resolution and the origin DNS resolution. Here is an example for the origin DNS resolution:

Connection errors in Asia Pacific region on July 9, 2023

Inherently, DNS is a public system that was initially published without any means to validate the integrity of the DNS traffic. So in order to prevent someone from spoofing DNS responses, DNS Security Extensions (DNSSEC) were introduced as a means to authenticate that DNS responses really come from who they claim to come from. This is achieved by generating cryptographic signatures alongside existing DNS records like A, AAAA, MX, CNAME, etc. By validating a DNS record’s associated signature, it is possible to verify that a requested DNS record really comes from its authoritative nameserver and wasn’t altered en-route. If a signature cannot be validated successfully, recursive resolvers usually return an error indicating the invalid signature. This is exactly what happened on Sunday.

Incident timeline and impact

On Saturday, July 8, 2023, at 21:10 UTC our logs show the first instances of DNSSEC validation errors that happened during upstream DNS resolution from multiple Cloudflare data centers in the Asia Pacific region. It appeared DNS responses from the TLD nameservers of .com and .net of the type NSEC3 (a DNSSEC mechanism to prove non-existing DNS records) did not include signatures. About an hour later at 22:16 UTC, the first internal alerts went off (since it required issues to be consistent over a certain period of time), but error rates were still at a level at around 0.5% of all upstream DNS queries.

After several hours, the error rate had increased to a point in which ~13% of our upstream DNS queries in affected locations were failing. This percentage continued to fluctuate over the duration of the incident between the ranges of 10-15% of upstream DNS queries, and roughly 5-7% of all DNS queries (external & upstream resolution) to affected Cloudflare data centers in the Asia Pacific region.

Connection errors in Asia Pacific region on July 9, 2023

Initially it appeared as though only a single upstream nameserver was having issues with DNS resolution, however upon further investigation it was discovered that the issue was more widespread. Ultimately, we were able to verify that the Verisign nameservers for .com and .net were returning expired DNSSEC signatures on a portion of DNS queries in the Asia Pacific region. Based on our tests, other nameserver locations were correctly returning valid DNSSEC signatures.

In response, we rerouted our DNS traffic to the .com and .net TLD nameserver IP addresses to Verisign’s US west locations. After this change was propagated, the issue very quickly subsided and origin resolution error rates returned to normal levels.

All times are in UTC:

2023-07-08 21:10 First instances of DNSSEC validation errors appear in our logs for origin DNS resolution.

2023-07-08 22:16 First internal alerts for Asia Pacific data centers go off indicating origin DNS resolution error on our test domain. Very few errors for other domains at this point.

2023-07-09 02:58 Error rates have increased substantially since the first instance. An incident is declared.

2023-07-09 03:28 DNSSEC validation issues seem to be isolated to a single upstream provider. It is not abnormal that a single upstream has issues that propagate back to us, and in this case our logs were predominantly showing errors from domains that resolve to this specific upstream.

2023-07-09 04:52 We realize that DNSSEC validation issues are more widespread and affect multiple .com and .net domains. Validation issues continue to be isolated to the Asia Pacific region only, and appear to be intermittent.

2023-07-09 05:15 DNS queries via popular recursive resolvers like 8.8.8.8 and 1.1.1.1 do not return invalid DNSSEC signatures at this point. DNS queries using the local stub resolver continue to return DNSSEC errors.

2023-07-09 06:24 Responses from .com and .net Verisign nameservers in Singapore contain expired DNSSEC signatures, but responses from Verisign TLD nameservers in other locations are fine.

2023-07-09 06:41 We contact Verisign and notify them about expired DNSSEC signatures.

2023-07-09 06:50 In order to remediate the impact, we reroute DNS traffic via IPv4 for the .com and .net Verisign nameserver IPs to US west IPs instead. This immediately leads to a substantial drop in the error rate.

2023-07-09 07:06 We also reroute DNS traffic via IPv6 for the .com and .net Verisign nameserver IPs to US west IPs. This leads to the error rate going down to zero.

2023-07-10 09:23 The rerouting is still in place, but the underlying issue with expired signatures in the Asia Pacific region has still not been resolved.

2023-07-10 18:23 Versign gets back to us confirming that they “were serving stale data” at their local site and have resolved the issues.

Technical description of the error and how it happened

As mentioned in the introduction, the underlying cause for this incident was expired DNSSEC signatures for .net and .com zones. Expired DNSSEC signatures will cause a DNS response to be interpreted as invalid. There are two main scenarios in which this error was observed by a user:

  1. CNAME flattening for external DNS resolution. This is when our authoritative nameservers attempt to return the IP address(es) that a CNAME record target resolves to rather than the CNAME record itself.
  2. CNAME target lookup for origin DNS resolution. This is most commonly used when an HTTPS request is sent to a Cloudflare anycast IP address, and we need to determine what IP address to forward the request to. See How Cloudflare works for more details.

In both cases, behind the scenes the DNS query goes through an in-house recursive DNS resolver in order to lookup what a hostname resolves to. The purpose of this resolver is to cache queries, optimize future queries and provide DNSSEC validation. If the query from this resolver fails for whatever reason, our authoritative DNS system will not be able to perform the two scenarios outlined above.

Connection errors in Asia Pacific region on July 9, 2023

During the incident, the recursive resolver was failing to validate the DNSSEC signatures in DNS responses for domains ending in .com and .net. These signatures are sent in the form of an RRSIG record alongside the other DNS records they cover. Together they form a Resource Record set (RRset). Each RRSIG has the corresponding fields:

Connection errors in Asia Pacific region on July 9, 2023

As you can see, the main part of the payload is associated with the signature and its corresponding metadata. Each recursive resolver is responsible for not only checking the signature but also the expiration time of the signature. It is important to obey the expiration time in order to avoid returning responses for RRsets that have been signed by old keys, which could have potentially been compromised by that time.

During this incident, Verisign, the authoritative operator for the .com and .net TLD zones, was occasionally returning expired signatures in its DNS responses in the Asia Pacific region. As a result our recursive resolver was not able to validate the corresponding RRset. Ultimately this meant that a percentage of DNS queries would return SERVFAIL as response code to our authoritative nameserver. This in turn caused connection issues for users trying to connect to a domain on Cloudflare, because we weren't able to resolve the upstream target of affected domain names and thus didn’t know where to send proxied HTTPS requests to upstream servers.

Remediation and follow-up steps

Once we had identified the root cause we started to look at different ways to remedy the issue. We came to the conclusion that the fastest way to work around this very regionalized issue was to stop using the local route to Verisign's nameservers. This means that, at the time of writing this, our outgoing DNS traffic towards Verisign's nameservers in the Asia Pacific region now traverses the Pacific and ends up at the US west coast, rather than being served by closer nameservers. Due to the nature of DNS and the important role of DNS caching, this has less impact than you might initially expect. Frequently looked up names will be cached after the first request, and this only needs to happen once per data center, as we share and pool the local DNS recursor caches to maximize their efficiency.

Ideally, we would have been able to fix the issue right away as it potentially affected others in the region too, not just our customers. We will therefore work diligently to improve our incident communications channels with other providers in order to ensure that the DNS ecosystem remains robust against issues such as this. Being able to quickly get hold of other providers that can take action is vital when urgent issues like these arise.

Conclusion

This incident once again shows how impactful DNS failures are and how crucial this technology is for the Internet. We will investigate how we can improve our systems to detect and resolve issues like this more efficiently and quickly if they occur again in the future.

While Cloudflare was not the cause of this issue, and we are certain that we were not the only ones affected by this, we are still sorry for the disruption to our customers and to all the users who were unable to access Internet properties during this incident.

Cloudflare Incident on January 24th, 2023

Post Syndicated from Kenny Johnson original https://blog.cloudflare.com/cloudflare-incident-on-january-24th-2023/

Cloudflare Incident on January 24th, 2023

Cloudflare Incident on January 24th, 2023

Several Cloudflare services became unavailable for 121 minutes on January 24th, 2023 due to an error releasing code that manages service tokens. The incident degraded a wide range of Cloudflare products including aspects of our Workers platform, our Zero Trust solution, and control plane functions in our content delivery network (CDN).

Cloudflare provides a service token functionality to allow automated services to authenticate to other services. Customers can use service tokens to secure the interaction between an application running in a data center and a resource in a public cloud provider, for example. As part of the release, we intended to introduce a feature that showed administrators the time that a token was last used, giving users the ability to safely clean up unused tokens. The change inadvertently overwrote other metadata about the service tokens and rendered the tokens of impacted accounts invalid for the duration of the incident.

The reason a single release caused so much damage is because Cloudflare runs on Cloudflare. Service tokens impact the ability for accounts to authenticate, and two of the impacted accounts power multiple Cloudflare services. When these accounts’ service tokens were overwritten, the services that run on these accounts began to experience failed requests and other unexpected errors.

We know this impacted several customers and we know the impact was painful. We’re documenting what went wrong so that you can understand why this happened and the steps we are taking to prevent this from occurring again.

What is a service token?

When users log into an application or identity provider, they typically input a username and a password. The password allows that user to demonstrate that they are in control of the username and that the service should allow them to proceed. Layers of additional authentication can be added, like hard keys or device posture, but the workflow consists of a human proving they are who they say they are to a service.

However, humans are not the only users that need to authenticate to a service. Applications frequently need to talk to other applications. For example, imagine you build an application that shows a user information about their upcoming travel plans.

The airline holds details about the flight and its duration in their own system. They do not want to make the details of every individual trip public on the Internet and they do not want to invite your application into their private network. Likewise, the hotel wants to make sure that they only send details of a room booking to a valid, approved third party service.

Your application needs a trusted way to authenticate with those external systems. Service tokens solve this problem by functioning as a kind of username and password for your service. Like usernames and passwords, service tokens come in two parts: a Client ID and a Client Secret. Both the ID and Secret must be sent with a request for authentication. Tokens are also assigned a duration, after which they become invalid and must be rotated. You can grant your application a service token and, if the upstream systems you need validate it, your service can grab airline and hotel information and present it to the end user in a joint report.

When administrators create Cloudflare service tokens, we generate the Client ID and the Client Secret pair. Customers can then configure their requesting services to send both values as HTTP headers when they need to reach a protected resource. The requesting service can run in any environment, including inside of Cloudflare’s network in the form of a Worker or in a separate location like a public cloud provider. Customers need to deploy the corresponding protected resource behind Cloudflare’s reverse proxy. Our network checks every request bound for a configured service for the HTTP headers. If present, Cloudflare validates their authenticity and either blocks the request or allows it to proceed. We also log the authentication event.

Incident Timeline

All Timestamps are UTC

At 2023-01-24 16:55 UTC the Access engineering team initiated the release that inadvertently began to overwrite service token metadata, causing the incident.

At 2023-01-24 17:05 UTC a member of the Access engineering team noticed an unrelated issue and rolled back the release which stopped any further overwrites of service token metadata.

Service token values are not updated across Cloudflare’s network until the service token itself is updated (more details below). This caused a staggered impact of the service token’s that had their metadata overwritten.

2023-01-24 17:50 UTC: The first invalid service token for Cloudflare WARP was synced to the edge. Impact began for WARP and Zero Trust users.

Cloudflare Incident on January 24th, 2023
WARP device posture uploads dropped to zero which raised an internal alert

At 2023-01-24 18:12 an incident was declared due to the large drop in successful WARP device posture uploads.

2023-01-24 18:19 UTC: The first invalid service token for the Cloudflare API was synced to the edge. Impact began for Cache Purge, Cache Reserve, Images and R2. Alerts were triggered for these products which identified a larger scope of the incident.

Cloudflare Incident on January 24th, 2023

At 2023-01-24 18:21 the overwritten services tokens were discovered during the initial investigation.

At 2023-01-24 18:28 the incident was elevated to include all impacted products.

At 2023-01-24 18:51 An initial solution was identified and implemented to revert the service token to its original value for the Cloudflare WARP account, impacting WARP and Zero Trust. Impact ended for WARP and Zero Trust.

At 2023-01-24 18:56 The same solution was implemented on the Cloudflare API account, impacting Cache Purge, Cache Reserve, Images and R2. Impact ended for Cache Purge, Cache Reserve, Images and R2.

At 2023-01-24 19:00 An update was made to the Cloudflare API account which incorrectly overwrote the Cloudflare API account. Impact restarted for Cache Purge, Cache Reserve, Images and R2. All internal Cloudflare account changes were then locked until incident resolution.

At 2023-01-24 19:07 the Cloudflare API was updated to include the correct service token value. Impact ended for Cache Purge, Cache Reserve, Images and R2.

At 2023-01-24 19:51 all affected accounts had their service tokens restored from a database backup. Incident Ends.

What was released and how did it break?

The Access team was rolling out a new change to service tokens that added a “Last seen at” field. This was a popular feature request to help identify which service tokens were actively in use.

What went wrong?

The “last seen at” value was derived by scanning all new login events in an account’s login event Kafka queue. If a login event using a service token was detected, an update to the corresponding service token’s last seen value was initiated.

In order to update the service token’s “last seen at” value a read write transaction is made to collect the information about the corresponding service token. Service token read requests redact the “client secret” value by default for security reasons. The “last seen at” update to the service token then used that information from the read did not include the “client secret” and updated the service token with an empty “client secret” on the write.

An example of the correct and incorrect service token values shown below:

Example Access Service Token values

{
  "1a4ddc9e-a1234-4acc-a623-7e775e579c87": {
    "client_id": "6b12308372690a99277e970a3039343c.access",
    "client_secret": "<hashed-value>", <-- what you would expect
    "expires_at": 1698331351
  },
  "23ade6c6-a123-4747-818a-cd7c20c83d15": {
    "client_id": "1ab44976dbbbdadc6d3e16453c096b00.access",
    "client_secret": "", <--- this is the problem
    "expires_at": 1670621577
  }
}

The service token “client secret” database did have a “not null” check however in this situation an empty text string did not trigger as a null value.

As a result of the bug, any Cloudflare account that used a service token to authenticate during the 10 minutes “last seen at” release was out would have its “client secret” value set to an empty string. The service token then needed to be modified in order for the empty “client secret” to be used for authentication. There were a total of 4 accounts in this state, all of which are internal to Cloudflare.

How did we fix the issue?

As a temporary solution, we were able to manually restore the correct service token values for the accounts with overwritten service tokens. This stopped the immediate impact across the affected Cloudflare services.

The database team was then able to implement a solution to restore the service tokens of all impacted accounts from an older database copy. This concluded any impact from this incident.

Why did this impact other Cloudflare services?

Service tokens impact the ability for accounts to authenticate. Two of the impacted accounts power multiple Cloudflare services. When these accounts’ services tokens were overwritten, the services that run on these accounts began to experience failed requests and other unexpected errors.

Cloudflare WARP Enrollment

Cloudflare provides a mobile and desktop forward proxy, Cloudflare WARP (our “1.1.1.1” app), that any user can install on a device to improve the privacy of their Internet traffic. Any individual can install this service without the need for a Cloudflare account and we do not retain logs that map activity to a user.

When a user connects using WARP, Cloudflare validates the enrollment of a device by relying on a service that receives and validates the keys on the device. In turn, that service communicates with another system that tells our network to provide the newly enrolled device with access to our network

During the incident, the enrollment service could no longer communicate with systems in our network that would validate the device. As a result, users could no longer register new devices and/or install the app on a new device, and may have experienced issues upgrading to a new version of the app (which also triggers re-registration).

Cloudflare Zero Trust Device Posture and Re-Auth Policies

Cloudflare provides a comprehensive Zero Trust solution that customers can deploy with or without an agent living on the device. Some use cases are only available when using the Cloudflare agent on the device. The agent is an enterprise version of the same Cloudflare WARP solution and experienced similar degradation anytime the agent needed to send or receive device state. This impacted three use cases in Cloudflare Zero Trust.

First, similar to the consumer product, new devices could not be enrolled and existing devices could not be revoked. Administrators were also unable to modify settings of enrolled devices.. In all cases errors would have been presented to the user.

Second, many customers who replace their existing private network with Cloudflare’s Zero Trust solution may add rules that continually validate a user’s identity through the use of session duration policies. The goal of these rules is to enforce users to reauthenticate in order to prevent stale sessions from having ongoing access to internal systems. The agent on the device prompts the user to reauthenticate based on signals from Cloudflare’s control plane. During the incident, the signals were not sent and users could not successfully reauthenticate.

Finally, customers who rely on device posture rules also experienced impact. Device posture rules allow customers who use Access or Gateway policies to rely on the WARP agent to continually enforce that a device meets corporate compliance rules.

The agent communicates these signals to a Cloudflare service responsible for maintaining the state of the device. Cloudflare’s Zero Trust access control product uses a service token to receive this signal and evaluate it along with other rules to determine if a user can access a given resource. During this incident those rules defaulted to a block action, meaning that traffic modified by these policies would appear broken to the user. In some cases this meant that all internet bound traffic from a device was completely blocked leaving users unable to access anything.

Cloudflare Gateway caches the device posture state for users every 5 minutes to apply Gateway policies. The device posture state is cached so Gateway can apply policies without having to verify device state on every request. Depending on which Gateway policy type was matched, the user would experience two different outcomes. If they matched a network policy the user would experience a dropped connection and for an HTTP policy they would see a 5XX error page. We peaked at over 50,000 5XX errors/minute over baseline and had over 10.5 million posture read errors until the incident was resolved.

Gateway 5XX errors per minute

Cloudflare Incident on January 24th, 2023

Total count of Gateway Device posture errors

Cloudflare Incident on January 24th, 2023

Cloudflare R2 Storage and Cache Reserve

Cloudflare R2 Storage allows developers to store large amounts of unstructured data without the costly egress bandwidth fees associated with typical cloud storage services.

During the incident, the R2 service was unable to make outbound API requests to other parts of the Cloudflare infrastructure. As a result, R2 users saw elevated request failure rates when making requests to R2.  

Many Cloudflare products also depend on R2 for data storage and were also affected. For example, Cache Reserve users were impacted during this window and saw increased origin load for any items not in the primary cache. The majority of read and write operations to the Cache Reserve service were impacted during this incident causing entries into and out of Cache Reserve to fail. However, when Cache Reserve sees an R2 error, it falls back to the customer origin, so user traffic was still serviced during this period.

Cloudflare Cache Purge

Cloudflare’s content delivery network (CDN) caches the content of Internet properties on our network in our data centers around the world to reduce the distance that a user’s request needs to travel for a response. In some cases, customers want to purge what we cache and replace it with different data.

The Cloudflare control plane, the place where an administrator interacts with our network, uses a service token to authenticate and reach the cache purge service. During the incident, many purge requests failed while the service token was invalid. We saw an average impact of 20 purge requests/second failing and a maximum of 70 requests/second.

What are we doing to prevent this from happening again?

We take incidents like this seriously and recognize the impact it had. We have identified several steps we can take to address the risk of a similar problem occurring in the future. We are implementing the following remediation plan as a result of this incident:

Test: The Access engineering team will add unit tests that would automatically catch any similar issues with service token overwrites before any new features are launched.

Alert: The Access team will implement an automatic alert for any dramatic increase in failed service token authentication requests to catch issues before they are fully launched.

Process: The Access team has identified process improvements to allow for faster rollbacks for specific database tables.

Implementation: All relevant database fields will be updated to include checks for empty strings on top of existing “not null checks”

We are sorry for the disruption this caused for our customers across a number of Cloudflare services. We are actively making these improvements to ensure improved stability moving forward and that this problem will not happen again.

Partial Cloudflare outage on October 25, 2022

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/partial-cloudflare-outage-on-october-25-2022/

Partial Cloudflare outage on October 25, 2022

Partial Cloudflare outage on October 25, 2022

Today, a change to our Tiered Cache system caused some requests to fail for users with status code 530. The impact lasted for almost six hours in total. We estimate that about 5% of all requests failed at peak. Because of the complexity of our system and a blind spot in our tests, we did not spot this when the change was released to our test environment.  

The failures were caused by side effects of how we handle cacheable requests across locations. At first glance, the errors looked like they were caused by a different system that had started a release some time before. It took our teams a number of tries to identify exactly what was causing the problems. Once identified we expedited a rollback which completed in 87 minutes.

We’re sorry, and we’re taking steps to make sure this does not happen again.

Background

One of Cloudflare’s products is our Content Delivery Network, or CDN. This is used to cache assets for websites globally. However, a data center is not guaranteed to have an asset cached. It could be new, expired, or has been purged. If that happens, and a user requests that asset, our CDN needs to retrieve a fresh copy from a website’s origin server. But the data center that the user is accessing might still be pretty far away from the origin server. This presents an additional issue for customers: every time an asset is not cached in the data center, we need to retrieve a new copy from the origin server.

To improve cache hit ratios, we introduced Tiered Cache. With Tiered Cache, we organize our data centers in the CDN into a hierarchy of “lower tiers” which are closer to the end users and “upper tiers” that are closer to the origin. When a cache-miss occurs in a lower tier, the upper tier is checked. If the upper tier has a fresh copy of the asset, we can serve that in response to the request. This improves performance and reduces the amount of times that Cloudflare has to reach out to an origin server to retrieve assets that are not cached in lower tier data centers.

Partial Cloudflare outage on October 25, 2022

Incident timeline and impact

At 08:40 UTC, a software release of a CDN component containing a bug began slowly rolling out. The bug was triggered when a user visited a site with either Tiered Cache, Cloudflare Images, or Bandwidth Alliance configured. This bug caused a subset of those customers to return HTTP Status Code 530 — an error. Content that could be served directly from a data center’s local cache was unaffected.

We started an investigation after receiving customer reports of an intermittent increase in 530s after the faulty component was released to a subset of data centers.

Once the release started rolling out globally to the remaining data centers, a sharp increase in 530s triggered alerts along with more customer reports, and an incident was declared.

Partial Cloudflare outage on October 25, 2022
Requests resulting in a response with status code 530

We confirmed a bad release was responsible by rolling back the release in a data center at 17:03 UTC. After the rollback, we observed a drop in 530 errors. After this confirmation, an accelerated global rollback began and the 530s started to decrease. Impact ended once the release was reverted in all data centers configured as Tiered Cache upper tiers at 18:04 UTC.

Timeline:

  • 2022-10-25 08:40: The release started to roll out to a small subset of data centers.
  • 2022-10-25 10:35: An individual customer alert fires, indicating an increase in 500 error codes.
  • 2022-10-25 11:20: After an investigation, a single small data center is pinpointed as the source of the issue and removed from production while teams investigate the issue there.
  • 2022-10-25 12:30: Issue begins spreading more broadly as more data centers get the code changes.
  • 2022-10-25 14:22: 530s errors increase as the release starts to slowly roll out to our largest data centers.
  • 2022-10-25 14:39: Multiple teams become involved in the investigation as more customers start reporting increases in errors.
  • 2022-10-25 17:03: CDN Release is rolled back in Atlanta and root cause is confirmed.
  • 2022-10-25 17:28: Peak impact with approximately 5% of all HTTP requests resulting in an error with status code 530.
  • 2022-10-25 17:38: An accelerated rollback continues with large data centers acting as Upper tier for many customers.
  • 2022-10-25 18:04: Rollback is complete in all Upper Tiers.
  • 2022-10-25 18:30: Rollback is complete.

During the early phases of the investigation, the indicators were that this was a problem with our internal DNS system that also had a release rolling out at the same time. As the following section shows, that was a side effect rather than the cause of the outage.  

Adding distributed tracing to Tiered Cache introduced the problem

In order to help improve our performance, we routinely add monitoring code to various parts of our services. Monitoring code helps by giving us visibility into how various components are performing, allowing us to determine bottlenecks that we can improve on. Our team recently added additional distributed tracing to our Tiered Cache logic. The tiered cache entrypoint code is as follows:

* Before:

function _M.go()
   -- code to run here
end

* After:

local trace_fn = require("opentracing").trace_fn
local function go()
-- code to run here
end
function _M.go()
trace_fn(ngx.ctx, "tiered_cache_rewrite", go)
end

The code above wraps the existing go() function with trace_fn() which will call the go() function and then reports its execution time.

However, the logic that injects a function to the opentracing module clears control headers on every request:

require("opentracing").configure_module(conf,
-- control header extractor
function(ctx)
-- Always clear the headers.
clear_control_headers()

Normally, we extract data from these control headers before clearing them as a routine part of how we process requests.

But internal tiered cache traffic expects the control headers from the lower tier to be passed as-is. The combination of clearing headers and using an upper tier meant that information that might be critical to the routing of the request was not available. In the subset of requests affected, we were missing the hostname to resolve by our internal DNS lookup for origin server IP addresses. As a result, a 530 DNS error was returned to the client.

Remediation and follow-up steps

To prevent this from happening again, in addition to the fixing the bug, we have identified a set of changes that help us detect and prevent issues like this in the future:

  • Include a larger data center that is configured as a Tiered Cache upper tier in an earlier stage in the release plan. This will allow us to notice similar issues more quickly, before a global release.
  • Expand our acceptance test coverage to include a broader set of configurations, including various Tiered Cache topologies.
  • Alert more aggressively in situations where we do not have full context on requests, and need the extra host information in the control headers.
  • Ensure that our system correctly fails fast in an error like this, which would have helped identify the problem during development and test.

Conclusion

We experienced an incident that affected a significant set of customers using Tiered Cache. After identifying the faulty component, we were able to quickly rollback and remediate the issue. We are sorry for any disruption this has caused our customers and end users trying to access services.

Remediations to prevent such an incident from happening in the future will be put in place as soon as possible.

The mechanics of a sophisticated phishing scam and how we stopped it

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/2022-07-sms-phishing-attacks/

The mechanics of a sophisticated phishing scam and how we stopped it

The mechanics of a sophisticated phishing scam and how we stopped it

Yesterday, August 8, 2022, Twilio shared that they’d been compromised by a targeted phishing attack. Around the same time as Twilio was attacked, we saw an attack with very similar characteristics also targeting Cloudflare’s employees. While individual employees did fall for the phishing messages, we were able to thwart the attack through our own use of Cloudflare One products, and physical security keys issued to every employee that are required to access all our applications.

We have confirmed that no Cloudflare systems were compromised. Our Cloudforce One threat intelligence team was able to perform additional analysis to further dissect the mechanism of the attack and gather critical evidence to assist in tracking down the attacker.

This was a sophisticated attack targeting employees and systems in such a way that we believe most organizations would be likely to be breached. Given that the attacker is targeting multiple organizations, we wanted to share here a rundown of exactly what we saw in order to help other companies recognize and mitigate this attack.

Targeted Text Messages

On July 20, 2022, the Cloudflare Security team received reports of employees receiving legitimate-looking text messages pointing to what appeared to be a Cloudflare Okta login page. The messages began at 2022-07-22 22:50 UTC. Over the course of less than 1 minute, at least 76 employees received text messages on their personal and work phones. Some messages were also sent to the employee’s family members. We have not yet been able to determine how the attacker assembled the list of employee’s phone numbers but have reviewed access logs to our employee directory services and have found no sign of compromise.

Cloudflare runs a 24×7 Security Incident Response Team (SIRT). Every Cloudflare employee is trained to report anything that is suspicious to the SIRT. More than 90 percent of the reports to SIRT turn out to not be threats. Employees are encouraged to report anything and never discouraged from over-reporting. In this case, however, the reports to SIRT were a real threat.

The text messages received by employees looked like this:

The mechanics of a sophisticated phishing scam and how we stopped it

They came from four phone numbers associated with T-Mobile-issued SIM cards: (754) 268-9387, (205) 946-7573, (754) 364-6683 and (561) 524-5989. They pointed to an official-looking domain: cloudflare-okta.com. That domain had been registered via Porkbun, a domain registrar, at 2022-07-20 22:13:04 UTC — less than 40 minutes before the phishing campaign began.

Cloudflare built our secure registrar product in part to be able to monitor when domains using the Cloudflare brand were registered and get them shut down. However, because this domain was registered so recently, it had not yet been published as a new .com registration, so our systems did not detect its registration and our team had not yet moved to terminate it.

If you clicked on the link it took you to a phishing page. The phishing page was hosted on DigitalOcean and looked like this:

The mechanics of a sophisticated phishing scam and how we stopped it

Cloudflare uses Okta as our identity provider. The phishing page was designed to look identical to a legitimate Okta login page. The phishing page prompted anyone who visited it for their username and password.

Real-Time Phishing

We were able to analyze the payload of the phishing attack based on what our employees received as well as its content being posted to services like VirusTotal by other companies that had been attacked. When the phishing page was completed by a victim, the credentials were immediately relayed to the attacker via the messaging service Telegram. This real-time relay was important because the phishing page would also prompt for a Time-based One Time Password (TOTP) code.

Presumably, the attacker would receive the credentials in real-time, enter them in a victim company’s actual login page, and, for many organizations that would generate a code sent to the employee via SMS or displayed on a password generator. The employee would then enter the TOTP code on the phishing site, and it too would be relayed to the attacker. The attacker could then, before the TOTP code expired, use it to access the company’s actual login page — defeating most two-factor authentication implementations.

The mechanics of a sophisticated phishing scam and how we stopped it

Protected Even If Not Perfect

We confirmed that three Cloudflare employees fell for the phishing message and entered their credentials. However, Cloudflare does not use TOTP codes. Instead, every employee at the company is issued a FIDO2-compliant security key from a vendor like YubiKey. Since the hard keys are tied to users and implement origin binding, even a sophisticated, real-time phishing operation like this cannot gather the information necessary to log in to any of our systems. While the attacker attempted to log in to our systems with the compromised username and password credentials, they could not get past the hard key requirement.

But this phishing page was not simply after credentials and TOTP codes. If someone made it past those steps, the phishing page then initiated the download of a phishing payload which included AnyDesk’s remote access software. That software, if installed, would allow an attacker to control the victim’s machine remotely. We confirmed that none of our team members got to this step. If they had, however, our endpoint security would have stopped the installation of the remote access software.

How Did We Respond?

The main response actions we took for this incident were:

1. Block the phishing domain using Cloudflare Gateway

Cloudflare Gateway is a Secure Web Gateway solution providing threat and data protection with DNS / HTTP filtering and natively-integrated Zero Trust. We use this  solution internally to proactively identify malicious domains and block them. Our team added the malicious domain to Cloudflare Gateway to block all employees from accessing it.

Gateway’s automatic detection of malicious domains also identified the domain and blocked it, but the fact that it was registered and messages were sent within such a short interval of time meant that the system hadn’t automatically taken action before some employees had clicked on the links. Given this incident we are working to speed up how quickly malicious domains are identified and blocked. We’re also implementing controls on access to newly registered domains which we offer to customers but had not implemented ourselves.

2. Identify all impacted Cloudflare employees and reset compromised credentials

We were able to compare recipients of the phishing texts to login activity and identify threat-actor attempts to authenticate to our employee accounts. We identified login attempts blocked due to the hard key (U2F) requirements indicating that the correct password was used, but the second factor could not be verified. For the three of our employees’ credentials were leaked, we reset their credentials and any active sessions and initiated scans of their devices.

3. Identify and take down threat-actor infrastructure

The threat actor’s phishing domain was newly registered via Porkbun, and hosted on DigitalOcean. The phishing domain used to target Cloudflare was set up less than an hour before the initial phishing wave. The site had a Nuxt.js frontend, and a Django backend. We worked with DigitalOcean to shut down the attacker’s server. We also worked with Porkbun to seize control of the malicious domain.

From the failed sign-in attempts we were able to determine that the threat actor was leveraging Mullvad VPN software and distinctively using the Google Chrome browser on a Windows 10 machine. The VPN IP addresses used by the attacker were 198.54.132.88, and 198.54.135.222. Those IPs are assigned to Tzulo, a US-based dedicated server provider whose website claims they have servers located in Los Angeles and Chicago. It appears, actually, that the first was actually running on a server in the Toronto area and the latter on a server in the Washington, DC area. We blocked these IPs from accessing any of our services.

4. Update detections to identify any subsequent attack attempts

With what we were able to uncover about this attack, we incorporated additional signals to our already existing detections to specifically identify this threat-actor. At the time of writing we have not observed any additional waves targeting our employees. However, intelligence from the server indicated the attacker was targeting other organizations, including Twilio. We reached out to these other organizations and shared intelligence on the attack.

5. Audit service access logs for any additional indications of attack

Following the attack, we screened all our system logs for any additional fingerprints from this particular attacker. Given Cloudflare Access serves as the central control point for all Cloudflare applications, we can search the logs for any indication the attacker may have breached any systems. Given employees’ phones were targeted, we also carefully reviewed the logs of our employee directory providers. We did not find any evidence of compromise.

Lessons Learned and Additional Steps We’re Taking

We learn from every attack. Even though the attacker was not successful, we are making additional adjustments from what we’ve learned. We’re adjusting the settings for Cloudflare Gateway to restrict or sandbox access to sites running on domains that were registered within the last 24 hours. We will also run any non-whitelisted sites containing terms such as “cloudflare” “okta” “sso” and “2fa” through our browser isolation technology. We are also increasingly using Area 1’s phish-identification technology to scan the web and look for any pages that are designed to target Cloudflare. Finally, we’re tightening up our Access implementation to prevent any logins from unknown VPNs, residential proxies, and infrastructure providers. All of these are standard features of the same products we offer to customers.

The attack also reinforced the importance of three things we’re doing well. First, requiring hard keys for access to all applications. Like Google, we have not seen any successful phishing attacks since rolling hard keys out. Tools like Cloudflare Access made it easy to support hard keys even across legacy applications. If you’re an organization interested in how we rolled out hard keys, reach out to [email protected] and our security team would be happy to share the best practices we learned through this process.

Second, using Cloudflare’s own technology to protect our employees and systems. Cloudflare One’s solutions like Access and Gateway were critical to staying ahead of this attack. We configured our Access implementation to require hard keys for every application. It also creates a central logging location for all application authentications. And, if ever necessary, a place from which we can kill the sessions of a potentially compromised employee. Gateway allows us the ability to shut down malicious sites like this one quickly and understand what employees may have fallen for the attack. These are all functionalities that we make available to Cloudflare customers as part of our Cloudflare One suite and this attack demonstrates how effective they can.

Third, having a paranoid but blame-free culture is critical for security. The three employees who fell for the phishing scam were not reprimanded. We’re all human and we make mistakes. It’s critically important that when we do, we report them and don’t cover them up. This incident provided another example of why security is part of every team member at Cloudflare’s job.

Detailed Timeline of Events

2022-07-20 22:49 UTC Attacker sends out 100+ SMS messages to Cloudflare employees and their families.
2022-07-20 22:50 UTC Employees begin reporting SMS messages to Cloudflare Security team.
2022-07-20 22:52 UTC Verify that the attacker’s domain is blocked in Cloudflare Gateway for corporate devices.
2022-07-20 22:58 UTC Warning communication sent to all employees across chat and email.
2022-07-20 22:50 UTC to
2022-07-20 23:26 UTC
Monitor telemetry in the Okta System log & Cloudflare Gateway HTTP logs to locate credential compromise. Clear login sessions and suspend accounts on discovery.
2022-07-20 23:26 UTC Phishing site is taken down by the hosting provider.
2022-07-20 23:37 UTC Reset leaked employee credentials.
2022-07-21 00:15 UTC Deep dive into attacker infrastructure and capabilities.

Indicators of compromise

Value Type Context and MITRE Mapping
cloudflare-okta[.]com hosted on 147[.]182[.]132[.]52 Phishing URL T1566.002: Phishing: Spear Phishing Link sent to users.
64547b7a4a9de8af79ff0eefadde2aed10c17f9d8f9a2465c0110c848d85317a SHA-256 T1219: Remote Access Software being distributed by the threat actor

What You Can Do

If you are similar attacks in your environment, please don’t hesitate to reach out to [email protected], and we’re happy to share best practices on how to keep your business secure. Finally, if you want to work on detecting and mitigating the next attacks with us? We’re hiring on our Detection and Response team, come join us!

Cloudflare outage on June 21, 2022

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/

Cloudflare outage on June 21, 2022

Introduction

Cloudflare outage on June 21, 2022

Today, June 21, 2022, Cloudflare suffered an outage that affected traffic in 19 of our data centers. Unfortunately, these 19 locations handle a significant proportion of our global traffic. This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations. A change to the network configuration in those locations caused an outage which started at 06:27 UTC. At 06:58 UTC the first data center was brought back online and by 07:42 UTC all data centers were online and working correctly.

Depending on your location in the world you may have been unable to access websites and services that rely on Cloudflare. In other locations, Cloudflare continued to operate normally.

We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.

Background

Over the last 18 months, Cloudflare has been working to convert all of our busiest locations to a more flexible and resilient architecture. In this time, we’ve converted 19 of our data centers to this architecture, internally called Multi-Colo PoP (MCP): Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, São Paulo, San Jose, Singapore, Sydney, Tokyo.

A critical part of this new architecture, which is designed as a Clos network, is an added layer of routing that creates a mesh of connections. This mesh allows us to easily disable and enable parts of the internal network in a data center for maintenance or to deal with a problem. This layer is represented by the spines in the following diagram.

Cloudflare outage on June 21, 2022

This new architecture has provided us with significant reliability improvements, as well as allowing us to run maintenance in these locations without disrupting customer traffic. As these locations also carry a significant proportion of the Cloudflare traffic, any problem here can have a very wide impact, and unfortunately, that’s what happened today.

Incident timeline and impact

In order to be reachable on the Internet, networks like Cloudflare make use of a protocol called BGP. As part of this protocol, operators define policies which decide which prefixes (a collection of adjacent IP addresses) are advertised to peers (the other networks they connect to), or accepted from peers.

These policies have individual components, which are evaluated sequentially. The end result is that any given prefixes will either be advertised or not advertised. A change in policy can mean a previously advertised prefix is no longer advertised, known as being “withdrawn”, and those IP addresses will no longer be reachable on the Internet.

Cloudflare outage on June 21, 2022

While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes.

Due to this withdrawal, Cloudflare engineers experienced added difficulty in reaching the affected locations to revert the problematic change. We have backup procedures for handling such an event and used them to take control of the affected locations.

03:56 UTC: We deploy the change to our first location. None of our locations are impacted by the change, as these are using our older architecture.
06:17: The change is deployed to our busiest locations, but not the locations with the MCP architecture.
06:27: The rollout reached the MCP-enabled locations, and the change is deployed to our spines. This is when the incident started, as this swiftly took these 19 locations offline.
06:32: Internal Cloudflare incident declared.
06:51: First change made on a router to verify the root cause.
06:58: Root cause found and understood. Work begins to revert the problematic change.
07:42: The last of the reverts has been completed. This was delayed as network engineers walked over each other’s changes, reverting the previous reverts, causing the problem to re-appear sporadically.
09:00: Incident closed.

The criticality of these data centers can clearly be seen in the volume of successful HTTP requests we handled globally:

Cloudflare outage on June 21, 2022

Even though these locations are only 4% of our total network, the outage impacted 50% of total requests. The same can be seen in our egress bandwidth:

Cloudflare outage on June 21, 2022

Technical description of the error and how it happened

As part of our continued effort to standardize our infrastructure configuration, we were rolling out a change to standardize the BGP communities we attach to a subset of the prefixes we advertise. Specifically, we were adding informational communities to our site-local prefixes.

These prefixes allow our metals to communicate with each other, as well as connect to customer origins. As part of the change procedure at Cloudflare, a Change Request ticket was created, which includes a dry-run of the change, as well as a stepped rollout procedure. Before it was allowed to go out, it was also peer reviewed by multiple engineers. Unfortunately, in this case, the steps weren’t small enough to catch the error before it hit all of our spines.

The change looked like this on one of the routers:

[edit policy-options policy-statement 4-COGENT-TRANSIT-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 4-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 6-COGENT-TRANSIT-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 6-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;

This was harmless, and just added some additional information to these prefix advertisements. The change on the spines was the following:

[edit policy-options policy-statement AGGREGATES-OUT]
term 6-DISABLED_PREFIXES { ... }
!    term 6-ADV-TRAFFIC-PREDICTOR { ... }
!    term 4-ADV-TRAFFIC-PREDICTOR { ... }
!    term ADV-FREE { ... }
!    term ADV-PRO { ... }
!    term ADV-BIZ { ... }
!    term ADV-ENT { ... }
!    term ADV-DNS { ... }
!    term REJECT-THE-REST { ... }
!    term 4-ADV-SITE-LOCALS { ... }
!    term 6-ADV-SITE-LOCALS { ... }
[edit policy-options policy-statement AGGREGATES-OUT term 4-ADV-SITE-LOCALS then]
community delete NO-EXPORT { ... }
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add AMS07;
+      community add EUROPE;
[edit policy-options policy-statement AGGREGATES-OUT term 6-ADV-SITE-LOCALS then]
community delete NO-EXPORT { ... }
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add AMS07;
+      community add EUROPE;

An initial glance at this diff might give the impression that this change is identical, but unfortunately, that’s not the case. If we focus on one part of the diff, it might become clear why:

!    term REJECT-THE-REST { ... }
!    term 4-ADV-SITE-LOCALS { ... }
!    term 6-ADV-SITE-LOCALS { ... }

In this diff format, the exclamation marks in front of the terms indicate a re-ordering of the terms. In this case, multiple terms moved up, and two terms were added to the bottom. Specifically, the 4-ADV-SITE-LOCALS and 6-ADV-SITE-LOCALS terms moved from the top to the bottom. These terms were now behind the REJECT-THE-REST term, and as might be clear from the name, this term is an explicit reject:

term REJECT-THE-REST {
    then reject;
} 

As this term is now before the site-local terms, we immediately stopped advertising our site-local prefixes, removing our direct access to all the impacted locations, as well as removing the ability of our servers to reach origin servers.

On top of the inability to contact origins, the removal of these site-local prefixes also caused our internal load balancing system Multimog (a variation of our Unimog load-balancer) to stop working, as it could no longer forward requests between the servers in our MCPs. This meant that our smaller compute clusters in an MCP received the same amount of traffic as our largest clusters, causing the smaller ones to overload.

Cloudflare outage on June 21, 2022

Remediation and follow-up steps

This incident had widespread impact, and we take availability very seriously. We have identified several areas of improvement and will continue to work on uncovering any other gaps that could cause a recurrence.

Here is what we are working on immediately:

Process: While the MCP program was designed to improve availability, a procedural gap in how we updated these data centers ultimately caused a broader impact in MCP locations specifically. While we did use a stagger procedure for this change, the stagger policy did not include an MCP data center until the final step. Change procedures and automation need to include MCP-specific test and deploy procedures to ensure there are no unintended consequences.

Architecture: The incorrect router configuration prevented the proper routes from being announced, preventing traffic from flowing properly to our infrastructure. Ultimately the policy statement that caused the incorrect routing advertisement will be redesigned to prevent an unintentional incorrect ordering.

Automation: There are several opportunities in our automation suite that would mitigate some or all of the impact seen from this event. Primarily, we will be concentrating on automation improvements that enforce an improved stagger policy for rollouts of network configuration and provide an automated “commit-confirm” rollback. The former enhancement would have significantly lessened the overall impact, and the latter would have greatly reduced the Time-to-Resolve during the incident.

Conclusion

Although Cloudflare has invested significantly in our MCP design to improve service availability, we clearly fell short of our customer expectations with this very painful incident. We are deeply sorry for the disruption to our customers and to all the users who were unable to access Internet properties during the outage. We have already started working on the changes outlined above and will continue our diligence to ensure this cannot happen again.

PIPEFAIL: How a missing shell option slowed Cloudflare down

Post Syndicated from Alex Forster original https://blog.cloudflare.com/pipefail-how-a-missing-shell-option-slowed-cloudflare-down/

PIPEFAIL: How a missing shell option slowed Cloudflare down

PIPEFAIL: How a missing shell option slowed Cloudflare down

At Cloudflare, we’re used to being the fastest in the world. However, for approximately 30 minutes last December, Cloudflare was slow. Between 20:10 and 20:40 UTC on December 16, 2021, web requests served by Cloudflare were artificially delayed by up to five seconds before being processed. This post tells the story of how a missing shell option called “pipefail” slowed Cloudflare down.

Background

Before we can tell this story, we need to introduce you to some of its characters.

PIPEFAIL: How a missing shell option slowed Cloudflare down

Cloudflare’s Front Line protects millions of users from some of the largest attacks ever recorded. This protection is orchestrated by a sidecar service called dosd, which analyzes traffic and looks for attacks. When dosd detects an attack, it provides Front Line with a list of attack fingerprints that describe how Front Line can match and block the attack traffic.

Instances of dosd run on every Cloudflare server, and they communicate with each other using a peer-to-peer mesh to identify malicious traffic patterns. This decentralized design allows dosd to perform analysis with much higher fidelity than is possible with a centralized system, but its scale also imposes some strict performance requirements. To meet these requirements, we need to provide dosd with very fast access to large amounts of configuration data, which naturally means that dosd depends on Quicksilver. Cloudflare developed Quicksilver to manage configuration data and replicate it around the world in milliseconds, allowing it to be accessed by services like dosd in microseconds.

PIPEFAIL: How a missing shell option slowed Cloudflare down

One piece of configuration data that dosd needs comes from the Addressing API, which is our authoritative IP address management service. The addressing data it provides is important because dosd uses it to understand what kind of traffic is expected on particular IPs. Since addressing data doesn’t change very frequently, we use a simple Kubernetes cron job to query it at 10 minutes past each hour and write it into Quicksilver, allowing it to be efficiently accessed by dosd.

With this context, let’s walk through the change we made on December 16 that ultimately led to the slowdown.

The Change

Approximately once a week, all of our Bug Fixes and Performance Improvements to the Front Line codebase are released to the network. On December 16, the Front Line team released a fix for a subtle bug in how the code handled compression in the presence of a Cache-Control: no-transform header. Unfortunately, the team realized pretty quickly that this fix actually broke some customers who had started depending on that buggy behavior, so the team decided to roll back the release and work with those customers to correct the issue.

PIPEFAIL: How a missing shell option slowed Cloudflare down

Here’s a graph showing the progression of the rollback. While most releases and rollbacks are fully automated, this particular rollback needed to be performed manually due to its urgency. Since this was a manual rollback, SREs decided to perform it in two batches as a safety measure. The first batch went to our smaller tier 2 and 3 data centers, and the second batch went to our larger tier 1 data centers.

SREs started the first batch at 19:25 UTC, and it completed in about 30 minutes. Then, after verifying that there were no issues, they started the second batch at 20:10. That’s when the slowdown started.

The Slowdown

Within minutes of starting the second batch of rollbacks, alerts started firing. “Traffic levels are dropping.” “CPU utilization is dropping.” “A P0 incident has been automatically declared.” The timing could not be a coincidence. Somehow, a deployment of known-good code, which had been limited to a subset of the network and which had just been successfully performed 40 minutes earlier, appeared to be causing a global problem.

A P0 incident is an “all hands on deck” emergency, so dozens of Cloudflare engineers quickly began to assess impact to their services and test their theories about the root cause. The rollback was paused, but that did not fix the problem. Then, approximately 10 minutes after the start of the incident, my team – the DOS team – received a concerning alert: “dosd is not running on numerous servers.” Before that alert fired we had been investigating whether the slowdown was caused by an unmitigated attack, but this required our immediate attention.

Based on service logs, we were able to see that dosd was panicking because the customer addressing data in Quicksilver was corrupted in some way. Remember: the data in this Quicksilver key is important. Without it, dosd could not make correct choices anymore, so it refused to continue.

Once we realized that the addressing data was corrupted, we had to figure out how it was corrupted so that we could fix it. The answer turned out to be pretty obvious: the Quicksilver key was completely empty.

Following the old adage – “did you try restarting it?” – we decided to manually re-run the Kubernetes cron job that populates this key and see what happened. At 20:40 UTC, the cron job was manually triggered. Seconds after it completed, dosd started running again, and traffic levels began returning to normal. We confirmed that the Quicksilver key was no longer empty, and the incident was over.

The Aftermath

Despite fixing the problem, we still didn’t really understand what had just happened.

Why was the Quicksilver key empty?

It was urgent that we quickly figure out how an empty value was written into that Quicksilver key, because for all we knew, it could happen again at any moment.

We started by looking at the Kubernetes cron job, which turned out to have a bug:

PIPEFAIL: How a missing shell option slowed Cloudflare down

This cron job is implemented using a small Bash script. If you’re unfamiliar with Bash (particularly shell pipelining), here’s what it does:

First, the dos-make-addr-conf executable runs. Its job is to query the Addressing API for various bits of JSON data and serialize it into a Toml document. Afterward, that Toml is “piped” as input into the dosctl executable, whose job is to simply write it into a Quicksilver key called template_vars.

Can you spot the bug? Here’s a hint: what happens if dos-make-addr-conf fails for some reason and exits with a non-zero error code? It turns out that, by default, the shell pipeline ignores the error code and continues executing the next command! This means that the output of dos-make-addr-conf (which could be empty) gets unconditionally piped into dosctl and used as the value of the template_vars key, regardless of whether dos-make-addr-conf succeeded or failed.

30 years ago, when the first users of Bourne shell were burned by this problem, a shell option called “pipefail” was introduced. Enabling this option changes the shell’s behavior so that, when any command in a pipeline series fails, the entire pipeline stops processing. However, this option is not enabled by default, so it’s widely recommended as best practice that all scripts should start by enabling this (and a few other) options.

Here’s the fixed version of that cron job:

PIPEFAIL: How a missing shell option slowed Cloudflare down

This bug was particularly insidious because dosd actually did attempt to gracefully handle the case where this Quicksilver key contained invalid Toml. However, an empty string is a perfectly valid Toml document. If an error message had been accidentally written into this Quicksilver key instead of an empty string, then dosd would have rejected the update and continued to use the previous value.

Why did that cause the Front Line to slow down?

We had figured out how an empty key could be written into Quicksilver, and we were confident that it wouldn’t happen again. However, we still needed to untangle how that empty key caused such a severe incident.

As I mentioned earlier, the Front Line relies on dosd to tell it how to mitigate attacks, but it doesn’t depend on dosd directly to serve requests. Instead, once every few seconds, the Front Line asynchronously asks dosd for new attack fingerprints and stores them in an in-memory cache. This cache is consulted while serving each request, and if dosd ever fails to provide fresh attack fingerprints, then the stale fingerprints will continue to be used instead. So how could this have caused the impact that we saw?

PIPEFAIL: How a missing shell option slowed Cloudflare down

As part of the rollback process, the Front Line’s code needed to be reloaded. Reloading this code implicitly flushed the in-memory caches, including the attack fingerprint data from dosd. The next time that a request tried to consult with the cache, the caching layer realized that it had no attack fingerprints to return and a “cache miss” happened.

To handle a cache miss, the caching layer tried to reach out to dosd, and this is when the slowdown happened. While the caching layer was waiting for dosd to reply, it blocked all pending requests from progressing. Since dosd wasn’t running, the attempt eventually timed out after five seconds when the caching layer gave up. But in the meantime, each pending request was stuck waiting for the timeout to happen. Once it did, all the pending requests that were queued up over the five-second timeout period became unblocked and were finally allowed to progress. This cycle repeated over and over again every five seconds on every server until the dosd failure was resolved.

To trigger this slowdown, not only did dosd have to fail, but the Front Line’s in-memory cache had to also be flushed at the same time. If dosd had failed, but the Front Line’s cache had not been flushed, then the stale attack fingerprints would have remained in the cache and request processing would not have been impacted.

Why didn’t the first rollback cause this problem?

These two batches of rollbacks were performed by forcing servers to run a Salt highstate. When each batch was executed, thousands of servers began running highstates at the same time. The highstate process involves, among other things, contacting the Addressing API in order to retrieve various bits of customer addressing information.

The first rollback started at 19:25 UTC, and the second rollback started 45 minutes later at 20:10. Remember how I mentioned that our Kubernetes cron job only runs on the 10th minute of every hour? At 21:10 – exactly the time that our cron job started executing – thousands of servers also began to highstate, flooding the Addressing API with requests. All of these requests were queued up and eventually served, but it took the Addressing API a few minutes to work through the backlog. This delay was long enough to cause our cron job to time out, and, due to the “pipefail”  bug, inadvertently clobber the Quicksilver key that it was responsible for updating.

To trigger the “pipefail” bug, not only did we have to flood the Addressing API with requests, we also had to do it at exactly 10 minutes after the hour. If SREs had started the second batch of rollbacks a few minutes earlier or later, this bug would have continued to lay dormant.

Lessons Learned

This was a unique incident where a chain of small or unlikely failures cascaded into a severe and painful outage that we deeply regret. In response, we have hardened each link in the chain:

  • A manual rollback inadvertently triggered the thundering herd problem, which overwhelmed the Addressing API. We have since significantly scaled out the Addressing API, so that it can handle high request rates if it ever again has to.
  • An error in a Kubernetes cron job caused invalid data to be written to Quicksilver. We have since made sure that, when this cron job fails, it is no longer possible for that failure to clobber the Quicksilver key.
  • dosd did not correctly handle all possible error conditions when loading configuration data from Quicksilver, causing it to fail. We have since taken these additional conditions into account where necessary, so that dosd will gracefully degrade in the face of corrupt Quicksilver data.
  • The Front Line had an unexpected dependency on dosd, which caused it to fail when dosd failed. We have since removed all such dependencies, and the Front Line will now gracefully survive dosd failures.

More broadly, this incident has served as an example to us of why code and systems must always be resilient to failure, no matter how unlikely that failure may seem.

Incorrect proxying of 24 hostnames on January 24, 2022

Post Syndicated from Jeremy Hartman original https://blog.cloudflare.com/incorrect-proxying-of-24-hostnames-on-january-24-2022/

Incorrect proxying of 24 hostnames on January 24, 2022

On January 24, 2022, as a result of an internal Cloudflare product migration, 24 hostnames (including www.cloudflare.com) that were actively proxied through the Cloudflare global network were mistakenly redirected to the wrong origin. During this incident, traffic destined for these hostnames was passed through to the clickfunnels.com origin and may have resulted in a clickfunnels.com page being displayed instead of the intended website. This was our doing and clickfunnels.com was unaware of our error until traffic started to reach their origin.

API calls or other expected responses to and from these hostnames may not have responded properly, or may have failed completely. For example, if you were making an API call to api.example.com, and api.example.com was an impacted hostname, you likely would not have received the response you would have expected.

Here is what happened:

At 2022-01-24 22:24 UTC we started a migration of hundreds of thousands of custom hostnames to the Cloudflare for SaaS product. Cloudflare for SaaS allows SaaS providers to manage their customers’ websites and SSL certificates at scale – more information is available here. This migration was intended to be completely seamless, with the outcome being enhanced features and security for our customers. The migration process was designed to read the custom hostname configuration from a database and migrate it from SaaS v1 (the old system) to SaaS v2 (the current version) automatically.

To better understand what happened next, it’s important to explain a bit more about how custom hostnames are configured.

First, Cloudflare for SaaS customers can configure any hostname; but before we will proxy traffic to them, they must prove (via DNS validation) that they actually are allowed to handle that hostname’s traffic.

When the Cloudflare for SaaS customer first configures the hostname, it is marked as pending until DNS validation has occurred. Pending hostnames are very common for Cloudflare for SaaS customers as the hostname gets provisioned, and then the SaaS provider will typically work with their customer to put in place the appropriate DNS validation that proves ownership.

Once a hostname passes DNS validation, it moves from a pending state to an active state and can be proxied. Except in one case: there’s a special check for whether the hostname is marked as blocked within Cloudflare’s system. A blocked hostname is one that can’t be activated without manual approval by our Trust & Safety team. Some scenarios that could lead to a hostname being blocked include when the hostname is a Cloudflare-owned property, a well known brand, or a hostname in need of additional review for a variety of reasons.

During this incident, a very small number of blocked hostnames were erroneously moved to the active state while migrating clickfunnels.com’s customers. Once that occurred, traffic destined for those previously blocked hostnames was then processed by a configuration belonging to clickfunnels.com, sending traffic to the clickfunnels.com’s origin. One of those hostnames was www.cloudflare.com. Note that it was www.cloudflare.com and not cloudflare.com, so subdomains like dash.cloudflare.com, api.cloudflare.com, cdnjs.cloudflare.com, and so on were unaffected by this problem.

As the migration process continued down the list of hostnames, additional traffic was re-routed to the clickfunnels.com origin. At 23:06 UTC www.cloudflare.com was affected. At 23:15 UTC an incident was declared internally. Since the first alert we received was for www.cloudflare.com, we started our investigation there. In the next 19 minutes, the team restored www.cloudflare.com to its correct origin, determined the breadth of the impact and the root cause of the incident, and began remediation for the remaining affected hostnames.

By 2022-01-25 00:13 UTC, all custom hostnames had been restored to their proper configuration and the incident was closed. We have contacted all the customers who were affected by this error. We have worked with ClickFunnels to delete logs of this event to ensure that no data erroneously sent to the clickfunnels.com’s origin is retained by them and are very grateful for their speedy assistance.

Here is a graph (on a log scale) of requests to clickfunnels.com during the event. Out of a total of 268,430,157 requests redirected, 268,220,296 (99.92%) were for www.cloudflare.com:

Incorrect proxying of 24 hostnames on January 24, 2022

At Cloudflare, we take these types of incidents very seriously, dedicating massive amounts of resources in preventative action and in follow-up engineering. In this case, there are both procedural and technical follow-ups to prevent reoccurrence. Here are our next steps:

  • No more blocked hostname overrides. All blocked hostname changes will route through our verification pipeline as part of the migration process.
  • All migrations will require explicit validation and approval from SaaS customers for a blocked hostname to be considered for activation.
  • Additional monitoring will be added to the hostnames being migrated to spot potential erroneous traffic patterns and alert the migration team.
  • Additional monitoring added for www.cloudflare.com.
  • Stage hostname activations on non-production elements prior to promoting to production will enable the ability to verify the new hostname state is expected. This will allow us to catch issues before they hit production traffic.

Conclusion

This event exposed previously unknown gaps in our process and technology that directly impacted our customers. We are truly sorry for the disruption to our customers and any potential visitor to the impacted properties. Our commitment is to provide fully reliable and secure products, and we will continue to make every effort possible to deliver just that for our customers and partners.

A Byzantine failure in the real world

Post Syndicated from Tom Lianza original https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/

A Byzantine failure in the real world

An analysis of the Cloudflare API availability incident on 2020-11-02

When we review design documents at Cloudflare, we are always on the lookout for Single Points of Failure (SPOFs). Eliminating these is a necessary step in architecting a system you can be confident in. Ironically, when you’re designing a system with built-in redundancy, you spend most of your time thinking about how well it functions when that redundancy is lost.

On November 2, 2020, Cloudflare had an incident that impacted the availability of the API and dashboard for six hours and 33 minutes. During this incident, the success rate for queries to our API periodically dipped as low as 75%, and the dashboard experience was as much as 80 times slower than normal. While Cloudflare’s edge is massively distributed across the world (and kept working without a hitch), Cloudflare’s control plane (API & dashboard) is made up of a large number of microservices that are redundant across two regions. For most services, the databases backing those microservices are only writable in one region at a time.

Each of Cloudflare’s control plane data centers has multiple racks of servers. Each of those racks has two switches that operate as a pair—both are normally active, but either can handle the load if the other fails. Cloudflare survives rack-level failures by spreading the most critical services across racks. Every piece of hardware has two or more power supplies with different power feeds. Every server that stores critical data uses RAID 10 redundant disks or storage systems that replicate data across at least three machines in different racks, or both. Redundancy at each layer is something we review and require. So—how could things go wrong?

In this post we present a timeline of what happened, and how a difficult failure mode known as a Byzantine fault played a role in a cascading series of events.

2020-11-02 14:43 UTC: Partial Switch Failure

At 14:43, a network switch started misbehaving. Alerts began firing about the switch being unreachable to pings. The device was in a partially operating state: network control plane protocols such as LACP and BGP remained operational, while others, such as vPC, were not. The vPC link is used to synchronize ports across multiple switches, so that they appear as one large, aggregated switch to servers connected to them. At the same time, the data plane (or forwarding plane) was not processing and forwarding all the packets received from connected devices.

This failure scenario is completely invisible to the connected nodes, as each server only sees an issue for some of its traffic due to the load-balancing nature of LACP. Had the switch failed fully, all traffic would have failed over to the peer switch, as the connected links would’ve simply gone down, and the ports would’ve dropped out of the forwarding LACP bundles.

Six minutes later, the switch recovered without human intervention. But this odd failure mode led to further problems that lasted long after the switch had returned to normal operation.

2020-11-02 14:44 UTC: etcd Errors begin

The rack with the misbehaving switch included one server in our etcd cluster. We use etcd heavily in our core data centers whenever we need strongly consistent data storage that’s reliable across multiple nodes.

In the event that the cluster leader fails, etcd uses the RAFT protocol to maintain consistency and establish consensus to promote a new leader. In the RAFT protocol, cluster members are assumed to be either available or unavailable, and to provide accurate information or none at all. This works fine when a machine crashes, but is not always able to handle situations where different members of the cluster have conflicting information.

In this particular situation:

  • Network traffic between node 1 (in the affected rack) and node 3 (the leader) was being sent through the switch in the degraded state,
  • Network traffic between node 1 and node 2 were going through its working peer, and
  • Network traffic between node 2 and node 3 was unaffected.

This caused cluster members to have conflicting views of reality, known in distributed systems theory as a Byzantine fault. As a consequence of this conflicting information, node 1 repeatedly initiated leader elections, voting for itself, while node 2 repeatedly voted for node 3, which it could still connect to. This resulted in ties that did not promote a leader node 1 could reach. RAFT leader elections are disruptive, blocking all writes until they’re resolved, so this made the cluster read-only until the faulty switch recovered and node 1 could once again reach node 3.

A Byzantine failure in the real world

2020-11-02 14:45 UTC: Database system promotes a new primary database

Cloudflare’s control plane services use relational databases hosted across multiple clusters within a data center. Each cluster is configured for high availability. The cluster setup includes a primary database, a synchronous replica, and one or more asynchronous replicas. This setup allows redundancy within a data center. For cross-datacenter redundancy, a similar high availability secondary cluster is set up and replicated in a geographically dispersed data center for disaster recovery. The cluster management system leverages etcd for cluster member discovery and coordination.

When etcd became read-only, two clusters were unable to communicate that they had a healthy primary database. This triggered the automatic promotion of a synchronous database replica to become the new primary. This process happened automatically and without error or data loss.

There was a defect in our cluster management system that requires a rebuild of all database replicas when a new primary database is promoted. So, although the new primary database was available instantly, the replicas would take considerable time to become available, depending on the size of the database. For one of the clusters, service was restored quickly. Synchronous and asynchronous database replicas were rebuilt and started replicating successfully from primary, and the impact was minimal.

For the other cluster, however, performant operation of that database required a replica to be online. Because this database handles authentication for API calls and dashboard activities, it takes a lot of reads, and one replica was heavily utilized to spare the primary the load. When this failover happened and no replicas were available, the primary was overloaded, as it had to take all of the load. This is when the main impact started.

Reduce Load, Leverage Redundancy

At this point we saw that our primary authentication database was overwhelmed and began shedding load from it. We dialed back the rate at which we push SSL certificates to the edge, send emails, and other features, to give it space to handle the additional load. Unfortunately, because of its size, we knew it would take several hours for a replica to be fully rebuilt.

A silver lining here is that every database cluster in our primary data center also has online replicas in our secondary data center. Those replicas are not part of the local failover process, and were online and available throughout the incident. The process of steering read-queries to those replicas was not yet automated, so we manually diverted API traffic that could leverage those read replicas to the secondary data center. This substantially improved our API availability.

The Dashboard

The Cloudflare dashboard, like most web applications, has the notion of a user session. When user sessions are created (each time a user logs in) we perform some database operations and keep data in a Redis cluster for the duration of that user’s session. Unlike our API calls, our user sessions cannot currently be moved across the ocean without disruption. As we took actions to improve the availability of our API calls, we were unfortunately making the user experience on the dashboard worse.

This is an area of the system that is currently designed to be able to fail over across data centers in the event of a disaster, but has not yet been designed to work in both data centers at the same time. After a first period in which users on the dashboard became increasingly frustrated, we failed the authentication calls fully back to our primary data center, and kept working on our primary database to ensure we could provide the best service levels possible in that degraded state.

2020-11-02 21:20 UTC Database Replica Rebuilt

The instant the first database replica rebuilt, it put itself back into service, and performance resumed to normal levels. We re-ramped all of the services that had been turned down, so all asynchronous processing could catch up, and after a period of monitoring marked the end of the incident.

Redundant Points of Failure

The cascade of failures in this incident was interesting because each system, on its face, had redundancy. Moreover, no system fully failed—each entered a degraded state. That combination meant the chain of events that transpired was considerably harder to model and anticipate. It was frustrating yet reassuring that some of the possible failure modes were already being addressed.

A team was already working on fixing the limitation that requires a database replica rebuild upon promotion. Our user sessions system was inflexible in scenarios where we’d like to steer traffic around, and redesigning that was already in progress.

This incident also led us to revisit the configuration parameters we put in place for things that auto-remediate. In previous years, promoting a database replica to primary took far longer than we liked, so getting that process automated and able to trigger on a minute’s notice was a point of pride. At the same time, for at least one of our databases, the cure may be worse than the disease, and in fact we may not want to invoke the promotion process so quickly. Immediately after this incident we adjusted that configuration accordingly.

Byzantine Fault Tolerance (BFT) is a hot research topic. Solutions have been known since 1982, but have had to choose between a variety of engineering tradeoffs, including security, performance, and algorithmic simplicity. Most general-purpose cluster management systems choose to forgo BFT entirely and use protocols based on PAXOS, or simplifications of PAXOS such as RAFT, that perform better and are easier to understand than BFT consensus protocols. In many cases, a simple protocol that is known to be vulnerable to a rare failure mode is safer than a complex protocol that is difficult to implement correctly or debug.

The first uses of BFT consensus were in safety-critical systems such as aircraft and spacecraft controls. These systems typically have hard real time latency constraints that require tightly coupling consensus with application logic in ways that make these implementations unsuitable for general-purpose services like etcd. Contemporary research on BFT consensus is mostly focused on applications that cross trust boundaries, which need to protect against malicious cluster members as well as malfunctioning cluster members. These designs are more suitable for implementing general-purpose services such as etcd, and we look forward to collaborating with researchers and the open source community to make them suitable for production cluster management.

We are very sorry for the difficulty the outage caused, and are continuing to improve as our systems grow. We’ve since fixed the bug in our cluster management system, and are continuing to tune each of the systems involved in this incident to be more resilient to failures of their dependencies.  If you’re interested in helping solve these problems at scale, please visit cloudflare.com/careers.

Analysis of Today’s CenturyLink/Level(3) Outage

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/

Analysis of Today's CenturyLink/Level(3) Outage

Today CenturyLink/Level(3), a major ISP and Internet bandwidth provider, experienced a significant outage that impacted some of Cloudflare’s customers as well as a significant number of other services and providers across the Internet. While we’re waiting for a post mortem from CenturyLink/Level(3), I wanted to write up the timeline of what we saw, how Cloudflare’s systems routed around the problem, why some of our customers were still impacted in spite of our mitigations, and what  appears to be the likely root cause of the issue.

Increase In Errors

At 10:03 UTC our monitoring systems started to observe an increased number of errors reaching our customers’ origin servers. These show up as “522 Errors” and indicate that there is an issue connecting from Cloudflare’s network to wherever our customers’ applications are hosted.

Cloudflare is connected to CenturyLink/Level(3) among a large and diverse set of network providers. When we see an increase in errors from one network provider, our systems automatically attempt to reach customers’ applications across alternative providers. Given the number of providers we have access to, we are generally able to continue to route traffic even when one provider has an issue.

Analysis of Today's CenturyLink/Level(3) Outage
The diverse set of network providers Cloudflare connects to. Source: https://bgp.he.net/AS13335#_asinfo‌‌

Automatic Mitigations

In this case, beginning within seconds of the increase in 522 errors, our systems automatically rerouted traffic from CenturyLink/Level(3) to alternate network providers we connect to including Cogent, NTT, GTT, Telia, and Tata.

Our Network Operations Center was also alerted and our team began taking additional steps to mitigate any issues our automated systems weren’t automatically able to address beginning at 10:09 UTC. We were successful in keeping traffic flowing across our network for most customers and end users even with the loss of CenturyLink/Level(3) as one of our network providers.

Analysis of Today's CenturyLink/Level(3) Outage
Dashboard Cloudflare’s automated systems recognizing the damage to the Internet caused by the CenturyLink/Level(3) failure and automatically routing around it.

The graph below shows traffic between Cloudflare’s network and six major tier-1 networks that are among the network providers we connect to. The red portion shows CenturyLink/Level(3) traffic, which dropped to near-zero during the incident. You can also see how we automatically shifted traffic to other network providers during the incident to mitigate the impact and ensure traffic continued to flow.

Analysis of Today's CenturyLink/Level(3) Outage
Traffic across six major tier-1 networks that are among the network providers Cloudflare connects to. CenturyLink/Level(3) in red.

The following graph shows 522 errors (indicating our inability to reach customers’ applications) across our network during the time of the incident.

Analysis of Today's CenturyLink/Level(3) Outage

The sharp spike up at 10:03 UTC was the CenturyLink/Level(3) network failing. Our automated systems immediately kicked in to attempt to reroute and rebalance traffic across alternative network providers, causing the errors to drop in half immediately and then fall to approximately 25 percent of the peak as those paths were automatically optimized.

Between 10:03 UTC and 10:11 UTC our systems automatically disabled CenturyLink/Level(3) in the 48 cities where we’re connected to them and rerouted traffic across alternate network providers. Our systems take into account capacity on other providers before shifting out traffic in order to prevent cascading failures. This is why the failover, while automatic, isn’t instantaneous in all locations. Our team was able to apply additional manual mitigations to reduce the number of errors another 5 percent.

Why Did the Errors Not Drop to Zero?

Unfortunately, there were still an elevated number of errors indicating we were still unable to reach some customers. CenturyLink/Level(3) is among the largest network providers in the world. As a result, many hosting providers only have single-homed connectivity to the Internet through their network.

To use the old Internet as a “superhighway” analogy, that’s like only having a single offramp to a town. If the offramp is blocked, then there’s no way to reach the town. This was exacerbated in some cases because CenturyLink/Level(3)’s network was not honoring route withdrawals and continued to advertise routes to networks like Cloudflare’s even after they’d been withdrawn. In the case of customers whose only connectivity to the Internet is via CenturyLink/Level(3), or if CenturyLink/Leve(3) continued to announce bad routes after they’d been withdrawn, there was no way for us to reach their applications and they continued to see 522 errors until CenturyLink/Level(3) resolved their issue around 14:30 UTC.

The same was a problem on the other (“eyeball”) side of the network. Individuals need to have an onramp onto the Internet’s superhighway. An onramp to the Internet is essentially what your ISP provides. CenturyLink is one of the largest ISPs in the United States.

Analysis of Today's CenturyLink/Level(3) Outage
Source: https://broadbandnow.com/CenturyLink

Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other Internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.

So What Likely Happened Here?

While we will not know exactly what happened until CenturyLink/Level(3) issue a post mortem, we can see clues from BGP announcements and how they propagated across the Internet during the outage. BGP is the Border Gateway Protocol. It is how routers on the Internet announce to each other what IPs sit behind them and therefore what traffic they should receive.

Starting at 10:04 UTC, there were a significant number of BGP updates. A BGP update is the signal a router makes to say that a route has changed or is no longer available. Under normal conditions, the Internet sees about 1.5MBs – 2MBs of BGP updates every 15 minutes. At the start of the incident, the number of BGP updates spiked to more than 26MBs of BGP updates per 15 minute period and stayed elevated throughout the incident.

Analysis of Today's CenturyLink/Level(3) Outage
Source: http://archive.routeviews.org/bgpdata/2020.08/UPDATES/

These updates show the instability of BGP routes inside the CenturyLink/Level(3) backbone. The question is what would have caused this instability. The CenturyLink/Level(3) status update offers some hints and points at a flowspec update as the root cause.

Analysis of Today's CenturyLink/Level(3) Outage

What’s Flowspec?

In CenturyLink/Level(3)’s update they mention that a bad Flowspec rule caused the issue. So what is Flowspec? Flowspec is an extension to BGP, which allows firewall rules to be easily distributed across a network, or even between networks, using BGP. Flowspec is a powerful tool. It allows you to efficiently push rules across an entire network almost instantly. It is great when you are trying to quickly respond to something like an attack, but it can be dangerous if you make a mistake.

At Cloudflare, early in our history, we used to use Flowspec ourselves to push out firewall rules in order to, for instance, mitigate large network-layer DDoS attacks. We suffered our own Flowspec-induced outage more than 7 years ago. We no longer use Flowspec ourselves, but it remains a common protocol for pushing out network firewall rules.

We can only speculate what happened at CenturyLink/Level(3), but one plausible scenario is that they issued a Flowspec command to try to block an attack or other abuse directed at their network. The status report indicates that the Flowspec rule prevented BGP itself from being announced. We have no way of knowing what that Flowspec rule was, but here’s one in Juniper’s format that would have blocked all BGP communications across their network.

route DISCARD-BGP {
   match {
      protocol tcp;
      destination-port 179;
   }
 then discard;
}

Why So Many Updates?

A mystery remains, however, why global BGP updates stayed elevated throughout the incident. If the rule blocked BGP then you would expect to see an increase in BGP announcements initially and then they would fall back to normal.

One possible explanation is that the offending Flowspec rule came near the end of a long list of BGP updates. If that were the case, what may have happened is that every router in CenturyLink/Level(3)’s network would receive the Flowspec rule. They would then block BGP. That would cause them to stop receiving the rule. They would start back up again, working their way through all the BGP rules until they got to the offending Flowspec rule again. BGP would be dropped again. The Flowspec rule would no longer be received. And the loop would continue, over and over.

One challenge of this is that on every cycle, the queue of BGP updates would continue to increase within CenturyLink/Level(3)’s network. This may have gotten to a point where the memory and CPU of their routers was overloaded, causing an additional set of challenges to getting their network back online.

Why Did It Take So Long to Fix?

This was a significant global Internet outage and, undoubtedly, the CenturyLink/Level(3) team received immediate alerts. They are a very sophisticated network operator with a world class Network Operations Center (NOC). So why did it take more than four hours to resolve?

Again, we can only speculate. First, it may have been that the Flowspec rule and the significant load that large number of BGP updates imposed on their routers made it difficult for them to login to their own interfaces. Several of the other tier-1 providers took action, it appears at CenturyLink/Level(3)’s request, to de-peer their networks. This would have limited the number of BGP announcements being received by the CenturyLink/Level(3) network and helped give it time to catch up.

Second, it also may have been that the Flowspec rule was not issued by CenturyLink/Level(3) themselves but rather by one of their customers. Many network providers will allow Flowspec peering. This can be a powerful tool for downstream customers wishing to block attack traffic, but can make it much more difficult to track down an offending Flowspec rule when something goes wrong.

Finally, it never helps when these issues occur early on a Sunday morning. Networks the size and scale of CenturyLink/Level(3)’s are extremely complicated. Incidents happen. We appreciate their team keeping us informed with what was going on throughout the incident. #hugops