Tag Archives: Outage

Cloudflare outage on June 21, 2022

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/

Cloudflare outage on June 21, 2022

Introduction

Cloudflare outage on June 21, 2022

Today, June 21, 2022, Cloudflare suffered an outage that affected traffic in 19 of our data centers. Unfortunately, these 19 locations handle a significant proportion of our global traffic. This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations. A change to the network configuration in those locations caused an outage which started at 06:27 UTC. At 06:58 UTC the first data center was brought back online and by 07:42 UTC all data centers were online and working correctly.

Depending on your location in the world you may have been unable to access websites and services that rely on Cloudflare. In other locations, Cloudflare continued to operate normally.

We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.

Background

Over the last 18 months, Cloudflare has been working to convert all of our busiest locations to a more flexible and resilient architecture. In this time, we’ve converted 19 of our data centers to this architecture, internally called Multi-Colo PoP (MCP): Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, São Paulo, San Jose, Singapore, Sydney, Tokyo.

A critical part of this new architecture, which is designed as a Clos network, is an added layer of routing that creates a mesh of connections. This mesh allows us to easily disable and enable parts of the internal network in a data center for maintenance or to deal with a problem. This layer is represented by the spines in the following diagram.

Cloudflare outage on June 21, 2022

This new architecture has provided us with significant reliability improvements, as well as allowing us to run maintenance in these locations without disrupting customer traffic. As these locations also carry a significant proportion of the Cloudflare traffic, any problem here can have a very wide impact, and unfortunately, that’s what happened today.

Incident timeline and impact

In order to be reachable on the Internet, networks like Cloudflare make use of a protocol called BGP. As part of this protocol, operators define policies which decide which prefixes (a collection of adjacent IP addresses) are advertised to peers (the other networks they connect to), or accepted from peers.

These policies have individual components, which are evaluated sequentially. The end result is that any given prefixes will either be advertised or not advertised. A change in policy can mean a previously advertised prefix is no longer advertised, known as being “withdrawn”, and those IP addresses will no longer be reachable on the Internet.

Cloudflare outage on June 21, 2022

While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes.

Due to this withdrawal, Cloudflare engineers experienced added difficulty in reaching the affected locations to revert the problematic change. We have backup procedures for handling such an event and used them to take control of the affected locations.

03:56 UTC: We deploy the change to our first location. None of our locations are impacted by the change, as these are using our older architecture.
06:17: The change is deployed to our busiest locations, but not the locations with the MCP architecture.
06:27: The rollout reached the MCP-enabled locations, and the change is deployed to our spines. This is when the incident started, as this swiftly took these 19 locations offline.
06:32: Internal Cloudflare incident declared.
06:51: First change made on a router to verify the root cause.
06:58: Root cause found and understood. Work begins to revert the problematic change.
07:42: The last of the reverts has been completed. This was delayed as network engineers walked over each other’s changes, reverting the previous reverts, causing the problem to re-appear sporadically.
09:00: Incident closed.

The criticality of these data centers can clearly be seen in the volume of successful HTTP requests we handled globally:

Cloudflare outage on June 21, 2022

Even though these locations are only 4% of our total network, the outage impacted 50% of total requests. The same can be seen in our egress bandwidth:

Cloudflare outage on June 21, 2022

Technical description of the error and how it happened

As part of our continued effort to standardize our infrastructure configuration, we were rolling out a change to standardize the BGP communities we attach to a subset of the prefixes we advertise. Specifically, we were adding informational communities to our site-local prefixes.

These prefixes allow our metals to communicate with each other, as well as connect to customer origins. As part of the change procedure at Cloudflare, a Change Request ticket was created, which includes a dry-run of the change, as well as a stepped rollout procedure. Before it was allowed to go out, it was also peer reviewed by multiple engineers. Unfortunately, in this case, the steps weren’t small enough to catch the error before it hit all of our spines.

The change looked like this on one of the routers:

[edit policy-options policy-statement 4-COGENT-TRANSIT-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 4-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 6-COGENT-TRANSIT-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 6-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;

This was harmless, and just added some additional information to these prefix advertisements. The change on the spines was the following:

[edit policy-options policy-statement AGGREGATES-OUT]
term 6-DISABLED_PREFIXES { ... }
!    term 6-ADV-TRAFFIC-PREDICTOR { ... }
!    term 4-ADV-TRAFFIC-PREDICTOR { ... }
!    term ADV-FREE { ... }
!    term ADV-PRO { ... }
!    term ADV-BIZ { ... }
!    term ADV-ENT { ... }
!    term ADV-DNS { ... }
!    term REJECT-THE-REST { ... }
!    term 4-ADV-SITE-LOCALS { ... }
!    term 6-ADV-SITE-LOCALS { ... }
[edit policy-options policy-statement AGGREGATES-OUT term 4-ADV-SITE-LOCALS then]
community delete NO-EXPORT { ... }
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add AMS07;
+      community add EUROPE;
[edit policy-options policy-statement AGGREGATES-OUT term 6-ADV-SITE-LOCALS then]
community delete NO-EXPORT { ... }
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add AMS07;
+      community add EUROPE;

An initial glance at this diff might give the impression that this change is identical, but unfortunately, that’s not the case. If we focus on one part of the diff, it might become clear why:

!    term REJECT-THE-REST { ... }
!    term 4-ADV-SITE-LOCALS { ... }
!    term 6-ADV-SITE-LOCALS { ... }

In this diff format, the exclamation marks in front of the terms indicate a re-ordering of the terms. In this case, multiple terms moved up, and two terms were added to the bottom. Specifically, the 4-ADV-SITE-LOCALS and 6-ADV-SITE-LOCALS terms moved from the top to the bottom. These terms were now behind the REJECT-THE-REST term, and as might be clear from the name, this term is an explicit reject:

term REJECT-THE-REST {
    then reject;
} 

As this term is now before the site-local terms, we immediately stopped advertising our site-local prefixes, removing our direct access to all the impacted locations, as well as removing the ability of our servers to reach origin servers.

On top of the inability to contact origins, the removal of these site-local prefixes also caused our internal load balancing system Multimog (a variation of our Unimog load-balancer) to stop working, as it could no longer forward requests between the servers in our MCPs. This meant that our smaller compute clusters in an MCP received the same amount of traffic as our largest clusters, causing the smaller ones to overload.

Cloudflare outage on June 21, 2022

Remediation and follow-up steps

This incident had widespread impact, and we take availability very seriously. We have identified several areas of improvement and will continue to work on uncovering any other gaps that could cause a recurrence.

Here is what we are working on immediately:

Process: While the MCP program was designed to improve availability, a procedural gap in how we updated these data centers ultimately caused a broader impact in MCP locations specifically. While we did use a stagger procedure for this change, the stagger policy did not include an MCP data center until the final step. Change procedures and automation need to include MCP-specific test and deploy procedures to ensure there are no unintended consequences.

Architecture: The incorrect router configuration prevented the proper routes from being announced, preventing traffic from flowing properly to our infrastructure. Ultimately the policy statement that caused the incorrect routing advertisement will be redesigned to prevent an unintentional incorrect ordering.

Automation: There are several opportunities in our automation suite that would mitigate some or all of the impact seen from this event. Primarily, we will be concentrating on automation improvements that enforce an improved stagger policy for rollouts of network configuration and provide an automated “commit-confirm” rollback. The former enhancement would have significantly lessened the overall impact, and the latter would have greatly reduced the Time-to-Resolve during the incident.

Conclusion

Although Cloudflare has invested significantly in our MCP design to improve service availability, we clearly fell short of our customer expectations with this very painful incident. We are deeply sorry for the disruption to our customers and to all the users who were unable to access Internet properties during the outage. We have already started working on the changes outlined above and will continue our diligence to ensure this cannot happen again.

Exam time means Internet disruptions in Syria, Sudan and Algeria

Post Syndicated from David Belson original https://blog.cloudflare.com/syria-sudan-algeria-exam-internet-shutdown/

Exam time means Internet disruptions in Syria, Sudan and Algeria

Exam time means Internet disruptions in Syria, Sudan and Algeria

It is once again exam time in Syria, Sudan, and Algeria, and with it, we find these countries disrupting Internet connectivity in an effort to prevent cheating on these exams. As they have done over the past several years, Syria and Sudan are implementing multi-hour nationwide Internet shutdowns. Algeria has also taken a similar approach in the past, but this year appears to be implementing more targeted website/application blocking.

Syria

Syria has been implementing Internet shutdowns across the country since 2011, but exam-related shutdowns have only been in place since 2016. In 2021, exams took place between May 31 and June 22, with multi-hour shutdowns observed on each of the exam days.

This year, the first shutdown was observed on May 30, with subsequent shutdowns (to date) seen on June 2, 6, and 12. In the Cloudflare Radar graph below, traffic for Syria drops to zero while the shutdowns are active. According to Internet Society Pulse, several additional shutdowns are expected through June 21. Each takes place between 02000530 UTC (0500–0830 local time). According to a published report, the current exam cycle covers more than 500,000 students for basic and general secondary education certificates.

Exam time means Internet disruptions in Syria, Sudan and Algeria

Consistent with shutdowns observed in prior years, Syria is once again implementing them in an asymmetric fashion – that is, inbound traffic is disabled, but egress traffic remains. This is clearly visible in request traffic from Syria to Cloudflare’s 1.1.1.1 DNS resolver. As the graph below shows, queries from clients in Syria are able to exit the country and reach Cloudflare, but responses can’t return, leading to retry floods, visible as spikes in the graph.

Exam time means Internet disruptions in Syria, Sudan and Algeria

Last year, the Syrian Minister of Education noted that, for the first time, encryption and surveillance technologies would be used in an effort to curtail cheating, with an apparent promise to suspend Internet shutdowns in the future if these technologies proved successful.

Sudan

Sudan is also no stranger to nationwide Internet shutdowns, with some last lasting for multiple weeks. Over the last several years, Sudan has also implemented Internet shutdowns during secondary school exams in an effort to limit cheating or leaking of exam questions. (We covered the 2021 round of shutdowns in a blog post.)

According to a schedule published by digital rights organization AccessNow, this year’s Secondary Certificate Exams will be taking place in Sudan daily between June 11–22, except June 17. As of this writing, near-complete shutdowns have been observed on June 11, 12, and 13 between 0530-0830 UTC (0730-1030 local time), as seen in the graph below. The timing of these shutdowns aligns with a communication reportedly sent to subscribers of telecommunications services in the country, which stated “In implementation of the decision of the Attorney General, the Internet service will be suspended during the Sudanese certificate exam sessions from 8 in the morning until 11 in the morning.”

Exam time means Internet disruptions in Syria, Sudan and Algeria

It is interesting to note that the shutdown, while nationwide, does not appear to be complete. The graph below shows that Cloudflare continues to see a small volume of HTTP requests from Sudatel during the shutdown periods. This is not completely unusual, as Sudatel may have public sector, financial services, or other types of customers that remain online.

Exam time means Internet disruptions in Syria, Sudan and Algeria

Algeria

Since 2018, Algeria has been shutting down the Internet nationwide during baccalaureate exams, following widespread cheating in 2016 that saw questions leaked online both before and during tests. These shutdowns reportedly cost businesses across the country an estimated 500 million Algerian Dinars (approximately $3.4 million USD) for every hour the Internet was unavailable. In 2021, there were two Internet shutdowns each day that exams took place—the first between 0700–1100 UTC (0800–1200 local time), and the second between 1330–1600 UTC (1430–1700 local time).

This year, more than 700,000 students will sit for the baccalaureate exams between June 12-16.

Perhaps recognizing the economic damage caused by these Internet shutdowns, this year the Algerian Minister of National Education announced that there would be no Internet shutdowns on exam days.

Thus far, it appears that this has been the case. However, it appears that the Algerian government has shifted to a content blocking-based approach, instead of a wide-scale Internet shutdown. The Cloudflare Radar graph below shows two nominal drops in country-level traffic during the two times on June 13 that the exams took place—0730–1000 UTC (0830–1100 local time) and 1330–1600 UTC (1430–1700 local time), similar to last year’s timing.

Exam time means Internet disruptions in Syria, Sudan and Algeria

The disruptions are also visible in traffic graphs for several major Algerian network providers, as shown below.

Exam time means Internet disruptions in Syria, Sudan and Algeria
Exam time means Internet disruptions in Syria, Sudan and Algeria
Exam time means Internet disruptions in Syria, Sudan and Algeria

Analysis of additional Cloudflare data further supports the hypothesis that Algeria is blocking access to specific websites and applications, rather than shutting down the Internet completely.

As described in a previous blog post, Network Error Logging (NEL) is a browser-based reporting system that allows users’ browsers to report connection failures to an endpoint specified by the webpage that failed to load. Below, a graph of NEL reports from browsers in Algeria shows clear spikes during the times (thus far) that the exams have taken place, with report levels significantly lower and more consistent during other times of the day.

Exam time means Internet disruptions in Syria, Sudan and Algeria

Conclusion

In addition to Syria, Sudan, and Algeria, countries including India, Jordan, Iraq, Uzbekistan, and Ethiopia have shut down or limited access to the Internet as exams took place. It is unclear whether these brute-force methods are truly effective at preventing cheating on these exams. However, it is clear that the impact of these shutdowns goes beyond students, as they impose a significant financial cost on businesses within the affected countries as they lose Internet access for multiple hours a day over the course of several weeks.

If you want to follow the remaining scheduled disruptions for these countries, you can see live data on the Cloudflare Radar pages for Syria, Sudan, and Algeria.

AAE-1 & SMW5 cable cuts impact millions of users across multiple countries

Post Syndicated from David Belson original https://blog.cloudflare.com/aae-1-smw5-cable-cuts/

AAE-1 & SMW5 cable cuts impact millions of users across multiple countries

AAE-1 & SMW5 cable cuts impact millions of users across multiple countries

Just after 1200 UTC on Tuesday, June 7, the Africa-Asia-Europe-1 (AAE-1) and SEA-ME-WE-5 (SMW-5) submarine cables suffered cable cuts. The damage reportedly occurred in Egypt, and impacted Internet connectivity for millions of Internet users across multiple countries in the Middle East and Africa, as well as thousands of miles away in Asia. In addition, Google Cloud Platform and OVHcloud reported connectivity issues due to these cable cuts.

The impact

Data from Cloudflare Radar showed significant drops in traffic across the impacted countries as the cable damage occurred, recovering approximately four hours later as the cables were repaired.

AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries

It appears that Saudi Arabia may have also been affected by the cable cut(s), but the impact was much less significant, and traffic recovered almost immediately.

AAE-1 & SMW5 cable cuts impact millions of users across multiple countries

In the graphs above, we show that Ethiopia was one of the impacted countries. However, as it is landlocked, there are obviously no submarine cable landing points within the country. The Afterfibre map from the Network Startup Resource Center (NSRC) shows that that fiber in Ethiopia connects to fiber in Somalia, which experienced an impact. In addition, Ethio Telecom also routes traffic through network providers in Kenya and Djibouti. Djibouti Telecom, one of these providers, in turn peers with larger global providers like Telecom Italia (TI) Sparkle, which is one of the owners of SMW5.

In addition to impacting end-user connectivity in the impacted countries, the cable cuts also reportedly impacted cloud providers including Google Cloud Platform and OVHcloud. In their incident report, Google Cloud noted “Google Cloud Networking experienced increased packet loss for egress traffic from Google to the Middle East, and elevated latency between our Europe and Asia Regions as a result, for 3 hours and 12 minutes, affecting several related products including Cloud NAT, Hybrid Connectivity and Virtual Private Cloud (VPC). From preliminary analysis, the root cause of the issue was a capacity shortage following two simultaneous fiber-cuts.” OVHcloud noted that “Backbone links between Marseille and Singapore are currently down” and that “Upon further investigation, our Network OPERATION teams advised that the fault was related to our partner fiber cuts.”

When concurrent disruptions like those highlighted above are observed across multiple countries in one or more geographic areas, the culprit is often a submarine cable that connects the impacted countries to the global Internet. The impact of such cable cuts will vary across countries, largely due to the levels of redundancy that they may have in place. That is, are these countries solely dependent on an impacted cable for global Internet connectivity, or do they have redundant connectivity across other submarine or terrestrial cables? Additionally, the location of the country relative to the cable cut will also impact how connectivity in a given country may be affected. Due to these factors, we didn’t see a similar impact across all of the countries connected to the AAE-1 and SMW5 cables.

What happened?

Specific details are sparse, but as noted above, the cable damage reportedly occurred in Egypt – both of the impacted cables land in Abu Talat and Zafarana, which also serve as landing points for a number of other submarine cables. According to a 2021 article in Middle East Eye, “There are 10 cable landing stations on Egypt’s Mediterranean and Red Sea coastlines, and some 15 terrestrial crossing routes across the country.” Alan Mauldin, research director at telecommunications research firm TeleGeography, notes that routing cables between Europe and the Middle East to India is done via Egypt, because there is the least amount of land to cross. This places the country in a unique position as a choke point for international Internet connectivity, with damage to infrastructure locally impacting the ability of millions of people thousands of miles away to access websites and applications, as well as impacting connectivity for leading cloud platform providers.

As the graphs above show, traffic returned to normal levels within a matter of hours, with tweets from telecommunications authorities in Pakistan and Oman also noting that Internet services had returned to their countries. Such rapid repairs to submarine cable infrastructure are unusual, as repair timeframes are often measured in days or weeks, as we saw with the cables damaged by the volcanic eruption in Tonga earlier this year. This is due to the need to locate the fault, send repair ships to the appropriate location, and then retrieve the cable and repair it. Given this, the damage to these cables likely occurred on land, after they came ashore.

Keeping content available

By deploying in data centers close to end users, Cloudflare helps to keep traffic local, which can mitigate the impact of catastrophic events like cable cuts, while improving performance, availability, and security. Being able to deliver content from our network generally requires first retrieving it from an origin, and with end users around the world, Cloudflare needs to be able to reach origins from multiple points around the world at the same time. However, a customer origin may be reachable from some networks but not from others, due to a cable cut or some other network disruption.

In September 2021, Cloudflare announced Orpheus, which provides reachability benefits for customers by finding unreachable paths on the Internet in real time, and guiding traffic away from those paths, ensuring that Cloudflare will always be able to reach an origin no matter what is happening on the Internet.

Conclusion

Because the Internet is an interconnected network of networks, an event such as a cable cut can have a ripple effect across the whole Internet, impacting connectivity for users thousands of miles away from where the incident occurred. Users may be unable to access content or applications, or the content/applications may suffer from reduced performance. Additionally, the providers of those applications may experience problems within their own network infrastructure due to such an event.

For network providers, the impact of such events can be mitigated through the use of multiple upstream providers/peers, and diverse physical paths for critical infrastructure like submarine cables. Cloudflare’s globally deployed network can help content and application providers ensure that their content and applications remain available and performant in the face of network disruptions.

PIPEFAIL: How a missing shell option slowed Cloudflare down

Post Syndicated from Alex Forster original https://blog.cloudflare.com/pipefail-how-a-missing-shell-option-slowed-cloudflare-down/

PIPEFAIL: How a missing shell option slowed Cloudflare down

PIPEFAIL: How a missing shell option slowed Cloudflare down

At Cloudflare, we’re used to being the fastest in the world. However, for approximately 30 minutes last December, Cloudflare was slow. Between 20:10 and 20:40 UTC on December 16, 2021, web requests served by Cloudflare were artificially delayed by up to five seconds before being processed. This post tells the story of how a missing shell option called “pipefail” slowed Cloudflare down.

Background

Before we can tell this story, we need to introduce you to some of its characters.

PIPEFAIL: How a missing shell option slowed Cloudflare down

Cloudflare’s Front Line protects millions of users from some of the largest attacks ever recorded. This protection is orchestrated by a sidecar service called dosd, which analyzes traffic and looks for attacks. When dosd detects an attack, it provides Front Line with a list of attack fingerprints that describe how Front Line can match and block the attack traffic.

Instances of dosd run on every Cloudflare server, and they communicate with each other using a peer-to-peer mesh to identify malicious traffic patterns. This decentralized design allows dosd to perform analysis with much higher fidelity than is possible with a centralized system, but its scale also imposes some strict performance requirements. To meet these requirements, we need to provide dosd with very fast access to large amounts of configuration data, which naturally means that dosd depends on Quicksilver. Cloudflare developed Quicksilver to manage configuration data and replicate it around the world in milliseconds, allowing it to be accessed by services like dosd in microseconds.

PIPEFAIL: How a missing shell option slowed Cloudflare down

One piece of configuration data that dosd needs comes from the Addressing API, which is our authoritative IP address management service. The addressing data it provides is important because dosd uses it to understand what kind of traffic is expected on particular IPs. Since addressing data doesn’t change very frequently, we use a simple Kubernetes cron job to query it at 10 minutes past each hour and write it into Quicksilver, allowing it to be efficiently accessed by dosd.

With this context, let’s walk through the change we made on December 16 that ultimately led to the slowdown.

The Change

Approximately once a week, all of our Bug Fixes and Performance Improvements to the Front Line codebase are released to the network. On December 16, the Front Line team released a fix for a subtle bug in how the code handled compression in the presence of a Cache-Control: no-transform header. Unfortunately, the team realized pretty quickly that this fix actually broke some customers who had started depending on that buggy behavior, so the team decided to roll back the release and work with those customers to correct the issue.

PIPEFAIL: How a missing shell option slowed Cloudflare down

Here’s a graph showing the progression of the rollback. While most releases and rollbacks are fully automated, this particular rollback needed to be performed manually due to its urgency. Since this was a manual rollback, SREs decided to perform it in two batches as a safety measure. The first batch went to our smaller tier 2 and 3 data centers, and the second batch went to our larger tier 1 data centers.

SREs started the first batch at 19:25 UTC, and it completed in about 30 minutes. Then, after verifying that there were no issues, they started the second batch at 20:10. That’s when the slowdown started.

The Slowdown

Within minutes of starting the second batch of rollbacks, alerts started firing. “Traffic levels are dropping.” “CPU utilization is dropping.” “A P0 incident has been automatically declared.” The timing could not be a coincidence. Somehow, a deployment of known-good code, which had been limited to a subset of the network and which had just been successfully performed 40 minutes earlier, appeared to be causing a global problem.

A P0 incident is an “all hands on deck” emergency, so dozens of Cloudflare engineers quickly began to assess impact to their services and test their theories about the root cause. The rollback was paused, but that did not fix the problem. Then, approximately 10 minutes after the start of the incident, my team – the DOS team – received a concerning alert: “dosd is not running on numerous servers.” Before that alert fired we had been investigating whether the slowdown was caused by an unmitigated attack, but this required our immediate attention.

Based on service logs, we were able to see that dosd was panicking because the customer addressing data in Quicksilver was corrupted in some way. Remember: the data in this Quicksilver key is important. Without it, dosd could not make correct choices anymore, so it refused to continue.

Once we realized that the addressing data was corrupted, we had to figure out how it was corrupted so that we could fix it. The answer turned out to be pretty obvious: the Quicksilver key was completely empty.

Following the old adage – “did you try restarting it?” – we decided to manually re-run the Kubernetes cron job that populates this key and see what happened. At 20:40 UTC, the cron job was manually triggered. Seconds after it completed, dosd started running again, and traffic levels began returning to normal. We confirmed that the Quicksilver key was no longer empty, and the incident was over.

The Aftermath

Despite fixing the problem, we still didn’t really understand what had just happened.

Why was the Quicksilver key empty?

It was urgent that we quickly figure out how an empty value was written into that Quicksilver key, because for all we knew, it could happen again at any moment.

We started by looking at the Kubernetes cron job, which turned out to have a bug:

PIPEFAIL: How a missing shell option slowed Cloudflare down

This cron job is implemented using a small Bash script. If you’re unfamiliar with Bash (particularly shell pipelining), here’s what it does:

First, the dos-make-addr-conf executable runs. Its job is to query the Addressing API for various bits of JSON data and serialize it into a Toml document. Afterward, that Toml is “piped” as input into the dosctl executable, whose job is to simply write it into a Quicksilver key called template_vars.

Can you spot the bug? Here’s a hint: what happens if dos-make-addr-conf fails for some reason and exits with a non-zero error code? It turns out that, by default, the shell pipeline ignores the error code and continues executing the next command! This means that the output of dos-make-addr-conf (which could be empty) gets unconditionally piped into dosctl and used as the value of the template_vars key, regardless of whether dos-make-addr-conf succeeded or failed.

30 years ago, when the first users of Bourne shell were burned by this problem, a shell option called “pipefail” was introduced. Enabling this option changes the shell’s behavior so that, when any command in a pipeline series fails, the entire pipeline stops processing. However, this option is not enabled by default, so it’s widely recommended as best practice that all scripts should start by enabling this (and a few other) options.

Here’s the fixed version of that cron job:

PIPEFAIL: How a missing shell option slowed Cloudflare down

This bug was particularly insidious because dosd actually did attempt to gracefully handle the case where this Quicksilver key contained invalid Toml. However, an empty string is a perfectly valid Toml document. If an error message had been accidentally written into this Quicksilver key instead of an empty string, then dosd would have rejected the update and continued to use the previous value.

Why did that cause the Front Line to slow down?

We had figured out how an empty key could be written into Quicksilver, and we were confident that it wouldn’t happen again. However, we still needed to untangle how that empty key caused such a severe incident.

As I mentioned earlier, the Front Line relies on dosd to tell it how to mitigate attacks, but it doesn’t depend on dosd directly to serve requests. Instead, once every few seconds, the Front Line asynchronously asks dosd for new attack fingerprints and stores them in an in-memory cache. This cache is consulted while serving each request, and if dosd ever fails to provide fresh attack fingerprints, then the stale fingerprints will continue to be used instead. So how could this have caused the impact that we saw?

PIPEFAIL: How a missing shell option slowed Cloudflare down

As part of the rollback process, the Front Line’s code needed to be reloaded. Reloading this code implicitly flushed the in-memory caches, including the attack fingerprint data from dosd. The next time that a request tried to consult with the cache, the caching layer realized that it had no attack fingerprints to return and a “cache miss” happened.

To handle a cache miss, the caching layer tried to reach out to dosd, and this is when the slowdown happened. While the caching layer was waiting for dosd to reply, it blocked all pending requests from progressing. Since dosd wasn’t running, the attempt eventually timed out after five seconds when the caching layer gave up. But in the meantime, each pending request was stuck waiting for the timeout to happen. Once it did, all the pending requests that were queued up over the five-second timeout period became unblocked and were finally allowed to progress. This cycle repeated over and over again every five seconds on every server until the dosd failure was resolved.

To trigger this slowdown, not only did dosd have to fail, but the Front Line’s in-memory cache had to also be flushed at the same time. If dosd had failed, but the Front Line’s cache had not been flushed, then the stale attack fingerprints would have remained in the cache and request processing would not have been impacted.

Why didn’t the first rollback cause this problem?

These two batches of rollbacks were performed by forcing servers to run a Salt highstate. When each batch was executed, thousands of servers began running highstates at the same time. The highstate process involves, among other things, contacting the Addressing API in order to retrieve various bits of customer addressing information.

The first rollback started at 19:25 UTC, and the second rollback started 45 minutes later at 20:10. Remember how I mentioned that our Kubernetes cron job only runs on the 10th minute of every hour? At 21:10 – exactly the time that our cron job started executing – thousands of servers also began to highstate, flooding the Addressing API with requests. All of these requests were queued up and eventually served, but it took the Addressing API a few minutes to work through the backlog. This delay was long enough to cause our cron job to time out, and, due to the “pipefail”  bug, inadvertently clobber the Quicksilver key that it was responsible for updating.

To trigger the “pipefail” bug, not only did we have to flood the Addressing API with requests, we also had to do it at exactly 10 minutes after the hour. If SREs had started the second batch of rollbacks a few minutes earlier or later, this bug would have continued to lay dormant.

Lessons Learned

This was a unique incident where a chain of small or unlikely failures cascaded into a severe and painful outage that we deeply regret. In response, we have hardened each link in the chain:

  • A manual rollback inadvertently triggered the thundering herd problem, which overwhelmed the Addressing API. We have since significantly scaled out the Addressing API, so that it can handle high request rates if it ever again has to.
  • An error in a Kubernetes cron job caused invalid data to be written to Quicksilver. We have since made sure that, when this cron job fails, it is no longer possible for that failure to clobber the Quicksilver key.
  • dosd did not correctly handle all possible error conditions when loading configuration data from Quicksilver, causing it to fail. We have since taken these additional conditions into account where necessary, so that dosd will gracefully degrade in the face of corrupt Quicksilver data.
  • The Front Line had an unexpected dependency on dosd, which caused it to fail when dosd failed. We have since removed all such dependencies, and the Front Line will now gracefully survive dosd failures.

More broadly, this incident has served as an example to us of why code and systems must always be resilient to failure, no matter how unlikely that failure may seem.

DNSSEC issues take Fiji domains offline

Post Syndicated from David Belson original https://blog.cloudflare.com/dnssec-issues-fiji/

DNSSEC issues take Fiji domains offline

DNSSEC issues take Fiji domains offline

On the morning of March 8, a post to Hacker News stated that “All .fj domains have gone offline”, listing several hostnames in domains within the Fiji top level domain (known as a ccTLD) that had become unreachable. Commenters in the associated discussion thread had mixed results in being able to reach .fj hostnames—some were successful, while others saw failures. The fijivillage news site also highlighted the problem, noting that the issue also impacted Vodafone’s M-PAiSA app/service, preventing users from completing financial transactions.

The impact of this issue can be seen in traffic to Cloudflare customer zones in the .com.fj second-level domain. The graph below shows that HTTP traffic to these zones dropped by approximately 40% almost immediately starting around midnight UTC on March 8. Traffic volumes continued to decline throughout the rest of the morning.

DNSSEC issues take Fiji domains offline

Looking at Cloudflare’s 1.1.1.1 resolver data for queries for .com.fj hostnames, we can also see that error volume associated with those queries climbs significantly starting just after midnight as well. This means that our resolvers encountered issues with the answers from .fj servers.

DNSSEC issues take Fiji domains offline

This observation suggests that the problem was strictly DNS related, rather than connectivity related—Cloudflare Radar does not show any indication of an Internet disruption in Fiji coincident with the start of this problem.

DNSSEC issues take Fiji domains offline

It was suggested within the Hacker News comments that the problem could be DNSSEC related. Upon further investigation, it appears that may be the cause. In verifying the DNSSEC record for the .fj ccTLD, shown in the dig output below, we see that it states EDE: 9 (DNSKEY Missing): 'no SEP matching the DS found for fj.'

kdig fj. soa +dnssec @1.1.1.1 
;; ->>HEADER<<- opcode: QUERY; status: SERVFAIL; id: 12710
;; Flags: qr rd ra; QUERY: 1; ANSWER: 0; AUTHORITY: 0; ADDITIONAL: 1
 
;; EDNS PSEUDOSECTION:
;; Version: 0; flags: do; UDP size: 1232 B; ext-rcode: NOERROR
;; EDE: 9 (DNSKEY Missing): 'no SEP matching the DS found for fj.'
 
;; QUESTION SECTION:
;; fj.                          IN      SOA
 
;; Received 73 B
;; Time 2022-03-08 08:57:41 EST
;; From 1.1.1.1@53(UDP) in 17.2 ms

Extended DNS Error 9 (EDE: 9) is defined as “A DS record existed at a parent, but no supported matching DNSKEY record could be found for the child.” The Cloudflare Learning Center article on DNSKEY and DS records explains this relationship:

The DS record is used to verify the authenticity of child zones of DNSSEC zones. The DS key record on a parent zone contains a hash of the KSK in a child zone. A DNSSEC resolver can therefore verify the authenticity of the child zone by hashing its KSK record, and comparing that to what is in the parent zone’s DS record.

Ultimately, it appears that around midnight UTC, the .fj zone started to be signed with a key that was not in the root zone DS, possibly as the result of a scheduled rollover that happened without checking that the root zone was updated first by IANA, which updates the root zone. (IANA owns contact with the TLD operators, and instructs the Root Zone Publisher on the changes to make in the next version of the root zone.)

DNSSEC problems as the root cause of the observed issue align with the observation in the Hacker News comments that some were able to access .fj websites, while others were not. Users behind resolvers doing strict DNSSEC validation would have seen an error in their browser, while users behind less strict resolvers would have been able to access the sites without a problem.

Conclusion

Further analysis of Cloudflare resolver metrics indicates that the problem was resolved around 1400 UTC, when the DS was updated. When DNSSEC is improperly configured for a single domain name, it can cause problems accessing websites or applications in that zone. However, when the misconfiguration occurs at a ccTLD level, the impact is much more significant. Unfortunately, this seems to occur all too often.

(Thank you to Ólafur Guðmundsson for his DNSSEC expertise.)

Internet is back in Tonga after 38 days of outage

Post Syndicated from João Tomé original https://blog.cloudflare.com/internet-is-back-in-tonga-after-38-days-of-outage/

Internet is back in Tonga after 38 days of outage

Internet is back in Tonga after 38 days of outage

Tonga, the South Pacific archipelago nation (with 169 islands), was reconnected to the Internet this early morning (UTC) and is back online after successful repairs to the undersea cable that was damaged on Saturday, January 15, 2022, by the January 14, volcanic eruption.

After 38 days without full access to the Internet, Cloudflare Radar shows that a little after midnight (UTC) — it was around 13:00 local time — on February 22, 2022, Internet traffic in Tonga started to increase to levels similar to those seen before the eruption.

Internet is back in Tonga after 38 days of outage

The faded line shows what was normal in Tonga at the start of the year, and the dark blue line shows the evolution of traffic in the last 30 days. Digicel, Tonga’s main ISP announced at 02:13 UTC that “data connectivity has been restored on the main island Tongatapu and Eua after undersea submarine cable repairs”.

When we expand the view to the previous 45 days, we can see more clearly how Internet traffic evolved before the volcanic eruption and after the undersea cable was repaired.

Internet is back in Tonga after 38 days of outage

The repair ship Reliance took 20 days to replace a 92 km (57 mile) section of the 827 km submarine fiber optical cable that connects Tonga to Fiji and international networks and had “multiple faults and breaks due to the volcanic eruption”, according to Digicel.

Tonga Cable chief executive James Panuve told Reuters that people on the main island “will have access almost immediately”, and that was what we saw on Radar with a large increase in traffic persisting.

Internet is back in Tonga after 38 days of outage

The residual traffic we saw from Tonga a few days after January 15, 2022, comes from satellite services that were used with difficulty by some businesses.

James Panuve also highlighted that the undersea work is still being finished to repair the domestic cable connecting the main island of Tongatapu with outlying islands that were worst hit by the tsunami, which, he told Reuters, could take six to nine months more.

So, for some of the people who live on the 36 inhabited islands, normal use of the Internet could take a lot longer. Tonga has a population of around 105,000, 70% of whom reside on the main island, Tongatapu and around 5% (5,000) live on the nearby island of Eua (now also connected to the Internet).

Telecommunication companies in neighboring Pacific islands, particularly New Caledonia, provided lengths of cable when Tonga ran out, said Panuve.

A world of undersea cables for the world’s communications

We have mentioned before, for example in our first blog post about the Tonga outage, how undersea cables are important to global Internet traffic that is mostly carried by a complex network that connects countries and continents.

The full submarine cable system (the first communications cables laid were from the 1850s and carried telegraphy traffic) is what makes most of the world’s Internet function between countries and continents. There are 428 active submarine cables (36 are planned), running to an estimated 1.3 million km around the globe.

Internet is back in Tonga after 38 days of outage
World map of submarine cables. Antartida is the only continent not yet reached by a submarine telecommunications cable. Source: TeleGeography (www.submarinecablemap.com

The reliability of submarine Internet is high, especially when multiple paths are available in the event of a cable break. That wasn’t the case for the Tonga outage, given that the 827 km submarine cable only connects Fiji to the Tonga archipelago — Fiji is connected to the main Southern Cross Cable, as the next image illustrates.

Internet is back in Tonga after 38 days of outage
Submarine Cable Map shows the undersea cables that connect Australia to Fiji and the following connections to other archipelagos like Tonga. Source: TeleGeography (www.submarinecablemap.com)

In a recent conversation on a Cloudflare TV segment we discussed the importance of undersea cables with Tom Paseka, Network Strategist who is celebrating 10 years at Cloudflare and worked previously for undersea cable companies in Australia. Here’s a clip:

Internet outage in Yemen amid airstrikes

Post Syndicated from João Tomé original https://blog.cloudflare.com/internet-outage-in-yemen-amid-airstrikes/

Internet outage in Yemen amid airstrikes

The early hours of Friday, January 21, 2022, started in Yemen with a country-wide Internet outage. According to local and global news reports airstrikes are happening in the country and the outage is likely related as there are reports that a telecommunications building in Al-Hudaydah where the FALCON undersea cable lands.

Cloudflare Radar shows that Internet traffic dropped close to zero between 21:30 UTC (January 20, 2022) and by 22:00 UTC (01:00 in local time).

Internet outage in Yemen amid airstrikes

The outage affected the main state-owned ISP, Public Telecommunication Corporation (AS30873 in blue in the next chart), which represents almost all the Internet traffic in the country.

Internet outage in Yemen amid airstrikes

Looking at BGP (Border Gateway Protocol) updates from Yemen’s ASNs around the time of the outage, we see a clear spike at the same time the main ASN was affected ~21:55 UTC, January 20, 2022. These update messages are BGP signalling that Yemen’s main ASN was no longer routable, something similar to what we saw happening in The Gambia and Kazakhstan but for very different reasons.

Internet outage in Yemen amid airstrikes

So far, 2022 has started with a few significant Internet disruptions for different reasons:

1. An Internet outage in The Gambia because of a cable problem.
2. An Internet shutdown in Kazakhstan because of unrest.
3. A mobile Internet shutdown in Burkina Faso because of a coup plot.
4. An Internet outage in Tonga because of a volcanic eruption (still ongoing).

You can keep an eye on Cloudflare Radar to monitor this situation as it unfolds.

Tonga’s likely lengthy Internet outage

Post Syndicated from João Tomé original https://blog.cloudflare.com/tonga-internet-outage/

Tonga’s likely lengthy Internet outage

2022 only has 19 days of existence but so far this January, there have already been four significant Internet disruptions:

1. An Internet outage in The Gambia because of a cable problem.
2. An Internet shutdown in Kazakhstan because of unrest.
3. A mobile Internet shutdown in Burkina Faso because of a coup plot.
4. An Internet outage in Tonga because of a volcanic eruption.

The latest Internet outage, in the South Pacific country of Tonga (with 169 islands), is still ongoing. It started with the large eruption of Hunga Tonga–Hunga Haʻapai, an uninhabited volcanic island of the Tongan archipelago on Friday, January 14, 2022. The next day, Cloudflare Radar shows that the Internet outage started at around 03:00 UTC (16:00 local time) — Saturday, January 15, 2022 — and is ongoing for more than four days. Tonga’s 105,000 residents are almost entirely unreachable, according to the BBC.

Tonga’s likely lengthy Internet outage

When we focus on the number of requests by ASN, the country’s main ISPs Digicel and Kalianet started to lose traffic after 03:00 UTC and by 05:30 UTC January 15, 2022, Cloudflare saw close to no traffic at all from them, as shown in the graph below.

Tonga’s likely lengthy Internet outage

Looking at the BGP (Border Gateway Protocol) updates from Tonga’s ASNs around the time of the outage, we see a clear spike at 05:35 UTC (18:35 local time). These update messages are BGP signalling that the Tongan ASNs are no longer routable. We saw the same trend in The Gambia outage of January 4, 2022 — there you can read about the importance of BGP as a mechanism to exchange routing information between autonomous systems on the Internet, something that was also seen in the 2021 Facebook outage.

Tonga’s likely lengthy Internet outage
BGP updates from Tongan ASNs around the time of the outage.

Cloudflare Radar data doesn’t show any significant disruptions for Internet traffic in Tonga’s neighbours American Samoa (although there was a small decrease in traffic on Friday and Saturday, January 14 and 15, 2022 in comparison with the previous week) and Fiji. In American Samoa, all schools were closed on Friday, January 14, because of severe weather, and on the same day, after the volcanic eruption, there were tsunami warnings and evacuation to higher ground was advised (that continued through the weekend).

Tonga, as a geographically remote Polynesian country more than 800 km from the Fiji archipelago, is highly dependent on the Internet for communications. That is something that was improved five years ago with an infrastructure connectivity program from the World Bank. Prior to that, the country was dependent on satellite links for Internet that included a very small percentage of the population.

Repairs could take a few weeks

Southern Cross Cable Network confirmed that the 827 km fiber-optic undersea communications cable connecting Tonga to the outside world may have been broken. The company is assisting Tonga Cable Limited (TCL), which owns the single cable that provides Internet access and almost all communications to and from the archipelago.

The eruption resulted in a fault in the international cable 37 kilometres from Nukuʻalofa (Tonga’s capital), and a further fault in a domestic cable 47 km from the capital.

TCL announced that it has already met with the US cable company SubCom to start preparations for SubCom’s cable repair ship Reliance to be dispatched from Papua New Guinea to Tonga, possibly via Samoa (more than 4,000 km away).

The repairs could take “at least” four weeks, given that a repair to a fiber-optic cable that has been cut on the seabed is considered more complicated than misconfigurations, power outages or other types of infrastructure damage. “The site conditions in Tonga have to be assessed thoroughly because of volcanic activities,” according to TCL chairman Samiuela Fonua.

Fonua also mentioned that the last cable cut (back in 2019) took nearly two weeks to repair, but this time the site conditions will determine the time it will take — the two cables are not far away from the eruption site (the volcano is still active). According to ZDNet, in 2019 Tonga signed a 15-year deal with Kacific for satellite connectivity, but since then the satellite provider says it is waiting on the Tongan government to activate its contract.

Svalbard Undersea Cable System also disrupted in January

Also in January, Space Norway, the operator of the world’s most northern submarine cable — the Svalbard Undersea Cable System — announced that on January 7 it located a disruption in one of the two twin submarine fiber optic communication cables connecting Longyearbyen with Andøya north of Harstad in northern Norway (in the area where the seabed goes from 300 meters down to 2,700 meters in the Greenland Sea). A repair mission is being planned.

A world of undersea cables for the world’s communications

A significant amount of Internet traffic is carried by a complex network of undersea fiber-optic cables that connect countries and continents. The full submarine cable system (the first communications cables laid were from the 1850s and carried telegraphy traffic) is what makes most of the world’s Internet function between countries and continents. There are 428 active submarine cables (36 are planned), running in an estimate of 1.3 million km around the globe.

Tonga’s likely lengthy Internet outage
World map of submarine cables. Antarctica is the only continent not yet reached by a submarine telecommunications cable. Source: TeleGeography (www.submarinecablemap.com)

This gives a sense that the Internet is literally a network of networks in a world where estimates indicate that around 99% of the data traffic that is crossing oceans is carried by these undersea cables (satellite Internet, so far, is still residual — SpaceX has around 145,000 users).

The reliability of submarine cables is high, especially when multiple paths are available in the event of a cable break. That’s not the case for the Tonga outage, given that the 827 km submarine cable only connects Fiji to the Tonga archipelago — Fiji is connected to the main Southern Cross Cable, as the next image illustrates.

Tonga’s likely lengthy Internet outage
Submarine Cable Map shows the undersea cables that connect Australia to Fiji and the following connections to other archipelagos like Tonga. Source: TeleGeography (www.submarinecablemap.com


The total carrying capacity of submarine cables is enormous (EllaLink, the optical submarine cable linking the European and South American continents, for example, has 100 Tbps capacity) and grows year after year as the world gets more and more connected. For example, Google has recently finished a new cable with 350 Tbps of capacity. But, a transoceanic submarine cable system costs several hundred million dollars to construct. One of the latest, between Portugal and Egypt, with a total of 8,700 kilometers, is budgeted at 326 million euros.

The Tonga outage was not the only one of 2022 (so far) that happened because of cable problems. The Gambia outage that affected the country’s main ISP, Gamtel, was because of “a primary link failure at ACE”, the cable system that serves 24 countries, from Europe to Africa, namely in the points of cable connections from Senegal to The Gambia.

In spite of these two fiber cable problems being separated by a few days at the start of 2022, Internet outages are more common because of situations like misconfigurations, power outages, extreme weather or the frequent state-imposed shutdowns to deal with unrest, elections or exams — recently this was the case of Sudan or Kazakhstan.

Sudan: seven days without Internet access (and counting)

Post Syndicated from João Tomé original https://blog.cloudflare.com/sudan-seven-days-without-internet-access-and-counting/

Sudan: seven days without Internet access (and counting)

Sudan: seven days without Internet access (and counting)

It’s not every day that there is no Internet access in an entire country. In the case of Sudan, it has been five days without Internet after political turmoil that started last Monday, October 25, 2021 (as we described).

The outage continues with almost a flat line and just a trickle of Internet traffic from Sudan. Cloudflare Radar shows that the Internet in Sudan is still almost completely cut off.

Sudan: seven days without Internet access (and counting)

There was a blip of traffic on Tuesday at ~14:00 UTC, for about one hour, but it flattened out again, and it continues like that — anyone can track the evolution on the Sudan page of Cloudflare Radar.

Sudan: seven days without Internet access (and counting)

Internet shutdowns are not that rare

Internet disruptions, including shutdowns and social media restrictions, are common occurrences in some countries and Sudan is one where this happens more frequently than most countries according to Human Rights Watch. In our June blog, we talked about Sudan when the country decided to shut down the Internet to prevent cheating in exams, but there were situations in the past more similar to this days-long shutdown — something that usually happens when there’s political unrest.

The country’s longest recorded network disruption was back in 2018, when Sudanese authorities cut off access to social media (and messaging apps like WhatsApp) for 68 consecutive days from December 21, 2018 to February 26, 2019. There was a full mobile Internet shutdown reported from June 3 to July 9, 2019 that lasted 36 days.

You can keep an eye on Cloudflare Radar to monitor how we see the Internet traffic globally and in every country.

Sudan woke up without Internet

Post Syndicated from Celso Martinho original https://blog.cloudflare.com/sudan-woke-up-without-internet/

Sudan woke up without Internet

Sudan woke up without Internet

Today, October 25, following political turmoil, Sudan woke up without Internet access.

In our June blog, we talked about Sudan when the country decided to shut down the Internet to prevent cheating in exams.

Now, the disruption seems to be for other reasons. AP is reporting that “military forces … detained at least five senior Sudanese government figures.”. This afternoon (UTC) several media outlets confirmed that Sudan’s military dissolved the transitional government in a coup that shut down mobile phone networks and Internet access.

Cloudflare Radar allows anyone to track Internet traffic patterns around the world. The dedicated page for Sudan clearly shows that this Monday, when the country was waking up, the Internet traffic went down and continued that trend through the afternoon (16:00 local time, 14:00 UTC).

Sudan woke up without Internet

We dug in a little more on the HTTP traffic data. It usually starts increasing after 06:00 local time (04:00 UTC). But this Monday morning, traffic was flat, and the trend continued in the afternoon (there were no signs of the Internet coming back at 18:00 local time).

Sudan woke up without Internet

When comparing today with the last seven days’ pattern, we see that today’s drop is abrupt and unusual.

Sudan woke up without Internet

We can see the same pattern when looking at HTTP traffic by ASN (Autonomous Systems Number). The shutdown affects all the major ISPs from Sudan.

Sudan woke up without Internet

Two weeks ago, we compared mobile traffic worldwide using Cloudflare Radar, and Sudan was one of the most mobile-friendly countries on the planet, with 83% of Internet traffic coming from mobile devices. Today, both mobile and desktop traffic was disrupted.

Sudan woke up without Internet

Using Cloudflare Radar, we can also see a change in Layer 3&4 DDoS attacks because of the lack of data.

Sudan woke up without Internet

You can keep an eye on Cloudflare Radar to monitor how we see the Internet traffic globally and in every country.

What happened on the Internet during the Facebook outage

Post Syndicated from Celso Martinho original https://blog.cloudflare.com/during-the-facebook-outage/

What happened on the Internet during the Facebook outage

It’s been a few days now since Facebook, Instagram, and WhatsApp went AWOL and experienced one of the most extended and rough downtime periods in their existence.

When that happened, we reported our bird’s-eye view of the event and posted the blog Understanding How Facebook Disappeared from the Internet where we tried to explain what we saw and how DNS and BGP, two of the technologies at the center of the outage, played a role in the event.

In the meantime, more information has surfaced, and Facebook has published a blog post giving more details of what happened internally.

As we said before, these events are a gentle reminder that the Internet is a vast network of networks, and we, as industry players and end-users, are part of it and should work together.

In the aftermath of an event of this size, we don’t waste much time debating how peers handled the situation. We do, however, ask ourselves the more important questions: “How did this affect us?” and “What if this had happened to us?” Asking and answering these questions whenever something like this happens is a great and healthy exercise that helps us improve our own resilience.

Today, we’re going to show you how the Facebook and affiliate sites downtime affected us, and what we can see in our data.

1.1.1.1

1.1.1.1 is a fast and privacy-centric public DNS resolver operated by Cloudflare, used by millions of users, browsers, and devices worldwide. Let’s look at our telemetry and see what we find.

First, the obvious. If we look at the response rate, there was a massive spike in the number of SERVFAIL codes. SERVFAILs can happen for several reasons; we have an excellent blog called Unwrap the SERVFAIL that you should read if you’re curious.

In this case, we started serving SERVFAIL responses to all facebook.com and whatsapp.com DNS queries because our resolver couldn’t access the upstream Facebook authoritative servers. About 60x times more than the average on a typical day.

What happened on the Internet during the Facebook outage

If we look at all the queries, not specific to Facebook or WhatsApp domains, and we split them by IPv4 and IPv6 clients, we can see that our load increased too.

As explained before, this is due to a snowball effect associated with applications and users retrying after the errors and generating even more traffic. In this case, 1.1.1.1 had to handle more than the expected rate for A and AAAA queries.

What happened on the Internet during the Facebook outage

Here’s another fun one.

DNS vs. DoT and DoH. Typically, DNS queries and responses are sent in plaintext over UDP (or TCP sometimes), and that’s been the case for decades now. Naturally, this poses security and privacy risks to end-users as it allows in-transit attacks or traffic snooping.

With DNS over TLS (DoT) and DNS over HTTPS, clients can talk DNS using well-known, well-supported encryption and authentication protocols.

Our learning center has a good article on “DNS over TLS vs. DNS over HTTPS” that you can read. Browsers like Chrome, Firefox, and Edge have supported DoH for some time now, WAP uses DoH too, and you can even configure your operating system to use the new protocols.

When Facebook went offline, we saw the number of DoT+DoH SERVFAILs responses grow by over x300 vs. the average rate.

What happened on the Internet during the Facebook outage
What happened on the Internet during the Facebook outage
What happened on the Internet during the Facebook outage

So, we got hammered with lots of requests and errors, causing traffic spikes to our 1.1.1.1 resolver and causing an unexpected load in the edge network and systems. How did we perform during this stressful period?

Quite well. 1.1.1.1 kept its cool and continued serving the vast majority of requests around the famous 10ms mark. An insignificant fraction of p95 and p99 percentiles saw increased response times, probably due to timeouts trying to reach Facebook’s nameservers.

What happened on the Internet during the Facebook outage

Another interesting perspective is the distribution of the ratio between SERVFAIL and good DNS answers, by country. In theory, the higher this ratio is, the more the country uses Facebook. Here’s the map with the countries that suffered the most:

What happened on the Internet during the Facebook outage

Here’s the top twelve country list, ordered by those that apparently use Facebook, WhatsApp and Instagram the most:

Country SERVFAIL/Good Answers ratio
Turkey 7.34
Grenada 4.84
Congo 4.44
Lesotho 3.94
Nicaragua 3.57
South Sudan 3.47
Syrian Arab Republic 3.41
Serbia 3.25
Turkmenistan 3.23
United Arab Emirates 3.17
Togo 3.14
French Guiana 3.00

Impact on other sites

When Facebook, Instagram, and WhatsApp aren’t around, the world turns to other places to look for information on what’s going on, other forms of entertainment or other applications to communicate with their friends and family. Our data shows us those shifts. While Facebook was going down, other services and platforms were going up.

To get an idea of the changing traffic patterns we look at DNS queries as an indicator of increased traffic to specific sites or types of site.

Here are a few examples.

Other social media platforms saw a slight increase in use, compared to normal.

What happened on the Internet during the Facebook outage

Traffic to messaging platforms like Telegram, Signal, Discord and Slack got a little push too.

What happened on the Internet during the Facebook outage

Nothing like a little gaming time when Instagram is down, we guess, when looking at traffic to sites like Steam, Xbox, Minecraft and others.

What happened on the Internet during the Facebook outage

And yes, people want to know what’s going on and fall back on news sites like CNN, New York Times, The Guardian, Wall Street Journal, Washington Post, Huffington Post, BBC, and others:

What happened on the Internet during the Facebook outage

Attacks

One could speculate that the Internet was under attack from malicious hackers. Our Firewall doesn’t agree; nothing out of the ordinary stands out.

What happened on the Internet during the Facebook outage

Network Error Logs

Network Error Logging, NEL for short, is an experimental technology supported in Chrome. A website can issue a Report-To header and ask the browser to send reports about network problems, like bad requests or DNS issues, to a specific endpoint.

Cloudflare uses NEL data to quickly help triage end-user connectivity issues when end-users reach our network. You can learn more about this feature in our help center.

If Facebook is down and their DNS isn’t responding, Chrome will start reporting NEL events every time one of the pages in our zones fails to load Facebook comments, posts, ads, or authentication buttons. This chart shows it clearly.​​

What happened on the Internet during the Facebook outage

WARP

Cloudflare announced WARP in 2019, and called it “A VPN for People Who Don’t Know What V.P.N. Stands For” and offered it for free to its customers. Today WARP is used by millions of people worldwide to securely and privately access the Internet on their desktop and mobile devices. Here’s what we saw during the outage by looking at traffic volume between WARP and Facebook’s network:

What happened on the Internet during the Facebook outage

You can see how the steep drop in Facebook ASN traffic coincides with the start of the incident and how it compares to the same period the day before.

Our own traffic

People tend to think of Facebook as a place to visit. We log in, and we access Facebook, we post. It turns out that Facebook likes to visit us too, quite a lot. Like Google and other platforms, Facebook uses an army of crawlers to constantly check websites for data and updates. Those robots gather information about websites content, such as its titles, descriptions, thumbnail images, and metadata. You can learn more about this on the “The Facebook Crawler” page and the Open Graph website.

Here’s what we see when traffic is coming from the Facebook ASN, supposedly from crawlers, to our CDN sites:

What happened on the Internet during the Facebook outage

The robots went silent.

What about the traffic coming to our CDN sites from Facebook User-Agents? The gap is indisputable.

What happened on the Internet during the Facebook outage

We see about 30% of a typical request rate hitting us. But it’s not zero; why is that?

We’ll let you know a little secret. Never trust User-Agent information; it’s broken. User-Agent spoofing is everywhere. Browsers, apps, and other clients deliberately change the User-Agent string when they fetch pages from the Internet to hide, obtain access to certain features, or bypass paywalls (because pay-walled sites want sites like Facebook to index their content, so that then they get more traffic from links).

Fortunately, there are newer, and privacy-centric standards emerging like User-Agent Client Hints.

Core Web Vitals

Core Web Vitals are the subset of Web Vitals, an initiative by Google to provide a unified interface to measure real-world quality signals when a user visits a web page. Such signals include Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS).

We use Core Web Vitals with our privacy-centric Web Analytics product and collect anonymized data on how end-users experience the websites that enable this feature.

One of the metrics we can calculate using these signals is the page load time. Our theory is that if a page includes scripts coming from external sites (for example, Facebook “like” buttons, comments, ads), and they are unreachable, its total load time gets affected.

We used a list of about 400 domains that we know embed Facebook scripts in their pages and looked at the data.

What happened on the Internet during the Facebook outage

Now let’s look at the Largest Contentful Paint. LCP marks the point in the page load timeline when the page’s main content has likely loaded. The faster the LCP is, the better the end-user experience.

What happened on the Internet during the Facebook outage

Again, the page load experience got visibly degraded.

The outcome seems clear. The sites that use Facebook scripts in their pages took 1.5x more time to load their pages during the outage, with some of them taking more than 2x the usual time. Facebook’s outage dragged the performance of  some other sites down.

Conclusion

When Facebook, Instagram, and WhatsApp went down, the Web felt it. Some websites got slower or lost traffic, other services and platforms got unexpected load, and people lost the ability to communicate or do business normally.

Understanding How Facebook Disappeared from the Internet

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/october-2021-facebook-outage/

Understanding How Facebook Disappeared from the Internet

Understanding How Facebook Disappeared from the Internet

“Facebook can’t be down, can it?”, we thought, for a second.

Today at 1651 UTC, we opened an internal incident entitled “Facebook DNS lookup returning SERVFAIL” because we were worried that something was wrong with our DNS resolver 1.1.1.1.  But as we were about to post on our public status page we realized something else more serious was going on.

Social media quickly burst into flames, reporting what our engineers rapidly confirmed too. Facebook and its affiliated services WhatsApp and Instagram were, in fact, all down. Their DNS names stopped resolving, and their infrastructure IPs were unreachable. It was as if someone had “pulled the cables” from their data centers all at once and disconnected them from the Internet.

How’s that even possible?

Meet BGP

BGP stands for Border Gateway Protocol. It’s a mechanism to exchange routing information between autonomous systems (AS) on the Internet. The big routers that make the Internet work have huge, constantly updated lists of the possible routes that can be used to deliver every network packet to their final destinations. Without BGP, the Internet routers wouldn’t know what to do, and the Internet wouldn’t work.

The Internet is literally a network of networks, and it’s bound together by BGP. BGP allows one network (say Facebook) to advertise its presence to other networks that form the Internet. As we write Facebook is not advertising its presence, ISPs and other networks can’t find Facebook’s network and so it is unavailable.

The individual networks each have an ASN: an Autonomous System Number. An Autonomous System (AS) is an individual network with a unified internal routing policy. An AS can originate prefixes (say that they control a group of IP addresses), as well as transit prefixes (say they know how to reach specific groups of IP addresses).

Cloudflare’s ASN is AS13335. Every ASN needs to announce its prefix routes to the Internet using BGP; otherwise, no one will know how to connect and where to find us.

Our learning center has a good overview of what BGP and ASNs are and how they work.

In this simplified diagram, you can see six autonomous systems on the Internet and two possible routes that one packet can use to go from Start to End. AS1 → AS2 → AS3 being the fastest, and AS1 → AS6 → AS5 → AS4 → AS3 being the slowest, but that can be used if the first fails.

Understanding How Facebook Disappeared from the Internet

At 1658 UTC we noticed that Facebook had stopped announcing the routes to their DNS prefixes. That meant that, at least, Facebook’s DNS servers were unavailable. Because of this Cloudflare’s 1.1.1.1 DNS resolver could no longer respond to queries asking for the IP address of facebook.com or instagram.com.

route-views>show ip bgp 185.89.218.0/23
% Network not in table
route-views>

route-views>show ip bgp 129.134.30.0/23
% Network not in table
route-views>

Meanwhile, other Facebook IP addresses remained routed but weren’t particularly useful since without DNS Facebook and related services were effectively unavailable:

route-views>show ip bgp 129.134.30.0   
BGP routing table entry for 129.134.0.0/17, version 1025798334
Paths: (24 available, best #14, table default)
  Not advertised to any peer
  Refresh Epoch 2
  3303 6453 32934
    217.192.89.50 from 217.192.89.50 (138.187.128.158)
      Origin IGP, localpref 100, valid, external
      Community: 3303:1004 3303:1006 3303:3075 6453:3000 6453:3400 6453:3402
      path 7FE1408ED9C8 RPKI State not found
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
route-views>

We keep track of all the BGP updates and announcements we see in our global network. At our scale, the data we collect gives us a view of how the Internet is connected and where the traffic is meant to flow from and to everywhere on the planet.

A BGP UPDATE message informs a router of any changes you’ve made to a prefix advertisement or entirely withdraws the prefix. We can clearly see this in the number of updates we received from Facebook when checking our time-series BGP database. Normally this chart is fairly quiet: Facebook doesn’t make a lot of changes to its network minute to minute.

But at around 15:40 UTC we saw a peak of routing changes from Facebook. That’s when the trouble began.

Understanding How Facebook Disappeared from the Internet

If we split this view by routes announcements and withdrawals, we get an even better idea of what happened. Routes were withdrawn, Facebook’s DNS servers went offline, and one minute after the problem occurred, Cloudflare engineers were in a room wondering why 1.1.1.1 couldn’t resolve facebook.com and worrying that it was somehow a fault with our systems.

Understanding How Facebook Disappeared from the Internet

With those withdrawals, Facebook and its sites had effectively disconnected themselves from the Internet.

DNS gets affected

As a direct consequence of this, DNS resolvers all over the world stopped resolving their domain names.

➜  ~ dig @1.1.1.1 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com.			IN	A
➜  ~ dig @1.1.1.1 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com.			IN	A
➜  ~ dig @8.8.8.8 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com.			IN	A
➜  ~ dig @8.8.8.8 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com.			IN	A

This happens because DNS, like many other systems on the Internet, also has its routing mechanism. When someone types the https://facebook.com URL in the browser, the DNS resolver, responsible for translating domain names into actual IP addresses to connect to, first checks if it has something in its cache and uses it. If not, it tries to grab the answer from the domain nameservers, typically hosted by the entity that owns it.

If the nameservers are unreachable or fail to respond because of some other reason, then a SERVFAIL is returned, and the browser issues an error to the user.

Again, our learning center provides a good explanation on how DNS works.

Understanding How Facebook Disappeared from the Internet

Due to Facebook stopping announcing their DNS prefix routes through BGP, our and everyone else’s DNS resolvers had no way to connect to their nameservers. Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses.

But that’s not all. Now human behavior and application logic kicks in and causes another exponential effect. A tsunami of additional DNS traffic follows.

This happened in part because apps won’t accept an error for an answer and start retrying, sometimes aggressively, and in part because end-users also won’t take an error for an answer and start reloading the pages, or killing and relaunching their apps, sometimes also aggressively.

This is the traffic increase (in number of requests) that we saw on 1.1.1.1:

Understanding How Facebook Disappeared from the Internet

So now, because Facebook and their sites are so big, we have DNS resolvers worldwide handling 30x more queries than usual and potentially causing latency and timeout issues to other platforms.

Fortunately, 1.1.1.1 was built to be Free, Private, Fast (as the independent DNS monitor DNSPerf can attest), and scalable, and we were able to keep servicing our users with minimal impact.

The vast majority of our DNS requests kept resolving in under 10ms. At the same time, a minimal fraction of p95 and p99 percentiles saw increased response times, probably due to expired TTLs having to resort to the Facebook nameservers and timeout. The 10 seconds DNS timeout limit is well known amongst engineers.

Understanding How Facebook Disappeared from the Internet

Impacting other services

People look for alternatives and want to know more or discuss what’s going on. When Facebook became unreachable, we started seeing increased DNS queries to Twitter, Signal and other messaging and social media platforms.

Understanding How Facebook Disappeared from the Internet

We can also see another side effect of this unreachability in our WARP traffic to and from Facebook’s affected ASN 32934. This chart shows how traffic changed from 15:45 UTC to 16:45 UTC compared with three hours before in each country. All over the world WARP traffic to and from Facebook’s network simply disappeared.

Understanding How Facebook Disappeared from the Internet

The Internet

Today’s events are a gentle reminder that the Internet is a very complex and interdependent system of millions of systems and protocols working together. That trust, standardization, and cooperation between entities are at the center of making it work for almost five billion active users worldwide.

Update

At around 21:00 UTC we saw renewed BGP activity from Facebook’s network which peaked at 21:17 UTC.

Understanding How Facebook Disappeared from the Internet

This chart shows the availability of the DNS name ‘facebook.com’ on Cloudflare’s DNS resolver 1.1.1.1. It stopped being available at around 15:50 UTC and returned at 21:20 UTC.

Understanding How Facebook Disappeared from the Internet

Undoubtedly Facebook, WhatsApp and Instagram services will take further time to come online but as of 22:28 UTC Facebook appears to be reconnected to the global Internet and DNS working again.

A Byzantine failure in the real world

Post Syndicated from Tom Lianza original https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/

A Byzantine failure in the real world

An analysis of the Cloudflare API availability incident on 2020-11-02

When we review design documents at Cloudflare, we are always on the lookout for Single Points of Failure (SPOFs). Eliminating these is a necessary step in architecting a system you can be confident in. Ironically, when you’re designing a system with built-in redundancy, you spend most of your time thinking about how well it functions when that redundancy is lost.

On November 2, 2020, Cloudflare had an incident that impacted the availability of the API and dashboard for six hours and 33 minutes. During this incident, the success rate for queries to our API periodically dipped as low as 75%, and the dashboard experience was as much as 80 times slower than normal. While Cloudflare’s edge is massively distributed across the world (and kept working without a hitch), Cloudflare’s control plane (API & dashboard) is made up of a large number of microservices that are redundant across two regions. For most services, the databases backing those microservices are only writable in one region at a time.

Each of Cloudflare’s control plane data centers has multiple racks of servers. Each of those racks has two switches that operate as a pair—both are normally active, but either can handle the load if the other fails. Cloudflare survives rack-level failures by spreading the most critical services across racks. Every piece of hardware has two or more power supplies with different power feeds. Every server that stores critical data uses RAID 10 redundant disks or storage systems that replicate data across at least three machines in different racks, or both. Redundancy at each layer is something we review and require. So—how could things go wrong?

In this post we present a timeline of what happened, and how a difficult failure mode known as a Byzantine fault played a role in a cascading series of events.

2020-11-02 14:43 UTC: Partial Switch Failure

At 14:43, a network switch started misbehaving. Alerts began firing about the switch being unreachable to pings. The device was in a partially operating state: network control plane protocols such as LACP and BGP remained operational, while others, such as vPC, were not. The vPC link is used to synchronize ports across multiple switches, so that they appear as one large, aggregated switch to servers connected to them. At the same time, the data plane (or forwarding plane) was not processing and forwarding all the packets received from connected devices.

This failure scenario is completely invisible to the connected nodes, as each server only sees an issue for some of its traffic due to the load-balancing nature of LACP. Had the switch failed fully, all traffic would have failed over to the peer switch, as the connected links would’ve simply gone down, and the ports would’ve dropped out of the forwarding LACP bundles.

Six minutes later, the switch recovered without human intervention. But this odd failure mode led to further problems that lasted long after the switch had returned to normal operation.

2020-11-02 14:44 UTC: etcd Errors begin

The rack with the misbehaving switch included one server in our etcd cluster. We use etcd heavily in our core data centers whenever we need strongly consistent data storage that’s reliable across multiple nodes.

In the event that the cluster leader fails, etcd uses the RAFT protocol to maintain consistency and establish consensus to promote a new leader. In the RAFT protocol, cluster members are assumed to be either available or unavailable, and to provide accurate information or none at all. This works fine when a machine crashes, but is not always able to handle situations where different members of the cluster have conflicting information.

In this particular situation:

  • Network traffic between node 1 (in the affected rack) and node 3 (the leader) was being sent through the switch in the degraded state,
  • Network traffic between node 1 and node 2 were going through its working peer, and
  • Network traffic between node 2 and node 3 was unaffected.

This caused cluster members to have conflicting views of reality, known in distributed systems theory as a Byzantine fault. As a consequence of this conflicting information, node 1 repeatedly initiated leader elections, voting for itself, while node 2 repeatedly voted for node 3, which it could still connect to. This resulted in ties that did not promote a leader node 1 could reach. RAFT leader elections are disruptive, blocking all writes until they’re resolved, so this made the cluster read-only until the faulty switch recovered and node 1 could once again reach node 3.

A Byzantine failure in the real world

2020-11-02 14:45 UTC: Database system promotes a new primary database

Cloudflare’s control plane services use relational databases hosted across multiple clusters within a data center. Each cluster is configured for high availability. The cluster setup includes a primary database, a synchronous replica, and one or more asynchronous replicas. This setup allows redundancy within a data center. For cross-datacenter redundancy, a similar high availability secondary cluster is set up and replicated in a geographically dispersed data center for disaster recovery. The cluster management system leverages etcd for cluster member discovery and coordination.

When etcd became read-only, two clusters were unable to communicate that they had a healthy primary database. This triggered the automatic promotion of a synchronous database replica to become the new primary. This process happened automatically and without error or data loss.

There was a defect in our cluster management system that requires a rebuild of all database replicas when a new primary database is promoted. So, although the new primary database was available instantly, the replicas would take considerable time to become available, depending on the size of the database. For one of the clusters, service was restored quickly. Synchronous and asynchronous database replicas were rebuilt and started replicating successfully from primary, and the impact was minimal.

For the other cluster, however, performant operation of that database required a replica to be online. Because this database handles authentication for API calls and dashboard activities, it takes a lot of reads, and one replica was heavily utilized to spare the primary the load. When this failover happened and no replicas were available, the primary was overloaded, as it had to take all of the load. This is when the main impact started.

Reduce Load, Leverage Redundancy

At this point we saw that our primary authentication database was overwhelmed and began shedding load from it. We dialed back the rate at which we push SSL certificates to the edge, send emails, and other features, to give it space to handle the additional load. Unfortunately, because of its size, we knew it would take several hours for a replica to be fully rebuilt.

A silver lining here is that every database cluster in our primary data center also has online replicas in our secondary data center. Those replicas are not part of the local failover process, and were online and available throughout the incident. The process of steering read-queries to those replicas was not yet automated, so we manually diverted API traffic that could leverage those read replicas to the secondary data center. This substantially improved our API availability.

The Dashboard

The Cloudflare dashboard, like most web applications, has the notion of a user session. When user sessions are created (each time a user logs in) we perform some database operations and keep data in a Redis cluster for the duration of that user’s session. Unlike our API calls, our user sessions cannot currently be moved across the ocean without disruption. As we took actions to improve the availability of our API calls, we were unfortunately making the user experience on the dashboard worse.

This is an area of the system that is currently designed to be able to fail over across data centers in the event of a disaster, but has not yet been designed to work in both data centers at the same time. After a first period in which users on the dashboard became increasingly frustrated, we failed the authentication calls fully back to our primary data center, and kept working on our primary database to ensure we could provide the best service levels possible in that degraded state.

2020-11-02 21:20 UTC Database Replica Rebuilt

The instant the first database replica rebuilt, it put itself back into service, and performance resumed to normal levels. We re-ramped all of the services that had been turned down, so all asynchronous processing could catch up, and after a period of monitoring marked the end of the incident.

Redundant Points of Failure

The cascade of failures in this incident was interesting because each system, on its face, had redundancy. Moreover, no system fully failed—each entered a degraded state. That combination meant the chain of events that transpired was considerably harder to model and anticipate. It was frustrating yet reassuring that some of the possible failure modes were already being addressed.

A team was already working on fixing the limitation that requires a database replica rebuild upon promotion. Our user sessions system was inflexible in scenarios where we’d like to steer traffic around, and redesigning that was already in progress.

This incident also led us to revisit the configuration parameters we put in place for things that auto-remediate. In previous years, promoting a database replica to primary took far longer than we liked, so getting that process automated and able to trigger on a minute’s notice was a point of pride. At the same time, for at least one of our databases, the cure may be worse than the disease, and in fact we may not want to invoke the promotion process so quickly. Immediately after this incident we adjusted that configuration accordingly.

Byzantine Fault Tolerance (BFT) is a hot research topic. Solutions have been known since 1982, but have had to choose between a variety of engineering tradeoffs, including security, performance, and algorithmic simplicity. Most general-purpose cluster management systems choose to forgo BFT entirely and use protocols based on PAXOS, or simplifications of PAXOS such as RAFT, that perform better and are easier to understand than BFT consensus protocols. In many cases, a simple protocol that is known to be vulnerable to a rare failure mode is safer than a complex protocol that is difficult to implement correctly or debug.

The first uses of BFT consensus were in safety-critical systems such as aircraft and spacecraft controls. These systems typically have hard real time latency constraints that require tightly coupling consensus with application logic in ways that make these implementations unsuitable for general-purpose services like etcd. Contemporary research on BFT consensus is mostly focused on applications that cross trust boundaries, which need to protect against malicious cluster members as well as malfunctioning cluster members. These designs are more suitable for implementing general-purpose services such as etcd, and we look forward to collaborating with researchers and the open source community to make them suitable for production cluster management.

We are very sorry for the difficulty the outage caused, and are continuing to improve as our systems grow. We’ve since fixed the bug in our cluster management system, and are continuing to tune each of the systems involved in this incident to be more resilient to failures of their dependencies.  If you’re interested in helping solve these problems at scale, please visit cloudflare.com/careers.

Analysis of Today’s CenturyLink/Level(3) Outage

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/

Analysis of Today's CenturyLink/Level(3) Outage

Today CenturyLink/Level(3), a major ISP and Internet bandwidth provider, experienced a significant outage that impacted some of Cloudflare’s customers as well as a significant number of other services and providers across the Internet. While we’re waiting for a post mortem from CenturyLink/Level(3), I wanted to write up the timeline of what we saw, how Cloudflare’s systems routed around the problem, why some of our customers were still impacted in spite of our mitigations, and what  appears to be the likely root cause of the issue.

Increase In Errors

At 10:03 UTC our monitoring systems started to observe an increased number of errors reaching our customers’ origin servers. These show up as “522 Errors” and indicate that there is an issue connecting from Cloudflare’s network to wherever our customers’ applications are hosted.

Cloudflare is connected to CenturyLink/Level(3) among a large and diverse set of network providers. When we see an increase in errors from one network provider, our systems automatically attempt to reach customers’ applications across alternative providers. Given the number of providers we have access to, we are generally able to continue to route traffic even when one provider has an issue.

Analysis of Today's CenturyLink/Level(3) Outage
The diverse set of network providers Cloudflare connects to. Source: https://bgp.he.net/AS13335#_asinfo‌‌

Automatic Mitigations

In this case, beginning within seconds of the increase in 522 errors, our systems automatically rerouted traffic from CenturyLink/Level(3) to alternate network providers we connect to including Cogent, NTT, GTT, Telia, and Tata.

Our Network Operations Center was also alerted and our team began taking additional steps to mitigate any issues our automated systems weren’t automatically able to address beginning at 10:09 UTC. We were successful in keeping traffic flowing across our network for most customers and end users even with the loss of CenturyLink/Level(3) as one of our network providers.

Analysis of Today's CenturyLink/Level(3) Outage
Dashboard Cloudflare’s automated systems recognizing the damage to the Internet caused by the CenturyLink/Level(3) failure and automatically routing around it.

The graph below shows traffic between Cloudflare’s network and six major tier-1 networks that are among the network providers we connect to. The red portion shows CenturyLink/Level(3) traffic, which dropped to near-zero during the incident. You can also see how we automatically shifted traffic to other network providers during the incident to mitigate the impact and ensure traffic continued to flow.

Analysis of Today's CenturyLink/Level(3) Outage
Traffic across six major tier-1 networks that are among the network providers Cloudflare connects to. CenturyLink/Level(3) in red.

The following graph shows 522 errors (indicating our inability to reach customers’ applications) across our network during the time of the incident.

Analysis of Today's CenturyLink/Level(3) Outage

The sharp spike up at 10:03 UTC was the CenturyLink/Level(3) network failing. Our automated systems immediately kicked in to attempt to reroute and rebalance traffic across alternative network providers, causing the errors to drop in half immediately and then fall to approximately 25 percent of the peak as those paths were automatically optimized.

Between 10:03 UTC and 10:11 UTC our systems automatically disabled CenturyLink/Level(3) in the 48 cities where we’re connected to them and rerouted traffic across alternate network providers. Our systems take into account capacity on other providers before shifting out traffic in order to prevent cascading failures. This is why the failover, while automatic, isn’t instantaneous in all locations. Our team was able to apply additional manual mitigations to reduce the number of errors another 5 percent.

Why Did the Errors Not Drop to Zero?

Unfortunately, there were still an elevated number of errors indicating we were still unable to reach some customers. CenturyLink/Level(3) is among the largest network providers in the world. As a result, many hosting providers only have single-homed connectivity to the Internet through their network.

To use the old Internet as a “superhighway” analogy, that’s like only having a single offramp to a town. If the offramp is blocked, then there’s no way to reach the town. This was exacerbated in some cases because CenturyLink/Level(3)’s network was not honoring route withdrawals and continued to advertise routes to networks like Cloudflare’s even after they’d been withdrawn. In the case of customers whose only connectivity to the Internet is via CenturyLink/Level(3), or if CenturyLink/Leve(3) continued to announce bad routes after they’d been withdrawn, there was no way for us to reach their applications and they continued to see 522 errors until CenturyLink/Level(3) resolved their issue around 14:30 UTC.

The same was a problem on the other (“eyeball”) side of the network. Individuals need to have an onramp onto the Internet’s superhighway. An onramp to the Internet is essentially what your ISP provides. CenturyLink is one of the largest ISPs in the United States.

Analysis of Today's CenturyLink/Level(3) Outage
Source: https://broadbandnow.com/CenturyLink

Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other Internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.

So What Likely Happened Here?

While we will not know exactly what happened until CenturyLink/Level(3) issue a post mortem, we can see clues from BGP announcements and how they propagated across the Internet during the outage. BGP is the Border Gateway Protocol. It is how routers on the Internet announce to each other what IPs sit behind them and therefore what traffic they should receive.

Starting at 10:04 UTC, there were a significant number of BGP updates. A BGP update is the signal a router makes to say that a route has changed or is no longer available. Under normal conditions, the Internet sees about 1.5MBs – 2MBs of BGP updates every 15 minutes. At the start of the incident, the number of BGP updates spiked to more than 26MBs of BGP updates per 15 minute period and stayed elevated throughout the incident.

Analysis of Today's CenturyLink/Level(3) Outage
Source: http://archive.routeviews.org/bgpdata/2020.08/UPDATES/

These updates show the instability of BGP routes inside the CenturyLink/Level(3) backbone. The question is what would have caused this instability. The CenturyLink/Level(3) status update offers some hints and points at a flowspec update as the root cause.

Analysis of Today's CenturyLink/Level(3) Outage

What’s Flowspec?

In CenturyLink/Level(3)’s update they mention that a bad Flowspec rule caused the issue. So what is Flowspec? Flowspec is an extension to BGP, which allows firewall rules to be easily distributed across a network, or even between networks, using BGP. Flowspec is a powerful tool. It allows you to efficiently push rules across an entire network almost instantly. It is great when you are trying to quickly respond to something like an attack, but it can be dangerous if you make a mistake.

At Cloudflare, early in our history, we used to use Flowspec ourselves to push out firewall rules in order to, for instance, mitigate large network-layer DDoS attacks. We suffered our own Flowspec-induced outage more than 7 years ago. We no longer use Flowspec ourselves, but it remains a common protocol for pushing out network firewall rules.

We can only speculate what happened at CenturyLink/Level(3), but one plausible scenario is that they issued a Flowspec command to try to block an attack or other abuse directed at their network. The status report indicates that the Flowspec rule prevented BGP itself from being announced. We have no way of knowing what that Flowspec rule was, but here’s one in Juniper’s format that would have blocked all BGP communications across their network.

route DISCARD-BGP {
   match {
      protocol tcp;
      destination-port 179;
   }
 then discard;
}

Why So Many Updates?

A mystery remains, however, why global BGP updates stayed elevated throughout the incident. If the rule blocked BGP then you would expect to see an increase in BGP announcements initially and then they would fall back to normal.

One possible explanation is that the offending Flowspec rule came near the end of a long list of BGP updates. If that were the case, what may have happened is that every router in CenturyLink/Level(3)’s network would receive the Flowspec rule. They would then block BGP. That would cause them to stop receiving the rule. They would start back up again, working their way through all the BGP rules until they got to the offending Flowspec rule again. BGP would be dropped again. The Flowspec rule would no longer be received. And the loop would continue, over and over.

One challenge of this is that on every cycle, the queue of BGP updates would continue to increase within CenturyLink/Level(3)’s network. This may have gotten to a point where the memory and CPU of their routers was overloaded, causing an additional set of challenges to getting their network back online.

Why Did It Take So Long to Fix?

This was a significant global Internet outage and, undoubtedly, the CenturyLink/Level(3) team received immediate alerts. They are a very sophisticated network operator with a world class Network Operations Center (NOC). So why did it take more than four hours to resolve?

Again, we can only speculate. First, it may have been that the Flowspec rule and the significant load that large number of BGP updates imposed on their routers made it difficult for them to login to their own interfaces. Several of the other tier-1 providers took action, it appears at CenturyLink/Level(3)’s request, to de-peer their networks. This would have limited the number of BGP announcements being received by the CenturyLink/Level(3) network and helped give it time to catch up.

Second, it also may have been that the Flowspec rule was not issued by CenturyLink/Level(3) themselves but rather by one of their customers. Many network providers will allow Flowspec peering. This can be a powerful tool for downstream customers wishing to block attack traffic, but can make it much more difficult to track down an offending Flowspec rule when something goes wrong.

Finally, it never helps when these issues occur early on a Sunday morning. Networks the size and scale of CenturyLink/Level(3)’s are extremely complicated. Incidents happen. We appreciate their team keeping us informed with what was going on throughout the incident. #hugops