Tag Archives: Outage

Cloudflare Incident on January 24th, 2023

Post Syndicated from Kenny Johnson original https://blog.cloudflare.com/cloudflare-incident-on-january-24th-2023/

Cloudflare Incident on January 24th, 2023

Cloudflare Incident on January 24th, 2023

Several Cloudflare services became unavailable for 121 minutes on January 24th, 2023 due to an error releasing code that manages service tokens. The incident degraded a wide range of Cloudflare products including aspects of our Workers platform, our Zero Trust solution, and control plane functions in our content delivery network (CDN).

Cloudflare provides a service token functionality to allow automated services to authenticate to other services. Customers can use service tokens to secure the interaction between an application running in a data center and a resource in a public cloud provider, for example. As part of the release, we intended to introduce a feature that showed administrators the time that a token was last used, giving users the ability to safely clean up unused tokens. The change inadvertently overwrote other metadata about the service tokens and rendered the tokens of impacted accounts invalid for the duration of the incident.

The reason a single release caused so much damage is because Cloudflare runs on Cloudflare. Service tokens impact the ability for accounts to authenticate, and two of the impacted accounts power multiple Cloudflare services. When these accounts’ service tokens were overwritten, the services that run on these accounts began to experience failed requests and other unexpected errors.

We know this impacted several customers and we know the impact was painful. We’re documenting what went wrong so that you can understand why this happened and the steps we are taking to prevent this from occurring again.

What is a service token?

When users log into an application or identity provider, they typically input a username and a password. The password allows that user to demonstrate that they are in control of the username and that the service should allow them to proceed. Layers of additional authentication can be added, like hard keys or device posture, but the workflow consists of a human proving they are who they say they are to a service.

However, humans are not the only users that need to authenticate to a service. Applications frequently need to talk to other applications. For example, imagine you build an application that shows a user information about their upcoming travel plans.

The airline holds details about the flight and its duration in their own system. They do not want to make the details of every individual trip public on the Internet and they do not want to invite your application into their private network. Likewise, the hotel wants to make sure that they only send details of a room booking to a valid, approved third party service.

Your application needs a trusted way to authenticate with those external systems. Service tokens solve this problem by functioning as a kind of username and password for your service. Like usernames and passwords, service tokens come in two parts: a Client ID and a Client Secret. Both the ID and Secret must be sent with a request for authentication. Tokens are also assigned a duration, after which they become invalid and must be rotated. You can grant your application a service token and, if the upstream systems you need validate it, your service can grab airline and hotel information and present it to the end user in a joint report.

When administrators create Cloudflare service tokens, we generate the Client ID and the Client Secret pair. Customers can then configure their requesting services to send both values as HTTP headers when they need to reach a protected resource. The requesting service can run in any environment, including inside of Cloudflare’s network in the form of a Worker or in a separate location like a public cloud provider. Customers need to deploy the corresponding protected resource behind Cloudflare’s reverse proxy. Our network checks every request bound for a configured service for the HTTP headers. If present, Cloudflare validates their authenticity and either blocks the request or allows it to proceed. We also log the authentication event.

Incident Timeline

All Timestamps are UTC

At 2023-01-24 16:55 UTC the Access engineering team initiated the release that inadvertently began to overwrite service token metadata, causing the incident.

At 2023-01-24 17:05 UTC a member of the Access engineering team noticed an unrelated issue and rolled back the release which stopped any further overwrites of service token metadata.

Service token values are not updated across Cloudflare’s network until the service token itself is updated (more details below). This caused a staggered impact of the service token’s that had their metadata overwritten.

2023-01-24 17:50 UTC: The first invalid service token for Cloudflare WARP was synced to the edge. Impact began for WARP and Zero Trust users.

Cloudflare Incident on January 24th, 2023
WARP device posture uploads dropped to zero which raised an internal alert

At 2023-01-24 18:12 an incident was declared due to the large drop in successful WARP device posture uploads.

2023-01-24 18:19 UTC: The first invalid service token for the Cloudflare API was synced to the edge. Impact began for Cache Purge, Cache Reserve, Images and R2. Alerts were triggered for these products which identified a larger scope of the incident.

Cloudflare Incident on January 24th, 2023

At 2023-01-24 18:21 the overwritten services tokens were discovered during the initial investigation.

At 2023-01-24 18:28 the incident was elevated to include all impacted products.

At 2023-01-24 18:51 An initial solution was identified and implemented to revert the service token to its original value for the Cloudflare WARP account, impacting WARP and Zero Trust. Impact ended for WARP and Zero Trust.

At 2023-01-24 18:56 The same solution was implemented on the Cloudflare API account, impacting Cache Purge, Cache Reserve, Images and R2. Impact ended for Cache Purge, Cache Reserve, Images and R2.

At 2023-01-24 19:00 An update was made to the Cloudflare API account which incorrectly overwrote the Cloudflare API account. Impact restarted for Cache Purge, Cache Reserve, Images and R2. All internal Cloudflare account changes were then locked until incident resolution.

At 2023-01-24 19:07 the Cloudflare API was updated to include the correct service token value. Impact ended for Cache Purge, Cache Reserve, Images and R2.

At 2023-01-24 19:51 all affected accounts had their service tokens restored from a database backup. Incident Ends.

What was released and how did it break?

The Access team was rolling out a new change to service tokens that added a “Last seen at” field. This was a popular feature request to help identify which service tokens were actively in use.

What went wrong?

The “last seen at” value was derived by scanning all new login events in an account’s login event Kafka queue. If a login event using a service token was detected, an update to the corresponding service token’s last seen value was initiated.

In order to update the service token’s “last seen at” value a read write transaction is made to collect the information about the corresponding service token. Service token read requests redact the “client secret” value by default for security reasons. The “last seen at” update to the service token then used that information from the read did not include the “client secret” and updated the service token with an empty “client secret” on the write.

An example of the correct and incorrect service token values shown below:

Example Access Service Token values

{
  "1a4ddc9e-a1234-4acc-a623-7e775e579c87": {
    "client_id": "6b12308372690a99277e970a3039343c.access",
    "client_secret": "<hashed-value>", <-- what you would expect
    "expires_at": 1698331351
  },
  "23ade6c6-a123-4747-818a-cd7c20c83d15": {
    "client_id": "1ab44976dbbbdadc6d3e16453c096b00.access",
    "client_secret": "", <--- this is the problem
    "expires_at": 1670621577
  }
}

The service token “client secret” database did have a “not null” check however in this situation an empty text string did not trigger as a null value.

As a result of the bug, any Cloudflare account that used a service token to authenticate during the 10 minutes “last seen at” release was out would have its “client secret” value set to an empty string. The service token then needed to be modified in order for the empty “client secret” to be used for authentication. There were a total of 4 accounts in this state, all of which are internal to Cloudflare.

How did we fix the issue?

As a temporary solution, we were able to manually restore the correct service token values for the accounts with overwritten service tokens. This stopped the immediate impact across the affected Cloudflare services.

The database team was then able to implement a solution to restore the service tokens of all impacted accounts from an older database copy. This concluded any impact from this incident.

Why did this impact other Cloudflare services?

Service tokens impact the ability for accounts to authenticate. Two of the impacted accounts power multiple Cloudflare services. When these accounts’ services tokens were overwritten, the services that run on these accounts began to experience failed requests and other unexpected errors.

Cloudflare WARP Enrollment

Cloudflare provides a mobile and desktop forward proxy, Cloudflare WARP (our “1.1.1.1” app), that any user can install on a device to improve the privacy of their Internet traffic. Any individual can install this service without the need for a Cloudflare account and we do not retain logs that map activity to a user.

When a user connects using WARP, Cloudflare validates the enrollment of a device by relying on a service that receives and validates the keys on the device. In turn, that service communicates with another system that tells our network to provide the newly enrolled device with access to our network

During the incident, the enrollment service could no longer communicate with systems in our network that would validate the device. As a result, users could no longer register new devices and/or install the app on a new device, and may have experienced issues upgrading to a new version of the app (which also triggers re-registration).

Cloudflare Zero Trust Device Posture and Re-Auth Policies

Cloudflare provides a comprehensive Zero Trust solution that customers can deploy with or without an agent living on the device. Some use cases are only available when using the Cloudflare agent on the device. The agent is an enterprise version of the same Cloudflare WARP solution and experienced similar degradation anytime the agent needed to send or receive device state. This impacted three use cases in Cloudflare Zero Trust.

First, similar to the consumer product, new devices could not be enrolled and existing devices could not be revoked. Administrators were also unable to modify settings of enrolled devices.. In all cases errors would have been presented to the user.

Second, many customers who replace their existing private network with Cloudflare’s Zero Trust solution may add rules that continually validate a user’s identity through the use of session duration policies. The goal of these rules is to enforce users to reauthenticate in order to prevent stale sessions from having ongoing access to internal systems. The agent on the device prompts the user to reauthenticate based on signals from Cloudflare’s control plane. During the incident, the signals were not sent and users could not successfully reauthenticate.

Finally, customers who rely on device posture rules also experienced impact. Device posture rules allow customers who use Access or Gateway policies to rely on the WARP agent to continually enforce that a device meets corporate compliance rules.

The agent communicates these signals to a Cloudflare service responsible for maintaining the state of the device. Cloudflare’s Zero Trust access control product uses a service token to receive this signal and evaluate it along with other rules to determine if a user can access a given resource. During this incident those rules defaulted to a block action, meaning that traffic modified by these policies would appear broken to the user. In some cases this meant that all internet bound traffic from a device was completely blocked leaving users unable to access anything.

Cloudflare Gateway caches the device posture state for users every 5 minutes to apply Gateway policies. The device posture state is cached so Gateway can apply policies without having to verify device state on every request. Depending on which Gateway policy type was matched, the user would experience two different outcomes. If they matched a network policy the user would experience a dropped connection and for an HTTP policy they would see a 5XX error page. We peaked at over 50,000 5XX errors/minute over baseline and had over 10.5 million posture read errors until the incident was resolved.

Gateway 5XX errors per minute

Cloudflare Incident on January 24th, 2023

Total count of Gateway Device posture errors

Cloudflare Incident on January 24th, 2023

Cloudflare R2 Storage and Cache Reserve

Cloudflare R2 Storage allows developers to store large amounts of unstructured data without the costly egress bandwidth fees associated with typical cloud storage services.

During the incident, the R2 service was unable to make outbound API requests to other parts of the Cloudflare infrastructure. As a result, R2 users saw elevated request failure rates when making requests to R2.  

Many Cloudflare products also depend on R2 for data storage and were also affected. For example, Cache Reserve users were impacted during this window and saw increased origin load for any items not in the primary cache. The majority of read and write operations to the Cache Reserve service were impacted during this incident causing entries into and out of Cache Reserve to fail. However, when Cache Reserve sees an R2 error, it falls back to the customer origin, so user traffic was still serviced during this period.

Cloudflare Cache Purge

Cloudflare’s content delivery network (CDN) caches the content of Internet properties on our network in our data centers around the world to reduce the distance that a user’s request needs to travel for a response. In some cases, customers want to purge what we cache and replace it with different data.

The Cloudflare control plane, the place where an administrator interacts with our network, uses a service token to authenticate and reach the cache purge service. During the incident, many purge requests failed while the service token was invalid. We saw an average impact of 20 purge requests/second failing and a maximum of 70 requests/second.

What are we doing to prevent this from happening again?

We take incidents like this seriously and recognize the impact it had. We have identified several steps we can take to address the risk of a similar problem occurring in the future. We are implementing the following remediation plan as a result of this incident:

Test: The Access engineering team will add unit tests that would automatically catch any similar issues with service token overwrites before any new features are launched.

Alert: The Access team will implement an automatic alert for any dramatic increase in failed service token authentication requests to catch issues before they are fully launched.

Process: The Access team has identified process improvements to allow for faster rollbacks for specific database tables.

Implementation: All relevant database fields will be updated to include checks for empty strings on top of existing “not null checks”

We are sorry for the disruption this caused for our customers across a number of Cloudflare services. We are actively making these improvements to ensure improved stability moving forward and that this problem will not happen again.

Internet disruptions overview for Q4 2022

Post Syndicated from David Belson original https://blog.cloudflare.com/q4-2022-internet-disruption-summary/

Internet disruptions overview for Q4 2022

Internet disruptions overview for Q4 2022

Cloudflare operates in more than 250 cities in over 100 countries, where we interconnect with over 10,000 network providers in order to provide a broad range of services to millions of customers. The breadth of both our network and our customer base provides us with a unique perspective on Internet resilience, enabling us to observe the impact of Internet disruptions.

While Internet disruptions are never convenient, online interest in the 2022 World Cup in mid-November and the growth in online holiday shopping in many areas during November and December meant that connectivity issues could be particularly disruptive. Having said that, the fourth quarter appeared to be a bit quieter from an Internet disruptions perspective, although Iran and Ukraine continued to be hotspots, as we discuss below.

Government directed

Multi-hour Internet shutdowns are frequently used by authoritarian governments in response to widespread protests as a means of limiting communications among protestors, as well preventing protestors from sharing information and video with the outside world. During the fourth quarter Cuba and Sudan again implemented such shutdowns, while Iran continued the series of “Internet curfews” across mobile networks it started in mid-September, in addition to implementing several other regional Internet shutdowns.

Cuba

In late September, Hurricane Ian knocked out power across Cuba. While officials worked to restore service as quickly as possible, some citizens responded to perceived delays with protests that were reportedly the largest since anti-government demonstrations over a year earlier. In response to these protests, the Cuban government reportedly cut off Internet access several times. A shutdown on September 29-30 was covered in the Internet disruptions overview for Q3 2022, and the impact of the shutdown that occurred on October 1 (UTC) is shown in the figure below. The timing of this one was similar to the previous one, taking place between 1900 on September 30 and 0245 on October 1 (0000-0745 UTC on October 1).

Internet disruptions overview for Q4 2022

Sudan

October 25 marked the first anniversary of a coup in Sudan that derailed the country’s transition to civilian rule, and thousands of Sudanese citizens marked the anniversary by taking to the streets in protest. Sudan’s government has a multi-year history of shutting down Internet access during times of civil unrest, and once again implemented an Internet shutdown in response to these protests. The figure below shows a near complete loss of Internet traffic from Sudan on October 25 between 0945-1740 local time (0745 – 1540 UTC).

Internet disruptions overview for Q4 2022

Iran

As we covered in last quarter’s blog post, the Iranian government implemented daily Internet “curfews”, generally taking place between 1600 and midnight local time (1230-2030 UTC) across three mobile network providers — AS44244 (Irancell), AS57218 (RighTel), and AS197207 (MCCI) — in response to protests surrounding the death of Mahsa Amini. These multi-hour Internet curfew shutdowns continued into early October, with additional similar outages also observed on October 8, 12 and 15 as seen in the figure below. (The graph’s line for AS57218 (Rightel), the smallest of the three mobile providers, suggests that the shutdowns on this network were not implemented after the end of September.)

Internet disruptions overview for Q4 2022

In addition to the mobile network shutdowns, several regional Internet disruptions were also observed in Iran during the fourth quarter, two of which we review below. The first was in Sanandaj, Kurdistan Province on October 26, where a complete Internet shutdown was implemented in response to demonstrations marking the 40th day since the death of Mahsa Amini. The figure below shows a complete loss of traffic starting at 1030 local time (0700 UTC), with the outage lasting until 0805 local time on October 27 (0435 UTC). In December, a province-level Internet disruption was observed starting on December 18, lasting through December 25.

Internet disruptions overview for Q4 2022
Kurdistan Province, Iran. (Source: Map data ©2023 Google, MapaGISrael)

Internet disruptions overview for Q4 2022
Internet disruptions overview for Q4 2022

The Internet disruptions that have taken place in Iran over the last several months have had a significant economic impact on the country. A December post from Filterwatch shared concerns stated in a letter from mobile operator Rightel:

The letter, signed by the network’s Managing Director Yasser Rezakhah, states that “during the past few weeks, the company’s resources and income have significantly decreased during Internet shutdowns and other restrictions, such as limiting Internet bandwidth from 21 September. They have also caused a decrease in data use from subscribers, decreasing data traffic by around 50%.” The letter also states that the “continued lack of compensation for losses could lead to bankruptcy.”

The post also highlighted economic concerns shared by Iranian officials:

Some Iranian officials have expressed concern about the cost of Internet shutdowns, including Valiollah Bayati, MP for Tafresh and Ashtian in Markazi province. In a public session in Majles (parliament), he stated that continued Internet shutdowns have led to the closure of many jobs and people are worried, the government and the President must provide necessary measures.

Statistics in an article on news site enthkhab.ir provide a more tangible view of the local economic impact, stating (via Google Translate):

Since the 30th of Shahrivar month and with the beginning of the government disruption in the Internet, the country’s businesses have been damaged daily at least 50 million tomans and at most 500 million tomans. More than 41% of companies have lost 25-50% of their income during this period, and about 47% have had more than 50% reduction in sales. A review of the data of the research assistant of the country’s tax affairs organization shows that the Internet outage in Iran has caused 3000 billion tomans of damage per day. That is, the cost of 3 months of Internet outage in Iran is equal to 43% of one year’s oil revenue of the country ($25 billion).

Power outages

Bangladesh, October 4

Over 140 million people in Bangladesh were left without electricity on October 4 as the result of a reported grid failure caused by a failure by power distribution companies to follow instructions from the National Load Dispatch Centre to shed load. The resultant power outage resulted in an observed drop in Internet traffic from the country, starting at 1405 local time (0805 UTC), as shown in the figure below. The disruption lasted approximately seven hours, with traffic returning to expected levels around 1900 local time (1500 UTC).

Internet disruptions overview for Q4 2022

Pakistan

Over a week later, a similar issue in Pakistan caused power outages across the southern part of the country, including Sindh, Punjab, and Balochistan. The power outages were caused by a fault in the national grid’s southern transmission system, reportedly due to faulty equipment and sub-standard maintenance. As expected, the power outages resulted in disruptions to Internet connectivity, and the figure below illustrates the impact observed in Sindh, where traffic dropped nearly 30% as compared to the previous week starting at 0935 local time (0435 UTC) on October 6. The disruption lasted over 15 hours, with traffic returning to expected levels at 0100 on October 7 (2000 UTC on October 6).

Internet disruptions overview for Q4 2022
Sindh, Pakistan (Source: Map data ©2023 Google)

Internet disruptions overview for Q4 2022

Kenya

On November 24, a Tweet from Kenya Power at 1525 local time noted that they had “lost bulk power supply to various parts of the country due to a system disturbance”. A subsequent Tweet published just over six hours later at 2150 local time stated that “normal power supply has been restored to all parts of the country.” The time stamps on these notifications align with the loss of Internet traffic visible in the figure below, which lasted between 1500-2050 local time (1200-1750 UTC).

Internet disruptions overview for Q4 2022

United States (Moore County, North Carolina)

On December 3, two electrical substations in Moore County, North Carolina were targeted by gunfire, with the resultant damage causing localized power outages that took multiple days to resolve. The power outages reportedly began just after 1900 local time (0000 UTC on December 4), resulting in the concurrent loss of Internet traffic from communities within Moore County, as seen in the figure below.

Internet traffic within the community of West End appeared to return midday (UTC) on December 5, but that recovery was apparently short-lived, as it fell again during the afternoon of December 6. In Pinehurst, traffic began to slowly recover after about a day, but returned to more normal levels around 0800 local time (1300 UTC) on December 7.

Internet disruptions overview for Q4 2022
West End and Pinehurst, North Carolina. (Source: Map data ©2023 Google)

Internet disruptions overview for Q4 2022

Ukraine

The war in Ukraine has been going on since February 24, and Cloudflare has covered the impact of the war on the country’s Internet connectivity in a number of blog posts across the year (March, March, April, May, June, July, October, December). Throughout the fourth quarter of 2022, Russian missile strikes caused widespread damage to electrical infrastructure, resulting in power outages and disruptions to Internet connectivity. Below, we highlight several examples of the Internet disruptions observed in Ukraine during the fourth quarter, but they are just a few of the many disruptions that occurred.

On October 20, the destruction of several power stations in Kyiv resulted in a 25% drop in Internet traffic from Kyiv City as compared to the two previous weeks. The disruption began around 0900 local time (0700 UTC).

Internet disruptions overview for Q4 2022
Kviv City, Ukraine. (Source: Map data ©2023 Google)

Internet disruptions overview for Q4 2022

On November 23, widespread power outages after Russian strikes caused a nearly 50% decrease in Internet traffic in Ukraine, starting just after 1400 local time (1200 UTC). This disruption lasted for nearly a day and a half, with traffic returning to expected levels around 2345 local time on November 24 (2145 UTC).

Internet disruptions overview for Q4 2022

On December 16, power outages resulting from Russian air strikes targeting power infrastructure caused country-level Internet traffic to drop around 13% at 0915 local time (0715 UTC), with the disruption lasting until midnight local time (2200 UTC). However, at a network level, the impact was more significant, with AS13188 (Triolan) seeing a 70% drop in traffic, and AS15895 (Kyivstar) a 40% drop, both shown in the figures below.

Internet disruptions overview for Q4 2022
Internet disruptions overview for Q4 2022
Internet disruptions overview for Q4 2022

Cable cuts

Shetland Islands, United Kingdom

The Shetland Islands are primarily dependent on the SHEFA-2 submarine cable system for Internet connectivity, connecting through the Scottish mainland. Late in the evening of October 19, damage to this cable knocked the Shetland Islands almost completely offline. At the time, there was heightened concern about the potential sabotage of submarine cables due to the reported sabotage of the Nord Stream natural gas pipelines in late September, but authorities believed that this cable damage was due to errant fishing vessels, and not sabotage.

The figure below shows that the impact of the damage to the cable was relatively short-lived, compared to the multi-day Internet disruptions often associated with submarine cable cuts. Traffic dropped just after 2300 local time (2200 UTC) on October 19, and recovered 14.5 hours later, just after 1430 local time (1330 UTC) on October 20.

Internet disruptions overview for Q4 2022
Shetland Islands, United Kingdom. (Source: Map data ©2023 GeoBasis-DE/BKG (©2009), Google)

Internet disruptions overview for Q4 2022

Natural disasters

Solomon Islands

Earthquakes frequently cause infrastructure damage and power outages in affected areas, resulting in disruptions to Internet connectivity. We observed such a disruption in the Solomon Islands after a magnitude 7.0 earthquake occurred near there on November 22. The figure below shows Internet traffic from the country dropping significantly at 1300 local time (0200 UTC), and recovering 11 hours later at around 2000 local time (0900 UTC).

Internet disruptions overview for Q4 2022

Technical problems

Kyrgyzstan

On October 24, a three-hour Internet disruption was observed in Kyrgyzstan lasting between 1100-1400 local time (0500-0800 UTC), as seen in the figure below. According to the country’s Ministry of Digital Development, the issue was caused by “an accident on one of the main lines that supply the Internet”, but no additional details were provided regarding the type of accident or where it had occurred.

Internet disruptions overview for Q4 2022

Australia (Aussie Broadband)

Customers of Australian broadband Internet provider Aussie Broadband in Victoria and New South Wales suffered brief Internet disruptions on October 27. As shown in the figure below, AS4764 (Aussie Broadband) traffic from Victoria dropped by approximately 40% between 1505-1745 local time (0405-0645 UTC). A similar, but briefer, loss of traffic from New South Wales was also observed, lasting between 1515-1550 local time (0415-0450 UTC). A representative of Aussie Broadband provided insight into the underlying cause of the disruption, stating “A config change was made which was pushed out through automation to the DHCP servers in those states. … The change has been rolled back but getting the sessions back online is taking time for VIC, and we are now manually bringing areas up one at a time.”

Internet disruptions overview for Q4 2022
Victoria and New South Wales, Australia. (Source: Map data ©2023 Google)

Internet disruptions overview for Q4 2022

Haiti

In Haiti, customers of Internet service provider Access Haiti experienced disrupted service for more than half a day on November 9. The figure below shows that Internet traffic for AS27759 (Access Haiti) fell precipitously around midnight local time (0500 UTC), remaining depressed until 1430 local time (1930 UTC), at which time it recovered quickly. A Facebook post from Access Haiti explained to customers that “Due to an intermittent outage on one of our international circuits, our network is experiencing difficulties that cause your Internet service to slow down.” While Access Haiti didn’t provide additional details on which international circuit was experiencing an outage, submarinecablemap.com shows that two submarine cables provide international Internet connectivity to Haiti — the Bahamas Domestic Submarine Network (BDSNi), which connects Haiti to the Bahamas, and Fibralink, which connects Haiti to the Dominican Republic and Jamaica.

Internet disruptions overview for Q4 2022

Unknown

Many Internet disruptions can be easily tied to an underlying cause, whether through coverage in the press, a concurrent weather or natural disaster event, or communication from an impacted provider. However, the causes of other observed disruptions remain unknown as the impacted providers remain silent about what caused the problem.

United States (Wide Open West)

On November 15, customers of Wide Open West, an Internet service provider with a multi-state footprint in the United States, experienced an Internet service disruption that lasted a little over an hour. The figure below illustrates the impact of the disruption in Alabama and Michigan on AS12083 (Wide Open West), with traffic dropping at 1150 local time (1650 UTC) and recovering just after 1300 local time (1800 UTC).

Internet disruptions overview for Q4 2022

Cuba

Cuba is no stranger to Internet disruptions, whether due to government-directed shutdowns (such as the one discussed above), fiber cuts, or power outages. However, no underlying cause was ever shared for the seven-hour disruption in the country’s Internet traffic observed between 2345 on November 25 and 0645 on November 26 local time (0445-1145 UTC on November 26). Traffic was down as much as 75% from previous levels during the disruption.

Internet disruptions overview for Q4 2022

As a provider of low earth orbit (LEO) satellite Internet connectivity services, disruptions to SpaceX Starlink’s service can have a global impact. On November 30, a disruption was observed on AS14593 (SPACEX-STARLINK) between 2050-2130 UTC, with traffic volume briefly dropping to near zero. Unfortunately, Starlink did not acknowledge the incident, nor did they provide any reason for the disruption.

Internet disruptions overview for Q4 2022

Conclusion

Looking back at the Internet disruptions observed during 2022, a number of common themes can be found. In countries with more authoritarian governments, the Internet is often weaponized as a means of limiting communication within the country and with the outside world through network-level, regional, or national Internet shutdowns. As noted above, this approach was used aggressively in Iran during the last few months of the year.

Internet connectivity quickly became a casualty of war in Ukraine. Early in the conflict, network-level outages were common, and some Ukrainian networks ultimately saw traffic re-routed through upstream Russian Internet service providers. Later in the year, as electrical power infrastructure was increasingly targeted by Russian attacks, widespread power outages resulted in multi-hour disruptions of Internet traffic across the country.

While the volcanic eruption in Tonga took the country offline for over a month due to its reliance on a single submarine cable for Internet connectivity, the damage caused by earthquakes in other countries throughout the year resulted in much shorter and more limited disruptions.

And while submarine cable issues can impact multiple countries along its route, the advent of services with an increasingly global footprint like SpaceX Starlink mean that service disruptions will ultimately have a much broader impact. (Starlink’s subscriber base is comparatively small at the moment, but it currently has a service footprint in over 30 countries around the world.)

To follow Internet disruptions as they occur, check the Cloudflare Radar Outage Center (CROC) and follow @CloudflareRadar on Twitter. To review those disruptions observed earlier in 2022, refer to the Q1, Q2, and Q3 Internet disruptions overview blog posts.

Partial Cloudflare outage on October 25, 2022

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/partial-cloudflare-outage-on-october-25-2022/

Partial Cloudflare outage on October 25, 2022

Partial Cloudflare outage on October 25, 2022

Today, a change to our Tiered Cache system caused some requests to fail for users with status code 530. The impact lasted for almost six hours in total. We estimate that about 5% of all requests failed at peak. Because of the complexity of our system and a blind spot in our tests, we did not spot this when the change was released to our test environment.  

The failures were caused by side effects of how we handle cacheable requests across locations. At first glance, the errors looked like they were caused by a different system that had started a release some time before. It took our teams a number of tries to identify exactly what was causing the problems. Once identified we expedited a rollback which completed in 87 minutes.

We’re sorry, and we’re taking steps to make sure this does not happen again.

Background

One of Cloudflare’s products is our Content Delivery Network, or CDN. This is used to cache assets for websites globally. However, a data center is not guaranteed to have an asset cached. It could be new, expired, or has been purged. If that happens, and a user requests that asset, our CDN needs to retrieve a fresh copy from a website’s origin server. But the data center that the user is accessing might still be pretty far away from the origin server. This presents an additional issue for customers: every time an asset is not cached in the data center, we need to retrieve a new copy from the origin server.

To improve cache hit ratios, we introduced Tiered Cache. With Tiered Cache, we organize our data centers in the CDN into a hierarchy of “lower tiers” which are closer to the end users and “upper tiers” that are closer to the origin. When a cache-miss occurs in a lower tier, the upper tier is checked. If the upper tier has a fresh copy of the asset, we can serve that in response to the request. This improves performance and reduces the amount of times that Cloudflare has to reach out to an origin server to retrieve assets that are not cached in lower tier data centers.

Partial Cloudflare outage on October 25, 2022

Incident timeline and impact

At 08:40 UTC, a software release of a CDN component containing a bug began slowly rolling out. The bug was triggered when a user visited a site with either Tiered Cache, Cloudflare Images, or Bandwidth Alliance configured. This bug caused a subset of those customers to return HTTP Status Code 530 — an error. Content that could be served directly from a data center’s local cache was unaffected.

We started an investigation after receiving customer reports of an intermittent increase in 530s after the faulty component was released to a subset of data centers.

Once the release started rolling out globally to the remaining data centers, a sharp increase in 530s triggered alerts along with more customer reports, and an incident was declared.

Partial Cloudflare outage on October 25, 2022
Requests resulting in a response with status code 530

We confirmed a bad release was responsible by rolling back the release in a data center at 17:03 UTC. After the rollback, we observed a drop in 530 errors. After this confirmation, an accelerated global rollback began and the 530s started to decrease. Impact ended once the release was reverted in all data centers configured as Tiered Cache upper tiers at 18:04 UTC.

Timeline:

  • 2022-10-25 08:40: The release started to roll out to a small subset of data centers.
  • 2022-10-25 10:35: An individual customer alert fires, indicating an increase in 500 error codes.
  • 2022-10-25 11:20: After an investigation, a single small data center is pinpointed as the source of the issue and removed from production while teams investigate the issue there.
  • 2022-10-25 12:30: Issue begins spreading more broadly as more data centers get the code changes.
  • 2022-10-25 14:22: 530s errors increase as the release starts to slowly roll out to our largest data centers.
  • 2022-10-25 14:39: Multiple teams become involved in the investigation as more customers start reporting increases in errors.
  • 2022-10-25 17:03: CDN Release is rolled back in Atlanta and root cause is confirmed.
  • 2022-10-25 17:28: Peak impact with approximately 5% of all HTTP requests resulting in an error with status code 530.
  • 2022-10-25 17:38: An accelerated rollback continues with large data centers acting as Upper tier for many customers.
  • 2022-10-25 18:04: Rollback is complete in all Upper Tiers.
  • 2022-10-25 18:30: Rollback is complete.

During the early phases of the investigation, the indicators were that this was a problem with our internal DNS system that also had a release rolling out at the same time. As the following section shows, that was a side effect rather than the cause of the outage.  

Adding distributed tracing to Tiered Cache introduced the problem

In order to help improve our performance, we routinely add monitoring code to various parts of our services. Monitoring code helps by giving us visibility into how various components are performing, allowing us to determine bottlenecks that we can improve on. Our team recently added additional distributed tracing to our Tiered Cache logic. The tiered cache entrypoint code is as follows:

* Before:

function _M.go()
   -- code to run here
end

* After:

local trace_fn = require("opentracing").trace_fn
local function go()
-- code to run here
end
function _M.go()
trace_fn(ngx.ctx, "tiered_cache_rewrite", go)
end

The code above wraps the existing go() function with trace_fn() which will call the go() function and then reports its execution time.

However, the logic that injects a function to the opentracing module clears control headers on every request:

require("opentracing").configure_module(conf,
-- control header extractor
function(ctx)
-- Always clear the headers.
clear_control_headers()

Normally, we extract data from these control headers before clearing them as a routine part of how we process requests.

But internal tiered cache traffic expects the control headers from the lower tier to be passed as-is. The combination of clearing headers and using an upper tier meant that information that might be critical to the routing of the request was not available. In the subset of requests affected, we were missing the hostname to resolve by our internal DNS lookup for origin server IP addresses. As a result, a 530 DNS error was returned to the client.

Remediation and follow-up steps

To prevent this from happening again, in addition to the fixing the bug, we have identified a set of changes that help us detect and prevent issues like this in the future:

  • Include a larger data center that is configured as a Tiered Cache upper tier in an earlier stage in the release plan. This will allow us to notice similar issues more quickly, before a global release.
  • Expand our acceptance test coverage to include a broader set of configurations, including various Tiered Cache topologies.
  • Alert more aggressively in situations where we do not have full context on requests, and need the extra host information in the control headers.
  • Ensure that our system correctly fails fast in an error like this, which would have helped identify the problem during development and test.

Conclusion

We experienced an incident that affected a significant set of customers using Tiered Cache. After identifying the faulty component, we were able to quickly rollback and remediate the issue. We are sorry for any disruption this has caused our customers and end users trying to access services.

Remediations to prevent such an incident from happening in the future will be put in place as soon as possible.

Internet disruptions overview for Q3 2022

Post Syndicated from David Belson original https://blog.cloudflare.com/q3-2022-internet-disruption-summary/

Internet disruptions overview for Q3 2022

Internet disruptions overview for Q3 2022

Cloudflare operates in more than 275 cities in over 100 countries, where we interconnect with over 10,000 network providers in order to provide a broad range of services to millions of customers. The breadth of both our network and our customer base provides us with a unique perspective on Internet resilience, enabling us to observe the impact of Internet disruptions. In many cases, these disruptions can be attributed to a physical event, while in other cases, they are due to an intentional government-directed shutdown. In this post, we review selected Internet disruptions observed by Cloudflare during the third quarter of 2022, supported by traffic graphs from Cloudflare Radar and other internal Cloudflare tools, and grouped by associated cause or common geography. The new Cloudflare Radar Outage Center provides additional information on these, and other historical, disruptions.

Government directed shutdowns

Unfortunately, for the last decade, governments around the world have turned to shutting down the Internet as a means of controlling or limiting communication among citizens and with the outside world. In the third quarter, this was an all too popular cause of observed disruptions, impacting countries and regions in Africa, the Middle East, Asia, and the Caribbean.

Iraq

As mentioned in our Q2 summary blog post, on June 27, the Kurdistan Regional Government in Iraq began to implement twice-weekly (Mondays and Thursday) multi-hour regional Internet shutdowns over the following four weeks, intended to prevent cheating on high school final exams. As seen in the figure below, these shutdowns occurred as expected each Monday and Thursday through July 21, with the exception of July 21. They impacted three governorates in Iraq, and lasted from 0630–1030 local time (0330–0730 UTC) each day.

Internet disruptions overview for Q3 2022
Erbil, Sulaymaniyah, and Duhok Governorates, Iraq. (Source: Map data ©2022 Google, Mapa GISrael)
Internet disruptions overview for Q3 2022

Cuba

In Cuba, an Internet disruption was observed between 0055-0150 local time (0455-0550 UTC) on July 15 amid reported anti-government protests in Los Palacios and Pinar del Rio.

Internet disruptions overview for Q3 2022
Los Palacios and Pinar del Rio, Cuba. (Source: Map data ©2022 INEGI)
Internet disruptions overview for Q3 2022

Closing out the quarter, another significant disruption was observed in Cuba, reportedly in response to protests over the lack of electricity in the wake of Hurricane Ian. A complete outage is visible in the figure below between 2030 on September 29 and 0315 on September 30 local time (0030-0715 UTC on September 30).

Internet disruptions overview for Q3 2022

Afghanistan

Telecommunications services were reportedly shut down in part of Kabul, Afghanistan on the morning of August 8. The figure below shows traffic dropping starting around 0930 local time (0500 UTC), recovering 11 hours later, around 2030 local time (1600 UTC).

Internet disruptions overview for Q3 2022
Kabul, Afghanistan. (Source: Map data ©2022 Google)
Internet disruptions overview for Q3 2022

Sierra Leone

Protests in Freetown, Sierra Leone over the rising cost of living likely drove the Internet disruptions observed within the country on August 10 & 11. The first one occurred between 1200-1400 local time (1200-1400 UTC) on August 10. While this outage is believed to have been government directed as a means of quelling the protests, Zoodlabs, which manages Sierra Leone Cable Limited, claimed that the outage was the result of “emergency technical maintenance on some of our international routes”.

A second longer outage was observed between 0100-0730 local time (0100-0730 UTC) on August 11, as seen in the figure below. These shutdowns follow similar behavior in years past, where Internet connectivity was shut off following elections within the country.

Internet disruptions overview for Q3 2022
Freetown, Sierra Leone (Source: Map data ©2022 Google, Inst. Geogr. Nacional)
Internet disruptions overview for Q3 2022

Region of Somaliland

In Somaliland, local authorities reportedly cut off Internet service on August 11 ahead of scheduled opposition demonstrations. The figure below shows a complete Internet outage in Woqooyi Galbeed between 0645-1355 local time (0345-1055 UTC.)

Internet disruptions overview for Q3 2022
Woqooyi Galbeed, Region of Somaliland. (Source: Map data ©2022 Google, Mapa GISrael)
Internet disruptions overview for Q3 2022

At a network level, the observed outage was due to a loss of traffic from AS37425 (SomCable) and AS37563 (Somtel), as shown in the figures below. Somtel is a mobile services provider, while SomCable is focused on providing wireline Internet access.

Internet disruptions overview for Q3 2022
Internet disruptions overview for Q3 2022

India

India is no stranger to government-directed Internet shutdowns, taking such action hundreds of times over the last decade. This may be changing in the future, however, as the country’s Supreme Court ordered the Ministry of Electronics and Information Technology (MEITY) to reveal the grounds upon which it imposes or approves Internet shutdowns. Until this issue is resolved, we will continue to see regional shutdowns across the country.

One such example occurred in Assam, where mobile Internet connectivity was shut down to prevent cheating on exams. The figure below shows that these shutdowns were implemented twice daily on August 21 and August 28. While the shutdowns were officially scheduled to take place between 1000-1200 and 1400-1600 local time (0430-0630 and 0830-1030 UTC), some providers reportedly suspended connectivity starting in the early morning.

Internet disruptions overview for Q3 2022
Assam, India. (Source: Map data ©2022 Google, TMap Mobility)
Internet disruptions overview for Q3 2022

Iran

In late September, protests and demonstrations have erupted across Iran in response to the death of Mahsa Amini. Amini was a 22-year-old woman from the Kurdistan Province of Iran, and was arrested on September 13, 2022, in Tehran by Iran’s “morality police”, a unit that enforces strict dress codes for women. She died on September 16 while in police custody. In response to these protests and demonstrations, Internet connectivity across the country experienced multiple waves of disruptions.

In addition to multi-hour outages in Sanadij and Tehran province on September 19 and 21 that were covered in a blog post, three mobile network providers — AS44244 (Irancell), AS57218 (RighTel), and AS197207 (MCCI) — implemented daily Internet “curfews”, generally taking place between 1600 and midnight local time (1230-2030 UTC), although the start times varied on several days. These regular shutdowns are clearly visible in the figure below, and continued into early October.

Internet disruptions overview for Q3 2022
Sanandij and Tehran, Iran. (Source: Map data ©2022 Google)
Internet disruptions overview for Q3 2022

As noted in the blog post, access to DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) services was also blocked in Iran starting on September 20, and in a move that is likely related, connections over HTTP/3 and QUIC were blocked starting on September 22, as shown in the figure below from Cloudflare Radar.

Internet disruptions overview for Q3 2022

Natural disasters

Natural disasters such as earthquakes and hurricanes wreak havoc on impacted geographies, often causing loss of life, as well as significant structural damage to buildings of all types. Infrastructure damage is also extremely common, with widespread loss of both electrical power and telecommunications infrastructure.

Papua New Guinea

On September 11, a 7.6 magnitude earthquake struck Papua New Guinea, resulting in landslides, cracked roads, and Internet connectivity disruptions. Traffic to the country dropped by 26% just after 1100 local time (0100 UTC) . The figure below shows that traffic volumes remained lower into the following day as well. An announcement from PNG DataCo, a local provider, noted that the earthquake “has affected the operations of the Kumul Submarine Cable Network (KSCN) Express Link between Port Moresby and Madang and the PPC-1 Cable between Madang and Sydney.” This damage, they stated, resulted in the observed outage and degraded service.

Internet disruptions overview for Q3 2022

Mexico

Just over a week later, a 7.6 magnitude earthquake struck the Colima-Michoacan border region in Mexico at 1305 local time (1805 UTC). As shown in the figure below, traffic dropped over 50% in the impacted states immediately after the quake occurred, but recovered fairly quickly, returning to normal levels by around 1600 local time (2100 UTC).

Internet disruptions overview for Q3 2022
Earthquake epicenter, 35 km SW of Aguililla, Mexico. (Source: Map data ©2022 INEGI)
Internet disruptions overview for Q3 2022

Hurricane Fiona

Several major hurricanes plowed their way up the east coast of North America in late September, causing significant damage, resulting in Internet disruptions. On September 18, island-wide power outages caused by Hurricane Fiona disrupted Internet connectivity on Puerto Rico. As the figure below illustrates, it took over 10 days for traffic volumes to return to expected levels. Luma Energy, the local power company, kept customers apprised of repair progress through regular updates to its Twitter feed.

Internet disruptions overview for Q3 2022

Two days later, Hurricane Fiona slammed the Turks and Caicos islands, causing flooding and significant damage, as well as disrupting Internet connectivity. The figure below shows traffic starting to drop below expected levels around 1245 local time (1645 UTC) on September 20. Recovery took approximately a day, with traffic returning to expected levels around 1100 local time (1500 UTC) on September 21.

Internet disruptions overview for Q3 2022

Continuing to head north, Hurricane Fiona ultimately made landfall in the Canadian province of Nova Scotia on September 24, causing power outages and disrupting Internet connectivity. The figure below shows that the most significant impact was seen in Nova Scotia. As Nova Scotia Power worked to restore service to customers, traffic volumes gradually increased, as seen in the figure below. By September 29, traffic volumes on the island had returned to normal levels.

Internet disruptions overview for Q3 2022

Hurricane Ian

On September 28, Hurricane Ian made landfall in Florida, and was the strongest hurricane to hit Florida since Hurricane Michael in 2018. With over four million customers losing power due to damage from the storm, a number of cities experienced associated Internet disruptions. Traffic from impacted cities dropped significantly starting around 1500 local time (1900 UTC), and as the figure below shows, recovery has been slow, with traffic levels still not back to pre-storm volumes more than two weeks later.

Internet disruptions overview for Q3 2022
Sarasota, Naples, Fort Myers, Cape Coral, North Port, Port Charlotte, Punta Gorda, and Marco Island, Florida. (Source: Map data ©2022 Google, INEGI)
Internet disruptions overview for Q3 2022

Power outages

In addition to power outages caused by earthquakes and hurricanes, a number of other power outages caused multi-hour Internet disruptions during the third quarter.

Iran

A reported power outage in a key data center building disrupted Internet connectivity for customers of local ISP Shatel in Iran on July 25. As seen in the figure below, traffic dropped significantly at approximately 0715 local time (0345 UTC). Recovery began almost immediately, with traffic nearing expected levels by 0830 local time (0500 UTC).

Internet disruptions overview for Q3 2022

Venezuela

Electrical issues frequently disrupt Internet connectivity in Venezuela, and the independent @vesinfiltro Twitter account tracks these events closely. One such example occurred on August 9, when electrical issues disrupted connectivity across multiple states, including Mérida, Táchira, Barinas, Portuguesa, and Estado Trujillo. The figure below shows evidence of two disruptions, the first around 1340 local time (1740 UTC) and the second a few hours later, starting at around 1615 local time (2015 UTC). In both cases, traffic volumes appeared to recover fairly quickly.

Internet disruptions overview for Q3 2022
Mérida, Táchira, Barinas, Portuguesa, and Estado Trujillo, Venezuela. (Source: Map data ©2022 Google. INEGI)
Internet disruptions overview for Q3 2022

Oman

On September 5, a power outage in Oman impacted energy, aviation, and telecommunications services. The latter is evident in the figure below, which shows the country’s traffic volume dropping nearly 60% when the outage began just before 1515 local time (0915 UTC). Although authorities claimed that “the electricity network would be restored within four hours,” traffic did not fully return to normal levels until 0400 local time on September 6 (2200 UTC on September 5) the following day, approximately 11 hours later.

Internet disruptions overview for Q3 2022

Ukraine

Over the last seven-plus months of war in Ukraine, we have observed multiple Internet disruptions due to infrastructure damage and power outages related to the fighting. We have covered these disruptions in our first and second quarter summary blog posts, and continue to do so on our @CloudflareRadar Twitter account as they occur. Power outages were behind Internet disruptions observed in Kharkiv on September 11, 12, and 13.

The figure below shows that the first disruption started around 2000 local time (1700 UTC) on September 11. This near-complete outage lasted just over 12 hours, with traffic returning to normal levels around 0830 local time (0530 UTC) on the 12th. However, later that day, another partial outage occurred, with a 50% traffic drop seen at 1330 local time (1030 UTC). This one was much shorter, with recovery starting approximately an hour later. Finally, a nominal disruption is visible at 0800 local time (0500 UTC) on September 13, with lower than expected traffic volumes lasting for around five hours.

Internet disruptions overview for Q3 2022

Cable damage

Damage to both terrestrial and submarine cables have caused many Internet disruptions over the years. The recent alleged sabotage of the sub-sea Nord Stream natural gas pipelines has brought an increasing level of interest from European media (including Swiss and French publications) around just how important submarine cables are to the Internet, and an increasing level of concern among policymakers about the safety of these cable systems and the potential impact of damage to them. However, the three instances of cable damage reviewed below are all related to terrestrial cable.

Iran

On August 1, a reported “fiber optic cable” problem caused by a fire in a telecommunications manhole disrupted connectivity across multiple network providers, including AS31549 (Aria Shatel), AS58224 (TIC), AS43754 (Asiatech), AS44244 (Irancell), and AS197207 (MCCI). The disruption started around 1215 local time (0845 UTC) and lasted for approximately four hours. Because it impacted a number of major wireless and wireline networks, the impact was visible at a country level as well, as seen in the figure below.

Internet disruptions overview for Q3 2022

Pakistan

Cable damage due to heavy rains and flooding caused several Internet disruptions in Pakistan in August. The first notable disruption occurred on August 19, starting around 0700 local time (0200 UTC) and lasted just over six and a half hours. On August 22, another significant disruption is also visible, starting at 2250 local time (1750 UTC), with a further drop at 0530 local time (0030 UTC) on the 23rd. The second more significant drop was brief, lasting only 45 minutes, after which traffic began to recover.

Internet disruptions overview for Q3 2022

Haiti

Amidst protests over fuel price hikes, fiber cuts in Haiti caused Internet outages on multiple network providers. Starting at 1500 local time (1900 UTC) on September 14, traffic on AS27759 (Access Haiti) fell to zero. According to a (translated) Twitter post from the provider, they had several fiber optic cables that were cut in various areas of the country, and blocked roads made it “really difficult” for their technicians to reach the problem areas. Repairs were eventually made, with traffic starting to increase again around 0830 local time (1230 UTC) on September 15, as shown in the figure below.

Internet disruptions overview for Q3 2022

Access Haiti provides AS27774 (Haiti Networking Group) with Internet connectivity (as an “upstream” provider), so the fiber cut impacted their connectivity as well, causing the outage shown in the figure below.

Internet disruptions overview for Q3 2022

Technical problems

As a heading, “technical problems” can be a catch-all, referring to multiple types of issues, including misconfigurations and routing problems. However, it is also sometimes the official explanation given by a government or telecommunications company for an observed Internet disruption.

Rogers

Arguably the most significant Internet disruption so far this year took place on AS812 (Rogers), one of Canada’s largest Internet service providers. At around 0845 UTC on July 8, a near complete loss of traffic was observed, as seen in the figure below.

Internet disruptions overview for Q3 2022

The figure below shows that small amounts of traffic were seen from the network over the course of the outage, but it took nearly 24 hours for traffic to return to normal levels.

Internet disruptions overview for Q3 2022

A notice posted by the Rogers CEO explained that “We now believe we’ve narrowed the cause to a network system failure following a maintenance update in our core network, which caused some of our routers to malfunction early Friday morning. We disconnected the specific equipment and redirected traffic, which allowed our network and services to come back online over time as we managed traffic volumes returning to normal levels.” A Cloudflare blog post covered the Rogers outage in real-time, highlighting related BGP activity and small increases of traffic.

Chad

A four-hour near-complete Internet outage took place in Chad on August 12, occurring between 1045 and 1300 local time (0945 to 1400 UTC). Authorities in Chad said that the disruption was due to a “technical problem” on connections between Sudachad and networks in Cameroon and Sudan.

Internet disruptions overview for Q3 2022

Unknown

In many cases, observed Internet disruptions are attributed to underlying causes thanks to statements by service providers, government officials, or media coverage of an associated event. However, for some disruptions, no published explanation or associated event could be found.

On August 11, a multi-hour outage impacted customers of US telecommunications provider Centurylink in states including Colorado, Iowa, Missouri, Montana, New Mexico, Utah, and Wyoming, as shown in the figure below. The outage was also visible in a traffic graph for AS209, the associated autonomous system.

Internet disruptions overview for Q3 2022
Internet disruptions overview for Q3 2022

On August 30, satellite Internet provider suffered a global service disruption, lasting between 0630-1030 UTC as seen in the figure below.

Internet disruptions overview for Q3 2022

Conclusion

As part of Cloudflare’s Birthday Week at the end of September, we launched the Cloudflare Radar Outage Center (CROC). The CROC is a section of our new Radar 2.0 site that archives information about observed Internet disruptions. The underlying data that powers the CROC is also available through an API, enabling interested parties to incorporate data into their own tools, sites, and applications. For regular updates on Internet disruptions as they occur and other Internet trends, follow @CloudflareRadar on Twitter.

The status page the Internet needs: Cloudflare Radar Outage Center

Post Syndicated from David Belson original https://blog.cloudflare.com/announcing-cloudflare-radar-outage-center/

The status page the Internet needs: Cloudflare Radar Outage Center

The status page the Internet needs: Cloudflare Radar Outage Center

Historically, Cloudflare has covered large-scale Internet outages with timely blog posts, such as those published for Iran, Sudan, Facebook, and Syria. While we still explore such outages on the Cloudflare blog, throughout 2022 we have ramped up our monitoring of Internet outages around the world, posting timely information about those outages to @CloudflareRadar on Twitter.

The new Cloudflare Radar Outage Center (CROC), launched today as part of Radar 2.0, is intended to be an archive of this information, organized by location, type, date, etc.

Furthermore, this initial release is also laying the groundwork for the CROC to become a first stop and key resource for civil society organizations, journalists/news media, and impacted parties to get information on, or corroboration of, reported or observed Internet outages.

The status page the Internet needs: Cloudflare Radar Outage Center

What information does the CROC contain?

At launch, the CROC includes summary information about observed outage events. This information includes:

  • Location: Where was the outage?
  • ASN: What autonomous system experienced a disruption in connectivity?
  • Type: How broad was the outage? Did connectivity fail nationwide, or at a sub-national level? Did just a single network provider have an outage?
  • Scope: If it was a sub-national/regional outage, what state or city was impacted? If it was a network-level outage, which one?
  • Cause: Insight into the likely cause of the outage, based on publicly available information. Historically, some have been government directed shutdowns, while others are caused by severe weather or natural disasters, or by infrastructure issues such as cable cuts, power outages, or filtering/blocking.
  • Start time: When did the outage start?
  • End time: When did the outage end?

Using the CROC

Radar pages, including the main landing page, include a card displaying information about the most recently observed outage, along with a link to the CROC. The CROC will also be linked from the left-side navigation bar

The status page the Internet needs: Cloudflare Radar Outage Center

Within the CROC, we have tried to keep the interface simple and easily understandable. Based on the selected time period, the global map highlights locations where Internet outages have been observed, along with a tooltip showing the number of outages observed during that period. Similarly, the table includes information (as described above) about each observed outage, along with a link to more information. The linked information may be a Twitter post, a blog post, or a custom Radar graph.

The status page the Internet needs: Cloudflare Radar Outage Center
The status page the Internet needs: Cloudflare Radar Outage Center

As mentioned in the Radar 2.0 launch blog post, we launched an associated API alongside the new site. Outage information is available through this API as well — in fact, the CROC is built on top of this API. Interested parties, including civil society organizations, data journalists, or others, can use the API to integrate the available outage data with their own data sets, build their own related tools, or even develop a custom interface.

Information about the related API endpoint and how to access it can be found in the Cloudflare API documentation.

We recognize that some users may want to download the whole list of observed outages for local consumption and analysis. They can do so by clicking the “Download CSV” link below the table.

The status page the Internet needs (coming soon)

Today’s launch of the Cloudflare Radar Outage Center is just the beginning, as we plan to improve it over time. This includes increased automation of outage detection, enabling us to publish more timely information through both the API and the CROC tool, which is important for members of the community that track and respond to Internet outages. We are also exploring how we can use synthetic monitoring in combination with other network-level performance and availability information to detect outages of popular consumer and business applications/platforms.

And anyone who uses a cloud platform provider (such as AWS) will know that those companies’ status pages take a surprisingly long time to update when there’s an outage. It’s very common to experience difficulty accessing a service, see hundreds of messages on Twitter and message boards about a service being down, only to go to the cloud platform provider’s status page and see everything green and “All systems normal”.

For the last few months we’ve been monitoring the performance of cloud platform providers to see if we can detect when they go down and provide our own, real time status page for them. We believe we can and Cloudflare Radar Outage Center will be extended to include cloud service providers and give the Internet the status page it needs.

The status page the Internet needs: Cloudflare Radar Outage Center

If you have questions about the CROC, or suggestions for features that you would like to see, please reach out to us on Twitter at @CloudflareRadar.

Internet disruptions overview for Q2 2022

Post Syndicated from David Belson original https://blog.cloudflare.com/q2-2022-internet-disruption-summary/

Internet disruptions overview for Q2 2022

Internet disruptions overview for Q2 2022

Cloudflare operates in more than 270 cities in over 100 countries, where we interconnect with over 10,000 network providers in order to provide a broad range of services to millions of customers. The breadth of both our network and our customer base provides us with a unique perspective on Internet resilience, enabling us to observe the impact of Internet disruptions. In many cases, these disruptions can be attributed to a physical event, while in other cases, they are due to an intentional government-directed shutdown. In this post, we review selected Internet disruptions observed by Cloudflare during the second quarter of 2022, supported by traffic graphs from Cloudflare Radar and other internal Cloudflare tools, and grouped by associated cause or common geography.

Optic outages

This quarter, we saw the usual complement of damage to both terrestrial and submarine fiber-optic cables, including one that impacted multiple countries across thousands of miles, and another more localized outage that was due to an errant rodent.

Comcast

On April 25, Comcast subscribers in nearly 20 southwestern Florida cities experienced an outage, reportedly due to a fiber cut. The traffic impact of this cut is clearly visible in the graph below, with Cloudflare traffic for these cities dropping to zero between 1915–2050 UTC (1515–1850 local time).

Internet disruptions overview for Q2 2022

Not only did the fiber cut force a significant number of Comcast subscribers offline, but it also impacted the types of traffic observed across Comcast as a whole. The graphs below illustrate the mix of mobile vs desktop clients, as well as IPv4 vs. IPv6 request volume across AS7922, Comcast’s primary autonomous system. During the brief disruption period, the percentage of Comcast traffic from mobile devices increased, while desktop devices dropped, and the percentage of IPv4 traffic dropped, with a corresponding increase in IPv6 traffic share.

Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022

South Africa

On the morning of May 17, Telkom SA, a South African telecommunications provider, tweeted an “important notice” to customers, noting that “Damage to a Fibre cable was detected on the Telkom network around 8:00am on Tuesday, 17 May 2022.” and outlining the impacted services and geographies. The graphs below show the impact to Cloudflare traffic from the Telkom autonomous system in three South African provinces. The top graph shows the impact to traffic in Gauteng, while the lower graph shows the impact in Limpopo and North West. Across all three, traffic falls at 0600 UTC (0800 local time), recovering around 1300 UTC (1500 local time). Telkom SA did not provide any additional information on where the fiber cut occurred or what caused it.

Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022

Venezuela

Although unconfirmed, a fiber cut was suspected to be the cause of an Internet disruption experienced by CANTV subscribers in Venezuela on May 19, the latest of several such incidents affecting that provider. Although the fiber cut reportedly impacted subscribers in multiple states, the most significant impact was measured in Falcón, as shown in the graph below. In this state, traffic dropped precipitously at 1800 UTC (1400 local time), finally recovering approximately 24 hours later.

Internet disruptions overview for Q2 2022

AAE-1 & SMW-5

Just after 1200 UTC on Tuesday, June 7, the Africa-Asia-Europe-1 (AAE-1) and SEA-ME-WE-5 (SMW-5) submarine cables suffered cable cuts, impacting Internet connectivity for millions of Internet users across multiple countries in the Middle East and Africa, as well as thousands of miles away in Asia. Although specific details are sparse, the cable damage reportedly occurred in Egypt – both of the impacted cables land in Abu Talat and Zafarana, which also serve as landing points for a number of other submarine cables.

The Cloudflare Radar graphs below illustrate the impact of these cable cuts across Africa, Asia, and the Middle East. Given that the associated traffic disruption only lasted several hours, the damage to these cables likely occurred on land, after they came ashore. More details on this event can be found in the “AAE-1 & SMW5 cable cuts impact millions of users across multiple countries” blog post.

Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022

Castor canadensis

Finally, on June 13, a beaver was responsible for an outage that impacted Internet users in British Columbia, Canada. According to a published report, a beaver gnawed its way through a tree, causing it to fall on both power lines and a Telus fiber optic cable. The damage to the fiber optic cable affected connectivity customers in over a dozen communities across British Columbia, including those using CityWest (AS18988), a utility company that uses the Telus cable. In the graph below, the impact of the damage to the fiber optic cable is clearly visible, with no traffic to Cloudflare from CityWest subscribers in British Columbia between 1800 UTC on June 7 until 0310 UTC on June 8 (1100–2010 local time).

Internet disruptions overview for Q2 2022

School’s in, Internet’s out

Nationwide Internet shutdowns have, unfortunately, become a popular approach taken by authoritarian regimes over the past half dozen years to prevent cheating on secondary school exams. It is not clear that this heavy-handed tactic is actually effective in preventing cheating, but the associated damage to the national economies has been estimated to be in the tens to hundreds of millions of US dollars, depending on the duration and frequency of the shutdowns.

This year, governments in Sudan and Syria implemented a number of multi-hour shutdowns in late May into June, while Algeria’s government appears to have resorted to more targeted content blocking. Additional details on these Internet disruptions can be found in the recent “Exam time means Internet disruptions in Syria, Sudan and Algeria” blog post.

Starting on May 30, Syria implemented the first of four nationwide Internet shutdowns, the last of which occurred on June 12, as seen in the graph below. Interestingly, we have observed that these shutdowns tend to be “asymmetric” in nature — that is, inbound traffic (into the country) is disabled, but egress traffic (from the country) remains. One effect of this is visible as spikes in the DNS graph below. During three of the four shutdowns, requests to Cloudflare’s 1.1.1.1 resolver from clients in Syria increased because DNS queries were able to exit the country, but responses couldn’t return, leading to retry floods.

Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022

In Sudan, daily shutdowns were implemented 0530-0830 UTC (0730–1030 local time) between June 11 and June 22, except for June 17. (It isn’t clear why that date was skipped.) The graph below shows that these shutdowns were nationwide, but not complete, as traffic from the country did not drop to zero.

Internet disruptions overview for Q2 2022

In Algeria, exams took place June 12 through June 16. In the past, the country has implemented nationwide shutdowns, but after recognizing the enormous cost to the economy, the government has apparently chosen an alternate tactic this year. The graph below shows nominal drops in country-level traffic during the two times each day that the exams took place—0730–1000 UTC (0830–1100 am local time) and 1330–1600 UTC (1430–1700 local time). These drops in traffic are likely more indicative of a content-blocking approach, instead of a broad Internet shutdown.

Internet disruptions overview for Q2 2022

On June 27, the Kurdistan Regional Government in Iraq began to implement twice-weekly (Mondays and Thursday) multi-hour regional Internet shutdowns, expected to last for a four-week period. The shutdowns are intended to prevent cheating on high school final exams, according to a published report, and are scheduled for 0630–1030 am local time (0330–0730 UTC). The graph below shows the impact to traffic from three governorates in Kurdistan, with traffic dropping to near zero in all three areas during the duration of the shutdowns.

Internet disruptions overview for Q2 2022

Government-guided

In addition to shutting down the Internet to prevent cheating on exams, governments have also been known to use shutdowns as a tool to limit or control communication around elections, rallies, protests, etc. During the second quarter, we observed several such shutdowns of note.

On April 10, following the blocking of social networks, VPN providers, and cloud platforms, the government of Turkmenistan implemented a near complete Internet shutdown, starting at 1400 UTC. Apparently related to criticism over the recent presidential election, the disruption lasted nearly 40 hours, as traffic started to return around 0700 UTC on April 12. The graphs below show the impact of the shutdown at a country level, as well as at two major network providers within the country, Telephone Network of Ashgabat CJSC (AS51495) and TurkmenTelecom (AS20661).

Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022

A month and a half later, on May 25, an Internet disruption was observed in Pakistan amid protests led by the country’s former Prime Minister. The disruption lasted only two hours, and was limited in scope — it was not a nationwide shutdown. (Telecom providers claimed that it was due to problems with a web filtering system.) At a national level, the impact of the disruption is visible as a slight drop in traffic.

Internet disruptions overview for Q2 2022

In the cities of Lahore and Karachi, the disruption is visible a little more clearly, as is the rapid recovery in traffic.

Internet disruptions overview for Q2 2022

The impact of the disruption is most evident at a network level, as seen in the graphs below. Cyber Internet Services (AS9541) saw a modest drop in traffic, while Mobilink (AS45669) experienced a near complete outage.

Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022

Closing out the quarter, a communications blackout, including an Internet shutdown, was imposed in Sudan on June 30 as protestors staged rallies against the country’s military leadership. This shutdown follows similar disruptions seen in October 2021 after the military toppled the transitional government and attempted to limit protests, as well the shutdowns seen earlier in June as the government attempted to prevent cheating on exams. The graphs below show that the shutdown started at 0600 UTC (0800 local time) and initially ended almost 12 hours later at 1740 UTC (1940 local time). Connectivity returned for approximately three hours, with traffic again dropping to near-zero levels again around 2040 UTC (2240 local time). This second outage remained active at the end of the day.

As a complete nationwide shutdown, the impact is also visible in the loss of traffic at major local Internet providers including MTN, Sudatel, Kanartel, and Sudanese Mobile Telephone (SDN Mobitel / ZAIN).

Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022

Infrastructure issues

In addition to fiber/cable cuts, as discussed above, problems with other infrastructure, whether due to fires, electrical issues, or maintenance, can also disrupt Internet services.

Around 2030 local time on April 6 (0030 UTC on April 7), a fire erupted at the Costa Sur generation plant, one of the largest power plants in Puerto Rico, resulting in a widespread power outage across the island territory. This island-wide outage caused a significant interruption to Internet services, clearly visible in Cloudflare traffic data. The graph below shows that as the power failed, traffic from Puerto Rico immediately fell by more than half. The regular diurnal pattern remained in place, albeit at lower levels, over the next three days, with traffic returning to “normal levels” three days later. By April 10, Luma Energy reported that it had restored electrical power to 99.7% of its 1.5M customers.

Internet disruptions overview for Q2 2022

The impact of the Internet service disruption is also fairly significant when viewed at a network level. The graphs below show traffic for Datacom Caribe/Claro (AS10396) and Liberty Cablevision of Puerto Rico (AS14638). At Datacom Caribe/Claro, traffic immediately fell by more than half, while Liberty Cablevision traffic declined approximately 85%.

Internet disruptions overview for Q2 2022
Internet disruptions overview for Q2 2022

On the evening of May 3, Swiss telecom provider Swisscom tweeted that there had been an interruption to Internet service following maintenance work. A published report noted that the interruption occurred between 2223–2253 local time (2023–2053 UTC), and the graph below shows a complete loss of traffic, but quick recovery, during that 30-minute window. Beyond citing maintenance work, Swisscom did not provide any additional details about the Internet disruption.

Internet disruptions overview for Q2 2022

Iran

Iran has a history of both nationwide and regional Internet shutdowns, as well as connectivity disruptions due to infrastructure damage.

On May 6, the government disrupted Internet connectivity in Khuzestan province, reportedly in response to mass protests around shortages of bread and water. It was reported that mobile data had been cut off locally, and that fixed connectivity speeds were significantly reduced. To this end, we observed a drop in traffic for Irancell (AS44244) (a mobile network provider) in Khuzestan starting around 1000 UTC as seen in the graph below.

Internet disruptions overview for Q2 2022

A similar disruption affecting Irancell, occurring amid reports of ongoing protests in the country, was observed on May 12, with lower peak traffic during the day, and a further drop around 1800 UTC.

Internet disruptions overview for Q2 2022

Near-complete Internet outages were observed on multiple Iranian network providers on May 9 between 1300–1440 UTC (1730–1910 local time), as illustrated in the graph below. Impacted providers included Atrin Information & Communications Technology Company (AS39650), AryaSat (AS43343), Ariana Gostar Spadana (AS48309), and Pirooz Leen (AS51759). All of these networks share Fanaptelecom (AS24631) as an upstream provider, which, as the graph shows, was also experiencing an outage. No root cause for the Fanaptelecom outage was available.

Internet disruptions overview for Q2 2022

Mobile provider Mobinnet (AS50810) experienced a multi-hour Internet disruption on May 14, lasting from 1230–1530 UTC (1700–2000 local time). According to a tweet from Mobinnet, the disruption was due to a “widespread cyber attack of foreign origin”.

Internet disruptions overview for Q2 2022

Ukraine

Now more than four months into the war in Ukraine, the Internet continues to be an active battlefield, with ongoing Internet outages in multiple cities and across multiple networks. However, we want to highlight here two similar events observed during the second quarter.

The Russian-occupied city of Kherson experienced a near-complete Internet outage between 1600 UTC (1900 local time) on April 30 and 0430 UTC (0730 local time) on May 4. According to social media posts from Ukraine’s vice Prime-Minister Mykhailo Fedorov and the State Service of Special Communications and Information Protection, the outage was caused by “interruptions of fiber-optic trunk lines and disconnection from the power supply of equipment of operators in the region”. The graph below shows effectively no traffic for Kherson for approximately 24 hours after the disruption began, followed by a nominal amount of traffic for the next several days.

Internet disruptions overview for Q2 2022

Around the time that the nominal amount of traffic returned, we also observed a shift in routing for an IPv4 prefix announced by AS47598 (Khersontelecom). As shown in the table below, prior to the outage, it reached the Internet through several other Ukrainian network providers, including AS12883, AS3326, and AS35213. However, as traffic returned, its routing path now showed a Russian network, AS201776 (Miranda) as the upstream provider. The path through Miranda also includes AS12389 (Rostelecom), which bills itself as “the largest digital services provider in Russia”.

Peer AS Last Update AS Path
AS1299 (TWELVE99 Arelion, fka Telia Carrier) 5/1/2022 16:02:26 1299 12389 201776 47598
AS6777 (AMS-IX-RS) 4/28/2022 11:23:33 12883 47598

As the disruption ended on May 4, we observed updates to Khersontelecom’s routing path that enabled it to return to reaching the global Internet through non-Russian upstream providers.

Peer AS Last Update AS Path
AS174 (COGENT-174) 5/4/2022 05:56:27 174 3326 3326 3326 47598
AS1273 (CW Vodafone Group PLC) 5/4/2022 03:11:25 1273 12389 201776 47598

Additional details about this outage and re-routing event can be found in the “Tracking shifts in Internet connectivity in Kherson, Ukraine” blog post.

A month later, on May 30, we again observed a significant Internet disruption in Kherson starting at 1435 UTC (1735 local time). And once again, we observed updated routing for Khersontelecom, as it shifted from Ukrainian upstream providers to Russian ones. As of the end of June, the Internet disruption in Kherson and the routing through Russian upstream providers both remain firmly in place, although the loss of traffic has not been nearly as significant as the April/May disruption.

Internet disruptions overview for Q2 2022

Peer AS Last Update AS Path
AS4775 (Globe Telecoms) 5/30/2022 13:56:22 4775 1273 12389 201776 47598
AS9002 (RETN-AS) 5/30/2022 09:58:16 9002 3326 47598

Conclusion

This post is by no means an exhaustive review of the Internet outages, shutdowns, and disruptions that have occurred throughout the second quarter. Some were extremely brief or limited in scope, while others were observed but had no known or publicly conjectured underlying cause. Having said that, it is important to bring increased visibility to these events so that the community can share information on what is happening, why it happened, and what the impact was — human, financial, or otherwise.

Follow @CloudflareRadar on Twitter for updates on Internet disruptions as they occur, and find up-to-date information on Internet trends using Cloudflare Radar.

Cloudflare outage on June 21, 2022

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/

Cloudflare outage on June 21, 2022

Introduction

Cloudflare outage on June 21, 2022

Today, June 21, 2022, Cloudflare suffered an outage that affected traffic in 19 of our data centers. Unfortunately, these 19 locations handle a significant proportion of our global traffic. This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations. A change to the network configuration in those locations caused an outage which started at 06:27 UTC. At 06:58 UTC the first data center was brought back online and by 07:42 UTC all data centers were online and working correctly.

Depending on your location in the world you may have been unable to access websites and services that rely on Cloudflare. In other locations, Cloudflare continued to operate normally.

We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.

Background

Over the last 18 months, Cloudflare has been working to convert all of our busiest locations to a more flexible and resilient architecture. In this time, we’ve converted 19 of our data centers to this architecture, internally called Multi-Colo PoP (MCP): Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, São Paulo, San Jose, Singapore, Sydney, Tokyo.

A critical part of this new architecture, which is designed as a Clos network, is an added layer of routing that creates a mesh of connections. This mesh allows us to easily disable and enable parts of the internal network in a data center for maintenance or to deal with a problem. This layer is represented by the spines in the following diagram.

Cloudflare outage on June 21, 2022

This new architecture has provided us with significant reliability improvements, as well as allowing us to run maintenance in these locations without disrupting customer traffic. As these locations also carry a significant proportion of the Cloudflare traffic, any problem here can have a very wide impact, and unfortunately, that’s what happened today.

Incident timeline and impact

In order to be reachable on the Internet, networks like Cloudflare make use of a protocol called BGP. As part of this protocol, operators define policies which decide which prefixes (a collection of adjacent IP addresses) are advertised to peers (the other networks they connect to), or accepted from peers.

These policies have individual components, which are evaluated sequentially. The end result is that any given prefixes will either be advertised or not advertised. A change in policy can mean a previously advertised prefix is no longer advertised, known as being “withdrawn”, and those IP addresses will no longer be reachable on the Internet.

Cloudflare outage on June 21, 2022

While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes.

Due to this withdrawal, Cloudflare engineers experienced added difficulty in reaching the affected locations to revert the problematic change. We have backup procedures for handling such an event and used them to take control of the affected locations.

03:56 UTC: We deploy the change to our first location. None of our locations are impacted by the change, as these are using our older architecture.
06:17: The change is deployed to our busiest locations, but not the locations with the MCP architecture.
06:27: The rollout reached the MCP-enabled locations, and the change is deployed to our spines. This is when the incident started, as this swiftly took these 19 locations offline.
06:32: Internal Cloudflare incident declared.
06:51: First change made on a router to verify the root cause.
06:58: Root cause found and understood. Work begins to revert the problematic change.
07:42: The last of the reverts has been completed. This was delayed as network engineers walked over each other’s changes, reverting the previous reverts, causing the problem to re-appear sporadically.
09:00: Incident closed.

The criticality of these data centers can clearly be seen in the volume of successful HTTP requests we handled globally:

Cloudflare outage on June 21, 2022

Even though these locations are only 4% of our total network, the outage impacted 50% of total requests. The same can be seen in our egress bandwidth:

Cloudflare outage on June 21, 2022

Technical description of the error and how it happened

As part of our continued effort to standardize our infrastructure configuration, we were rolling out a change to standardize the BGP communities we attach to a subset of the prefixes we advertise. Specifically, we were adding informational communities to our site-local prefixes.

These prefixes allow our metals to communicate with each other, as well as connect to customer origins. As part of the change procedure at Cloudflare, a Change Request ticket was created, which includes a dry-run of the change, as well as a stepped rollout procedure. Before it was allowed to go out, it was also peer reviewed by multiple engineers. Unfortunately, in this case, the steps weren’t small enough to catch the error before it hit all of our spines.

The change looked like this on one of the routers:

[edit policy-options policy-statement 4-COGENT-TRANSIT-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 4-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 6-COGENT-TRANSIT-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 6-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;

This was harmless, and just added some additional information to these prefix advertisements. The change on the spines was the following:

[edit policy-options policy-statement AGGREGATES-OUT]
term 6-DISABLED_PREFIXES { ... }
!    term 6-ADV-TRAFFIC-PREDICTOR { ... }
!    term 4-ADV-TRAFFIC-PREDICTOR { ... }
!    term ADV-FREE { ... }
!    term ADV-PRO { ... }
!    term ADV-BIZ { ... }
!    term ADV-ENT { ... }
!    term ADV-DNS { ... }
!    term REJECT-THE-REST { ... }
!    term 4-ADV-SITE-LOCALS { ... }
!    term 6-ADV-SITE-LOCALS { ... }
[edit policy-options policy-statement AGGREGATES-OUT term 4-ADV-SITE-LOCALS then]
community delete NO-EXPORT { ... }
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add AMS07;
+      community add EUROPE;
[edit policy-options policy-statement AGGREGATES-OUT term 6-ADV-SITE-LOCALS then]
community delete NO-EXPORT { ... }
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add AMS07;
+      community add EUROPE;

An initial glance at this diff might give the impression that this change is identical, but unfortunately, that’s not the case. If we focus on one part of the diff, it might become clear why:

!    term REJECT-THE-REST { ... }
!    term 4-ADV-SITE-LOCALS { ... }
!    term 6-ADV-SITE-LOCALS { ... }

In this diff format, the exclamation marks in front of the terms indicate a re-ordering of the terms. In this case, multiple terms moved up, and two terms were added to the bottom. Specifically, the 4-ADV-SITE-LOCALS and 6-ADV-SITE-LOCALS terms moved from the top to the bottom. These terms were now behind the REJECT-THE-REST term, and as might be clear from the name, this term is an explicit reject:

term REJECT-THE-REST {
    then reject;
} 

As this term is now before the site-local terms, we immediately stopped advertising our site-local prefixes, removing our direct access to all the impacted locations, as well as removing the ability of our servers to reach origin servers.

On top of the inability to contact origins, the removal of these site-local prefixes also caused our internal load balancing system Multimog (a variation of our Unimog load-balancer) to stop working, as it could no longer forward requests between the servers in our MCPs. This meant that our smaller compute clusters in an MCP received the same amount of traffic as our largest clusters, causing the smaller ones to overload.

Cloudflare outage on June 21, 2022

Remediation and follow-up steps

This incident had widespread impact, and we take availability very seriously. We have identified several areas of improvement and will continue to work on uncovering any other gaps that could cause a recurrence.

Here is what we are working on immediately:

Process: While the MCP program was designed to improve availability, a procedural gap in how we updated these data centers ultimately caused a broader impact in MCP locations specifically. While we did use a stagger procedure for this change, the stagger policy did not include an MCP data center until the final step. Change procedures and automation need to include MCP-specific test and deploy procedures to ensure there are no unintended consequences.

Architecture: The incorrect router configuration prevented the proper routes from being announced, preventing traffic from flowing properly to our infrastructure. Ultimately the policy statement that caused the incorrect routing advertisement will be redesigned to prevent an unintentional incorrect ordering.

Automation: There are several opportunities in our automation suite that would mitigate some or all of the impact seen from this event. Primarily, we will be concentrating on automation improvements that enforce an improved stagger policy for rollouts of network configuration and provide an automated “commit-confirm” rollback. The former enhancement would have significantly lessened the overall impact, and the latter would have greatly reduced the Time-to-Resolve during the incident.

Conclusion

Although Cloudflare has invested significantly in our MCP design to improve service availability, we clearly fell short of our customer expectations with this very painful incident. We are deeply sorry for the disruption to our customers and to all the users who were unable to access Internet properties during the outage. We have already started working on the changes outlined above and will continue our diligence to ensure this cannot happen again.

Exam time means Internet disruptions in Syria, Sudan and Algeria

Post Syndicated from David Belson original https://blog.cloudflare.com/syria-sudan-algeria-exam-internet-shutdown/

Exam time means Internet disruptions in Syria, Sudan and Algeria

Exam time means Internet disruptions in Syria, Sudan and Algeria

It is once again exam time in Syria, Sudan, and Algeria, and with it, we find these countries disrupting Internet connectivity in an effort to prevent cheating on these exams. As they have done over the past several years, Syria and Sudan are implementing multi-hour nationwide Internet shutdowns. Algeria has also taken a similar approach in the past, but this year appears to be implementing more targeted website/application blocking.

Syria

Syria has been implementing Internet shutdowns across the country since 2011, but exam-related shutdowns have only been in place since 2016. In 2021, exams took place between May 31 and June 22, with multi-hour shutdowns observed on each of the exam days.

This year, the first shutdown was observed on May 30, with subsequent shutdowns (to date) seen on June 2, 6, and 12. In the Cloudflare Radar graph below, traffic for Syria drops to zero while the shutdowns are active. According to Internet Society Pulse, several additional shutdowns are expected through June 21. Each takes place between 02000530 UTC (0500–0830 local time). According to a published report, the current exam cycle covers more than 500,000 students for basic and general secondary education certificates.

Exam time means Internet disruptions in Syria, Sudan and Algeria

Consistent with shutdowns observed in prior years, Syria is once again implementing them in an asymmetric fashion – that is, inbound traffic is disabled, but egress traffic remains. This is clearly visible in request traffic from Syria to Cloudflare’s 1.1.1.1 DNS resolver. As the graph below shows, queries from clients in Syria are able to exit the country and reach Cloudflare, but responses can’t return, leading to retry floods, visible as spikes in the graph.

Exam time means Internet disruptions in Syria, Sudan and Algeria

Last year, the Syrian Minister of Education noted that, for the first time, encryption and surveillance technologies would be used in an effort to curtail cheating, with an apparent promise to suspend Internet shutdowns in the future if these technologies proved successful.

Sudan

Sudan is also no stranger to nationwide Internet shutdowns, with some last lasting for multiple weeks. Over the last several years, Sudan has also implemented Internet shutdowns during secondary school exams in an effort to limit cheating or leaking of exam questions. (We covered the 2021 round of shutdowns in a blog post.)

According to a schedule published by digital rights organization AccessNow, this year’s Secondary Certificate Exams will be taking place in Sudan daily between June 11–22, except June 17. As of this writing, near-complete shutdowns have been observed on June 11, 12, and 13 between 0530-0830 UTC (0730-1030 local time), as seen in the graph below. The timing of these shutdowns aligns with a communication reportedly sent to subscribers of telecommunications services in the country, which stated “In implementation of the decision of the Attorney General, the Internet service will be suspended during the Sudanese certificate exam sessions from 8 in the morning until 11 in the morning.”

Exam time means Internet disruptions in Syria, Sudan and Algeria

It is interesting to note that the shutdown, while nationwide, does not appear to be complete. The graph below shows that Cloudflare continues to see a small volume of HTTP requests from Sudatel during the shutdown periods. This is not completely unusual, as Sudatel may have public sector, financial services, or other types of customers that remain online.

Exam time means Internet disruptions in Syria, Sudan and Algeria

Algeria

Since 2018, Algeria has been shutting down the Internet nationwide during baccalaureate exams, following widespread cheating in 2016 that saw questions leaked online both before and during tests. These shutdowns reportedly cost businesses across the country an estimated 500 million Algerian Dinars (approximately $3.4 million USD) for every hour the Internet was unavailable. In 2021, there were two Internet shutdowns each day that exams took place—the first between 0700–1100 UTC (0800–1200 local time), and the second between 1330–1600 UTC (1430–1700 local time).

This year, more than 700,000 students will sit for the baccalaureate exams between June 12-16.

Perhaps recognizing the economic damage caused by these Internet shutdowns, this year the Algerian Minister of National Education announced that there would be no Internet shutdowns on exam days.

Thus far, it appears that this has been the case. However, it appears that the Algerian government has shifted to a content blocking-based approach, instead of a wide-scale Internet shutdown. The Cloudflare Radar graph below shows two nominal drops in country-level traffic during the two times on June 13 that the exams took place—0730–1000 UTC (0830–1100 local time) and 1330–1600 UTC (1430–1700 local time), similar to last year’s timing.

Exam time means Internet disruptions in Syria, Sudan and Algeria

The disruptions are also visible in traffic graphs for several major Algerian network providers, as shown below.

Exam time means Internet disruptions in Syria, Sudan and Algeria
Exam time means Internet disruptions in Syria, Sudan and Algeria
Exam time means Internet disruptions in Syria, Sudan and Algeria

Analysis of additional Cloudflare data further supports the hypothesis that Algeria is blocking access to specific websites and applications, rather than shutting down the Internet completely.

As described in a previous blog post, Network Error Logging (NEL) is a browser-based reporting system that allows users’ browsers to report connection failures to an endpoint specified by the webpage that failed to load. Below, a graph of NEL reports from browsers in Algeria shows clear spikes during the times (thus far) that the exams have taken place, with report levels significantly lower and more consistent during other times of the day.

Exam time means Internet disruptions in Syria, Sudan and Algeria

Conclusion

In addition to Syria, Sudan, and Algeria, countries including India, Jordan, Iraq, Uzbekistan, and Ethiopia have shut down or limited access to the Internet as exams took place. It is unclear whether these brute-force methods are truly effective at preventing cheating on these exams. However, it is clear that the impact of these shutdowns goes beyond students, as they impose a significant financial cost on businesses within the affected countries as they lose Internet access for multiple hours a day over the course of several weeks.

If you want to follow the remaining scheduled disruptions for these countries, you can see live data on the Cloudflare Radar pages for Syria, Sudan, and Algeria.

AAE-1 & SMW5 cable cuts impact millions of users across multiple countries

Post Syndicated from David Belson original https://blog.cloudflare.com/aae-1-smw5-cable-cuts/

AAE-1 & SMW5 cable cuts impact millions of users across multiple countries

AAE-1 & SMW5 cable cuts impact millions of users across multiple countries

Just after 1200 UTC on Tuesday, June 7, the Africa-Asia-Europe-1 (AAE-1) and SEA-ME-WE-5 (SMW-5) submarine cables suffered cable cuts. The damage reportedly occurred in Egypt, and impacted Internet connectivity for millions of Internet users across multiple countries in the Middle East and Africa, as well as thousands of miles away in Asia. In addition, Google Cloud Platform and OVHcloud reported connectivity issues due to these cable cuts.

The impact

Data from Cloudflare Radar showed significant drops in traffic across the impacted countries as the cable damage occurred, recovering approximately four hours later as the cables were repaired.

AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries
AAE-1 & SMW5 cable cuts impact millions of users across multiple countries

It appears that Saudi Arabia may have also been affected by the cable cut(s), but the impact was much less significant, and traffic recovered almost immediately.

AAE-1 & SMW5 cable cuts impact millions of users across multiple countries

In the graphs above, we show that Ethiopia was one of the impacted countries. However, as it is landlocked, there are obviously no submarine cable landing points within the country. The Afterfibre map from the Network Startup Resource Center (NSRC) shows that that fiber in Ethiopia connects to fiber in Somalia, which experienced an impact. In addition, Ethio Telecom also routes traffic through network providers in Kenya and Djibouti. Djibouti Telecom, one of these providers, in turn peers with larger global providers like Telecom Italia (TI) Sparkle, which is one of the owners of SMW5.

In addition to impacting end-user connectivity in the impacted countries, the cable cuts also reportedly impacted cloud providers including Google Cloud Platform and OVHcloud. In their incident report, Google Cloud noted “Google Cloud Networking experienced increased packet loss for egress traffic from Google to the Middle East, and elevated latency between our Europe and Asia Regions as a result, for 3 hours and 12 minutes, affecting several related products including Cloud NAT, Hybrid Connectivity and Virtual Private Cloud (VPC). From preliminary analysis, the root cause of the issue was a capacity shortage following two simultaneous fiber-cuts.” OVHcloud noted that “Backbone links between Marseille and Singapore are currently down” and that “Upon further investigation, our Network OPERATION teams advised that the fault was related to our partner fiber cuts.”

When concurrent disruptions like those highlighted above are observed across multiple countries in one or more geographic areas, the culprit is often a submarine cable that connects the impacted countries to the global Internet. The impact of such cable cuts will vary across countries, largely due to the levels of redundancy that they may have in place. That is, are these countries solely dependent on an impacted cable for global Internet connectivity, or do they have redundant connectivity across other submarine or terrestrial cables? Additionally, the location of the country relative to the cable cut will also impact how connectivity in a given country may be affected. Due to these factors, we didn’t see a similar impact across all of the countries connected to the AAE-1 and SMW5 cables.

What happened?

Specific details are sparse, but as noted above, the cable damage reportedly occurred in Egypt – both of the impacted cables land in Abu Talat and Zafarana, which also serve as landing points for a number of other submarine cables. According to a 2021 article in Middle East Eye, “There are 10 cable landing stations on Egypt’s Mediterranean and Red Sea coastlines, and some 15 terrestrial crossing routes across the country.” Alan Mauldin, research director at telecommunications research firm TeleGeography, notes that routing cables between Europe and the Middle East to India is done via Egypt, because there is the least amount of land to cross. This places the country in a unique position as a choke point for international Internet connectivity, with damage to infrastructure locally impacting the ability of millions of people thousands of miles away to access websites and applications, as well as impacting connectivity for leading cloud platform providers.

As the graphs above show, traffic returned to normal levels within a matter of hours, with tweets from telecommunications authorities in Pakistan and Oman also noting that Internet services had returned to their countries. Such rapid repairs to submarine cable infrastructure are unusual, as repair timeframes are often measured in days or weeks, as we saw with the cables damaged by the volcanic eruption in Tonga earlier this year. This is due to the need to locate the fault, send repair ships to the appropriate location, and then retrieve the cable and repair it. Given this, the damage to these cables likely occurred on land, after they came ashore.

Keeping content available

By deploying in data centers close to end users, Cloudflare helps to keep traffic local, which can mitigate the impact of catastrophic events like cable cuts, while improving performance, availability, and security. Being able to deliver content from our network generally requires first retrieving it from an origin, and with end users around the world, Cloudflare needs to be able to reach origins from multiple points around the world at the same time. However, a customer origin may be reachable from some networks but not from others, due to a cable cut or some other network disruption.

In September 2021, Cloudflare announced Orpheus, which provides reachability benefits for customers by finding unreachable paths on the Internet in real time, and guiding traffic away from those paths, ensuring that Cloudflare will always be able to reach an origin no matter what is happening on the Internet.

Conclusion

Because the Internet is an interconnected network of networks, an event such as a cable cut can have a ripple effect across the whole Internet, impacting connectivity for users thousands of miles away from where the incident occurred. Users may be unable to access content or applications, or the content/applications may suffer from reduced performance. Additionally, the providers of those applications may experience problems within their own network infrastructure due to such an event.

For network providers, the impact of such events can be mitigated through the use of multiple upstream providers/peers, and diverse physical paths for critical infrastructure like submarine cables. Cloudflare’s globally deployed network can help content and application providers ensure that their content and applications remain available and performant in the face of network disruptions.

PIPEFAIL: How a missing shell option slowed Cloudflare down

Post Syndicated from Alex Forster original https://blog.cloudflare.com/pipefail-how-a-missing-shell-option-slowed-cloudflare-down/

PIPEFAIL: How a missing shell option slowed Cloudflare down

PIPEFAIL: How a missing shell option slowed Cloudflare down

At Cloudflare, we’re used to being the fastest in the world. However, for approximately 30 minutes last December, Cloudflare was slow. Between 20:10 and 20:40 UTC on December 16, 2021, web requests served by Cloudflare were artificially delayed by up to five seconds before being processed. This post tells the story of how a missing shell option called “pipefail” slowed Cloudflare down.

Background

Before we can tell this story, we need to introduce you to some of its characters.

PIPEFAIL: How a missing shell option slowed Cloudflare down

Cloudflare’s Front Line protects millions of users from some of the largest attacks ever recorded. This protection is orchestrated by a sidecar service called dosd, which analyzes traffic and looks for attacks. When dosd detects an attack, it provides Front Line with a list of attack fingerprints that describe how Front Line can match and block the attack traffic.

Instances of dosd run on every Cloudflare server, and they communicate with each other using a peer-to-peer mesh to identify malicious traffic patterns. This decentralized design allows dosd to perform analysis with much higher fidelity than is possible with a centralized system, but its scale also imposes some strict performance requirements. To meet these requirements, we need to provide dosd with very fast access to large amounts of configuration data, which naturally means that dosd depends on Quicksilver. Cloudflare developed Quicksilver to manage configuration data and replicate it around the world in milliseconds, allowing it to be accessed by services like dosd in microseconds.

PIPEFAIL: How a missing shell option slowed Cloudflare down

One piece of configuration data that dosd needs comes from the Addressing API, which is our authoritative IP address management service. The addressing data it provides is important because dosd uses it to understand what kind of traffic is expected on particular IPs. Since addressing data doesn’t change very frequently, we use a simple Kubernetes cron job to query it at 10 minutes past each hour and write it into Quicksilver, allowing it to be efficiently accessed by dosd.

With this context, let’s walk through the change we made on December 16 that ultimately led to the slowdown.

The Change

Approximately once a week, all of our Bug Fixes and Performance Improvements to the Front Line codebase are released to the network. On December 16, the Front Line team released a fix for a subtle bug in how the code handled compression in the presence of a Cache-Control: no-transform header. Unfortunately, the team realized pretty quickly that this fix actually broke some customers who had started depending on that buggy behavior, so the team decided to roll back the release and work with those customers to correct the issue.

PIPEFAIL: How a missing shell option slowed Cloudflare down

Here’s a graph showing the progression of the rollback. While most releases and rollbacks are fully automated, this particular rollback needed to be performed manually due to its urgency. Since this was a manual rollback, SREs decided to perform it in two batches as a safety measure. The first batch went to our smaller tier 2 and 3 data centers, and the second batch went to our larger tier 1 data centers.

SREs started the first batch at 19:25 UTC, and it completed in about 30 minutes. Then, after verifying that there were no issues, they started the second batch at 20:10. That’s when the slowdown started.

The Slowdown

Within minutes of starting the second batch of rollbacks, alerts started firing. “Traffic levels are dropping.” “CPU utilization is dropping.” “A P0 incident has been automatically declared.” The timing could not be a coincidence. Somehow, a deployment of known-good code, which had been limited to a subset of the network and which had just been successfully performed 40 minutes earlier, appeared to be causing a global problem.

A P0 incident is an “all hands on deck” emergency, so dozens of Cloudflare engineers quickly began to assess impact to their services and test their theories about the root cause. The rollback was paused, but that did not fix the problem. Then, approximately 10 minutes after the start of the incident, my team – the DOS team – received a concerning alert: “dosd is not running on numerous servers.” Before that alert fired we had been investigating whether the slowdown was caused by an unmitigated attack, but this required our immediate attention.

Based on service logs, we were able to see that dosd was panicking because the customer addressing data in Quicksilver was corrupted in some way. Remember: the data in this Quicksilver key is important. Without it, dosd could not make correct choices anymore, so it refused to continue.

Once we realized that the addressing data was corrupted, we had to figure out how it was corrupted so that we could fix it. The answer turned out to be pretty obvious: the Quicksilver key was completely empty.

Following the old adage – “did you try restarting it?” – we decided to manually re-run the Kubernetes cron job that populates this key and see what happened. At 20:40 UTC, the cron job was manually triggered. Seconds after it completed, dosd started running again, and traffic levels began returning to normal. We confirmed that the Quicksilver key was no longer empty, and the incident was over.

The Aftermath

Despite fixing the problem, we still didn’t really understand what had just happened.

Why was the Quicksilver key empty?

It was urgent that we quickly figure out how an empty value was written into that Quicksilver key, because for all we knew, it could happen again at any moment.

We started by looking at the Kubernetes cron job, which turned out to have a bug:

PIPEFAIL: How a missing shell option slowed Cloudflare down

This cron job is implemented using a small Bash script. If you’re unfamiliar with Bash (particularly shell pipelining), here’s what it does:

First, the dos-make-addr-conf executable runs. Its job is to query the Addressing API for various bits of JSON data and serialize it into a Toml document. Afterward, that Toml is “piped” as input into the dosctl executable, whose job is to simply write it into a Quicksilver key called template_vars.

Can you spot the bug? Here’s a hint: what happens if dos-make-addr-conf fails for some reason and exits with a non-zero error code? It turns out that, by default, the shell pipeline ignores the error code and continues executing the next command! This means that the output of dos-make-addr-conf (which could be empty) gets unconditionally piped into dosctl and used as the value of the template_vars key, regardless of whether dos-make-addr-conf succeeded or failed.

30 years ago, when the first users of Bourne shell were burned by this problem, a shell option called “pipefail” was introduced. Enabling this option changes the shell’s behavior so that, when any command in a pipeline series fails, the entire pipeline stops processing. However, this option is not enabled by default, so it’s widely recommended as best practice that all scripts should start by enabling this (and a few other) options.

Here’s the fixed version of that cron job:

PIPEFAIL: How a missing shell option slowed Cloudflare down

This bug was particularly insidious because dosd actually did attempt to gracefully handle the case where this Quicksilver key contained invalid Toml. However, an empty string is a perfectly valid Toml document. If an error message had been accidentally written into this Quicksilver key instead of an empty string, then dosd would have rejected the update and continued to use the previous value.

Why did that cause the Front Line to slow down?

We had figured out how an empty key could be written into Quicksilver, and we were confident that it wouldn’t happen again. However, we still needed to untangle how that empty key caused such a severe incident.

As I mentioned earlier, the Front Line relies on dosd to tell it how to mitigate attacks, but it doesn’t depend on dosd directly to serve requests. Instead, once every few seconds, the Front Line asynchronously asks dosd for new attack fingerprints and stores them in an in-memory cache. This cache is consulted while serving each request, and if dosd ever fails to provide fresh attack fingerprints, then the stale fingerprints will continue to be used instead. So how could this have caused the impact that we saw?

PIPEFAIL: How a missing shell option slowed Cloudflare down

As part of the rollback process, the Front Line’s code needed to be reloaded. Reloading this code implicitly flushed the in-memory caches, including the attack fingerprint data from dosd. The next time that a request tried to consult with the cache, the caching layer realized that it had no attack fingerprints to return and a “cache miss” happened.

To handle a cache miss, the caching layer tried to reach out to dosd, and this is when the slowdown happened. While the caching layer was waiting for dosd to reply, it blocked all pending requests from progressing. Since dosd wasn’t running, the attempt eventually timed out after five seconds when the caching layer gave up. But in the meantime, each pending request was stuck waiting for the timeout to happen. Once it did, all the pending requests that were queued up over the five-second timeout period became unblocked and were finally allowed to progress. This cycle repeated over and over again every five seconds on every server until the dosd failure was resolved.

To trigger this slowdown, not only did dosd have to fail, but the Front Line’s in-memory cache had to also be flushed at the same time. If dosd had failed, but the Front Line’s cache had not been flushed, then the stale attack fingerprints would have remained in the cache and request processing would not have been impacted.

Why didn’t the first rollback cause this problem?

These two batches of rollbacks were performed by forcing servers to run a Salt highstate. When each batch was executed, thousands of servers began running highstates at the same time. The highstate process involves, among other things, contacting the Addressing API in order to retrieve various bits of customer addressing information.

The first rollback started at 19:25 UTC, and the second rollback started 45 minutes later at 20:10. Remember how I mentioned that our Kubernetes cron job only runs on the 10th minute of every hour? At 21:10 – exactly the time that our cron job started executing – thousands of servers also began to highstate, flooding the Addressing API with requests. All of these requests were queued up and eventually served, but it took the Addressing API a few minutes to work through the backlog. This delay was long enough to cause our cron job to time out, and, due to the “pipefail”  bug, inadvertently clobber the Quicksilver key that it was responsible for updating.

To trigger the “pipefail” bug, not only did we have to flood the Addressing API with requests, we also had to do it at exactly 10 minutes after the hour. If SREs had started the second batch of rollbacks a few minutes earlier or later, this bug would have continued to lay dormant.

Lessons Learned

This was a unique incident where a chain of small or unlikely failures cascaded into a severe and painful outage that we deeply regret. In response, we have hardened each link in the chain:

  • A manual rollback inadvertently triggered the thundering herd problem, which overwhelmed the Addressing API. We have since significantly scaled out the Addressing API, so that it can handle high request rates if it ever again has to.
  • An error in a Kubernetes cron job caused invalid data to be written to Quicksilver. We have since made sure that, when this cron job fails, it is no longer possible for that failure to clobber the Quicksilver key.
  • dosd did not correctly handle all possible error conditions when loading configuration data from Quicksilver, causing it to fail. We have since taken these additional conditions into account where necessary, so that dosd will gracefully degrade in the face of corrupt Quicksilver data.
  • The Front Line had an unexpected dependency on dosd, which caused it to fail when dosd failed. We have since removed all such dependencies, and the Front Line will now gracefully survive dosd failures.

More broadly, this incident has served as an example to us of why code and systems must always be resilient to failure, no matter how unlikely that failure may seem.

DNSSEC issues take Fiji domains offline

Post Syndicated from David Belson original https://blog.cloudflare.com/dnssec-issues-fiji/

DNSSEC issues take Fiji domains offline

DNSSEC issues take Fiji domains offline

On the morning of March 8, a post to Hacker News stated that “All .fj domains have gone offline”, listing several hostnames in domains within the Fiji top level domain (known as a ccTLD) that had become unreachable. Commenters in the associated discussion thread had mixed results in being able to reach .fj hostnames—some were successful, while others saw failures. The fijivillage news site also highlighted the problem, noting that the issue also impacted Vodafone’s M-PAiSA app/service, preventing users from completing financial transactions.

The impact of this issue can be seen in traffic to Cloudflare customer zones in the .com.fj second-level domain. The graph below shows that HTTP traffic to these zones dropped by approximately 40% almost immediately starting around midnight UTC on March 8. Traffic volumes continued to decline throughout the rest of the morning.

DNSSEC issues take Fiji domains offline

Looking at Cloudflare’s 1.1.1.1 resolver data for queries for .com.fj hostnames, we can also see that error volume associated with those queries climbs significantly starting just after midnight as well. This means that our resolvers encountered issues with the answers from .fj servers.

DNSSEC issues take Fiji domains offline

This observation suggests that the problem was strictly DNS related, rather than connectivity related—Cloudflare Radar does not show any indication of an Internet disruption in Fiji coincident with the start of this problem.

DNSSEC issues take Fiji domains offline

It was suggested within the Hacker News comments that the problem could be DNSSEC related. Upon further investigation, it appears that may be the cause. In verifying the DNSSEC record for the .fj ccTLD, shown in the dig output below, we see that it states EDE: 9 (DNSKEY Missing): 'no SEP matching the DS found for fj.'

kdig fj. soa +dnssec @1.1.1.1 
;; ->>HEADER<<- opcode: QUERY; status: SERVFAIL; id: 12710
;; Flags: qr rd ra; QUERY: 1; ANSWER: 0; AUTHORITY: 0; ADDITIONAL: 1
 
;; EDNS PSEUDOSECTION:
;; Version: 0; flags: do; UDP size: 1232 B; ext-rcode: NOERROR
;; EDE: 9 (DNSKEY Missing): 'no SEP matching the DS found for fj.'
 
;; QUESTION SECTION:
;; fj.                          IN      SOA
 
;; Received 73 B
;; Time 2022-03-08 08:57:41 EST
;; From [email protected](UDP) in 17.2 ms

Extended DNS Error 9 (EDE: 9) is defined as “A DS record existed at a parent, but no supported matching DNSKEY record could be found for the child.” The Cloudflare Learning Center article on DNSKEY and DS records explains this relationship:

The DS record is used to verify the authenticity of child zones of DNSSEC zones. The DS key record on a parent zone contains a hash of the KSK in a child zone. A DNSSEC resolver can therefore verify the authenticity of the child zone by hashing its KSK record, and comparing that to what is in the parent zone’s DS record.

Ultimately, it appears that around midnight UTC, the .fj zone started to be signed with a key that was not in the root zone DS, possibly as the result of a scheduled rollover that happened without checking that the root zone was updated first by IANA, which updates the root zone. (IANA owns contact with the TLD operators, and instructs the Root Zone Publisher on the changes to make in the next version of the root zone.)

DNSSEC problems as the root cause of the observed issue align with the observation in the Hacker News comments that some were able to access .fj websites, while others were not. Users behind resolvers doing strict DNSSEC validation would have seen an error in their browser, while users behind less strict resolvers would have been able to access the sites without a problem.

Conclusion

Further analysis of Cloudflare resolver metrics indicates that the problem was resolved around 1400 UTC, when the DS was updated. When DNSSEC is improperly configured for a single domain name, it can cause problems accessing websites or applications in that zone. However, when the misconfiguration occurs at a ccTLD level, the impact is much more significant. Unfortunately, this seems to occur all too often.

(Thank you to Ólafur Guðmundsson for his DNSSEC expertise.)

Internet is back in Tonga after 38 days of outage

Post Syndicated from João Tomé original https://blog.cloudflare.com/internet-is-back-in-tonga-after-38-days-of-outage/

Internet is back in Tonga after 38 days of outage

Internet is back in Tonga after 38 days of outage

Tonga, the South Pacific archipelago nation (with 169 islands), was reconnected to the Internet this early morning (UTC) and is back online after successful repairs to the undersea cable that was damaged on Saturday, January 15, 2022, by the January 14, volcanic eruption.

After 38 days without full access to the Internet, Cloudflare Radar shows that a little after midnight (UTC) — it was around 13:00 local time — on February 22, 2022, Internet traffic in Tonga started to increase to levels similar to those seen before the eruption.

Internet is back in Tonga after 38 days of outage

The faded line shows what was normal in Tonga at the start of the year, and the dark blue line shows the evolution of traffic in the last 30 days. Digicel, Tonga’s main ISP announced at 02:13 UTC that “data connectivity has been restored on the main island Tongatapu and Eua after undersea submarine cable repairs”.

When we expand the view to the previous 45 days, we can see more clearly how Internet traffic evolved before the volcanic eruption and after the undersea cable was repaired.

Internet is back in Tonga after 38 days of outage

The repair ship Reliance took 20 days to replace a 92 km (57 mile) section of the 827 km submarine fiber optical cable that connects Tonga to Fiji and international networks and had “multiple faults and breaks due to the volcanic eruption”, according to Digicel.

Tonga Cable chief executive James Panuve told Reuters that people on the main island “will have access almost immediately”, and that was what we saw on Radar with a large increase in traffic persisting.

Internet is back in Tonga after 38 days of outage

The residual traffic we saw from Tonga a few days after January 15, 2022, comes from satellite services that were used with difficulty by some businesses.

James Panuve also highlighted that the undersea work is still being finished to repair the domestic cable connecting the main island of Tongatapu with outlying islands that were worst hit by the tsunami, which, he told Reuters, could take six to nine months more.

So, for some of the people who live on the 36 inhabited islands, normal use of the Internet could take a lot longer. Tonga has a population of around 105,000, 70% of whom reside on the main island, Tongatapu and around 5% (5,000) live on the nearby island of Eua (now also connected to the Internet).

Telecommunication companies in neighboring Pacific islands, particularly New Caledonia, provided lengths of cable when Tonga ran out, said Panuve.

A world of undersea cables for the world’s communications

We have mentioned before, for example in our first blog post about the Tonga outage, how undersea cables are important to global Internet traffic that is mostly carried by a complex network that connects countries and continents.

The full submarine cable system (the first communications cables laid were from the 1850s and carried telegraphy traffic) is what makes most of the world’s Internet function between countries and continents. There are 428 active submarine cables (36 are planned), running to an estimated 1.3 million km around the globe.

Internet is back in Tonga after 38 days of outage
World map of submarine cables. Antartida is the only continent not yet reached by a submarine telecommunications cable. Source: TeleGeography (www.submarinecablemap.com

The reliability of submarine Internet is high, especially when multiple paths are available in the event of a cable break. That wasn’t the case for the Tonga outage, given that the 827 km submarine cable only connects Fiji to the Tonga archipelago — Fiji is connected to the main Southern Cross Cable, as the next image illustrates.

Internet is back in Tonga after 38 days of outage
Submarine Cable Map shows the undersea cables that connect Australia to Fiji and the following connections to other archipelagos like Tonga. Source: TeleGeography (www.submarinecablemap.com)

In a recent conversation on a Cloudflare TV segment we discussed the importance of undersea cables with Tom Paseka, Network Strategist who is celebrating 10 years at Cloudflare and worked previously for undersea cable companies in Australia. Here’s a clip:

Internet outage in Yemen amid airstrikes

Post Syndicated from João Tomé original https://blog.cloudflare.com/internet-outage-in-yemen-amid-airstrikes/

Internet outage in Yemen amid airstrikes

The early hours of Friday, January 21, 2022, started in Yemen with a country-wide Internet outage. According to local and global news reports airstrikes are happening in the country and the outage is likely related as there are reports that a telecommunications building in Al-Hudaydah where the FALCON undersea cable lands.

Cloudflare Radar shows that Internet traffic dropped close to zero between 21:30 UTC (January 20, 2022) and by 22:00 UTC (01:00 in local time).

Internet outage in Yemen amid airstrikes

The outage affected the main state-owned ISP, Public Telecommunication Corporation (AS30873 in blue in the next chart), which represents almost all the Internet traffic in the country.

Internet outage in Yemen amid airstrikes

Looking at BGP (Border Gateway Protocol) updates from Yemen’s ASNs around the time of the outage, we see a clear spike at the same time the main ASN was affected ~21:55 UTC, January 20, 2022. These update messages are BGP signalling that Yemen’s main ASN was no longer routable, something similar to what we saw happening in The Gambia and Kazakhstan but for very different reasons.

Internet outage in Yemen amid airstrikes

So far, 2022 has started with a few significant Internet disruptions for different reasons:

1. An Internet outage in The Gambia because of a cable problem.
2. An Internet shutdown in Kazakhstan because of unrest.
3. A mobile Internet shutdown in Burkina Faso because of a coup plot.
4. An Internet outage in Tonga because of a volcanic eruption (still ongoing).

You can keep an eye on Cloudflare Radar to monitor this situation as it unfolds.

Tonga’s likely lengthy Internet outage

Post Syndicated from João Tomé original https://blog.cloudflare.com/tonga-internet-outage/

Tonga’s likely lengthy Internet outage

2022 only has 19 days of existence but so far this January, there have already been four significant Internet disruptions:

1. An Internet outage in The Gambia because of a cable problem.
2. An Internet shutdown in Kazakhstan because of unrest.
3. A mobile Internet shutdown in Burkina Faso because of a coup plot.
4. An Internet outage in Tonga because of a volcanic eruption.

The latest Internet outage, in the South Pacific country of Tonga (with 169 islands), is still ongoing. It started with the large eruption of Hunga Tonga–Hunga Haʻapai, an uninhabited volcanic island of the Tongan archipelago on Friday, January 14, 2022. The next day, Cloudflare Radar shows that the Internet outage started at around 03:00 UTC (16:00 local time) — Saturday, January 15, 2022 — and is ongoing for more than four days. Tonga’s 105,000 residents are almost entirely unreachable, according to the BBC.

Tonga’s likely lengthy Internet outage

When we focus on the number of requests by ASN, the country’s main ISPs Digicel and Kalianet started to lose traffic after 03:00 UTC and by 05:30 UTC January 15, 2022, Cloudflare saw close to no traffic at all from them, as shown in the graph below.

Tonga’s likely lengthy Internet outage

Looking at the BGP (Border Gateway Protocol) updates from Tonga’s ASNs around the time of the outage, we see a clear spike at 05:35 UTC (18:35 local time). These update messages are BGP signalling that the Tongan ASNs are no longer routable. We saw the same trend in The Gambia outage of January 4, 2022 — there you can read about the importance of BGP as a mechanism to exchange routing information between autonomous systems on the Internet, something that was also seen in the 2021 Facebook outage.

Tonga’s likely lengthy Internet outage
BGP updates from Tongan ASNs around the time of the outage.

Cloudflare Radar data doesn’t show any significant disruptions for Internet traffic in Tonga’s neighbours American Samoa (although there was a small decrease in traffic on Friday and Saturday, January 14 and 15, 2022 in comparison with the previous week) and Fiji. In American Samoa, all schools were closed on Friday, January 14, because of severe weather, and on the same day, after the volcanic eruption, there were tsunami warnings and evacuation to higher ground was advised (that continued through the weekend).

Tonga, as a geographically remote Polynesian country more than 800 km from the Fiji archipelago, is highly dependent on the Internet for communications. That is something that was improved five years ago with an infrastructure connectivity program from the World Bank. Prior to that, the country was dependent on satellite links for Internet that included a very small percentage of the population.

Repairs could take a few weeks

Southern Cross Cable Network confirmed that the 827 km fiber-optic undersea communications cable connecting Tonga to the outside world may have been broken. The company is assisting Tonga Cable Limited (TCL), which owns the single cable that provides Internet access and almost all communications to and from the archipelago.

The eruption resulted in a fault in the international cable 37 kilometres from Nukuʻalofa (Tonga’s capital), and a further fault in a domestic cable 47 km from the capital.

TCL announced that it has already met with the US cable company SubCom to start preparations for SubCom’s cable repair ship Reliance to be dispatched from Papua New Guinea to Tonga, possibly via Samoa (more than 4,000 km away).

The repairs could take “at least” four weeks, given that a repair to a fiber-optic cable that has been cut on the seabed is considered more complicated than misconfigurations, power outages or other types of infrastructure damage. “The site conditions in Tonga have to be assessed thoroughly because of volcanic activities,” according to TCL chairman Samiuela Fonua.

Fonua also mentioned that the last cable cut (back in 2019) took nearly two weeks to repair, but this time the site conditions will determine the time it will take — the two cables are not far away from the eruption site (the volcano is still active). According to ZDNet, in 2019 Tonga signed a 15-year deal with Kacific for satellite connectivity, but since then the satellite provider says it is waiting on the Tongan government to activate its contract.

Svalbard Undersea Cable System also disrupted in January

Also in January, Space Norway, the operator of the world’s most northern submarine cable — the Svalbard Undersea Cable System — announced that on January 7 it located a disruption in one of the two twin submarine fiber optic communication cables connecting Longyearbyen with Andøya north of Harstad in northern Norway (in the area where the seabed goes from 300 meters down to 2,700 meters in the Greenland Sea). A repair mission is being planned.

A world of undersea cables for the world’s communications

A significant amount of Internet traffic is carried by a complex network of undersea fiber-optic cables that connect countries and continents. The full submarine cable system (the first communications cables laid were from the 1850s and carried telegraphy traffic) is what makes most of the world’s Internet function between countries and continents. There are 428 active submarine cables (36 are planned), running in an estimate of 1.3 million km around the globe.

Tonga’s likely lengthy Internet outage
World map of submarine cables. Antarctica is the only continent not yet reached by a submarine telecommunications cable. Source: TeleGeography (www.submarinecablemap.com)

This gives a sense that the Internet is literally a network of networks in a world where estimates indicate that around 99% of the data traffic that is crossing oceans is carried by these undersea cables (satellite Internet, so far, is still residual — SpaceX has around 145,000 users).

The reliability of submarine cables is high, especially when multiple paths are available in the event of a cable break. That’s not the case for the Tonga outage, given that the 827 km submarine cable only connects Fiji to the Tonga archipelago — Fiji is connected to the main Southern Cross Cable, as the next image illustrates.

Tonga’s likely lengthy Internet outage
Submarine Cable Map shows the undersea cables that connect Australia to Fiji and the following connections to other archipelagos like Tonga. Source: TeleGeography (www.submarinecablemap.com


The total carrying capacity of submarine cables is enormous (EllaLink, the optical submarine cable linking the European and South American continents, for example, has 100 Tbps capacity) and grows year after year as the world gets more and more connected. For example, Google has recently finished a new cable with 350 Tbps of capacity. But, a transoceanic submarine cable system costs several hundred million dollars to construct. One of the latest, between Portugal and Egypt, with a total of 8,700 kilometers, is budgeted at 326 million euros.

The Tonga outage was not the only one of 2022 (so far) that happened because of cable problems. The Gambia outage that affected the country’s main ISP, Gamtel, was because of “a primary link failure at ACE”, the cable system that serves 24 countries, from Europe to Africa, namely in the points of cable connections from Senegal to The Gambia.

In spite of these two fiber cable problems being separated by a few days at the start of 2022, Internet outages are more common because of situations like misconfigurations, power outages, extreme weather or the frequent state-imposed shutdowns to deal with unrest, elections or exams — recently this was the case of Sudan or Kazakhstan.

Sudan: seven days without Internet access (and counting)

Post Syndicated from João Tomé original https://blog.cloudflare.com/sudan-seven-days-without-internet-access-and-counting/

Sudan: seven days without Internet access (and counting)

Sudan: seven days without Internet access (and counting)

It’s not every day that there is no Internet access in an entire country. In the case of Sudan, it has been five days without Internet after political turmoil that started last Monday, October 25, 2021 (as we described).

The outage continues with almost a flat line and just a trickle of Internet traffic from Sudan. Cloudflare Radar shows that the Internet in Sudan is still almost completely cut off.

Sudan: seven days without Internet access (and counting)

There was a blip of traffic on Tuesday at ~14:00 UTC, for about one hour, but it flattened out again, and it continues like that — anyone can track the evolution on the Sudan page of Cloudflare Radar.

Sudan: seven days without Internet access (and counting)

Internet shutdowns are not that rare

Internet disruptions, including shutdowns and social media restrictions, are common occurrences in some countries and Sudan is one where this happens more frequently than most countries according to Human Rights Watch. In our June blog, we talked about Sudan when the country decided to shut down the Internet to prevent cheating in exams, but there were situations in the past more similar to this days-long shutdown — something that usually happens when there’s political unrest.

The country’s longest recorded network disruption was back in 2018, when Sudanese authorities cut off access to social media (and messaging apps like WhatsApp) for 68 consecutive days from December 21, 2018 to February 26, 2019. There was a full mobile Internet shutdown reported from June 3 to July 9, 2019 that lasted 36 days.

You can keep an eye on Cloudflare Radar to monitor how we see the Internet traffic globally and in every country.

Sudan woke up without Internet

Post Syndicated from Celso Martinho original https://blog.cloudflare.com/sudan-woke-up-without-internet/

Sudan woke up without Internet

Sudan woke up without Internet

Today, October 25, following political turmoil, Sudan woke up without Internet access.

In our June blog, we talked about Sudan when the country decided to shut down the Internet to prevent cheating in exams.

Now, the disruption seems to be for other reasons. AP is reporting that “military forces … detained at least five senior Sudanese government figures.”. This afternoon (UTC) several media outlets confirmed that Sudan’s military dissolved the transitional government in a coup that shut down mobile phone networks and Internet access.

Cloudflare Radar allows anyone to track Internet traffic patterns around the world. The dedicated page for Sudan clearly shows that this Monday, when the country was waking up, the Internet traffic went down and continued that trend through the afternoon (16:00 local time, 14:00 UTC).

Sudan woke up without Internet

We dug in a little more on the HTTP traffic data. It usually starts increasing after 06:00 local time (04:00 UTC). But this Monday morning, traffic was flat, and the trend continued in the afternoon (there were no signs of the Internet coming back at 18:00 local time).

Sudan woke up without Internet

When comparing today with the last seven days’ pattern, we see that today’s drop is abrupt and unusual.

Sudan woke up without Internet

We can see the same pattern when looking at HTTP traffic by ASN (Autonomous Systems Number). The shutdown affects all the major ISPs from Sudan.

Sudan woke up without Internet

Two weeks ago, we compared mobile traffic worldwide using Cloudflare Radar, and Sudan was one of the most mobile-friendly countries on the planet, with 83% of Internet traffic coming from mobile devices. Today, both mobile and desktop traffic was disrupted.

Sudan woke up without Internet

Using Cloudflare Radar, we can also see a change in Layer 3&4 DDoS attacks because of the lack of data.

Sudan woke up without Internet

You can keep an eye on Cloudflare Radar to monitor how we see the Internet traffic globally and in every country.

What happened on the Internet during the Facebook outage

Post Syndicated from Celso Martinho original https://blog.cloudflare.com/during-the-facebook-outage/

What happened on the Internet during the Facebook outage

It’s been a few days now since Facebook, Instagram, and WhatsApp went AWOL and experienced one of the most extended and rough downtime periods in their existence.

When that happened, we reported our bird’s-eye view of the event and posted the blog Understanding How Facebook Disappeared from the Internet where we tried to explain what we saw and how DNS and BGP, two of the technologies at the center of the outage, played a role in the event.

In the meantime, more information has surfaced, and Facebook has published a blog post giving more details of what happened internally.

As we said before, these events are a gentle reminder that the Internet is a vast network of networks, and we, as industry players and end-users, are part of it and should work together.

In the aftermath of an event of this size, we don’t waste much time debating how peers handled the situation. We do, however, ask ourselves the more important questions: “How did this affect us?” and “What if this had happened to us?” Asking and answering these questions whenever something like this happens is a great and healthy exercise that helps us improve our own resilience.

Today, we’re going to show you how the Facebook and affiliate sites downtime affected us, and what we can see in our data.

1.1.1.1

1.1.1.1 is a fast and privacy-centric public DNS resolver operated by Cloudflare, used by millions of users, browsers, and devices worldwide. Let’s look at our telemetry and see what we find.

First, the obvious. If we look at the response rate, there was a massive spike in the number of SERVFAIL codes. SERVFAILs can happen for several reasons; we have an excellent blog called Unwrap the SERVFAIL that you should read if you’re curious.

In this case, we started serving SERVFAIL responses to all facebook.com and whatsapp.com DNS queries because our resolver couldn’t access the upstream Facebook authoritative servers. About 60x times more than the average on a typical day.

What happened on the Internet during the Facebook outage

If we look at all the queries, not specific to Facebook or WhatsApp domains, and we split them by IPv4 and IPv6 clients, we can see that our load increased too.

As explained before, this is due to a snowball effect associated with applications and users retrying after the errors and generating even more traffic. In this case, 1.1.1.1 had to handle more than the expected rate for A and AAAA queries.

What happened on the Internet during the Facebook outage

Here’s another fun one.

DNS vs. DoT and DoH. Typically, DNS queries and responses are sent in plaintext over UDP (or TCP sometimes), and that’s been the case for decades now. Naturally, this poses security and privacy risks to end-users as it allows in-transit attacks or traffic snooping.

With DNS over TLS (DoT) and DNS over HTTPS, clients can talk DNS using well-known, well-supported encryption and authentication protocols.

Our learning center has a good article on “DNS over TLS vs. DNS over HTTPS” that you can read. Browsers like Chrome, Firefox, and Edge have supported DoH for some time now, WAP uses DoH too, and you can even configure your operating system to use the new protocols.

When Facebook went offline, we saw the number of DoT+DoH SERVFAILs responses grow by over x300 vs. the average rate.

What happened on the Internet during the Facebook outage
What happened on the Internet during the Facebook outage
What happened on the Internet during the Facebook outage

So, we got hammered with lots of requests and errors, causing traffic spikes to our 1.1.1.1 resolver and causing an unexpected load in the edge network and systems. How did we perform during this stressful period?

Quite well. 1.1.1.1 kept its cool and continued serving the vast majority of requests around the famous 10ms mark. An insignificant fraction of p95 and p99 percentiles saw increased response times, probably due to timeouts trying to reach Facebook’s nameservers.

What happened on the Internet during the Facebook outage

Another interesting perspective is the distribution of the ratio between SERVFAIL and good DNS answers, by country. In theory, the higher this ratio is, the more the country uses Facebook. Here’s the map with the countries that suffered the most:

What happened on the Internet during the Facebook outage

Here’s the top twelve country list, ordered by those that apparently use Facebook, WhatsApp and Instagram the most:

Country SERVFAIL/Good Answers ratio
Turkey 7.34
Grenada 4.84
Congo 4.44
Lesotho 3.94
Nicaragua 3.57
South Sudan 3.47
Syrian Arab Republic 3.41
Serbia 3.25
Turkmenistan 3.23
United Arab Emirates 3.17
Togo 3.14
French Guiana 3.00

Impact on other sites

When Facebook, Instagram, and WhatsApp aren’t around, the world turns to other places to look for information on what’s going on, other forms of entertainment or other applications to communicate with their friends and family. Our data shows us those shifts. While Facebook was going down, other services and platforms were going up.

To get an idea of the changing traffic patterns we look at DNS queries as an indicator of increased traffic to specific sites or types of site.

Here are a few examples.

Other social media platforms saw a slight increase in use, compared to normal.

What happened on the Internet during the Facebook outage

Traffic to messaging platforms like Telegram, Signal, Discord and Slack got a little push too.

What happened on the Internet during the Facebook outage

Nothing like a little gaming time when Instagram is down, we guess, when looking at traffic to sites like Steam, Xbox, Minecraft and others.

What happened on the Internet during the Facebook outage

And yes, people want to know what’s going on and fall back on news sites like CNN, New York Times, The Guardian, Wall Street Journal, Washington Post, Huffington Post, BBC, and others:

What happened on the Internet during the Facebook outage

Attacks

One could speculate that the Internet was under attack from malicious hackers. Our Firewall doesn’t agree; nothing out of the ordinary stands out.

What happened on the Internet during the Facebook outage

Network Error Logs

Network Error Logging, NEL for short, is an experimental technology supported in Chrome. A website can issue a Report-To header and ask the browser to send reports about network problems, like bad requests or DNS issues, to a specific endpoint.

Cloudflare uses NEL data to quickly help triage end-user connectivity issues when end-users reach our network. You can learn more about this feature in our help center.

If Facebook is down and their DNS isn’t responding, Chrome will start reporting NEL events every time one of the pages in our zones fails to load Facebook comments, posts, ads, or authentication buttons. This chart shows it clearly.​​

What happened on the Internet during the Facebook outage

WARP

Cloudflare announced WARP in 2019, and called it “A VPN for People Who Don’t Know What V.P.N. Stands For” and offered it for free to its customers. Today WARP is used by millions of people worldwide to securely and privately access the Internet on their desktop and mobile devices. Here’s what we saw during the outage by looking at traffic volume between WARP and Facebook’s network:

What happened on the Internet during the Facebook outage

You can see how the steep drop in Facebook ASN traffic coincides with the start of the incident and how it compares to the same period the day before.

Our own traffic

People tend to think of Facebook as a place to visit. We log in, and we access Facebook, we post. It turns out that Facebook likes to visit us too, quite a lot. Like Google and other platforms, Facebook uses an army of crawlers to constantly check websites for data and updates. Those robots gather information about websites content, such as its titles, descriptions, thumbnail images, and metadata. You can learn more about this on the “The Facebook Crawler” page and the Open Graph website.

Here’s what we see when traffic is coming from the Facebook ASN, supposedly from crawlers, to our CDN sites:

What happened on the Internet during the Facebook outage

The robots went silent.

What about the traffic coming to our CDN sites from Facebook User-Agents? The gap is indisputable.

What happened on the Internet during the Facebook outage

We see about 30% of a typical request rate hitting us. But it’s not zero; why is that?

We’ll let you know a little secret. Never trust User-Agent information; it’s broken. User-Agent spoofing is everywhere. Browsers, apps, and other clients deliberately change the User-Agent string when they fetch pages from the Internet to hide, obtain access to certain features, or bypass paywalls (because pay-walled sites want sites like Facebook to index their content, so that then they get more traffic from links).

Fortunately, there are newer, and privacy-centric standards emerging like User-Agent Client Hints.

Core Web Vitals

Core Web Vitals are the subset of Web Vitals, an initiative by Google to provide a unified interface to measure real-world quality signals when a user visits a web page. Such signals include Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS).

We use Core Web Vitals with our privacy-centric Web Analytics product and collect anonymized data on how end-users experience the websites that enable this feature.

One of the metrics we can calculate using these signals is the page load time. Our theory is that if a page includes scripts coming from external sites (for example, Facebook “like” buttons, comments, ads), and they are unreachable, its total load time gets affected.

We used a list of about 400 domains that we know embed Facebook scripts in their pages and looked at the data.

What happened on the Internet during the Facebook outage

Now let’s look at the Largest Contentful Paint. LCP marks the point in the page load timeline when the page’s main content has likely loaded. The faster the LCP is, the better the end-user experience.

What happened on the Internet during the Facebook outage

Again, the page load experience got visibly degraded.

The outcome seems clear. The sites that use Facebook scripts in their pages took 1.5x more time to load their pages during the outage, with some of them taking more than 2x the usual time. Facebook’s outage dragged the performance of  some other sites down.

Conclusion

When Facebook, Instagram, and WhatsApp went down, the Web felt it. Some websites got slower or lost traffic, other services and platforms got unexpected load, and people lost the ability to communicate or do business normally.

Understanding How Facebook Disappeared from the Internet

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/october-2021-facebook-outage/

Understanding How Facebook Disappeared from the Internet

Understanding How Facebook Disappeared from the Internet

“Facebook can’t be down, can it?”, we thought, for a second.

Today at 1651 UTC, we opened an internal incident entitled “Facebook DNS lookup returning SERVFAIL” because we were worried that something was wrong with our DNS resolver 1.1.1.1.  But as we were about to post on our public status page we realized something else more serious was going on.

Social media quickly burst into flames, reporting what our engineers rapidly confirmed too. Facebook and its affiliated services WhatsApp and Instagram were, in fact, all down. Their DNS names stopped resolving, and their infrastructure IPs were unreachable. It was as if someone had “pulled the cables” from their data centers all at once and disconnected them from the Internet.

How’s that even possible?

Meet BGP

BGP stands for Border Gateway Protocol. It’s a mechanism to exchange routing information between autonomous systems (AS) on the Internet. The big routers that make the Internet work have huge, constantly updated lists of the possible routes that can be used to deliver every network packet to their final destinations. Without BGP, the Internet routers wouldn’t know what to do, and the Internet wouldn’t work.

The Internet is literally a network of networks, and it’s bound together by BGP. BGP allows one network (say Facebook) to advertise its presence to other networks that form the Internet. As we write Facebook is not advertising its presence, ISPs and other networks can’t find Facebook’s network and so it is unavailable.

The individual networks each have an ASN: an Autonomous System Number. An Autonomous System (AS) is an individual network with a unified internal routing policy. An AS can originate prefixes (say that they control a group of IP addresses), as well as transit prefixes (say they know how to reach specific groups of IP addresses).

Cloudflare’s ASN is AS13335. Every ASN needs to announce its prefix routes to the Internet using BGP; otherwise, no one will know how to connect and where to find us.

Our learning center has a good overview of what BGP and ASNs are and how they work.

In this simplified diagram, you can see six autonomous systems on the Internet and two possible routes that one packet can use to go from Start to End. AS1 → AS2 → AS3 being the fastest, and AS1 → AS6 → AS5 → AS4 → AS3 being the slowest, but that can be used if the first fails.

Understanding How Facebook Disappeared from the Internet

At 1658 UTC we noticed that Facebook had stopped announcing the routes to their DNS prefixes. That meant that, at least, Facebook’s DNS servers were unavailable. Because of this Cloudflare’s 1.1.1.1 DNS resolver could no longer respond to queries asking for the IP address of facebook.com or instagram.com.

route-views>show ip bgp 185.89.218.0/23
% Network not in table
route-views>

route-views>show ip bgp 129.134.30.0/23
% Network not in table
route-views>

Meanwhile, other Facebook IP addresses remained routed but weren’t particularly useful since without DNS Facebook and related services were effectively unavailable:

route-views>show ip bgp 129.134.30.0   
BGP routing table entry for 129.134.0.0/17, version 1025798334
Paths: (24 available, best #14, table default)
  Not advertised to any peer
  Refresh Epoch 2
  3303 6453 32934
    217.192.89.50 from 217.192.89.50 (138.187.128.158)
      Origin IGP, localpref 100, valid, external
      Community: 3303:1004 3303:1006 3303:3075 6453:3000 6453:3400 6453:3402
      path 7FE1408ED9C8 RPKI State not found
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
route-views>

We keep track of all the BGP updates and announcements we see in our global network. At our scale, the data we collect gives us a view of how the Internet is connected and where the traffic is meant to flow from and to everywhere on the planet.

A BGP UPDATE message informs a router of any changes you’ve made to a prefix advertisement or entirely withdraws the prefix. We can clearly see this in the number of updates we received from Facebook when checking our time-series BGP database. Normally this chart is fairly quiet: Facebook doesn’t make a lot of changes to its network minute to minute.

But at around 15:40 UTC we saw a peak of routing changes from Facebook. That’s when the trouble began.

Understanding How Facebook Disappeared from the Internet

If we split this view by routes announcements and withdrawals, we get an even better idea of what happened. Routes were withdrawn, Facebook’s DNS servers went offline, and one minute after the problem occurred, Cloudflare engineers were in a room wondering why 1.1.1.1 couldn’t resolve facebook.com and worrying that it was somehow a fault with our systems.

Understanding How Facebook Disappeared from the Internet

With those withdrawals, Facebook and its sites had effectively disconnected themselves from the Internet.

DNS gets affected

As a direct consequence of this, DNS resolvers all over the world stopped resolving their domain names.

➜  ~ dig @1.1.1.1 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com.			IN	A
➜  ~ dig @1.1.1.1 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com.			IN	A
➜  ~ dig @8.8.8.8 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com.			IN	A
➜  ~ dig @8.8.8.8 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com.			IN	A

This happens because DNS, like many other systems on the Internet, also has its routing mechanism. When someone types the https://facebook.com URL in the browser, the DNS resolver, responsible for translating domain names into actual IP addresses to connect to, first checks if it has something in its cache and uses it. If not, it tries to grab the answer from the domain nameservers, typically hosted by the entity that owns it.

If the nameservers are unreachable or fail to respond because of some other reason, then a SERVFAIL is returned, and the browser issues an error to the user.

Again, our learning center provides a good explanation on how DNS works.

Understanding How Facebook Disappeared from the Internet

Due to Facebook stopping announcing their DNS prefix routes through BGP, our and everyone else’s DNS resolvers had no way to connect to their nameservers. Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses.

But that’s not all. Now human behavior and application logic kicks in and causes another exponential effect. A tsunami of additional DNS traffic follows.

This happened in part because apps won’t accept an error for an answer and start retrying, sometimes aggressively, and in part because end-users also won’t take an error for an answer and start reloading the pages, or killing and relaunching their apps, sometimes also aggressively.

This is the traffic increase (in number of requests) that we saw on 1.1.1.1:

Understanding How Facebook Disappeared from the Internet

So now, because Facebook and their sites are so big, we have DNS resolvers worldwide handling 30x more queries than usual and potentially causing latency and timeout issues to other platforms.

Fortunately, 1.1.1.1 was built to be Free, Private, Fast (as the independent DNS monitor DNSPerf can attest), and scalable, and we were able to keep servicing our users with minimal impact.

The vast majority of our DNS requests kept resolving in under 10ms. At the same time, a minimal fraction of p95 and p99 percentiles saw increased response times, probably due to expired TTLs having to resort to the Facebook nameservers and timeout. The 10 seconds DNS timeout limit is well known amongst engineers.

Understanding How Facebook Disappeared from the Internet

Impacting other services

People look for alternatives and want to know more or discuss what’s going on. When Facebook became unreachable, we started seeing increased DNS queries to Twitter, Signal and other messaging and social media platforms.

Understanding How Facebook Disappeared from the Internet

We can also see another side effect of this unreachability in our WARP traffic to and from Facebook’s affected ASN 32934. This chart shows how traffic changed from 15:45 UTC to 16:45 UTC compared with three hours before in each country. All over the world WARP traffic to and from Facebook’s network simply disappeared.

Understanding How Facebook Disappeared from the Internet

The Internet

Today’s events are a gentle reminder that the Internet is a very complex and interdependent system of millions of systems and protocols working together. That trust, standardization, and cooperation between entities are at the center of making it work for almost five billion active users worldwide.

Update

At around 21:00 UTC we saw renewed BGP activity from Facebook’s network which peaked at 21:17 UTC.

Understanding How Facebook Disappeared from the Internet

This chart shows the availability of the DNS name ‘facebook.com’ on Cloudflare’s DNS resolver 1.1.1.1. It stopped being available at around 15:50 UTC and returned at 21:20 UTC.

Understanding How Facebook Disappeared from the Internet

Undoubtedly Facebook, WhatsApp and Instagram services will take further time to come online but as of 22:28 UTC Facebook appears to be reconnected to the global Internet and DNS working again.

A Byzantine failure in the real world

Post Syndicated from Tom Lianza original https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/

A Byzantine failure in the real world

An analysis of the Cloudflare API availability incident on 2020-11-02

When we review design documents at Cloudflare, we are always on the lookout for Single Points of Failure (SPOFs). Eliminating these is a necessary step in architecting a system you can be confident in. Ironically, when you’re designing a system with built-in redundancy, you spend most of your time thinking about how well it functions when that redundancy is lost.

On November 2, 2020, Cloudflare had an incident that impacted the availability of the API and dashboard for six hours and 33 minutes. During this incident, the success rate for queries to our API periodically dipped as low as 75%, and the dashboard experience was as much as 80 times slower than normal. While Cloudflare’s edge is massively distributed across the world (and kept working without a hitch), Cloudflare’s control plane (API & dashboard) is made up of a large number of microservices that are redundant across two regions. For most services, the databases backing those microservices are only writable in one region at a time.

Each of Cloudflare’s control plane data centers has multiple racks of servers. Each of those racks has two switches that operate as a pair—both are normally active, but either can handle the load if the other fails. Cloudflare survives rack-level failures by spreading the most critical services across racks. Every piece of hardware has two or more power supplies with different power feeds. Every server that stores critical data uses RAID 10 redundant disks or storage systems that replicate data across at least three machines in different racks, or both. Redundancy at each layer is something we review and require. So—how could things go wrong?

In this post we present a timeline of what happened, and how a difficult failure mode known as a Byzantine fault played a role in a cascading series of events.

2020-11-02 14:43 UTC: Partial Switch Failure

At 14:43, a network switch started misbehaving. Alerts began firing about the switch being unreachable to pings. The device was in a partially operating state: network control plane protocols such as LACP and BGP remained operational, while others, such as vPC, were not. The vPC link is used to synchronize ports across multiple switches, so that they appear as one large, aggregated switch to servers connected to them. At the same time, the data plane (or forwarding plane) was not processing and forwarding all the packets received from connected devices.

This failure scenario is completely invisible to the connected nodes, as each server only sees an issue for some of its traffic due to the load-balancing nature of LACP. Had the switch failed fully, all traffic would have failed over to the peer switch, as the connected links would’ve simply gone down, and the ports would’ve dropped out of the forwarding LACP bundles.

Six minutes later, the switch recovered without human intervention. But this odd failure mode led to further problems that lasted long after the switch had returned to normal operation.

2020-11-02 14:44 UTC: etcd Errors begin

The rack with the misbehaving switch included one server in our etcd cluster. We use etcd heavily in our core data centers whenever we need strongly consistent data storage that’s reliable across multiple nodes.

In the event that the cluster leader fails, etcd uses the RAFT protocol to maintain consistency and establish consensus to promote a new leader. In the RAFT protocol, cluster members are assumed to be either available or unavailable, and to provide accurate information or none at all. This works fine when a machine crashes, but is not always able to handle situations where different members of the cluster have conflicting information.

In this particular situation:

  • Network traffic between node 1 (in the affected rack) and node 3 (the leader) was being sent through the switch in the degraded state,
  • Network traffic between node 1 and node 2 were going through its working peer, and
  • Network traffic between node 2 and node 3 was unaffected.

This caused cluster members to have conflicting views of reality, known in distributed systems theory as a Byzantine fault. As a consequence of this conflicting information, node 1 repeatedly initiated leader elections, voting for itself, while node 2 repeatedly voted for node 3, which it could still connect to. This resulted in ties that did not promote a leader node 1 could reach. RAFT leader elections are disruptive, blocking all writes until they’re resolved, so this made the cluster read-only until the faulty switch recovered and node 1 could once again reach node 3.

A Byzantine failure in the real world

2020-11-02 14:45 UTC: Database system promotes a new primary database

Cloudflare’s control plane services use relational databases hosted across multiple clusters within a data center. Each cluster is configured for high availability. The cluster setup includes a primary database, a synchronous replica, and one or more asynchronous replicas. This setup allows redundancy within a data center. For cross-datacenter redundancy, a similar high availability secondary cluster is set up and replicated in a geographically dispersed data center for disaster recovery. The cluster management system leverages etcd for cluster member discovery and coordination.

When etcd became read-only, two clusters were unable to communicate that they had a healthy primary database. This triggered the automatic promotion of a synchronous database replica to become the new primary. This process happened automatically and without error or data loss.

There was a defect in our cluster management system that requires a rebuild of all database replicas when a new primary database is promoted. So, although the new primary database was available instantly, the replicas would take considerable time to become available, depending on the size of the database. For one of the clusters, service was restored quickly. Synchronous and asynchronous database replicas were rebuilt and started replicating successfully from primary, and the impact was minimal.

For the other cluster, however, performant operation of that database required a replica to be online. Because this database handles authentication for API calls and dashboard activities, it takes a lot of reads, and one replica was heavily utilized to spare the primary the load. When this failover happened and no replicas were available, the primary was overloaded, as it had to take all of the load. This is when the main impact started.

Reduce Load, Leverage Redundancy

At this point we saw that our primary authentication database was overwhelmed and began shedding load from it. We dialed back the rate at which we push SSL certificates to the edge, send emails, and other features, to give it space to handle the additional load. Unfortunately, because of its size, we knew it would take several hours for a replica to be fully rebuilt.

A silver lining here is that every database cluster in our primary data center also has online replicas in our secondary data center. Those replicas are not part of the local failover process, and were online and available throughout the incident. The process of steering read-queries to those replicas was not yet automated, so we manually diverted API traffic that could leverage those read replicas to the secondary data center. This substantially improved our API availability.

The Dashboard

The Cloudflare dashboard, like most web applications, has the notion of a user session. When user sessions are created (each time a user logs in) we perform some database operations and keep data in a Redis cluster for the duration of that user’s session. Unlike our API calls, our user sessions cannot currently be moved across the ocean without disruption. As we took actions to improve the availability of our API calls, we were unfortunately making the user experience on the dashboard worse.

This is an area of the system that is currently designed to be able to fail over across data centers in the event of a disaster, but has not yet been designed to work in both data centers at the same time. After a first period in which users on the dashboard became increasingly frustrated, we failed the authentication calls fully back to our primary data center, and kept working on our primary database to ensure we could provide the best service levels possible in that degraded state.

2020-11-02 21:20 UTC Database Replica Rebuilt

The instant the first database replica rebuilt, it put itself back into service, and performance resumed to normal levels. We re-ramped all of the services that had been turned down, so all asynchronous processing could catch up, and after a period of monitoring marked the end of the incident.

Redundant Points of Failure

The cascade of failures in this incident was interesting because each system, on its face, had redundancy. Moreover, no system fully failed—each entered a degraded state. That combination meant the chain of events that transpired was considerably harder to model and anticipate. It was frustrating yet reassuring that some of the possible failure modes were already being addressed.

A team was already working on fixing the limitation that requires a database replica rebuild upon promotion. Our user sessions system was inflexible in scenarios where we’d like to steer traffic around, and redesigning that was already in progress.

This incident also led us to revisit the configuration parameters we put in place for things that auto-remediate. In previous years, promoting a database replica to primary took far longer than we liked, so getting that process automated and able to trigger on a minute’s notice was a point of pride. At the same time, for at least one of our databases, the cure may be worse than the disease, and in fact we may not want to invoke the promotion process so quickly. Immediately after this incident we adjusted that configuration accordingly.

Byzantine Fault Tolerance (BFT) is a hot research topic. Solutions have been known since 1982, but have had to choose between a variety of engineering tradeoffs, including security, performance, and algorithmic simplicity. Most general-purpose cluster management systems choose to forgo BFT entirely and use protocols based on PAXOS, or simplifications of PAXOS such as RAFT, that perform better and are easier to understand than BFT consensus protocols. In many cases, a simple protocol that is known to be vulnerable to a rare failure mode is safer than a complex protocol that is difficult to implement correctly or debug.

The first uses of BFT consensus were in safety-critical systems such as aircraft and spacecraft controls. These systems typically have hard real time latency constraints that require tightly coupling consensus with application logic in ways that make these implementations unsuitable for general-purpose services like etcd. Contemporary research on BFT consensus is mostly focused on applications that cross trust boundaries, which need to protect against malicious cluster members as well as malfunctioning cluster members. These designs are more suitable for implementing general-purpose services such as etcd, and we look forward to collaborating with researchers and the open source community to make them suitable for production cluster management.

We are very sorry for the difficulty the outage caused, and are continuing to improve as our systems grow. We’ve since fixed the bug in our cluster management system, and are continuing to tune each of the systems involved in this incident to be more resilient to failures of their dependencies.  If you’re interested in helping solve these problems at scale, please visit cloudflare.com/careers.

Analysis of Today’s CenturyLink/Level(3) Outage

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/

Analysis of Today's CenturyLink/Level(3) Outage

Today CenturyLink/Level(3), a major ISP and Internet bandwidth provider, experienced a significant outage that impacted some of Cloudflare’s customers as well as a significant number of other services and providers across the Internet. While we’re waiting for a post mortem from CenturyLink/Level(3), I wanted to write up the timeline of what we saw, how Cloudflare’s systems routed around the problem, why some of our customers were still impacted in spite of our mitigations, and what  appears to be the likely root cause of the issue.

Increase In Errors

At 10:03 UTC our monitoring systems started to observe an increased number of errors reaching our customers’ origin servers. These show up as “522 Errors” and indicate that there is an issue connecting from Cloudflare’s network to wherever our customers’ applications are hosted.

Cloudflare is connected to CenturyLink/Level(3) among a large and diverse set of network providers. When we see an increase in errors from one network provider, our systems automatically attempt to reach customers’ applications across alternative providers. Given the number of providers we have access to, we are generally able to continue to route traffic even when one provider has an issue.

Analysis of Today's CenturyLink/Level(3) Outage
The diverse set of network providers Cloudflare connects to. Source: https://bgp.he.net/AS13335#_asinfo‌‌

Automatic Mitigations

In this case, beginning within seconds of the increase in 522 errors, our systems automatically rerouted traffic from CenturyLink/Level(3) to alternate network providers we connect to including Cogent, NTT, GTT, Telia, and Tata.

Our Network Operations Center was also alerted and our team began taking additional steps to mitigate any issues our automated systems weren’t automatically able to address beginning at 10:09 UTC. We were successful in keeping traffic flowing across our network for most customers and end users even with the loss of CenturyLink/Level(3) as one of our network providers.

Analysis of Today's CenturyLink/Level(3) Outage
Dashboard Cloudflare’s automated systems recognizing the damage to the Internet caused by the CenturyLink/Level(3) failure and automatically routing around it.

The graph below shows traffic between Cloudflare’s network and six major tier-1 networks that are among the network providers we connect to. The red portion shows CenturyLink/Level(3) traffic, which dropped to near-zero during the incident. You can also see how we automatically shifted traffic to other network providers during the incident to mitigate the impact and ensure traffic continued to flow.

Analysis of Today's CenturyLink/Level(3) Outage
Traffic across six major tier-1 networks that are among the network providers Cloudflare connects to. CenturyLink/Level(3) in red.

The following graph shows 522 errors (indicating our inability to reach customers’ applications) across our network during the time of the incident.

Analysis of Today's CenturyLink/Level(3) Outage

The sharp spike up at 10:03 UTC was the CenturyLink/Level(3) network failing. Our automated systems immediately kicked in to attempt to reroute and rebalance traffic across alternative network providers, causing the errors to drop in half immediately and then fall to approximately 25 percent of the peak as those paths were automatically optimized.

Between 10:03 UTC and 10:11 UTC our systems automatically disabled CenturyLink/Level(3) in the 48 cities where we’re connected to them and rerouted traffic across alternate network providers. Our systems take into account capacity on other providers before shifting out traffic in order to prevent cascading failures. This is why the failover, while automatic, isn’t instantaneous in all locations. Our team was able to apply additional manual mitigations to reduce the number of errors another 5 percent.

Why Did the Errors Not Drop to Zero?

Unfortunately, there were still an elevated number of errors indicating we were still unable to reach some customers. CenturyLink/Level(3) is among the largest network providers in the world. As a result, many hosting providers only have single-homed connectivity to the Internet through their network.

To use the old Internet as a “superhighway” analogy, that’s like only having a single offramp to a town. If the offramp is blocked, then there’s no way to reach the town. This was exacerbated in some cases because CenturyLink/Level(3)’s network was not honoring route withdrawals and continued to advertise routes to networks like Cloudflare’s even after they’d been withdrawn. In the case of customers whose only connectivity to the Internet is via CenturyLink/Level(3), or if CenturyLink/Leve(3) continued to announce bad routes after they’d been withdrawn, there was no way for us to reach their applications and they continued to see 522 errors until CenturyLink/Level(3) resolved their issue around 14:30 UTC.

The same was a problem on the other (“eyeball”) side of the network. Individuals need to have an onramp onto the Internet’s superhighway. An onramp to the Internet is essentially what your ISP provides. CenturyLink is one of the largest ISPs in the United States.

Analysis of Today's CenturyLink/Level(3) Outage
Source: https://broadbandnow.com/CenturyLink

Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other Internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.

So What Likely Happened Here?

While we will not know exactly what happened until CenturyLink/Level(3) issue a post mortem, we can see clues from BGP announcements and how they propagated across the Internet during the outage. BGP is the Border Gateway Protocol. It is how routers on the Internet announce to each other what IPs sit behind them and therefore what traffic they should receive.

Starting at 10:04 UTC, there were a significant number of BGP updates. A BGP update is the signal a router makes to say that a route has changed or is no longer available. Under normal conditions, the Internet sees about 1.5MBs – 2MBs of BGP updates every 15 minutes. At the start of the incident, the number of BGP updates spiked to more than 26MBs of BGP updates per 15 minute period and stayed elevated throughout the incident.

Analysis of Today's CenturyLink/Level(3) Outage
Source: http://archive.routeviews.org/bgpdata/2020.08/UPDATES/

These updates show the instability of BGP routes inside the CenturyLink/Level(3) backbone. The question is what would have caused this instability. The CenturyLink/Level(3) status update offers some hints and points at a flowspec update as the root cause.

Analysis of Today's CenturyLink/Level(3) Outage

What’s Flowspec?

In CenturyLink/Level(3)’s update they mention that a bad Flowspec rule caused the issue. So what is Flowspec? Flowspec is an extension to BGP, which allows firewall rules to be easily distributed across a network, or even between networks, using BGP. Flowspec is a powerful tool. It allows you to efficiently push rules across an entire network almost instantly. It is great when you are trying to quickly respond to something like an attack, but it can be dangerous if you make a mistake.

At Cloudflare, early in our history, we used to use Flowspec ourselves to push out firewall rules in order to, for instance, mitigate large network-layer DDoS attacks. We suffered our own Flowspec-induced outage more than 7 years ago. We no longer use Flowspec ourselves, but it remains a common protocol for pushing out network firewall rules.

We can only speculate what happened at CenturyLink/Level(3), but one plausible scenario is that they issued a Flowspec command to try to block an attack or other abuse directed at their network. The status report indicates that the Flowspec rule prevented BGP itself from being announced. We have no way of knowing what that Flowspec rule was, but here’s one in Juniper’s format that would have blocked all BGP communications across their network.

route DISCARD-BGP {
   match {
      protocol tcp;
      destination-port 179;
   }
 then discard;
}

Why So Many Updates?

A mystery remains, however, why global BGP updates stayed elevated throughout the incident. If the rule blocked BGP then you would expect to see an increase in BGP announcements initially and then they would fall back to normal.

One possible explanation is that the offending Flowspec rule came near the end of a long list of BGP updates. If that were the case, what may have happened is that every router in CenturyLink/Level(3)’s network would receive the Flowspec rule. They would then block BGP. That would cause them to stop receiving the rule. They would start back up again, working their way through all the BGP rules until they got to the offending Flowspec rule again. BGP would be dropped again. The Flowspec rule would no longer be received. And the loop would continue, over and over.

One challenge of this is that on every cycle, the queue of BGP updates would continue to increase within CenturyLink/Level(3)’s network. This may have gotten to a point where the memory and CPU of their routers was overloaded, causing an additional set of challenges to getting their network back online.

Why Did It Take So Long to Fix?

This was a significant global Internet outage and, undoubtedly, the CenturyLink/Level(3) team received immediate alerts. They are a very sophisticated network operator with a world class Network Operations Center (NOC). So why did it take more than four hours to resolve?

Again, we can only speculate. First, it may have been that the Flowspec rule and the significant load that large number of BGP updates imposed on their routers made it difficult for them to login to their own interfaces. Several of the other tier-1 providers took action, it appears at CenturyLink/Level(3)’s request, to de-peer their networks. This would have limited the number of BGP announcements being received by the CenturyLink/Level(3) network and helped give it time to catch up.

Second, it also may have been that the Flowspec rule was not issued by CenturyLink/Level(3) themselves but rather by one of their customers. Many network providers will allow Flowspec peering. This can be a powerful tool for downstream customers wishing to block attack traffic, but can make it much more difficult to track down an offending Flowspec rule when something goes wrong.

Finally, it never helps when these issues occur early on a Sunday morning. Networks the size and scale of CenturyLink/Level(3)’s are extremely complicated. Incidents happen. We appreciate their team keeping us informed with what was going on throughout the incident. #hugops