Tag Archives: Post Mortem

A Byzantine failure in the real world

Post Syndicated from Tom Lianza original https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/

A Byzantine failure in the real world

An analysis of the Cloudflare API availability incident on 2020-11-02

When we review design documents at Cloudflare, we are always on the lookout for Single Points of Failure (SPOFs). Eliminating these is a necessary step in architecting a system you can be confident in. Ironically, when you’re designing a system with built-in redundancy, you spend most of your time thinking about how well it functions when that redundancy is lost.

On November 2, 2020, Cloudflare had an incident that impacted the availability of the API and dashboard for six hours and 33 minutes. During this incident, the success rate for queries to our API periodically dipped as low as 75%, and the dashboard experience was as much as 80 times slower than normal. While Cloudflare’s edge is massively distributed across the world (and kept working without a hitch), Cloudflare’s control plane (API & dashboard) is made up of a large number of microservices that are redundant across two regions. For most services, the databases backing those microservices are only writable in one region at a time.

Each of Cloudflare’s control plane data centers has multiple racks of servers. Each of those racks has two switches that operate as a pair—both are normally active, but either can handle the load if the other fails. Cloudflare survives rack-level failures by spreading the most critical services across racks. Every piece of hardware has two or more power supplies with different power feeds. Every server that stores critical data uses RAID 10 redundant disks or storage systems that replicate data across at least three machines in different racks, or both. Redundancy at each layer is something we review and require. So—how could things go wrong?

In this post we present a timeline of what happened, and how a difficult failure mode known as a Byzantine fault played a role in a cascading series of events.

2020-11-02 14:43 UTC: Partial Switch Failure

At 14:43, a network switch started misbehaving. Alerts began firing about the switch being unreachable to pings. The device was in a partially operating state: network control plane protocols such as LACP and BGP remained operational, while others, such as vPC, were not. The vPC link is used to synchronize ports across multiple switches, so that they appear as one large, aggregated switch to servers connected to them. At the same time, the data plane (or forwarding plane) was not processing and forwarding all the packets received from connected devices.

This failure scenario is completely invisible to the connected nodes, as each server only sees an issue for some of its traffic due to the load-balancing nature of LACP. Had the switch failed fully, all traffic would have failed over to the peer switch, as the connected links would’ve simply gone down, and the ports would’ve dropped out of the forwarding LACP bundles.

Six minutes later, the switch recovered without human intervention. But this odd failure mode led to further problems that lasted long after the switch had returned to normal operation.

2020-11-02 14:44 UTC: etcd Errors begin

The rack with the misbehaving switch included one server in our etcd cluster. We use etcd heavily in our core data centers whenever we need strongly consistent data storage that’s reliable across multiple nodes.

In the event that the cluster leader fails, etcd uses the RAFT protocol to maintain consistency and establish consensus to promote a new leader. In the RAFT protocol, cluster members are assumed to be either available or unavailable, and to provide accurate information or none at all. This works fine when a machine crashes, but is not always able to handle situations where different members of the cluster have conflicting information.

In this particular situation:

  • Network traffic between node 1 (in the affected rack) and node 3 (the leader) was being sent through the switch in the degraded state,
  • Network traffic between node 1 and node 2 were going through its working peer, and
  • Network traffic between node 2 and node 3 was unaffected.

This caused cluster members to have conflicting views of reality, known in distributed systems theory as a Byzantine fault. As a consequence of this conflicting information, node 1 repeatedly initiated leader elections, voting for itself, while node 2 repeatedly voted for node 3, which it could still connect to. This resulted in ties that did not promote a leader node 1 could reach. RAFT leader elections are disruptive, blocking all writes until they’re resolved, so this made the cluster read-only until the faulty switch recovered and node 1 could once again reach node 3.

A Byzantine failure in the real world

2020-11-02 14:45 UTC: Database system promotes a new primary database

Cloudflare’s control plane services use relational databases hosted across multiple clusters within a data center. Each cluster is configured for high availability. The cluster setup includes a primary database, a synchronous replica, and one or more asynchronous replicas. This setup allows redundancy within a data center. For cross-datacenter redundancy, a similar high availability secondary cluster is set up and replicated in a geographically dispersed data center for disaster recovery. The cluster management system leverages etcd for cluster member discovery and coordination.

When etcd became read-only, two clusters were unable to communicate that they had a healthy primary database. This triggered the automatic promotion of a synchronous database replica to become the new primary. This process happened automatically and without error or data loss.

There was a defect in our cluster management system that requires a rebuild of all database replicas when a new primary database is promoted. So, although the new primary database was available instantly, the replicas would take considerable time to become available, depending on the size of the database. For one of the clusters, service was restored quickly. Synchronous and asynchronous database replicas were rebuilt and started replicating successfully from primary, and the impact was minimal.

For the other cluster, however, performant operation of that database required a replica to be online. Because this database handles authentication for API calls and dashboard activities, it takes a lot of reads, and one replica was heavily utilized to spare the primary the load. When this failover happened and no replicas were available, the primary was overloaded, as it had to take all of the load. This is when the main impact started.

Reduce Load, Leverage Redundancy

At this point we saw that our primary authentication database was overwhelmed and began shedding load from it. We dialed back the rate at which we push SSL certificates to the edge, send emails, and other features, to give it space to handle the additional load. Unfortunately, because of its size, we knew it would take several hours for a replica to be fully rebuilt.

A silver lining here is that every database cluster in our primary data center also has online replicas in our secondary data center. Those replicas are not part of the local failover process, and were online and available throughout the incident. The process of steering read-queries to those replicas was not yet automated, so we manually diverted API traffic that could leverage those read replicas to the secondary data center. This substantially improved our API availability.

The Dashboard

The Cloudflare dashboard, like most web applications, has the notion of a user session. When user sessions are created (each time a user logs in) we perform some database operations and keep data in a Redis cluster for the duration of that user’s session. Unlike our API calls, our user sessions cannot currently be moved across the ocean without disruption. As we took actions to improve the availability of our API calls, we were unfortunately making the user experience on the dashboard worse.

This is an area of the system that is currently designed to be able to fail over across data centers in the event of a disaster, but has not yet been designed to work in both data centers at the same time. After a first period in which users on the dashboard became increasingly frustrated, we failed the authentication calls fully back to our primary data center, and kept working on our primary database to ensure we could provide the best service levels possible in that degraded state.

2020-11-02 21:20 UTC Database Replica Rebuilt

The instant the first database replica rebuilt, it put itself back into service, and performance resumed to normal levels. We re-ramped all of the services that had been turned down, so all asynchronous processing could catch up, and after a period of monitoring marked the end of the incident.

Redundant Points of Failure

The cascade of failures in this incident was interesting because each system, on its face, had redundancy. Moreover, no system fully failed—each entered a degraded state. That combination meant the chain of events that transpired was considerably harder to model and anticipate. It was frustrating yet reassuring that some of the possible failure modes were already being addressed.

A team was already working on fixing the limitation that requires a database replica rebuild upon promotion. Our user sessions system was inflexible in scenarios where we’d like to steer traffic around, and redesigning that was already in progress.

This incident also led us to revisit the configuration parameters we put in place for things that auto-remediate. In previous years, promoting a database replica to primary took far longer than we liked, so getting that process automated and able to trigger on a minute’s notice was a point of pride. At the same time, for at least one of our databases, the cure may be worse than the disease, and in fact we may not want to invoke the promotion process so quickly. Immediately after this incident we adjusted that configuration accordingly.

Byzantine Fault Tolerance (BFT) is a hot research topic. Solutions have been known since 1982, but have had to choose between a variety of engineering tradeoffs, including security, performance, and algorithmic simplicity. Most general-purpose cluster management systems choose to forgo BFT entirely and use protocols based on PAXOS, or simplifications of PAXOS such as RAFT, that perform better and are easier to understand than BFT consensus protocols. In many cases, a simple protocol that is known to be vulnerable to a rare failure mode is safer than a complex protocol that is difficult to implement correctly or debug.

The first uses of BFT consensus were in safety-critical systems such as aircraft and spacecraft controls. These systems typically have hard real time latency constraints that require tightly coupling consensus with application logic in ways that make these implementations unsuitable for general-purpose services like etcd. Contemporary research on BFT consensus is mostly focused on applications that cross trust boundaries, which need to protect against malicious cluster members as well as malfunctioning cluster members. These designs are more suitable for implementing general-purpose services such as etcd, and we look forward to collaborating with researchers and the open source community to make them suitable for production cluster management.

We are very sorry for the difficulty the outage caused, and are continuing to improve as our systems grow. We’ve since fixed the bug in our cluster management system, and are continuing to tune each of the systems involved in this incident to be more resilient to failures of their dependencies.  If you’re interested in helping solve these problems at scale, please visit cloudflare.com/careers.

Analysis of Today’s CenturyLink/Level(3) Outage

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/

Analysis of Today's CenturyLink/Level(3) Outage

Today CenturyLink/Level(3), a major ISP and Internet bandwidth provider, experienced a significant outage that impacted some of Cloudflare’s customers as well as a significant number of other services and providers across the Internet. While we’re waiting for a post mortem from CenturyLink/Level(3), I wanted to write up the timeline of what we saw, how Cloudflare’s systems routed around the problem, why some of our customers were still impacted in spite of our mitigations, and what  appears to be the likely root cause of the issue.

Increase In Errors

At 10:03 UTC our monitoring systems started to observe an increased number of errors reaching our customers’ origin servers. These show up as “522 Errors” and indicate that there is an issue connecting from Cloudflare’s network to wherever our customers’ applications are hosted.

Cloudflare is connected to CenturyLink/Level(3) among a large and diverse set of network providers. When we see an increase in errors from one network provider, our systems automatically attempt to reach customers’ applications across alternative providers. Given the number of providers we have access to, we are generally able to continue to route traffic even when one provider has an issue.

Analysis of Today's CenturyLink/Level(3) Outage
The diverse set of network providers Cloudflare connects to. Source: https://bgp.he.net/AS13335#_asinfo‌‌

Automatic Mitigations

In this case, beginning within seconds of the increase in 522 errors, our systems automatically rerouted traffic from CenturyLink/Level(3) to alternate network providers we connect to including Cogent, NTT, GTT, Telia, and Tata.

Our Network Operations Center was also alerted and our team began taking additional steps to mitigate any issues our automated systems weren’t automatically able to address beginning at 10:09 UTC. We were successful in keeping traffic flowing across our network for most customers and end users even with the loss of CenturyLink/Level(3) as one of our network providers.

Analysis of Today's CenturyLink/Level(3) Outage
Dashboard Cloudflare’s automated systems recognizing the damage to the Internet caused by the CenturyLink/Level(3) failure and automatically routing around it.

The graph below shows traffic between Cloudflare’s network and six major tier-1 networks that are among the network providers we connect to. The red portion shows CenturyLink/Level(3) traffic, which dropped to near-zero during the incident. You can also see how we automatically shifted traffic to other network providers during the incident to mitigate the impact and ensure traffic continued to flow.

Analysis of Today's CenturyLink/Level(3) Outage
Traffic across six major tier-1 networks that are among the network providers Cloudflare connects to. CenturyLink/Level(3) in red.

The following graph shows 522 errors (indicating our inability to reach customers’ applications) across our network during the time of the incident.

Analysis of Today's CenturyLink/Level(3) Outage

The sharp spike up at 10:03 UTC was the CenturyLink/Level(3) network failing. Our automated systems immediately kicked in to attempt to reroute and rebalance traffic across alternative network providers, causing the errors to drop in half immediately and then fall to approximately 25 percent of the peak as those paths were automatically optimized.

Between 10:03 UTC and 10:11 UTC our systems automatically disabled CenturyLink/Level(3) in the 48 cities where we’re connected to them and rerouted traffic across alternate network providers. Our systems take into account capacity on other providers before shifting out traffic in order to prevent cascading failures. This is why the failover, while automatic, isn’t instantaneous in all locations. Our team was able to apply additional manual mitigations to reduce the number of errors another 5 percent.

Why Did the Errors Not Drop to Zero?

Unfortunately, there were still an elevated number of errors indicating we were still unable to reach some customers. CenturyLink/Level(3) is among the largest network providers in the world. As a result, many hosting providers only have single-homed connectivity to the Internet through their network.

To use the old Internet as a “superhighway” analogy, that’s like only having a single offramp to a town. If the offramp is blocked, then there’s no way to reach the town. This was exacerbated in some cases because CenturyLink/Level(3)’s network was not honoring route withdrawals and continued to advertise routes to networks like Cloudflare’s even after they’d been withdrawn. In the case of customers whose only connectivity to the Internet is via CenturyLink/Level(3), or if CenturyLink/Leve(3) continued to announce bad routes after they’d been withdrawn, there was no way for us to reach their applications and they continued to see 522 errors until CenturyLink/Level(3) resolved their issue around 14:30 UTC.

The same was a problem on the other (“eyeball”) side of the network. Individuals need to have an onramp onto the Internet’s superhighway. An onramp to the Internet is essentially what your ISP provides. CenturyLink is one of the largest ISPs in the United States.

Analysis of Today's CenturyLink/Level(3) Outage
Source: https://broadbandnow.com/CenturyLink

Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other Internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.

So What Likely Happened Here?

While we will not know exactly what happened until CenturyLink/Level(3) issue a post mortem, we can see clues from BGP announcements and how they propagated across the Internet during the outage. BGP is the Border Gateway Protocol. It is how routers on the Internet announce to each other what IPs sit behind them and therefore what traffic they should receive.

Starting at 10:04 UTC, there were a significant number of BGP updates. A BGP update is the signal a router makes to say that a route has changed or is no longer available. Under normal conditions, the Internet sees about 1.5MBs – 2MBs of BGP updates every 15 minutes. At the start of the incident, the number of BGP updates spiked to more than 26MBs of BGP updates per 15 minute period and stayed elevated throughout the incident.

Analysis of Today's CenturyLink/Level(3) Outage
Source: http://archive.routeviews.org/bgpdata/2020.08/UPDATES/

These updates show the instability of BGP routes inside the CenturyLink/Level(3) backbone. The question is what would have caused this instability. The CenturyLink/Level(3) status update offers some hints and points at a flowspec update as the root cause.

Analysis of Today's CenturyLink/Level(3) Outage

What’s Flowspec?

In CenturyLink/Level(3)’s update they mention that a bad Flowspec rule caused the issue. So what is Flowspec? Flowspec is an extension to BGP, which allows firewall rules to be easily distributed across a network, or even between networks, using BGP. Flowspec is a powerful tool. It allows you to efficiently push rules across an entire network almost instantly. It is great when you are trying to quickly respond to something like an attack, but it can be dangerous if you make a mistake.

At Cloudflare, early in our history, we used to use Flowspec ourselves to push out firewall rules in order to, for instance, mitigate large network-layer DDoS attacks. We suffered our own Flowspec-induced outage more than 7 years ago. We no longer use Flowspec ourselves, but it remains a common protocol for pushing out network firewall rules.

We can only speculate what happened at CenturyLink/Level(3), but one plausible scenario is that they issued a Flowspec command to try to block an attack or other abuse directed at their network. The status report indicates that the Flowspec rule prevented BGP itself from being announced. We have no way of knowing what that Flowspec rule was, but here’s one in Juniper’s format that would have blocked all BGP communications across their network.

   match {
      protocol tcp;
      destination-port 179;
 then discard;

Why So Many Updates?

A mystery remains, however, why global BGP updates stayed elevated throughout the incident. If the rule blocked BGP then you would expect to see an increase in BGP announcements initially and then they would fall back to normal.

One possible explanation is that the offending Flowspec rule came near the end of a long list of BGP updates. If that were the case, what may have happened is that every router in CenturyLink/Level(3)’s network would receive the Flowspec rule. They would then block BGP. That would cause them to stop receiving the rule. They would start back up again, working their way through all the BGP rules until they got to the offending Flowspec rule again. BGP would be dropped again. The Flowspec rule would no longer be received. And the loop would continue, over and over.

One challenge of this is that on every cycle, the queue of BGP updates would continue to increase within CenturyLink/Level(3)’s network. This may have gotten to a point where the memory and CPU of their routers was overloaded, causing an additional set of challenges to getting their network back online.

Why Did It Take So Long to Fix?

This was a significant global Internet outage and, undoubtedly, the CenturyLink/Level(3) team received immediate alerts. They are a very sophisticated network operator with a world class Network Operations Center (NOC). So why did it take more than four hours to resolve?

Again, we can only speculate. First, it may have been that the Flowspec rule and the significant load that large number of BGP updates imposed on their routers made it difficult for them to login to their own interfaces. Several of the other tier-1 providers took action, it appears at CenturyLink/Level(3)’s request, to de-peer their networks. This would have limited the number of BGP announcements being received by the CenturyLink/Level(3) network and helped give it time to catch up.

Second, it also may have been that the Flowspec rule was not issued by CenturyLink/Level(3) themselves but rather by one of their customers. Many network providers will allow Flowspec peering. This can be a powerful tool for downstream customers wishing to block attack traffic, but can make it much more difficult to track down an offending Flowspec rule when something goes wrong.

Finally, it never helps when these issues occur early on a Sunday morning. Networks the size and scale of CenturyLink/Level(3)’s are extremely complicated. Incidents happen. We appreciate their team keeping us informed with what was going on throughout the incident. #hugops

Cloudflare outage on July 17, 2020

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-outage-on-july-17-2020/

Cloudflare outage on July 17, 2020

Today a configuration error in our backbone network caused an outage for Internet properties and Cloudflare services that lasted 27 minutes. We saw traffic drop by about 50% across our network. Because of the architecture of our backbone this outage didn’t affect the entire Cloudflare network and was localized to certain geographies.

The outage occurred because, while working on an unrelated issue with a segment of the backbone from Newark to Chicago, our network engineering team updated the configuration on a router in Atlanta to alleviate congestion. This configuration contained an error that caused all traffic across our backbone to be sent to Atlanta. This quickly overwhelmed the Atlanta router and caused Cloudflare network locations connected to the backbone to fail.

The affected locations were San Jose, Dallas, Seattle, Los Angeles, Chicago, Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt, Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto Alegre. Other locations continued to operate normally.

For the avoidance of doubt: this was not caused by an attack or breach of any kind.

We are sorry for this outage and have already made a global change to the backbone configuration that will prevent it from being able to occur again.

The Cloudflare Backbone

Cloudflare outage on July 17, 2020

Cloudflare operates a backbone between many of our data centers around the world. The backbone is a series of private lines between our data centers that we use for faster and more reliable paths between them. These links allow us to carry traffic between different data centers, without going over the public Internet.

We use this, for example, to reach a website origin server sitting in New York, carrying requests over our private backbone to both San Jose, California, as far as Frankfurt or São Paulo. This additional option to avoid the public Internet allows a higher quality of service, as the private network can be used to avoid Internet congestion points. With the backbone, we have far greater control over where and how to route Internet requests and traffic than the public Internet provides.


All timestamps are UTC.

First, an issue occurred on the backbone link between Newark and Chicago which led to backbone congestion in between Atlanta and Washington, DC.

In responding to that issue, a configuration change was made in Atlanta. That change started the outage at 21:12. Once the outage was understood, the Atlanta router was disabled and traffic began flowing normally again at 21:39.

Shortly after, we saw congestion at one of our core data centers that processes logs and metrics, causing some logs to be dropped. During this period the edge network continued to operate normally.

  • 20:25: Loss of backbone link between EWR and ORD
  • 20:25: Backbone between ATL and IAD is congesting
  • 21:12 to 21:39: ATL attracted traffic from across the backbone
  • 21:39 to 21:47: ATL dropped from the backbone, service restored
  • 21:47 to 22:10: Core congestion caused some logs to drop, edge continues operating
  • 22:10: Full recovery, including logs and metrics

Here’s a view of the impact from Cloudflare’s internal traffic manager tool. The red and orange region at the top shows CPU utilization in Atlanta reaching overload, and the white regions show affected data centers seeing CPU drop to near zero as they were no longer handling traffic. This is the period of the outage.

Other, unaffected data centers show no change in their CPU utilization during the incident. That’s indicated by the fact that the green color does not change during the incident for those data centers.

Cloudflare outage on July 17, 2020

What happened and what we’re doing about it

As there was backbone congestion in Atlanta, the team had decided to remove some of Atlanta’s backbone traffic. But instead of removing the Atlanta routes from the backbone, a one line change started leaking all BGP routes into the backbone.

atl01# show | compare 
[edit policy-options policy-statement 6-BBONE-OUT term 6-SITE-LOCAL from]
!       inactive: prefix-list 6-SITE-LOCAL { ... }

The complete term looks like this:

from {
    prefix-list 6-SITE-LOCAL;
then {
    local-preference 200;
    community add SITE-LOCAL-ROUTE;
    community add ATL01;
    community add NORTH-AMERICA;

This term sets the local-preference, adds some communities, and accepts the routes that match the prefix-list. Local-preference is a transitive property on iBGP sessions (it will be transferred to the next BGP peer).

The correct change would have been to deactivate the term instead of the prefix-list.

By removing the prefix-list condition, the router was instructed to send all its BGP routes to all other backbone routers, with an increased local-preference of 200. Unfortunately at the time, local routes that the edge routers received from our compute nodes had a local-preference of 100. As the higher local-preference wins, all of the traffic meant for local compute nodes went to Atlanta compute nodes instead.

With the routes sent out, Atlanta started attracting traffic from across the backbone.

We are making the following changes:

  • Introduce a maximum-prefix limit on our backbone BGP sessions – this would have shut down the backbone in Atlanta, but our network is built to function properly without a backbone. This change will be deployed on Monday, July 20.
  • Change the BGP local-preference for local server routes. This change will prevent a single location from attracting other locations’ traffic in a similar manner. This change has been deployed following the incident.


We’ve never experienced an outage on our backbone and our team responded quickly to restore service in the affected locations, but this was a very painful period for everyone involved. We are sorry for the disruption to our customers and to all the users who were unable to access Internet properties while the outage was happening.

We’ve already made changes to the backbone configuration to make sure that this cannot happen again, and further changes will resume on Monday.

An AWS Elasticsearch Post-Mortem

Post Syndicated from Bozho original https://techblog.bozho.net/aws-elasticsearch-post-mortem/

So it happened that we had a production issue on the SaaS version of LogSentinel – our Elasticsearch stopped indexing new data. There was no data loss, as elasticsearch is just a secondary storage, but it caused some issues for our customers (they could not see the real-time data on their dashboards). Below is a post-mortem analysis – what happened, why it happened, how we handled it and how we can prevent it.

Let me start with a background of how the system operates – we accept audit trail entries (logs) through a RESTful API (or syslog), and push them to a Kafka topic. Then the Kafka topic is consumed to store the data in the primary storage (Cassandra) and index it for better visualization and analysis in Elasticsearch. The managed AWS Elasticsearch service was chosen because it saves you all the overhead of cluster management, and as a startup we want to minimize our infrastructure management efforts. That’s a blessing and a curse, as we’ll see below.

We have alerting enabled on many elements, including the Elasticsearch storage space and the number of application errors in the log files. This allows us to respond quickly to issues. So the “high number of application errors” alarm triggered. Indexing was blocked due to FORBIDDEN/8/index write. We have a system call that enables it, so I tried to run it, but after less than a minute it was blocked again. This meant that our Kafka consumers failed to process the messages, which is fine, as we have a sufficient message retention period in Kafka, so no data can be lost.

I investigated the possible reasons for such a block. And there are two, according to Amazon – increased JVM memory pressure and low disk space. I checked the metrics and everything looked okay – JVM memory pressure was barely reaching 70% (and 75% is the threshold), and there was more than 200GiB free storage. There was only one WARN in the elasticsearch application logs (it was “node failure”, but after that there were no issues reported)

There was another strange aspect of the issue – there were twice as many nodes as configured. This usually happens during upgrades, as AWS is using blue/green deployment for Elasticsearch, but we haven’t done any upgrade recently. These additional nodes usually go away after a short period of time (after the redeployment/upgrade is ready), but they wouldn’t go away in this case.

Being unable to SSH into the actual machine, being unable to unblock the indexing through Elasticsearch means, and being unable to shut down or restart the nodes, I raised a ticket with support. And after a few ours and a few exchanged messages, the problem was clear and resolved.

The main reason for the issue is 2-fold. First, we had a configuration that didn’t reflect the cluster status – we had assumed a bit more nodes and our shared and replica configuration meant we have unassigned replicas (more on shards and replicas here and here). The best practice is to have nodes > number of replicas, so that each node gets one replica (plus the main shard). Having unassigned shard replicas is not bad per se, and there are legitimate cases for it. Our can probably be seen as misconfiguration, but not one with immediate negative effects. We chose those settings in part because it’s not possible to change some settings in AWS after a cluster is created. And opening and closing indexes is not supported.

The second issue is AWS Elasticsearch logic for calculating free storage in their circuit breaker that blocks indexing. So even though there were 200+ GiB free space on each of the existing nodes, AWS Elasticsearch thought we were out of space and blocked indexing. There was no way for us to see that, as we only see the available storage, not what AWS thinks is available. So, the calculation gets the total number of shards+replicas and multiplies it by the per-shard storage. Which means unassigned replicas that do not take actual space are calculated as if they take up space. That logic is counterintuitive (if not plain wrong), and there is hardly a way to predict it.

This logic appears to be triggered when blue/green deployment occurs – so in normal operation the actual remaining storage space is checked, but during upgrades, the shard-based check is triggered. That has blocked the entire cluster. But what triggered the blue/green deployment process?

We occasionally need access to Kibana, and because of our strict security rules it is not accessible to anyone by default. So we temporarily change the access policy to allow access from our office IP(s). This change is not expected to trigger a new deployment, and has never lead to that. AWS documentation, however, states:

In most cases, the following operations do not cause blue/green deployments: Changing access policy, Changing the automated snapshot hour, If your domain has dedicated master nodes, changing data instance count.
There are some exceptions. For example, if you haven’t reconfigured your domain since the launch of three Availability Zone support, Amazon ES might perform a one-time blue/green deployment to redistribute your dedicated master nodes across Availability Zones.

There are other exceptions, apparently, and one of them happened to us. That lead to the blue/green deployment, which in turn, because of our flawed configuration, triggered the index block based on the odd logic to assume unassigned replicas as taking up storage space.

How we fixed it – we recreated the index with fewer replicas and started a reindex (it takes data from the primary source and indexes it in batches). That reduced the size taken and AWS manually intervened to “unstuck” the blue/green deployment. Once the problem was known, the fix was easy (and we have to recreate the index anyway due to other index configuration changes). It’s appropriate to (once again) say how good AWS support is, in both fixing the issue and communicating it.

As I said in the beginning, this did not mean there’s data loss because we have Kafka keep the messages for a sufficient amount of time. However, once the index was writable, we expected the consumer to continue from the last successful message – we have specifically written transactional behaviour that committed the offsets only after successful storing in the primary storage and successful indexing. Unfortunately, the kafka client we are using had auto-commit turned on that we have overlooked. So the consumer has skipped past the failed messages. They are still in Kafka and we are processing them with a separate tool, but that showed us that our assumption was wrong and the fact that the code calls “commit” doesn’t actually mean something.

So, the morals of the story:

  • Monitor everything. Bad things happen, it’s good to learn about them quickly.
  • Check your production configuration and make sure it’s adequate to the current needs. Be it replicas, JVM sizes, disk space, number of retries, auto-scaling rules, etc.
  • Be careful with managed cloud services. They save a lot of effort but also take control away from you. And they may have issues for which your only choice is contacting support.
  • If providing managed services, make sure you show enough information about potential edge cases. An error console, an activity console, or something, that would allow the customer to know what is happening.
  • Validate your assumptions about default settings of your libraries. (Ideally, libraries should warn you if you are doing something not expected in the current state of configuration)
  • Make sure your application is fault-tolerant, i.e. that failure in one component doesn’t stop the world and doesn’t lead to data loss.

To sum it up, a rare event unexpectedly triggered a blue/green deployment, where a combination of flawed configuration and flawed free space calculation resulted in an unwritable cluster. Fortunately, no data is lost and at least I learned something.

The post An AWS Elasticsearch Post-Mortem appeared first on Bozho's tech blog.

Details of the Cloudflare outage on July 2, 2019

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

Almost nine years ago, Cloudflare was a tiny company and I was a customer not an employee. Cloudflare had launched a month earlier and one day alerting told me that my little site, jgc.org, didn’t seem to have working DNS any more. Cloudflare had pushed out a change to its use of Protocol Buffers and it had broken DNS.

I wrote to Matthew Prince directly with an email titled “Where’s my dns?” and he replied with a long, detailed, technical response (you can read the full email exchange here) to which I replied:

From: John Graham-Cumming
Date: Thu, Oct 7, 2010 at 9:14 AM
Subject: Re: Where's my dns?
To: Matthew Prince

Awesome report, thanks. I'll make sure to call you if there's a
problem.  At some point it would probably be good to write this up as
a blog post when you have all the technical details because I think
people really appreciate openness and honesty about these things.
Especially if you couple it with charts showing your post launch
traffic increase.

I have pretty robust monitoring of my sites so I get an SMS when
anything fails.  Monitoring shows I was down from 13:03:07 to
14:04:12.  Tests are made every five minutes.

It was a blip that I'm sure you'll get past.  But are you sure you
don't need someone in Europe? :-)

To which he replied:

From: Matthew Prince
Date: Thu, Oct 7, 2010 at 9:57 AM
Subject: Re: Where's my dns?
To: John Graham-Cumming

Thanks. We've written back to everyone who wrote in. I'm headed in to
the office now and we'll put something on the blog or pin an official
post to the top of our bulletin board system. I agree 100%    
transparency is best.

And so, today, as an employee of a much, much larger Cloudflare I get to be the one who writes, transparently about a mistake we made, its impact and what we are doing about it.

The events of July 2

On July 2, we deployed a new rule in our WAF Managed Rules that caused CPUs to become exhausted on every CPU core that handles HTTP/HTTPS traffic on the Cloudflare network worldwide. We are constantly improving WAF Managed Rules to respond to new vulnerabilities and threats. In May, for example, we used the speed with which we can update the WAF to push a rule to protect against a serious SharePoint vulnerability. Being able to deploy rules quickly and globally is a critical feature of our WAF.

Unfortunately, last Tuesday’s update contained a regular expression that backtracked enormously and exhausted CPU used for HTTP/HTTPS serving. This brought down Cloudflare’s core proxying, CDN and WAF functionality. The following graph shows CPUs dedicated to serving HTTP/HTTPS traffic spiking to nearly 100% usage across the servers in our network.

CPU utilization in one of our PoPs during the incident

This resulted in our customers (and their customers) seeing a 502 error page when visiting any Cloudflare domain. The 502 errors were generated by the front line Cloudflare web servers that still had CPU cores available but were unable to reach the processes that serve HTTP/HTTPS traffic.

We know how much this hurt our customers. We’re ashamed it happened. It also had a negative impact on our own operations while we were dealing with the incident.

It must have been incredibly stressful, frustrating and frightening if you were one of our customers. It was even more upsetting because we haven’t had a global outage for six years.

The CPU exhaustion was caused by a single WAF rule that contained a poorly written regular expression that ended up creating excessive backtracking. The regular expression that was at the heart of the outage is (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Although the regular expression itself is of interest to many people (and is discussed more below), the real story of how the Cloudflare service went down for 27 minutes is much more complex than “a regular expression went bad”. We’ve taken the time to write out the series of events that lead to the outage and kept us from responding quickly. And, if you want to know more about regular expression backtracking and what to do about it, then you’ll find it in an appendix at the end of this post.

What happened

Let’s begin with the sequence of events. All times in this blog are UTC.

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic process. This generated a Change Request ticket. We use Jira to manage these tickets and a screenshot is below.

Three minutes later the first PagerDuty page went out indicating a fault with the WAF. This was a synthetic test that checks the functionality of the WAF (we have hundreds of such tests) from outside Cloudflare to ensure that it is working correctly. This was rapidly followed by pages indicating many other end-to-end tests of Cloudflare services failing, a global traffic drop alert, widespread 502 errors and then many reports from our points-of-presence (PoPs) in cities worldwide indicating there was CPU exhaustion.

Some of these alerts hit my watch and I jumped out of the meeting I was in and was on my way back to my desk when a leader in our Solutions Engineering group told me we had lost 80% of our traffic. I ran over to SRE where the team was debugging the situation. In the initial moments of the outage there was speculation it was an attack of some type we’d never seen before.

Cloudflare’s SRE team is distributed around the world, with continuous, around-the-clock coverage. Alerts like these, the vast majority of which are noting very specific issues of limited scopes in localized areas, are monitored in internal dashboards and addressed many times every day. This pattern of pages and alerts, however, indicated that something gravely serious had happened, and SRE immediately declared a P0 incident and escalated to engineering leadership and systems engineering.

The London engineering team was at that moment in our main event space listening to an internal tech talk. The talk was interrupted and everyone assembled in a large conference room and others dialed-in. This wasn’t a normal problem that SRE could handle alone, it needed every relevant team online at once.

At 14:00 the WAF was identified as the component causing the problem and an attack dismissed as a possibility. The Performance Team pulled live CPU data from a machine that clearly showed the WAF was responsible. Another team member used strace to confirm. Another team saw error logs indicating the WAF was in trouble. At 14:02 the entire team looked at me when it was proposed that we use a ‘global kill’, a mechanism built into Cloudflare to disable a single component worldwide.

But getting to the global WAF kill was another story. Things stood in our way. We use our own products and with our Access service down we couldn’t authenticate to our internal control panel (and once we were back we’d discover that some members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently).

And we couldn’t get to other internal services like Jira or the build system. To get to them we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). Eventually, a team member executed the global WAF kill at 14:07 and by 14:09 traffic levels and CPU were back to expected levels worldwide. The rest of Cloudflare’s protection mechanisms continued to operate.

Then we moved on to restoring the WAF functionality. Because of the sensitivity of the situation we performed both negative tests (asking ourselves “was it really that particular change that caused the problem?”) and positive tests (verifying the rollback worked) in a single city using a subset of traffic after removing our paying customers’ traffic from that location.

At 14:52 we were 100% satisfied that we understood the cause and had a fix in place and the WAF was re-enabled globally.

How Cloudflare operates

Cloudflare has a team of engineers who work on our WAF Managed Rules product; they are constantly working to improve detection rates, lower false positives, and respond rapidly to new threats as they emerge. In the last 60 days, 476 change requests have been handled for the WAF Managed Rules (averaging one every 3 hours).

This particular change was to be deployed in “simulate” mode where real customer traffic passes through the rule but nothing is blocked. We use that mode to test the effectiveness of a rule and measure its false positive and false negative rate. But even in the simulate mode the rules actually need to execute and in this case the rule contained a regular expression that consumed excessive CPU.

As can be seen from the Change Request above there’s a deployment plan, a rollback plan and a link to the internal Standard Operating Procedure (SOP) for this type of deployment. The SOP for a rule change specifically allows it to be pushed globally. This is very different from all the software we release at Cloudflare where the SOP first pushes software to an internal dogfooding network point of presence (PoP) (which our employees pass through), then to a small number of customers in an isolated location, followed by a push to a large number of customers and finally to the world.

The process for a software release looks like this: We use git internally via BitBucket. Engineers working on changes push code which is built by TeamCity and when the build passes, reviewers are assigned. Once a pull request is approved the code is built and the test suite runs (again).

If the build and tests pass then a Change Request Jira is generated and the change has to be approved by the relevant manager or technical lead. Once approved deployment to what we call the “animal PoPs” occurs: DOG, PIG, and the Canaries.

The DOG PoP is a Cloudflare PoP (just like any of our cities worldwide) but it is used only by Cloudflare employees. This dogfooding PoP enables us to catch problems early before any customer traffic has touched the code. And it frequently does.

If the DOG test passes successfully code goes to PIG (as in “Guinea Pig”). This is a Cloudflare PoP where a small subset of customer traffic from non-paying customers passes through the new code.

If that is successful the code moves to the Canaries. We have three Canary PoPs spread across the world and run paying and non-paying customer traffic running through them on the new code as a final check for errors.

Cloudflare software release process

Once successful in Canary the code is allowed to go live. The entire DOG, PIG, Canary, Global process can take hours or days to complete, depending on the type of code change. The diversity of Cloudflare’s network and customers allows us to test code thoroughly before a release is pushed to all our customers globally. But, by design, the WAF doesn’t use this process because of the need to respond rapidly to threats.

WAF Threats

In the last few years we have seen a dramatic increase in vulnerabilities in common applications. This has happened due to the increased availability of software testing tools, like fuzzing for example (we just posted a new blog on fuzzing here).

Source: https://cvedetails.com/

What is commonly seen is a Proof of Concept (PoC) is created and often published on Github quickly, so that teams running and maintaining applications can test to make sure they have adequate protections. Because of this, it’s imperative that Cloudflare are able to react as quickly as possible to new attacks to give our customers a chance to patch their software.

A great example of how Cloudflare proactively provided this protection was through the deployment of our protections against the SharePoint vulnerability in May (blog here). Within a short space of time from publicised announcements, we saw a huge spike in attempts to exploit our customer’s Sharepoint installations. Our team continuously monitors for new threats and writes rules to mitigate them on behalf of our customers.

The specific rule that caused last Tuesday’s outage was targeting Cross-site scripting (XSS) attacks. These too have increased dramatically in recent years.

Source: https://cvedetails.com/

The standard procedure for a WAF Managed Rules change indicates that Continuous Integration (CI) tests must pass prior to a global deploy. That happened normally last Tuesday and the rules were deployed. At 13:31 an engineer on the team had merged a Pull Request containing the change after it was approved.

At 13:37 TeamCity built the rules and ran the tests, giving it the green light. The WAF test suite tests that the core functionality of the WAF works and consists of a large collection of unit tests for individual matching functions. After the unit tests run the individual WAF rules are tested by executing a huge collection of HTTP requests against the WAF. These HTTP requests are designed to test requests that should be blocked by the WAF (to make sure it catches attacks) and those that should be let through (to make sure it isn’t over-blocking and creating false positives). What it didn’t do was test for runaway CPU utilization by the WAF and examining the log files from previous WAF builds shows that no increase in test suite run time was observed with the rule that would ultimately cause CPU exhaustion on our edge.

With the tests passing, TeamCity automatically began deploying the change at 13:42.


Because WAF rules are required to address emergent threats they are deployed using our Quicksilver distributed key-value (KV) store that can push changes globally in seconds. This technology is used by all our customers when making configuration changes in our dashboard or via the API and is the backbone of our service’s ability to respond to changes very, very rapidly.

We haven’t really talked about Quicksilver much. We previously used Kyoto Tycoon as a globally distributed key-value store, but we ran into operational issues with it and wrote our own KV store that is replicated across our more than 180 cities. Quicksilver is how we push changes to customer configuration, update WAF rules, and distribute JavaScript code written by customers using Cloudflare Workers.

From clicking a button in the dashboard or making an API call to change configuration to that change coming into effect takes seconds, globally. Customers have come to love this high speed configurability. And with Workers they expect near instant, global software deployment. On average Quicksilver distributes about 350 changes per second.

And Quicksilver is very fast.  On average we hit a p99 of 2.29s for a change to be distributed to every machine worldwide. Usually, this speed is a great thing. It means that when you enable a feature or purge your cache you know that it’ll be live globally nearly instantly. When you push code with Cloudflare Workers it’s pushed out a the same speed. This is part of the promise of Cloudflare fast updates when you need them.

However, in this case, that speed meant that a change to the rules went global in seconds. You may notice that the WAF code uses Lua. Cloudflare makes use of Lua extensively in production and details of the Lua in the WAF have been discussed before. The Lua WAF uses PCRE internally and it uses backtracking for matching and has no mechanism to protect against a runaway expression. More on that and what we’re doing about it below.

Everything that occurred up to the point the rules were deployed was done “correctly”: a pull request was raised, it was approved, CI/CD built the code and tested it, a change request was submitted with an SOP detailing rollout and rollback, and the rollout was executed.

Cloudflare WAF deployment process

What went wrong

As noted, we deploy dozens of new rules to the WAF every week, and we have numerous systems in place to prevent any negative impact of that deployment. So when things do go wrong, it’s generally the unlikely convergence of multiple causes. Getting to a single root cause, while satisfying, may obscure the reality. Here are the multiple vulnerabilities that converged to get to the point where Cloudflare’s service for HTTP/HTTPS went offline.

  1. An engineer wrote a regular expression that could easily backtrack enormously.
  2. A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.
  3. The regular expression engine being used didn’t have complexity guarantees.
  4. The test suite didn’t have a way of identifying excessive CPU consumption.
  5. The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.
  6. The rollback plan required running the complete WAF build twice taking too long.
  7. The first alert for the global traffic drop took too long to fire.
  8. We didn’t update our status page quickly enough.
  9. We had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on.
  10. SREs had lost access to some systems because their credentials had been timed out for security reasons.
  11. Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.

What’s happened since last Tuesday

Firstly, we stopped all release work on the WAF completely and are doing the following:

  1. Re-introduce the excessive CPU usage protection that got removed. (Done)
  2. Manually inspecting all 3,868 rules in the WAF Managed Rules to find and correct any other instances of possible excessive backtracking. (Inspection complete)
  3. Introduce performance profiling for all rules to the test suite. (ETA:  July 19)
  4. Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)
  5. Changing the SOP to do staged rollouts of rules in the same manner used for other software at Cloudflare while retaining the ability to do emergency global deployment for active attacks.
  6. Putting in place an emergency ability to take the Cloudflare Dashboard and API off Cloudflare’s edge.
  7. Automating update of the Cloudflare Status page.

In the longer term we are moving away from the Lua WAF that I wrote years ago. We are porting the WAF to use the new firewall engine. This will make the WAF both faster and add yet another layer of protection.


This was an upsetting outage for our customers and for the team. We responded quickly to correct the situation and are correcting the process deficiencies that allowed the outage to occur and going deeper to protect against any further possible problems with the way we use regular expressions by replacing the underlying technology used.

We are ashamed of the outage and sorry for the impact on our customers. We believe the changes we’ve made mean such an outage will never recur.

Appendix: About Regular Expression Backtracking

To fully understand how (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))  caused CPU exhaustion you need to understand a little about how a standard regular expression engine works. The critical part is .*(?:.*=.*). The (?: and matching ) are a non-capturing group (i.e. the expression inside the parentheses is grouped together as a single expression).

For the purposes of the discussion of why this pattern causes CPU exhaustion we can safely ignore it and treat the pattern as .*.*=.*. When reduced to this, the pattern obviously looks unnecessarily complex; but what’s important is any “real-world” expression (like the complex ones in our WAF rules) that ask the engine to “match anything followed by anything” can lead to catastrophic backtracking. Here’s why.

In a regular expression, . means match a single character, .* means match zero or more characters greedily (i.e. match as much as possible) so .*.*=.* means match zero or more characters, then match zero or more characters, then find a literal = sign, then match zero or more characters.

Consider the test string x=x. This will match the expression .*.*=.*. The .*.* before the equal can match the first x (one of the .* matches the x, the other matches zero characters). The .* after the = matches the final x.

It takes 23 steps for this match to happen. The first .* in .*.*=.* acts greedily and matches the entire x=x string. The engine moves on to consider the next .*. There are no more characters left to match so the second .* matches zero characters (that’s allowed). Then the engine moves on to the =. As there are no characters left to match (the first .* having consumed all of x=x) the match fails.

At this point the regular expression engine backtracks. It returns to the first .* and matches it against x= (instead of x=x) and then moves onto the second .*. That .* matches the second x and now there are no more characters left to match. So when the engine tries to match the = in .*.*=.* the match fails. The engine backtracks again.

This time it backtracks so that the first .* is still matching x= but the second .* no longer matches x; it matches zero characters. The engine then moves on to try to find the literal = in the .*.*=.* pattern but it fails (because it was already matched against the first .*). The engine backtracks again.

This time the first .* matches just the first x. But the second .* acts greedily and matches =x. You can see what’s coming. When it tries to match the literal = it fails and backtracks again.

The first .* still matches just the first x. Now the second .* matches just =. But, you guessed it, the engine can’t match the literal = because the second .* matched it. So the engine backtracks again. Remember, this is all to match a three character string.

Finally, with the first .* matching just the first x, the second .* matching zero characters the engine is able to match the literal = in the expression with the = in the string. It moves on and the final .* matches the final x.

23 steps to match x=x. Here’s a short video of that using the Perl Regexp::Debugger showing the steps and backtracking as they occur.

That’s a lot of work but what happens if the string is changed from x=x to x=xx? This time is takes 33 steps to match. And if the input is x=xxx it takes 45. That’s not linear. Here’s a chart showing matching from x=x to x=xxxxxxxxxxxxxxxxxxxx (20 x’s after the =). With 20 x’s after the = the engine takes 555 steps to match! (Worse, if the x= was missing, so the string was just 20 x’s, the engine would take 4,067 steps to find the pattern doesn’t match).

This video shows all the backtracking necessary to match x=xxxxxxxxxxxxxxxxxxxx:

That’s bad because as the input size goes up the match time goes up super-linearly. But things could have been even worse with a slightly different regular expression. Suppose it had been .*.*=.*; (i.e. there’s a literal semicolon at the end of the pattern). This could easily have been written to try to match an expression like foo=bar;.

This time the backtracking would have been catastrophic. To match x=x takes 90 steps instead of 23. And the number of steps grows very quickly. Matching x= followed by 20 x’s takes 5,353 steps. Here’s the corresponding chart. Look carefully at the Y-axis values compared the previous chart.

To complete the picture here are all 5,353 steps of failing to match x=xxxxxxxxxxxxxxxxxxxx against .*.*=.*;

Using lazy rather than greedy matches helps control the amount of backtracking that occurs in this case. If the original expression is changed to .*?.*?=.*? then matching x=x takes 11 steps (instead of 23) and so does matching x=xxxxxxxxxxxxxxxxxxxx. That’s because the ? after the .* instructs the engine to match the smallest number of characters first before moving on.

But laziness isn’t the total solution to this backtracking behaviour. Changing the catastrophic example .*.*=.*; to .*?.*?=.*?; doesn’t change its run time at all. x=x still takes 555 steps and x= followed by 20 x’s still takes 5,353 steps.

The only real solution, short of fully re-writing the pattern to be more specific, is to move away from a regular expression engine with this backtracking mechanism. Which we are doing within the next few weeks.

The solution to this problem has been known since 1968 when Ken Thompson wrote a paper titled “Programming Techniques: Regular expression search algorithm”. The paper describes a mechanism for converting a regular expression into an NFA (non-deterministic finite automata) and then following the state transitions in the NFA using an algorithm that executes in time linear in the size of the string being matched against.

Thompson’s paper doesn’t actually talk about NFA but the linear time algorithm is clearly explained and an ALGOL-60 program that generates assembly language code for the IBM 7094 is presented. The implementation may be arcane but the idea it presents is not.

Here’s what the .*.*=.* regular expression would look like when diagrammed in a similar manner to the pictures in Thompson’s paper.

Figure 0 has five states starting at 0. There are three loops which begin with the states 1, 2 and 3. These three loops correspond to the three .* in the regular expression. The three lozenges with dots in them match a single character. The lozenge with an = sign in it matches the literal = sign. State 4 is the ending state, if reached then the regular expression has matched.

To see how such a state diagram can be used to match the regular expression .*.*=.* we’ll examine matching the string x=x. The program starts in state 0 as shown in Figure 1.

The key to making this algorithm work is that the state machine is in multiple states at the same time. The NFA will take every transition it can, simultaneously.

Even before it reads any input, it immediately transitions to both states 1 and 2 as shown in Figure 2.

Looking at Figure 2 we can see what happened when it considers  first x in x=x. The x can match the top dot by transitioning from state 1 and back to state 1. Or the x can match the dot below it by transitioning from state 2 and back to state 2.

So after matching the first x in x=x the states are still 1 and 2. It’s not possible to reach state 3 or 4 because a literal = sign is needed.

Next the algorithm considers the = in x=x. Much like the x before it, it can be matched by either of the top two loops transitioning from state 1 to state 1 or state 2 to state 2, but additionally the literal = can be matched and the algorithm can transition state 2 to state 3 (and immediately state 4). That’s illustrated in Figure 3.

Next the algorithm reaches the final x in x=x. From states 1 and 2 the same transitions are possible back to states 1 and 2. From state 3 the x can match the dot on the right and transition back to state 3.

At that point every character of x=x has been considered and because state 4 has been reached the regular expression matches that string. Each character was processed once so the algorithm was linear in the length of the input string. And no backtracking was needed.

It might also be obvious that once state 4 was reached (after x= was matched) the regular expression had matched and the algorithm could terminate without considering the final x at all.

This algorithm is linear in the size of its input.

Détails de la panne Cloudflare du 2 juillet 2019

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019-fr/

Il y a près de neuf ans, Cloudflare était une toute petite entreprise dont j’étais le client, et non l’employé. Cloudflare était sorti depuis un mois et un jour, une notification m’alerte que mon petit site,  jgc.org, semblait ne plus disposer d’un DNS fonctionnel. Cloudflare avait effectué une modification dans l’utilisation de Protocol Buffers qui avait endommagé le DNS.

J’ai contacté directement Matthew Prince avec un e-mail intitulé « Où est mon DNS ? » et il m’a envoyé une longue réponse technique et détaillée (vous pouvez lire tous nos échanges d’e-mails ici) à laquelle j’ai répondu :

De: John Graham-Cumming
Date: Jeudi 7 octobre 2010 à 09:14
Objet: Re: Où est mon DNS?
À: Matthew Prince

Superbe rapport, merci. Je veillerai à vous appeler s’il y a un
problème.  Il serait peut-être judicieux, à un certain moment, d’écrire tout cela dans un article de blog, lorsque vous aurez tous les détails techniques, car je pense que les gens apprécient beaucoup la franchise et l’honnêteté sur ce genre de choses. Surtout si vous y ajoutez les tableaux qui montrent l’augmentation du trafic suite à votre lancement.

Je dispose d’un système robuste de surveillance de mes sites qui m’envoie un SMS en cas de problème.  La surveillance montre une panne de 13:03:07 à 14:04:12.  Des tests sont réalisés toutes les cinq minutes.

Un accident de parcours dont vous vous relèverez certainement.  Mais ne pensez-vous pas qu’il vous faudrait quelqu’un en Europe? :-)

Ce à quoi il a répondu :

De: Matthew Prince
Date: Jeudi 7 octobre 2010 à 09:57
Objet: Re: Où est mon DNS?
À: John Graham-Cumming

Merci. Nous avons répondu à tous ceux qui nous ont contacté. Je suis en route vers le bureau et nous allons publier un article sur le blog ou épingler un post officiel en haut de notre panneau d’affichage. Je suis parfaitement d’accord, la transparence est nécessaire.

Aujourd’hui, en tant qu’employé d’un Cloudflare bien plus grand, je vous écris de manière transparente à propos d’une erreur que nous avons commise, de son impact et de ce que nous faisons pour régler le problème.

Les événements du 2 juillet

Le 2 juillet, nous avons déployé une nouvelle règle dans nos règles gérées du pare-feu applicatif Web (WAF) qui ont engendré un surmenage des processeurs sur chaque cœur de processeur traitant le trafic HTTP/HTTPS sur le réseau Cloudflare à travers le monde. Nous améliorons constamment les règles gérées du WAF pour répondre aux nouvelles menaces et vulnérabilités. En mai, par exemple, nous avons utilisé la vitesse avec laquelle nous pouvons mettre à jour le WAF pour appliquer une règle et nous protéger d’une vulnérabilité SharePoint importante. Être capable de déployer rapidement et globalement des règles est une fonctionnalité essentielle de notre WAF.

Malheureusement, la mise à jour de mardi dernier contenait une expression régulière qui nous a fait reculer énormément et qui a épuisé les processeurs utilisés pour la distribution HTTP/HTTPS. Les fonctionnalités essentielles de mise en proxy, du CDN et du WAF de Cloudflare sont tombées en panne. Le graphique suivant présente les processeurs dédiés à la distribution du trafic HTTP/HTTPS atteindre presque 100 % d’utilisation sur les serveurs de notre réseau.

Utilisation des processeurs pendant l’incident dans l’un de nos PoP

En conséquence, nos clients (et leurs clients) voyaient une page d’erreur 502 lorsqu’ils visitaient n’importe quel domaine Cloudflare. Les erreurs 502 étaient générées par les serveurs Web frontaux Cloudflare dont les cœurs de processeurs étaient encore disponibles mais incapables d’atteindre les processus qui distribuent le trafic HTTP/HTTPS.

Nous réalisons à quel points nos clients ont été affectés. Nous sommes terriblement gênés qu’une telle chose se soit produite. L’impact a été négatif également pour nos propres activités lorsque nous avons traité l’incident.

Cela a du être particulièrement stressant, frustrant et angoissant si vous étiez l’un de nos clients. Le plus regrettable est que nous n’avions pas eu de panne globale depuis six ans.

Le surmenage des processeurs fut causé par une seule règle WAF qui contenait une expression régulière mal écrite et qui a engendré un retour en arrière excessif. L’expression régulière au cœur de la panne est (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Bien que l’expression régulière soit intéressante pour de nombreuses personnes (et évoquée plus en détails ci-dessous), comprendre ce qui a causé la panne du service Cloudflare pendant 27 minutes est bien plus complexe « qu’une expression régulière qui a mal tourné ». Nous avons pris le temps de noter les séries d’événements qui ont engendré la panne et nous ont empêché de réagir rapidement. Et si vous souhaitez en savoir plus sur le retour en arrière d’une expression régulière et que faire dans ce cas, veuillez consulter l’annexe à la fin de cet article.

Ce qui s’est produit

Commençons par l’ordre des événements. Toutes les heures de ce blog sont en UTC.

À 13:42, un ingénieur de l’équipe pare-feu a déployé une légère modification aux règles de détection XSS via un processus automatique. Cela a généré un ticket de requête de modification. Nous utilisons Jira pour gérer ces tickets (capture d’écran ci-dessous).

Trois minutes plus tard, la première page PagerDuty indiquait une erreur avec le pare-feu applicatif Web (WAF). C’était un test synthétique qui vérifie les fonctionnalités du WAF (nous disposons de centaines de tests de ce type) en dehors de Cloudflare pour garantir qu’il fonctionne correctement. Ce test fut rapidement suivi par des pages indiquant de nombreux échecs d’autres tests de bout-en-bout des services Cloudflare, une alerte de perte de trafic globale, des erreurs 502 généralisées, puis par de nombreux rapports de nos points de présence (PoP) dans des villes à travers le monde indiquant un surmenage des processeurs.

Préoccupé par ces alertes, j’ai quitté la réunion à laquelle j’assistais et en me dirigeant vers mon bureau, un responsable de notre groupe Ingénieur Solutions m’a dit que nous avions perdu 80 % de notre trafic. J’ai couru au département SRE où l’équipe était en train de déboguer la situation. Au tout début de la panne, certains pensaient à un type d’attaque que nous n’avions jamais connu auparavant.

L’équipe SRE de Cloudflare est répartie dans le monde entier pour assurer une couverture continue. Les alertes de ce type, dont la majorité identifient des problèmes très spécifiques sur des cadres limités et dans des secteurs précis, sont surveillées dans des tableaux de bord internes et traitées de nombreuses fois chaque jour. Cependant, ce modèle de pages et d’alertes indiquait que quelque chose de grave s’était produit, et le département SRE a immédiatement déclaré un incident P0 et transmis à la direction de l’ingénierie et à l’ingénierie des systèmes.

À ce moment, l’équipe d’ingénieurs de Londres assistait à une conférence technique interne dans notre espace principal d’événements. La discussion fut interrompue et tout le monde s’est rassemblé dans une grande salle de conférence ; les autres se sont connectés à distance. Ce n’était pas un problème normal que le département SRE pouvait régler seul ; nous avions besoin que toutes les équipes appropriées soient en ligne au même moment.

À 14:00, nous avons identifié que le composant à l’origine du problème était le WAF, et nous avons écarté la possibilité d’une attaque. L’équipe Performances a extrait les données en direct du processeur d’une machine qui montrait clairement que le WAF était responsable. Un autre membre de l’équipe a utilisé Strace pour confirmer. Une autre équipe a observé des journaux d’erreur indiquant que le WAF présentait un problème. À 14:02, toute l’équipe m’a regardé : la proposition était d’utiliser un « global kill », un mécanisme intégré à Cloudflare pour désactiver un composant unique partout dans le monde.

Mais utiliser un « global kill » pour le WAF, c’était une toute autre histoire. Nous avons rencontré plusieurs obstacles. Nous utilisons nos propres produits et avec la panne de notre service Access, nous ne pouvions plus authentifier notre panneau de contrôle interne (une fois de retour en ligne, nous avons découvert que certains membres de l’équipe avaient perdu leur accès à cause d’une fonctionnalité de sécurité qui désactive les identifiants s’ils n’utilisent pas régulièrement le panneau de contrôle interne).

Et nous ne pouvions pas atteindre les autres services internes comme Jira ou le moteur de production. Pour les atteindre, nous avons du employer un mécanisme de contournement très rarement utilisé (un autre point à creuser après l’événement). Finalement, un membre de l’équipe a exécuté le « global kill » du WAF à 14:07 ; à 14:09, les niveaux du trafic et des processeurs sont revenus à la normale à travers le monde. Le reste des mécanismes de protection de Cloudflare a continué de fonctionner.

Nous sommes ensuite passés à la restauration de la fonctionnalité WAF. Le côté sensible de la situation nous a poussé à réaliser des tests négatifs (nous demander « est-ce réellement ce changement particulier qui a causé le problème ? ») et des tests positifs (vérifier le fonctionnement du retour) dans une ville unique à l’aide d’un sous-ensemble de trafic après avoir retiré de cet emplacement le trafic de nos clients d’offres payantes.

À 14:52, nous étions certains à 100 % d’avoir compris la cause, résolu le problème et que le WAF était réactivé partout dans le monde.

Fonctionnement de Cloudflare

Cloudflare dispose d’une équipe d’ingénieurs dédiée à notre solution de règles gérées du WAF ; ils travaillent constamment pour améliorer les taux de détection, réduire les faux positifs et répondre rapidement aux nouvelles menaces dès qu’elles apparaissent. Dans les 60 derniers jours, 476 requêtes de modifications ont été traitées pour les règles gérées du WAF (en moyenne une toutes les 3 heures).

Cette modification spécifique fut déployée en mode « simulation » où le trafic client passe au travers de la règle mais rien n’est bloqué. Nous utilisons ce mode pour tester l’efficacité d’une règle et mesurer son taux de faux positifs et de faux négatifs. Cependant, même en mode simulation, les règles doivent être exécutées, et dans cette situation, la règle contenait une expression régulière qui consommait trop de processeur.

On peut observer dans la requête de modification ci-dessus un plan de déploiement, un plan de retour et un lien vers la procédure opérationnelle standard (POS) interne pour ce type de déploiement. La POS pour une modification de règle lui permet d’être appliquée à l’échelle mondiale. Ce processus est très différent de tous les logiciels que nous publions chez Cloudflare, où la POS applique d’abord le logiciel au réseau interne de dogfooding d’un point de présence (PoP) (au travers duquel passent nos employés), puis à quelques clients sur un emplacement isolé, puis à un plus grand nombre de clients et finalement au monde entier.

Le processus pour le lancement d’un logiciel ressemble à cela : Nous utilisons Git en interne via BitBucket. Les ingénieurs qui travaillent sur les modifications intègrent du code créé par TeamCity ; les examinateurs sont désignés une fois la construction validée. Lorsqu’une requête d’extraction est approuvée, le code est créé et la série de tests exécutée (à nouveau).

Si la construction et les tests sont validés, une requête de modification Jira est générée et la modification doit être approuvée par le responsable approprié ou la direction technique. Une fois approuvée, le déploiement de ce que l’on appelle les « PDP animaux » survient : DOG, PIG et les Canaries

Le PoP DOG est un PoP Cloudflare (au même titre que toutes nos villes à travers le monde) mais qui n’est utilisé que par les employés Cloudflare. Ce PoP de dogfooding nous permet d’identifier les problèmes avant que le trafic client n’atteigne le code. Et il le fait régulièrement.

Si le test DOG réussit, le code est transféré vers PIG (en référence au « cobaye » ou « cochon d’Inde »). C’est un PoP Cloudflare sur lequel un petit sous-ensemble de trafic client provenant de clients de l’offre gratuite passe au travers du nouveau code.

Si le test réussit, le code est transféré vers les Canaries. Nous disposons de trois PoP Canaries répartis dans le monde entier ; nous exécutons du trafic client des offres payantes et gratuite qui traverse ces points sur le nouveau code afin de vérifier une dernière fois qu’il n’y a pas d’erreur.

Processus de lancement d’un logiciel Cloudflare

Une fois validé aux Canaries, le code peut être déployé. L’intégralité du processus DOG, PIG, Canari, Global peut prendre plusieurs heures ou plusieurs jours en fonction du type de modification de code. La diversité du réseau et des clients de Cloudflare nous permet de tester rigoureusement le code avant d’appliquer un lancement à tous nos clients partout dans le monde. Cependant, le WAF n’est pas conçu pour utiliser ce processus car il doit répondre rapidement aux menaces.

Menaces WAF

Ces derniers années, nous avons observé une augmentation considérable des vulnérabilités dans les applications communes. C’est une conséquence directe de la disponibilité augmentée des outils de test de logiciels, comme par exemple le fuzzing (nous venons de publier un nouveau blog sur le fuzzing ici).

Source: https://cvedetails.com/

Le plus souvent, une preuve de concept (PoC, Proof of Concept) est créée et publiée rapidement sur Github, afin que les équipes en charge de l’exécution et de la maintenance des applications puissent effectuer des tests et vérifier qu’ils disposent des protections adéquates. Il est donc essentiel que Cloudflare soit capable de réagir aussi vite que possible aux nouvelles attaques pour offrir à nos clients une chance de corriger leur logiciel.

Le déploiement de nos protections contre la vulnérabilité SharePoint au mois de mai illustre parfaitement la façon dont Cloudflare est capable d’offrir une protection de manière proactive (blog ici). Peu de temps après la médiatisation de nos annonces, nous avons observé un énorme pic de tentatives visant à exploiter les installations Sharepoint de notre client. Notre équipe surveille constamment les nouvelles menaces et rédige des règles pour les atténuer pour le compte de nos clients.

La règle spécifique qui a engendré la panne de mardi dernier ciblait les attaques Cross-Site Scripting (XSS). Ce type d’attaque a aussi considérablement augmenté ces dernières années.

Source: https://cvedetails.com/

La procédure standard pour une modification des règles gérées du WAF indique que les tests d’intégration continue (CI) doivent être validés avant un déploiement à l’échelle mondiale. Cela s’est passé normalement mardi dernier et les règles ont été déployées. À 13:31, un ingénieur de l’équipe fusionnait une requête d’extraction contenant la modification déjà approuvée.

À 13:37, TeamCity a construit les règles et exécuté les tests puis donné son feu vert. La série de tests WAF vérifie que les fonctionnalités principales du WAF fonctionnent ; c’est un vaste ensemble de tests individuels ayant chacun une fonction associée. Une fois les tests individuels réalisés, les règles individuelles du WAF sont testées en exécutant une longue série de requêtes HTTP sur le WAF. Ces requêtes HTTP sont conçues pour tester les requêtes qui doivent être bloquées par le WAF (pour s’assurer qu’il identifie les attaques) et celles qui doivent être transmises (pour s’assurer qu’il ne bloque pas trop et ne crée pas de faux positifs). Il n’a pas vérifié si le WAF utilisait excessivement le processeur, et en examinant les journaux des précédentes constructions du WAF, on peut voir qu’aucune augmentation n’est observée pendant l’exécution de la série de tests avec la règle qui aurait engendré le surmenage du processeur sur notre périphérie.

Après le succès des tests, TeamCity a commencé à déployer automatiquement la modification à 13:42.


Les règles du WAF doivent répondre à de nouvelles menaces ; elles sont donc déployées à l’aide de notre base de données clé-valeur Quicksilver, capable d’appliquer des modifications à l’échelle mondiale en quelques secondes. Cette technologie est utilisée par tous nos clients pour réaliser des changements de configuration dans notre tableau de bord ou via l’API. C’est la base de notre service : répondre très rapidement aux modifications.

Nous n’avons pas vraiment abordé Quicksilver. Auparavant, nous utilisions Kyoto Tycoon comme base de données clé-valeur distribuées globalement, mais nous avons rencontré des problèmes opérationnels et rédigé notre propre base de données clé-valeur que nous avons reproduite dans plus de 180 villes. Quicksilver nous permet d’appliquer des modifications de configuration client, de mettre à jour les règles du WAF et de distribuer du code JavaScript rédigé par nos clients à l’aide de Cloudflare Workers.

Cliquer sur un bouton dans le tableau de bord, passer un appel d’API pour modifier la configuration et cette modification prendra effet en quelques secondes à l’échelle mondiale. Le clients adorent cette configurabilité haut débit. Avec Workers, ils peuvent profiter d’un déploiement logiciel global quasi instantané. En moyenne, Quicksilver distribue environ 350 modifications par seconde.

Et Quicksilver est très rapide.  En moyenne, nous atteignons un p99 de 2,29 s pour distribuer une modification sur toutes nos machines à travers le monde. En général, cette vitesse est un avantage. Lorsque vous activez une fonctionnalité ou purgez votre cache, vous êtes sûr que ce sera fait à l’échelle mondiale presque instantanément. Lorsque vous intégrez du code avec Cloudflare Workers, celui-ci est appliqué à la même vitesse. La promesse de Cloudflare est de vous offrir des mises à jour rapides quand vous en avez besoin.

Cependant, dans cette situation, cette vitesse a donc appliqué la modification des règles à l’échelle mondiale en quelques secondes. Vous avez peut-être remarqué que le code du WAF utilise Lua. Cloudflare utilise beaucoup Lua en production, et des informations sur Lua dans le WAF ont été évoquées auparavant. Le WAF Lua utilise PCRE en interne, le retour en arrière pour la correspondance, et ne dispose pas de mécanisme pour se protéger d’une expression étendue. Ci-dessous, plus d’informations sur nos actions.

Tout ce qui s’est passé jusqu’au déploiement des règles a été fait « correctement » : une requête d’extraction a été lancée, puis approuvée, CI/CD a créé le code et l’a testé, une requête de modification a été envoyée avec une procédure opérationnelle standard précisant le déploiement et le retour en arrière, puis le déploiement a été exécuté.

Processus de déploiement du WAF Cloudflare

Les causes du problème

Nous déployons des douzaines de nouvelles règles sur le WAF chaque semaine, et nous avons de nombreux systèmes en place pour empêcher tout impact négatif sur ce déploiement. Quand quelque chose tourne mal, c’est donc généralement la convergence improbable de causes multiples. Trouver une cause racine unique est satisfaisant mais peut obscurcir la réalité. Voici les vulnérabilités multiples qui ont convergé pour engendrer la mise hors ligne du service HTTP/HTTPS de Cloudflare.

  • Un ingénieur a rédigé une expression régulière qui pouvait facilement causer un énorme retour en arrière.
  • Une protection qui aurait pu empêcher l’utilisation excessive du processeur par une expression régulière a été retirée par erreur lors d’une refonte du WAF quelques semaines plus tôt (refonte dont l’objectif était de limiter l’utilisation du processeur par le WAF).
  • Le moteur d’expression régulière utilisé n’avait pas de garanties de complexité.
  • La série de tests ne présentait pas de moyen d’identifier une consommation excessive du processeur.
  • La procédure opérationnelle standard a permis la mise en production globale d’une modification de règle non-urgente sans déploiement progressif.
  • Le plan de retour en arrière nécessitait la double exécution de l’intégralité du WAF, ce qui aurait pris trop de temps.
  • La première alerte de chute du trafic global a mis trop de temps à se déclencher.
  • Nous n’avons pas mis à jour suffisamment rapidement notre page d’état.
  • Nous avons eu des problèmes pour accéder à nos propres systèmes en raison de la panne et la procédure de contournement n’était pas bien préparée.
  • Les ingénieurs fiabilité avaient perdu l’accès à certains systèmes car leurs identifiants ont expiré pour des raisons de sécurité.
  • Nos clients ne pouvaient pas accéder au tableau de bord ou à l’API Cloudflare car ils passent par la périphérie Cloudflare.

Ce qui s’est passé depuis mardi dernier

Tout d’abord, nous avons interrompu la totalité du travail de publication sur le WAF et réalisons les actions suivantes :

  • Réintroduction de la protection en cas d’utilisation excessive du processeur qui avait été retirée. (Terminé)
  • Inspection manuelle de toutes les 3 868 règles des règles gérées du WAF pour trouver et corriger tout autre cas d’éventuel retour en arrière excessif. (Inspection terminée)
  • Introduction de profil de performances pour toutes les règles sur la série de tests. (ETA :  19 juillet)
  • Passage au moteur d’expression régulière re2 ou Rust qui disposent de garanties de temps d’exécution. (ETA : 31 juillet)
  • Modifier la procédure opérationnelle standard afin d’effectuer des déploiements progressifs de règles de la même manière que pour les autres logiciels chez Cloudflare tout en conservant la capacité de réaliser des déploiements globaux d’urgence pour les attaques actives.
  • Mettre en place une capacité d’urgence pour retirer le tableau de bord et l’API Cloudflare de la périphérie de Cloudflare.
  • Mettre à jour automatiquement la page d’état Cloudflare.

À long terme, nous nous éloignons du WAF Lua que j’ai rédigé il y a plusieurs années. Nous déplaçons le WAF pour qu’il utilise le nouveau moteur pare-feu. Cela rendra le WAF plus rapide et offrira une couche de protection supplémentaire.


Ce fut une panne regrettable pour nos clients et pour l’équipe. Nous avons réagi rapidement pour régler la situation et nous réparons les défauts de processus qui ont permis à cette panne de se produire. Afin de nous protéger contre d’éventuels problèmes ultérieurs avec notre utilisation des expressions régulières, nous allons remplacer notre technologie fondamentale.

Nous sommes très gênés par cette panne et désolés de l’impact sur nos clients. Nous sommes convaincus que les changements réalisés nous permettront de ne plus jamais subir une telle panne.

Annexe : À propos du retour en arrière d’expression régulière

Pour comprendre comment (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))  a causé le surmenage du processeur, vous devez vous familiariser avec le fonctionnement d’un moteur d’expression régulière standard. La partie critique est .*(?:.*=.*). Le(?: et les ) correspondantes sont un groupe de non-capture (l’expression entre parenthèses est regroupée dans une seule expression).

Pour identifier la raison pour laquelle ce modèle engendre le surmenage du processeur, nous pouvons l’ignorer et traiter le modèle .*.*=.*. En le réduisant, le modèle paraît excessivement complexe, mais ce qui est important, c’est que toute expression du « monde réel » (comme les expressions complexes au sein de nos règles WAF) qui demande au moteur de « faire correspondre quelque chose suivi de quelque chose » peut engendrer un retour en arrière catastrophique. Explication.

Dans une expression régulière, . signifie correspondre à un caractère unique, .* signifie correspondre à zéro ou plus de caractères abondamment (c’est à dire faire correspondre autant que possible). .*.*=.* signifie donc correspondre à zéro ou plus de caractères, puis à nouveau zéro ou plus de caractères, puis trouver un signe littéral =, puis correspondre à zéro ou plus de caractères.

Prenons la chaîne test x=x. Cela correspondra à l’expression .*.*=.*. Les .*.* précédant le signe = peuvent correspondre au premier x (l’un des .* correspond au x, l’autre correspond à zéro caractère). Le .* après le signe = correspond au x final.

Il faut 23 étapes pour atteindre cette correspondance. Le premier .* de .*.*=.* agit abondamment et correspond à la chaîne complète x=x. Le moteur avance pour examiner le .* suivant. Il ne reste aucun caractère à faire correspondre, le deuxième .* correspond donc à zéro caractère (c’est autorisé). Le moteur passe ensuite au =. La correspondance échoue car il n’y a plus de caractère à faire correspondre (le premier .* ayant consommé l’intégralité de x=x).

À ce moment, le moteur d’expression régulière retourne en arrière. Il retourne au premier  .* et le fait correspondre à x= (au lieu de x=x) et passe ensuite au deuxième .*. Ce .* correspond au deuxième x et il n’y a maintenant plus aucun caractère à associer. Quand le moteur essaie d’associer le = dans .*.*=.*, la correspondance échoue. Le moteur retourne de nouveau en arrière.

Avec ce retour en arrière, le premier .* correspond toujours à x= mais le deuxième .* ne correspond plus à x ; il correspond à zéro caractère. Le moteur avance ensuite pour essayer de trouver le signe littéral = dans le modèle .*.*=.* mais échoue (car il correspond déjà au premier .*). Le moteur retourne de nouveau en arrière.

Cette fois-ci, le premier .* correspond seulement au premier x. Mais le deuxième .* agit abondamment et correspond à =x. Vous imaginez la suite. Lorsqu’il essaie d’associer le signe littéral =, il échoue et retourne de nouveau en arrière.

Le premier .* correspond toujours seulement au premier x. Désormais, le deuxième .* correspond seulement à =. Comme vous l’avez deviné, le moteur ne peut pas associer le signe littéral = car il correspond déjà au deuxième .*. Le moteur retourne donc de nouveau en arrière. Tout cela pour associer une chaîne de trois caractères.

Enfin, avec le premier .* correspondant uniquement au premier x et le deuxième .* correspondant à zéro caractère, le moteur est capable d’associer le signe littéral = dans l’expression avec le = contenu dans la chaîne. Il avance et le dernier .* correspond au dernier x.

23 étapes pour faire correspondre x=x. Voici une courte vidéo de l’utilisation de Perl Regexp::Debugger présentant les étapes et le retour en arrière.

Cela représente beaucoup de travail, mais que se passe-t-il si la chaîne passe de x=x à x=xx ? La correspondance prend cette fois 33 étapes. Et si l’entrée est x=xxx, 45 étapes. Ce n’est pas linéaire. Voici un graphique présentant la correspondance de x=x à x=xxxxxxxxxxxxxxxxxxxx (20 x après le =). Avec 20 x après le =, le moteur requiert 555 étapes pour associer ! (Pire encore, si le x= était manquant et que la chaîne ne contenait que 20 x, le moteur nécessiterait 4 067 étapes pour réaliser que le modèle ne correspond pas).

Cette vidéo montre tout le retour en arrière nécessaire pour correspondre à  x=xxxxxxxxxxxxxxxxxxxx:

Ce n’est pas bon signe car à mesure que la taille de l’entrée augmente, le temps de correspondance augmente de manière ultra linéaire. Mais cela aurait pu être encore pire avec une expression régulière légèrement différente. Prenons l’exemple .*.*=.*; (avec un point-virgule littéral à la fin du modèle). Nous aurions facilement pu rédiger cela pour tenter d’associer une expression comme foo=bar;.

Le retour en arrière aurait été catastrophique. Associer x=x prend 90 étapes au lieu de 23. Et le nombre d’étapes augmente très rapidement. Associer x= suivi de 20 x prend 5 353 étapes. Voici le graphique correspondant : Observez attentivement les valeurs de l’axe Y par rapport au graphique précédent.

Pour terminer, voici toutes les 5 353 étapes de l’échec de la correspondance de x=xxxxxxxxxxxxxxxxxxxx avec .*.*=.*;

L’utilisation de correspondances fainéantes plutôt que gourmandes (lazy / greedy) permet de contrôler l’étendue du retour en arrière qui survient dans ce cas précis. Si l’expression originale est modifiée en .*?.*?=.*?, la correspondance de x=x prend 11 étapes (au lieu de 23), tout comme la correspondance de x=xxxxxxxxxxxxxxxxxxxx. C’est parce que le ? après le .*ordonne au moteur d’associer le plus petit nombre de caractères avant de commencer à avancer.

Mais la fainéantise n’est pas la solution totale à ce comportement de retour en arrière. Modifier l’exemple catastrophique .*.*=.*; en .*?.*?=.*?; n’affecte pas du tout son temps d’exécution. x=x prend toujours 555 étapes et x= suivi de 20 x prend toujours 5 353 étapes.

La seule vraie solution, mise à part la réécriture complète du modèle pour être plus précis, c’est de s’éloigner du moteur d’expression régulière avec ce mécanisme de retour en arrière. Ce que nous faisons au cours des prochaines semaines.

La solution à ce problème existe depuis 1968, lorsque Ken Thompson écrivait un article intitulé « Techniques de programmation : Algorithme de recherche d’expression régulière ». L’article décrit un mécanisme pour convertir une expression régulière en AFN (automate fini non-déterministe) et suivre les transitions d’état dans l’AFN à l’aide d’un algorithme exécuté en un temps linéaire de la taille de la chaîne que l’on souhaite faire correspondre.

L’article de Thompson n’évoque pas l’AFN mais l’algorithme de temps linéaire est clairement expliqué. Il présente également un programme ALGOL-60 qui génère du code de langage assembleur pour l’IBM 7094. La réalisation paraît ésotérique mais l’idée présentée ne l’est pas.

Voici à quoi ressemblerait l’expression régulière .*.*=.* si elle était schématisée d’une manière similaire aux images de l’article de Thompson.

Le schéma 0 contient 5 états commençant à 0. Il y a trois boucles qui commencent avec les états 1, 2 et 3. Ces trois boucles correspondent aux trois .* dans l’expression régulière. Les trois pastilles avec des points à l’intérieur correspondent à un caractère unique. La pastille avec le signe = à l’intérieur correspond au signe littéral =. L’état 4 est l’état final : s’il est atteint, l’expression régulière a trouvé une correspondance.

Pour voir comment un tel schéma d’état peut être utilisé pour associer l’expression régulière .*.*=.*, nous allons examiner la correspondance de la chaîne x=x. Le programme commence à l’état 0 comme indiqué dans le schéma 1.

La clé pour faire fonctionner cet algorithme est que la machine d’état soit dans plusieurs états au même moment. L’AFN prendra autant de transitions que possible simultanément.

Avant de lire une entrée, il passe immédiatement aux deux états 1 et 2 comme indiqué dans le schéma 2.

En observant le schéma 2, on peut voir ce qui s’est passé lorsqu’il considère le premier x dans x=x. Le x peut correspondre au point supérieur en faisant une transition de l’état 1 et en retournant à l’état 1. Ou le x peut correspondre au point inférieur en faisant une transition de l’état 2 et en retournant à l’état 2.

Après avoir associé le premier x dans x=x, les états restent 1 et 2. On ne peut pas atteindre l’état 3 ou 4 car il faut un signe littéral =.

L’algorithme considère ensuite le = dans x=x. Comme le x avant lui, il peut être associé par l’une des deux boucles supérieures faisant une transition de l’état 1 à l’état 1 ou de l’état 2 à l’état 2. En outre, le signe littéral = peut être associé et l’algorithme peut faire une transition de l’état 2 à l’état 3 (et directement à l’état 4). C’est illustré dans le schéma 3.

Ensuite, l’algorithme atteint le x final dans x=x. Depuis les états 1 et 2, les mêmes transitions sont possibles vers les états 1 et 2. Depuis l’état 3, le x peut correspondre au point sur la droite et retourner à l’état 3.

À ce niveau, chaque caractère de x=x a été considéré, et puisque l’état 4 a été atteint, l’expression régulière correspond à cette chaîne. Chaque caractère a été traité une fois, l’algorithme était donc linéaire avec la longueur de la chaîne saisie. Aucun retour en arrière nécessaire.

Cela peut paraître évident qu’une fois l’état 4 atteint (après la correspondance de x=), l’expression régulière correspond et l’algorithme peut terminer sans prendre en compte le x final.

Cet algorithme est linéaire avec la taille de son entrée.

Cloudflare outage caused by bad software deploy (updated)

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-outage/

This is a short placeholder blog and will be replaced with a full post-mortem and disclosure of what happened today.

For about 30 minutes today, visitors to Cloudflare sites received 502 errors caused by a massive spike in CPU utilization on our network. This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.

This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.

Update at 2009 UTC:

Starting at 1342 UTC today we experienced a global outage across our network that resulted in visitors to Cloudflare-proxied domains being shown 502 errors (“Bad Gateway”). The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules.

The intent of these new rules was to improve the blocking of inline JavaScript that is used in attacks. These rules were being deployed in a simulated mode where issues are identified and logged by the new rule but no customer traffic is actually blocked so that we can measure false positive rates and ensure that the new rules do not cause problems when they are deployed into full production.

Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%.

This chart shows CPU spiking in one of our PoPs:

We were seeing an unprecedented CPU exhaustion event, which was novel for us as we had not experienced global CPU exhaustion before.

We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.

At 1402 UTC we understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.

We then went on to review the offending pull request, roll back the specific rules, test the change to ensure that we were 100% certain that we had the correct fix, and re-enabled the WAF Managed Rulesets at 1452 UTC.

We recognize that an incident like this is very painful for our customers. Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

Post Syndicated from Martin J Levy original https://blog.cloudflare.com/the-deep-dive-into-how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-monday/

A recap on what happened Monday

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

On Monday we wrote about a painful Internet wide route leak. We wrote that this should never have happened because Verizon should never have forwarded those routes to the rest of the Internet. That blog entry came out around 19:58 UTC, just over seven hours after the route leak finished (which will we see below was around 12:39 UTC). Today we will dive into the archived routing data and analyze it. The format of the code below is meant to use simple shell commands so that any reader can follow along and, more importantly, do their own investigations on the routing tables.

This was a very public BGP route leak event. It was both reported online via many news outlets and the event’s BGP data was reported via social media as it was happening. Andree Toonk tweeted a quick list of 2,400 ASNs that were affected.

This blog contains a large number of acronyms and those are explained at the end of the blog.

Using RIPE NCC archived data

The RIPE NCC operates a very useful archive of BGP routing. It runs collectors globally and provides an API for querying the data. More can be seen at https://stat.ripe.net/. In the world of BGP all routing is public (within the ability of anyone collecting data to have enough collections points). The archived data is very valuable for research and that’s what we will do in this blog. The site can create some very useful data visualizations.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

Dumping the RIPEstat data for this event

Presently, the RIPEstat data gets ingested around eight to twelve hours after real-time. It’s not meant to be a real-time service. The data can be queried in many ways, including a full web interface and an API. We are using the API to extract the data in a JSON format.

We are going to focus only on the Cloudflare routes that were leaked. Many other ASNs were leaked (see the tweet above); however, we want to deal with a finite data set and focus on what happened to Cloudflare’s routing. All the commands below can be run with ease on many systems. Both the scripts and the raw data file are now available on GitHub. The following was done on MacBook Pro running macOS Mojave.

First we collect 24 hours of route announcements and AS-PATH data that RIPEstat sees coming from AS13335 (Cloudflare).

$ # Collect 24 hours of data - more than enough
$ ASN="AS13335"
$ START="2019-06-24T00:00:00"
$ END="2019-06-25T00:00:00"
$ ARGS="resource=${ASN}&starttime=${START}&endtime=${END}"
$ URL="https://stat.ripe.net/data/bgp-updates/data.json?${ARGS}"
$ # Fetch the data from RIPEstat
$ curl -sS "${URL}" | jq . > 13335-routes.json
$ ls -l 13335-routes.json
-rw-r--r –  1 martin  staff  339363899 Jun 25 08:47 13335-routes.json

That’s 340MB of data – which seems like a lot, but it contains plenty of white space and plenty of data we just don’t need. Our second task is to reduce this raw data down to just the required data – that’s timestamps, actual routes, and AS-PATH. The third item will be very useful. Note we are using jq, which can be installed on macOS with the  brew package manager.

$ # Extract just the times, routes, and AS-PATH
$ jq -rc '.data.updates[]|.timestamp,.attrs.target_prefix,.attrs.path' < 13335-routes.json | paste - - - > 13335-listing-a.txt
$ wc -l 13335-listing-a.txt
691318 13335-listing-a.txt

We are down to just below seven hundred thousand routing events, however, that’s not a leak, that’s everything that includes Cloudflare’s ASN (the number 13335 above). For that we need to go back to Monday’s blog and realize it was AS396531 (Allegheny Technologies) that showed up with 701 (Verizon) in the leak. Now we reduce the data further:

$ # Extract the route leak 701,396531
$ # AS701 is Verizon and AS396531 is Allegheny Technologies
$ egrep '701,396531' < 13335-listing-a.txt > 13335-listing-b.txt
$ wc -l 13335-listing-b.txt
204568 13335-listing-b.txt

At 204 thousand data points, we are looking better. It’s still a lot of data because BGP can be very chatty if topology is changing. A route leak will cause exactly that. Now let’s see how many routes were affected:

$ # Extract the actual routes affected by the route leak
$ cut -f2 < 13335-listing-b.txt | sort -V -u > 13335-listing-c.txt
$ wc -l 13335-listing-c.txt
101 13335-listing-c.txt

It’s a much smaller number. We now have a listing of at least 101 routes that were leaked via Verizon. This may not be the full list because route collectors like RIPEstat don’t have direct feeds from Verizon, so this data is a blended view with Verizon’s path and other paths. We can see that if we look at the AS-PATH in the above files. Please note that I had a typo in this script when this blog was first published and only 20 routes showed up because the -n vs -V option was used on sort. Now the list is correct with 101 affected routes. Please see this short article from stackoverflow to see the issue.

Here’s a partial listing of affected routes.

$ cat 13335-listing-c.txt

This is an interesting list, as some of these routes do not originate from Cloudflare’s network, however, they show up with AS13335 (our ASN) as the originator. For example, the route is not announced from our network, but we do announce (which covers that route). More importantly, we have an IRR (Internet Routing Registries) route object plus an RPKI ROA for that block. Here’s the IRR object:

origin:         AS13335
source:         ARIN

And here’s the RPKI ROA. This ROA has Max Length set to 20, so no smaller route should be accepted.

Max Length:   /20
ASN:          13335
Trust Anchor: ARIN
Validity:     Thu, 02 Aug 2018 04:00:00 GMT - Sat, 31 Jul 2027 04:00:00 GMT
Emitted:      Thu, 02 Aug 2018 21:45:37 GMT
Name:         535ad55d-dd30-40f9-8434-c17fc413aa99
Key:          4a75b5de16143adbeaa987d6d91e0519106d086e
Parent Key:   a6e7a6b44019cf4e388766d940677599d0c492dc
Path:         rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-...

The Max Length field in an ROA says what the minimum size of an acceptable announcement is. The fact that this is a /20 route with a /20 Max Length says that a /21 (or /22 or /23 or /24) within this IP space isn’t allowed. Looking further at the route list above we get the following listing:

Route Seen            Cloudflare IRR & ROA    ROA Max Length    ->          /20   ->         /20    ->          /20   ->         /20    ->          /20     ->           /20   ->         /20   ->         /20   ->         /20     ->           /20   ->         /20    ->          /20     ->           /20

So how did all these /21’s show up? That’s where we dive into the world of BGP route optimization systems and their propensity to synthesize routes that should not exist. If those routes leak (and it’s very clear after this week that they can), all hell breaks loose. That can be compounded when not one, but two ISPs allow invalid routes to be propagated outside their autonomous network. We will explore the AS-PATH further down this blog.

More than 20 years ago, RFC1997 added the concept of communities to BGP. Communities are a way of tagging or grouping route advertisements. Communities are often used to label routes so that specific handling policies can be applied. RFC1997 includes a small number of universal well-known communities. One of these is the NO_EXPORT community, which has the following specification:

    All routes received carrying a communities attribute
    containing this value MUST NOT be advertised outside a BGP
    confederation boundary (a stand-alone autonomous system that
    is not part of a confederation should be considered a
    confederation itself).

The use of the NO_EXPORT community is very common within BGP enabled networks and is a community tag that would have helped alleviate this route leak immensely.

How BGP route optimization systems work (or don’t work in this case) can be a subject for a whole other blog entry.

Timing of the route leak

As we saved away the timestamps in the JSON file and in the text files, we can confirm the time for every route in the route leak by looking at the first and the last timestamp of a route in the data. We saved data from 00:00:00 UTC until 00:00:00 the next day, so we know we have covered the period of the route leak. We write a script that checks the first and last entry for every route and report the information sorted by start time:

$ # Extract the timing of the route leak
$ while read cidr
  echo $cidr
  fgrep $cidr < 13335-listing-b.txt | head -1 | cut -f1
  fgrep $cidr < 13335-listing-b.txt | tail -1 | cut -f1
done < 13335-listing-c.txt |\
paste - - - | sort -k2,3 | column -t | sed -e 's/2019-06-24T//g'
...   10:34:25  12:38:54     10:34:27  12:29:39    10:34:27  12:30:00   10:34:27  12:30:34  10:34:27  12:30:39  10:34:27  12:30:39    10:34:29  12:30:34   10:34:29  12:30:34   10:34:29  12:30:34    10:34:29  12:30:34     10:34:29  12:30:34     10:34:31  12:19:24     10:34:36  12:29:53    10:34:38  12:19:24   10:34:38  12:19:24      10:34:38  12:19:24     11:52:49  11:53:19   12:00:13  12:29:34    12:00:13  12:30:00   12:09:39  12:29:34

Now we know the times. The route leak started at 10:34:25 UTC (just before lunchtime London time) on 2019-06-24 and ended at 12:38:54 UTC. That’s a hair over two hours. Here’s that same time data in a graphical form showing the near-instant start of the event and the duration of each route leaked:

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

We can also go back to RIPEstat and look at the activity graph for Cloudflare’s AS13335 network:

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

Clearly between 10:30 UTC and 12:40 UTC there’s a lot of route activity – far more than normal.

Note that as we mentioned above, RIPEstat doesn’t get a full view of Verizon’s network routing and hence some of the propagated routes won’t show up.

Drilling down on the AS-PATH part of the data

Having the routes is useful, but now we want to look at the paths of these leaked routes to see which ASNs are involved. We knew the offending ASNs during the route leak on Monday. Now we want to dig deeper using the archived data. This allows us to see the extent and reach of this route leak.

$ # Extract the AS-PATH of the route leak
$ # Use the list of routes to extract the full AS-PATH
$ # Merge the results together to show an amalgamation of paths.
$ # We know (luckily) the last few ASNs in the AS-PATH are consistent
$ cut -f3 < 13335-listing-b.txt | tr -d '[\[\]]' |\
awk '{
  n=split($0, a, ",");
  printf "%50s\n",
    a[n-5] "_" a[n-4] "_" a[n-3] "_" a[n-2] "_" a[n-1] "_" a[n];
}' | sort -u

This script clearly shows the AS-PATH of the leaked routes. It’s very consistent. Reading from the back of the line to the front, we have 13335 (Cloudflare), 3356 or 174 (Level3/CenturyLink or Cogent – both tier 1 transit providers for Cloudflare). So far, so good. Then we have 33154 (DQE Communications) and 396531 (Allegheny Technologies Inc) which is still technically not a leak, but trending that way. The reason why we can state this is still technically not a leak is because we don’t know the relationship between those two ASs. It’s possible they have a mutual transit agreement between them. That’s up to them.

Back to the AS-PATH’s. Still reading leftwards, we see 701 (Verizon), which is very very bad and clear evidence of a leak. It’s a leak for two reasons. First, this matches the path when a transit is leaking a non-customer route from a customer. This Verizon customer does not have 13335 (Cloudflare) listed as a customer. Second, the route contains within its path a tier 1 ASN. This is the point where a route leak should have been absolutely squashed by filtering on the customer BGP session. Beyond this point there be dragons.

And dragons there be! Everything above is about how Verizon filtered (or didn’t filter) its customer. What follows the 701 (i.e the number to the left of it) is the peers or customers of Verizon that have accepted these leaked routes. They are mainly other tier 1 networks of Verizon in this list: 174 (Cogent), 1239 (Sprint), 1273 (Vodafone), 3320 (DTAG), 3491 (PCCW), 6461 (Zayo), 6762 (Telecom Italia), etc.

What’s missing from that list are three networks worthy of mentioning – 1299 (Telia), 2914 (NTT), and 7018 (AT&T). All three implement a very simple AS-PATH filter which saved the day for their network. They do not allow one tier 1 ISP to send them a route which has another tier 1 further down the path. That’s because when that happens, it’s officially a leak as each tier 1 is fully connected to all other tier 1’s (which is part of the definition of a tier 1 network). The topology of the Internet’s global BGP routing tables simply states that if you see another tier 1 in the path, then it’s a bad route and it should be filtered away.

Additionally we know that 7018 (AT&T) operates a network which drops RPKI invalids. Because Cloudflare routes are RPKI signed, this also means that AT&T would have dropped these routes when it receives them from Verizon. This shows a clear win for RPKI (and for AT&T when you see their bandwidth graph below)!

That all said, keep in mind we are still talking about routes that Cloudflare didn’t announce. They all came from the route optimizer.

What should 701 Verizon network accept from their customer 396531?

This is a great question to ask. Normally we would look at the IRR (Internet Routing Registries) to see what policy an ASN wants for it’s routes.

$ whois -h whois.radb.net AS396531 ; the Verizon customer
%  No entries found for the selected source(s).
$ whois -h whois.radb.net AS33154  ; the downstream of that customer
%  No entries found for the selected source(s).

That’s enough to say that we should not be seeing this ASN anywhere on the Internet, however, we should go further into checking this. As we know the ASN of the network, we can search for any routes that are listed for that ASN. We find one:

$ whois -h whois.radb.net ' -i origin AS396531' | egrep '^route|^origin|^mnt-by|^source'
origin:         AS396531
mnt-by:         MNT-DCNSL
source:         ARIN

More importantly, now we have a maintainer (the owner of the routing registry entries). We can see what else is there for that network and we are specifically looking for this:

$ whois -h whois.radb.net ' -i mnt-by -T as-set MNT-DCNSL' | egrep '^as-set|^members|^mnt-by|^source'
as-set:         AS-DQECUST
members:        AS4130, AS5050, AS11199, AS11360, AS12017, AS14088, AS14162,
                AS14740, AS15327, AS16821, AS18891, AS19749, AS20326,
                AS21764, AS26059, AS26257, AS26461, AS27223, AS30168,
                AS32634, AS33039, AS33154, AS33345, AS33358, AS33504,
                AS33726, AS40549, AS40794, AS54552, AS54559, AS54822,
                AS393456, AS395440, AS396531, AS15204, AS54119, AS62984,
                AS13659, AS54934, AS18572, AS397284
mnt-by:         MNT-DCNSL
source:         ARIN

This object is important. It lists all the downstream ASNs that this network is expected to announce to the world. It does not contain Cloudflare’s ASN (or any of the leaked ASNs). Clearly this as-set was not used for any BGP filtering.

Just for completeness the same exercise can be done for the other ASN (the downstream of the customer of Verizon). In this case, we just searched for the maintainer object (as there are plenty of route and route6 objects listed).

$ whois -h whois.radb.net ' -i origin AS33154' | egrep '^mnt-by' | sort -u
mnt-by:         MNT-DCNSL
mnt-by:     MAINT-AS3257
mnt-by:     MAINT-AS5050

None of these maintainers are directly related to 33154 (DQE Communications). They have been created by other parties and hence they become a dead-end in that search.

It’s worth doing a secondary search to see if any as-set object exists with 33154 or 396531 included. We turned to the most excellent IRR Explorer website run by NLNOG. It provides deep insight into the routing registry data. We did a simple search for 33154 using http://irrexplorer.nlnog.net/search/33154 and we found these as-set objects.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

It’s interesting to see this ASN listed in other as-set’s but none are important or related to Monday’s route-leak. Next we looked at 396531

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

This shows that there’s nowhere else we need to check. AS-DQECUST is the as-set macro that controls (or should control) filtering for any transit provider of their network.

The summary of all the investigation is a solid statement that no Cloudflare routes or ASNs are listed anywhere within the routing registry data for the customer of Verizon. As there were 2,300 ASNs listed in the tweet above, we can conclusively state no filtering was in place and hence this route leak went on its way unabated.

IPv6? Where is the IPv6 route leak?

In what could be considered the only plus from Monday’s route leak, we can confirm that there was no route leak within IPv6 space. Why?

It turns out that 396531 (Allegheny Technologies Inc) is a network without IPv6 enabled. Normally you would hear Cloudflare chastise anyone that’s yet to enable IPv6, however, in this case we are quite happy that one of the two protocol families survived. IPv6 was stable during this route leak, which now can be called an IPv4-only route leak.

Yet that’s not really the whole story. Let’s look at the percentage of traffic Cloudflare sends Verizon that’s IPv6 (vs IPv4). Normally the IPv4/IPv6 percentage holds steady.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

This uptick in IPv6 traffic could be the direct result of Happy Eyeballs on mobile handsets picking a working IPv6 path into Cloudflare vs a non-working IPv4 path. Happy Eyeballs is meant to protect against IPv6 failure, however in this case it’s doing a wonderful job in protecting from an IPv4 failure. Yet we have to be careful with this graph because after further thought and investigation the percentage only increased because IPv4 reduces. Sometimes graphs can be misinterpreted, yet Happy Eyeballs still did a good job even as end users were being affected.

Happy Eyeballs, described in RFC8305, is a mechanism where a client (lets say a mobile device) tries to connect to a website both on IPv4 and IPv6 simultaneously. IPv6 is sometimes given a head-start. The theory is that, should a failure exist on one of the paths (sadly IPv6 is the norm), then IPv4 will save the day. Monday was a day of opposites for Verizon.

In fact, enabling IPv6 for mobile users is the one area where we can praise the Verizon network this week (or at least the Verizon mobile network), unlike the residential Verizon networks where IPv6 is almost non-existent.

Using bandwidth graphs to confirm routing leaks and stability.

As we have already stated,Verizon had impacted their own users/customers. Let’s start with their bandwidth graph:

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

The red line is 24 June 2019 (00:00 UTC to 00:00 UTC the next day). The gray lines are previous days for comparison. This graph includes both Verizon fixed-line services like FiOS along with mobile.

The AT&T graph is quite different.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

There’s no perturbation. This, along with some direct confirmation, shows that 7018 (AT&T) was not affected. This is an important point.

Going back and looking at a third tier 1 network, we can see 6762 (Telecom Italia) affected by this route leak and yet Cloudflare has a direct interconnect with them.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

We will be asking Telecom Italia to improve their route filtering as we now have this data.

Future work that could have helped on Monday

The IETF is doing work in the area of BGP path protection within the Secure Inter-Domain Routing Operations Working Group (sidrops) area. The charter of this IETF group is:

The SIDR Operations Working Group (sidrops) develops guidelines for the operation of SIDR-aware networks, and provides operational guidance on how to deploy and operate SIDR technologies in existing and new networks.

One new effort from this group should be called out to show how important the issue of route leaks like today’s event is. The draft document from Alexander Azimov et al named draft-ietf-sidrops-aspa-profile (ASPA stands for Autonomous System Provider Authorization) extends the RPKI data structures to handle BGP path information. This in ongoing work and Cloudflare and other companies are clearly interested in seeing it progress further.

However, as we said in Monday’s blog and something we should reiterate again and again: Cloudflare encourages all network operators to deploy RPKI now!

Acronyms used in the blog

  • API – Application Programming Interface
  • AS-PATH – The list of ASNs that a routes has traversed so far
  • ASN – Autonomous System Number – A unique number assigned for each network on the Internet
  • BGP – Border Gateway Protocol (version 4) – the core routing protocol for the Internet
  • IETF – Internet Engineering Task Force – an open standards organization
  • IPv4 – Internet Protocol version 4
  • IPv6 – Internet Protocol version 6
  • IRR – Internet Routing Registries – a database of Internet route objects
  • ISP – Internet Service Provider
  • JSON – JavaScript Object Notation – a lightweight data-interchange format
  • RFC – Request For Comment – published by the IETF
  • RIPE NCC – Réseaux IP Européens Network Coordination Centre – a regional Internet registry
  • ROA – Route Origin Authorization – a cryptographically signed attestation of a BGP route announcement
  • RPKI – Resource Public Key Infrastructure – a public key infrastructure framework for routing information
  • Tier 1 – A network that has no default route and peers with all other tier 1’s
  • UTC – Coordinated Universal Time – a time standard for clocks and time
  • "there be dragons" – a mistype, as it was meant to be "here be dragons" which means dangerous or unexplored territories