Tag Archives: bgp

Why BGP communities are better than AS-path prepends

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/prepends-considered-harmful/

Why BGP communities are better than AS-path prepends

Why BGP communities are better than AS-path prepends

The Internet, in its purest form, is a loosely connected graph of independent networks (also called Autonomous Systems (AS for short)). These networks use a signaling protocol called BGP (Border Gateway Protocol) to inform their neighbors (also known as peers) about the reachability of IP prefixes (a group of IP addresses) in and through their network. Part of this exchange contains useful metadata about the IP prefix that are used to inform network routing decisions. One example of the metadata is the full AS-path, which consists of the different autonomous systems an IP packet needs to pass through to reach its destination.

As we all want our packets to get to their destination as fast as possible, selecting the shortest AS-path for a given prefix is a good idea. This is where something called prepending comes into play.

Routing on the Internet, a primer

Let’s briefly talk about how the Internet works at its most fundamental level, before we dive into some nitty-gritty details.

The Internet is, at its core, a massively interconnected network of thousands of networks. Each network owns two things that are critical:

1. An Autonomous System Number (ASN): a 32-bit integer that uniquely identifies a network. For example, one of the Cloudflare ASNs (we have multiple) is 13335.

2. IP prefixes: An IP prefix is a range of IP addresses, bundled together in powers of two: In the IPv4 space, two addresses form a /31 prefix, four form a /30, and so on, all the way up to /0, which is shorthand for “all IPv4 prefixes”. The same applies for IPv6  but instead of aggregating 32 bits at most, you can aggregate up to 128 bits. The figure below shows this relationship between IP prefixes, in reverse — a /24 contains two /25s that contains two /26s, and so on.

Why BGP communities are better than AS-path prepends

To communicate on the Internet, you must be able to reach your destination, and that’s where routing protocols come into play. They enable each node on the Internet to know where to send your message (and for the receiver to send a message back).

Why BGP communities are better than AS-path prepends

As mentioned earlier, these destinations are identified by IP addresses, and contiguous ranges of IP addresses are expressed as IP prefixes. We use IP prefixes for routing as an efficiency optimization: Keeping track of where to go for four billion (232)  IP addresses in IPv4 would be incredibly complex, and require a lot of resources. Sticking to prefixes reduces that number down to about one million instead.

Now recall that Autonomous Systems are independently operated and controlled. In the Internet’s network of networks, how do I tell Source A in some other network that there is an available path to get to Destination B in (or through) my network? In comes BGP! BGP is the Border Gateway Protocol, and it is used to signal reachability information. Signal messages generated by the source ASN are referred to as ‘announcements’ because they declare to the Internet that IP addresses in the prefix are online and reachable.

Why BGP communities are better than AS-path prepends

Have a look at the figure above. Source A should now know how to get to Destination B through 2 different networks!

This is what an actual BGP message would look like:

BGP Message
    Type: UPDATE Message
    Path Attributes:
        Path Attribute - Origin: IGP
        Path Attribute - AS_PATH: 64500 64496
        Path Attribute - NEXT_HOP: 198.51.100.1
        Path Attribute - COMMUNITIES: 64500:13335
        Path Attribute - Multi Exit Discriminator (MED): 100
    Network Layer Reachability Information (NLRI):
        192.0.2.0/24

As you can see, BGP messages contain more than just the IP prefix (the NLRI bit) and the path, but also a bunch of other metadata that provides additional information about the path. Other fields include communities (more on that later), as well as MED, or origin code. MED is a suggestion to other directly connected networks on which path should be taken if multiple options are available, and the lowest value wins. The origin code can be one of three values: IGP, EGP or Incomplete. IGP will be set if you originate the prefix through BGP, EGP is no longer used (it’s an ancient routing protocol), and Incomplete is set when you distribute a prefix into BGP from another routing protocol (like IS-IS or OSPF).

Now that source A knows how to get to Destination B through two different networks, let’s talk about traffic engineering!

Traffic engineering

Traffic engineering is a critical part of the day to day management of any network. Just like in the physical world, detours can be put in place by operators to optimize the traffic flows into (inbound) and out of (outbound) their network. Outbound traffic engineering is significantly easier than inbound traffic engineering because operators can choose from neighboring networks, even prioritize some traffic over others. In contrast, inbound traffic engineering requires influencing a network that is operated by someone else entirely. The autonomy and self-governance of a network is paramount, so operators use available tools to inform or shape inbound packet flows from other networks. The understanding and use of those tools is complex, and can be a challenge.

The available set of traffic engineering tools, both in- and outbound, rely on manipulating attributes (metadata) of a given route. As we’re talking about traffic engineering between independent networks, we’ll be manipulating the attributes of an EBGP-learned route. BGP can be split into two categories:

  1. EBGP: BGP communication between two different ASNs
  2. IBGP: BGP communication within the same ASN.

While the protocol is the same, certain attributes can be exchanged on an IBGP session that aren’t exchanged on an EBGP session. One of those is local-preference. More on that in a moment.

BGP best path selection

When a network is connected to multiple other networks and service providers, it can receive path information to the same IP prefix from many of those networks, each with slightly different attributes. It is then up to the receiving network of that information to use a BGP best path selection algorithm to pick the “best” prefix (and route), and use this to forward IP traffic. I’ve put “best” in quotation marks, as best is a subjective requirement. “Best” is frequently the shortest, but what can be best for my network might not be the best outcome for another network.

BGP will consider multiple prefix attributes when filtering through the received options. However, rather than combine all those attributes into a single selection criteria, BGP best path selection uses the attributes in tiers — at any tier, if the available attributes are sufficient to choose the best path, then the algorithm terminates with that choice.

The BGP best path selection algorithm is extensive, containing 15 discrete steps to select the best available path for a given prefix. Given the numerous steps, it’s in the interest of the network to decide the best path as early as possible. The first four steps are most used and influential, and are depicted in the figure below as sieves.

Why BGP communities are better than AS-path prepends

Picking the shortest path possible is usually a good idea, which is why “AS-path length” is a step executed early on in the algorithm. However, looking at the figure above, “AS-path length” appears second, despite being the attribute to find the shortest path. So let’s talk about the first step: local preference.

Local preference
Local preference is an operator favorite because it allows them to handpick a route+path combination of their choice. It’s the first attribute in the algorithm because it is unique for any given route+neighbor+AS-path combination.

A network sets the local preference on import of a route (having learned about the route from a neighbor network). Being a non-transitive property, meaning that it’s an attribute that is never sent in an EBGP message to other networks. This intrinsically means, for example, that the operator of AS 64496 can’t set the local preference of routes to their own (or transiting) IP prefixes inside neighboring AS 64511. The inability to do so is partially why inbound traffic engineering through EBGP is so difficult.

Prepending artificially increases AS-path length
Since no network is able to directly set the local preference for a prefix inside another network, the first opportunity to influence other networks’ choices is modifying the AS-path. If the next hops are valid, and the local preference for all the different paths for a given route are the same, modifying the AS-path is an obvious option to change the path traffic will take towards your network. In a BGP message, prepending looks like this:

BEFORE:

BGP Message
    Type: UPDATE Message
    Path Attributes:
        Path Attribute - Origin: IGP
        Path Attribute - AS_PATH: 64500 64496
        Path Attribute - NEXT_HOP: 198.51.100.1
        Path Attribute - COMMUNITIES: 64500:13335
        Path Attribute - Multi Exit Discriminator (MED): 100
    Network Layer Reachability Information (NLRI):
        192.0.2.0/24

AFTER:

BGP Message
    Type: UPDATE Message
    Path Attributes:
        Path Attribute - Origin: IGP
        Path Attribute - AS_PATH: 64500 64496 64496
        Path Attribute - NEXT_HOP: 198.51.100.1
        Path Attribute - COMMUNITIES: 64500:13335
        Path Attribute - Multi Exit Discriminator (MED): 100
    Network Layer Reachability Information (NLRI):
        192.0.2.0/24

Specifically, operators can do AS-path prepending. When doing AS-path prepending, an operator adds additional autonomous systems to the path (usually the operator uses their own AS, but that’s not enforced in the protocol). This way, an AS-path can go from a length of 1 to a length of 255. As the length has now increased dramatically, that specific path for the route will not be chosen. By changing the AS-path advertised to different peers, an operator can control the traffic flows coming into their network.

Unfortunately, prepending has a catch: To be the deciding factor, all the other attributes need to be equal. This is rarely true, especially in large networks that are able to choose from many possible routes to a destination.

Business Policy Engine

BGP is colloquially also referred to as a Business Policy Engine: it does not select the best path from a performance point of view; instead, and more often than not, it will select the best path from a business point of view. The business criteria could be anything from investment (port) efficiency to increased revenue, and more. This may sound strange but, believe it or not, this is what BGP is designed to do! The power (and complexity) of BGP is that it enables a network operator to make choices according to the operator’s needs, contracts, and policies, many of which cannot be reflected by conventional notions of engineering performance.

Different local preferences

A lot of networks (including Cloudflare) assign a local preference depending on the type of connection used to send us the routes. A higher value is a higher preference. For example, routes learned from transit network connections will get a lower local preference of 100 because they are the most costly to use; backbone-learned routes will be 150, Internet exchange (IX) routes get 200, and lastly private interconnect (PNI) routes get 250. This means that for egress (outbound) traffic, the Cloudflare network, by default, will prefer a PNI-learned route, even if a shorter AS-path is available through an IX or transit neighbor.

Part of the reason a PNI is preferred over an IX is reliability, because there is no third-party switching platform involved that is out of our control, which is important because we operate on the assumption that all hardware can and will eventually break. Another part of the reason is for port efficiency reasons. Here, efficiency is defined by cost per megabit transferred on each port. Roughly speaking, the cost is calculated by:

((cost_of_switch / port_count) + transceiver_cost)

which is combined with the cross-connect cost (might be monthly recurring (MRC), or a one-time fee). PNI is preferable because it helps to optimize value by reducing the overall cost per megabit transferred, because the unit price decreases with higher utilization of the port.

This reasoning is similar for a lot of other networks, and is very prevalent in transit networks. BGP is at least as much about cost and business policy, as it is about performance.

Transit local preference

For simplicity, when referring to transits, I mean the traditional tier-1 transit networks. Due to the nature of these networks, they have two distinct sets of network peers:

1. Customers (like Cloudflare)
2. Settlement-free peers (like other tier-1 networks)

In normal circumstances, transit customers will get a higher local preference assigned than the local preference used for their settlement-free peers. This means that, no matter how much you prepend a prefix, if traffic enters that transit network, traffic will always land on your interconnection with that transit network, it will not be offloaded to another peer.

A prepend can still be used if you want to switch/offload traffic from a single link with one transit if you have multiple distinguished links with them, or if the source of traffic is multihomed behind multiple transits (and they don’t have their own local preference playbook preferring one transit over another). But inbound traffic engineering traffic away from one transit port to another through AS-path prepending has significant diminishing returns: once you’re past three prepends, it’s unlikely to change much, if anything, at that point.

Example

Why BGP communities are better than AS-path prepends

In the above scenario, no matter the adjustment Cloudflare makes in its AS-path towards AS 64496, the traffic will keep flowing through the Transit B <> Cloudflare interconnection, even though the path Origin A → Transit B → Transit A → Cloudflare is shorter from an AS-path point of view.

Why BGP communities are better than AS-path prepends

In this scenario, not a lot has changed, but Origin A is now multi-homed behind the two transit providers. In this case, the AS-path prepending was effective, as the paths seen on the Origin A side are both the prepended and non-prepended path. As long as Origin A is not doing any egress traffic engineering, and is treating both transit networks equally, then the path chosen will be Origin A → Transit A → Cloudflare.

Community-based traffic engineering

So we have now identified a pretty critical problem within the Internet ecosystem for operators: with the tools mentioned above, it’s not always (some might even say outright impossible) possible to accurately dictate paths traffic can ingress your own network, reducing the control an autonomous system has over its own network. Fortunately, there is a solution for this problem: community-based local preference.

Some transit providers allow their customers to influence the local preference in the transit network through the use of BGP communities. BGP communities are an optional transitive attribute for a route advertisement. The communities can be informative (“I learned this prefix in Rome”), but they can also be used to trigger actions on the receiving side. For example, Cogent publishes the following action communities:

Community Local preference
174:10 10
174:70 70
174:120 120
174:125 125
174:135 135
174:140 140

When you know that Cogent uses the following default local preferences in their network:

Peers → Local preference 100
Customers → Local preference 130

It’s easy to see how we could use the communities provided to change the route used. It’s important to note though that, as we can’t set the local preference of a route to 100 (or 130), AS-path prepending remains largely irrelevant, as the local preference won’t ever be the same.

Take for example the following configuration:

term ADV-SITELOCAL {
    from {
        prefix-list SITE-LOCAL;
        route-type internal;
    }
    then {
        as-path-prepend "13335 13335";
        accept;
    }
}

Why BGP communities are better than AS-path prepends

We’re prepending the Cloudflare ASN two times, resulting in a total AS-path of three, yet we were still seeing a lot (too much) traffic coming in on our Cogent link. At that point, an engineer could add another prepend, but for a well-connected network as Cloudflare, if two prepends didn’t do much, or three, then four or five isn’t going to do much either. Instead, we can leverage the Cogent communities documented above to change the routing within Cogent:

term ADV-SITELOCAL {
    from {
        prefix-list SITE-LOCAL;
        route-type internal;
    }
    then {
        community add COGENT_LPREF70;
        accept;
    }
}

The above configuration changes the traffic flow to this:

Why BGP communities are better than AS-path prepends

Which is exactly what we wanted!

Conclusion

AS-path prepending is still useful, and has its use as part of the toolchain for operators to do traffic engineering, but should be used sparingly. Excessive prepending opens a network up to wider spread route hijacks, which should be avoided at all costs. As such, using community-based ingress traffic engineering is highly preferred (and recommended). In cases where communities aren’t available (or not available to steer customer traffic), prepends can be applied, but I encourage operators to actively monitor their effects, and roll them back if ineffective.

As a side-note, P Marcos et al. have published an interesting paper on AS-path prepending, and go into some trends seen in relation to prepending, I highly recommend giving it a read: https://www.caida.org/catalog/papers/2020_aspath_prepending/aspath_prepending.pdf

How we detect route leaks and our new Cloudflare Radar route leak service

Post Syndicated from Mingwei Zhang original https://blog.cloudflare.com/route-leak-detection-with-cloudflare-radar/

How we detect route leaks and our new Cloudflare Radar route leak service

How we detect route leaks and our new Cloudflare Radar route leak service

Today we’re introducing Cloudflare Radar’s route leak data and API so that anyone can get information about route leaks across the Internet. We’ve built a comprehensive system that takes in data from public sources and Cloudflare’s view of the Internet drawn from our massive global network. The system is now feeding route leak data on Cloudflare Radar’s ASN pages and via the API.

This blog post is in two parts. There’s a discussion of BGP and route leaks followed by details of our route leak detection system and how it feeds Cloudflare Radar.

About BGP and route leaks

Inter-domain routing, i.e., exchanging reachability information among networks, is critical to the wellness and performance of the Internet. The Border Gateway Protocol (BGP) is the de facto routing protocol that exchanges routing information among organizations and networks. At its core, BGP assumes the information being exchanged is genuine and trust-worthy, which unfortunately is no longer a valid assumption on the current Internet. In many cases, networks can make mistakes or intentionally lie about the reachability information and propagate that to the rest of the Internet. Such incidents can cause significant disruptions of the normal operations of the Internet. One type of such disruptive incident is route leaks.

We consider route leaks as the propagation of routing announcements beyond their intended scope (RFC7908). Route leaks can cause significant disruption affecting millions of Internet users, as we have seen in many past notable incidents. For example, in June 2016 a misconfiguration in a small network in Pennsylvania, US (AS396531 – Allegheny Technologies Inc) accidentally leaked a Cloudflare prefix to Verizon, which proceeded to propagate the misconfigured route to the rest of its peers and customers. As a result, the traffic of a large portion of the Internet was squeezed through the limited-capacity links of a small network. The resulting congestion caused most of Cloudflare traffic to and from the affected IP range to be dropped.

A similar incident in November 2018 caused widespread unavailability of Google services when a Nigerian ISP (AS37282 – Mainone) accidentally leaked a large number of Google IP prefixes to its peers and providers violating the valley-free principle.

These incidents illustrate not only that route leaks can be very impactful, but also the snowball effects that misconfigurations in small regional networks can have on the global Internet.

Despite the criticality of detecting and rectifying route leaks promptly, they are often detected only when users start reporting the noticeable effects of the leaks. The challenge with detecting and preventing route leaks stems from the fact that AS business relationships and BGP routing policies are generally undisclosed, and the affected network is often remote to the root of the route leak.

In the past few years, solutions have been proposed to prevent the propagation of leaked routes. Such proposals include RFC9234 and ASPA, which extends the BGP to annotate sessions with the relationship type between the two connected AS networks to enable the detention and prevention of route leaks.

An alternative proposal to implement similar signaling of BGP roles is through the use of BGP Communities; a transitive attribute used to encode metadata in BGP announcements. While these directions are promising in the long term, they are still in very preliminary stages and are not expected to be adopted at scale soon.

At Cloudflare, we have developed a system to detect route leak events automatically and send notifications to multiple channels for visibility. As we continue our efforts to bring more relevant data to the public, we are happy to announce that we are starting an open data API for our route leak detection results today and integrate results to Cloudflare Radar pages.

How we detect route leaks and our new Cloudflare Radar route leak service

Route leak definition and types

Before we jump into how we design our systems, we will first do a quick primer on what a route leak is, and why it is important to detect it.

We refer to the published IETF RFC7908 document “Problem Definition and Classification of BGP Route Leaks” to define route leaks.

> A route leak is the propagation of routing announcement(s) beyond their intended scope.

The intended scope is often concretely defined as inter-domain routing policies based on business relationships between Autonomous Systems (ASes). These business relationships are broadly classified into four categories: customers, transit providers, peers and siblings, although more complex arrangements are possible.

In a customer-provider relationship the customer AS has an agreement with another network to transit its traffic to the global routing table. In a peer-to-peer relationship two ASes agree to free bilateral traffic exchange, but only between their own IPs and the IPs of their customers. Finally, ASes that belong under the same administrative entity are considered siblings, and their traffic exchange is often unrestricted.  The image below illustrates how the three main relationship types translate to export policies.

How we detect route leaks and our new Cloudflare Radar route leak service

By categorizing the types of AS-level relationships and their implications on the propagation of BGP routes, we can define multiple phases of a prefix origination announcements during propagation:

  • upward: all path segments during this phase are customer to provider
  • peering: one peer-peer path segment
  • downward: all path segments during this phase are provider to customer

An AS path that follows valley-free routing principle will have upward, peering, downward phases, all optional but have to be in that order. Here is an example of an AS path that conforms with valley-free routing.

How we detect route leaks and our new Cloudflare Radar route leak service

In RFC7908, “Problem Definition and Classification of BGP Route Leaks”, the authors define six types of route leaks, and we refer to these definitions in our system design. Here are illustrations of each of the route leak types.

Type 1: Hairpin Turn with Full Prefix

> A multihomed AS learns a route from one upstream ISP and simply propagates it to another upstream ISP (the turn essentially resembling a hairpin).  Neither the prefix nor the AS path in the update is altered.

An AS path that contains a provider-customer and customer-provider segment is considered a type 1 leak. The following example: AS4 → AS5 → AS6 forms a type 1 leak.

How we detect route leaks and our new Cloudflare Radar route leak service

Type 1 is the most recognized type of route leaks and is very impactful. In many cases, a customer route is preferable to a peer or a provider route. In this example, AS6 will likely prefer sending traffic via AS5 instead of its other peer or provider routes, causing AS5 to unintentionally become a transit provider. This can significantly affect the performance of the traffic related to the leaked prefix or cause outages if the leaking AS is not provisioned to handle a large influx of traffic.

In June 2015, Telekom Malaysia (AS4788), a regional ISP, leaked over 170,000 routes learned from its providers and peers to its other provider Level3 (AS3549, now Lumin). Level3 accepted the routes and further propagated them to its downstream networks, which in turn caused significant network issues globally.

Type 2: Lateral ISP-ISP-ISP Leak

Type 2 leak is defined as propagating routes obtained from one peer to another peer, creating two or more consecutive peer-to-peer path segments.

Here is an example: AS3 → AS4 → AS5 forms a  type 2 leak.

How we detect route leaks and our new Cloudflare Radar route leak service

One example of such leaks is more than three very large networks appearing in sequence. Very large networks (such as Verizon and Lumin) do not purchase transit from each other, and having more than three such networks on the path in sequence is often an indication of a route leak.

However, in the real world, it is not unusual to see multiple small peering networks exchanging routes and passing on to each other. Legit business reasons exist for having this type of network path. We are less concerned about this type of route leak as compared to type 1.

Type 3 and 4: Provider routes to peer; peer routes to provider

These two types involve propagating routes from a provider or a peer not to a customer, but to another peer or provider. Here are the illustrations of the two types of leaks:

How we detect route leaks and our new Cloudflare Radar route leak service
How we detect route leaks and our new Cloudflare Radar route leak service

As in the previously mentioned example, a Nigerian ISP who peers with Google accidentally leaked its route to its provider AS4809, and thus generated a type 4 route leak. Because routes via customers are usually preferred to others, the large provider (AS4809) rerouted its traffic to Google via its customer, i.e. the leaking ASN, overwhelmed the small ISP and took down Google for over one hour.

Route leak summary

So far, we have looked at the four types of route leaks defined in RFC7908. The common thread of the four types of route leaks is that they’re all defined using AS-relationships, i.e., peers, customers, and providers. We summarize the types of leaks by categorizing the AS path propagation based on where the routes are learned from and propagate to. The results are shown in the following table.

Routes from / propagates to To provider To peer To customer
From provider Type 1 Type 3 Normal
From peer Type 4 Type 2 Normal
From customer Normal Normal Normal

We can summarize the whole table into one single rule: routes obtained from a non-customer AS can only be propagated to customers.

Note: Type 5 and type 6 route leaks are defined as prefix re-origination and announcing of private prefixes. Type 5 is more closely related to prefix hijackings, which we plan to expand our system to as the next steps, while type 6 leaks are outside the scope of this work. Interested readers can refer to sections 3.5 and 3.6 of RFC7908 for more information.

The Cloudflare Radar route leak system

Now that we know what a  route leak is, let’s talk about how we designed our route leak detection system.

From a very high level, we compartmentalize our system into three different components:

  1. Raw data collection module: responsible for gathering BGP data from multiple sources and providing BGP message stream to downstream consumers.
  2. Leak detection module: responsible for determining whether a given AS-level path is a route leak, estimate the confidence level of the assessment, aggregating and providing all external evidence needed for further analysis of the event.
  3. Storage and notification module: responsible for providing access to detected route leak events and sending out notifications to relevant parties. This could also include building a dashboard for easy access and search of the historical events and providing the user interface for high-level analysis of the event.

Data collection module

There are three types of data input we take into consideration:

  1. Historical: BGP archive files for some time range in the past
    a. RouteViews and RIPE RIS BGP archives
  2. Semi-real-time: BGP archive files as soon as they become available, with a 10-30 minute delay.
    a. RouteViews and RIPE RIS archives with data broker that checks new files periodically (e.g. BGPKIT Broker)
  3. Real-time: true real-time data sources
    a. RIPE RIS Live
    b. Cloudflare internal BGP sources

How we detect route leaks and our new Cloudflare Radar route leak service

For the current version, we use the semi-real-time data source for the detection system, i.e., the BGP updates files from RouteViews and RIPE RIS. For data completeness, we process data from all public collectors from these two projects (a total of 63 collectors and over 2,400 collector peers) and implement a pipeline that’s capable of handling the BGP data processing as the data files become available.

For data files indexing and processing, we deployed an on-premises BGPKIT Broker instance with Kafka feature enabled for message passing, and a custom concurrent MRT data processing pipeline based on BGPKIT Parser Rust SDK. The data collection module processes MRT files and converts results into a BGP messages stream at over two billion BGP messages per day (roughly 30,000 messages per second).

Route leak detection

The route leak detection module works at the level of individual BGP announcements. The detection component investigates one BGP message at a time, and estimates how likely a given BGP message is a result of a route leak event.

How we detect route leaks and our new Cloudflare Radar route leak service

We base our detection algorithm mainly on the valley-free model, which we believe can capture most of the notable route leak incidents. As mentioned previously, the key to having low false positives for detecting route leaks with the valley-free model is to have accurate AS-level relationships. While those relationship types are not publicized by every AS, there have been over two decades of research on the inference of the relationship types using publicly observed BGP data.

While state-of-the-art relationship inference algorithms have been shown to be highly accurate, even a small margin of errors can still incur inaccuracies in the detection of route leaks. To alleviate such artifacts, we synthesize multiple data sources for inferring AS-level relationships, including CAIDA/UCSD’s AS relationship data and our in-house built AS relationship dataset. Building on top of the two AS-level relationships, we create a much more granular dataset at the per-prefix and per-peer levels. The improved dataset allows us to answer the question like what is the relationship between AS1 and AS2 with respect to prefix P observed by collector peer X. This eliminates much of the ambiguity for cases where networks have multiple different relationships based on prefixes and geo-locations, and thus helps us reduce the number of false positives in the system. Besides the AS-relationships datasets, we also apply the AS Hegemony dataset from IHR IIJ to further reduce false positives.

Route leak storage and presentation

After processing each BGP message, we store the generated route leak entries in a database for long-term storage and exploration. We also aggregate individual route leak BGP announcements and group relevant leaks from the same leak ASN within a short period together into route-leak events. The route leak events will then be available for consumption by different downstream applications like web UIs, an API, or alerts.

How we detect route leaks and our new Cloudflare Radar route leak service

Route leaks on Cloudflare Radar

At Cloudflare, we aim to help build a better Internet, and that includes sharing our efforts on monitoring and securing Internet routing. Today, we are releasing our route leak detection system as public beta.

Starting today, users going to the Cloudflare Radar ASN pages will now find the list of route leaks that affect that AS. We consider that an AS is being affected when the leaker AS is one hop away from it in any direction, before or after.

The Cloudflare Radar ASN page is directly accessible via https://radar.cloudflare.com/as{ASN}. For example, one can navigate to https://radar.cloudflare.com/as174 to view the overview page for Cogent AS174. ASN pages now show a dedicated card for route leaks detected relevant to the current ASN within the selected time range.

How we detect route leaks and our new Cloudflare Radar route leak service

Users can also start using our public data API to lookup route leak events with regards to any given ASN.  Our API supports filtering route leak results by time ranges, and ASes involved. Here is a screenshot of the route leak events API documentation page on the newly updated API docs site.

How we detect route leaks and our new Cloudflare Radar route leak service

More to come on routing security

There is a lot more we are planning to do with route-leak detection. More features like a global view page, route leak notifications, more advanced APIs, custom automations scripts, and historical archive datasets will begin to ship on Cloudflare Radar over time. Your feedback and suggestions are also very important for us to continue improving on our detection results and serve better data to the public.

Furthermore, we will continue to expand our work on other important topics of Internet routing security, including global BGP hijack detection (not limited to our customer networks), RPKI validation monitoring, open-sourcing tools and architecture designs, and centralized routing security web gateway. Our goal is to provide the best data and tools for routing security to the communities so that we can build a better and more secure Internet together.

In the meantime, we opened a Radar room on our Developers Discord Server. Feel free to join and talk to us; the team is eager to receive feedback and answer questions.

Visit Cloudflare Radar for more Internet insights. You can also follow us on Twitter for more Radar updates.

Route leaks and confirmation biases

Post Syndicated from Maximilian Wilhelm original https://blog.cloudflare.com/route-leaks-and-confirmation-biases/

Route leaks and confirmation biases

Route leaks and confirmation biases

This is not what I imagined my first blog article would look like, but here we go.

On February 1, 2022, a configuration error on one of our routers caused a route leak of up to 2,000 Internet prefixes to one of our Internet transit providers. This leak lasted for 32 seconds and at a later time 7 seconds. We did not see any traffic spikes or drops in our network and did not see any customer impact because of this error, but this may have caused an impact to external parties, and we are sorry for the mistake.

Route leaks and confirmation biases

Timeline

All timestamps are UTC.

As part of our efforts to build the best network, we regularly update our Internet transit and peering links throughout our network. On February 1, 2022, we had a “hot-cut” scheduled with one of our Internet transit providers to simultaneously update router configurations on Cloudflare and ISP routers to migrate one of our existing Internet transit links in Newark to a link with more capacity. Doing a “hot-cut” means that both parties will change cabling and configuration at the same time, usually while being on a conference call, to reduce downtime and impact on the network. The migration started off-peak at 10:45 (05:45 local time) with our network engineer entering the bridge call with our data center engineers and remote hands on site as well as operators from the ISP.

At 11:17, we connected the new fiber link and established the BGP sessions to the ISP successfully. We had BGP filters in place on our end to not accept and send any prefixes, so we could evaluate the connection and settings without any impact on our network and services.

As the connection between our router and the ISP — like most Internet connections — was realized over a fiber link, the first item to check are the “light levels” of that link. This shows the strength of the optical signal received by our router from the ISP router and can indicate a bad connection when it’s too low. Low light levels are likely caused by unclean fiber ends or not fully seated connectors, but may also indicate a defective optical transceiver which connects the fiber link to the router – all of which can degrade service quality.

The next item on the checklist is interface errors, which will occur when a network device receives incorrect or malformed network packets, which would also indicate a bad connection and would likely lead to a degradation in service quality, too.

As light levels were good, and we observed no errors on the link, we deemed it ready for production and removed the BGP reject filters at 11:22.

This immediately triggered the maximum prefix-limit protection the ISP had configured on the BGP session and shut down the session, preventing further impact. The maximum prefix-limit is a safeguard in BGP to prevent the spread of route leaks and to protect the Internet. The limit is usually set just a little higher than the expected number of Internet prefixes from a peer to leave some headroom for growth but also catch configuration errors fast. The configured value was just 40 prefixes short of the number of prefixes we were advertising at that site, so this was considered the reason for the session to be shut down. After checking back internally, we asked the ISP to raise the prefix-limit, which they did.

The BGP session was reestablished at 12:08 and immediately shut down again. The problem was identified and fixed at 12:14.

10:45: Start of scheduled maintenance

11:17: New link was connected and BGP sessions went up (filters still in place)

11:22: Link was deemed ready for production and filters removed

11:23: BGP sessions were torn down by ISP router due to configured prefix-limit

12:08: ISP configures higher prefix-limits, BGP sessions briefly come up again and are shut down

12:14: Issue identified and configuration updated

What happened and what we’re doing about it

The outage occurred while migrating one of our Internet transits to a link with more capacity. Once the new link and a BGP session had been established, and the link deemed error-free, our network engineering team followed the peer-reviewed deployment plan. The team removed the filters from the BGP sessions, which prevented the Cloudflare router from accepting and sending prefixes via BGP.

Due to an oversight in the deployment plan, which had been peer-reviewed before without noticing this issue, no BGP filters to only export prefixes of Cloudflare and our customers were added. A peer review on the internal chat did not notice this either, so the network engineer performing this change went ahead.

ewr02# show |compare                                     
[edit protocols bgp group 4-ORANGE-TRANSIT]
-  import REJECT-ALL;
-  export REJECT-ALL;
[edit protocols bgp group 6-ORANGE-TRANSIT]
-  import REJECT-ALL;
-  export REJECT-ALL;

The change resulted in our router sending all known prefixes to the ISP router, which shut down the session as the number of prefixes received exceeded the maximum prefix-limit configured.

As the configured values for the maximum prefix-limits turned out to be rather low for the number of prefixes on our network, this didn’t come as a surprise to our network engineering team and no investigation into why the BGP session went down was started. The prefix-limit being too low seemed to be a perfectly valid reason.

We asked the ISP to increase the prefix-limit, which they did after they received approval on their side. Once the prefix-limit had been increased and the previously shutdown BGP sessions reset, the sessions were reestablished but were shut down immediately as the maximum prefix-limit was triggered again. This is when our network engineer started questioning whether there was another issue at fault and found and corrected the configuration error previously overlooked.

We made the following change in response to this event: we introduced an implicit reject policy for BGP sessions which will take effect if no import/export policy is configured for a specific BGP neighbor or neighbor group. This change has been deployed.

BGP security & preventing route-leaks — what’s in the cards?

Route leaks aren’t new, and they keep happening. The industry has come up with many approaches to limit the impact or even prevent route-leaks. Policies and filters are used to control which prefixes should be exported to or imported from a given peer. RPKI can help to make sure only allowed prefixes are accepted from a peer and a maximum prefix-limit can act as a last line of defense when everything else fails.

BGP policies and filters are commonly used to ensure only explicitly allowed prefixes are sent out to BGP peers, usually only allowing prefixes owned by the entity operating the network and its customers. They can also be used to tweak some knobs (BGP local-pref, MED, AS path prepend, etc.) to influence routing decisions and balance traffic across links. This is what the policies we have in place for our peers and transits do. As explained above, the maximum prefix-limit is intended to tear down BGP sessions if more prefixes are being sent or received than to be expected. We have talked about RPKI before, it’s the required cryptographic upgrade to BGP routing, and we still are on our path to securing Internet Routing.

To improve the overall stability of the Internet even more, in 2017, a new Internet standard was proposed, which adds another layer of protection into the mix: RFC8212 defines Default External BGP (EBGP) Route Propagation Behavior without Policies which pretty much tackles the exact issues we were facing.

This RFC updates the BGP-4 standard (RFC4271) which defines how BGP works and what vendors are expected to implement. On the Juniper operating system, JunOS, this can be activated by setting defaults ebgp no-policy reject-always on the protocols bgp hierarchy level starting with Junos OS Release 20.3R1.

If you are running an older version of JunOS, a similar effect can be achieved by defining a REJECT-ALL policy and setting this as import/export policy on the protocols bgp hierarchy level. Note that this will also affect iBGP sessions, which the solution above will have no impact on.

policy-statement REJECT-ALL {
  then reject;
}

protocol bgp {
  import REJECT-ALL;
  export REJECT-ALL;
}

Conclusion

We are sorry for leaking routes of prefixes which did not belong to Cloudflare or our customers and to network engineers who got paged as a result of this.

We have processes in place to make sure that changes to our infrastructure are reviewed before being executed, so potential issues can be spotted before they reach production. In this case, the review process failed to catch this configuration error. In response, we will increase our efforts to further our network automation, to fully derive the device configuration from an intended state.

While this configuration error was caused by human error, it could have been detected and mitigated significantly faster if the confirmation bias did not kick in, making the operator think the observed behavior was to be expected. This error underlines the importance of our existing efforts on training our people to be aware of biases we have in our life. This also serves as a great example on how confirmation bias can influence and impact our work and that we should question our conclusions (early).

It also shows how important protocols like RPKI are. Route leaks are something even experienced network operators can cause accidentally, and technical solutions are needed to reduce the impact of leaks whether they are intentional or the result of an error.

What happened on the Internet during the Facebook outage

Post Syndicated from Celso Martinho original https://blog.cloudflare.com/during-the-facebook-outage/

What happened on the Internet during the Facebook outage

It’s been a few days now since Facebook, Instagram, and WhatsApp went AWOL and experienced one of the most extended and rough downtime periods in their existence.

When that happened, we reported our bird’s-eye view of the event and posted the blog Understanding How Facebook Disappeared from the Internet where we tried to explain what we saw and how DNS and BGP, two of the technologies at the center of the outage, played a role in the event.

In the meantime, more information has surfaced, and Facebook has published a blog post giving more details of what happened internally.

As we said before, these events are a gentle reminder that the Internet is a vast network of networks, and we, as industry players and end-users, are part of it and should work together.

In the aftermath of an event of this size, we don’t waste much time debating how peers handled the situation. We do, however, ask ourselves the more important questions: “How did this affect us?” and “What if this had happened to us?” Asking and answering these questions whenever something like this happens is a great and healthy exercise that helps us improve our own resilience.

Today, we’re going to show you how the Facebook and affiliate sites downtime affected us, and what we can see in our data.

1.1.1.1

1.1.1.1 is a fast and privacy-centric public DNS resolver operated by Cloudflare, used by millions of users, browsers, and devices worldwide. Let’s look at our telemetry and see what we find.

First, the obvious. If we look at the response rate, there was a massive spike in the number of SERVFAIL codes. SERVFAILs can happen for several reasons; we have an excellent blog called Unwrap the SERVFAIL that you should read if you’re curious.

In this case, we started serving SERVFAIL responses to all facebook.com and whatsapp.com DNS queries because our resolver couldn’t access the upstream Facebook authoritative servers. About 60x times more than the average on a typical day.

What happened on the Internet during the Facebook outage

If we look at all the queries, not specific to Facebook or WhatsApp domains, and we split them by IPv4 and IPv6 clients, we can see that our load increased too.

As explained before, this is due to a snowball effect associated with applications and users retrying after the errors and generating even more traffic. In this case, 1.1.1.1 had to handle more than the expected rate for A and AAAA queries.

What happened on the Internet during the Facebook outage

Here’s another fun one.

DNS vs. DoT and DoH. Typically, DNS queries and responses are sent in plaintext over UDP (or TCP sometimes), and that’s been the case for decades now. Naturally, this poses security and privacy risks to end-users as it allows in-transit attacks or traffic snooping.

With DNS over TLS (DoT) and DNS over HTTPS, clients can talk DNS using well-known, well-supported encryption and authentication protocols.

Our learning center has a good article on “DNS over TLS vs. DNS over HTTPS” that you can read. Browsers like Chrome, Firefox, and Edge have supported DoH for some time now, WAP uses DoH too, and you can even configure your operating system to use the new protocols.

When Facebook went offline, we saw the number of DoT+DoH SERVFAILs responses grow by over x300 vs. the average rate.

What happened on the Internet during the Facebook outage
What happened on the Internet during the Facebook outage
What happened on the Internet during the Facebook outage

So, we got hammered with lots of requests and errors, causing traffic spikes to our 1.1.1.1 resolver and causing an unexpected load in the edge network and systems. How did we perform during this stressful period?

Quite well. 1.1.1.1 kept its cool and continued serving the vast majority of requests around the famous 10ms mark. An insignificant fraction of p95 and p99 percentiles saw increased response times, probably due to timeouts trying to reach Facebook’s nameservers.

What happened on the Internet during the Facebook outage

Another interesting perspective is the distribution of the ratio between SERVFAIL and good DNS answers, by country. In theory, the higher this ratio is, the more the country uses Facebook. Here’s the map with the countries that suffered the most:

What happened on the Internet during the Facebook outage

Here’s the top twelve country list, ordered by those that apparently use Facebook, WhatsApp and Instagram the most:

Country SERVFAIL/Good Answers ratio
Turkey 7.34
Grenada 4.84
Congo 4.44
Lesotho 3.94
Nicaragua 3.57
South Sudan 3.47
Syrian Arab Republic 3.41
Serbia 3.25
Turkmenistan 3.23
United Arab Emirates 3.17
Togo 3.14
French Guiana 3.00

Impact on other sites

When Facebook, Instagram, and WhatsApp aren’t around, the world turns to other places to look for information on what’s going on, other forms of entertainment or other applications to communicate with their friends and family. Our data shows us those shifts. While Facebook was going down, other services and platforms were going up.

To get an idea of the changing traffic patterns we look at DNS queries as an indicator of increased traffic to specific sites or types of site.

Here are a few examples.

Other social media platforms saw a slight increase in use, compared to normal.

What happened on the Internet during the Facebook outage

Traffic to messaging platforms like Telegram, Signal, Discord and Slack got a little push too.

What happened on the Internet during the Facebook outage

Nothing like a little gaming time when Instagram is down, we guess, when looking at traffic to sites like Steam, Xbox, Minecraft and others.

What happened on the Internet during the Facebook outage

And yes, people want to know what’s going on and fall back on news sites like CNN, New York Times, The Guardian, Wall Street Journal, Washington Post, Huffington Post, BBC, and others:

What happened on the Internet during the Facebook outage

Attacks

One could speculate that the Internet was under attack from malicious hackers. Our Firewall doesn’t agree; nothing out of the ordinary stands out.

What happened on the Internet during the Facebook outage

Network Error Logs

Network Error Logging, NEL for short, is an experimental technology supported in Chrome. A website can issue a Report-To header and ask the browser to send reports about network problems, like bad requests or DNS issues, to a specific endpoint.

Cloudflare uses NEL data to quickly help triage end-user connectivity issues when end-users reach our network. You can learn more about this feature in our help center.

If Facebook is down and their DNS isn’t responding, Chrome will start reporting NEL events every time one of the pages in our zones fails to load Facebook comments, posts, ads, or authentication buttons. This chart shows it clearly.​​

What happened on the Internet during the Facebook outage

WARP

Cloudflare announced WARP in 2019, and called it “A VPN for People Who Don’t Know What V.P.N. Stands For” and offered it for free to its customers. Today WARP is used by millions of people worldwide to securely and privately access the Internet on their desktop and mobile devices. Here’s what we saw during the outage by looking at traffic volume between WARP and Facebook’s network:

What happened on the Internet during the Facebook outage

You can see how the steep drop in Facebook ASN traffic coincides with the start of the incident and how it compares to the same period the day before.

Our own traffic

People tend to think of Facebook as a place to visit. We log in, and we access Facebook, we post. It turns out that Facebook likes to visit us too, quite a lot. Like Google and other platforms, Facebook uses an army of crawlers to constantly check websites for data and updates. Those robots gather information about websites content, such as its titles, descriptions, thumbnail images, and metadata. You can learn more about this on the “The Facebook Crawler” page and the Open Graph website.

Here’s what we see when traffic is coming from the Facebook ASN, supposedly from crawlers, to our CDN sites:

What happened on the Internet during the Facebook outage

The robots went silent.

What about the traffic coming to our CDN sites from Facebook User-Agents? The gap is indisputable.

What happened on the Internet during the Facebook outage

We see about 30% of a typical request rate hitting us. But it’s not zero; why is that?

We’ll let you know a little secret. Never trust User-Agent information; it’s broken. User-Agent spoofing is everywhere. Browsers, apps, and other clients deliberately change the User-Agent string when they fetch pages from the Internet to hide, obtain access to certain features, or bypass paywalls (because pay-walled sites want sites like Facebook to index their content, so that then they get more traffic from links).

Fortunately, there are newer, and privacy-centric standards emerging like User-Agent Client Hints.

Core Web Vitals

Core Web Vitals are the subset of Web Vitals, an initiative by Google to provide a unified interface to measure real-world quality signals when a user visits a web page. Such signals include Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS).

We use Core Web Vitals with our privacy-centric Web Analytics product and collect anonymized data on how end-users experience the websites that enable this feature.

One of the metrics we can calculate using these signals is the page load time. Our theory is that if a page includes scripts coming from external sites (for example, Facebook “like” buttons, comments, ads), and they are unreachable, its total load time gets affected.

We used a list of about 400 domains that we know embed Facebook scripts in their pages and looked at the data.

What happened on the Internet during the Facebook outage

Now let’s look at the Largest Contentful Paint. LCP marks the point in the page load timeline when the page’s main content has likely loaded. The faster the LCP is, the better the end-user experience.

What happened on the Internet during the Facebook outage

Again, the page load experience got visibly degraded.

The outcome seems clear. The sites that use Facebook scripts in their pages took 1.5x more time to load their pages during the outage, with some of them taking more than 2x the usual time. Facebook’s outage dragged the performance of  some other sites down.

Conclusion

When Facebook, Instagram, and WhatsApp went down, the Web felt it. Some websites got slower or lost traffic, other services and platforms got unexpected load, and people lost the ability to communicate or do business normally.

Hypotheses About What Happened to Facebook

Post Syndicated from Bozho original https://techblog.bozho.net/hypotheses-about-what-happened-to-facebook/

Facebook was down. I’d recommend reading Cloudflare’s summary. Then I recommend reading Facebook’s own account on the incident. But let me expand on that. Facebook published announcements and withdrawals for certain BGP prefixes which lead to removing its DNS servers from “the map of the internet” – they told everyone “the part of our network where our DNS servers are doesn’t exist”. That was the result of a backbone self-inflicted failure due to a bug in the auditing tool that checks whether the commands executed aren’t doing harmful things.

Facebook owns a lot of IPs. According to RIPEstat they are part of 399 prefixes (147 of them are IPv4). The DNS servers are located in two of those 399. Facebook uses a.ns.facebook.com, b.ns.facebook.com, c.ns.facebook.com and d.ns.facebook.com, which get queries whenever someone wants to know the IPs of Facebook-owned domains. These four nameservers are served by the same Autonomous System from just two prefixes – 129.134.30.0/23 and 185.89.218.0/23. Of course “4 nameservers” is a logical construct, there are probably many actual servers behind that (using anycast).

I wrote a simple “script” to fetch all the withdrawals and announcements for all Facebook-owned prefixes (from the great API of RIPEstats). Facebook didn’t remove itself from the map entirely. As CloudFlare points out, it was just some prefixes that are affected. It can be just these two, or a few others as well, but it seems that just a handful were affected. If we sort the resulting CSV from the above script by withdrawals, we’ll notice that 129.134.30.0/23 and 185.89.218.0/23 are the pretty high up (alongside 185.89 and 123.134 with a /24, which are all included in the /23). Now that perfectly matches Facebook’s account that their nameservers automatically withdraw themselves if they fail to connect to other parts of the infrastructure. Everything may have also been down, but the logic for withdrawal is present only in the networks that have nameservers in them.

So first, let me make three general observations that are not as obvious and as universal as they may sound, but they are worth discussing:

  • Use longer DNS TTLs if possible – if Facebook had 6 hour TTL on its domains, we may have not figured out that their name servers are down. This is hard to ask for such a complex service that uses DNS for load-balancing and geographical distribution, but it’s worth considering. Also, if they killed their backbone and their entire infrastructure was down anyway, the DNS TTL would not have solved the issue. But
  • We need improved caching logic for DNS. It can’t be just “present or not”; DNS caches may keep “last known good state” in case of SERVFAIL and fallback to that. All of those DNS resolvers that had to ask the authoritative nameserver “where can I find facebook.com” knew where to find facebook.com just a minute ago. Then they got a failure and suddenly they are wondering “oh, where could Facebook be?”. It’s not that simple, of course, but such cache improvement is worth considering. And again, if their entire infrastructure was down, this would not have helped.
  • Consider having an authoritative nameserver outside your main AS. If something bad happens to your AS routes (regardless of the reason), you may still have DNS working. That may have downsides – generally, it will be hard to manage and sync your DNS infrastructure. But at least having a spare set of nameservers and the option to quickly point glue records there is worth considering. It would not have saved Facebook in this case, as again, they claim the entire infrastructure was inaccessible due to a “broken” backbone.
  • Have a 100% test coverage on critical tools, such as the auditing tool that had a bug. 100% test coverage is rarely achievable in any project, but in such critical tools it’s a must.

The main explanation is the accidental outage. This is what Facebook engineers explain in the blogpost and other accounts, and that’s what seems to have happened. However, there are alternative hypotheses floating around, so let me briefly discuss all of the options.

  • Accidental outage due to misconfiguration – a very likely scenario. These things may happen to everyone and Facebook is known for it “break things” mentality, so it’s not unlikely that they just didn’t have the right safeguards in place and that someone ran a buggy update. The scenarios why and how that may have happened are many, and we can’t know from the outside (even after Facebook’s brief description). This remains the primary explanation, following my favorite Hanlon’s razor. A bug in the audit tool is absolutely realistic (btw, I’d love Facebook to publish their internal tools).
  • Cyber attack – It cannot be known by the data we have, but this would be a sophisticated attack that gained access to their BGP administration interface, which I would assume is properly protected. Not impossible, but a 6-hour outage of a social network is not something a sophisticated actor (e.g. a nation state) would invest resources in. We can’t rule it out, as this might be “just a drill” for something bigger to follow. If I were an attacker that wanted to take Facebook down, I’d try to kill their DNS servers, or indeed, “de-route” them. If we didn’t know that Facebook lets its DNS servers cut themselves from the network in case of failures, the fact that so few prefixes were updated might be in indicator of targeted attack, but this seems less and less likely.
  • Deliberate self-sabotage1.5 billion records are claimed to be leaked yesterday. At the same time, a Facebook whistleblower is testifying in the US congress. Both of these news are potentially damaging to Facebook reputation and shares. If they wanted to drown the news and the respective share price plunge in a technical story that few people understand but everyone is talking about (and then have their share price rebound, because technical issues happen to everyone), then that’s the way to do it – just as a malicious actor would do, but without all the hassle to gain access from outside – de-route the prefixes for the DNS servers and you have a “perfect” outage. These coincidences have lead people to assume such a plot, but from the observed outage and the explanation given by Facebook on why the DNS prefixes have been automatically withdrawn, this sounds unlikely.

Distinguishing between the three options is actually hard. You can mask a deliberate outage as an accident, a malicious actor can make it look like a deliberate self-sabotage. That’s why there are speculations. To me, however, by all of the data we have in RIPEStat and the various accounts by CloudFlare, Facebook and other experts, it seems that a chain of mistakes (operational and possibly design ones) lead to this.

The post Hypotheses About What Happened to Facebook appeared first on Bozho's tech blog.

Understanding How Facebook Disappeared from the Internet

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/october-2021-facebook-outage/

Understanding How Facebook Disappeared from the Internet

Understanding How Facebook Disappeared from the Internet

“Facebook can’t be down, can it?”, we thought, for a second.

Today at 1651 UTC, we opened an internal incident entitled “Facebook DNS lookup returning SERVFAIL” because we were worried that something was wrong with our DNS resolver 1.1.1.1.  But as we were about to post on our public status page we realized something else more serious was going on.

Social media quickly burst into flames, reporting what our engineers rapidly confirmed too. Facebook and its affiliated services WhatsApp and Instagram were, in fact, all down. Their DNS names stopped resolving, and their infrastructure IPs were unreachable. It was as if someone had “pulled the cables” from their data centers all at once and disconnected them from the Internet.

How’s that even possible?

Meet BGP

BGP stands for Border Gateway Protocol. It’s a mechanism to exchange routing information between autonomous systems (AS) on the Internet. The big routers that make the Internet work have huge, constantly updated lists of the possible routes that can be used to deliver every network packet to their final destinations. Without BGP, the Internet routers wouldn’t know what to do, and the Internet wouldn’t work.

The Internet is literally a network of networks, and it’s bound together by BGP. BGP allows one network (say Facebook) to advertise its presence to other networks that form the Internet. As we write Facebook is not advertising its presence, ISPs and other networks can’t find Facebook’s network and so it is unavailable.

The individual networks each have an ASN: an Autonomous System Number. An Autonomous System (AS) is an individual network with a unified internal routing policy. An AS can originate prefixes (say that they control a group of IP addresses), as well as transit prefixes (say they know how to reach specific groups of IP addresses).

Cloudflare’s ASN is AS13335. Every ASN needs to announce its prefix routes to the Internet using BGP; otherwise, no one will know how to connect and where to find us.

Our learning center has a good overview of what BGP and ASNs are and how they work.

In this simplified diagram, you can see six autonomous systems on the Internet and two possible routes that one packet can use to go from Start to End. AS1 → AS2 → AS3 being the fastest, and AS1 → AS6 → AS5 → AS4 → AS3 being the slowest, but that can be used if the first fails.

Understanding How Facebook Disappeared from the Internet

At 1658 UTC we noticed that Facebook had stopped announcing the routes to their DNS prefixes. That meant that, at least, Facebook’s DNS servers were unavailable. Because of this Cloudflare’s 1.1.1.1 DNS resolver could no longer respond to queries asking for the IP address of facebook.com or instagram.com.

route-views>show ip bgp 185.89.218.0/23
% Network not in table
route-views>

route-views>show ip bgp 129.134.30.0/23
% Network not in table
route-views>

Meanwhile, other Facebook IP addresses remained routed but weren’t particularly useful since without DNS Facebook and related services were effectively unavailable:

route-views>show ip bgp 129.134.30.0   
BGP routing table entry for 129.134.0.0/17, version 1025798334
Paths: (24 available, best #14, table default)
  Not advertised to any peer
  Refresh Epoch 2
  3303 6453 32934
    217.192.89.50 from 217.192.89.50 (138.187.128.158)
      Origin IGP, localpref 100, valid, external
      Community: 3303:1004 3303:1006 3303:3075 6453:3000 6453:3400 6453:3402
      path 7FE1408ED9C8 RPKI State not found
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
route-views>

We keep track of all the BGP updates and announcements we see in our global network. At our scale, the data we collect gives us a view of how the Internet is connected and where the traffic is meant to flow from and to everywhere on the planet.

A BGP UPDATE message informs a router of any changes you’ve made to a prefix advertisement or entirely withdraws the prefix. We can clearly see this in the number of updates we received from Facebook when checking our time-series BGP database. Normally this chart is fairly quiet: Facebook doesn’t make a lot of changes to its network minute to minute.

But at around 15:40 UTC we saw a peak of routing changes from Facebook. That’s when the trouble began.

Understanding How Facebook Disappeared from the Internet

If we split this view by routes announcements and withdrawals, we get an even better idea of what happened. Routes were withdrawn, Facebook’s DNS servers went offline, and one minute after the problem occurred, Cloudflare engineers were in a room wondering why 1.1.1.1 couldn’t resolve facebook.com and worrying that it was somehow a fault with our systems.

Understanding How Facebook Disappeared from the Internet

With those withdrawals, Facebook and its sites had effectively disconnected themselves from the Internet.

DNS gets affected

As a direct consequence of this, DNS resolvers all over the world stopped resolving their domain names.

➜  ~ dig @1.1.1.1 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com.			IN	A
➜  ~ dig @1.1.1.1 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com.			IN	A
➜  ~ dig @8.8.8.8 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com.			IN	A
➜  ~ dig @8.8.8.8 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com.			IN	A

This happens because DNS, like many other systems on the Internet, also has its routing mechanism. When someone types the https://facebook.com URL in the browser, the DNS resolver, responsible for translating domain names into actual IP addresses to connect to, first checks if it has something in its cache and uses it. If not, it tries to grab the answer from the domain nameservers, typically hosted by the entity that owns it.

If the nameservers are unreachable or fail to respond because of some other reason, then a SERVFAIL is returned, and the browser issues an error to the user.

Again, our learning center provides a good explanation on how DNS works.

Understanding How Facebook Disappeared from the Internet

Due to Facebook stopping announcing their DNS prefix routes through BGP, our and everyone else’s DNS resolvers had no way to connect to their nameservers. Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses.

But that’s not all. Now human behavior and application logic kicks in and causes another exponential effect. A tsunami of additional DNS traffic follows.

This happened in part because apps won’t accept an error for an answer and start retrying, sometimes aggressively, and in part because end-users also won’t take an error for an answer and start reloading the pages, or killing and relaunching their apps, sometimes also aggressively.

This is the traffic increase (in number of requests) that we saw on 1.1.1.1:

Understanding How Facebook Disappeared from the Internet

So now, because Facebook and their sites are so big, we have DNS resolvers worldwide handling 30x more queries than usual and potentially causing latency and timeout issues to other platforms.

Fortunately, 1.1.1.1 was built to be Free, Private, Fast (as the independent DNS monitor DNSPerf can attest), and scalable, and we were able to keep servicing our users with minimal impact.

The vast majority of our DNS requests kept resolving in under 10ms. At the same time, a minimal fraction of p95 and p99 percentiles saw increased response times, probably due to expired TTLs having to resort to the Facebook nameservers and timeout. The 10 seconds DNS timeout limit is well known amongst engineers.

Understanding How Facebook Disappeared from the Internet

Impacting other services

People look for alternatives and want to know more or discuss what’s going on. When Facebook became unreachable, we started seeing increased DNS queries to Twitter, Signal and other messaging and social media platforms.

Understanding How Facebook Disappeared from the Internet

We can also see another side effect of this unreachability in our WARP traffic to and from Facebook’s affected ASN 32934. This chart shows how traffic changed from 15:45 UTC to 16:45 UTC compared with three hours before in each country. All over the world WARP traffic to and from Facebook’s network simply disappeared.

Understanding How Facebook Disappeared from the Internet

The Internet

Today’s events are a gentle reminder that the Internet is a very complex and interdependent system of millions of systems and protocols working together. That trust, standardization, and cooperation between entities are at the center of making it work for almost five billion active users worldwide.

Update

At around 21:00 UTC we saw renewed BGP activity from Facebook’s network which peaked at 21:17 UTC.

Understanding How Facebook Disappeared from the Internet

This chart shows the availability of the DNS name ‘facebook.com’ on Cloudflare’s DNS resolver 1.1.1.1. It stopped being available at around 15:50 UTC and returned at 21:20 UTC.

Understanding How Facebook Disappeared from the Internet

Undoubtedly Facebook, WhatsApp and Instagram services will take further time to come online but as of 22:28 UTC Facebook appears to be reconnected to the global Internet and DNS working again.

Fall 2020 RPKI Update

Post Syndicated from Louis Poinsignon original https://blog.cloudflare.com/rpki-2020-fall-update/

Fall 2020 RPKI Update

The Internet is a network of networks. In order to find the path between two points and exchange data, the network devices rely on the information from their peers. This information consists of IP addresses and Autonomous Systems (AS) which announce the addresses using Border Gateway Protocol (BGP).

One problem arises from this design: what protects against a malevolent peer who decides to announce incorrect information? The damage caused by route hijacks can be major.

Routing Public Key Infrastructure (RPKI) is a framework created in 2008. Its goal is to provide a source of truth for Internet Resources (IP addresses) and ASes in signed cryptographically signed records called Route Origin Objects (ROA).

Recently, we’ve seen the significant threshold of two hundred thousands of ROAs being passed. This represents a big step in making the Internet more secure against accidental and deliberate BGP tampering.

We have talked about RPKI in the past but we thought it would be a good time for an update.

In a more technical context, the RPKI framework consists of two parts:

  • IP addresses need to be cryptographically signed by their owners in a database managed by a Trust Anchor: Afrinic, APNIC, ARIN, LACNIC and RIPE. Those five organizations are in charge of allocating Internet resources. The ROA indicates which Network Operator is allowed to announce the addresses using BGP.
  • Network operators download the list of ROAs, perform the cryptographic checks and then apply filters on the prefixes they receive: this is called BGP Origin Validation.

The “Is BGP Safe Yet” website

The launch of the website isbgpsafeyet.com to test if your ISP correctly performs BGP Origin Validation was a success. Since launch, it has been visited more than five million times from over 223 countries and 13,000 unique networks (20% of the entire Internet), generating half a million BGP Origin Validation tests.

Many providers subsequently indicated on social media (for example, here or here) that they had an RPKI deployment in the works. This increase in Origin Validation by networks is increasing the security of the Internet globally.

The site’s test for Origin Validation consists of queries toward two addresses, one of which is behind an RPKI invalid prefix and the other behind an RPKI valid prefix. If the query towards the invalid succeeds, the test fails as the ISP does not implement Origin Validation. We counted the number of queries that failed to reach invalid.cloudflare.com. This also included a few thousand RIPE Atlas tests that were started by Cloudflare and various contributors, providing coverage for smaller networks.

Every month since launch we’ve seen that around 10 to 20 networks are deploying RPKI Origin Validation. Among the major providers we can build the following table:

Month Networks
August Swisscom (Switzerland), Salt (Switzerland)
July Telstra (Australia), Quadranet (USA), Videotron (Canada)
June Colocrossing (USA), Get Norway (Norway), Vocus (Australia), Hurricane Electric (Worldwide), Cogent (Worldwide)
May Sengked Fiber (Indonesia), Online.net (France), WebAfrica Networks (South Africa), CableNet (Cyprus), IDnet (Indonesia), Worldstream (Netherlands), GTT (Worldwide)

With the help of many contributors, we have compiled a list of network operators and public statements at the top of the isbgpsafeyet.com page.

We excluded providers that manually blocked the traffic towards the prefix instead of using RPKI. Among the techniques we see are firewall filtering and manual prefix rejection. The filtering is often propagated to other customer ISPs. In a unique case, an ISP generated a “more-specific” blackhole route that leaked to multiple peers over the Internet.

The deployment of RPKI by major transit providers, also known as Tier 1, such as Cogent, GTT, Hurricane Electric, NTT and Telia made many downstream networks more secure without them having them deploying validation software.

Overall, we looked at the evolution of the successful tests per ASN and we noticed a steady increase over the recent months of 8%.

Fall 2020 RPKI Update

Furthermore, when we probed the entire IPv4 space this month, using a similar technique to the isbgpsafeyet.com test, many more networks were not able to reach an RPKI invalid prefix than compared to the same period last year. This confirms an increase of RPKI Origin Validation deployment across all network operators. The picture below shows the IPv4 space behind a network with RPKI Origin Validation enabled in yellow and the active space in blue. It uses a Hilbert Curve to efficiently plot IP addresses: for example one /20 prefix (4096 IPs) is a pixel, a /16 prefix (65536 IPs) will form a 4×4 pixels square.

The more the yellow spreads, the safer the Internet becomes.

Fall 2020 RPKI Update

What does it mean exactly? If you were hijacking a prefix, the users behind the yellow space would likely not be affected. This also applies if you miss-sign your prefixes: you would not be able to reach the services or users behind the yellow space. Once RPKI is enabled everywhere, there will only be yellow squares.

Progression of signed prefixes

Owners of IP addresses indicate the networks allowed to announce them. They do this by signing prefixes: they create Route Origin Objects (ROA). As of today, there are more than 200,000 ROAs. The distribution shows that the RIPE region is still leading in ROA count, then followed by the APNIC region.

Fall 2020 RPKI Update

2020 started with 172,000 records and the count is getting close to 200,000 at the beginning of November, approximately a quarter of all the Internet routes. Since last year, the database of ROAs grew by more than 70 percent, from 100,000 records, an average pace of 5% every month.

On the following graph of unique ROAs count per day, we can see two points that were followed by a change in ROA creation rate: 140/day, then 231/day, and since August, 351 new ROAs per day.

It is not yet clear what caused the increase in August.

Fall 2020 RPKI Update

Free services and software

In 2018 and 2019, Cloudflare was impacted by BGP route hijacks. Both could have been avoided with RPKI. Not long after the first incident, we started signing prefixes and developing RPKI software. It was necessary to make BGP safer and we wanted to do more than talk about it. But we also needed enough networks to be deploying RPKI as well. By making deployment easier for everyone, we hoped to increase adoption.

The following is a reminder of what we built over the years around RPKI and how it grew.

OctoRPKI is Cloudflare’s open source RPKI Validation software. It periodically generates a JSON document of validated prefixes that we pass onto our routers using GoRTR. It generates most of the data behind the graphs here.

The latest version, 1.2.0, of OctoRPKI was released at the end of October. It implements important security fixes, better memory management and extended logging. This is the first validator to provide detailed information around cryptographically invalid records into Sentry and performance data in distributed tracing tools.
GoRTR remains heavily used in production, including by transit providers. It can natively connect to other validators like rpki-client.

When we released our public rpki.json endpoint in early 2019, the idea was to enable anyone to see what Cloudflare was filtering.

The file is also used as a bootstrap by GoRTR, so that users can test a deployment. The file is cached on more than 200 data centers, ensuring quick and secure delivery of a list of valid prefixes, making RPKI more accessible for smaller networks and developers.

Between March 2019 and November 2020, the number of queries more than doubled and there are five times more networks querying this file.

The growth of queries follows approximately the rate of ROA creation (~5% per month).

Fall 2020 RPKI Update

A public RTR server is also available on rtr.rpki.cloudflare.com. It includes a plaintext endpoint on port 8282 and an SSH endpoint on port 8283. This allows us to test new versions of GoRTR before release.

Later in 2019, we also built a public dashboard where you can see in-depth RPKI validation. With a GraphQL API, you can now explore the validation data, test a list of prefixes, or see the status of the current routing table.

Fall 2020 RPKI Update

Currently, the API is used by BGPalerter, an open-source tool that detects routing issues (including hijacks!) from a stream of BGP updates.

Additionally, starting in November, you can access the historical data from May 2019. Data is computed daily and contains the unique records. The team behind the dashboard worked hard to provide a fast and accurate visualization of the daily ROA changes and the volumes of files changed over the day.

Fall 2020 RPKI Update

The future

We believe RPKI is going to continue growing, and we would like to thank the hundreds of network engineers around the world who are making the Internet routing more secure by deploying RPKI.

25% of routes are signed and 20% of the Internet is doing origin validation and those numbers grow everyday. We believe BGP will be safer before reaching 100% of deployment; for instance, once the remaining transit providers enable Origin Validation, it is unlikely a BGP hijack will make it to the front page of world news outlets.

While difficult to quantify, we believe that critical mass of protected resources will be reached in late 2021.

We will keep improving the tooling; OctoRPKI and GoRTR are open-source and we welcome contributions. In the near future, we plan on releasing a packaged version of GoRTR that can be directly installed on certain routers. Stay tuned!

Is BGP Safe Yet? No. But we are tracking it carefully

Post Syndicated from Louis Poinsignon original https://blog.cloudflare.com/is-bgp-safe-yet-rpki-routing-security-initiative/

Is BGP Safe Yet? No. But we are tracking it carefully

BGP leaks and hijacks have been accepted as an unavoidable part of the Internet for far too long. We relied on protection at the upper layers like TLS and DNSSEC to ensure an untampered delivery of packets, but a hijacked route often results in an unreachable IP address. Which results in an Internet outage.

The Internet is too vital to allow this known problem to continue any longer. It’s time networks prevented leaks and hijacks from having any impact. It’s time to make BGP safe. No more excuses.

Border Gateway Protocol (BGP), a protocol to exchange routes has existed and evolved since the 1980s. Over the years it has had security features. The most notable security addition is Resource Public Key Infrastructure (RPKI), a security framework for routing. It has been the subject of a few blog posts following our deployment in mid-2018.

Today, the industry considers RPKI mature enough for widespread use, with a sufficient ecosystem of software and tools, including tools we’ve written and open sourced. We have fully deployed Origin Validation on all our BGP sessions with our peers and signed our prefixes.

However, the Internet can only be safe if the major network operators deploy RPKI. Those networks have the ability to spread a leak or hijack far and wide and it’s vital that they take a part in stamping out the scourge of BGP problems whether inadvertent or deliberate.

Many like AT&T and Telia pioneered global deployments of RPKI in 2019. They were successfully followed by Cogent and NTT in 2020. Hundreds networks of all sizes have done a tremendous job over the last few years but there is still work to be done.

If we observe the customer-cones of the networks that have deployed RPKI, we see around 50% of the Internet is more protected against route leaks. That’s great, but it’s nothing like enough.

Is BGP Safe Yet? No. But we are tracking it carefully

Today, we are releasing isBGPSafeYet.com, a website to track deployments and filtering of invalid routes by the major networks.

We are hoping this will help the community and we will crowdsource the information on the website. The source code is available on GitHub, we welcome suggestions and contributions.

We expect this initiative will make RPKI more accessible to everyone and ultimately will reduce the impact of route leaks. Share the message with your Internet Service Providers (ISP), hosting providers, transit networks to build a safer Internet.

Additionally, to monitor and test deployments, we decided to announce two bad prefixes from our 200+ data centers and via the 233+ Internet Exchange Points (IXPs) we are connected to:

  • 103.21.244.0/24
  • 2606:4700:7000::/48

Both these prefixes should be considered invalid and should not be routed by your provider if RPKI is implemented within their network. This makes it easy to demonstrate how far a bad route can go, and test whether RPKI is working in the real world.

Is BGP Safe Yet? No. But we are tracking it carefully
A Route Origin Authorization for 103.21.244.0/24 on rpki.cloudflare.com

In the test you can run on isBGPSafeYet.com, your browser will attempt to fetch two pages: the first one valid.rpki.cloudflare.com, is behind an RPKI-valid prefix and the second one, invalid.rpki.cloudflare.com, is behind the RPKI-invalid prefix.

The test has two outcomes:

  • If both pages were correctly fetched, your ISP accepted the invalid route. It does not implement RPKI.
  • If only valid.rpki.cloudflare.com was fetched, your ISP implements RPKI. You will be less sensitive to route-leaks.
Is BGP Safe Yet? No. But we are tracking it carefully
a simple test of RPKI invalid reachability

We will be performing tests using those prefixes to check for propagation. Traceroutes and probing helped us in the past by creating visualizations of deployment.

A simple indicator is the number of networks sending the accepted route to their peers and collectors:

Is BGP Safe Yet? No. But we are tracking it carefully
Routing status from online route collection tool RIPE Stat

In December 2019, we released a Hilbert curve map of the IPv4 address space. Every pixel represents a /20 prefix. If a dot is yellow, the prefix responded only to the probe from a RPKI-valid IP space. If it is blue, the prefix responded to probes from both RPKI valid and invalid IP space.

To summarize, the yellow areas are IP space behind networks that drop RPKI invalid prefixes. The Internet isn’t safe until the blue becomes yellow.

Is BGP Safe Yet? No. But we are tracking it carefully
Hilbert Curve Map of IP address space behind networks filtering RPKI invalid prefixes

Last but not least, we would like to thank every network that has already deployed RPKI and every developer that contributed to validator-software code bases. The last two years have shown that the Internet can become safer and we are looking forward to the day where we can call route leaks and hijacks an incident of the past.

RPKI and the RTR protocol

Post Syndicated from Martin J Levy original https://blog.cloudflare.com/rpki-and-the-rtr-protocol/

RPKI and the RTR protocol

Today’s Internet requires stronger protection within its core routing system and as we have already said: it’s high time to stop BGP route leaks and hijacks by deploying operationally-excellent RPKI!

Luckily, over the last year plus a lot of good work has happened in this arena. If you’ve been following the growth of RPKI’s validation data, then you’ll know that more and more networks are signing their routes and creating ROA’s or Route Origin Authorizations. These are cryptographically-signed assertions of the validity of an announced IP block and contribute to the further securing of the global routing table that makes for a safer Internet.

The protocol that we have not written much about is RTR. The Resource Public Key Infrastructure (RPKI) to Router Protocol – or RTR Protocol for short. Today we’re fixing that.

RPKI rewind

We have written a few times about RPKI (here and here). We have written about how Cloudflare both signs its announced routes and filters its routing inbound from other networks (both transits and peers) using RPKI data. We also added our efforts in the open-source software space with the release of the Cloudflare RPKI Toolkit.

The primary part of the RPKI (Resource Public Key Infrastructure) system is a cryptographically-signed database which is read and processed by a RPKI validator. The validator works with the published ROAs to build a list of validated routes. A ROA consists of an IP address block plus an ASN (Autonomous System Number) that together define who can announce which IP block.

RPKI and the RTR protocol

After that step, it is then the job of that validator (or some associated software module) to communicate its list of valid routes to an Internet router. That’s where the RTR protocol (the RPKI to Router Protocol) comes in. Its job is to communicate between the validator and device in charge of allowing or rejecting routes in its table.

RTR

The IETF defines the RTR protocol in RFC 8210. This blog post focuses on version 1 and ignores previous versions.

In order to verifiably validate the origin Autonomous Systems and Autonomous System Paths of BGP announcements, routers need a simple but reliable mechanism to receive Resource Public Key Infrastructure (RFC 6480) prefix origin data and router keys from a trusted cache. This document describes a protocol to deliver them.

This document describes version 1 of the RPKI-Router protocol.

The Internet’s routers are, to put it bluntly, not the best place to run a routing table’s cryptographic processing. The RTR protocol allows the heavy lifting to be done outside of the valuable processing modules that routers have. RTR is a very lightweight protocol with a low memory footprint. The router simply decides yay-or-nay when a route is received (called “announce” in BGP speak) and hence the router never needs to touch the complex cryptographic validation algorithms. In many cases, it also provides some isolation between the outside world, where certificates need to be fetched from across the globe and then stored, checked, processed, and databased locally. In many cases the control plane (where RTR communication happens) exists on private or protected networks. Separation is a good thing.

Cloudflare’s open-source software for RPKI validation also includes GoRTR, an implementation of the RTR protocol. As mentioned, in Cloudflare’s operational model, we separate the validation (done with OctoRPKI) from the RTR process.

RPKI and the RTR protocol

RTR protocol implementations are also provided in other RPKI validation software packages. In fact, RPKI is unable to filter routes without the final step of running RTR (or something similar – should it exist). Here’s a current list of RPKI software packages that either validate or validate and run RTR.

Each of these open source software packages has its own specific database model and operational methods. Because GoRTR reads a somewhat common JSON file format, you can mix and match between different validators and GoRTR’s code.

The RTR protocol

The protocol’s core is all about synchronizing a database between a validator and a router. This is done using serial-numbers and session-ids.

It’s kicked off with a router setting up a TCP connection towards a backend RTR server followed by a series of serial-number exchanges and data records exchanges such that a cache on the validator (or RTR server) can be synced fully with a cache on the router. As mentioned, the lightweight protocol is void of all the cryptographic data that RPKI is built upon and simply deals with the validated routing list, which consists of CIDRs, ASNs and maybe a MaxLength parameter.

Here’s a simple Cisco configuration for enabling RTR on a router:

router bgp 65001
 rpki server 192.168.1.100
  transport tcp port 8282
 !
!

The configuration can take additional parameters in order to enable SSH or similar transport options. Other platforms (such as Juniper, Arista, Bird 2.0, etc) have their own specific configuration language.

The RTR protocol supports IPv4 and IPv6 routing information (as you would expect).

RPKI and the RTR protocol

Being specified as a lightweight protocol, RTR allows the data to be transferred quickly. With a session-id created by the RTR cache server plus serial-numbers exchanged between cache servers and routers, there’s the solid ability for route authentication data on the router to be kept fresh with a minimum amount of actual data being transferred. Remember, as we said above, the router has much better things to do with its control plane processor like running the BGP convergence algorithm, or SRv6, or ISIS, or any of the protocols needed to manage routing tables.

Is RTR a weak link in the RPKI story?

All aspects of RPKI data processing are built around solid cryptographic principles. The five RIRs each hold a root key called a Trust Anchor (TA). Each publishes data fully signed up/down so that every piece of information can be proven to be correct and without tampering. A validators job is to do that processing and spit out (or store) a list of valid ROAs (Route Origin Authorizations) that are assertions traceable back to a known source. If you want to study this protocol, you can start with RFC6480 and work forward through all the other relevant RFCs (Hint: It’s at least thirty more RFCs from RFC6483 thru RFC8210 and counting).

However, RTR does not carry that trust through to the Internet router. All that complexity (and hence assertions) are stripped away before a router sees anything. It is 100% up to the network operator to build a reliable and secure path between validator or RTR cache and router so that this lightweight transfer is still trusted.

RTR helps somewhat in this space. It provides more than one way to communicate between cache server and router. The RFC defines various methods to communicate.

  • A plain TCP connection (which is clearly insecure). In this case the RFC states: “the cache and routers MUST be on the same trusted and controlled network.”.
  • A TCP connection with TCP-AO transport.
  • A Secure Shell version 2 (SSHv2) transport.
  • A TCP connection with TCP MD5 transport (which is already obsoleted by TCP-AO).
  • A TCP connection over IPsec transport.
  • Transport Layer Security (TLS) transport.

This plethora of options is all well and good; however, there’s no useful implementation of TCP-AO out in the production world and hence (ironically) a lot of early implementations are living with plain-text communications. SSH and TLS are much better options; however, this comes with classic operational problems to solve. For example, in SSH’s case, the RFC states:

It is assumed that the router and cache have exchanged keys out of band by some reasonably secured means.

For a TLS connection, there’s also some worthwhile security setup mentioned in the RFC. It starts off as follows:

Client routers using TLS transport MUST present client-side certificates to authenticate themselves to the cache in order to allow the cache to manage the load by rejecting connections from unauthorized routers.

Then the RFC continues with enough information to secure the connection fully. If implemented correctly, then security is correctly provided between RTR cache and router such that no MITM attack can take place.

Assuming that these operational issues are handled fully then the RTR protocol is a perfect protocol for operationally implementing RPKI’s final linkage into the routers.

Testing the RTR protocol and open-source rpki-rtr-client

A modern router software stack can be configured to run RTR against a cache. If you have a test lab (as most modern networks do); then you have all you need to see RPKI route filtering (and the dropping of invalid routes).

However, if you are without a router and want to see RTR in action, Cloudflare has just placed rpki-rtr-client on GitHub. This software, written in Python, performs the router portion of the RTR protocol and comes with enough debug output that it can also be used to help write new RTR caches, or test existing code bases. The code was written directly from the RFC and then tested against a public RTR cache that Cloudflare operates.

$ pip3 install netaddr
...
$ git clone https://github.com/cloudflare/rpki-rtr-client.git
...
$ cd rpki-rtr-client
$

Operating the client is easy (and doubly-so if you use the Cloudflare provided cache).

$ ./rtr_client.py -h rtr.rpki.cloudflare.com -p 8282
...
^C
$

As there is no router (and hence no dropping of invalids) this code simply creates data files for later review. See the README file for more information.

$ ls -lt data/2020-02
total 21592
-rw-r--r--  1 martin martin  5520676 Feb 16 18:22 2020-02-17-022209.routes.00000365.json
-rw-r--r--  1 martin martin  5520676 Feb 16 18:42 2020-02-17-024242.routes.00000838.json
-rw-r--r--  1 martin martin      412 Feb 16 19:56 2020-02-17-035645.routes.00000841.json
-rw-r--r--  1 martin martin      272 Feb 16 20:16 2020-02-17-041647.routes.00000842.json
-rw-r--r--  1 martin martin      643 Feb 16 20:36 2020-02-17-043649.routes.00000843.json
$

As the RTR protocol communicates and increments its serial-number, the rpki-rtr-client software writes the routing information is a fresh file for later review.

$ for f in data/2020-02/*.json ; do echo "$f `jq -r '.routes.announce[]|.ip' < $f | wc -l` `jq -r '.routes.withdraw[]|.ip' < $f | wc -l`" ; done
data/2020-02/2020-02-17-022209.routes.00000365.json   128483        0
data/2020-02/2020-02-17-024242.routes.00000838.json   128483        0
data/2020-02/2020-02-17-035645.routes.00000841.json        3        6
data/2020-02/2020-02-17-041647.routes.00000842.json        5        0
data/2020-02/2020-02-17-043649.routes.00000843.json        9        5
$

Valid ROAs are listed as follows:

$ jq -r '.routes.announce[]|.ip,.asn,.maxlen' data/2020-02/*0838.json | paste - - - | sort -V | head
1.0.0.0/24      13335   null
1.1.1.0/24      13335   null
1.9.0.0/16      4788    24
1.9.12.0/24     65037   null
1.9.21.0/24     24514   null
1.9.23.0/24     65120   null
1.9.31.0/24     65077   null
1.9.65.0/24     24514   null
1.34.0.0/15     3462    24
1.36.0.0/16     4760    null
$

The code can also dump the raw binary protocol and then replay that data to debug the protocol.

As the code is on GitHub, any protocol developer can feel free to expand on the code.

Future of RTR protocol

The present RFC defines version 1 of the protocol and it is expected that this lightweight protocol will progress to also include additional functions, but stay lightweight. RPKI is a Route Origin Validation protocol (i.e. mapping an IP route or CIDR to an ASN). It does not provide support for validating the AS-PATH. Neither does it provide any support for IRR databases (which are non-cryptographically-signed routing definitions). Presently IRR data is the primary method used for filtering routing on the global Internet. Today that is done by building massive filter lists within a router’s configuration file and not via a lightweight protocol like RTR.

At the present time there’s an IETF proposal for RTR version 2. This is draft, work alongside the ASPA (Autonomous System Provider Authorization) draft and draft work. These draft documents from Alexander Azimov et al. define ASPA extending the RPKI data structures to handle BGP path information. The version 2 of RTR protocol should provide the required messaging in order to move ASPA data into the router.

RPKI and the RTR protocol

Additionally, RPKI is going to potentially further expand, at some point, from today’s singular data type (the ROA object). Just like with the ASPA draft, RTR will need to advance in lock-step. Hopefully the open-source code we have published will help this effort.

Some final thoughts on RTR and RPKI

If RPKI is to become ubiquitous, then RTR support in all BGP speaking Internet routers is going to be required. Vendors need to complete their RTR software delivery and additionally support some of the more secure transport definitions from the RFC. Additionally, should the protocol advance, then timely support for the new version will be needed.

Cloudflare continues to be committed to a secure Internet; so should you also have the same thoughts and you like what you’ve read here or elsewhere on our blog; then please take a look at our jobs page. We have software and network engineering open roles in many of our offices around the world.

Using Machine Learning to Detect IP Hijacking

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2019/10/using_machine_l_1.html

This is interesting research:

In a BGP hijack, a malicious actor convinces nearby networks that the best path to reach a specific IP address is through their network. That’s unfortunately not very hard to do, since BGP itself doesn’t have any security procedures for validating that a message is actually coming from the place it says it’s coming from.

[…]

To better pinpoint serial attacks, the group first pulled data from several years’ worth of network operator mailing lists, as well as historical BGP data taken every five minutes from the global routing table. From that, they observed particular qualities of malicious actors and then trained a machine-learning model to automatically identify such behaviors.

The system flagged networks that had several key characteristics, particularly with respect to the nature of the specific blocks of IP addresses they use:

  • Volatile changes in activity: Hijackers’ address blocks seem to disappear much faster than those of legitimate networks. The average duration of a flagged network’s prefix was under 50 days, compared to almost two years for legitimate networks.
  • Multiple address blocks: Serial hijackers tend to advertise many more blocks of IP addresses, also known as “network prefixes.”

  • IP addresses in multiple countries: Most networks don’t have foreign IP addresses. In contrast, for the networks that serial hijackers advertised that they had, they were much more likely to be registered in different countries and continents.

Note that this is much more likely to detect criminal attacks than nation-state activities. But it’s still good work.

Academic paper.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

Post Syndicated from Martin J Levy original https://blog.cloudflare.com/the-deep-dive-into-how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-monday/

A recap on what happened Monday

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

On Monday we wrote about a painful Internet wide route leak. We wrote that this should never have happened because Verizon should never have forwarded those routes to the rest of the Internet. That blog entry came out around 19:58 UTC, just over seven hours after the route leak finished (which will we see below was around 12:39 UTC). Today we will dive into the archived routing data and analyze it. The format of the code below is meant to use simple shell commands so that any reader can follow along and, more importantly, do their own investigations on the routing tables.

This was a very public BGP route leak event. It was both reported online via many news outlets and the event’s BGP data was reported via social media as it was happening. Andree Toonk tweeted a quick list of 2,400 ASNs that were affected.


Using RIPE NCC archived data

The RIPE NCC operates a very useful archive of BGP routing. It runs collectors globally and provides an API for querying the data. More can be seen at https://stat.ripe.net/. In the world of BGP all routing is public (within the ability of anyone collecting data to have enough collections points). The archived data is very valuable for research and that’s what we will do in this blog. The site can create some very useful data visualizations.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

Dumping the RIPEstat data for this event

Presently, the RIPEstat data gets ingested around eight to twelve hours after real-time. It’s not meant to be a real-time service. The data can be queried in many ways, including a full web interface and an API. We are using the API to extract the data in a JSON format.

We are going to focus only on the Cloudflare routes that were leaked. Many other ASNs were leaked (see the tweet above); however, we want to deal with a finite data set and focus on what happened to Cloudflare’s routing. All the commands below can be run with ease on many systems. The following was done on MacBook Pro running macOS Mojave.

First we collect 24 hours of route announcements and AS-PATH data that RIPEstat sees coming from AS13335 (Cloudflare).

$ # Collect 24 hours of data - more than enough
$ ASN=”AS13335”
$ START=”2019-06-24T00:00:00”END=”2019-06-25T00:00:00”
$ ARGS=”resource=${ASN}&starttime=${START}&endtime=${END}"
$ URL=”https://stat.ripe.net/data/bgp-updates/data.json?${ARGS}”
$ # Fetch the data from RIPEstat
$ curl -sS "${URL}" | jq . > 13335-routes.json
$ ls -l 13335-routes.json
-rw-r--r--  1 martin  staff  339363899 Jun 25 08:47 13335-routes.json
$

That’s 340MB of data – which seems like a lot, but it contains plenty of white space and plenty of data we just don’t need. Our second task is to reduce this raw data down to just the required data – that’s timestamps, actual routes, and AS-PATH. The third item will be very useful. Note we are using jq, which can be installed on macOS with the  brew package manager.

$ # Extract just the times, routes, and AS-PATH
$ jq -rc '.data.updates[]|.timestamp,.attrs.target_prefix,.attrs.path' < 13335-routes.json | paste - - - > 13335-listing-a.txt
$ wc -l 13335-listing-a.txt
691318 13335-listing-a.txt
$

We are down to just below seven hundred thousand routing events, however, that’s not a leak, that’s everything that includes Cloudflare’s ASN (the number 13335 above). For that we need to go back to Monday’s blog and realize it was AS396531 (Allegheny Technologies) that showed up with 701 (Verizon) in the leak. Now we reduce the data further:

$ # Extract the route leak 701,396531
$ # AS701 is Verizon and AS396531 is Allegheny Technologies
$ egrep '701,396531' < 13335-listing-a.txt > 13335-listing-b.txt
$ wc -l 13335-listing-b.txt
204568 13335-listing-b.txt
$

At 204 thousand data points, we are looking better. It’s still a lot of data because BGP can be very chatty if topology is changing. A route leak will cause exactly that. Now let’s see how many routes were affected:

$ # Extract the actual routes affected
$ cut -f2 < 13335-listing-b.txt | sort -n -u > 13335-listing-c.txt
$ wc -l 13335-listing-c.txt
20 13335-listing-c.txt
$

It’s a much smaller number. We now have a listing of at least twenty routes that were leaked via Verizon. This may not be the full list because route collectors like RIPEstat don’t have direct feeds from Verizon, so this data is a blended view with Verizon’s path and other paths. We can see that if we look at the AS-PATH in the above files. Here’s the listing of affected routes.

$ cat 13335-listing-c.txt
8.39.214.0/24
8.42.245.0/24
8.44.58.0/24
104.16.80.0/21
104.17.168.0/21
104.18.32.0/21
104.19.168.0/21
104.20.64.0/21
104.22.8.0/21
104.23.128.0/21
104.24.112.0/21
104.25.144.0/21
104.26.0.0/21
104.27.160.0/21
104.28.16.0/21
104.31.0.0/21
141.101.120.0/23
162.159.224.0/21
172.68.60.0/22
172.69.116.0/22
$

This is an interesting list, as some of these routes do not originate from Cloudflare’s network, however, they show up with AS13335 (our ASN) as the originator. For example, the 104.26.0.0/21 route is not announced from our network, but we do announce 104.26.0.0/20 (which covers that route). More importantly, we have an IRR route object plus an RPKI ROA for that block. Here’s the IRR object:

route:          104.26.0.0/20
origin:         AS13335
source:         ARIN

And here’s the RPKI ROA. This ROA has maxlength set to 20, so no smaller route should be accepted.

Prefix:       104.26.0.0/20
Max Length:   /20
ASN:          13335
Trust Anchor: ARIN
Validity:     Thu, 02 Aug 2018 04:00:00 GMT - Sat, 31 Jul 2027 04:00:00 GMT
Emitted:      Thu, 02 Aug 2018 21:45:37 GMT
Name:         535ad55d-dd30-40f9-8434-c17fc413aa99
Key:          4a75b5de16143adbeaa987d6d91e0519106d086e
Parent Key:   a6e7a6b44019cf4e388766d940677599d0c492dc
Path:         rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-...

The Max Length field in an ROA says what the minimum size of an acceptable announcement is. The fact that this is a /20 route with a /20 Max Length says that a /21 (or /22 or /23 or /24) within this IP space isn’t allowed. Looking further at the route list above we get the following listing:

Route Seen            Cloudflare IRR & ROA    ROA Max Length
104.16.80.0/21    ->  104.16.80.0/20          /20
104.17.168.0/21   ->  104.17.160.0/20         /20
104.18.32.0/21    ->  104.18.32.0/20          /20
104.19.168.0/21   ->  104.19.160.0/20         /20
104.20.64.0/21    ->  104.20.64.0/20          /20
104.22.8.0/21     ->  104.22.0.0/20           /20
104.23.128.0/21   ->  104.23.128.0/20         /20
104.24.112.0/21   ->  104.24.112.0/20         /20
104.25.144.0/21   ->  104.25.144.0/20         /20
104.26.0.0/21     ->  104.26.0.0/20           /20
104.27.160.0/21   ->  104.27.160.0/20         /20
104.28.16.0/21    ->  104.28.16.0/20          /20
104.31.0.0/21     ->  104.31.0.0/20           /20

So how did all these /21’s show up? That’s where we dive into the world of BGP route optimization systems and their propensity to synthesize routes that should not exist. If those routes leak (and it’s very clear after this week that they can), all hell breaks loose. That can be compounded when not one, but two ISPs allow invalid routes to be propagated outside their autonomous network. We will explore the AS-PATH further down this blog.

More than 20 years ago, RFC1997 added the concept of “communities” to BGP. Communities are a way of tagging or grouping route advertisements. Communities are often used to label routes so that specific handling policies can be applied. RFC1997 includes a small number of universal “well-known” communities. One of these is the NO_EXPORT community, which has the following specification:

    All routes received carrying a communities attribute
    containing this value MUST NOT be advertised outside a BGP
    confederation boundary (a stand-alone autonomous system that
    is not part of a confederation should be considered a
    confederation itself).

The use of the NO_EXPORT community is very common within BGP enabled networks and is a community tag that would have helped alleviate this route leak immensely.

How BGP route optimization systems work (or don’t work in this case) can be a subject for a whole other blog entry.

Timing of the route leak

As we saved away the timestamps in the JSON file and in the text files, we can confirm the time for every route in the route leak by looking at the first and the last timestamp of a route in the data. We saved data from 00:00:00 UTC until 00:00:00 the next day, so we know we have covered the period of the route leak. We write a script that checks the first and last entry for every route and report the information sorted by start time:

$ while read cidr
do
  echo $cidr
  fgrep $cidr < 13335-listing-b.txt | head -1 | cut -f1
  fgrep $cidr < 13335-listing-b.txt | tail -1 | cut -f1
done < 13335-listing-c.txt |\
paste - - - | sort -k2,3 | column -t | sed -e ‘s/2019-06-24T//g’
104.25.144.0/21   10:34:25  12:38:54
104.22.8.0/21     10:34:27  12:29:39
104.20.64.0/21    10:34:27  12:30:00
104.23.128.0/21   10:34:27  12:30:34
141.101.120.0/23  10:34:27  12:30:39
162.159.224.0/21  10:34:27  12:30:39
104.18.32.0/21    10:34:29  12:30:34
104.24.112.0/21   10:34:29  12:30:34
104.27.160.0/21   10:34:29  12:30:34
104.28.16.0/21    10:34:29  12:30:34
104.31.0.0/21     10:34:29  12:30:34
8.39.214.0/24     10:34:31  12:19:24
104.26.0.0/21     10:34:36  12:29:53
172.68.60.0/22    10:34:38  12:19:24
172.69.116.0/22   10:34:38  12:19:24
8.44.58.0/24      10:34:38  12:19:24
8.42.245.0/24     11:52:49  11:53:19
104.17.168.0/21   12:00:13  12:29:34
104.16.80.0/21    12:00:13  12:30:00
104.19.168.0/21   12:09:39  12:29:34
$

Now we know the times. The route leak started at 10:34:25 UTC (just before lunchtime London time) on 2019-06-24 and ended at 12:38:54 UTC. That’s a hair over two hours. Here’s that same time data in a graphical form showing the near-instant start of the event and the duration of each route leaked:

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

We can also go back to RIPEstat and look at the activity graph for Cloudflare’s AS13335 network:

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

Clearly between 10:30 UTC and 12:40 UTC there’s a lot of route activity – far more than normal.

Note that as we mentioned above, RIPEstat doesn’t get a full view of Verizon’s network routing and hence some of the propagated routes won’t show up.

Drilling down on the AS-PATH part of the data

Having the routes is useful, but now we want to look at the paths of these leaked routes to see which ASNs are involved. We knew the offending ASNs during the route leak on Monday. Now we want to dig deeper using the archived data. This allows us to see the extent and reach of this route leak.

# Use the list of routes to extract the full AS-PATH
# Merge the results together to show an amalgamation of paths.
# We know (luckily) the last few ASNs in the AS-PATH are consistent
$ cut -f3 < 13335-listing-b.txt | tr -d '[\[\]]' |\
awk '{
  n=split($0, a, ",");
  printf "%50s\n",
    a[n-5] "_" a[n-4] "_" a[n-3] "_" a[n-2] "_" a[n-1] "_" a[n];
}' | sort -u
174_701_396531_33154_3356_13335
2497_701_396531_33154_174_13335
577_701_396531_33154_3356_13335
6939_701_396531_33154_174_13335
1239_701_396531_33154_3356_13335
1273_701_396531_33154_3356_13335
1280_701_396531_33154_3356_13335
2497_701_396531_33154_3356_13335
2516_701_396531_33154_3356_13335
3320_701_396531_33154_3356_13335
3491_701_396531_33154_3356_13335
4134_701_396531_33154_3356_13335
4637_701_396531_33154_3356_13335
6453_701_396531_33154_3356_13335
6461_701_396531_33154_3356_13335
6762_701_396531_33154_3356_13335
6830_701_396531_33154_3356_13335
6939_701_396531_33154_3356_13335
7738_701_396531_33154_3356_13335
12956_701_396531_33154_3356_13335
17639_701_396531_33154_3356_13335
23148_701_396531_33154_3356_13335
$

This script clearly shows the AS-PATH of the leaked routes. It’s very consistent. Reading from the back of the line to the front, we have 13335 (Cloudflare), 3356 or 174 (Level3/CenturyLink or Cogent – both tier 1 transit providers for Cloudflare). So far, so good. Then we have 33154 (DQE Communications) and 396531 (Allegheny Technologies Inc) which is still technically not a leak, but trending that way. The reason why we can state this is still technically not a leak is because we don’t know the relationship between those two ASs. It’s possible they have a mutual transit agreement between them. That’s up to them.

Back to the AS-PATH’s. Still reading leftwards, we see 701 (Verizon), which is very very bad and clear evidence of a leak. It’s a leak for two reasons. First, this matches the path when a transit is leaking a non-customer route from a customer. This Verizon customer does not have 13335 (Cloudflare) listed as a customer. Second, the route contains within its path a tier 1 ASN. This is the point where a route leak should have been absolutely squashed by filtering on the customer BGP session. Beyond this point there be dragons.

And dragons there be! Everything above is about how Verizon filtered (or didn’t filter) its customer. What follows the 701 (i.e the number to the left of it) is the peers or customers of Verizon that have accepted these leaked routes. They are mainly other tier 1 networks of Verizon in this list: 174 (Cogent), 1239 (Sprint), 1273 (Vodafone), 3320 (DTAG), 3491 (PCCW), 6461 (Zayo), 6762 (Telecom Italia), etc.

What’s missing from that list are two networks worthy of mentioning – 2914 (NTT) & 7018 (AT&T). Both implement a very simple AS-PATH filter which saved the day for their network. They do not allow one tier 1 ISP to send them a route which has another tier 1 further down the path. That’s because when that happens, it’s officially a leak as each tier 1 is fully connected to all other tier 1’s (which is part of the definition of a tier 1 network). The topology of the Internet’s global BGP routing tables simply states that if you see another tier 1 in the path, then it’s a bad route and it should be filtered away.

Additionally we know that 7018 (AT&T) operates a network which drops RPKI invalids. Because Cloudflare routes are RPKI signed, this also means that AT&T would have dropped these routes when it receives them from Verizon. This shows a clear win for RPKI (and for AT&T when you see their bandwidth graph below)!


That all said, keep in mind we are still talking about routes that Cloudflare didn’t announce. They all came from the route optimizer.

What should 701 Verizon network accept from their customer 396531?

This is a great question to ask. Normally we would look at the IRR (Internet Routing Registries) to see what policy an ASN wants for it’s routes.

$ whois -h whois.radb.net AS396531 ; the Verizon customer
%  No entries found for the selected source(s).
$ whois -h whois.radb.net AS33154  ; the downstream of that customer
%  No entries found for the selected source(s).
$ 

That’s enough to say that we should not be seeing this ASN anywhere on the Internet, however, we should go further into checking this. As we know the ASN of the network, we can search for any routes that are listed for that ASN. We find one:

$ whois -h whois.radb.net ' -i origin AS396531' | egrep '^route|^origin|^mnt-by|^source'
route:          192.92.159.0/24
origin:         AS396531
mnt-by:         MNT-DCNSL
source:         ARIN
$

More importantly, now we have a maintainer (the owner of the routing registry entries). We can see what else is there for that network and we are specifically looking for this:

$ whois -h whois.radb.net ' -i mnt-by -T as-set MNT-DCNSL' | egrep '^as-set|^members|^mnt-by|^source'
as-set:         AS-DQECUST
members:        AS4130, AS5050, AS11199, AS11360, AS12017, AS14088, AS14162,
                AS14740, AS15327, AS16821, AS18891, AS19749, AS20326,
                AS21764, AS26059, AS26257, AS26461, AS27223, AS30168,
                AS32634, AS33039, AS33154, AS33345, AS33358, AS33504,
                AS33726, AS40549, AS40794, AS54552, AS54559, AS54822,
                AS393456, AS395440, AS396531, AS15204, AS54119, AS62984,
                AS13659, AS54934, AS18572, AS397284
mnt-by:         MNT-DCNSL
source:         ARIN
$

This object is important. It lists all the downstream ASNs that this network is expected to announce to the world. It does not contain Cloudflare’s ASN (or any of the leaked ASNs). Clearly this as-set was not used for any BGP filtering.

Just for completeness the same exercise can be done for the other ASN (the downstream of the customer of Verizon). In this case, we just searched for the maintainer object (as there are plenty of route and route6 objects listed).

$ whois -h whois.radb.net ' -i origin AS33154' | egrep '^mnt-by' | sort -u
mnt-by:         MNT-DCNSL
mnt-by:     MAINT-AS3257
mnt-by:     MAINT-AS5050
$

None of these maintainers are directly related to 33154 (DQE Communications). They have been created by other parties and hence they become a dead-end in that search.

It’s worth doing a secondary search to see if any as-set object exists with 33154 or 396531 included. We turned to the most excellent IRR Explorer website run by NLNOG. It provides deep insight into the routing registry data. We did a simple search for 33154 using http://irrexplorer.nlnog.net/search/33154 and we found these as-set objects.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

It’s interesting to see this ASN listed in other as-set’s but none are important or related to Monday’s route-leak. Next we looked at 396531

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

This shows that there’s nowhere else we need to check. AS-DQECUST is the as-set macro that controls (or should control) filtering for any transit provider of their network.

The summary of all the investigation is a solid statement that no Cloudflare routes or ASNs are listed anywhere within the routing registry data for the customer of Verizon. As there were 2,300 ASNs listed in the tweet above, we can conclusively state no filtering was in place and hence this route leak went on its way unabated.

IPv6? Where is the IPv6 route leak?

In what could be considered the only plus from Monday’s route leak, we can confirm that there was no route leak within IPv6 space. Why?

It turns out that 396531 (Allegheny Technologies Inc) is a network without IPv6 enabled. Normally you would hear Cloudflare chastise anyone that’s yet to enable IPv6, however, in this case we are quite happy that one of the two protocol families survived. IPv6 was stable during this route leak, which now can be called an IPv4-only route leak.

Yet that’s not really the whole story. Let’s look at the percentage of traffic Cloudflare sends Verizon that’s IPv6 (vs IPv4). Normally the IPv4/IPv6 percentage holds steady.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

This uptick in IPv6 traffic could be the direct result of Happy Eyeballs on mobile handsets picking a working IPv6 path into Cloudflare vs a non-working IPv4 path. Happy Eyeballs is meant to protect against IPv6 failure, however in this case it’s doing a wonderful job in protecting from an IPv4 failure. Yet we have to be careful with this graph because after further thought and investigation the percentage only increased because IPv4 reduces. Sometimes graphs can be misinterpreted, yet Happy Eyeballs still did a good job even as end users were being affected.

Happy Eyeballs, described in RFC8305, is a mechanism where a client (lets say a mobile device) tries to connect to a website both on IPv4 and IPv6 simultaneously. IPv6 is sometimes given a head-start. The theory is that, should a failure exist on one of the paths (sadly IPv6 is the norm), then IPv4 will save the day. Monday was a day of opposites for Verizon.

In fact, enabling IPv6 for mobile users is the one area where we can praise the Verizon network this week (or at least the Verizon mobile network), unlike the residential Verizon networks where IPv6 is almost non-existent.

Using bandwidth graphs to confirm routing leaks and stability.

As we have already stated,Verizon had impacted their own users/customers. Let’s start with their bandwidth graph:

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

The red line is 24 July 2019 (00:00 UTC to 00:00 UTC the next day). The gray lines are previous days for comparison. This graph includes both Verizon fixed-line services like FiOS along with mobile.

The AT&T graph is quite different.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

There’s no perturbation. This, along with some direct confirmation, shows that 7018 (AT&T) was not affected. This is an important point.

Going back and looking at a third tier 1 network, we can see 6762 (Telecom Italia) affected by this route leak and yet Cloudflare has a direct interconnect with them.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

We will be asking Telecom Italia to improve their route filtering as we now have this data.

Future work that could have helped on Monday

The IETF is doing work in the area of BGP path protection within the Secure Inter-Domain Routing Operations Working Group (sidrops) area. The charter of this IETF group is:

The SIDR Operations Working Group (sidrops) develops guidelines for the operation of SIDR-aware networks, and provides operational guidance on how to deploy and operate SIDR technologies in existing and new networks.

One new effort from this group should be called out to show how important the issue of route leaks like today’s event is. The draft document from Alexander Azimov et al named draft-ietf-sidrops-aspa-profile (ASPA stands for Autonomous System Provider Authorization) extends the RPKI data structures to handle BGP path information. This in ongoing work and Cloudflare and other companies are clearly interested in seeing it progress further.

However, as we said in Monday’s blog and something we should reiterate again and again: Cloudflare encourages all network operators to deploy RPKI now!

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

Post Syndicated from Martin J Levy original https://blog.cloudflare.com/the-deep-dive-into-how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-monday/

A recap on what happened Monday

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

On Monday we wrote about a painful Internet wide route leak. We wrote that this should never have happened because Verizon should never have forwarded those routes to the rest of the Internet. That blog entry came out around 19:58 UTC, just over seven hours after the route leak finished (which will we see below was around 12:39 UTC). Today we will dive into the archived routing data and analyze it. The format of the code below is meant to use simple shell commands so that any reader can follow along and, more importantly, do their own investigations on the routing tables.

This was a very public BGP route leak event. It was both reported online via many news outlets and the event’s BGP data was reported via social media as it was happening. Andree Toonk tweeted a quick list of 2,400 ASNs that were affected.


This blog contains a large number of acronyms and those are explained at the end of the blog.

Using RIPE NCC archived data

The RIPE NCC operates a very useful archive of BGP routing. It runs collectors globally and provides an API for querying the data. More can be seen at https://stat.ripe.net/. In the world of BGP all routing is public (within the ability of anyone collecting data to have enough collections points). The archived data is very valuable for research and that’s what we will do in this blog. The site can create some very useful data visualizations.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

Dumping the RIPEstat data for this event

Presently, the RIPEstat data gets ingested around eight to twelve hours after real-time. It’s not meant to be a real-time service. The data can be queried in many ways, including a full web interface and an API. We are using the API to extract the data in a JSON format.

We are going to focus only on the Cloudflare routes that were leaked. Many other ASNs were leaked (see the tweet above); however, we want to deal with a finite data set and focus on what happened to Cloudflare’s routing. All the commands below can be run with ease on many systems. Both the scripts and the raw data file are now available on GitHub. The following was done on MacBook Pro running macOS Mojave.

First we collect 24 hours of route announcements and AS-PATH data that RIPEstat sees coming from AS13335 (Cloudflare).

$ # Collect 24 hours of data - more than enough
$ ASN="AS13335"
$ START="2019-06-24T00:00:00"
$ END="2019-06-25T00:00:00"
$ ARGS="resource=${ASN}&starttime=${START}&endtime=${END}"
$ URL="https://stat.ripe.net/data/bgp-updates/data.json?${ARGS}"
$ # Fetch the data from RIPEstat
$ curl -sS "${URL}" | jq . > 13335-routes.json
$ ls -l 13335-routes.json
-rw-r--r--  1 martin  staff  339363899 Jun 25 08:47 13335-routes.json
$

That’s 340MB of data – which seems like a lot, but it contains plenty of white space and plenty of data we just don’t need. Our second task is to reduce this raw data down to just the required data – that’s timestamps, actual routes, and AS-PATH. The third item will be very useful. Note we are using jq, which can be installed on macOS with the  brew package manager.

$ # Extract just the times, routes, and AS-PATH
$ jq -rc '.data.updates[]|.timestamp,.attrs.target_prefix,.attrs.path' < 13335-routes.json | paste - - - > 13335-listing-a.txt
$ wc -l 13335-listing-a.txt
691318 13335-listing-a.txt
$

We are down to just below seven hundred thousand routing events, however, that’s not a leak, that’s everything that includes Cloudflare’s ASN (the number 13335 above). For that we need to go back to Monday’s blog and realize it was AS396531 (Allegheny Technologies) that showed up with 701 (Verizon) in the leak. Now we reduce the data further:

$ # Extract the route leak 701,396531
$ # AS701 is Verizon and AS396531 is Allegheny Technologies
$ egrep '701,396531' < 13335-listing-a.txt > 13335-listing-b.txt
$ wc -l 13335-listing-b.txt
204568 13335-listing-b.txt
$

At 204 thousand data points, we are looking better. It’s still a lot of data because BGP can be very chatty if topology is changing. A route leak will cause exactly that. Now let’s see how many routes were affected:

$ # Extract the actual routes affected by the route leak
$ cut -f2 < 13335-listing-b.txt | sort -V -u > 13335-listing-c.txt
$ wc -l 13335-listing-c.txt
101 13335-listing-c.txt
$

It’s a much smaller number. We now have a listing of at least 101 routes that were leaked via Verizon. This may not be the full list because route collectors like RIPEstat don’t have direct feeds from Verizon, so this data is a blended view with Verizon’s path and other paths. We can see that if we look at the AS-PATH in the above files. Please note that I had a typo in this script when this blog was first published and only 20 routes showed up because the -n vs -V option was used on sort. Now the list is correct with 101 affected routes. Please see this short article from stackoverflow to see the issue.

Here’s a partial listing of affected routes.

$ cat 13335-listing-c.txt
8.39.214.0/24
8.42.245.0/24
8.44.58.0/24
...
104.16.80.0/21
104.17.168.0/21
104.18.32.0/21
104.19.168.0/21
104.20.64.0/21
104.22.8.0/21
104.23.128.0/21
104.24.112.0/21
104.25.144.0/21
104.26.0.0/21
104.27.160.0/21
104.28.16.0/21
104.31.0.0/21
141.101.120.0/23
162.159.224.0/21
172.68.60.0/22
172.69.116.0/22
...
$

This is an interesting list, as some of these routes do not originate from Cloudflare’s network, however, they show up with AS13335 (our ASN) as the originator. For example, the 104.26.0.0/21 route is not announced from our network, but we do announce 104.26.0.0/20 (which covers that route). More importantly, we have an IRR (Internet Routing Registries) route object plus an RPKI ROA for that block. Here’s the IRR object:

route:          104.26.0.0/20
origin:         AS13335
source:         ARIN

And here’s the RPKI ROA. This ROA has Max Length set to 20, so no smaller route should be accepted.

Prefix:       104.26.0.0/20
Max Length:   /20
ASN:          13335
Trust Anchor: ARIN
Validity:     Thu, 02 Aug 2018 04:00:00 GMT - Sat, 31 Jul 2027 04:00:00 GMT
Emitted:      Thu, 02 Aug 2018 21:45:37 GMT
Name:         535ad55d-dd30-40f9-8434-c17fc413aa99
Key:          4a75b5de16143adbeaa987d6d91e0519106d086e
Parent Key:   a6e7a6b44019cf4e388766d940677599d0c492dc
Path:         rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-...

The Max Length field in an ROA says what the minimum size of an acceptable announcement is. The fact that this is a /20 route with a /20 Max Length says that a /21 (or /22 or /23 or /24) within this IP space isn’t allowed. Looking further at the route list above we get the following listing:

Route Seen            Cloudflare IRR & ROA    ROA Max Length
104.16.80.0/21    ->  104.16.80.0/20          /20
104.17.168.0/21   ->  104.17.160.0/20         /20
104.18.32.0/21    ->  104.18.32.0/20          /20
104.19.168.0/21   ->  104.19.160.0/20         /20
104.20.64.0/21    ->  104.20.64.0/20          /20
104.22.8.0/21     ->  104.22.0.0/20           /20
104.23.128.0/21   ->  104.23.128.0/20         /20
104.24.112.0/21   ->  104.24.112.0/20         /20
104.25.144.0/21   ->  104.25.144.0/20         /20
104.26.0.0/21     ->  104.26.0.0/20           /20
104.27.160.0/21   ->  104.27.160.0/20         /20
104.28.16.0/21    ->  104.28.16.0/20          /20
104.31.0.0/21     ->  104.31.0.0/20           /20

So how did all these /21’s show up? That’s where we dive into the world of BGP route optimization systems and their propensity to synthesize routes that should not exist. If those routes leak (and it’s very clear after this week that they can), all hell breaks loose. That can be compounded when not one, but two ISPs allow invalid routes to be propagated outside their autonomous network. We will explore the AS-PATH further down this blog.

More than 20 years ago, RFC1997 added the concept of communities to BGP. Communities are a way of tagging or grouping route advertisements. Communities are often used to label routes so that specific handling policies can be applied. RFC1997 includes a small number of universal well-known communities. One of these is the NO_EXPORT community, which has the following specification:

    All routes received carrying a communities attribute
    containing this value MUST NOT be advertised outside a BGP
    confederation boundary (a stand-alone autonomous system that
    is not part of a confederation should be considered a
    confederation itself).

The use of the NO_EXPORT community is very common within BGP enabled networks and is a community tag that would have helped alleviate this route leak immensely.

How BGP route optimization systems work (or don’t work in this case) can be a subject for a whole other blog entry.

Timing of the route leak

As we saved away the timestamps in the JSON file and in the text files, we can confirm the time for every route in the route leak by looking at the first and the last timestamp of a route in the data. We saved data from 00:00:00 UTC until 00:00:00 the next day, so we know we have covered the period of the route leak. We write a script that checks the first and last entry for every route and report the information sorted by start time:

$ # Extract the timing of the route leak
$ while read cidr
do
  echo $cidr
  fgrep $cidr < 13335-listing-b.txt | head -1 | cut -f1
  fgrep $cidr < 13335-listing-b.txt | tail -1 | cut -f1
done < 13335-listing-c.txt |\
paste - - - | sort -k2,3 | column -t | sed -e 's/2019-06-24T//g'
...
104.25.144.0/21   10:34:25  12:38:54
104.22.8.0/21     10:34:27  12:29:39
104.20.64.0/21    10:34:27  12:30:00
104.23.128.0/21   10:34:27  12:30:34
141.101.120.0/23  10:34:27  12:30:39
162.159.224.0/21  10:34:27  12:30:39
104.18.32.0/21    10:34:29  12:30:34
104.24.112.0/21   10:34:29  12:30:34
104.27.160.0/21   10:34:29  12:30:34
104.28.16.0/21    10:34:29  12:30:34
104.31.0.0/21     10:34:29  12:30:34
8.39.214.0/24     10:34:31  12:19:24
104.26.0.0/21     10:34:36  12:29:53
172.68.60.0/22    10:34:38  12:19:24
172.69.116.0/22   10:34:38  12:19:24
8.44.58.0/24      10:34:38  12:19:24
8.42.245.0/24     11:52:49  11:53:19
104.17.168.0/21   12:00:13  12:29:34
104.16.80.0/21    12:00:13  12:30:00
104.19.168.0/21   12:09:39  12:29:34
...
$

Now we know the times. The route leak started at 10:34:25 UTC (just before lunchtime London time) on 2019-06-24 and ended at 12:38:54 UTC. That’s a hair over two hours. Here’s that same time data in a graphical form showing the near-instant start of the event and the duration of each route leaked:

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

We can also go back to RIPEstat and look at the activity graph for Cloudflare’s AS13335 network:

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

Clearly between 10:30 UTC and 12:40 UTC there’s a lot of route activity – far more than normal.

Note that as we mentioned above, RIPEstat doesn’t get a full view of Verizon’s network routing and hence some of the propagated routes won’t show up.

Drilling down on the AS-PATH part of the data

Having the routes is useful, but now we want to look at the paths of these leaked routes to see which ASNs are involved. We knew the offending ASNs during the route leak on Monday. Now we want to dig deeper using the archived data. This allows us to see the extent and reach of this route leak.

$ # Extract the AS-PATH of the route leak
$ # Use the list of routes to extract the full AS-PATH
$ # Merge the results together to show an amalgamation of paths.
$ # We know (luckily) the last few ASNs in the AS-PATH are consistent
$ cut -f3 < 13335-listing-b.txt | tr -d '[\[\]]' |\
awk '{
  n=split($0, a, ",");
  printf "%50s\n",
    a[n-5] "_" a[n-4] "_" a[n-3] "_" a[n-2] "_" a[n-1] "_" a[n];
}' | sort -u
   174_701_396531_33154_3356_13335
   2497_701_396531_33154_174_13335
   577_701_396531_33154_3356_13335
   6939_701_396531_33154_174_13335
  1239_701_396531_33154_3356_13335
  1273_701_396531_33154_3356_13335
  1280_701_396531_33154_3356_13335
  2497_701_396531_33154_3356_13335
  2516_701_396531_33154_3356_13335
  3320_701_396531_33154_3356_13335
  3491_701_396531_33154_3356_13335
  4134_701_396531_33154_3356_13335
  4637_701_396531_33154_3356_13335
  6453_701_396531_33154_3356_13335
  6461_701_396531_33154_3356_13335
  6762_701_396531_33154_3356_13335
  6830_701_396531_33154_3356_13335
  6939_701_396531_33154_3356_13335
  7738_701_396531_33154_3356_13335
 12956_701_396531_33154_3356_13335
 17639_701_396531_33154_3356_13335
 23148_701_396531_33154_3356_13335
$

This script clearly shows the AS-PATH of the leaked routes. It’s very consistent. Reading from the back of the line to the front, we have 13335 (Cloudflare), 3356 or 174 (Level3/CenturyLink or Cogent – both tier 1 transit providers for Cloudflare). So far, so good. Then we have 33154 (DQE Communications) and 396531 (Allegheny Technologies Inc) which is still technically not a leak, but trending that way. The reason why we can state this is still technically not a leak is because we don’t know the relationship between those two ASs. It’s possible they have a mutual transit agreement between them. That’s up to them.

Back to the AS-PATH’s. Still reading leftwards, we see 701 (Verizon), which is very very bad and clear evidence of a leak. It’s a leak for two reasons. First, this matches the path when a transit is leaking a non-customer route from a customer. This Verizon customer does not have 13335 (Cloudflare) listed as a customer. Second, the route contains within its path a tier 1 ASN. This is the point where a route leak should have been absolutely squashed by filtering on the customer BGP session. Beyond this point there be dragons.

And dragons there be! Everything above is about how Verizon filtered (or didn’t filter) its customer. What follows the 701 (i.e the number to the left of it) is the peers or customers of Verizon that have accepted these leaked routes. They are mainly other tier 1 networks of Verizon in this list: 174 (Cogent), 1239 (Sprint), 1273 (Vodafone), 3320 (DTAG), 3491 (PCCW), 6461 (Zayo), 6762 (Telecom Italia), etc.

What’s missing from that list are three networks worthy of mentioning – 1299 (Telia), 2914 (NTT), and 7018 (AT&T). All three implement a very simple AS-PATH filter which saved the day for their network. They do not allow one tier 1 ISP to send them a route which has another tier 1 further down the path. That’s because when that happens, it’s officially a leak as each tier 1 is fully connected to all other tier 1’s (which is part of the definition of a tier 1 network). The topology of the Internet’s global BGP routing tables simply states that if you see another tier 1 in the path, then it’s a bad route and it should be filtered away.

Additionally we know that 7018 (AT&T) operates a network which drops RPKI invalids. Because Cloudflare routes are RPKI signed, this also means that AT&T would have dropped these routes when it receives them from Verizon. This shows a clear win for RPKI (and for AT&T when you see their bandwidth graph below)!


That all said, keep in mind we are still talking about routes that Cloudflare didn’t announce. They all came from the route optimizer.

What should 701 Verizon network accept from their customer 396531?

This is a great question to ask. Normally we would look at the IRR (Internet Routing Registries) to see what policy an ASN wants for it’s routes.

$ whois -h whois.radb.net AS396531 ; the Verizon customer
%  No entries found for the selected source(s).
$ whois -h whois.radb.net AS33154  ; the downstream of that customer
%  No entries found for the selected source(s).
$ 

That’s enough to say that we should not be seeing this ASN anywhere on the Internet, however, we should go further into checking this. As we know the ASN of the network, we can search for any routes that are listed for that ASN. We find one:

$ whois -h whois.radb.net ' -i origin AS396531' | egrep '^route|^origin|^mnt-by|^source'
route:          192.92.159.0/24
origin:         AS396531
mnt-by:         MNT-DCNSL
source:         ARIN
$

More importantly, now we have a maintainer (the owner of the routing registry entries). We can see what else is there for that network and we are specifically looking for this:

$ whois -h whois.radb.net ' -i mnt-by -T as-set MNT-DCNSL' | egrep '^as-set|^members|^mnt-by|^source'
as-set:         AS-DQECUST
members:        AS4130, AS5050, AS11199, AS11360, AS12017, AS14088, AS14162,
                AS14740, AS15327, AS16821, AS18891, AS19749, AS20326,
                AS21764, AS26059, AS26257, AS26461, AS27223, AS30168,
                AS32634, AS33039, AS33154, AS33345, AS33358, AS33504,
                AS33726, AS40549, AS40794, AS54552, AS54559, AS54822,
                AS393456, AS395440, AS396531, AS15204, AS54119, AS62984,
                AS13659, AS54934, AS18572, AS397284
mnt-by:         MNT-DCNSL
source:         ARIN
$

This object is important. It lists all the downstream ASNs that this network is expected to announce to the world. It does not contain Cloudflare’s ASN (or any of the leaked ASNs). Clearly this as-set was not used for any BGP filtering.

Just for completeness the same exercise can be done for the other ASN (the downstream of the customer of Verizon). In this case, we just searched for the maintainer object (as there are plenty of route and route6 objects listed).

$ whois -h whois.radb.net ' -i origin AS33154' | egrep '^mnt-by' | sort -u
mnt-by:         MNT-DCNSL
mnt-by:     MAINT-AS3257
mnt-by:     MAINT-AS5050
$

None of these maintainers are directly related to 33154 (DQE Communications). They have been created by other parties and hence they become a dead-end in that search.

It’s worth doing a secondary search to see if any as-set object exists with 33154 or 396531 included. We turned to the most excellent IRR Explorer website run by NLNOG. It provides deep insight into the routing registry data. We did a simple search for 33154 using http://irrexplorer.nlnog.net/search/33154 and we found these as-set objects.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

It’s interesting to see this ASN listed in other as-set’s but none are important or related to Monday’s route-leak. Next we looked at 396531

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

This shows that there’s nowhere else we need to check. AS-DQECUST is the as-set macro that controls (or should control) filtering for any transit provider of their network.

The summary of all the investigation is a solid statement that no Cloudflare routes or ASNs are listed anywhere within the routing registry data for the customer of Verizon. As there were 2,300 ASNs listed in the tweet above, we can conclusively state no filtering was in place and hence this route leak went on its way unabated.

IPv6? Where is the IPv6 route leak?

In what could be considered the only plus from Monday’s route leak, we can confirm that there was no route leak within IPv6 space. Why?

It turns out that 396531 (Allegheny Technologies Inc) is a network without IPv6 enabled. Normally you would hear Cloudflare chastise anyone that’s yet to enable IPv6, however, in this case we are quite happy that one of the two protocol families survived. IPv6 was stable during this route leak, which now can be called an IPv4-only route leak.

Yet that’s not really the whole story. Let’s look at the percentage of traffic Cloudflare sends Verizon that’s IPv6 (vs IPv4). Normally the IPv4/IPv6 percentage holds steady.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

This uptick in IPv6 traffic could be the direct result of Happy Eyeballs on mobile handsets picking a working IPv6 path into Cloudflare vs a non-working IPv4 path. Happy Eyeballs is meant to protect against IPv6 failure, however in this case it’s doing a wonderful job in protecting from an IPv4 failure. Yet we have to be careful with this graph because after further thought and investigation the percentage only increased because IPv4 reduces. Sometimes graphs can be misinterpreted, yet Happy Eyeballs still did a good job even as end users were being affected.

Happy Eyeballs, described in RFC8305, is a mechanism where a client (lets say a mobile device) tries to connect to a website both on IPv4 and IPv6 simultaneously. IPv6 is sometimes given a head-start. The theory is that, should a failure exist on one of the paths (sadly IPv6 is the norm), then IPv4 will save the day. Monday was a day of opposites for Verizon.

In fact, enabling IPv6 for mobile users is the one area where we can praise the Verizon network this week (or at least the Verizon mobile network), unlike the residential Verizon networks where IPv6 is almost non-existent.

Using bandwidth graphs to confirm routing leaks and stability.

As we have already stated,Verizon had impacted their own users/customers. Let’s start with their bandwidth graph:

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

The red line is 24 June 2019 (00:00 UTC to 00:00 UTC the next day). The gray lines are previous days for comparison. This graph includes both Verizon fixed-line services like FiOS along with mobile.

The AT&T graph is quite different.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

There’s no perturbation. This, along with some direct confirmation, shows that 7018 (AT&T) was not affected. This is an important point.

Going back and looking at a third tier 1 network, we can see 6762 (Telecom Italia) affected by this route leak and yet Cloudflare has a direct interconnect with them.

The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

We will be asking Telecom Italia to improve their route filtering as we now have this data.

Future work that could have helped on Monday

The IETF is doing work in the area of BGP path protection within the Secure Inter-Domain Routing Operations Working Group (sidrops) area. The charter of this IETF group is:

The SIDR Operations Working Group (sidrops) develops guidelines for the operation of SIDR-aware networks, and provides operational guidance on how to deploy and operate SIDR technologies in existing and new networks.

One new effort from this group should be called out to show how important the issue of route leaks like today’s event is. The draft document from Alexander Azimov et al named draft-ietf-sidrops-aspa-profile (ASPA stands for Autonomous System Provider Authorization) extends the RPKI data structures to handle BGP path information. This in ongoing work and Cloudflare and other companies are clearly interested in seeing it progress further.

However, as we said in Monday’s blog and something we should reiterate again and again: Cloudflare encourages all network operators to deploy RPKI now!

Acronyms used in the blog

  • API – Application Programming Interface
  • AS-PATH – The list of ASNs that a routes has traversed so far
  • ASN – Autonomous System Number – A unique number assigned for each network on the Internet
  • BGP – Border Gateway Protocol (version 4) – the core routing protocol for the Internet
  • IETF – Internet Engineering Task Force – an open standards organization
  • IPv4 – Internet Protocol version 4
  • IPv6 – Internet Protocol version 6
  • IRR – Internet Routing Registries – a database of Internet route objects
  • ISP – Internet Service Provider
  • JSON – JavaScript Object Notation – a lightweight data-interchange format
  • RFC – Request For Comment – published by the IETF
  • RIPE NCC – Réseaux IP Européens Network Coordination Centre – a regional Internet registry
  • ROA – Route Origin Authorization – a cryptographically signed attestation of a BGP route announcement
  • RPKI – Resource Public Key Infrastructure – a public key infrastructure framework for routing information
  • Tier 1 – A network that has no default route and peers with all other tier 1’s
  • UTC – Coordinated Universal Time – a time standard for clocks and time
  • "there be dragons" – a mistype, as it was meant to be "here be dragons" which means dangerous or unexplored territories

Verizon 和某 BGP 优化器如何在今日大范围重创互联网

Post Syndicated from Marketing Translators original https://blog.cloudflare.com/404/

大规模路由泄漏影响包括 Cloudflare 在内的主要互联网服务

事件经过

Verizon 和某 BGP 优化器如何在今日大范围重创互联网

在 UTC 时间今天( 2019年6月24号)10 点 30 分,互联网遭受了一场不小的冲击。通过主要互联网转接服务提供供应商 Verizon (AS701) 转接的许多互联网路由,都被先导至宾夕法尼亚北部的一家小型公司。这相当于  滴滴错误地将整条高速公路导航至某小巷 – 导致大部分互联网用户无法访问 Cloudflare 上的许多网站和许多其他供应商的服务。这本不应该发生,因为 Verizon 不应将这些路径转送到互联网的其余部分。要理解事件发生原因,请继续阅读。

此类不幸事件并不罕见,我们以前曾发布相关博文。这一次,全世界都再次见证了它所带来的严重损害。而Noction “BGP 优化器”产品的涉及,则让今天的事件进一步恶化。这个产品有一个功能:可将接收到的 IP 前缀拆分为更小的组成部分(称为更具体前缀)。例如,我们自己的 IPv4 路由 104.20.0.0/20 被转换为 104.20.0.0/21 和 104.20.8.0/21。就好像通往 ”北京”的路标被两个路标取代,一个是 ”北京东”,另一个是 ”北京西”。通过将这些主要 IP 块拆分为更小的部分,网络可引导其内部的流量,但这种拆分原本不允许向全球互联网广播。正是这方面的原因导致了今天的网络中断。

为了解释后续发生的事情,我们先快速回顾一下互联网基础“地图”的工作原理。“Internet”的字面意思是网络互联,它由叫做自治系统(AS)的网络组成,每个网络都有唯一的标识符,即 AS 编号。所有这些网络都使用边界网关协议(BGP)来进行互连。BGP 将这些网络连接在一起,并构建互联网“地图”,使通信得以一个地方(例如,您的 ISP)传播到地球另一端的热门网站。

通过 BGP,各大网络可以交换路由信息:如何从您所在的地址访问它们。这些路由可能很具体,类似于在 GPS 上查找特定城市,也可能非常宽泛,就如同让 GPS 指向某个省。这就是今天的问题所在。

宾夕法尼亚州的互联网服务供应商(AS33154 – DQE Communications)在其网络中使用了 BGP 优化器,这意味着其网络中有很多更具体的路由。具体路由优先于更一般的路由(类似于在滴滴 中,前往北京火车站的路线比前往北京的路线更具体)。

DQE 向其客户(AS396531 – Allegheny)公布了这些具体路由。然后,所有这些路由信息被转送到他们的另一个转接服务供应商 (AS701 – Verizon),后者则将这些”更好”的路线泄露给整个互联网。由于这些路由更精细、更具体,因此被误认为是“更好”的路由。

泄漏本应止于 Verizon。然而,Verizon 违背了以下所述的众多最佳做法,因缺乏过滤机制而导致此次泄露变成重大事件,影响到众多互联网服务,如亚马逊、Linode 和 Cloudflare

这意味着,Verizon、Allegheny 和 DQE 突然之间必须应对大量试图通过其网络访问这些服务的互联网用户。由于这些网络没有适当的设备来应对流量的急剧增长,导致服务中断。即便他们有足够的能力来应对流量剧增,DQE、Allegheny 和 Verizon 也不应该被允许宣称他们拥有访问 Cloudflare、亚马逊、Linode 等服务的最佳路由…

Verizon 和某 BGP 优化器如何在今日大范围重创互联网
涉及 BGP 优化器的 BGP 泄漏过程

在事件中,我们观察到,在最严重的时段,我们损失了大约 15% 的全球流量。

Verizon 和某 BGP 优化器如何在今日大范围重创互联网
事件中 Cloudflare 的流量水平。

如何防止此类泄漏?

有多种方法可以避免此类泄漏:

一,在配置 BGP 会话的时候,可以对其接收的路由网段的数量设置硬性限制。这意味着,如果接收到的路由网段数超过阈值,路由器可以决定关闭会话。如果 Verizon 有这样的前缀限制,这次事件就不会发生。设置前缀数量限制是最佳做法。像 Verizon 这样的供应商可以无需付出任何成本便可以设置。没有任何合理的理由可以解释为什么他们没有设置这种限制的原因,只能归咎为草率或懒惰可能就是因为草率或懒惰。

二,网络运营商防止类似泄漏的另外一种方法是:实施基于 IRR (Internet Routing Registry)的筛选。IRR 是因特网路由注册表,各网络所有者可以将自己的网段条目添加到这些分布式数据库中。然后,其他网络运营商可以使用这些 IRR 记录,在与同类运营商产生BGP 会话时,生成特定的前缀列表。如果使用了 IRR 过滤,所涉及的任何网络都不会接受更具体的网段前缀。非常令人震惊的是,尽管 IRR 过滤已经存在了 24 年(并且有详细记录),Verizon 没有在其与 Allegheny的 BGP 会话中实施任何此类过滤。IRR 过滤不会增加 Verizon 的任何成本或限制其服务。再次,我们唯一能想到的原因是草率或懒惰。

三,我们去年在全球实施和部署的 RPKI 框架旨在防止此类泄漏。它支持对源网络和网段大小进行过滤。Cloudflare 发布的网段最长不超过 20 位。然后,RPKI 会指示无论路径是什么,都不应接受任何更具体网段前缀。要使此机制发挥作用,网络需要启用 BGP 源验证。AT&T 等许多供应商已在其网络中成功启用此功能

如果 Verizon 使用了 RPKI,他们就会发现播发的路由无效,路由器就能自动丢弃这些路由。

Cloudflare 建议所有网络运营商立即部署 RPKI

Verizon 和某 BGP 优化器如何在今日大范围重创互联网
使用 IRR、RPKI 和前缀限制预防路由泄漏

上述所有建议经充分浓缩后已纳入 MANRS(共同协议路由安全规范

事件解决

Cloudflare 的网络团队联系了涉事网络:AS33154 (DQE Communications) 和 AS701 (Verizon)。联系过程并不顺畅,这可能是由于路由泄露发生在美国东海岸凌晨时间。

Verizon 和某 BGP 优化器如何在今日大范围重创互联网
发送给 Verizon 的电子邮件截图

我们的一位网络工程师迅速联系了 DQE Communications,稍有耽搁之后,他们帮助我们与解决该问题的相关人员取得了联系。DQE 与我们通过电话合作,停止向 Allegheny 播发这些“优化”路线路由。我们对他们的帮助表示感谢。采取此措施后,互联网变回稳定,事态恢复正常。

Verizon 和某 BGP 优化器如何在今日大范围重创互联网
尝试与 DQE 和 Verizon 客服进行沟通的截图

很遗憾,我们尝试通过电子邮件和电话联系 Verizon,但直到撰写本文时(事件发生后超过 8 小时),尚未收到他们的回复,我们也不清楚他们是否正在采取行动以解决问题。

Cloudflare 希望此类事件永不发生,但很遗憾,当前的互联网环境在预防防止此类事件方面作出的努力甚微。现在,业界都应该通过 RPKI 等系统部署更好,更安全的路由的时候。我们希望主要供应商能效仿 Cloudflare、亚马逊和 AT&T,开始验证路由。尤其并且,我们正密切关注 Verizon 并仍在等候其回复。

尽管导致此次服务中断的事件并非我们所能控制,但我们仍对此感到抱歉。我们的团队非常关注我们的服务,在发现此问题几分钟后,即已安排美国、英国、澳大利亚和新加坡的工程师上线解决问题。

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/

Massive route leak impacts major parts of the Internet, including Cloudflare

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today

What happened?

Today at 10:30UTC, the Internet had a small heart attack. A small company in Northern Pennsylvania became a preferred path of many Internet routes through Verizon (AS701), a major Internet transit provider. This was the equivalent of Waze routing an entire freeway down a neighborhood street — resulting in many websites on Cloudflare, and many other providers, to be unavailable from large parts of the Internet. This should never have happened because Verizon should never have forwarded those routes to the rest of the Internet. To understand why, read on.

We have blogged about these unfortunate events in the past, as they are not uncommon. This time, the damage was seen worldwide. What exacerbated the problem today was the involvement of a “BGP Optimizer” product from Noction. This product has a feature that splits up received IP prefixes into smaller, contributing parts (called more-specifics). For example, our own IPv4 route 104.20.0.0/20 was turned into 104.20.0.0/21 and 104.20.8.0/21. It’s as if the road sign directing traffic to “Pennsylvania” was replaced by two road signs, one for “Pittsburgh, PA” and one for “Philadelphia, PA”. By splitting these major IP blocks into smaller parts, a network has a mechanism to steer traffic within their network but that split should never have been announced to the world at large. When it was it caused today’s outage.

To explain what happened next, here’s a quick summary of how the underlying “map” of the Internet works. “Internet” literally means a network of networks and it is made up of networks called Autonomous Systems (AS), and each of these networks has a unique identifier, its AS number. All of these networks are interconnected using a protocol called Border Gateway Protocol (BGP). BGP joins these networks together and build the Internet “map” that enables traffic to travel from, say, your ISP to a popular website on the other side of the globe.

Using BGP, networks exchange route information: how to get to them from wherever you are. These routes can either be specific, similar to finding a specific city on your GPS, or very general, like pointing your GPS to a state. This is where things went wrong today.

An Internet Service Provider in Pennsylvania  (AS33154 – DQE Communications) was using a BGP optimizer in their network, which meant there were a lot of more specific routes in their network. Specific routes override more general routes (in the Waze analogy a route to, say, Buckingham Palace is more specific than a route to London).

DQE announced these specific routes to their customer (AS396531 – Allegheny Technologies Inc). All of this routing information was then sent to their other transit provider (AS701 – Verizon), who proceeded to tell the entire Internet about these “better” routes. These routes were supposedly “better” because they were more granular, more specific.

The leak should have stopped at Verizon. However, against numerous best practices outlined below, Verizon’s lack of filtering turned this into a major incident that affected many Internet services such as Amazon, Fastly,  Linode and Cloudflare.

What this means is that suddenly Verizon, Allegheny, and DQE had to deal with a stampede of Internet users trying to access those services through their network. None of these networks were suitably equipped to deal with this drastic increase in traffic, causing disruption in service. Even if they had sufficient capacity DQE, Allegheny and Verizon were not allowed to say they had the best route to Cloudflare, Amazon, Fastly, Linode, etc…

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
BGP leak process with a BGP optimizer

During the incident, we observed a loss, at the worst of the incident, of about 15% of our global traffic.

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
Traffic levels at Cloudflare during the incident.

How could this leak have been prevented?

There are multiple ways this leak could have been avoided:

A BGP session can be configured with a hard limit of prefixes to be received. This means a router can decide to shut down a session if the number of prefixes goes above the threshold. Had Verizon had such a prefix limit in place, this would not have occurred. It is a best practice to have such limits in place. It doesn’t cost a provider like Verizon anything to have such limits in place. And there’s no good reason, other than sloppiness or laziness, that they wouldn’t have such limits in place

A different way network operators can prevent leaks like this one is by implementing IRR-based filtering. IRR is the Internet Routing Registry, and networks can add entries to these distributed databases. Other network operators can then use these IRR records to generate specific prefix lists for the BGP sessions with their peers. If IRR filtering had been used, none of the networks involved would have accepted the faulty more-specifics. What’s quite shocking is that it appears that Verizon didn’t implement any of this filtering in their BGP session with Allegheny Technologies, even though IRR filtering has been around (and well documented) for over 24 years. IRR filtering would not have increased Verizon’s costs or limited their service in any way. Again, the only explanation we can conceive of why it wasn’t in place is sloppiness or laziness.

The RPKI framework that we implemented and deployed globally last year is designed to prevent this type of leak. It enables filtering on origin network and prefix size. The prefixes Cloudflare announces are signed for a maximum size of 20. RPKI then indicates any more-specific prefix should not be accepted, no matter what the path is. In order for this mechanism to take action, a network needs to enable BGP Origin Validation. Many providers like AT&T have already enabled it successfully in their network.

If Verizon had used RPKI, they would have seen that the advertised routes were not valid, and the routes could have been automatically dropped by the router.

Cloudflare encourages all network operators to deploy RPKI now!

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
Route leak prevention using IRR, RPKI, and prefix limits

All of the above suggestions are nicely condensed into MANRS (Mutually Agreed Norms for Routing Security)

How it was resolved

The network team at Cloudflare reached out to the networks involved, AS33154 (DQE Communications) and AS701 (Verizon). We had difficulties reaching either network, this may have been due to the time of the incident as it was still early on the East Coast of the US when the route leak started.

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
Screenshot of the email sent to Verizon

One of our network engineers made contact with DQE Communications quickly and after a little delay they were able to put us in contact with someone who could fix the problem. DQE worked with us on the phone to stop advertising these “optimized” routes to Allegheny Technologies Inc. We’re grateful for their help. Once this was done, the Internet stabilized, and things went back to normal.

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
Screenshot of attempts to communicate with the support for DQE and Verizon

It is unfortunate that while we tried both e-mail and phone calls to reach out to Verizon, at the time of writing this article (over 8 hours after the incident), we have not heard back from them, nor are we aware of them taking action to resolve the issue.

At Cloudflare, we wish that events like this never take place, but unfortunately the current state of the Internet does very little to prevent incidents such as this one from occurring. It’s time for the industry to adopt better routing security through systems like RPKI. We hope that major providers will follow the lead of Cloudflare, Amazon, and AT&T and start validating routes. And, in particular, we’re looking at you Verizon — and still waiting on your reply.

Despite this being caused by events outside our control, we’re sorry for the disruption. Our team cares deeply about our service and we had engineers in the US, UK, Australia, and Singapore online minutes after this problem was identified.

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/

Massive route leak impacts major parts of the Internet, including Cloudflare

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today

What happened?

Today at 10:30UTC, the Internet had a small heart attack. A small company in Northern Pennsylvania became a preferred path of many Internet routes through Verizon (AS701), a major Internet transit provider. This was the equivalent of Waze routing an entire freeway down a neighborhood street — resulting in many websites on Cloudflare, and many other providers, to be unavailable from large parts of the Internet. This should never have happened because Verizon should never have forwarded those routes to the rest of the Internet. To understand why, read on.

We have blogged about these unfortunate events in the past, as they are not uncommon. This time, the damage was seen worldwide. What exacerbated the problem today was the involvement of a “BGP Optimizer” product from Noction. This product has a feature that splits up received IP prefixes into smaller, contributing parts (called more-specifics). For example, our own IPv4 route 104.20.0.0/20 was turned into 104.20.0.0/21 and 104.20.8.0/21. It’s as if the road sign directing traffic to “Pennsylvania” was replaced by two road signs, one for “Pittsburgh, PA” and one for “Philadelphia, PA”. By splitting these major IP blocks into smaller parts, a network has a mechanism to steer traffic within their network but that split should never have been announced to the world at large. When it was it caused today’s outage.

To explain what happened next, here’s a quick summary of how the underlying “map” of the Internet works. “Internet” literally means a network of networks and it is made up of networks called Autonomous Systems (AS), and each of these networks has a unique identifier, its AS number. All of these networks are interconnected using a protocol called Border Gateway Protocol (BGP). BGP joins these networks together and builds the Internet “map” that enables traffic to travel from, say, your ISP to a popular website on the other side of the globe.

Using BGP, networks exchange route information: how to get to them from wherever you are. These routes can either be specific, similar to finding a specific city on your GPS, or very general, like pointing your GPS to a state. This is where things went wrong today.

An Internet Service Provider in Pennsylvania  (AS33154 – DQE Communications) was using a BGP optimizer in their network, which meant there were a lot of more specific routes in their network. Specific routes override more general routes (in the Waze analogy a route to, say, Buckingham Palace is more specific than a route to London).

DQE announced these specific routes to their customer (AS396531 – Allegheny Technologies Inc). All of this routing information was then sent to their other transit provider (AS701 – Verizon), who proceeded to tell the entire Internet about these “better” routes. These routes were supposedly “better” because they were more granular, more specific.

The leak should have stopped at Verizon. However, against numerous best practices outlined below, Verizon’s lack of filtering turned this into a major incident that affected many Internet services such as Amazon,  Linode and Cloudflare.

What this means is that suddenly Verizon, Allegheny, and DQE had to deal with a stampede of Internet users trying to access those services through their network. None of these networks were suitably equipped to deal with this drastic increase in traffic, causing disruption in service. Even if they had sufficient capacity DQE, Allegheny and Verizon were not allowed to say they had the best route to Cloudflare, Amazon, Linode, etc…

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
BGP leak process with a BGP optimizer

During the incident, we observed a loss, at the worst of the incident, of about 15% of our global traffic.

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
Traffic levels at Cloudflare during the incident.

How could this leak have been prevented?

There are multiple ways this leak could have been avoided:

A BGP session can be configured with a hard limit of prefixes to be received. This means a router can decide to shut down a session if the number of prefixes goes above the threshold. Had Verizon had such a prefix limit in place, this would not have occurred. It is a best practice to have such limits in place. It doesn’t cost a provider like Verizon anything to have such limits in place. And there’s no good reason, other than sloppiness or laziness, that they wouldn’t have such limits in place.

A different way network operators can prevent leaks like this one is by implementing IRR-based filtering. IRR is the Internet Routing Registry, and networks can add entries to these distributed databases. Other network operators can then use these IRR records to generate specific prefix lists for the BGP sessions with their peers. If IRR filtering had been used, none of the networks involved would have accepted the faulty more-specifics. What’s quite shocking is that it appears that Verizon didn’t implement any of this filtering in their BGP session with Allegheny Technologies, even though IRR filtering has been around (and well documented) for over 24 years. IRR filtering would not have increased Verizon’s costs or limited their service in any way. Again, the only explanation we can conceive of why it wasn’t in place is sloppiness or laziness.

The RPKI framework that we implemented and deployed globally last year is designed to prevent this type of leak. It enables filtering on origin network and prefix size. The prefixes Cloudflare announces are signed for a maximum size of 20. RPKI then indicates any more-specific prefix should not be accepted, no matter what the path is. In order for this mechanism to take action, a network needs to enable BGP Origin Validation. Many providers like AT&T have already enabled it successfully in their network.

If Verizon had used RPKI, they would have seen that the advertised routes were not valid, and the routes could have been automatically dropped by the router.

Cloudflare encourages all network operators to deploy RPKI now!

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
Route leak prevention using IRR, RPKI, and prefix limits

All of the above suggestions are nicely condensed into MANRS (Mutually Agreed Norms for Routing Security)

How it was resolved

The network team at Cloudflare reached out to the networks involved, AS33154 (DQE Communications) and AS701 (Verizon). We had difficulties reaching either network, this may have been due to the time of the incident as it was still early on the East Coast of the US when the route leak started.

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
Screenshot of the email sent to Verizon

One of our network engineers made contact with DQE Communications quickly and after a little delay they were able to put us in contact with someone who could fix the problem. DQE worked with us on the phone to stop advertising these “optimized” routes to Allegheny Technologies Inc. We’re grateful for their help. Once this was done, the Internet stabilized, and things went back to normal.

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
Screenshot of attempts to communicate with the support for DQE and Verizon

It is unfortunate that while we tried both e-mail and phone calls to reach out to Verizon, at the time of writing this article (over 8 hours after the incident), we have not heard back from them, nor are we aware of them taking action to resolve the issue.

At Cloudflare, we wish that events like this never take place, but unfortunately the current state of the Internet does very little to prevent incidents such as this one from occurring. It’s time for the industry to adopt better routing security through systems like RPKI. We hope that major providers will follow the lead of Cloudflare, Amazon, and AT&T and start validating routes. And, in particular, we’re looking at you Verizon — and still waiting on your reply.

Despite this being caused by events outside our control, we’re sorry for the disruption. Our team cares deeply about our service and we had engineers in the US, UK, Australia, and Singapore online minutes after this problem was identified.

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/comment-verizon-et-un-optimiseur-bgp-ont-affecte-de-nombreuses-partie-dinternet-aujourdhu/

Une fuite massive de routes a eu un impact sur de nombreuses parties d’Internet, y compris sur Cloudflare

Que s’est-il passé ?

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu

Aujourd’hui à 10h30 UTC, Internet a connu une sorte de mini crise cardiaque. Une petite entreprise du nord de la Pennsylvanie est devenue le chemin privilégié de nombreuses routes Internet à cause de Verizon (AS701), un important fournisseur de transit Internet. C’est un peu comme si Waze venait à diriger le trafic d’une autoroute complète vers une petite rue de quartier : de nombreux sites Web sur Cloudflare et beaucoup d’autres fournisseurs étaient indisponibles depuis une grande partie du réseau. Cet incident n’aurait jamais dû arriver, car Verizon n’aurait jamais dû transmettre ces itinéraires au reste d’Internet. Pour en comprendre les raisons, lisez la suite de cet article.

Nous avons déjà écrit un certain nombre d’articles par le passé sur ces événements malheureux qui sont plus fréquents qu’on ne le pense. Cette fois, les effets ont pu être observés dans le monde entier. Aujourd’hui, le problème a été aggravé par l’implication d’un produit « Optimiseur BGP » de Noction. Ce produit dispose d’une fonctionnalité qui permet de diviser les préfixes IP reçus en parties contributives plus petites (appelées « préfixes plus spécifiques »). Par exemple, notre propre route IPv4 104.20.0.0/20 a été transformée en 104.20.0.0/21 et 104.20.8.0/21. Imaginons par exemple qu’il existe un panneau unique acheminant le trafic de Paris vers la Normandie et que ce panneau soit remplacé par deux panneaux, l’un pour Le Havre, l’autre pour Dieppe. En divisant ces blocs IP importants en parties plus petites, un réseau dispose d’un mécanisme lui permettant de diriger le trafic à l’intérieur de son réseau, mais cette division n’aurait jamais dû être annoncée au monde entier. Le fait qu’elle l’ait été a entraîné la panne survenue aujourd’hui.

Pour expliquer ce qui s’est passé ensuite, rappelons tout d’abord brièvement le fonctionnement de la « carte » de base d’Internet. Le terme « Internet » désigne littéralement un réseau de réseaux, constitué de réseaux appelés Autonomous Systems ou systèmes autonomes (abrégé AS). Chacun de ces réseaux possède un identifiant unique, son numéro d’AS. Tous ces réseaux sont interconnectés par un protocole de routage appelé Border Gateway Protocol (BGP). Le protocole BGP réunit ces réseaux et construit la « carte » Internet qui permet au trafic de circuler, par exemple, de votre fournisseur d’accès à Internet vers un site Web populaire situé à l’autre bout du monde.

À l’aide du protocole BGP, les réseaux échangent des informations sur les itinéraires : comment y accéder quel que soit votre emplacement. Ces itinéraires peuvent être soit des itinéraires spécifiques, comme lorsque vous recherchez une ville sur votre GPS, soit des itinéraires plus généraux, comme lorsque vous demandez à votre GPS de vous orienter vers une région. C’est précisément là que le bât a blessé aujourd’hui.

Un fournisseur d’accès Internet de Pennsylvanie (AS33154 – DQE Communications) utilisait un optimiseur BGP sur son réseau, ce qui signifie qu’il possédait de nombreux itinéraires plus spécifiques au sein même de son réseau. Les itinéraires spécifiques remplacent les itinéraires plus généraux (pour conserver l’analogie avec Waze, un itinéraire, par exemple, vers le Palais de l’Elysée est beaucoup plus spécifique qu’un itinéraire vers Paris).

DQE a annoncé ces itinéraires spécifiques à son client (AS396531 – Allegheny Technologies Inc). Toutes ces informations de routage ont ensuite été envoyées à leur autre fournisseur de transit (AS701 – Verizon), qui a ensuite informé tout Internet de l’existence de tels itinéraires « optimisés ». Ces itinéraires étaient supposés « optimisés » dans la mesure où ils étaient plus précis, plus spécifiques.

La fuite aurait dû s’arrêter au niveau de Verizon. Cependant, contrairement aux bonnes pratiques définies ci-dessous, l’absence de filtrage par Verizon a créé un incident majeur, entraînant des pannes pour de nombreux services Internet, tels qu’Amazon, Linode et Cloudflare.

En d’autres termes, Verizon, Allegheny et DQE ont dû soudainement faire face à un afflux massif d’internautes tentant d’accéder à ces services via leur réseau. Aucun de ces réseaux n’était équipé de manière appropriée pour faire face à cette augmentation considérable du trafic, ce qui a entraîné une interruption du service. Même s’ils avaient disposé d’une capacité suffisante, DQE, Allegheny et Verizon n’étaient pas autorisés à dire qu’ils avaient le meilleur itinéraire pour Cloudflare, Amazon, Linode, etc.

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu
Processus de fuite BGP avec un optimiseur BGP

Au cours de l’incident, nous avons observé au moment le plus critique une perte d’environ 15 % de notre trafic mondial.

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu
Niveaux de trafic chez Cloudflare au cours de l’incident.

Comment cette fuite aurait-elle pu être évitée ?

Cette fuite aurait pu être évitée de différentes manières :

Une session BGP peut être configurée avec une limite fixe de préfixes à recevoir. En d’autres termes, un routeur peut décider de fermer une session si le nombre de préfixes dépasse le seuil. Si Verizon avait mis en place une limite de préfixes, l’incident ne se serait pas produit. La mise en place des limites fait partie des bonnes pratiques. Verizon aurait pu parfaitement définir ce type de limites. Il n’existe donc aucune raison, autre que la négligence ou la paresse, pour justifier que la société ne l’ait pas fait.

Une autre manière pour les opérateurs de réseau d’empêcher les fuites de ce type consiste à appliquer un filtrage IRR. L’IRR (Internet Routing Registry) est le registre de routage d’Internet, et les réseaux peuvent ajouter des entrées à ces bases de données distribuées. Les autres opérateurs de réseau peuvent ensuite utiliser ces enregistrements IRR pour générer des listes de préfixes spécifiques pour les sessions BGP avec leurs pairs. Si le filtrage IRR avait été utilisé, aucun des réseaux concernés n’aurait accepté les préfixes plus spécifiques défectueux. Il est pour le moins choquant que Verizon n’ait appliqué aucun de ces filtres au cours de sa session BGP avec Allegheny Technologies, alors que le filtrage IRR existe (et est bien documenté) depuis plus de 24 ans. Le filtrage IRR ne représentait pas un coût supplémentaire pour Verizon et n’aurait en aucune manière limité son service. Comme nous l’avons déjà dit, le fait que ce filtrage n’ait pas été mis en œuvre ne peut s’expliquer que par la négligence ou la paresse.

La RPKI que nous avons mise en place et déployée l’année dernière à l’échelle mondiale vise à prévenir ce type de fuite. Elle active le filtrage sur le réseau d’origine et la taille des préfixes. Les préfixes que Cloudflare annonce sont signés pour une taille maximale de 20. La RPKI indique alors que tout préfixe plus spécifique ne doit pas être accepté, quel que soit le chemin. Pour que ce mécanisme soit opérationnel, un réseau doit activer BGP Origin Validation. De nombreux fournisseurs tels que AT&T l’ont déjà activé sur leur réseau.

Si Verizon avait utilisé la RPKI, elle se serait rendue compte que les routes annoncées n’étaient pas valides, et ces dernières auraient été automatiquement ignorées par le routeur.

Cloudflare encourage tous les opérateurs de réseau à déployer la RPKI dès maintenant !

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu
Prévention des fuites d’itinéraire à l’aide de l’IRR, la RPKI et les limites de préfixes

Toutes les suggestions ci-dessus sont bien résumées dans le projet MANRS (Normes d’accord mutuel sur la sécurité du routage.)

Résolution de l’incident

L’équipe réseau de Cloudflare a contacté les réseaux concernés, AS33154 (DQE Communications) et AS701 (Verizon). Nous avons eu quelques difficultés pour joindre l’un ou l’autre réseau, probablement en raison de l’heure de l’incident, car il était encore tôt sur la côte Est des États-Unis lorsque la fuite des routes a commencé.

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu
Capture d’écran du courriel envoyé à Verizon

L’un de nos ingénieurs réseau a rapidement pris contact avec DQE Communications qui, après un certain temps, nous a mis en contact avec une personne en mesure de résoudre le problème. Nous avons travaillé avec DQE au téléphone pour qu’ils cessent de faire la promotion de ces routes « optimisées » vers Allegheny Technologies Inc. Nous les remercions pour l’aide qu’ils nous ont apportée. Après cette intervention, Internet s’est stabilisé et la situation est redevenue normale.

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu
Capture d’écran des tentatives de communication avec le service d’assistance de DQE et Verizon

Il est regrettable que nos tentatives de contact par courriel et par téléphone avec Verizon soient restées vaines, au moment de la rédaction de cet article (plus de 8 heures après l’incident), nous n’avons toujours pas eu de nouvelles et nous ne savons pas si la société a pris des mesures pour résoudre le problème.

Chez Cloudflare, nous souhaitons que de tels événements ne se produisent jamais, malheureusement, l’état actuel d’Internet fait bien peu pour que ce type d’incident ne se reproduise pas. Il est temps que l’industrie adopte une meilleure sécurité du routage grâce à des systèmes tels que la RPKI. Nous espérons que les principaux fournisseurs suivront l’exemple de Cloudflare, Amazon et AT&T pour valider les itinéraires. Nous nous tournons en particulier vers vous, Verizon, et nous attendons toujours votre réponse.

Bien que cet incident soit dû à des événements indépendants de notre volonté, nous sommes désolés des perturbations occasionnées. Notre équipe est profondément attachée à la qualité de notre service et quelques minutes après l’identification du problème, nous étions déjà en ligne avec des ingénieurs aux États-Unis, au Royaume-Uni, en Australie et à Singapour.

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/comment-verizon-et-un-optimiseur-bgp-ont-affecte-de-nombreuses-partie-dinternet-aujourdhu/

Une fuite massive de routes a eu un impact sur de nombreuses parties d’Internet, y compris sur Cloudflare

Que s’est-il passé ?

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu

Aujourd’hui à 10h30 UTC, Internet a connu une sorte de mini crise cardiaque. Une petite entreprise du nord de la Pennsylvanie est devenue le chemin privilégié de nombreuses routes Internet à cause de Verizon (AS701), un important fournisseur de transit Internet. C’est un peu comme si Waze venait à diriger le trafic d’une autoroute complète vers une petite rue de quartier : de nombreux sites Web sur Cloudflare et beaucoup d’autres fournisseurs étaient indisponibles depuis une grande partie du réseau. Cet incident n’aurait jamais dû arriver, car Verizon n’aurait jamais dû transmettre ces itinéraires au reste d’Internet. Pour en comprendre les raisons, lisez la suite de cet article.

Nous avons déjà écrit un certain nombre d’articles par le passé sur ces événements malheureux qui sont plus fréquents qu’on ne le pense. Cette fois, les effets ont pu être observés dans le monde entier. Aujourd’hui, le problème a été aggravé par l’implication d’un produit « Optimiseur BGP » de Noction. Ce produit dispose d’une fonctionnalité qui permet de diviser les préfixes IP reçus en parties contributives plus petites (appelées « préfixes plus spécifiques »). Par exemple, notre propre route IPv4 104.20.0.0/20 a été transformée en 104.20.0.0/21 et 104.20.8.0/21. Imaginons par exemple qu’il existe un panneau unique acheminant le trafic de Paris vers la Normandie et que ce panneau soit remplacé par deux panneaux, l’un pour Le Havre, l’autre pour Dieppe. En divisant ces blocs IP importants en parties plus petites, un réseau dispose d’un mécanisme lui permettant de diriger le trafic à l’intérieur de son réseau, mais cette division n’aurait jamais dû être annoncée au monde entier. Le fait qu’elle l’ait été a entraîné la panne survenue aujourd’hui.

Pour expliquer ce qui s’est passé ensuite, rappelons tout d’abord brièvement le fonctionnement de la « carte » de base d’Internet. Le terme « Internet » désigne littéralement un réseau de réseaux, constitué de réseaux appelés Autonomous Systems ou systèmes autonomes (abrégé AS). Chacun de ces réseaux possède un identifiant unique, son numéro d’AS. Tous ces réseaux sont interconnectés par un protocole de routage appelé Border Gateway Protocol (BGP). Le protocole BGP réunit ces réseaux et construit la « carte » Internet qui permet au trafic de circuler, par exemple, de votre fournisseur d’accès à Internet vers un site Web populaire situé à l’autre bout du monde.

À l’aide du protocole BGP, les réseaux échangent des informations sur les itinéraires : comment y accéder quel que soit votre emplacement. Ces itinéraires peuvent être soit des itinéraires spécifiques, comme lorsque vous recherchez une ville sur votre GPS, soit des itinéraires plus généraux, comme lorsque vous demandez à votre GPS de vous orienter vers une région. C’est précisément là que le bât a blessé aujourd’hui.

Un fournisseur d’accès Internet de Pennsylvanie (AS33154 – DQE Communications) utilisait un optimiseur BGP sur son réseau, ce qui signifie qu’il possédait de nombreux itinéraires plus spécifiques au sein même de son réseau. Les itinéraires spécifiques remplacent les itinéraires plus généraux (pour conserver l’analogie avec Waze, un itinéraire, par exemple, vers le Palais de l’Elysée est beaucoup plus spécifique qu’un itinéraire vers Paris).

DQE a annoncé ces itinéraires spécifiques à son client (AS396531 – Allegheny Technologies Inc). Toutes ces informations de routage ont ensuite été envoyées à leur autre fournisseur de transit (AS701 – Verizon), qui a ensuite informé tout Internet de l’existence de tels itinéraires « optimisés ». Ces itinéraires étaient supposés « optimisés » dans la mesure où ils étaient plus précis, plus spécifiques.

La fuite aurait dû s’arrêter au niveau de Verizon. Cependant, contrairement aux bonnes pratiques définies ci-dessous, l’absence de filtrage par Verizon a créé un incident majeur, entraînant des pannes pour de nombreux services Internet, tels qu’Amazon, Linode et Cloudflare.

En d’autres termes, Verizon, Allegheny et DQE ont dû soudainement faire face à un afflux massif d’internautes tentant d’accéder à ces services via leur réseau. Aucun de ces réseaux n’était équipé de manière appropriée pour faire face à cette augmentation considérable du trafic, ce qui a entraîné une interruption du service. Même s’ils avaient disposé d’une capacité suffisante, DQE, Allegheny et Verizon n’étaient pas autorisés à dire qu’ils avaient le meilleur itinéraire pour Cloudflare, Amazon, Linode, etc.

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu
Processus de fuite BGP avec un optimiseur BGP

Au cours de l’incident, nous avons observé au moment le plus critique une perte d’environ 15 % de notre trafic mondial.

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu
Niveaux de trafic chez Cloudflare au cours de l’incident.

Comment cette fuite aurait-elle pu être évitée ?

Cette fuite aurait pu être évitée de différentes manières :

Une session BGP peut être configurée avec une limite fixe de préfixes à recevoir. En d’autres termes, un routeur peut décider de fermer une session si le nombre de préfixes dépasse le seuil. Si Verizon avait mis en place une limite de préfixes, l’incident ne se serait pas produit. La mise en place des limites fait partie des bonnes pratiques. Verizon aurait pu parfaitement définir ce type de limites. Il n’existe donc aucune raison, autre que la négligence ou la paresse, pour justifier que la société ne l’ait pas fait.

Une autre manière pour les opérateurs de réseau d’empêcher les fuites de ce type consiste à appliquer un filtrage IRR. L’IRR (Internet Routing Registry) est le registre de routage d’Internet, et les réseaux peuvent ajouter des entrées à ces bases de données distribuées. Les autres opérateurs de réseau peuvent ensuite utiliser ces enregistrements IRR pour générer des listes de préfixes spécifiques pour les sessions BGP avec leurs pairs. Si le filtrage IRR avait été utilisé, aucun des réseaux concernés n’aurait accepté les préfixes plus spécifiques défectueux. Il est pour le moins choquant que Verizon n’ait appliqué aucun de ces filtres au cours de sa session BGP avec Allegheny Technologies, alors que le filtrage IRR existe (et est bien documenté) depuis plus de 24 ans. Le filtrage IRR ne représentait pas un coût supplémentaire pour Verizon et n’aurait en aucune manière limité son service. Comme nous l’avons déjà dit, le fait que ce filtrage n’ait pas été mis en œuvre ne peut s’expliquer que par la négligence ou la paresse.

La RPKI que nous avons mise en place et déployée l’année dernière à l’échelle mondiale vise à prévenir ce type de fuite. Elle active le filtrage sur le réseau d’origine et la taille des préfixes. Les préfixes que Cloudflare annonce sont signés pour une taille maximale de 20. La RPKI indique alors que tout préfixe plus spécifique ne doit pas être accepté, quel que soit le chemin. Pour que ce mécanisme soit opérationnel, un réseau doit activer BGP Origin Validation. De nombreux fournisseurs tels que AT&T l’ont déjà activé sur leur réseau.

Si Verizon avait utilisé la RPKI, elle se serait rendue compte que les routes annoncées n’étaient pas valides, et ces dernières auraient été automatiquement ignorées par le routeur.

Cloudflare encourage tous les opérateurs de réseau à déployer la RPKI dès maintenant !

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu
Prévention des fuites d’itinéraire à l’aide de l’IRR, la RPKI et les limites de préfixes

Toutes les suggestions ci-dessus sont bien résumées dans le projet MANRS (Normes d’accord mutuel sur la sécurité du routage.)

Résolution de l’incident

L’équipe réseau de Cloudflare a contacté les réseaux concernés, AS33154 (DQE Communications) et AS701 (Verizon). Nous avons eu quelques difficultés pour joindre l’un ou l’autre réseau, probablement en raison de l’heure de l’incident, car il était encore tôt sur la côte Est des États-Unis lorsque la fuite des routes a commencé.

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu
Capture d’écran du courriel envoyé à Verizon

L’un de nos ingénieurs réseau a rapidement pris contact avec DQE Communications qui, après un certain temps, nous a mis en contact avec une personne en mesure de résoudre le problème. Nous avons travaillé avec DQE au téléphone pour qu’ils cessent de faire la promotion de ces routes « optimisées » vers Allegheny Technologies Inc. Nous les remercions pour l’aide qu’ils nous ont apportée. Après cette intervention, Internet s’est stabilisé et la situation est redevenue normale.

Comment Verizon et un optimiseur BGP ont affecté de nombreuses partie d’Internet aujourd’hu
Capture d’écran des tentatives de communication avec le service d’assistance de DQE et Verizon

Il est regrettable que nos tentatives de contact par courriel et par téléphone avec Verizon soient restées vaines, au moment de la rédaction de cet article (plus de 8 heures après l’incident), nous n’avons toujours pas eu de nouvelles et nous ne savons pas si la société a pris des mesures pour résoudre le problème.

Chez Cloudflare, nous souhaitons que de tels événements ne se produisent jamais, malheureusement, l’état actuel d’Internet fait bien peu pour que ce type d’incident ne se reproduise pas. Il est temps que l’industrie adopte une meilleure sécurité du routage grâce à des systèmes tels que la RPKI. Nous espérons que les principaux fournisseurs suivront l’exemple de Cloudflare, Amazon et AT&T pour valider les itinéraires. Nous nous tournons en particulier vers vous, Verizon, et nous attendons toujours votre réponse.

Bien que cet incident soit dû à des événements indépendants de notre volonté, nous sommes désolés des perturbations occasionnées. Notre équipe est profondément attachée à la qualité de notre service et quelques minutes après l’identification du problème, nous étions déjà en ligne avec des ingénieurs aux États-Unis, au Royaume-Uni, en Australie et à Singapour.

Securing Certificate Issuance using Multipath Domain Control Validation

Post Syndicated from Dina Kozlov original https://blog.cloudflare.com/secure-certificate-issuance/

Securing Certificate Issuance using Multipath Domain Control Validation

Securing Certificate Issuance using Multipath Domain Control Validation

Trust on the Internet is underpinned by the Public Key Infrastructure (PKI). PKI grants servers the ability to securely serve websites by issuing digital certificates, providing the foundation for encrypted and authentic communication.

Certificates make HTTPS encryption possible by using the public key in the certificate to verify server identity. HTTPS is especially important for websites that transmit sensitive data, such as banking credentials or private messages. Thankfully, modern browsers, such as Google Chrome, flag websites not secured using HTTPS by marking them “Not secure,” allowing users to be more security conscious of the websites they visit.

Now that we know what certificates are used for, let’s talk about where they come from.

Certificate Authorities

Certificate Authorities (CAs) are the institutions responsible for issuing certificates.

When issuing a certificate for any given domain, they use Domain Control Validation (DCV) to verify that the entity requesting a certificate for the domain is the legitimate owner of the domain. With DCV the domain owner:

  1. creates a DNS resource record for a domain;
  2. uploads a document to the web server located at that domain; OR
  3. proves ownership of the domain’s administrative email account.

The DCV process prevents adversaries from obtaining private-key and certificate pairs for domains not owned by the requestor.  

Preventing adversaries from acquiring this pair is critical: if an incorrectly issued certificate and private-key pair wind up in an adversary’s hands, they could pose as the victim’s domain and serve sensitive HTTPS traffic. This violates our existing trust of the Internet, and compromises private data on a potentially massive scale.

For example, an adversary that tricks a CA into mis-issuing a certificate for gmail.com could then perform TLS handshakes while pretending to be Google, and exfiltrate cookies and login information to gain access to the victim’s Gmail account. The risks of certificate mis-issuance are clearly severe.

Domain Control Validation

To prevent attacks like this, CAs only issue a certificate after performing DCV. One way of validating domain ownership is through HTTP validation, done by uploading a text file to a specific HTTP endpoint on the webserver they want to secure.  Another DCV method is done using email verification, where an email with a validation code link is sent to the administrative contact for the domain.

HTTP Validation

Suppose Alice buys the domain name aliceswonderland.com and wants to get a dedicated certificate for this domain. Alice chooses to use Let’s Encrypt as their certificate authority. First, Alice must generate their own private key and create a certificate signing request (CSR). She sends the CSR to Let’s Encrypt, but the CA won’t issue a certificate for that CSR and private key until they know Alice owns aliceswonderland.com. Alice can then choose to prove that she owns this domain through HTTP validation.

When Let’s Encrypt performs DCV over HTTP, they require Alice to place a randomly named file in the /.well-known/acme-challenge path for her website. The CA must retrieve the text file by sending an HTTP GET request to http://aliceswonderland.com/.well-known/acme-challenge/<random_filename>. An expected value must be present on this endpoint for DCV to succeed.

For HTTP validation, Alice would upload a file to http://aliceswonderland.com/.well-known/acme-challenge/YnV0dHNz

where the body contains:

curl http://aliceswonderland.com/.well-known/acme-challenge/YnV0dHNz

GET /.well-known/acme-challenge/LoqXcYV8...jxAjEuX0
Host: aliceswonderland.com

HTTP/1.1 200 OK
Content-Type: application/octet-stream

YnV0dHNz.TEST_CLIENT_KEY

The CA instructs them to use the Base64 token YnV0dHNz. TEST_CLIENT_KEY in an account-linked key that only the certificate requestor and the CA know. The CA uses this field combination to verify that the certificate requestor actually owns the domain. Afterwards, Alice can get her certificate for her website!

DNS Validation

Another way users can validate domain ownership is to add a DNS TXT record containing a verification string or token from the CA to their domain’s resource records. For example, here’s a domain for an enterprise validating itself towards Google:

$ dig TXT aliceswonderland.com
aliceswonderland.com.	 28 IN TXT "google-site-verification=COanvvo4CIfihirYW6C0jGMUt2zogbE_lC6YBsfvV-U"

Here, Alice chooses to create a TXT DNS resource record with a specific token value. A Google CA can verify the presence of this token to validate that Alice actually owns her website.

Types of BGP Hijacking Attacks

Certificate issuance is required for servers to securely communicate with clients. This is why it’s so important that the process responsible for issuing certificates is also secure. Unfortunately, this is not always the case.

Researchers at Princeton University recently discovered that common DCV methods are vulnerable to attacks executed by network-level adversaries. If Border Gateway Protocol (BGP) is the “postal service” of the Internet responsible for delivering data through the most efficient routes, then Autonomous Systems (AS) are individual post office branches that represent an Internet network run by a single organization. Sometimes network-level adversaries advertise false routes over BGP to steal traffic, especially if that traffic contains something important, like a domain’s certificate.

Bamboozling Certificate Authorities with BGP highlights five types of attacks that can be orchestrated during the DCV process to obtain a certificate for a domain the adversary does not own. After implementing these attacks, the authors were able to (ethically) obtain certificates for domains they did not own from the top five CAs: Let’s Encrypt, GoDaddy, Comodo, Symantec, and GlobalSign. But how did they do it?

Attacking the Domain Control Validation Process

There are two main approaches to attacking the DCV process with BGP hijacking:

  1. Sub-Prefix Attack
  2. Equally-Specific-Prefix Attack

These attacks create a vulnerability when an adversary sends a certificate signing request for a victim’s domain to a CA. When the CA verifies the network resources using an HTTP GET  request (as discussed earlier), the adversary then uses BGP attacks to hijack traffic to the victim’s domain in a way that the CA’s request is rerouted to the adversary and not the domain owner. To understand how these attacks are conducted, we first need to do a little bit of math.

Securing Certificate Issuance using Multipath Domain Control Validation

Every device on the Internet uses an IP (Internet Protocol) address as a numerical identifier. IPv4 addresses contain 32 bits and follow a slash notation to indicate the size of the prefix. So, in the network address 123.1.2.0/24, “/24” refers to how many bits the network contains. This means that there are 8 bits left that contain the host addresses, for a total of 256 host addresses. The smaller the prefix number, the more host addresses remain in the network. With this knowledge, let’s jump into the attacks!

Attack one: Sub-Prefix Attack

When BGP announces a route, the router always prefers to follow the more specific route. So if 123.0.0.0/8 and 123.1.2.0/24 are advertised, the router will use the latter as it is the more specific prefix. This becomes a problem when an adversary makes a BGP announcement to a specific IP address while using the victim’s domain IP address. Let’s say the IP address for our victim, leagueofentropy.com, is 123.0.0.0/8. If an adversary announces the prefix 123.1.2.0/24, then they will capture the victim’s traffic, launching a sub-prefix hijack attack.

For example, in an attack during April 2018, routes were announced with the more specific /24 vs. the existing /23. In the diagram below, /23 is Texas and /24 is the more specific Austin, Texas. The new (but nefarious) routes overrode the existing routes for portions of the Internet. The attacker then ran a nefarious DNS server on the normal IP addresses with DNS records pointing at some new nefarious web server instead of the existing server. This attracted the traffic destined for the victim’s domain within the area the nefarious routes were being propagated. The reason this attack was successful was because a more specific prefix is always preferred by the receiving routers.

Securing Certificate Issuance using Multipath Domain Control Validation

Attack two: Equally-Specific-Prefix Attack

In the last attack, the adversary was able to hijack traffic by offering a more specific announcement, but what if the victim’s prefix is /24 and a sub-prefix attack is not viable? In this case, an attacker would launch an equally-specific-prefix hijack, where the attacker announces the same prefix as the victim. This means that the AS chooses the preferred route between the victim and the adversary’s announcements based on properties like path length. This attack only ever intercepts a portion of the traffic.

Securing Certificate Issuance using Multipath Domain Control Validation

There are more advanced attacks that are covered in more depth in the paper. They are fundamentally similar attacks but are more stealthy.

Once an attacker has successfully obtained a bogus certificate for a domain that they do not own, they can perform a convincing attack where they pose as the victim’s domain and are able to decrypt and intercept the victim’s TLS traffic. The ability to decrypt the TLS traffic allows the adversary to completely Monster-in-the-Middle (MITM) encrypted TLS traffic and reroute Internet traffic destined for the victim’s domain to the adversary. To increase the stealthiness of the attack, the adversary will continue to forward traffic through the victim’s domain to perform the attack in an undetected manner.

DNS Spoofing

Another way an adversary can gain control of a domain is by spoofing DNS traffic by using a source IP address that belongs to a DNS nameserver. Because anyone can modify their packets’ outbound IP addresses, an adversary can fake the IP address of any DNS nameserver involved in resolving the victim’s domain, and impersonate a nameserver when responding to a CA.

This attack is more sophisticated than simply spamming a CA with falsified DNS responses. Because each DNS query has its own randomized query identifiers and source port, a fake DNS response must match the DNS query’s identifiers to be convincing. Because these query identifiers are random, making a spoofed response with the correct identifiers is extremely difficult.

Adversaries can fragment User Datagram Protocol (UDP) DNS packets so that identifying DNS response information (like the random DNS query identifier) is delivered in one packet, while the actual answer section follows in another packet. This way, the adversary spoofs the DNS response to a legitimate DNS query.

Say an adversary wants to get a mis-issued certificate for victim.com by forcing packet fragmentation and spoofing DNS validation. The adversary sends a DNS nameserver for victim.com a DNS packet with a small Maximum Transmission Unit, or maximum byte size. This gets the nameserver to start fragmenting DNS responses. When the CA sends a DNS query to a nameserver for victim.com asking for victim.com’s TXT records, the nameserver will fragment the response into the two packets described above: the first contains the query ID and source port, which the adversary cannot spoof, and the second one contains the answer section, which the adversary can spoof. The adversary can continually send a spoofed answer to the CA throughout the DNS validation process, in the hopes of sliding their spoofed answer in before the CA receives the real answer from the nameserver.

In doing so, the answer section of a DNS response (the important part!) can be falsified, and an adversary can trick a CA into mis-issuing a certificate.

Securing Certificate Issuance using Multipath Domain Control Validation

Solution

At first glance, one could think a Certificate Transparency log could expose a mis-issued certificate and allow a CA to quickly revoke it. CT logs, however, can take up to 24 hours to include newly issued certificates, and certificate revocation can be inconsistently followed among different browsers. We need a solution that allows CAs to proactively prevent this attacks, not retroactively address them.

We’re excited to announce that Cloudflare provides CAs a free API to leverage our global network to perform DCV from multiple vantage points around the world. This API bolsters the DCV process against BGP hijacking and off-path DNS attacks.

Given that Cloudflare runs 175+ datacenters around the world, we are in a unique position to perform DCV from multiple vantage points. Each datacenter has a unique path to DNS nameservers or HTTP endpoints, which means that successful hijacking of a BGP route can only affect a subset of DCV requests, further hampering BGP hijacks. And since we use RPKI, we actually sign and verify BGP routes.

This DCV checker additionally protects CAs against off-path, DNS spoofing attacks. An additional feature that we built into the service that helps protect against off-path attackers is DNS query source IP randomization. By making the source IP unpredictable to the attacker, it becomes more challenging to spoof the second fragment of the forged DNS response to the DCV validation agent.

By comparing multiple DCV results collected over multiple paths, our DCV API makes it virtually impossible for an adversary to mislead a CA into thinking they own a domain when they actually don’t. CAs can use our tool to ensure that they only issue certificates to rightful domain owners.

Our multipath DCV checker consists of two services:

  1. DCV agents responsible for performing DCV out of a specific datacenter, and
  2. a DCV orchestrator that handles multipath DCV requests from CAs and dispatches them to a subset of DCV agents.

When a CA wants to ensure that DCV occurred without being intercepted, it can send a request to our API specifying the type of DCV to perform and its parameters.

Securing Certificate Issuance using Multipath Domain Control Validation

The DCV orchestrator then forwards each request to a random subset of over 20 DCV agents in different datacenters. Each DCV agent performs the DCV request and forwards the result to the DCV orchestrator, which aggregates what each agent observed and returns it to the CA.

This approach can also be generalized to performing multipath queries over DNS records, like Certificate Authority Authorization (CAA) records. CAA records authorize CAs to issue certificates for a domain, so spoofing them to trick unauthorized CAs into issuing certificates is another attack vector that multipath observation prevents.

As we were developing our multipath checker, we were in contact with the Princeton research group that introduced the proof-of-concept (PoC) of certificate mis-issuance through BGP hijacking attacks. Prateek Mittal, coauthor of the Bamboozling Certificate Authorities with BGP paper, wrote:

“Our analysis shows that domain validation from multiple vantage points significantly mitigates the impact of localized BGP attacks. We recommend that all certificate authorities adopt this approach to enhance web security. A particularly attractive feature of Cloudflare’s implementation of this defense is that Cloudflare has access to a vast number of vantage points on the Internet, which significantly enhances the robustness of domain control validation.”

Our DCV checker follows our belief that trust on the Internet must be distributed, and vetted through third-party analysis (like that provided by Cloudflare) to ensure consistency and security. This tool joins our pre-existing Certificate Transparency monitor as a set of services CAs are welcome to use in improving the accountability of certificate issuance.

An Opportunity to Dogfood

Building our multipath DCV checker also allowed us to dogfood multiple Cloudflare products.

The DCV orchestrator as a simple fetcher and aggregator was a fantastic candidate for Cloudflare Workers. We implemented the orchestrator in TypeScript using this post as a guide, and created a typed, reliable orchestrator service that was easy to deploy and iterate on. Hooray that we don’t have to maintain our own dcv-orchestrator  server!

We use Argo Tunnel to allow Cloudflare Workers to contact DCV agents. Argo Tunnel allows us to easily and securely expose our DCV agents to the Workers environment. Since Cloudflare has approximately 175 datacenters running DCV agents, we expose many services through Argo Tunnel, and have had the opportunity to load test Argo Tunnel as a power user with a wide variety of origins. Argo Tunnel readily handled this influx of new origins!

Getting Access to the Multipath DCV Checker

If you and/or your organization are interested in trying our DCV checker, email [email protected] and let us know! We’d love to hear more about how multipath querying and validation bolsters the security of your certificate issuance.

As a new class of BGP and IP spoofing attacks threaten to undermine PKI fundamentals, it’s important that website owners advocate for multipath validation when they are issued certificates. We encourage all CAs to use multipath validation, whether it is Cloudflare’s or their own. Jacob Hoffman-Andrews, Tech Lead, Let’s Encrypt, wrote:

“BGP hijacking is one of the big challenges the web PKI still needs to solve, and we think multipath validation can be part of the solution. We’re testing out our own implementation and we encourage other CAs to pursue multipath as well”

Hopefully in the future, website owners will look at multipath validation support when selecting a CA.

Welcome to Crypto Week 2019

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/welcome-to-crypto-week-2019/

Welcome to Crypto Week 2019

Welcome to Crypto Week 2019

The Internet is an extraordinarily complex and evolving ecosystem. Its constituent protocols range from the ancient and archaic (hello FTP) to the modern and sleek (meet WireGuard), with a fair bit of everything in between. This evolution is ongoing, and as one of the most connected networks on the Internet, Cloudflare has a duty to be a good steward of this ecosystem. We take this responsibility to heart: Cloudflare’s mission is to help build a better Internet. In this spirit, we are very proud to announce Crypto Week 2019.

Every day this week we’ll announce a new project or service that uses modern cryptography to build a more secure, trustworthy Internet. Everything we release this week will be free and immediately useful. This blog is a fun exploration of the themes of the week.

  • Monday: Coming Soon
  • Tuesday: Coming Soon
  • Wednesday: Coming Soon
  • Thursday: Coming Soon
  • Friday: Coming Soon

The Internet of the Future

Many pieces of the Internet in use today were designed in a different era with different assumptions. The Internet’s success is based on strong foundations that support constant reassessment and improvement. Sometimes these improvements require deploying new protocols.

Performing an upgrade on a system as large and decentralized as the Internet can’t be done by decree;

  • There are too many economic, cultural, political, and technological factors at play.
  • Changes must be compatible with existing systems and protocols to even be considered for adoption.
  • To gain traction, new protocols must provide tangible improvements for users. Nobody wants to install an update that doesn’t improve their experience!

The last time the Internet had a complete reboot and upgrade was during TCP/IP flag day in 1983. Back then, the Internet (called ARPANET) had fewer than ten thousand hosts! To have an Internet-wide flag day today to switch over to a core new protocol is inconceivable; the scale and diversity of the components involved is way too massive. Too much would break. It’s challenging enough to deprecate outmoded functionality. In some ways, the open Internet is a victim of its own success. The bigger a system grows and the longer it stays the same, the harder it is to change. The Internet is like a massive barge: it takes forever to steer in a different direction and it’s carrying a lot of garbage.

Welcome to Crypto Week 2019
ARPANET, 1983 (Computer History Museum)

As you would expect, many of the warts of the early Internet still remain. Both academic security researchers and real-life adversaries are still finding and exploiting vulnerabilities in the system. Many vulnerabilities are due to the fact that most of the protocols in use on the Internet have a weak notion of trust inherited from the early days. With 50 hosts online, it’s relatively easy to trust everyone, but in a world-scale system, that trust breaks down in fascinating ways. The primary tool to scale trust is cryptography, which helps provide some measure of accountability, though it has its own complexities.

In an ideal world, the Internet would provide a trustworthy substrate for human communication and commerce. Some people naïvely assume that this is the natural direction the evolution of the Internet will follow. However, constant improvement is not a given. It’s possible that the Internet of the future will actually be worse than the Internet today: less open, less secure, less private, less trustworthy. There are strong incentives to weaken the Internet on a fundamental level by Governments, by businesses such as ISPs, and even by the financial institutions entrusted with our personal data.

In a system with as many stakeholders as the Internet, real change requires principled commitment from all invested parties. At Cloudflare, we believe everyone is entitled to an Internet built on a solid foundation of trust. Crypto Week is our way of helping nudge the Internet’s evolution in a more trust-oriented direction. Each announcement this week helps bring the Internet of the future to the present in a tangible way.

Ongoing Internet Upgrades

Before we explore the Internet of the future, let’s explore some of the previous and ongoing attempts to upgrade the Internet’s fundamental protocols.

Routing Security

As we highlighted in last year’s Crypto Week one of the weak links on the Internet is routing. Not all networks are directly connected.

To send data from one place to another, you might have to rely on intermediary networks to pass your data along. A packet sent from one host to another may have to be passed through up to a dozen of these intermediary networks. No single network knows the full path the data will have to take to get to its destination, it only knows which network to pass it to next.  The protocol that determines how packets are routed is called the Border Gateway Protocol (BGP.) Generally speaking, networks use BGP to announce to each other which addresses they know how to route packets for and (dependent on a set of complex rules) these networks share what they learn with their neighbors.

Unfortunately, BGP is completely insecure:

  • Any network can announce any set of addresses to any other network, even addresses they don’t control. This leads to a phenomenon called BGP hijacking, where networks are tricked into sending data to the wrong network.
  • A BGP hijack is most often caused by accidental misconfiguration, but can also be the result of malice on the network operator’s part.
  • During a BGP hijack, a network inappropriately announces a set of addresses to other networks, which results in packets destined for the announced addresses to be routed through the illegitimate network.

Understanding the risk

If the packets represent unencrypted data, this can be a big problem as it allows the hijacker to read or even change the data:

Mitigating the risk

The Resource Public Key Infrastructure (RPKI) system helps bring some trust to BGP by enabling networks to utilize cryptography to digitally sign network routes with certificates, making BGP hijacking much more difficult.

  • This enables participants of the network to gain assurances about the authenticity of route advertisements. Certificate Transparency (CT) is a tool that enables additional trust for certificate-based systems. Cloudflare operates the Cirrus CT log to support RPKI.

Since we announced our support of RPKI last year, routing security has made big strides. More routes are signed, more networks validate RPKI, and the software ecosystem has matured, but this work is not complete. Most networks are still vulnerable to BGP hijacking. For example, Pakistan knocked YouTube offline with a BGP hijack back in 2008, and could likely do the same today. Adoption here is driven less by providing a benefit to users, but rather by reducing systemic risk, which is not the strongest motivating factor for adopting a complex new technology. Full routing security on the Internet could take decades.

DNS Security

The Domain Name System (DNS) is the phone book of the Internet. Or, for anyone under 25 who doesn’t remember phone books, it’s the system that takes hostnames (like cloudflare.com or facebook.com) and returns the Internet address where that host can be found. For example, as of this publication, www.cloudflare.com is 104.17.209.9 and 104.17.210.9 (IPv4) and 2606:4700::c629:d7a2, 2606:4700::c629:d6a2 (IPv6). Like BGP, DNS is completely insecure. Queries and responses sent unencrypted over the Internet are modifiable by anyone on the path.

There are many ongoing attempts to add security to DNS, such as:

  • DNSSEC that adds a chain of digital signatures to DNS responses
  • DoT/DoH that wraps DNS queries in the TLS encryption protocol (more on that later)

Both technologies are slowly gaining adoption, but have a long way to go.

Welcome to Crypto Week 2019
DNSSEC-signed responses served by Cloudflare

Welcome to Crypto Week 2019
Cloudflare’s 1.1.1.1 resolver queries are already over 5% DoT/DoH

Just like RPKI, securing DNS comes with a performance cost, making it less attractive to users. However,

The Web

Transport Layer Security (TLS) is a cryptographic protocol that gives two parties the ability to communicate over an encrypted and authenticated channel. TLS protects communications from eavesdroppers even in the event of a BGP hijack. TLS is what puts the “S” in HTTPS. TLS protects web browsing against multiple types of network adversaries.

Welcome to Crypto Week 2019
Requests hop from network to network over the Internet

Welcome to Crypto Week 2019
For unauthenticated protocols, an attacker on the path can impersonate the server

Welcome to Crypto Week 2019
Attackers can use BGP hijacking to change the path so that communication can be intercepted

Welcome to Crypto Week 2019
Authenticated protocols are protected from interception attacks

The adoption of TLS on the web is partially driven by the fact that:

  • It’s easy and free for websites to get an authentication certificate (via Let’s Encrypt, Universal SSL, etc.)
  • Browsers make TLS adoption appealing to website operators by only supporting new web features such as HTTP/2 over HTTPS.

This has led to the rapid adoption of HTTPS over the last five years.

Welcome to Crypto Week 2019
HTTPS adoption curve (from Google Chrome)‌‌

To further that adoption, TLS recently got an upgrade in TLS 1.3, making it faster and more secure (a combination we love). It’s taking over the Internet!

Welcome to Crypto Week 2019
TLS 1.3 adoption over the last 12 months (from Cloudflare’s perspective)

Despite this fantastic progress in the adoption of security for routing, DNS, and the web, there are still gaps in the trust model of the Internet. There are other things needed to help build the Internet of the future. To find and identify these gaps, we lean on research experts.

Research Farm to Table

Cryptographic security on the Internet is a hot topic and there have been many flaws and issues recently pointed out in academic journals. Researchers often study the vulnerabilities of the past and ask:

  • What other critical components of the Internet have the same flaws?
  • What underlying assumptions can subvert trust in these existing systems?

The answers to these questions help us decide what to tackle next. Some recent research  topics we’ve learned about include:

  • Quantum Computing
  • Attacks on Time Synchronization
  • DNS attacks affecting Certificate issuance
  • Scaling distributed trust

Cloudflare keeps abreast of these developments and we do what we can to bring these new ideas to the Internet at large. In this respect, we’re truly standing on the shoulders of giants.

Future-proofing Internet Cryptography

The new protocols we are currently deploying (RPKI, DNSSEC, DoT/DoH, TLS 1.3) use relatively modern cryptographic algorithms published in the 1970s and 1980s.

  • The security of these algorithms is based on hard mathematical problems in the field of number theory, such as factoring and the elliptic curve discrete logarithm problem.
  • If you can solve the hard problem, you can crack the code. Using a bigger key makes the problem harder, making it more difficult to break, but also slows performance.

Modern Internet protocols typically pick keys large enough to make it infeasible to break with classical computers, but no larger. The sweet spot is around 128-bits of security; meaning a computer has to do approximately 2¹²⁸ operations to break it.

Arjen Lenstra and others created a useful measure of security levels by comparing the amount of energy it takes to break a key to the amount of water you can boil using that much energy. You can think of this as the electric bill you’d get if you run a computer long enough to crack the key.

  • 35-bit security is “Teaspoon security” — It takes about the same amount of energy to break a 35-bit key as it does to boil a teaspoon of water (pretty easy).

Welcome to Crypto Week 2019

  • 65 bits gets you up to “Pool security” – The energy needed to boil the average amount of water in a swimming pool.

Welcome to Crypto Week 2019

  • 105 bits is “Sea Security” – The energy needed to boil the Mediterranean Sea.

Welcome to Crypto Week 2019

  • 114-bits is “Global Security” –  The energy needed to boil all water on Earth.

Welcome to Crypto Week 2019

  • 128-bit security is safely beyond that of Global Security – Anything larger is overkill.
  • 256-bit security corresponds to “Universal Security” – The estimated mass-energy of the observable universe. So, if you ever hear someone suggest 256-bit AES, you know they mean business.

Welcome to Crypto Week 2019

Post-Quantum of Solace

As far as we know, the algorithms we use for cryptography are functionally uncrackable with all known algorithms that classical computers can run. Quantum computers change this calculus. Instead of transistors and bits, a quantum computer uses the effects of quantum mechanics to perform calculations that just aren’t possible with classical computers. As you can imagine, quantum computers are very difficult to build. However, despite large-scale quantum computers not existing quite yet, computer scientists have already developed algorithms that can only run efficiently on quantum computers. Surprisingly, it turns out that with a sufficiently powerful quantum computer, most of the hard mathematical problems we rely on for Internet security become easy!

Although there are still quantum-skeptics out there, some experts estimate that within 15-30 years these large quantum computers will exist, which poses a risk to every security protocol online. Progress is moving quickly; every few months a more powerful quantum computer is announced.

Welcome to Crypto Week 2019

Luckily, there are cryptography algorithms that rely on different hard math problems that seem to be resistant to attack from quantum computers. These math problems form the basis of so-called quantum-resistant (or post-quantum) cryptography algorithms that can run on classical computers. These algorithms can be used as substitutes for most of our current quantum-vulnerable algorithms.

  • Some quantum-resistant algorithms (such as McEliece and Lamport Signatures) were invented decades ago, but there’s a reason they aren’t in common use: they lack some of the nice properties of the algorithms we’re currently using, such as key size and efficiency.
  • Some quantum-resistant algorithms require much larger keys to provide 128-bit security
  • Some are very CPU intensive,
  • And some just haven’t been studied enough to know if they’re secure.

It is possible to swap our current set of quantum-vulnerable algorithms with new quantum-resistant algorithms, but it’s a daunting engineering task. With widely deployed protocols, it is hard to make the transition from something fast and small to something slower, bigger or more complicated without providing concrete user benefits. When exploring new quantum-resistant algorithms, minimizing user impact is of utmost importance to encourage adoption. This is a big deal, because almost all the protocols we use to protect the Internet are vulnerable to quantum computers.

Cryptography-breaking quantum computing is still in the distant future, but we must start the transition to ensure that today’s secure communications are safe from tomorrow’s quantum-powered onlookers; however, that’s not the most timely problem with the Internet. We haven’t addressed that…yet.

Attacking time

Just like DNS, BGP, and HTTP, the Network Time Protocol (NTP) is fundamental to how the Internet works. And like these other protocols, it is completely insecure.

  • Last year, Cloudflare introduced Roughtime as a mechanism for computers to access the current time from a trusted server in an authenticated way.
  • Roughtime is powerful because it provides a way to distribute trust among multiple time servers so that if one server attempts to lie about the time, it will be caught.

However, Roughtime is not exactly a secure drop-in replacement for NTP.

  • Roughtime lacks the complex mechanisms of NTP that allow it to compensate for network latency and yet maintain precise time, especially if the time servers are remote. This leads to imprecise time.
  • Roughtime also involves expensive cryptography that can further reduce precision. This lack of precision makes Roughtime useful for browsers and other systems that need coarse time to validate certificates (most certificates are valid for 3 months or more), but some systems (such as those used for financial trading) require precision to the millisecond or below.

With Roughtime we supported the time protocol of the future, but there are things we can do to help improve the health of security online today.

Welcome to Crypto Week 2019

Some academic researchers, including Aanchal Malhotra of Boston University, have demonstrated a variety of attacks against NTP, including BGP hijacking and off-path User Datagram Protocol (UDP) attacks.

  • Some of these attacks can be avoided by connecting to an NTP server that is close to you on the Internet.
  • However, to bring cryptographic trust to time while maintaining precision, we need something in between NTP and Roughtime.
  • To solve this, it’s natural to turn to the same system of trust that enabled us to patch HTTP and DNS: Web PKI.

Attacking the Web PKI

The Web PKI is similar to the RPKI, but is more widely visible since it relates to websites rather than routing tables.

  • If you’ve ever clicked the lock icon on your browser’s address bar, you’ve interacted with it.
  • The PKI relies on a set of trusted organizations called Certificate Authorities (CAs) to issue certificates to websites and web services.
  • Websites use these certificates to authenticate themselves to clients as part of the TLS protocol in HTTPS.

Welcome to Crypto Week 2019
TLS provides encryption and integrity from the client to the server with the help of a digital certificate 

Welcome to Crypto Week 2019
TLS connections are safe against MITM, because the client doesn’t trust the attacker’s certificate

While we were all patting ourselves on the back for moving the web to HTTPS, some researchers managed to find and exploit a weakness in the system: the process for getting HTTPS certificates.

Certificate Authorities (CAs) use a process called domain control validation (DCV) to ensure that they only issue certificates to websites owners who legitimately request them.

  • Some CAs do this validation manually, which is secure, but can’t scale to the total number of websites deployed today.
  • More progressive CAs have automated this validation process, but rely on insecure methods (HTTP and DNS) to validate domain ownership.

Without ubiquitous cryptography in place (DNSSEC may never reach 100% deployment), there is no completely secure way to bootstrap this system. So, let’s look at how to distribute trust using other methods.

One tool at our disposal is the distributed nature of the Cloudflare network.

Cloudflare is global. We have locations all over the world connected to dozens of networks. That means we have different vantage points, resulting in different ways to traverse networks. This diversity can prove an advantage when dealing with BGP hijacking, since an attacker would have to hijack multiple routes from multiple locations to affect all the traffic between Cloudflare and other distributed parts of the Internet. The natural diversity of the network raises the cost of the attacks.

A distributed set of connections to the Internet and using them as a quorum is a mighty paradigm to distribute trust, with or without cryptography.

Distributed Trust

This idea of distributing the source of trust is powerful. Last year we announced the Distributed Web Gateway that

  • Enables users to access content on the InterPlanetary File System (IPFS), a network structured to reduce the trust placed in any single party.
  • Even if a participant of the network is compromised, it can’t be used to distribute compromised content because the network is content-addressed.
  • However, using content-based addressing is not the only way to distribute trust between multiple independent parties.

Another way to distribute trust is to literally split authority between multiple independent parties. We’ve explored this topic before. In the context of Internet services, this means ensuring that no single server can authenticate itself to a client on its own. For example,

  • In HTTPS the server’s private key is the lynchpin of its security. Compromising the owner of the private key (by hook or by crook) gives an attacker the ability to impersonate (spoof) that service. This single point of failure puts services at risk. You can mitigate this risk by distributing the authority to authenticate the service between multiple independently-operated services.

Welcome to Crypto Week 2019
TLS doesn’t protect against server compromise

Welcome to Crypto Week 2019
With distributed trust, multiple parties combine to protect the connection

Welcome to Crypto Week 2019
An attacker that has compromised one of the servers cannot break the security of the system‌‌

The Internet barge is old and slow, and we’ve only been able to improve it through the meticulous process of patching it piece by piece. Another option is to build new secure systems on top of this insecure foundation. IPFS is doing this, and IPFS is not alone in its design. There has been more research into secure systems with decentralized trust in the last ten years than ever before.

The result is radical new protocols and designs that use exotic new algorithms. These protocols do not supplant those at the core of the Internet (like TCP/IP), but instead, they sit on top of the existing Internet infrastructure, enabling new applications, much like HTTP did for the web.

Gaining Traction

Some of the most innovative technical projects were considered failures because they couldn’t attract users. New technology has to bring tangible benefits to users to sustain it: useful functionality, content, and a decent user experience. Distributed projects, such as IPFS and others, are gaining popularity, but have not found mass adoption. This is a chicken-and-egg problem. New protocols have a high barrier to entryusers have to install new softwareand because of the small audience, there is less incentive to create compelling content. Decentralization and distributed trust are nice security features to have, but they are not products. Users still need to get some benefit out of using the platform.

An example of a system to break this cycle is the web. In 1992 the web was hardly a cornucopia of awesomeness. What helped drive the dominance of the web was its users.

  • The growth of the user base meant more incentive for people to build services, and the availability of more services attracted more users. It was a virtuous cycle.
  • It’s hard for a platform to gain momentum, but once the cycle starts, a flywheel effect kicks in to help the platform grow.

The Distributed Web Gateway project Cloudflare launched last year in Crypto Week is our way of exploring what happens if we try to kickstart that flywheel. By providing a secure, reliable, and fast interface from the classic web with its two billion users to the content on the distributed web, we give the fledgling ecosystem an audience.

  • If the advantages provided by building on the distributed web are appealing to users, then the larger audience will help these services grow in popularity.
  • This is somewhat reminiscent of how IPv6 gained adoption. It started as a niche technology only accessible using IPv4-to-IPv6 translation services.
  • IPv6 adoption has now grown so much that it is becoming a requirement for new services. For example, Apple is requiring that all apps work in IPv6-only contexts.

Eventually, as user-side implementations of distributed web technologies improve, people may move to using the distributed web natively rather than through an HTTP gateway. Or they may not! By leveraging Cloudflare’s global network to give users access to new technologies based on distributed trust, we give these technologies a better chance at gaining adoption.

Happy Crypto Week

At Cloudflare, we always support new technologies that help make the Internet better. Part of helping make a better Internet is scaling the systems of trust that underpin web browsing and protect them from attack. We provide the tools to create better systems of assurance with fewer points of vulnerability. We work with academic researchers of security to get a vision of the future and engineer away vulnerabilities before they can become widespread. It’s a constant journey.

Cloudflare knows that none of this is possible without the work of researchers. From award-winning researcher publishing papers in top journals to the blog posts of clever hobbyists, dedicated and curious people are moving the state of knowledge of the world forward. However, the push to publish new and novel research sometimes holds researchers back from committing enough time and resources to fully realize their ideas. Great research can be powerful on its own, but it can have an even broader impact when combined with practical applications. We relish the opportunity to stand on the shoulders of these giants and use our engineering know-how and global reach to expand on their work to help build a better Internet.

So, to all of you dedicated researchers, thank you for your work! Crypto Week is yours as much as ours. If you’re working on something interesting and you want help to bring the results of your research to the broader Internet, please contact us at [email protected]. We want to help you realize your dream of making the Internet safe and trustworthy.

Massive Ad Fraud Scheme Relied on BGP Hijacking

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/12/massive_ad_frau.html

This is a really interesting story of an ad fraud scheme that relied on hijacking the Border Gateway Protocol:

Members of 3ve (pronounced “eve”) used their large reservoir of trusted IP addresses to conceal a fraud that otherwise would have been easy for advertisers to detect. The scheme employed a thousand servers hosted inside data centers to impersonate real human beings who purportedly “viewed” ads that were hosted on bogus pages run by the scammers themselves­ — who then received a check from ad networks for these billions of fake ad impressions. Normally, a scam of this magnitude coming from such a small pool of server-hosted bots would have stuck out to defrauded advertisers. To camouflage the scam, 3ve operators funneled the servers’ fraudulent page requests through millions of compromised IP addresses.

About one million of those IP addresses belonged to computers, primarily based in the US and the UK, that attackers had infected with botnet software strains known as Boaxxe and Kovter. But at the scale employed by 3ve, not even that number of IP addresses was enough. And that’s where the BGP hijacking came in. The hijacking gave 3ve a nearly limitless supply of high-value IP addresses. Combined with the botnets, the ruse made it seem like millions of real people from some of the most affluent parts of the world were viewing the ads.

Lots of details in the article.

An aphorism I often use in my talks is “expertise flows downhill: today’s top-secret NSA programs become tomorrow’s PhD theses and the next day’s hacking tools.” This is an example of that. BGP hacking — known as “traffic shaping” inside the NSA — has long been a tool of national intelligence agencies. Now it is being used by cybercriminals.

EDITED TO ADD (1/2): Classified NSA presentation on “network shaping.” I don’t know if there is a difference inside the NSA between the two terms.