Tag Archives: Routing Security

How we detect route leaks and our new Cloudflare Radar route leak service

Post Syndicated from Mingwei Zhang original https://blog.cloudflare.com/route-leak-detection-with-cloudflare-radar/

How we detect route leaks and our new Cloudflare Radar route leak service

How we detect route leaks and our new Cloudflare Radar route leak service

Today we’re introducing Cloudflare Radar’s route leak data and API so that anyone can get information about route leaks across the Internet. We’ve built a comprehensive system that takes in data from public sources and Cloudflare’s view of the Internet drawn from our massive global network. The system is now feeding route leak data on Cloudflare Radar’s ASN pages and via the API.

This blog post is in two parts. There’s a discussion of BGP and route leaks followed by details of our route leak detection system and how it feeds Cloudflare Radar.

About BGP and route leaks

Inter-domain routing, i.e., exchanging reachability information among networks, is critical to the wellness and performance of the Internet. The Border Gateway Protocol (BGP) is the de facto routing protocol that exchanges routing information among organizations and networks. At its core, BGP assumes the information being exchanged is genuine and trust-worthy, which unfortunately is no longer a valid assumption on the current Internet. In many cases, networks can make mistakes or intentionally lie about the reachability information and propagate that to the rest of the Internet. Such incidents can cause significant disruptions of the normal operations of the Internet. One type of such disruptive incident is route leaks.

We consider route leaks as the propagation of routing announcements beyond their intended scope (RFC7908). Route leaks can cause significant disruption affecting millions of Internet users, as we have seen in many past notable incidents. For example, in June 2016 a misconfiguration in a small network in Pennsylvania, US (AS396531 – Allegheny Technologies Inc) accidentally leaked a Cloudflare prefix to Verizon, which proceeded to propagate the misconfigured route to the rest of its peers and customers. As a result, the traffic of a large portion of the Internet was squeezed through the limited-capacity links of a small network. The resulting congestion caused most of Cloudflare traffic to and from the affected IP range to be dropped.

A similar incident in November 2018 caused widespread unavailability of Google services when a Nigerian ISP (AS37282 – Mainone) accidentally leaked a large number of Google IP prefixes to its peers and providers violating the valley-free principle.

These incidents illustrate not only that route leaks can be very impactful, but also the snowball effects that misconfigurations in small regional networks can have on the global Internet.

Despite the criticality of detecting and rectifying route leaks promptly, they are often detected only when users start reporting the noticeable effects of the leaks. The challenge with detecting and preventing route leaks stems from the fact that AS business relationships and BGP routing policies are generally undisclosed, and the affected network is often remote to the root of the route leak.

In the past few years, solutions have been proposed to prevent the propagation of leaked routes. Such proposals include RFC9234 and ASPA, which extends the BGP to annotate sessions with the relationship type between the two connected AS networks to enable the detention and prevention of route leaks.

An alternative proposal to implement similar signaling of BGP roles is through the use of BGP Communities; a transitive attribute used to encode metadata in BGP announcements. While these directions are promising in the long term, they are still in very preliminary stages and are not expected to be adopted at scale soon.

At Cloudflare, we have developed a system to detect route leak events automatically and send notifications to multiple channels for visibility. As we continue our efforts to bring more relevant data to the public, we are happy to announce that we are starting an open data API for our route leak detection results today and integrate results to Cloudflare Radar pages.

How we detect route leaks and our new Cloudflare Radar route leak service

Route leak definition and types

Before we jump into how we design our systems, we will first do a quick primer on what a route leak is, and why it is important to detect it.

We refer to the published IETF RFC7908 document “Problem Definition and Classification of BGP Route Leaks” to define route leaks.

> A route leak is the propagation of routing announcement(s) beyond their intended scope.

The intended scope is often concretely defined as inter-domain routing policies based on business relationships between Autonomous Systems (ASes). These business relationships are broadly classified into four categories: customers, transit providers, peers and siblings, although more complex arrangements are possible.

In a customer-provider relationship the customer AS has an agreement with another network to transit its traffic to the global routing table. In a peer-to-peer relationship two ASes agree to free bilateral traffic exchange, but only between their own IPs and the IPs of their customers. Finally, ASes that belong under the same administrative entity are considered siblings, and their traffic exchange is often unrestricted.  The image below illustrates how the three main relationship types translate to export policies.

How we detect route leaks and our new Cloudflare Radar route leak service

By categorizing the types of AS-level relationships and their implications on the propagation of BGP routes, we can define multiple phases of a prefix origination announcements during propagation:

  • upward: all path segments during this phase are customer to provider
  • peering: one peer-peer path segment
  • downward: all path segments during this phase are provider to customer

An AS path that follows valley-free routing principle will have upward, peering, downward phases, all optional but have to be in that order. Here is an example of an AS path that conforms with valley-free routing.

How we detect route leaks and our new Cloudflare Radar route leak service

In RFC7908, “Problem Definition and Classification of BGP Route Leaks”, the authors define six types of route leaks, and we refer to these definitions in our system design. Here are illustrations of each of the route leak types.

Type 1: Hairpin Turn with Full Prefix

> A multihomed AS learns a route from one upstream ISP and simply propagates it to another upstream ISP (the turn essentially resembling a hairpin).  Neither the prefix nor the AS path in the update is altered.

An AS path that contains a provider-customer and customer-provider segment is considered a type 1 leak. The following example: AS4 → AS5 → AS6 forms a type 1 leak.

How we detect route leaks and our new Cloudflare Radar route leak service

Type 1 is the most recognized type of route leaks and is very impactful. In many cases, a customer route is preferable to a peer or a provider route. In this example, AS6 will likely prefer sending traffic via AS5 instead of its other peer or provider routes, causing AS5 to unintentionally become a transit provider. This can significantly affect the performance of the traffic related to the leaked prefix or cause outages if the leaking AS is not provisioned to handle a large influx of traffic.

In June 2015, Telekom Malaysia (AS4788), a regional ISP, leaked over 170,000 routes learned from its providers and peers to its other provider Level3 (AS3549, now Lumin). Level3 accepted the routes and further propagated them to its downstream networks, which in turn caused significant network issues globally.

Type 2: Lateral ISP-ISP-ISP Leak

Type 2 leak is defined as propagating routes obtained from one peer to another peer, creating two or more consecutive peer-to-peer path segments.

Here is an example: AS3 → AS4 → AS5 forms a  type 2 leak.

How we detect route leaks and our new Cloudflare Radar route leak service

One example of such leaks is more than three very large networks appearing in sequence. Very large networks (such as Verizon and Lumin) do not purchase transit from each other, and having more than three such networks on the path in sequence is often an indication of a route leak.

However, in the real world, it is not unusual to see multiple small peering networks exchanging routes and passing on to each other. Legit business reasons exist for having this type of network path. We are less concerned about this type of route leak as compared to type 1.

Type 3 and 4: Provider routes to peer; peer routes to provider

These two types involve propagating routes from a provider or a peer not to a customer, but to another peer or provider. Here are the illustrations of the two types of leaks:

How we detect route leaks and our new Cloudflare Radar route leak service
How we detect route leaks and our new Cloudflare Radar route leak service

As in the previously mentioned example, a Nigerian ISP who peers with Google accidentally leaked its route to its provider AS4809, and thus generated a type 4 route leak. Because routes via customers are usually preferred to others, the large provider (AS4809) rerouted its traffic to Google via its customer, i.e. the leaking ASN, overwhelmed the small ISP and took down Google for over one hour.

Route leak summary

So far, we have looked at the four types of route leaks defined in RFC7908. The common thread of the four types of route leaks is that they’re all defined using AS-relationships, i.e., peers, customers, and providers. We summarize the types of leaks by categorizing the AS path propagation based on where the routes are learned from and propagate to. The results are shown in the following table.

Routes from / propagates to To provider To peer To customer
From provider Type 1 Type 3 Normal
From peer Type 4 Type 2 Normal
From customer Normal Normal Normal

We can summarize the whole table into one single rule: routes obtained from a non-customer AS can only be propagated to customers.

Note: Type 5 and type 6 route leaks are defined as prefix re-origination and announcing of private prefixes. Type 5 is more closely related to prefix hijackings, which we plan to expand our system to as the next steps, while type 6 leaks are outside the scope of this work. Interested readers can refer to sections 3.5 and 3.6 of RFC7908 for more information.

The Cloudflare Radar route leak system

Now that we know what a  route leak is, let’s talk about how we designed our route leak detection system.

From a very high level, we compartmentalize our system into three different components:

  1. Raw data collection module: responsible for gathering BGP data from multiple sources and providing BGP message stream to downstream consumers.
  2. Leak detection module: responsible for determining whether a given AS-level path is a route leak, estimate the confidence level of the assessment, aggregating and providing all external evidence needed for further analysis of the event.
  3. Storage and notification module: responsible for providing access to detected route leak events and sending out notifications to relevant parties. This could also include building a dashboard for easy access and search of the historical events and providing the user interface for high-level analysis of the event.

Data collection module

There are three types of data input we take into consideration:

  1. Historical: BGP archive files for some time range in the past
    a. RouteViews and RIPE RIS BGP archives
  2. Semi-real-time: BGP archive files as soon as they become available, with a 10-30 minute delay.
    a. RouteViews and RIPE RIS archives with data broker that checks new files periodically (e.g. BGPKIT Broker)
  3. Real-time: true real-time data sources
    a. RIPE RIS Live
    b. Cloudflare internal BGP sources

How we detect route leaks and our new Cloudflare Radar route leak service

For the current version, we use the semi-real-time data source for the detection system, i.e., the BGP updates files from RouteViews and RIPE RIS. For data completeness, we process data from all public collectors from these two projects (a total of 63 collectors and over 2,400 collector peers) and implement a pipeline that’s capable of handling the BGP data processing as the data files become available.

For data files indexing and processing, we deployed an on-premises BGPKIT Broker instance with Kafka feature enabled for message passing, and a custom concurrent MRT data processing pipeline based on BGPKIT Parser Rust SDK. The data collection module processes MRT files and converts results into a BGP messages stream at over two billion BGP messages per day (roughly 30,000 messages per second).

Route leak detection

The route leak detection module works at the level of individual BGP announcements. The detection component investigates one BGP message at a time, and estimates how likely a given BGP message is a result of a route leak event.

How we detect route leaks and our new Cloudflare Radar route leak service

We base our detection algorithm mainly on the valley-free model, which we believe can capture most of the notable route leak incidents. As mentioned previously, the key to having low false positives for detecting route leaks with the valley-free model is to have accurate AS-level relationships. While those relationship types are not publicized by every AS, there have been over two decades of research on the inference of the relationship types using publicly observed BGP data.

While state-of-the-art relationship inference algorithms have been shown to be highly accurate, even a small margin of errors can still incur inaccuracies in the detection of route leaks. To alleviate such artifacts, we synthesize multiple data sources for inferring AS-level relationships, including CAIDA/UCSD’s AS relationship data and our in-house built AS relationship dataset. Building on top of the two AS-level relationships, we create a much more granular dataset at the per-prefix and per-peer levels. The improved dataset allows us to answer the question like what is the relationship between AS1 and AS2 with respect to prefix P observed by collector peer X. This eliminates much of the ambiguity for cases where networks have multiple different relationships based on prefixes and geo-locations, and thus helps us reduce the number of false positives in the system. Besides the AS-relationships datasets, we also apply the AS Hegemony dataset from IHR IIJ to further reduce false positives.

Route leak storage and presentation

After processing each BGP message, we store the generated route leak entries in a database for long-term storage and exploration. We also aggregate individual route leak BGP announcements and group relevant leaks from the same leak ASN within a short period together into route-leak events. The route leak events will then be available for consumption by different downstream applications like web UIs, an API, or alerts.

How we detect route leaks and our new Cloudflare Radar route leak service

Route leaks on Cloudflare Radar

At Cloudflare, we aim to help build a better Internet, and that includes sharing our efforts on monitoring and securing Internet routing. Today, we are releasing our route leak detection system as public beta.

Starting today, users going to the Cloudflare Radar ASN pages will now find the list of route leaks that affect that AS. We consider that an AS is being affected when the leaker AS is one hop away from it in any direction, before or after.

The Cloudflare Radar ASN page is directly accessible via https://radar.cloudflare.com/as{ASN}. For example, one can navigate to https://radar.cloudflare.com/as174 to view the overview page for Cogent AS174. ASN pages now show a dedicated card for route leaks detected relevant to the current ASN within the selected time range.

How we detect route leaks and our new Cloudflare Radar route leak service

Users can also start using our public data API to lookup route leak events with regards to any given ASN.  Our API supports filtering route leak results by time ranges, and ASes involved. Here is a screenshot of the route leak events API documentation page on the newly updated API docs site.

How we detect route leaks and our new Cloudflare Radar route leak service

More to come on routing security

There is a lot more we are planning to do with route-leak detection. More features like a global view page, route leak notifications, more advanced APIs, custom automations scripts, and historical archive datasets will begin to ship on Cloudflare Radar over time. Your feedback and suggestions are also very important for us to continue improving on our detection results and serve better data to the public.

Furthermore, we will continue to expand our work on other important topics of Internet routing security, including global BGP hijack detection (not limited to our customer networks), RPKI validation monitoring, open-sourcing tools and architecture designs, and centralized routing security web gateway. Our goal is to provide the best data and tools for routing security to the communities so that we can build a better and more secure Internet together.

In the meantime, we opened a Radar room on our Developers Discord Server. Feel free to join and talk to us; the team is eager to receive feedback and answer questions.

Visit Cloudflare Radar for more Internet insights. You can also follow us on Twitter for more Radar updates.