Tag Archives: bgp

BGP zombies and excessive path hunting

2025-10-31 Bryton Herdes

Post Syndicated from Bryton Herdes original https://blog.cloudflare.com/going-bgp-zombie-hunting/

Here at Cloudflare, we’ve been celebrating Halloween with some zombie hunting of our own. The zombies we’d like to remove are those that disrupt the core framework responsible for how the Internet routes traffic: BGP (Border Gateway Protocol).

A BGP zombie is a silly name for a route that has become stuck in the Internet’s Default-Free Zone (aka the DFZ: the collection of all internet routers that do not require a default route, potentially due to a missed or lost prefix withdrawal).

The underlying root cause of a zombie could be multiple things, spanning from buggy software in routers or just general route processing slowness. It’s when a BGP prefix is meant to be gone from the Internet, but for one reason or another it becomes a member of the undead and hangs around for some period of time.

The longer these zombies linger, the more they create operational impact and become a real headache for network operators. Zombies can lead packets astray, either by trapping them inside of route loops or by causing them to take an excessively scenic route. Today, we’d like to celebrate Halloween by covering how BGP zombies form and how we can lessen the likelihood that they wreak havoc on Internet traffic.

Path hunting

To understand the slowness that can often lead to BGP zombies, we need to talk about path hunting. Path hunting occurs when routers running BGP exhaustively search for the best path to a prefix as determined by Longest Prefix Matching (LPM) and BGP routing attributes like path length and local preference. This becomes relevant in our observations of exactly how routes become stuck, for how long they become stuck, and how visible they are on the Internet.

For example, path hunting happens when a more-specific BGP prefix is withdrawn from the global routing table, and networks need to fallback to a less-specific BGP advertisement. In this example, we use 2001:db8::/48 for the more-specific BGP announcement and 2001:db8::/32 for the less-specific prefix. When the /48 is withdrawn by the originating Autonomous System (AS), BGP routers have to recognize that route as missing and begin routing traffic to IPs such as 2001:db8::1 via the 2001:db8::/32 route, which still remains while the prefix 2001:db8::/48 is gone.

Let’s see what this could look like in action with a few diagrams.

_{Diagram 1: Active 2001:db8::/48 route}

In this initial state, 2001:db8::/48 is used actively for traffic forwarding, which all flows through AS13335 on the way to AS64511. In this case, AS64511 would be a BYOIP customer of Cloudflare. AS64511 also announces a backup route to another Internet Service Provider (ISP), AS64510, but this route is not active even in AS64510’s routing table for forwarding to 2001:db8::1 because 2001:db8::/48 is a longer prefix match when compared to 2001:db8::/32.

Things get more interesting when AS64510 signals for 2001:db8::/48 to be withdrawn by Cloudflare (AS13335), perhaps because a DDoS attack is over and the customer opts to use Cloudflare only when they are actively under attack.

When the customer signals to Cloudflare (via BGP Control or API call) to withdraw the 2001:db8::/48 announcement, all BGP routers have to converge upon this update, which involves path hunting. AS13335 sends a BGP withdrawal message for 2001:db8::/48 to its directly-connected BGP neighbors. While the news of withdrawal may travel quickly from AS13335 to the other networks, news may travel more quickly to some of the neighbors than others. This means that until everyone has received and processed the withdrawal, networks may try routing through one another to reach the 2001:db8::/48 prefix – even after AS13335 has withdrawn it.

_{Diagram 2: 2001:db8::/48 route withdrawn via AS13335}

Imagine AS64501 is a little slower than the rest – perhaps due to using older hardware, hardware being overloaded, a software bug, specific configuration settings, poor luck, or some other factor – and still has not processed the withdrawal of the /48. This in itself could be a BGP zombie, since the route is stuck for a small period. Our pings toward 2001:db8::1 are never able to actually reach AS64511, because AS13335 knows the /48 is meant to be withdrawn, but some routers carrying a full table have not yet converged upon that result.

The length of time spent path hunting is amplified by something called the Minimum Route Advertisement Interval (MRAI). The MRAI specifies the minimum amount of time between BGP advertisement messages from a BGP router, meaning it introduces a purposeful number of seconds of delay between each BGP advertisement update. RFC4271 recommends an MRAI value of 30-seconds for eBGP updates, and while this can cut down on the chattiness of BGP, or even potential oscillation of updates, it also makes path hunting take longer.

At the next cycle of path hunting, even AS64501, which was previously still pointing toward a nonexistent /48 route from AS13335, should find the /32 advertisement is all that is left toward 2001:db8::1. Once it has done so, the traffic flow will become the following:

_{Diagram 3: Routing fallback to 2001:db8::/32 and 2001:db8::/48 is gone from DFZ}

This would mean BGP path hunting is over, and the Internet has realized that the 2001:db8::/32 is the best route available toward 2001:db8::1, and that 2001:db8::/48 is really gone. While in this example we’ve purposely made path hunting only last two cycles, in reality it can be far more, especially with how highly connected AS13335 is to thousands of peer networks and Tier-1’s globally.

Now that we’ve discussed BGP path hunting and how it works, you can probably already see how a BGP zombie outbreak can begin and how routing tables can become stuck for a lengthy period of time. Excessive BGP path hunting for a previously-known more-specific prefix can be an early indicator that a zombie could follow.

Spawning a zombie

Zombies have captured our attention more recently as they were noticed by some of our customers leveraging Bring-Your-Own-IP (BYOIP) on-demand advertisement for Magic Transit. BYOIP may be configured in two modes: “always-on”, in which a prefix is continuously announced, or “on-demand”, where a prefix is announced only when a customer chooses to. For some on-demand customers, announcement and withdrawal cycles may be a more frequent occurrence, which can lead to an increase in BGP zombies.

With that in mind and also knowing how path hunting works, let’s spawn our own zombie onto the Internet. To do so, we’ll take a spare block of IPv4 and IPv6 and announce them like so:

Once the routes are announced and stable, we’ll then proceed to withdraw the more specific routes advertised via Cloudflare globally. With a few quick clicks, we’ve successfully re-animated the dead.

Variant A: Ghoulish Gateways

One place zombies commonly occur is between upstream ISPs. When one router in a given ISP’s network is a little slower to update, routes can become stuck.

Take, for example, the following loop we observed between two of our upstream partners:

7. be2431.ccr31.sjc04.atlas.cogentco.com
8. tisparkle.sjc04.atlas.cogentco.com
9. 213.144.177.184
10. 213.144.177.184
11. 89.221.32.227
12. (waiting for reply)
13. be2749.rcr71.goa01.atlas.cogentco.com
14. be3219.ccr31.mrs02.atlas.cogentco.com
15. be2066.agr21.mrs02.atlas.cogentco.com
16. telecomitalia.mrs02.atlas.cogentco.com
17. 213.144.177.186
18. 89.221.32.227

Or this loop – observed on the same withdrawal test – between two different providers:

15. if-bundle-12-2.qcore2.pvu-paris.as6453.net
16. if-bundle-56-2.qcore1.fr0-frankfurt.as6453.net
17. if-bundle-15-2.qhar1.fr0-frankfurt.as6453.net
18. 195.219.223.11
19. 213.144.177.186
20. 195.22.196.137
21. 213.144.177.186
22. 195.22.196.137

Variant B: Undead LAN (Local Area Network)

Simultaneously, zombies can occur entirely within a given network. When a route is withdrawn from Cloudflare’s network, each device in our network must individually begin the process of withdrawing the route. While this is generally a smooth process, things can still become stuck.

Take, for instance, a situation where one router inside of our network has not yet fully processed the withdrawal. Connectivity partners will continue routing traffic towards that router (as they have not yet received the withdrawal) while no host remains behind the router which is capable of actually processing the traffic. The result is an internal-only looping path:

10. 192.0.2.112
11. 192.0.2.113
12. 192.0.2.112
13. 192.0.2.113
14. 192.0.2.112
15. 192.0.2.113
16. 192.0.2.112
17. 192.0.2.113
18. 192.0.2.112
19. 192.0.2.113

Unlike most fictionally-depicted hoards of the walking dead, our highly-visible zombie has a limited lifetime in most major networks – in this instance, only around around 6 minutes, after which most had re-converged around the less-specific as the best path. Sadly, this is on the shorter side – in some cases, we have seen long-lived zombies cause reachability issues for more than 10 minutes. It’s safe to say this is longer than most network operators would expect BGP convergence to take in a normal situation.

But, you may ask – is this the excessive path hunting we talked about earlier, or a BGP zombie? Really, it depends on the expectation and tolerance around how long BGP convergence should take to process the prefix withdrawal. In any case, even over 30 minutes after our withdrawal of our more-specific prefix, we are able to see zombie routes in the route-views public collectors easily:

~ % monocle search --start-ts 2025-10-28T12:40:13Z --end-ts 2025-10-28T13:00:13Z --prefix 198.18.0.0/24
A|1761656125.550447|206.82.105.116|54309|198.18.0.0/24|54309 13335 395747|IGP|206.82.104.31|0|0|54309:111|false|||route-views.ny

You might argue that six to eleven minutes (or more) is a reasonable time for worst-case BGP convergence in the Tier-1 network layer, though that itself seems like a stretch. Even setting that aside, our data shows that very real BGP zombies exist in the global routing table, and they will negatively impact traffic. Curiously, we observed the path hunting delay is worse on IPv4, with the longest observed IPv6 impact in major (Tier-1) networks being just over 4 minutes. One could speculate this is in part due to the much higher number of IPv4 prefixes in the Internet global routing table than the IPv6 global table, and how BGP speakers handle them separately.

_{Source: RIPEstat’s BGPlay}

Part of the delay appears to originate from how interconnected AS13335 is; being heavily peered with a large portion of the Internet increases the likelihood of a route becoming stuck in a given location. Given that, perhaps a zombie would be shorter-lived if we operated in the opposite direction: announcing a less-specific persistently to 13335 and announcing more specifics via our local ISP during normal operation. Since the withdrawal will come from what is likely a less well-peered network, the time-to-convergence may be shorter:

Indeed, as predicted, we still get a stuck route, and it only lives for around 20 seconds in the Tier-1 network layer:

19. be12488.ccr42.ams03.atlas.cogentco.com
20. 38.88.214.142
21. be2020.ccr41.ams03.atlas.cogentco.com
22. 38.88.214.142
23. (waiting for reply)
24. 38.88.214.142
25. (waiting for reply)
26. 38.88.214.142

Unfortunately, that 20 seconds is still an impactful 20 seconds – while better, it’s not where we want to be. The exact length of time will depend on the native ISP networks one is connected with, and it could certainly ease into the minutes worth of stuck routing.

In both cases, the initial time-to-announce yielded no loss, nor was a zombie created, as both paths remained valid for the entirety of their initial lifetime. Zombies were only created when a more specific prefix was fully withdrawn. A newly-announced route is not subject to path hunting in the same way a withdrawn more-specific route is. As they say, good (new) news travels fast.

Lessening the zombie outbreak

Our findings lead us to believe that the withdrawal of a more-specific prefix may lead to zombies running rampant for longer periods of time. Because of this, we are exploring some improvements that make the consequences of BGP zombie routing less impactful for our customers relying on our on-demand BGP functionality.

For the traffic that does reach Cloudflare with stuck routes, we will introduce some BGP traffic forwarding improvements internally that allow for a more graceful withdrawal of traffic, even if routes are erroneously pointing toward us. In many ways, this will closely resemble the BGP well-known no-export community’s functionality from our servers running BGP. This means even if we receive traffic from external parties due to stuck routing, we will still have the opportunity to deliver traffic to our far-end customers over a tunneled connection or via a Cloudflare Network Interconnect (CNI). We look forward to reporting back the positive impact after making this improvement for a more graceful draining of traffic by default.

For the traffic that does not reach Cloudflare’s edge, and instead loops between network providers, we need to use a different approach. Since we know more-specific to less-specific prefix routing fallback is more prone to BGP zombie outbreak, we are encouraging customers to instead use a multi-step draining process when they want traffic drained from the Cloudflare edge for an on-demand prefix without introducing route loops or blackhole events. The draining process when removing traffic for a BYOIP prefix from Cloudflare should look like this:

The customer is already announcing an example prefix from Cloudflare, ex. 198.18.0.0/24
The customer begins natively announcing the prefix 198.18.0.0/24 (i.e. the same-length as the prefix they are advertising via Cloudflare) from their network to the Internet Service Providers that they wish to fail over traffic to.
After a few minutes, the customer signals BGP withdrawal from Cloudflare for the 198.18.0.0/24 prefix.

The result is a clean cut over: impactful zombies are avoided because the same-length prefix (198.18.0.0/24) remains in the global routing table. Excessive path hunting is avoided because instead of routers needing to aggressively seek out a missing more-specific prefix match, they can fallback to the same-length announcement that persists in the routing table from the natively-originated path to the customer’s network.

_{Source: RIPEstat’s BGPlay}

What next?

We are going to continue to refine our methods of measuring BGP zombies, so you can look forward to more insights in the future. There is also work from others in the community around zombie measurement that is interesting and producing useful data. In terms of combatting the software bugs around BGP zombie creation, routing vendors should implement RFC9687, the BGP SendHoldTimer. The general idea is that a local router can detect via the SendHoldTimer if the far-end router stops processing BGP messages unexpectedly, which lowers the possibility of zombies becoming stuck for long periods of time.

In addition, it’s worth keeping in mind our observations made in this post about more-specific prefix announcements and excessive path hunting. If as a network operator you rely on more-specific BGP prefix announcements for failover, or for traffic engineering, you need to be aware that routes could become stuck for a longer period of time before full BGP convergence occurs.

If you’re interested in problems like BGP zombies, consider coming to work at Cloudflare or applying for an internship. Together we can help build a better Internet!

Monitoring AS-SETs and why they matter

2025-09-26 Mingwei Zhang

Post Syndicated from Mingwei Zhang original https://blog.cloudflare.com/monitoring-as-sets-and-why-they-matter/

Introduction to AS-SETs

An AS-SET, not to be confused with the recently deprecated BGP AS_SET, is an Internet Routing Registry (IRR) object that allows network operators to group related networks together. AS-SETs have been used historically for multiple purposes such as grouping together a list of downstream customers of a particular network provider. For example, Cloudflare uses the AS13335:AS-CLOUDFLARE AS-SET to group together our list of our own Autonomous System Numbers (ASNs) and our downstream Bring-Your-Own-IP (BYOIP) customer networks, so we can ultimately communicate to other networks whose prefixes they should accept from us.

In other words, an AS-SET is currently the way on the Internet that allows someone to attest the networks for which they are the provider. This system of provider authorization is completely trust-based, meaning it’s not reliable at all, and is best-effort. The future of an RPKI-based provider authorization system is coming in the form of ASPA (Autonomous System Provider Authorization), but it will take time for standardization and adoption. Until then, we are left with AS-SETs.

Because AS-SETs are so critical for BGP routing on the Internet, network operators need to be able to monitor valid and invalid AS-SET memberships for their networks. Cloudflare Radar now introduces a transparent, public listing to help network operators in our routing page per ASN.

AS-SETs and building BGP route filters

AS-SETs are a critical component of BGP policies, and often paired with the expressive Routing Policy Specification Language (RPSL) that describes how a particular BGP ASN accepts and propagates routes to other networks. Most often, networks use AS-SET to express what other networks should accept from them, in terms of downstream customers.

Back to the AS13335:AS-CLOUDFLARE example AS-SET, this is published clearly on PeeringDB for other peering networks to reference and build filters against.

When turning up a new transit provider service, we also ask the provider networks to build their route filters using the same AS-SET. Because BGP prefixes are also created in IRR registries using the route or route6 objects, peers and providers now know what BGP prefixes they should accept from us and deny the rest. A popular tool for building prefix-lists based on AS-SETs and IRR databases is bgpq4, and it’s one you can easily try out yourself.

For example, to generate a Juniper router’s IPv4 prefix-list containing prefixes that AS13335 could propagate for Cloudflare and its customers, you may use:

% bgpq4 -4Jl CLOUDFLARE-PREFIXES -m24 AS13335:AS-CLOUDFLARE | head -n 10
policy-options {
replace:
 prefix-list CLOUDFLARE-PREFIXES {
    1.0.0.0/24;
    1.0.4.0/22;
    1.1.1.0/24;
    1.1.2.0/24;
    1.178.32.0/19;
    1.178.32.0/20;
    1.178.48.0/20;

^{Restricted to 10 lines, actual output of prefix-list would be much greater}

This prefix list would be applied within an eBGP import policy by our providers and peers to make sure AS13335 is only able to propagate announcements for ourselves and our customers.

How accurate AS-SETs prevent route leaks

Let’s see how accurate AS-SETs can help prevent route leaks with a simple example. In this example, AS64502 has two providers – AS64501 and AS64503. AS64502 has accidentally messed up their BGP export policy configuration toward the AS64503 neighbor, and is exporting all routes, including those it receives from their AS64501 provider. This is a typical Type 1 Hairpin route leak.

Fortunately, AS64503 has implemented an import policy that they generated using IRR data including AS-SETs and route objects. By doing so, they will only accept the prefixes that originate from the AS Cone of AS64502, since they are their customer. Instead of having a major reachability or latency impact for many prefixes on the Internet because of this route leak propagating, it is stopped in its tracks thanks to the responsible filtering by the AS64503 provider network. Again it is worth keeping in mind the success of this strategy is dependent upon data accuracy for the fictional AS64502:AS-CUSTOMERS AS-SET.

Monitoring AS-SET misuse

Besides using AS-SETs to group together one’s downstream customers, AS-SETs can also represent other types of relationships, such as peers, transits, or IXP participations.

For example, there are 76 AS-SETs that directly include one of the Tier-1 networks, Telecom Italia / Sparkle (AS6762). Judging from the names of the AS-SETs, most of them are representing peers and transits of certain ASNs, which includes AS6762. You can view this output yourself at https://radar.cloudflare.com/routing/as6762#irr-as-sets

There is nothing wrong with defining AS-SETs that contain one’s peers or upstreams as long as those AS-SETs are not submitted upstream for customer->provider BGP session filtering. In fact, an AS-SET for upstreams or peer-to-peer relationships can be useful for defining a network’s policies in RPSL.

However, some AS-SETs in the AS6762 membership list such as AS-10099 look to attest customer relationships.

% whois -h rr.ntt.net AS-10099 | grep "descr"
descr:          CUHK Customer

We know AS6762 is transit free and this customer membership must be invalid, so it is a prime example of AS-SET misuse that would ideally be cleaned up. Many Internet Service Providers and network operators are more than happy to correct an invalid AS-SET entry when asked to. It is reasonable to look at each AS-SET membership like this as a potential risk of having higher route leak propagation to major networks and the Internet when they happen.

AS-SET information on Cloudflare Radar

Cloudflare Radar is a hub that showcases global Internet traffic, attack, and technology trends and insights. Today, we are adding IRR AS-SET information to Radar’s routing section, freely available to the public via both website and API access. To view all AS-SETs an AS is a member of, directly or indirectly via other AS-SETs, a user can visit the corresponding AS’s routing page. For example, the AS-SETs list for Cloudflare (AS13335) is available at https://radar.cloudflare.com/routing/as13335#irr-as-sets

The AS-SET data on IRR contains only limited information like the AS members and AS-SET members. Here at Radar, we also enhance the AS-SET table with additional useful information as follows.

Inferred ASN shows the AS number that is inferred to be the creator of the AS-SET. We use PeeringDB AS-SET information match if available. Otherwise, we parse the AS-SET name to infer the creator.
IRR Sources shows which IRR databases we see the corresponding AS-SET. We are currently using the following databases: AFRINIC, APNIC, ARIN, LACNIC, RIPE, RADB, ALTDB, NTTCOM, and TC.
AS Members and AS-SET members show the count of the corresponding types of members.
AS Cone is the count of the unique ASNs that are included by the AS-SET directly or indirectly.
Upstreams is the count of unique AS-SETs that includes the corresponding AS-SET.

Users can further filter the table by searching for a specific AS-SET name or ASN. A toggle to show only direct or indirect AS-SETs is also available.

In addition to listing AS-SETs, we also provide a tree-view to display how an AS-SET includes a given ASN. For example, the following screenshot shows how as-delta indirectly includes AS6762 through 7 additional other AS-SETs. Users can copy or download this tree-view content in the text format, making it easy to share with others.

We built this Radar feature using our publicly available API, the same way other Radar websites are built. We have also experimented using this API to build additional features like a full AS-SET tree visualization. We encourage developers to give this API (and other Radar APIs) a try, and tell us what you think!

Looking ahead

We know AS-SETs are hard to keep clean of error or misuse, and even though Radar is making them easier to monitor, the mistakes and misuse will continue. Because of this, we as a community need to push forth adoption of RFC9234 and implementations of it from the major vendors. RFC9234 embeds roles and an Only-To-Customer (OTC) attribute directly into the BGP protocol itself, helping to detect and prevent route leaks in-line. In addition to BGP misconfiguration protection with RFC9234, Autonomous System Provider Authorization (ASPA) is still making its way through the IETF and will eventually help offer an authoritative means of attesting who the actual providers are per BGP Autonomous System (AS).

If you are a network operator and manage an AS-SET, you should seriously consider moving to hierarchical AS-SETs if you have not already. A hierarchical AS-SET looks like AS13335:AS-CLOUDFLARE instead of AS-CLOUDFLARE, but the difference is very important. Only a proper maintainer of the AS13335 ASN can create AS13335:AS-CLOUDFLARE, whereas anyone could create AS-CLOUDFLARE in an IRR database if they wanted to. In other words, using hierarchical AS-SETs helps guarantee ownership and prevent the malicious poisoning of routing information.

While keeping track of AS-SET memberships seems like a chore, it can have significant payoffs in preventing BGP-related incidents such as route leaks. We encourage all network operators to do their part in making sure the AS-SETs you submit to your providers and peers to communicate your downstream customer cone are accurate. Every small adjustment or clean-up effort in AS-SETs could help lessen the impact of a BGP incident later.

Visit Cloudflare Radar for additional insights around (Internet disruptions, routing issues, Internet traffic trends, attacks, Internet quality, etc.). Follow us on social media at @CloudflareRadar (X), https://noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via e-mail.

Bringing connections into view: real-time BGP route visibility on Cloudflare Radar

2025-05-21 Mingwei Zhang

Post Syndicated from Mingwei Zhang original https://blog.cloudflare.com/bringing-connections-into-view-real-time-bgp-route-visibility-on-cloudflare/

The Internet relies on the Border Gateway Protocol (BGP) to exchange IP address reachability information. This information outlines the path a sender or router can use to reach a specific destination. These paths, conveyed in BGP messages, are sequences of Autonomous System Numbers (ASNs), with each ASN representing an organization that operates its own segment of Internet infrastructure.

Throughout this blog post, we’ll use the terms “BGP routes” or simply “routes” to refer to these paths. In essence, BGP functions by enabling autonomous systems to exchange routes to IP address blocks (“IP prefixes”), allowing different entities across the Internet to construct their routing tables.

When network operators debug reachability issues or assess a resource’s global reach, BGP routes are often the first thing they examine. Therefore, it’s critical to have an up-to-date view of the routes toward the IP prefixes of interest. Some networks provide tools called “looking glasses” — public routing information services offering data directly from their own BGP routers. These allow external operators to examine routes from that specific network’s perspective. Furthermore, services like bgp.tools, bgp.he.net, RouteViews, or the NLNOG RING looking glass offer aggregated, looking glass-like lookup capabilities, drawing on data sources from multiple organizations rather than just a single one.

However, individual looking glass instances offer a limited scope, typically restricted to the infrastructure of the service provider’s network. While aggregated routing information services provide broader vantage points, they often lack the API access necessary for building automated tools on top of them. For example, systems designed for automated tasks, such as BGP leak or hijack detection, depend on programmatic API access.

We’re excited to introduce Cloudflare Radar’s new real-time BGP route lookup service, described below. Built using public data sources, this service provides visualizations of real-time routes directly on the corresponding IP prefix pages within Radar (see the page for 1.1.1.0/24 as an example). We are also offering API access through our free-to-use Cloudflare Radar API, empowering developers to leverage this data to build their own innovative systems and tools.

Cloudflare Radar provides real-time routes

We are excited to announce the launch of our new real-time BGP route lookup service, now accessible through both Cloudflare Radar web interface and the Cloudflare Radar API. This enhancement provides users with a near instantaneous view into global BGP routing data.

Cloudflare Radar prefix pages

Cloudflare Radar’s real-time routes feature now offers a Sankey diagram illustrating the BGP routes for a given prefix. To minimize visual complexity, the visualization displays routes directed towards the Tier 1 networks. For example, the diagram below shows that 1.1.1.0/24 is announced by AS13335 (Cloudflare) and that Cloudflare has direct connections to almost all U.S.-based and international Tier 1 network providers.

Expanding on this more concise view, users also have the option to ‘Show full paths’ and visualize every BGP route from the prefix of interest to the collectors. (The role of the collectors in gathering this data is discussed below.) The interactive view allows panning and zooming, and hovering over the links provides tooltip information on which collector saw the route and when it was last updated.

For both views, the prefix origin table is displayed above the route’s visualization. The table shows the originating Autonomous System (AS), the visibility percentage (representing the proportion of route collectors observing the origin ASN announcement), and RPKI validation outcomes.

During a recently detected BGP misconfiguration, we saw two origin ASNs for a prefix, with AS3 incorrectly used instead of the intended origin being prepended three times. The visualization reveals AS3 as RPKI invalid with low visibility, indicating limited network acceptance. Operators can analyze these issues visually or in the table and monitor real-time corrections by refreshing the page.

Whether facing network outages, implementing new deployments, or investigating route leaks, users can leverage this feature for any scenario where a clear, global understanding of a prefix’s routing paths is essential.

To allow easier access to this information, users can now search for any prefix using the Radar search bar and navigate to the corresponding prefix routing pages. Prefixes involved in BGP route leak and origin hijack events are also linked to this enhanced routing information page, helping operators debug BGP anomalies in real-time.

Cloudflare Routes API

Cloudflare Radar real-time route data is also accessible via the Radar API, and users can follow this guide to get started.

The following example shows an HTTP GET request to query all the current routes for a prefix of interest:

curl -X GET 
"https://api.cloudflare.com/client/v4/radar/bgp/routes/realtime?prefix=1.1.1.0/24" -H 
"Authorization: Bearer <API_TOKEN>"

With the help of JSON data processing tools like jq, users can further filter data results by routes containing a certain ASN. In the following example, we make a request to ask for all current routes toward the prefix 1.1.1.0/24 and filter all routes with AS paths containing AS174:

curl -X GET
 "https://api.cloudflare.com/client/v4/radar/bgp/routes/realtime?prefix=1.1.1.0/24" \
    -H "Authorization: Bearer <API_TOKEN>" | \
jq '.result.routes[]|select(.as_path | contains([174]))'

The command output is a JSON array of route objects. Each object details a route that includes AS174 in its AS path. Additional information provided for each route includes the BGP route collector, BGP community values, and the timestamp of the last update.

{
  "as_path": [
    3130,
    174,
    13335
  ],
  "collector": "route-views2",
  "communities": [
    "174:21001",
    "174:22013",
    "3130:394"
  ],
  "peer_asn": 3130,
  "prefix": "1.1.1.0/24",
  "timestamp": "2025-05-14T00:00:00Z"
}
{
  "as_path": [
    263237,
    174,
    13335
  ],
  "collector": "rrc15",
  "communities": [
    "174:21001",
    "174:22013",
    "65237:174"
  ],
  "peer_asn": 263237,
  "prefix": "1.1.1.0/24",
  "timestamp": "2025-05-14T01:39:52Z"
}

The API also offers supplementary metadata alongside BGP route information, including insights into BGP route collector status and aggregated prefix-to-origin data. Recalling the earlier example of an AS path prepending misconfiguration, the RPKI invalid AS3 origin is now directly visible to users and API clients in the JSON response, showing that just 9% of all collectors observed its announcements.

"meta": {
  "collectors": [
    {
      "latest_realtime_ts": "2025-05-19T21:35:40Z",
      "latest_rib_ts": "2025-05-19T20:00:00Z",
      "latest_updates_ts": "2025-05-19T21:15:00Z",
      "peers_count": 24,
      "peers_v4_count": 0,
      "peers_v6_count": 24,
      "collector": "route-views6"
    },
  ],
  "prefix_origins": [
    {
      "origin": 3,
      "prefix": "2804:4e28::/32",
      "rpki_validation": "invalid",
      "total_peers": 121,
      "total_visible": 11,
      "visibility": 0.09090909090909091
    },
    {
      "origin": 268243,
      "prefix": "2804:4e28::/32",
      "rpki_validation": "valid",
      "total_peers": 121,
      "total_visible": 94,
      "visibility": 0.7768595041322314
    }
  ],
}

From archives to real-time

Architecture overview

Cloudflare Radar uses RIPE RIS and the University of Oregon’s RouteViews as our primary BGP data sources for services like the routing statistics widget, anomaly detection, and announced address space graphs. We have previously discussed in detail on how we use the data archives from these two providers to build Cloudflare Radar’s routing pages, and our route leak and hijack detection systems.

In brief, RIPE RIS and RouteViews maintain several BGP route collectors, each connected to BGP routers across a diverse set of networks. These routers forward BGP messages to the collectors, which generate periodic data dumps for public access. These data dumps include both collections of BGP message updates and full routing table snapshots (RIB dump files).

For services monitoring stable routing information, like global routing statistics, we process RIB dump files from the archives as they become available. Conversely, for detecting dynamic events such as route leaks and hijacks, we process periodic BGP update files in batches. Services depending on this historical BGP data may experience processing delays of 10 to 30 minutes at the route collectors.

For the new real-time BGP routes feature, we aim to reduce the data delay from minutes or tens of minutes down to seconds. With the real-time stream capability provided by the BGP archiver services — RIS Live WebSocket from RIPE RIS and OpenBMP Kafka stream from RouteViews — we designed an additional real-time data stream component that enhances the routes snapshots built with MRT archive files by constantly updating the snapshots.

System design

At its core, the system enables a user to look up a prefix’s routes stored in BGP routes snapshots. The BGP routes snapshots serve as a queryable data repository, organized hierarchically. The snapshots use a trie structure to allow for the retrieval of route information (such as AS paths and community values) associated with specific address prefixes. Each node in the hierarchy stores routing information from different peering routers, providing a consolidated global view. To handle the large data volumes from multiple BGP route collectors, the system partitions routing data into separate BGP routes snapshots, where each snapshot receives data streamed from its corresponding collector. This architecture enables horizontal scalability, allowing for dynamic adjustment of data sources by selecting which independent collectors’ data to include or exclude.

Because the collectors’ BGP route information is maintained independently, to query for a global status, we apply the actor model software architecture for the implementation. Each collector is considered an actor that runs completely independently and communicates with a central controller via a dedicated communication channel. The central controller starts all actors by sending a signal to each of them, triggering actors to start collecting archival and real-time BGP data, on their separate threads.

Upon queries from users, the central controller will relay the query to all running actors via a query message. The actors will retrieve the corresponding route information on its prefix-trie and return the results to the controller with another message. The controller aggregates all messages from the actors and compiles them into a reply response to the user. During the whole process, the real-time BGP streaming and snapshots’ updating processes continue to run in the background without interruptions.

Our actor-model implementation enables a single node to efficiently store hundreds of full routing tables in its memory. Our current deployment uses eight route collectors, housing a total of 261 full routing tables. This in-memory system operating on a single node consumes approximately 45 GB of memory, which translates to about 170 MB per full routing table.

Summary

Cloudflare Radar now offers a real-time BGP route lookup service, providing near-instantaneous insights into global Internet routing. This feature leverages real-time data streams from RouteViews and RIPE RIS, moving beyond historical archives to deliver up-to-the-minute information. Users can now visualize routes in real time on Cloudflare Radar’s prefix pages with intuitive Sankey diagrams that detail complete route information. Furthermore, the Cloudflare Radar API provides programmatic access to this data, allowing for seamless integration into custom tools and workflows.

Visit Cloudflare Radar for additional insights around Internet disruptions, routing issues, Internet traffic trends, attacks, and Internet quality. Follow us on social media at @CloudflareRadar (X), noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via email.

The backbone behind Cloudflare’s Connectivity Cloud

2024-08-06 Shozo Moritz Takaya

Post Syndicated from Shozo Moritz Takaya original https://blog.cloudflare.com/backbone2024

The modern use of “cloud” arguably traces its origins to the cloud icon, omnipresent in network diagrams for decades. A cloud was used to represent the vast and intricate infrastructure components required to deliver network or Internet services without going into depth about the underlying complexities. At Cloudflare, we embody this principle by providing critical infrastructure solutions in a user-friendly and easy-to-use way. Our logo, featuring the cloud symbol, reflects our commitment to simplifying the complexities of Internet infrastructure for all our users.

This blog post provides an update about our infrastructure, focusing on our global backbone in 2024, and highlights its benefits for our customers, our competitive edge in the market, and the impact on our mission of helping build a better Internet. Since the time of our last backbone-related blog post in 2021, we have increased our backbone capacity (Tbps) by more than 500%, unlocking new use cases, as well as reliability and performance benefits for all our customers.

A snapshot of Cloudflare’s infrastructure

As of July 2024, Cloudflare has data centers in 330 cities across more than 120 countries, each running Cloudflare equipment and services. The goal of delivering Cloudflare products and services everywhere remains consistent, although these data centers vary in the number of servers and amount of computational power.

These data centers are strategically positioned around the world to ensure our presence in all major regions and to help our customers comply with local regulations. It is a programmable smart network, where your traffic goes to the best data center possible to be processed. This programmability allows us to keep sensitive data regional, with our Data Localization Suite solutions, and within the constraints that our customers impose. Connecting these sites, exchanging data with customers, public clouds, partners, and the broader Internet, is the role of our network, which is managed by our infrastructure engineering and network strategy teams. This network forms the foundation that makes our products lightning fast, ensuring our global reliability, security for every customer request, and helping customers comply with data sovereignty requirements.

Traffic exchange methods

The Internet is an interconnection of different networks and separate autonomous systems that operate by exchanging data with each other. There are multiple ways to exchange data, but for simplicity, we’ll focus on two key methods on how these networks communicate: Peering and IP Transit. To better understand the benefits of our global backbone, it helps to understand these basic connectivity solutions we use in our network.

Peering: The voluntary interconnection of administratively separate Internet networks that allows for traffic exchange between users of each network is known as “peering”. Cloudflare is one of the most peered networks globally. We have peering agreements with ISPs and other networks in 330 cities and across all major Internet Exchanges (IX’s). Interested parties can register to peer with us anytime, or directly connect to our network with a link through a private network interconnect (PNI).
IP transit: A paid service that allows traffic to cross or “transit” somebody else’s network, typically connecting a smaller Internet service provider (ISP) to the larger Internet. Think of it as paying a toll to access a private highway with your car.

The backbone is a dedicated high-capacity optical fiber network that moves traffic between Cloudflare’s global data centers, where we interconnect with other networks using these above-mentioned traffic exchange methods. It enables data transfers that are more reliable than over the public Internet. For the connectivity within a city and long distance connections we manage our own dark fiber or lease wavelengths using Dense Wavelength Division Multiplexing (DWDM). DWDM is a fiber optic technology that enhances network capacity by transmitting multiple data streams simultaneously on different wavelengths of light within the same fiber. It’s like having a highway with multiple lanes, so that more cars can drive on the same highway. We buy and lease these services from our global carrier partners all around the world.

Backbone operations and benefits

Operating a global backbone is challenging, which is why many competitors don’t do it. We take this challenge for two key reasons: traffic routing control and cost-effectiveness.

With IP transit, we rely on our transit partners to carry traffic from Cloudflare to the ultimate destination network, introducing unnecessary third-party reliance. In contrast, our backbone gives us full control over routing of both internal and external traffic, allowing us to manage it more effectively. This control is crucial because it lets us optimize traffic routes, usually resulting in the lowest latency paths, as previously mentioned. Furthermore, the cost of serving large traffic volumes through the backbone is, on average, more cost-effective than IP transit. This is why we are doubling down on backbone capacity in regions such as Frankfurt, London, Amsterdam, and Paris and Marseille, where we see continuous traffic growth and where connectivity solutions are widely available and competitively priced.

Our backbone serves both internal and external traffic. Internal traffic includes customer traffic using our security or performance products and traffic from Cloudflare’s internal systems that shift data between our data centers. Tiered caching, for example, optimizes our caching delivery by dividing our data centers into a hierarchy of lower tiers and upper tiers. If lower-tier data centers don’t have the content, they request it from the upper tiers. If the upper tiers don’t have it either, they then request it from the origin server. This process reduces origin server requests and improves cache efficiency. Using our backbone to transport the cached content between lower and upper-tier data centers and the origin is often the most cost-effective method, considering the scale of our network. Magic Transit is another example where we attract traffic, by means of BGP anycast, to the Cloudflare data center closest to the end user and implement our DDoS solution. Our backbone transports the clean traffic to our customer’s data center, which they connect through a Cloudflare Network Interconnect (CNI).

External traffic that we carry on our backbone can be traffic from other origin providers like AWS, Oracle, Alibaba, Google Cloud Platform, or Azure, to name a few. The origin responses from these cloud providers are transported through peering points and our backbone to the Cloudflare data center closest to our customer. By leveraging our backbone we have more control over how we backhaul this traffic throughout our network, which results in more reliability and better performance and less dependency on the public Internet.

This interconnection between public clouds, offices, and the Internet with a controlled layer of performance, security, programmability, and visibility running on our global backbone is our Connectivity Cloud.

*This map is a simplification of our current backbone network and does not show all paths*

Expanding our network

As mentioned in the introduction, we have increased our backbone capacity (Tbps) by more than 500% since 2021. With the addition of sub-sea cable capacity to Africa, we achieved a big milestone in 2023 by completing our global backbone ring. It now reaches six continents through terrestrial fiber and subsea cables.

Building out our backbone within regions where Internet infrastructure is less developed compared to markets like Central Europe or the US has been a key strategy for our latest network expansions. We have a shared goal with regional ISP partners to keep our data flow localized and as close as possible to the end user. Traffic often takes inefficient routes outside the region due to the lack of sufficient local peering and regional infrastructure. This phenomenon, known as traffic tromboning, occurs when data is routed through more cost-effective international routes and existing peering agreements.

Our regional backbone investments in countries like India or Turkey aim to reduce the need for such inefficient routing. With our own in-region backbone, traffic can be directly routed between in-country Cloudflare data centers, such as from Mumbai to New Delhi to Chennai, reducing latency, increasing reliability, and helping us to provide the same level of service quality as in more developed markets. We can control that data stays local, supporting our Data Localization Suite (DLS), which helps businesses comply with regional data privacy laws by controlling where their data is stored and processed.

Improved latency and performance

This strategic expansion has not only extended our global reach but has also significantly improved our overall latency. One illustration of this is that since the deployment of our backbone between Lisbon and Johannesburg, we have seen a major performance improvement for users in Johannesburg. Customers benefiting from this improved latency can be, for example, a financial institution running their APIs through us for real-time trading, where milliseconds can impact trades, or our Magic WAN users, where we facilitate site-to-site connectivity between their branch offices.

The table above shows an example where we measured the round-trip time (RTT) for an uncached origin fetch, from an end-user in Johannesburg to various origin locations, comparing our backbone and the public Internet. By carrying the origin request over our backbone, as opposed to IP transit or peering, local users in Johannesburg get their content up to 22% faster. By using our own backbone to long-haul the traffic to its final destination, we are in complete control of the path and performance. This improvement in latency varies by location, but consistently demonstrates the superiority of our backbone infrastructure in delivering high performance connectivity.

Traffic control

Consider a navigation system using 1) GPS to identify the route and 2) a highway toll pass that is valid until your final destination and allows you to drive straight through toll stations without stopping. Our backbone works quite similarly.

Our global backbone is built upon two key pillars. The first is BGP (Border Gateway Protocol), the routing protocol for the Internet, and the second is Segment Routing MPLS (Multiprotocol label switching), a technique for steering traffic across predefined forwarding paths in an IP network. By default, Segment Routing provides end-to-end encapsulation from ingress to egress routers where the intermediate nodes execute no route lookup. Instead, they forward traffic across an end-to-end virtual circuit, or tunnel, called a label-switched path. Once traffic is put on a label-switched path, it cannot detour onto the public Internet and must continue on the predetermined route across Cloudflare’s backbone. This is nothing new, as many networks will even run a “BGP Free Core” where all the route intelligence is carried at the edge of the network, and intermediate nodes only participate in forwarding from ingress to egress.

While leveraging Segment Routing Traffic Engineering (SR-TE) in our backbone, we can automatically select paths between our data centers that are optimized for latency and performance. Sometimes the “shortest path” in terms of routing protocol cost is not the lowest latency or highest performance path.

Supercharged: Argo and the global backbone

Argo Smart Routing is a service that uses Cloudflare’s portfolio of backbone, transit, and peering connectivity to find the most optimal path between the data center where a user’s request lands and your back-end origin server. Argo may forward a request from one Cloudflare data center to another on the way to an origin if the performance would improve by doing so. Orpheus is the counterpart to Argo, and routes around degraded paths for all customer origin requests free of charge. Orpheus is able to analyze network conditions in real-time and actively avoid reachability failures. Customers with Argo enabled get optimal performance for requests from Cloudflare data centers to their origins, while Orpheus provides error self-healing for all customers universally. By mixing our global backbone using Segment Routing as an underlay with Argo Smart Routing and Orpheus as our connectivity overlay, we are able to transport critical customer traffic along the most optimized paths that we have available.

So how exactly does our global backbone fit together with Argo Smart Routing? Argo Transit Selection is an extension of Argo Smart Routing where the lowest latency path between Cloudflare data center hops is explicitly selected and used to forward customer origin requests. The lowest latency path will often be our global backbone, as it is a more dedicated and private means of connectivity, as opposed to third-party transit networks.

Consider a multinational Dutch pharmaceutical company that relies on Cloudflare’s network and services with our SASE solution to connect their global offices, research centers, and remote employees. Their Asian branch offices depend on Cloudflare’s security solutions and network to provide secure access to important data from their central data centers back to their offices in Asia. In case of a cable cut between regions, our network would automatically look for the best alternative route between them so that business impact is limited.

Argo measures every potential combination of the different provider paths, including our own backbone, as an option for reaching origins with smart routing. Because of our vast interconnection with so many networks, and our global private backbone, Argo is able to identify the most performant network path for requests. The backbone is consistently one of the lowest latency paths for Argo to choose from.

In addition to high performance, we care greatly about network reliability for our customers. This means we need to be as resilient as possible from fiber cuts and third-party transit provider issues. During a disruption of the AAE-1 (Asia Africa Europe-1) submarine cable, this is what Argo saw between Singapore and Amsterdam across some of our transit provider paths vs. the backbone.

The large (purple line) spike shows a latency increase on one of our third-party IP transit provider paths due to congestion, which was eventually resolved following likely traffic engineering within the provider’s network. We saw a smaller latency increase (yellow line) over other transit networks, but still one that is noticeable. The bottom (green) line on the graph is our backbone, where round-trip time more or less remains flat throughout the event, due to our diverse backbone connectivity between Asia and Europe. Throughout the fiber cut, we remained stable at around 200ms between Amsterdam and Singapore. There was no noticeable network hiccup as was seen on the transit provider paths, so Argo actively leveraged the backbone for optimal performance.

Call to action

As Argo improves performance in our network, Cloudflare Network Interconnects (CNIs) optimize getting onto it. We encourage our Enterprise customers to use our free CNI’s as on-ramps onto our network whenever practical. In this way, you can fully leverage our network, including our robust backbone, and increase overall performance for every product within your Cloudflare Connectivity Cloud. In the end, our global network is our main product and our backbone plays a critical role in it. This way we continue to help build a better Internet, by improving our services for everybody, everywhere.

If you want to be part of our mission, join us as a Cloudflare network on-ramp partner to offer secure and reliable connectivity to your customers by integrating directly with us. Learn more about our on-ramp partnerships and how they can benefit your business here.

Exam-ining recent Internet shutdowns in Syria, Iraq, and Algeria

2024-06-21 David Belson

Post Syndicated from David Belson original https://blog.cloudflare.com/syria-iraq-algeria-exam-internet-shutdown

The practice of cheating on exams (or at least attempting to) is presumably as old as the concept of exams itself, especially when the results of the exam can have significant consequences for one’s academic future or career. As access to the Internet became more ubiquitous with the growth of mobile connectivity, and communication easier with an assortment of social media and messaging apps, a new avenue for cheating on exams emerged, potentially facilitating the sharing of test materials or answers. Over the last decade, some governments have reacted to this perceived risk by taking aggressive action to prevent cheating, ranging from targeted DNS-based blocking/filtering to multi-hour nationwide shutdowns across multi-week exam periods.

Syria and Iraq are well-known practitioners of the latter approach, and we have covered past exam-related Internet shutdowns in Syria (2021, 2022, 2023) and Iraq (2022, 2023) here on the Cloudflare blog. It is now mid-June 2024, and exams in both countries took place over the last several weeks, and with those exams, regular nationwide Internet shutdowns. In addition, Baccalaureate exams also took place in Algeria, and we have written about related Internet disruptions there in the past (2022, 2023). However, in contrast to the single daily shutdowns in Syria and Iraq, the Algerian government opted instead for two multi-hour disruptions each day – one in the morning, one in the afternoon – and appears to be pursuing a content blocking strategy, rather than a full nationwide shutdown.

As we have done in past year’s posts, we will examine the impact that these shutdowns have on Internet traffic, but also analyze routing information and traffic from other Cloudflare services in an effort to better understand how these shutdowns are being implemented.

Syria

The Syrian Telecom Company, to their credit, publishes an exam schedule on social media, with the image below published to their Facebook page. The English version was created by applying Google Translate to the image. The schedule shows the date & time of each Internet shutdown (“disconnection”), in addition to the subject(s) of that day’s exam(s). In 2024, exams started on May 26, and went through June 13.

In Syria, AS29256 (Syrian Telecom) is effectively the Internet, as shown in the table below. While there are a few other autonomous systems (ASNs/ASes) registered in Syria, there are only two that currently announce IP address space to the public Internet. As such, the trends seen at a country level for Syria reflect those seen for AS29256, and this is clearly evident in the traffic graphs below.

Nationwide Internet shutdowns in Syria began on May 26, taking place for varying multi-hour periods from Sunday to Thursday for three consecutive weeks. The graphs below show Internet traffic from the country, as well as AS29256, dropping to zero during the scheduled shutdowns.

In addition, graphs from the Cloudflare Radar Routing pages for Syria and AS29256 show the number of IPv4 and IPv6 prefixes being announced country-wide and by AS29256 dropping to at or near zero during each shutdown. This ultimately means that there is no Internet path back to systems (IP addresses) connected to Syrian Telecom. Below, we explore why this is important and problematic.

As has been observed in the past, the shutdowns in Syria are asymmetrical. That is, traffic can exit the country (via AS29256), but there are no paths for responses to return. The impact of this approach is clearly evident in traffic to Cloudflare’s 1.1.1.1 DNS Resolver. We continue to see traffic to the resolver when the shutdowns take place, and in fact, we see the traffic spike during the shutdowns, as the graph below shows.

If we dig into traffic to 1.1.1.1 by protocol, we can see that it is driven by requests over UDP port 53, the standard port used for DNS requests over UDP and TCP. (Given the request pattern, that also appears to be the primary way that we see traffic to the resolver from Syria.)

If we remove the UDP line from the graph, we see that request volume for DNS over TCP port 53, as well as DNS over HTTPS (DoH) and DNS over TLS (DoT), all drops to zero during the shutdowns.

Similarly, we can clearly see the shutdowns in HTTP(S) request-based traffic graphs as well, since HTTP(S) is also a TCP-based protocol.

Why do we see this impact? With DNS over UDP, the client simply makes a request to the resolver – no multi-step handshake is involved, as with TCP. So in this case, 1.1.1.1 is receiving these requests, but as shown above, there’s no path for the response to reach the client. Because it hasn’t received a response, the client retries the request, and this flood of retries is manifested as the spike seen in the graphs above.

However, as we see above, request volume for DNS over TCP, as well as DoH, DoT, and HTTP(S) (which all use TCP), falls to zero during the shutdowns. The lack of a path back to the client means that the TCP 3-way handshake can’t complete, and thus we don’t see DNS requests over these protocols.

In looking at 1.1.1.1 Resolver request volume from Syria for popular social media and messaging applications, we can see traffic for facebook.com most closely matches the spikes shown above. Removing facebook.com from the graph, we can also see similar, though more limited, increases for domains used by popular messaging applications WhatsApp, Signal, and Telegram. Facebook and WhatsApp are reportedly the most popular social media and messaging applications in Syria.

Although we have focused on the analysis of traffic to Cloudflare’s DNS resolver, and the patterns seen within that traffic, it is also worth highlighting an interesting pattern observed in traffic to Cloudflare’s Authoritative DNS platform. (DNS resolvers act as a middleman between clients, such as a laptop or phone, and an authoritative DNS server. Authoritative DNS servers contain information specific to the domain names they serve, including IP addresses and other types of records.)

The graph below shows bits/second traffic from Syria for Cloudflare’s authoritative DNS service on June 13. (Similar patterns were observed during the other days when shutdowns occurred, but data volume limits the ability to create a graph showing an extended period of time.) In this graph, we can see that at the start of the shutdown (03:00 UTC), traffic rises sharply, effectively plateaus for the duration of the shutdown, and then returns to normal levels. We believe that the traffic pattern illustrated here could be the result of some local resolvers in Syria having the IP addresses for our authoritative DNS servers cached, and are making requests to them. The increased traffic level could be because they are retrying their queries after not receiving responses, but in a less aggressive fashion than the client applications driving the resolver traffic spikes shown above.

In summary, Syria appears to be implementing their Internet shutdowns not through filtering, but rather by simply not announcing their IP address space for the duration of the shutdown, thereby preventing any responses from returning to the originating requestor, whether client application, web browser, or local DNS resolver.

Iraq

On May 19, the Iraqi Ministry of Communication posted an update that stated (translated) “The Ministry of Communications would like to note that the Internet service will be cut off for two hours during the general exams for intermediate studies, from six in the morning until eight in the morning, based on higher directives and at the request of the Ministry of Education.” The post came nearly a year after the Iraqi Ministry of Communication refused a request from the Ministry of Education to shut down the Internet during the baccalaureate exams as part of efforts to prevent cheating. On May 20, the Iraqi Ministry of Education posted the schedule for the upcoming set of exams to its Facebook page.

Iraq has a much richer network service provider environment than Syria does, with over 150 autonomous systems (ASNs) registered in the country and announcing IP address space, compared to just two ASNs (both Syrian Telecom) in Syria announcing IP address space. Although traffic in Iraq is generally concentrated among the larger providers, shutdowns are rarely “complete” at a country level because not every autonomous system (network provider) in the country implements a shutdown. (This is due in part to the autonomous Kurdistan region in the north, which often implements similar shutdowns on their own schedule. Network providers in this region are included in Iraq’s country-level graphs.)

We can see this in a Cloudflare Radar traffic graph that shows the shutdowns at a country level, where traffic is dropping by around 87% during each multi-hour shutdown. In addition to the five networks also shown here (AS203214 (HulumTele), AS199739 (Earthlink), AS58322 (Halasat), AS51684 (Asiacell), and AS59588 (Zainas)), further analysis finds more than 30 where we observed a complete loss of traffic during the shutdowns, with a number of them downstream of these providers.

In contrast to Syria, the changes to announced IP address space during the shutdowns are much less severe in Iraq. Several of the shutdowns are correlated with a drop of ~20-25% in announced IPv4 address space, while a few others saw a drop closer to just 2%.

At an ASN level, the changes in announced address space were mixed – AS59588 (Zainas), AS199739 (Earthlink), and AS51684 (Asiacell) experienced a significant loss, while AS203214 (HulumTele) and AS58322 (Halasat) experienced little to no change.

Similar to Syria, we can also look at 1.1.1.1 resolver traffic data to better understand how the shutdowns are being implemented. The country-level graphs below suggest that UDP traffic patterns are not visibly changing, suggesting that responses from the resolver are, in fact, getting back to the clients. However, this likely isn’t the case, and such a conclusion is at least in part an artifact of the graph’s time frame and hourly granularity, as well as the inclusion of resolver traffic from Kurdish network providers (ASNs). The shutdowns are more clearly evident in the DNS-over-TCP and DNS-over-HTTPS graphs below, as well as in the graph for HTTP(S) request traffic (both mobile & desktop), which is also TCP-based. In these graphs, the troughs on days that shutdowns occurred generally dip lower than those on the days that the Internet remained available.

In looking at authoritative DNS traffic from Iraq during a shutdown (for June 13 as an example day, as above), we see evidence of a decline in traffic during the time the shutdown occurs.

The decline in authoritative DNS traffic is more evident at an ASN level, such as in the graph below for AS203214 (Hulum), effectively confirming that UDP traffic is not getting through here either.

Considering the traffic, 1.1.1.1 Resolver, and authoritative DNS observations reviewed here, it suggests that the Internet shutdowns taking place in Iraq are more complex than Syria’s, as it appears that both UDP and TCP traffic are unable to egress from impacted network providers. As not all impacted network providers are showing a complete loss of announced IP address space during the shutdowns, Iraq is taking a different approach to disrupting Internet connectivity. Although analysis of our data doesn’t provide a definitive conclusion, there are several likely options, and network providers in the country may be combining several. These options revolve around:

IP: Block packets from reaching IP addresses. This may be done by withdrawing prefix announcements from the routing table (a brute force approach) or by blocking access to specific IP addresses, such as those associated with a specific application or service (a more surgical approach).
Connection: Block connections based on SNI/HTTP headers, or other application data. If a network or on-path device is able to observe the server name (or other relevant headers/data), then the connection can be terminated.
DNS: Operators of private or ‘internal’ DNS resolvers, offered by ISPs and enterprise environments for use by their own users, can apply content restrictions, blocking the resolution of hostnames associated with websites and other applications.

The consequences of these options are covered in more detail in a blog post. In addition, applying them at common network chokepoints, such as AS212330 (IRAQIXP) or AS208293 (AlSalam State Company, associated with the Iraqi Ministry of Communications), can disrupt connectivity at multiple downstream ISPs, without those providers necessarily having to take action themselves.

Algeria

As we noted in blog posts in 2022 and 2023, Algeria has a history of disrupting Internet connectivity during Baccalaureate exams. This has been taking place since 2018, following widespread cheating in 2016 that saw questions leaked online both before and during tests. On March 13, the Algerian Ministry of Education announced that the Baccalaureate exams would be held June 9-13. As expected, Internet disruptions were observed both country-wide and at a network level. Similar to previous years, two disruptions were observed each day. The first one began at 08:00 local time (07:00 UTC), and except for June 9, lasted three hours, ending at 11:00 local time (10:00 UTC). (On June 9, it lasted until 13:00 local time (12:00 UTC).) The second one began between 14:00-14:30 local time (13:00-13:30 UTC), and lasted until 16:00-17:00 local time (15:00-16:00 UTC) – the end time varied by day.

As seen in the graphs below, the impact to traffic was fairly nominal, suggesting that wide scale Internet shutdowns similar to those seen in Syria were not being implemented. While this is in line with 2023’s pronouncement by the Minister of Education that there would be no Internet shutdown on exam days, a number of posts on X complained of broader cuts to Internet connectivity.

Similar to the analysis above of the shutdowns in Syria and Iraq, we can also review changes to announced IP address space to better understand how connectivity was being disrupted. In this case, as the graphs below show, no meaningful changes to announced IPv4 address space were observed during the days the Baccalaureate exams were given. As such, the observed drops in traffic were not caused by routing changes.

In the HTTP(S) request traffic graph below, the twice-daily disruptions are highlighted, with the morning one appearing as a nominal drop in traffic, and the afternoon one causing a more severe decline. (The graph shows request traffic aggregated at a country level, but the graphs for the ASNs listed above also show similar patterns.)

In addition, similar patterns are observed in 1.1.1.1 resolver traffic at a country and ASN level, but only for DNS over TCP, DNS over TLS, and DNS over HTTPS, all of which leverage TCP. In the graph below showing only resolver traffic over UDP, there’s no clear evidence of disruptions. However, in the graph that shows resolver traffic over HTTPS, TCP, and TLS, a slight perturbation is visible in the morning, as traffic begins to rise for the day, and a sharper decrease is visible in the afternoon, with both disruptions aligning with the twice daily drops in traffic discussed above.

These observations support the conjecture that the Algerian government is likely taking a more nuanced approach to restricting access to content, interfering in some fashion with TCP-based traffic. The conjecture is also supported by an internal tool that helps to understand connection tampering that is based on research co-designed and developed by members of the Cloudflare Research team. We will be launching insights into TCP connection tampering on Cloudflare Radar later in 2024 and, in the meantime, technical details can be found in the peer-reviewed paper titled Global, Passive Detection of Connection Tampering.

The graph below, taken from the internal tool, highlights observed TCP connection tampering in connections from Algeria during the week that the Baccalaureate exams took place. While some baseline level of post-ACK and post-PSH tampering is consistently visible, we see significant increases in post-ACK twice a day during the exam period, at the times that align with the shifts in traffic discussed above. Technical descriptions of post-ACK and post-PSH tampering can be found in the Cloudflare Radar glossary, but in short, tampering post-ACK means an established TCP connection to Cloudflare’s server has been abruptly ended by one or more RST packets before the server sees data packets. Although clients do use RSTs, clients are more likely to close connections with a FIN (as specified by the RFC). The RST method can also be used by middleboxes that (i) sees the data packet, then (ii) drops the data packet, then (iii) sends an RST to the server to force the server to close the connection (and very likely another RST to the client too for the same reason). Tampering post-PSH means that something on the path, like a middlebox, (i) saw something it didn’t like on an established connection, then (ii) permitted the data to pass but then, (iii) it sends the RST to force endpoints to close the connection.

Looking beyond Cloudflare-sourced data, aggregated test results from the Open Observatory of Network Interference (OONI) also show evidence of anomalous behavior. Using OONI Probe, a mobile and desktop app, can probe for potential blocking of websites, instant messaging apps, and censorship circumvention tools. Examining test results from users in Algeria for popular messaging platforms WhatsApp, Telegram, Signal, and Facebook Messenger for the first two weeks of June, we clearly see the appearance of test results marked as “Anomaly” starting on June 9. (OONI defines “Anomaly” results as “Measurements that provided signs of potential blocking”.) OONI Tor test results also show a similar “Anomaly” pattern. Anomalous traffic patterns are also visible for Google Web Search, YouTube, and GMail.

Although the analysis of these observations and data sets doesn’t provide us with specific details around exactly how the observed Internet disruptions are being implemented, it strongly supports the supposition that network providers in Algeria are, in some fashion, interfering with TCP connections, but not blocking them outright nor shutting down their networks completely. Given that popular messaging platforms, Google properties, Cloudflare’s 1.1.1.1 DNS resolver, and some number of Cloudflare customer sites all appear to be impacted, it suggests that a list of hostnames are being targeted for disruption/interference, either by the SNI or the destination IP address.

Conclusion

Perhaps recognizing the broad negative impact that brute-force nationwide Internet shutdowns have as a response to cheating on exams, some governments appear to be turning to more nuanced techniques, such as content blocking or connection tampering. However, because these are widely applied as well, they are arguably just as disruptive as a full nationwide Internet shutdown. The cause of full shutdowns, such as those seen in Syria, are arguably easier to diagnose than the disruptions to connectivity seen in Iraq and Algeria, which appear to use approaches that are hard to specifically identify from the outside.

Visit Cloudflare Radar for additional insights around these, and other, Internet disruptions. Follow us on social media at @CloudflareRadar (X), noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via email.

Traffic anomalies and notifications with Cloudflare Radar

2023-09-26 David Belson

Post Syndicated from David Belson original http://blog.cloudflare.com/traffic-anomalies-notifications-radar/

Traffic anomalies and notifications with Cloudflare Radar

We launched the Cloudflare Radar Outage Center (CROC) during Birthday Week 2022 as a way of keeping the community up to date on Internet disruptions, including outages and shutdowns, visible in Cloudflare’s traffic data. While some of the entries have their genesis in information from social media posts made by local telecommunications providers or civil society organizations, others are based on an internal traffic anomaly detection and alerting tool. Today, we’re adding this alerting feed to Cloudflare Radar, showing country and network-level traffic anomalies on the CROC as they are detected, as well as making the feed available via API.

Building on this new functionality, as well as the route leaks and route hijacks insights that we recently launched on Cloudflare Radar, we are also launching new Radar notification functionality, enabling you to subscribe to notifications about traffic anomalies, confirmed Internet outages, route leaks, or route hijacks. Using the Cloudflare dashboard’s existing notification functionality, users can set up notifications for one or more countries or autonomous systems, and receive notifications when a relevant event occurs. Notifications may be sent via e-mail or webhooks — the available delivery methods vary according to plan level.

Traffic anomalies

Internet traffic generally follows a fairly regular pattern, with daily peaks and troughs at roughly the same volumes of traffic. However, while weekend traffic patterns may look similar to weekday ones, their traffic volumes are generally different. Similarly, holidays or national events can also cause traffic patterns and volumes to differ significantly from “normal”, as people shift their activities and spend more time offline, or as people turn to online sources for information about, or coverage of, the event. These traffic shifts can be newsworthy, and we have covered some of them in past Cloudflare blog posts (King Charles III coronation, Easter/Passover/Ramadan, Brazilian presidential elections).

However, as you also know from reading our blog posts and following Cloudflare Radar on social media, it is the more drastic drops in traffic that are a cause for concern. Some are the result of infrastructure damage from severe weather or a natural disaster like an earthquake and are effectively unavoidable, but getting timely insights into the impact of these events on Internet connectivity is valuable from a communications perspective. Other traffic drops have occurred when an authoritarian government orders mobile Internet connectivity to be shut down, or shuts down all Internet connectivity nationwide. Timely insights into these types of anomalous traffic drops are often critical from a human rights perspective, as Internet shutdowns are often used as a means of controlling communication with the outside world.

Over the last several months, the Cloudflare Radar team has been using an internal tool to identify traffic anomalies and post alerts for followup to a dedicated chat space. The companion blog post Gone Offline: Detecting Internet Outages goes into deeper technical detail about our traffic analysis and anomaly detection methodologies that power this internal tool.

Many of these internal traffic anomaly alerts ultimately result in Outage Center entries and Cloudflare Radar social media posts. Today, we’re extending the Cloudflare Radar Outage Center and publishing information about these anomalies as we identify them. As shown in the figure below, the new Traffic anomalies table includes the type of anomaly (location or ASN), the entity where the anomaly was detected (country/region name or autonomous system), the start time, duration, verification status, and an “Actions” link, where the user can view the anomaly on the relevant entity traffic page or subscribe to a notification. (If manual review of a detected anomaly finds that it is present in multiple Cloudflare traffic datasets and/or is visible in third-party datasets, such as Georgia Tech’s IODA platform, we will mark it as verified. Unverified anomalies may be false positives, or related to Netflows collection issues, though we endeavor to minimize both.)

In addition to this new table, we have updated the Cloudflare Radar Outage Center map to highlight where we have detected anomalies, as well as placing them into a broader temporal context in a new timeline immediately below the map. Anomalies are represented as orange circles on the map, and can be hidden with the toggle in the upper right corner. Double-bordered circles represent an aggregation across multiple countries, and zooming in to that area will ultimately show the number of anomalies associated with each country that was included in the aggregation. Hovering over a specific dot in the timeline displays information about the outage or anomaly with which it is associated.

Internet outage information has been available via the Radar API since we launched the Outage Center and API in September 2022, and traffic anomalies are now available through a Radar API endpoint as well. An example traffic anomaly API request and response are shown below.

Example request:

curl --request GET \ --url https://api.cloudflare.com/client/v4/radar/traffic_anomalies \ --header 'Content-Type: application/json' \ --header 'X-Auth-Email: '

Example response:

{
  "result": {
    "trafficAnomalies": [
      {
        "asnDetails": {
          "asn": "189",
          "locations": {
            "code": "US",
            "name": "United States"
          },
          "name": "LUMEN-LEGACY-L3-PARTITION"
        },
        "endDate": "2023-08-03T23:15:00Z",
        "locationDetails": {
          "code": "US",
          "name": "United States"
        },
        "startDate": "2023-08-02T23:15:00Z",
        "status": "UNVERIFIED",
        "type": "LOCATION",
        "uuid": "55a57f33-8bc0-4984-b4df-fdaff72df39d",
        "visibleInDataSources": [
          "string"
        ]
      }
    ]
  },
  "success": true
}

Notifications overview

Timely knowledge about Internet “events”, such as drops in traffic or routing issues, are potentially of interest to multiple audiences. Customer service or help desk agents can use the information to help diagnose customer/user complaints about application performance or availability. Similarly, network administrators can use the information to better understand the state of the Internet outside their network. And civil society organizations can use the information to inform action plans aimed at maintaining communications and protecting human rights in areas of conflict or instability. With the new notifications functionality also being launched today, you can subscribe to be notified about observed traffic anomalies, confirmed Internet outages, route leaks, or route hijacks, at a country or autonomous system level. In the following sections, we discuss how to subscribe to and configure notifications, as well as the information contained within the various types of notifications.

Subscribing to notifications

Note that you need to log in to the Cloudflare dashboard to subscribe to and configure notifications. No purchase of Cloudflare services is necessary — just a verified email address is required to set up an account. While we would have preferred to not require a login, it enables us to take advantage of Cloudflare’s existing notifications engine, allowing us to avoid having to dedicate time and resources to building a separate one just for Radar. If you don’t already have a Cloudflare account, visit https://dash.cloudflare.com/sign-up to create one. Enter your username and a unique strong password, click “Sign Up”, and follow the instructions in the verification email to activate your account. (Once you’ve activated your account, we also suggest activating two-factor authentication (2FA) as an additional security measure.)

Once you have set up and activated your account, you are ready to start creating and configuring notifications. The first step is to look for the Notifications (bullhorn) icon – the presence of this icon means that notifications are available for that metric — in the Traffic, Routing, and Outage Center sections on Cloudflare Radar. If you are on a country or ASN-scoped traffic or routing page, the notification subscription will be scoped to that entity.

After clicking a notification icon, you’ll be taken to the Cloudflare login screen. Enter your username and password (and 2FA code if required), and once logged in, you’ll see the Add Notification page, pre-filled with the key information passed through from the referring page on Radar, including relevant locations and/or ASNs. (If you are already logged in to Cloudflare, then you’ll be taken directly to the Add Notification page after clicking a notification icon on Radar.) On this page, you can name the notification, add an optional description, and adjust the location and ASN filters as necessary. Enter an email address for notifications to be sent to, or select an established webhook destination (if you have webhooks enabled on your account).

Click “Save”, and the notification is added to the Notifications Overview page for the account.

You can also create and configure notifications directly within Cloudflare, without starting from a link on Radar a Radar page. To do so, log in to Cloudflare, and choose “Notifications” from the left side navigation bar. That will take you to the Notifications page shown below. Click the “Add” button to add a new notification.

On the next page, search for and select “Radar” from the list of Cloudflare products for which notifications are available.

On the subsequent “Add Notification” page, you can create and configure a notification from scratch. Event types can be selected in the “Notify me for:” field, and both locations and ASNs can be searched for and selected within the respective “Filtered by (optional)” fields. Note that if no filters are selected, then notifications will be sent for all events of the selected type(s). Add one or more emails to send notifications to, or select a webhook target if available, and click “Save” to add it to the list of notifications configured for your account.

It is worth mentioning that advanced users can also create and configure notifications through the Cloudflare API Notification policies endpoint, but we will not review that process within this blog post.

Notification messages

Example notification email messages are shown below for the various types of events. Each contains key information like the type of event, affected entities, and start time — additional relevant information is included depending on the event type. Each email includes both plaintext and HTML versions to accommodate multiple types of email clients. (Final production emails may vary slightly from those shown below.)

If you are sending notifications to webhooks, you can integrate those notifications into tools like Slack. For example, by following the directions in Slack’s API documentation, creating a simple integration took just a few minutes and results in messages like the one shown below.

Conclusion

Cloudflare’s unique perspective on the Internet provides us with near-real-time insight into unexpected drops in traffic, as well as potentially problematic routing events. While we’ve been sharing these insights with you over the past year, you had to visit Cloudflare Radar to figure out if there were any new “events”. With the launch of notifications, we’ll now automatically send you information about the latest events that you are interested in.

We encourage you to visit Cloudflare Radar to familiarize yourself with the information we publish about traffic anomalies, confirmed Internet outages, BGP route leaks, and BGP origin hijacks. Look for the notification icon on the relevant graphs and tables on Radar, and go through the workflow to set up and subscribe to notifications. (And don’t forget to sign up for a Cloudflare account if you don’t have one already.) Please send us feedback about the notifications, as we are constantly working to improve them, and let us know how and where you’ve integrated Radar notifications into your own tools/workflows/organization.

Follow Cloudflare Radar on social media at @CloudflareRadar (Twitter), cloudflare.social/@radar (Mastodon), and radar.cloudflare.com (Bluesky).

Cloudflare Radar’s new BGP origin hijack detection system

2023-07-28 Mingwei Zhang

Post Syndicated from Mingwei Zhang original http://blog.cloudflare.com/bgp-highjack-detection/

Cloudflare Radar's new BGP origin hijack detection system

Border Gateway Protocol (BGP) is the de facto inter-domain routing protocol used on the Internet. It enables networks and organizations to exchange reachability information for blocks of IP addresses (IP prefixes) among each other, thus allowing routers across the Internet to forward traffic to its destination. BGP was designed with the assumption that networks do not intentionally propagate falsified information, but unfortunately that’s not a valid assumption on today’s Internet.

Malicious actors on the Internet who control BGP routers can perform BGP hijacks by falsely announcing ownership of groups of IP addresses that they do not own, control, or route to. By doing so, an attacker is able to redirect traffic destined for the victim network to itself, and monitor and intercept its traffic. A BGP hijack is much like if someone were to change out all the signs on a stretch of freeway and reroute automobile traffic onto incorrect exits.

You can learn more about BGP and BGP hijacking and its consequences in our learning center.

At Cloudflare, we have long been monitoring suspicious BGP anomalies internally. With our recent efforts, we are bringing BGP origin hijack detection to the Cloudflare Radar platform, sharing our detection results with the public. In this blog post, we will explain how we built our detection system and how people can use Radar and its APIs to integrate our data into their own workflows.

What is BGP origin hijacking?

Services and devices on the Internet locate each other using IP addresses. Blocks of IP addresses are called an IP prefix (or just prefix for short), and multiple prefixes from the same organization are aggregated into an autonomous system (AS).

Using the BGP protocol, ASes announce which routes can be imported or exported to other ASes and routers from their routing tables. This is called the AS routing policy. Without this routing information, operating the Internet on a large scale would quickly become impractical: data packets would get lost or take too long to reach their destinations.

During a BGP origin hijack, an attacker creates fake announcements for a targeted prefix, falsely identifying an autonomous systems (AS) under their control as the origin of the prefix.

In the following graph, we show an example where AS 4 announces the prefix P that was previously originated by AS 1. The receiving parties, i.e. AS 2 and AS 3, accept the hijacked routes and forward traffic toward prefix P to AS 4 instead.

As you can see, the normal and hijacked traffic flows back in the opposite direction of the BGP announcements we receive.

If successful, this type of attack will result in the dissemination of the falsified prefix origin announcement throughout the Internet, causing network traffic previously intended for the victim network to be redirected to the AS controlled by the attacker. As an example of a famous BGP hijack attack, in 2018 someone was able to convince parts of the Internet to reroute traffic for AWS to malicious servers where they used DNS to redirect MyEtherWallet.com, a popular crypto wallet, to a hacked page.

Prevention mechanisms and why they’re not perfect (yet)

The key difficulty in preventing BGP origin hijacks is that the BGP protocol itself does not provide a mechanism to validate the announcement content. In other words, the original BGP protocol does not provide any authentication or ownership safeguards; any route can be originated and announced by any random network, independent of its rights to announce that route.

To address this problem, operators and researchers have proposed the Resource Public Key Infrastructure (RPKI) to store and validate prefix-to-origin mapping information. With RPKI, operators can prove the ownership of their network resources and create ROAs, short for Route Origin Authorisations, cryptographically signed objects that define which Autonomous System (AS) is authorized to originate a specific prefix.

Cloudflare committed to support RPKI since the early days of the RFC. With RPKI, IP prefix owners can store and share the ownership information securely, and other operators can validate BGP announcements by checking the prefix origin to the information stored on RPKI. Any hijacking attempt to announce an IP prefix with an incorrect origin AS will result in invalid validation results, and such invalid BGP messages will be discarded. This validation process is referred to as route origin validation (ROV).

In order to further advocate for RPKI deployment and filtering of RPKI invalid announcements, Cloudflare has been providing a RPKI test service, Is BGP Safe Yet?, allowing users to test whether their ISP filters RPKI invalid announcements. We also provide rich information with regard to the RPKI status of individual prefixes and ASes at https://rpki.cloudflare.com/.

However, the effectiveness of RPKI on preventing BGP origin hijacks depends on two factors:

The ratio of prefix owners register their prefixes on RPKI;
The ratio of networks performing route origin validation.

Unfortunately, neither ratio is at a satisfactory level yet. As of today, July 27, 2023, only about 45% of the IP prefixes routable on the Internet are covered by some ROA on RPKI. The remaining prefixes are highly vulnerable to BGP origin hijacks. Even for the 45% prefix that are covered by some ROA, origin hijack attempts can still affect them due to the low ratio of networks that perform route origin validation (ROV). Based on our recent study, only 6.5% of the Internet users are protected by ROV from BGP origin hijacks.

Despite the benefits of RPKI and RPKI ROAs, their effectiveness in preventing BGP origin hijacks is limited by the slow adoption and deployment of these technologies. Until we achieve a high rate of RPKI ROA registration and RPKI invalid filtering, BGP origin hijacks will continue to pose a significant threat to the daily operations of the Internet and the security of everyone connected to it. Therefore, it’s also essential to prioritize developing and deploying BGP monitoring and detection tools to enhance the security and stability of the Internet's routing infrastructure.

Design of Cloudflare’s BGP hijack detection system

Our system comprises multiple data sources and three distinct modules that work together to detect and analyze potential BGP hijack events: prefix origin change detection, hijack detection and the storage and notification module.

The Prefix Origin Change Detection module provides the data, the Hijack Detection module analyzes the data, and the Alerts Storage and Delivery module stores and provides access to the results. Together, these modules work in tandem to provide a comprehensive system for detecting and analyzing potential BGP hijack events.

Prefix origin change detection module

At its core, the BGP protocol involves:

Exchanging prefix reachability (routing) information;
Deciding where to forward traffic based on the reachability information received.

The reachability change information is encoded in BGP update messages while the routing decision results are encoded as a route information base (RIB) on the routers, also known as the routing table.

In our origin hijack detection system, we focus on investigating BGP update messages that contain changes to the origin ASes of any IP prefixes. There are two types of BGP update messages that could indicate prefix origin changes: announcements and withdrawals.

Announcements include an AS-level path toward one or more prefixes. The path tells the receiving parties through which sequence of networks (ASes) one can reach the corresponding prefixes. The last hop of an AS path is the origin AS. In the following diagram, AS 1 is the origin AS of the announced path.

Withdrawals, on the other hand, simply inform the receiving parties that the prefixes are no longer reachable.

Both types of messages are stateless. They inform us of the current route changes, but provide no information about the previous states. As a result, detecting origin changes is not as straightforward as one may think. Our system needs to keep track of historical BGP updates and build some sort of state over time so that we can verify if a BGP update contains origin changes.

We didn't want to deal with a complex system like a database to manage the state of all the prefixes we see resulting from all the BGP updates we get from them. Fortunately, there's this thing called prefix trie in computer science that you can use to store and look up string-indexed data structures, which is ideal for our use case. We ended up developing a fast Rust-based custom IP prefix trie that we use to hold the relevant information such as the origin ASN and the AS path for each IP prefix and allows information to be updated based on BGP announcements and withdrawals.

The example figure below shows an example of the AS path information for prefix 192.0.2.0/24 stored on a prefix trie. When updating the information on the prefix trie, if we see a change of origin ASN for any given prefix, we record the BGP message as well as the change and create an Origin Change Signal.

The prefix origin changes detection module collects and processes live-stream and historical BGP data from various sources. For live streams, our system applies a thin layer of data processing to translate BGP messages into our internal data structure. At the same time, for historical archives, we use a dedicated deployment of the BGPKIT broker and parser to convert MRT files from RouteViews and RIPE RIS into BGP message streams as they become available.

After the data is collected, consolidated and normalized it then creates, maintains and destroys the prefix tries so that we can know what changed from previous BGP announcements from the same peers. Based on these calculations we then send enriched messages downstream to be analyzed.

Hijack detection module

Determining whether BGP messages suggest a hijack is a complex task, and no common scoring mechanism can be used to provide a definitive answer. Fortunately, there are several types of data sources that can collectively provide a relatively good idea of whether a BGP announcement is legitimate or not. These data sources can be categorized into two types: inter-AS relationships and prefix-origin binding.

The inter-AS relationship datasets include AS2org and AS2rel datasets from CAIDA/UCSD, AS2rel datasets from BGPKIT, AS organization datasets from PeeringDB, and per-prefix AS relationship data built at Cloudflare. These datasets provide information about the relationship between autonomous systems, such as whether they are upstream or downstream from one another, or if the origins of any change signal belong to the same organization.

Prefix-to-origin binding datasets include live RPKI validated ROA payload (VRP) from the Cloudflare RPKI portal, daily Internet Routing Registry (IRR) dumps curated and cleaned up by MANRS, and prefix and AS bogon lists (private and reserved addresses defined by RFC 1918, RFC 5735, and RFC 6598). These datasets provide information about the ownership of prefixes and the ASes that are authorized to originate them.

By combining all these data sources, it is possible to collect information about each BGP announcement and answer questions programmatically. For this, we have a scoring function that takes all the evidence gathered for a specific BGP event as the input and runs that data through a sequence of checks. Each condition returns a neutral, positive, or negative weight that keeps adding to the final score. The higher the score, the more likely it is that the event is a hijack attempt.

The following diagram illustrates this sequence of checks:

As you can see, for each event, several checks are involved that help calculate the final score: RPKI, Internet Routing Registry (IRR), bogon prefixes and ASNs lists, AS relationships, and AS path.

Our guiding principles are: if the newly announced origins are RPKI or IRR invalid, it’s more likely that it’s a hijack, but if the old origins are also invalid, then it’s less likely. We discard events about private and reserved ASes and prefixes. If the new and old origins have a direct business relationship, then it’s less likely that it’s a hijack. If the new AS path indicates that the traffic still goes through the old origin, then it’s probably not a hijack.

Signals that are deemed legitimate are discarded, while signals with a high enough confidence score are flagged as potential hijacks and sent downstream for further analysis.

It's important to reiterate that the decision is not binary but a score. There will be situations where we find false negatives or false positives. The advantage of this framework is that we can easily monitor the results, learn from additional datasets and conduct the occasional manual inspection, which allows us to adjust the weights, add new conditions and continue improving the score precision over time.

Aggregating BGP hijack events

Our BGP hijack detection system provides fast response time and requires minimal resources by operating on a per-message basis.

However, when a hijack is happening, the number of hijack signals can be overwhelming for operators to manage. To address this issue, we designed a method to aggregate individual hijack messages into BGP hijack events, thereby reducing the number of alerts triggered.

An event aggregates BGP messages that are coming from the same hijacker related to prefixes from the same victim. The start date is the same as the date of the first suspicious signal. To calculate the end of an event we look for one of the following conditions:

A BGP withdrawn message for the hijacked prefix: regardless of who sends the withdrawal, the route towards the prefix is no longer via the hijacker, and thus this hijack message is considered finished.
A new BGP announcement message with the previous (legitimate) network as the origin: this indicates that the route towards the prefix is reverted to the state before the hijack, and the hijack is therefore considered finished.

If all BGP messages for an event have been withdrawn or reverted, and there are no more new suspicious origin changes from the hijacker ASN for six hours, we mark the event as finished and set the end date.

Hijack events can capture both small-scale and large-scale attacks. Alerts are then based on these aggregated events, not individual messages, making it easier for operators to manage and respond appropriately.

Alerts, Storage and Notifications module

This module provides access to detected BGP hijack events and sends out notifications to relevant parties. It handles storage of all detected events and provides a user interface for easy access and search of historical events. It also generates notifications and delivers them to the relevant parties, such as network administrators or security analysts, when a potential BGP hijack event is detected. Additionally, this module can build dashboards to display high-level information and visualizations of detected events to facilitate further analysis.

Lightweight and portable implementation

Our BGP hijack detection system is implemented as a Rust-based command line application that is lightweight and portable. The whole detection pipeline runs off a single binary application that connects to a PostgreSQL database and essentially runs a complete self-contained BGP data pipeline. And if you are wondering, yes, the full system, including the database, can run well on a laptop.

The runtime cost mainly comes from maintaining the in-memory prefix tries for each full-feed router, each costing roughly 200 MB RAM. For the beta deployment, we use about 170 full-feed peers and the whole system runs well on a single 32 GB node with 12 threads.

Using the BGP Hijack Detection

The BGP Hijack Detection results are now available on both the Cloudflare Radar website and the Cloudflare Radar API.

Cloudflare Radar

Under the “Security & Attacks” section of the Cloudflare Radar for both global and ASN view, we now display the BGP origin hijacks table. In this table, we show a list of detected potential BGP hijack events with the following information:

The detected and expected origin ASes;
The start time and event duration;
The number of BGP messages and route collectors peers that saw the event;
The announced prefixes;
Evidence tags and confidence level (on the likelihood of the event being a hijack).

For each BGP event, our system generates relevant evidence tags to indicate why the event is considered suspicious or not. These tags are used to inform the confidence score assigned to each event. Red tags indicate evidence that increases the likelihood of a hijack event, while green tags indicate the opposite.

For example, the red tag "RPKI INVALID" indicates an event is likely a hijack, as it suggests that the RPKI validation failed for the announcement. Conversely, the tag "SIBLING ORIGINS" is a green tag that indicates the detected and expected origins belong to the same organization, making it less likely for the event to be a hijack.

Users can now access the BGP hijacks table in the following ways:

Global view under Security & Attacks page without location filters. This view lists the most recent 150 detected BGP hijack events globally.
When filtered by a specific ASN, the table will appear on Overview, Traffic, and Traffic & Attacks tabs.

Cloudflare Radar API

We also provide programmable access to the BGP hijack detection results via the Cloudflare Radar API, which is freely available under CC BY-NC 4.0 license. The API documentation is available at the Cloudflare API portal.

The following curl command fetches the most recent 10 BGP hijack events relevant to AS64512.

curl -X GET "https://api.cloudflare.com/client/v4/radar/bgp/hijacks/events?invlovedAsn=64512&format=json&per_page=10" \
    -H "Authorization: Bearer <API_TOKEN>"

Users can further filter events with high confidence by specifying the minConfidence parameter with a 0-10 value, where a higher value indicates higher confidence of the events being a hijack. The following example expands on the previous example by adding the minimum confidence score of 8 to the query:

curl -X GET "https://api.cloudflare.com/client/v4/radar/bgp/hijacks/events?invlovedAsn=64512&format=json&per_page=10&minConfidence=8" \
    -H "Authorization: Bearer <API_TOKEN>"

Additionally, users can also quickly build custom hijack alerters using a Cloudflare Workers + KV combination. We have a full tutorial on building alerters that send out webhook-based messages or emails (with Email Routing) available on the Cloudflare Radar documentation site.

More routing security on Cloudflare Radar

As we continue improving Cloudflare Radar, we are planning to introduce additional Internet routing and security data. For example, Radar will soon get a dedicated routing section to provide digestible BGP information for given networks or regions, such as distinct routable prefixes, RPKI valid/invalid/unknown routes, distribution of IPv4/IPv6 prefixes, etc. Our goal is to provide the best data and tools for routing security to the community, so that we can build a better and more secure Internet together.

Visit Cloudflare Radar for additional insights around (Internet disruptions, routing issues, Internet traffic trends, attacks, Internet quality, etc.). Follow us on social media at @CloudflareRadar (Twitter), cloudflare.social/@radar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via e-mail.

Helping build a safer Internet by measuring BGP RPKI Route Origin Validation

2022-12-16 Carlos Rodrigues

Post Syndicated from Carlos Rodrigues original https://blog.cloudflare.com/rpki-updates-data/

Helping build a safer Internet by measuring BGP RPKI Route Origin Validation

The Border Gateway Protocol (BGP) is the glue that keeps the entire Internet together. However, despite its vital function, BGP wasn’t originally designed to protect against malicious actors or routing mishaps. It has since been updated to account for this shortcoming with the Resource Public Key Infrastructure (RPKI) framework, but can we declare it to be safe yet?

If the question needs asking, you might suspect we can’t. There is a shortage of reliable data on how much of the Internet is protected from preventable routing problems. Today, we’re releasing a new method to measure exactly that: what percentage of Internet users are protected by their Internet Service Provider from these issues. We find that there is a long way to go before the Internet is protected from routing problems, though it varies dramatically by country.

Why RPKI is necessary to secure Internet routing

The Internet is a network of independently-managed networks, called Autonomous Systems (ASes). To achieve global reachability, ASes interconnect with each other and determine the feasible paths to a given destination IP address by exchanging routing information using BGP. BGP enables routers with only local network visibility to construct end-to-end paths based on the arbitrary preferences of each administrative entity that operates that equipment. Typically, Internet traffic between a user and a destination traverses multiple AS networks using paths constructed by BGP routers.

BGP, however, lacks built-in security mechanisms to protect the integrity of the exchanged routing information and to provide authentication and authorization of the advertised IP address space. Because of this, AS operators must implicitly trust that the routing information exchanged through BGP is accurate. As a result, the Internet is vulnerable to the injection of bogus routing information, which cannot be mitigated by security measures at the client or server level of the network.

An adversary with access to a BGP router can inject fraudulent routes into the routing system, which can be used to execute an array of attacks, including:

Denial-of-Service (DoS) through traffic blackholing or redirection,
Impersonation attacks to eavesdrop on communications,
Machine-in-the-Middle exploits to modify the exchanged data, and subvert reputation-based filtering systems.

Additionally, local misconfigurations and fat-finger errors can be propagated well beyond the source of the error and cause major disruption across the Internet.

Such an incident happened on June 24, 2019. Millions of users were unable to access Cloudflare address space when a regional ISP in Pennsylvania accidentally advertised routes to Cloudflare through their capacity-limited network. This was effectively the Internet equivalent of routing an entire freeway through a neighborhood street.

Traffic misdirections like these, either unintentional or intentional, are not uncommon. The Internet Society’s MANRS (Mutually Agreed Norms for Routing Security) initiative estimated that in 2020 alone there were over 3,000 route leaks and hijacks, and new occurrences can be observed every day through Cloudflare Radar.

The most prominent proposals to secure BGP routing, standardized by the IETF focus on validating the origin of the advertised routes using Resource Public Key Infrastructure (RPKI) and verifying the integrity of the paths with BGPsec. Specifically, RPKI (defined in RFC 7115) relies on a Public Key Infrastructure to validate that an AS advertising a route to a destination (an IP address space) is the legitimate owner of those IP addresses.

RPKI has been defined for a long time but lacks adoption. It requires network operators to cryptographically sign their prefixes, and routing networks to perform an RPKI Route Origin Validation (ROV) on their routers. This is a two-step operation that requires coordination and participation from many actors to be effective.

The two phases of RPKI adoption: signing origins and validating origins

RPKI has two phases of deployment: first, an AS that wants to protect its own IP prefixes can cryptographically sign Route Origin Authorization (ROA) records thereby attesting to be the legitimate origin of that signed IP space. Second, an AS can avoid selecting invalid routes by performing Route Origin Validation (ROV, defined in RFC 6483).

With ROV, a BGP route received by a neighbor is validated against the available RPKI records. A route that is valid or missing from RPKI is selected, while a route with RPKI records found to be invalid is typically rejected, thus preventing the use and propagation of hijacked and misconfigured routes.

One issue with RPKI is the fact that implementing ROA is meaningful only if other ASes implement ROV, and vice versa. Therefore, securing BGP routing requires a united effort and a lack of broader adoption disincentivizes ASes from commiting the resources to validate their own routes. Conversely, increasing RPKI adoption can lead to network effects and accelerate RPKI deployment. Projects like MANRS and Cloudflare’s isbgpsafeyet.com are promoting good Internet citizenship among network operators, and make the benefits of RPKI deployment known to the Internet. You can check whether your own ISP is being a good Internet citizen by testing it on isbgpsafeyet.com.

Measuring the extent to which both ROA (signing of addresses by the network that controls them) and ROV (filtering of invalid routes by ISPs) have been implemented is important to evaluating the impact of these initiatives, developing situational awareness, and predicting the impact of future misconfigurations or attacks.

Measuring ROAs is straightforward since ROA data is readily available from RPKI repositories. Querying RPKI repositories for publicly routed IP prefixes (e.g. prefixes visible in the RouteViews and RIPE RIS routing tables) allows us to estimate the percentage of addresses covered by ROA objects. Currently, there are 393,344 IPv4 and 86,306 IPv6 ROAs in the global RPKI system, covering about 40% of the globally routed prefix-AS origin pairs¹.

Measuring ROV, however, is significantly more challenging given it is configured inside the BGP routers of each AS, not accessible by anyone other than each router’s administrator.

Measuring ROV deployment

Although we do not have direct access to the configuration of everyone’s BGP routers, it is possible to infer the use of ROV by comparing the reachability of RPKI-valid and RPKI-invalid prefixes from measurement points within an AS².

Consider the following toy topology as an example, where an RPKI-invalid origin is advertised through AS0 to AS1 and AS2. If AS1 filters and rejects RPKI-invalid routes, a user behind AS1 would not be able to connect to that origin. By contrast, if AS2 does not reject RPKI invalids, a user behind AS2 would be able to connect to that origin.

While occasionally a user may be unable to access an origin due to transient network issues, if multiple users act as vantage points for a measurement system, we would be able to collect a large number of data points to infer which ASes deploy ROV.

If, in the figure above, AS0 filters invalid RPKI routes, then vantage points in both AS1 and AS2 would be unable to connect to the RPKI-invalid origin, making it hard to distinguish if ROV is deployed at the ASes of our vantage points or in an AS along the path. One way to mitigate this limitation is to announce the RPKI-invalid origin from multiple locations from an anycast network taking advantage of its direct interconnections to the measurement vantage points as shown in the figure below. As a result, an AS that does not itself deploy ROV is less likely to observe the benefits of upstream ASes using ROV, and we would be able to accurately infer ROV deployment per AS³.

Note that it’s also important that the IP address of the RPKI-invalid origin should not be covered by a less specific prefix for which there is a valid or unknown RPKI route, otherwise even if an AS filters invalid RPKI routes its users would still be able to find a route to that IP.

The measurement technique described here is the one implemented by Cloudflare’s isbgpsafeyet.com website, allowing end users to assess whether or not their ISPs have deployed BGP ROV.

The isbgpsafeyet.com website itself doesn’t submit any data back to Cloudflare, but recently we started measuring whether end users’ browsers can successfully connect to invalid RPKI origins when ROV is present. We use the same mechanism as is used for global performance data⁴. In particular, every measurement session (an individual end user at some point in time) attempts a request to both valid.rpki.cloudflare.com, which should always succeed as it’s RPKI-valid, and invalid.rpki.cloudflare.com, which is RPKI-invalid and should fail when the user’s ISP uses ROV.

This allows us to have continuous and up-to-date measurements from hundreds of thousands of browsers on a daily basis, and develop a greater understanding of the state of ROV deployment.

The state of global ROV deployment

The figure below shows the raw number of ROV probe requests per hour during October 2022 to valid.rpki.cloudflare.com and invalid.rpki.cloudflare.com. In total, we observed 69.7 million successful probes from 41,531 ASNs.

Based on APNIC’s estimates on the number of end users per ASN, our weighted⁵ analysis covers 96.5% of the world’s Internet population. As expected, the number of requests follow a diurnal pattern which reflects established user behavior in daily and weekly Internet activity⁶.

We can also see that the number of successful requests to valid.rpki.cloudflare.com (gray line) closely follows the number of sessions that issued at least one request (blue line), which works as a smoke test for the correctness of our measurements.

As we don’t store the IP addresses that contribute measurements, we don’t have any way to count individual clients and large spikes in the data may introduce unwanted bias. We account for that by capturing those instants and excluding them.

Overall, we estimate that out of the four billion Internet users, only 261 million (6.5%) are protected by BGP Route Origin Validation, but the true state of global ROV deployment is more subtle than this.

The following map shows the fraction of dropped RPKI-invalid requests from ASes with over 200 probes over the month of October. It depicts how far along each country is in adopting ROV but doesn’t necessarily represent the fraction of protected users in each country, as we will discover.

Sweden and Bolivia appear to be the countries with the highest level of adoption (over 80%), while only a few other countries have crossed the 50% mark (e.g. Finland, Denmark, Chad, Greece, the United States).

ROV adoption may be driven by a few ASes hosting large user populations, or by many ASes hosting small user populations. To understand such disparities, the map below plots the contrast between overall adoption in a country (as in the previous map) and median adoption over the individual ASes within that country. Countries with stronger reds have relatively few ASes deploying ROV with high impact, while countries with stronger blues have more ASes deploying ROV but with lower impact per AS.

In the Netherlands, Denmark, Switzerland, or the United States, adoption appears mostly driven by their larger ASes, while in Greece or Yemen it’s the smaller ones that are adopting ROV.

The following histogram summarizes the worldwide level of adoption for the 6,765 ASes covered by the previous two maps.

Most ASes either don’t validate at all, or have close to 100% adoption, which is what we’d intuitively expect. However, it’s interesting to observe that there are small numbers of ASes all across the scale. ASes that exhibit partial RPKI-invalid drop rate compared to total requests may either implement ROV partially (on some, but not all, of their BGP routers), or appear as dropping RPKI invalids due to ROV deployment by other ASes in their upstream path.

To estimate the number of users protected by ROV we only considered ASes with an observed adoption above 95%, as an AS with an incomplete deployment still leaves its users vulnerable to route leaks from its BGP peers.

If we take the previous histogram and summarize by the number of users behind each AS, the green bar on the right corresponds to the 261 million users currently protected by ROV according to the above criteria (686 ASes).

Looking back at the country adoption map one would perhaps expect the number of protected users to be larger. But worldwide ROV deployment is still mostly partial, lacking larger ASes, or both. This becomes even more clear when compared with the next map, plotting just the fraction of fully protected users.

To wrap up our analysis, we look at two world economies chosen for their contrasting, almost symmetrical, stages of deployment: the United States and the European Union.

112 million Internet users are protected by 111 ASes from the United States with comprehensive ROV deployments. Conversely, more than twice as many ASes from countries making up the European Union have fully deployed ROV, but end up covering only half as many users. This can be reasonably explained by end user ASes being more likely to operate within a single country rather than span multiple countries.

Conclusion

Probe requests were performed from end user browsers and very few measurements were collected from transit providers (which have few end users, if any). Also, paths between end user ASes and Cloudflare are often very short (a nice outcome of our extensive peering) and don’t traverse upper-tier networks that they would otherwise use to reach the rest of the Internet.

In other words, the methodology used focuses on ROV adoption by end user networks (e.g. ISPs) and isn’t meant to reflect the eventual effect of indirect validation from (perhaps validating) upper-tier transit networks. While indirect validation may limit the “blast radius” of (malicious or accidental) route leaks, it still leaves non-validating ASes vulnerable to leaks coming from their peers.

As with indirect validation, an AS remains vulnerable until its ROV deployment reaches a sufficient level of completion. We chose to only consider AS deployments above 95% as truly comprehensive, and Cloudflare Radar will soon begin using this threshold to track ROV adoption worldwide, as part of our mission to help build a better Internet.

When considering only comprehensive ROV deployments, some countries such as Denmark, Greece, Switzerland, Sweden, or Australia, already show an effective coverage above 50% of their respective Internet populations, with others like the Netherlands or the United States slightly above 40%, mostly driven by few large ASes rather than many smaller ones.

Worldwide we observe a very low effective coverage of just 6.5% over the measured ASes, corresponding to 261 million end users currently safe from (malicious and accidental) route leaks, which means there’s still a long way to go before we can declare BGP to be safe.

……
¹https://rpki.cloudflare.com/
²Gilad, Yossi, Avichai Cohen, Amir Herzberg, Michael Schapira, and Haya Shulman. "Are we there yet? On RPKI’s deployment and security." Cryptology ePrint Archive (2016).
³Geoff Huston. “Measuring ROAs and ROV”. https://blog.apnic.net/2021/03/24/measuring-roas-and-rov/
⁴Measurements are issued stochastically when users encounter 1xxx error pages from default (non-customer) configurations.
⁵Probe requests are weighted by AS size as calculated from Cloudflare’s worldwide HTTP traffic.
⁶Quan, Lin, John Heidemann, and Yuri Pradkin. "When the Internet sleeps: Correlating diurnal networks with external factors." In Proceedings of the 2014 Conference on Internet Measurement Conference, pp. 87-100. 2014.

Why BGP communities are better than AS-path prepends

2022-11-24 Tom Strickx

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/prepends-considered-harmful/

Why BGP communities are better than AS-path prepends

The Internet, in its purest form, is a loosely connected graph of independent networks (also called Autonomous Systems (AS for short)). These networks use a signaling protocol called BGP (Border Gateway Protocol) to inform their neighbors (also known as peers) about the reachability of IP prefixes (a group of IP addresses) in and through their network. Part of this exchange contains useful metadata about the IP prefix that are used to inform network routing decisions. One example of the metadata is the full AS-path, which consists of the different autonomous systems an IP packet needs to pass through to reach its destination.

As we all want our packets to get to their destination as fast as possible, selecting the shortest AS-path for a given prefix is a good idea. This is where something called prepending comes into play.

Routing on the Internet, a primer

Let’s briefly talk about how the Internet works at its most fundamental level, before we dive into some nitty-gritty details.

The Internet is, at its core, a massively interconnected network of thousands of networks. Each network owns two things that are critical:

1. An Autonomous System Number (ASN): a 32-bit integer that uniquely identifies a network. For example, one of the Cloudflare ASNs (we have multiple) is 13335.

2. IP prefixes: An IP prefix is a range of IP addresses, bundled together in powers of two: In the IPv4 space, two addresses form a /31 prefix, four form a /30, and so on, all the way up to /0, which is shorthand for “all IPv4 prefixes”. The same applies for IPv6 but instead of aggregating 32 bits at most, you can aggregate up to 128 bits. The figure below shows this relationship between IP prefixes, in reverse — a /24 contains two /25s that contains two /26s, and so on.

To communicate on the Internet, you must be able to reach your destination, and that’s where routing protocols come into play. They enable each node on the Internet to know where to send your message (and for the receiver to send a message back).

As mentioned earlier, these destinations are identified by IP addresses, and contiguous ranges of IP addresses are expressed as IP prefixes. We use IP prefixes for routing as an efficiency optimization: Keeping track of where to go for four billion (232) IP addresses in IPv4 would be incredibly complex, and require a lot of resources. Sticking to prefixes reduces that number down to about one million instead.

Now recall that Autonomous Systems are independently operated and controlled. In the Internet’s network of networks, how do I tell Source A in some other network that there is an available path to get to Destination B in (or through) my network? In comes BGP! BGP is the Border Gateway Protocol, and it is used to signal reachability information. Signal messages generated by the source ASN are referred to as ‘announcements’ because they declare to the Internet that IP addresses in the prefix are online and reachable.

Have a look at the figure above. Source A should now know how to get to Destination B through 2 different networks!

This is what an actual BGP message would look like:

BGP Message
    Type: UPDATE Message
    Path Attributes:
        Path Attribute - Origin: IGP
        Path Attribute - AS_PATH: 64500 64496
        Path Attribute - NEXT_HOP: 198.51.100.1
        Path Attribute - COMMUNITIES: 64500:13335
        Path Attribute - Multi Exit Discriminator (MED): 100
    Network Layer Reachability Information (NLRI):
        192.0.2.0/24

As you can see, BGP messages contain more than just the IP prefix (the NLRI bit) and the path, but also a bunch of other metadata that provides additional information about the path. Other fields include communities (more on that later), as well as MED, or origin code. MED is a suggestion to other directly connected networks on which path should be taken if multiple options are available, and the lowest value wins. The origin code can be one of three values: IGP, EGP or Incomplete. IGP will be set if you originate the prefix through BGP, EGP is no longer used (it’s an ancient routing protocol), and Incomplete is set when you distribute a prefix into BGP from another routing protocol (like IS-IS or OSPF).

Now that source A knows how to get to Destination B through two different networks, let’s talk about traffic engineering!

Traffic engineering

Traffic engineering is a critical part of the day to day management of any network. Just like in the physical world, detours can be put in place by operators to optimize the traffic flows into (inbound) and out of (outbound) their network. Outbound traffic engineering is significantly easier than inbound traffic engineering because operators can choose from neighboring networks, even prioritize some traffic over others. In contrast, inbound traffic engineering requires influencing a network that is operated by someone else entirely. The autonomy and self-governance of a network is paramount, so operators use available tools to inform or shape inbound packet flows from other networks. The understanding and use of those tools is complex, and can be a challenge.

The available set of traffic engineering tools, both in- and outbound, rely on manipulating attributes (metadata) of a given route. As we’re talking about traffic engineering between independent networks, we’ll be manipulating the attributes of an EBGP-learned route. BGP can be split into two categories:

EBGP: BGP communication between two different ASNs
IBGP: BGP communication within the same ASN.

While the protocol is the same, certain attributes can be exchanged on an IBGP session that aren’t exchanged on an EBGP session. One of those is local-preference. More on that in a moment.

BGP best path selection

When a network is connected to multiple other networks and service providers, it can receive path information to the same IP prefix from many of those networks, each with slightly different attributes. It is then up to the receiving network of that information to use a BGP best path selection algorithm to pick the “best” prefix (and route), and use this to forward IP traffic. I’ve put “best” in quotation marks, as best is a subjective requirement. “Best” is frequently the shortest, but what can be best for my network might not be the best outcome for another network.

BGP will consider multiple prefix attributes when filtering through the received options. However, rather than combine all those attributes into a single selection criteria, BGP best path selection uses the attributes in tiers — at any tier, if the available attributes are sufficient to choose the best path, then the algorithm terminates with that choice.

The BGP best path selection algorithm is extensive, containing 15 discrete steps to select the best available path for a given prefix. Given the numerous steps, it’s in the interest of the network to decide the best path as early as possible. The first four steps are most used and influential, and are depicted in the figure below as sieves.

Picking the shortest path possible is usually a good idea, which is why “AS-path length” is a step executed early on in the algorithm. However, looking at the figure above, “AS-path length” appears second, despite being the attribute to find the shortest path. So let’s talk about the first step: local preference.

Local preference
Local preference is an operator favorite because it allows them to handpick a route+path combination of their choice. It’s the first attribute in the algorithm because it is unique for any given route+neighbor+AS-path combination.

A network sets the local preference on import of a route (having learned about the route from a neighbor network). Being a non-transitive property, meaning that it’s an attribute that is never sent in an EBGP message to other networks. This intrinsically means, for example, that the operator of AS 64496 can’t set the local preference of routes to their own (or transiting) IP prefixes inside neighboring AS 64511. The inability to do so is partially why inbound traffic engineering through EBGP is so difficult.

Prepending artificially increases AS-path length
Since no network is able to directly set the local preference for a prefix inside another network, the first opportunity to influence other networks’ choices is modifying the AS-path. If the next hops are valid, and the local preference for all the different paths for a given route are the same, modifying the AS-path is an obvious option to change the path traffic will take towards your network. In a BGP message, prepending looks like this:

BEFORE:

BGP Message
    Type: UPDATE Message
    Path Attributes:
        Path Attribute - Origin: IGP
        Path Attribute - AS_PATH: 64500 64496
        Path Attribute - NEXT_HOP: 198.51.100.1
        Path Attribute - COMMUNITIES: 64500:13335
        Path Attribute - Multi Exit Discriminator (MED): 100
    Network Layer Reachability Information (NLRI):
        192.0.2.0/24

AFTER:

BGP Message
    Type: UPDATE Message
    Path Attributes:
        Path Attribute - Origin: IGP
        Path Attribute - AS_PATH: 64500 64496 64496
        Path Attribute - NEXT_HOP: 198.51.100.1
        Path Attribute - COMMUNITIES: 64500:13335
        Path Attribute - Multi Exit Discriminator (MED): 100
    Network Layer Reachability Information (NLRI):
        192.0.2.0/24

Specifically, operators can do AS-path prepending. When doing AS-path prepending, an operator adds additional autonomous systems to the path (usually the operator uses their own AS, but that’s not enforced in the protocol). This way, an AS-path can go from a length of 1 to a length of 255. As the length has now increased dramatically, that specific path for the route will not be chosen. By changing the AS-path advertised to different peers, an operator can control the traffic flows coming into their network.

Unfortunately, prepending has a catch: To be the deciding factor, all the other attributes need to be equal. This is rarely true, especially in large networks that are able to choose from many possible routes to a destination.

Business Policy Engine

BGP is colloquially also referred to as a Business Policy Engine: it does not select the best path from a performance point of view; instead, and more often than not, it will select the best path from a business point of view. The business criteria could be anything from investment (port) efficiency to increased revenue, and more. This may sound strange but, believe it or not, this is what BGP is designed to do! The power (and complexity) of BGP is that it enables a network operator to make choices according to the operator’s needs, contracts, and policies, many of which cannot be reflected by conventional notions of engineering performance.

Different local preferences

A lot of networks (including Cloudflare) assign a local preference depending on the type of connection used to send us the routes. A higher value is a higher preference. For example, routes learned from transit network connections will get a lower local preference of 100 because they are the most costly to use; backbone-learned routes will be 150, Internet exchange (IX) routes get 200, and lastly private interconnect (PNI) routes get 250. This means that for egress (outbound) traffic, the Cloudflare network, by default, will prefer a PNI-learned route, even if a shorter AS-path is available through an IX or transit neighbor.

Part of the reason a PNI is preferred over an IX is reliability, because there is no third-party switching platform involved that is out of our control, which is important because we operate on the assumption that all hardware can and will eventually break. Another part of the reason is for port efficiency reasons. Here, efficiency is defined by cost per megabit transferred on each port. Roughly speaking, the cost is calculated by:

((cost_of_switch / port_count) + transceiver_cost)

which is combined with the cross-connect cost (might be monthly recurring (MRC), or a one-time fee). PNI is preferable because it helps to optimize value by reducing the overall cost per megabit transferred, because the unit price decreases with higher utilization of the port.

This reasoning is similar for a lot of other networks, and is very prevalent in transit networks. BGP is at least as much about cost and business policy, as it is about performance.

Transit local preference

For simplicity, when referring to transits, I mean the traditional tier-1 transit networks. Due to the nature of these networks, they have two distinct sets of network peers:

1. Customers (like Cloudflare)
2. Settlement-free peers (like other tier-1 networks)

In normal circumstances, transit customers will get a higher local preference assigned than the local preference used for their settlement-free peers. This means that, no matter how much you prepend a prefix, if traffic enters that transit network, traffic will always land on your interconnection with that transit network, it will not be offloaded to another peer.

A prepend can still be used if you want to switch/offload traffic from a single link with one transit if you have multiple distinguished links with them, or if the source of traffic is multihomed behind multiple transits (and they don’t have their own local preference playbook preferring one transit over another). But inbound traffic engineering traffic away from one transit port to another through AS-path prepending has significant diminishing returns: once you’re past three prepends, it’s unlikely to change much, if anything, at that point.

Example

In the above scenario, no matter the adjustment Cloudflare makes in its AS-path towards AS 64496, the traffic will keep flowing through the Transit B <> Cloudflare interconnection, even though the path Origin A → Transit B → Transit A → Cloudflare is shorter from an AS-path point of view.

In this scenario, not a lot has changed, but Origin A is now multi-homed behind the two transit providers. In this case, the AS-path prepending was effective, as the paths seen on the Origin A side are both the prepended and non-prepended path. As long as Origin A is not doing any egress traffic engineering, and is treating both transit networks equally, then the path chosen will be Origin A → Transit A → Cloudflare.

Community-based traffic engineering

So we have now identified a pretty critical problem within the Internet ecosystem for operators: with the tools mentioned above, it’s not always (some might even say outright impossible) possible to accurately dictate paths traffic can ingress your own network, reducing the control an autonomous system has over its own network. Fortunately, there is a solution for this problem: community-based local preference.

Some transit providers allow their customers to influence the local preference in the transit network through the use of BGP communities. BGP communities are an optional transitive attribute for a route advertisement. The communities can be informative (“I learned this prefix in Rome”), but they can also be used to trigger actions on the receiving side. For example, Cogent publishes the following action communities:

Community	Local preference
174:10	10
174:70	70
174:120	120
174:125	125
174:135	135
174:140	140

When you know that Cogent uses the following default local preferences in their network:

Peers → Local preference 100
Customers → Local preference 130

It’s easy to see how we could use the communities provided to change the route used. It’s important to note though that, as we can’t set the local preference of a route to 100 (or 130), AS-path prepending remains largely irrelevant, as the local preference won’t ever be the same.

Take for example the following configuration:

term ADV-SITELOCAL {
    from {
        prefix-list SITE-LOCAL;
        route-type internal;
    }
    then {
        as-path-prepend "13335 13335";
        accept;
    }
}

We’re prepending the Cloudflare ASN two times, resulting in a total AS-path of three, yet we were still seeing a lot (too much) traffic coming in on our Cogent link. At that point, an engineer could add another prepend, but for a well-connected network as Cloudflare, if two prepends didn’t do much, or three, then four or five isn’t going to do much either. Instead, we can leverage the Cogent communities documented above to change the routing within Cogent:

term ADV-SITELOCAL {
    from {
        prefix-list SITE-LOCAL;
        route-type internal;
    }
    then {
        community add COGENT_LPREF70;
        accept;
    }
}

The above configuration changes the traffic flow to this:

Which is exactly what we wanted!

Conclusion

AS-path prepending is still useful, and has its use as part of the toolchain for operators to do traffic engineering, but should be used sparingly. Excessive prepending opens a network up to wider spread route hijacks, which should be avoided at all costs. As such, using community-based ingress traffic engineering is highly preferred (and recommended). In cases where communities aren’t available (or not available to steer customer traffic), prepends can be applied, but I encourage operators to actively monitor their effects, and roll them back if ineffective.

As a side-note, P Marcos et al. have published an interesting paper on AS-path prepending, and go into some trends seen in relation to prepending, I highly recommend giving it a read: https://www.caida.org/catalog/papers/2020_aspath_prepending/aspath_prepending.pdf

How we detect route leaks and our new Cloudflare Radar route leak service

2022-11-23 Mingwei Zhang

Post Syndicated from Mingwei Zhang original https://blog.cloudflare.com/route-leak-detection-with-cloudflare-radar/

How we detect route leaks and our new Cloudflare Radar route leak service

Today we’re introducing Cloudflare Radar’s route leak data and API so that anyone can get information about route leaks across the Internet. We’ve built a comprehensive system that takes in data from public sources and Cloudflare’s view of the Internet drawn from our massive global network. The system is now feeding route leak data on Cloudflare Radar’s ASN pages and via the API.

This blog post is in two parts. There’s a discussion of BGP and route leaks followed by details of our route leak detection system and how it feeds Cloudflare Radar.

About BGP and route leaks

Inter-domain routing, i.e., exchanging reachability information among networks, is critical to the wellness and performance of the Internet. The Border Gateway Protocol (BGP) is the de facto routing protocol that exchanges routing information among organizations and networks. At its core, BGP assumes the information being exchanged is genuine and trust-worthy, which unfortunately is no longer a valid assumption on the current Internet. In many cases, networks can make mistakes or intentionally lie about the reachability information and propagate that to the rest of the Internet. Such incidents can cause significant disruptions of the normal operations of the Internet. One type of such disruptive incident is route leaks.

We consider route leaks as the propagation of routing announcements beyond their intended scope (RFC7908). Route leaks can cause significant disruption affecting millions of Internet users, as we have seen in many past notable incidents. For example, in June 2016 a misconfiguration in a small network in Pennsylvania, US (AS396531 – Allegheny Technologies Inc) accidentally leaked a Cloudflare prefix to Verizon, which proceeded to propagate the misconfigured route to the rest of its peers and customers. As a result, the traffic of a large portion of the Internet was squeezed through the limited-capacity links of a small network. The resulting congestion caused most of Cloudflare traffic to and from the affected IP range to be dropped.

A similar incident in November 2018 caused widespread unavailability of Google services when a Nigerian ISP (AS37282 – Mainone) accidentally leaked a large number of Google IP prefixes to its peers and providers violating the valley-free principle.

These incidents illustrate not only that route leaks can be very impactful, but also the snowball effects that misconfigurations in small regional networks can have on the global Internet.

Despite the criticality of detecting and rectifying route leaks promptly, they are often detected only when users start reporting the noticeable effects of the leaks. The challenge with detecting and preventing route leaks stems from the fact that AS business relationships and BGP routing policies are generally undisclosed, and the affected network is often remote to the root of the route leak.

In the past few years, solutions have been proposed to prevent the propagation of leaked routes. Such proposals include RFC9234 and ASPA, which extends the BGP to annotate sessions with the relationship type between the two connected AS networks to enable the detention and prevention of route leaks.

An alternative proposal to implement similar signaling of BGP roles is through the use of BGP Communities; a transitive attribute used to encode metadata in BGP announcements. While these directions are promising in the long term, they are still in very preliminary stages and are not expected to be adopted at scale soon.

At Cloudflare, we have developed a system to detect route leak events automatically and send notifications to multiple channels for visibility. As we continue our efforts to bring more relevant data to the public, we are happy to announce that we are starting an open data API for our route leak detection results today and integrate results to Cloudflare Radar pages.

Route leak definition and types

Before we jump into how we design our systems, we will first do a quick primer on what a route leak is, and why it is important to detect it.

We refer to the published IETF RFC7908 document “Problem Definition and Classification of BGP Route Leaks” to define route leaks.

> A route leak is the propagation of routing announcement(s) beyond their intended scope.

The intended scope is often concretely defined as inter-domain routing policies based on business relationships between Autonomous Systems (ASes). These business relationships are broadly classified into four categories: customers, transit providers, peers and siblings, although more complex arrangements are possible.

In a customer-provider relationship the customer AS has an agreement with another network to transit its traffic to the global routing table. In a peer-to-peer relationship two ASes agree to free bilateral traffic exchange, but only between their own IPs and the IPs of their customers. Finally, ASes that belong under the same administrative entity are considered siblings, and their traffic exchange is often unrestricted. The image below illustrates how the three main relationship types translate to export policies.

By categorizing the types of AS-level relationships and their implications on the propagation of BGP routes, we can define multiple phases of a prefix origination announcements during propagation:

upward: all path segments during this phase are customer to provider
peering: one peer-peer path segment
downward: all path segments during this phase are provider to customer

An AS path that follows valley-free routing principle will have upward, peering, downward phases, all optional but have to be in that order. Here is an example of an AS path that conforms with valley-free routing.

In RFC7908, “Problem Definition and Classification of BGP Route Leaks”, the authors define six types of route leaks, and we refer to these definitions in our system design. Here are illustrations of each of the route leak types.

Type 1: Hairpin Turn with Full Prefix

> A multihomed AS learns a route from one upstream ISP and simply propagates it to another upstream ISP (the turn essentially resembling a hairpin). Neither the prefix nor the AS path in the update is altered.

An AS path that contains a provider-customer and customer-provider segment is considered a type 1 leak. The following example: AS4 → AS5 → AS6 forms a type 1 leak.

Type 1 is the most recognized type of route leaks and is very impactful. In many cases, a customer route is preferable to a peer or a provider route. In this example, AS6 will likely prefer sending traffic via AS5 instead of its other peer or provider routes, causing AS5 to unintentionally become a transit provider. This can significantly affect the performance of the traffic related to the leaked prefix or cause outages if the leaking AS is not provisioned to handle a large influx of traffic.

In June 2015, Telekom Malaysia (AS4788), a regional ISP, leaked over 170,000 routes learned from its providers and peers to its other provider Level3 (AS3549, now Lumin). Level3 accepted the routes and further propagated them to its downstream networks, which in turn caused significant network issues globally.

Type 2: Lateral ISP-ISP-ISP Leak

Type 2 leak is defined as propagating routes obtained from one peer to another peer, creating two or more consecutive peer-to-peer path segments.

Here is an example: AS3 → AS4 → AS5 forms a type 2 leak.

One example of such leaks is more than three very large networks appearing in sequence. Very large networks (such as Verizon and Lumin) do not purchase transit from each other, and having more than three such networks on the path in sequence is often an indication of a route leak.

However, in the real world, it is not unusual to see multiple small peering networks exchanging routes and passing on to each other. Legit business reasons exist for having this type of network path. We are less concerned about this type of route leak as compared to type 1.

Type 3 and 4: Provider routes to peer; peer routes to provider

These two types involve propagating routes from a provider or a peer not to a customer, but to another peer or provider. Here are the illustrations of the two types of leaks:

As in the previously mentioned example, a Nigerian ISP who peers with Google accidentally leaked its route to its provider AS4809, and thus generated a type 4 route leak. Because routes via customers are usually preferred to others, the large provider (AS4809) rerouted its traffic to Google via its customer, i.e. the leaking ASN, overwhelmed the small ISP and took down Google for over one hour.

Route leak summary

So far, we have looked at the four types of route leaks defined in RFC7908. The common thread of the four types of route leaks is that they’re all defined using AS-relationships, i.e., peers, customers, and providers. We summarize the types of leaks by categorizing the AS path propagation based on where the routes are learned from and propagate to. The results are shown in the following table.

Routes from / propagates to	To provider	To peer	To customer
From provider	Type 1	Type 3	Normal
From peer	Type 4	Type 2	Normal
From customer	Normal	Normal	Normal

We can summarize the whole table into one single rule: routes obtained from a non-customer AS can only be propagated to customers.

Note: Type 5 and type 6 route leaks are defined as prefix re-origination and announcing of private prefixes. Type 5 is more closely related to prefix hijackings, which we plan to expand our system to as the next steps, while type 6 leaks are outside the scope of this work. Interested readers can refer to sections 3.5 and 3.6 of RFC7908 for more information.

The Cloudflare Radar route leak system

Now that we know what a route leak is, let’s talk about how we designed our route leak detection system.

From a very high level, we compartmentalize our system into three different components:

Raw data collection module: responsible for gathering BGP data from multiple sources and providing BGP message stream to downstream consumers.
Leak detection module: responsible for determining whether a given AS-level path is a route leak, estimate the confidence level of the assessment, aggregating and providing all external evidence needed for further analysis of the event.
Storage and notification module: responsible for providing access to detected route leak events and sending out notifications to relevant parties. This could also include building a dashboard for easy access and search of the historical events and providing the user interface for high-level analysis of the event.

Data collection module

There are three types of data input we take into consideration:

Historical: BGP archive files for some time range in the past
a. RouteViews and RIPE RIS BGP archives
Semi-real-time: BGP archive files as soon as they become available, with a 10-30 minute delay.
a. RouteViews and RIPE RIS archives with data broker that checks new files periodically (e.g. BGPKIT Broker)
Real-time: true real-time data sources
a. RIPE RIS Live
b. Cloudflare internal BGP sources

For the current version, we use the semi-real-time data source for the detection system, i.e., the BGP updates files from RouteViews and RIPE RIS. For data completeness, we process data from all public collectors from these two projects (a total of 63 collectors and over 2,400 collector peers) and implement a pipeline that’s capable of handling the BGP data processing as the data files become available.

For data files indexing and processing, we deployed an on-premises BGPKIT Broker instance with Kafka feature enabled for message passing, and a custom concurrent MRT data processing pipeline based on BGPKIT Parser Rust SDK. The data collection module processes MRT files and converts results into a BGP messages stream at over two billion BGP messages per day (roughly 30,000 messages per second).

Route leak detection

The route leak detection module works at the level of individual BGP announcements. The detection component investigates one BGP message at a time, and estimates how likely a given BGP message is a result of a route leak event.

We base our detection algorithm mainly on the valley-free model, which we believe can capture most of the notable route leak incidents. As mentioned previously, the key to having low false positives for detecting route leaks with the valley-free model is to have accurate AS-level relationships. While those relationship types are not publicized by every AS, there have been over two decades of research on the inference of the relationship types using publicly observed BGP data.

While state-of-the-art relationship inference algorithms have been shown to be highly accurate, even a small margin of errors can still incur inaccuracies in the detection of route leaks. To alleviate such artifacts, we synthesize multiple data sources for inferring AS-level relationships, including CAIDA/UCSD’s AS relationship data and our in-house built AS relationship dataset. Building on top of the two AS-level relationships, we create a much more granular dataset at the per-prefix and per-peer levels. The improved dataset allows us to answer the question like what is the relationship between AS1 and AS2 with respect to prefix P observed by collector peer X. This eliminates much of the ambiguity for cases where networks have multiple different relationships based on prefixes and geo-locations, and thus helps us reduce the number of false positives in the system. Besides the AS-relationships datasets, we also apply the AS Hegemony dataset from IHR IIJ to further reduce false positives.

Route leak storage and presentation

After processing each BGP message, we store the generated route leak entries in a database for long-term storage and exploration. We also aggregate individual route leak BGP announcements and group relevant leaks from the same leak ASN within a short period together into route-leak events. The route leak events will then be available for consumption by different downstream applications like web UIs, an API, or alerts.

Route leaks on Cloudflare Radar

At Cloudflare, we aim to help build a better Internet, and that includes sharing our efforts on monitoring and securing Internet routing. Today, we are releasing our route leak detection system as public beta.

Starting today, users going to the Cloudflare Radar ASN pages will now find the list of route leaks that affect that AS. We consider that an AS is being affected when the leaker AS is one hop away from it in any direction, before or after.

The Cloudflare Radar ASN page is directly accessible via https://radar.cloudflare.com/as{ASN}. For example, one can navigate to https://radar.cloudflare.com/as174 to view the overview page for Cogent AS174. ASN pages now show a dedicated card for route leaks detected relevant to the current ASN within the selected time range.

Users can also start using our public data API to lookup route leak events with regards to any given ASN. Our API supports filtering route leak results by time ranges, and ASes involved. Here is a screenshot of the route leak events API documentation page on the newly updated API docs site.

More to come on routing security

There is a lot more we are planning to do with route-leak detection. More features like a global view page, route leak notifications, more advanced APIs, custom automations scripts, and historical archive datasets will begin to ship on Cloudflare Radar over time. Your feedback and suggestions are also very important for us to continue improving on our detection results and serve better data to the public.

Furthermore, we will continue to expand our work on other important topics of Internet routing security, including global BGP hijack detection (not limited to our customer networks), RPKI validation monitoring, open-sourcing tools and architecture designs, and centralized routing security web gateway. Our goal is to provide the best data and tools for routing security to the communities so that we can build a better and more secure Internet together.

In the meantime, we opened a Radar room on our Developers Discord Server. Feel free to join and talk to us; the team is eager to receive feedback and answer questions.

Visit Cloudflare Radar for more Internet insights. You can also follow us on Twitter for more Radar updates.

Route leaks and confirmation biases

2022-02-23 Maximilian Wilhelm

Post Syndicated from Maximilian Wilhelm original https://blog.cloudflare.com/route-leaks-and-confirmation-biases/

Route leaks and confirmation biases

This is not what I imagined my first blog article would look like, but here we go.

On February 1, 2022, a configuration error on one of our routers caused a route leak of up to 2,000 Internet prefixes to one of our Internet transit providers. This leak lasted for 32 seconds and at a later time 7 seconds. We did not see any traffic spikes or drops in our network and did not see any customer impact because of this error, but this may have caused an impact to external parties, and we are sorry for the mistake.

Timeline

All timestamps are UTC.

As part of our efforts to build the best network, we regularly update our Internet transit and peering links throughout our network. On February 1, 2022, we had a “hot-cut” scheduled with one of our Internet transit providers to simultaneously update router configurations on Cloudflare and ISP routers to migrate one of our existing Internet transit links in Newark to a link with more capacity. Doing a “hot-cut” means that both parties will change cabling and configuration at the same time, usually while being on a conference call, to reduce downtime and impact on the network. The migration started off-peak at 10:45 (05:45 local time) with our network engineer entering the bridge call with our data center engineers and remote hands on site as well as operators from the ISP.

At 11:17, we connected the new fiber link and established the BGP sessions to the ISP successfully. We had BGP filters in place on our end to not accept and send any prefixes, so we could evaluate the connection and settings without any impact on our network and services.

As the connection between our router and the ISP — like most Internet connections — was realized over a fiber link, the first item to check are the “light levels” of that link. This shows the strength of the optical signal received by our router from the ISP router and can indicate a bad connection when it’s too low. Low light levels are likely caused by unclean fiber ends or not fully seated connectors, but may also indicate a defective optical transceiver which connects the fiber link to the router – all of which can degrade service quality.

The next item on the checklist is interface errors, which will occur when a network device receives incorrect or malformed network packets, which would also indicate a bad connection and would likely lead to a degradation in service quality, too.

As light levels were good, and we observed no errors on the link, we deemed it ready for production and removed the BGP reject filters at 11:22.

This immediately triggered the maximum prefix-limit protection the ISP had configured on the BGP session and shut down the session, preventing further impact. The maximum prefix-limit is a safeguard in BGP to prevent the spread of route leaks and to protect the Internet. The limit is usually set just a little higher than the expected number of Internet prefixes from a peer to leave some headroom for growth but also catch configuration errors fast. The configured value was just 40 prefixes short of the number of prefixes we were advertising at that site, so this was considered the reason for the session to be shut down. After checking back internally, we asked the ISP to raise the prefix-limit, which they did.

The BGP session was reestablished at 12:08 and immediately shut down again. The problem was identified and fixed at 12:14.

10:45: Start of scheduled maintenance

11:17: New link was connected and BGP sessions went up (filters still in place)

11:22: Link was deemed ready for production and filters removed

11:23: BGP sessions were torn down by ISP router due to configured prefix-limit

12:08: ISP configures higher prefix-limits, BGP sessions briefly come up again and are shut down

12:14: Issue identified and configuration updated

What happened and what we’re doing about it

The outage occurred while migrating one of our Internet transits to a link with more capacity. Once the new link and a BGP session had been established, and the link deemed error-free, our network engineering team followed the peer-reviewed deployment plan. The team removed the filters from the BGP sessions, which prevented the Cloudflare router from accepting and sending prefixes via BGP.

Due to an oversight in the deployment plan, which had been peer-reviewed before without noticing this issue, no BGP filters to only export prefixes of Cloudflare and our customers were added. A peer review on the internal chat did not notice this either, so the network engineer performing this change went ahead.

ewr02# show |compare                                     
[edit protocols bgp group 4-ORANGE-TRANSIT]
-  import REJECT-ALL;
-  export REJECT-ALL;
[edit protocols bgp group 6-ORANGE-TRANSIT]
-  import REJECT-ALL;
-  export REJECT-ALL;

The change resulted in our router sending all known prefixes to the ISP router, which shut down the session as the number of prefixes received exceeded the maximum prefix-limit configured.

As the configured values for the maximum prefix-limits turned out to be rather low for the number of prefixes on our network, this didn’t come as a surprise to our network engineering team and no investigation into why the BGP session went down was started. The prefix-limit being too low seemed to be a perfectly valid reason.

We asked the ISP to increase the prefix-limit, which they did after they received approval on their side. Once the prefix-limit had been increased and the previously shutdown BGP sessions reset, the sessions were reestablished but were shut down immediately as the maximum prefix-limit was triggered again. This is when our network engineer started questioning whether there was another issue at fault and found and corrected the configuration error previously overlooked.

We made the following change in response to this event: we introduced an implicit reject policy for BGP sessions which will take effect if no import/export policy is configured for a specific BGP neighbor or neighbor group. This change has been deployed.

BGP security & preventing route-leaks — what’s in the cards?

Route leaks aren’t new, and they keep happening. The industry has come up with many approaches to limit the impact or even prevent route-leaks. Policies and filters are used to control which prefixes should be exported to or imported from a given peer. RPKI can help to make sure only allowed prefixes are accepted from a peer and a maximum prefix-limit can act as a last line of defense when everything else fails.

BGP policies and filters are commonly used to ensure only explicitly allowed prefixes are sent out to BGP peers, usually only allowing prefixes owned by the entity operating the network and its customers. They can also be used to tweak some knobs (BGP local-pref, MED, AS path prepend, etc.) to influence routing decisions and balance traffic across links. This is what the policies we have in place for our peers and transits do. As explained above, the maximum prefix-limit is intended to tear down BGP sessions if more prefixes are being sent or received than to be expected. We have talked about RPKI before, it’s the required cryptographic upgrade to BGP routing, and we still are on our path to securing Internet Routing.

To improve the overall stability of the Internet even more, in 2017, a new Internet standard was proposed, which adds another layer of protection into the mix: RFC8212 defines Default External BGP (EBGP) Route Propagation Behavior without Policies which pretty much tackles the exact issues we were facing.

This RFC updates the BGP-4 standard (RFC4271) which defines how BGP works and what vendors are expected to implement. On the Juniper operating system, JunOS, this can be activated by setting defaults ebgp no-policy reject-always on the protocols bgp hierarchy level starting with Junos OS Release 20.3R1.

If you are running an older version of JunOS, a similar effect can be achieved by defining a REJECT-ALL policy and setting this as import/export policy on the protocols bgp hierarchy level. Note that this will also affect iBGP sessions, which the solution above will have no impact on.

policy-statement REJECT-ALL {
  then reject;
}

protocol bgp {
  import REJECT-ALL;
  export REJECT-ALL;
}

Conclusion

We are sorry for leaking routes of prefixes which did not belong to Cloudflare or our customers and to network engineers who got paged as a result of this.

We have processes in place to make sure that changes to our infrastructure are reviewed before being executed, so potential issues can be spotted before they reach production. In this case, the review process failed to catch this configuration error. In response, we will increase our efforts to further our network automation, to fully derive the device configuration from an intended state.

While this configuration error was caused by human error, it could have been detected and mitigated significantly faster if the confirmation bias did not kick in, making the operator think the observed behavior was to be expected. This error underlines the importance of our existing efforts on training our people to be aware of biases we have in our life. This also serves as a great example on how confirmation bias can influence and impact our work and that we should question our conclusions (early).

It also shows how important protocols like RPKI are. Route leaks are something even experienced network operators can cause accidentally, and technical solutions are needed to reduce the impact of leaks whether they are intentional or the result of an error.

What happened on the Internet during the Facebook outage

2021-10-08 Celso Martinho

Post Syndicated from Celso Martinho original https://blog.cloudflare.com/during-the-facebook-outage/

It’s been a few days now since Facebook, Instagram, and WhatsApp went AWOL and experienced one of the most extended and rough downtime periods in their existence.

When that happened, we reported our bird’s-eye view of the event and posted the blog Understanding How Facebook Disappeared from the Internet where we tried to explain what we saw and how DNS and BGP, two of the technologies at the center of the outage, played a role in the event.

In the meantime, more information has surfaced, and Facebook has published a blog post giving more details of what happened internally.

As we said before, these events are a gentle reminder that the Internet is a vast network of networks, and we, as industry players and end-users, are part of it and should work together.

In the aftermath of an event of this size, we don’t waste much time debating how peers handled the situation. We do, however, ask ourselves the more important questions: “How did this affect us?” and “What if this had happened to us?” Asking and answering these questions whenever something like this happens is a great and healthy exercise that helps us improve our own resilience.

Today, we’re going to show you how the Facebook and affiliate sites downtime affected us, and what we can see in our data.

1.1.1.1

1.1.1.1 is a fast and privacy-centric public DNS resolver operated by Cloudflare, used by millions of users, browsers, and devices worldwide. Let’s look at our telemetry and see what we find.

First, the obvious. If we look at the response rate, there was a massive spike in the number of SERVFAIL codes. SERVFAILs can happen for several reasons; we have an excellent blog called Unwrap the SERVFAIL that you should read if you’re curious.

In this case, we started serving SERVFAIL responses to all facebook.com and whatsapp.com DNS queries because our resolver couldn’t access the upstream Facebook authoritative servers. About 60x times more than the average on a typical day.

What happened on the Internet during the Facebook outage

If we look at all the queries, not specific to Facebook or WhatsApp domains, and we split them by IPv4 and IPv6 clients, we can see that our load increased too.

As explained before, this is due to a snowball effect associated with applications and users retrying after the errors and generating even more traffic. In this case, 1.1.1.1 had to handle more than the expected rate for A and AAAA queries.

Here’s another fun one.

DNS vs. DoT and DoH. Typically, DNS queries and responses are sent in plaintext over UDP (or TCP sometimes), and that’s been the case for decades now. Naturally, this poses security and privacy risks to end-users as it allows in-transit attacks or traffic snooping.

With DNS over TLS (DoT) and DNS over HTTPS, clients can talk DNS using well-known, well-supported encryption and authentication protocols.

Our learning center has a good article on “DNS over TLS vs. DNS over HTTPS” that you can read. Browsers like Chrome, Firefox, and Edge have supported DoH for some time now, WAP uses DoH too, and you can even configure your operating system to use the new protocols.

When Facebook went offline, we saw the number of DoT+DoH SERVFAILs responses grow by over x300 vs. the average rate.

So, we got hammered with lots of requests and errors, causing traffic spikes to our 1.1.1.1 resolver and causing an unexpected load in the edge network and systems. How did we perform during this stressful period?

Quite well. 1.1.1.1 kept its cool and continued serving the vast majority of requests around the famous 10ms mark. An insignificant fraction of p95 and p99 percentiles saw increased response times, probably due to timeouts trying to reach Facebook’s nameservers.

Another interesting perspective is the distribution of the ratio between SERVFAIL and good DNS answers, by country. In theory, the higher this ratio is, the more the country uses Facebook. Here’s the map with the countries that suffered the most:

Here’s the top twelve country list, ordered by those that apparently use Facebook, WhatsApp and Instagram the most:

Country	SERVFAIL/Good Answers ratio
Turkey	7.34
Grenada	4.84
Congo	4.44
Lesotho	3.94
Nicaragua	3.57
South Sudan	3.47
Syrian Arab Republic	3.41
Serbia	3.25
Turkmenistan	3.23
United Arab Emirates	3.17
Togo	3.14
French Guiana	3.00

Impact on other sites

When Facebook, Instagram, and WhatsApp aren’t around, the world turns to other places to look for information on what’s going on, other forms of entertainment or other applications to communicate with their friends and family. Our data shows us those shifts. While Facebook was going down, other services and platforms were going up.

To get an idea of the changing traffic patterns we look at DNS queries as an indicator of increased traffic to specific sites or types of site.

Here are a few examples.

Other social media platforms saw a slight increase in use, compared to normal.

Traffic to messaging platforms like Telegram, Signal, Discord and Slack got a little push too.

Nothing like a little gaming time when Instagram is down, we guess, when looking at traffic to sites like Steam, Xbox, Minecraft and others.

And yes, people want to know what’s going on and fall back on news sites like CNN, New York Times, The Guardian, Wall Street Journal, Washington Post, Huffington Post, BBC, and others:

Attacks

One could speculate that the Internet was under attack from malicious hackers. Our Firewall doesn’t agree; nothing out of the ordinary stands out.

Network Error Logs

Network Error Logging, NEL for short, is an experimental technology supported in Chrome. A website can issue a Report-To header and ask the browser to send reports about network problems, like bad requests or DNS issues, to a specific endpoint.

Cloudflare uses NEL data to quickly help triage end-user connectivity issues when end-users reach our network. You can learn more about this feature in our help center.

If Facebook is down and their DNS isn’t responding, Chrome will start reporting NEL events every time one of the pages in our zones fails to load Facebook comments, posts, ads, or authentication buttons. This chart shows it clearly.

WARP

Cloudflare announced WARP in 2019, and called it “A VPN for People Who Don’t Know What V.P.N. Stands For” and offered it for free to its customers. Today WARP is used by millions of people worldwide to securely and privately access the Internet on their desktop and mobile devices. Here’s what we saw during the outage by looking at traffic volume between WARP and Facebook’s network:

You can see how the steep drop in Facebook ASN traffic coincides with the start of the incident and how it compares to the same period the day before.

Our own traffic

People tend to think of Facebook as a place to visit. We log in, and we access Facebook, we post. It turns out that Facebook likes to visit us too, quite a lot. Like Google and other platforms, Facebook uses an army of crawlers to constantly check websites for data and updates. Those robots gather information about websites content, such as its titles, descriptions, thumbnail images, and metadata. You can learn more about this on the “The Facebook Crawler” page and the Open Graph website.

Here’s what we see when traffic is coming from the Facebook ASN, supposedly from crawlers, to our CDN sites:

The robots went silent.

What about the traffic coming to our CDN sites from Facebook User-Agents? The gap is indisputable.

We see about 30% of a typical request rate hitting us. But it’s not zero; why is that?

We’ll let you know a little secret. Never trust User-Agent information; it’s broken. User-Agent spoofing is everywhere. Browsers, apps, and other clients deliberately change the User-Agent string when they fetch pages from the Internet to hide, obtain access to certain features, or bypass paywalls (because pay-walled sites want sites like Facebook to index their content, so that then they get more traffic from links).

Fortunately, there are newer, and privacy-centric standards emerging like User-Agent Client Hints.

Core Web Vitals

Core Web Vitals are the subset of Web Vitals, an initiative by Google to provide a unified interface to measure real-world quality signals when a user visits a web page. Such signals include Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS).

We use Core Web Vitals with our privacy-centric Web Analytics product and collect anonymized data on how end-users experience the websites that enable this feature.

One of the metrics we can calculate using these signals is the page load time. Our theory is that if a page includes scripts coming from external sites (for example, Facebook “like” buttons, comments, ads), and they are unreachable, its total load time gets affected.

We used a list of about 400 domains that we know embed Facebook scripts in their pages and looked at the data.

Now let’s look at the Largest Contentful Paint. LCP marks the point in the page load timeline when the page’s main content has likely loaded. The faster the LCP is, the better the end-user experience.

Again, the page load experience got visibly degraded.

The outcome seems clear. The sites that use Facebook scripts in their pages took 1.5x more time to load their pages during the outage, with some of them taking more than 2x the usual time. Facebook’s outage dragged the performance of some other sites down.

Conclusion

When Facebook, Instagram, and WhatsApp went down, the Web felt it. Some websites got slower or lost traffic, other services and platforms got unexpected load, and people lost the ability to communicate or do business normally.

Hypotheses About What Happened to Facebook

2021-10-06 Bozho

Post Syndicated from Bozho original https://techblog.bozho.net/hypotheses-about-what-happened-to-facebook/

Facebook was down. I’d recommend reading Cloudflare’s summary. Then I recommend reading Facebook’s own account on the incident. But let me expand on that. Facebook published announcements and withdrawals for certain BGP prefixes which lead to removing its DNS servers from “the map of the internet” – they told everyone “the part of our network where our DNS servers are doesn’t exist”. That was the result of a backbone self-inflicted failure due to a bug in the auditing tool that checks whether the commands executed aren’t doing harmful things.

Facebook owns a lot of IPs. According to RIPEstat they are part of 399 prefixes (147 of them are IPv4). The DNS servers are located in two of those 399. Facebook uses a.ns.facebook.com, b.ns.facebook.com, c.ns.facebook.com and d.ns.facebook.com, which get queries whenever someone wants to know the IPs of Facebook-owned domains. These four nameservers are served by the same Autonomous System from just two prefixes – 129.134.30.0/23 and 185.89.218.0/23. Of course “4 nameservers” is a logical construct, there are probably many actual servers behind that (using anycast).

I wrote a simple “script” to fetch all the withdrawals and announcements for all Facebook-owned prefixes (from the great API of RIPEstats). Facebook didn’t remove itself from the map entirely. As CloudFlare points out, it was just some prefixes that are affected. It can be just these two, or a few others as well, but it seems that just a handful were affected. If we sort the resulting CSV from the above script by withdrawals, we’ll notice that 129.134.30.0/23 and 185.89.218.0/23 are the pretty high up (alongside 185.89 and 123.134 with a /24, which are all included in the /23). Now that perfectly matches Facebook’s account that their nameservers automatically withdraw themselves if they fail to connect to other parts of the infrastructure. Everything may have also been down, but the logic for withdrawal is present only in the networks that have nameservers in them.

So first, let me make three general observations that are not as obvious and as universal as they may sound, but they are worth discussing:

Use longer DNS TTLs if possible – if Facebook had 6 hour TTL on its domains, we may have not figured out that their name servers are down. This is hard to ask for such a complex service that uses DNS for load-balancing and geographical distribution, but it’s worth considering. Also, if they killed their backbone and their entire infrastructure was down anyway, the DNS TTL would not have solved the issue. But
We need improved caching logic for DNS. It can’t be just “present or not”; DNS caches may keep “last known good state” in case of SERVFAIL and fallback to that. All of those DNS resolvers that had to ask the authoritative nameserver “where can I find facebook.com” knew where to find facebook.com just a minute ago. Then they got a failure and suddenly they are wondering “oh, where could Facebook be?”. It’s not that simple, of course, but such cache improvement is worth considering. And again, if their entire infrastructure was down, this would not have helped.
Consider having an authoritative nameserver outside your main AS. If something bad happens to your AS routes (regardless of the reason), you may still have DNS working. That may have downsides – generally, it will be hard to manage and sync your DNS infrastructure. But at least having a spare set of nameservers and the option to quickly point glue records there is worth considering. It would not have saved Facebook in this case, as again, they claim the entire infrastructure was inaccessible due to a “broken” backbone.
Have a 100% test coverage on critical tools, such as the auditing tool that had a bug. 100% test coverage is rarely achievable in any project, but in such critical tools it’s a must.

The main explanation is the accidental outage. This is what Facebook engineers explain in the blogpost and other accounts, and that’s what seems to have happened. However, there are alternative hypotheses floating around, so let me briefly discuss all of the options.

Accidental outage due to misconfiguration – a very likely scenario. These things may happen to everyone and Facebook is known for it “break things” mentality, so it’s not unlikely that they just didn’t have the right safeguards in place and that someone ran a buggy update. The scenarios why and how that may have happened are many, and we can’t know from the outside (even after Facebook’s brief description). This remains the primary explanation, following my favorite Hanlon’s razor. A bug in the audit tool is absolutely realistic (btw, I’d love Facebook to publish their internal tools).
Cyber attack – It cannot be known by the data we have, but this would be a sophisticated attack that gained access to their BGP administration interface, which I would assume is properly protected. Not impossible, but a 6-hour outage of a social network is not something a sophisticated actor (e.g. a nation state) would invest resources in. We can’t rule it out, as this might be “just a drill” for something bigger to follow. If I were an attacker that wanted to take Facebook down, I’d try to kill their DNS servers, or indeed, “de-route” them. If we didn’t know that Facebook lets its DNS servers cut themselves from the network in case of failures, the fact that so few prefixes were updated might be in indicator of targeted attack, but this seems less and less likely.
Deliberate self-sabotage – 1.5 billion records are claimed to be leaked yesterday. At the same time, a Facebook whistleblower is testifying in the US congress. Both of these news are potentially damaging to Facebook reputation and shares. If they wanted to drown the news and the respective share price plunge in a technical story that few people understand but everyone is talking about (and then have their share price rebound, because technical issues happen to everyone), then that’s the way to do it – just as a malicious actor would do, but without all the hassle to gain access from outside – de-route the prefixes for the DNS servers and you have a “perfect” outage. These coincidences have lead people to assume such a plot, but from the observed outage and the explanation given by Facebook on why the DNS prefixes have been automatically withdrawn, this sounds unlikely.

Distinguishing between the three options is actually hard. You can mask a deliberate outage as an accident, a malicious actor can make it look like a deliberate self-sabotage. That’s why there are speculations. To me, however, by all of the data we have in RIPEStat and the various accounts by CloudFlare, Facebook and other experts, it seems that a chain of mistakes (operational and possibly design ones) lead to this.

The post Hypotheses About What Happened to Facebook appeared first on Bozho's tech blog.

Understanding How Facebook Disappeared from the Internet

2021-10-05 Tom Strickx

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/october-2021-facebook-outage/

Understanding How Facebook Disappeared from the Internet

“Facebook can’t be down, can it?”, we thought, for a second.

Today at 1651 UTC, we opened an internal incident entitled “Facebook DNS lookup returning SERVFAIL” because we were worried that something was wrong with our DNS resolver 1.1.1.1. But as we were about to post on our public status page we realized something else more serious was going on.

Social media quickly burst into flames, reporting what our engineers rapidly confirmed too. Facebook and its affiliated services WhatsApp and Instagram were, in fact, all down. Their DNS names stopped resolving, and their infrastructure IPs were unreachable. It was as if someone had “pulled the cables” from their data centers all at once and disconnected them from the Internet.

How’s that even possible?

Meet BGP

BGP stands for Border Gateway Protocol. It’s a mechanism to exchange routing information between autonomous systems (AS) on the Internet. The big routers that make the Internet work have huge, constantly updated lists of the possible routes that can be used to deliver every network packet to their final destinations. Without BGP, the Internet routers wouldn’t know what to do, and the Internet wouldn’t work.

The Internet is literally a network of networks, and it’s bound together by BGP. BGP allows one network (say Facebook) to advertise its presence to other networks that form the Internet. As we write Facebook is not advertising its presence, ISPs and other networks can’t find Facebook’s network and so it is unavailable.

The individual networks each have an ASN: an Autonomous System Number. An Autonomous System (AS) is an individual network with a unified internal routing policy. An AS can originate prefixes (say that they control a group of IP addresses), as well as transit prefixes (say they know how to reach specific groups of IP addresses).

Cloudflare’s ASN is AS13335. Every ASN needs to announce its prefix routes to the Internet using BGP; otherwise, no one will know how to connect and where to find us.

Our learning center has a good overview of what BGP and ASNs are and how they work.

In this simplified diagram, you can see six autonomous systems on the Internet and two possible routes that one packet can use to go from Start to End. AS1 → AS2 → AS3 being the fastest, and AS1 → AS6 → AS5 → AS4 → AS3 being the slowest, but that can be used if the first fails.

At 1658 UTC we noticed that Facebook had stopped announcing the routes to their DNS prefixes. That meant that, at least, Facebook’s DNS servers were unavailable. Because of this Cloudflare’s 1.1.1.1 DNS resolver could no longer respond to queries asking for the IP address of facebook.com or instagram.com.

route-views>show ip bgp 185.89.218.0/23
% Network not in table
route-views>

route-views>show ip bgp 129.134.30.0/23
% Network not in table
route-views>

Meanwhile, other Facebook IP addresses remained routed but weren’t particularly useful since without DNS Facebook and related services were effectively unavailable:

route-views>show ip bgp 129.134.30.0   
BGP routing table entry for 129.134.0.0/17, version 1025798334
Paths: (24 available, best #14, table default)
  Not advertised to any peer
  Refresh Epoch 2
  3303 6453 32934
    217.192.89.50 from 217.192.89.50 (138.187.128.158)
      Origin IGP, localpref 100, valid, external
      Community: 3303:1004 3303:1006 3303:3075 6453:3000 6453:3400 6453:3402
      path 7FE1408ED9C8 RPKI State not found
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
route-views>

We keep track of all the BGP updates and announcements we see in our global network. At our scale, the data we collect gives us a view of how the Internet is connected and where the traffic is meant to flow from and to everywhere on the planet.

A BGP UPDATE message informs a router of any changes you’ve made to a prefix advertisement or entirely withdraws the prefix. We can clearly see this in the number of updates we received from Facebook when checking our time-series BGP database. Normally this chart is fairly quiet: Facebook doesn’t make a lot of changes to its network minute to minute.

But at around 15:40 UTC we saw a peak of routing changes from Facebook. That’s when the trouble began.

If we split this view by routes announcements and withdrawals, we get an even better idea of what happened. Routes were withdrawn, Facebook’s DNS servers went offline, and one minute after the problem occurred, Cloudflare engineers were in a room wondering why 1.1.1.1 couldn’t resolve facebook.com and worrying that it was somehow a fault with our systems.

With those withdrawals, Facebook and its sites had effectively disconnected themselves from the Internet.

DNS gets affected

As a direct consequence of this, DNS resolvers all over the world stopped resolving their domain names.

➜  ~ dig @1.1.1.1 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com.			IN	A
➜  ~ dig @1.1.1.1 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com.			IN	A
➜  ~ dig @8.8.8.8 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com.			IN	A
➜  ~ dig @8.8.8.8 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com.			IN	A

This happens because DNS, like many other systems on the Internet, also has its routing mechanism. When someone types the https://facebook.com URL in the browser, the DNS resolver, responsible for translating domain names into actual IP addresses to connect to, first checks if it has something in its cache and uses it. If not, it tries to grab the answer from the domain nameservers, typically hosted by the entity that owns it.

If the nameservers are unreachable or fail to respond because of some other reason, then a SERVFAIL is returned, and the browser issues an error to the user.

Again, our learning center provides a good explanation on how DNS works.

Due to Facebook stopping announcing their DNS prefix routes through BGP, our and everyone else’s DNS resolvers had no way to connect to their nameservers. Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses.

But that’s not all. Now human behavior and application logic kicks in and causes another exponential effect. A tsunami of additional DNS traffic follows.

This happened in part because apps won’t accept an error for an answer and start retrying, sometimes aggressively, and in part because end-users also won’t take an error for an answer and start reloading the pages, or killing and relaunching their apps, sometimes also aggressively.

This is the traffic increase (in number of requests) that we saw on 1.1.1.1:

So now, because Facebook and their sites are so big, we have DNS resolvers worldwide handling 30x more queries than usual and potentially causing latency and timeout issues to other platforms.

Fortunately, 1.1.1.1 was built to be Free, Private, Fast (as the independent DNS monitor DNSPerf can attest), and scalable, and we were able to keep servicing our users with minimal impact.

The vast majority of our DNS requests kept resolving in under 10ms. At the same time, a minimal fraction of p95 and p99 percentiles saw increased response times, probably due to expired TTLs having to resort to the Facebook nameservers and timeout. The 10 seconds DNS timeout limit is well known amongst engineers.

Impacting other services

People look for alternatives and want to know more or discuss what’s going on. When Facebook became unreachable, we started seeing increased DNS queries to Twitter, Signal and other messaging and social media platforms.

We can also see another side effect of this unreachability in our WARP traffic to and from Facebook’s affected ASN 32934. This chart shows how traffic changed from 15:45 UTC to 16:45 UTC compared with three hours before in each country. All over the world WARP traffic to and from Facebook’s network simply disappeared.

The Internet

Today’s events are a gentle reminder that the Internet is a very complex and interdependent system of millions of systems and protocols working together. That trust, standardization, and cooperation between entities are at the center of making it work for almost five billion active users worldwide.

Update

At around 21:00 UTC we saw renewed BGP activity from Facebook’s network which peaked at 21:17 UTC.

This chart shows the availability of the DNS name ‘facebook.com’ on Cloudflare’s DNS resolver 1.1.1.1. It stopped being available at around 15:50 UTC and returned at 21:20 UTC.

Undoubtedly Facebook, WhatsApp and Instagram services will take further time to come online but as of 22:28 UTC Facebook appears to be reconnected to the global Internet and DNS working again.

Fall 2020 RPKI Update

2020-11-06 Louis Poinsignon

Post Syndicated from Louis Poinsignon original https://blog.cloudflare.com/rpki-2020-fall-update/

Fall 2020 RPKI Update

The Internet is a network of networks. In order to find the path between two points and exchange data, the network devices rely on the information from their peers. This information consists of IP addresses and Autonomous Systems (AS) which announce the addresses using Border Gateway Protocol (BGP).

One problem arises from this design: what protects against a malevolent peer who decides to announce incorrect information? The damage caused by route hijacks can be major.

Routing Public Key Infrastructure (RPKI) is a framework created in 2008. Its goal is to provide a source of truth for Internet Resources (IP addresses) and ASes in signed cryptographically signed records called Route Origin Objects (ROA).

Recently, we’ve seen the significant threshold of two hundred thousands of ROAs being passed. This represents a big step in making the Internet more secure against accidental and deliberate BGP tampering.

We have talked about RPKI in the past but we thought it would be a good time for an update.

In a more technical context, the RPKI framework consists of two parts:

IP addresses need to be cryptographically signed by their owners in a database managed by a Trust Anchor: Afrinic, APNIC, ARIN, LACNIC and RIPE. Those five organizations are in charge of allocating Internet resources. The ROA indicates which Network Operator is allowed to announce the addresses using BGP.
Network operators download the list of ROAs, perform the cryptographic checks and then apply filters on the prefixes they receive: this is called BGP Origin Validation.

The “Is BGP Safe Yet” website

The launch of the website isbgpsafeyet.com to test if your ISP correctly performs BGP Origin Validation was a success. Since launch, it has been visited more than five million times from over 223 countries and 13,000 unique networks (20% of the entire Internet), generating half a million BGP Origin Validation tests.

Many providers subsequently indicated on social media (for example, here or here) that they had an RPKI deployment in the works. This increase in Origin Validation by networks is increasing the security of the Internet globally.

The site’s test for Origin Validation consists of queries toward two addresses, one of which is behind an RPKI invalid prefix and the other behind an RPKI valid prefix. If the query towards the invalid succeeds, the test fails as the ISP does not implement Origin Validation. We counted the number of queries that failed to reach invalid.cloudflare.com. This also included a few thousand RIPE Atlas tests that were started by Cloudflare and various contributors, providing coverage for smaller networks.

Every month since launch we’ve seen that around 10 to 20 networks are deploying RPKI Origin Validation. Among the major providers we can build the following table:

Month	Networks
August	Swisscom (Switzerland), Salt (Switzerland)
July	Telstra (Australia), Quadranet (USA), Videotron (Canada)
June	Colocrossing (USA), Get Norway (Norway), Vocus (Australia), Hurricane Electric (Worldwide), Cogent (Worldwide)
May	Sengked Fiber (Indonesia), Online.net (France), WebAfrica Networks (South Africa), CableNet (Cyprus), IDnet (Indonesia), Worldstream (Netherlands), GTT (Worldwide)

With the help of many contributors, we have compiled a list of network operators and public statements at the top of the isbgpsafeyet.com page.

We excluded providers that manually blocked the traffic towards the prefix instead of using RPKI. Among the techniques we see are firewall filtering and manual prefix rejection. The filtering is often propagated to other customer ISPs. In a unique case, an ISP generated a “more-specific” blackhole route that leaked to multiple peers over the Internet.

The deployment of RPKI by major transit providers, also known as Tier 1, such as Cogent, GTT, Hurricane Electric, NTT and Telia made many downstream networks more secure without them having them deploying validation software.

Overall, we looked at the evolution of the successful tests per ASN and we noticed a steady increase over the recent months of 8%.

Furthermore, when we probed the entire IPv4 space this month, using a similar technique to the isbgpsafeyet.com test, many more networks were not able to reach an RPKI invalid prefix than compared to the same period last year. This confirms an increase of RPKI Origin Validation deployment across all network operators. The picture below shows the IPv4 space behind a network with RPKI Origin Validation enabled in yellow and the active space in blue. It uses a Hilbert Curve to efficiently plot IP addresses: for example one /20 prefix (4096 IPs) is a pixel, a /16 prefix (65536 IPs) will form a 4×4 pixels square.

The more the yellow spreads, the safer the Internet becomes.

What does it mean exactly? If you were hijacking a prefix, the users behind the yellow space would likely not be affected. This also applies if you miss-sign your prefixes: you would not be able to reach the services or users behind the yellow space. Once RPKI is enabled everywhere, there will only be yellow squares.

Progression of signed prefixes

Owners of IP addresses indicate the networks allowed to announce them. They do this by signing prefixes: they create Route Origin Objects (ROA). As of today, there are more than 200,000 ROAs. The distribution shows that the RIPE region is still leading in ROA count, then followed by the APNIC region.

2020 started with 172,000 records and the count is getting close to 200,000 at the beginning of November, approximately a quarter of all the Internet routes. Since last year, the database of ROAs grew by more than 70 percent, from 100,000 records, an average pace of 5% every month.

On the following graph of unique ROAs count per day, we can see two points that were followed by a change in ROA creation rate: 140/day, then 231/day, and since August, 351 new ROAs per day.

It is not yet clear what caused the increase in August.

Free services and software

In 2018 and 2019, Cloudflare was impacted by BGP route hijacks. Both could have been avoided with RPKI. Not long after the first incident, we started signing prefixes and developing RPKI software. It was necessary to make BGP safer and we wanted to do more than talk about it. But we also needed enough networks to be deploying RPKI as well. By making deployment easier for everyone, we hoped to increase adoption.

The following is a reminder of what we built over the years around RPKI and how it grew.

OctoRPKI is Cloudflare’s open source RPKI Validation software. It periodically generates a JSON document of validated prefixes that we pass onto our routers using GoRTR. It generates most of the data behind the graphs here.

The latest version, 1.2.0, of OctoRPKI was released at the end of October. It implements important security fixes, better memory management and extended logging. This is the first validator to provide detailed information around cryptographically invalid records into Sentry and performance data in distributed tracing tools.
GoRTR remains heavily used in production, including by transit providers. It can natively connect to other validators like rpki-client.

When we released our public rpki.json endpoint in early 2019, the idea was to enable anyone to see what Cloudflare was filtering.

The file is also used as a bootstrap by GoRTR, so that users can test a deployment. The file is cached on more than 200 data centers, ensuring quick and secure delivery of a list of valid prefixes, making RPKI more accessible for smaller networks and developers.

Between March 2019 and November 2020, the number of queries more than doubled and there are five times more networks querying this file.

The growth of queries follows approximately the rate of ROA creation (~5% per month).

A public RTR server is also available on rtr.rpki.cloudflare.com. It includes a plaintext endpoint on port 8282 and an SSH endpoint on port 8283. This allows us to test new versions of GoRTR before release.

Later in 2019, we also built a public dashboard where you can see in-depth RPKI validation. With a GraphQL API, you can now explore the validation data, test a list of prefixes, or see the status of the current routing table.

Currently, the API is used by BGPalerter, an open-source tool that detects routing issues (including hijacks!) from a stream of BGP updates.

Additionally, starting in November, you can access the historical data from May 2019. Data is computed daily and contains the unique records. The team behind the dashboard worked hard to provide a fast and accurate visualization of the daily ROA changes and the volumes of files changed over the day.

The future

We believe RPKI is going to continue growing, and we would like to thank the hundreds of network engineers around the world who are making the Internet routing more secure by deploying RPKI.

25% of routes are signed and 20% of the Internet is doing origin validation and those numbers grow everyday. We believe BGP will be safer before reaching 100% of deployment; for instance, once the remaining transit providers enable Origin Validation, it is unlikely a BGP hijack will make it to the front page of world news outlets.

While difficult to quantify, we believe that critical mass of protected resources will be reached in late 2021.

We will keep improving the tooling; OctoRPKI and GoRTR are open-source and we welcome contributions. In the near future, we plan on releasing a packaged version of GoRTR that can be directly installed on certain routers. Stay tuned!