Tag Archives: Routing

How Cloudflare’s systems dynamically route traffic across the globe

Post Syndicated from David Tuber original http://blog.cloudflare.com/meet-traffic-manager/

How Cloudflare’s systems dynamically route traffic across the globe

How Cloudflare’s systems dynamically route traffic across the globe

Picture this: you’re at an airport, and you’re going through an airport security checkpoint. There are a bunch of agents who are scanning your boarding pass and your passport and sending you through to your gate. All of a sudden, some of the agents go on break. Maybe there’s a leak in the ceiling above the checkpoint. Or perhaps a bunch of flights are leaving at 6pm, and a number of passengers turn up at once. Either way, this imbalance between localized supply and demand can cause huge lines and unhappy travelers — who just want to get through the line to get on their flight. How do airports handle this?

Some airports may not do anything and just let you suffer in a longer line. Some airports may offer fast-lanes through the checkpoints for a fee. But most airports will tell you to go to another security checkpoint a little farther away to ensure that you can get through to your gate as fast as possible. They may even have signs up telling you how long each line is, so you can make an easier decision when trying to get through.

At Cloudflare, we have the same problem. We are located in 300 cities around the world that are built to receive end-user traffic for all of our product suites. And in an ideal world, we always have enough computers and bandwidth to handle everyone at their closest possible location. But the world is not always ideal; sometimes we take a data center offline for maintenance, or a connection to a data center goes down, or some equipment fails, and so on. When that happens, we may not have enough attendants to serve every person going through security in every location. It’s not because we haven’t built enough kiosks, but something has happened in our data center that prevents us from serving everyone.

So, we built Traffic Manager: a tool that balances supply and demand across our entire global network. This blog is about Traffic Manager: how it came to be, how we built it, and what it does now.

The world before Traffic Manager

The job now done by Traffic Manager used to be a manual process carried out by network engineers: our network would operate as normal until something happened that caused user traffic to be impacted at a particular data center.

When such events happened, user requests would start to fail with 499 or 500 errors because there weren’t enough machines to handle the request load of our users. This would trigger a page to our network engineers, who would then remove some Anycast routes for that data center. The end result: by no longer advertising those prefixes in the impacted data center, user traffic would divert to a different data center. This is how Anycast fundamentally works: user traffic is drawn to the closest data center advertising the prefix the user is trying to connect to, as determined by Border Gateway Protocol. For a primer on what Anycast is, check out this reference article.

Depending on how bad the problem was, engineers would remove some or even all the routes in a data center. When the data center was again able to absorb all the traffic, the engineers would put the routes back and the traffic would return naturally to the data center.

As you might guess, this was a challenging task for our network engineers to do every single time any piece of hardware on our network had an issue. It didn’t scale.

Never send a human to do a machine’s job

But doing it manually wasn’t just a burden on our Network Operations team. It also resulted in a sub-par experience for our customers; our engineers would need to take time to diagnose and re-route traffic. To solve both these problems, we wanted to build a service that would immediately and automatically detect if users were unable to reach a Cloudflare data center, and withdraw routes from the data center until users were no longer seeing issues. Once the service received notifications that the impacted data center could absorb the traffic, it could put the routes back and reconnect that data center. This service is called Traffic Manager, because its job (as you might guess) is to manage traffic coming into the Cloudflare network.

How Cloudflare’s systems dynamically route traffic across the globe

Accounting for second order consequences

When a network engineer removes a route from a router, they can make the best guess at where the user requests will move to, and try to ensure that the failover data center has enough resources to handle the requests — if it doesn’t, they can adjust the routes there accordingly prior to removing the route in the initial data center. To be able to automate this process, we needed to move from a world of intuition to a world of data — accurately predicting where traffic would go when a route was removed, and feeding this information to Traffic Manager, so it could ensure it doesn’t make the situation worse.

Meet Traffic Predictor

Although we can adjust which data centers advertise a route, we are unable to influence what proportion of traffic each data center receives. Each time we add a new data center, or a new peering session, the distribution of traffic changes, and as we are in over 300 cities and 12,500 peering sessions, it has become quite difficult for a human to keep track of, or predict the way traffic will move around our network. Traffic manager needed a buddy: Traffic Predictor.

In order to do its job, Traffic Predictor carries out an ongoing series of real world tests to see where traffic actually moves. Traffic Predictor relies on a testing system that simulates removing a data center from service and measuring where traffic would go if that data center wasn’t serving traffic. To help understand how this system works, let’s simulate the removal of a subset of a data center in Christchurch, New Zealand:

  • First, Traffic Predictor gets a list of all the IP addresses that normally connect to Christchurch. Traffic Predictor will send a ping request to hundreds of thousands of IPs that have recently made a request there.
  • Traffic Predictor records if the IP responds, and whether the response returns to Christchurch using a special Anycast IP range specifically configured for Traffic Predictor.
  • Once Traffic Predictor has a list of IPs that respond to Christchurch, it removes that route containing that special range from Christchurch, waits a few minutes for the Internet routing table to be updated, and runs the test again.
  • Instead of being routed to Christchurch, the responses are instead going to data centers around Christchurch. Traffic Predictor then uses the knowledge of responses for each data center, and records the results as the failover for Christchurch.

This allows us to simulate Christchurch going offline without actually taking Christchurch offline!

But Traffic Predictor doesn’t just do this for any one data center. To add additional layers of resiliency, Traffic Predictor even calculates a second layer of indirection: for each data center failure scenario, Traffic Predictor also calculates failure scenarios and creates policies for when surrounding data centers fail.

Using our example from before, when Traffic Predictor tests Christchurch, it will run a series of tests that remove several surrounding data centers from service including Christchurch to calculate different failure scenarios. This ensures that even if something catastrophic happens which impacts multiple data centers in a region, we still have the ability to serve user traffic. If you think this data model is complicated, you’re right: it takes several days to calculate all of these failure paths and policies.

Here’s what those failure paths and failover scenarios look like for all of our data centers around the world when they’re visualized:

How Cloudflare’s systems dynamically route traffic across the globe

This can be a bit complicated for humans to parse, so let’s dig into that above scenario for Christchurch, New Zealand to make this a bit more clear. When we take a look at failover paths specifically for Christchurch, we see they look like this:

How Cloudflare’s systems dynamically route traffic across the globe

In this scenario we predict that 99.8% of Christchurch’s traffic would shift to Auckland, which is able to absorb all Christchurch traffic in the event of a catastrophic outage.

Traffic Predictor allows us to not only see where traffic will move to if something should happen, but it allows us to preconfigure Traffic Manager policies to move requests out of failover data centers to prevent a thundering herd scenario: where sudden influx of requests can cause failures in a second data center if the first one has issues. With Traffic Predictor, Traffic Manager doesn’t just move traffic out of one data center when that one fails, but it also proactively moves traffic out of other data centers to ensure a seamless continuation of service.

From a sledgehammer to a scalpel

With Traffic Predictor, Traffic Manager can dynamically advertise and withdraw prefixes while ensuring that every datacenter can handle all the traffic. But withdrawing prefixes as a means of traffic management can be a bit heavy-handed at times. One of the reasons for this is that the only way we had to add or remove traffic to a data center was through advertising routes from our Internet-facing routers. Each one of our routes has thousands of IP addresses, so removing only one still represents a large portion of traffic.

Specifically, Internet applications will advertise prefixes to the Internet from a /24 subnet at an absolute minimum, but many will advertise prefixes larger than that. This is generally done to prevent things like route leaks or route hijacks: many providers will actually filter out routes that are more specific than a /24 (for more information on that, check out this blog here). If we assume that Cloudflare maps protected properties to IP addresses at a 1:1 ratio, then each /24 subnet would be able to service 256 customers, which is the number of IP addresses in a /24 subnet. If every IP address sent one request per second, we’d have to move 4 /24 subnets out of a data center if we needed to move 1,000 requests per second (RPS).

But in reality, Cloudflare maps a single IP address to hundreds of thousands of protected properties. So for Cloudflare, a /24 might take 3,000 requests per second, but if we needed to move 1,000 RPS out, we would have no choice but to move a single /24 out. And that’s just assuming we advertise at a /24 level. If we used /20s to advertise, the amount we can withdraw gets less granular: at a 1:1 website to IP address mapping, that’s 4,096 requests per second for each prefix, and even more if the website to IP address mapping is many to one.

While withdrawing prefix advertisements improved the customer experience for those users who would have seen a 499 or 500 error — there may have been a significant portion of users who wouldn’t have been impacted by an issue who still were moved away from the data center they should have gone to, probably slowing them down, even if only a little bit. This concept of moving more traffic out than is necessary is called “stranding capacity”: the data center is theoretically able to service more users in a region but cannot because of how Traffic Manager was built.

We wanted to improve Traffic Manager so that it only moved the absolute minimum of users out of a data center that was seeing a problem and not strand any more capacity. To do so, we needed to shift percentages of prefixes, so we could be extra fine-grained and only move the things that absolutely need to be moved. To solve this, we built an extension of our Layer 4 load balancer Unimog, which we call Plurimog.

A quick refresher on Unimog and layer 4 load balancing: every single one of our machines contains a service that determines whether that machine can take a user request. If the machine can take a user request then it sends the request to our HTTP stack which processes the request before returning it to the user. If the machine can’t take the request, the machine sends the request to another machine in the data center that can. The machines can do this because they are constantly talking to each other to understand whether they can serve requests for users.

Plurimog does the same thing, but instead of talking between machines, Plurimog talks in between data centers and points of presence. If a request goes into Philadelphia and Philadelphia is unable to take the request, Plurimog will forward to another data center that can take the request, like Ashburn, where the request is decrypted and processed. Because Plurimog operates at layer 4, it can send individual TCP or UDP requests to other places which allows it to be very fine-grained: it can send percentages of traffic to other data centers very easily, meaning that we only need to send away enough traffic to ensure that everyone can be served as fast as possible. Check out how that works in our Frankfurt data center, as we are able to shift progressively more and more traffic away to handle issues in our data centers. This chart shows the number of actions taken on free traffic that cause it to be sent out of Frankfurt over time:

How Cloudflare’s systems dynamically route traffic across the globe

But even within a data center, we can route traffic around to prevent traffic from leaving the datacenter at all. Our large data centers, called Multi-Colo Points of Presence (MCPs) contain logical sections of compute within a data center that are distinct from one another. These MCP data centers are enabled with another version of Unimog called Duomog, which allows for traffic to be shifted between logical sections of compute automatically. This makes MCP data centers fault-tolerant without sacrificing performance for our customers, and allows Traffic Manager to work within a data center as well as between data centers.

When evaluating portions of requests to move, Traffic Manager does the following:

  • Traffic Manager identifies the proportion of requests that need to be removed from a data center or subsection of a data center so that all requests can be served.
  • Traffic Manager then calculates the aggregated space metrics for each target to see how many requests each failover data center can take.
  • Traffic Manager then identifies how much traffic in each plan we need to move, and moves either a proportion of the plan, or all of the plan through Plurimog/Duomog, until we've moved enough traffic. We move Free customers first, and if there are no more Free customers in a data center, we'll move Pro, and then Business customers if needed.

For example, let’s look at Ashburn, Virginia: one of our MCPs. Ashburn has nine different subsections of capacity that can each take traffic. On 8/28, one of those subsections, IAD02, had an issue that reduced the amount of traffic it could handle.

During this time period, Duomog sent more traffic from IAD02 to other subsections within Ashburn, ensuring that Ashburn was always online, and that performance was not impacted during this issue. Then, once IAD02 was able to take traffic again, Duomog shifted traffic back automatically. You can see these actions visualized in the time series graph below, which tracks the percentage of traffic moved over time between subsections of capacity within IAD02 (shown in green):

How Cloudflare’s systems dynamically route traffic across the globe

How does Traffic Manager know how much to move?

Although we used requests per second in the example above, using requests per second as a metric isn’t accurate enough when actually moving traffic. The reason for this is that different customers have different resource costs to our service; a website served mainly from cache with the WAF deactivated is much cheaper CPU wise than a site with all WAF rules enabled and caching disabled. So we record the time that each request takes in the CPU. We can then aggregate the CPU time across each plan to find the CPU time usage per plan. We record the CPU time in ms, and take a per second value, resulting in a unit of milliseconds per second.

CPU time is an important metric because of the impact it can have on latency and customer performance. As an example, consider the time it takes for an eyeball request to make it entirely through the Cloudflare front line servers: we call this the cfcheck latency. If this number goes too high, then our customers will start to notice, and they will have a bad experience. When cfcheck latency gets high, it’s usually because CPU utilization is high. The graph below shows 95th percentile cfcheck latency plotted against CPU utilization across all the machines in the same data center, and you can see the strong correlation:

How Cloudflare’s systems dynamically route traffic across the globe

So having Traffic Manager look at CPU time in a data center is a very good way to ensure that we’re giving customers the best experience and not causing problems.

After getting the CPU time per plan, we need to figure out how much of that CPU time to move to other data centers. To do this, we aggregate the CPU utilization across all servers to give a single CPU utilization across the data center. If a proportion of servers in the data center fail, due to network device failure, power failure, etc., then the requests that were hitting those servers are automatically routed elsewhere within the data center by Duomog. As the number of servers decrease, the overall CPU utilization of the data center increases. Traffic Manager has three thresholds for each data center; the maximum threshold, the target threshold, and the acceptable threshold:

  • Maximum: the CPU level at which performance starts to degrade, where Traffic Manager will take action
  • Target: the level to which Traffic Manager will try to reduce the CPU utilization to restore optimal service to users
  • Acceptable: the level below which a data center can receive requests forwarded from another data center, or revert active moves

When a data center goes above the maximum threshold, Traffic Manager takes the ratio of total CPU time across all plans to current CPU utilization, then applies that to the target CPU utilization to find the target CPU time. Doing it this way means we can compare a data center with 100 servers to a data center with 10 servers, without having to worry about the number of servers in each data center. This assumes that load increases linearly, which is close enough to true for the assumption to be valid for our purposes.

Target ratio equals current ratio:

How Cloudflare’s systems dynamically route traffic across the globe

Therefore:

How Cloudflare’s systems dynamically route traffic across the globe

Subtracting the target CPU time from the current CPU time gives us the CPU time to move:

How Cloudflare’s systems dynamically route traffic across the globe

For example, if the current CPU utilization was at 90% across the data center, the target was 85%, and the CPU time across all plans was 18,000, we would have:

How Cloudflare’s systems dynamically route traffic across the globe

This would mean Traffic Manager would need to move 1,000 CPU time:

How Cloudflare’s systems dynamically route traffic across the globe

Now we know the total CPU time needed to move, we can go through the plans, until the required time to move has been met.

What is the maximum threshold?

A frequent problem that we faced was determining at which point Traffic Manager should start taking action in a data center – what metric should it watch, and what is an acceptable level?

As said before, different services have different requirements in terms of CPU utilization, and there are many cases of data centers that have very different utilization patterns.

To solve this problem, we turned to machine learning. We created a service that will automatically adjust the maximum thresholds for each data center according to customer-facing indicators. For our main service-level indicator (SLI), we decided to use the cfcheck latency metric we described earlier.

But we also need to define a service-level objective (SLO) in order for our machine learning application to be able to adjust the threshold. We set the SLO for 20ms. Comparing our SLO to our SLI, our 95th percentile cfcheck latency should never go above 20ms and if it does, we need to do something. The below graph shows 95th percentile cfcheck latency over time, and customers start to get unhappy when cfcheck latency goes into the red zone:

How Cloudflare’s systems dynamically route traffic across the globe

If customers have a bad experience when CPU gets too high, then the goal of Traffic Manager’s maximum thresholds are to ensure that customer performance isn’t impacted and to start redirecting traffic away before performance starts to degrade. At a scheduled interval the Traffic Manager service will fetch a number of metrics for each data center and apply a series of machine learning algorithms. After cleaning the data for outliers we apply a simple quadratic curve fit, and we are currently testing a linear regression algorithm.

After fitting the models we can use them to predict the CPU usage when the SLI is equal to our SLO, and then use it as our maximum threshold. If we plot the cpu values against the SLI we can see clearly why these methods work so well for our data centers, as you can see for Barcelona in the graphs below, which are plotted against curve fit and linear regression fit respectively.

How Cloudflare’s systems dynamically route traffic across the globe
How Cloudflare’s systems dynamically route traffic across the globe

In these charts the vertical line is the SLO, and the intersection of this line with the fitted model represents the value that will be used as the maximum threshold. This model has proved to be very accurate, and we are able to significantly reduce the SLO breaches. Let’s take a look at when we started deploying this service in Lisbon:

How Cloudflare’s systems dynamically route traffic across the globe

Before the change, cfcheck latency was constantly spiking, but Traffic Manager wasn’t taking actions because the maximum threshold was static. But after July 29, we see that cfcheck latency has never hit the SLO because we are constantly measuring to make sure that customers are never impacted by CPU increases.

Where to send the traffic?

So now that we have a maximum threshold, we need to find the third CPU utilization threshold which isn’t used when calculating how much traffic to move – the acceptable threshold. When a data center is below this threshold, it has unused capacity which, as long as it isn’t forwarding traffic itself, is made available for other data centers to use when required. To work out how much each data center is able to receive, we use the same methodology as above, substituting target for acceptable:

How Cloudflare’s systems dynamically route traffic across the globe

Therefore:

How Cloudflare’s systems dynamically route traffic across the globe

Subtracting the current CPU time from the acceptable CPU time gives us the amount of CPU time that a data center could accept:

How Cloudflare’s systems dynamically route traffic across the globe

To find where to send traffic, Traffic Manager will find the available CPU time in all data centers, then it will order them by latency from the data center needing to move traffic. It moves through each of the data centers, using all available capacity based on the maximum thresholds before moving onto the next. When finding which plans to move, we move from the lowest priority plan to highest, but when finding where to send them, we move in the opposite direction.

To make this clearer let's use an example:

We need to move 1,000 CPU time from data center A, and we have the following usage per plan: Free: 500ms/s, Pro: 400ms/s, Business: 200ms/s, Enterprise: 1000ms/s.

We would move 100% of Free (500ms/s), 100% of Pro (400ms/s), 50% of Business (100ms/s), and 0% of Enterprise.

Nearby data centers have the following available CPU time: B: 300ms/s, C: 300ms/s, D: 1,000ms/s.

With latencies: A-B: 100ms, A-C: 110ms, A-D: 120ms.

Starting with the lowest latency and highest priority plan that requires action, we would be able to move all the Business CPU time to data center B and half of Pro. Next we would move onto data center C, and be able to move the rest of Pro, and 20% of Free. The rest of Free could then be forwarded to data center D. Resulting in the following action: Business: 50% → B, Pro: 50% → B, 50% → C, Free: 20% → C, 80% → D.

Reverting actions

In the same way that Traffic Manager is constantly looking to keep data centers from going above the threshold, it is also looking to bring any forwarded traffic back into a data center that is actively forwarding traffic.

Above we saw how Traffic Manager works out how much traffic a data center is able to receive from another data center — it calls this the available CPU time. When there is an active move we use this available CPU time to bring back traffic to the data center — we always prioritize reverting an active move over accepting traffic from another data center.

When you put this all together, you get a system that is constantly measuring system and customer health metrics for every data center and spreading traffic around to make sure that each request can be served given the current state of our network. When we put all of these moves between data centers on a map, it looks something like this, a map of all Traffic Manager moves for a period of one hour. This map doesn’t show our full data center deployment, but it does show the data centers that are sending or receiving moved traffic during this period:

Data centers in red or yellow are under load and shifting traffic to other data centers until they become green, which means that all metrics are showing as healthy. The size of the circles represent how many requests are shifted from or to those data centers. Where the traffic is going is denoted by where the lines are moving. This is difficult to see at a world scale, so let’s zoom into the United States to see this in action for the same time period:

Here you can see Toronto, Detroit, New York, and Kansas City are unable to serve some requests due to hardware issues, so they will send those requests to Dallas, Chicago, and Ashburn until equilibrium is restored for users and data centers. Once data centers like Detroit are able to service all the requests they are receiving without needing to send traffic away, Detroit will gradually stop forwarding requests to Chicago until any issues in the data center are completely resolved, at which point it will no longer be forwarding anything. Throughout all of this, end users are online and are not impacted by any physical issues that may be happening in Detroit or any of the other locations sending traffic.

Happy network, happy products

Because Traffic Manager is plugged into the user experience, it is a fundamental component of the Cloudflare network: it keeps our products online and ensures that they’re as fast and reliable as they can be. It’s our real time load balancer, helping to keep our products fast by only shifting necessary traffic away from data centers that are having issues. Because less traffic gets moved, our products and services stay fast.

But Traffic Manager can also help keep our products online and reliable because they allow our products to predict where reliability issues may occur and preemptively move the products elsewhere. For example, Browser Isolation directly works with Traffic Manager to help ensure the uptime of the product. When you connect to a Cloudflare data center to create a hosted browser instance, Browser Isolation first asks Traffic Manager if the data center has enough capacity to run the instance locally, and if so, the instance is created right then and there. If there isn’t sufficient capacity available, Traffic Manager tells Browser Isolation which the closest data center with sufficient available capacity is, thereby helping Browser Isolation to provide the best possible experience for the user.

Happy network, happy users

At Cloudflare, we operate this huge network to service all of our different products and customer scenarios. We’ve built this network for resiliency: in addition to our MCP locations designed to reduce impact from a single failure, we are constantly shifting traffic around on our network in response to internal and external issues.

But that is our problem — not yours.

Similarly, when human beings had to fix those issues, it was customers and end users who would be impacted. To ensure that you’re always online, we’ve built a smart system that detects our hardware failures and preemptively balances traffic across our network to ensure it’s online and as fast as possible. This system works faster than any person — not only allowing our network engineers to sleep at night — but also providing a better, faster experience for all of our customers.

And finally: if these kinds of engineering challenges sound exciting to you, then please consider checking out the Traffic Engineering team's open position on Cloudflare’s Careers page!

Routing information now on Cloudflare Radar

Post Syndicated from Mingwei Zhang original http://blog.cloudflare.com/radar-routing/

Routing information now on Cloudflare Radar

Routing information now on Cloudflare Radar

Routing is one of the most critical operations of the Internet. Routing decides how and where the Internet traffic should flow from the source to the destination, and can be categorized into two major types: intra-domain routing and inter-domain routing. Intra-domain routing handles making decisions on how individual packets should be routed among the servers and routers within an organization/network. When traffic reaches the edge of a network, the inter-domain routing kicks in to decide what the next hop is and forward the traffic along to the corresponding networks. Border Gateway Protocol (BGP) is the de facto inter-domain routing protocol used on the Internet.

Today, we are introducing another section on Cloudflare Radar: the Routing page, which focuses on monitoring the BGP messages exchanged to extract and present insights on the IP prefixes, individual networks, countries, and the Internet overall. The new routing data allows users to quickly examine routing status of the Internet, examine secure routing protocol deployment for a country, identify routing anomalies, validate IP block reachability and much more from globally distributed vantage points.

It’s a detailed view of how the Internet itself holds together.

Routing information now on Cloudflare Radar

Collecting routing statistics

The Internet consists of tens of thousands of interconnected organizations. Each organization manages its own internal networking infrastructure autonomously, and is referred to as an autonomous system (AS). ASes establish connectivity among each other and exchange routing information via BGP messages to form the current Internet.

When we open the Radar Routing page the “Routing Statistics” block provides a quick glance on the sizes and status of an autonomous system (AS), a country, or the Internet overall. The routing statistics component contains the following count information:

  • The number of ASes on the Internet or registered from a given country;
  • The number of distinct prefixes and the routes toward them observed on the global routing table, worldwide, by country, or by AS;
  • The number of routes categorized by Resource Public Key Infrastructure (RPKI) validation results (valid, invalid, or unknown).

We also show the breakdown of these numbers for IPv4 and IPv6 separately, so users may have a better understanding of such information with respect to different IP protocols.

Routing information now on Cloudflare Radar

For a given network, we also show the BGP announcements volume chart for the past week as well as other basic information like network name, registration country, estimated user count, and sibling networks.

Routing information now on Cloudflare Radar

Identifying routing anomalies

BGP as a routing protocol suffers from a number of security weaknesses. In the new Routing page we consolidate the BGP route leaks and BGP hijacks detection results in one single place, showing the relevant detected events for any given network or globally.

Routing information now on Cloudflare Radar

The BGP Route Leaks table shows the detected BGP route leak events. Each entry in the table contains the information about the related ASes of the leak event, start and end time, as well as other numeric statistics that reflect the scale and impact of the event. The BGP Origin Hijacks table shows the detected potential BGP origin hijacks. Apart from the relevant ASes, time, and impact information, we also show the key evidence that we collected for each event to provide additional context on why and how likely one event being a BGP hijack.

With this release, we introduce another anomaly detection: RPKI Invalid Multiple Origin AS (MOAS) is one type of routing conflict where multiple networks (ASes) originate the same IP prefixes at the same time, which goes against the best practice recommendation. Our system examines the most recent global routing tables and identifies MOASes on the routing tables. With the help of Resource Public Key Infrastructure (RPKI), we can further identify MOAS events that have origins that were proven RPKI invalid, which are less likely to be legitimate cases. Users and operators can quickly identify such anomalies relevant to the networks of interest and take actions accordingly.

Routing information now on Cloudflare Radar

The Routing page will be the permanent home for all things BGP and routing data in the future; we will gradually introduce more anomaly detections and improve our pipeline to provide more security insights.

Examining routing assets and connectivity

Apart from examining the overall routing statistics and anomalies, we also gather information on the routing assets (IP prefixes for a network and networks for a country) and networks’ connectivity.

Tens of thousands autonomous systems (ASes) connect to each other to form the current Internet. The ASes differ in size and operate in different geolocations. Generally, larger networks are more well-connected and considered “upstream” and smaller networks are less connected and considered “downstream” on the Internet. Below is an example connectivity diagram showing how two smaller networks may connect to each other. AS1 announces its IP prefixes to its upstream providers and propagates upwards until it reaches the large networks AS3 and AS4, and then the route propagates downstream to smaller networks until it reaches AS6.

Routing information now on Cloudflare Radar

In the routing page, we examine what IP prefixes any given AS originates, as well as the interconnections among ASes. We show the full list of IP prefixes originated for any given AS, including the breakdown lists by RPKI validation status. We also show the detected connectivity among other ASes categorized into upstream, downstream and peering connections. Users can easily search for any ASes upstream, downstream, or peers.

Routing information now on Cloudflare Radar

For a given country, we show the full list of networks registered in the country, sorted by the number of IP prefixes originated from the corresponding networks. This allows users to quickly glance and find networks from any given country. The table is also searchable by network name or AS number.

Routing information now on Cloudflare Radar

Routing data API access

Like all the other data, the Cloudflare Radar Routing data is powered by our developer API. The data API is freely available under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0) license. In the following table, we list all of our data APIs available at launch. As we improve the routing section, we will introduce more APIs in the future.

API Type
Get BGP origin hijack events Anomaly detection
Get BGP route leak events Anomaly detection
Get MOASes Anomaly detection
Get BGP routing table stats Routing information
Get prefix-to-origin mapping Routing information
Get all ASes registered in a country Routing information
Get AS-level relationship Routing information

Example 1: lookup origin AS for a given prefix with cURL

The Cloudflare Radar prefix-to-origin mapping API returns the matching prefix-origin pairs observed on the global routing tables, allowing users to quickly examine the networks that originate a given prefix or listing all the prefixes a network originates.

In this example, we ask of “which network(s) originated the prefix 1.1.1.0/24?” using the following cURL command:

curl --request GET \
  --url "https://api.cloudflare.com/client/v4/radar/bgp/routes/pfx2as?prefix=1.1.1.0/24" \
  --header 'Content-Type: application/json' \
--header "Authorization: Bearer YOUR_TOKEN"

The returned JSON result shows that Cloudflare (AS13335) originates the prefix 1.1.1.0/24 and it is a RPKI valid origin. It also returns the meta information such as the UTC timestamp of the query as well as when the dataset is last updated (data_time field).

{
  "success": true,
  "errors": [],
  "result": {
    "prefix_origins": [
      {
        "origin": 13335,
        "peer_count": 82,
        "prefix": "1.1.1.0/24",
        "rpki_validation": "Valid"
      }
    ],
    "meta": {
      "data_time": "2023-07-24T16:00:00",
      "query_time": "2023-07-24T18:04:55",
      "total_peers": 82
    }
  }
}

Example 2: integrate Radar API into command-line tool

BGPKIT monocle is an open-source command-line application that provides multiple utility functions like searching BGP messages on public archives, network lookup by name, RPKI validation status for a given IP prefix, etc.

By integrating Cloudflare Radar APIs into monocle, users can now quickly lookup routing statistics or prefix-to-origin mapping by running monocle radar stats [QUERY] and monocle radar pfx2as commands.

➜  monocle radar stats   
┌─────────────┬─────────┬──────────┬─────────────────┬───────────────┬─────────────────┐
│ scope       │ origins │ prefixes │ rpki_valid      │ rpki_invalid  │ rpki_unknown    │
├─────────────┼─────────┼──────────┼─────────────────┼───────────────┼─────────────────┤
│ global      │ 81769   │ 1204488  │ 551831 (45.38%) │ 15652 (1.29%) │ 648462 (53.33%) │
├─────────────┼─────────┼──────────┼─────────────────┼───────────────┼─────────────────┤
│ global ipv4 │ 74990   │ 1001973  │ 448170 (44.35%) │ 11879 (1.18%) │ 550540 (54.48%) │
├─────────────┼─────────┼──────────┼─────────────────┼───────────────┼─────────────────┤
│ global ipv6 │ 31971   │ 202515   │ 103661 (50.48%) │ 3773 (1.84%)  │ 97922 (47.68%)  │
└─────────────┴─────────┴──────────┴─────────────────┴───────────────┴─────────────────┘

➜  monocle radar pfx2as 1.1.1.0/24
┌────────────┬─────────┬───────┬───────────────┐
│ prefix     │ origin  │ rpki  │ visibility    │
├────────────┼─────────┼───────┼───────────────┤
│ 1.1.1.0/24 │ as13335 │ valid │ high (98.78%) │
└────────────┴─────────┴───────┴───────────────┘

Visit Cloudflare Radar for additional insights around (Internet disruptions, routing issues, Internet traffic trends, attacks, Internet quality, etc.). Follow us on social media at @CloudflareRadar (Twitter), cloudflare.social/@radar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via e-mail.

Why BGP communities are better than AS-path prepends

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/prepends-considered-harmful/

Why BGP communities are better than AS-path prepends

Why BGP communities are better than AS-path prepends

The Internet, in its purest form, is a loosely connected graph of independent networks (also called Autonomous Systems (AS for short)). These networks use a signaling protocol called BGP (Border Gateway Protocol) to inform their neighbors (also known as peers) about the reachability of IP prefixes (a group of IP addresses) in and through their network. Part of this exchange contains useful metadata about the IP prefix that are used to inform network routing decisions. One example of the metadata is the full AS-path, which consists of the different autonomous systems an IP packet needs to pass through to reach its destination.

As we all want our packets to get to their destination as fast as possible, selecting the shortest AS-path for a given prefix is a good idea. This is where something called prepending comes into play.

Routing on the Internet, a primer

Let’s briefly talk about how the Internet works at its most fundamental level, before we dive into some nitty-gritty details.

The Internet is, at its core, a massively interconnected network of thousands of networks. Each network owns two things that are critical:

1. An Autonomous System Number (ASN): a 32-bit integer that uniquely identifies a network. For example, one of the Cloudflare ASNs (we have multiple) is 13335.

2. IP prefixes: An IP prefix is a range of IP addresses, bundled together in powers of two: In the IPv4 space, two addresses form a /31 prefix, four form a /30, and so on, all the way up to /0, which is shorthand for “all IPv4 prefixes”. The same applies for IPv6  but instead of aggregating 32 bits at most, you can aggregate up to 128 bits. The figure below shows this relationship between IP prefixes, in reverse — a /24 contains two /25s that contains two /26s, and so on.

Why BGP communities are better than AS-path prepends

To communicate on the Internet, you must be able to reach your destination, and that’s where routing protocols come into play. They enable each node on the Internet to know where to send your message (and for the receiver to send a message back).

Why BGP communities are better than AS-path prepends

As mentioned earlier, these destinations are identified by IP addresses, and contiguous ranges of IP addresses are expressed as IP prefixes. We use IP prefixes for routing as an efficiency optimization: Keeping track of where to go for four billion (232)  IP addresses in IPv4 would be incredibly complex, and require a lot of resources. Sticking to prefixes reduces that number down to about one million instead.

Now recall that Autonomous Systems are independently operated and controlled. In the Internet’s network of networks, how do I tell Source A in some other network that there is an available path to get to Destination B in (or through) my network? In comes BGP! BGP is the Border Gateway Protocol, and it is used to signal reachability information. Signal messages generated by the source ASN are referred to as ‘announcements’ because they declare to the Internet that IP addresses in the prefix are online and reachable.

Why BGP communities are better than AS-path prepends

Have a look at the figure above. Source A should now know how to get to Destination B through 2 different networks!

This is what an actual BGP message would look like:

BGP Message
    Type: UPDATE Message
    Path Attributes:
        Path Attribute - Origin: IGP
        Path Attribute - AS_PATH: 64500 64496
        Path Attribute - NEXT_HOP: 198.51.100.1
        Path Attribute - COMMUNITIES: 64500:13335
        Path Attribute - Multi Exit Discriminator (MED): 100
    Network Layer Reachability Information (NLRI):
        192.0.2.0/24

As you can see, BGP messages contain more than just the IP prefix (the NLRI bit) and the path, but also a bunch of other metadata that provides additional information about the path. Other fields include communities (more on that later), as well as MED, or origin code. MED is a suggestion to other directly connected networks on which path should be taken if multiple options are available, and the lowest value wins. The origin code can be one of three values: IGP, EGP or Incomplete. IGP will be set if you originate the prefix through BGP, EGP is no longer used (it’s an ancient routing protocol), and Incomplete is set when you distribute a prefix into BGP from another routing protocol (like IS-IS or OSPF).

Now that source A knows how to get to Destination B through two different networks, let’s talk about traffic engineering!

Traffic engineering

Traffic engineering is a critical part of the day to day management of any network. Just like in the physical world, detours can be put in place by operators to optimize the traffic flows into (inbound) and out of (outbound) their network. Outbound traffic engineering is significantly easier than inbound traffic engineering because operators can choose from neighboring networks, even prioritize some traffic over others. In contrast, inbound traffic engineering requires influencing a network that is operated by someone else entirely. The autonomy and self-governance of a network is paramount, so operators use available tools to inform or shape inbound packet flows from other networks. The understanding and use of those tools is complex, and can be a challenge.

The available set of traffic engineering tools, both in- and outbound, rely on manipulating attributes (metadata) of a given route. As we’re talking about traffic engineering between independent networks, we’ll be manipulating the attributes of an EBGP-learned route. BGP can be split into two categories:

  1. EBGP: BGP communication between two different ASNs
  2. IBGP: BGP communication within the same ASN.

While the protocol is the same, certain attributes can be exchanged on an IBGP session that aren’t exchanged on an EBGP session. One of those is local-preference. More on that in a moment.

BGP best path selection

When a network is connected to multiple other networks and service providers, it can receive path information to the same IP prefix from many of those networks, each with slightly different attributes. It is then up to the receiving network of that information to use a BGP best path selection algorithm to pick the “best” prefix (and route), and use this to forward IP traffic. I’ve put “best” in quotation marks, as best is a subjective requirement. “Best” is frequently the shortest, but what can be best for my network might not be the best outcome for another network.

BGP will consider multiple prefix attributes when filtering through the received options. However, rather than combine all those attributes into a single selection criteria, BGP best path selection uses the attributes in tiers — at any tier, if the available attributes are sufficient to choose the best path, then the algorithm terminates with that choice.

The BGP best path selection algorithm is extensive, containing 15 discrete steps to select the best available path for a given prefix. Given the numerous steps, it’s in the interest of the network to decide the best path as early as possible. The first four steps are most used and influential, and are depicted in the figure below as sieves.

Why BGP communities are better than AS-path prepends

Picking the shortest path possible is usually a good idea, which is why “AS-path length” is a step executed early on in the algorithm. However, looking at the figure above, “AS-path length” appears second, despite being the attribute to find the shortest path. So let’s talk about the first step: local preference.

Local preference
Local preference is an operator favorite because it allows them to handpick a route+path combination of their choice. It’s the first attribute in the algorithm because it is unique for any given route+neighbor+AS-path combination.

A network sets the local preference on import of a route (having learned about the route from a neighbor network). Being a non-transitive property, meaning that it’s an attribute that is never sent in an EBGP message to other networks. This intrinsically means, for example, that the operator of AS 64496 can’t set the local preference of routes to their own (or transiting) IP prefixes inside neighboring AS 64511. The inability to do so is partially why inbound traffic engineering through EBGP is so difficult.

Prepending artificially increases AS-path length
Since no network is able to directly set the local preference for a prefix inside another network, the first opportunity to influence other networks’ choices is modifying the AS-path. If the next hops are valid, and the local preference for all the different paths for a given route are the same, modifying the AS-path is an obvious option to change the path traffic will take towards your network. In a BGP message, prepending looks like this:

BEFORE:

BGP Message
    Type: UPDATE Message
    Path Attributes:
        Path Attribute - Origin: IGP
        Path Attribute - AS_PATH: 64500 64496
        Path Attribute - NEXT_HOP: 198.51.100.1
        Path Attribute - COMMUNITIES: 64500:13335
        Path Attribute - Multi Exit Discriminator (MED): 100
    Network Layer Reachability Information (NLRI):
        192.0.2.0/24

AFTER:

BGP Message
    Type: UPDATE Message
    Path Attributes:
        Path Attribute - Origin: IGP
        Path Attribute - AS_PATH: 64500 64496 64496
        Path Attribute - NEXT_HOP: 198.51.100.1
        Path Attribute - COMMUNITIES: 64500:13335
        Path Attribute - Multi Exit Discriminator (MED): 100
    Network Layer Reachability Information (NLRI):
        192.0.2.0/24

Specifically, operators can do AS-path prepending. When doing AS-path prepending, an operator adds additional autonomous systems to the path (usually the operator uses their own AS, but that’s not enforced in the protocol). This way, an AS-path can go from a length of 1 to a length of 255. As the length has now increased dramatically, that specific path for the route will not be chosen. By changing the AS-path advertised to different peers, an operator can control the traffic flows coming into their network.

Unfortunately, prepending has a catch: To be the deciding factor, all the other attributes need to be equal. This is rarely true, especially in large networks that are able to choose from many possible routes to a destination.

Business Policy Engine

BGP is colloquially also referred to as a Business Policy Engine: it does not select the best path from a performance point of view; instead, and more often than not, it will select the best path from a business point of view. The business criteria could be anything from investment (port) efficiency to increased revenue, and more. This may sound strange but, believe it or not, this is what BGP is designed to do! The power (and complexity) of BGP is that it enables a network operator to make choices according to the operator’s needs, contracts, and policies, many of which cannot be reflected by conventional notions of engineering performance.

Different local preferences

A lot of networks (including Cloudflare) assign a local preference depending on the type of connection used to send us the routes. A higher value is a higher preference. For example, routes learned from transit network connections will get a lower local preference of 100 because they are the most costly to use; backbone-learned routes will be 150, Internet exchange (IX) routes get 200, and lastly private interconnect (PNI) routes get 250. This means that for egress (outbound) traffic, the Cloudflare network, by default, will prefer a PNI-learned route, even if a shorter AS-path is available through an IX or transit neighbor.

Part of the reason a PNI is preferred over an IX is reliability, because there is no third-party switching platform involved that is out of our control, which is important because we operate on the assumption that all hardware can and will eventually break. Another part of the reason is for port efficiency reasons. Here, efficiency is defined by cost per megabit transferred on each port. Roughly speaking, the cost is calculated by:

((cost_of_switch / port_count) + transceiver_cost)

which is combined with the cross-connect cost (might be monthly recurring (MRC), or a one-time fee). PNI is preferable because it helps to optimize value by reducing the overall cost per megabit transferred, because the unit price decreases with higher utilization of the port.

This reasoning is similar for a lot of other networks, and is very prevalent in transit networks. BGP is at least as much about cost and business policy, as it is about performance.

Transit local preference

For simplicity, when referring to transits, I mean the traditional tier-1 transit networks. Due to the nature of these networks, they have two distinct sets of network peers:

1. Customers (like Cloudflare)
2. Settlement-free peers (like other tier-1 networks)

In normal circumstances, transit customers will get a higher local preference assigned than the local preference used for their settlement-free peers. This means that, no matter how much you prepend a prefix, if traffic enters that transit network, traffic will always land on your interconnection with that transit network, it will not be offloaded to another peer.

A prepend can still be used if you want to switch/offload traffic from a single link with one transit if you have multiple distinguished links with them, or if the source of traffic is multihomed behind multiple transits (and they don’t have their own local preference playbook preferring one transit over another). But inbound traffic engineering traffic away from one transit port to another through AS-path prepending has significant diminishing returns: once you’re past three prepends, it’s unlikely to change much, if anything, at that point.

Example

Why BGP communities are better than AS-path prepends

In the above scenario, no matter the adjustment Cloudflare makes in its AS-path towards AS 64496, the traffic will keep flowing through the Transit B <> Cloudflare interconnection, even though the path Origin A → Transit B → Transit A → Cloudflare is shorter from an AS-path point of view.

Why BGP communities are better than AS-path prepends

In this scenario, not a lot has changed, but Origin A is now multi-homed behind the two transit providers. In this case, the AS-path prepending was effective, as the paths seen on the Origin A side are both the prepended and non-prepended path. As long as Origin A is not doing any egress traffic engineering, and is treating both transit networks equally, then the path chosen will be Origin A → Transit A → Cloudflare.

Community-based traffic engineering

So we have now identified a pretty critical problem within the Internet ecosystem for operators: with the tools mentioned above, it’s not always (some might even say outright impossible) possible to accurately dictate paths traffic can ingress your own network, reducing the control an autonomous system has over its own network. Fortunately, there is a solution for this problem: community-based local preference.

Some transit providers allow their customers to influence the local preference in the transit network through the use of BGP communities. BGP communities are an optional transitive attribute for a route advertisement. The communities can be informative (“I learned this prefix in Rome”), but they can also be used to trigger actions on the receiving side. For example, Cogent publishes the following action communities:

Community Local preference
174:10 10
174:70 70
174:120 120
174:125 125
174:135 135
174:140 140

When you know that Cogent uses the following default local preferences in their network:

Peers → Local preference 100
Customers → Local preference 130

It’s easy to see how we could use the communities provided to change the route used. It’s important to note though that, as we can’t set the local preference of a route to 100 (or 130), AS-path prepending remains largely irrelevant, as the local preference won’t ever be the same.

Take for example the following configuration:

term ADV-SITELOCAL {
    from {
        prefix-list SITE-LOCAL;
        route-type internal;
    }
    then {
        as-path-prepend "13335 13335";
        accept;
    }
}

Why BGP communities are better than AS-path prepends

We’re prepending the Cloudflare ASN two times, resulting in a total AS-path of three, yet we were still seeing a lot (too much) traffic coming in on our Cogent link. At that point, an engineer could add another prepend, but for a well-connected network as Cloudflare, if two prepends didn’t do much, or three, then four or five isn’t going to do much either. Instead, we can leverage the Cogent communities documented above to change the routing within Cogent:

term ADV-SITELOCAL {
    from {
        prefix-list SITE-LOCAL;
        route-type internal;
    }
    then {
        community add COGENT_LPREF70;
        accept;
    }
}

The above configuration changes the traffic flow to this:

Why BGP communities are better than AS-path prepends

Which is exactly what we wanted!

Conclusion

AS-path prepending is still useful, and has its use as part of the toolchain for operators to do traffic engineering, but should be used sparingly. Excessive prepending opens a network up to wider spread route hijacks, which should be avoided at all costs. As such, using community-based ingress traffic engineering is highly preferred (and recommended). In cases where communities aren’t available (or not available to steer customer traffic), prepends can be applied, but I encourage operators to actively monitor their effects, and roll them back if ineffective.

As a side-note, P Marcos et al. have published an interesting paper on AS-path prepending, and go into some trends seen in relation to prepending, I highly recommend giving it a read: https://www.caida.org/catalog/papers/2020_aspath_prepending/aspath_prepending.pdf

Inspect Subnet to Subnet traffic with Amazon VPC More Specific Routing

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/inspect-subnet-to-subnet-traffic-with-amazon-vpc-more-specific-routing/

Since December 2019, Amazon Virtual Private Cloud (VPC) has allowed you to route all ingress traffic (also known as north – south traffic) to a specific network interface. You might use this capability for a number of reasons. For example, to inspect incoming traffic using an intrusion detection system (IDS) appliance or to route ingress traffic to a firewall.

Since we launched this feature, many of you asked us to provide a similar capability to analyze traffic flowing from one subnet to another inside your VPC, also known as east – west traffic. Until today, it was not possible because a route in a routing table cannot be more specific than the default local route (check the VPC documentation for more details). In plain English, it means that no route can have a destination using a smaller CIDR range than the default local route (which is the CIDR range of the whole VPC). For example, when the VPC range is 10.0.0/16 and a subnet has 10.0.1.0/24, a route to 10.0.1.0/24 is more specific than a route to 10.0.0/16.

Routing tables no longer have this restriction. Routes in a routing table can have routes more specific than the default local route. You can use such more specific route to send all traffic to a dedicated appliance or service to inspect, analyze, or filter all traffic flowing between two subnets (east-west traffic). The route target can be the network interface (ENI) attached to an appliance you built or you acquired, an AWS Gateway Load Balancer (GWLB) endpoint to distribute traffic to multiple appliances for performance or high availability reasons, an AWS Firewall Manager endpoint, or a NAT gateway. It also allows to insert an appliance between a subnet and an AWS Transit Gateway.

It is possible to chain appliances to have more than one type of analysis in between source and destination subnets. For examples, you might want first to filter traffic using a firewall (AWS managed or a third-party firewall appliance), second send the traffic to an intrusion detection and prevention systems, and finally, perform deep packet inspection. You can access virtual appliances from our AWS Partner Network and AWS Marketplace.

When you chain appliances, each appliance and each endpoint have to be in separate subnets.

Let’s get our hands dirty and try this new capability.

How It Works
For the purpose of this blog post, let’s assume I have a VPC with three subnets. The first subnet is public and has a bastion host. It requires access to resources, such as an API or a database in the second subnet. The second subnet is private. It hosts the resources required by the bastion. I wrote a simple CDK script to help you to deploy this setup.

VPC More Specific Routing

For compliance reasons, my company requires that traffic to this private application flows through an intrusion detection system. The CDK script also creates a third subnet, a private one, to host a network appliance. It provides three Amazon Elastic Compute Cloud (Amazon EC2) instances : the bastion host, the application instance and the network analysis appliance. The script also creates a NAT gateway allowing to bootstrap the application instance and to connect to the three instances using AWS Systems Manager Session Manager (SSM).

Because this is a demo, the network appliance is just a regular Amazon Linux EC2 instance configured as an IP router. In real life, you’re most probably going to use either one of the many appliances provided by our partners on the AWS Marketplace, or a Gateway Load Balancer endpoint, or a Network Firewall.

Let’s modify the routing tables to send the traffic through the appliance.

Using either the AWS Management Console, or the AWS Command Line Interface (CLI), I add a more specific route to the 10.0.0.0/24 and 10.0.1.0/24 subnet routing tables. These routes point to eni0, the network interface of the traffic-inspection appliance.

Using the CLI, I first collect the VPC ID, Subnet IDs, routing table IDs, and the ENI ID of the appliance.

VPC_ID=$(aws                                                    \
    --region $REGION cloudformation describe-stacks             \
    --stack-name SpecificRoutingDemoStack                       \
    --query "Stacks[].Outputs[?OutputKey=='VPCID'].OutputValue" \
    --output text)
echo $VPC_ID

APPLICATION_SUBNET_ID=$(aws                                                                      \
    --region $REGION ec2 describe-instances                                                      \
    --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='application']].NetworkInterfaces[].SubnetId" \
    --output text)
echo $APPLICATION_SUBNET_ID

APPLICATION_SUBNET_ROUTE_TABLE=$(aws                                                             \
    --region $REGION  ec2 describe-route-tables                                                  \
    --query "RouteTables[?VpcId=='${VPC_ID}'] | [?Associations[?SubnetId=='${APPLICATION_SUBNET_ID}']].RouteTableId" \
    --output text)
echo $APPLICATION_SUBNET_ROUTE_TABLE

APPLIANCE_ENI_ID=$(aws                                                                           \
    --region $REGION ec2 describe-instances                                                      \
    --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='appliance']].NetworkInterfaces[].NetworkInterfaceId" \
    --output text)
echo $APPLIANCE_ENI_ID

BASTION_SUBNET_ID=$(aws                                                                         \
    --region $REGION ec2 describe-instances                                                     \
    --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='BastionHost']].NetworkInterfaces[].SubnetId" \
    --output text)
echo $BASTION_SUBNET_ID

BASTION_SUBNET_ROUTE_TABLE=$(aws \
 --region $REGION ec2 describe-route-tables \
 --query "RouteTables[?VpcId=='${VPC_ID}'] | [?Associations[?SubnetId=='${BASTION_SUBNET_ID}']].RouteTableId" \
 --output text)
echo $BASTION_SUBNET_ROUTE_TABLE

Next, I add two more specific routes. One route sends traffic from the bastion public subnet to the application private subnet through the appliance network interface.  The second route is in the opposite direction to route replies. It routes more specific traffic from the application private subnet to the bastion public subnet through the appliance network interface.  Confused? Let’s look at the following diagram:

VPC More Specific Routing

First, let’s modify the bastion routing table:

aws ec2 create-route                                  \
     --region $REGION                                 \
     --route-table-id $BASTION_SUBNET_ROUTE_TABLE     \
     --destination-cidr-block 10.0.1.0/24             \
     --network-interface-id $APPLIANCE_ENI_ID

Next, let’s modify the application routing table:

aws ec2 create-route                                  \
    --region $REGION                                  \
    --route-table-id $APPLICATION_SUBNET_ROUTE_TABLE  \
    --destination-cidr-block 10.0.0.0/24              \
    --network-interface-id $APPLIANCE_ENI_ID

I can also use the Amazon VPC Console to make these modifications. I simply choose the “Bastion” routing tables and from the Routes tab and click Edit routes.MSR : Select a routing table

I add a route to send traffic for 10.0.1.0/24 (subnet of the application) to the appliance ENI (eni-055...).MSR : create a route

The next step is to define the opposite route for replies, from the application subnet send traffic to 10.0.0.0/24 to the appliance ENI (eni-05...).  Once finished, the application subnet routing table should look like this:

MSR : Final route table

Configure the Appliance Instance
Finally, I configure the appliance instance to forward all traffic it receives. Your software appliance usually does that for you. No extra step is required when you use AWS Marketplace appliances or the instance created by the CDK script I provided for this demo. If you’re using a plain Linux instance, complete these two extra steps:

1. Connect to the EC2 appliance instance and configure IP traffic forwarding in the kernel:

sysctl -w net.ipv4.ip_forward=1
sysctl -w net.ipv6.conf.all.forwarding=1

2. Configure the EC2 instance to accept traffic for destinations other than itself (known as source/destination check) :

APPLIANCE_ID=$(aws --region $REGION ec2 describe-instances                     \
     --filter "Name=tag:Name,Values=appliance"                                 \
     --query "Reservations[].Instances[?State.Name == 'running'].InstanceId[]" \
     --output text)

aws ec2 modify-instance-attribute --region $REGION     \
                         --no-source-dest-check        \
                         --instance-id $APPLIANCE_ID

Test the Setup
The appliance is now ready to forward traffic to the other EC2 instances.

If you are using the demo setup, there is no SSH key installed on the bastion host. Access is made through AWS Systems Manager Session Manager.

BASTION_ID=$(aws --region $REGION ec2 describe-instances                      \
    --filter "Name=tag:Name,Values=BastionHost"                               \
    --query "Reservations[].Instances[?State.Name == 'running'].InstanceId[]" \
    --output text)

aws --region $REGION ssm start-session --target $BASTION_ID

After you’re connected to the bastion host, issue the following cURL command to connect to the application host:

sh-4.2$ curl -I 10.0.1.239 # use the private IP address of your application host
HTTP/1.1 200 OK
Server: nginx/1.18.0
Date: Mon, 24 May 2021 10:00:22 GMT
Content-Type: text/html
Content-Length: 12338
Last-Modified: Mon, 24 May 2021 09:36:49 GMT
Connection: keep-alive
ETag: "60ab73b1-3032"
Accept-Ranges: bytes

To verify the traffic is really flowing through the appliance, you can enable source/destination check on the instance again. Use the --source-dest-check parameter with the modify-instance-attribute CLI command above. The traffic is blocked when the source/destination check is enabled.

I can also connect to the appliance host and inspect traffic with the tcpdump command.

(on your laptop)
APPLIANCE_ID=$(aws --region $REGION ec2 describe-instances     \
                   --filter "Name=tag:Name,Values=appliance" \
		   --query "Reservations[].Instances[?State.Name == 'running'].InstanceId[]" \
  		   --output text)

aws --region $REGION ssm start-session --target $APPLIANCE_ID

(on the appliance host)
tcpdump -i eth0 host 10.0.0.16 # the private IP address of the bastion host

08:53:22.760055 IP ip-10-0-0-16.us-west-2.compute.internal.46934 > ip-10-0-1-104.us-west-2.compute.internal.http: Flags [S], seq 1077227105, win 26883, options [mss 8961,sackOK,TS val 1954932042 ecr 0,nop,wscale 6], length 0
08:53:22.760073 IP ip-10-0-0-16.us-west-2.compute.internal.46934 > ip-10-0-1-104.us-west-2.compute.internal.http: Flags [S], seq 1077227105, win 26883, options [mss 8961,sackOK,TS val 1954932042 ecr 0,nop,wscale 6], length 0
08:53:22.760322 IP ip-10-0-1-104.us-west-2.compute.internal.http > ip-10-0-0-16.us-west-2.compute.internal.46934: Flags [S.], seq 4152624111, ack 1077227106, win 26847, options [mss 8961,sackOK,TS val 4094021737 ecr 1954932042,nop,wscale 6], length 0
08:53:22.760329 IP ip-10-0-1-104.us-west-2.compute.internal.http > ip-10-0-0-16.us-west-2.compute.internal.46934: Flags [S.], seq 4152624111, ack 1077227106, win 26847, options [mss 

Cleanup
If you used the CDK script I provided for this post, be sure to run cdk destroy when you’re finished so that you’re not billed for the three EC2 instances and the NAT gateway I use for this demo. Running the demo script in us-west-2 costs $0.062 per hour.

Things to Keep in Mind.
There are couple of things to keep in mind when using VPC more specific routes :

  • The network interface or service endpoint you are sending the traffic to must be in a dedicated subnet. It cannot be in the source or destination subnet of your traffic.
  • You can chain appliances. Each appliance must live in its dedicated subnet.
  • Each subnet you’re adding consumes a block of IP addresses.  If you’re using IPv4, be conscious of the number of IP addresses consumed (A /24 subnet consumes 256 addresses from your VPC). The smallest CIDR range allowed in a subnet is /28, it just consumes 16 IP addresses.
  • The appliance’s security group must have a rule accepting incoming traffic on the desired port. Similarly, the application’s security group must authorize traffic coming from the appliance security group or IP address.

This new capability is available in all AWS Regions, at no additional cost.

You can start using it today.

Customer Support workforce routing

Post Syndicated from Grab Tech original https://engineering.grab.com/customer-support-workforce-routing

Introduction

With Grab’s wide range of services, we get large volumes of queries a day. Our Customer Support teams address concerns and issues from safety issues to general FAQs. The teams delight our customers through quick resolutions, resulting from world-class support framework and an efficient workforce routing system.

Our routing workforce system ensures that available resources are efficiently assigned to a request based on the right skillset and deciding factors such as department, country, request priority. Scalability to work across support channels (e.g. voice, chat, or digital) is also another factor considered for routing a request to a particular support specialist.

Sample flow of how it works today
Sample flow of how it works today

Having an efficient workforce routing system ensures that requests are directed to relevant support specialists who are most suited to handle a certain type of issue, resulting in quicker resolution, happier and satisfied customers, and reduced cost spent on support.

We initially implemented a third-party solution, however there were a few limitations, such prioritisation, that motivated us to build our very own routing solution that provides better routing configuration controls and cost reduction from licensing costs.

This article describes how we built our in-house workforce routing system at Grab and focuses on Livechat, one of the domains of customer support.

Problem

Let’s run through the issues with our previous routing solution in the next sections.

Priority management

The third-party solution didn’t allow us to prioritise a group of requests over others. This was particularly important for handling safety issues that were not impacted due to other low-priority requests like enquiries. So our goal for the in-house solution was to ensure that we were able to configure the priority of the request queues.

Bespoke product customisation

With the third-party solution being a generic service provider, customisations often required long lead times as not all product requests from Grab were well received by the mass market. Building this in-house meant Grab had full controls over the design and configuration over routing. Here are a few sample use cases that were addressed by customisation:

  • Bulk configuration changes – Previously, it was challenging to assign the same configuration to multiple agents. So, we introduced another layer of grouping for agents that share the same configuration. For example, which queues the agents receive chats from and what the proficiency and max concurrency should be.
  • Resource Constraints – To avoid overwhelming resources with unlimited chats and maintaining reasonable wait times for our customers, we introduced a dynamic queue limit on the number of chat requests enqueued. This limit was based on factors like the number of incoming chats and the agent performance over the last hour.
  • Remote Work Challenges – With the pandemic situation and more of our agents working remotely, network issues were common. So we released an enhancement on the routing system to reroute chats handled by unavailable agents (due to disconnection for an extended period) to another available agent.The seamless experience helped increase customer satisfaction.

Reporting and analytics

Similar to previous point, having a solution addressing generic use cases doesn’t allow you to add customisations at will over monitoring. With the custom implementation, we were able to add more granular metrics which are very useful to assess the agent productivity and performance and helps in planning the resources ahead of time, and this is why reporting and analytics were so valuable for workforce planning. Few of the customisations added additionally were

  • Agent Time Utilisation – While basic agent tracking was available in the out-of-the-box solution, it limited users to three states (online, away, and invisible). With the custom routing solution, we were able to create customised statuses to reflect the time the agent spent in a particular status due to  chat connection issues and failures and reflect this on dashboards for immediate attention.
  • Chat Transfers – The number of chat transfers could only be tabulated manually. We then automated this process with a custom implementation.

Solution

Now that we’ve covered the issues we’re solving, let’s cover the solutions.

Prioritising high-priority requests

During routing, the constraint is on the number of resources available. The incoming requests cannot simply be assigned to the first available agent. The issue with this approach is that we would eventually run out of agents to serve the high-priority requests.

One of the ways to prevent this is to have a separate group of agents to solely handle high-priority requests. This does not solve issues as the high-priority requests and low-priority requests share the same queue and are de-queued in a First-In, First-out (FIFO) order. As a result, the low-priority requests are directly processed instead of waiting for the queue to fill up before processing high-priority requests. Because of this queuing issue, prioritisation of requests is critical.

The need to prioritise

High-priority requests, such as safety issues, must not be in the queue for a long duration and should be handled as fast as possible even when the system is filled with low-priority requests.

There are two different kinds of queues, one to handle requests at priority level and other to handle individual issues, which are the business queues on which the queue limit constraints apply.

To illustrate further, here are two different scenarios of enqueuing/de-queuing:

Different issues with different priorities

In this scenario, the priority is set to dequeue safety issues, which are in the high-priority queue, before picking up the enquiry issues from the low-priority queue.

Different issues with different priorities
Different issues with different priorities

Identical issues with different priorities

In this scenario where identical issues have different priorities, the reallocated enquiry issue in the high-priority queue is dequeued first before picking up a low-priority enquiry issue.  Reallocations happen when a chat is transferred to another agent or when it was not accepted by the allocated agent. When reallocated, it goes back to the queue with a higher priority.

Identical issues with different priorities
Identical issues with different priorities

Approach

To implement different levels of priorities, we decided to use separate queues for each of the priorities and denoted the request queues by groups, which could logically exist in any of the priority queues.

For de-queueing, time slices of varied lengths were assigned to each of the queues to make sure the de-queueing worker spends more time on a higher priority queue.

The architecture uses multiple de-queueing workers running in parallel, with each worker looping over the queues and waiting for a message in a queue for a certain amount of time, and then allocating it to an agent.

for i := startIndex; i < len(consumer.priorityQueue); i++ {
 queue := consumer.priorityQueue\[i\]
 duration := queue.config.ProcessingDurationInMilliseconds
 for now := time.Now(); time.Since(now) < time.Duration(duration)\*time.Millisecond; {
   consumer.processMessage(queue.client, queue.config)
   // cool down
   time.Sleep(time.Millisecond \* 100)
 }
}

The above code snippet iterates over individual priority queues and waits for a message for a certain duration, it then processes the message upon receipt. There is also a cooldown period of 100ms before it moves on to receive a message from a different priority queue.

The caveat with the above approach is that the worker may end up spending more time than expected when it receives a message at the end of the waiting duration. We addressed this by having multiple workers running concurrently.

Request starvation

Now when priority queues are used, there is a possibility that some of the low-priority requests remain unprocessed for long periods of time. To ensure that this doesn’t happen, the workers are forced to run out of sync by tweaking the order in which priority queues are processed, such that when worker1 is processing a high-priority queue request, worker2 is waiting for a request in the medium-priority queue instead of the high-priority queue.

Customising to our needs

We wanted to make sure that agents with the adequate skills are assigned to the right queues to handle the requests. On top of that, we wanted to ensure that there is a limit on the number of requests that a queue can accept at a time, guaranteeing that the system isn’t flushed with too many requests, which can lead to longer waiting times for request allocation.

Approach

The queues are configured with a dynamic queue limit, which is the upper limit on the number of requests that a queue can accept. Additionally attributes such as country, department, and skills are defined on the queue.

The dynamic queue limit takes account of the utilisation factor of the queue and the available agents at the given time, which ensures an appropriate waiting time at the queue level.

A simple approach to assign which queues the agents can receive the requests from is to directly assign the queues to the agents. But this leads to another problem to solve, which is to control the number of concurrent chats an agent can handle and define how proficient an agent is at solving a request. Keeping this in mind, it made sense to have another grouping layer between the queue and agent assignment and to define attributes, such as concurrency, to make sure these groups can be reused.

There are three entities in agent assignment:

  • Queue
  • Agent Group
  • Agent

When the request is de-queued, the agent list mapped to the queue is found and then some additional business rules (e.g. checking for proficiency) are applied to calculate the eligibility score of each mapped agent to decide which agent is the best suited to cater to the request.

The factors impacting the eligibility score are proficiency, whether the agent is online/offline, the current concurrency, max concurrency and the last allocation time.

Ensuring the concurrency is not breached

To make sure that the agent doesn’t receive more chats than their defined concurrency, a locking mechanism is used at per agent level. During agent allocation, the worker acquires a lock on the agent record with an expiry, preventing other workers from allocating a chat to this agent. Only once the allocation process is complete (either failed or successful), the concurrency is updated and the lock is released, allowing other workers to assign more chats to the agent depending on the bandwidth.

A similar approach was used to ensure that the queue limit doesn’t exceed the desired limit.

Reallocation and transfers

Having the routing configuration set up, the reallocation of agents is done using the same steps for agent allocation.

To transfer a  chat to another queue, the request goes back to the queue with a higher priority so that the request is assigned faster.

Unaccepted chats

If the agent fails to accept the request in a given period of time, then the request is put back into the queue, but this time with a higher priority. This is the reason why there’s a corresponding re-allocation queue with a higher priority than the normal queue to make sure that those unaccepted requests don’t have to wait in the queue again.

Informing the frontend about allocation

When an allocation of an agent happens, the routing system needs to inform the frontend by sending messages over websocket to the frontend. This is done with our super reliable messaging system called Hermes, which operates at scale in supporting 12k concurrent connections and establishes real time communication between agents and customers.

Finding the online agents

The routing system should only send the allocation message to the frontend when the agent is online and accepting requests. Frontend uses the same websocket connection used to receive the allocation message to inform the routing system about the availability of agents. This means that if for some reason, the websocket connection is broken due to internet connection issues, the agent would stop receiving any new chat requests.

Enriched reporting and analytics

The routing system is able to push monitoring metrics, such as number of online agents, number of chat requests assigned to the agent, and so on. Because of fine grained control that comes with building this system in-house, it gives us the ability to push more custom metrics.

There are two levels of monitoring offered by this system, real time monitoring and non-real time monitoring, which could be used for analytics for calculating things like the productivity of the agent and the time they spent on each chat.

We achieved the discussed solutions with the help of StatsD for real time monitoring and for analytical purposes, we sent the data used for Tableau visualisations and reporting to Presto tables.

Given that the bottleneck for this system is the number of resources (i.e. number of agents), the real time monitoring helps identify which configuration needs to be adjusted when there is a spike in the number of requests. Moreover, the analytical persistent data allows us the ability to predict the traffic and plan the workforce management such that they are efficiently handling the requests.

Scalability

Letting the system behave appropriately when rolled out to multiple regions is a very critical piece that needed to be taken into account. To ensure that there were enough workers to handle the requests, horizontal scaling of instances can be done if the CPU utilisation increases.

Now to understand the system limitations and behaviour before releasing to multiple regions, we ran load tests with 10x more traffic than expected. This gave us the understanding on what monitors and alerts we should add to make sure the system is able to function efficiently and reduce our recovery time if something goes wrong.

Next steps

The few enhancements lined up after building this routing solution are to focus on reducing customer’s waiting time and to reduce the time spent by the agents on unresponsive customers, who have waited too long in the queue. Aside from chats, we would like to employ this solution into handling digital issues (social media and emails) and voice requests (call).


Special thanks to Andrea Carlevato and Karen Kue for making sure that the blogpost is interesting and represents the problem we solved accurately.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.