Tag Archives: Traffic

Waiting Room adds multi-host and path coverage, unlocking broader protection and multilingual setups

Post Syndicated from Arielle Olache original http://blog.cloudflare.com/multihost-waiting-room/

Waiting Room adds multi-host and path coverage, unlocking broader protection and multilingual setups

Waiting Room adds multi-host and path coverage, unlocking broader protection and multilingual setups

Cloudflare Waiting Room protects sites from overwhelming traffic surges by placing excess visitors in a fully customizable virtual waiting room, admitting them dynamically as spots become available. Instead of throwing error pages or delivering poorly-performing site pages, Waiting Room empowers customers to take control of their end-user experience during unmanageable traffic surges.

Waiting Room adds multi-host and path coverage, unlocking broader protection and multilingual setups

A key decision customers make when setting up a waiting room is what pages it will protect. Before now, customers could select one hostname and path combination to determine what pages would be covered by a waiting room. Today, we are thrilled to announce that Waiting Room now supports coverage of multiple hostname and path combinations with a single waiting room, giving customers more flexibility and offering broader site coverage without interruptions to end-user flows. This new capability is available to all Enterprise customers with an Advanced Purchase of Waiting Room.

Waiting Room site placement

As part of the simple, no-coding-necessary process for deploying a waiting room, customers specify a hostname and path combination to indicate which pages are covered by a particular waiting room. When a site visitor makes a preliminary request to that hostname and path or any of its subpaths, they will be issued a waiting room cookie and either be admitted to the site or placed in a waiting room if the site is at capacity.

Last year, we added Waiting Room Bypass Rules, giving customers many options for creating exceptions to this hostname and path coverage. This unlocked capabilities such as User Agent Bypassing, geo-targeting, URL exclusions, and administrative IP bypassing. It also allowed for some added flexibility regarding which pages a waiting room would apply to on a customer’s site by adding the ability to exclude URLs, paths, and query strings. While this update allowed for greater specificity regarding which traffic should be gated by Waiting Room, it didn’t offer the broader coverage that many customers still needed to protect greater portions of their sites with a single waiting room.

Why customers needed broader Waiting Room coverage

Let’s review some simple yet impactful examples of why this broader coverage option was important for our customers. Imagine you have an online store, example.com, and you want to ensure that a single waiting room covers the entire customer journey — from the homepage, to product browsing, to checkout. Many sites would delineate these steps in the flow using paths: example.com/, example.com/shop/product1, example.com/checkout. Because Waiting Room assumes an implied wildcard at the end of a waiting room’s configured path, this use case was already satisfied for these sites. Thus, placing a waiting room at example.com/ would cover all the URLs associated with every step of this customer journey. In this setup, once a site visitor has passed the waiting room, they would not be re-queued at any step in their user flow because they are still using the same waiting room’s cookie, which indicates to Waiting Room that they are the same user between URLs.

However, many sites delineate different steps of this type of shopping flow using subdomains instead of or as well as paths. For example, many sites will have their checkout page on a different subdomain, such as checkout.example.com. Before now, if a customer had this site structure and wanted to protect their entire site with a single waiting room, they would have needed to deploy a waiting room at example.com/ and another at checkout.example.com/. This option was not ideal for many customers because a site visitor could be queued at two different parts of the same customer journey. This is because the waiting room at checkout.example.com/ would count the same visitor as a different user than the waiting room covering example.com/.

That said, there are cases where it is wise to separate waiting rooms on a single site. For example, a ticketing website could place a waiting room at its apex domain (example.com) and distinct waiting rooms with pre-queues on pages for specific events (example.com/popular_artist_tour). The waiting room set at example.com/ ensures that the main point of entry to the site does not get overwhelmed and crash when ticket sales for any one event open. The waiting room placed on the specific event page ensures that traffic for a single event can start queuing ahead of the event without impacting traffic going to other parts of the site.

Ultimately, whether a customer wants one or multiple waiting rooms to protect their site, we wanted to give our customers the flexibility to deploy Waiting Room however was best for their use cases and site structure. We’re thrilled to announce that Waiting Room now supports multi-hostname and path coverage with a single waiting room!

Getting started with multi-host and path coverage

Customers can now configure a waiting room on multiple hostname and path combinations — or routes — belonging to the same zone. To do so, navigate to Traffic > Waiting Room and select Create. The name of your domain will already be populated. To add additional routes to your waiting room configuration, select Add Hostname and Path. You can then enter another hostname and path to be covered by the same waiting room. Remember, there is an implied wildcard after each path. Therefore, creating a waiting room for each URL you want your waiting room to cover is unnecessary. Only create additional routes for URLs that wouldn’t be covered by the other hostname and path combinations you have already entered.

Waiting Room adds multi-host and path coverage, unlocking broader protection and multilingual setups

When deploying a waiting room that covers multiple hostname and path combinations, you must create a unique cookie name for this waiting room (more on that later!). Then, proceed with the same workflow as usual to deploy your waiting room.

Deploying a multi-language waiting room

A frequent customer request was the ability to cover a multi-language site with a single waiting room — offering different text per language while counting all site traffic toward the same waiting room limits. There are various ways a site can be structured to delineate between different language options, but the two most common are either by subdomain or path. For a site that uses path delineation, this could look like example.com/en and example.com/es for English and Spanish, respectively. For a site using subdomain delineation, this could look like en.example.com/ and es.example.com/. Before multihost Waiting Room coverage, the subdomain variation would not have been able to be covered by a single waiting room.

Waiting Room’s existing configuration options already supported the path variation. However, this was only true if the customer wanted to gate the entire site, which they could do by placing the waiting room at example.com/. Many e-commerce customers asked for the ability to gate different high-demand product pages selling the same product but with different language options. For instance, consider example.com/en/product_123 and example.com/es/product_123, where the customer wants the same waiting room and traffic limits to cover both URLs. Before now, this would not have been possible without some complex bypass rule logic.

Now, customers can deploy a waiting room that accommodates either the subdomain or path approach to structuring a multi-language site. The only remaining step is setting up your waiting room to serve different languages when a user is queued in a waiting room. This can be achieved by constructing a template that reads the URL to determine the locale and define the appropriate translations for each of the locales within the template.

Here is an example of a template that determines the locale from the URL path, and renders the translated text:

<!DOCTYPE html>
<html>
  <head>
    <title>Waiting Room powered by Cloudflare</title>
  </head>
  <body>
    <section>
      <h1 id="inline-msg">
        You are now in line.
      </h1>
      <h1 id="patience-msg">
        Thank you for your patience.
      </h1>
    </section>
    <h2 id="waitTime"></h2>

    <script>
      var locale = location.pathname.split("/")[1] || "en";
      var translations = {
        "en": {
          "waittime_60_less": "Your estimated wait time is {{waitTime}} minute.",
          "waittime_60_greater": "Your estimated wait time is {{waitTimeHours}} hours and {{waitTimeHourMinutes}} minutes.",
          "inline-msg": "You are now in line.",
          "patience-msg": "Thank you for your patience.",
        },
        "es": {
          "waittime_60_less": "El tiempo de espera estimado es {{waitTime}} minuto.",
          "waittime_60_greater": "El tiempo de espera estimado es {{waitTimeHours}} de horas y {{waitTimeHourMinutes}} minutos.",
          "inline-msg": "Ahora se encuentra en la fila de espera previa.",
          "patience-msg": "Gracias por su paciencia.",
        }
      };

      {{#waitTimeKnown}}
      var minutes = parseInt( {{waitTime}} , 10);
      var time = document.getElementById('waitTime');

      if ( minutes < 61) {
        time.innerText = translations[locale]["waittime_60_less"]
      } else {
        time.innerText = translations[locale]["waittime_60_greater"]
      }
      {{/waitTimeKnown}}

      // translate template text for each of the elements with “id” suffixed with “msg”
      for (const msg of ["inline-msg", "patience-msg"]) {
        const el = document.getElementById(msg);
        if (el == null) continue;
        el.innerText = translations[locale][msg];
      }
    </script>
  </body>
</html>

Here’s how the template works: given a site distinguishes different locales with various paths such as /en/product_123 or /es/product_123 in the <script /> body after the rendering the page, the locale is extracted from the location.pathname via locale = location.pathname.split(“/”)[1]. If there isn’t a locale specified within the translations object we default it to “en”. For example, if a user visits example.com/product_123, then by default render the English text template.

Similarly, in order to support a multi-language waiting room for sites that delineate site structure via subdomain, it would only require you to update how you extract the locale from the URL. Instead of looking at the pathname we look at the hostname property of the window.location object like locale = location.hostname.split(“.”)[0], given, we have site structure as following: en.example.com, es.example.com.

The above code is a pared down example of starter templates we have added to our developer documentation, which we have included to make it easier for you to get started with a multi-language waiting room. You can download these templates and edit them to have the look and feel aligned with your site and with the language options your site offers.

Waiting Room adds multi-host and path coverage, unlocking broader protection and multilingual setups
This is an example of the starter template provided in dev docs. This waiting room is in Queue-all mode and displays the Spanish text when the user visits example.com/es/product_123.

These templates can further be expanded to include other languages by adding translations to the `translations` object for each of the locales. Thus, a single template is able to serve multiple languages depending on whatever the locale the site is being served as either via subdomain or path.

How we handle cookies with a multihostname and path waiting room

The waiting room assigns a __cfwaitingroom cookie to each user to manage the state of the user that determines the position of the user in line among other properties needed to make the queueing decisions for the user. Traditionally, for a single hostname and path configuration it was trivial to just set the cookie as __cfwaitingroom=[cookie-value]; Domain=example.com; Path=/es/product_123. This ensured that when the page refreshed it would send us the same Waiting Room cookie for us to examine and update. However, this became non-trivial when we wanted to share the same cookie across different subdomain or path combinations, for example, on example.com/en/product_123.

Approaches to setting multiple cookies

There were two approaches that we brainstormed and evaluated to allow the sharing of the cookie for the same waiting room.

The first approach we considered was to issue multiple Set-Cookie headers in the HTTP response. For example, in the multi-language example above, we could issue the following in the response header:

Set-Cookie: __cfwaitingroom=[cookie-value]…Domain=example.com; Path=/en/product_123 …
Set-Cookie: __cfwaitingroom=[cookie-value]...Domain=example.com; Path=/es/product_123 …

This approach would allow us to handle the multiple hostnames and paths for the same waiting room. However, it does not present itself as a very scalable solution. Per RFC6265, there isn’t a strict limit defined in the specification, browsers have their own implementation-specific limits. For example, Chrome allows the response header to grow up to 256K bytes before throwing the transaction with ERR_RESPONSE_HEADERS_TOO_BIG. Additionally, in this case, the response header would grow proportionally to the number of unique hostname and path combinations, and we would be redundantly repeating the same cookie value N (where N is the number of additional routes) number of times. While this approach would have allowed us to effectively handle a bounded list of hostname and path combinations that need to be set, it was not ideal for cases where we can expect more than a handful routes for a particular site.

The second approach that we considered and chose to move forward with was to set the cookie on the apex domain (or the most common subdomain). In other words, we would figure out the most common subdomain from the list of routes and use that as the domain. Similarly, for the paths, this would entail determining the least common prefix from the list of paths and use that as the value to be set on the path attribute. For example, consider a waiting room with the following two routes configured, a.example.com/shoes/product_123 and b.example.com/shoes/product_456.

To determine the domain that would be set for the cookie, we can see that example.com is the most common subdomain among the list of domains. Applying the most common subdomain algorithm, we would get example.com as the domain. Applying the least common prefix algorithm on the set of paths, /shoes/product_123 and /shoes/product_456, we would get /shoes as the path. Thus, the set-cookie header would be the following:

Set-Cookie: … __cfwaitingroom=[cookie-value]; Domain=example.com; Path=/shoes …

Now, when a visitor accesses any of the pages covered by this waiting room, we can guarantee Waiting Room receives the right cookie and there will only be Set-Cookie included in the response header.

However, we are still not there yet. Although we are able to identify the common subdomain (or apex domain) and common path prefix, there would still be a problem if routes from one waiting room would overlap with another waiting room. For example, say we configure two waiting rooms for the same site where we are selling our special shoes:

Waiting Room A
    a.example.com/shoes/product_123
    b.example.com/shoes/product_456

Waiting Room B
    c.example.com/shoes/product_789
    d.example.com/shoes/product_012

If we apply the same algorithm as described above, we would get the same domain and path properties set for the two cookies. For Waiting Room A, the domain would be example.com and the path would be /shoes. Similarly, for Waiting Room B, the most common subdomain would also be example.com and least common path prefix would be /shoes. This would effectively prevent us from distinguishing the two cookies and route the visitor to the right waiting room. In order to solve the issue of overlapping cookie names, we introduced the requirement of setting a custom suffix that is unique to the customer’s zone. This is why it is required to provide a custom cookie suffix when configuring a waiting room that covers multiple hostnames or paths. Therefore, if Waiting Room A was assigned cookie suffix “a” and Waiting Room B was assigned cookie suffix “b”, the two Set-Cookie definitions would look like the following:

Waiting Room A

Set-Cookie: __cfwaitingroom_a=[cookie-value]; Domain=example.com; Path=/shoes

Waiting Room B

Set-Cookie: __cfwaitingroom_b=[cookie-value]; Domain=example.com; Path=/shoes

When a visitor makes a request to any pages where the two different waiting rooms are configured, Waiting Room can distinguish and select which cookie corresponds to the given request and use this to determine which waiting room’s settings apply to that user depending on where they are on the site.

With the addition of multihost and multipath Waiting Room coverage, we’re thrilled to offer more flexible options for incorporating Waiting Room into your site, giving your visitors the best experience possible while protecting your site during times of high traffic. Make sure your site is always protected from traffic surges. Try out Waiting Room today or reach out to us to learn more about the Advanced Waiting Room!

How Cloudflare’s systems dynamically route traffic across the globe

Post Syndicated from David Tuber original http://blog.cloudflare.com/meet-traffic-manager/

How Cloudflare’s systems dynamically route traffic across the globe

How Cloudflare’s systems dynamically route traffic across the globe

Picture this: you’re at an airport, and you’re going through an airport security checkpoint. There are a bunch of agents who are scanning your boarding pass and your passport and sending you through to your gate. All of a sudden, some of the agents go on break. Maybe there’s a leak in the ceiling above the checkpoint. Or perhaps a bunch of flights are leaving at 6pm, and a number of passengers turn up at once. Either way, this imbalance between localized supply and demand can cause huge lines and unhappy travelers — who just want to get through the line to get on their flight. How do airports handle this?

Some airports may not do anything and just let you suffer in a longer line. Some airports may offer fast-lanes through the checkpoints for a fee. But most airports will tell you to go to another security checkpoint a little farther away to ensure that you can get through to your gate as fast as possible. They may even have signs up telling you how long each line is, so you can make an easier decision when trying to get through.

At Cloudflare, we have the same problem. We are located in 300 cities around the world that are built to receive end-user traffic for all of our product suites. And in an ideal world, we always have enough computers and bandwidth to handle everyone at their closest possible location. But the world is not always ideal; sometimes we take a data center offline for maintenance, or a connection to a data center goes down, or some equipment fails, and so on. When that happens, we may not have enough attendants to serve every person going through security in every location. It’s not because we haven’t built enough kiosks, but something has happened in our data center that prevents us from serving everyone.

So, we built Traffic Manager: a tool that balances supply and demand across our entire global network. This blog is about Traffic Manager: how it came to be, how we built it, and what it does now.

The world before Traffic Manager

The job now done by Traffic Manager used to be a manual process carried out by network engineers: our network would operate as normal until something happened that caused user traffic to be impacted at a particular data center.

When such events happened, user requests would start to fail with 499 or 500 errors because there weren’t enough machines to handle the request load of our users. This would trigger a page to our network engineers, who would then remove some Anycast routes for that data center. The end result: by no longer advertising those prefixes in the impacted data center, user traffic would divert to a different data center. This is how Anycast fundamentally works: user traffic is drawn to the closest data center advertising the prefix the user is trying to connect to, as determined by Border Gateway Protocol. For a primer on what Anycast is, check out this reference article.

Depending on how bad the problem was, engineers would remove some or even all the routes in a data center. When the data center was again able to absorb all the traffic, the engineers would put the routes back and the traffic would return naturally to the data center.

As you might guess, this was a challenging task for our network engineers to do every single time any piece of hardware on our network had an issue. It didn’t scale.

Never send a human to do a machine’s job

But doing it manually wasn’t just a burden on our Network Operations team. It also resulted in a sub-par experience for our customers; our engineers would need to take time to diagnose and re-route traffic. To solve both these problems, we wanted to build a service that would immediately and automatically detect if users were unable to reach a Cloudflare data center, and withdraw routes from the data center until users were no longer seeing issues. Once the service received notifications that the impacted data center could absorb the traffic, it could put the routes back and reconnect that data center. This service is called Traffic Manager, because its job (as you might guess) is to manage traffic coming into the Cloudflare network.

How Cloudflare’s systems dynamically route traffic across the globe

Accounting for second order consequences

When a network engineer removes a route from a router, they can make the best guess at where the user requests will move to, and try to ensure that the failover data center has enough resources to handle the requests — if it doesn’t, they can adjust the routes there accordingly prior to removing the route in the initial data center. To be able to automate this process, we needed to move from a world of intuition to a world of data — accurately predicting where traffic would go when a route was removed, and feeding this information to Traffic Manager, so it could ensure it doesn’t make the situation worse.

Meet Traffic Predictor

Although we can adjust which data centers advertise a route, we are unable to influence what proportion of traffic each data center receives. Each time we add a new data center, or a new peering session, the distribution of traffic changes, and as we are in over 300 cities and 12,500 peering sessions, it has become quite difficult for a human to keep track of, or predict the way traffic will move around our network. Traffic manager needed a buddy: Traffic Predictor.

In order to do its job, Traffic Predictor carries out an ongoing series of real world tests to see where traffic actually moves. Traffic Predictor relies on a testing system that simulates removing a data center from service and measuring where traffic would go if that data center wasn’t serving traffic. To help understand how this system works, let’s simulate the removal of a subset of a data center in Christchurch, New Zealand:

  • First, Traffic Predictor gets a list of all the IP addresses that normally connect to Christchurch. Traffic Predictor will send a ping request to hundreds of thousands of IPs that have recently made a request there.
  • Traffic Predictor records if the IP responds, and whether the response returns to Christchurch using a special Anycast IP range specifically configured for Traffic Predictor.
  • Once Traffic Predictor has a list of IPs that respond to Christchurch, it removes that route containing that special range from Christchurch, waits a few minutes for the Internet routing table to be updated, and runs the test again.
  • Instead of being routed to Christchurch, the responses are instead going to data centers around Christchurch. Traffic Predictor then uses the knowledge of responses for each data center, and records the results as the failover for Christchurch.

This allows us to simulate Christchurch going offline without actually taking Christchurch offline!

But Traffic Predictor doesn’t just do this for any one data center. To add additional layers of resiliency, Traffic Predictor even calculates a second layer of indirection: for each data center failure scenario, Traffic Predictor also calculates failure scenarios and creates policies for when surrounding data centers fail.

Using our example from before, when Traffic Predictor tests Christchurch, it will run a series of tests that remove several surrounding data centers from service including Christchurch to calculate different failure scenarios. This ensures that even if something catastrophic happens which impacts multiple data centers in a region, we still have the ability to serve user traffic. If you think this data model is complicated, you’re right: it takes several days to calculate all of these failure paths and policies.

Here’s what those failure paths and failover scenarios look like for all of our data centers around the world when they’re visualized:

How Cloudflare’s systems dynamically route traffic across the globe

This can be a bit complicated for humans to parse, so let’s dig into that above scenario for Christchurch, New Zealand to make this a bit more clear. When we take a look at failover paths specifically for Christchurch, we see they look like this:

How Cloudflare’s systems dynamically route traffic across the globe

In this scenario we predict that 99.8% of Christchurch’s traffic would shift to Auckland, which is able to absorb all Christchurch traffic in the event of a catastrophic outage.

Traffic Predictor allows us to not only see where traffic will move to if something should happen, but it allows us to preconfigure Traffic Manager policies to move requests out of failover data centers to prevent a thundering herd scenario: where sudden influx of requests can cause failures in a second data center if the first one has issues. With Traffic Predictor, Traffic Manager doesn’t just move traffic out of one data center when that one fails, but it also proactively moves traffic out of other data centers to ensure a seamless continuation of service.

From a sledgehammer to a scalpel

With Traffic Predictor, Traffic Manager can dynamically advertise and withdraw prefixes while ensuring that every datacenter can handle all the traffic. But withdrawing prefixes as a means of traffic management can be a bit heavy-handed at times. One of the reasons for this is that the only way we had to add or remove traffic to a data center was through advertising routes from our Internet-facing routers. Each one of our routes has thousands of IP addresses, so removing only one still represents a large portion of traffic.

Specifically, Internet applications will advertise prefixes to the Internet from a /24 subnet at an absolute minimum, but many will advertise prefixes larger than that. This is generally done to prevent things like route leaks or route hijacks: many providers will actually filter out routes that are more specific than a /24 (for more information on that, check out this blog here). If we assume that Cloudflare maps protected properties to IP addresses at a 1:1 ratio, then each /24 subnet would be able to service 256 customers, which is the number of IP addresses in a /24 subnet. If every IP address sent one request per second, we’d have to move 4 /24 subnets out of a data center if we needed to move 1,000 requests per second (RPS).

But in reality, Cloudflare maps a single IP address to hundreds of thousands of protected properties. So for Cloudflare, a /24 might take 3,000 requests per second, but if we needed to move 1,000 RPS out, we would have no choice but to move a single /24 out. And that’s just assuming we advertise at a /24 level. If we used /20s to advertise, the amount we can withdraw gets less granular: at a 1:1 website to IP address mapping, that’s 4,096 requests per second for each prefix, and even more if the website to IP address mapping is many to one.

While withdrawing prefix advertisements improved the customer experience for those users who would have seen a 499 or 500 error — there may have been a significant portion of users who wouldn’t have been impacted by an issue who still were moved away from the data center they should have gone to, probably slowing them down, even if only a little bit. This concept of moving more traffic out than is necessary is called “stranding capacity”: the data center is theoretically able to service more users in a region but cannot because of how Traffic Manager was built.

We wanted to improve Traffic Manager so that it only moved the absolute minimum of users out of a data center that was seeing a problem and not strand any more capacity. To do so, we needed to shift percentages of prefixes, so we could be extra fine-grained and only move the things that absolutely need to be moved. To solve this, we built an extension of our Layer 4 load balancer Unimog, which we call Plurimog.

A quick refresher on Unimog and layer 4 load balancing: every single one of our machines contains a service that determines whether that machine can take a user request. If the machine can take a user request then it sends the request to our HTTP stack which processes the request before returning it to the user. If the machine can’t take the request, the machine sends the request to another machine in the data center that can. The machines can do this because they are constantly talking to each other to understand whether they can serve requests for users.

Plurimog does the same thing, but instead of talking between machines, Plurimog talks in between data centers and points of presence. If a request goes into Philadelphia and Philadelphia is unable to take the request, Plurimog will forward to another data center that can take the request, like Ashburn, where the request is decrypted and processed. Because Plurimog operates at layer 4, it can send individual TCP or UDP requests to other places which allows it to be very fine-grained: it can send percentages of traffic to other data centers very easily, meaning that we only need to send away enough traffic to ensure that everyone can be served as fast as possible. Check out how that works in our Frankfurt data center, as we are able to shift progressively more and more traffic away to handle issues in our data centers. This chart shows the number of actions taken on free traffic that cause it to be sent out of Frankfurt over time:

How Cloudflare’s systems dynamically route traffic across the globe

But even within a data center, we can route traffic around to prevent traffic from leaving the datacenter at all. Our large data centers, called Multi-Colo Points of Presence (MCPs) contain logical sections of compute within a data center that are distinct from one another. These MCP data centers are enabled with another version of Unimog called Duomog, which allows for traffic to be shifted between logical sections of compute automatically. This makes MCP data centers fault-tolerant without sacrificing performance for our customers, and allows Traffic Manager to work within a data center as well as between data centers.

When evaluating portions of requests to move, Traffic Manager does the following:

  • Traffic Manager identifies the proportion of requests that need to be removed from a data center or subsection of a data center so that all requests can be served.
  • Traffic Manager then calculates the aggregated space metrics for each target to see how many requests each failover data center can take.
  • Traffic Manager then identifies how much traffic in each plan we need to move, and moves either a proportion of the plan, or all of the plan through Plurimog/Duomog, until we've moved enough traffic. We move Free customers first, and if there are no more Free customers in a data center, we'll move Pro, and then Business customers if needed.

For example, let’s look at Ashburn, Virginia: one of our MCPs. Ashburn has nine different subsections of capacity that can each take traffic. On 8/28, one of those subsections, IAD02, had an issue that reduced the amount of traffic it could handle.

During this time period, Duomog sent more traffic from IAD02 to other subsections within Ashburn, ensuring that Ashburn was always online, and that performance was not impacted during this issue. Then, once IAD02 was able to take traffic again, Duomog shifted traffic back automatically. You can see these actions visualized in the time series graph below, which tracks the percentage of traffic moved over time between subsections of capacity within IAD02 (shown in green):

How Cloudflare’s systems dynamically route traffic across the globe

How does Traffic Manager know how much to move?

Although we used requests per second in the example above, using requests per second as a metric isn’t accurate enough when actually moving traffic. The reason for this is that different customers have different resource costs to our service; a website served mainly from cache with the WAF deactivated is much cheaper CPU wise than a site with all WAF rules enabled and caching disabled. So we record the time that each request takes in the CPU. We can then aggregate the CPU time across each plan to find the CPU time usage per plan. We record the CPU time in ms, and take a per second value, resulting in a unit of milliseconds per second.

CPU time is an important metric because of the impact it can have on latency and customer performance. As an example, consider the time it takes for an eyeball request to make it entirely through the Cloudflare front line servers: we call this the cfcheck latency. If this number goes too high, then our customers will start to notice, and they will have a bad experience. When cfcheck latency gets high, it’s usually because CPU utilization is high. The graph below shows 95th percentile cfcheck latency plotted against CPU utilization across all the machines in the same data center, and you can see the strong correlation:

How Cloudflare’s systems dynamically route traffic across the globe

So having Traffic Manager look at CPU time in a data center is a very good way to ensure that we’re giving customers the best experience and not causing problems.

After getting the CPU time per plan, we need to figure out how much of that CPU time to move to other data centers. To do this, we aggregate the CPU utilization across all servers to give a single CPU utilization across the data center. If a proportion of servers in the data center fail, due to network device failure, power failure, etc., then the requests that were hitting those servers are automatically routed elsewhere within the data center by Duomog. As the number of servers decrease, the overall CPU utilization of the data center increases. Traffic Manager has three thresholds for each data center; the maximum threshold, the target threshold, and the acceptable threshold:

  • Maximum: the CPU level at which performance starts to degrade, where Traffic Manager will take action
  • Target: the level to which Traffic Manager will try to reduce the CPU utilization to restore optimal service to users
  • Acceptable: the level below which a data center can receive requests forwarded from another data center, or revert active moves

When a data center goes above the maximum threshold, Traffic Manager takes the ratio of total CPU time across all plans to current CPU utilization, then applies that to the target CPU utilization to find the target CPU time. Doing it this way means we can compare a data center with 100 servers to a data center with 10 servers, without having to worry about the number of servers in each data center. This assumes that load increases linearly, which is close enough to true for the assumption to be valid for our purposes.

Target ratio equals current ratio:

How Cloudflare’s systems dynamically route traffic across the globe

Therefore:

How Cloudflare’s systems dynamically route traffic across the globe

Subtracting the target CPU time from the current CPU time gives us the CPU time to move:

How Cloudflare’s systems dynamically route traffic across the globe

For example, if the current CPU utilization was at 90% across the data center, the target was 85%, and the CPU time across all plans was 18,000, we would have:

How Cloudflare’s systems dynamically route traffic across the globe

This would mean Traffic Manager would need to move 1,000 CPU time:

How Cloudflare’s systems dynamically route traffic across the globe

Now we know the total CPU time needed to move, we can go through the plans, until the required time to move has been met.

What is the maximum threshold?

A frequent problem that we faced was determining at which point Traffic Manager should start taking action in a data center – what metric should it watch, and what is an acceptable level?

As said before, different services have different requirements in terms of CPU utilization, and there are many cases of data centers that have very different utilization patterns.

To solve this problem, we turned to machine learning. We created a service that will automatically adjust the maximum thresholds for each data center according to customer-facing indicators. For our main service-level indicator (SLI), we decided to use the cfcheck latency metric we described earlier.

But we also need to define a service-level objective (SLO) in order for our machine learning application to be able to adjust the threshold. We set the SLO for 20ms. Comparing our SLO to our SLI, our 95th percentile cfcheck latency should never go above 20ms and if it does, we need to do something. The below graph shows 95th percentile cfcheck latency over time, and customers start to get unhappy when cfcheck latency goes into the red zone:

How Cloudflare’s systems dynamically route traffic across the globe

If customers have a bad experience when CPU gets too high, then the goal of Traffic Manager’s maximum thresholds are to ensure that customer performance isn’t impacted and to start redirecting traffic away before performance starts to degrade. At a scheduled interval the Traffic Manager service will fetch a number of metrics for each data center and apply a series of machine learning algorithms. After cleaning the data for outliers we apply a simple quadratic curve fit, and we are currently testing a linear regression algorithm.

After fitting the models we can use them to predict the CPU usage when the SLI is equal to our SLO, and then use it as our maximum threshold. If we plot the cpu values against the SLI we can see clearly why these methods work so well for our data centers, as you can see for Barcelona in the graphs below, which are plotted against curve fit and linear regression fit respectively.

How Cloudflare’s systems dynamically route traffic across the globe
How Cloudflare’s systems dynamically route traffic across the globe

In these charts the vertical line is the SLO, and the intersection of this line with the fitted model represents the value that will be used as the maximum threshold. This model has proved to be very accurate, and we are able to significantly reduce the SLO breaches. Let’s take a look at when we started deploying this service in Lisbon:

How Cloudflare’s systems dynamically route traffic across the globe

Before the change, cfcheck latency was constantly spiking, but Traffic Manager wasn’t taking actions because the maximum threshold was static. But after July 29, we see that cfcheck latency has never hit the SLO because we are constantly measuring to make sure that customers are never impacted by CPU increases.

Where to send the traffic?

So now that we have a maximum threshold, we need to find the third CPU utilization threshold which isn’t used when calculating how much traffic to move – the acceptable threshold. When a data center is below this threshold, it has unused capacity which, as long as it isn’t forwarding traffic itself, is made available for other data centers to use when required. To work out how much each data center is able to receive, we use the same methodology as above, substituting target for acceptable:

How Cloudflare’s systems dynamically route traffic across the globe

Therefore:

How Cloudflare’s systems dynamically route traffic across the globe

Subtracting the current CPU time from the acceptable CPU time gives us the amount of CPU time that a data center could accept:

How Cloudflare’s systems dynamically route traffic across the globe

To find where to send traffic, Traffic Manager will find the available CPU time in all data centers, then it will order them by latency from the data center needing to move traffic. It moves through each of the data centers, using all available capacity based on the maximum thresholds before moving onto the next. When finding which plans to move, we move from the lowest priority plan to highest, but when finding where to send them, we move in the opposite direction.

To make this clearer let's use an example:

We need to move 1,000 CPU time from data center A, and we have the following usage per plan: Free: 500ms/s, Pro: 400ms/s, Business: 200ms/s, Enterprise: 1000ms/s.

We would move 100% of Free (500ms/s), 100% of Pro (400ms/s), 50% of Business (100ms/s), and 0% of Enterprise.

Nearby data centers have the following available CPU time: B: 300ms/s, C: 300ms/s, D: 1,000ms/s.

With latencies: A-B: 100ms, A-C: 110ms, A-D: 120ms.

Starting with the lowest latency and highest priority plan that requires action, we would be able to move all the Business CPU time to data center B and half of Pro. Next we would move onto data center C, and be able to move the rest of Pro, and 20% of Free. The rest of Free could then be forwarded to data center D. Resulting in the following action: Business: 50% → B, Pro: 50% → B, 50% → C, Free: 20% → C, 80% → D.

Reverting actions

In the same way that Traffic Manager is constantly looking to keep data centers from going above the threshold, it is also looking to bring any forwarded traffic back into a data center that is actively forwarding traffic.

Above we saw how Traffic Manager works out how much traffic a data center is able to receive from another data center — it calls this the available CPU time. When there is an active move we use this available CPU time to bring back traffic to the data center — we always prioritize reverting an active move over accepting traffic from another data center.

When you put this all together, you get a system that is constantly measuring system and customer health metrics for every data center and spreading traffic around to make sure that each request can be served given the current state of our network. When we put all of these moves between data centers on a map, it looks something like this, a map of all Traffic Manager moves for a period of one hour. This map doesn’t show our full data center deployment, but it does show the data centers that are sending or receiving moved traffic during this period:

Data centers in red or yellow are under load and shifting traffic to other data centers until they become green, which means that all metrics are showing as healthy. The size of the circles represent how many requests are shifted from or to those data centers. Where the traffic is going is denoted by where the lines are moving. This is difficult to see at a world scale, so let’s zoom into the United States to see this in action for the same time period:

Here you can see Toronto, Detroit, New York, and Kansas City are unable to serve some requests due to hardware issues, so they will send those requests to Dallas, Chicago, and Ashburn until equilibrium is restored for users and data centers. Once data centers like Detroit are able to service all the requests they are receiving without needing to send traffic away, Detroit will gradually stop forwarding requests to Chicago until any issues in the data center are completely resolved, at which point it will no longer be forwarding anything. Throughout all of this, end users are online and are not impacted by any physical issues that may be happening in Detroit or any of the other locations sending traffic.

Happy network, happy products

Because Traffic Manager is plugged into the user experience, it is a fundamental component of the Cloudflare network: it keeps our products online and ensures that they’re as fast and reliable as they can be. It’s our real time load balancer, helping to keep our products fast by only shifting necessary traffic away from data centers that are having issues. Because less traffic gets moved, our products and services stay fast.

But Traffic Manager can also help keep our products online and reliable because they allow our products to predict where reliability issues may occur and preemptively move the products elsewhere. For example, Browser Isolation directly works with Traffic Manager to help ensure the uptime of the product. When you connect to a Cloudflare data center to create a hosted browser instance, Browser Isolation first asks Traffic Manager if the data center has enough capacity to run the instance locally, and if so, the instance is created right then and there. If there isn’t sufficient capacity available, Traffic Manager tells Browser Isolation which the closest data center with sufficient available capacity is, thereby helping Browser Isolation to provide the best possible experience for the user.

Happy network, happy users

At Cloudflare, we operate this huge network to service all of our different products and customer scenarios. We’ve built this network for resiliency: in addition to our MCP locations designed to reduce impact from a single failure, we are constantly shifting traffic around on our network in response to internal and external issues.

But that is our problem — not yours.

Similarly, when human beings had to fix those issues, it was customers and end users who would be impacted. To ensure that you’re always online, we’ve built a smart system that detects our hardware failures and preemptively balances traffic across our network to ensure it’s online and as fast as possible. This system works faster than any person — not only allowing our network engineers to sleep at night — but also providing a better, faster experience for all of our customers.

And finally: if these kinds of engineering challenges sound exciting to you, then please consider checking out the Traffic Engineering team's open position on Cloudflare’s Careers page!

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution

Post Syndicated from Brian Batraski original http://blog.cloudflare.com/elevate-load-balancing-with-private-ips-and-cloudflare-tunnels-a-secure-path-to-efficient-traffic-distribution/

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution

In the dynamic world of modern applications, efficient load balancing plays a pivotal role in delivering exceptional user experiences. Customers commonly leverage load balancing, so they can efficiently use their existing infrastructure resources in the best way possible. Though, load balancing is not a ‘one-size-fits-all, out of the box’ solution for everyone. As you go deeper into the details of your traffic shaping requirements and as your architecture becomes more complex, different flavors of load balancing are usually required to achieve these varying goals, such as steering between datacenters for public traffic, creating high availability for critical internal services with private IPs, applying steering between servers in a single datacenter, and more. We are extremely excited to announce a new addition to our Load Balancing solution, Local Traffic Management (LTM) with deep integrations with Zero Trust!

A common problem businesses run into is that almost no providers can satisfy all these requirements, resulting in a growing list of vendors to manage disparate data sources to get a clear view of your traffic pipeline, and investment into incredibly expensive hardware that is complicated to set up and maintain. Not having a single source of truth to dwindle down ‘time to resolution’ and a single partner to work with in times when things are not operating within the ideal path can be the difference between a proactive, healthy growing business versus one that is reactive and constantly having to put out fires. The latter can result in extreme slowdown to developing amazing features/services, reduction in revenue, tarnishing of brand trust, decreases in adoption – the list goes on!

For eight years, we have provided top-tier global traffic load balancing (GTM) capabilities to thousands of customers across the globe. But why should the steering intelligence, failover, and reliability we guarantee stop at the front door of the selected datacenter and only operate with public traffic? We came to the conclusion that we should go even further. Today is the start of a long series of new features that allow traffic steering, failover, session persistence, SSL/TLS offloading and much more to take place between servers after datacenter selection has occurred! Instead of relying only on the relative weight to determine which server traffic should be sent to, you can now bring the same intelligent steering policies, such as least outstanding requests steering or hash steering, to any of your many data centers. This also means you have a single partner for all of your load balancing initiatives and a single pane of glass to inform business decisions! Cloudflare is thrilled to introduce the powerful combination of private IP support for Load Balancing with Cloudflare Tunnels and Local Traffic Management, offering customers a solution that blends unparalleled efficiency, security, flexibility, and privacy.

What is a load balancer?

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution
A Cloudflare load balancer directs a request from a user to the appropriate origin pool within a data center

Load balancing — functionality that’s been around for the last 30 years to help businesses leverage their existing infrastructure resources. Load balancing works by proactively steering traffic away from unhealthy origin servers and — for more advanced solutions — intelligently distributing traffic load based on different steering algorithms. This process ensures that errors aren’t served to end users and empowers businesses to tightly couple overall business objectives to their traffic behavior. Cloudflare Load Balancing has made it simpler and easier to securely and reliably manage your traffic across multiple data centers around the world. With Cloudflare Load Balancing, your traffic will be directed reliably regardless of the scale of traffic or where it originates with customizable steering, affinity and failover. This clearly has an advantage over a physical load balancer since it can be configured easily and traffic doesn’t have to reach one of your data centers to be routed to another location, introducing single points of failure and significant latency. When compared with other global traffic management load balancers, Cloudflare’s Load Balancing offering is easier to set up, simpler to understand, and is fully integrated with the Cloudflare platform as one single product for all load balancing needs.

What are Cloudflare Tunnels?

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution
Origins and servers of various types can be connected to Cloudflare using Cloudflare Tunnel. Users can also secure their traffic using WARP, allowing traffic to be secured and managed end to end through Cloudflare.‌ ‌

In 2018, Cloudflare introduced Cloudflare Tunnels, a private, secure connection between your data center and Cloudflare. Traditionally, from the moment an Internet property is deployed, developers spend an exhaustive amount of time and energy locking it down through access control lists, rotating IP addresses, or more complex solutions like GRE tunnels. We built Tunnel to help alleviate that burden. With Tunnels, users can create a private link from their origin server directly to Cloudflare without exposing your services directly to the public internet or allowing incoming connections in your data center’s firewall. Instead, this private connection is established by running a lightweight daemon, cloudflared, in your data center, which creates a secure, outbound-only connection. This means that only traffic that you’ve configured to pass through Cloudflare can reach your private origin.

Unleashing the potential of Cloudflare Load Balancing with Cloudflare Tunnels

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution
Cloudflare Load Balancing can easily and securely direct a user’s request to a specific origin within your private data center or public cloud using Cloudflare Tunnels

Combining Cloudflare Tunnels with Cloudflare Load Balancing allows you to remove your physical load balancers from your data center and have your Cloudflare load balancer reach out to your servers directly via their private IP addresses with health checks, steering, and all other Load Balancing features currently available. Instead of configuring your on-premise load balancer to expose each service and then updating your Cloudflare load balancer, you can configure it all in one place. This means that from the end-user to the server handling the request, all your configuration can be done in a single place – the Cloudflare dashboard. On top of this, you can say goodbye to the multi hundred thousand dollar price tag to hardware appliances, the incredible management overhead and investing in a solution that has a time limit for its delivered value.

Load Balancing serves as the backbone for online services, ensuring seamless traffic distribution across servers or data centers. Traditional load balancing techniques often require exposing services on a data center’s public IP addresses, forcing organizations to create complex configurations vulnerable to security risks and potential data exposure. By harnessing the power of private IP support for Load Balancing in conjunction with Cloudflare Tunnels, Cloudflare is revolutionizing the way businesses protect and optimize their applications. With clear steps to install the cloudflared agent to connect your private network to Cloudflare’s network via Cloudflare Tunnels, directly and securely routing traffic into your data centers becomes easier than ever before!

Publicly exposing services in private data centers is complicated

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution
A visitor’s request hits a global traffic management (GTM) load balancer directing the request to a data center, then a firewall, then a local traffic management (LTM) load balancer and then an origin

Load balancing within a private data center can be expensive and difficult to manage. The idea of keeping security first while ensuring ease of use and flexibility for your internal workforce is a tricky balance to strike. It’s not only the ‘how’ of securely exposing internal services, but how to best balance traffic between servers at a single location within your private network!

In a private data center, even a very simple website can be fairly complex in terms of networking and configuration. Let’s walk through a simple example of a customer device connecting to a website. A customer device performs a DNS lookup for the business’s website and receives an IP address corresponding to a customer data center. The customer then makes an HTTPS request to that IP address, passing the original hostname via Server Name Indication (SNI). That load balancer forwards that request to the corresponding origin server and returns the response to the customer device.

This example doesn’t have any advanced functionality and the stack is already difficult to configure:

  • Expose the service or server on a private IP.
  • Configure your data center’s networking to expose the LB on a public IP or IP range.
  • Configure your load balancer to forward requests for that hostname and/or public IP to your server’s private IP.
  • Configure a DNS record for your domain to point to your load balancer’s public IP.

In large enterprises, each of these configuration changes likely requires approval from several stakeholders and modified through different repositories, websites and/or private web interfaces. Load balancer and networking configurations are often maintained as complex configuration files for Terraform, Chef, Puppet, Ansible or a similar infrastructure-as-code service. These configuration files can be syntax checked or tested but are rarely tested thoroughly prior to deployment. Each deployment environment is often unique enough that thorough testing is often not feasible given the time and hardware requirements needed to do so. This means that changes to these files can negatively affect other services within the data center. In addition, opening up an ingress to your data center widens the attack surface for varying security risks such as DDoS attacks or catastrophic data breaches. To make things worse, each vendor has a different interface or API for configuring their devices or services. For example, some registrars only have XML APIs while others have JSON REST APIs. Each device configuration may have different Terraform providers or Ansible playbooks. This results in complex configurations accumulating over time that are difficult to consolidate or standardize, inevitably resulting in technical debt.

Now let’s add additional origins. For each additional origin for our service, we’ll have to go set up and expose that origin and configure the physical load balancer to use our new origin. Now let’s add another data center. Now we need another solution to distribute across our data centers. This results in a separate global traffic management system and local traffic management system. These solutions have in the past come from different vendors and will have to be configured in different ways even though they should serve the same purpose: load balancing. This makes managing your web traffic unnecessarily difficult. Why should you have to configure your origins in two different load balancers? Why can’t you manage all the traffic for all the origins for a service in the same place?

Simpler and better: Load Balancing with Tunnels

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution
Cloudflare Load Balancing can manage traffic for all your offices, data centers, remote users, public clouds, private clouds and hybrid clouds in one place‌ ‌

With Cloudflare Load Balancing and Cloudflare Tunnel, you can manage all your public and private origins in one place: the Cloudflare dashboard. Cloudflare load balancers can be easily configured using the Cloudflare dashboard or the Cloudflare API. There’s no need to SSH or open a remote desktop to modify load balancer configurations for your public or private servers. All configurations can be done through the dashboard UI or Cloudflare API, with full parity between the two.

With Cloudflare Tunnel set up and running in your data center, everything is ready to connect your origin server to Cloudflare network and load balancers. You do not need to configure any ingress to your data center since Cloudflare Tunnel operates only over outbound connections and can securely reach out to privately addressed services inside your data center. To expose your service to Cloudflare, you just set up your private IP range to be routed over that tunnel. Then, you can create a Cloudflare load balancer and input the corresponding private IP address and virtual network ID into your origin pool. After that, Cloudflare manages the DNS and load balancing across your private servers. Now your origin is receiving traffic exclusively via Cloudflare Tunnel and your physical load balancer is no longer needed!

This groundbreaking integration enables organizations to deploy load balancers while keeping their applications securely shielded from the public Internet. The customer’s traffic passes through Cloudflare’s data centers, allowing customers to continue to take full advantage of Cloudflare’s security and performance services. Also, by leveraging Cloudflare Tunnels, traffic between Cloudflare and customer origins remains isolated within trusted networks, bolstering privacy, security, and peace of mind.

The advantages of Private IP support with Cloudflare Tunnels

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution
Cloudflare Load Balancing works in conjunction with all the security and privacy products that Cloudflare has to offer including DDoS protection, Web Application Firewall and Bot Managment

Combining Global and Local Traffic Management: All the features and ease of use that were part of Cloudflare Load Balancing for Global Traffic Management are also available with Local Traffic Management. You can configure your public and private origins in one dashboard as opposed to several services and vendors. Now, all your private origins can benefit from the features that Cloudflare Load Balancing is known for: instant failover, customizable steering between data centers, ease of use, custom rules and configuration updates in a matter of seconds. They will also benefit from our newer features including least connection steering, least outstanding request steering, and session affinity by header. This is just a small subset of the expansive feature set for Load Balancing. See our dev docs for more features and details on the offering.

Enhanced Security: By combining private IP support with Cloudflare Tunnels, organizations can fortify their security posture and protect sensitive data. With private IP addresses and encrypted connections via Cloudflare Tunnel, the risk of unauthorized access and potential attacks is significantly reduced – traffic remains within trusted networks. You can also configure Cloudflare Access to add single sign-on support for your application and restrict your application to a subset of authorized users. In addition, you still benefit from Firewall rules, Rate Limiting rules, Bot Management, DDoS protection and all the other Cloudflare products available today allowing comprehensive security configurations.

Uncompromising Privacy: As data privacy continues to take center stage, businesses must ensure the confidentiality of user information. Cloudflare's private IP support with Cloudflare Tunnels enables organizations to segregate applications and keep sensitive data within their private network boundaries. Custom rules also allow you to direct traffic for specific devices to specific data centers. For example, you can use custom rules to direct traffic from Eastern and Western Europe to your European data centers, so you can easily keep those users’ data within Europe. This minimizes the exposure of data to external entities, preserving user privacy and complying with strict privacy regulations across different geographies.

Flexibility & Reliability: Scale and adaptability are some of the major foundations of a well-operating business. Implementing solutions that fit your business’ needs today is not enough. Customers must find solutions that meet their needs for the next three or more years. The blend of Load Balancing with Cloudflare Tunnels within our Zero Trust solution lends to the very definition of flexibility and reliability! Changes to load balancer configurations propagate around the world in a matter of seconds, making load balancers an effective way to respond to incidents. Also, instant failover, health monitoring, and steering policies all help to maintain high availability for your applications, so you can deliver the reliability that your users expect. This is all in addition to best in class Zero Trust capabilities that are deeply integrated such as, but not limited to Secure Web Gateway (SWG), remote browser isolation, network logs. Data loss prevention.

Streamlined Infrastructure: Organizations can consolidate their network architecture and establish secure connections across distributed environments. This unification reduces complexity, lowers operational overhead, and facilitates efficient resource allocation. Whether you need to apply a global traffic manager to intelligently direct traffic between datacenters within your private network, or steer between specific servers after datacenter selection has taken place, there is now a clear, single lens to manage your global and local traffic, regardless of whether the source or destination of the traffic is public or private. Complexity can be a large hurdle in achieving and maintaining fast, agile business units. Consolidating into a single provider, like Cloudflare, that provides security, reliability, and observability will not only save significant cost but allows your teams to move faster and focus on growing their business, enhancing critical services, and developing incredible features, rather than taping together infrastructure that may not work in a few years. Leave the heavy lifting to us, and let us empower you and your team to focus on creating amazing experiences for your employees and end-users.

The lack of agility, flexibility, and lean operations of hardware appliances for Local Traffic Management does not justify the hundreds of thousands of dollars spent on them, along with the huge overhead of managing CPU, memory, power, cooling, etc. Instead, we want to help businesses move this logic to the cloud by abstracting away the needless overhead and bringing more focus back to teams to do what they do best, building amazing experiences, and allowing Cloudflare to do what we do best, protecting, accelerating, and building heightened reliability. Stay tuned for more updates on Cloudflare's Local Traffic Manager and how it can reduce architecture complexity while bringing more insight, security, and control to your teams. In the meantime, check out our new whitepaper!

Looking to the future

Cloudflare's impactful solution, private IP support for Load Balancing with Cloudflare Tunnels as part of the Zero Trust solution, reaffirms our commitment to providing cutting-edge tools that prioritize security, privacy, and performance. By leveraging private IP addresses and secure tunnels, Cloudflare empowers businesses to fortify their network infrastructure while ensuring compliance with regulatory requirements. With enhanced security, uncompromising privacy, and streamlined infrastructure, load balancing becomes a powerful driver of efficient and secure public or private services.

As a business grows and its systems scale up, they'll need the features that Cloudflare Load Balancing is known for: health monitoring, steering, and failover. As availability requirements increase due to growing demands and standards from end-users, customers can add health checks, enabling automatic failover to healthy servers when an unhealthy server begins to fail. When the business begins to receive more traffic from around the world, they can create new pools for different regions and use dynamic steering to reduce latency between the user and the server. For intensive or long-running requests, such as complex datastore queries, customers can benefit from leveraging least outstanding requests steering to reduce the number of concurrent requests per server. Before, this could all be done with publicly addressable IPs, but it is now available for pools with public IPs, private servers, or combinations of the two. Private IP Load Balancing along with Local Traffic Management is live and ready to use today! Check out our dev docs for instructions on how to get started.

Stay tuned for our next addition to add new Load Balancing onramp support for Spectrum and WARP with Cloudflare Tunnels with private IPs for your Layer 4 traffic, allowing us to support TCP and UDP applications in your private data centers!

Cloudflare’s view of the Virgin Media outage in the UK

Post Syndicated from David Belson original https://blog.cloudflare.com/virgin-media-outage-april-4-2023/

Cloudflare’s view of the Virgin Media outage in the UK

Just after midnight (UTC) on April 4, subscribers to UK ISP Virgin Media (AS5089) began experiencing an Internet outage, with subscriber complaints multiplying rapidly on platforms including Twitter and Reddit.

Cloudflare Radar data shows Virgin Media traffic dropping to near-zero around 00:30 UTC, as seen in the figure below. Connectivity showed some signs of recovery around 02:30 UTC, but fell again an hour later. Further nominal recovery was seen around 04:45 UTC, before again experiencing another complete outage between around 05:45-06:45 UTC, after which traffic began to recover, reaching expected levels around 07:30 UTC.

After the initial set of early-morning disruptions, Virgin Media experienced another round of issues in the afternoon. Cloudflare observed instability in traffic from Virgin Media’s network (called an autonomous system in Internet jargon) AS5089 starting around 15:00 UTC, with a significant drop just before 16:00 UTC. However in this case, it did not appear to be a complete outage, with traffic recovering approximately a half hour later.

Cloudflare’s view of the Virgin Media outage in the UK

Virgin Media’s Twitter account acknowledged the early morning disruption several hours after it began, posting responses stating “We’re aware of an issue that is affecting broadband services for Virgin Media customers as well as our contact centres. Our teams are currently working to identify and fix the problem as quickly as possible and we apologise to those customers affected.” Further responses after service restoration noted “We’ve restored broadband services for customers but are closely monitoring the situation as our engineers continue to investigate. We apologise for any inconvenience caused.”

However, the second disruption was acknowledged on Virgin Media’s Twitter account much more rapidly, with a post at 16:25 UTC stating “Unfortunately we have seen a repeat of an earlier issue which is causing intermittent broadband connectivity problems for some Virgin Media customers. We apologise again to those impacted, our teams are continuing to work flat out to find the root cause of the problem and fix it.”

At the time of the outages, www.virginmedia.com, which includes the provider’s status page, was unavailable. As seen in the figure below, a DNS lookup for the hostname resulted in a SERVFAIL error, indicating that the lookup failed to return a response. This is because the authoritative nameservers for virginmedia.com are listed as ns{1-4}.virginmedia.net, and these nameservers are all hosted within Virgin Media’s network (AS5089) and thus are not accessible during the outage.

Cloudflare’s view of the Virgin Media outage in the UK

Although Virgin Media has not publicly released a root cause for the series of disruptions that its network has experienced, looking at BGP activity can be instructive.

BGP is a mechanism to exchange routing information between networks on the Internet. The big routers that make the Internet work have huge, constantly updated lists of the possible routes that can be used to deliver each network packet to its final destination. Without BGP, the Internet routers wouldn’t know what to do, and the Internet wouldn’t exist.

The Internet is literally a network of networks, or for math fans, a graph, with each individual network a node in it, and the edges representing the interconnections. All of this is bound together by BGP, which allows one network (Virgin Media, for instance) to advertise its presence to other networks that form the Internet. When Virgin Media is not advertising its presence, other networks can’t find its network and it becomes effectively unavailable.

BGP announcements inform a router of changes made to the routing of a prefix (a group of IP addresses) or entirely withdraws the prefix, removing it from the routing table. The figure below shows aggregate BGP announcement activity from AS5089 with spikes that align with the decreases and increases seen in the traffic graph above, suggesting that the underlying cause may in fact be BGP-related, or related to problems with core network infrastructure.

Cloudflare’s view of the Virgin Media outage in the UK

We can drill down further to break out the observed activity between BGP announcements (dark blue) and withdrawals (light blue) seen in the figure below, with key activity coincident with the loss and return of traffic. An initial set of withdrawals are seen just after midnight, effectively removing Virgin Media from the Internet resulting in the initial outage.

A set of announcements occurred just before 03:00 UTC, aligning with the nominal increase in traffic noted above, but those were followed quickly by another set of withdrawals. A similar announcement/withdrawal exchange was observed at 05:00 and 05:30 UTC respectively, before a final set of announcements restored connectivity at 07:00 UTC.

Things remained relatively stable through the morning into the afternoon, before another set of withdrawals presaged the afternoon’s connectivity problems, with a spike of withdrawals at 15:00 UTC, followed by additional withdrawal/announcement exchanges over the next several hours.

Cloudflare’s view of the Virgin Media outage in the UK

Conclusion

Track ongoing traffic trends for Virgin Media on Cloudflare Radar, and follow us on Twitter and Mastodon for regular updates.

Who won Super Bowl LV? A look at Internet traffic during the game

Post Syndicated from Marc Lamik original https://blog.cloudflare.com/who-won-super-bowl-lv/

Who won Super Bowl LV? A look at Internet traffic during the game

Who won Super Bowl LV? A look at Internet traffic during the game

The obvious answer is the Tampa Bay Buccaneers but the less obvious answer comes from asking “which Super Bowl advertiser got the biggest Internet bump?”. This blog aims to answer that question.

Before, during, and after the game a crack team of three people who work on Cloudflare Radar looked at real time statistics for traffic to advertisers’ websites, social media in the US, US food delivery services, and websites covering (American) football. Luckily, one of us (Kari) is (a) American and (b) a fan of football. Unluckily, one of us (Kari) is a fan of the Kansas City Chiefs.

Cloudflare Radar uses a variety of sources to provide aggregate information about Internet traffic and attack trends. In this blog post we use DNS name resolution data to estimate traffic to websites. We can’t see who visited the websites mentioned below, or what anyone did on the websites, but DNS can give us an estimate of the interest generated by the commercials. This analysis only looked at the top-level names in each domain (so example.com and www.example.com and not any other subdomains).

The Big Picture

To get the ball rolling here’s a look at traffic to NFL team websites and sports websites. Traffic builds to a peak as the game begins just after 1830 local time. As the game progresses traffic to those websites drops off hitting a mid-game low at about 2015 before jumping up for 30 minutes during the halftime show.

Who won Super Bowl LV? A look at Internet traffic during the game

The big peak at around 2145 appears to be at the same time as a streaker ran onto the field. A lesser peak comes soon after 2200 when the Buccaneers sealed their victory.

As well as reading about the game, fans were also using social media to get news and add their own commentary. Here’s a look at US social media use during the game.

Who won Super Bowl LV? A look at Internet traffic during the game

Social media use dips a little just as the game is about to begin and then ramps up until the Buccaneers’ victory is final. And then, as people go to sleep, social media use falls away.

Taking a look at food delivery services it looks like folks orders ramped up about 90 minutes before the game started and people’s hunger was sated before the game ended.

Who won Super Bowl LV? A look at Internet traffic during the game

The Internet Impact of Commercials

One question is “Does a Super Bowl commercial drive traffic to the company’s web site while the game is on?” Answer: yes. Here’s one that went vroom peaking at over 40x baseline:

Who won Super Bowl LV? A look at Internet traffic during the game

Vehicles and related services played a big part. GM tried to take on Norway with a commercial about their electric vehicle batteries.

Who won Super Bowl LV? A look at Internet traffic during the game

And the electric vehicle theme continued with Cadillac:

Who won Super Bowl LV? A look at Internet traffic during the game

And Jeep’s commercial caused a similar spike:

Who won Super Bowl LV? A look at Internet traffic during the game

Food and Drink

Anheuser-Busch had a “corporate” commercial this time (as opposed to a commercial for their individual brands). You can clearly see from this chart when that aired.

Who won Super Bowl LV? A look at Internet traffic during the game

And Bud Light got a boost that seemed to keep people thinking about beer through the game.

Who won Super Bowl LV? A look at Internet traffic during the game

Some brands see a spike when the commercial airs and then traffic reverts fairly quickly towards the baseline. Another brand that lingered on post-commercial is Mountain Dew. Prior to the commercial, traffic to the Mountain Dew website was steady, then came the airing of the commercial with a greater than 25x peak followed by interest throughout the game and into the night.

Who won Super Bowl LV? A look at Internet traffic during the game

Pepsi sponsored the halftime show and that’s clearly visible in the charts as they get a boost through The Weeknd’s performance.

Who won Super Bowl LV? A look at Internet traffic during the game

Moving on from drinks and reaching for the snacks we can see that Doritos were pretty popular all evening long with visible commercial induced spikes.

Who won Super Bowl LV? A look at Internet traffic during the game

Brands that might not be so well known get a large traffic boost from their Super Bowl commercials. Here’s the impact on oat milk company Oatly:

Who won Super Bowl LV? A look at Internet traffic during the game

And ever popular chain Jimmy John’s saw a large traffic jump when their commercial aired.

Who won Super Bowl LV? A look at Internet traffic during the game

Keeping Clean

Cleaning products (household and personal) were also the order of the day. First up, we have Dr. Squatch advertising their soap products for men.

Who won Super Bowl LV? A look at Internet traffic during the game

Microban24 makes hand sanitizer and got a jump from their commercial:

Who won Super Bowl LV? A look at Internet traffic during the game

But perhaps the biggest surprise in this category is Tide, which not only jumped up but stayed up throughout the game. Maybe it was the sight of all the sportswear on the field that was going to need cleaning:

Who won Super Bowl LV? A look at Internet traffic during the game

Highlights

The folks at WeatherTech showed their commercial more than once and hit over 20x baseline.

Who won Super Bowl LV? A look at Internet traffic during the game

Rocket Mortgage got a lot of people thinking about mortgages well into the night:

Who won Super Bowl LV? A look at Internet traffic during the game

Financial services firm Klarna got a big jump as the game was wrapping up.

Who won Super Bowl LV? A look at Internet traffic during the game

Throughout the game Paramount+ was touted more than once. Another streaming service? Definitely looked interesting to many.

Who won Super Bowl LV? A look at Internet traffic during the game

Skechers’ got people thinking about what they put on their feet throughout the first half of the game:

Who won Super Bowl LV? A look at Internet traffic during the game

And lastly, for this look at just some of the Super Bowl LV commercials, Fiverr got the message out about about freelancing:

Who won Super Bowl LV? A look at Internet traffic during the game

Which brings us to the all important question…

So, who won Super Bowl LV?

Taking “highest commercial induced peak” as the measure then it’s Dexcom. Dexcom makes wearable continuous glucose monitors for people with diabetes. They got a 100x boost.

Who won Super Bowl LV? A look at Internet traffic during the game

A close runner up is Inspiration4, a privately funded trip into space where one seat is up for grabs via a lottery. Inspiration4 went from very little traffic to over 70x and continuous interest throughout.

Who won Super Bowl LV? A look at Internet traffic during the game

Of course, this doesn’t tell the whole story. Inspiration4 had very little traffic prior to the game so the magnitude of the peak isn’t that surprising.

After a tough 2020 perhaps it’s not a surprise that a healthcare product and an inspirational project should “win” Super Bowl LV.

Of course, for brands that already get a lot of Internet traffic that spikes aren’t so high yet represent a great deal of traffic because the baseline is so much higher. Here’s online marketplace Mercari getting a 2x jump.

Who won Super Bowl LV? A look at Internet traffic during the game

And businesses like Disney or Amazon have so much traffic that commercials might drive a small increase in overall traffic but it tends to get lost.

One More Thing: Tom Brady

One person who didn’t advertise during the game but nevertheless got a bump in traffic to their website was Tom Brady. Brady’s fitness and nutrition brand TB12 saw traffic grow as the game ended and continued interest into the night.

Who won Super Bowl LV? A look at Internet traffic during the game

Want more?

Visit Cloudflare Radar for up to date Internet traffic and attack trends.

Uganda’s January 13, 2021 Internet Shut Down

Post Syndicated from Celso Martinho original https://blog.cloudflare.com/uganda-january-13-2021-internet-shut-down/

Uganda's January 13, 2021 Internet Shut Down

Two days ago, through its communications regulator, Uganda’s government ordered the “Suspension Of The Operation Of Internet Gateways” the day before the country’s general election. This action was confirmed by several users and journalists who got access to the letter sent to Internet providers. In other words, the government effectively cut off Internet access from the population to the rest of the world.

On Cloudflare Radar, we want to help anyone understand what happens on the Internet. We are continually monitoring our network and exposing insights, threats, and trends based on the aggregated data that we see.

Uganda’s unusual traffic patterns quickly popped up in our charts. Our 7-day change in Internet Traffic chart in Uganda shows a clear drop to near zero starting around 1900 local time, when the providers received the letter.

Uganda's January 13, 2021 Internet Shut Down

This is also obvious in the Application-level Attacks chart.

Uganda's January 13, 2021 Internet Shut Down

The traffic drop was also confirmed by the Uganda Internet eXchange point, a place where many providers exchange their data traffic, on their public statistics page.

Uganda's January 13, 2021 Internet Shut Down

We keep an eye on traffic levels and BGP routing to our edge network, and are able to see which networks carry traffic to and from Uganda and their relative traffic levels. The cutoff is clear in those statistics also. Each colored line is a different network inside Uganda (such as ISPs, mobile providers, etc.)

Uganda's January 13, 2021 Internet Shut Down

We will continue to keep an eye on traffic levels from Uganda and update the blog when we see significant changes. At the time of writing, Internet access appears to be still cut off.

Cloudflare Radar’s 2020 Year In Review

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-radar-2020-year-in-review/

Cloudflare Radar's 2020 Year In Review

Cloudflare Radar's 2020 Year In Review

Throughout 2020, we tracked changing Internet trends as the SARS-Cov-2 pandemic forced us all to change the way we were living, working, exercising and learning. In early April, we created a dedicated website https://builtforthis.net/ that showed some of the ways in which Internet use had changed, suddenly, because of the crisis.

On that website, we showed how traffic patterns had changed; for example, where people accessed the Internet from, how usage had jumped up dramatically, and how Internet attacks continued unabated and ultimately increased.

Today we are launching a dedicated Year In Review page with interactive maps and charts you can use to explore what changed on the Internet in 2020. Year In Review is part of Cloudflare Radar. We launched Radar in September 2020 to give anyone access to Internet use and abuse trends that Cloudflare normally had reserved only for employees.

Where people accessed the Internet

To get a sense for the Year In Review, let’s zoom in on London (you can do the same with any city from a long list of locations that we’ve analyzed). Here’s a map showing the change in Internet use comparing April (post-lockdown) and February (pre-lockdown). This map compares working hours Internet use on a weekday between those two months.

As you can clearly see, with offices closed in central London (and elsewhere), Internet use dropped (the blue colour) while usage increased in largely residential areas. Looking out to the west of London, a blue area near Windsor shows how Internet usage dropped at London’s Heathrow airport and surrounding areas.

Cloudflare Radar's 2020 Year In Review

A similar story plays out slightly later in the San Francisco Bay Area.

Cloudflare Radar's 2020 Year In Review

But that trend reverses in July, with an increase in Internet use in many places that saw a rapid decrease in April.

Cloudflare Radar's 2020 Year In Review

When you select a city from the map, a second chart shows the overall trend in Internet use for the country in which that city is located. For example, here’s the chart for the United States. The Y-axis shows the percentage change in Internet traffic compared to the start of the year.

Cloudflare Radar's 2020 Year In Review

Internet use really took off in March (when the lockdowns began) and rapidly increased to 40% higher than the start of the year. And usage has pretty much stayed there for all of 2020: that’s the new normal.

Here’s what happened in France (when selecting Paris) on the map view.

Cloudflare Radar's 2020 Year In Review

Internet use was flat until the lockdowns began. At that point, it took off and grew close to 40% over the beginning of the year. But there’s a visible slow down during the summer months, with Internet use up “only” 20% over the start of the year. Usage picked up again at “la rentrée” in September, with a new normal of about 30% growth in 2020.

What people did on the Internet

Returning to London, we can zoom into what people did on the Internet as the lockdowns began. The UK government announced a lockdown on March 23. On that day, the mixture of Internet use looked like this:

Cloudflare Radar's 2020 Year In Review

A few days later, the E-commerce category had jumped from 12.9% to 15.1% as people shopped online for groceries, clothing, webcams, school supplies, and more. Travel dropped from 1.5% of traffic to 1.1% (a decline of 30%).

Cloudflare Radar's 2020 Year In Review

And then by early mid-April E-commerce had increased to 16.2% of traffic with Travel remaining low.

Cloudflare Radar's 2020 Year In Review

But not all the trends are pandemic-related. One question is: to what extent is Black Friday (November 27, 2020) an event outside the US? We can answer that by moving the London slider to late November and look at the change in E-commerce. Watch carefully as E-commerce traffic grows towards Black Friday and actually peaks at 21.8% of traffic on Saturday, November 28.

As Christmas approached, E-commerce dropped off, but another category became very important: Entertainment. Notice how it peaked on Christmas Eve, as Britons, no doubt, turned to entertainment online during a locked-down Christmas.


And Hacking 2020

Of course, a pandemic didn’t mean that hacking activity decreased. Throughout 2020 and across the world, hackers continued to run their tools to attack websites, overwhelm APIs, and try to exfiltrate data.

Cloudflare Radar's 2020 Year In Review

Explore More

To explore data for 2020, you can check out Cloudflare Radar’s Year In Review page. To go deep into any specific country with up-to-date data about current trends, start at Cloudflare Radar’s homepage.

Internet traffic disruption caused by the Christmas Day bombing in Nashville

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/internet-traffic-disruption-caused-by-the-christmas-day-bombing-in-nashville/

Internet traffic disruption caused by the Christmas Day bombing in Nashville

On Christmas Day 2020, an apparent suicide bomb exploded in Nashville, TN. The explosion happened outside an AT&T network building on Second Avenue in Nashville at 1230 UTC. Damage to the AT&T building and its power supply and generators quickly caused an outage for telephone and Internet service for local people. These outages continued for two days.

Looking at traffic flow data for AT&T in the Nashville area to Cloudflare we can see that services continued operating (on battery power according to reports) for over five hours after the explosion, but at 1748 UTC we saw a dramatic drop in traffic. 1748 UTC is close to noon in Nashville when reports indicate that people lost phone and Internet service.

Internet traffic disruption caused by the Christmas Day bombing in Nashville

We saw traffic from Nashville via AT&T start to recover over a 45 minute period on December 27 at 1822 UTC making the total outage 2 days and 34 minutes.

Internet traffic disruption caused by the Christmas Day bombing in Nashville

Traffic flows continue to be normal and no further disruption has been seen.