All posts by David Tuber

Argo for Packets is Generally Available

2021-12-10 David Tuber

Post Syndicated from David Tuber original https://blog.cloudflare.com/argo-for-packets-generally-available/

Argo for Packets is Generally Available

What would you say if we told you your IP network can be faster by 10%, and all you have to do is reach out to your account team to make it happen?

Today, we’re announcing the general availability of Argo for Packets, which provides IP layer network optimizations to supercharge your Cloudflare network services products like Magic Transit (our Layer 3 DDoS protection service), Magic WAN (which lets you build your own SD-WAN on top of Cloudflare), and Cloudflare for Offices (our initiative to provide secure, performant connectivity into thousands of office buildings around the world).

If you’re not familiar with Argo, it’s a Cloudflare product that makes your traffic faster. Argo finds the fastest, most available path for your traffic on the Internet. Every day, Cloudflare carries trillions of requests, connections, and packets across our network and the Internet. Because our network, our customers, and their end users are well distributed globally, all of these requests flowing across our infrastructure paint a great picture of how different parts of the Internet are performing at any given time. Cloudflare leverages this picture to ensure that your traffic takes the fastest path through our infrastructure.

Previously, Argo optimized traffic at the Layer 7 application layer and at the Layer 4 protocol layer. With the GA of Argo for Packets, we’re now optimizing the IP layer for your private network. During Speed Week we announced the early access for Argo for Packets, and how it can offer a 10% latency reduction. Today, to celebrate Argo for Packets reaching GA, we’re going to dive deeper into the latency reductions, show you examples, explain how you can see even greater optimizations, and talk about how Argo’s secure data plane gives you additional encryption even at Layer 3.

And if you’re interested in enabling Argo for Packets today, please reach out to your account team to get the process started!

Better than BGP

As we said during Speed Week, Argo for Packets provides an average 10% latency improvement across the world in our internal testing:

As we moved towards GA, we found that our real world numbers match our internal testing, and we still see that 10% improvement. But it’s important to note that the 10% latency reduction numbers are an average across all paths across the world. Different customers can see different latency gains depending on their setup.

Argo for Packets achieves these latency gains by dynamically choosing the best possible path throughout our network. Let’s talk a bit about what that means.

Normal packets on the network find their way to their destination using something called the Border Gateway Protocol (BGP), which allows packets to traverse the “shortest” path to its destination. However, the shortest path in BGP terms isn’t strongly correlated with latency, but with network hops. For example, path A in a network has two possible paths: 12345 – 54321 – 13335, and 12345 13335. Both networks start from the network 12345 and end at Cloudflare, which is AS 13335. BGP logic dictates that traffic will always go through the second path. But if the first path has a lower network latency or lower packet loss, customers could potentially see better performance and not know it!

There are two ways to remedy this. The first way is to invest in building out more pipes with network 12345 while expanding the network to be right next to every network. Customers can also build out their own networks or purchase expensive vendor MPLS networks. Either solution will cost a lot of money and time to reach the levels of performance customers want.

Cloudflare improves customer performance by leveraging our existing global network and backbone, plus the networking data from traffic that’s already being sent over to optimize routes back to you. This allows us to improve which paths are taken as traffic changes and congestion on the Internet happens. Argo looks at every path back from every Cloudflare datacenter back to your origin, down to the individual network path. Argo compares existing Layer 4 traffic and network analytics across all of these unique paths to determine the fastest, most available path.

To make Argo personalized to your private network, Cloudflare makes use of a data source that we already built for Magic Transit. That data source: health check probes. Cloudflare leverages existing health check probes from every single Cloudflare data center back to each customer’s origin. These probes are used to determine the health of paths from Cloudflare back to a customer for Magic Transit, so that Cloudflare knows which paths back to origin are healthy. These probes contain a variety of information that can also be used to improve performance such as packet loss and latency data. By examining health check probes and adding them to existing Layer 4 data, Cloudflare can get a better understanding of one-way latencies and can construct a map that allows us to see all the interconnected data centers and how fast they are to each other. Cloudflare then finds the best path at layer 3 back to the customer datacenter by picking an entry location where the packet entered our network, and an exit location that is directly connected back to the customer via a Cloudflare Network Interconnect.

Using this map, Cloudflare constructs dynamic routes for each customer based on where their traffic enters Cloudflare’s network and where they need to go.

Let’s dive into some examples of how your latency reductions manifest depending on your setup.

Cloudflare’s Network is Your Network

In our Speed Week blog outlining how Magic products make your network faster, we outlined several different network topology examples and showed the improvements Magic Transit and Magic WAN had on their networks. Let’s supercharge those numbers by adding Argo for Packets on top of that to see how we can improve performance even further.

The example from the blog outlined a company with locations in South Carolina, Oregon, and Los Angeles. In that blog, we showed the latency improvements that Magic Transit by itself provided for one leg of the trip. That network looks like this:

Let’s break that out to show the latencies between all paths on that network. Let’s assume that South Carolina connects to Atlanta, and Oregon connects to Seattle, which is the most likely scenario:

Source Location	Destination Location	Magic WAN one-way latency	Argo for Packets One-way latency	Argo improvement (in ms)	Latency percent improvement
Los Angeles	Atlanta	49.1	45	4.11	8.36
Los Angeles	Seattle	32.4	27.2	5.18	16
Atlanta	Los Angeles	49	44.9	4.09	8.35
Atlanta	Seattle	78.1	56.9	21.2	27.1
Seattle	Los Angeles	32.2	27	5.22	16.2
Seattle	Atlanta	77.7	56.7	20.9	26.9

For this sample customer network, Argo for Packets improves latencies on every possible path. As you can see the average percent improvement is much higher for this particular network than the global average of 10%.

Let’s take another example of a customer with locations in Asia: South Korea, Philippines, Singapore, Osaka, and Hong Kong. For a network with those locations, Argo for Packets is able to create a 17% latency reduction by finding the optimal paths between locations that were typically trickiest to navigate, like between South Korea, Osaka, and the Philippines. A customer with many locations will see huge benefits from Argo for Packets, because it optimizes the trickiest paths on the Internet and makes them just as fast as the other paths. It removes the latency incurred by your worst network paths and makes not only your average numbers look good, but especially your 90th percentile latency numbers.

Reducing these long-tail latencies is critical especially as customers move back to better and start returning to offices all around the world.

Next Stop: Your Office

Argo for Packets pairs brilliantly with Magic WAN and Cloudflare for Offices to create a hyper-optimized, ultra-secure, private network that adapts to whatever you throw at it. If this is your first time hearing about Cloudflare for Offices, it’s our new initiative to provide private, secure, performant connectivity to thousands of new locations around the world. And that private connectivity provides a great foundation for Argo for Packets to speed up your network.

Taking the above example from the United States, if this company adds two new locations in Boston and Dallas, those locations also see significant latency reduction through Argo for Packets. Now, their network looks like this:

Argo for Packets also ensures that those freshly added new offices will immediately see great performance on the private network:

Source Location	Destination Location	Argo improvement (in ms)	Latency percent improvement
Los Angeles	Dallas	9.89	23.3
Los Angeles	Atlanta	0.774	1.58
Los Angeles	Seattle	0.478	1.51
Los Angeles	Boston	13.3	16.8
Dallas	Los Angeles	9.66	23
Dallas	Atlanta	0	0
Dallas	Seattle	2.96	5.2
Dallas	Boston	0.43	0.955
Atlanta	Los Angeles	0.687	1.4
Atlanta	Dallas	0	0
Atlanta	Seattle	9.7	12.4
Atlanta	Boston	4.39	15.2
Seattle	Los Angeles	0.322	1.02
Seattle	Dallas	3.11	5.43
Seattle	Atlanta	9.81	12.6
Seattle	Boston	34.7	30.3
Boston	Los Angeles	13.3	16.8
Boston	Dallas	0.386	0.85
Boston	Atlanta	4.37	15
Boston	Seattle	33.7	29.6

Cloudflare for Offices makes it so easy to get those offices set up because customers don’t have to bring their perimeter firewalls, WAN devices, or anything else — they can just plug into Cloudflare at their building, and the power of Cloudflare One allows them to get all their network security services over a private connection to Cloudflare, optimized by Argo for Packets.

Your Network, but Faster

Argo for Packets is the perfect complement to any of our Cloudflare One solutions: providing faster bits through your network, built on Cloudflare. Now, your SD-WAN and Magic Transit solutions can be optimized to not just be secure, but performant as well.

If you’re interested in turning on Argo for Packets or onboarding your offices to a private and secure connectivity solution, reach out to your account team to get the process started.

Network Performance Update: Full Stack Week

2021-11-20 David Tuber

Post Syndicated from David Tuber original https://blog.cloudflare.com/network-performance-update-full-stack-week/

Network Performance Update: Full Stack Week

This blog was published on November 20, 2021. As we continue to optimize our network we’re publishing regular updates, which are available here.

A little over two months ago, we shared extensive benchmarking results of last mile networks all around the world. The results showed that on a range of tests (TCP connection time, time to first byte, time to last byte), and on a range of measurements (p95, mean), that Cloudflare was the fastest provider in 49% of networks around the world. Since then, we’ve worked to continuously improve performance until we’re the fastest everywhere. We set a goal to grow the number of networks where we’re the fastest by 10% every Innovation Week. We met that goal during Birthday Week (September 2021).

Today, we’re proud to report we blew the goal away for Full Stack Week (November 2021). Cloudflare measured our performance against the top 1,000 networks in the world (by number of IPv4 addresses advertised). Out of those, Cloudflare has become the fastest provider in 79 new networks, an increase of 14% of these 1,000 networks. Of course, we’re not done yet, but we wanted to share the latest results and explain how we did it.

However, before we go into more detail on our network performance, we wanted to share new performance metrics on our Workers platform (given it’s Full Stack Week!). We’ve crunched the numbers of Cloudflare Workers vs Fastly’s Compute@Edge, and the results are in: Workers is 196% faster.

Faster Network Means Faster Stack

A few months ago, we also discussed the performance of Cloudflare Workers, as compared to other similar offerings out there. We compared our performance to Lambda and Lambda@Edge, where Cloudflare Workers outperformed at 210% and 298% respectively.

At the time, we wanted to see how we measured up against all comparable offerings, but not all offerings were generally available. As a result, we weren’t able to report on how Workers compared to another solution: Fastly’s Compute@Edge.

Today, we’re excited to report that Cloudflare Workers is 196% faster than Fastly’s Compute@Edge based on the time to first byte from the tests we ran on 50 nodes using Catchpoint’s data from across the world.

As we have done in the past, we executed a function that simply returns the current time and measured wait time to first byte (the length of time between a client making an HTTP request to when the client receives the first byte of the request’s response, after DNS, connection, and TLS handshake). The tests were performed on November 8, 2021, using a free tier account for both Cloudflare Workers and Fastly’s Compute@Edge.

The code we ran on both providers was exactly identical — a small function that returns all request headers:

addEventListener('fetch', event => event.respondWith(handleRequest(event)));


async function handleRequest(event) {
  let requestHeaders = Object.fromEntries(event.request.headers)

  return new Response(JSON.stringify(requestHeaders), {status: 200})
};

Blue: Cloudflare Workers
Green: Compute@Edge

If you want to explore the results on your own, here is a link to the data.

By building on our global network that we’re constantly accelerating, leveraging isolates, and driving cold starts to zero, we’re able to offer our customers ludicrously fast speeds across the board.

Now, let’s move on to an update on how Cloudflare’s broader network performance has continued to improve!

Measuring What Matters

To quantify network performance, we have to get enough data from around the world across all manner of different networks comparing ourselves with other providers. We used Real User Measurements (RUM) to fetch a 100kb file from several different providers. Users around the world report the performance of different providers. The more users who report the data, the higher fidelity the signal is. The goal is to provide an accurate picture of where different providers are faster, and more importantly, where Cloudflare can improve. You can read more about the methodology in the original Speed Week blog post here.

In the process of quantifying network performance, it became clear where we were not the fastest everywhere. After Birthday Week, we found 601 country/network pairs where we were more than 100ms behind the leading provider (where a country/network pair is defined as the performance of a network within a particular country).

We are constantly going through the process of figuring out why we were slow — and then improve. The challenges we faced were unique to each network and highlighted a variety of different issues that are prevalent on the Internet. We’re going to deep dive into a couple of networks, and show how we diagnosed and then improved performance.

But before we do, here are the results of our efforts in the past two weeks.

Cloudflare has become number one in TCP Connection Time in 79 new networks. This graph shows the number of networks where we ranked number 1 for TCP Connection Time during Full Stack Week compared to Birthday Week:

We’ve become faster in 79 more networks thanks to our efforts, which have represented a growth of 14% in networks where we were the fastest. Here’s a chart showing our ranking distribution comparing Birthday Week and Full Stack Week:

Now that we’ve talked about how we’ve improved, let’s share our stories about chasing peak performance across the world — each with a different set of challenges.

Placing Traffic Properly in Peru

Our first stop for improving network performance was in Peru. We observed that a lot of users in Lima were actually getting sent to Chile to be served. Cloudflare has multiple locations in Peru, so this shouldn’t have happened. Sending traffic to Chile caused us to be ranked fourth on that particular network in the country. Our engineers knew that the best way to get to number one was to ensure that all the Lima traffic stayed within the country, so we decided to look at why so much of our traffic was getting routed outside the country.

The reason that so much traffic was being routed outside the country was due to the network provider distributing traffic to Cloudflare unevenly, and too many users were sent to one specific location. Our network has a series of checks and fail-safes that allow us to ensure that even if this happens, our users will continue to see a good experience. The checks were being engaged here because of the uneven distribution of traffic to our locations in Lima; however, the traffic was being sent out of the country.

To fix the situation in the short term, we decided to do a bit of manual load balancing across our locations in Lima while building automation to remove the need for manual actions in the future. We took one of the locations that was seeing the most traffic and stopped advertising some prefixes from that location. The hypothesis we had was that the traffic would simply flow to the other Lima locations instead of Chile, and everything would balance out, improving the TCP connect time for everyone while keeping the traffic in the country. We started to make the change on a small portion of our free traffic, and our hypothesis proved correct. At that point, we deployed the change in a larger scope, and the P90 Client TCP RTT dropped from 240ms to 60ms.

As a result, Cloudflare is now number one in network performance in Peru.

Slimming Down Latencies in Sri Lanka

Our next example takes us halfway around the world to Sri Lanka, where we found a network provider who was routing requests from their users to Newark.

1 * * *
2 100.85.0.1 3.061ms 2.522ms 2.728ms
3 198.51.100.146 AS29766 3.651ms 1.855ms 2.715ms
4 198.51.100.145 AS29766 3.438ms 3.225ms 2.805ms
5 222.165.177.150 AS9329 2.233ms 2.272ms 2.843ms
6 222.165.177.145 AS9329 2.703ms 2.862ms 2.291ms
7 103.87.125.253 AS45489 3.658ms 3.708ms 3.613ms
8 103.87.124.245 AS45489 120.027ms 120.665ms 120.471ms
9 103.87.124.146 AS45489 115.597ms 115.863ms 115.178ms
10 50.208.235.157 be-107-2008-pe01.60hudson.ny.ibone.comcast.net AS7922 249.884ms 249.475ms 250.063ms -> going from Sri Lanka to New York
11 96.110.41.145 be-4101-cs01.newyork.ny.ibone.comcast.net AS7922 267.839ms 267.979ms 268.719ms
12 96.110.34.34 be-3112-pe12.111eighthave.ny.ibone.comcast.net AS7922 262.647ms 261.272ms 262.272ms
13 66.208.233.106 AS7922 262.378ms 258.948ms 258.057ms
14 172.70.108.4 AS13335 268.974ms 280.475ms 268.158ms
15 172.67.182.209 AS13335 267.329ms 266.466ms 266.593ms

This understandably caused significant latency problems, and Cloudflare was ranked fourth in Sri Lanka on this network as a result. Even though Colombo is a relatively small location, we moved as much traffic as possible and advertised through the location to improve the user experience and reduce the potential amount of traffic sent to Newark.

Once this was done, we noticed that the P90 Client TCP RTT dropped from 150ms to 50ms.

However, even though we were advertising all of our ranges through Colombo and our performance in aggregate improved, this provider was still sending traffic for some Cloudflare prefixes to Newark. We reached out to the provider and let them know about this user-impacting change they made.

After doing all of these things, Cloudflare moved from fourth in Sri Lanka to number one.

Update on Birthday Week

All of these network changes and more have allowed Cloudflare to become number one in network performance in more networks than before. During Birthday Week, we announced that we were faster in more networks than our competitors. Out of the top 1,000 networks in the world (by number of IPv4 addresses advertised), here’s how Cloudflare performed during Birthday Week (September 2021):

As of Full Stack Week (November 2021), we further improved our position to be faster in 79 new networks:

But we haven’t just increased our performance on the last mile, we’ve even gotten better on Time to Last Byte as well. Here’s how the landscape looked leading up to Birthday Week (September 2021):

And here’s the network landscape now (November 2021):

Cloudflare is also committed to being the fastest provider in every country. Network performance by country is a moving target, and is largely driven by users who are accessing at any given day. Also, looking at network performance in aggregate across countries for long time frames can leave a lot of data out. That being said, this is a world map using the data that was to show the countries with the fastest network provider during Birthday Week (Sept 2021):

Here’s how it looks two months later during Full Stack Week (November 2021):

Long Tail Latency

The running theme of these performance updates has always been the long tail of issues to solve. Ironing out these kinks on our network is critical to ensure that we provide premiere performance as we grow.

Our team has put in a lot of effort and yielded some great results, but we’re constantly trying to be faster. We’ve automated the discovery of performance issues like these, and we’re looking to build automation that will detect and remediate different classes of these issues to stay on top of network performance in the future.

Tracking performance like this doesn’t just make one number faster; it helps improve the performance of your entire stack, making everything lightning-fast.

We have one more innovation week coming in 2021, and we’ll be back to report on further progress on optimizing our performance globally.

Unboxing the Last Mile: Introducing Last Mile Insights

2021-09-16 David Tuber

Post Syndicated from David Tuber original https://blog.cloudflare.com/last-mile-insights/

Unboxing the Last Mile: Introducing Last Mile Insights

“The last 20% of the work requires 80% of the effort.” The Pareto Principle applies in many domains — nowhere more so on the Internet, however, than on the Last Mile. Last Mile networks are heterogeneous and independent of each other, but all of them need to be running to allow for everyone to use the Internet. They’re typically the responsibility of Internet Service Providers (ISPs). However, if you’re an organization running a mission-critical service on the Internet, not paying attention to Last Mile networks is in effect handing off responsibility for the uptime and performance of your service over to those ISPs.

Probably not the best idea.

When a customer puts a service on Cloudflare, part of our job is to offer a good experience across the whole Internet. We couldn’t do that without focusing on Last Mile networks. In particular, we’re focused on two things:

Cloudflare needs to have strong connectivity to Last Mile ISPs and needs to be as close as possible to every Internet-connected person on the planet.
Cloudflare needs good observability tools to know when something goes wrong, and needs to be able to surface that data to you so that you can be informed.

Today, we’re excited to announce Last Mile Insights, to help with this last problem in particular. Last Mile Insights allows customers to see where their end-users are having trouble connecting to their Cloudflare properties. Cloudflare can now show customers the traffic that failed to connect to Cloudflare, where it failed to connect, and why. If you’re an enterprise Cloudflare customer, you can sign up to join the beta in the Cloudflare Dashboard starting today: in the Analytics tab under Edge Reachability.

The Last Mile is historically the most complicated, least understood, and in some ways the most important part of operating a reliable network. We’re here to make it easier.

What is the Last Mile?

The Last Mile is the connection between your home and your ISP. When we talk about how users connect to content on the Internet, we typically do it like this:

This is useful, but in reality, there are lots of things in the path between a user and anything on the Internet. Say that a user is connecting to a resource hosted behind Cloudflare. The path would look like this:

Cloudflare is a global Anycast network that takes traffic from the Internet and proxies it to your origin. Because we function as a proxy, we think of the life of a request in two legs: before it reaches Cloudflare (end users to Cloudflare), and after it reaches Cloudflare (Cloudflare to origin). However, in Internet parlance, there are generally three legs: the First Mile tends to represent the path from an origin server to the data that you are requesting. The Middle Mile represents the path from an origin server to any proxies or other network hops. And finally, there is the final hop from the ISP to the user, which is known as the Last Mile.

Issues with the Last Mile are difficult to detect. If users are unable to reach something on the Internet, it is difficult for the resource to report that there was a problem. This is because if a user never reaches the resource, then the resource will never know something is wrong. Multiply that one problem across hundreds of thousands of Last Mile ISPs coming from a diverse set of regions, and it can be really hard for services to keep track of all the possible things that can go wrong on the Internet. The above graphic actually doesn’t really reflect the scope of the problem, so let’s revise it a bit more:

It’s not an easy problem to keep on top of.

Brand New Last Mile Insights

Cloudflare is launching a closed beta of a brand new Last Mile reporting tool, Last Mile Insights. Last Mile Insights allows for customers to see where their end-users are having trouble connecting to their Cloudflare properties. Cloudflare can now show customers the traffic that failed to connect to Cloudflare, where it failed to connect, and why.

Access to this data is useful to our customers because when things break, knowing what is broken and why — and then communicating with your end users — is vital. During issues, users and employees may create support/helpdesk tickets and social media posts to understand what’s going on. Knowing what is going on, and then communicating effectively about what the problem is and where it’s happening, can give end users confidence that issues are identified and being investigated… even if the issues are occurring on a third party network. Beyond that, understanding the root of the problem can help with mitigations and speed time to resolution.

How do Last Mile Insights work?

Our Last Mile monitoring tools use a combination of signals and machine learning to detect errors and performance regressions on the Last Mile.

Among the signals: Network Error Logging (NEL) is a browser-based reporting system that allows users’ browsers to report connection failures to an endpoint specified by the webpage that failed to load. When a user is able to connect to Cloudflare on a site with NEL enabled, Cloudflare will pass back two headers that indicate to the browser that they should report any network failures to an endpoint specified in the headers. The browser will then operate as usual, and if something happens that prevents the browser from being able to connect to the site, it will log the failure as a report and send it to the endpoint. This all happens in real time; the endpoint receives failure reports instantly after the browser experiences them.

The browser can send failure reports for many reasons: it could send reports because the TLS certificate was incorrect, the ISP or an upstream transit was having issues on the request path, the terminating server was overloaded and dropping requests, or a data center was unreachable. The W3C specification outlines specific buckets that the browser should break reports into and uploads those as reasons the browser could not connect. So the browser is literally telling the reporting endpoint why it was unable to reach the desired site. Here’s an example of a sample report a browser gives to Cloudflare’s endpoint:

The report itself is a JSON blob that contains a lot of things, but the things we care about are when in the request the failure occurred (phase), why the request failed (tcp.timed_out), the ASN the request came over, and the metro area where the request came from. This information allows anyone looking at the reports to see where things are failing and why. Personal Identifiable Information is not captured in NEL reports. For more information, please see our KB article on NEL.

Many services can operate their own reporting endpoints and set their own headers indicating that users who connect to their site should upload these reports to the endpoint they specify. Cloudflare is also an operator of one such endpoint, and we’re excited to open up the data collected by us for customer use and visibility. Let’s talk about a customer who used Last Mile Insights to help make a bad day on the Internet a little better.

Case Study: Canva

Canva is a Cloudflare customer that provides a design and collaboration platform hosted in the cloud. With more than 60 million users around the world, having constant access to Canva’s platform is critical. Last year, Canva users connecting through Cox Communications in San Diego started to experience connectivity difficulties. Around 50% of Canva’s users connecting via Cox Communications saw disconnects during that time period, and these users weren’t able to access Canva or Cloudflare at all. This wasn’t a Canva or Cloudflare outage, but rather, was caused by Cox routing traffic destined for Canva incorrectly, and causing errors for mutual Cox/Canva customers as a result.

Normally, this scenario would have taken hours to diagnose and even longer to mitigate. Canva would’ve seen a slight drop in traffic, but as the outage wasn’t on Canva’s side, it wouldn’t have flagged any alerts based on traffic drops. Canva engineers, in this case, would be notified by the users which would then be followed by a lengthy investigation to diagnose the problem.

Fortunately, Cloudflare has invested in monitoring systems to proactively identify issues exactly like these. Within minutes of the routing anomaly being introduced on Cox’s network, Canva was made aware of the issue via our monitoring, and a conversation with Cox was started to remediate the issue. Meanwhile, Canva could advise their users on the steps to fix it.

Cloudflare is excited to be offering our internal monitoring solution to our customers so that they can see what we see.

But providing insights into seeing where problems happen on the Last Mile is only part of the solution. In order to truly deliver a reliable, fast network, we also need to be as close to end users as possible.

Getting close to users

Getting close to end users is important for one reason: it minimizes the time spent on the Last Mile. These networks can be unreliable and slow. The best way to improve performance is to spend as little time on them as possible. And the only way to do that is to get close to our users. In order to get close to our users, Cloudflare is constantly expanding our presence into new cities and markets. We’ve just announced expansion into new markets and are adding even more new markets all the time to get as close to every network and every user as we can.

This is because not every network is the same. Some users may be clustered very close together in cities with high bandwidth, in others, this may not be the case. Because user populations are not homogeneous, each ISP operates their network differently to meet the needs of their users. Physical distance from where servers are matters a great deal, because nobody can outrun the speed of light. If you’re farther away from the content you want, it will take longer to reach it. But distance is not the only variable; bandwidth and speed will also vary, because networks are operated differently all over the world. But one thing we do know is that your network performance will also be impacted by how healthy your Last Mile network is.

Healthier networks perform better

A healthy network has no downtime, minimal congestion, and low packet loss. These things all add latency. If you’re driving somewhere, street closures, traffic, and bad roads will prevent you from going as fast as possible to where you need to be. Healthy networks provide the best possible conditions for you to connect, and Last Mile performance is better because of it. Consider three networks in the same country: ISP A, ISP B, and ISP C. These ISPs have similar distribution among their users. ISP A is healthy and is directly connected with Cloudflare. ISP B is healthy but is not directly connected to Cloudflare. ISP C is an unhealthy network. Our data shows that Last Mile latencies for ISP C are significantly slower than Network A or B because the network quality of ISP C is worse.

This box plot shows that the latencies to Cloudflare for ISP C are 360% higher than ISPs A or B.

We want all networks to be like Network A, but that’s not always the case, and it’s something Cloudflare can’t control. The only thing Cloudflare can do to mitigate performance problems like these is to limit how much time you spend on these networks.

Shrinking the Last Mile gives better performance

By placing data centers close to our users, we reduce the amount of time spent on these Last Mile networks, and the latency between end users and Cloudflare goes down. A great example of this is how bringing up new locations in Africa affected the latency for the Internet-connected population there. Blue shows the latency before these locations were added, and red shows after:

Our efforts globally have brought 95% of the Internet-connected population within 50ms of us:

You will also notice that 80% of the Internet is within 30ms of us. The tail for Last Mile latencies is very long, and every data center we add helps bring that tail closer to great performance. As we expand into more locations and countries, more of the Internet will be even better connected.

But even when the Last Mile is shrunken down by our infrastructure expansions, networks can still have issues that are difficult to detect. Existing logging and monitoring solutions don’t provide a good way to see what the problem is. Cloudflare has built a sophisticated set of tools to identify issues with Last Mile networks outside our control, and help reduce time to resolution for this purpose, and it has already found problems on the Last Mile for our customers.

Cloudflare has unique performance and insight into Last Mile networking

Running an application on the Internet requires customers to look at the whole Internet. Many cloud services optimize latency starting at the first mile and work their way out, because it’s easier to optimize for things they can control. Because the Last Mile is controlled by hundreds or thousands of ISPs, it is difficult to influence how the Last Mile behaves.

Cloudflare is focused on closing performance gaps everywhere, including close to your users and employees. Last mile performance and reliability is critically important to delivering content, keeping employees productive, and all the other things the world depends on the Internet to do. If a Last Mile provider is having a problem, then users connecting to the Internet through them will have a bad day.

Cloudflare’s efforts to provide better Last Mile performance and visibility allow customers to rely on Cloudflare to optimize the Last Mile, making it one less thing they have to think about. Through Last Mile Insights and network expansion efforts — available today in the Cloudflare Dashboard, in the Analytics tab under Edge Reachability — we want to provide you the ability to see what’s really happening on the Internet while knowing that Cloudflare is working on giving your users the best possible Internet experience.

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

2021-09-14 David Tuber

Post Syndicated from David Tuber original https://blog.cloudflare.com/orpheus/

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

Cloudflare’s mission is to help build a better Internet for everyone. Building a better Internet means helping build more reliable and efficient services that everyone can use. To help realize this vision, we’re announcing the free distribution of two products, one old and one new:

Tiered Caching is now available to all customers for free. Tiered Caching reduces origin data transfer and improves performance, making web properties cheaper and faster to operate. Tiered Cache was previously a paid addition to Free, Pro, and Business plans as part of Argo.
Orpheus is now available to all customers for free. Orpheus routes around problems on the Internet to ensure that customer origin servers are reachable from everywhere, reducing the number of errors your visitors see.

Tiered Caching: improving website performance and economics for everyone

Tiered Cache uses the size of our network to reduce requests to customer origins by dramatically increasing cache hit ratios. With data centers around the world, Cloudflare caches content very close to end users, but if a piece of content is not in cache, the Cloudflare edge data centers must contact the origin server to receive the cacheable content. This can be slow and places load on an origin server compared to serving directly from cache.

Tiered Cache works by dividing Cloudflare’s data centers into a hierarchy of lower-tiers and upper-tiers. If content is not cached in lower-tier data centers (generally the ones closest to a visitor), the lower-tier must ask an upper-tier to see if it has the content. If the upper-tier does not have it, only the upper-tier can ask the origin for content. This practice improves bandwidth efficiency by limiting the number of data centers that can ask the origin for content, reduces origin load, and makes websites more cost-effective to operate.

Dividing data centers like this results in improved performance for visitors because distances and links traversed between Cloudflare data centers are generally shorter and faster than the links between data centers and origins. It also reduces load on origins, making web properties more economical to operate. Customers enabling Tiered Cache can achieve a 60% or greater reduction in their cache miss rate as compared to Cloudflare’s traditional CDN service.

Additionally, Tiered Cache concentrates connections to origin servers so they come from a small number of data centers rather than the full set of network locations. This results in fewer open connections using server resources.

Tiered Cache is simple to enable:

Log into your Cloudflare account.
Navigate to the Caching in the dashboard.
Under Caching, select Tiered Cache.
Enable Tiered Cache.

From there, customers will automatically be enrolled in Smart Tiered Cache Topology without needing to make any additional changes. Enterprise Customers can select from different prefab topologies or have a custom topology created for their unique needs.

Smart Tiered Cache dynamically selects the single best upper tier for each of your website’s origins with no configuration required. We will dynamically find the single best upper tier for an origin by using Cloudflare’s performance and routing data. Cloudflare collects latency data for each request to an origin. Using this latency data, we can determine how well any upper-tier data center is connected with an origin and can empirically select the best data center with the lowest latency to be the upper-tier for an origin.

Today, Smart Tiered Cache is being offered to ALL Cloudflare customers for free, in contrast to other CDNs who may charge exorbitant fees for similar or worse functionality. Current Argo customers will get additional benefits described here. We think that this is a foundational improvement to the performance and economics of running a website.

But what happens if an upper-tier can’t reach an origin?

Orpheus: solving origin reachability problems for everyone

Cloudflare is a reverse proxy that receives traffic from end users and proxies requests back to customer servers or origins. To be successful, Cloudflare needs to be reachable by end users while simultaneously being able to reach origins. With end users around the world, Cloudflare needs to be able to reach origins from multiple points around the world at the same time. This is easier said than done! The Internet is not homogenous, and diverse Cloudflare network locations do not necessarily take the same paths to a given customer origin at any given time. A customer origin may be reachable from some networks but not from others.

Cloudflare developed Argo to be the Waze of the Internet, allowing our network to react to changes in Internet traffic conditions and route around congestion and breakages in real-time, ensuring end users always have a good experience. Argo Smart Routing provides amazing performance and reliability improvements to our customers.

Enter Orpheus. Orpheus provides reachability benefits for customers by finding unreachable paths on the Internet in real time, and guiding traffic away from those paths, ensuring that Cloudflare will always be able to reach an origin no matter what is happening on the Internet.

Today, we’re excited to announce that Orpheus is available to and being used by all our customers.

Fewer 522s

You may have seen this error before at one time or another.

This error indicates that a user was unable to reach content because Cloudflare couldn’t reach the origin. Because of the unpredictability of the Internet described above, users may see this error even when an origin is up and able to receive traffic.

So why do you see this error? The 522 error occurs when network instability causes traffic sent by Cloudflare to fail either before it reaches the origin, or on the way back from the origin to Cloudflare. This is the equivalent of either Cloudflare or your origin sending a request and never getting a response. Both sides think that they’re fine, but the network path between them is not reachable at all. This causes customer pain.

Orpheus solves that pain, ensuring that no matter where users are or where the origin is, an Internet application will always be reachable from Cloudflare.

How it works

Orpheus builds and provisions routes from Cloudflare to origins by analyzing data from users on every path from Cloudflare and ordering them on a per-data center level with the goal of eliminating connection errors and minimizing packet loss. If Orpheus detects errors on the current path from Cloudflare back to a customer origin, Orpheus will steer subsequent traffic from the impacted network path to the healthiest path available.

This is similar to how Argo works but with some key differences: Argo is always steering traffic down the fastest path, whereas Orpheus is reactionary and steers traffic down healthy (and not necessarily the fastest) paths when needed.

Improving origin reachability for customers

Let’s look at an example.

Barry has an origin hosted in WordPress in Chicago for his daughter’s band. This zone primarily sees traffic from three locations: the location closest to his daughter in Seattle, the location closest to him in Boston, and the location closest to his parents in Tampa, who check in on their granddaughter’s site daily for updates.

One day, a link between Tampa and the Chicago origin gets cut by a wandering backhoe. This means that Tampa loses some connectivity back to the Chicago origin. As a result, Barry’s parents start to see failures when connecting back to origin when connecting to the site. This reflects in origin reachability decreasing. Orpheus helps here by finding alternate paths for Barry’s parents, whether it’s through Boston, Seattle, or any location in between that isn’t impacted by the fiber cut seen in Tampa.

So even though there is packet loss between one of Cloudflare’s data centers and Barry’s origin, because there is a path through a different Cloudflare data center that doesn’t have loss, the traffic will still succeed because the traffic will go down the non-lossy path.

How much does Orpheus help my origin reachability?

In our rollout of Orpheus for customers, we observed that Orpheus improved Origin reachability by 23%, from 99.87% to 99.90%. Here is a chart showing the improvement Orpheus provides (lower is better):

We measure this reachability improvement by measuring 522 rates for every data center-origin pair and then comparing traffic that traversed Orpheus routes with traffic that went directly back to origin. Orpheus was especially helpful at improving reachability for slightly lossy paths that could present small amounts of failure over a long period of time, whereas direct to origin would see those failures.

Note that we’ll never get this number to 0% because, with or without Orpheus, some origins really are unreachable because they are down!

Orpheus makes Cloudflare products better

Orpheus pairs well with some of our products that are already designed to provide highly available services on an uncertain Internet. Let’s go over the interactions between Orpheus and three of our products: Load Balancing, Cloudflare Network Interconnect, and Tiered Cache.

Load Balancing

Orpheus and Load Balancing go together to provide high reachability for every origin endpoint. Load balancing allows for automatic selection of endpoints based on health probes, ensuring that if an origin isn’t working, customers will still be available and operational. Orpheus finds reachable paths from Cloudflare to every origin. These two products in tandem provide a highly available and reachable experience for customers.

Cloudflare Network Interconnect

Orpheus and Cloudflare Network Interconnect (CNI) combine to always provide a highly reachable path, no matter where in the world you are. Consider Acme, a company who is connected to the Internet by only one provider that has a lot of outages. Orpheus will do its best to steer traffic around the lossy paths, but if there’s only one path back to the customer, Orpheus won’t be able to find a less-lossy path. Cloudflare Network Interconnect solves this problem by providing a path that is separate from the transit provider that any Cloudflare data center can access. CNI provides a viable path back to Acme’s origin that will allow Orpheus to engage from any data center in the world if loss occurs.

Shields for All

Orpheus and Tiered Cache can combine to build an adaptive shield around an origin that caches as much as possible while improving traffic back to origin. Tiered Cache topologies allow for customers to deflect much of their static traffic away from their origin to reduce load, and Orpheus helps ensure that any traffic that has to go back to the origin traverses over highly available links.

Improving origin performance for everyone

The Internet is a growing, ever-changing ecosystem. With the release of Orpheus and Tiered Cache for everyone, we’ve given you the ability to navigate whatever the Internet has in store to provide the best possible experience to your customers.

Argo 2.0: Smart Routing Learns New Tricks

2021-09-14 David Tuber

Post Syndicated from David Tuber original https://blog.cloudflare.com/argo-v2/

Argo 2.0: Smart Routing Learns New Tricks

We launched Argo in 2017 to improve performance on the Internet. Argo uses real-time global network information to route around brownouts, cable cuts, packet loss, and other problems on the Internet. Argo makes the network that Cloudflare relies on—the Internet—faster, more reliable, and more secure on every hop around the world.

Without any complicated configuration, Argo is able to use real-time traffic data to pick the fastest path across the Internet, improving performance and delivering more satisfying experiences to your customers and users.

Today, Cloudflare is announcing several upgrades to Argo’s intelligent routing:

When it launched, Argo was entirely focused on the “middle mile,” speeding up connections from Cloudflare to our customers’ servers. Argo now delivers optimal routes from clients and users to Cloudflare, further reducing end-to-end latency while still providing the impressive edge to origin performance that Argo is known for. These last-mile improvements reduce end user round trip times by up to 40%.
We’re also adding support for accelerating pure IP workloads, allowing Magic Transit and Magic WAN customers to build IP networks to enjoy the performance benefits of Argo.

Starting today, all Free, Pro, and Business plan Argo customers will see improved performance with no additional configuration or charge. Enterprise customers have already enjoyed the last mile performance improvements described here for some time. Magic Transit and WAN customers can contact their account team to request Early Access to Argo Smart Routing for Packets.

What’s Argo?

Argo finds the best and fastest possible path for your traffic on the Internet. Every day, Cloudflare carries hundreds of billions of requests across our network and the Internet. Because our network, our customers, and their end users are well distributed globally, all of these requests flowing across our infrastructure paint a great picture of how different parts of the Internet are performing at any given time.

Just like Waze examines real data from real drivers to give you accurate, uncongested — and sometimes unorthodox — routes across town, Argo Smart Routing uses the timing data Cloudflare collects from each request to pick faster, more efficient routes across the Internet.

In practical terms, Cloudflare’s network is expansive in its reach. Some Internet links in a given region may be congested and cause poor performance (a literal traffic jam). By understanding this is happening and using alternative network locations and providers, Argo can put traffic on a less direct, but faster, route from its origin to its destination.

These benefits are not theoretical: enabling Argo Smart Routing shaves an average of 33% off HTTP time to first byte (TTFB).

One other thing we’re proud of: we’ve stayed super focused on making it easy to use. One click in the dashboard enables better, smarter routing, bringing the full weight of Cloudflare’s network, data, and engineering expertise to bear on making your traffic faster. Advanced analytics allow you to understand exactly how Argo is performing for you around the world.

You can read a lot more about how Argo works in our original launch blog post.

Even More Blazing Fast

We’ve continuously improved Argo since the day it was launched, making it faster, quicker to respond to changes on the Internet, and allowing more types of traffic to flow over smart routes.

Argo’s new performance optimizations improve last mile latencies and reduce time to first byte even further. Argo’s last mile optimizations can save up to 40% on last mile round trip time (RTT) with commensurate improvements to end-to-end latency.

Running benchmarks against an origin server in the central United States, with visitors coming from around the world, Argo delivered the following results:

The Argo improvements on the last mile reduced overall time to first byte by 39%, and reduced end-to-end latencies by 5% overall:

Faster, better caching

Argo customers don’t just see benefits to their dynamic traffic. Argo’s new found skills provide benefits for static traffic as well. Because Argo now finds the best path to Cloudflare, client TTFB for cache hits sees the same last mile benefit as dynamic traffic.

Getting access to faster Argo

The best part about all these improvements? They’re already deployed and enabled for all Argo customers! These optimizations have been live for Enterprise customers for some time and were enabled for Free, Pro, and Business plans this week.

Moving Down the Stack: Argo Smart Routing for Packets

Customers use Magic Transit and Magic WAN to create their own IP networks on top of Cloudflare’s network, with access to a full suite of network functions (firewalls, DDoS mitigation, and more) delivered as a service. This allows customers to build secure, private, global networks without the need to purchase specialized hardware. Now, Argo Smart Routing for Packets allows these customers to create these IP networks with the performance benefits of Argo.

Consider a fictional gaming company, Golden Fleece Games. Golden Fleece deployed Magic Transit to mitigate attacks by malicious actors on the Internet. They want to be able to provide a quality game to their users while staying up. However, they also need their service to be as fast as possible. If their game sees additional latency, then users won’t play it, and even if their service is technically up, the increased latency will show a decrease in users. For Golden Fleece, being slow is just as bad as being down.

Finance customers also have similar requirements for low latency, high security scenarios. Consider Jason Financial, a fictional Magic Transit customer using Packet Smart Routing. Jason Financial employees connect to Cloudflare in New York, and their requests are routed to their data center which is connected to Cloudflare through a Cloudflare Network Interconnect attached to Cloudflare in Singapore. For Jason Financial, reducing latency is extraordinarily important: if their network is slow, then the latency penalties they incur can literally cost them millions of dollars due to how fast the stock market moves. Jason wants Magic Transit and other Cloudflare One products to secure their network and prevent attacks, but improving performance is important for them as well.

Argo’s Smart Routing for Packets provides these customers with the security they need at speeds faster than before. Now, customers can get the best of both worlds: security and performance. Now, let’s talk a bit about how it works.

A bird’s eye view of the Internet

Argo Smart Routing for Packets picks the fastest possible path between two points. But how does Argo know that the chosen route is the fastest? As with all Argo products, the answer comes by analyzing a wealth of network data already available on the Cloudflare edge. In Argo for HTTP or Argo for TCP, Cloudflare is able to use existing timing data from traffic that’s already being sent over our edge to optimize routes. This allows us to improve which paths are taken as traffic changes and congestion on the Internet happens. However, to build Smart Routing for Packets, the game changed, and we needed to develop a new approach to collect latency data at the IP layer.

Let’s go back to the Jason Financial case. Before, Argo would understand that the number of paths that are available from Cloudflare’s data centers back to Jason’s data center is proportional to the number of data centers Cloudflare has multiplied by the number of distinct interconnections between each data center. By looking at the traffic to Singapore, Cloudflare can use existing Layer 4 traffic and network analytics to determine the best path. But Layer 4 is not Layer 3, and when you move down the stack, you lose some insight into things like round trip time (RTT), and other metrics that compose time to first byte because that data is only produced at higher levels of the application stack. It can become harder to figure out what the best path actually is.

Optimizing performance at the IP layer can be more difficult than at higher layers. This is because protocol and application layers have additional headers and stateful protocols that allow for further optimization. For example, connection reuse is a performance improvement that can only be realized at higher layers of the stack because HTTP requests can reuse existing TCP connections. IP layers don’t have the concept of connections or requests at all: it’s just packets flowing over the wire.

To help bridge the gap, Cloudflare makes use of an existing data source that already exists for every Magic Transit customer today: health check probes. Every Magic Transit customer leverages existing health check probes from every single Cloudflare data center back to the customer origin. These probes are used to determine tunnel health for Magic Transit, so that Cloudflare knows which paths back to origin are healthy. These probes contain a variety of information that can also be used to improve performance as well. By examining health check probes and adding them to existing Layer 4 data, Cloudflare can get a better understanding of one-way latencies and can construct a map that allows us to see all the interconnected data centers and how fast they are to each other. Once this customer gets a Cloudflare Network Interconnect, Argo can use the data center-to-data center probes to create an alternate path for the customer that’s different from the public Internet.

Using this map, Cloudflare can construct dynamic routes for each customer based on where their traffic enters Cloudflare’s network and where they need to go. This allows us to find the optimal route for Jason Financial and allows us to always pick the fastest path.

Packet-Level Latency Reductions

We’ve kind of buried the lede here! We’ve talked about how hard it is to optimize performance for IP traffic. The important bit: despite all these difficulties, Argo Smart Routing for Packets is able to provide a 10% average latency improvement worldwide in our internal testing!

Depending on your network topology, you may see latency reductions that are even higher!

How do I get Argo Smart Routing for Packets?

Argo Smart Routing for Packets is in closed beta and is available only for Magic Transit customers who have a Cloudflare Network Interconnect provisioned. If you are a Magic Transit customer interested in seeing the improved performance of Argo Smart Routing for Packets for yourself, reach out to your account team today! If you don’t have Magic Transit but want to take advantage of bigger performance gains while acquiring uncompromised levels of network security, begin your Magic Transit onboarding process today!

What’s next for Argo

Argo’s roadmap is simple: get ever faster, for any type of traffic.

Argo’s recent optimizations will help customers move data across the Internet at as close to the speed of light as possible. Internally, “how fast are we compared to the speed of light” is one of our engineering team’s key success metrics. We’re not done until we’re even.