All posts by David Tuber

Unboxing the Last Mile: Introducing Last Mile Insights

Post Syndicated from David Tuber original

Unboxing the Last Mile: Introducing Last Mile Insights

Unboxing the Last Mile: Introducing Last Mile Insights

“The last 20% of the work requires 80% of the effort.” The Pareto Principle applies in many domains — nowhere more so on the Internet, however, than on the Last Mile. Last Mile networks are heterogeneous and independent of each other, but all of them need to be running to allow for everyone to use the Internet. They’re typically the responsibility of Internet Service Providers (ISPs). However, if you’re an organization running a mission-critical service on the Internet, not paying attention to Last Mile networks is in effect handing off responsibility for the uptime and performance of your service over to those ISPs.

Probably not the best idea.

When a customer puts a service on Cloudflare, part of our job is to offer a good experience across the whole Internet. We couldn’t do that without focusing on Last Mile networks. In particular, we’re focused on two things:

  • Cloudflare needs to have strong connectivity to Last Mile ISPs and needs to be as close as possible to every Internet-connected person on the planet.
  • Cloudflare needs good observability tools to know when something goes wrong, and needs to be able to surface that data to you so that you can be informed.

Today, we’re excited to announce Last Mile Insights, to help with this last problem in particular. Last Mile Insights allows customers to see where their end-users are having trouble connecting to their Cloudflare properties. Cloudflare can now show customers the traffic that failed to connect to Cloudflare, where it failed to connect, and why. If you’re an enterprise Cloudflare customer, you can sign up to join the beta in the Cloudflare Dashboard starting today: in the Analytics tab under Edge Reachability.

The Last Mile is historically the most complicated, least understood, and in some ways the most important part of operating a reliable network. We’re here to make it easier.

What is the Last Mile?

The Last Mile is the connection between your home and your ISP. When we talk about how users connect to content on the Internet, we typically do it like this:

Unboxing the Last Mile: Introducing Last Mile Insights

This is useful, but in reality, there are lots of things in the path between a user and anything on the Internet. Say that a user is connecting to a resource hosted behind Cloudflare. The path would look like this:

Unboxing the Last Mile: Introducing Last Mile Insights

Cloudflare is a global Anycast network that takes traffic from the Internet and proxies it to your origin. Because we function as a proxy, we think of the life of a request in two legs: before it reaches Cloudflare (end users to Cloudflare), and after it reaches Cloudflare (Cloudflare to origin). However, in Internet parlance, there are generally three legs: the First Mile tends to represent the path from an origin server to the data that you are requesting. The Middle Mile represents the path from an origin server to any proxies or other network hops. And finally, there is the final hop from the ISP to the user, which is known as the Last Mile.

Issues with the Last Mile are difficult to detect. If users are unable to reach something on the Internet, it is difficult for the resource to report that there was a problem. This is because if a user never reaches the resource, then the resource will never know something is wrong. Multiply that one problem across hundreds of thousands of Last Mile ISPs coming from a diverse set of regions, and it can be really hard for services to keep track of all the possible things that can go wrong on the Internet. The above graphic actually doesn’t really reflect the scope of the problem, so let’s revise it a bit more:

Unboxing the Last Mile: Introducing Last Mile Insights

It’s not an easy problem to keep on top of.

Brand New Last Mile Insights

Cloudflare is launching a closed beta of a brand new Last Mile reporting tool, Last Mile Insights. Last Mile Insights allows for customers to see where their end-users are having trouble connecting to their Cloudflare properties. Cloudflare can now show customers the traffic that failed to connect to Cloudflare, where it failed to connect, and why.

Unboxing the Last Mile: Introducing Last Mile Insights

Access to this data is useful to our customers because when things break, knowing what is broken and why — and then communicating with your end users — is vital. During issues, users and employees may create support/helpdesk tickets and social media posts to understand what’s going on. Knowing what is going on, and then communicating effectively about what the problem is and where it’s happening, can give end users confidence that issues are identified and being investigated… even if the issues are occurring on a third party network. Beyond that, understanding the root of the problem can help with mitigations and speed time to resolution.

How do Last Mile Insights work?

Our Last Mile monitoring tools use a combination of signals and machine learning to detect errors and performance regressions on the Last Mile.

Among the signals: Network Error Logging (NEL) is a browser-based reporting system that allows users’ browsers to report connection failures to an endpoint specified by the webpage that failed to load. When a user is able to connect to Cloudflare on a site with NEL enabled, Cloudflare will pass back two headers that indicate to the browser that they should report any network failures to an endpoint specified in the headers. The browser will then operate as usual, and if something happens that prevents the browser from being able to connect to the site, it will log the failure as a report and send it to the endpoint. This all happens in real time; the endpoint receives failure reports instantly after the browser experiences them.

The browser can send failure reports for many reasons: it could send reports because the TLS certificate was incorrect, the ISP or an upstream transit was having issues on the request path, the terminating server was overloaded and dropping requests, or a data center was unreachable. The W3C specification outlines specific buckets that the browser should break reports into and uploads those as reasons the browser could not connect. So the browser is literally telling the reporting endpoint why it was unable to reach the desired site. Here’s an example of a sample report a browser gives to Cloudflare’s endpoint:

Unboxing the Last Mile: Introducing Last Mile Insights

The report itself is a JSON blob that contains a lot of things, but the things we care about are when in the request the failure occurred (phase), why the request failed (tcp.timed_out), the ASN the request came over, and the metro area where the request came from. This information allows anyone looking at the reports to see where things are failing and why. Personal Identifiable Information is not captured in NEL reports. For more information, please see our KB article on NEL.

Many services can operate their own reporting endpoints and set their own headers indicating that users who connect to their site should upload these reports to the endpoint they specify. Cloudflare is also an operator of one such endpoint, and we’re excited to open up the data collected by us for customer use and visibility.  Let’s talk about a customer who used Last Mile Insights to help make a bad day on the Internet a little better.

Case Study: Canva

Canva is a Cloudflare customer that provides a design and collaboration platform hosted in the cloud. With more than 60 million users around the world, having constant access to Canva’s platform is critical. Last year, Canva users connecting through Cox Communications in San Diego started to experience connectivity difficulties. Around 50% of Canva’s users connecting via Cox Communications saw disconnects during that time period, and these users weren’t able to access Canva or Cloudflare at all. This wasn’t a Canva or Cloudflare outage, but rather, was caused by Cox routing traffic destined for Canva incorrectly, and causing errors for mutual Cox/Canva customers as a result.

Normally, this scenario would have taken hours to diagnose and even longer to mitigate. Canva would’ve seen a slight drop in traffic, but as the outage wasn’t on Canva’s side, it wouldn’t have flagged any alerts based on traffic drops. Canva engineers, in this case, would be notified by the users which would then be followed by a lengthy investigation to diagnose the problem.

Fortunately, Cloudflare has invested in monitoring systems to proactively identify issues exactly like these. Within minutes of the routing anomaly being introduced on Cox’s network, Canva was made aware of the issue via our monitoring, and a conversation with Cox was started to remediate the issue. Meanwhile, Canva could advise their users on the steps to fix it.

Cloudflare is excited to be offering our internal monitoring solution to our customers so that they can see what we see.

But providing insights into seeing where problems happen on the Last Mile is only part of the solution. In order to truly deliver a reliable, fast network, we also need to be as close to end users as possible.

Getting close to users

Getting close to end users is important for one reason: it minimizes the time spent on the Last Mile. These networks can be unreliable and slow. The best way to improve performance is to spend as little time on them as possible. And the only way to do that is to get close to our users. In order to get close to our users, Cloudflare is constantly expanding our presence into new cities and markets. We’ve just announced expansion into new markets and are adding even more new markets all the time to get as close to every network and every user as we can.

This is because not every network is the same. Some users may be clustered very close together in cities with high bandwidth, in others, this may not be the case. Because user populations are not homogeneous, each ISP operates their network differently to meet the needs of their users. Physical distance from where servers are matters a great deal, because nobody can outrun the speed of light. If you’re farther away from the content you want, it will take longer to reach it. But distance is not the only variable; bandwidth and speed will also vary, because networks are operated differently all over the world. But one thing we do know is that your network performance will also be impacted by how healthy your Last Mile network is.

Healthier networks perform better

A healthy network has no downtime, minimal congestion, and low packet loss.  These things all add latency. If you’re driving somewhere, street closures, traffic, and bad roads will prevent you from going as fast as possible to where you need to be. Healthy networks provide the best possible conditions for you to connect, and Last Mile performance is better because of it.  Consider three networks in the same country: ISP A, ISP B, and ISP C. These ISPs have similar distribution among their users. ISP A is healthy and is directly connected with Cloudflare. ISP B is healthy but is not directly connected to Cloudflare.  ISP C is an unhealthy network. Our data shows that Last Mile latencies for ISP C are significantly slower than Network A or B because the network quality of ISP C is worse.

Unboxing the Last Mile: Introducing Last Mile Insights

This box plot shows that the latencies to Cloudflare for ISP C are 360% higher than ISPs A or B.

We want all networks to be like Network A, but that’s not always the case, and it’s something Cloudflare can’t control. The only thing Cloudflare can do to mitigate performance problems like these is to limit how much time you spend on these networks.

Shrinking the Last Mile gives better performance

By placing data centers close to our users, we reduce the amount of time spent on these Last Mile networks, and the latency between end users and Cloudflare goes down. A great example of this is how bringing up new locations in Africa affected the latency for the Internet-connected population there.  Blue shows the latency before these locations were added, and red shows after:

Unboxing the Last Mile: Introducing Last Mile Insights

Our efforts globally have brought 95% of the Internet-connected population within 50ms of us:

Unboxing the Last Mile: Introducing Last Mile Insights

You will also notice that 80% of the Internet is within 30ms of us. The tail for Last Mile latencies is very long, and every data center we add helps bring that tail closer to great performance. As we expand into more locations and countries, more of the Internet will be even better connected.

But even when the Last Mile is shrunken down by our infrastructure expansions, networks can still have issues that are difficult to detect. Existing logging and monitoring solutions don’t provide a good way to see what the problem is. Cloudflare has built a sophisticated set of tools to identify issues with Last Mile networks outside our control, and help reduce time to resolution for this purpose, and it has already found problems on the Last Mile for our customers.

Cloudflare has unique performance and insight into Last Mile networking

Running an application on the Internet requires customers to look at the whole Internet. Many cloud services optimize latency starting at the first mile and work their way out, because it’s easier to optimize for things they can control. Because the Last Mile is controlled by hundreds or thousands of ISPs, it is difficult to influence how the Last Mile behaves.

Cloudflare is focused on closing performance gaps everywhere, including close to your users and employees. Last mile performance and reliability is critically important to delivering content, keeping employees productive, and all the other things the world depends on the Internet to do. If a Last Mile provider is having a problem, then users connecting to the Internet through them will have a bad day.

Cloudflare’s efforts to provide better Last Mile performance and visibility allow customers to rely on Cloudflare to optimize the Last Mile, making it one less thing they have to think about. Through Last Mile Insights and network expansion efforts — available today in the Cloudflare Dashboard,  in the Analytics tab under Edge Reachability — we want to provide you the ability to see what’s really happening on the Internet while knowing that Cloudflare is working on giving your users the best possible Internet experience.

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

Post Syndicated from David Tuber original

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

Cloudflare’s mission is to help build a better Internet for everyone. Building a better Internet means helping build more reliable and efficient services that everyone can use. To help realize this vision, we’re announcing the free distribution of two products, one old and one new:

  • Tiered Caching is now available to all customers for free. Tiered Caching reduces origin data transfer and improves performance, making web properties cheaper and faster to operate. Tiered Cache was previously a paid addition to Free, Pro, and Business plans as part of Argo.
  • Orpheus is now available to all customers for free. Orpheus routes around problems on the Internet to ensure that customer origin servers are reachable from everywhere, reducing the number of errors your visitors see.

Tiered Caching: improving website performance and economics for everyone

Tiered Cache uses the size of our network to reduce requests to customer origins by dramatically increasing cache hit ratios. With data centers around the world, Cloudflare caches content very close to end users, but if a piece of content is not in cache, the Cloudflare edge data centers must contact the origin server to receive the cacheable content. This can be slow and places load on an origin server compared to serving directly from cache.

Tiered Cache works by dividing Cloudflare’s data centers into a hierarchy of lower-tiers and upper-tiers. If content is not cached in lower-tier data centers (generally the ones closest to a visitor), the lower-tier must ask an upper-tier to see if it has the content. If the upper-tier does not have it, only the upper-tier can ask the origin for content. This practice improves bandwidth efficiency by limiting the number of data centers that can ask the origin for content, reduces origin load, and makes websites more cost-effective to operate.

Dividing data centers like this results in improved performance for visitors because distances and links traversed between Cloudflare data centers are generally shorter and faster than the links between data centers and origins. It also reduces load on origins, making web properties more economical to operate. Customers enabling Tiered Cache can achieve a 60% or greater reduction in their cache miss rate as compared to Cloudflare’s traditional CDN service.

Additionally, Tiered Cache concentrates connections to origin servers so they come from a small number of data centers rather than the full set of network locations. This results in fewer open connections using server resources.

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

Tiered Cache is simple to enable:

  • Log into your Cloudflare account.
  • Navigate to the Caching in the dashboard.
  • Under Caching, select Tiered Cache.
  • Enable Tiered Cache.

From there, customers will automatically be enrolled in Smart Tiered Cache Topology without needing to make any additional changes. Enterprise Customers can select from different prefab topologies or have a custom topology created for their unique needs.

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

Smart Tiered Cache dynamically selects the single best upper tier for each of your website’s origins with no configuration required. We will dynamically find the single best upper tier for an origin by using Cloudflare’s performance and routing data. Cloudflare collects latency data for each request to an origin. Using this latency data, we can determine how well any upper-tier data center is connected with an origin and can empirically select the best data center with the lowest latency to be the upper-tier for an origin.

Today, Smart Tiered Cache is being offered to ALL Cloudflare customers for free, in contrast to other CDNs who may charge exorbitant fees for similar or worse functionality. Current Argo customers will get additional benefits described here. We think that this is a foundational improvement to the performance and economics of running a website.

But what happens if an upper-tier can’t reach an origin?

Orpheus: solving origin reachability problems for everyone

Cloudflare is a reverse proxy that receives traffic from end users and proxies requests back to customer servers or origins. To be successful, Cloudflare needs to be reachable by end users while simultaneously being able to reach origins. With end users around the world, Cloudflare needs to be able to reach origins from multiple points around the world at the same time. This is easier said than done! The Internet is not homogenous, and diverse Cloudflare network locations do not necessarily take the same paths to a given customer origin at any given time. A customer origin may be reachable from some networks but not from others.

Cloudflare developed Argo to be the Waze of the Internet, allowing our network to react to changes in Internet traffic conditions and route around congestion and breakages in real-time, ensuring end users always have a good experience. Argo Smart Routing provides amazing performance and reliability improvements to our customers.

Enter Orpheus. Orpheus provides reachability benefits for customers by finding unreachable paths on the Internet in real time, and guiding traffic away from those paths, ensuring that Cloudflare will always be able to reach an origin no matter what is happening on the Internet.  

Today, we’re excited to announce that Orpheus is available to and being used by all our customers.

Fewer 522s

You may have seen this error before at one time or another.

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

This error indicates that a user was unable to reach content because Cloudflare couldn’t reach the origin. Because of the unpredictability of the Internet described above, users may see this error even when an origin is up and able to receive traffic.

So why do you see this error? The 522 error occurs when network instability causes traffic sent by Cloudflare to fail either before it reaches the origin, or on the way back from the origin to Cloudflare. This is the equivalent of either Cloudflare or your origin sending a request and never getting a response. Both sides think that they’re fine, but the network path between them is not reachable at all. This causes customer pain.

Orpheus solves that pain, ensuring that no matter where users are or where the origin is, an Internet application will always be reachable from Cloudflare.

How it works

Orpheus builds and provisions routes from Cloudflare to origins by analyzing data from users on every path from Cloudflare and ordering them on a per-data center level with the goal of eliminating connection errors and minimizing packet loss. If Orpheus detects errors on the current path from Cloudflare back to a customer origin, Orpheus will steer subsequent traffic from the impacted network path to the healthiest path available.

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

This is similar to how Argo works but with some key differences: Argo is always steering traffic down the fastest path, whereas Orpheus is reactionary and steers traffic down healthy (and not necessarily the fastest) paths when needed.

Improving origin reachability for customers

Let’s look at an example.

Barry has an origin hosted in WordPress in Chicago for his daughter’s band. This zone primarily sees traffic from three locations: the location closest to his daughter in Seattle, the location closest to him in Boston, and the location closest to his parents in Tampa, who check in on their granddaughter’s site daily for updates.

One day, a link between Tampa and the Chicago origin gets cut by a wandering backhoe. This means that Tampa loses some connectivity back to the Chicago origin. As a result, Barry’s parents start to see failures when connecting back to origin when connecting to the site. This reflects in origin reachability decreasing. Orpheus helps here by finding alternate paths for Barry’s parents, whether it’s through Boston, Seattle, or any location in between that isn’t impacted by the fiber cut seen in Tampa.

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

So even though there is packet loss between one of Cloudflare’s data centers and Barry’s origin, because there is a path through a different Cloudflare data center that doesn’t have loss, the traffic will still succeed because the traffic will go down the non-lossy path.

How much does Orpheus help my origin reachability?

In our rollout of Orpheus for customers, we observed that Orpheus improved Origin reachability by 23%, from 99.87% to 99.90%. Here is a chart showing the improvement Orpheus provides (lower is better):

Improving Origin Performance for Everyone with Orpheus and Tiered Cache

We measure this reachability improvement by measuring 522 rates for every data center-origin pair and then comparing traffic that traversed Orpheus routes with traffic that went directly back to origin. Orpheus was especially helpful at improving reachability for slightly lossy paths that could present small amounts of failure over a long period of time, whereas direct to origin would see those failures.

Note that we’ll never get this number to 0% because, with or without Orpheus, some origins really are unreachable because they are down!

Orpheus makes Cloudflare products better

Orpheus pairs well with some of our products that are already designed to provide highly available services on an uncertain Internet. Let’s go over the interactions between Orpheus and three of our products: Load Balancing, Cloudflare Network Interconnect, and Tiered Cache.

Load Balancing

Orpheus and Load Balancing go together to provide high reachability for every origin endpoint. Load balancing allows for automatic selection of endpoints based on health probes, ensuring that if an origin isn’t working, customers will still be available and operational. Orpheus finds reachable paths from Cloudflare to every origin. These two products in tandem provide a highly available and reachable experience for customers.

Cloudflare Network Interconnect

Orpheus and Cloudflare Network Interconnect (CNI) combine to always provide a highly reachable path, no matter where in the world you are. Consider Acme, a company who is connected to the Internet by only one provider that has a lot of outages. Orpheus will do its best to steer traffic around the lossy paths, but if there’s only one path back to the customer, Orpheus won’t be able to find a less-lossy path. Cloudflare Network Interconnect solves this problem by providing a path that is separate from the transit provider that any Cloudflare data center can access. CNI provides a viable path back to Acme’s origin that will allow Orpheus to engage from any data center in the world if loss occurs.

Shields for All

Orpheus and Tiered Cache can combine to build an adaptive shield around an origin that caches as much as possible while improving traffic back to origin. Tiered Cache topologies allow for customers to deflect much of their static traffic away from their origin to reduce load, and Orpheus helps ensure that any traffic that has to go back to the origin traverses over highly available links.

Improving origin performance for everyone

The Internet is a growing, ever-changing ecosystem. With the release of Orpheus and Tiered Cache for everyone, we’ve given you the ability to navigate whatever the Internet has in store to provide the best possible experience to your customers.

Argo 2.0: Smart Routing Learns New Tricks

Post Syndicated from David Tuber original

Argo 2.0: Smart Routing Learns New Tricks

Argo 2.0: Smart Routing Learns New Tricks

We launched Argo in 2017 to improve performance on the Internet. Argo uses real-time global network information to route around brownouts, cable cuts, packet loss, and other problems on the Internet. Argo makes the network that Cloudflare relies on—the Internet—faster, more reliable, and more secure on every hop around the world.

Without any complicated configuration, Argo is able to use real-time traffic data to pick the fastest path across the Internet, improving performance and delivering more satisfying experiences to your customers and users.

Today, Cloudflare is announcing several upgrades to Argo’s intelligent routing:

  • When it launched, Argo was entirely focused on the “middle mile,” speeding up connections from Cloudflare to our customers’ servers. Argo now delivers optimal routes from clients and users to Cloudflare, further reducing end-to-end latency while still providing the impressive edge to origin performance that Argo is known for. These last-mile improvements reduce end user round trip times by up to 40%.
  • We’re also adding support for accelerating pure IP workloads, allowing Magic Transit and Magic WAN customers to build IP networks to enjoy the performance benefits of Argo.

Starting today, all Free, Pro, and Business plan Argo customers will see improved performance with no additional configuration or charge. Enterprise customers have already enjoyed the last mile performance improvements described here for some time. Magic Transit and WAN customers can contact their account team to request Early Access to Argo Smart Routing for Packets.

What’s Argo?

Argo finds the best and fastest possible path for your traffic on the Internet. Every day, Cloudflare carries hundreds of billions of requests across our network and the Internet. Because our network, our customers, and their end users are well distributed globally, all of these requests flowing across our infrastructure paint a great picture of how different parts of the Internet are performing at any given time.

Just like Waze examines real data from real drivers to give you accurate, uncongested — and sometimes unorthodox — routes across town, Argo Smart Routing uses the timing data Cloudflare collects from each request to pick faster, more efficient routes across the Internet.

In practical terms, Cloudflare’s network is expansive in its reach. Some Internet links in a given region may be congested and cause poor performance (a literal traffic jam). By understanding this is happening and using alternative network locations and providers, Argo can put traffic on a less direct, but faster, route from its origin to its destination.

These benefits are not theoretical: enabling Argo Smart Routing shaves an average of 33% off HTTP time to first byte (TTFB).

One other thing we’re proud of: we’ve stayed super focused on making it easy to use. One click in the dashboard enables better, smarter routing, bringing the full weight of Cloudflare’s network, data, and engineering expertise to bear on making your traffic faster. Advanced analytics allow you to understand exactly how Argo is performing for you around the world.

You can read a lot more about how Argo works in our original launch blog post.

Even More Blazing Fast

We’ve continuously improved Argo since the day it was launched, making it faster, quicker to respond to changes on the Internet, and allowing more types of traffic to flow over smart routes.

Argo’s new performance optimizations improve last mile latencies and reduce time to first byte even further. Argo’s last mile optimizations can save up to 40% on last mile round trip time (RTT) with commensurate improvements to end-to-end latency.

Running benchmarks against an origin server in the central United States, with visitors coming from around the world, Argo delivered the following results:

Argo 2.0: Smart Routing Learns New Tricks

The Argo improvements on the last mile reduced overall time to first byte by 39%, and reduced end-to-end latencies by 5% overall:

Argo 2.0: Smart Routing Learns New Tricks

Faster, better caching

Argo customers don’t just see benefits to their dynamic traffic. Argo’s new found skills provide benefits for static traffic as well. Because Argo now finds the best path to Cloudflare, client TTFB for cache hits sees the same last mile benefit as dynamic traffic.

Getting access to faster Argo

The best part about all these improvements? They’re already deployed and enabled for all Argo customers! These optimizations have been live for Enterprise customers for some time and were enabled for Free, Pro, and Business plans this week.

Moving Down the Stack: Argo Smart Routing for Packets

Customers use Magic Transit and Magic WAN to create their own IP networks on top of Cloudflare’s network, with access to a full suite of network functions (firewalls, DDoS mitigation, and more) delivered as a service. This allows customers to build secure, private, global networks without the need to purchase specialized hardware. Now, Argo Smart Routing for Packets allows these customers to create these IP networks with the performance benefits of Argo.

Consider a fictional gaming company, Golden Fleece Games. Golden Fleece deployed Magic Transit to mitigate attacks by malicious actors on the Internet. They want to be able to provide a quality game to their users while staying up. However, they also need their service to be as fast as possible. If their game sees additional latency, then users won’t play it, and even if their service is technically up, the increased latency will show a decrease in users. For Golden Fleece, being slow is just as bad as being down.

Finance customers also have similar requirements for low latency, high security scenarios. Consider Jason Financial, a fictional Magic Transit customer using Packet Smart Routing. Jason Financial employees connect to Cloudflare in New York, and their requests are routed to their data center which is connected to Cloudflare through a Cloudflare Network Interconnect attached to Cloudflare in Singapore. For Jason Financial, reducing latency is extraordinarily important: if their network is slow, then the latency penalties they incur can literally cost them millions of dollars due to how fast the stock market moves. Jason wants Magic Transit and other Cloudflare One products to secure their network and prevent attacks, but improving performance is important for them as well.

Argo’s Smart Routing for Packets provides these customers with the security they need at speeds faster than before. Now, customers can get the best of both worlds: security and performance. Now, let’s talk a bit about how it works.

A bird’s eye view of the Internet

Argo Smart Routing for Packets picks the fastest possible path between two points. But how does Argo know that the chosen route is the fastest? As with all Argo products, the answer comes by analyzing a wealth of network data already available on the Cloudflare edge. In Argo for HTTP or Argo for TCP, Cloudflare is able to use existing timing data from traffic that’s already being sent over our edge to optimize routes. This allows us to improve which paths are taken as traffic changes and congestion on the Internet happens. However, to build Smart Routing for Packets, the game changed, and we needed to develop a new approach to collect latency data at the IP layer.

Let’s go back to the Jason Financial case. Before, Argo would understand that the number of paths that are available from Cloudflare’s data centers back to Jason’s data center is proportional to the number of data centers Cloudflare has multiplied by the number of distinct interconnections between each data center. By looking at the traffic to Singapore, Cloudflare can use existing Layer 4 traffic and network analytics to determine the best path. But Layer 4 is not Layer 3, and when you move down the stack, you lose some insight into things like round trip time (RTT), and other metrics that compose time to first byte because that data is only produced at higher levels of the application stack. It can become harder to figure out what the best path actually is.

Optimizing performance at the IP layer can be more difficult than at higher layers. This is because protocol and application layers have additional headers and stateful protocols that allow for further optimization. For example, connection reuse is a performance improvement that can only be realized at higher layers of the stack because HTTP requests can reuse existing TCP connections. IP layers don’t have the concept of connections or requests at all: it’s just packets flowing over the wire.

To help bridge the gap, Cloudflare makes use of an existing data source that already exists for every Magic Transit customer today: health check probes. Every Magic Transit customer leverages existing health check probes from every single Cloudflare data center back to the customer origin. These probes are used to determine tunnel health for Magic Transit, so that Cloudflare knows which paths back to origin are healthy. These probes contain a variety of information that can also be used to improve performance as well. By examining health check probes and adding them to existing Layer 4 data, Cloudflare can get a better understanding of one-way latencies and can construct a map that allows us to see all the interconnected data centers and how fast they are to each other. Once this customer gets a Cloudflare Network Interconnect, Argo can use the data center-to-data center probes to create an alternate path for the customer that’s different from the public Internet.

Argo 2.0: Smart Routing Learns New Tricks

Using this map, Cloudflare can construct dynamic routes for each customer based on where their traffic enters Cloudflare’s network and where they need to go. This allows us to find the optimal route for Jason Financial and allows us to always pick the fastest path.

Packet-Level Latency Reductions

We’ve kind of buried the lede here! We’ve talked about how hard it is to optimize performance for IP traffic. The important bit: despite all these difficulties, Argo Smart Routing for Packets is able to provide a 10% average latency improvement worldwide in our internal testing!

Argo 2.0: Smart Routing Learns New Tricks

Depending on your network topology, you may see latency reductions that are even higher!

How do I get Argo Smart Routing for Packets?

Argo Smart Routing for Packets is in closed beta and is available only for Magic Transit customers who have a Cloudflare Network Interconnect provisioned. If you are a Magic Transit customer interested in seeing the improved performance of Argo Smart Routing for Packets for yourself, reach out to your account team today! If you don’t have Magic Transit but want to take advantage of bigger performance gains while acquiring uncompromised levels of network security, begin your Magic Transit onboarding process today!

What’s next for Argo

Argo’s roadmap is simple: get ever faster, for any type of traffic.

Argo’s recent optimizations will help customers move data across the Internet at as close to the speed of light as possible. Internally, “how fast are we compared to the speed of light” is one of our engineering team’s key success metrics. We’re not done until we’re even.

Introducing Cloudflare Network Interconnect

Post Syndicated from David Tuber original

Introducing Cloudflare Network Interconnect

Introducing Cloudflare Network Interconnect

Today we’re excited to announce Cloudflare Network Interconnect (CNI). CNI allows our customers to interconnect branch and HQ locations directly with Cloudflare wherever they are, bringing Cloudflare’s full suite of network functions to their physical network edge. Using CNI to interconnect provides security, reliability, and performance benefits vs. using the public Internet to connect to Cloudflare. And because of Cloudflare’s global network reach, connecting to our network is straightforward no matter where on the planet your infrastructure and employees are.

At its most basic level, an interconnect is a link between two networks. Today, we’re offering customers the following options to interconnect with Cloudflare’s network:

  • Via a private network interconnect (PNI). A physical cable (or a virtual “pseudo-wire”; more on that later) that connects two networks.
  • Over an Internet Exchange (IX). A common switch fabric where multiple Internet Service Providers (ISPs) and Internet networks can interconnect with each other.

To use a real world analogy: Cloudflare over the years has built a network of highways across the Internet to handle all our customers’ traffic. We’re now providing dedicated on-ramps for our customers’ on-prem networks to get onto those highways.

Why interconnect with Cloudflare?

CNI provides more reliable, faster, and more private connectivity between your infrastructure and Cloudflare’s. This delivers benefits across our product suite. Here are some examples of specific products and how you can combine them with CNI:

  • Cloudflare Access: Cloudflare Access replaces corporate VPNs with Cloudflare’s network. Instead of placing internal tools on a private network, teams deploy them in any environment, including hybrid or multi-cloud models, and secure them consistently with Cloudflare’s network. CNI allows you to bring your own MPLS network to meet ours, allowing your employees to connect to your network securely and quickly no matter where they are.
  • CDN: Cloudflare’s CDN places content closer to visitors, improving site speed while minimizing origin load. CNI improves cache fill performance and reduces costs.
  • Magic Transit: Magic Transit protects datacenter and branch networks from unwanted attack and malicious traffic. Pairing Magic Transit with CNI decreases jitter and drives throughput improvements, and further hardens infrastructure from attack.
  • Cloudflare Workers: Workers is Cloudflare’s serverless compute platform. Integrating with CNI provides a secure connection to serverless cloud compute that does not traverse the public Internet, allowing customers to use Cloudflare’s unique set of Workers services with tighter network performance tolerances.

Let’s talk more about how CNI delivers these benefits.

Improving performance through interconnection

CNI is a great way to boost performance for many existing Cloudflare products. By utilizing CNI and setting up interconnection with Cloudflare wherever a customer’s origin infrastructure is, customers can get increased performance and security at lower cost than using public transit providers.

CNI makes things faster

As an example of the performance improvements network interconnects can deliver for Cloudflare customers, consider an HTTP application workload which flows through Cloudflare’s CDN and WAF. Many of our customers rely on our CDN to make their HTTP applications more responsive.

Cloudflare caches content very close to end users to provide the best performance possible. But, if content is not in cache, Cloudflare edge PoPs must contact the origin server to retrieve cacheable content. This can be slow, and places more load on an origin server compared to serving directly from cache.

With CNI, these origin pulls can be completed over a dedicated link, improving throughput and reducing overall time needed for origin pulls. Using Argo Tiered Cache, customers can manage tiered cache topologies and specify upstream cache tiers that correspond with locations where network interconnects are in place. Using Tiered Cache in this fashion lowers origin loads and increases cache hit rates, thereby improving performance and reducing origin infrastructures costs.

Here’s anonymized and sampled data from a real Cloudflare customer who recently provisioned interconnections between our network and theirs to further improve performance. Heavy users of our CDN, they were able to shave off precious milliseconds from their origin round trip time (RTT) by adding PNIs in multiple locations.

Introducing Cloudflare Network Interconnect

As an example, their 90th percentile round trip time in Warsaw, Poland decreased by 6.5ms as a result of provisioning a private network interconnect (from 7.5ms to 1ms), which is a performance win of 87%!  The jitter (variation in delay in received packets) on the link decreased from 82.9 to 0.3, which speaks to the dedicated, reliable nature of the link. CNI helps deliver reliable and performant network connectivity to your customers and employees.

Introducing Cloudflare Network Interconnect

Enhanced security through private connectivity

Customers with large on-premise networks want to move to the cloud: it’s cheaper, less hassle, and less overhead and maintenance.  However, customers want to also preserve their existing security and threat models.

Traditionally, CIOs trying to connect their IP networks to the Internet do so in two steps:

  1. Source connectivity to the Internet from transit providers (ISPs).
  2. Purchase, operate, and maintain network function specific hardware appliances. Think hardware load balancers, firewalls, DDoS mitigation equipment, WAN optimization, and more.

CNI allows CIOs to provision security services on Cloudflare and connect their existing networks to Cloudflare in a way that bypasses the public Internet.  Because Cloudflare integrates with on-premise networks and the cloud, customers can enforce security policies across both networks and create a consistent, secure boundary.

CNI increases cloud and network security by providing a private, dedicated link to the Cloudflare network. Since this link is reserved exclusively for the customer that provisions it, the customer’s traffic is isolated and private.

CNI + Magic Transit: Removing public Internet exposure

To use a product-specific example: through CNI’s integration with Magic Transit, customers can take advantage of private connectivity to minimize exposure of their network to the public Internet.

Magic Transit attracts customers’ IP traffic to our data centers by advertising their IP addresses from our edge via BGP. When traffic arrives, it’s filtered and sent along to customers’ data centers. Before CNI, all Magic Transit traffic was sent from Cloudflare to customers via Generic Routing Encapsulation (GRE) tunnels over the Internet. Because GRE endpoints are publicly routable, there is some risk these endpoints could be discovered and attacked, bypassing Cloudflare’s DDoS mitigation and security tools.

Using CNI removes this exposure to the Internet. Advantages of using CNI with Magic Transit include:

  • Reduced threat exposure. Although there are many steps companies can take to increase network security, some risk-sensitive organizations prefer not to expose endpoints to the public Internet at all. CNI allows Cloudflare to absorb that risk and forward only clean traffic (via Magic Transit) through a truly private interface.
  • Increased reliability. Traffic traveling over the public Internet is subject to factors outside of your control, including latency and packet loss on intermediate networks. Removing steps between Cloudflare’s network and yours means that after Magic Transit processes traffic, it’s forwarded directly and reliably to your network.
  • Simplified configuration. Soon, Magic Transit + CNI customers will have the option to skip making MSS (maximum segment size) changes when onboarding, a step that’s required for GRE-over-Internet and can be challenging for customers who need to consider their downstream customers’ MSS as well (eg. service providers).

Example deployment: Penguin Corp uses Cloudflare for Teams, Magic Transit, and CNI to protect branch and core networks, and employees.

Imagine Penguin Corp, a hypothetical company, has a fully connected private MPLS network.  Maintaining their network is difficult and they have a dedicated team of network engineers to do this.  They are currently paying a lot of money to run their own private cloud. To minimize costs, they limit their network egress points to two worldwide.  This creates a major performance problem for their users, whose bits have to travel a long way to accomplish basic tasks while still traversing Penguin’s network boundary.

Introducing Cloudflare Network Interconnect

SASE (Secure Access Service Edge) models look attractive to them, because they can, in theory, move away from their traditional MPLS network and move towards the cloud.  SASE deployments provide firewall, DDoS mitigation, and encryption services at the network edge, and bring security as a service to any cloud deployment, as seen in the diagram below:

Introducing Cloudflare Network Interconnect

CNI allows Penguin to use Cloudflare as their true network edge, hermetically sealing their branch office locations and datacenters from the Internet. Penguin can adapt to a SASE-like model while keeping exposure to the public Internet at zero. Penguin establishes PNIs with Cloudflare from their branch office in San Jose to Cloudflare’s San Jose location to take advantage of Cloudflare for Teams, and from their core colocation facility in Austin to Cloudflare’s Dallas location to use Magic Transit to protect their core networks.

Like Magic Transit, Cloudflare for Teams replaces traditional security hardware on-premise with Cloudflare’s global network. Customers who relied on VPN appliances to reach internal applications can instead connect securely through Cloudflare Access. Organizations maintaining physical web gateway boxes can send Internet-bound traffic to Cloudflare Gateway for filtering and logging.

Cloudflare for Teams services run in every Cloudflare data center, bringing filtering and authentication closer to your users and locations to avoid compromising performance. CNI improves that even further with a direct connection from your offices to Cloudflare. With a simple configuration change, all branch traffic reaches Cloudflare’s edge where Cloudflare for Teams policies can be applied. The link improves speed and reliability for users and replaces the need to backhaul traffic to centralized filtering appliances.

Once interconnected this way, Penguin’s network and employees realize two benefits:

  1. They get to use Cloudflare’s full set of security services without having to provision expensive and centralized physical or virtualized network appliances.
  2. Their security and performance services are running across Cloudflare’s global network in over 200 cities. This brings performance and usability improvements for users by putting security functions closer to them.
Introducing Cloudflare Network Interconnect

Scalable, global, and flexible interconnection options

CNI offers a big benefit to customers because it allows them to take advantage of our global footprint spanning 200+ cities: their branch office and datacenter infrastructure can connect to Cloudflare wherever they are.

This matters for two reasons: our globally distributed network makes it easier to interconnect locally, no matter where a customer’s branches and core infrastructure is, and allows for a globally distributed workforce to interact with our edge network with low latency and improved performance.

Customers don’t have to worry about securely expanding their network footprint: that’s our job.

To this point, global companies need to interconnect at many points around the world. Cloudflare Network Interconnect is priced for global network scale: Cloudflare doesn’t charge anything for enterprise customers to provision CNI. Customers may need to pay for access to an interconnection platform or a datacenter cross-connect. We’ll work with you and any other parties involved to make the ordering and provisioning process as smooth as possible.

In other words, CNI’s pricing is designed to accommodate complicated enterprise network topologies and modern IT budgets.

How to interconnect

Customers can interconnect with Cloudflare in one of three ways: over a private network interconnect (PNI), over an IX, or through one of our interconnection platform partners. We have worked closely with our global partners to meet our customers where they are and how they want.

Private Network Interconnects

Private Network Interconnects are available at any of our listed private peering facilities. Getting a physical connection to Cloudflare is easy: specify where you want to connect, port speeds, and target VLANs. From there, we’ll authorize it, you’ll place the order, and let us do the rest.  Customers should choose PNI as their connectivity option if they want higher throughput than a virtual connection or connection over an IX, or want to eliminate as many intermediaries from an interconnect as possible.

Internet Exchanges

Customers who want to use existing Internet Exchanges can interconnect with us at any of the 235+ Internet Exchanges we participate in. To connect with Cloudflare via an Internet Exchange, follow the IX’s instructions to connect, and Cloudflare will spin up our side of the connection.  Customers should choose Internet Exchanges as their connectivity option if they are either already peered at an IX, or they want to interconnect in a place where an interconnection platform isn’t present.

Interconnection Platform Partners

Cloudflare is proud to be partnering with Equinix, Megaport, PCCW ConsoleConnect, PacketFabric, and Zayo to provide you with easy ways to virtually connect with us in any of the partner-supported locations. Customers should choose to connect with an interconnection platform if they are already using these providers or want a quick and easy way to onboard onto a secure cloud experience.

Introducing Cloudflare Network Interconnect

If you’re interested in learning more, please see this blog post about all the different ways you can interconnect. For all of the interconnect methodologies described above, the BGP session establishment and IP routing are the same. The only thing that is different is the physical way in which we interconnect with other networks.

How do I find the best places to interconnect?

Our product page for CNI includes tools to better understand the right places for your network to interconnect with ours.  Customers can use this data to help figure out the optimal place to interconnect to have the most connectivity with other cloud providers and other ISPs in general.

What’s the difference between CNI and peering?

Technically, peering and CNI use similar mechanisms and technical implementations behind the scenes.

We have had an open peering policy for years with any network and will continue to abide by that policy: it allows us to help build a better Internet for everyone by interconnecting networks together, making the Internet more reliable. Traditional networks use interconnect/peering to drive better performance for their customers and connectivity while driving down costs. With CNI, we are opening up our infrastructure to extend the same benefits to our customers as well.

How do I learn more?

CNI provides customers with better performance, reliability, scalability, and security than using the public Internet. A customer can interconnect with Cloudflare in any of our physical locations today, getting dedicated links to Cloudflare that deliver security benefits and more stable latency, jitter, and available bandwidth through each interconnection point.

Contact our enterprise sales team about adding Cloudflare Network Interconnect to your existing offerings.