Tag Archives: Speed

Race ahead with Cloudflare Pages build caching

2023-09-28 Anni Wang

Post Syndicated from Anni Wang original http://blog.cloudflare.com/race-ahead-with-build-caching/

Race ahead with Cloudflare Pages build caching

Today, we are thrilled to release a beta of Cloudflare Pages support for build caching! With build caching, we are offering a supercharged Pages experience by helping you cache parts of your project to save time on subsequent builds.

For developers, time is not just money – it’s innovation and progress. When every second counts in crunch time before a new launch, the “need for speed” becomes critical. With Cloudflare Pages’ built-in continuous integration and continuous deployment (CI/CD), developers count on us to drive fast. We’ve already taken great strides in making sure we’re enabling quick development iterations for our users by making solid improvements on the stability and efficiency of our build infrastructure. But we always knew there was more to our build story.

Quick pit stops

Build times can feel like a developer's equivalent of a time-out, a forced pause in the creative process—the inevitable pit stop in a high-speed formula race.

Long build times not only breaks the flow of individual developers, but it can also create a ripple effect across the team. It can slow down iterations and push back deployments. In the fast-paced world of CI/CD, these delays can drastically impact productivity and the delivery of products.

We want to empower developers to win the race, miles ahead of competition.

Mechanics of build caching

At its core, build caching is a mechanism that stores artifacts of a build, allowing subsequent builds to reuse these artifacts rather than recomputing them from scratch. By leveraging the cached results, build times can be significantly reduced, leading to a more efficient build process.

Previously, when you initiated a build, the Pages CI system would generate every step of the build process, even if most parts of the codebase remain unchanged between builds. This is the equivalent to changing out every single part of the car during a pit stop, irrespective of if anything needs replacing.

Build caching refines this process. Now, the Pages build system will detect if cached artifacts can be leveraged, restore the artifacts, then focus on only computing the modified sections of the code. In essence, build caching acts like an experienced pit crew, smartly skipping unnecessary steps and focusing only on what's essential to get you back in the race faster.

What are we caching?

It boils down to two components: dependencies and build output.

The Pages build system supports dependency caching for select package managers and build output caching for select frameworks. Check out our documentation for more information on what’s currently supported and what’s coming up.

Let’s take a closer look at what exactly we are caching.

Dependencies: upon initiating a build, the Pages CI system checks for cached artifacts from previous builds. If it identifies a cache hit for dependencies, it restores from cache to speed up dependency installation.

Build output: if a cache hit for build output is identified, Pages will only build the changed assets. This approach enables the long awaited incremental builds for supported JavaScript frameworks.

Ready, set … go!

Build caching is now in beta, and ready for you to test drive!

In this release, the feature will support the node-based package managers npm, yarn, pnpm, as well as Bun. We’ve also ensured compatibility with the most popular frameworks that provide native incremental building support: Gatsby.js, Next.js and Astro – and more to come!

For you as a Pages user, interacting with build caching will be seamless. If you are working with an existing project, simply navigate to your project’s settings to toggle on Build Cache.

When you push a code change and initiate a build using Pages CI, build caching will kick-start and do its magic in the background.

“Cache” us on Discord

Have questions? Join us on our Discord Server [link]. We will be hosting an “Ask Us Anything” session on October 2nd where you can chat live with members of our team! Your feedback on this beta is invaluable to us, so after testing out build caching, don't hesitate to share your experiences! Happy building!

Cloudflare Fonts: enhancing website font privacy and speed

2023-09-25 Matt Bullock

Post Syndicated from Matt Bullock original http://blog.cloudflare.com/cloudflare-fonts-enhancing-website-privacy-speed/

Cloudflare Fonts: enhancing website font privacy and speed

We are thrilled to introduce Cloudflare Fonts! In the coming weeks sites that use Google Fonts will be able to effortlessly load their fonts from the site’s own domain rather than from Google. All at a click of a button. This enhances both privacy and performance. It enhances users' privacy by eliminating the need to load fonts from Google’s third-party servers. It boosts a site's performance by bringing fonts closer to end users, reducing the time spent on DNS lookups and TLS connections.

Sites that currently use Google Fonts will not need to self-host fonts or make complex code changes to benefit – Cloudflare Fonts streamlines the entire process, making it a breeze.

Fonts and privacy

When you load fonts from Google, your website initiates a data exchange with Google's servers. This means that your visitors' browsers send requests directly to Google. Consequently, Google has the potential to accumulate a range of data, including IP addresses, user agents (formatted descriptions of the browser and operating system), the referer (the page on which the Google font is to be displayed) and how often each IP makes requests to Google. While Google states that they do not use this data for targeted advertising or set cookies, any time you can prevent sharing your end user’s personal data unnecessarily is a win for privacy.

With Cloudflare Fonts, when you serve fonts directly from your own domain. This means no font requests are sent to third-party domains like Google, which some privacy regulators have found to be a problem in the past. Our pro-privacy approach means your end user’s IP address and other data are not sent to another domain. All that information stays within your control, within your domain. In addition, because Cloudflare Fonts eliminates data transmission to third-party servers like Google's, this can enhance your ability to comply with any potential data localization requirements.

Faster Google Font delivery through Cloudflare

Now that we have established that Cloudflare Fonts can improve your privacy, let's flip to the other side of the coin – how Cloudflare Fonts will improve your performance.

To do this, we first need to delve into how Google Fonts affects your website's performance. Subsequently, we'll explore how Cloudflare Fonts addresses and rectifies these performance challenges.

Google Fonts is a fantastic resource that offers website owners a range of royalty-free fonts for website usage. When you decide on the fonts you would like to incorporate, it’s super easy to integrate. You just add a snippet of HTML to your site. You then add styles to apply these fonts to various parts of your page:

<link href="https://fonts.googleapis.com/css?family=Open+Sans|Roboto+Slab" rel="stylesheet">
<style>
  body {
    font-family: 'Open Sans', sans-serif;
  }
  h1 {
    font-family: 'Roboto Slab', serif;
  }
</style>

But this ease of use comes with a performance penalty.

Upon loading your webpage, your visitors' browser fetches the CSS file as soon as the HTML starts to be parsed. Then, when the browser starts rendering the page and identifies the need for fonts in different text sections, it requests the required font files.

This is where the performance problem arises. Google Fonts employs a two-domain system: the CSS resides on one domain – fonts.googleapis.com – while the font files reside on another domain – fonts.gstatic.com.

This separation results in a minimum of four round trips to the third-party servers for each resource request. These round trips are DNS lookup, socket connection establishment, TLS negotiation (for HTTPS), and the final round trip for the actual resource request. Ultimately, getting a font from Google servers to a browser requires eight round trips.

Users can see this. If they are using Google Fonts they can open their network tab and filter for these Google domains.

You can visually see the impact of the extra DNS request and TLS connection that these requests add to your website experience. For example on my WordPress site that natively uses Google Fonts as part of the theme adds an extra ~150ms.

Fast fonts

Cloudflare Fonts streamlines this process, by reducing the number of round trips from eight to one. Two sets of DNS lookups, socket connections and TLS negotiations to third-parties are no longer required because there is no longer a third-party server involved in serving the CSS or the fonts. The only round trip involves serving the font files directly from the same domain where the HTML is hosted. This approach offers an additional advantage: it allows fonts to be transmitted over the same HTTP/2 or HTTP/3 connection as other page resources, benefiting from proper prioritization and preventing bandwidth contention.

The eagle-eyed amongst you might be thinking “Surely it is still two round trips – what about the CSS request?”. Well, with Cloudflare Fonts, we have also removed the need for a separate CSS request. This means there really is only one round-trip – fetching the font itself.

To achieve both the home-routing of font requests and the removal of the CSS request, we rewrite the HTML as it passes through Cloudflare’s global network. The CSS response is embedded, and font URL transformations are performed within the embedded CSS.

These transformations adjust the font URLs to align with the same domain as the HTML content. These modified responses seamlessly pass through Cloudflare's caching infrastructure, where they are automatically cached for a substantial performance boost. In the event of any cache misses, we use Fontsource and NPM to load these fonts and cache them within the Cloudflare infrastructure. This approach ensures that there's no inadvertent data exposure to Google's infrastructure, maintaining both performance and data privacy.

With Cloudflare Fonts enabled, you are able to see within your Network Tab that font files are now loaded from your own hostname from the /cf-fonts path and served from Cloudflare’s closest cache to the user, as indicated by the cf-cache-status: HIT.

Additionally, you can notice that the timings section in the browser no longer needs an extra DNS lookup for the hostname or the setup of a TLS connection. This happens because the content is served from your hostname, and the browser has already cached the DNS response and has an open TLS connection.

Finally, you can see the real-world performance benefits of Cloudflare Fonts. We conducted synthetic Google Lighthouse tests before enabling Cloudflare Fonts on a straightforward page that displays text. First Contentful Paint (FCP), which represents the time it takes for the first content element to appear on the page, was measured at 0.9 seconds in the Google fonts tests. After enabling Cloudflare Fonts, the First Contentful Paint (FCP) was reduced to 0.3 seconds, and our overall Lighthouse performance score improved from 98 to a perfect 100 out of 100.

Making Cloudflare Fonts fast with ROFL

In order to make Cloudflare Fonts this performant, we needed to make blazing-fast HTML alterations as responses stream through Cloudflare’s network. This has been made possible by leveraging one of Cloudflare’s more recent technologies.

Earlier this year, we finished rewriting one of Cloudflare's oldest components, which played a crucial role in dynamically altering HTML content. But as described in this blog post, a new solution was required to replace the old – A memory-safe solution, able to scale to Cloudflare’s ever-increasing load.

This new module is known as ROFL (Response Overseer for FL). It now powers various Cloudflare products that need to alter HTML as it streams, such as Email Obfuscation, Rocket Loader, and HTML Minification.

ROFL was developed entirely in Rust. This decision was driven by Rust's memory safety, performance, and security. The memory-safety features of Rust are indispensable to ensure airtight protection against memory leaks while we process a staggering volume of requests, measuring in the millions per second. Rust's compiled nature allows us to finely optimize our code for specific hardware configurations, delivering impressive performance gains compared to interpreted languages.

ROFL paved the way for the development of Cloudflare Fonts. The performance of ROFL allows us to rewrite HTML on-the-fly and modify the Google Fonts links quickly, safely and efficiently. This speed helps us reduce any additional latency added by processing the HTML file and improve the performance of your website.

Unlock the power of Cloudflare Fonts today! 🚀

Cloudflare Fonts will be available to all Cloudflare customers in October. If you're using Google Fonts, you will be able to supercharge your site's privacy and speed. By enabling this feature, you can seamlessly enhance your website's performance while safeguarding your user’s privacy.

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile

2023-06-23 Dirk-Jan van Helmond

Post Syndicated from Dirk-Jan van Helmond original http://blog.cloudflare.com/how-cloudflare-scaled-and-protected-eurovision-2023-voting/

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile

2023 was the first year that non-participating countries could vote for their favorites during the Eurovision Song Contest, adding millions of additional viewers and voters to an already impressive 162 million tuning in from the participating countries. It became a truly global event with a potential for disruption from multiple sources. To prepare for anything, Cloudflare helped scale and protect the voting application, used by millions of dedicated fans around the world to choose the winner.

In this blog we will cover how once.net built their platform based.io to monitor, manage and scale the Eurovision voting application to handle all traffic using many Cloudflare services. The speed with which DNS changes made through the Cloudflare API propagate globally allowed them to scale their backend within seconds. At the same time, Cloudflare Pages was ready to serve any amount of traffic to the voting landing page so fans didn’t miss a beat. And to cap it off, by combining Cloudflare CDN, DDoS protection, WAF, and Turnstile, they made sure that attackers didn’t steal any of the limelight.

The unsung heroes

Based.io is a resilient live data platform built by the once.net team, with the capability to scale up to 400 million concurrent connected users. It’s built from the ground up for speed and performance, consisting of an observable real time graph database, networking layer, cloud functions, analytics and infrastructure orchestration. Since all system information, traffic analysis and disruptions are monitored in real time, it makes the platform instantly responsive to variable demand, which enables real time scaling of your infrastructure during spikes, outages and attacks.

Although the based.io platform on its own is currently in closed beta, it is already serving a few flagship customers in production assisted by the software and services of the once.net team. One such customer is Tally, a platform used by multiple broadcasters in Europe to add live interaction to traditional television. Over 100 live shows have been performed using the platform. Another is Airhub, a startup that handles and logs automatic drone flights. And of course the star of this blog post, the Eurovision Song Contest.

Setting the stage

The Eurovision Song Contest is one of the world’s most popular broadcasted contests, and this year it reached 162 million people across 38 broadcasting countries. In addition, on TikTok the three live shows were viewed 4.8 million times, while 7.6 million people watched the Grand Final live on YouTube. With such an audience, it is no surprise that Cloudflare sees the impact of it on the Internet. Last year, we wrote a blog post where we showed lower than average traffic during, and higher than average traffic after the grand final. This year, the traffic from participating countries showed an even more remarkable surge:

Such large amounts of traffic are nothing new to the Eurovision Song Contest. Eurovision has relied on Cloudflare’s services for over a decade now and Cloudflare has helped to protect Eurovision.tv and improve its performance through noticeable faster load time to visitors from all corners of the world. Year after year, the team of Eurovision continued to use our services more, discovering additional features to improve performance and reliability further, with increasingly fine-grained control over their traffic flows. Eurovision.tv uses Page Rules to cache additional content on Cloudflare’s edge, speeding up delivery without sacrificing up-to-the-minute updates during the global event. Finally, to protect their backend and content management system, the team has placed their admin portals behind Cloudflare Zero Trust to delegate responsibilities down to individual levels.

Since then the contest itself has also evolved – sometimes by choice, sometimes by force. During the COVID-19 pandemic in-person cheering became impossible for many people due to a reduced live audience, resulting in the Eurovision Song Contest asking once.net to build a new iOS and Android application in which fans could cheer virtually. The feature was an instant hit, and it was clear that it would become part of this year’s contest as well.

In 2023, once.net was also asked to handle the paid voting from the regions where phone and SMS voting was not possible. It was the first time that Eurovision allowed voting online. The challenge that had to be overcome was the extreme peak demand on the platform when the show was live, and especially when the voting window started.

Complicating it further, was the fact that during last year’s show, there had been a large number of targeted and coordinated attacks.

To prepare for these spikes in demand and determined adversaries, once.net needed a platform that isn’t only resilient and highly scalable, but could also act as a mitigation layer in front of it. once.net selected Cloudflare for this functionality and integrated Cloudflare deeply with its real-time monitoring and management platform. To understand how and why, it’s essential to understand based.io underlying architecture.

The based.io platform

Instead of relying on network or HTTP load balancers, based.io uses a client-side service discovery pattern, selecting the most suitable server to connect to and leveraging Cloudflare's fast cache propagation infrastructure to handle spikes in traffic (both malicious and benign).

First, each server continuously registers a unique access key that has an expiration of 15 seconds, which must be used when a client connects to the server. In addition, the backend servers register their health (such as active connections, CPU, memory usage, requests per second, etc.) to the service registry every 300 milliseconds. Clients then request the optimal server URL and associated access key from a central discovery registry and proceed to establish a long lived connection with that server. When a server gets overloaded it will disconnect a certain amount of clients and those clients will go through the discovery process again.

The central discovery registry would normally be a huge bottleneck and attack target. based.io resolves this by putting the registry behind Cloudflare's global network with a cache time of three seconds. Since the system relies on real-time stats to distribute load and uses short lived access keys, it is crucial that the cache updates fast and reliably. This is where Cloudflare’s infrastructure proved its worth, both due to the fast updating cache and reducing load with Tiered Caching.

Not using load balancers means the based.io system allows clients to connect to the backend servers through Cloudflare, resulting in better performance and a more resilient infrastructure by eliminating the load balancers as potential attack surface. It also results in a better distribution of connections, using the real-time information of server health, amount of active connections, active subscriptions.

Scaling up the platform happens automatically under load by deploying additional machines that can each handle 40,000 connected users. These are spun up in batches of a couple of hundred and as each machine spins up, it reaches out directly to the Cloudflare API to configure its own DNS record and proxy status. Thanks to Cloudflare’s high speed DNS system, these changes are then propagated globally within seconds, resulting in a total machine turn-up time of around three seconds. This means faster discovery of new servers and faster dynamic rebalancing from the clients. And since the voting window of the Eurovision Song Contest is only 45 minutes, with the main peak within minutes after the window opens, every second counts!

To vote, users of the mobile app and viewers globally were pointed to the voting landing page, esc.vote. Building a frontend web application able to handle this kind of an audience is a challenge in itself. Although hosting it yourself and putting a CDN in front seems straightforward, this still requires you to own, configure and manage your origin infrastructure. once.net decided to leverage Cloudflare’s infrastructure directly by hosting the voting landing page on Cloudflare Pages. Deploying was as quick as a commit to their Git repository, and they never had to worry about reachability or scaling of the webpage.

once.net also used Cloudflare Turnstile to protect their payment API endpoints that were used to validate online votes. They used the invisible Turnstile widget to make sure the request was not coming from emulated browsers (e.g. Selenium). And best of all, using the invisible Turnstile widget the user did not have to go through extra steps, which allowed for a better user experience and better conversion.

Cloudflare Pages stealing the show!

After the two semi-finals went according to plan with approximately 200,000 concurrent users during each,May 13 brought the Grand Final. The once.net team made sure that there were enough machines ready to take the initial load, jumped on a call with Cloudflare to monitor and started looking at the number of concurrent users slowly increasing. During the event, there were a few attempts to DDoS the site, which were automatically and instantaneously mitigated without any noticeable impact to any visitors.

The based.io discovery registry server also got some attention. Since the cache TTL was set quite low at five seconds, a high rate of distributed traffic to it could still result in a significant load. Luckily, on its own, the highly optimized based.io server can already handle around 300,000 requests per second. Still, it was great to see that during the event the cache hit ratio for normal traffic was 20%, and during one significant attack the cache hit ratio peaked towards 80%. This showed how easy it is to leverage a combination of Cloudflare CDN and DDoS protection to mitigate such attacks, while still being able to serve dynamic and real time content.

When the curtains finally closed, 1.3 million concurrent users connected to the based.io platform at peak. The based.io platform handled a total of 350 million events and served seven million unique users in three hours. The voting landing page hosted by Cloudflare Pages served 2.3 million requests per second at peak, and made sure that the voting payments were by real human fans using Turnstile. Although the Cloudflare platform didn’t blink for such a flood of traffic, it is no surprise that it shows up as a short crescendo in our traffic statistics:

Get in touch with us

If you’re also working on or with an application that would benefit from Cloudflare’s speed and security, but don’t know where to start, reach out and we’ll work together.

Introducing the Cloudflare Radar Internet Quality Page

2023-06-23 David Belson

Post Syndicated from David Belson original http://blog.cloudflare.com/introducing-radar-internet-quality-page/

Introducing the Cloudflare Radar Internet Quality Page

Internet connections are most often marketed and sold on the basis of "speed", with providers touting the number of megabits or gigabits per second that their various service tiers are supposed to provide. This marketing has largely been successful, as most subscribers believe that "more is better”. Furthermore, many national broadband plans in countries around the world include specific target connection speeds. However, even with a high speed connection, gamers may encounter sluggish performance, while video conference participants may experience frozen video or audio dropouts. Speeds alone don't tell the whole story when it comes to Internet connection quality.

Additional factors like latency, jitter, and packet loss can significantly impact end user experience, potentially leading to situations where higher speed connections actually deliver a worse user experience than lower speed connections. Connection performance and quality can also vary based on usage – measured average speed will differ from peak available capacity, and latency varies under loaded and idle conditions.

The new Cloudflare Radar Internet Quality page

A little more than three years ago, as residential Internet connections were strained because of the shift towards working and learning from home due to the COVID-19 pandemic, Cloudflare announced the speed.cloudflare.com speed test tool, which enabled users to test the performance and quality of their Internet connection. Within the tool, users can download the results of their individual test as a CSV, or share the results on social media. However, there was no aggregated insight into Cloudflare speed test results at a network or country level to provide a perspective on connectivity characteristics across a larger population.

Today, we are launching these long-missing aggregated connection performance and quality insights on Cloudflare Radar. The new Internet Quality page provides both country and network (autonomous system) level insight into Internet connection performance (bandwidth) and quality (latency, jitter) over time. (Your Internet service provider is likely an autonomous system with its own autonomous system number (ASN), and many large companies, online platforms, and educational institutions also have their own autonomous systems and associated ASNs.) The insights we are providing are presented across two sections: the Internet Quality Index (IQI), which estimates average Internet quality based on aggregated measurements against a set of Cloudflare & third-party targets, and Connection Quality, which presents peak/best case connection characteristics based on speed.cloudflare.com test results aggregated over the previous 90 days. (Details on our approach to the analysis of this data are presented below.)

Users may note that individual speed test results, as well as the aggregate speed test results presented on the Internet Quality page will likely differ from those presented by other speed test tools. This can be due to a number of factors including differences in test endpoint locations (considering both geographic and network distance), test content selection, the impact of “rate boosting” by some ISPs, and testing over a single connection vs. multiple parallel connections. Infrequent testing (on any speed test tool) by users seeking to confirm perceived poor performance or validate purchased speeds will also contribute to the differences seen in the results published by the various speed test platforms.

And as we announced in April, Cloudflare has partnered with Measurement Lab (M-Lab) to create a publicly-available, queryable repository for speed test results. M-Lab is a non-profit third-party organization dedicated to providing a representative picture of Internet quality around the world. M-Lab produces and hosts the Network Diagnostic Tool, which is a very popular network quality test that records millions of samples a day. Given their mission to provide a publicly viewable, representative picture of Internet quality, we chose to partner with them to provide an accurate view of your Internet experience and the experience of others around the world using openly available data.

Connection speed & quality data is important

While most advertisements for fixed broadband and mobile connectivity tend to focus on download speeds (and peak speeds at that), there’s more to an Internet connection, and the user’s experience with that Internet connection, than that single metric. In addition to download speeds, users should also understand the upload speeds that their connection is capable of, as well as the quality of the connection, as expressed through metrics known as latency and jitter. Getting insight into all of these metrics provides a more well-rounded view of a given Internet connection, or in aggregate, the state of Internet connectivity across a geography or network.

The concept of download speeds are fairly well understood as a measure of performance. However, it is important to note that the average download speeds experienced by a user during common Web browsing activities, which often involves the parallel retrieval of multiple smaller files from multiple hosts, can differ significantly from peak download speeds, where the user is downloading a single large file (such as a video or software update), which allows the connection to reach maximum performance. The bandwidth (speed) available for upload is sometimes mentioned in ISP advertisements, but doesn’t receive much attention. (And depending on the type of Internet connection, there’s often a significant difference between the available upload and download speeds.) However, the importance of upload came to the forefront in 2020 as video conferencing tools saw a surge in usage as both work meetings and school classes shifted to the Internet during the COVID-19 pandemic. To share your audio and video with other participants, you need sufficient upload bandwidth, and this issue was often compounded by multiple people sharing a single residential Internet connection.

Latency is the time it takes data to move through the Internet, and is measured in the number of milliseconds that it takes a packet of data to go from a client (such as your computer or mobile device) to a server, and then back to the client. In contrast to speed metrics, lower latency is preferable. This is especially true for use cases like online gaming where latency can make a difference between a character’s life and death in the game, as well as video conferencing, where higher latency can cause choppy audio and video experiences, but it also impacts web page performance. The latency metric can be further broken down into loaded and idle latency. The former measures latency on a loaded connection, where bandwidth is actively being consumed, while the latter measures latency on an “idle” connection, when there is no other network traffic present. (These specific loaded and idle definitions are from the device’s perspective, and more specifically, from the speed test application’s perspective. Unless the speed test is being performed directly from a router, the device/application doesn't have insight into traffic on the rest of the network.) Jitter is the average variation found in consecutive latency measurements, and can be measured on both idle and loaded connections. A lower number means that the latency measurements are more consistent. As with latency, Internet connections should have minimal jitter, which helps provide more consistent performance.

Our approach to data analysis

The Internet Quality Index (IQI) and Connection Quality sections get their data from two different sources, providing two different (albeit related) perspectives. Under the hood they share some common principles, though.

IQI builds upon the mechanism we already use to regularly benchmark ourselves against other industry players. It is based on end user measurements against a set of Cloudflare and third-party targets, meant to represent a pattern that has become very common in the modern Internet, where most content is served from distribution networks with points of presence spread throughout the world. For this reason, and by design, IQI will show worse results for regions and Internet providers that rely on international (rather than peering) links for most content.

IQI is also designed to reflect the traffic load most commonly associated with web browsing, rather than more intensive use. This, and the chosen set of measurement targets, effectively biases the numbers towards what end users experience in practice (where latency plays an important role in how fast things can go).

For each metric covered by IQI, and for each ASN, we calculate the 25th percentile, median, and 75th percentile at 15 minute intervals. At the country level and above, the three calculated numbers for each ASN visible from that region are independently aggregated. This aggregation takes the estimated user population of each ASN into account, biasing the numbers away from networks that source a lot of automated traffic but have few end users.

The Connection Quality section gets its data from the Cloudflare Speed Test tool, which exercises a user's connection in order to see how well it is able to perform. It measures against the closest Cloudflare location, providing a good balance of realistic results and network proximity to the end user. We have a presence in 285 cities around the world, allowing us to be pretty close to most users.

Similar to the IQI, we calculate the 25th percentile, median, and 75th percentile for each ASN. But here these three numbers are immediately combined using an operation called the trimean — a single number meant to balance the best connection quality that most users have, with the best quality available from that ASN (users may not subscribe to the best available plan for a number of reasons).

Because users may choose to run a speed test for different motives at different times, and also because we take privacy very seriously and don’t record any personally identifiable information along with test results, we aggregate at 90-day intervals to capture as much variability as we can.

At the country level and above, the calculated trimean for each ASN in that region is aggregated. This, again, takes the estimated user population of each ASN into account, biasing the numbers away from networks that have few end users but which may still have technicians using the Cloudflare Speed Test to assess the performance of their network.

Navigating the Internet Quality page

The new Internet Quality page includes three views: Global, country-level, and autonomous system (AS). In line with the other pages on Cloudflare Radar, the country-level and AS pages show the same data sets, differing only in their level of aggregation. Below, we highlight the various components of the Internet Quality page.

Global

The top section of the global (worldwide) view includes time series graphs of the Internet Quality Index metrics aggregated at a continent level. The time frame shown in the graphs is governed by the selection made in the time frame drop down at the upper right of the page, and at launch, data for only the last three months is available. For users interested in examining a specific continent, clicking on the other continent names in the legend removes them from the graph. Although continent-level aggregation is still rather coarse, it still provides some insight into regional Internet quality around the world.

Further down the page, the Connection Quality section presents a choropleth map, with countries shaded according to the values of the speed, latency, or jitter metric selected from the drop-down menu. Hovering over a country displays a label with the country’s name and metric value, and clicking on the country takes you to the country’s Internet Quality page. Note that in contrast to the IQI section, the Connection Quality section always displays data aggregated over the previous 90 days.

Country-level

Within the country-level page (using Canada as an example in the figures below), the country’s IQI metrics over the selected time frame are displayed. These time series graphs show the median bandwidth, latency, and DNS response time within a shaded band bounded at the 25th and 75th percentile and represent the average expected user experience across the country, as discussed in the Our approach to data analysis section above.

Below that is the Connection Quality section, which provides a summary view of the country’s measured upload and download speeds, as well as latency and jitter, over the previous 90 days. The colored wedges in the Performance Summary graph are intended to illustrate aggregate connection quality at a glance, with an “ideal” connection having larger upload and download wedges and smaller latency and jitter wedges. Hovering over the wedges displays the metric’s value, which is also shown in the table to the right of the graph.

Below that, the Bandwidth and Latency/Jitter histograms illustrate the bucketed distribution of upload and download speeds, and latency and jitter measurements. In some cases, the speed histograms may show a noticeable bar at 1 Gbps, or 1000 ms (1 second) on the latency/jitter histograms. The presence of such a bar indicates that there is a set of measurements with values greater than the 1 Gbps/1000 ms maximum histogram values.

Autonomous system level

Within the upper-right section of the country-level page, a list of the top five autonomous systems within the country is shown. Clicking on an ASN takes you to the Performance page for that autonomous system. For others not displayed in the top five list, you can use the search bar at the top of the page to search by autonomous system name or number. The graphs shown within the AS level view are identical to those shown at a country level, but obviously at a different level of aggregation. You can find the ASN that you are connected to from the My Connection page on Cloudflare Radar.

Exploring connection performance & quality data

Digging into the IQI and Connection Quality visualizations can surface some interesting observations, including characterizing Internet connections, and the impact of Internet disruptions, including shutdowns and network issues. We explore some examples below.

Characterizing Internet connections

Verizon FiOS is a residential fiber-based Internet service available to customers in the United States. Fiber-based Internet services (as opposed to cable-based, DSL, dial-up, or satellite) will generally offer symmetric upload and download speeds, and the FiOS plans page shows this to be the case, offering 300 Mbps (upload & download), 500 Mbps (upload & download), and “1 Gig” (Verizon claims average wired speeds between 750-940 Mbps download / 750-880 Mbps upload) plans. Verizon carries FiOS traffic on AS701 (labeled UUNET due to a historical acquisition), and in looking at the bandwidth histogram for AS701, several things stand out. The first is a rough symmetry in upload and download speeds. (A cable-based Internet service provider, in contrast, would generally show a wide spread of download speeds, but have upload speeds clustered at the lower end of the range.) Another is the peaks around 300 Mbps and 750 Mbps, suggesting that the 300 Mbps and “1 Gig” plans may be more popular than the 500 Mbps plan. It is also clear that there are a significant number of test results with speeds below 300 Mbps. This is due to several factors: one is that Verizon also carries lower speed non-FiOS traffic on AS701, while another is that erratic nature of in-home WiFi often means that the speeds achieved on a test will be lower than the purchased service level.

Traffic shifts drive latency shifts

On May 9, 2023, the government of Pakistan ordered the shutdown of mobile network services in the wake of protests following the arrest of former Prime Minister Imran Khan. Our blog post covering this shutdown looked at the impact from a traffic perspective. Within the post, we noted that autonomous systems associated with fixed broadband networks saw significant increases in traffic when the mobile networks were shut down – that is, some users shifted to using fixed networks (home broadband) when mobile networks were unavailable.

Examining IQI data after the blog post was published, we found that the impact of this traffic shift was also visible in our latency data. As can be seen in the shaded area of the graph below, the shutdown of the mobile networks resulted in the median latency dropping about 25% as usage shifted from higher latency mobile networks to lower latency fixed broadband networks. An increase in latency is visible in the graph when mobile connectivity was restored on May 12.

Bandwidth shifts as a potential early warning sign

On April 4, UK mobile operator Virgin Media suffered several brief outages. In examining the IQI bandwidth graph for AS5089, the ASN used by Virgin Media (formerly branded as NTL), indications of a potential problem are visible several days before the outages occurred, as median bandwidth dropped by about a third, from around 35 Mbps to around 23 Mbps. The outages are visible in the circled area in the graph below. Published reports indicate that the problems lasted into April 5, in line with the lower median bandwidth measured through mid-day.

Submarine cable issues cause slower browsing

On June 5, Philippine Internet provider PLDT Tweeted an advisory that noted “One of our submarine cable partners confirms a loss in some of its internet bandwidth capacity, and thus causing slower Internet browsing.” IQI latency and bandwidth graphs for AS9299, a primary ASN used by PLDT, shows clear shifts starting around 06:45 UTC (14:45 local time). Median bandwidth dropped by half, from 17 Mbps to 8 Mbps, while median latency increased by 75% from 37 ms to around 65 ms. 75th percentile latency also saw a significant increase, nearly tripling from 63 ms to 180 ms coincident with the reported submarine cable issue.

Conclusion

Making network performance and quality insights available on Cloudflare Radar supports Cloudflare’s mission to help build a better Internet. However, we’re not done yet – we have more enhancements planned. These include making data available at a more granular geographical level (such as state and possibly city), incorporating AIM scores to help assess Internet quality for specific types of use cases, and embedding the Cloudflare speed test directly on Radar using the open source JavaScript module.

In the meantime, we invite you to use speed.cloudflare.com to test the performance and quality of your Internet connection, share any country or AS-level insights you discover on social media (tag @CloudflareRadar on Twitter or @[email protected] on Mastodon), and explore the underlying data through the M-Lab repository or the Radar API.

Benchmarking dashboard performance

2023-06-22 Richard Nguyen

Post Syndicated from Richard Nguyen original http://blog.cloudflare.com/benchmarking-dashboard-performance/

Benchmarking dashboard performance

In preparation of Cloudflare Speed Week 2023, we spent the last few weeks benchmarking the performance of a Cloudflare product that has gone through many transformations throughout the years: the Cloudflare dashboard itself!

Limitations and scope

Optimizing for user-experience is vital to the long-term success of both Cloudflare and our customers. Reliability and availability of the dashboard are also important, since millions of customers depend on our services every day. To avoid any potential service interruptions while we made changes to the application’s architecture, we decided to gradually roll out the improvements, starting with the login page.

As a global company, we strive to deliver the best experience to all of our customers around the world. While we were aware that performance was regional, with regions furthest from our core data centers experiencing up to 10 times longer loading speeds, we wanted to focus on improvements that would benefit all of our users, no matter where they geographically connect to the Dashboard.

Finally, throughout this exercise, it was important to keep in mind that our overall goal was to improve the user experience of the dashboard, with regards to loading performance. We chose to use a Lighthouse Performance score as a metric to measure performance, but we were careful to not set a target score. Once a measure becomes a target, it ceases to be a good measure.

Initial Benchmarks

Using a combination of open-source tools offered by Google (Lighthouse and PageSpeed Insights) and our own homegrown solution (Cloudflare Speed Test), we benchmarked our Lighthouse performance scores starting in Q1 2023. We found the results were… somewhat disappointing:

Although the site’s initial render occurred quickly (200ms), it took more than two seconds for the site to finish loading and be fully interactive.
In that time, the page was blocked for more than 500ms while the browser executed long JavaScript tasks.
Over half of the JavaScript served for the login page was not necessary to render the login page itself.

Improving what we've measured

The Cloudflare dashboard is a single page application that houses all of the UI for our wide portfolio of existing products, as well as the new features we're releasing every day. However, a less-than-performant experience is not acceptable to us; we owe it to our customers to deliver the best performance possible.

So what did we do?

Shipped less JavaScript

As obvious as it sounds, shipping less code to the user means they have to download fewer resources to load the application. In practice however, accomplishing this was harder than expected, especially for a five year old monolithic application.

We identified some of our largest dependencies with multiple versions, like lodash and our icon library, and deduped them. Bloated packages like the datacenter colo catalogs were refactored and drastically slimmed down. Packages containing unused code like development-only components, deprecated translations, and old Cloudflare Access UI components were removed entirely.

The result was a reduction in total assets being served to the user, going from 10MB (2.7MB gzipped) to 6.5MB (1.7MB gzipped). Lighthouse performance score improved to about 70. This was a good first step, but we could do better.

Identified and code split top-level boundaries

Code splitting is the process in which the application code is split into multiple bundles to be loaded on demand, reducing the initial amount of JavaScript a user downloads on page load. After logging in, as users navigate from account-level products like Workers and Pages, and then into specific zone-level products, like Page Shield for their domain, only the code necessary to render that particular page gets loaded dynamically.

Although most of the account-level and zone-level pages of the dashboard were properly code-split, the root application that imported these pages was not. It contained all of the code to bootstrap the application for both authenticated and unauthenticated users. This wasn’t a great experience for users who weren't even logged in yet, and we wanted to allow them to get into the main dashboard as quickly as possible.

So we split our monolithic application into two sub-applications: an authenticated and unauthenticated application. At a high level, on entrypoint initialization, we simply make an API request to check the user’s authentication state and dynamically load one sub-application or the other.

import React from 'react';
import { useAuth } from './useAuth';
const AuthenticatedAppLoadable = React.lazy(
  () => import('./AuthenticatedApp')
);
const UnauthenticatedAppLoadable = React.lazy(
  () => import('./UnauthenticatedApp')
);

// Fetch user auth state here and return user if logged in
// Render AuthenticatedApp or UnauthenticatedApp based on user
const Root: React.FC = () => {
  const { user } = useAuth();
  return user ? <AuthenticatedAppLoadable /> : <UnauthenticatedAppLoadable />;
};

That’s it! If a user is not logged in, we ship them a small bundle that only contains code necessary to render parts of the application related to login and signup, as well as a few global components. Code related to billing, account-level and zone-level products, sidebar navigation, and user profile settings are all bundled into a separate sub-application that only gets loaded once a user logs in.

Again, we saw significant improvements, especially to Largest Contentful Paint, pushing our performance scores to about 80. However, we ran a Chrome performance profile, and on closer inspection of the longest blocking task we noticed that there was still unnecessary code being parsed and evaluated, even though we never used it. For example, code for sidebar navigation was still loaded for unauthenticated users who never actually saw that component.

Optimized dead-code elimination

It turned out that our configuration for dead-code elimination was not optimized. Dead-code elimination, or “tree-shaking”, is the process in which your JavaScript transpiler automatically removes unused module imports from the final bundle. Although most modern transpilers have that setting on by default today, optimizing dead-code elimination for an existing application as old as the Cloudflare dashboard is not as straightforward.

We had to go through each individual JavaScript import to identify modules that didn’t produce side-effects so they could be marked by the transpiler to be removed. We were able to optimize “tree-shaking” for the majority of the modules, but this will be an ongoing process as we make more performance improvements.

Key results

Although the performance of the dashboard is not yet where we want it to be, we were still able to roll out significant improvements for the majority of our users. The table below shows the performance benchmarks for US users hitting the login page for the first time before and after the performance improvements.

Desktop

Mobile

What’s next

Overall, we were able to get some quick wins, but we’re still not done! This is just the first step in our mission to continually improve performance for all of our dashboard users. Here’s a look at some next steps that we will be experimenting with and testing in the coming months: decoupling signup pages from the main application, redesigning SSO login experience, exploring microfrontends and edge-side rendering.

In the meantime, check out Cloudflare Speed Test to generate a performance report and receive recommendations on how to improve the performance of your site today.

Spotlight on Zero Trust: We’re fastest and here’s the proof

2023-06-21 David Tuber

Post Syndicated from David Tuber original http://blog.cloudflare.com/spotlight-on-zero-trust/

Spotlight on Zero Trust: We're fastest and here's the proof

In January and in March we posted blogs outlining how Cloudflare performed against others in Zero Trust. The conclusion in both cases was that Cloudflare was faster than Zscaler and Netskope in a variety of Zero Trust scenarios. For Speed Week, we’re bringing back these tests and upping the ante: we’re testing more providers against more public Internet endpoints in more regions than we have in the past.

For these tests, we tested three Zero Trust scenarios: Secure Web Gateway (SWG), Zero Trust Network Access (ZTNA), and Remote Browser Isolation (RBI). We tested against three competitors: Zscaler, Netskope, and Palo Alto Networks. We tested these scenarios from 12 regions around the world, up from the four we’d previously tested with. The results are that Cloudflare is the fastest Secure Web Gateway in 42% of testing scenarios, the most of any provider. Cloudflare is 46% faster than Zscaler, 56% faster than Netskope, and 10% faster than Palo Alto for ZTNA, and 64% faster than Zscaler for RBI scenarios.

In this blog, we’ll provide a refresher on why performance matters, do a deep dive on how we’re faster for each scenario, and we’ll talk about how we measured performance for each product.

Performance is a threat vector

Performance in Zero Trust matters; when Zero Trust performs poorly, users disable it, opening organizations to risk. Zero Trust services should be unobtrusive when the services become noticeable they prevent users from getting their job done.

Zero Trust services may have lots of bells and whistles that help protect customers, but none of that matters if employees can’t use the services to do their job quickly and efficiently. Fast performance helps drive adoption and makes security feel transparent to the end users. At Cloudflare, we prioritize making our products fast and frictionless, and the results speak for themselves. So now let’s turn it over to the results, starting with our secure web gateway.

Cloudflare Gateway: security at the Internet

A secure web gateway needs to be fast because it acts as a funnel for all of an organization’s Internet-bound traffic. If a secure web gateway is slow, then any traffic from users out to the Internet will be slow. If traffic out to the Internet is slow, users may see web pages load slowly, video calls experience jitter or loss, or generally unable to do their jobs. Users may decide to turn off the gateway, putting the organization at risk of attack.

In addition to being close to users, a performant web gateway needs to also be well-peered with the rest of the Internet to avoid slow paths out to websites users want to access. Many websites use CDNs to accelerate their content and provide a better experience. These CDNs are often well-peered and embedded in last mile networks. But traffic through a secure web gateway follows a forward proxy path: users connect to the proxy, and the proxy connects to the websites users are trying to access. If that proxy isn’t as well-peered as the destination websites are, the user traffic could travel farther to get to the proxy than it would have needed to if it was just going to the website itself, creating a hairpin, as seen in the diagram below:

A well-connected proxy ensures that the user traffic travels less distance making it as fast as possible.

To compare secure web gateway products, we pitted the Cloudflare Gateway and WARP client against Zscaler, Netskope, and Palo Alto which all have products that perform the same functions. Cloudflare users benefit from Gateway and Cloudflare’s network being embedded deep into last mile networks close to users, being peered with over 12,000 networks. That heightened connectivity shows because Cloudflare Gateway is the fastest network in 42% of tested scenarios:

Number of testing scenarios where each provider is fastest for 95th percentile HTTP Response time (higher is better)
Provider	Scenarios where this provider is fastest
Cloudflare	48
Zscaler	14
Netskope	10
Palo Alto Networks	42

This data shows that we are faster to more websites from more places than any of our competitors. To measure this, we look at the 95th percentile HTTP response time: how long it takes for a user to go through the proxy, have the proxy make a request to a website on the Internet, and finally return the response. This measurement is important because it’s an accurate representation of what users see. When we look at the 95th percentile across all tests, we see that Cloudflare is 2.5% faster than Palo Alto Networks, 13% faster than Zscaler, and 6.5% faster than Netskope.

95th percentile HTTP response across all tests
Provider	95th percentile response (ms)
Cloudflare	515
Zscaler	595
Netskope	550
Palo Alto Networks	529

Cloudflare wins out here because Cloudflare’s exceptional peering allows us to succeed in places where others were not able to succeed. We are able to get locally peered in hard-to-reach places on the globe, giving us an edge. For example, take a look at how Cloudflare performs against the others in Australia, where we are 30% faster than the next fastest provider:

Cloudflare establishes great peering relationships in countries around the world: in Australia we are locally peered with all of the major Australian Internet providers, and as such we are able to provide a fast experience to many users around the world. Globally, we are peered with over 12,000 networks, getting as close to end users as we can to shorten the time requests spend on the public Internet. This work has previously allowed us to deliver content quickly to users, but in a Zero Trust world, it shortens the path users take to get to their SWG, meaning they can quickly get to the services they need.

Previously when we performed these tests, we only tested from a single Azure region to five websites. Existing testing frameworks like Catchpoint are unsuitable for this task because performance testing requires that you run the SWG client on the testing endpoint. We also needed to make sure that all of the tests are running on similar machines in the same places to measure performance as well as possible. This allows us to measure the end-to-end responses coming from the same location where both test environments are running.

In our testing configuration for this round of evaluations, we put four VMs in 12 cloud regions side by side: one running Cloudflare WARP connecting to our gateway, one running ZIA, one running Netskope, and one running Palo Alto Networks. These VMs made requests every five minutes to the 11 different websites mentioned below and logged the HTTP browser timings for how long each request took. Based on this, we are able to get a user-facing view of performance that is meaningful. Here is a full matrix of locations that we tested from, what websites we tested against, and which provider was faster:

	Endpoints
SWG Regions	Shopify	Walmart	Zendesk	ServiceNow	Azure Site	Slack	Zoom	Box	M365	GitHub	Bitbucket
East US	Cloudflare	Cloudflare	Palo Alto Networks	Cloudflare	Palo Alto Networks	Cloudflare	Palo Alto Networks	Cloudflare
West US	Palo Alto Networks	Palo Alto Networks	Cloudflare	Cloudflare	Palo Alto Networks	Cloudflare	Palo Alto Networks	Cloudflare
South Central US	Cloudflare	Cloudflare	Palo Alto Networks	Cloudflare	Palo Alto Networks	Cloudflare	Palo Alto Networks	Cloudflare
Brazil South	Cloudflare	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Zscaler	Zscaler	Zscaler	Palo Alto Networks	Cloudflare	Palo Alto Networks	Palo Alto Networks
UK South	Cloudflare	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Cloudflare	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks
Central India	Cloudflare	Cloudflare	Cloudflare	Palo Alto Networks	Palo Alto Networks	Cloudflare	Cloudflare	Cloudflare
Southeast Asia	Cloudflare	Cloudflare	Cloudflare	Cloudflare	Palo Alto Networks	Cloudflare	Cloudflare	Cloudflare
Canada Central	Cloudflare	Cloudflare	Palo Alto Networks	Cloudflare	Cloudflare	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Zscaler	Cloudflare	Zscaler
Switzerland North	netskope	Zscaler	Zscaler	Cloudflare	netskope	netskope	netskope	netskope	Cloudflare	Cloudflare	netskope
Australia East	Cloudflare	Cloudflare	netskope	Cloudflare	Cloudflare	Cloudflare	Cloudflare	Cloudflare
UAE Dubai	Zscaler	Zscaler	Cloudflare	Cloudflare	Zscaler	netskope	Palo Alto Networks	Zscaler	Zscaler	netskope	netskope
South Africa North	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Zscaler	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Zscaler	Palo Alto Networks	Palo Alto Networks

Blank cells indicate that tests to that particular website did not report accurate results or experienced failures for over 50% of the testing period. Based on this data, Cloudflare is generally faster, but we’re not as fast as we’d like to be. There are still some areas where we need to improve, specifically in South Africa, UAE, and Brazil. By Birthday Week in September, we want to be the fastest to all of these websites in each of these regions, which will bring our number up from fastest in 54% of tests to fastest in 79% of tests.

To summarize, Cloudflare’s Gateway is still the fastest SWG on the Internet. But Zero Trust isn’t all about SWG. Let’s talk about how Cloudflare performs in Zero Trust Network Access scenarios.

Instant (Zero Trust) access

Access control needs to be seamless and transparent to the user: the best compliment for a Zero Trust solution is for employees to barely notice it’s there. Services like Cloudflare Access protect applications over the public Internet, allowing for role-based authentication access instead of relying on things like a VPN to restrict and secure applications. This form of access management is more secure, but with a performant ZTNA service, it can even be faster.

Cloudflare outperforms our competitors in this space, being 46% faster than Zscaler, 56% faster than Netskope, and 10% faster than Palo Alto Networks:

Zero Trust Network Access P95 HTTP Response times
Provider	P95 HTTP response (ms)
Cloudflare	1252
Zscaler	2388
Netskope	2974
Palo Alto Networks	1471

For this test, we created applications hosted in three different clouds in 12 different locations: AWS, GCP, and Azure. However, it should be noted that Palo Alto Networks was the exception, as we were only able to measure them using applications hosted in one cloud from two regions due to logistical challenges with setting up testing: US East and Singapore.

For each of these applications, we created tests from Catchpoint that accessed the application from 400 locations around the world. Each of these Catchpoint nodes attempted two actions:

New Session: log into an application and receive an authentication token
Existing Session: refresh the page and log in passing the previously obtained credentials

We like to measure these scenarios separately, because when we look at 95th percentile values, we would almost always be looking at new sessions if we combined new and existing sessions together. For the sake of completeness though, we will also show the 95th percentile latency of both new and existing sessions combined.

Cloudflare was faster in both US East and Singapore, but let’s spotlight a couple of regions to delve into. Let’s take a look at a region where resources are heavily interconnected equally across competitors: US East, specifically Ashburn, Virginia.

In Ashburn, Virginia, Cloudflare handily beats Zscaler and Netskope for ZTNA 95th percentile HTTP Response:

95th percentile HTTP Response times (ms) for applications hosted in Ashburn, VA
AWS East US	Total (ms)	New Sessions (ms)	Existing Sessions (ms)
Cloudflare	2849	1749	1353
Zscaler	5340	2953	2491
Netskope	6513	3748	2897
Palo Alto Networks
Azure East US
Cloudflare	1692	989	1169
Zscaler	5403	2951	2412
Netskope	6601	3805	2964
Palo Alto Networks
GCP East US
Cloudflare	2811	1615	1320
Zscaler
Netskope	6694	3819	3023
Palo Alto Networks	2258	894	1464

You might notice that Palo Alto Networks looks to come out ahead of Cloudflare for existing sessions (and therefore for overall 95th percentile). But these numbers are misleading because Palo Alto Networks’ ZTNA behavior is slightly different than ours, Zscaler’s, or Netskope’s. When they perform a new session, it does a full connection intercept and returns a response from its processors instead of directing users to the login page of the application they are trying to access.

This means that Palo Alto Networks' new session response times don’t actually measure the end-to-end latency we’re looking for. Because of this, their numbers for new session latency and total session latency are misleading, meaning we can only meaningfully compare ourselves to them for existing session latency. When we look at existing sessions, when Palo Alto Networks acts as a pass-through, Cloudflare still comes out ahead by 10%.

This is true in Singapore as well, where Cloudflare is 50% faster than Zscaler and Netskope, and also 10% faster than Palo Alto Networks for Existing Sessions:

95th percentile HTTP Response times (ms) for applications hosted in Singapore
AWS Singapore	Total (ms)	New Sessions (ms)	Existing Sessions (ms)
Cloudflare	2748	1568	1310
Zscaler	5349	3033	2500
Netskope	6402	3598	2990
Palo Alto Networks
Azure Singapore
Cloudflare	1831	1022	1181
Zscaler	5699	3037	2577
Netskope	6722	3834	3040
Palo Alto Networks
GCP Singapore
Cloudflare	2820	1641	1355
Zscaler	5499	3037	2412
Netskope	6525	3713	2992
Palo Alto Networks	2293	922	1476

One critique of this data could be that we’re aggregating the times of all Catchpoint nodes together at P95, and we’re not looking at the 95th percentile of Catchpoint nodes in the same region as the application. We looked at that, too, and Cloudflare’s ZTNA performance is still better. Looking at only North America-based Catchpoint nodes, Cloudflare performs 50% better than Netskope, 40% better than Zscaler, and 10% better than Palo Alto Networks at P95 for warm connections:

Zero Trust Network Access 95th percentile HTTP Response times for warm connections with testing locations in North America
Provider	P95 HTTP response (ms)
Cloudflare	810
Zscaler	1290
Netskope	1351
Palo Alto Networks	871

Finally, one thing we wanted to show about our ZTNA performance was how well Cloudflare performed per cloud per region. This below chart shows the matrix of cloud providers and tested regions:

Fastest ZTNA provider in each cloud provider and region by 95th percentile HTTP Response
	AWS	Azure	GCP
Australia East	Cloudflare	Cloudflare	Cloudflare
Brazil South	Cloudflare	Cloudflare	N/A
Canada Central	Cloudflare	Cloudflare	Cloudflare
Central India	Cloudflare	Cloudflare	Cloudflare
East US	Cloudflare	Cloudflare	Cloudflare
South Africa North	Cloudflare	Cloudflare	N/A
South Central US	N/A	Cloudflare	Zscaler
Southeast Asia	Cloudflare	Cloudflare	Cloudflare
Switzerland North	N/A	N/A	Cloudflare
UAE Dubai	Cloudflare	Cloudflare	Cloudflare
UK South	Cloudflare	Cloudflare	netskope
West US	Cloudflare	Cloudflare	N/A

There were some VMs in some clouds that malfunctioned and didn’t report accurate data. But out of 30 available cloud regions where we had accurate data, Cloudflare was the fastest ZT provider in 28 of them, meaning we were faster in 93% of tested cloud regions.

To summarize, Cloudflare also provides the best experience when evaluating Zero Trust Network Access. But what about another piece of the puzzle: Remote Browser Isolation (RBI)?

Remote Browser Isolation: a secure browser hosted in the cloud

Remote browser isolation products have a very strong dependency on the public Internet: if your connection to your browser isolation product isn’t good, then your browser experience will feel weird and slow. Remote browser isolation is extraordinarily dependent on performance to feel smooth and seamless to the users: if everything is fast as it should be, then users shouldn’t even notice that they’re using browser isolation.

For this test, we’re again pitting Cloudflare against Zscaler. While Netskope does have an RBI product, we were unable to test it due to it requiring a SWG client, meaning we would be unable to get full fidelity of testing locations like we would when testing Cloudflare and Zscaler. Our tests showed that Cloudflare is 64% faster than Zscaler for remote browsing scenarios: Here’s a matrix of fastest provider per cloud per region for our RBI tests:

Fastest RBI provider in each cloud provider and region by 95th percentile HTTP Response
	AWS	Azure	GCP
Australia East	Cloudflare	Cloudflare	Cloudflare
Brazil South	Cloudflare	Cloudflare	Cloudflare
Canada Central	Cloudflare	Cloudflare	Cloudflare
Central India	Cloudflare	Cloudflare	Cloudflare
East US	Cloudflare	Cloudflare	Cloudflare
South Africa North	Cloudflare	Cloudflare
South Central US		Cloudflare	Cloudflare
Southeast Asia	Cloudflare	Cloudflare	Cloudflare
Switzerland North	Cloudflare	Cloudflare	Cloudflare
UAE Dubai	Cloudflare	Cloudflare	Cloudflare
UK South	Cloudflare	Cloudflare	Cloudflare
West US	Cloudflare	Cloudflare	Cloudflare

This chart shows the results of all of the tests run against Cloudflare and Zscaler to applications hosted on three different clouds in 12 different locations from the same 400 Catchpoint nodes as the ZTNA tests. In every scenario, Cloudflare was faster. In fact, no test against a Cloudflare-protected endpoint had a 95th percentile HTTP Response of above 2105 ms, while no Zscaler-protected endpoint had a 95th percentile HTTP response of below 5000 ms.

To get this data, we leveraged the same VMs to host applications accessed through RBI services. Each Catchpoint node would attempt to log into the application through RBI, receive authentication credentials, and then try to access the page by passing the credentials. We look at the same new and existing sessions that we do for ZTNA, and Cloudflare is faster in both new sessions and existing session scenarios as well.

Gotta go fast(er)

Our Zero Trust customers want us to be fast not because they want the fastest Internet access, but because they want to know that employee productivity won’t be impacted by switching to Cloudflare. That doesn’t necessarily mean that the most important thing for us is being faster than our competitors, although we are. The most important thing for us is improving our experience so that our users feel comfortable knowing we take their experience seriously. When we put out new numbers for Birthday Week in September and we’re faster than we were before, it won’t mean that we just made the numbers go up: it means that we are constantly evaluating and improving our service to provide the best experience for our customers. We care more that our customers in UAE have an improved experience with Office365 as opposed to beating a competitor in a test. We show these numbers so that we can show you that we take performance seriously, and we’re committed to providing the best experience for you, wherever you are.

Faster website, more customers: Cloudflare Observatory can help your business grow

2023-06-20 Matt Bullock

Post Syndicated from Matt Bullock original http://blog.cloudflare.com/cloudflare-observatory-generally-available/

Faster website, more customers: Cloudflare Observatory can help your business grow

This post is also available in 简体中文, 日本語 and Español.

Website performance is crucial to the success of online businesses. Study after study has shown that an increased load time directly affects sales. In highly competitive markets the performance of a website is crucial for success. Just like a physical shop situated in a remote area faces challenges in attracting customers, a slow website encounters similar difficulties in attracting traffic. It is vital to measure and improve website performance to enhance user experience and maximize online engagement. Results from testing at home don’t take into account how your customers in different countries, on different devices, with different Internet connections experience your website.

Simply put, you might not know how your website is performing. And that could be costing your business money every single day.

Today we are excited to announce Cloudflare Observatory – the new home of performance at Cloudflare.

Cloudflare users can now easily monitor website performance using Real User Monitoring (RUM) data along with scheduled tests from different regions in a single dashboard. This will identify any performance issues your website may have. The best bit? Once we’ve identified any issues, Observatory will highlight customized recommendations to resolve these issues, all with a single click.

Making your website faster just got a lot easier.

I feel the need. The need for speed!

Having a fast website is crucial for achieving online success. According to Google, even a one-second improvement in load time can boost mobile conversions by up to 27%.

A study from Deloitte found “With a 0.1s improvement in site speed, we observed that retail consumers spent almost 10% more”. Another study, from Google, found “53% will leave a mobile site if it takes more than 3 seconds to load”. There is a very real link between website performance and business success.

In today's digital landscape, customers expect instant access to information and seamless browsing experiences. We have all encountered the frustration of waiting for a website to load, often leading us to click the back button and click on the next link. For ecommerce sites, this delay directly translates to lost revenue as users quickly navigate elsewhere.

This importance is further amplified in the world of Search Engine Optimization (SEO). In May 2021, Google announced that page speed would be incorporated into their ranking algorithm, highlighting the significance of fast-loading web pages for higher search engine rankings.

Introducing Observatory

In 2019, we launched the new Speed Tab with the mission to address two crucial questions: "How fast is my website after moving to Cloudflare?" and "How fast could it be?" This tab allowed customers to compare their website's performance before and after enabling Cloudflare features. However, it required users to delve into analytics and analyze traffic patterns and cache hit ratios to optimize their sites, which proved challenging for new Cloudflare users.

To address this, we developed Observatory, a fresh approach to performance monitoring at Cloudflare. Observatory fills the gap that previously existed in understanding website performance and simplifies the process of addressing performance issues by providing tailored recommendations.

Observatory integrates Real-User Monitoring (RUM) data, which enables users to understand their website's performance as experienced by their end users across the globe. By leveraging RUM data we can show valuable insight into the areas of the website that can be optimized and surface Cloudflare features and functionality that can address these issues.

Additionally, Observatory incorporates Google Lighthouse, the industry standard tool for evaluating web performance. We replaced WebPageTest with Lighthouse due to its versatility and widespread adoption in the performance community. With Lighthouse, users can run, schedule, and access Lighthouse performance reports directly in the Cloudflare dashboard.

Observatory also enables regional testing, recognizing the importance of understanding performance variations across different locations. By simulating website performance in different regions, users can understand if their webpage performs well in certain countries and poorly in others. This enables users to optimize their websites for a global audience, ensuring consistent and fast user experiences regardless of location.

Observatory becomes your unified place within the Cloudflare dashboard for website performance by bringing together RUM data, Lighthouse insights, and regional testing. Users can gain a comprehensive understanding of their website's performance and implement Cloudflare recommendations based on this data with just a click of a button.

Measuring performance in Cloudflare Observatory

We support the two main methods of testing website performance. These are synthetic tests and Real User Monitoring (RUM) tests.

Synthetic tests involve simulating user interactions and monitoring performance under controlled environments. These tests can provide valuable baseline measurements and help identify potential issues before deploying changes.

On the other hand, RUM tests involve collecting data directly from real users as they interact with the website, capturing their actual experiences in different environments and network conditions. RUM tests offer insights into the true end-user perspective. By combining both synthetic and RUM tests, website owners can gain a holistic view of performance, understanding how changes and optimizations affect both simulated and real user experiences.

Cloudflare Observatory combines both of these in one location. The integration of Google Lighthouse within the Observatory gives Cloudflare users a simple way to synthetically measure and understand their site's performance. Google Lighthouse measures several key performance metrics that impact user experience and search engine ranking. The generated report provides an overall performance score ranging from 1 (least performant) to 99 (most performant), making it easy for website owners to understand their site's performance.

Observatory offers a user-friendly interface that presents each Lighthouse metric in a traffic light system, indicating the result of the tested metric. One critical metric is Largest Contentful Paint (LCP), which measures a page's loading performance of the primary content. An optimal LCP score is less than 2.5 seconds, indicating satisfactory loading speed for the user. Through Observatory website owners can easily see their LCP score and other metrics. This allows them to optimize their site's performance and user experience. For example, by examining the LCP score website owners can identify opportunities for improvement and make informed decisions to enhance their site's performance.

New Smarter Recommendations

Recommendations from Observatory have become smarter by leveraging the insights gathered from Lighthouse and RUM testing. This enables us to precisely identify issues and offer tailored Cloudflare settings to enhance performance. For instance, when you receive a Lighthouse report it will highlight areas in which your website can be improved. In the provided report, several enhancements for image optimization are suggested. Cloudflare takes this feedback into account and provides product recommendations, such as enabling Polish or utilizing Image Resizing. This empowers our customers to enhance their performance score with just a single click.

Customers will have the convenience of viewing these recommendations within the Cloudflare dashboard, directly linked to the audit. The dashboard will encompass a wide range of Cloudflare features and functionalities, continually improving over time. With the addition of Cache Rules recommendations for uncached static content and a comprehensive testing suite, users will gain valuable insights into the benefits of implementing specific Cloudflare features before enabling them.

By knowing the performance impact of a product or feature before it is enabled, customers can make informed decisions and optimize their website's performance with confidence.

More tests, multiple regions and recurring tests

A significant piece of feedback we received from our old Speed Tab and beta testing was regarding the number and location of tests. We're thrilled to announce that we have addressed this feedback by increasing the number of tests allowed and enabling all plan types to schedule at least one recurring test, originating from a US region.

Customers on our Pro, Business, and Enterprise family of subscriptions can run tests from various regions to understand their site's performance in those areas. For instance, if a website is solely hosted in Iowa, USA, and a visitor is accessing it from Sydney, Australia, they will experience a slower page load due to the time it takes for an uncached file to be sent and rendered by the user's browser over a distance of 14,000 kilometers. By running tests from various regions, our customers can gain valuable insights into their website's performance and make informed decisions to optimize it for a better user experience – and an improved page load time.

The higher your plan type the more tests you are able to run and the more regions you are able to use. For example Pro customers can set up five recurring tests for their most important page from five different locations. These test runs will then be stored within the Observatory history tab allowing them to understand their Page Speed score from around the globe. Below is a table detailing the number of tests each plan type can run and the regions available to them.

Plan	Ad-hoc tests	Recurring tests	Frequency of recurring tests	Regions supported
Free	5	1	Weekly	Iowa, USA
Pro	10	5	Daily	Everything in Free and South Carolina, USA North Virginia, USA Dallas, USA Oregon, USA Hamina, Finland Madrid, Spain St. Ghislain, Belgium Eemshaven, Netherlands Milan, Italy Paris, France Changhua County, Taiwan Tokyo, Japan Osaka, Japan Tel Aviv, Israel London, England Jurong West, Singapore Sydney, Australia Frankfurt, Germany Mumbai, India São Paulo, Brazil
Business	20	10	Daily
Enterprise	50	15	Daily

Incorporating RUM

Cloudflare’s RUM service provides insights to a user's browser or devices, tracking metrics such as page load times, response times, and other user interactions. Cloudflare collects RUM data through its Browser Insights feature, which inserts a JavaScript "beacon" into HTML pages. This beacon sends information back to Cloudflare about the performance of a website from the perspective of real users, including metrics such as page load time, time to first byte, and other Web Vitals.

While you can always try a few page loads on your own laptop and see the results, gathering data from real users is the only way to take into account real-life device performance and network conditions.

Observatory now incorporates RUM data to match against your tested paths. This allows you to easily see how real users experience your site across the globe. This data is also dissected and located in the Observatory tab against your tested paths. Allowing you to view synthetic test data directly against Real User metrics.

Our RUM provider already incorporates the Interactive Next Paint (INP) Score. In 2022, Google announced Interaction to Next Paint (INP) as that new metric, promoting INP as the new Core Web Vital metric for responsiveness, replacing First Input Delay (FID). FID measures the delay between a user's first interaction with a web page and the browser's response to that interaction. INP measures the delay for any user interaction on a website, not just limited to the first input. This change reflects a more comprehensive approach to evaluating the responsiveness of a website.

If you don't have Web Analytics enabled on your Cloudflare zone then we will be unable to collect and display RUM data within Observatory. Enabling this feature is very simple and instructions can be found here.

One click optimizations

Observatory now includes an enhanced Optimization layout, which introduces a one-click recommendations center. Enabling these features on your Cloudflare zone enhances optimization for the latest HTTP protocols, including HTTP/3. Additionally, Image Delivery is improved by converting PNGs and JPEGs to the efficient WebP format. Finally, Cloudflare performance tools are also enabled, allowing users to seamlessly implement new technologies such as Early Hints. These features are designed to contribute to improved website speed and overall performance.

As we release new features that we believe are beneficial to our customers, we will continue to add them to the One Click Optimizations. We have also made changes to the overall layout of the tab, splitting our products into subcategories to allow easy navigation to the individual performance products.

Available now

Observatory is available now! Become the Web Performance advocate in your organization by taking advantage of the Observatory features such as Google Lighthouse integration, RUM data, and multi-region testing, all available now. You will be able to gain valuable insights into your website's performance and make informed decisions to optimize and improve your site's performance.

In the coming months, we will continue expanding the Recommendations engine, introducing more products that empower you to continually enhance your website's performance. Additionally, we will provide the capability to simulate requests for specific features, giving you a comprehensive understanding of the real-world performance benefits before implementing them on your website.

INP. Get ready for the new Core Web Vital

2023-06-20 William Woodhead

Post Syndicated from William Woodhead original http://blog.cloudflare.com/inp-get-ready-for-the-new-core-web-vital/

INP will replace FID in the Core Web Vitals

INP. Get ready for the new Core Web Vital

On May 10, 2023, Google announced that INP will replace FID in the Core Web Vitals in March 2024. The Core Web Vitals play a role in the Google Search algorithm. So website owners who care about Search Engine Optimization (SEO) should prepare for the change. Otherwise their search ranking might suffer.

This post will first explain what FID, INP and the Core Web Vitals are. Then it will show how FID and INP relate to each other across a large range of Cloudflare sites. (Spoiler alert – If a site has ‘Good’ scoring FID, it might not have ‘Good’ scoring INP). Then it will discuss how to prepare for this change and how Cloudflare can help.

A few definitions

In order to make sense of the upcoming change, here are some definitions that will set the scene.

Core Web Vitals

Measuring user-centric web performance is challenging. To face this challenge, Google developed a series of metrics called the Web Vitals. These Web Vitals are signals that measure different aspects of web performance. For example Time To First Byte (TTFB) is one of the Web Vitals: from the perspective of the browser, TTFB “measures the time between the request for a resource and when the first byte of a response begins to arrive.” – https://web.dev/ttfb/

A subset of the Web Vitals are the Core Web Vitals. The Core Web Vitals are identified as the most critical Web Vitals to pay attention to. As such, they play a role in the Google Search algorithm. Improving a webpage’s Core Web Vitals can improve success with Google Search. The Core Web Vitals currently consist of three metrics: Largest Contentful Paint (LCP), First Input Delay (FID) and Cumulative Layout Shift (CLS). Respectively they measure Loading Speed, Interactivity and Visual Stability. This will change in March 2024 when Interaction To Next Paint (INP) replaces FID as the Core Web Vital for Interactivity.

Interaction

An Interaction on a webpage starts with a user input. The browser then reacts to this input. This includes the input delay, the processing time and the presentation delay before the next paint occurs and the new frame is presented.

Keeping in mind that an interaction is composed of these three serialized durations – input delay, processing time and presentation delay – will help later on to understand the difference between FID and INP.

First Input Delay (FID)

“FID measures the time from when a user first interacts with a page (that is, when they click a link, tap on a button, or use a custom, JavaScript-powered control) to the time when the browser is actually able to begin processing event handlers in response to that interaction.” – https://web.dev/fid/

FID is the current Core Web Vital that measures interactivity. FID measures the first interaction with the page. User interactions after the first interaction are not measured with FID. FID also only measures the input delay part of the first interaction. It does not measure the processing time and the presentation delay.

FID is classed as ‘Good’ if the input delay is less than 100 ms. FID is classed as ‘Needs Improvement’ between 100ms and 300ms. Delays above 300 ms are classed as ‘Poor’.

Interaction to Next Paint (INP)

“INP is a metric that assesses a page's overall responsiveness to user interactions by observing the latency of all click, tap, and keyboard interactions that occur throughout the lifespan of a user's visit to a page. The final INP value is the longest interaction observed, ignoring outliers.” - https://web.dev/inp/

INP, like FID, is also a Web Vital that measures interactivity. INP observes all user interactions in a page view and reports the longest. It measures more than just the input delay duration of each interaction. It also measures the processing time and the presentation delay until the next paint occurs. Once you understand the mechanics of it, the name ‘Interaction to Next Paint’ is a good description of the duration it measures.

INP is classed as ‘Good’ if the final reported value is less than 200 ms. INP is classed as ‘Needs Improvement’ between 200ms and 500ms. Any measurement above 500 ms is classed as ‘Poor’.

A more comprehensive measure of interactivity

Both FID and INP measure interactivity by measuring delays after user interactions. But INP is more comprehensive than FID. As an example, let’s say a user loads a web page and interacts only once by clicking a button. In this case, FID will capture this interaction by reporting the input delay after the user input. INP will also observe this interaction but it will measure not only the input delay but also the subsequent processing time and the presentation delay. Since this is the only interaction on the web page, INP will observe this as the longest duration within the page view and so will report this duration as the final reported INP value. Already it is clear that with just a single user interaction, INP provides a more comprehensive signal of interactivity because it includes processing time and presentation delay.

But INP also goes beyond the first interaction on the page. It observes all the interactions within a page view and reports the longest interaction of the set (ignoring outliers). This means that web sites need to ensure that all interactions throughout the lifecycle of a page view are under 200ms in order to score a ‘Good’ INP. With FID, web sites only need to focus on optimizing the first interaction’s input delay under 100 ms to score ‘Good’ FID. With INP joining the Core Web Vitals, web sites will need to focus on optimizing all interactions.

Comparing FID with INP across Cloudflare

At Cloudflare, for sites that have enabled Real User Measurements (RUM) via the Web Analytics Product or the Observatory Product, we already collect INP data. This means we have a rich dataset across 850k sites with which to analyze INP against FID.

FID vs INP

Over the period of a week, across all Cloudflare RUM-enabled sites, we observed over four billion reported FIDs and INPs. 93% of the FID values classify as ‘Good’ (under 100 ms). 75% of the INP values classify as ‘Good’ (under 200 ms). Since these values are sourced from the same set of page views, it is already possible to see at a glance that INP is distributed more lightly in the ‘Good’ range than FID.

Interestingly, the P75 value for FID is just 16 ms. A 0.16 times multiple of the FID ‘Good’ threshold of 100 ms. For INP, the P75 value is 200 ms. A one times multiple of the INP ‘Good’ threshold.

INP across devices

Focusing on INP, when the INP data is segmented by device type, we observe that 88% of desktop INP classify as ‘Good’, but only 67% of mobile INP classify as ‘Good’. This suggests that INP is generally suffering more on mobile devices than on desktop.

The bottom line is that INP is more challenging than FID to ‘get into the Green’. Since INP is a more comprehensive measure of interactivity, web sites are more exposed, and less likely to score favorably.

How to prepare for the change

INP will start affecting Google Search rankings when it becomes a Core Web Vital in March 2024. To get ready, websites need to improve interactivity beyond the initial load of the site.

As always, in order to improve on a metric, a baseline must be drawn by collecting data on the metric. But with INP, this can be challenging because INP is best reported from field data. Field data means that the data comes from real users interacting with web pages on real browsers. With field data, synthetic tools like Google Lighthouse aren’t a great fit. Synthetic tools are not designed to emulate the wide range of users on the wide range of browsers and operating systems that will navigate a web page.

In order to collect real user data from real browsers, websites need to enable a RUM provider. RUM providers collect performance data from real browsers, and process the data so that it can be aggregated and analyzed. It’s worth noting that performance data is anonymous and does not include any personal identifiable information (PII).

So the first step to understanding INP on a website is to enable a RUM provider.

Once a RUM provider is enabled on a website, it is then possible to analyze INP data to understand which web pages are not optimized for interactivity. There can be many possible causes for this. Since INP is composed of input delay, processing time and presentation delay, optimizing INP can be broken down into optimizing these distinct durations.

The best advice for optimizing INP is set out by web.dev. Optimizing INP is not straight-forward – it can involve breaking up JavaScript tasks, reducing DOM interactions and layouts, avoiding use of timers, and limiting third-party code amongst a range of other measures. That’s why it’s best to get started early and learn as much as possible before the change takes place.

How Cloudflare can help

A free RUM provider

Cloudflare offers a free RUM provider which collects INP as part of its dataset. It currently powers Cloudflare Web Analytics and Cloudflare Observatory. By enabling RUM with Cloudflare, you can explore the interactivity of your site’s web pages and start to identify interactivity issues.

Just log onto the Cloudflare Dashboard, and head to your target account. Go to Web Analytics.

Reduce the impact of third-party scripts

One of the contributing causes of degraded INP is JavaScript blocking the main thread after an interaction. Often this can be caused by heavy third-party scripts attaching event listeners to DOM elements. For example, third-party analytics scripts can register listeners on interactive elements of a page to power various forms of analytics. Many sites have dozens, if not hundreds of third-party scripts running, each one potentially degrading INP.

Cloudflare offers a unique product that can entirely remove third-party scripts from the browser. This frees up the main thread, boosting interactivity. If you haven’t come across Cloudflare Zaraz before, check it out in our docs. Not only can it improve INP, but it can also dramatically improve the security posture of your site by reducing (or even entirely removing) the amount of third-party JavaScript that runs on your website.

Making home Internet faster has little to do with “speed”

2023-04-18 Mike Conlow

Post Syndicated from Mike Conlow original https://blog.cloudflare.com/making-home-internet-faster/

Making home Internet faster has little to do with “speed”

More than ten years ago, researchers at Google published a paper with the seemingly heretical title “More Bandwidth Doesn’t Matter (much)”. We published our own blog showing it is faster to fly 1TB of data from San Francisco to London than it is to upload it on a 100 Mbps connection. Unfortunately, things haven’t changed much. When you make purchasing decisions about home Internet plans, you probably consider the bandwidth of the connection when evaluating Internet performance. More bandwidth is faster speed, or so the marketing goes. In this post, we’ll use real-world data to show both bandwidth and – spoiler alert! – latency impact the speed of an Internet connection. By the end, we think you’ll understand why Cloudflare is so laser focused on reducing latency everywhere we can find it.

First, we should quickly define bandwidth and latency. Bandwidth is the amount of data that can be transmitted at any single time. It’s the maximum throughput, or capacity, of the communications link between two servers that want to exchange data. Usually, the bottleneck – the place in the network where the connection is constrained by the amount of bandwidth available – is in the “last mile”, either the wire that connects a home, or the modem or router in the home itself.

If the Internet is an information superhighway, bandwidth is the number of lanes on the road. The wider the road, the more traffic can fit on the highway at any time. Bandwidth is useful for downloading large files like operating system updates and big game updates. While bandwidth, throughput and capacity are synonyms, confusingly “speed” has come to mean bandwidth when talking about Internet plans. More on this later.

We use bandwidth when streaming video, though probably less than you think. Netflix recommends 15 Mbps of bandwidth to watch a stream in 4K/Ultra HD. A 1 Gbps connection could stream more than 60 Netflix shows in 4K at the same time!

Latency, on the other hand, is how fast data is moving through the Internet. To extend our analogy, tatency is the speed at which traffic is moving on the highway. If traffic is moving quickly, you’ll get to your destination faster. Latency is measured in the number of milliseconds that it takes a packet of data to go from a client (such as your laptop computer) to a server, and then back to the client. If you’re practicing tennis against a wall, round-trip latency is the time the ball was in the air. On the Internet “backbone” data is traveling at almost 200,000 kilometers per second as it bounces off the glass on the inside of fiber optic wires. That’s fast!

While we can’t make light travel through glass much faster, we can improve latency by moving the content closer to users, shortening the distance data needs to travel. That’s the effect of our presence in more than 285 cities globally: when you’re on the Internet superhighway trying to reach Cloudflare, we want to be just off the next exit.

We know that low-latency (low delay; high speed) connections are important for gaming, where tiny bits of data, such as the change in position of players in a game, need to reach another computer quickly. And increasingly, we’re becoming aware of high latency when it makes our videoconferencing choppy and unpleasant.

But we don’t use the Internet only to play games, nor only watch streaming video. We do those, and we visit a lot of normal web pages in between.

In the 2010 paper from Google, the author simulated loading web pages while varying the throughput and latency of the connection. What he finds is that above about 5 Mbps, the page doesn’t load much faster. Increasing bandwidth from 1 Mbps to 2 Mbps is almost a 40 percent improvement in page load time. From 5 Mbps to 6 Mbps is less than a 5 percent improvement.

When he varies the latency (the Round Trip Time, or RTT), something interesting happens: there is a linear return to better latency. For every 20 milliseconds of reduced latency, the page load time improves by about 10%.

Let’s see what this looks like in real life with empirical data. Below is a chart from an excellent recent paper by two researchers from MIT. Using data from the FCC’s Measuring Broadband America program, these researchers produce a similar chart to what we expect from the 2010 simulation. Though the point of diminishing returns to more bandwidth has moved higher – to about 20 Mbps – the overall trend is exactly the same.

For the latency relationship, we use aggregate and anonymized data from Cloudflare’s own delivery network. It’s a familiar pattern. For every 200 milliseconds of latency we can save, we cut the page load time by over 1 second. That relationship applies when the latency is 950 milliseconds. And it applies when the latency is 50 milliseconds.

Let’s try to understand why this relationship exists. When you connect to a website, the first thing that your browser does is encrypt the connection and make sure the site you want to access is authenticated to be the site you think you’re accessing. This process is called the TLS establishment and SSL handshake. Before you even see a pixel of content, your browser and the website will exchange information up to seven times: three to perform TLS establishment, and three more to perform the SSL handshake.

On top of that, when we load a webpage after we establish encryption and verify website authority, we might be asking the browser to load hundreds of different files across dozens of different domains. Some of these files can be loaded in parallel, but others need to be loaded sequentially. As the browser races to compile all these different files, it’s the speed at which it can get to the server and back that determines how fast it can put the page together. The files are often quite small, but there’s a lot of them.

The chart below shows the beginning of what the browser does when it loads cnn.com. After a 301 redirect, the browser loads the main HTML page in step two. Only then, more than 1 second into the load, does it know about all the javascript files it needs to render the page. Files 3-19 become unblocked once the browser has the HTML file in step two, but we see more blocking later on. Files 20-27 are all blocked on earlier files. They can’t start until the browser has the previous file back from the server. There are 650 assets in this page load, and the blocking happens all the way through the page load. Here’s why this matters: better latency makes every file load faster, which in turn unblocks other files faster, and so on.

The browser will do its best to use all the bandwidth available, but often is using just a fraction of what’s available. It’s no wonder then that adding more bandwidth doesn’t speed up the page load, but better latency does. While developments like Early Hints help this by allowing browsers to pre-fetch files and content that don’t need to be serialized, this is still a problem for many websites on the Internet today.

Recently, Internet researchers have turned their attention to using our understanding of the relationship between throughput and latency to improve Internet Quality of Experience (QoE). A paper from the Broadband Internet Technical Advisory Group (BITAG) summarizes:

But we now recognize that it is not just greater throughput that matters, but also consistently low latency. Unfortunately, the way that we’ve historically understood and characterized latency was flawed, and our latency measurements and metrics were not aligned with end-user QoE.

There’s a difference between latency on an idle Internet connection and latency measured in working conditions, which we call “working latency” or “responsiveness”. Since responsiveness is what the user experiences as the speed of their Internet connection, it’s important to understand and measure this particular latency.

An Internet connection can suffer from poor responsiveness (even if it has good idle latency) when data is delayed in buffers. If you download a large file, for example an operating system update, the server sending the file might send the file with higher throughput than the Internet connection can accept. That’s ok. Extra bits of the file will sit in a buffer until it’s their turn to go through the funnel. Adding extra lanes to the highway allows more cars to pass through, and is a good strategy if we aren’t particularly concerned with the speed of the traffic.

Say for example, Christabel is watching a stream of the news while on a video meeting. When Christabel starts watching the video, her browser fetches a bunch of content and stores it in various buffers on the way from the content host to the browser. Those same buffers also contain data packets pertaining to the video meeting Christabel is currently in. If the data generated as part of a video conference sits in the same buffer as the video files, the video files will fill up the buffer and cause delay for the video meeting packets as well. That type of buffering delay is called bufferbloat, and leads to jittery video calls and delays where people speak over each other and get frustrated.

To help users understand the strengths and weaknesses of their connection, we recently added Aggregated Internet Measurement (AIM) scores to our own “Speed” Test. These scores remove the technical metrics and give users a real-world, plain-English understanding of what their connection will be good at, and where it might struggle. We’d also like to collect more data from our speed test to help track Page Load Times (PLT) and see how they are correlated with the reduction of lower working latency. You’ll start seeing those numbers on our speed test soon!

We all use our Internet connections in slightly different ways, but we share the desire for our connections to be as fast as possible. As more and more services move into the cloud – word documents, music, websites, communications, etc – the speed at which we can access those services becomes critical. While bandwidth plays a part, the latency of the connection – the real Internet “speed” – is more important.

At Cloudflare, we’re working every day to help build a more performant Internet. Want to help? Apply for one of our open engineering roles here.

Cloudflare Access is the fastest Zero Trust proxy

2023-03-17 David Tuber

Post Syndicated from David Tuber original https://blog.cloudflare.com/network-performance-update-security-week-2023/

Cloudflare Access is the fastest Zero Trust proxy

During every Innovation Week, Cloudflare looks at our network’s performance versus our competitors. In past weeks, we’ve focused on how much faster we are compared to reverse proxies like Akamai, or platforms that sell serverless compute that compares to our Supercloud, like Fastly and AWS. This week, we’d like to provide an update on how we compare to other reverse proxies as well as an update to our application services security product comparison against Zscaler and Netskope. This product is part of our Zero Trust platform, which helps secure applications and Internet experiences out to the public Internet, as opposed to our reverse proxy which protects your websites from outside users.

In addition to our previous post showing how our Zero Trust platform compared against Zscaler, we also have previously shared extensive network benchmarking results for reverse proxies from 3,000 last mile networks around the world. It’s been a while since we’ve shown you our progress towards being #1 in every last mile network. We want to show that data as well as revisiting our series of tests comparing Cloudflare Access to Zscaler Private Access and Netskope Private Access. For our overall network tests, Cloudflare is #1 in 47% of the top 3,000 most reported networks. For our application security tests, Cloudflare is 50% faster than Zscaler and 75% faster than Netskope.

In this blog we’re going to talk about why performance matters for our products, do a deep dive on what we’re measuring to show that we’re faster, and we’ll talk about how we measured performance for each product.

Why does performance matter?

We talked about it in our last blog, but performance matters because it impacts your employees’ experience and their ability to get their job done. Whether it’s accessing services through access control products, connecting out to the public Internet through a Secure Web Gateway, or securing risky external sites through Remote Browser Isolation, all of these experiences need to be frictionless.

A quick summary: say Bob at Acme Corporation is connecting from Johannesburg out to Slack or Zoom to get some work done. If Acme’s Secure Web Gateway is located far away from Bob in London, then Bob’s traffic may go out of Johannesburg to London, and then back into Johannesburg to reach his email. If Bob tries to do something like a voice call on Slack or Zoom, his performance may be painfully slow as he waits for his emails to send and receive. Zoom and Slack both recommend low latency for optimal performance. That extra hop Bob has to take through his gateway could decrease throughput and increase his latency, giving Bob a bad experience.

As we’ve discussed before, if these products or experiences are slow, then something worse might happen than your users complaining: they may find ways to turn off the products or bypass them, which puts your company at risk. A Zero Trust product suite is completely ineffective if no one is using it because it’s slow. Ensuring Zero Trust is fast is critical to the effectiveness of a Zero Trust solution: employees won’t want to turn it off and put themselves at risk if they barely know it’s there at all.

Much like Zscaler, Netskope may outperform many older, antiquated solutions, but their network still fails to measure up to a highly performant, optimized network like Cloudflare’s. We’ve tested all of our Zero Trust products against Netskope equivalents, and we’re even bringing back Zscaler to show you how Zscaler compares against them as well. So let’s dig into the data and show you how and why we’re faster in a critical Zero Trust scenario, comparing Cloudflare Access to Zscaler Private Access and Netskope Private Access.

Cloudflare Access: the fastest Zero Trust proxy

Access control needs to be seamless and transparent to the user: the best compliment for a Zero Trust solution is employees barely notice it’s there. These services allow users to cache authentication information on the provider network, ensuring applications can be accessed securely and quickly to give users that seamless experience they want. So having a network that minimizes the number of logins required while also reducing the latency of your application requests will help keep your Internet experience snappy and reactive.

Cloudflare Access does all that 75% faster than Netskope and 50% faster than Zscaler, ensuring that no matter where you are in the world, you’ll get a fast, secure application experience:

Cloudflare measured application access across ourselves, Zscaler and Netskope from 300 different locations around the world connecting to 6 distinct application servers in Hong Kong, Toronto, Johannesburg, São Paulo, Phoenix, and Switzerland. In each of these locations, Cloudflare’s P95 response time was faster than Zscaler and Netskope. Let’s take a look at the data when the application is hosted in Toronto, an area where Zscaler and Netskope should do well as it’s in a heavily interconnected region: North America.

ZT Access – Response time (95th Percentile) – Toronto
	95th Percentile Response (ms)
Cloudflare	2,182
Zscaler	4,071
Netskope	6,072

Cloudflare really stands out in regions with more diverse connectivity options like South America or Asia Pacific, where Zscaler compares better to Netskope than it does Cloudflare:

When we look at application servers hosted locally in South America, Cloudflare stands out:

ZT Access – Response time (95th Percentile) – South America
	95th Percentile Response (ms)
Cloudflare	2,961
Zscaler	9,271
Netskope	8,223

Cloudflare’s network shines here, allowing us to ingress connections close to the users. You can see this by looking at the Connect times in South America:

ZT Access – Connect time (95th Percentile) – South America
	95th Percentile Connect (ms)
Cloudflare	369
Zscaler	1,753
Netskope	1,160

Cloudflare’s network sets us apart here because we’re able to get users onto our network faster and find the optimal routes around the world back to the application host. We’re twice as fast as Zscaler and three times faster than Netskope because of this superpower. Across all the different tests, Cloudflare’s Connect times is consistently faster across all 300 testing nodes.

In our last blog, we looked at two distinct scenarios that need to be measured individually when we compared Cloudflare and Zscaler. The first scenario is when a user logs into their application and has to authenticate. In this case, the Zero Trust Access service will direct the user to a login page, the user will authenticate, and then be redirected to their application.

This is called a new session, because no authentication information is cached or exists on the Access network. The second scenario is called an existing session, when a user has already been authenticated and that authentication information can be cached. This scenario is usually much faster, because it doesn’t require an extra call to an identity provider to complete.

We like to measure these scenarios separately, because when we look at 95th percentile values, we would almost always be looking at new sessions if we combined new and existing sessions together. But across both scenarios, Cloudflare is consistently faster in every region. Let’s go back and look at an application hosted in Toronto, where users connecting to us connect faster than Zscaler and Netskope for both new and existing sessions.

ZT Access – Response Time (95th Percentile) – Toronto
	New Sessions (ms)	Existing Sessions (ms)
Cloudflare	1,276	1,022
Zscaler	2,415	1,797
Netskope	5,741	1,822

You can see that new sessions are generally slower as expected, but Cloudflare’s network and optimized software stack provides a consistently fast user experience. In scenarios where end-to-end connectivity can be more challenging, Cloudflare stands out even more. Let’s take a look at users in Asia connecting through to an application in Hong Kong.

ZT Access – Response Time (95th Percentile) – Hong Kong
	New Sessions (ms)	Existing Sessions (ms)
Cloudflare	2,582	2,075
Zscaler	4,956	3,617
Netskope	5,139	3,902

One interesting thing that stands out here is that while Cloudflare’s network is hyper-optimized for performance, Zscaler more closely compares to Netskope on performance than they do to Cloudflare. Netskope also performs poorly on new sessions, which indicates that their service does not react well when users are establishing new sessions.

We like to separate these new and existing sessions because it’s important to look at similar request paths to do a proper comparison. For example, if we’re comparing a request via Zscaler on an existing session and a request via Cloudflare on a new session, we could see that Cloudflare was much slower than Zscaler because of the need to authenticate. So when we contracted a third party to design these tests, we made sure that they took that into account.

For these tests, Cloudflare configured five application instances hosted in Toronto, Los Angeles, Sao Paulo, and Hong Kong. Cloudflare then used 300 different Catchpoint nodes from around the world to mimic a browser login as follows:

User connects to the application from a browser mimicked by a Catchpoint instance – new session
User authenticates against their identity provider
User accesses resource
User refreshes the browser page and tries to access the same resource but with credentials already present – existing session

This allows us to look at Cloudflare versus all the other products for application performance for both new and existing sessions, and we’ve shown that we’re faster. As we’ve mentioned, a lot of that is due to our network and how we get close to our users. So now we’re going to talk about how we compare to other large networks and how we get close to you.

Network effects make the user experience better

Getting closer to users improves the last mile Round Trip Time (RTT). As we discussed in the Access comparison, having a low RTT improves customer performance because new and existing sessions don’t have to travel very far to get to Cloudflare’s Zero Trust network. Embedding ourselves in these last mile networks helps us get closer to our users, which doesn’t just help Zero Trust performance, it helps web performance and developer performance, as we’ve discussed in prior blogs.

To quantify network performance, we have to get enough data from around the world, across all manner of different networks, comparing ourselves with other providers. We used Real User Measurements (RUM) to fetch a 100kb file from several different providers. Users around the world report the performance of different providers. The more users who report the data, the higher fidelity the signal is. The goal is to provide an accurate picture of where different providers are faster, and more importantly, where Cloudflare can improve. You can read more about the methodology in the original Speed Week 2021 blog post here.

We are constantly going through the process of figuring out why we were slow — and then improving. The challenges we faced were unique to each network and highlighted a variety of different issues that are prevalent on the Internet. We’re going to provide an overview of some of the efforts we use to improve our performance for our users.

But before we do, here are the results of our efforts since Developer Week 2022, the last time we showed off these numbers. Out of the top 3,000 networks in the world (by number of IPv4 addresses advertised), here’s a breakdown of the number of networks where each provider is number one in p95 TCP Connection Time, which represents the time it takes for a user on a given network to connect to the provider:

Here’s what those numbers look like as of this week, Security Week 2023:

As you can see, Cloudflare has extended its lead in being faster in more networks, while other networks that previously were faster like Akamai and Fastly lost their lead. This translates to the effects we see on the World Map. Here’s what that world map looked like in Developer Week 2022:

Here’s how that world map looks today during Security Week 2023:

As you can see, Cloudflare has gotten faster in Brazil, many countries in Africa including South Africa, Ethiopia, and Nigeria, as well as Indonesia in Asia, and Norway, Sweden, and the UK in Europe.

A lot of these countries benefited from the Edge Partner Program that we discussed in the Impact Week blog. A quick refresher: the Edge Partner Program encourages last mile ISPs to partner with Cloudflare to deploy Cloudflare locations that are embedded in the last mile ISP. This improves the last mile RTT and improves performance for things like Access. Since we last showed you this map, Cloudflare has deployed more partner locations in places like Nigeria, and Saudi Arabia, which have improved performance for users in all scenarios. Efforts like the Edge Partner Program help improve not just the Zero Trust scenarios like we described above, but also the general web browsing experience for end users who use websites protected by Cloudflare.

Next-generation performance in a Zero Trust world

In a non-Zero Trust world, you and your IT teams were the network operator — which gave you the ability to control performance. While this control was comforting, it was also a huge burden on your IT teams who had to manage middle mile connections between offices and resources. But in a Zero Trust world, your network is now… well, it’s the public Internet. This means less work for your teams — but a lot more responsibility on your Zero Trust provider, which has to manage performance for every single one of your users. The better your Zero Trust provider is at improving end-to-end performance, the better an experience your users will have and the less risk you expose yourself to. For real-time applications like authentication and secure web gateways, having a snappy user experience is critical.

A Zero Trust provider needs to not only secure your users on the public Internet, but it also needs to optimize the public Internet to make sure that your users continuously stay protected. Moving to Zero Trust doesn’t just reduce the need for corporate networks, it also allows user traffic to flow to resources more naturally. However, given your Zero Trust provider is going to be the gatekeeper for all your users and all your applications, performance is a critical aspect to evaluate to reduce friction for your users and reduce the likelihood that users will complain, be less productive, or turn the solutions off. Cloudflare is constantly improving our network to ensure that users always have the best experience, through programs like the Edge Partner Program and constantly improving our peering and interconnectivity. It’s this tireless effort that makes us the fastest Zero Trust provider.

Cloudflare is faster than Zscaler

2023-01-09 David Tuber

Post Syndicated from David Tuber original https://blog.cloudflare.com/network-performance-update-cio-edition/

Cloudflare is faster than Zscaler

Every Innovation Week, Cloudflare looks at our network’s performance versus our competitors. In past weeks, we’ve focused on how much faster we are compared to reverse proxies like Akamai, or platforms that sell edge compute that compares to our Supercloud, like Fastly and AWS. For CIO Week, we want to show you how our network stacks up against competitors that offer forward proxy services. These products are part of our Zero Trust platform, which helps secure applications and Internet experiences out to the public Internet, as opposed to our reverse proxy which protects your websites from outside users.

We’ve run a series of tests comparing our Zero Trust services with Zscaler. We’ve compared our ZT Application protection product Cloudflare Access against Zscaler Private Access (ZPA). We’ve compared our Secure Web Gateway, Cloudflare Gateway, against Zscaler Internet Access (ZIA), and finally our Remote Browser Isolation product, Cloudflare Browser Isolation, against Zscaler Cloud Browser Isolation. We’ve found that Cloudflare Gateway is 58% faster than ZIA in our tests, Cloudflare Access is 38% faster than ZPA worldwide, and Cloudflare Browser Isolation is 45% faster than Zscaler Cloud Browser Isolation worldwide. For each of these tests, we used 95th percentile Time to First Byte and Response tests, which measure the time it takes for a user to make a request, and get the start of the response (Time to First Byte), and the end of the response (Response). These tests were designed with the goal of trying to measure performance from an end-user perspective.

In this blog we’re going to talk about why performance matters for each of these products, do a deep dive on what we’re measuring to show that we’re faster, and we’ll talk about how we measured performance for each product.

Why does performance matter?

Performance matters because it impacts your employees’ experience and their ability to get their job done. Whether it’s accessing services through access control products, connecting out to the public Internet through a Secure Web Gateway, or securing risky external sites through Remote Browser Isolation, all of these experiences need to be frictionless.

Say Anna at Acme Corporation is connecting from Sydney out to Microsoft 365 or Teams to get some work done. If Acme’s Secure Web Gateway is located far away from Anna in Singapore, then Anna’s traffic may go out of Sydney to Singapore, and then back into Sydney to reach her email. If Acme Corporation is like many companies that require Anna to use Microsoft Outlook in online mode, her performance may be painfully slow as she waits for her emails to send and receive. Microsoft 365 recommends keeping latency as low as possible and bandwidth as high as possible. That extra hop Anna has to take through her gateway could decrease throughput and increase her latency, giving Anna a bad experience.

In another example, if Anna is connecting to a hosted, protected application like Jira to complete some tickets, she doesn’t want to be waiting constantly for pages to load or to authenticate her requests. In an access-controlled application, the first thing you do when you connect is you log in. If that login takes a long time, you may get distracted with a random message from a coworker or maybe will not want to tackle any of the work at all. And even when you get authenticated, you still want your normal application experience to be snappy and smooth: users should never notice Zero Trust when it’s at its best.

If these products or experiences are slow, then something worse might happen than your users complaining: they may find ways to turn off the products or bypass them, which puts your company at risk. A Zero Trust product suite is completely ineffective if no one is using it because it’s slow. Ensuring Zero Trust is fast is critical to the effectiveness of a Zero Trust solution: employees won’t want to turn it off and put themselves at risk if they barely know it’s there at all.

Services like Zscaler may outperform many older, antiquated solutions, but their network still fails to measure up to a highly performant, optimized network like Cloudflare’s. We’ve tested all of our Zero Trust products against Zscaler’s equivalents, and we’re going to show you that we’re faster. So let’s dig into the data and show you how and why we’re faster in three critical Zero Trust scenarios, starting with Secure Web Gateway: comparing Cloudflare Gateway to Zscaler Internet Access (ZIA).

Cloudflare Gateway: a performant secure web gateway at your doorstep

A secure web gateway needs to be fast because it acts as a funnel for all of an organization’s Internet-bound traffic. If a secure web gateway is slow, then any traffic from users out to the Internet will be slow. If traffic out to the Internet is slow, then users may be prompted to turn off the Gateway, putting the organization at risk of attack.

But in addition to being close to users, a performant web gateway needs to also be well-peered with the rest of the Internet to avoid slow paths out to websites users want to access. Remember that traffic through a secure web gateway follows a forward proxy path: users connect to the proxy, and the proxy connects to the websites users are trying to access. Therefore, it behooves the proxy to be well-connected to ensure that the user traffic can get where it needs to go as fast as possible.

When comparing secure web gateway products, we pitted the Cloudflare Gateway and WARP client against Zscaler Internet Access (ZIA), which performs the same functions. Fortunately for Cloudflare users, Gateway and Cloudflare’s network is not only embedded deep into last mile networks close to users, but is also one of the most well peered networks in the world. We use our most peered network to be 55% faster than ZIA for Gateway user scenarios. Below is a box plot showing the 95th percentile response time for Cloudflare, Zscaler, and a control set that didn’t use a gateway at all:

	Secure Web Gateway – Response Time
	95th percentile (ms)
Control	142.22
Cloudflare	163.77
Zscaler	365.77

This data shows that not only is Cloudflare much faster than Zscaler for Gateway scenarios, but that Cloudflare is actually more comparable to not using a secure web gateway at all rather than Zscaler.

To best measure the end-user Gateway experience, we are looking at 95th percentile response time from the end-user: we’re measuring how long it takes for a user to go through the proxy, have the proxy make a request to a website on the Internet, and finally return the response. This measurement is important because it’s an accurate representation of what users see.

When we measured against Zscaler, we had our end user client try to access five different websites: a website hosted in Azure, a Cloudflare-protected Worker, Google, Slack, and Zoom: websites that users would connect to on a regular basis. In each of those instances, Cloudflare outperformed Zscaler, and in the case of the Cloudflare-protected Worker, Gateway even outperformed the control for 95th percentile Response time. Here is a box plot showing the 95th percentile responses broken down by the different endpoints we queried as a part of our tests:

No matter where you go on the Internet, Cloudflare’s Gateway outperforms Zscaler Internet Access (ZIA) when you look at end-to-end response times. But why are we so much faster than Zscaler? The answer has to do with something that Zscaler calls proxy latency.

Proxy latency is the amount of time a user request spends on a Zscaler machine before being sent to its destination and back to the user. This number completely excludes the time it takes a user to reach Zscaler, and the time it takes Zscaler to reach the destination and restricts measurement to the milliseconds Zscaler spends processing requests.

Zscaler’s latency SLA says that 95% of your requests will spend less than 100 ms on a Zscaler device. Zscaler promises that the latency they can measure on their edge, not the end-to-end latency that actually matters, will be 100ms or less for 95% of user requests. You can even see those metrics in Zscaler’s Digital Experience to measure for yourself. If we can get this proxy latency from Zscaler logs and compare it to the Cloudflare equivalent, we can see how we stack up to Zscaler’s SLA metrics. While we don’t yet have those metrics exposed to customers, we were able to enable tracing on Cloudflare to measure the Cloudflare proxy latency.

The results show that at the 95th percentile, Zscaler was exceeding their SLA, and the Cloudflare proxy latency was 7ms. Furthermore, when our proxy latency was 100ms (meeting the Zscaler SLA), their proxy latencies were over 10x ours. Zscaler’s proxy latency accounts for the difference in performance we saw at the 95th percentile, being anywhere between 140-240 ms slower than Cloudflare for each of the sites. Here are the Zscaler proxy latency values at different percentiles for all sites tested, and then broken down by each site:

Zscaler Internet Access (ZIA)	P90 Proxy Latency (ms)	P95 Proxy Latency (ms)	P99 Proxy Latency (ms)	P99.9 Proxy Latency (ms)	P99.957 Proxy Latency (ms)
Global	06.0	142.0	625.0	1,071.7	1,383.7
Azure Site	97.0	181.0	458.5	1,032.7	1,291.3
Zoom	206.0	254.2	659.8	1,297.8	1,455.4
Slack	118.8	186.2	454.5	1,358.1	1,625.8
Workers Site	97.8	184.1	468.3	1,246.2	1,288.6
Google	13.7	100.8	392.6	848.9	1,115.0

At the 95th percentile, not only were their proxy latencies out of SLA, those values show the difference between Zscaler and Cloudflare: taking Zoom as an example, if Zscaler didn’t have the proxy latency, they would be on-par with Cloudflare and the control. Cloudflare’s equivalent of proxy latency is so small that using us is just like using the public Internet:

Cloudflare Gateway	P90 Proxy Latency (ms)	P95 Proxy Latency (ms)	P99 Proxy Latency (ms)	P99.9 Proxy Latency (ms)	P99.957 Proxy Latency (ms)
Global	5.6	7.2	15.6	32.2	101.9
Tubes Test	6.2	7.7	12.3	18.1	19.2
Zoom	5.1	6.2	9.6	25.5	31.1
Slack	5.3	6.5	10.5	12.5	12.8
Workers	5.1	6.1	9.4	17.3	20.5
Google	5.3	7.4	12.0	26.9	30.2

The 99.957 percentile may seem strange to include, but it marks the percentile at which Cloudflare’s proxy latencies finally exceeded 100ms. Cloudflare’s 99.957 percentile proxy latency is even faster than Zscaler’s 90th percentile proxy latency. Even on the metric Zscaler cares about and holds themselves accountable for despite proxy latency not being the metric customers care about, Cloudflare is faster.

Getting this view of data was not easy. Existing testing frameworks like Catchpoint are unsuitable for this task because performance testing requires that you run the ZIA client or the WARP client on the testing endpoint. We also needed to make sure that the Cloudflare test and the Zscaler test are running on similar machines in the same place to measure performance as good as possible. This allows us to measure the end-to-end responses coming from the same location where both test environments are running:

In our setup, we put three VMs in the cloud side by side: one running Cloudflare WARP connecting to our Gateway, one running ZIA, and one running no proxy at all as a control. These VMs made requests every three minutes to the five different endpoints mentioned above and logged the HTTP browser timings for how long each request took. Based on this, we are able to get a user-facing view of performance that is meaningful.

A quick summary so far: Cloudflare is faster than Zscaler when we’re protecting users from the public Internet through a secure web gateway from an end-user perspective. Cloudflare is even faster than Zscaler according to Zscaler’s small definition of what performance through a secure web gateway means. But let’s take a look at scenarios where you need access for specific applications through Zero Trust access.

Cloudflare Access: the fastest Zero Trust proxy

Access control needs to be seamless and transparent to the user: the best compliment for a Zero Trust solution is employees barely notice it’s there. Services like Cloudflare Access and Zscaler Private Access (ZPA) allow users to cache authentication information on the provider network, ensuring applications can be accessed securely and quickly to give users that seamless experience they want. So having a network that minimizes the number of logins required while also reducing the latency of your application requests will help keep your Internet experience snappy and reactive.

Cloudflare Access does all that 38% faster than Zscaler Private Access (ZPA), ensuring that no matter where you are in the world, you’ll get a fast, secure application experience:

	ZT Access – Time to First Byte (Global)
	95th Percentile (ms)
Cloudflare	849
Zscaler	1,361

When we drill into the data, we see that Cloudflare is consistently faster everywhere around the world. For example, take Tokyo, where Cloudflare’s 95th percentile time to first byte times are 22% faster than Zscaler:

	ZT Access – 95th Percentile Time to First Byte (Chicago)
	New Sessions (ms)	Existing Sessions (ms)
Cloudflare	1,032	293
Zscaler	1,373	338

When we evaluate Cloudflare against Zscaler for application access scenarios, we are looking at two distinct scenarios that need to be measured individually. The first scenario is when a user logs into their application and has to authenticate. In this case, the Zero Trust Access service will direct the user to a login page, the user will authenticate, and then be redirected to their application.

We like to measure these scenarios separately, because when we look at 95th percentile values, we would almost always be looking at new sessions if we combined new and existing sessions together. But across both scenarios, Cloudflare is consistently faster in every region. Here’s how this data looks when you take a location Zscaler is more likely to have good peering in: users in Chicago, IL connecting to an application hosted in US-Central.

	ZT Access – 95th Percentile Time to First Byte (Chicago)
	New Sessions (ms)	Existing Sessions (ms)
Cloudflare	1,032	293
Zscaler	1,373	338

Cloudflare is faster overall there as well. Here’s a histogram of 95th percentile response times for new connections overall:

You’ll see that Cloudflare’s network really gives a performance boost on login, helping find optimal paths back to authentication providers to retrieve login details. In this test, Cloudflare never takes more than 2.5 seconds to return a login response, but half of Zscaler’s 95th percentile responses are almost double that, at around four seconds. This would suggest that Zscaler’s network isn’t peered as well, which causes latency early on. But it may also suggest that Zscaler may do better when the connection is established and everything is cached. But on an existing connection, Cloudflare still comes out ahead:

Zscaler and Cloudflare do match up more evenly at lower latency buckets, but Cloudflare’s response times are much more consistent, and you can see that Zscaler has half of their responses which take almost a second to load. This further highlights how well-connected we are: because we’re in more places, we provide a better application experience, and we don’t have as many edge cases with high latency and poor application performance.

For these tests, Cloudflare contracted Miercom, a third party who performed a set of tests that was intended to replicate an end-user connecting to a resource protected by Cloudflare or Zscaler. Miercom set up application instances in 12 locations around the world, and devised a test that would log into the application through various Zero Trust providers to access certain content. The test methodology is described as follows, but you can look at the full report from Miercom detailing their test methodology here:

User connects to the application from a browser mimicked by a Catchpoint instance – new session
User authenticates against their identity provider
User accesses resource
User refreshes the browser page and tries to access the same resource but with credentials already present – existing session

This allows us to look at Cloudflare versus Zscaler for application performance for both new and existing sessions, and we’ve shown that we’re faster. We’re faster in secure web gateway scenarios too.

But what if you want to access resources on the public Internet and you don’t have a ZT client on your device? To do that, you’ll need remote browser isolation.

Cloudflare Browser Isolation: your friendly neighborhood web browser

Cloudflare once again is faster than Zscaler for remote browser isolation performance. Comparing 95th percentile time to first byte, Cloudflare is 45% faster than Zscaler across all regions:

	ZT RBI – Time to First Byte (Global)
	95th Percentile (ms)
Cloudflare	2,072
Zscaler	3,781

When you compare the total response time or the ability for a browser isolation product to deliver a full response back to a user, Cloudflare is still 39% faster than Zscaler:

	ZT RBI – Time to First Byte (Global)
	95th Percentile (ms)
Cloudflare	2,394
Zscaler	3,932

Cloudflare’s network really shines here to help deliver the best user experience to our customers. Because Cloudflare’s network is incredibly well-peered close to end-user devices, we are able to drive down our time to first byte and response times, helping improve the end-user experience.

To measure this, we went back to Miercom to help get us the data we needed by having Catchpoint nodes connect to Cloudflare Browser Isolation and Zscaler Cloud Browser Isolation across the world from the same 14 locations and had devices simulating clients try to reach applications through the browser isolation products in each locale. For more on the test methodology, you can refer to that same Miercom report, linked here.

Next-generation performance in a Zero Trust world

A Zero Trust provider needs to not only secure your users on the public Internet, but it also needs to optimize the public Internet to make sure that your users continuously stay protected. Moving to Zero Trust doesn’t just reduce the need for corporate networks, it also allows user traffic to flow to resources more naturally. However, given your Zero Trust provider is going to be the gatekeeper for all your users and all your applications, performance is a critical aspect to evaluate to reduce friction for your users and reduce the likelihood that users will complain, be less productive, or turn the solutions off. Cloudflare is constantly improving our network to ensure that users always have the best experience, and this comes not just from routing fixes, but also through expanding peering arrangements and adding new locations. It’s this tireless effort that makes us the fastest Zero Trust provider.

Check out our compare page for more detail on how Cloudflare’s network architecture stacks up against Zscaler.

A new era for Cloudflare Pages builds

2022-05-10 Nevi Shah

Post Syndicated from Nevi Shah original https://blog.cloudflare.com/cloudflare-pages-build-improvements/

A new era for Cloudflare Pages builds

Music is flowing through your headphones. Your hands are flying across the keyboard. You’re stringing together a masterpiece of code. The momentum is building up as you put on the finishing touches of your project. And at last, it’s ready for the world to see. Heart pounding with excitement and the feeling of victory, you push changes to the main branch…. only to end up waiting for the build to execute each step and spit out the build logs.

Starting afresh

Since the launch of Cloudflare Pages, there is no doubt that the build experience has been its biggest source of criticism. From the amount of waiting to inflexibility of CI workflow, Pages had a lot of opportunity for growth and improvement. With Pages, our North Star has always been designing a developer platform that fits right into your workflow and oozes simplicity. User pain points have been and always will be our priority, which is why today we are thrilled to share a list of exciting updates to our build times, logs and settings!

Over the last three quarters, we implemented a new build infrastructure that speeds up Pages builds, so you can iterate quickly and efficiently. In February, we soft released the Pages Fast Builds Beta, allowing you to opt in to this new infrastructure on a per-project basis. This not only allowed us to test our implementation, but also gave our community the opportunity to try it out and give us direct feedback in Discord. Today we are excited to announce the new build infrastructure is now generally available and automatically enabled for all existing and new projects!

Faster build times

As a developer, your time is extremely valuable, and we realize Pages builds were slow. It was obvious that creating an infrastructure that built projects faster and smarter was one of our top requirements.

Looking at a Pages build, there are four main steps: (1) initializing the build environment, (2) cloning your git repository, (3) building the application, and (4) deploying to Cloudflare’s global network. Each of these steps is a crucial part of the build process, and upon investigating areas suitable for optimization, we directed our efforts to cutting down on build initialization time.

In our old infrastructure, every time a build job was submitted, we created a new virtual machine to run that build, costing our users precious dev time. In our new infrastructure, we start jobs on machines that are ready and waiting to be used, taking a major chunk of time away from the build initialization step. This step previously ran for 2+ minutes, but with our new infrastructure update, projects are expected to see a build initialization time cut down to 2-3 SECONDS.

This means less time waiting and more time iterating on your code.

Fast and secure

In our old build infrastructure, because we spun up a new virtual machine (VM) for every build, it would take several minutes to boot up and initialize with the Pages build image needed to execute the build. Alternatively, one could reuse a collection of VMs, assigning a new build to the next available VM, but containers share a kernel with the host operating system, making them far less isolated, posing a huge security risk. This could allow a malicious actor to perform a “container escape” to break out of their sandbox. We wanted the best of both worlds: the speed of a container with the isolation of a virtual machine.

Enter gVisor, a container sandboxing technology that drastically limits the attack surface of a host. In the new infrastructure, each container running with gVisor is given its own independent application “kernel,” instead of directly sharing the kernel with its host. Then, to address the speed, we keep a cluster of virtual machines warm and ready to execute builds so that when a new Pages deployment is triggered, it takes just a few seconds for a new gVisor container to start up and begin executing meaningful work in a secure sandbox with near native performance.

Stream your build logs

After we solidified a fast and secure build, we wanted to enhance the user facing build experience. Because a build may not be successful every time, providing you with the tools you need to debug and access that information as fast as possible is crucial. While we have a long list of future improvements for a better logging experience, today we are starting by enabling you to stream your build logs.

Prior to today, with the aforementioned build steps required to complete a Pages build, you were required to wait until the build completed in order to view the resulting build logs. Easily addressable issues like incorrectly inputting the build command or specifying an environment variable would have required waiting for the entire build to finish before understanding the problem.

Today, we’re giving you the power to understand your build issues as soon as they happen. Spend less time waiting for your logs and start debugging the events of your builds within a second or less after they happen!

Control Branch Builds

Finally, the build experience does not just include the events during execution but everything leading up to the trigger of a build. For our final trick, we’re enabling our users to have full control of the precise branches they’d like to include and exclude for automatic deployments.

Before today, Pages submitted builds for every commit in both production and preview environments, which led to queued builds and even more waiting if you exceeded your concurrent build limit. We wanted to provide even more flexibility to control your CI workflow. Now you can configure your build settings to specify branches to build, as well as skip ad hoc commits.

Specify branches to build

While “unlimited staging” is one of Pages’ greatest advantages, depending on your setup, sometimes automatic deployments to the preview environment can cause extra noise.

In the Pages build configuration setting, you can specify automatic deployments to be turned off for the production environment, the preview environment, or specific preview branches. In a more extreme case, you can even pause all deployments so that any commit sent to your git source will not trigger a new Pages build.

Additionally, in your project’s settings, you can now configure the specific Preview branches you would like to include and exclude for automatic deployments. To make this configuration an even more powerful tool, you can use wildcard syntax to set rules for existing branches as well as any newly created preview branches.

Read more in our Pages docs on how to get started with configuring automatic deployments with Wildcard Syntax.

Using CI Skip

Sometimes commits need to be skipped on an ad hoc basis. A small update to copy or a set of changes within a small timespan don’t always require an entire site rebuild. That’s why we also implemented a CI Skip command for your commit message, signaling to Pages that the update should be skipped by our builder.

With both CI Skip and configured build rules, you can keep track of your site changes in Pages’ deployment history.

Where we’re going

We’re extremely excited to bring these updates to you today, but of course, this is only the beginning of improving our build experience. Over the next few quarters, we will be bringing more to the build experience to create a seamless developer journey from site inception to launch.

Incremental builds and caching

From beta testing, we noticed that our new infrastructure can be less impactful on larger projects that use heavier frameworks such as Gatsby. We believe that every user on our developer platform, regardless of their use case, has the right to fast builds. Up next, we will be implementing incremental builds to help Pages identify only the deltas between commits and rebuild only files that were directly updated. We will also be implementing other caching strategies such as caching external dependencies to save time on subsequent builds.

Build image updates

Because we’ve been using the same build image we launched Pages with back in 2021, we are going to make some major updates. Languages release new versions all the time, and we want to make sure we update and maintain the latest versions. An updated build image will mean faster builds, more security and of course supporting all the latest versions of languages and tools we provide. With new build image versions being released, we will allow users to opt in to the updated builds in order to maintain compatibility with all existing projects.

Productive error messaging

Lastly, while streaming build logs helps you to identify those easily addressable issues, the infamous “Internal error occurred” is sometimes a little more cryptic to decipher depending on the failure. While we recently published a “Debugging Cloudflare Pages” guide, in the future we’d like to provide the error feedback in a more productive manner, so you can easily identify the issue.

Have feedback?

As always, your feedback defines our roadmap. With all the updates we’ve made to our build experience, it’s important we hear from you! You can get in touch with our team directly through Discord. Navigate to our Pages specific section and check out our various channels specific to different parts of the product!

Join us at Cloudflare Connect!

Interested in learning more about building with Cloudflare Pages? If you’re based in the New York City area, join us on Thursday, May 12th for a series of workshops on how to build a full stack application on Pages! Follow along with a fully hands-on lab, featuring Pages in conjunction with other products like Workers, Images and Cloudflare Gateway, and hear directly from our product managers. Register now!

Benchmarking Edge Network Performance: Akamai, Cloudflare, AWS CloudFront, Fastly, and Google

2021-09-17 John Graham-Cumming

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/benchmarking-edge-network-performance/

Benchmarking Edge Network Performance: Akamai, Cloudflare, AWS CloudFront, Fastly, and Google

During Speed Week we’ve talked a lot about services that make the web faster. Argo 2.0 for better routing around bad Internet weather, Orpheus to ensure that origins are reachable from anywhere, image optimization to send just the right bits to the client, Tiered Cache to maximize cache hit rates and get the most out of Cloudflare’s new 25% bigger network, our expanded fiber backbone and more.

Those things are all great.

But it’s vital that we also measure the performance of our network and benchmark ourselves against industry players large and small to make sure we are providing the best, fastest service.

We recently ran a measurement experiment where we used Real User Measurement (RUM) data from the standard browser API to test the performance of Cloudflare and others in real-world conditions across the globe. We wanted to use third-party tests for this, but they didn’t have the granularity we wanted. We want to drill down to every single ISP in the world to make sure we optimize everywhere. We knew that in some places the answers we got wouldn’t be good, and we’d need to do work to improve our performance. But without detailed analysis across the entire globe we couldn’t know we were really the fastest or where we needed to improve.

In this blog post I’ll describe how that measurement worked and what the results are. But the short version is: Cloudflare is #1 in almost all cases whether you look at all the networks on the globe, or the top 1,000 largest, or the top 1,000 most active, and whether you look at mean timings or 95th percentile, and you measure how quickly a connection can be established, how long it takes for the first byte to arrive in a user’s web browser, or how long the complete download takes. And we’re not resting here, we’re committed to working network by network globally to ensure that we are always #1.

Why we measured

Commercial Internet measurement services (such as Cedexis, Catchpoint, Pingdom, ThousandEyes) already exist and perform the sorts of RUM measurements that Cloudflare used for this test. And we wanted to ensure that our test was as fair as possible and allowed each network to put its best foot forward.

We subscribe to the third party monitoring services already. And, when we looked at their data we weren’t satisfied.

First, we worried that the way they sampled wasn’t globally representative and was often skewed by measuring from the server, rather than the eyeball, side of the network. Or, even if operating from the eyeball side, could be skewed as artificial or tainted by bots and other automated traffic.

Second, it wasn’t granular enough. It showed our performance by country or region, but didn’t dive into individual networks and therefore obscured the details and outliers behind averages. While we looked good in third party tests, we didn’t trust them to be as thorough and accurate as we wanted. The goal isn’t to pick a test where we looked good. The goal was to be accurate and see where we weren’t good enough yet, so we could focus on those areas and improve. That’s why we had to build this ourselves.

We benchmark against others because it’s useful to know what’s possible. If someone else is faster than we are somewhere then it proves it’s possible. We won’t be satisfied until we’re at least as good as everyone else everywhere. Now we have the granular data to ensure that’ll happen. We plan our next update during Birthday Week when our target is to take 10% of networks where we’re not the fastest and become the fastest.

How we measured

To measure performance we did two things. We created a small internal team to do the measurements separate from the team that manages and optimizes our network. The goal was to show the team where we need to improve.

And to ensure that the other CDNs were tested using their most representative assets we used the very same endpoints that commercial measurement services use on the assumption that our competitors will have already ensured that those endpoints are optimized to show their networks’ best performance.

The measurements in this blog post are based on four days just before Speed Week began (2021-09-10 12:25:02 UTC to 2021-09-13 16:21:10 UTC). We took measurements of downloading exactly the same 100KB PNG file. We categorized them by the network the measurement was taken from. It’s identified by its ASN and a name. We’re not done with these measurements and will continue measuring and optimizing.

A 100KB file is a common test measurement used in the industry and allows us to measure network characteristics like the connection time, but also the total download time.

Before we get into results let’s talk a little about how the Internet fits together. The Internet is a network of networks that cooperate to form the global network that we all use. These networks are identified by a strangely named “autonomous system number” or ASN. The idea is that large networks (like ISPs, or cloud providers, or universities, or mobile phone companies) operate autonomously and then join the global Internet through a protocol called BGP (which we’ve written about in the past).

In a way the Internet is these ASNs and because the Internet is made of them we want to measure our performance for each ASN. Why? Because one part of improving performance is improving our connectivity to each ASN and knowing how we’re doing on a per-network basis helps enormously.

There are roughly 70,000 ASNs in the global Internet and during the measurement period we saw traffic from about 21,000 (exact number: 20,728) of them. This makes sense since not all networks are “external” (as in the source of traffic to websites); many ASNs are intermediaries moving traffic around the Internet.

For the rest of this blog we simply say “network” instead of “ASN”.

What we measured

Getting real user measurement data used to be hard but has been made easy for HTTP endpoints thanks to the Resource Timing API, supported by most modern browsers. This API allows a page to measure network timing data of fetched resources using high-resolution timestamps, accurate to 5 µs (microseconds).

The point of this API is to get timing information that shows how a real end-user experiences the Internet (and not a synthetic test that might measure a single component of all the things that happen when a user browses the web).

The Resource Timing API is supported by pretty much every browser enabling measurement on everything from old versions of Internet Explorer, to mobile browsers on iOS and Android to the latest version of Chrome. Using this API gives us a view of real world use on real world browsers.

We don’t just instruct the browser to download an image too. To make sure that we’re fair and replicate the real-life end-user experience, we make sure that no local caching was involved in the request, check if the object has been compressed by the server or not, take the HTTP headers size into account, and record if the connection has been pre-warmed or not, to name a few technical details.

Here’s a high-level example on how this works:

await fetch("https://example.com/100KB.png?r=7562574954", {
              mode: "cors",
              cache: "no-cache",
              credentials: "omit",
              method: "GET",
})

performance.getEntriesByType("resource");

{
   connectEnd: 1400.3999999761581
   connectStart: 1400.3999999761581
   decodedBodySize: 102400
   domainLookupEnd: 1400.3999999761581
   domainLookupStart: 1400.3999999761581
   duration: 51.60000002384186
   encodedBodySize: 102400
   entryType: "resource"
   fetchStart: 1400.3999999761581
   initiatorType: "fetch"
   name: "https://example.com/100KB.png"
   nextHopProtocol: "h2"
   redirectEnd: 0
   redirectStart: 0
   requestStart: 1406
   responseEnd: 1452
   responseStart: 1428.5
   secureConnectionStart: 1400.3999999761581
   startTime: 1400.3999999761581
   transferSize: 102700
   workerStart: 0
}

To measure the performance of each CDN we downloaded an image from each, when a browser visited one of our special pages. Once every image is downloaded we record the measurements using a Cloudflare Workers based API.

The three measurements: TCP connection time, TTFB and TTLB

We focused on three measurements to illustrate how fast our network is: TCP connection time, TTFB and TTLB. Here’s why those three values matter.

TCP connection time is used to show how well-connected we are to the global Internet as it counts only the time taken for a machine to establish a connection to the remote host (before any higher level protocols are used). The TCP connection time is calculated as connectEnd – connectStart (see the diagram above).

TTFB (or time to first byte) is the time taken once an HTTP request has been sent for the first byte of data to be returned by the server. This is a common measurement used to show how responsive a server is. We calculate TTFB as responseStart – connectStart – (requestStart – connectEnd).

TTLB (or time to last byte) is the time taken to send the entire response to the web browser. It’s a good measure of how long a complete download takes and helps measure how good the server (or CDN) is at sending data. We calculate TTLB as responseEnd – connectStart – (requestStart – connectEnd).

We then produced two sets of data: mean and p95. The mean is a very well understood number for laypeople and gives the average user experience, but it doesn’t capture the breadth of different speeds people experience very well. Because it averages everything together it can miss skewed distributions of data (where lots of people get good performance and lots bad performance, for example).

To address the mean’s problems we also used p95, the 95th percentile. This number tells us what performance 95% of measurements fall below. That can be a hard number to understand, but you can think of it as the “reasonable worst case” performance for a user. Only 5% of measurements were worse than this number.

An example chart

As this blog contains a lot of data, let’s start with a detailed look at a chart of results. We compared ourselves against industry giants Google and Amazon CloudFront, industry pioneer Akamai and up and comer Fastly.

For each network (represented by an ASN) and for each CDN we calculated which CDN had the best performance. Here, for example, is a count of the number of networks on which each CDN had the best performance for TTFB. This particular chart shows p95 and includes data from the top 1,000 networks (by number of IPv4 addresses advertised).

In these charts, longer bars are better; the bars indicate the number of networks for which that CDN had the lowest TTFB at p95.

This shows that Cloudflare had the lowest time to first byte (the TTFB, or the time it took the first byte of content to reach their browser) at the 95th percentile for the largest number of networks in the top 1,000 (based on the number IPv4 addresses they advertise). Google was next, then Fastly followed by Amazon CloudFront and Akamai.

All three measures, TCP connection time, time to first byte and time to last byte, matter to the user experience. For this example, I focused on time to first byte (TTFB) because it’s such a common measure of responsiveness of the web. It’s literally the time it takes a web server to start responding to a request from a browser for a web page.

To understand the data we gathered let’s look at two large US-based ISPs: Cox and Comcast. Cox serves about 6.5 million customers and Comcast has about 30 million customers. We performed roughly 22,000 measurements on Cox’s network and 100,000 on Comcast’s. Below we’ll make use of measurement counts to rank networks by their size, here we see that our measurements and customer counts of Cox and Comcast track nicely.

Cox Communications has ASN 22773 and our data shows that the p95 TTFB for the five CDNs was as follows: Cloudflare 332.6ms, Fastly 357.6ms, Google 380.3ms, Amazon CloudFront 404.4ms and Akamai 441.5ms. In this case Cloudflare was the fastest and about 7% faster than the next CDN (Fastly) which was faster than Google and so on.

Looking at Comcast (with ASN 7922) we see p95 TTFB was 323.7ms (Fastly), 324.2ms (Cloudflare), 353.7ms (Google), 384.6ms (Akamai) and 418.8ms (Amazon CloudFront). Here Fastly (323.7ms) edged out Cloudflare (324.2ms) by 0.2%.

Figures like these go into determining which CDN is the fastest for each network for this analysis and the charts presented. At a granular level they matter to Cloudflare as we can then target networks and connectivity for optimization.

The results

Shown below are the results for three different measurement types (TCP connection time, TTFB and TTLB) with two different aggregations (mean and p95) and for three different groupings of networks.

The first grouping is the simplest: it’s all the networks we have data for. The second grouping is the one used above, the top 1,000 networks by number of IP addresses. But we also show a third grouping, top 1,000 networks by number of observations. That last group is important.

Top 1,000 networks by number of IP addresses is interesting (it includes the major ISPs) but it also includes some networks that have huge numbers of IP addresses available that aren’t necessarily used. These come about because of historical allocations of IP addresses organisations like the US Department of Defense.

Cloudflare’s testing reveals which networks are most used, and so we also report results for the top 1,000 networks by number of observations to get an idea of how we’re performing on networks with heavy usage.

Hence, we have 18 charts showing all combinations of (TCP connection time, TTFB, TTLB), (mean, p95) and (all networks, top 1,000 networks by IP count, top 1,000 networks by observations).

You’ll see that in two of the charts Cloudflare is not #1 of 18 (we’re #2). We’ll be working to make sure we’re #1 for every measurement over the next few weeks. Both of those are average times. We’re most interested in the p95 measurements because they show the “reasonable worst case” scenario for someone using the Internet. But as we go about optimizing performance we want to be #1 on every chart so that we’re top no matter how performance is measured.

TCP Connection Time

Let’s start with TCP connection time to get a sense of how well-connected the five companies we’ve measured. Recall that longer bars are better here, they indicate that the particular CDN was the highest performance for that many networks: more networks is better.

Time To First Byte (TTFB)

Next up is TTFB for the five companies. Once again, longer bars is better means more networks where that CDN had the lowest TTFB.

Time To Last Byte (TTLB)

And finally the TTLB measurements. Once again, longer bars is better means more networks where that CDN had the lowest TTLB.

Optimization Targets

Looking not just at where we are #1 but where we are #1 or #2 helps us see how quickly we can optimize our network to be #1 in more places. Examining the top 1,000 networks by observations we see that we’re currently #1 or #2 for TTFB in 69.9% of networks, for TTLB in 65.0% of networks and for TCP connection time in 70.5%.

To see how much optimization we need to do to go from #2 to #1 we looked at the three measures and see that median TTFB of the #1 network is 92.3%, median TTLB is 94.0% and TCP connection time is 91.5%.

The latter figure is very interesting because it shows that we can make big gains by optimizing network level routing.

Where’s the world map?

It’s very common to present data about Internet performance on maps. World maps look good but they obscure information. The reason is very simple: world maps show geography (and, depending on the projection, a very skewed version of the world’s actual countries and land masses).

Here’s an example world map with countries colored by who had the lowest TTLB at p95. Cloudflare is in orange, Amazon CloudFront in yellow, Google in purple, Akamai in red and Fastly in blue.

What that doesn’t show is population. And Cloudflare is interested in how well we perform for people. Consider Russia and Indonesia. Indonesia has double the population of Russia and about 1/10th of the land.

By focusing on networks we can optimize for the people who use the Internet. For example, Biznet is a major ISP in Indonesia with ASN 17451. Looking at TTFB at p95 (the same statistics we discussed earlier for US ISPs Cox and Comcast) we see that for Biznet users Cloudflare has the fastest p95 TTFB at 677.7ms, Fastly 744.0ms, Google 872.8ms, Amazon CloudFront 1,239.9 and Akamai 1,248.9ms.

What’s next

The data we’ve gathered gives us a granular view of every network that connects to Cloudflare. This allows us to optimize routes and over the coming days and weeks we’ll be using the data to further increase Cloudflare’s performance.

In just over a week it’s Cloudflare’s Birthday Week, and we are aiming to improve our performance in 10% of the networks where we are not #1 today. Stay tuned.

Computing Euclidean distance on 144 dimensions

2020-12-18 Marek Majkowski

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/computing-euclidean-distance-on-144-dimensions/

Computing Euclidean distance on 144 dimensions

Late last year I read a blog post about our CSAM image scanning tool. I remember thinking: this is so cool! Image processing is always hard, and deploying a real image identification system at Cloudflare is no small achievement!

Some time later, I was chatting with Kornel: “We have all the pieces in the image processing pipeline, but we are struggling with the performance of one component.” Scaling to Cloudflare needs ain’t easy!

The problem was in the speed of the matching algorithm itself. Let me elaborate. As John explained in his blog post, the image matching algorithm creates a fuzzy hash from a processed image. The hash is exactly 144 bytes long. For example, it might look like this:

00e308346a494a188e1043333147267a 653a16b94c33417c12b433095c318012
5612442030d14a4ce82c623f4e224733 1dd84436734e4a5d6e25332e507a8218
6e3b89174e30372d

The hash is designed to be used in a fuzzy matching algorithm that can find “nearby”, related images. The specific algorithm is well defined, but making it fast is left to the programmer — and at Cloudflare we need the matching to be done super fast. We want to match thousands of hashes per second, of images passing through our network, against a database of millions of known images. To make this work, we need to seriously optimize the matching algorithm.

Naive quadratic algorithm

The first algorithm that comes to mind has O(K*N) complexity: for each query, go through every hash in the database. In naive implementation, this creates a lot of work. But how much work exactly?

First, we need to explain how fuzzy matching works.

Given a query hash, the fuzzy match is the “closest” hash in a database. This requires us to define a distance. We treat each hash as a vector containing 144 numbers, identifying a point in a 144-dimensional space. Given two such points, we can calculate the distance using the standard Euclidean formula.

For our particular problem, though, we are interested in the “closest” match in a database only if the distance is lower than some predefined threshold. Otherwise, when the distance is large, we can assume the images aren’t similar. This is the expected result — most of our queries will not have a related image in the database.

The Euclidean distance equation used by the algorithm is standard:

To calculate the distance between two 144-byte hashes, we take each byte, calculate the delta, square it, sum it to an accumulator, do a square root, and ta-dah! We have the distance!

Here’s how to count the squared distance in C:

This function returns the squared distance. We avoid computing the actual distance to save us from running the square root function – it’s slow. Inside the code, for performance and simplicity, we’ll mostly operate on the squared value. We don’t need the actual distance value, we just need to find the vector with the smallest one. In our case it doesn’t matter if we’ll compare distances or squared distances!

As you can see, fuzzy matching is basically a standard problem of finding the closest point in a multi-dimensional space. Surely this has been solved in the past — but let’s not jump ahead.

While this code might be simple, we expect it to be rather slow. Finding the smallest hash distance in a database of, say, 1M entries, would require going over all records, and would need at least:

144 * 1M subtractions
144 * 1M multiplications
144 * 1M additions

And more. This alone adds up to 432 million operations! How does it look in practice? To illustrate this blog post we prepared a full test suite. The large database of known hashes can be well emulated by random data. The query hashes can’t be random and must be slightly more sophisticated, otherwise the exercise wouldn’t be that interesting. We generated the test smartly by byte-swaps of the actual data from the database — this allows us to precisely control the distance between test hashes and database hashes. Take a look at the scripts for details. Here’s our first run of the first, naive, algorithm:

$ make naive
< test-vector.txt ./mmdist-naive > test-vector.tmp
Total: 85261.833ms, 1536 items, avg 55.509ms per query, 18.015 qps

We matched 1,536 test hashes against a database of 1 million random vectors in 85 seconds. It took 55ms of CPU time on average to find the closest neighbour. This is rather slow for our needs.

SIMD for help

An obvious improvement is to use more complex SIMD instructions. SIMD is a way to instruct the CPU to process multiple data points using one instruction. This is a perfect strategy when dealing with vector problems — as is the case for our task.

We settled on using AVX2, with 256 bit vectors. We did this for a simple reason — newer AVX versions are not supported by our AMD CPUs. Additionally, in the past, we were not thrilled by the AVX-512 frequency scaling.

Using AVX2 is easier said than done. There is no single instruction to count Euclidean distance between two uint8 vectors! The fastest way of counting the full distance of two 144-byte vectors with AVX2 we could find is authored by Vlad:

It’s actually simpler than it looks: load 16 bytes, convert vector from uint8 to int16, subtract the vector, store intermediate sums as int32, repeat. At the end, we need to do complex 4 instructions to extract the partial sums into the final sum. This AVX2 code improves the performance around 3x:

$ make naive-avx2 
Total: 25911.126ms, 1536 items, avg 16.869ms per query, 59.280 qps

We measured 17ms per item, which is still below our expectations. Unfortunately, we can’t push it much further without major changes. The problem is that this code is limited by memory bandwidth. The measurements come from my Intel i7-5557U CPU, which has the max theoretical memory bandwidth of just 25GB/s. The database of 1 million entries takes 137MiB, so it takes at least 5ms to feed the database to my CPU. With this naive algorithm we won’t be able to go below that.

Vantage Point Tree algorithm

Since the naive brute force approach failed, we tried using more sophisticated algorithms. My colleague Kornel Lesiński implemented a super cool Vantage Point algorithm. After a few ups and downs, optimizations and rewrites, we gave up. Our problem turned out to be unusually hard for this kind of algorithm.

We observed “the curse of dimensionality”. Space partitioning algorithms don’t work well in problems with large dimensionality — and in our case, we have an enormous number of 144 dimensions. K-D trees are doomed. Locality-sensitive hashing is also doomed. It’s a bizarre situation in which the space is unimaginably vast, but everything is close together. The volume of the space is a 347-digit-long number, but the maximum distance between points is just 3060 – sqrt(255*255*144).

Space partitioning algorithms are fast, because they gradually narrow the search space as they get closer to finding the closest point. But in our case, the common query is never close to any point in the set, so the search space can’t be narrowed to a meaningful degree.

A VP-tree was a promising candidate, because it operates only on distances, subdividing space into near and far partitions, like a binary tree. When it has a close match, it can be very fast, and doesn’t need to visit more than O(log(N)) nodes. For non-matches, its speed drops dramatically. The algorithm ends up visiting nearly half of the nodes in the tree. Everything is close together in 144 dimensions! Even though the algorithm avoided visiting more than half of the nodes in the tree, the cost of visiting remaining nodes was higher, so the search ended up being slower overall.

Smarter brute force?

This experience got us thinking. Since space partitioning algorithms can’t narrow down the search, and still need to go over a very large number of items, maybe we should focus on going over all the hashes, extremely quickly. We must be smarter about memory bandwidth though — it was the limiting factor in the naive brute force approach before.

Perhaps we don’t need to fetch all the data from memory.

Short distance

The breakthrough came from the realization that we don’t need to count the full distance between hashes. Instead, we can compute only a subset of dimensions, say 32 out of the total of 144. If this distance is already large, then there is no need to compute the full one! Computing more points is not going to reduce the Euclidean distance.

The proposed algorithm works as follows:

1. Take the query hash and extract a 32-byte short hash from it

2. Go over all the 1 million 32-byte short hashes from the database. They must be densely packed in the memory to allow the CPU to perform good prefetching and avoid reading data we won’t need.

3. If the distance of the 32-byte short hash is greater or equal a best score so far, move on

4. Otherwise, investigate the hash thoroughly and compute the full distance.

Even though this algorithm needs to do less arithmetic and memory work, it’s not faster than the previous naive one. See make short-avx2. The problem is: we still need to compute a full distance for hashes that are promising, and there are quite a lot of them. Computing the full distance for promising hashes adds enough work, both in ALU and memory latency, to offset the gains of this algorithm.

There is one detail of our particular application of the image matching problem that will help us a lot moving forward. As we described earlier, the problem is less about finding the closest neighbour and more about proving that the neighbour with a reasonable distance doesn’t exist. Remember — in practice, we don’t expect to find many matches! We expect almost every image we feed into the algorithm to be unrelated to image hashes stored in the database.

It’s sufficient for our algorithm to prove that no neighbour exists within a predefined distance threshold. Let’s assume we are not interested in hashes more distant than, say, 220, which squared is 48,400. This makes our short-distance algorithm variation work much better:

$ make short-avx2-threshold
Total: 4994.435ms, 1536 items, avg 3.252ms per query, 307.542 qps

Origin distance variation

At some point, John noted that the threshold allows additional optimization. We can order the hashes by their distance from some origin point. Given a query hash which has origin distance of A, we can inspect only hashes which are distant between |A-threshold| and |A+threshold| from the origin. This is pretty much how each level of Vantage Point Tree works, just simplified. This optimization — ordering items in the database by their distance from origin point — is relatively simple and can help save us a bit of work.

While great on paper, this method doesn’t introduce much gain in practice, as the vectors are not grouped in clusters — they are pretty much random! For the threshold values we are interested in, the origin distance algorithm variation gives us ~20% speed boost, which is okay but not breathtaking. This change might bring more benefits if we ever decide to reduce the threshold value, so it might be worth doing for production implementation. However, it doesn’t work well with query batching.

Transposing data for better AVX

But we’re not done with AVX optimizations! The usual problem with AVX is that the instructions don’t normally fit a specific problem. Some serious mind twisting is required to adapt the right instruction to the problem, or to reverse the problem so that a specific instruction can be used. AVX2 doesn’t have useful “horizontal” uint16 subtract, multiply and add operations. For example, _mm_hadd_epi16 exists, but it’s slow and cumbersome.

Instead, we can twist the problem to make use of fast available uint16 operands. For example we can use:

_mm256_sub_epi16
_mm256_mullo_epi16
and _mm256_add_epu16.

The add would overflow in our case, but fortunately there is add-saturate _mm256_adds_epu16.

The saturated add is great and saves us conversion to uint32. It just adds a small limitation: the threshold passed to the program (i.e., the max squared distance) must fit into uint16. However, this is fine for us.

To effectively use these instructions we need to transpose the data in the database. Instead of storing hashes in rows, we can store them in columns:

So instead of:

[a1, a2, a3],
[b1, b2, b3],
[c1, c2, c3],

…

We can lay it out in memory transposed:

[a1, b1, c1],
[a2, b2, c2],
[a3, b3, c3],

…

Now we can load 16 first bytes of hashes using one memory operation. In the next step, we can subtract the first byte of the querying hash using a single instruction, and so on. The algorithm stays exactly the same as defined above; we just make the data easier to load and easier to process for AVX.

The hot loop code even looks relatively pretty:

With the well-tuned batch size and short distance size parameters we can see the performance of this algorithm:

$ make short-inv-avx2
Total: 1118.669ms, 1536 items, avg 0.728ms per query, 1373.062 qps

Whoa! This is pretty awesome. We started from 55ms per query, and we finished with just 0.73ms. There are further micro-optimizations possible, like memory prefetching or using huge pages to reduce page faults, but they have diminishing returns at this point.

If you are interested in architectural tuning such as this, take a look at the new performance book by Denis Bakhvalov. It discusses roofline model analysis, which is pretty much what we did here.

Do take a look at our code and tell us if we missed some optimization!

Summary

What an optimization journey! We jumped between memory and ALU bottlenecked code. We discussed more sophisticated algorithms, but in the end, a brute force algorithm — although tuned — gave us the best results.

To get even better numbers, I experimented with Nvidia GPU using CUDA. The CUDA intrinsics like vabsdiff4 and dp4a fit the problem perfectly. The V100 gave us some amazing numbers, but I wasn’t fully satisfied with it. Considering how many AMD Ryzen cores with AVX2 we can get for the cost of a single server-grade GPU, we leaned towards general purpose computing for this particular problem.

This is a great example of the type of complexities we deal with every day. Making even the best technologies work “at Cloudflare scale” requires thinking outside the box. Sometimes we rewrite the solution dozens of times before we find the optimal one. And sometimes we settle on a brute-force algorithm, just very very optimized.

The computation of hashes and image matching are challenging problems that require running very CPU intensive operations.. The CPU we have available on the edge is scarce and workloads like this are incredibly expensive. Even with the optimization work talked about in this blog post, running the CSAM scanner at scale is a challenge and has required a huge engineering effort. And we’re not done! We need to solve more hard problems before we’re satisfied. If you want to help, consider applying!

A Last Call for QUIC, a giant leap for the Internet

2020-10-22 Lucas Pardue

Post Syndicated from Lucas Pardue original https://blog.cloudflare.com/last-call-for-quic/

A Last Call for QUIC, a giant leap for the Internet

QUIC is a new Internet transport protocol for secure, reliable and multiplexed communications. HTTP/3 builds on top of QUIC, leveraging the new features to fix performance problems such as Head-of-Line blocking. This enables web pages to load faster, especially over troublesome networks.

QUIC and HTTP/3 are open standards that have been under development in the IETF for almost exactly 4 years. On October 21, 2020, following two rounds of Working Group Last Call, draft 32 of the family of documents that describe QUIC and HTTP/3 were put into IETF Last Call. This is an important milestone for the group. We are now telling the entire IETF community that we think we’re almost done and that we’d welcome their final review.

A Last Call for QUIC, a giant leap for the Internet

Speaking personally, I’ve been involved with QUIC in some shape or form for many years now. Earlier this year I was honoured to be asked to help co-chair the Working Group. I’m pleased to help shepherd the documents through this important phase, and grateful for the efforts of everyone involved in getting us there, especially the editors. I’m also excited about future opportunities to evolve on top of QUIC v1 to help build a better Internet.

There are two aspects to protocol development. One aspect involves writing and iterating upon the documents that describe the protocols themselves. Then, there’s implementing, deploying and testing libraries, clients and/or servers. These aspects operate hand in hand, helping the Working Group move towards satisfying the goals listed in its charter. IETF Last Call marks the point that the group and their responsible Area Director (in this case Magnus Westerlund) believe the job is almost done. Now is the time to solicit feedback from the wider IETF community for review. At the end of the Last Call period, the stakeholders will take stock, address feedback as needed and, fingers crossed, go onto the next step of requesting the documents be published as RFCs on the Standards Track.

Although specification and implementation work hand in hand, they often progress at different rates, and that is totally fine. The QUIC specification has been mature and deployable for a long time now. HTTP/3 has been generally available on the Cloudflare edge since September 2019, and we’ve been delighted to see support roll out in user agents such as Chrome, Firefox, Safari, curl and so on. Although draft 32 is the latest specification, the community has for the time being settled on draft 29 as a solid basis for interoperability. This shouldn’t be surprising, as foundational aspects crystallize the scope of changes between iterations decreases. For the average person in the street, there’s not really much difference between 29 and 32.

So today, if you visit a website with HTTP/3 enabled—such as https://cloudflare-quic.com—you’ll probably see response headers that contain Alt-Svc: h3-29=”… . And in a while, once Last Call completes and the RFCs ship, you’ll start to see websites simply offer Alt-Svc: h3=”… (note, no draft version!).

Need a deep dive?

We’ve collected a bunch of resource links at https://cloudflare-quic.com. If you’re more of an interactive visual learner, you might be pleased to hear that I’ve also been hosting a series on Cloudflare TV called “Levelling up Web Performance with HTTP/3”. There are over 12 hours of content including the basics of QUIC, ways to measure and debug the protocol in action using tools like Wireshark, and several deep dives into specific topics. I’ve also been lucky to have some guest experts join me along the way. The table below gives an overview of the episodes that are available on demand.

Episode	Description
1	Introduction to QUIC.
2	Introduction to HTTP/3.
3	QUIC & HTTP/3 logging and analysis using qlog and qvis. Featuring Robin Marx.
4	QUIC & HTTP/3 packet capture and analysis using Wireshark. Featuring Peter Wu.
5	The roles of Server Push and Prioritization in HTTP/2 and HTTP/3. Featuring Yoav Weiss.
6	"After dinner chat" about curl and QUIC. Featuring Daniel Stenberg.
7	Qlog vs. Wireshark. Featuring Robin Marx and Peter Wu.
8	Understanding protocol performance using WebPageTest. Featuring Pat Meenan and Andy Davies.
9	Handshake deep dive.
10	Getting to grips with quiche, Cloudflare’s QUIC and HTTP/3 library.
11	A review of SIGCOMM’s EPIQ workshop on evolving QUIC.
12	Understanding the role of congestion control in QUIC. Featuring Junho Choi.

Whither QUIC?

So does Last Call mean QUIC is “done”? Not by a long shot. The new protocol is a giant leap for the Internet, because it enables new opportunities and innovation. QUIC v1 is basically the set of documents that have gone into Last Call. We’ll continue to see people gain experience deploying and testing this, and no doubt cool blog posts about tweaking parameters for efficiency and performance are on the radar. But QUIC and HTTP/3 are extensible, so we’ll see people interested in trying new things like multipath, different congestion control approaches, or new ways to carry data unreliably such as the DATAGRAM frame.

We’re also seeing people interested in using QUIC for other use cases. Mapping other application protocols like DNS to QUIC is a rapid way to get its improvements. We’re seeing people that want to use QUIC as a substrate for carrying other transport protocols, hence the formation of the MASQUE Working Group. There’s folks that want to use QUIC and HTTP/3 as a “supercharged WebSocket”, hence the formation of the WebTransport Working Group.

Whatever the future holds for QUIC, we’re just getting started, and I’m excited.

Secondary DNS – Deep Dive

2020-09-15 Alex Fattouche

Post Syndicated from Alex Fattouche original https://blog.cloudflare.com/secondary-dns-deep-dive/

How Does Secondary DNS Work?

Secondary DNS - Deep Dive

If you already understand how Secondary DNS works, please feel free to skip this section. It does not provide any Cloudflare-specific information.

Secondary DNS has many use cases across the Internet; however, traditionally, it was used as a synchronized backup for when the primary DNS server was unable to respond to queries. A more modern approach involves focusing on redundancy across many different nameservers, which in many cases broadcast the same anycasted IP address.

Secondary DNS involves the unidirectional transfer of DNS zones from the primary to the Secondary DNS server(s). One primary can have any number of Secondary DNS servers that it must communicate with in order to keep track of any zone updates. A zone update is considered a change in the contents of a zone, which ultimately leads to a Start of Authority (SOA) serial number increase. The zone’s SOA serial is one of the key elements of Secondary DNS; it is how primary and secondary servers synchronize zones. Below is an example of what an SOA record might look like during a dig query.

example.com	3600	IN	SOA	ashley.ns.cloudflare.com. dns.cloudflare.com. 
2034097105  // Serial
10000 // Refresh
2400 // Retry
604800 // Expire
3600 // Minimum TTL

Each of the numbers is used in the following way:

Serial – Used to keep track of the status of the zone, must be incremented at every change.
Refresh – The maximum number of seconds that can elapse before a Secondary DNS server must check for a SOA serial change.
Retry – The maximum number of seconds that can elapse before a Secondary DNS server must check for a SOA serial change, after previously failing to contact the primary.
Expire – The maximum number of seconds that a Secondary DNS server can serve stale information, in the event the primary cannot be contacted.
Minimum TTL – Per RFC 2308, the number of seconds that a DNS negative response should be cached for.

Using the above information, the Secondary DNS server stores an SOA record for each of the zones it is tracking. When the serial increases, it knows that the zone must have changed, and that a zone transfer must be initiated.

Serial Tracking

Serial increases can be detected in the following ways:

The fastest way for the Secondary DNS server to keep track of a serial change is to have the primary server NOTIFY them any time a zone has changed using the DNS protocol as specified in RFC 1996, Secondary DNS servers will instantly be able to initiate a zone transfer.
Another way is for the Secondary DNS server to simply poll the primary every “Refresh” seconds. This isn’t as fast as the NOTIFY approach, but it is a good fallback in case the notifies have failed.

One of the issues with the basic NOTIFY protocol is that anyone on the Internet could potentially notify the Secondary DNS server of a zone update. If an initial SOA query is not performed by the Secondary DNS server before initiating a zone transfer, this is an easy way to perform an amplification attack. There is two common ways to prevent anyone on the Internet from being able to NOTIFY Secondary DNS servers:

Using transaction signatures (TSIG) as per RFC 2845. These are to be placed as the last record in the extra records section of the DNS message. Usually the number of extra records (or ARCOUNT) should be no more than two in this case.
Using IP based access control lists (ACL). This increases security but also prevents flexibility in server location and IP address allocation.

Generally NOTIFY messages are sent over UDP, however TCP can be used in the event the primary server has reason to believe that TCP is necessary (i.e. firewall issues).

Zone Transfers

In addition to serial tracking, it is important to ensure that a standard protocol is used between primary and Secondary DNS server(s), to efficiently transfer the zone. DNS zone transfer protocols do not attempt to solve the confidentiality, authentication and integrity triad (CIA); however, the use of TSIG on top of the basic zone transfer protocols can provide integrity and authentication. As a result of DNS being a public protocol, confidentiality during the zone transfer process is generally not a concern.

Authoritative Zone Transfer (AXFR)

AXFR is the original zone transfer protocol that was specified in RFC 1034 and RFC 1035 and later further explained in RFC 5936. AXFR is done over a TCP connection because a reliable protocol is needed to ensure packets are not lost during the transfer. Using this protocol, the primary DNS server will transfer all of the zone contents to the Secondary DNS server, in one connection, regardless of the serial number. AXFR is recommended to be used for the first zone transfer, when none of the records are propagated, and IXFR is recommended after that.

Incremental Zone Transfer (IXFR)

IXFR is the more sophisticated zone transfer protocol that was specified in RFC 1995. Unlike the AXFR protocol, during an IXFR, the primary server will only send the secondary server the records that have changed since its current version of the zone (based on the serial number). This means that when a Secondary DNS server wants to initiate an IXFR, it sends its current serial number to the primary DNS server. The primary DNS server will then format its response based on previous versions of changes made to the zone. IXFR messages must obey the following pattern:

Current latest SOA
Secondary server current SOA
DNS record deletions
Secondary server current SOA + changes
DNS record additions
Current latest SOA

Steps 2,3,4,5,6 can be repeated any number of times, as each of those represents one change set of deletions and additions, ultimately leading to a new serial.

IXFR can be done over UDP or TCP, but again TCP is generally recommended to avoid packet loss.

How Does Secondary DNS Work at Cloudflare?

The DNS team loves microservice architecture! When we initially implemented Secondary DNS at Cloudflare, it was done using Mesos Marathon. This allowed us to separate each of our services into several different marathon apps, individually scaling apps as needed. All of these services live in our core data centers. The following services were created:

Zone Transferer – responsible for attempting IXFR, followed by AXFR if IXFR fails.
Zone Transfer Scheduler – responsible for periodically checking zone SOA serials for changes.
Rest API – responsible for registering new zones and primary nameservers.

In addition to the marathon apps, we also had an app external to the cluster:

Notify Listener – responsible for listening for notifies from primary servers and telling the Zone Transferer to initiate an AXFR/IXFR.

Each of these microservices communicates with the others through Kafka.

Secondary DNS - Deep Dive — Figure 1: Secondary DNS Microservice Architecture‌‌

Once the zone transferer completes the AXFR/IXFR, it then passes the zone through to our zone builder, and finally gets pushed out to our edge at each of our 200 locations.

Although this current architecture worked great in the beginning, it left us open to many vulnerabilities and scalability issues down the road. As our Secondary DNS product became more popular, it was important that we proactively scaled and reduced the technical debt as much as possible. As with many companies in the industry, Cloudflare has recently migrated all of our core data center services to Kubernetes, moving away from individually managed apps and Marathon clusters.

What this meant for Secondary DNS is that all of our Marathon-based services, as well as our NOTIFY Listener, had to be migrated to Kubernetes. Although this long migration ended up paying off, many difficult challenges arose along the way that required us to come up with unique solutions in order to have a seamless, zero downtime migration.

Challenges When Migrating to Kubernetes

Although the entire DNS team agreed that kubernetes was the way forward for Secondary DNS, it also introduced several challenges. These challenges arose from a need to properly scale up across many distributed locations while also protecting each of our individual data centers. Since our core does not rely on anycast to automatically distribute requests, as we introduce more customers, it opens us up to denial-of-service attacks.

The two main issues we ran into during the migration were:

How do we create a distributed and reliable system that makes use of kubernetes principles while also making sure our customers know which IPs we will be communicating from?
When opening up a public-facing UDP socket to the Internet, how do we protect ourselves while also preventing unnecessary spam towards primary nameservers?.

Issue 1:

As was previously mentioned, one form of protection in the Secondary DNS protocol is to only allow certain IPs to initiate zone transfers. There is a fine line between primary servers allow listing too many IPs and them having to frequently update their IP ACLs. We considered several solutions:

Open source k8s controllers
Altering Network Address Translation(NAT) entries
Do not use k8s for zone transfers
Allowlist all Cloudflare IPs and dynamically update
Proxy egress traffic

Ultimately we decided to proxy our egress traffic from k8s, to the DNS primary servers, using static proxy addresses. Shadowsocks-libev was chosen as the SOCKS5 implementation because it is fast, secure and known to scale. In addition, it can handle both UDP/TCP and IPv4/IPv6.

The partnership of k8s and Shadowsocks combined with a large enough IP range brings many benefits:

Horizontal scaling
Efficient load balancing
Primary server ACLs only need to be updated once
It allows us to make use of kubernetes for both the Zone Transferer and the Local ShadowSocks Proxy.
Shadowsocks proxy can be reused by many different Cloudflare services.

Issue 2:

The Notify Listener requires listening on static IPs for NOTIFY Messages coming from primary DNS servers. This is mostly a solved problem through the use of k8s services of type loadbalancer, however exposing this service directly to the Internet makes us uneasy because of its susceptibility to attacks. Fortunately DDoS protection is one of Cloudflare’s strengths, which lead us to the likely solution of dogfooding one of our own products, Spectrum.

Spectrum provides the following features to our service:

Reverse proxy TCP/UDP traffic
Filter out Malicious traffic
Optimal routing from edge to core data centers
Dual Stack technology

Figure 3 shows two interesting attributes of the system:

Spectrum <-> k8s IPv4 only:
This is because our custom k8s load balancer currently only supports IPv4; however, Spectrum has no issue terminating the IPv6 connection and establishing a new IPv4 connection.
Spectrum <-> k8s routing decisions based of L4 protocol:
This is because k8s only supports one of TCP/UDP/SCTP per service of type load balancer. Once again, spectrum has no issues proxying this correctly.

One of the problems with using a L4 proxy in between services is that source IP addresses get changed to the source IP address of the proxy (Spectrum in this case). Not knowing the source IP address means we have no idea who sent the NOTIFY message, opening us up to attack vectors. Fortunately, Spectrum’s proxy protocol feature is capable of adding custom headers to TCP/UDP packets which contain source IP/Port information.

As we are using miekg/dns for our Notify Listener, adding proxy headers to the DNS NOTIFY messages would cause failures in validation at the DNS server level. Alternatively, we were able to implement custom read and write decorators that do the following:

Reader: Extract source address information on inbound NOTIFY messages. Place extracted information into new DNS records located in the additional section of the message.
Writer: Remove additional records from the DNS message on outbound NOTIFY replies. Generate a new reply using proxy protocol headers.

There is no way to spoof these records, because the server only permits two extra records, one of which is the optional TSIG. Any other records will be overwritten.

This custom decorator approach abstracts the proxying away from the Notify Listener through the use of the DNS protocol.

Although knowing the source IP will block a significant amount of bad traffic, since NOTIFY messages can use both UDP and TCP, it is prone to IP spoofing. To ensure that the primary servers do not get spammed, we have made the following additions to the Zone Transferer:

Always ensure that the SOA has actually been updated before initiating a zone transfer.
Only allow at most one working transfer and one scheduled transfer per zone.

Additional Technical Challenges

Zone Transferer Scheduling

As shown in figure 1, there are several ways of sending Kafka messages to the Zone Transferer in order to initiate a zone transfer. There is no benefit in having a large backlog of zone transfers for the same zone. Once a zone has been transferred, assuming no more changes, it does not need to be transferred again. This means that we should only have at most one transfer ongoing, and one scheduled transfer at the same time, for any zone.

If we want to limit our number of scheduled messages to one per zone, this involves ignoring Kafka messages that get sent to the Zone Transferer. This is not as simple as ignoring specific messages in any random order. One of the benefits of Kafka is that it holds on to messages until the user actually decides to acknowledge them, by committing that messages offset. Since Kafka is just a queue of messages, it has no concept of order other than first in first out (FIFO). If a user is capable of reading from the Kafka topic concurrently, it is entirely possible that a message in the middle of the queue be committed before a message at the end of the queue.

Most of the time this isn’t an issue, because we know that one of the concurrent readers has read the message from the end of the queue and is processing it. There is one Kubernetes-related catch to this issue, though: pods are ephemeral. The kube master doesn’t care what your concurrent reader is doing, it will kill the pod and it’s up to your application to handle it.

Consider the following problem:

Read offset 1. Start transferring zone 1.
Read offset 2. Start transferring zone 2.
Zone 2 transfer finishes. Commit offset 2, essentially also marking offset 1.
Restart pod.
Read offset 3 Start transferring zone 3.

If these events happen, zone 1 will never be transferred. It is important that zones stay up to date with the primary servers, otherwise stale data will be served from the Secondary DNS server. The solution to this problem involves the use of a list to track which messages have been read and completely processed. In this case, when a zone transfer has finished, it does not necessarily mean that the kafka message should be immediately committed. The solution is as follows:

Keep a list of Kafka messages, sorted based on offset.
If finished transfer, remove from list:
If the message is the oldest in the list, commit the messages offset.

This solution is essentially soft committing Kafka messages, until we can confidently say that all other messages have been acknowledged. It’s important to note that this only truly works in a distributed manner if the Kafka messages are keyed by zone id, this will ensure the same zone will always be processed by the same Kafka consumer.

Life of a Secondary DNS Request

Although Cloudflare has a large global network, as shown above, the zone transferring process does not take place at each of the edge datacenter locations (which would surely overwhelm many primary servers), but rather in our core data centers. In this case, how do we propagate to our edge in seconds? After transferring the zone, there are a couple more steps that need to be taken before the change can be seen at the edge.

Zone Builder – This interacts with the Zone Transferer to build the zone according to what Cloudflare edge understands. This then writes to Quicksilver, our super fast, distributed KV store.
Authoritative Server – This reads from Quicksilver and serves the built zone.

What About Performance?

At the time of writing this post, according to dnsperf.com, Cloudflare leads in global performance for both Authoritative and Resolver DNS. Here, Secondary DNS falls under the authoritative DNS category here. Let’s break down the performance of each of the different parts of the Secondary DNS pipeline, from the primary server updating its records, to them being present at the Cloudflare edge.

Primary Server to Notify Listener – Our most accurate measurement is only precise to the second, but we know UDP/TCP communication is likely much faster than that.
NOTIFY to Zone Transferer – This is negligible
Zone Transferer to Primary Server – 99% of the time we see ~800ms as the average latency for a zone transfer.

4. Zone Transferer to Zone Builder – 99% of the time we see ~10ms to build a zone.

5. Zone Builder to Quicksilver edge: 95% of the time we see less than 1s propagation.

End to End latency: less than 5 seconds on average. Although we have several external probes running around the world to test propagation latencies, they lack precision due to their sleep intervals, location, provider and number of zones that need to run. The actual propagation latency is likely much lower than what is shown in figure 10. Each of the different colored dots is a separate data center location around the world.

An additional test was performed manually to get a real world estimate, the test had the following attributes:

Primary server: NS1
Number of records changed: 1
Start test timer event: Change record on NS1
Stop test timer event: Observe record change at Cloudflare edge using dig
Recorded timer value: 6 seconds

Conclusion

Cloudflare serves 15.8 trillion DNS queries per month, operating within 100ms of 99% of the Internet-connected population. The goal of Cloudflare operated Secondary DNS is to allow our customers with custom DNS solutions, be it on-premise or some other DNS provider, to be able to take advantage of Cloudflare’s DNS performance and more recently, through Secondary Override, our proxying and security capabilities too. Secondary DNS is currently available on the Enterprise plan, if you’d like to take advantage of it, please let your account team know. For additional documentation on Secondary DNS, please refer to our support article.