Tag Archives: Speed Week

Recapping Speed Week 2023

2023-06-26 Sam Marsh

Post Syndicated from Sam Marsh original http://blog.cloudflare.com/recapping-speed-week-2023/

Recapping Speed Week 2023

Speed Week 2023 is officially a wrap.

In our Welcome to Speed Week 2023 blog post, we set a clear goal:

“This week we will help you measure what matters. We’ll help you gain insight into your performance, from Zero Trust and API’s to websites and applications. And finally we’ll help you get faster. Quickly.”.

This week we published five posts on how to measure performance, explaining which metrics and approaches make sense and why. We had a deep dive on the latest Core Web Vital, “Interaction to Next Paint”, what it means and how we can help. There was a post on Time To First Byte (TTFB) and why it isn't a good way to measure good web performance. We also wrote about how to measure Zero Trust performance, and announced the Internet Quality page of Cloudflare Radar – giving everyone the ability to compare Internet connection quality across Internet Service Providers, countries, and more.

We launched new products such as Observatory, Digital Experiencing Monitoring and Timing Insights. These products give an incredible window into how your applications and websites are performing through the eyes of website visitors and your employees.

Next, we showed how we continue to be the fastest, with fresh posts on how we have the fastest network, fastest Secure Web Gateway, fastest Zero Trust Network Access and fastest Remote Browser Isolation solutions. There was even an update on how our global network grew to 300 cities. The Cloudflare network is at the center of everything we do, and every product we build benefits from the speed and scale it provides and the proximity to the user.

There were also a number of great product announcements which make speed simple, single button performance boosts to accelerate your traffic. Smart Hints, HTTP/3 Prioritization, Argo for UDP, Brotli end-to-end, LL-HLS for Stream and Ricochet for API Gateway all make speed simple – giving you an immediate speed boost on your traffic for very minimal, if any configuration.

We also showed how AI / ML continue to play a big part at Cloudflare, with posts discussing why running AI inference on Cloudflare's network makes performance sense, and how we both scale and run machine learning at the microseconds level.

Finally, we wrote about how we are making it easier than ever for customers to migrate to Cloudflare from legacy vendors via our Turpentine and Descaler programs.

We’re on a mission to be the fastest at everything we do, and to make it simple for our customers to get the best performance.

In case you missed any of the announcements, take a look at the summary and navigation guide below.

AI / Machine Learning

Announcement	Summary
Globally distributed AI and a Constellation update	Announcing new Constellation features, explaining why it’s the first globally distributed AI platform and why deploying your machine learning tasks in our global network is advantageous.
Every request, every second: scalable machine learning at Cloudflare	Describing the technical strategies that have enabled us to expand the number of machine learning features and models, all while substantially reducing the processing time for each HTTP request on our network.
How Orpheus automatically routes around bad Internet weather	A little less than two years ago, Cloudflare made Orpheus automatically available to all customers for free. Since then, Orpheus has saved 132 billion Internet requests from failing by intelligently routing them around connectivity outages, prevented 50+ Internet incidents from impacting our customers, and made our customer’s origins more reachable to everyone on the Internet. Let’s dive into how Orpheus accomplished these feats over the last year.
How Cloudflare runs machine learning inference in microseconds	How we optimized bot management’s machine learning model execution. To reduce processing latency, we've undertaken a project to rewrite our bot management technology, porting it from Lua to Rust, and applying a number of performance optimizations. This post focuses on optimizations applied to the machine-learning detections within the bot management module, which account for approximately 15% of the latency added by bot detection. By switching away from a garbage collected language, removing memory allocations, and optimizing our parsers, we reduce the P50 latency of the bot management module by 79μs – a 20% reduction.

Zero Trust

Announcement	Summary
Spotlight on Zero Trust: we're fastest and here's the proof	Cloudflare is the fastest Secure Web Gateway in 42% of testing scenarios, the most of any provider. Cloudflare is 46% faster than Zscaler, 56% faster than Netskope, and 10% faster than Palo Alto for ZTNA, and 64% faster than Zscaler for RBI scenarios.
Understanding end user-connectivity and performance with Digital Experience Monitoring, now available in beta	DEX allows administrators to monitor their WARP Deployment and create predefined application tests. Features include live team & device analytics, server and traceroute tests, Synthetic Application Monitoring, and Fleet Status for real-time insights on WARP deployment.
Descale your network with Cloudflare’s enhanced Descaler Program	The speed at which customers are able to move from Zscaler ZIA to Cloudflare Gateway continually gets faster. It usually takes more time to set up a meeting with the right technical administrators than to migrate settings, configurations, lists, policies and more to Cloudflare.
Donning a MASQUE: building a new protocol into Cloudflare WARP	Announcing support for MASQUE, a cutting-edge new protocol for the beta version of our consumer WARP iOS app.
How we think about Zero Trust Performance	There are many ways to view network performance. However, at Cloudflare we believe the best way to measure performance is to use end-to-end HTTP response measurements. In this blog, we’re going to talk about why end-to-end performance is the most important thing to look at, why other methods like proxy latency and decrypted latency SLAs are insufficient for performance evaluations, and how you can measure your Zero Trust performance like we do.

Measuring what matters

Announcement	Summary
Introducing the Cloudflare Radar Internet Quality Page	The new Internet Quality page on Cloudflare Radar provides both country and network (autonomous system) level insight into Internet connection performance (bandwidth) and quality (latency, jitter) over time based on benchmark test data as well as speed.cloudflare.com test results.
Network performance update: Speed Week 2023	A blog post that shares the most recent network performance updates, and tells you about our tools and processes that we use to monitor and improve our network performance.
Introducing Timing Insights: new performance metrics via our GraphQL API	If you care about the performance of your website or APIs, it’s critical to understand why things are slow. We're introducing new analytics tools to help you understand what is contributing to "Time to First Byte" (TTFB) of Cloudflare and your origin. But wait – maybe you've heard that you should stop worrying about TTFB? Isn't Cloudflare moving away from TTFB as a metric? Read on to understand why there are still situations where TTFB matters.
Are you measuring what matters? A fresh look at Time To First Byte	Time To First Byte (TTFB) is not a good way to measure your websites performance. In this blog we’ll cover what TTFB is a good indicator of, what it's not great for, and what you should be using instead.
INP. Get ready for the new Core Web Vital	On May 10, 2023, Google announced that INP will replace FID in the Core Web Vitals in March 2024. The Core Web Vitals play a role in the Google Search algorithm. Website owners who care about Search Engine Optimization (SEO) should prepare for the change. In this post we outline what INP is, and how you can prepare.

Speed made simple

Announcement	Summary
Faster website, more customers: Cloudflare Observatory can help your business grow	Cloudflare users can now easily monitor website performance using Real User Monitoring (RUM) data along with scheduled tests from different regions in a single dashboard. This will identify any performance issues your website may have. Once we’ve identified any issues, Observatory will highlight customized recommendations to resolve these issues, all with a single click.
Smart Hints make code-free performance simple	We’re excited to announce we’re making Early Hints and Fetch Priorities automatic using the power of Cloudflare’s network.
Introducing HTTP/3 Prioritization	Announcing full support for HTTP/3 Extensible Priorities, a new standard that speeds the loading of webpages by up to 37%.
Argo Smart Routing for UDP: speeding up gaming, real-time communications and more	Announcing we’re bringing traffic acceleration to customer’s UDP traffic. Now, users can improve the latency of UDP-based applications like video games, voice calls, and video meetings by up to 17%.
All the way up to 11: Serve Brotli from origin and Introducing Compression Rules	Enhancing our support for Brotli compression, enabling end-to-end Brotli compression for web content. Compression plays a vital role in reducing bytes during transfers, ensuring quicker downloads and seamless browsing.
How to use Cloudflare Observatory for performance experiments	Introducing Cloudflare's Performance Experiments in Observatory: Safely test code, improve website speed, and minimize risk.
Introducing Low-Latency HLS Support for Cloudflare Stream	Broadcast live to websites and applications with less than 10 second latency with Low-Latency HTTP Live Streaming (LL-HLS), now in beta with Cloudflare Stream.
Speeding up APIs with Ricochet for API Gateway	Announcing Ricochet for API Gateway, the easiest way for Cloudflare customers to achieve faster API responses through automatic, intelligent API response caching.

But wait, there’s more

Announcement	Summary
Cloudflare Snippets is now available in alpha	Cloudflare Snippets are available in alpha. Snippets are a simple way of executing a small piece of Javascript on select HTTP requests, using the ruleset engine filtering logic.
Making Cloudflare Pages the fastest way to serve your sites	Pages is now the fastest way to serve your sites across Netlify, Vercel and many others.
Cloudflare's global network grows to 300 cities and ever closer to end users with connections to 12,000 networks	We are pleased to announce that Cloudflare is now connected to over 12,000 Internet networks in over 300 cities around the world.
It's never been easier to migrate thanks to Cloudflare's new Migration Hub	Relaunching Turpentine, a service for moving away from Varnish Control Language (VCL). Introducing Cloudflare's new Migration Hub. The Migration Hub serves as a one-stop-shop for all migration needs, featuring brand-new migration guides that bring transparency and simplicity to the process.
Part 2: Rethinking cache purge with a new architecture	Discussing architecture improvements we’ve made so far for Cache Purge and what we’re working on now.
Speeding up your (WordPress) website is a few clicks away	In this blog, we will explain where the opportunities exist to improve website performance, how to check if a specific site can improve performance, and provide a small JavaScript snippet which can be used with Cloudflare Workers to do this optimization for you.
Benchmarking dashboard performance	The Cloudflare dashboard is a single page application that houses all of the UI for our wide portfolio of existing products, as well as the new features we're releasing every day.
Workers KV is faster than ever with a new architecture	With the new architecture powering Workers KV our service will become faster and more scalable than ever. We have significantly reduced cold read probability, and enabled KV to serve over a trillion requests a month.
How Kinsta used Workers and Workers KV to improve cache hit rates by 56%	Kinsta delivers tailored cloud hosting solutions to over 26,000 companies across 128 countries. Learn how they used Workers and Workers KV to improve cache performance and customer performance.
A step-by-step guide to transferring domains to Cloudflare	Transferring your domains to a new registrar isn’t something you do every day, and getting any step of the process wrong could mean downtime and disruption. We’ve built a domain transfer checklist to help you quickly and safely transfer your domains to Cloudflare.
How we scaled and protected Eurovision 2023 voting with Pages and Turnstile	More than 162 million fans tuned in to the 2023 Eurovision Song Contest, the first year that non-participating countries could also vote. Cloudflare helped scale and protect the voting application using our rapid DNS infrastructure, CDN, Cloudflare Pages and Turnstile.

Watch on Cloudflare TV

Here's a summary of the Speed Week on Cloudflare TV:

If you missed any of the announcements or want to also view the associated Cloudflare TV segments, where blog authors went through each announcement, you can now watch all the Speed Week videos on Cloudflare TV.

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile

2023-06-23 Dirk-Jan van Helmond

Post Syndicated from Dirk-Jan van Helmond original http://blog.cloudflare.com/how-cloudflare-scaled-and-protected-eurovision-2023-voting/

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile

2023 was the first year that non-participating countries could vote for their favorites during the Eurovision Song Contest, adding millions of additional viewers and voters to an already impressive 162 million tuning in from the participating countries. It became a truly global event with a potential for disruption from multiple sources. To prepare for anything, Cloudflare helped scale and protect the voting application, used by millions of dedicated fans around the world to choose the winner.

In this blog we will cover how once.net built their platform based.io to monitor, manage and scale the Eurovision voting application to handle all traffic using many Cloudflare services. The speed with which DNS changes made through the Cloudflare API propagate globally allowed them to scale their backend within seconds. At the same time, Cloudflare Pages was ready to serve any amount of traffic to the voting landing page so fans didn’t miss a beat. And to cap it off, by combining Cloudflare CDN, DDoS protection, WAF, and Turnstile, they made sure that attackers didn’t steal any of the limelight.

The unsung heroes

Based.io is a resilient live data platform built by the once.net team, with the capability to scale up to 400 million concurrent connected users. It’s built from the ground up for speed and performance, consisting of an observable real time graph database, networking layer, cloud functions, analytics and infrastructure orchestration. Since all system information, traffic analysis and disruptions are monitored in real time, it makes the platform instantly responsive to variable demand, which enables real time scaling of your infrastructure during spikes, outages and attacks.

Although the based.io platform on its own is currently in closed beta, it is already serving a few flagship customers in production assisted by the software and services of the once.net team. One such customer is Tally, a platform used by multiple broadcasters in Europe to add live interaction to traditional television. Over 100 live shows have been performed using the platform. Another is Airhub, a startup that handles and logs automatic drone flights. And of course the star of this blog post, the Eurovision Song Contest.

Setting the stage

The Eurovision Song Contest is one of the world’s most popular broadcasted contests, and this year it reached 162 million people across 38 broadcasting countries. In addition, on TikTok the three live shows were viewed 4.8 million times, while 7.6 million people watched the Grand Final live on YouTube. With such an audience, it is no surprise that Cloudflare sees the impact of it on the Internet. Last year, we wrote a blog post where we showed lower than average traffic during, and higher than average traffic after the grand final. This year, the traffic from participating countries showed an even more remarkable surge:

Such large amounts of traffic are nothing new to the Eurovision Song Contest. Eurovision has relied on Cloudflare’s services for over a decade now and Cloudflare has helped to protect Eurovision.tv and improve its performance through noticeable faster load time to visitors from all corners of the world. Year after year, the team of Eurovision continued to use our services more, discovering additional features to improve performance and reliability further, with increasingly fine-grained control over their traffic flows. Eurovision.tv uses Page Rules to cache additional content on Cloudflare’s edge, speeding up delivery without sacrificing up-to-the-minute updates during the global event. Finally, to protect their backend and content management system, the team has placed their admin portals behind Cloudflare Zero Trust to delegate responsibilities down to individual levels.

Since then the contest itself has also evolved – sometimes by choice, sometimes by force. During the COVID-19 pandemic in-person cheering became impossible for many people due to a reduced live audience, resulting in the Eurovision Song Contest asking once.net to build a new iOS and Android application in which fans could cheer virtually. The feature was an instant hit, and it was clear that it would become part of this year’s contest as well.

In 2023, once.net was also asked to handle the paid voting from the regions where phone and SMS voting was not possible. It was the first time that Eurovision allowed voting online. The challenge that had to be overcome was the extreme peak demand on the platform when the show was live, and especially when the voting window started.

Complicating it further, was the fact that during last year’s show, there had been a large number of targeted and coordinated attacks.

To prepare for these spikes in demand and determined adversaries, once.net needed a platform that isn’t only resilient and highly scalable, but could also act as a mitigation layer in front of it. once.net selected Cloudflare for this functionality and integrated Cloudflare deeply with its real-time monitoring and management platform. To understand how and why, it’s essential to understand based.io underlying architecture.

The based.io platform

Instead of relying on network or HTTP load balancers, based.io uses a client-side service discovery pattern, selecting the most suitable server to connect to and leveraging Cloudflare's fast cache propagation infrastructure to handle spikes in traffic (both malicious and benign).

First, each server continuously registers a unique access key that has an expiration of 15 seconds, which must be used when a client connects to the server. In addition, the backend servers register their health (such as active connections, CPU, memory usage, requests per second, etc.) to the service registry every 300 milliseconds. Clients then request the optimal server URL and associated access key from a central discovery registry and proceed to establish a long lived connection with that server. When a server gets overloaded it will disconnect a certain amount of clients and those clients will go through the discovery process again.

The central discovery registry would normally be a huge bottleneck and attack target. based.io resolves this by putting the registry behind Cloudflare's global network with a cache time of three seconds. Since the system relies on real-time stats to distribute load and uses short lived access keys, it is crucial that the cache updates fast and reliably. This is where Cloudflare’s infrastructure proved its worth, both due to the fast updating cache and reducing load with Tiered Caching.

Not using load balancers means the based.io system allows clients to connect to the backend servers through Cloudflare, resulting in better performance and a more resilient infrastructure by eliminating the load balancers as potential attack surface. It also results in a better distribution of connections, using the real-time information of server health, amount of active connections, active subscriptions.

Scaling up the platform happens automatically under load by deploying additional machines that can each handle 40,000 connected users. These are spun up in batches of a couple of hundred and as each machine spins up, it reaches out directly to the Cloudflare API to configure its own DNS record and proxy status. Thanks to Cloudflare’s high speed DNS system, these changes are then propagated globally within seconds, resulting in a total machine turn-up time of around three seconds. This means faster discovery of new servers and faster dynamic rebalancing from the clients. And since the voting window of the Eurovision Song Contest is only 45 minutes, with the main peak within minutes after the window opens, every second counts!

To vote, users of the mobile app and viewers globally were pointed to the voting landing page, esc.vote. Building a frontend web application able to handle this kind of an audience is a challenge in itself. Although hosting it yourself and putting a CDN in front seems straightforward, this still requires you to own, configure and manage your origin infrastructure. once.net decided to leverage Cloudflare’s infrastructure directly by hosting the voting landing page on Cloudflare Pages. Deploying was as quick as a commit to their Git repository, and they never had to worry about reachability or scaling of the webpage.

once.net also used Cloudflare Turnstile to protect their payment API endpoints that were used to validate online votes. They used the invisible Turnstile widget to make sure the request was not coming from emulated browsers (e.g. Selenium). And best of all, using the invisible Turnstile widget the user did not have to go through extra steps, which allowed for a better user experience and better conversion.

Cloudflare Pages stealing the show!

After the two semi-finals went according to plan with approximately 200,000 concurrent users during each,May 13 brought the Grand Final. The once.net team made sure that there were enough machines ready to take the initial load, jumped on a call with Cloudflare to monitor and started looking at the number of concurrent users slowly increasing. During the event, there were a few attempts to DDoS the site, which were automatically and instantaneously mitigated without any noticeable impact to any visitors.

The based.io discovery registry server also got some attention. Since the cache TTL was set quite low at five seconds, a high rate of distributed traffic to it could still result in a significant load. Luckily, on its own, the highly optimized based.io server can already handle around 300,000 requests per second. Still, it was great to see that during the event the cache hit ratio for normal traffic was 20%, and during one significant attack the cache hit ratio peaked towards 80%. This showed how easy it is to leverage a combination of Cloudflare CDN and DDoS protection to mitigate such attacks, while still being able to serve dynamic and real time content.

When the curtains finally closed, 1.3 million concurrent users connected to the based.io platform at peak. The based.io platform handled a total of 350 million events and served seven million unique users in three hours. The voting landing page hosted by Cloudflare Pages served 2.3 million requests per second at peak, and made sure that the voting payments were by real human fans using Turnstile. Although the Cloudflare platform didn’t blink for such a flood of traffic, it is no surprise that it shows up as a short crescendo in our traffic statistics:

Get in touch with us

If you’re also working on or with an application that would benefit from Cloudflare’s speed and security, but don’t know where to start, reach out and we’ll work together.

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

2023-06-23 Matt Bullock

Post Syndicated from Matt Bullock original http://blog.cloudflare.com/this-is-brotli-from-origin/

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

This post is also available in 简体中文, 日本語, Español and Deutsch.

Throughout Speed Week, we have talked about the importance of optimizing performance. Compression plays a crucial role by reducing file sizes transmitted over the Internet. Smaller file sizes lead to faster downloads, quicker website loading, and an improved user experience.

Take household cleaning products as a real world example. It is estimated “a typical bottle of cleaner is 90% water and less than 10% actual valuable ingredients”. Removing 90% of a typical 500ml bottle of household cleaner reduces the weight from 600g to 60g. This reduction means only a 60g parcel, with instructions to rehydrate on receipt, needs to be sent. Extrapolated into the gallons, this weight reduction soon becomes a huge shipping saving for businesses. Not to mention the environmental impact.

This is how compression works. The sender compresses the file to its smallest possible size, and then sends the smaller file with instructions on how to handle it when received. By reducing the size of the files sent, compression ensures the amount of bandwidth needed to send files over the Internet is a lot less. Where files are stored in expensive cloud providers like AWS, reducing the size of files sent can directly equate to significant cost savings on bandwidth.

Smaller file sizes are also particularly beneficial for end users with limited Internet connections, such as mobile devices on cellular networks or users in areas with slow network speeds.

Cloudflare has always supported compression in the form of Gzip. Gzip is a widely used compression algorithm that has been around since 1992 and provides file compression for all Cloudflare users. However, in 2013 Google introduced Brotli which supports higher compression levels and better performance overall. Switching from gzip to Brotli results in smaller file sizes and faster load times for web pages. We have supported Brotli since 2017 for the connection between Cloudflare and client browsers. Today we are announcing end-to-end Brotli support for web content: support for Brotli compression, at the highest possible levels, from the origin server to the client.

If your origin server supports Brotli, turn it on, crank up the compression level, and enjoy the performance boost.

Brotli compression to 11

Brotli has 12 levels of compression ranging from 0 to 11, with 0 providing the fastest compression speed but the lowest compression ratio, and 11 offering the highest compression ratio but requiring more computational resources and time. During our initial implementation of Brotli five years ago, we identified that compression level 4 offered the balance between bytes saved and compression time without compromising performance.

Since 2017, Cloudflare has been using a maximum compression of Brotli level 4 for all compressible assets based on the end user's "accept-encoding" header. However, one issue was that Cloudflare only requested Gzip compression from the origin, even if the origin supported Brotli. Furthermore, Cloudflare would always decompress the content received from the origin before compressing and sending it to the end user, resulting in additional processing time. As a result, customers were unable to fully leverage the benefits offered by Brotli compression.

Old world

With Cloudflare now fully supporting Brotli end to end, customers will start seeing our updated accept-encoding header arriving at their origins. Once available customers can transfer, cache and serve heavily compressed Brotli files directly to us, all the way up to the maximum level of 11. This will help reduce latency and bandwidth consumption. If the end user device does not support Brotli compression, we will automatically decompress the file and serve it either in its decompressed format or as a Gzip-compressed file, depending on the Accept-Encoding header.

Full end-to-end Brotli compression support

End user cannot support Brotli compression

Customers can implement Brotli compression at their origin by referring to the appropriate online materials. For example, customers that are using NGINX, can implement Brotli by following this tutorial and setting compression at level 11 within the nginx.conf configuration file as follows:

brotli on;
brotli_comp_level 11;
brotli_static on;
brotli_types text/plain text/css application/javascript application/x-javascript text/xml 
application/xml application/xml+rss text/javascript image/x-icon 
image/vnd.microsoft.icon image/bmp image/svg+xml;

Cloudflare will then serve these assets to the client at the exact same compression level (11) for the matching file brotli_types. This means any SVG or BMP images will be sent to the client compressed at Brotli level 11.

Testing

We applied compression against a simple CSS file, measuring the impact of various compression algorithms and levels. Our goal was to identify potential improvements that users could experience by optimizing compression techniques. These results can be seen in the following table:

Test	Size (bytes)	% Reduction of original file (Higher % better)
Uncompressed response (no compression used)	2,747	–
Cloudflare default Gzip compression (level 8)	1,121	59.21%
Cloudflare default Brotli compression (level 4)	1,110	59.58%
Compressed with max Gzip level (level 9)	1,121	59.21%
Compressed with max Brotli level (level 11)	909	66.94%

By compressing Brotli at level 11 users are able to reduce their file sizes by 19% compared to the best Gzip compression level. Additionally, the strongest Brotli compression level is around 18% smaller than the default level used by Cloudflare. This highlights a significant size reduction achieved by utilizing Brotli compression, particularly at its highest levels, which can lead to improved website performance, faster page load times and an overall reduction in egress fees.

To take advantage of higher end to end compression rates the following Cloudflare proxy features need to be disabled.

Email Obfuscation
Rocket Loader
Server Side Excludes (SSE)
Mirage
HTML Minification – JavaScript and CSS can be left enabled.
Automatic HTTPS Rewrites

This is due to Cloudflare needing to decompress and access the body to apply the requested settings. Alternatively a customer can disable these features for specific paths using Configuration Rules.

If any of these rewrite features are enabled, your origin can still send Brotli compression at higher levels. However, we will decompress, apply the Cloudflare feature(s) enabled, and recompress on the fly using Cloudflare’s default Brotli level 4 or Gzip level 8 depending on the user's accept-encoding header.

For browsers that do not accept Brotli compression, we will continue to decompress and send Gzipped responses or uncompressed.

Implementation

The initial step towards implementing Brotli from the origin involved constructing a decompression module that could be integrated into Cloudflare software stack. It allows us to efficiently convert the compressed bits received from the origin into the original, uncompressed file. This step was crucial as numerous features such as Email Obfuscation and Cloudflare Workers Customers, rely on accessing the body of a response to apply customizations.

We integrated the decompressor into the core reverse web proxy of Cloudflare. This integration ensured that all Cloudflare products and features could access Brotli decompression effortlessly. This also allowed our Cloudflare Workers team to incorporate Brotli Directly into Cloudflare Workers allowing our Workers customers to be able to interact with responses returned in Brotli or pass through to the end user unmodified.

Introducing Compression rules – Granular control of compression to end users

By default Cloudflare compresses certain content types based on the Content-Type header of the file. Today we are also announcing Compression Rules for our Enterprise Customers to allow you even more control on how and what Cloudflare will compress.

Today we are also announcing the introduction of Compression Rules for our Enterprise Customers. With Compression Rules, you gain enhanced control over Cloudflare's compression capabilities, enabling you to customize how and which content Cloudflare compresses to optimize your website's performance.

For example, by using Cloudflare's Compression Rules for .ktx files, customers can optimize the delivery of textures in webGL applications, enhancing the overall user experience. Enabling compression minimizes the bandwidth usage and ensures that webGL applications load quickly and smoothly, even when dealing with large and detailed textures.

Alternatively customers can disable compression or specify a preference of how we compress. Another example could be an Infrastructure company only wanting to support Gzip for their IoT devices but allow Brotli compression for all other hostnames.

Compression rules use the filters that our other rules products are built on top of with the added fields of Media Type and Extension type. Allowing users to easily specify the content you wish to compress.

Deprecating the Brotli toggle

Brotli has been long supported by some web browsers since 2016 and Cloudflare offered Brotli Support in 2017. As with all new web technologies Brotli was unknown and we gave customers the ability to selectively enable or disable BrotlI via the API and our UI.

Now that Brotli has evolved and is supported by all browsers, we plan to enable Brotli on all zones by default in the coming months. Mirroring the Gzip behavior we currently support and removing the toggle from our dashboard. If browsers do not support Brotli, Cloudflare will continue to support their accepted encoding types such as Gzip or uncompressed and Enterprise customers will still be able to use Compression rules to granularly control how we compress data towards their users.

The future of web compression

We've seen great adoption and great performance for Brotli as the new compression technique for the web. Looking forward, we are closely following trends and new compression algorithms such as zstd as a possible next-generation compression algorithm.

At the same time, we're looking to improve Brotli directly where we can. One development that we're particularly focused on is shared dictionaries with Brotli. Whenever you compress an asset, you use a "dictionary" that helps the compression to be more efficient. A simple analogy of this is typing OMW into an iPhone message. The iPhone will automatically translate it into On My Way using its own internal dictionary.

O	M	W
O	n		M	y		W	a	y

This internal dictionary has taken three characters and morphed this into nine characters (including spaces) The internal dictionary has saved six characters which equals performance benefits for users.

By default, the Brotli RFC defines a static dictionary that both clients and the origin servers use. The static dictionary was designed to be general purpose and apply to everyone. Optimizing the size of the dictionary as to not be too large whilst able to generate best compression results. However, what if an origin could generate a bespoke dictionary tailored to a specific website? For example a Cloudflare-specific dictionary would allow us to compress the words and phrases that appear repeatedly on our site such as the word “Cloudflare”. The bespoke dictionary would be designed to compress this as heavily as possible and the browser using the same dictionary would be able to translate this back.

A new proposal by the Web Incubator CG aims to do just that, allowing you to specify your own dictionaries that browsers can use to allow websites to optimize compression further. We're excited about contributing to this proposal and plan on publishing our research soon.

Try it now

Compression Rules are available now! With End to End Brotli being rolled out over the coming weeks. Allowing you to improve performance, reduce bandwidth and granularly control how Cloudflare handles compression to your end users.

Making Cloudflare Pages the fastest way to serve your sites

2023-06-23 Sid Chatterjee

Post Syndicated from Sid Chatterjee original http://blog.cloudflare.com/how-we-decreased-pages-latency/

Making Cloudflare Pages the fastest way to serve your sites

In an era where visitors expect instant gratification and content on-demand, every millisecond counts. If you’re a web application developer, it’s an excellent time to be in this line of business, but with great power comes great responsibility. You’re tasked with creating an experience that is not only intuitive and delightful but also quick, reactive and responsive – sometimes with the two sides being at odds with each other. To add to this, if your business completely runs on the internet (say ecommerce), then your site’s Core Web Vitals could make or break your bottom line.

You don’t just need fast – you need magic fast. For the past two years, Cloudflare Pages has been serving up performant applications for users across the globe, but this week, we’re showing off our brand new, lightning fast architecture, decreasing the TTFB by up to 10X when serving assets.

And while a magician never reveals their secrets, this trick is too good to keep to ourselves. For all our application builders, we’re thrilled to share the juicy technical details on how we adopted Workers for Platforms — our extension of Workers to build SaaS businesses on top of — to make Pages one of the fastest ways to serve your sites.

The problem

When we launched Pages in 2021, we didn’t anticipate the exponential growth we would experience for our platform in the months and years to come. As our users began to adopt Pages into their development workflows, usage of our platform began to skyrocket. However, while riding the high of Pages’ success, we began to notice a problem – a rather large one. As projects grew in size, with every deployment came a pinch more latency, slowly affecting the end users visiting the Pages site. Customers with tens of thousands of deployments were at risk of introducing latency to their site – a problem that needed to be solved.

Before we dive into our technical solution, let’s first explore further the setup of Pages and the relationship between number of deployments and the observed latency.

How could this be?

Built on top of Cloudflare Workers, Pages serves static assets through a highly optimised Worker. We refer to this as the Asset Server Worker.

Users can also add dynamic content through Pages Functions which eventually get compiled into a separate Worker. Every single Pages deployment corresponds to unique instances of these Workers composed in a pipeline.

When a request hits Cloudflare we need to look up which pipeline to execute. As you’d expect, this is a function of the hostname in the URL.

If a user requested https://2b469e16.example.pages.dev/index.html, the hostname is 2b469e16.example.pages.dev which is unique across all deployments on Pages — 2b469e16 is typically the commit hash and example in this case refers to the name of the project.

Every Pages project has its own routing table which is used to look up the pipeline to execute. The routing table happens to be a JSON object with a list of regexes for possible paths in that project (in our case, one for every deployment) and their corresponding pipelines.

The script_hash in the example below refers to the pipeline identifier. Naming is hard indeed.

{
 "filters": [
   {
     "pattern": "^(?:2b469e16.example.pages.dev(?:[:][0-9]+)?\\/(?<p1>.*))$",
     "script_hash": "..."
   },
   {
     "pattern": "^(?:example.pages.dev(?:[:][0-9]+)?\\/(?<p1>.*))$",
     "script_hash": "..."
   },
   {
     "pattern": "^(?:test.example.com(?:[:][0-9]+)?\\/(?<p1>.*))$",
     "script_hash": "..."
   }
 ],
 "v": 1
}

So to look up the pipeline in question, we would: download this JSON object from Quicksilver, parse it, and then iterate through this until it finds a regex that matches the current request.

Unsurprisingly, this is expensive. Let’s take a look at a quick real world example to see how expensive.

In one realistic case, it took us 107ms just to parse the JSON. The larger the JSON object gets, the more compute it takes to parse it — with tens of thousands of deployments (not unusual for very active projects that deploy immutable preview deployments for every git commit), this JSON could be several megabytes in size!

It doesn’t end there though. After parsing this, it took 29ms to then iterate and test several regexes to find the one that matched the current request.

To summarise, every single request to this project would take 136ms to just pick the right pipeline to execute. While this was the median case for projects with 10,000 deployments on average, we’ve seen projects with seconds in added latency making them unusable after 50,000 deployments, punishing users for using our platform.

Given most web sites load more than one asset for a page, this leads to timeouts and breakage leading to an unstable and unacceptable user experience.

The secret sauce is Workers for Platforms

We launched Workers for Platforms last year as a way to build ambitious platforms on top of Workers. Workers for Platforms lets one build complex pipelines where a request may be served by a Worker built and maintained by you but could then dispatch to a Worker written by a user of your platform. This allows your platform’s users to write their own Worker like they’ve been used to but while you control how and when they are executed.

This isn’t very different from what we do with Pages. Users write their Pages functions which compile into a Worker. Users also upload their own static assets which are then bound to our special Asset Server Worker in unique pipelines for each of their deployments. And we control how and when which Worker gets executed based on a hostname in their URL.

Runtime lookups shouldn’t be O(n) though but O(1). Because Workers for Platforms was designed to build entire platforms on top of, lookups when trying to dispatch to a user’s Worker were designed as O(1) ensuring latency wasn’t a function of number of Workers in an account.

The solution

By default, Workers for Platforms hashes the name of the Worker with a secret and uses that for lookups at runtime. However, because we need to dispatch by hostname, we had a different idea. At deployment time, we could hash the pipeline for the deployment by its hostname — 2b469e16.example.pages.dev, for example.

When a request comes in, we hash the hostname from the URL with our predefined secret and then use that value to look up the pipeline to execute. This entirely removes the necessity to fetch, parse and traverse the routing table JSON from before, now making our lookup O(1).

Once we were happy with our new setup from internal testing we wanted to onboard a real user. Our Developer Docs have been running on Pages since the start of 2022 and during that time, we’ve dogfooded many different features and experiments. Due to the great relationship between our two teams and them being a sizable customer of ours we wanted to bring them onto our new Workers for Platform routing.

Before opting them in, TTFB was averaging at about 600ms.

After opting them in, TTFB is now 60ms. Web Analytics shows a noticeable drop in entire page load time as a result.

This improvement was also visible through Lighthouse scores now approaching a perfect score of 100 instead of 78 which was the average we saw previously.

The team was ecstatic about this especially given all of this happened under the hood with no downtime or required engineering team on their end. Not only is https://developers.cloudflare.com/ faster, we’re using less compute to serve it to all of you.

The big migration

Migrating developers.cloudflare.com was a big milestone for us and meant our new infrastructure was capable of handling traffic at scale. But a goal we were very certain of was migrating every Pages deployment ever created. We didn’t want to leave any users behind.

Turns out, that wasn’t a small number. There’d been over 14 million deployments so far over the years. This was about to be one of the biggest migrations we’d done to runtime assets and the risk was that we’d take down someone’s site.

We approached this migration with some key goals:

Customer impact in terms of downtime was a no go, all of this needed to happen under the hood without anyone’s site being affected;
We needed the ability to A/B test the old and new setup so we could revert on a per site basis if something went wrong or was incompatible;
Migrations at this scale have the ability to cause incidents because they exceed the typical request capacity of our APIs in a short window so they need to run slowly;
Because this was likely to be a long running migration, we needed the ability to look at metrics and retry failures.

The first step to all of this was to add the ability to A/B test between the legacy setup and the new one. To ensure we could A/B between the legacy setup and new one at any time, we needed to deploy both a regular pipeline (and updated routing table) and new Workers for Platforms hashed one for every deployment.

We also added a feature flag that allowed us to route to either the legacy setup or the new one per site or per data centre with the ability to explicitly opt out a site when an edgecase didn’t work.

With this setup, we started running our long running migration behind the scenes that duplicated every single deployment to the new Workers for Platforms enabled pipelines.

Duplicating them instead of replacing them meant that risk was low and A/B would be possible with the tradeoff of more cleanup after we finished but we picked that with reliability for users in mind.

A few days in after all 14 million deployments had finished migrating over, we started rollout to the new infrastructure with a percentage based rollout. This was a great way for us to find issues and ensure we were ready to serve all runtime traffic for Pages without the risk of an incident.

Feeding three birds with one scone

Alongside the significant latency improvements for Pages projects, this change also gave improvements in other areas:

Lower CPU usage – Since we no longer need to parse a huge JSON blob and do potentially thousands of regex matches, we saved a nice amount of CPU time across thousands of machines across our data centres.
Higher LRU hit rate – We have LRU caches for things we fetch from Quicksilver this is to reduce load on Quicksilver and improve performance. However, with the large routing tables we had previously, we could easily fill up this cache with one or just a few routing tables. Now that we have turned this into tiny single entry JSONs, we have improved the cache hit rate for all Workers.
Quicksilver storage reduction – We also reduced the storage we take up with our routing tables by 92%. This is a reduction of approximately 12 GiB on each of our hundreds of data centres.

We’re just getting started

Pages is now the fastest way to serve your sites across Netlify, Vercel and many others and we’re so proud.

But it’s going to get even faster. With projects like Flame, we can’t wait to shave off many more milliseconds to every request a user makes to your site.

To a faster web for all of us.

Speeding up APIs with Ricochet for API Gateway

2023-06-23 John Cosgrove

Post Syndicated from John Cosgrove original http://blog.cloudflare.com/speeding-up-apis-ricochet-for-api-gateway/

Speeding up APIs with Ricochet for API Gateway

APIs form the backbone of communication between apps and services on the Internet. They are a quick way for an application to ask for data or ask that a task be performed by a service. For example, anyone can write a weather app without being a meteorologist: simply ask a weather API for the forecast and display it in your app.

Speed is inherent to the API use case. Rather than transferring bulky files like images and HTML, APIs only share the essential data needed to render a webpage or an app. However, despite their efficiency, Internet latency can still impede API data transfers. If the server processing a user’s API request is located far from that user, the network round trip time can degrade that user’s experience.

Cloudflare's global network is specifically designed to optimize and accelerate internet traffic, including APIs. Our users enjoy features like 11ms DNS responses, load balancing, and Argo Smart Routing, which significantly improve API traffic speed. For web content, Cloudflare customers have always been able to cache their web traffic, serving requests from the closest data center and thereby reducing network round trip time and server processing time to a bare minimum. Now, we are leveraging these benefits to enhance API traffic in exciting new ways.

Today we’re announcing Ricochet for API Gateway, the easiest way for Cloudflare customers to achieve faster API responses. Customers using Cloudflare’s API Gateway will be able to enable Ricochet for their API endpoints and automatically reduce average latency through intelligent caching of API requests that would otherwise go to origin. Ricochet will even work for things you previously thought un-cacheable, like GraphQL POST requests. Best of all, there are no changes to make at your origin. Just configure API Gateway with your API session identifiers and leave the rest to us.

Enabling Ricochet for your APIs will cause Cloudflare to cache many of the basic, repetitive API calls from your applications and deliver them to users faster than ever. At first, your product metrics might even look broken with lower latency and fewer requests at origin. But these metrics will be a new sign of success, reflecting your app’s new speedy user experience.

Why you should cache API responses

It isn’t news that page load times directly correlate to dollar spend of site visitors. Organizations have spent the last decade obsessing over static content optimization to deliver websites quicker every year. Faster apps result in higher business metrics for most web sites. For example, faster sites receive more sales orders and have higher customer loyalty. Faster sites are also critical for engagement during marketing campaigns. Caching API requests will make your sites and apps faster by lowering the amount of time required to populate your app with data for your users.

The tools for caching APIs have always been available. But why isn’t it more common? We hypothesize a few reasons. It could be that API developers assume that APIs are too dynamic to cache, and that it only helps after lots of analysis of the application’s performance. It could also be that the security concerns around caching APIs are non-trivial. If both of those are true, you can imagine how caching APIs would only be successful with a large cross-organizational approach and lots of effort.

Let’s say your organization decided to try API caching anyway. We know there are a few problems if you want to cache APIs: First, traditional caching methods aren’t necessarily ready out of the box for caching APIs; cache invalidation needs to happen quickly and automatically. Second, special standalone cache tooling exists, but it doesn’t help you if it’s placed next to the origin and your users are globally distributed. Lastly, it’s hard to get security right when caching user data. It’s strictly forbidden to serve one user’s data to another on accident. Cloudflare has superpowers in these areas: knowledge of origin response time and existing cache-hit ratios, customer-configured API session IDs to establish secure user association with API requests, and the scale of our global network. We’re bringing together API Management with our robust caching infrastructure to safely and automatically cache API requests.

Cloudflare’s unique approach

The HTTP methods POST, PUT, and DELETE aren't generally cacheable. These methods are meant to change information at the origin on a one-time basis, and are therefore “non-safe”. If you responded to a non-safe request from cache, the data on the API server would not be updated. Compare this with “safe” HTTP methods: GET, OPTIONS, and HEAD. Caching requests with safe methods is straightforward as data at the origin does not change per-request.

But what do we know about RESTful APIs that can make caching easier? The endpoints usually stay the same when operating on objects, but the methods change. We will enable caching for safe methods and then automatically invalidate the cache when we see a non-safe method request for a RESTful endpoint managed by API Gateway. It’s also possible you have updates on one endpoint that change another endpoint’s data. In that event, we urge you to consider whether API Gateway’s short default TTL timers fit your use case by allowing a small delay between updating data at the origin and serving that update from cache. Check out the below diagram for an example of how automatic cache invalidation would work for shared paths with different methods:

Even for safe requests, caching API data is risky when it comes to security. Done incorrectly, you could accidentally serve sensitive user data to the wrong person. That’s why Ricochet’s cache key includes the user’s API session identifier. This extra information in the cache key ensures that only an authorized user is able to receive their own cached data. For APIs without authentication, we hash the request parameters themselves and include that hash in the cache key, to ensure the correct data is returned for endpoints with static paths but variable inputs. Here’s an example for authenticated APIs:

And here’s an example for anonymous APIs, where we use a hash of the request body to preserve privacy and still enable unique, useful cache keys:

APIs can be ripe for caching

There are many API caching use cases out there, and we decided to start with two where our customers have asked us for help so far: mixed-authentication APIs where returning the correct data is critical, and APIs that have single endpoints that can return varied query results (think RESTful endpoints with variable inputs and GraphQL POST requests). These use cases include things like weather forecasts and current conditions, airline flight tracking and flight status, live sports scores, and collaborative tools with many users.

Even short cache control timers can be beneficial to reduce load at the origin and speed up responses. Consider an example of a popular public endpoint receiving 1,000 requests/second, or 60,000 requests/min at origin. Let’s assume the data at the origin changes unpredictably but not due to unique user interaction. For this use case, reporting stale data for a few seconds or even a minute could be acceptable, and most users wouldn’t know or mind the difference. You could set your cache control to a very low 1 second and serve 999 requests/second from cache. That would reduce the origin requests to only 60 requests/minute!

This is a simple example, and we urge you to think about your API and the potential performance improvements caching could bring.

Potential impact of caching APIs

We profiled five top global airline website APIs for flight status checks and compared their retrieval time against the airline logo’s retrieval time. Here are the results:

All five airlines saw on average a ~7x slow down for data that could easily be cached!

Successful caching will also lower load on your API origin. The hidden benefit with lower load at origin means faster responses from the requests that miss the cache and do end up hitting origin. So it’s two-for-the-price-of-one and latency decreases all-around. We aren’t going to reinvent your API overnight, but we are going to make a difference in your application’s response times by making it easy to add caching to your APIs.

Conclusion

We're launching Ricochet in 2024. It’s going to make a measurable difference speeding up your APIs, and the best part is that as an API Gateway customer it will be easy to get started without requiring tons of your team’s time. Let your account team know if you’d like to be on our waitlist for this feature. Our goal is to increase the amount of API caching on the Internet so that we can all benefit from faster response times and snappier apps.

Introducing the Cloudflare Radar Internet Quality Page

2023-06-23 David Belson

Post Syndicated from David Belson original http://blog.cloudflare.com/introducing-radar-internet-quality-page/

Introducing the Cloudflare Radar Internet Quality Page

Internet connections are most often marketed and sold on the basis of "speed", with providers touting the number of megabits or gigabits per second that their various service tiers are supposed to provide. This marketing has largely been successful, as most subscribers believe that "more is better”. Furthermore, many national broadband plans in countries around the world include specific target connection speeds. However, even with a high speed connection, gamers may encounter sluggish performance, while video conference participants may experience frozen video or audio dropouts. Speeds alone don't tell the whole story when it comes to Internet connection quality.

Additional factors like latency, jitter, and packet loss can significantly impact end user experience, potentially leading to situations where higher speed connections actually deliver a worse user experience than lower speed connections. Connection performance and quality can also vary based on usage – measured average speed will differ from peak available capacity, and latency varies under loaded and idle conditions.

The new Cloudflare Radar Internet Quality page

A little more than three years ago, as residential Internet connections were strained because of the shift towards working and learning from home due to the COVID-19 pandemic, Cloudflare announced the speed.cloudflare.com speed test tool, which enabled users to test the performance and quality of their Internet connection. Within the tool, users can download the results of their individual test as a CSV, or share the results on social media. However, there was no aggregated insight into Cloudflare speed test results at a network or country level to provide a perspective on connectivity characteristics across a larger population.

Today, we are launching these long-missing aggregated connection performance and quality insights on Cloudflare Radar. The new Internet Quality page provides both country and network (autonomous system) level insight into Internet connection performance (bandwidth) and quality (latency, jitter) over time. (Your Internet service provider is likely an autonomous system with its own autonomous system number (ASN), and many large companies, online platforms, and educational institutions also have their own autonomous systems and associated ASNs.) The insights we are providing are presented across two sections: the Internet Quality Index (IQI), which estimates average Internet quality based on aggregated measurements against a set of Cloudflare & third-party targets, and Connection Quality, which presents peak/best case connection characteristics based on speed.cloudflare.com test results aggregated over the previous 90 days. (Details on our approach to the analysis of this data are presented below.)

Users may note that individual speed test results, as well as the aggregate speed test results presented on the Internet Quality page will likely differ from those presented by other speed test tools. This can be due to a number of factors including differences in test endpoint locations (considering both geographic and network distance), test content selection, the impact of “rate boosting” by some ISPs, and testing over a single connection vs. multiple parallel connections. Infrequent testing (on any speed test tool) by users seeking to confirm perceived poor performance or validate purchased speeds will also contribute to the differences seen in the results published by the various speed test platforms.

And as we announced in April, Cloudflare has partnered with Measurement Lab (M-Lab) to create a publicly-available, queryable repository for speed test results. M-Lab is a non-profit third-party organization dedicated to providing a representative picture of Internet quality around the world. M-Lab produces and hosts the Network Diagnostic Tool, which is a very popular network quality test that records millions of samples a day. Given their mission to provide a publicly viewable, representative picture of Internet quality, we chose to partner with them to provide an accurate view of your Internet experience and the experience of others around the world using openly available data.

Connection speed & quality data is important

While most advertisements for fixed broadband and mobile connectivity tend to focus on download speeds (and peak speeds at that), there’s more to an Internet connection, and the user’s experience with that Internet connection, than that single metric. In addition to download speeds, users should also understand the upload speeds that their connection is capable of, as well as the quality of the connection, as expressed through metrics known as latency and jitter. Getting insight into all of these metrics provides a more well-rounded view of a given Internet connection, or in aggregate, the state of Internet connectivity across a geography or network.

The concept of download speeds are fairly well understood as a measure of performance. However, it is important to note that the average download speeds experienced by a user during common Web browsing activities, which often involves the parallel retrieval of multiple smaller files from multiple hosts, can differ significantly from peak download speeds, where the user is downloading a single large file (such as a video or software update), which allows the connection to reach maximum performance. The bandwidth (speed) available for upload is sometimes mentioned in ISP advertisements, but doesn’t receive much attention. (And depending on the type of Internet connection, there’s often a significant difference between the available upload and download speeds.) However, the importance of upload came to the forefront in 2020 as video conferencing tools saw a surge in usage as both work meetings and school classes shifted to the Internet during the COVID-19 pandemic. To share your audio and video with other participants, you need sufficient upload bandwidth, and this issue was often compounded by multiple people sharing a single residential Internet connection.

Latency is the time it takes data to move through the Internet, and is measured in the number of milliseconds that it takes a packet of data to go from a client (such as your computer or mobile device) to a server, and then back to the client. In contrast to speed metrics, lower latency is preferable. This is especially true for use cases like online gaming where latency can make a difference between a character’s life and death in the game, as well as video conferencing, where higher latency can cause choppy audio and video experiences, but it also impacts web page performance. The latency metric can be further broken down into loaded and idle latency. The former measures latency on a loaded connection, where bandwidth is actively being consumed, while the latter measures latency on an “idle” connection, when there is no other network traffic present. (These specific loaded and idle definitions are from the device’s perspective, and more specifically, from the speed test application’s perspective. Unless the speed test is being performed directly from a router, the device/application doesn't have insight into traffic on the rest of the network.) Jitter is the average variation found in consecutive latency measurements, and can be measured on both idle and loaded connections. A lower number means that the latency measurements are more consistent. As with latency, Internet connections should have minimal jitter, which helps provide more consistent performance.

Our approach to data analysis

The Internet Quality Index (IQI) and Connection Quality sections get their data from two different sources, providing two different (albeit related) perspectives. Under the hood they share some common principles, though.

IQI builds upon the mechanism we already use to regularly benchmark ourselves against other industry players. It is based on end user measurements against a set of Cloudflare and third-party targets, meant to represent a pattern that has become very common in the modern Internet, where most content is served from distribution networks with points of presence spread throughout the world. For this reason, and by design, IQI will show worse results for regions and Internet providers that rely on international (rather than peering) links for most content.

IQI is also designed to reflect the traffic load most commonly associated with web browsing, rather than more intensive use. This, and the chosen set of measurement targets, effectively biases the numbers towards what end users experience in practice (where latency plays an important role in how fast things can go).

For each metric covered by IQI, and for each ASN, we calculate the 25th percentile, median, and 75th percentile at 15 minute intervals. At the country level and above, the three calculated numbers for each ASN visible from that region are independently aggregated. This aggregation takes the estimated user population of each ASN into account, biasing the numbers away from networks that source a lot of automated traffic but have few end users.

The Connection Quality section gets its data from the Cloudflare Speed Test tool, which exercises a user's connection in order to see how well it is able to perform. It measures against the closest Cloudflare location, providing a good balance of realistic results and network proximity to the end user. We have a presence in 285 cities around the world, allowing us to be pretty close to most users.

Similar to the IQI, we calculate the 25th percentile, median, and 75th percentile for each ASN. But here these three numbers are immediately combined using an operation called the trimean — a single number meant to balance the best connection quality that most users have, with the best quality available from that ASN (users may not subscribe to the best available plan for a number of reasons).

Because users may choose to run a speed test for different motives at different times, and also because we take privacy very seriously and don’t record any personally identifiable information along with test results, we aggregate at 90-day intervals to capture as much variability as we can.

At the country level and above, the calculated trimean for each ASN in that region is aggregated. This, again, takes the estimated user population of each ASN into account, biasing the numbers away from networks that have few end users but which may still have technicians using the Cloudflare Speed Test to assess the performance of their network.

Navigating the Internet Quality page

The new Internet Quality page includes three views: Global, country-level, and autonomous system (AS). In line with the other pages on Cloudflare Radar, the country-level and AS pages show the same data sets, differing only in their level of aggregation. Below, we highlight the various components of the Internet Quality page.

Global

The top section of the global (worldwide) view includes time series graphs of the Internet Quality Index metrics aggregated at a continent level. The time frame shown in the graphs is governed by the selection made in the time frame drop down at the upper right of the page, and at launch, data for only the last three months is available. For users interested in examining a specific continent, clicking on the other continent names in the legend removes them from the graph. Although continent-level aggregation is still rather coarse, it still provides some insight into regional Internet quality around the world.

Further down the page, the Connection Quality section presents a choropleth map, with countries shaded according to the values of the speed, latency, or jitter metric selected from the drop-down menu. Hovering over a country displays a label with the country’s name and metric value, and clicking on the country takes you to the country’s Internet Quality page. Note that in contrast to the IQI section, the Connection Quality section always displays data aggregated over the previous 90 days.

Country-level

Within the country-level page (using Canada as an example in the figures below), the country’s IQI metrics over the selected time frame are displayed. These time series graphs show the median bandwidth, latency, and DNS response time within a shaded band bounded at the 25th and 75th percentile and represent the average expected user experience across the country, as discussed in the Our approach to data analysis section above.

Below that is the Connection Quality section, which provides a summary view of the country’s measured upload and download speeds, as well as latency and jitter, over the previous 90 days. The colored wedges in the Performance Summary graph are intended to illustrate aggregate connection quality at a glance, with an “ideal” connection having larger upload and download wedges and smaller latency and jitter wedges. Hovering over the wedges displays the metric’s value, which is also shown in the table to the right of the graph.

Below that, the Bandwidth and Latency/Jitter histograms illustrate the bucketed distribution of upload and download speeds, and latency and jitter measurements. In some cases, the speed histograms may show a noticeable bar at 1 Gbps, or 1000 ms (1 second) on the latency/jitter histograms. The presence of such a bar indicates that there is a set of measurements with values greater than the 1 Gbps/1000 ms maximum histogram values.

Autonomous system level

Within the upper-right section of the country-level page, a list of the top five autonomous systems within the country is shown. Clicking on an ASN takes you to the Performance page for that autonomous system. For others not displayed in the top five list, you can use the search bar at the top of the page to search by autonomous system name or number. The graphs shown within the AS level view are identical to those shown at a country level, but obviously at a different level of aggregation. You can find the ASN that you are connected to from the My Connection page on Cloudflare Radar.

Exploring connection performance & quality data

Digging into the IQI and Connection Quality visualizations can surface some interesting observations, including characterizing Internet connections, and the impact of Internet disruptions, including shutdowns and network issues. We explore some examples below.

Characterizing Internet connections

Verizon FiOS is a residential fiber-based Internet service available to customers in the United States. Fiber-based Internet services (as opposed to cable-based, DSL, dial-up, or satellite) will generally offer symmetric upload and download speeds, and the FiOS plans page shows this to be the case, offering 300 Mbps (upload & download), 500 Mbps (upload & download), and “1 Gig” (Verizon claims average wired speeds between 750-940 Mbps download / 750-880 Mbps upload) plans. Verizon carries FiOS traffic on AS701 (labeled UUNET due to a historical acquisition), and in looking at the bandwidth histogram for AS701, several things stand out. The first is a rough symmetry in upload and download speeds. (A cable-based Internet service provider, in contrast, would generally show a wide spread of download speeds, but have upload speeds clustered at the lower end of the range.) Another is the peaks around 300 Mbps and 750 Mbps, suggesting that the 300 Mbps and “1 Gig” plans may be more popular than the 500 Mbps plan. It is also clear that there are a significant number of test results with speeds below 300 Mbps. This is due to several factors: one is that Verizon also carries lower speed non-FiOS traffic on AS701, while another is that erratic nature of in-home WiFi often means that the speeds achieved on a test will be lower than the purchased service level.

Traffic shifts drive latency shifts

On May 9, 2023, the government of Pakistan ordered the shutdown of mobile network services in the wake of protests following the arrest of former Prime Minister Imran Khan. Our blog post covering this shutdown looked at the impact from a traffic perspective. Within the post, we noted that autonomous systems associated with fixed broadband networks saw significant increases in traffic when the mobile networks were shut down – that is, some users shifted to using fixed networks (home broadband) when mobile networks were unavailable.

Examining IQI data after the blog post was published, we found that the impact of this traffic shift was also visible in our latency data. As can be seen in the shaded area of the graph below, the shutdown of the mobile networks resulted in the median latency dropping about 25% as usage shifted from higher latency mobile networks to lower latency fixed broadband networks. An increase in latency is visible in the graph when mobile connectivity was restored on May 12.

Bandwidth shifts as a potential early warning sign

On April 4, UK mobile operator Virgin Media suffered several brief outages. In examining the IQI bandwidth graph for AS5089, the ASN used by Virgin Media (formerly branded as NTL), indications of a potential problem are visible several days before the outages occurred, as median bandwidth dropped by about a third, from around 35 Mbps to around 23 Mbps. The outages are visible in the circled area in the graph below. Published reports indicate that the problems lasted into April 5, in line with the lower median bandwidth measured through mid-day.

Submarine cable issues cause slower browsing

On June 5, Philippine Internet provider PLDT Tweeted an advisory that noted “One of our submarine cable partners confirms a loss in some of its internet bandwidth capacity, and thus causing slower Internet browsing.” IQI latency and bandwidth graphs for AS9299, a primary ASN used by PLDT, shows clear shifts starting around 06:45 UTC (14:45 local time). Median bandwidth dropped by half, from 17 Mbps to 8 Mbps, while median latency increased by 75% from 37 ms to around 65 ms. 75th percentile latency also saw a significant increase, nearly tripling from 63 ms to 180 ms coincident with the reported submarine cable issue.

Conclusion

Making network performance and quality insights available on Cloudflare Radar supports Cloudflare’s mission to help build a better Internet. However, we’re not done yet – we have more enhancements planned. These include making data available at a more granular geographical level (such as state and possibly city), incorporating AIM scores to help assess Internet quality for specific types of use cases, and embedding the Cloudflare speed test directly on Radar using the open source JavaScript module.

In the meantime, we invite you to use speed.cloudflare.com to test the performance and quality of your Internet connection, share any country or AS-level insights you discover on social media (tag @CloudflareRadar on Twitter or @[email protected] on Mastodon), and explore the underlying data through the M-Lab repository or the Radar API.

A step-by-step guide to transferring domains to Cloudflare

2023-06-23 Ricky Robinett

Post Syndicated from Ricky Robinett original http://blog.cloudflare.com/a-step-by-step-guide-to-transferring-domains-to-cloudflare/

A step-by-step guide to transferring domains to Cloudflare

Transferring your domains to a new registrar isn’t something you do every day, and getting any step of the process wrong could mean downtime and disruption. That’s why this Speed Week we’ve prepared a domain transfer checklist. We want to empower anyone to quickly transfer their domains to Cloudflare Registrar, without worrying about missing any steps along the way or being left with any unanswered questions.

Domain Transfer Checklist

Confirm eligibility

Confirm you want to use Cloudflare’s nameservers: We built our registrar specifically for customers who want to use other Cloudflare products. This means domains registered with Cloudflare can only use our nameservers. If your domain requires non-Cloudflare nameservers then we’re not the right registrar for you.
Confirm Cloudflare supports your domain’s TLD: You can view the full list of TLDs we currently support here. Note: We plan to support .dev and .app by mid-July 2023.
Confirm your domain is not a premium domain or internationalized domain name (IDNs): Cloudflare currently does not support premium domains or internationalized domain names (Unicode).
Confirm your domain hasn’t been registered or transferred in the past 60 days: ICANN rules prohibit a domain from being transferred if it has been registered or previously transferred within the last 60 days.
Confirm your WHOIS Registrant contact information hasn’t been updated in the past 60 days: ICANN rules also prohibit a domain from being transferred if the WHOIS Registrant contact information was modified in the past 60 days.

Before you transfer

Gather your credentials for your current registrar: Make sure you have your credentials for your current registrar. It’s possible you haven’t logged in for many years and you may have to reset your password.
Make note of your current DNS settings: When transferring your domain, Cloudflare will automatically scan your DNS records, but you’ll want to capture your current settings in case there are any issues.
Remove WHOIS privacy (if necessary): In most cases, domains may be transferred even if WHOIS privacy services have been enabled. However, some registrars may prohibit the transfer if the WHOIS privacy service has been enabled.
Disable DNSSEC: You can disable DNSSEC by removing the DS record at your current DNS host and disabling DNSSEC in the Cloudflare dashboard.
Renew your domain if up for renewal in the next 15 days: If your domain is up for renewal, you’ll need to renew it with your current registrar before initiating a transfer to Cloudflare.
Unlock the domain: Registrars include a lightweight safeguard to prevent unauthorized users from starting domain transfers – often called a registrar or domain lock. This lock prevents any other registrar from attempting to initiate a transfer. Only the registrant can enable or disable this lock, typically through the administration interface of the registrar.
Sign up for Cloudflare: If you don’t already have a Cloudflare account, you can sign up here.
Add your domain to Cloudflare: You can add a new domain to your Cloudflare account by following these instructions.
Add a valid credit card to your Cloudflare account: If you haven’t already added a payment method into your Cloudflare dashboard billing profile, you’ll be prompted to add one when you add your domain.
Review DNS records at Cloudflare: Once you’ve added your domain, review the DNS records that Cloudflare automatically configured with what you have at your current registrar to make sure nothing was missed.
Change your DNS nameservers to Cloudflare: In order to transfer your domain, your nameservers will need to be set to Cloudflare.
(optional) Configure Cloudflare Email Routing: If you’re using email forwarding, ensure that you follow this guide to migrate to Cloudflare Email Routing.
Wait for your DNS changes to propagate: Registrars can take up to 24 hours to process nameserver updates. You will receive an email when Cloudflare has confirmed that these changes are in place. You can’t proceed with transferring your domain until this process is complete.

Initiating and confirming transfer process

Request an authorization code: Cloudflare needs to confirm with your old registrar that the transfer flow is authorized. To do that, your old registrar will provide an authorization code to you. This code is often referred to as an authorization code, auth code, authinfo code, or transfer code. You will need to input that code to complete your transfer to Cloudflare. We will use it to confirm the transfer is authentic.
Initiate your transfer to Cloudflare: Visit the Transfer Domains section of your Cloudflare dashboard. Here you’ll be presented with any domains available for transfer. If your domain isn’t showing, ensure you completed all the proceeding steps. If you have, review the list on this page to see if any apply to your domain.
Review the transfer price: When you transfer a domain, you are required by ICANN to pay to extend its registration by one year from the expiration date. You will not be billed at this step. Cloudflare will only bill your card when you input the auth code and confirm the contact information at the conclusion of your transfer request.
Input your authorization code: In the next page, input the authorization code for each domain you are transferring.
Confirm or input your contact information: In the final stage of the transfer process, input the contact information for your registration. Cloudflare Registrar redacts this information by default but is required to collect the authentic contact information for this registration.
Approve the transfer with Cloudflare: Once you have requested your transfer, Cloudflare will begin processing it, and send a Form of Authorization (FOA) email to the registrant, if the information is available in the public WHOIS database. The FOA is what authorizes the domain transfer.
Approve the transfer with your previous registrar: After this step, your previous registrar will also email you to confirm your request to transfer. Most registrars will include a link to confirm the transfer request. If you follow that link, you can accelerate the transfer operation. If you do not act on the email, the registrar can wait up to five days to process the transfer to Cloudflare. You may also be able to approve the transfer from within your current registrar dashboard.
Follow your transfer status in your Cloudflare dashboard: Your domain transfer status will be viewable under Account Home > Overview > Domain Registration for your domain.

After you transfer

Test your site and email: After the transfer is complete, you’ll want to test your site to ensure everything is working properly. If you encounter any issues or have any questions you can always talk with us on our community forums or Discord server.
Build something new: Perhaps this is a domain that you bought but haven’t launched anything on yet. Now that you’ve transferred it, it’s a great time to build and launch something new on it. You could start a new project built on your favorite frontend framework using C3, build a blog using Nuxt.js and Sanity.io on Cloudflare Pages, or try building your first ChatGPT plugin with Cloudflare Workers.

Speeding up your (WordPress) website is a few clicks away

2023-06-22 Alex Krivit

Post Syndicated from Alex Krivit original http://blog.cloudflare.com/speeding-up-your-website-in-a-few-clicks/

Speeding up your (WordPress) website is a few clicks away

Every day, website visitors spend far too much time waiting for websites to load in their browsers. This waiting is partially due to browsers not knowing which resources are critically important so they can prioritize them ahead of less-critical resources. In this blog we will outline how millions of websites across the Internet can improve their performance by specifying which critical content loads first with Cloudflare Workers and what Cloudflare will do to make this easier by default in the future.

Popular Content Management Systems (CMS) like WordPress have made attempts to influence website resource priority, for example through techniques like lazy loading images. When done correctly, the results are magical. Performance is optimized between the CMS and browser without needing to implement any changes or coding new prioritization strategies. However, we’ve seen that these default priorities have opportunities to improve greatly.

In this co-authored blog with Google’s Patrick Meenan we will explain where the opportunities exist to improve website performance, how to check if a specific site can improve performance, and provide a small JavaScript snippet which can be used with Cloudflare Workers to do this optimization for you.

What happens when a browser receives the response?

Before we dive into where the opportunities are to improve website performance, let’s take a step back to understand how browsers load website assets by default.

After the browser sends a HTTP request to a server, it receives a HTTP response containing information like status codes, headers, and the requested content. The browser carefully analyzes the response's status code and response headers to ensure proper handling of the content.

Next, the browser processes the content itself. For HTML responses, the browser extracts important information from the <head> section of the HTML, such as the page title, stylesheets, and scripts. Once this information is parsed, the browser moves on to the response <body> which has the actual page content. During this stage, the browser begins to present the webpage to the visitor.

If the response includes additional 3rd party resources like CSS, JavaScript, or other content, the browser may need to fetch and integrate them into the webpage. Typically, browsers like Google Chrome delay loading images until after the resources in the HTML <head> have loaded. This is also known as “blocking” the render of the webpage. However, developers can override this blocking behavior using fetch priority or other methods to boost other content’s priority in the browser. By adjusting an important image's fetch priority, it can be loaded earlier, which can lead to significant improvements in crucial performance metrics like LCP (Largest Contentful Paint).

Images are so central to web pages that they have become an essential element in measuring website performance from Core Web Vitals. LCP measures the time it takes for the largest visible element, often an image, to be fully rendered on the screen. Optimizing the loading of critical images (like LCP images) can greatly enhance performance, improving the overall user experience and page performance.

But here's the challenge – a browser may not know which images are the most important for the visitor experience (like the LCP image) until rendering begins. If the developer can identify the LCP image or critical elements before it reaches the browser, its priority can be increased at the server to boost website performance instead of waiting for the browser to naturally discover the critical images.

In our Smart Hints blog, we describe how Cloudflare will soon be able to automatically prioritize content on behalf of website developers, but what happens if there’s a need to optimize the priority of the images right now? How do you know if a website is in a suboptimal state and what can you do to improve?

Using Cloudflare, developers should be able to improve image performance with heuristics that identify likely-important images before the browser parses them so these images can have increased priority and be loaded sooner.

Identifying Image Priority opportunities

Just increasing the fetch priority of all images won't help if they are lazy-loaded or not critical/LCP images. Lazy-loading is a method that developers use to generally improve the initial load of a webpage if it includes numerous out-of-view elements. For example, on Instagram, when you continually scroll down the application to see more images, it would only make sense to load those images when the user arrives at them otherwise the performance of the page load would be needlessly delayed by the browser eagerly loading these out-of-view images. Instead the highest priority should be given to the LCP image in the viewport to improve performance.

So developers are left in a situation where they need to know which images are on users' screens/viewports to increase their priority and which are off their screens to lazy-load them.

Recently, we’ve seen attempts to influence image priority on behalf of developers. For example, by default, in WordPress 5.5 all images with an IMG tag and aspect ratios were directed to be lazy-loaded. While there are plugins and other methods WordPress developers can use to boost the priority of LCP images, lazy-loading all images in a default manner and not knowing which are LCP images can cause artificial performance delays in website performance (they’re working on this though, and have partially resolved this for block themes).

So how do we identify the LCP image and other critical assets before they get to the browser?

To evaluate the opportunity to improve image performance, we turned to the HTTP Archive. Out of the approximately 22 million desktop pages tested in February 2023, 46% had an LCP element with an IMG tag. Meaning that for page load metrics, LCP had an image included about half the time. Though, among these desktop pages, 8.5 million had the image in the static HTML delivered with the page, indicating a total potential improvement opportunity of approximately 39% of the desktop pages within the dataset.

In the case of mobile pages, out of the ~28.5 million tested, 40% had an LCP element as an IMG tag. Among these mobile pages, 10.3 million had the image in the static HTML delivered with the page, suggesting a potential improvement opportunity in around 36% of the mobile pages within the dataset.

However, as previously discussed, prioritizing an image won't be effective if the image is lazy-loaded because the directives are contradictory. In the dataset, approximately 1.8 million LCP desktop images and 2.4 million LCP mobile images were lazy-loaded.

Therefore, across the Internet, the opportunity to improve image performance would be about ~30% of pages that have an LCP image in the original HTML markup that weren’t lazy-loaded, but with a more advanced Cloudflare Worker, the additional 9% of lazy-loaded LCP images can also be improved improved by removing the lazy-load attribute.

If you’d like to determine which element on your website serves as the LCP element so you can increase the priority or remove any lazy-loading, you can use browser developer tools, or speed tests like Webpagetest or Cloudflare Observatory.

39% of desktop images seems like a lot of opportunity to improve image performance. So the next question is how can Cloudflare determine the LCP image across our network and automatically prioritize them?

Image Index

We thought that how soon the LCP image showed up in the HTML would serve as a useful indicator. So we analyzed the HTTP Archive dataset to see where the cumulative percentage of LCP images are discovered based on their position in the HTML, including lazy-loaded images.

We found that approximately 25% of the pages had the LCP image as the first image in the HTML (around 10% of all pages). Another 25% had the LCP image as the second image. WordPress seemed to arrive at a similar conclusion and recently released a development to remove the default lazy-load attribute from the first image on block themes, but there are opportunities to go further.

Our analysis revealed that implementing a straightforward rule like "do not lazy-load the first four images," either through the browser, a content management system (CMS), or a Cloudflare Worker could address approximately 75% of the issue of lazy-loading LCP images (example Worker below).

Ignoring small images

In trying to find other ways to identify likely LCP images we next turned to the size of the image. To increase the likelihood of getting the LCP image early in the HTML, we looked into ignoring “small” images as they are unlikely to be big enough to be a LCP element. We explored several sizes and 10,000 pixels (less than 100×100) was a pretty reliable threshold that didn’t skip many LCP images and avoided a good chunk of the non-LCP images.

By ignoring small images (<10,000px), we found that the first image became the LCP image in approximately 30-34% of cases. Adding the second image increased this percentage to 56-60% of pages.

Therefore, to improve image priority, a potential approach could involve assigning a higher priority to the first four "not-small" images.

Chrome 114 Image Prioritization Experiment

An experiment running in Chrome 114 does exactly what we described above. Within the browser there are a few different prioritization knobs to play with that aren’t web-exposed so we have the opportunity to assign a “medium” priority to images that we want to boost automatically (directly controlling priority with “fetch priority” lets you set high or low). This will let us move the images ahead of other images, async scripts and parser-blocking scripts late in the body but still keep the boosted image priority below any high-priority requests, particularly dynamically-injected blocking scripts.

We are experimenting with boosting the priority of varying numbers of images (2, 5 and 10) and with allowing one of those medium-priority images to load at a time during Chromes “tight” mode (when it is loading the render-blocking resources in the head) to increase the likelihood that the LCP image will be available when the first paint is done.

The data is still coming in and no “ship” decisions have been made yet but the early results are very promising, improving the LCP time across the entire web for all arms of the experiment (not by massive amounts but moving the metrics of the whole web is notoriously difficult).

How to use Cloudflare Workers to boost performance

Now that we’ve seen that there is a large opportunity across the Internet for helping prioritize images for performance and how to identify images on individual pages that are likely LCP images, the question becomes, what would the results be of implementing a network-wide rule that could boost image priority from this study?

We built a test worker and deployed it on some WordPress test sites with our friends at Rocket.net, a WordPress hosting platform focused on performance. This worker boosts the priority of the first four images while removing the lazy-load attribute, if present. When deployed we saw good performance results and the expected image prioritization.

export default {
  async fetch(request) {
    const response = await fetch(request);
 
    // Check if the response is HTML
    const contentType = response.headers.get('Content-Type');
    if (!contentType || !contentType.includes('text/html')) {
      return response;
    }
 
    const transformedResponse = transformResponse(response);
 
    // Return the transformed response with streaming enabled
    return transformedResponse;
  },
};
 
async function transformResponse(response) {
  // Create an HTMLRewriter instance and define the image transformation logic
  const rewriter = new HTMLRewriter()
    .on('img', new ImageElementHandler());
 
  const transformedBody = await rewriter.transform(response).text()
 
  const transformresponse = new Response(transformedBody, response)
 
  // Return the transformed response with streaming enabled
  return transformresponse
}
 
class ImageElementHandler {
  constructor() {
    this.imageCount = 0;
    this.processedImages = new Set();
  }
 
  element(element) {
    const imgSrc = element.getAttribute('src');
 
    // Check if the image is small based on Chrome's criteria
    if (imgSrc && this.imageCount < 4 && !this.processedImages.has(imgSrc) && !isImageSmall(element)) {
      element.removeAttribute('loading');
      element.setAttribute('fetchpriority', 'high');
      this.processedImages.add(imgSrc);
      this.imageCount++;
    }
  }
}
 
function isImageSmall(element) {
  // Check if the element has width and height attributes
  const width = element.getAttribute('width');
  const height = element.getAttribute('height');
 
  // If width or height is 0, or width * height < 10000, consider the image as small
  if ((width && parseInt(width, 10) === 0) || (height && parseInt(height, 10) === 0)) {
    return true;
  }
 
  if (width && height) {
    const area = parseInt(width, 10) * parseInt(height, 10);
    if (area < 10000) {
      return true;
    }
  }
 
  return false;
}

When testing the Worker, we saw that default image priority was boosted into “high” for the first four images and the fifth image remained “low.” This resulted in an LCP range of “good” from a speed test. While this initial test is not a dispositive indicator that the Worker will boost performance in every situation, the results are promising and we look forward to continuing to experiment with this idea.

While we’ve experimented with WordPress sites to illustrate the issues and potential performance benefits, this issue is present across the Internet.

Website owners can help us experiment with the Worker above to improve the priority of images on their websites or edit it to be more specific by targeting likely LCP elements. Cloudflare will continue experimenting using a very similar process to understand how to safely implement a network-wide rule to ensure that images are correctly prioritized across the Internet and performance is boosted without the need to configure a specific Worker.

Automatic Platform OptimizationBut what about APO?

Cloudflare’s Automatic Platform Optimization (APO) is a plugin for WordPress which allows Cloudflare to deliver your entire WordPress site from our network ensuring consistent, fast performance for visitors. By serving cached sites, APO can improve performance metrics. APO does not currently have a way to prioritize images over other assets to improve browser render metrics or dynamically rewrite HTML, techniques we’ve discussed in this post. Although this presents a potential opportunity for future development, it requires thorough testing to ensure safe and reliable support.

In the future we’ll may look to include the techniques discussed today as part of APO, however in the meantime we recommend using Snippets (and Experiments) to test with the code example above to see the performance impact on your website.

Get in touch!

If you are interested in using the JavaScript above, we recommended testing with Workers or using Cloudflare Snippets. We’d love to hear from you on what your results were. Get in touch via social media and share your experiences.

Descale your network with Cloudflare’s enhanced Descaler Program

2023-06-22 Corey Mahan

Post Syndicated from Corey Mahan original http://blog.cloudflare.com/descaler-program-update/

Descale your network with Cloudflare’s enhanced Descaler Program

Speed matters, especially when it comes to exiting a slower service and transitioning to a new one. Back in March, 2023, we announced the Descaler Program, a frictionless path to migrate existing Zscaler customers to Cloudflare One. This program makes it easy for customers to make the switch to a faster, simpler, and more agile foundation for security and network transformation with Cloudflare.

Through repeated engagements with customers of all sizes, we've improved the Descaler tooling to allow Zscaler to Cloudflare configuration migrations to be completed in hours, not days. This accelerated transition has helped organizations meet migration deadlines and eliminate countless hours of manual migration effort without skipping a beat. Today we’re excited to share more stories from customers and the amount of time it took them to ‘descale’.

Cloudflare One and the Descaler Program

As a quick recap, Cloudflare One is our Secure Access Service Edge (SASE) platform that combines network connectivity services with Zero Trust security services on one of the fastest, most resilient and most composable global networks. The platform dynamically connects users to enterprise resources, with identity-based security controls delivered close to users, wherever they are.

At its core, the Descaler Program helps derisk change. It’s designed to be simple and straightforward, with resources to ensure a smooth transition and supporting technology to ensure the migration achieves your organization's goals. The magic of this process is in the technology and its simplicity. Following extract, transform, and load best practices, using supported and documented API calls to your current account, the Descaler toolkit will export your current configuration and settings and transform them to be Cloudflare One-compatible before migrating into a new Cloudflare One account.

A question almost every customer asked was “so, how long is this going to take?”. The answer? As soon as you can meet with the Cloudflare team.

Migrate in minutes, not months

The speed at which customers are able to move from Zscaler ZIA to Cloudflare Gateway continually gets faster. As the title of this blog post implies, it usually takes more time to set up a meeting with the right technical administrators than to migrate settings, configurations, lists, policies and more to Cloudflare. We’ve seen this time continue to get faster through Descaler engagements. But it wasn’t this way from the onset. To be the fastest at everything we do, it means iterating and learning from customers to find the best solution possible. Here are three customer stories of doing just that.

Customer migration time: seven days | “Is there a summary available?”

A UK ecommerce giant with 7,500 employees sought a solution that could provide them with faster, safer access to corporate resources and SaaS apps while eliminating the exorbitant costs associated with Zscaler. With Descaler, they achieved this goal in just one week. Our streamlined migration process ensured minimal disruption to their operations, empowering them to seamlessly transition to Cloudflare One before a tight renewal deadline. By reducing the time and cost involved in the migration, they were able to focus on what matters most—driving their business forward.

To better communicate what is available to be moved into Cloudflare Gateway, the team was curious on what objects they had active in their account in a simplified view. Based on this feedback, the Cloudflare team added the option for the Descaler tool to provide a summary of what will be moved to Cloudflare, as shown below.

Customer migration time: two days | Lots and lots of lists

For a US-based Fortune 100 oil and gas company with nearly 20,000 employees, the key priority was to streamline their application, network, and security services. With Descaler, they were able to move over more of their security service and achieved this objective in just under two days. Cloudflare’s intuitive dashboard provided them with a single pane of glass to manage all their services efficiently, simplifying their operations and enhancing their overall productivity. The speed at which Descaler facilitated their migration allowed them to seamlessly consolidate their services, unlocking new levels of efficiency and cost savings.

The team had also put a significant amount of effort into curating lists of IP addresses, hostnames, and URLs of sites and services used in their filtering policies. These thousands of items were transformed and loaded into their new Cloudflare production account almost instantly. With some minor testing, they were able to save hours of copying and retain their security intelligence.

Customer migration time: 24 hours | “What about Terraform?”

Recently a prominent Australian based telecommunications company that owns one of the countries largest fiber networks prioritized employee Internet security and the prevention of malware attacks. Descaler played a crucial role in their quest to protect users and block malware, with a configuration migration time of less than 24 hours. By migrating to Cloudflare One, they ensured their employees had access to robust security features and comprehensive protection, bolstering their defense against potential threats.

Having Terraform output was table stakes for this organization and many others the team interacted with. Terraform is a tool for building, changing, and versioning infrastructure, and provides components and documentation for building Cloudflare resources. Without the ability to manage their Cloudflare configuration as infrastructure-as-code, it meant breaking their normal workflows. From this feedback the Descaler team added the option to export the configuration in a shareable Terraform file which was then shared with the customer.

How to get started

Migration times are still getting faster and the overall process even smoother due to iterations like the ones mentioned above. We’re excited to invite new customers to take advantage of the program by signing up using the link below. From there, the Cloudflare team will reach out to you with further enrollment details.

With the Descaler Program we’re excited to offer a clear path for customers to make the switch to Cloudflare One. To get started, sign up here.

Understanding end user-connectivity and performance with Digital Experience Monitoring, now available in beta

2023-06-22 Shruti Pokalori Nejad

Post Syndicated from Shruti Pokalori Nejad original http://blog.cloudflare.com/digital-experience-monitoring-beta/

Understanding end user-connectivity and performance with Digital Experience Monitoring, now available in beta

Organizations that replace their corporate network and security appliances with a cloud-based solution trust that provider with how their employees work each and every day. Cloudflare One, our comprehensive Secure Access Service Edge (SASE) offering, helps more than 10,000 organizations deploy a remote access and Internet security solution that is faster than industry competitors. Starting today, administrators can measure that experience on their own and hold us accountable to that standard.

Cloudflare’s Digital Experience Monitoring (DEX) product gives teams of any size the same toolkit that we use to measure our own global network that powers nearly one-fourth of the Internet each day. Customers of Cloudflare One can now measure the experience that their team members have connecting to the Internet – whether they need that data for troubleshooting, evaluating carrier and ISP performance, or just understanding how their employees work.

We are excited to share today that DEX is now in open beta for all Cloudflare One customers. Administrators can begin running tests and evaluating network performance with any device enrolled using the Cloudflare One agent. Today’s announcement opens up these tools to every customer, but we are just getting started – we want your feedback to help us continue to improve the experience as we build more observability into Cloudflare’s SASE solution.

Monitor performance & availability of public or private applications with Synthetic Application Monitoring

Picture this: you're at the helm of a diverse team, using Google Mail as their main communication hub. When everyone worked from the same office, spotting a slowdown with Google Mail or its provider was relatively straightforward. But in today's remote environment, ensuring the consistent performance of such a critical resource can become a labyrinthine task.

Synthetic Application Monitoring shines a light through this maze. With the ability to schedule HTTP GET tests that target Gmail at specified intervals, you're not merely monitoring performance — you're safeguarding your team's access to crucial communication lines, irrespective of their global locations.

What sets Synthetic Application Monitoring apart isn't just its user-friendly setup, but the powerful insights derived from Cloudflare's extensive network. With our network's global reach, you can track response time averages from various locations, painting a realistic picture of your application's performance as experienced by your users worldwide.

Test results visualize resource fetch times, exemplifying the unique strengths of Cloudflare's expansive network. The HTTP GET tests harness the speed and reliability of the nearest data center in our network, providing an accurate reflection of your users' experiences. This graph translates raw data into an easy-to-read timeline, helping you identify trends, spot anomalies, and optimize application performance using the insights garnered.

With DEX, you get a reliable and precise view of server and DNS response times from around the world. The time series format allows you to spot trends, identify peak periods, and pinpoint potential issues with ease. This isn't just data—it's actionable intelligence that helps you optimize server configurations, DNS settings, and ultimately, your users' digital experience. Essentially, these charts are more than visual aids; they are strategic tools, using Cloudflare's network to enhance your application's performance management.

The time series chart depicting HTTP status codes is another powerful tool in the DEX arsenal. Drawing from the wealth of data traversing our globally distributed network, it lets you quickly visualize the frequency of each status code over time. This granular perspective allows you to detect and investigate anomalies, such as a sudden surge of client or server errors. By making HTTP status codes more comprehensible, it equips you to swiftly identify and troubleshoot potential issues that can impact user experience.

Synthetic Application Monitoring is more than a product; it's a strategic ally that harnesses the might of Cloudflare's network, ensuring your applications deliver the reliable, high-quality experience your users expect and deserve.

Understand the state of WARP-enrolled devices with Fleet Status

Zero Trust solutions replace legacy private networks with a model that assumes all connection attempts are suspicious. A Zero Trust network denies access attempts by default and forces every connection or request to prove that access should be granted.

A large component of proof is the identity of the end user, but the device itself also provides a signal about access rights. Whether the device is managed by the enterprise, healthy and patched, or assigned to a given user can determine permissions within Cloudflare One. For customers who rely on Cloudflare One to give their users a secure path to the rest of the Internet, the device also becomes an on-ramp for those team members connecting through Cloudflare.

We kept hearing from customers who wanted to better understand their device fleet using the data that Cloudflare could gather.

As part of today’s launch, we are introducing Fleet Status. Fleet Status provides real-time insights into the status of all of your client devices’ connection, mode, and location on both a global and per-device basis. This is achieved via Cloudflare WARP. Cloudflare WARP is a client which allows companies to protect corporate devices by securely and privately sending traffic from those devices to Cloudflare’s global network, where Cloudflare Gateway can apply advanced web filtering. The WARP client also makes it possible to apply advanced Zero Trust policies that check for a device’s health before it connects to corporate applications.

Fleet Status, with its data visualizations, detailed per-device views, and time-series charts, transforms the way administrators understand their deployment. Picture a network administrator who oversees a fleet of WARP-enrolled devices scattered worldwide, each contributing to the organization's vital operations. Suddenly, an issue arises. A group of devices in a specific location is unexpectedly disconnecting or changing connection methods.

With traditional methods, identifying the issue itself would be a time-consuming endeavor. However, Fleet Status enables real-time insights to be at the administrator's fingertips, quickly providing a global snapshot of devices, highlighting the ones experiencing anomalies.

The per-device view allows further investigation into these specific devices, presenting granular details like device location, client platform, version, client connection state, and connection methods. Meanwhile, time-series charts plot data over time, helping to identify if the disconnects or changes are an anomaly or a part of a recurring pattern.

Armed with these insights, the administrator can work proactively to ensure connectivity issues are addressed, leading to minimal disruption and maximum productivity. Fleet Status isn't just about presenting data; it's about empowering administrators with actionable insights when they need them most.

What’s next

Our journey doesn't end here. As we continue to build DEX, we are committed to adding more visibility and refining test customization. These enhancements will equip you with the resources to proactively troubleshoot issues and understand your Zero Trust Deployment.

Getting started with DEX is a breeze. If you're an existing Cloudflare One user, simply log in to your dashboard and navigate to the DEX beta section – no activation needed. We can’t wait for you to build tests and start leveraging DEX’s insights immediately!

If you're new to Cloudflare One, we've got you covered. Sign up for our free plan, which provides DEX for up to 50 users at no cost. For our Enterprise Plan users, ten synthetic application tests are part of your package. If you're on any other plan, you can create up to five tests.

We also need your feedback. Want to tell us more about what you would like to see next? Let us know at this form or this community forum post.

Network performance update: Speed Week 2023

2023-06-22 Onur Karaagaoglu

Post Syndicated from Onur Karaagaoglu original http://blog.cloudflare.com/speed-week-2023-network-performance-update/

Network performance update: Speed Week 2023

We constantly measure our own and other networks' performance, and look for ways to improve our performance; and share our results.

In this post we are going to share the most recent updates, and tell you about our tools and processes that we use to monitor and improve our network performance.

First, the results

In July, 2022, we started taking a more granular look down to every single network and taking actions for the specific networks where we have some room for improvement. Cloudflare was already the fastest provider for most of the networks around the world (we define a network as country and AS number pair). Taking a closer look at the numbers, Cloudflare was ranked #1 in 33% of the networks and was within 2 ms or 5% of the #1 provider for 8% of the networks that we measured in terms of the 95th percentile TCP Connection Time. For reference, our closest competitor on that front was the fastest for 20% of networks.

As of May 31, 2023 those numbers have improved significantly. Today, Cloudflare is the fastest provider for 46% of networks—and was within 2 ms (95th percentile TCP Connection Time) or 5% of the fastest provider for 10% of the networks that we measured—whereas our closest competitor is now the fastest for 18% of networks.

Below is the change in percentage of networks that each provider is the fastest over time for Cloudflare and other services.

Our tooling and process

We use Real User Measurements (RUM) and fetch a small file from Cloudflare, Akamai, CloudFront, Fastly and Google Cloud CDN. Browsers around the world report the performance of those providers from the perspective of the end-user network they are on. The goal is to provide an accurate picture of where different providers are faster, and more importantly, where Cloudflare can improve. You can read more about the methodology in the original Speed Week blog post here.

Using the RUM data, we are able to measure various performance metrics, such as TCP Connection Time, Time to First Byte (TTFB), Time to Last Byte (TTLB), for ourselves and other networks.

One of the most important tools that we use for measuring and improving our performance is what we call Performance Benchmarks Dashboard. That's the dashboard where we can analyze the data that we collect in different dimensions.

Here are the metrics that we monitor based on some of the dimensions.

The first metric we closely monitor is the percent of networks that we are ranked #1 in terms of TCP Connection Time. That's a key performance indicator that we evaluate ourselves against.

The second metric we monitor is our overall performance in each country. This gives us visibility into the countries or regions that we need to pay closer attention to and take action towards improving our performance. Those actions will be listed later. Orange indicates the countries that Cloudflare is the fastest provider based on the TCP Connection Time.

The third set of metrics we use are TCP Connection Time and TTLB. The number of networks where we are #1 in terms of 95th percentile TCP Connection Time is one of our key performance indicators. We actively monitor and work on improving that metric. More on that later.

Using all the metrics listed above, the Performance Benchmarks Dashboard helps us to find networks where we can improve our performance. Our engineering teams monitor that dashboard and investigate the underlying reasons for degraded performance if there are any and the action items are displayed on the dashboard until they are resolved.

Once we identify a particular network to improve, we investigate the root cause and document the action items to improve our performance. Those actions generally fall under three categories.

The first category is establishing peering with that network in a specific location so that users can take the optimum path. That’s a critical component of a better Internet! Here is our more detailed blog post about that from earlier this week.

The second category is expanding our compute capacity in a specific data center so that we can serve the users at the closest data center.

And finally, we apply traffic engineering actions to make sure that the network is served in the optimum way. Traffic engineering actions are generally manual configurations that we apply, in case the path that’s chosen by the routing protocols is not the most performant path.

What’s next

The data we collect gives us a granular view of every network that connects to Cloudflare and we constantly optimize our infrastructure to improve Cloudflare’s performance. We won’t rest until we’re #1 everywhere.

Benchmarking dashboard performance

2023-06-22 Richard Nguyen

Post Syndicated from Richard Nguyen original http://blog.cloudflare.com/benchmarking-dashboard-performance/

Benchmarking dashboard performance

In preparation of Cloudflare Speed Week 2023, we spent the last few weeks benchmarking the performance of a Cloudflare product that has gone through many transformations throughout the years: the Cloudflare dashboard itself!

Limitations and scope

Optimizing for user-experience is vital to the long-term success of both Cloudflare and our customers. Reliability and availability of the dashboard are also important, since millions of customers depend on our services every day. To avoid any potential service interruptions while we made changes to the application’s architecture, we decided to gradually roll out the improvements, starting with the login page.

As a global company, we strive to deliver the best experience to all of our customers around the world. While we were aware that performance was regional, with regions furthest from our core data centers experiencing up to 10 times longer loading speeds, we wanted to focus on improvements that would benefit all of our users, no matter where they geographically connect to the Dashboard.

Finally, throughout this exercise, it was important to keep in mind that our overall goal was to improve the user experience of the dashboard, with regards to loading performance. We chose to use a Lighthouse Performance score as a metric to measure performance, but we were careful to not set a target score. Once a measure becomes a target, it ceases to be a good measure.

Initial Benchmarks

Using a combination of open-source tools offered by Google (Lighthouse and PageSpeed Insights) and our own homegrown solution (Cloudflare Speed Test), we benchmarked our Lighthouse performance scores starting in Q1 2023. We found the results were… somewhat disappointing:

Although the site’s initial render occurred quickly (200ms), it took more than two seconds for the site to finish loading and be fully interactive.
In that time, the page was blocked for more than 500ms while the browser executed long JavaScript tasks.
Over half of the JavaScript served for the login page was not necessary to render the login page itself.

Improving what we've measured

The Cloudflare dashboard is a single page application that houses all of the UI for our wide portfolio of existing products, as well as the new features we're releasing every day. However, a less-than-performant experience is not acceptable to us; we owe it to our customers to deliver the best performance possible.

So what did we do?

Shipped less JavaScript

As obvious as it sounds, shipping less code to the user means they have to download fewer resources to load the application. In practice however, accomplishing this was harder than expected, especially for a five year old monolithic application.

We identified some of our largest dependencies with multiple versions, like lodash and our icon library, and deduped them. Bloated packages like the datacenter colo catalogs were refactored and drastically slimmed down. Packages containing unused code like development-only components, deprecated translations, and old Cloudflare Access UI components were removed entirely.

The result was a reduction in total assets being served to the user, going from 10MB (2.7MB gzipped) to 6.5MB (1.7MB gzipped). Lighthouse performance score improved to about 70. This was a good first step, but we could do better.

Identified and code split top-level boundaries

Code splitting is the process in which the application code is split into multiple bundles to be loaded on demand, reducing the initial amount of JavaScript a user downloads on page load. After logging in, as users navigate from account-level products like Workers and Pages, and then into specific zone-level products, like Page Shield for their domain, only the code necessary to render that particular page gets loaded dynamically.

Although most of the account-level and zone-level pages of the dashboard were properly code-split, the root application that imported these pages was not. It contained all of the code to bootstrap the application for both authenticated and unauthenticated users. This wasn’t a great experience for users who weren't even logged in yet, and we wanted to allow them to get into the main dashboard as quickly as possible.

So we split our monolithic application into two sub-applications: an authenticated and unauthenticated application. At a high level, on entrypoint initialization, we simply make an API request to check the user’s authentication state and dynamically load one sub-application or the other.

import React from 'react';
import { useAuth } from './useAuth';
const AuthenticatedAppLoadable = React.lazy(
  () => import('./AuthenticatedApp')
);
const UnauthenticatedAppLoadable = React.lazy(
  () => import('./UnauthenticatedApp')
);

// Fetch user auth state here and return user if logged in
// Render AuthenticatedApp or UnauthenticatedApp based on user
const Root: React.FC = () => {
  const { user } = useAuth();
  return user ? <AuthenticatedAppLoadable /> : <UnauthenticatedAppLoadable />;
};

That’s it! If a user is not logged in, we ship them a small bundle that only contains code necessary to render parts of the application related to login and signup, as well as a few global components. Code related to billing, account-level and zone-level products, sidebar navigation, and user profile settings are all bundled into a separate sub-application that only gets loaded once a user logs in.

Again, we saw significant improvements, especially to Largest Contentful Paint, pushing our performance scores to about 80. However, we ran a Chrome performance profile, and on closer inspection of the longest blocking task we noticed that there was still unnecessary code being parsed and evaluated, even though we never used it. For example, code for sidebar navigation was still loaded for unauthenticated users who never actually saw that component.

Optimized dead-code elimination

It turned out that our configuration for dead-code elimination was not optimized. Dead-code elimination, or “tree-shaking”, is the process in which your JavaScript transpiler automatically removes unused module imports from the final bundle. Although most modern transpilers have that setting on by default today, optimizing dead-code elimination for an existing application as old as the Cloudflare dashboard is not as straightforward.

We had to go through each individual JavaScript import to identify modules that didn’t produce side-effects so they could be marked by the transpiler to be removed. We were able to optimize “tree-shaking” for the majority of the modules, but this will be an ongoing process as we make more performance improvements.

Key results

Although the performance of the dashboard is not yet where we want it to be, we were still able to roll out significant improvements for the majority of our users. The table below shows the performance benchmarks for US users hitting the login page for the first time before and after the performance improvements.

Desktop

Mobile

What’s next

Overall, we were able to get some quick wins, but we’re still not done! This is just the first step in our mission to continually improve performance for all of our dashboard users. Here’s a look at some next steps that we will be experimenting with and testing in the coming months: decoupling signup pages from the main application, redesigning SSO login experience, exploring microfrontends and edge-side rendering.

In the meantime, check out Cloudflare Speed Test to generate a performance report and receive recommendations on how to improve the performance of your site today.

How we think about Zero Trust Performance

2023-06-22 David Tuber

Post Syndicated from David Tuber original http://blog.cloudflare.com/how-we-think-about-zero-trust-performance/

How we think about Zero Trust Performance

Cloudflare has done several deep dives into Zero Trust performance in 2023 alone: one in January, one in March, and one for Speed Week. In each of them, we outline a series of tests we perform and then show that we’re the fastest. While some may think that this is a marketing stunt, it’s not: the tests we devised aren’t necessarily built to make us look the best, our network makes us look the best when we run the tests.

We’ve discussed why performance matters in our blogs before, but the short version is that poor performance is a threat vector: the last thing we want is for your users to turn off Zero Trust to get an experience that is usable for them. Our goal is to improve performance because it helps improve the security of your users, the security of the things that matter most to you, and enables your users to be more productive.

When we run Zero Trust performance tests, we start by measuring end-to-end latency from when a user sends a packet to when the Zero Trust proxy receives, forwards, and inspects the packet, to when the destination website processes the packet and all the way back to the user. This number, called HTTP Response, is often used in Application Services tests to measure the performance of CDNs. We use this to measure our Zero Trust services as well, but it’s not the only way to measure performance. Zscaler measures their performance through something called proxy latency, while Netskope measures theirs through a decrypted latency SLA. Some providers don’t think about performance at all!

There are many ways to view network performance. However, at Cloudflare we believe the best way to measure performance is to use end-to-end HTTP response measurements. In this blog, we’re going to talk about why end-to-end performance is the most important thing to look at, why other methods like proxy latency and decrypted latency SLAs are insufficient for performance evaluations, and how you can measure your Zero Trust performance like we do.

Let’s start at the very beginning

When evaluating performance for any scenario, the most important thing to consider is what exactly you’re supposed to be measuring. This may seem obvious, but oftentimes the things we’re evaluating don’t do a great job of actually measuring the impact users see. A great example of this is when users look at network speed tests: measuring bandwidth doesn’t accurately measure how fast your Internet connection is.

So we must ask ourselves a fundamental question: how do users interact with Zero Trust products? The answer is they shouldn’t: or they shouldn’t know they’re interacting with Zero Trust services. Users actually interact with websites and applications hosted somewhere on the Internet: maybe they’re interacting with a private instance of Microsoft Exchange, or maybe they’re accessing Salesforce in the cloud. In any case, the Zero Trust services that sit in between serve as a forward proxy: they receive the packets from the user, filter for security and access evaluations, and then send the packets along to their destination. If the services are doing their job correctly, users won’t notice their presence at all.

So when we look at Zero Trust services, we have to look at scenarios where transparency becomes opacity: when the Zero Trust services reveal themselves and result in high latency, or even application failures. In order to simulate these scenarios, we have to access sites users would access on a regular basis. If we simulate accessing those websites through a Zero Trust platform, we can look at what happens when Zero Trust is present in the request path.

Fortunately for us, we know exactly how to simulate user requests hitting websites. We have a lot of experience measuring performance for our Developer Platform, and for our Network Benchmarking analysis. By framing Zero Trust performance in the context of our other performance analysis initiatives, it’s easy to make performance better and ensure that all of our efforts are focused on making the most people as fast as possible. Just like our analyses of other Cloudflare products, this approach puts customers and users first and ensures they get the best performance.

Challenges of the open Internet

Zero Trust services naturally come at a disadvantage when it comes to performance: they automatically add an additional network hop between users and the services they’re trying to access. That’s because a forward proxy sits between the user and the public Internet to filter and protect traffic. This means that the Zero Trust service needs to maintain connectivity with end-user ISPs, maintain connectivity with cloud providers, and transit networks that connect services that send and receive most public Internet traffic. This is generally done through peering and interconnectivity relationships. In addition to maintaining all of that connectivity, there’s also the time it takes for the service to actually process rules and packet inspections. Given all of these challenges, performance management in this scenario is complex.

Some providers try to circumvent this by scoping performance down. This is essentially what Zscaler’s proxy latency and Netskope’s decrypted latency are: an attempt to remove parts of the network path that are difficult to control and only focus on the aspects of a request they can control. To be more specific, these latencies only focus on the time that a request spends on Zscaler’s or Netskope’s physical hardware. The upside of this is that it allows these providers to make some amount of guarantee in regards to latency. This line of thinking traditionally comes from trying to replace hardware firewalls and CASB services that may not process requests inline. Zscaler and Netskope are trying to prove that they can process rules and actions inline with a request and still be performant.

But as we showed with our blog back in January, the time spent on a machine in a Zero Trust network is only a small portion of the request time experienced by the end user. The majority of a requests’ time is spent on the wire between machines. When you look at performance, you need to look at it holistically and not at a single element, like on-box processing latency. So by scoping performance down to only looking at on-box processing latencies, you’re not actually looking at anything close to the full picture of performance. To be fast, providers need to look at every aspect of the network and how they function. So let’s talk about all the elements needed to make zero trust service performance better.

How do you get better Zero Trust performance?

A good way to think of Zero Trust performance is like driving on a highway. If you’re hungry and need to eat, you want to go to a place that’s close to the highway and fast. If a restaurant that serves burgers in one second is 15 minutes out of the way, it doesn’t matter how fast they serve the burgers: the time it takes to get to that restaurant isn’t worth the trip. A McDonald’s at a rest stop may take the same amount of time as the other restaurant, but is faster end-to-end. The restaurant you pick should be close to the highway AND serve food fast. Only looking at one of the two will impact your overall time if the other aspect is slow.

Based on this analogy, in addition to having good processing times, the best ways to improve Zero Trust performance are to be well peered on the last mile, be well peered with networks that host important applications, and have diverse paths on the Internet to steer traffic around should things go wrong. Let’s go over each of those and why they’re important.

Last mile peering

We’ve talked before about how getting closer to users is critical to increase performance, but here’s a quick summary: Having a Zero Trust provider that receives your packets physically close to you straightens the path your packets take between your device and what applications you’re trying to access. Because Zero Trust networking will always incur an additional hop, if that hop is inline with the path your requests to your website would normally take, the overhead your Zero Trust network incurs is minimal.

In the diagram above, you can see three connectivity models: one from a user straight to a website, one going through a generic forward proxy, and one going through Cloudflare. The length of each line is representative of the point to point latency. Based on that, you can see that the forward proxy path is longer because the two segments add up to be longer than the direct connection is. This additional travel path is referred to as a hairpin in the networking world. The goal is to keep the line between user and website as straight as possible, because that’s the shortest distance between the two.

The closer your Zero Trust provider is to you, the easier keeping the path small is to achieve. This challenge is something we’re really good at, as we’re always investing to get closer to users no matter where they are by leveraging our over 12,000 peered networks.

Cloud peering

But getting close to users is only half the battle. Once the traffic is on the Zero Trust network, it needs to be delivered to the destination. Oftentimes, those destinations are hosted in hyperscale cloud providers like Azure, Amazon Web Services, or Google Cloud. These hyperscalers are global networks with hundreds of locations for users to store data and host services. If a Zero Trust network is not well peered with all of these networks in all of the places they offer compute, that straight path starts to diverge: less than it would on the last mile, but still enough to be noticeable by end-users.

Cloudflare helps out here by being peered with these major cloud providers where they are, ensuring that the handoff between Cloudflare and the respective cloud is short and seamless. Cloudflare has peering with the major cloud providers in over 40 different metros around the world, ensuring that wherever applications may be hosted, Cloudflare is there to connect to them.

Alternative paths for everything in between

If a Zero Trust network has good connectivity on the last mile and good connectivity to the clouds, the only thing left is being able to pass traffic between the two. Having diverse network paths within the Zero Trust network is incredibly valuable for being able to shift traffic around networking issues and provide private connectivity on the Zero Trust network that is reliable and performant. Cloudflare leverages our own private backbone for this purpose, and that backbone is what helps us deliver next-level performance for all scenario types.

Getting the measurements that matter

So now that we know what scenarios we’re trying to measure and how to make them faster, how are we measuring them? The answer is elegantly simple: we make HTTP calls through our Zero Trust services and measure the Response times. When we perform our Gateway tests, we configure a client program that periodically connects to a bunch of websites commonly used by enterprises through our Zero Trust client and measure the HTTP timings to calculate HTTP response.

As we discussed before, Response is the time it takes for a user to send a packet to the Zero Trust proxy which receives, forwards, and inspects the packet, then sends it to the destination website which processes the packet and returns a response all the way back to the user. This measurement is valuable because it allows us to focus specifically on network performance and not necessarily the ability of a web application to load and render content. We don’t measure things like Largest Contentful Paint because those have dependencies on the software stack on the destination, whether the destination is fronted by a CDN and how their performance is, or even the browser making the request. We want to measure how well the Zero Trust service can deliver packets from a device to a website and back. Our current measurement methodology is focused on the time to deliver a response to the client and ignores some client side processing like browser render time (Largest Contentful Paint) and application specific metrics like UDP Video delivery.

You can do it too

Measuring performance may seem complicated, but at Cloudflare we’re trying to make it easy. Your goals of measuring user experience and our goals of providing a faster experience are perfectly aligned, and the tools we build to view performance are not only user-facing but are used internally for performance improvements. We purpose-built our Digital Experience Monitoring product to not just show where things are going wrong, but to monitor your Zero Trust performance so that you can track your user experience right alongside us. We use this data to help identify regressions and issues on our network to help ensure that you are having a good experience. With DEX, you can make tests to measure endpoints you care about just like we do in our tests, and you can see the results for HTTP Response in the Cloudflare dashboard. And the more tests you make and better visibility you get into your experience, the more you’re helping us better see Zero Trust experiences across our network and the broader Internet.

Just like everything else at Cloudflare, our performance measurements are designed with users in mind. When we measure these numbers and investigate them, we know that by making these numbers look better, we’ll improve the end-to-end experience for Zero Trust users.

Globally distributed AI and a Constellation update

2023-06-22 Rita Kozlov

Post Syndicated from Rita Kozlov original http://blog.cloudflare.com/globally-distributed-ai-and-a-constellation-update/

Globally distributed AI and a Constellation update

During Cloudflare's 2023 Developer Week, we announced Constellation, a set of APIs that allow everyone to run fast, low-latency inference tasks using pre-trained machine learning/AI models, directly on Cloudflare’s network.

Constellation update

We now have a few thousand accounts onboarded in the Constellation private beta and have been listening to our customer's feedback to evolve and improve the platform. Today, one month after the announcement, we are upgrading Constellation with three new features:

Bigger models
We are increasing the size limit of your models from 10 MB to 50 MB. While still somewhat conservative during the private beta, this new limit opens doors to more pre-trained and optimized models you can use with Constellation.

Tensor caching
When you run a Constellation inference task, you pass multiple tensor objects as inputs, sometimes creating big data payloads. These inputs travel through the wire protocol back and forth when you repeat the same task, even when the input changes from multiple runs are minimal, creating unnecessary network and data parsing overhead.

The client API now supports caching input tensors resulting in even better network latency and faster inference times.

XGBoost runtime
Constellation started with the ONNX runtime, but our vision is to support multiple runtimes under a common API. Today we're adding the XGBoost runtime to the list.

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable, and it's known for its performance in structured and tabular data tasks.

You can start uploading and using XGBoost models today.

You can find the updated documentation with these new features and an example on how to use the XGBoost runtime with Constellation in our Developers Documentation.

An era of globally distributed AI

Since Cloudflare’s network is globally distributed, Constellation is our first public release of globally distributed machine learning.

But what does this mean? You may not think of a global network as the place to deploy your machine learning tasks, but machine learning has been a core part of what’s enabled much of Cloudflare’s core functionality for many years. And we run it across our global network in 300 cities.

Is this large spike in traffic an attack or a Black Friday sale? What’s going to be the best way to route this request based on current traffic patterns? Is this request coming from a human or a bot? Is this HTTP traffic a zero-day? Being able to answer these questions using automated machine learning and AI, rather than human intervention, is one of the things that’s enabled Cloudflare to scale.

But this is just a small sample of what globally distributed machine learning enables. The reason this was so helpful for us was because we were able to run this machine learning as an integrated part of our stack, which is why we’re now in the process of opening it up to more and more developers with Constellation.

As Michelle Zatlyn, our co-founder likes to say, we’re just getting started (in this space) — every day we’re adding hundreds of new users to our Constellation beta, testing out and globally deploying new models, and beyond that, deploying new hardware to support the new types of workloads that AI will bring to the our global network.

With that, we wanted to share a few announcements and some use cases that help illustrate why we’re so excited about globally distributed AI. And since it’s Speed Week, it should be no surprise that, well, speed is at the crux of it all.

Custom tailored web experiences, powered by AI

We’ve long known about the importance of performance when it comes to web experiences — in e-commerce, every second of page load time can have as much as a 7% drop off effect on conversion. But being fast is not enough. It’s necessary, but not sufficient. You also have to be accurate.

That is, rather than serving one-size-fits-all experiences, users have come to expect that you know what they want before they do.

So you have to serve personalized experiences, and you have to do it fast. That’s where Constellation can come into play. With Constellation, as a part of your e-commerce application that may already be served from Cloudflare’s network through Workers or Pages, or even store data in D1, you can now perform tasks such as categorization (what demographic is this customer most likely in?) and personalization (if you bought this, you may also like that).

Making devices smarter wherever they are

Another use case where performance is critical is in interacting with the real world. Imagine a face recognition system that detects whether you’re human or not every time you go into your house. Every second of latency makes a difference (especially if you’re holding heavy groceries).

Running inference on Cloudflare’s network, means that within 95% of the world’s population, compute, and thus a decision, is never going to be more than 50ms away. This is in huge contrast to centralized compute, where if you live in Europe, but bought a doorbell system from a US-based company, may be up to hundreds of milliseconds round trip away.

You may be thinking, why not just run the compute on the device then?

For starters, running inference on the device doesn’t guarantee fast performance. Most devices with built in intelligence are run on microcontrollers, often with limited computational abilities (not a high-end GPU or server-grade CPU). Milliseconds become seconds; depending on the volume of workloads you need to process, the local inference might not be suitable. The compute that can be fit on devices is simply not powerful enough for high-volume complex operations, certainly not for operating at low-latency.

But even user experience aside (some devices don’t interface with a user directly), there are other downsides to running compute directly on devices.

The first is battery life — the longer the compute, the shorter the battery life. There's always a power consumption hit, even if you have a custom ASIC chip or a Tensor Processing Unit (TPU), meaning shorter battery life if that's one of your constraints. For consumer products, this means having to switch out your doorbell battery (lest you get locked out). For operating fleets of devices at scale (imagine watering devices in a field) this means costs of keeping up with, and swapping out batteries.

Lastly, device hardware, and even software, is harder to update. As new technologies or more efficient chips become available, upgrading fleets of hundreds or thousands of devices is challenging. And while software updates may be easier to manage, they’ll never be as easy as updating on-cloud software, where you can effortlessly ship updates multiple times a day!

Speaking of shipping software…

AI applications, easier than ever with Constellation

Speed Week is not just about making your applications or devices faster, but also your team!

For the past six years, our developer platform has been making it easy for developers to ship new code with Cloudflare Workers. With Constellation, it’s now just as easy to add Machine Learning to your existing application, with just a few commands.

And if you don’t believe us, don’t just take our word for it. We’re now in the process of opening up the beta to more and more customers. To request access, head on over to the Cloudflare Dashboard where you’ll see a new tab for Constellation. We encourage you to check out our tutorial for getting started with Constellation — this AI thing may be even easier than you expected it to be!

We’re just getting started

This is just the beginning of our journey for helping developers build AI driven applications, and we’re already thinking about what’s next.

We look forward to seeing what you build, and hearing your feedback.

Donning a MASQUE: building a new protocol into Cloudflare WARP

2023-06-22 Mari Galicer

Post Syndicated from Mari Galicer original http://blog.cloudflare.com/masque-building-a-new-protocol-into-cloudflare-warp/

Donning a MASQUE: building a new protocol into Cloudflare WARP

When we originally announced WARP, we knew we were launching a product that was different from other VPNs. Cloudflare has not only hundreds more data centers than your typical VPN provider, but also a unique purview into the adoption of open Internet standards. The confluence of these two factors have led us to today’s announcement: support for MASQUE, a cutting-edge new protocol for the beta version of our consumer WARP iOS app.

MASQUE is a set of mechanisms that extend HTTP/3 and leverage the unique properties of the QUIC transport protocol to efficiently proxy IP and UDP traffic. Most importantly, it will make your Internet browsing experience faster and more stable without sacrificing privacy.

Like many products at Cloudflare, we’re offering this first as a free, consumer offering. Once we’ve had an opportunity to learn from what it’s like to operate MASQUE on mobile devices, at scale, we plan to integrate it into our Zero Trust enterprise product suite.

We’re not saying goodbye to Wireguard

When we first built WARP we chose to go with Wireguard for many reasons – among them, simplicity. This is where Wireguard shines: ~4,000 lines of code that use public-key cryptography to create an encrypted tunnel between one computer and another. The cryptographic parts – encapsulation and decapsulation – are fast, simple, and secure. This simplicity has allowed us to implement it cross-platform without much effort; today, we support Wireguard clients on iOS, Android, macOS, Windows, and Linux.

That being said, the protocol is not without its issues. Like many tradeoffs in technology, Wireguard’s strengths are also its drawbacks. While simple, it is also rigid: it’s not possible to extend it easily, for example, for session management, congestion control, or to recover more quickly from error-state behaviors we’re familiar with. Finally, neither the protocol nor the cryptography it uses are standards-based, making it difficult to keep up with the strongest known cryptography (post-quantum crypto, for example).

We want to move QUIC-ly

We’re excited about MASQUE because it fits into the way the Internet is evolving. According to this year’s usage report from our Radar team, HTTP/2 is currently the standard in use by the majority of Internet traffic, but HTTP/3 occupies a growing share – 28% as of June 2023. Cloudflare has always been dedicated towards adopting the cutting edge when it comes to standards: when RFC 9000 (the QUIC transport protocol) was published, we enabled it for all Cloudflare customers the very next day.

So why do we think HTTP/3 is so promising? Well, a lot of it has to do with solving performance issues with HTTP/2. HTTP/3 promises a number of things.

Faster connection establishment: the TCP+TLS handshake of earlier HTTP versions typically takes two to three round trips. QUIC performs the transport and security handshake at the same time, cutting down on the total required round trips.

No more head of line blocking: when one packet of information does not make it to its destination, it will no longer block all streams of information.

Agility and evolution: QUIC has strong extension and version negotiation mechanisms. And because it encrypts all but a few bits of its wire image, deploying new transport features is easier and more practical. In contrast, TCP evolution was hampered by middleboxes that failed to keep up with the times.

Naturally, we’d want the proxying protocol we use for so many people’s everyday browsing to take advantage of these benefits. For example, the QUIC unreliable datagram extension doesn't help much for standard web traffic but it's ideal for tunneling UDP or IP packets that expect an unreliable substrate beneath them. Without the unreliable aspect, the protocols on top can get upset and start to perform badly. Datagrams help unlock QUIC's proxying potential.

MASQUE: A new era for VPN performance and flexibility

You may have heard of HTTP GET, POST, and PUT, but what about CONNECT? HTTP-CONNECT is a method that opens up a tunnel between servers and proxies traffic between them. For a deeper dive, check out our Primer on Proxies. Many Cloudflare services use this method like so:

Clients send a CONNECT request, and if the proxy sends back a 2xx (success) status code, tunnel secured! Simple. However, remember that QUIC is UDP-based. Luckily, the MASQUE working group has figured out how to run multiple concurrently stream and datagram-based connections. Establishing one looks like this:

Here’s what this MASQUE proxying looks like:

From a development perspective, MASQUE also allows us to improve our performance in other ways: we’re already running it for iCloud Private Relay and other Privacy Proxy partners. The services that power these partnerships, from our Rust-based proxy framework to our open source QUIC implementation, are already deployed globally in our network and have proven to be fast, resilient, and reliable. We've already learned a lot about how to operate proxies at scale, but there’s plenty of room for improvement. The good news is that every performance improvement we make to speed up MASQUE-based connections for our WARP clients will also improve performance for our customers that use HTTP-CONNECT, and vice-versa.

From a protocol perspective, we also think that MASQUE will prove to be resilient over time. As you can see above, connections are made through port 443, which for both TCP and UDP blends in well with general HTTP/3 traffic and is less susceptible than Wireguard to blocking.

Finally, because MASQUE is an IETF standard, innovations via extensions are already underway. One we’re particularly excited about is Multipath QUIC, an extension whose implementation would allow us to use multiple concurrent network interfaces for a single logical QUIC connection. For example, using both LTE and WiFi on a single mobile device could allow for seamless switching between the two, helping to avoid pesky disruptions when you’re coming to and from work or home.

The magic of supporting MASQUE is that it combines some pretty cool (and very Cloudflare-y!) elements: a standards-based proxying protocol that provides real user-facing performance benefits, built upon Cloudflare’s widely available Anycast network, and encryption of that last-mile between that network and your phone.

So how can I use it?

If you’d like to join the waitlist for our beta tester program for MASQUE, you can sign up here.

You’ll first need to download Testflight on a valid iOS device. We will be sending out invites to download the app via Testflight first come, first served, as early as next week. Once you’ve downloaded the app, MASQUE will be available as the default connection in our beta iOS version, only available in iOS 17 (and up).

To toggle between Wireguard and MASQUE, go to Settings > Personalization > Protocol:

Protocols come and go, but our privacy promise remains the same

While the protocols that dominate the Internet may change, our promise to consumers remains the same – a more private Internet, free of cost. When using WARP, we still route all DNS queries through 1.1.1.1, our privacy-respecting DNS resolver; we will never write user-identifiable log data to disk; we will never sell your browsing data or use it in any way to target you with advertising data; and you can still use WARP without providing any personal information like your name, phone number, or email address.

Cloudflare Snippets is now available in alpha

2023-06-21 Matt Bullock

Post Syndicated from Matt Bullock original http://blog.cloudflare.com/cloudflare-snippets-alpha/

Cloudflare Snippets is now available in alpha

Today we are excited to announce that Cloudflare Snippets is available in alpha. In the coming weeks we will be opening access to our waiting list.

What are Snippets?

Over the past two years we have released a number of new rules products such as Transform Rules, Cache Rules, Origin Rules, Config Rules and Redirect Rules. These new products give more control to customers on how we process their traffic as it flows through our global network. The feedback on these products so far has been overwhelmingly positive. However, our customers still occasionally need the ability to do more than the out-of-the-box functionality allows. Not just adding an HTTP header – but performing some advanced calculation to create the output.

For these cases, Cloudflare Snippets comes to the rescue. Snippets are small pieces of user created JavaScript that are run by Cloudflare before your website, API or application is served to the user. If you're familiar with Cloudflare Workers, our robust developer platform, then you'll find Snippets to be a familiar addition. For those who are not, Snippets are designed to be easily created, tested, and deployed. Providing you with the ability to deploy your custom JavaScript Snippet to our global network in a matter of seconds.

While Snippets are built on top of the Workers Platform, they do have a number of differences. The first lies in how Snippets operate within the Ruleset Engine as a dedicated new phase, similar to Transform Rules and Cache Rules. This means that customers can select and execute a Snippet based on any Ruleset Engine filter. This gives customers the flexibility to run a Snippet on every request or apply it selectively based on various criteria they provide, such as specific bot scores, country of origin, or certain cookies.

Moreover, Snippets are cumulative in nature, allowing users to have multiple Snippets that execute if they meet the defined conditions. For instance, one Snippet could add an HTTP header and another rewrite the URL, both of which will be executed if their respective conditions are met.

Users now have the flexibility to choose between using a rule for simple, no-code-required tasks, such as adding a basic response Cookie header with Transform Rules, or writing a Cloudflare Snippet for more complex cookie functionality, such as dynamically changing the host or date within the cookie value. Snippets empower customers to get the job done quickly and effortlessly within the Cloudflare ecosystem, without incurring extra expenses (though a fair usage cap applies).

The difference between Snippets and Workers

Another significant advantage is that Cloudflare Snippets are available across all plan levels at no extra cost. This empowers customers to migrate their simple workloads from legacy solutions like VCL to the Cloudflare platform, actively reducing their monthly expenses.

Whether you're on the Free, Pro, Business, or Enterprise plan, Snippets are at your disposal. Free plan users have access to five Snippets per zone, while Pro, Business, and Enterprise plans offer 10, 25, and 50 Snippets per zone, respectively.

In terms of resources, Cloudflare Snippets are lightweight compared to Workers. They have a maximum execution time of 5ms, a maximum memory of 2MB, and a total package size of 32KB. These limits are more than sufficient for common use cases like modifying HTTP headers, rewriting URLs, and routing traffic tasks that do not require the additional features and resources Cloudflare Workers has to offer.

Snippets also run before Workers; this means that users will be able to move simple logic out of a Cloudflare Worker into Snippets or use Cloudflare Workers and its features to further modify a request. The Traffic Sequence UI has also been updated to incorporate Snippets allowing you to easily understand how all the products fit together and understand how HTTP requests flow between them.

What can you build with Cloudflare Snippets?

Snippets allow customers to migrate their existing workloads to Cloudflare. For example, customers that wish to set a dynamic cookie on all of their responses for a percentage of requests can use the `math.random` function within their Snippet.

By leveraging the Ruleset Engine, we can improve the implementation by moving the set cookie logic to the rule instead of executing it on every response or handling it within a Snippet. For example if I only want to set this cookie on my shop subdomain and only for German or UK customers I can create the following rule.

This approach ensures that the snippet will only execute when necessary, minimizing additional processing and reducing the complexity of the code required.

We are excited to see what other use cases Cloudflare Snippets unlock for our customers.

Using Snippets

Snippets are located within the Rules section of the Cloudflare Dashboard. Here customers can use the UI to write, preview and deploy their first Snippet.

As with all Cloudflare products users can deploy their Snippets via the API and Terraform. Allowing users to easily incorporate Snippets within their CI/CD pipelines. The added benefit of using the Ruleset engine allows users to test their code on a subset of traffic. For example, by specifying your own office IP or secret header within the filter that will only trigger the Snippet if present. Finally we will be integrating Snippets within the Account Request Tracer allowing users to easily identify all Rules that are executing on a specific request.

How did we build Snippets?

During Developer Week, we discussed the process of Building Cloudflare on Cloudflare, using our Cloudflare Workers developer platform to enhance our products in terms of speed, robustness, and ease of development. Snippets, the latest Cloudflare product, is built on top of Workers for Platforms.

A snippet is a piece of user-defined JavaScript that, upon creation, generates a unique Snippet ID. This Snippet ID is then associated with a user-defined rule created using the Rule Engine syntax. When a rule is created, a unique Rule ID is assigned to it. The Snippet ID and Rule ID are then linked in a one-to-one relationship. Customers have the flexibility to create multiple snippets and rules, each with its own unique Snippet and Rule. Customers with multiple snippets can easily prioritize them within the user interface (UI) or via API, similar to our other rules-based products.

When a customer's request reaches Cloudflare, we evaluate the request parameters against the created Snippet rules within a user's zone. If a Snippet rule is matched, the corresponding unique Snippet ID is added to a Snippet table. Once all the rules have been evaluated and the Snippet table has been compiled, the completed table is passed to the Snippets Internal Worker Service.

This Worker receives all the Snippet IDs stored within the table that are to be sequentially executed. The system's design allows for the flexibility of keeping Snippets simple, where users can manage individual snippets independently, that execute on the same request. This approach grants users the freedom to control and fine-tune their own individual snippets rather than merging them into a single entity.

Each Snippet receives the modified request from the previous Snippet and applies new modifications to it. After executing the final Snippet IDs, the resulting modified request is passed back to FL for the next step of the request processing.

Snip into action

We are excited to see the innovative use cases that our customers will create with Snippets. In the upcoming weeks, we will start granting access to the alpha version to those on our waitlist. If you haven't joined the waitlist yet, you can still sign up with an open beta available later this year.

Part 2: Rethinking cache purge with a new architecture

2023-06-21 Zaidoon Abd Al Hadi

Post Syndicated from Zaidoon Abd Al Hadi original http://blog.cloudflare.com/rethinking-cache-purge-architecture/

Part 2: Rethinking cache purge with a new architecture

In Part 1: Rethinking Cache Purge, Fast and Scalable Global Cache Invalidation, we outlined the importance of cache invalidation and the difficulties of purging caches, how our existing purge system was designed and performed, and we gave a high level overview of what we wanted our new Cache Purge system to look like.

It’s been a while since we published the first blog post and it’s time for an update on what we’ve been working on. In this post we’ll be talking about some of the architecture improvements we’ve made so far and what we’re working on now.

Cache Purge end to end

We touched on the high level design of what we called the “coreless” purge system in part 1, but let’s dive deeper into what that design encompasses by following a purge request from end to end:

Step 1: Request received locally

An API request to Cloudflare is routed to the nearest Cloudflare data center and passed to an API Gateway worker. This worker looks at the request URL to see which service it should be sent to and forwards the request to the appropriate upstream backend. Most endpoints of the Cloudflare API are currently handled by centralized services, so the API Gateway worker is often just proxying requests to the nearest “core” data center which have their own gateway services to handle authentication, authorization, and further routing. But for endpoints which aren’t handled centrally the API Gateway worker must handle authentication and route authorization, and then proxy to an appropriate upstream. For cache purge requests that upstream is a Purge Ingest worker in the same data center.

Step 2: Purges tested locally

The Purge Ingest worker evaluates the purge request to make sure it is processible. It scans the URLs in the body of the request to see if they’re valid, then attempts to purge the URLs from the local data center’s cache. This concept of local purging was a new step introduced with the coreless purge system allowing us to capitalize on existing logic already used in every data center.

By leveraging the same ownership checks our data centers use to serve a zone’s normal traffic on the URLs being purged, we can determine if those URLs are even cacheable by the zone. Currently more than 50% of the URLs we’re asked to purge can’t be cached by the requesting zones, either because they don’t own the URLs (e.g. a customer asking us to purge https://cloudflare.com) or because the zone’s settings for the URL prevent caching (e.g. the zone has a “bypass” cache rule that matches the URL). All such purges are superfluous and shouldn’t be processed further, so we filter them out and avoid broadcasting them to other data centers freeing up resources to process more legitimate purges.

On top of that, generating the cache key for a file isn’t free; we need to load zone configuration options that might affect the cache key, apply various transformations, et cetera. The cache key for a given file is the same in every data center though, so when we purge the file locally we now return the generated cache key to the Purge Ingest worker and broadcast that key to other data centers instead of making each data center generate it themselves.

Step 3: Purges queued for broadcasting

Once the local purge is done the Purge Ingest worker forwards the purge request with the cache key obtained from the local cache to a Purge Queue worker. The queue worker is a Durable Object worker using its persistent state to hold a queue of purges it receives and pointers to how far along in the queue each data center in our network is in processing purges.

The queue is important because it allows us to automatically recover from a number of scenarios such as connectivity issues or data centers coming back online after maintenance. Having a record of all purges since an issue arose lets us replay those purges to a data center and “catch up”.

But Durable Objects are globally unique, so having one manage all global purges would have just moved our centrality problem from a core data center to wherever that Durable Object was provisioned. Instead we have dozens of Durable Objects in each region, and the Purge Ingest worker looks at the load balancing pool of Durable Objects for its region and picks one (often in the same data center) to forward the request to. The Durable Object will write the purge request to its queue and immediately loop through all the data center pointers and attempt to push any outstanding purges to each.

While benchmarking our performance we found our particular workload exhibited a “goldilocks zone” of throughput to a given Durable Object. On script startup we have to load all sorts of data like network topology and data center health–then refresh it continuously in the background–and as long as the Durable Object sees steady traffic it stays active and we amortize those startup costs. But if you ask a single Durable Object to do too much at once like send or receive too many requests, the single-threaded runtime won’t keep up. Regional purge traffic fluctuates a lot depending on local time of day, so there wasn’t a static quantity of Durable Objects per region that would let us stay within the goldilocks zone of enough requests to each to keep them active but not too many to keep them efficient. So we built load monitoring into our Durable Objects, and a Regional Autoscaler worker to aggregate that data and adjust load balancing pools when we start approaching the upper or lower edges of our efficiency goldilocks zone.

Step 4: Purges broadcast globally

Once a purge request is queued by a Purge Queue worker it needs to be broadcast to the rest of Cloudflare’s data centers to be carried out by their caches. The Durable Objects will broadcast purges directly to all data centers in their region, but when broadcasting to other regions they pick a Purge Fanout worker per region to take care of their region’s distribution. The fanout workers manage queues of their own as well as pointers for all of their region’s data centers, and in fact they share a lot of the same logic as the Purge Queue workers in order to do so. One key difference is fanout workers aren’t Durable Objects; they’re normal worker scripts, and their queues are purely in memory as opposed to being backed by Durable Object state. This means not all queue worker Durable Objects are talking to the same fanout worker in each region. Fanout workers can be dropped and spun up again quickly by any metal in the data center because they aren’t canonical sources of state. They maintain queues and pointers for their region but all of that info is also sent back downstream to the Durable Objects who persist that data themselves, reliably.

But what does the fanout worker get us? Cloudflare has hundreds of data centers all over the world, and as we mentioned above we benefit from keeping the number of incoming and outgoing requests for a Durable Object fairly low. Sending purges to a fanout worker per region means each Durable Object only has to make a fraction of the requests it would if it were broadcasting to every data center directly, which means it can process purges faster.

On top of that, occasionally a request will fail to get where it was going and require retransmission. When this happens between data centers in the same region it’s largely unnoticeable, but when a Durable Object in Canada has to retry a request to a data center in rural South Africa the cost of traversing that whole distance again is steep. The data centers elected to host fanout workers have the most reliable connections in their regions to the rest of our network. This minimizes the chance of inter-regional retries and limits the latency imposed by retries to regional timescales.

The introduction of the Purge Fanout worker was a massive improvement to our distribution system, reducing our end-to-end purge latency by 50% on its own and increasing our throughput threefold.

Current status of coreless purge

We are proud to say our new purge system has been in production serving purge by URL requests since July 2022, and the results in terms of latency improvements are dramatic. In addition, flexible purge requests (purge by tag/prefix/host and purge everything) share and benefit from the new coreless purge system’s entrypoint workers before heading to a core data center for fulfillment.

The reason flexible purge isn’t also fully coreless yet is because it’s a more complex task than “purge this object”; flexible purge requests can end up purging multiple objects–or even entire zones–from cache. They do this through an entirely different process that isn’t coreless compatible, so to make flexible purge fully coreless we would have needed to come up with an entirely new multi-purge mechanism on top of redesigning distribution. We chose instead to start with just purge by URL so we could focus purely on the most impactful improvements, revamping distribution, without reworking the logic a data center uses to actually remove an object from cache.

This is not to say that the flexible purges haven’t benefited from the coreless purge project. Our cache purge API lets users bundle single file and flexible purges in one request, so the API Gateway worker and Purge Ingest worker handle authorization, authentication and payload validation for flexible purges too. Those flexible purges get forwarded directly to our services in core data centers pre-authorized and validated which reduces load on those core data center auth services. As an added benefit, because authorization and validity checks all happen at the edge for all purge types users get much faster feedback when their requests are malformed.

Next steps

While coreless cache purge has come a long way since the part 1 blog post, we’re not done. We continue to work on reducing end-to-end latency even more for purge by URL because we can do better. Alongside improvements to our new distribution system, we’ve also been working on the redesign of flexible purge to make it fully coreless, and we’re really excited to share the results we’re seeing soon. Flexible cache purge is an incredibly popular API and we’re giving its refresh the care and attention it deserves.

Spotlight on Zero Trust: We’re fastest and here’s the proof

2023-06-21 David Tuber

Post Syndicated from David Tuber original http://blog.cloudflare.com/spotlight-on-zero-trust/

Spotlight on Zero Trust: We're fastest and here's the proof

In January and in March we posted blogs outlining how Cloudflare performed against others in Zero Trust. The conclusion in both cases was that Cloudflare was faster than Zscaler and Netskope in a variety of Zero Trust scenarios. For Speed Week, we’re bringing back these tests and upping the ante: we’re testing more providers against more public Internet endpoints in more regions than we have in the past.

For these tests, we tested three Zero Trust scenarios: Secure Web Gateway (SWG), Zero Trust Network Access (ZTNA), and Remote Browser Isolation (RBI). We tested against three competitors: Zscaler, Netskope, and Palo Alto Networks. We tested these scenarios from 12 regions around the world, up from the four we’d previously tested with. The results are that Cloudflare is the fastest Secure Web Gateway in 42% of testing scenarios, the most of any provider. Cloudflare is 46% faster than Zscaler, 56% faster than Netskope, and 10% faster than Palo Alto for ZTNA, and 64% faster than Zscaler for RBI scenarios.

In this blog, we’ll provide a refresher on why performance matters, do a deep dive on how we’re faster for each scenario, and we’ll talk about how we measured performance for each product.

Performance is a threat vector

Performance in Zero Trust matters; when Zero Trust performs poorly, users disable it, opening organizations to risk. Zero Trust services should be unobtrusive when the services become noticeable they prevent users from getting their job done.

Zero Trust services may have lots of bells and whistles that help protect customers, but none of that matters if employees can’t use the services to do their job quickly and efficiently. Fast performance helps drive adoption and makes security feel transparent to the end users. At Cloudflare, we prioritize making our products fast and frictionless, and the results speak for themselves. So now let’s turn it over to the results, starting with our secure web gateway.

Cloudflare Gateway: security at the Internet

A secure web gateway needs to be fast because it acts as a funnel for all of an organization’s Internet-bound traffic. If a secure web gateway is slow, then any traffic from users out to the Internet will be slow. If traffic out to the Internet is slow, users may see web pages load slowly, video calls experience jitter or loss, or generally unable to do their jobs. Users may decide to turn off the gateway, putting the organization at risk of attack.

In addition to being close to users, a performant web gateway needs to also be well-peered with the rest of the Internet to avoid slow paths out to websites users want to access. Many websites use CDNs to accelerate their content and provide a better experience. These CDNs are often well-peered and embedded in last mile networks. But traffic through a secure web gateway follows a forward proxy path: users connect to the proxy, and the proxy connects to the websites users are trying to access. If that proxy isn’t as well-peered as the destination websites are, the user traffic could travel farther to get to the proxy than it would have needed to if it was just going to the website itself, creating a hairpin, as seen in the diagram below:

A well-connected proxy ensures that the user traffic travels less distance making it as fast as possible.

To compare secure web gateway products, we pitted the Cloudflare Gateway and WARP client against Zscaler, Netskope, and Palo Alto which all have products that perform the same functions. Cloudflare users benefit from Gateway and Cloudflare’s network being embedded deep into last mile networks close to users, being peered with over 12,000 networks. That heightened connectivity shows because Cloudflare Gateway is the fastest network in 42% of tested scenarios:

Number of testing scenarios where each provider is fastest for 95th percentile HTTP Response time (higher is better)
Provider	Scenarios where this provider is fastest
Cloudflare	48
Zscaler	14
Netskope	10
Palo Alto Networks	42

This data shows that we are faster to more websites from more places than any of our competitors. To measure this, we look at the 95th percentile HTTP response time: how long it takes for a user to go through the proxy, have the proxy make a request to a website on the Internet, and finally return the response. This measurement is important because it’s an accurate representation of what users see. When we look at the 95th percentile across all tests, we see that Cloudflare is 2.5% faster than Palo Alto Networks, 13% faster than Zscaler, and 6.5% faster than Netskope.

95th percentile HTTP response across all tests
Provider	95th percentile response (ms)
Cloudflare	515
Zscaler	595
Netskope	550
Palo Alto Networks	529

Cloudflare wins out here because Cloudflare’s exceptional peering allows us to succeed in places where others were not able to succeed. We are able to get locally peered in hard-to-reach places on the globe, giving us an edge. For example, take a look at how Cloudflare performs against the others in Australia, where we are 30% faster than the next fastest provider:

Cloudflare establishes great peering relationships in countries around the world: in Australia we are locally peered with all of the major Australian Internet providers, and as such we are able to provide a fast experience to many users around the world. Globally, we are peered with over 12,000 networks, getting as close to end users as we can to shorten the time requests spend on the public Internet. This work has previously allowed us to deliver content quickly to users, but in a Zero Trust world, it shortens the path users take to get to their SWG, meaning they can quickly get to the services they need.

Previously when we performed these tests, we only tested from a single Azure region to five websites. Existing testing frameworks like Catchpoint are unsuitable for this task because performance testing requires that you run the SWG client on the testing endpoint. We also needed to make sure that all of the tests are running on similar machines in the same places to measure performance as well as possible. This allows us to measure the end-to-end responses coming from the same location where both test environments are running.

In our testing configuration for this round of evaluations, we put four VMs in 12 cloud regions side by side: one running Cloudflare WARP connecting to our gateway, one running ZIA, one running Netskope, and one running Palo Alto Networks. These VMs made requests every five minutes to the 11 different websites mentioned below and logged the HTTP browser timings for how long each request took. Based on this, we are able to get a user-facing view of performance that is meaningful. Here is a full matrix of locations that we tested from, what websites we tested against, and which provider was faster:

	Endpoints
SWG Regions	Shopify	Walmart	Zendesk	ServiceNow	Azure Site	Slack	Zoom	Box	M365	GitHub	Bitbucket
East US	Cloudflare	Cloudflare	Palo Alto Networks	Cloudflare	Palo Alto Networks	Cloudflare	Palo Alto Networks	Cloudflare
West US	Palo Alto Networks	Palo Alto Networks	Cloudflare	Cloudflare	Palo Alto Networks	Cloudflare	Palo Alto Networks	Cloudflare
South Central US	Cloudflare	Cloudflare	Palo Alto Networks	Cloudflare	Palo Alto Networks	Cloudflare	Palo Alto Networks	Cloudflare
Brazil South	Cloudflare	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Zscaler	Zscaler	Zscaler	Palo Alto Networks	Cloudflare	Palo Alto Networks	Palo Alto Networks
UK South	Cloudflare	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Cloudflare	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks
Central India	Cloudflare	Cloudflare	Cloudflare	Palo Alto Networks	Palo Alto Networks	Cloudflare	Cloudflare	Cloudflare
Southeast Asia	Cloudflare	Cloudflare	Cloudflare	Cloudflare	Palo Alto Networks	Cloudflare	Cloudflare	Cloudflare
Canada Central	Cloudflare	Cloudflare	Palo Alto Networks	Cloudflare	Cloudflare	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Zscaler	Cloudflare	Zscaler
Switzerland North	netskope	Zscaler	Zscaler	Cloudflare	netskope	netskope	netskope	netskope	Cloudflare	Cloudflare	netskope
Australia East	Cloudflare	Cloudflare	netskope	Cloudflare	Cloudflare	Cloudflare	Cloudflare	Cloudflare
UAE Dubai	Zscaler	Zscaler	Cloudflare	Cloudflare	Zscaler	netskope	Palo Alto Networks	Zscaler	Zscaler	netskope	netskope
South Africa North	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Zscaler	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Palo Alto Networks	Zscaler	Palo Alto Networks	Palo Alto Networks

Blank cells indicate that tests to that particular website did not report accurate results or experienced failures for over 50% of the testing period. Based on this data, Cloudflare is generally faster, but we’re not as fast as we’d like to be. There are still some areas where we need to improve, specifically in South Africa, UAE, and Brazil. By Birthday Week in September, we want to be the fastest to all of these websites in each of these regions, which will bring our number up from fastest in 54% of tests to fastest in 79% of tests.

To summarize, Cloudflare’s Gateway is still the fastest SWG on the Internet. But Zero Trust isn’t all about SWG. Let’s talk about how Cloudflare performs in Zero Trust Network Access scenarios.

Instant (Zero Trust) access

Access control needs to be seamless and transparent to the user: the best compliment for a Zero Trust solution is for employees to barely notice it’s there. Services like Cloudflare Access protect applications over the public Internet, allowing for role-based authentication access instead of relying on things like a VPN to restrict and secure applications. This form of access management is more secure, but with a performant ZTNA service, it can even be faster.

Cloudflare outperforms our competitors in this space, being 46% faster than Zscaler, 56% faster than Netskope, and 10% faster than Palo Alto Networks:

Zero Trust Network Access P95 HTTP Response times
Provider	P95 HTTP response (ms)
Cloudflare	1252
Zscaler	2388
Netskope	2974
Palo Alto Networks	1471

For this test, we created applications hosted in three different clouds in 12 different locations: AWS, GCP, and Azure. However, it should be noted that Palo Alto Networks was the exception, as we were only able to measure them using applications hosted in one cloud from two regions due to logistical challenges with setting up testing: US East and Singapore.

For each of these applications, we created tests from Catchpoint that accessed the application from 400 locations around the world. Each of these Catchpoint nodes attempted two actions:

New Session: log into an application and receive an authentication token
Existing Session: refresh the page and log in passing the previously obtained credentials

We like to measure these scenarios separately, because when we look at 95th percentile values, we would almost always be looking at new sessions if we combined new and existing sessions together. For the sake of completeness though, we will also show the 95th percentile latency of both new and existing sessions combined.

Cloudflare was faster in both US East and Singapore, but let’s spotlight a couple of regions to delve into. Let’s take a look at a region where resources are heavily interconnected equally across competitors: US East, specifically Ashburn, Virginia.

In Ashburn, Virginia, Cloudflare handily beats Zscaler and Netskope for ZTNA 95th percentile HTTP Response:

95th percentile HTTP Response times (ms) for applications hosted in Ashburn, VA
AWS East US	Total (ms)	New Sessions (ms)	Existing Sessions (ms)
Cloudflare	2849	1749	1353
Zscaler	5340	2953	2491
Netskope	6513	3748	2897
Palo Alto Networks
Azure East US
Cloudflare	1692	989	1169
Zscaler	5403	2951	2412
Netskope	6601	3805	2964
Palo Alto Networks
GCP East US
Cloudflare	2811	1615	1320
Zscaler
Netskope	6694	3819	3023
Palo Alto Networks	2258	894	1464

You might notice that Palo Alto Networks looks to come out ahead of Cloudflare for existing sessions (and therefore for overall 95th percentile). But these numbers are misleading because Palo Alto Networks’ ZTNA behavior is slightly different than ours, Zscaler’s, or Netskope’s. When they perform a new session, it does a full connection intercept and returns a response from its processors instead of directing users to the login page of the application they are trying to access.

This means that Palo Alto Networks' new session response times don’t actually measure the end-to-end latency we’re looking for. Because of this, their numbers for new session latency and total session latency are misleading, meaning we can only meaningfully compare ourselves to them for existing session latency. When we look at existing sessions, when Palo Alto Networks acts as a pass-through, Cloudflare still comes out ahead by 10%.

This is true in Singapore as well, where Cloudflare is 50% faster than Zscaler and Netskope, and also 10% faster than Palo Alto Networks for Existing Sessions:

95th percentile HTTP Response times (ms) for applications hosted in Singapore
AWS Singapore	Total (ms)	New Sessions (ms)	Existing Sessions (ms)
Cloudflare	2748	1568	1310
Zscaler	5349	3033	2500
Netskope	6402	3598	2990
Palo Alto Networks
Azure Singapore
Cloudflare	1831	1022	1181
Zscaler	5699	3037	2577
Netskope	6722	3834	3040
Palo Alto Networks
GCP Singapore
Cloudflare	2820	1641	1355
Zscaler	5499	3037	2412
Netskope	6525	3713	2992
Palo Alto Networks	2293	922	1476

One critique of this data could be that we’re aggregating the times of all Catchpoint nodes together at P95, and we’re not looking at the 95th percentile of Catchpoint nodes in the same region as the application. We looked at that, too, and Cloudflare’s ZTNA performance is still better. Looking at only North America-based Catchpoint nodes, Cloudflare performs 50% better than Netskope, 40% better than Zscaler, and 10% better than Palo Alto Networks at P95 for warm connections:

Zero Trust Network Access 95th percentile HTTP Response times for warm connections with testing locations in North America
Provider	P95 HTTP response (ms)
Cloudflare	810
Zscaler	1290
Netskope	1351
Palo Alto Networks	871

Finally, one thing we wanted to show about our ZTNA performance was how well Cloudflare performed per cloud per region. This below chart shows the matrix of cloud providers and tested regions:

Fastest ZTNA provider in each cloud provider and region by 95th percentile HTTP Response
	AWS	Azure	GCP
Australia East	Cloudflare	Cloudflare	Cloudflare
Brazil South	Cloudflare	Cloudflare	N/A
Canada Central	Cloudflare	Cloudflare	Cloudflare
Central India	Cloudflare	Cloudflare	Cloudflare
East US	Cloudflare	Cloudflare	Cloudflare
South Africa North	Cloudflare	Cloudflare	N/A
South Central US	N/A	Cloudflare	Zscaler
Southeast Asia	Cloudflare	Cloudflare	Cloudflare
Switzerland North	N/A	N/A	Cloudflare
UAE Dubai	Cloudflare	Cloudflare	Cloudflare
UK South	Cloudflare	Cloudflare	netskope
West US	Cloudflare	Cloudflare	N/A

There were some VMs in some clouds that malfunctioned and didn’t report accurate data. But out of 30 available cloud regions where we had accurate data, Cloudflare was the fastest ZT provider in 28 of them, meaning we were faster in 93% of tested cloud regions.

To summarize, Cloudflare also provides the best experience when evaluating Zero Trust Network Access. But what about another piece of the puzzle: Remote Browser Isolation (RBI)?

Remote Browser Isolation: a secure browser hosted in the cloud

Remote browser isolation products have a very strong dependency on the public Internet: if your connection to your browser isolation product isn’t good, then your browser experience will feel weird and slow. Remote browser isolation is extraordinarily dependent on performance to feel smooth and seamless to the users: if everything is fast as it should be, then users shouldn’t even notice that they’re using browser isolation.

For this test, we’re again pitting Cloudflare against Zscaler. While Netskope does have an RBI product, we were unable to test it due to it requiring a SWG client, meaning we would be unable to get full fidelity of testing locations like we would when testing Cloudflare and Zscaler. Our tests showed that Cloudflare is 64% faster than Zscaler for remote browsing scenarios: Here’s a matrix of fastest provider per cloud per region for our RBI tests:

Fastest RBI provider in each cloud provider and region by 95th percentile HTTP Response
	AWS	Azure	GCP
Australia East	Cloudflare	Cloudflare	Cloudflare
Brazil South	Cloudflare	Cloudflare	Cloudflare
Canada Central	Cloudflare	Cloudflare	Cloudflare
Central India	Cloudflare	Cloudflare	Cloudflare
East US	Cloudflare	Cloudflare	Cloudflare
South Africa North	Cloudflare	Cloudflare
South Central US		Cloudflare	Cloudflare
Southeast Asia	Cloudflare	Cloudflare	Cloudflare
Switzerland North	Cloudflare	Cloudflare	Cloudflare
UAE Dubai	Cloudflare	Cloudflare	Cloudflare
UK South	Cloudflare	Cloudflare	Cloudflare
West US	Cloudflare	Cloudflare	Cloudflare

This chart shows the results of all of the tests run against Cloudflare and Zscaler to applications hosted on three different clouds in 12 different locations from the same 400 Catchpoint nodes as the ZTNA tests. In every scenario, Cloudflare was faster. In fact, no test against a Cloudflare-protected endpoint had a 95th percentile HTTP Response of above 2105 ms, while no Zscaler-protected endpoint had a 95th percentile HTTP response of below 5000 ms.

To get this data, we leveraged the same VMs to host applications accessed through RBI services. Each Catchpoint node would attempt to log into the application through RBI, receive authentication credentials, and then try to access the page by passing the credentials. We look at the same new and existing sessions that we do for ZTNA, and Cloudflare is faster in both new sessions and existing session scenarios as well.

Gotta go fast(er)

Our Zero Trust customers want us to be fast not because they want the fastest Internet access, but because they want to know that employee productivity won’t be impacted by switching to Cloudflare. That doesn’t necessarily mean that the most important thing for us is being faster than our competitors, although we are. The most important thing for us is improving our experience so that our users feel comfortable knowing we take their experience seriously. When we put out new numbers for Birthday Week in September and we’re faster than we were before, it won’t mean that we just made the numbers go up: it means that we are constantly evaluating and improving our service to provide the best experience for our customers. We care more that our customers in UAE have an improved experience with Office365 as opposed to beating a competitor in a test. We show these numbers so that we can show you that we take performance seriously, and we’re committed to providing the best experience for you, wherever you are.

It’s never been easier to migrate thanks to Cloudflare’s new Migration Hub

2023-06-21 Sam Marsh

Post Syndicated from Sam Marsh original http://blog.cloudflare.com/turpentine-v2-migration-program/

It's never been easier to migrate thanks to Cloudflare's new Migration Hub

We understand the pain points associated with CDN migrations. That's why in late 2021 we introduced Turpentine, a project to the process of translating the old Varnish Configuration Language (VCL) into Cloudflare Workers with just a push of a button. After nearly two years of testing and user feedback, we’ve tailored the migration processes for different user groups.

Today, we are thrilled to relaunch Turpentine, and introduce Cloudflare's new Migration Hub. The Migration Hub serves as a one-stop-shop for all migration needs, featuring brand-new migration guides that bring transparency and simplicity to the process.

We also know that a large number of customers aren't comfortable doing migrations themselves. Years of built up business logic makes unpacking and translating CDN configurations between different vendors difficult and locks businesses into subpar products and services. To help these customers we have established a Professional Services group to ensure smooth migrations for customers transitioning to Cloudflare’s first-class products. Going forward, we plan to continue to invest resources in Turpentine to ensure that moving to any part of Cloudflare is easy and you have the help you need.

Why choose Cloudflare?

Cloudflare has gained immense popularity among businesses seeking to improve website performance, security, and reliability. The demand for Cloudflare's CDN services has skyrocketed, with an ever-increasing number of companies wanting to use our services to help protect their web properties. It became evident that a more streamlined approach was needed to empower customers to self-guide through the onboarding process if they wanted.

That’s why we’ve shipped guides to help bring transparency to the migration process, compare Cloudflare's Rules or Workers to VCL or XML configurations, and provide mappings of different products between vendors. This resource serves as a repository of information and step-by-step guidance for those seeking to move to Cloudflare. These guides are designed to empower customers to take control of their onboarding journey by providing them with the tools and resources they need to understand how to successfully implement Cloudflare's first-class products without needing to talk to anyone.

As new features and enhancements are introduced to Cloudflare, the landing page will be updated to reflect these changes.

However, undertaking the onboarding process independently can be daunting for some businesses. We understand that every organization is unique, with specific requirements and challenges. To address this concern, Cloudflare has established a dedicated Professional Services team. This team of experts works closely with customers, taking the time to understand their environments, assess their needs, and provide tailored guidance and support throughout the migration process. With the help of the professional services team, businesses can transition to Cloudflare being guided by an experienced team to ensure a timely, smooth and successful migration. Using the Migration Hub, you can get in contact with the Professional Services team to help your migration journey.

Whether you prefer self-guided exploration or expert guidance, the Cloudflare Migration Hub has everything you need to make your migration journey a success.

Self-serve guides

Our commitment to transparency and empowering our customers led us to create comprehensive public-facing guides that provide valuable insights into how CDN products compare and overlap. With these guides, you can gain a clear understanding of the features and capabilities offered by Cloudflare, and how they map between CDN offerings you might be more familiar with.

The migration guides include product maps that show how you can match Cloudflare features to Akamai or Fastly features and how to configure them. Using this information, migration should just be about matching up rules and implementing instead of translating feature names between vendors or fiddling with ChatGPT prompts to correctly (or incorrectly!) translate code from one vendor to the other. There are also numerous examples of how certain configurations have been accomplished with code examples that help customers configure and understand their current configuration and translate them into Cloudflare products, easily. Check them out here.

Not only that, but Cloudflare’s commitment to providing numerous free tools across our network means anyone can sign-up and get access to much of our platform without needing to talk to anyone. We believe in giving you the tools and knowledge you need to navigate the migration and testing process independently, while knowing that our support is just a click away whenever you need it.

Let us do it for you with Professional Services

We're also incredibly excited to introduce our dedicated team of migration experts, known as Professional Services, who are here to assist you throughout the entire process. The Professional Service team will work closely with you, offering their expertise and guiding you through each step to ensure a seamless transition onto Cloudflare’s products.

Too often, we meet with customers who have been intimidated by the complexity of their current CDN vendor. They had help setting it up by a third party and have experienced the nervousness of trying to change things without knowing what impacts it could have downstream. This is compounded by different CDNs using different terminology for essentially the same concepts.

Professional Services is here to help guide your onboarding experience and cut through that uncertainty.

From providing in-depth knowledge about the migration process and tooling to addressing any specific challenges you may encounter, our Professional Services team is committed to making your migration experience as smooth and efficient as possible. With Cloudflare's Professional Services, you can confidently embark on your migration journey, knowing that our experts will handle the complexities while empowering you to drive the migration process forward.

Success Stories

By leveraging Cloudflare's migration solutions, numerous businesses have achieved remarkable results, including improved performance, enhanced security, and streamlined pricing. These success stories serve as a testament to the effectiveness and reliability of Cloudflare's migration offerings.

Improve cost and performance by migrating to Cloudflare

A mobile communications leader successfully migrated its public website, after 20 years with Akamai, to Cloudflare for a better digital experience plus >20% cost savings.

The company’s decision to decentralize purchasing of CDN services illuminated the high cost of using Akamai for its public-facing websites.

A short proof-of-concept of Cloudflare Application Performance suite resulted in measurable cost savings and performance improvements. It was also determined the flexibility to integrate additional Cloudflare tools, like Workers for serverless compute offerings, would enable the organization to scale further when ready.

Avoid reliability concerns by migrating to Cloudflare

A UK sporting giant with a devoted international fan community was deeply concerned about their spikey traffic associated with game days. Often these matches saw 10x the normal website traffic. Unfortunately, incumbent vendors weren’t up for the challenge of providing the performance and uptime reliability to their fans during these game day traffic spikes.

After migrating to Cloudflare, the results spoke for themselves. In one 24-hour match day, the site received over 11 million requests. Cloudflare’s cache served over 93% of them with eaze while providing a 100% uptime guarantee.

Get started today

We invite you to visit our Migration Hub and explore our comprehensive offerings.

Migrating from one CDN to another can be a daunting task, but with Cloudflare's Migration Hub and Professional Services, the process becomes more straightforward and hassle-free. We are committed to empowering our customers with the resources, support, and expertise needed to transition smoothly to Cloudflare's advanced solutions.

Workers KV is faster than ever with a new architecture

2023-06-21 Charles Burnett

Post Syndicated from Charles Burnett original http://blog.cloudflare.com/faster-workers-kv-architecture/

Workers KV is faster than ever with a new architecture

We’re excited to announce a significant performance improvement coming to Workers KV, focused on dramatically improving cold read performance and reducing latency, even for long tail access patterns.

Developers using KV have seen great performance on hot reads, but ask why their 95th percentile latency — often on a key (or set of keys) that hadn’t been accessed recently or in that region — was higher than expected. We took this feedback to heart: we’ve been working feverishly on a new caching layer for KV behind the scenes, which enables customers to achieve much more frequent hot reads, reduced worst case latency times, better flexibility and control over cache TTLs, and much faster consistency over our previous iterations, and it’s now live for all KV users.

The best part? Developers using KV don’t need to change anything to benefit from this increased performance.

What is Workers KV?

Workers KV is a key value store designed for read heavy use-cases and applications powered by Cloudflare’s network. KV’s focus on read-heavy use-cases allows it to serve hot (cached) reads in milliseconds, which makes it ideal for storing per-application or customer configuration data, routing configuration, multivariate (A/B testing) configurations, and even small asset data that you need to serve quickly. Anything that you can serialize and need quickly you can store in KV, all the way up to 25 MiB worth of data per each individual key, with no cap on total data stored.

The problem

KV might be optimized for read-heavy workloads, but it’s critical that writes are globally available quickly enough that they’re useful for your application. Under typical conditions, the convergence delay for an eventually consistent system like KV is approximately one minute, globally: a write from one location should be able to be observed by all readers. Typical conditions are great, but typical unfortunately didn’t mean “always”. It could take significant time to restore global consistency where regions like North America and Europe are reading the same value. We needed to improve not just the average convergence, but the worst case as well.

Speaking of consistency, setting a long cache Time to Live (cacheTTL) for reads would result in a situation where you won’t notice a write for the entire cacheTTL duration, as the existing cached data had not timed out yet. This means you have to trade off read latency for infrequently accessed keys against noticing writes. Developers using KV have been consistent in their feedback: a higher cache TTL should improve performance, but not multiply the time it takes for KV to converge on a write to that key.

Lastly, our cold read times also left room for improvement. While cache hits are fast in KV, a cache miss would result in a request being routed all the way to our storage backends. While this is slow for everyone, it was particularly slow for folks in regions not immediately in the US or EU.This is poor performance that doesn’t represent what we can achieve with our global presence.

Our solution

A new horizontally scaled tiered cache

We’ve revamped Workers KV to be powered by a new tiered cache implementation. This implementation is written as a Worker service. We reuse the Dynamic Dispatch infrastructure developed for Workers for Platforms which lets us jump from our old KV worker into our new caching service within hundreds of microseconds. Importantly, this means we don’t impact cache hit performance to implement this new transparent caching layer. We leverage the same infrastructure powering Smart Placement to implement the tiering.

Before we re-designed KV, our topology looked like this:

Cache TTL and efficiency

Our design goal was clear and ambitious: “can we relax honoring the cacheTTL constraint without violating it”? While this seems contradictory, the motivation is clear: we want to minimize the need to communicate with our storage backends while honoring the user-facing semantics of the cacheTTL setting, as it can have security implications if violated (e.g. if you use it to store and validate security tokens). Answering this design question also manages to simultaneously solve many of the problems outlined earlier.

Comparing existing solutions

First, let’s look at the design constraints for two eventually consistent storage systems at Cloudflare: Quicksilver and Tiered CDN.

Quicksilver gives us global consistency within seconds using a push architecture to replicate the data across all machines at Cloudflare. That architecture however doesn’t scale for Workers KV’s needs, which can have terabytes of data just within one namespace. This would be too much to replicate to every single data center.

By comparison, the tiered CDN cache is a pull mechanism where each hop pulls a more recent version of the asset into the local cache on access. That scales better because we only use storage for assets that are accessed, which works well with most use-cases where the vast majority of data is never retrieved. However, a pull based architecture is insufficient because it can only let us aggregate traffic across broader regions but we still can’t decouple how long we serve from the cache from the cacheTTL.

Push based architectures let us know when an asset is updated and enable scalable storage. By blending the properties of both systems, we can decouple how long we store the assets in cache from the cacheTTL. And that’s exactly what we did: KV now uses a hybrid push/pull caching layer where data centers closest to customers will pull from the regional data centers that are a little bit farther away. Writes will broadcast to all regional data centers that a key has been updated, so that the regional data center will remove that key from the local cache.

We can solve this problem by taking advantage of the fact that we semantically understand the write operations that are happening within Workers KV:

Workers KV doesn’t have one data center per region as might be typical for your zone in a Cloudflare CDN regional tiered cache topology. Instead, each key in a KV namespace is deterministically assigned a data center by performing a weighted rendezvous hash. The rendezvous hash ensures that load is distributed equally across the region and outages result in optimal shifts of traffic.
When the data center closest to a customer has a miss, it computes the regional data center affinity and provides that information to our Smart Placement infrastructure. When a regional tier misses, we repeat this process except using data centers in the KV origin region.
Finally, a miss at the upper tier exits to our storage nodes located in that origin region.

When we do a write, we only purge (invalidate) the key from the regional and upper tier data centers. This is a fixed number of data centers in our network regardless of how many data centers we add, which ensures that we aren’t reducing cache hit rates as our network continues to grow Compared with a global purge that delivers the event to every data center in our network, because we only need to deliver this purge to a random fixed set of data centers in our network, our aggregate write capacity for Workers KV automatically scales horizontally as we add more data centers.

Why do we call this a hybrid topology? The data centers closest to customers pull from the regional data centers as normal, but we automatically push invalidation events to the regional tier data centers on every write. That way, those customer data center pulls know to get an updated value when there is one. This means that while the cacheTTL parameter controls the caching behavior closest to the customer, it’s treated as a suggestion at best at the regional and upper tiers.

This way we’ve combined the push design principles behind Quicksilver, which delivers global consistency within seconds, with the pull-based design of our CDN tiered caching which can scale to handle “infinite” size workloads and prioritizes the assets that are most frequently accessed.

Visualizing it

It can be a bit hard to follow what’s happening in the new caching layer since there’s so many moving parts.

Here’s a video of a simplified version of how it works:

Small yellow balls represent KV read requests, larger green balls represent read responses. A larger purple ball represents a KV write request, while a read response ball represents a KV write response. Teal balls represent purge requests being broadcast. The “E” is a data center that doesn’t participate as a regional tier. The R represents the regional tier for key N while O is the upper tier for key N.

Decoupled cache TTL and consistency parameters

As a refresher, the objects written to KV can specify a cacheTTL: by default this is set to 1 minute, which is also the minimum acceptable value. This means that if an asset has been in the cache for longer than a minute, we bypass the cache and read instead from our durable storage nodes. In order to prevent eyeballs noticing origin fetches every minute, we implement stale while revalidate logic in our caching layer that automatically refreshes from the storage nodes in the background as requests come in.

Notice the absence of any spikes indicating a cache miss? You’d expect to see them regularly every minute or so in the tens or even hundreds of milliseconds when the cacheTTL should expire. The reason this doesn’t happen is because as the expiry time is approaching, a background request to the storage nodes occurs and the cache is updated with an expiry time one more minute into the future; thus the asset in cache is never too stale and eyeball requests are always served from cache. Let’s take a look at requests to our storage layer before and after adding tiering:

The above chart is for a system with conservative parameters set. The upper tier doesn’t store the data for much longer than the cacheTTL currently and the upper tier will itself still do a background refresh probabilistically even though it doesn’t actually need to since we see all writes.

The new caching layer we’ve built inherits the old background refresh mechanism and expands on it. The first thing we did is decouple the background refresh period from the cacheTTL as a separate parameter (also defaulting to 1 minute). This means that even if you set a cacheTTL for 1 hour, KV will still check every minute from the regional tier to see if the value has been updated. If the data you’re storing within KV doesn’t have strict requirements on stale reads (think a key that’s accessed once every 10 minutes but needs to honor a write within 1 minute like security tokens), then you can increase the cacheTTL so that infrequently accessed keys stick around in the cache without changing the observed consistency.

Consistency improvements

Speaking of consistency, we’ve improved the worst case performance of that as well. Historically, we’ve had a background system that crawls all data in the storage nodes to figure out which region has the most up to date value and update accordingly. This gives us complete consistency coverage, but could take a significant amount of time to confirm. We would also periodically check both backends to see if network conditions had changed to pick the primary storage region to use for a given customer-close data center. Of course inconsistencies would be resolved then, but in practice this happens randomly, and at a low probability that won’t typically catch any meaningful values served inconsistently.

With the new caching layer all this changes. Since we’re now only reading keys on first access or after a write, we have enough storage capacity that we can check both backends on every read. When a customer requests data, we make a call to each origin data center, with the fastest response being returned immediately to reduce read latency. If the other data center has a newer value than what was returned first, we synchronize both data centers and notify our caching layer to purge that key from all regional data centers. If the other data center instead has an older value, we just synchronize the data centers without purging since we served the latest value. This means that even if our data centers are inconsistent, readers will notice new values much more quickly.

Latency improvements

Here’s the latency improvement at 10% rollout on a logarithmic x-axis:

Architecture that just gets better

This is just the start of what we can do. We now have a solid foundation for making further improvements, including making our best case reads even faster. We’ll be working on cutting out parts of our traditional stack that add unnecessary latency, and adding new high performance features that were too difficult to integrate otherwise. We can also explore features like setting the consistency TTL parameter for sub one minute consistency for additional cost. Similarly, we could create a best effort global purge feature if you want to choose to signal writes that way. Finally, we’re looking at exposing this new caching layer as a general Worker binding anyone can use within a Worker in front of their own service or to put in front of their Worker. If these sound like the killer features you need, please reach out to us if you’re interested in trying them out.

What next?

Developers don’t have to do anything to benefit from KV’s new performance improvements. We are currently in the process of rolling out our new architecture, and you don’t have to redeploy your Worker or change the way you use KV to benefit.

Workers KV is a natural fit for any application built on top of our Workers platform. We provide a native API that enables any Worker script to read, write, list, and manageyour Workers KV storage. You can also interact with Workers KV directly via our REST API from any client that can make a HTTP request, and the Cloudflare Dashboard provides an easy way to create, list, and delete keys to be used with the rest of your Workers setup.

Regardless of how you use Workers KV, it will be faster than ever before. We’re excited to see what you build with us, and you can dive into our documentation to start building with it.