Tag Archives: Speed Week

NGINX structural enhancements for HTTP/2 performance

Post Syndicated from Nick Jones original https://blog.cloudflare.com/nginx-structural-enhancements-for-http-2-performance/

NGINX structural enhancements for HTTP/2 performance

NGINX structural enhancements for HTTP/2 performance

Introduction

My team: the Cloudflare PROTOCOLS team is responsible for termination of HTTP traffic at the edge of the Cloudflare network. We deal with features related to: TCP, QUIC, TLS and Secure Certificate management, HTTP/1 and HTTP/2. Over Q1, we were responsible for implementing the Enhanced HTTP/2 Prioritization product that Cloudflare announced during Speed Week.

This is a very exciting project to be part of, and doubly exciting to see the results of, but during the course of the project, we had a number of interesting realisations about NGINX: the HTTP oriented server onto which Cloudflare currently deploys its software infrastructure. We quickly became certain that our Enhanced HTTP/2 Prioritization project could not achieve even moderate success if the internal workings of NGINX were not changed.

Due to these realisations we embarked upon a number of significant changes to the internal structure of NGINX in parallel to the work on the core prioritization product. This blog post describes the motivation behind the structural changes, how we approached them, and what impact they had. We also identify additional changes that we plan to add to our roadmap, which we hope will improve performance further.

Background

Enhanced HTTP/2 Prioritization aims to do one thing to web traffic flowing between a client and a server: it provides a means to shape the many HTTP/2 streams as they flow from upstream (server or origin side) into a single HTTP/2 connection that flows downstream (client side).

Enhanced HTTP/2 Prioritization allows site owners and the Cloudflare edge systems to dictate the rules about how various objects should combine into the single HTTP/2 connection: whether a particular object should have priority and dominate that connection and reach the client as soon as possible, or whether a group of objects should evenly share the capacity of the connection and put more emphasis on parallelism.

As a result, Enhanced HTTP/2 Prioritization allows site owners to tackle two problems that exist between a client and a server: how to control precedence and ordering of objects, and: how to make the best use of a limited connection resource, which may be constrained by a number of factors such as bandwidth, volume of traffic and CPU workload at the various stages on the path of the connection.

What did we see?

The key to prioritisation is being able to compare two or more HTTP/2 streams in order to determine which one’s frame is to go down the pipe next. The Enhanced HTTP/2 Prioritization project necessarily drew us into the core NGINX codebase, as our intention was to fundamentally alter the way that NGINX compared and queued HTTP/2 data frames as they were written back to the client.

Very early in the analysis phase, as we rummaged through the NGINX internals to survey the site of our proposed features, we noticed a number of shortcomings in the structure of NGINX itself, in particular: how it moved data from upstream (server side) to downstream (client side) and how it temporarily stored (buffered) that data in its various internal stages. The main conclusion of our early analysis of NGINX was that it largely failed to give the stream data frames any ‘proximity’. Either streams were processed in the NGINX HTTP/2 layer in isolated succession or frames of different streams spent very little time in the same place: a shared queue for example. The net effect was a reduction in the opportunities for useful comparison.

We coined a new, barely scientific but useful measurement: Potential, to describe how effectively the Enhanced HTTP/2 Prioritization strategies (or even the default NGINX prioritization) can be applied to queued data streams. Potential is not so much a measurement of the effectiveness of prioritization per se, that metric would be left for later on in the project, it is more a measurement of the levels of participation during the application of the algorithm. In simple terms, it considers the number of streams and frames thereof that are included in an iteration of prioritization, with more streams and more frames leading to higher Potential.

What we could see from early on was that by default, NGINX displayed low Potential: rendering prioritization instructions from either the browser, as is the case in the traditional HTTP/2 prioritization model, or from our Enhanced HTTP/2 Prioritization product, fairly useless.

What did we do?

With the goal of improving the specific problems related to Potential, and also improving general throughput of the system, we identified some key pain points in NGINX. These points, which will be described below, have either been worked on and improved as part of our initial release of Enhanced HTTP/2 Prioritization, or have now branched out into meaningful projects of their own that we will put engineering effort into over the course of the next few months.

HTTP/2 frame write queue reclamation

Write queue reclamation was successfully shipped with our release of Enhanced HTTP/2 Prioritization and ironically, it wasn’t a change made to the original NGINX, it was in fact a change made against our Enhanced HTTP/2 Prioritization implementation when we were part way through the project, and it serves as a good example of something one may call: conservation of data, which is a good way to increase Potential.

Similar to the original NGINX, our Enhanced HTTP/2 Prioritization algorithm will place a cohort of HTTP/2 data frames into a write queue as a result of an iteration of the prioritization strategies being applied to them. The contents of the write queue would be destined to be written the downstream TLS layer.  Also similar to the original NGINX, the write queue may only be partially written to the TLS layer due to back-pressure from the network connection that has temporarily reached write capacity.

NGINX structural enhancements for HTTP/2 performance

Early on in our project, if the write queue was only partially written to the TLS layer, we would simply leave the frames in the write queue until the backlog was cleared, then we would re-attempt to write that data to the network in a future write iteration, just like the original NGINX.

The original NGINX takes this approach because the write queue is the only place that waiting data frames are stored. However, in our NGINX modified for Enhanced HTTP/2 Prioritization, we have a unique structure that the original NGINX lacks: per-stream data frame queues where we temporarily store data frames before our prioritization algorithms are applied to them.

We came to the realisation that in the event of a partial write, we were able to restore the unwritten frames back into their per-stream queues. If it was the case that a subsequent data cohort arrived behind the partially unwritten one, then the previously unwritten frames could participate in an additional round of prioritization comparisons, thus raising the Potential of our algorithms.

The following diagram illustrates this process:

NGINX structural enhancements for HTTP/2 performance

We were very pleased to ship Enhanced HTTP/2 Prioritization with the reclamation feature included as this single enhancement greatly increased Potential and made up for the fact that we had to withhold the next enhancement for speed week due to its delicacy.

HTTP/2 frame write event re-ordering

In Cloudflare infrastructure, we map the many streams of a single HTTP/2 connection from the eyeball to multiple HTTP/1.1 connections to the upstream Cloudflare control plane.

As a note: it may seem counterintuitive that we downgrade protocols like this, and it may seem doubly counterintuitive when I reveal that we also disable HTTP keepalive on these upstream connections, resulting in only one transaction per connection, however this arrangement offers a number of advantages, particularly in the form of improved CPU workload distribution.

When NGINX monitors its upstream HTTP/1.1 connections for read activity, it may detect readability on many of those connections and process them all in a batch. However, within that batch, each of the upstream connections is processed sequentially, one at a time, from start to finish: from HTTP/1.1 connection read, to framing in the HTTP/2 stream, to HTTP/2 connection write to the TLS layer.

The existing NGINX workflow is illustrated in this diagram:

NGINX structural enhancements for HTTP/2 performance

By committing each streams’ frames to the TLS layer one stream at a time, many frames may pass entirely through the NGINX system before backpressure on the downstream connection allows the queue of frames to build up, providing an opportunity for these frames to be in proximity and allowing prioritization logic to be applied.  This negatively impacts Potential and reduces the effectiveness of prioritization.

The Cloudflare Enhanced HTTP/2 Prioritization modified NGINX aims to re-arrange the internal workflow described above into the following model:

NGINX structural enhancements for HTTP/2 performance

Although we continue to frame upstream data into HTTP/2 data frames in the separate iterations for each upstream connection, we no longer commit these frames to a single write queue within each iteration, instead we arrange the frames into the per-stream queues described earlier. We then post a single event to the end of the per-connection iterations, and perform the prioritization, queuing and writing of the HTTP/2 data frames of all streams in that single event.

This single event finds the cohort of data conveniently stored in their respective per-stream queues, all in close proximity, which greatly increases the Potential of the Edge Prioritization algorithms.

In a form closer to actual code, the core of this modification looks a bit like this:

ngx_http_v2_process_data(ngx_http_v2_connection *h2_conn,
                         ngx_http_v2_stream *h2_stream,
                         ngx_buffer *buffer)
{
    while ( ! ngx_buffer_empty(buffer) {
        ngx_http_v2_frame_data(h2_conn,
                               h2_stream->frames,
                               buffer);
    }

    ngx_http_v2_prioritise(h2_conn->queue,
                           h2_stream->frames);

    ngx_http_v2_write_queue(h2_conn->queue);
}

To this:

ngx_http_v2_process_data(ngx_http_v2_connection *h2_conn,
                         ngx_http_v2_stream *h2_stream,
                         ngx_buffer *buffer)
{
    while ( ! ngx_buffer_empty(buffer) {
        ngx_http_v2_frame_data(h2_conn,
                               h2_stream->frames,
                               buffer);
    }

    ngx_list_add(h2_conn->active_streams, h2_stream);

    ngx_call_once_async(ngx_http_v2_write_streams, h2_conn);
}

ngx_http_v2_write_streams(ngx_http_v2_connection *h2_conn)
{
    ngx_http_v2_stream *h2_stream;

    while ( ! ngx_list_empty(h2_conn->active_streams)) {
        h2_stream = ngx_list_pop(h2_conn->active_streams);

        ngx_http_v2_prioritise(h2_conn->queue,
                               h2_stream->frames);
    }

    ngx_http_v2_write_queue(h2_conn->queue);
}

There is a high level of risk in this modification, for even though it is remarkably small, we are taking the well established and debugged event flow in NGINX and switching it around to a significant degree. Like taking a number of Jenga pieces out of the tower and placing them in another location, we risk: race conditions, event misfires and event blackholes leading to lockups during transaction processing.

Because of this level of risk, we did not release this change in its entirety during Speed Week, but we will continue to test and refine it for future release.

Upstream buffer partial re-use

Nginx has an internal buffer region to store connection data it reads from upstream. To begin with, the entirety of this buffer is Ready for use. When data is read from upstream into the Ready buffer, the part of the buffer that holds the data is passed to the downstream HTTP/2 layer. Since HTTP/2 takes responsibility for that data, that portion of the buffer is marked as: Busy and it will remain Busy for as long as it takes for the HTTP/2 layer to write the data into the TLS layer, which is a process that may take some time (in computer terms!).

During this gulf of time, the upstream layer may continue to read more data into the remaining Ready sections of the buffer and continue to pass that incremental data to the HTTP/2 layer until there are no Ready sections available.

When Busy data is finally finished in the HTTP/2 layer, the buffer space that contained that data is then marked as: Free

The process is illustrated in this diagram:

NGINX structural enhancements for HTTP/2 performance

You may ask: When the leading part of the upstream buffer is marked as Free (in blue in the diagram), even though the trailing part of the upstream buffer is still Busy, can the Free part be re-used for reading more data from upstream?

The answer to that question is: NO

Because just a small part of the buffer is still Busy, NGINX will refuse to allow any of the entire buffer space to be re-used for reads. Only when the entirety of the buffer is Free, can the buffer be returned to the Ready state and used for another iteration of upstream reads. So in summary, data can be read from upstream into Ready space at the tail of the buffer, but not into Free space at the head of the buffer.

This is a shortcoming in NGINX and is clearly undesirable as it interrupts the flow of data into the system. We asked: what if we could cycle through this buffer region and re-use parts at the head as they became Free? We seek to answer that question in the near future by testing the following buffering model in NGINX:

NGINX structural enhancements for HTTP/2 performance

TLS layer Buffering

On a number of occasions in the above text, I have mentioned the TLS layer, and how the HTTP/2 layer writes data into it. In the OSI network model, TLS sits just below the protocol (HTTP/2) layer, and in many consciously designed networking software systems such as NGINX, the software interfaces are separated in a way that mimics this layering.

The NGINX HTTP/2 layer will collect the current cohort of data frames and place them in priority order into an output queue, then submit this queue to the TLS layer. The TLS layer makes use of a per-connection buffer to collect HTTP/2 layer data before performing the actual cryptographic transformations on that data.

The purpose of the buffer is to give the TLS layer a more meaningful quantity of data to encrypt, for if the buffer was too small, or the TLS layer simply relied on the units of data from the HTTP/2 layer, then the overhead of encrypting and transmitting the multitude of small blocks may negatively impact system throughput.

The following diagram illustrates this undersize buffer situation:

NGINX structural enhancements for HTTP/2 performance

If the TLS buffer is too big, then an excessive amount of HTTP/2 data will be committed to encryption and if it failed to write to the network due to backpressure, it would be locked into the TLS layer and not be available to return to the HTTP/2 layer for the reclamation process, thus reducing the effectiveness of reclamation. The following diagram illustrates this oversize buffer situation:

NGINX structural enhancements for HTTP/2 performance

In the coming months, we will embark on a process to attempt to find the ‘goldilocks’ spot for TLS buffering: To size the TLS buffer so it is big enough to maintain efficiency of encryption and network writes, but not so big as to reduce the responsiveness to incomplete network writes and the efficiency of reclamation.

Thank you – Next!

The Enhanced HTTP/2 Prioritization project has the lofty goal of fundamentally re-shaping how we send traffic from the Cloudflare edge to clients, and as results of our testing and feedback from some of our customers shows, we have certainly achieved that! However, one of the most important aspects that we took away from the project was the critical role the internal data flow within our NGINX software infrastructure plays in the outlook of the traffic observed by our end users. We found that changing a few lines of (albeit critical) code, could have significant impacts on the effectiveness and performance of our prioritization algorithms. Another positive outcome is that in addition to improving HTTP/2, we are looking forward to carrying our newfound skills and lessons learned and apply them to HTTP/3 over QUIC.

We are eager to share our modifications to NGINX with the community, so we have opened this ticket, through which we will discuss upstreaming the event re-ordering change and the buffer partial re-use change with the NGINX team.

As Cloudflare continues to grow, our requirements on our software infrastructure also shift. Cloudflare has already moved beyond proxying of HTTP/1 over TCP to support termination and Layer 3 and 4 protection for any UDP and TCP traffic. Now we are moving on to other technologies and protocols such as QUIC and HTTP/3, and full proxying of a wide range of other protocols such as messaging and streaming media.

For these endeavours we are looking at new ways to answer questions on topics such as: scalability, localised performance, wide scale performance, introspection and debuggability, release agility, maintainability.

If you would like to help us answer these questions and know a bit about: hardware and software scalability, network programming, asynchronous event and futures based software design, TCP, TLS, QUIC, HTTP, RPC protocols, Rust or maybe something else?, then have a look here.

One more thing… new Speed Page

Post Syndicated from Andrew Galloni original https://blog.cloudflare.com/new-speed-page/

Congratulations on making it through Speed Week. In the last week, Cloudflare has: described how our global network speeds up the Internet, launched a HTTP/2 prioritisation model that will improve web experiences on all browsers, launched an image resizing service which will deliver the optimal image to every device, optimized live video delivery, detailed how to stream progressive images so that they render twice as fast – using the flexibility of our new HTTP/2 prioritisation model and finally, prototyped a new over-the-wire format for JavaScript that could improve application start-up performance especially on mobile devices. As a bonus, we’re also rolling out one more new feature: “TCP Turbo” automatically chooses the TCP settings to further accelerate your website.

As a company, we want to help every one of our customers improve web experiences. The growth of Cloudflare, along with the increase in features, has often made simple questions difficult to answer:

  • How fast is my website?
  • How should I be thinking about performance features?
  • How much faster would the site be if I were to enable a particular feature?

This post will describe the exciting changes we have made to the Speed Page on the Cloudflare dashboard to give our customers a much clearer understanding of how their websites are performing and how they can be made even faster. The new Speed Page consists of :

  • A visual comparison of your website loading on Cloudflare, with caching enabled, compared to connecting directly to the origin.
  • The measured improvement expected if any performance feature is enabled.
  • A report describing how fast your website is on desktop and mobile.

We want to simplify the complexity of making web experiences fast and give our customers control.  Take a look – We hope you like it.

Why do fast web experiences matter?

Customer experience : No one likes slow service. Imagine if you go to a restaurant and the service is slow, especially when you arrive; you are not likely to go back or recommend it to your friends. It turns out the web works in the same way and Internet customers are even more demanding. As many as 79% of customers who are “dissatisfied” with a website’s performance are less likely to buy from that site again.

Engagement and Revenue : There are many studies explaining how speed affects customer engagement, bounce rates and revenue.

Reputation : There is also brand reputation to consider as customers associate an online experience to the brand. A study found that for 66% of the sample website performance influences their impression of the company.

Diversity : Mobile traffic has grown to be larger than its desktop counterpart over the last few years. Mobile customers’ expectations have becoming increasingly demanding and expect seamless Internet access regardless of location.

Mobile provides a new set of challenges that includes the diversity of device specifications. When testing, be aware that the average mobile device is significantly less capable than the top-of-the-range models. For example, there can be orders-of-magnitude disparity in the time different mobile devices take to run JavaScript. Another challenge is the variance in mobile performance, as customers move from a strong, high quality office network to mobile networks of different speeds (3G/5G), and quality within the same browsing session.

New Speed Page

There is compelling evidence that a faster web experience is important for anyone online. Most of the major studies involve the largest tech companies, who have whole teams dedicated to measuring and improving web experiences for their own services. At Cloudflare we are on a mission to help build a better and faster Internet for everyone – not just the selected few.

Delivering fast web experiences is not a simple matter. That much is clear.
To know what to send and when requires a deep understanding of every layer of the stack, from TCP tuning, protocol level prioritisation, content delivery formats through to the intricate mechanics of browser rendering.  You will also need a global network that strives to be within 10 ms of every Internet user. The intrinsic value of such a network, should be clear to everyone. Cloudflare has this network, but it also offers many additional performance features.

With the Speed Page redesign, we are emphasizing the performance benefits of using Cloudflare and the additional improvements possible from our features.

The de facto standard for measuring website performance has been WebPageTest. Having its creator in-house at Cloudflare encouraged us to use it as the basis for website performance measurement. So, what is the easiest way to understand how a web page loads? A list of statistics do not paint a full picture of actual user experience. One of the cool features of WebPageTest is that it can generate a filmstrip of screen snapshots taken during a web page load, enabling us to quantify how a page loads, visually. This view makes it significantly easier to determine how long the page is blank for, and how long it takes for the most important content to render. Being able to look at the results in this way, provides the ability to empathise with the user.

How fast on Cloudflare ?

After moving your website to Cloudflare, you may have asked: How fast did this decision make my website? Well, now we provide the answer:

Comparison of website performance using Cloudflare. 

As well as the increase in speed, we provide filmstrips of before and after, so that it is easy to compare and understand how differently a user will experience the website. If our tests are unable to reach your origin and you are already setup on Cloudflare, we will test with development mode enabled, which disables caching and minification.

Site performance statistics

How can we measure the user experience of a website?

Traditionally, page load was the important metric. Page load is a technical measurement used by browser vendors that has no bearing on the presentation or usability of a page. The metric reports on how long it takes not only to load the important content but also all of the 3rd party content (social network widgets, advertising, tracking scripts etc.). A user may very well not see anything until after all the page content has loaded, or they may be able to interact with a page immediately, while content continues to load.

A user will not decide whether a page is fast by a single measure or moment. A user will perceive how fast a website is from a combination of factors:

  • when they see any response
  • when they see the content they expect
  • when they can interact with the page
  • when they can perform the task they intended

Experience has shown that if you focus on one measure, it will likely be to the detriment of the others.

Importance of Visual response

If an impatient user navigates to your site and sees no content for several seconds or no valuable content, they are likely to get frustrated and leave. The paint timing spec defines a set of paint metrics, when content appears on a page, to measure the key moments in how a user perceives performance.

First Contentful Paint (FCP) is the time when the browser first renders any DOM content.

First Meaningful Paint (FMP) is the point in time when the page’s “primary” content appears on the screen. This metric should relate to what the user has come to the site to see and is designed as the point in time when the largest visible layout change happens.

Speed Index attempts to quantify the value of the filmstrip rather than using a single paint timing. The speed index measures the rate at which content is displayed – essentially the area above the curve. In the chart below from our progressive image feature you can see reaching 80% happens much earlier for the parallelized (red) load rather than the regular (blue).

Image Description

Importance of interactivity

The same impatient user is now happy that the content they want to see has appeared. They will still become frustrated if they are unable to interact with the site.
Time to Interactive is the time it takes for content to be rendered and the page is ready to receive input from the user. Technically this is defined as when the browser’s main processing thread has been idle for several seconds after first meaningful paint.

The Speed Tab displays these key metrics for mobile and desktop.

How much faster on Cloudflare ?

The Cloudflare Dashboard provides a list of performance features which can, admittedly, be both confusing and daunting. What would be the benefit of turning on Rocket Loader and on which performance metrics will it have the most impact ? If you upgrade to Pro what will be the value of the enhanced HTTP/2 prioritisation ? The optimization section answers these questions.

Tests are run with each performance feature turned on and off. The values for the tests for the appropriate performance metrics are displayed, along with the improvement. You can enable or upgrade the feature from this view. Here are a few examples :

If Rocket Loader were enabled for this website, the render-blocking JavaScript would be deferred causing first paint time to drop from 1.25s to 0.81s – an improvement of 32% on desktop.

Image heavy sites do not perform well on slow mobile connections. If you enable Mirage, your customers on 3G connections would see meaningful content 1s sooner – an improvement of 29.4%.

So how about our new features?

We tested the enhanced HTTP/2 prioritisation feature on an Edge browser on desktop and saw meaningful content display 2s sooner – an improvement of 64%.

This is a more interesting result taken from the blog example used to illustrate the progressive image streaming. At first glance the improvement of 29% in speed index is good. The filmstrip comparison shows a more significant difference. In this case the page with no images shown is already 43% visually complete for both scenarios after 1.5s. At 2.5s the difference is 77% compared to 50%.

This is a great example of how metrics do not tell the full story. They cannot completely replace viewing the page loading flow and understanding what is important for your site.

How to try

This is our first iteration of the new Speed Page and we are eager to get your feedback. We will be rolling this out to beta customers who are interested in seeing how their sites perform. To be added to the queue for activation of the new Speed Page please click on the banner on the overview page,

or click on the banner on the existing Speed Page.

Faster script loading with BinaryAST?

Post Syndicated from Ingvar Stepanyan original https://blog.cloudflare.com/binary-ast/

Faster script loading with BinaryAST?

JavaScript Cold starts

Faster script loading with BinaryAST?

The performance of applications on the web platform is becoming increasingly bottlenecked by the startup (load) time. Large amounts of JavaScript code are required to create rich web experiences that we’ve become used to. When we look at the total size of JavaScript requested on mobile devices from HTTPArchive, we see that an average page loads 350KB of JavaScript, while 10% of pages go over the 1MB threshold. The rise of more complex applications can push these numbers even higher.

While caching helps, popular websites regularly release new code, which makes cold start (first load) times particularly important. With browsers moving to separate caches for different domains to prevent cross-site leaks, the importance of cold starts is growing even for popular subresources served from CDNs, as they can no longer be safely shared.

Usually, when talking about the cold start performance, the primary factor considered is a raw download speed. However, on modern interactive pages one of the other big contributors to cold starts is JavaScript parsing time. This might seem surprising at first, but makes sense – before starting to execute the code, the engine has to first parse the fetched JavaScript, make sure it doesn’t contain any syntax errors and then compile it to the initial bytecode. As networks become faster, parsing and compilation of JavaScript could become the dominant factor.

Faster script loading with BinaryAST?

Faster script loading with BinaryAST?

The device capability (CPU or memory performance) is the most important factor in the variance of JavaScript parsing times and correspondingly the time to application start. A 1MB JavaScript file will take an order of a 100 ms to parse on a modern desktop or high-end mobile device but can take over a second on an average phone  (Moto G4).

A more detailed post on the overall cost of parsing, compiling and execution of JavaScript shows how the JavaScript boot time can vary on different mobile devices. For example, in the case of news.google.com, it can range from 4s on a Pixel 2 to 28s on a low-end device.

While engines continuously improve raw parsing performance, with V8 in particular doubling it over the past year, as well as moving more things off the main thread, parsers still have to do lots of potentially unnecessary work that consumes memory, battery and might delay the processing of the useful resources.

The “BinaryAST” Proposal

This is where BinaryAST comes in. BinaryAST is a new over-the-wire format for JavaScript proposed and actively developed by Mozilla that aims to speed up parsing while keeping the semantics of the original JavaScript intact. It does so by using an efficient binary representation for code and data structures, as well as by storing and providing extra information to guide the parser ahead of time.

The name comes from the fact that the format stores the JavaScript source as an AST encoded into a binary file. The specification lives at tc39.github.io/proposal-binary-ast and is being worked on by engineers from Mozilla, Facebook, Bloomberg and Cloudflare.

“Making sure that web applications start quickly is one of the most important, but also one of the most challenging parts of web development. We know that BinaryAST can radically reduce startup time, but we need to collect real-world data to demonstrate its impact. Cloudflare’s work on enabling use of BinaryAST with Cloudflare Workers is an important step towards gathering this data at scale.”

Till Schneidereit, Senior Engineering Manager, Developer Technologies
Mozilla

Parsing JavaScript

For regular JavaScript code to execute in a browser the source is parsed into an intermediate representation known as an AST that describes the syntactic structure of the code. This representation can then be compiled into a byte code or a native machine code for execution.

Faster script loading with BinaryAST?

A simple example of adding two numbers can be represented in an AST as:

Faster script loading with BinaryAST?

Parsing JavaScript is not an easy task; no matter which optimisations you apply, it still requires reading the entire text file char by char, while tracking extra context for syntactic analysis.

The goal of the BinaryAST is to reduce the complexity and the amount of work the browser parser has to do overall by providing an additional information and context by the time and place where the parser needs it.

To execute JavaScript delivered as BinaryAST the only steps required are:

Faster script loading with BinaryAST?

Another benefit of BinaryAST is that it makes possible to only parse the critical code necessary for start-up, completely skipping over the unused bits. This can dramatically improve the initial loading time.

Faster script loading with BinaryAST?

Faster script loading with BinaryAST?

Faster script loading with BinaryAST?

Faster script loading with BinaryAST?

This post will now describe some of the challenges of parsing JavaScript in more detail, explain how the proposed format addressed them, and how we made it possible to run its encoder in Workers.

Hoisting

JavaScript relies on hoisting for all declarations – variables, functions, classes. Hoisting is a property of the language that allows you to declare items after the point they’re syntactically used.

Let’s take the following example:

function f() {
	return g();
}

function g() {
	return 42;
}

Here, when the parser is looking at the body of f, it doesn’t know yet what g is referring to – it could be an already existing global function or something declared further in the same file – so it can’t finalise parsing of the original function and start the actual compilation.

BinaryAST fixes this by storing all the scope information and making it available upfront before the actual expressions.

Faster script loading with BinaryAST?

As shown by the difference between the initial AST and the enhanced AST in a JSON representation:

Faster script loading with BinaryAST?

Lazy parsing

One common technique used by modern engines to improve parsing times is lazy parsing. It utilises the fact that lots of websites include more JavaScript than they actually need, especially for the start-up.

Working around this involves a set of heuristics that try to guess when any given function body in the code can be safely skipped by the parser initially and delayed for later. A common example of such heuristic is immediately running the full parser for any function that is wrapped into parentheses:

(function(...

Such prefix usually indicates that a following function is going to be an IIFE (immediately-invoked function expression), and so the parser can assume that it will be compiled and executed ASAP, and wouldn’t benefit from being skipped over and delayed for later.

(function() {
	…
})();

These heuristics significantly improve the performance of the initial parsing and cold starts, but they’re not completely reliable or trivial to implement.

One of the reasons is the same as in the previous section – even with lazy parsing, you still need to read the contents, analyse them and store an additional scope information for the declarations.

Another reason is that the JavaScript specification requires reporting any syntax errors immediately during load time, and not when the code is actually executed. A class of these errors, called early errors, is checking for mistakes like usage of the reserved words in invalid contexts, strict mode violations, variable name clashes and more. All of these checks require not only lexing JavaScript source, but also tracking extra state even during the lazy parsing.

Having to do such extra work means you need to be careful about marking functions as lazy too eagerly, especially if they actually end up being executed during the page load. Otherwise you’re making cold start costs even worse, as now every function that is erroneously marked as lazy, needs to be parsed twice – once by the lazy parser and then again by the full one.

Because BinaryAST is meant to be an output format of other tools such as Babel, TypeScript and bundlers such as Webpack, the browser parser can rely on the JavaScript being already analysed and verified by the initial parser. This allows it to skip function bodies completely, making lazy parsing essentially free.

It reduces the cost of a completely unused code – while including it is still a problem in terms of the network bandwidth (don’t do this!), at least it’s not affecting parsing times anymore. These benefits apply equally to the code that is used later in the page lifecycle (for example, invoked in response to user actions), but is not required during the startup.

Last but not least important benefit of such approach is that BinaryAST encodes lazy annotations as part of the format, giving tools and developers direct and full control over the heuristics. For example, a tool targeting the Web platform or a framework CLI can use its domain-specific knowledge to mark some event handlers as lazy or eager depending on the context and the event type.

Avoiding ambiguity in parsing

Using a text format for a programming language is great for readability and debugging, but it’s not the most efficient representation for parsing and execution.

For example, parsing low-level types like numbers, booleans and even strings from text requires extra analysis and computation, which is unnecessary when you can just store and read them as native binary-encoded values in the first place and read directly on the other side.

Another problem is an ambiguity in the grammar itself. It was already an issue in the ES5 world, but could usually be resolved with some extra bookkeeping based on the previously seen tokens. However, in ES6+ there are productions that can be ambiguous all the way through until they’re parsed completely.

For example, a token sequence like:

(a, {b: c, d}, [e = 1])...

can start either a parenthesized comma expression with nested object and array literals and an assignment:

(a, {b: c, d}, [e = 1]); // it was an expression

or a parameter list of an arrow expression function with nested object and array patterns and a default value:

(a, {b: c, d}, [e = 1]) => … // it was a parameter list

Both representations are perfectly valid, but have completely different semantics, and you can’t know which one you’re dealing with until you see the final token.

To work around this, parsers usually have to either backtrack, which can easily get exponentially slow, or to parse contents into intermediate node types that are capable of holding both expressions and patterns, with following conversion. The latter approach preserves linear performance, but makes the implementation more complicated and requires preserving more state.

In the BinaryAST format this issue doesn’t exist in the first place because the parser sees the type of each node before it even starts parsing its contents.

Cloudflare Implementation

Currently, the format is still in flux, but the very first version of the client-side implementation was released under a flag in Firefox Nightly several months ago. Keep in mind this is only an initial unoptimised prototype, and there are already several experiments changing the format to provide improvements to both size and parsing performance.

On the producer side, the reference implementation lives at github.com/binast/binjs-ref. Our goal was to take this reference implementation and consider how we would deploy it at Cloudflare scale.

If you dig into the codebase, you will notice that it currently consists of two parts.

Faster script loading with BinaryAST?

One is the encoder itself, which is responsible for taking a parsed AST, annotating it with scope and other relevant information, and writing out the result in one of the currently supported formats. This part is written in Rust and is fully native.

Another part is what produces that initial AST – the parser. Interestingly, unlike the encoder, it’s implemented in JavaScript.

Unfortunately, there is currently no battle-tested native JavaScript parser with an open API, let alone implemented in Rust. There have been a few attempts, but, given the complexity of JavaScript grammar, it’s better to wait a bit and make sure they’re well-tested before incorporating it into the production encoder.

On the other hand, over the last few years the JavaScript ecosystem grew to extensively rely on developer tools implemented in JavaScript itself. In particular, this gave a push to rigorous parser development and testing. There are several JavaScript parser implementations that have been proven to work on thousands of real-world projects.

With that in mind, it makes sense that the BinaryAST implementation chose to use one of them – in particular, Shift – and integrated it with the Rust encoder, instead of attempting to use a native parser.

Connecting Rust and JavaScript

Integration is where things get interesting.

Rust is a native language that can compile to an executable binary, but JavaScript requires a separate engine to be executed. To connect them, we need some way to transfer data between the two without sharing the memory.

Initially, the reference implementation generated JavaScript code with an embedded input on the fly, passed it to Node.js and then read the output when the process had finished. That code contained a call to the Shift parser with an inlined input string and produced the AST back in a JSON format.

This doesn’t scale well when parsing lots of JavaScript files, so the first thing we did is transformed the Node.js side into a long-living daemon. Now Rust could spawn a required Node.js process just once and keep passing inputs into it and getting responses back as individual messages.

Faster script loading with BinaryAST?

Running in the cloud

While the Node.js solution worked fairly well after these optimisations, shipping both a Node.js instance and a native bundle to production requires some effort. It’s also potentially risky and requires manual sandboxing of both processes to make sure we don’t accidentally start executing malicious code.

On the other hand, the only thing we needed from Node.js is the ability to run the JavaScript parser code. And we already have an isolated JavaScript engine running in the cloud – Cloudflare Workers! By additionally compiling the native Rust encoder to Wasm (which is quite easy with the native toolchain and wasm-bindgen), we can even run both parts of the code in the same process, making cold starts and communication much faster than in a previous model.

Faster script loading with BinaryAST?

Optimising data transfer

The next logical step is to reduce the overhead of data transfer. JSON worked fine for communication between separate processes, but with a single process we should be able to retrieve the required bits directly from the JavaScript-based AST.

To attempt this, first of all, we needed to move away from the direct JSON usage to something more generic that would allow us to support various import formats. The Rust ecosystem already has an amazing serialisation framework for that – Serde.

Aside from allowing us to be more flexible in regard to the inputs, rewriting to Serde helped an existing native use case too. Now, instead of parsing JSON into an intermediate representation and then walking through it, all the native typed AST structures can be deserialized directly from the stdout pipe of the Node.js process in a streaming manner. This significantly improved both the CPU usage and memory pressure.

But there is one more thing we can do: instead of serializing and deserializing from an intermediate format (let alone, a text format like JSON), we should be able to operate [almost] directly on JavaScript values, saving memory and repetitive work.

How is this possible? wasm-bindgen provides a type called JsValue that stores a handle to an arbitrary value on the JavaScript side. This handle internally contains an index into a predefined array.

Each time a JavaScript value is passed to the Rust side as a result of a function call or a property access, it’s stored in this array and an index is sent to Rust. The next time Rust wants to do something with that value, it passes the index back and the JavaScript side retrieves the original value from the array and performs the required operation.

By reusing this mechanism, we could implement a Serde deserializer that requests only the required values from the JS side and immediately converts them to their native representation. It’s now open-sourced under https://github.com/cloudflare/serde-wasm-bindgen.

Faster script loading with BinaryAST?

At first, we got a much worse performance out of this due to the overhead of more frequent calls between 1) Wasm and JavaScript – SpiderMonkey has improved these recently, but other engines still lag behind and 2) JavaScript and C++, which also can’t be optimised well in most engines.

The JavaScript <-> C++ overhead comes from the usage of TextEncoder to pass strings between JavaScript and Wasm in wasm-bindgen, and, indeed, it showed up as the highest in the benchmark profiles. This wasn’t surprising – after all, strings can appear not only in the value payloads, but also in property names, which have to be serialized and sent between JavaScript and Wasm over and over when using a generic JSON-like structure.

Luckily, because our deserializer doesn’t have to be compatible with JSON anymore, we can use our knowledge of Rust types and cache all the serialized property names as JavaScript value handles just once, and then keep reusing them for further property accesses.

This, combined with some changes to wasm-bindgen which we have upstreamed, allows our deserializer to be up to 3.5x faster in benchmarks than the original Serde support in wasm-bindgen, while saving ~33% off the resulting code size. Note that for string-heavy data structures it might still be slower than the current JSON-based integration, but situation is expected to improve over time when reference types proposal lands natively in Wasm.

After implementing and integrating this deserializer, we used the wasm-pack plugin for Webpack to build a Worker with both Rust and JavaScript parts combined and shipped it to some test zones.

Show me the numbers

Keep in mind that this proposal is in very early stages, and current benchmarks and demos are not representative of the final outcome (which should improve numbers much further).

As mentioned earlier, BinaryAST can mark functions that should be parsed lazily ahead of time. By using different levels of lazification in the encoder (https://github.com/binast/binjs-ref/blob/b72aff7dac7c692a604e91f166028af957cdcda5/crates/binjs_es6/src/lazy.rs#L43) and running tests against some popular JavaScript libraries, we found following speed-ups.

Level 0 (no functions are lazified)

With lazy parsing disabled in both parsers we got a raw parsing speed improvement of between 3 and 10%.

NameSource size (kb)JavaScript Parse time (average ms)BinaryAST parse time (average ms)Diff (%)
React200.4030.385-4.56
D3 (v5)24011.17810.525-6.018
Angular1806.9856.331-9.822
Babel78021.25520.599-3.135
Backbone320.7750.699-10.312
wabtjs172064.83659.556-8.489
Fuzzball (1.2)723.1652.768-13.383

Level 3 (functions up to 3 levels deep are lazified)

But with the lazification set to skip nested functions of up to 3 levels we see much more dramatic improvements in parsing time between 90 and 97%. As mentioned earlier in the post, BinaryAST makes lazy parsing essentially free by completely skipping over the marked functions.

NameSource size (kb)Parse time (average ms)BinaryAST parse time (average ms)Diff (%)
React200.4070.032-92.138
D3 (v5)24011.6230.224-98.073
Angular1807.0930.680-90.413
Babel78021.1000.895-95.758
Backbone320.8980.045-94.989
wabtjs172059.8021.601-97.323
Fuzzball (1.2)722.9370.089-96.970

All the numbers are from manual tests on a Linux x64 Intel i7 with 16Gb of ram.

While these synthetic benchmarks are impressive, they are not representative of real-world scenarios. Normally you will use at least some of the loaded JavaScript during the startup. To check this scenario, we decided to test some realistic pages and demos on desktop and mobile Firefox and found speed-ups in page loads too.

For a sample application (https://github.com/cloudflare/binjs-demo, https://serve-binjs.that-test.site/) which weighed in at around 1.2 MB of JavaScript we got the following numbers for initial script execution:

DeviceJavaScriptBinaryAST
Desktop338ms314ms
Mobile (HTC One M8)2019ms1455ms

Here is a video that will give you an idea of the improvement as seen by a user on mobile Firefox (in this case showing the entire page startup time):

Faster script loading with BinaryAST?

Next step is to start gathering data on real-world websites, while improving the underlying format.

How do I test BinaryAST on my website?

We’ve open-sourced our Worker so that it could be installed on any Cloudflare zone: https://github.com/binast/binjs-ref/tree/cf-wasm.

One thing to be currently wary of is that, even though the result gets stored in the cache, the initial encoding is still an expensive process, and might easily hit CPU limits on any non-trivial JavaScript files and fall back to the unencoded variant. We are working to improve this situation by releasing BinaryAST encoder as a separate feature with more relaxed limits in the following few days.

Meanwhile, if you want to play with BinaryAST on larger real-world scripts, an alternative option is to use a static binjs_encode tool from https://github.com/binast/binjs-ref to pre-encode JavaScript files ahead of time. Then, you can use a Worker from https://github.com/cloudflare/binast-cf-worker to serve the resulting BinaryAST assets when supported and requested by the browser.

On the client side, you’ll currently need to download Firefox Nightly, go to about:config and enable unrestricted BinaryAST support via the following options:

Faster script loading with BinaryAST?

Now, when opening a website with either of the Workers installed, Firefox will get BinaryAST instead of JavaScript automatically.

Summary

The amount of JavaScript in modern apps is presenting performance challenges for all consumers. Engine vendors are experimenting with different ways to improve the situation – some are focusing on raw decoding performance, some on parallelizing operations to reduce overall latency, some are researching new optimised formats for data representation, and some are inventing and improving protocols for the network delivery.

No matter which one it is, we all have a shared goal of making the Web better and faster. On Cloudflare’s side, we’re always excited about collaborating with all the vendors and combining various approaches to make that goal closer with every step.

Live video just got more live: Introducing Concurrent Streaming Acceleration

Post Syndicated from Jon Levine original https://blog.cloudflare.com/introducing-concurrent-streaming-acceleration/

Live video just got more live: Introducing Concurrent Streaming Acceleration

Live video just got more live: Introducing Concurrent Streaming Acceleration

Today we’re excited to introduce Concurrent Streaming Acceleration, a new technique for reducing the end-to-end latency of live video on the web when using Stream Delivery.

Let’s dig into live-streaming latency, why it’s important, and what folks have done to improve it.

How “live” is “live” video?

Live streaming makes up an increasing share of video on the web. Whether it’s a TV broadcast, a live game show, or an online classroom, users expect video to arrive quickly and smoothly. And the promise of “live” is that the user is seeing events as they happen. But just how close to “real-time” is “live” Internet video?

Delivering live video on the Internet is still hard and adds lots of latency:

  1. The content source records video and sends it to an encoding server;
  2. The origin server transforms this video into a format like DASH, HLS or CMAF that can be delivered to millions of devices efficiently;
  3. A CDN is typically used to deliver encoded video across the globe
  4. Client players decode the video and render it on the screen

Live video just got more live: Introducing Concurrent Streaming Acceleration

And all of this is under a time constraint — the whole process need to happen in a few seconds, or video experiences will suffer. We call the total delay between when the video was shot, and when it can be viewed on an end-user’s device, as “end-to-end latency” (think of it as the time from the camera lens to your phone’s screen).

Traditional segmented delivery

Video formats like DASH, HLS, and CMAF work by splitting video into small files, called “segments”. A typical segment duration is 6 seconds.

If a client player needs to wait for a whole 6s segment to be encoded, sent through a CDN, and then decoded, it can be a long wait! It takes even longer if you want the client to build up a buffer of segments to protect against any interruptions in delivery. A typical player buffer for HLS is 3 segments:

Live video just got more live: Introducing Concurrent Streaming Acceleration
Clients may have to buffer three 6-second chunks, introducing at least 18s of latency‌‌

When you consider encoding delays, it’s easy to see why live streaming latency on the Internet has typically been about 20-30 seconds. We can do better.

Reduced latency with chunked transfer encoding

A natural way to solve this problem is to enable client players to start playing the chunks while they’re downloading, or even while they’re still being created. Making this possible requires a clever bit of cooperation to encode and deliver the files in a particular way, known as “chunked encoding.” This involves splitting up segments into smaller, bite-sized pieces, or “chunks”. Chunked encoding can typically bring live latency down to 5 or 10 seconds.

Confusingly, the word “chunk” is overloaded to mean two different things:

  1. CMAF or HLS chunks, which are small pieces of a segment (typically 1s) that are aligned on key frames
  2. HTTP chunks, which are just a way of delivering any file over the web

Live video just got more live: Introducing Concurrent Streaming Acceleration
Chunked Encoding splits segments into shorter chunks

HTTP chunks are important because web clients have limited ability to process streams of data. Most clients can only work with data once they’ve received the full HTTP response, or at least a complete HTTP chunk. By using HTTP chunked transfer encoding, we enable video players to start parsing and decoding video sooner.

CMAF chunks are important so that decoders can actually play the bits that are in the HTTP chunks. Without encoding video in a careful way, decoders would have random bits of a video file that can’t be played.

CDNs can introduce additional buffering

Chunked encoding with HLS and CMAF is growing in use across the web today. Part of what makes this technique great is that HTTP chunked encoding is widely supported by CDNs – it’s been part of the HTTP spec for 20 years.

CDN support is critical because it allows low-latency live video to scale up and reach audiences of thousands or millions of concurrent viewers – something that’s currently very difficult to do with other, non-HTTP based protocols.

Unfortunately, even if you enable chunking to optimise delivery, your CDN may be working against you by buffering the entire segment. To understand why consider what happens when many people request a live segment at the same time:

Live video just got more live: Introducing Concurrent Streaming Acceleration

If the file is already in cache, great! CDNs do a great job at delivering cached files to huge audiences. But what happens when the segment isn’t in cache yet? Remember – this is the typical request pattern for live video!

Typically, CDNs are able to “stream on cache miss” from the origin. That looks something like this:

Live video just got more live: Introducing Concurrent Streaming Acceleration

But again – what happens when multiple people request the file at once? CDNs typically need to pull the entire file into cache before serving additional viewers:

Live video just got more live: Introducing Concurrent Streaming Acceleration
Only one viewer can stream video, while other clients wait for the segment to buffer at the CDN

This behavior is understandable. CDN data centers consist of many servers. To avoid overloading origins, these servers typically coordinate amongst themselves using a “cache lock” (mutex) that allows only one server to request a particular file from origin at a given time. A side effect of this is that while a file is being pulled into cache, it can’t be served to any user other than the first one that requested it. Unfortunately, this cache lock also defeats the purpose of using chunked encoding!

To recap thus far:

  • Chunked encoding splits up video segments into smaller pieces
  • This can reduce end-to-end latency by allowing chunks to be fetched and decoded by players, even while segments are being produced at the origin server
  • Some CDNs neutralize the benefits of chunked encoding by buffering entire files inside the CDN before they can be delivered to clients

Cloudflare’s solution: Concurrent Streaming Acceleration

As you may have guessed, we think we can do better. Put simply, we now have the ability to deliver un-cached files to multiple clients simultaneously while we pull the file once from the origin server.

Live video just got more live: Introducing Concurrent Streaming Acceleration

This sounds like a simple change, but there’s a lot of subtlety to do this safely. Under the hood, we’ve made deep changes to our caching infrastructure to remove the cache lock and enable multiple clients to be able to safely read from a single file while it’s still being written.

The best part is – all of Cloudflare now works this way! There’s no need to opt-in, or even make a config change to get the benefit.

We rolled this feature out a couple months ago and have been really pleased with the results so far. We measure success by the “cache lock wait time,” i.e. how long a request must wait for other requests – a direct component of Time To First Byte.  One OTT customer saw this metric drop from 1.5s at P99 to nearly 0, as expected:

Live video just got more live: Introducing Concurrent Streaming Acceleration

This directly translates into a 1.5-second improvement in end-to-end latency. Live video just got more live!

Conclusion

New techniques like chunked encoding have revolutionized live delivery, enabling publishers to deliver low-latency live video at scale. Concurrent Streaming Acceleration helps you unlock the power of this technique at your CDN, potentially shaving precious seconds of end-to-end latency.

If you’re interested in using Cloudflare for live video delivery, contact our enterprise sales team.

And if you’re interested in working on projects like this and helping us improve live video delivery for the entire Internet, join our engineering team!

Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery

Post Syndicated from Isaac Specter original https://blog.cloudflare.com/announcing-cloudflare-image-resizing-simplifying-optimal-image-delivery/

Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery

Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery

In the past three years, the amount of image data on the median mobile webpage has doubled. Growing images translate directly to users hitting data transfer caps, experiencing slower websites, and even leaving if a website doesn’t load in a reasonable amount of time. The crime is many of these images are so slow because they are larger than they need to be, sending data over the wire which has absolutely no (positive) impact on the user’s experience.

To provide a concrete example, let’s consider this photo of Cloudflare’s Lava Lamp Wall:

Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery
Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery

On the left you see the photo, scaled to 300 pixels wide. On the right you see the same image delivered in its original high resolution, scaled in a desktop web browser. They both look exactly the same, yet the image on the right takes more than twenty times more data to load. Even for the best and most conscientious developers resizing every image to handle every possible device geometry consumes valuable time, and it’s exceptionally easy to forget to do this resizing altogether.

Today we are launching a new product, Image Resizing, to fix this problem once and for all.

Announcing Image Resizing

With Image Resizing, Cloudflare adds another important product to its suite of available image optimizations.  This product allows customers to perform a rich set of the key actions on images.

  • Resize – The source image will be resized to the specified height and width.  This action allows multiple different sized variants to be created for each specific use.
  • Crop – The source image will be resized to a new size that does not maintain the original aspect ratio and a portion of the image will be removed.  This can be especially helpful for headshots and product images where different formats must be achieved by keeping only a portion of the image.
  • Compress – The source image will have its file size reduced by applying lossy compression.  This should be used when slight quality reduction is an acceptable trade for file size reduction.
  • Convert to WebP – When the users browser supports it, the source image will be converted to WebP.  Delivering a WebP image takes advantage of the modern, highly optimized image format.

By using a combination of these actions, customers store a single high quality image on their server, and Image Resizing can be leveraged to create specialized variants for each specific use case.  Without any additional effort, each variant will also automatically benefit from Cloudflare’s global caching.

Examples

Ecommerce Thumbnails

Ecommerce sites typically store a high-quality image of each product.  From that image, they need to create different variants depending on how that product will be displayed.  One example is creating thumbnails for a catalog view.  Using Image Resizing, if the high quality image is located here:

https://example.com/images/shoe123.jpg

This is how to display a 75×75 pixel thumbnail using Image Resizing:

<img src="/cdn-cgi/image/width=75,height=75/images/shoe123.jpg">

Responsive Images

When tailoring a site to work on various device types and sizes, it’s important to always use correctly sized images.  This can be difficult when images are intended to fill a particular percentage of the screen.  To solve this problem, <img srcset sizes> can be used.

Without Image Resizing, multiple versions of the same image would need to be created and stored.  In this example, a single high quality copy of hero.jpg is stored, and Image Resizing is used to resize for each particular size as needed.

<img width="100%" srcset=" /cdn-cgi/image/fit=contain,width=320/assets/hero.jpg 320w, /cdn-cgi/image/fit=contain,width=640/assets/hero.jpg 640w, /cdn-cgi/image/fit=contain,width=960/assets/hero.jpg 960w, /cdn-cgi/image/fit=contain,width=1280/assets/hero.jpg 1280w, /cdn-cgi/image/fit=contain,width=2560/assets/hero.jpg 2560w, " src="/cdn-cgi/image/width=960/assets/hero.jpg">

Enforce Maximum Size Without Changing URLs

Image Resizing is also available from within a Cloudflare Worker. Workers allow you to write code which runs close to your users all around the world. For example, you might wish to add Image Resizing to your images while keeping the same URLs. Your users and client would be able to use the same image URLs as always, but the images will be transparently modified in whatever way you need.

You can install a Worker on a route which matches your image URLs, and resize any images larger than a limit:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

async function handleRequest(request) {
  return fetch(request, {
    cf: {
      image: {
        width: 800,
        height: 800,
        fit: 'scale-down'
      }
  });
}

As a Worker is just code, it is also easy to run this worker only on URLs with image extensions, or even to only resize images being delivered to mobile clients.

Cloudflare and Images

Cloudflare has a long history building tools to accelerate images. Our caching has always helped reduce latency by storing a copy of images closer to the user.  Polish automates options for both lossless and lossy image compression to remove unnecessary bytes from images.  Mirage accelerates image delivery based on device type. We are continuing to invest in all of these tools, as they all serve a unique role in improving the image experience on the web.

Image Resizing is different because it is the first image product at Cloudflare to give developers full control over how their images would be served. You should choose Image Resizing if you are comfortable defining the sizes you wish your images to be served at in advance or within a Cloudflare Worker.

Next Steps and Simple Pricing

Image Resizing is available today for Business and Enterprise Customers.  To enable it, login to the Cloudflare Dashboard and navigate to the Speed Tab.  There you’ll find the section for Image Resizing which you can enable with one click.

This product is included in the Business and Enterprise plans at no additional cost with generous usage limits.  Business Customers have 100k requests per month limit and will be charged $10 for each additional 100k requests per month.  Enterprise Customers have a 10M request per month limit with discounted tiers for higher usage.  Requests are defined as a hit on a URI that contains Image Resizing or a call to Image Resizing from a Worker.

Now that you’ve enabled Image Resizing, it’s time to resize your first image.

  1. Using your existing site, store an image here: https://yoursite.com/images/yourimage.jpg
  2. Use this URL to resize that image:
    https://yoursite.com/cdn-cgi/image/width=100,height=100,quality=75/images/yourimage.jpg
  3. Experiment with changing width=, height=, and quality=.

The instructions above use the Default URL Format for Image Resizing.  For details on options, uses cases, and compatibility, refer to our Developer Documentation.

Parallel streaming of progressive images

Post Syndicated from Andrew Galloni original https://blog.cloudflare.com/parallel-streaming-of-progressive-images/

Parallel streaming of progressive images

Parallel streaming of progressive images

Progressive image rendering and HTTP/2 multiplexing technologies have existed for a while, but now we’ve combined them in a new way that makes them much more powerful. With Cloudflare progressive streaming images appear to load in half of the time, and browsers can start rendering pages sooner.


In HTTP/1.1 connections, servers didn’t have any choice about the order in which resources were sent to the client; they had to send responses, as a whole, in the exact order they were requested by the web browser. HTTP/2 improved this by adding multiplexing and prioritization, which allows servers to decide exactly what data is sent and when. We’ve taken advantage of these new HTTP/2 capabilities to improve perceived speed of loading of progressive images by sending the most important fragments of image data sooner.

This feature is compatible with all major browsers, and doesn’t require any changes to page markup, so it’s very easy to adopt. Sign up for the Beta to enable it on your site!

What is progressive image rendering?

Basic images load strictly from top to bottom. If a browser has received only half of an image file, it can show only the top half of the image. Progressive images have their content arranged not from top to bottom, but from a low level of detail to a high level of detail. Receiving a fraction of image data allows browsers to show the entire image, only with a lower fidelity. As more data arrives, the image becomes clearer and sharper.

Parallel streaming of progressive images

This works great in the JPEG format, where only about 10-15% of the data is needed to display a preview of the image, and at 50% of the data the image looks almost as good as when the whole file is delivered. Progressive JPEG images contain exactly the same data as baseline images, merely reshuffled in a more useful order, so progressive rendering doesn’t add any cost to the file size. This is possible, because JPEG doesn’t store the image as pixels. Instead, it represents the image as frequency coefficients, which are like a set of predefined patterns that can be blended together, in any order, to reconstruct the original image. The inner workings of JPEG are really fascinating, and you can learn more about them from my recent performance.now() conference talk.

The end result is that the images can look almost fully loaded in half of the time, for free! The page appears to be visually complete and can be used much sooner. The rest of the image data arrives shortly after, upgrading images to their full quality, before visitors have time to notice anything is missing.

HTTP/2 progressive streaming

But there’s a catch. Websites have more than one image (sometimes even hundreds of images). When the server sends image files naïvely, one after another, the progressive rendering doesn’t help that much, because overall the images still load sequentially:

Parallel streaming of progressive images

Having complete data for half of the images (and no data for the other half) doesn’t look as good as having half of the data for all images.

And there’s another problem: when the browser doesn’t know image sizes yet, it lays the page out with placeholders instead, and relays out the page when each image loads. This can make pages jump during loading, which is inelegant, distracting and annoying for the user.

Our new progressive streaming feature greatly improves the situation: we can send all of the images at once, in parallel. This way the browser gets size information for all of the images as soon as possible, can paint a preview of all images without having to wait for a lot of data, and large images don’t delay loading of styles, scripts and other more important resources.

This idea of streaming of progressive images in parallel is as old as HTTP/2 itself, but it needs special handling in low-level parts of web servers, and so far this hasn’t been implemented at a large scale.

When we were improving our HTTP/2 prioritization, we realized it can be also used to implement this feature. Image files as a whole are neither high nor low priority. The priority changes within each file, and dynamic re-prioritization gives us the behavior we want:

  • The image header that contains the image size is very high priority, because the browser needs to know the size as soon as possible to do page layout. The image header is small, so it doesn’t hurt to send it ahead of other data.

    Parallel streaming of progressive images

  • The minimum amount of data in the image required to show a preview of the image has a medium priority (we’d like to plug "holes" left for unloaded images as soon as possible, but also leave some bandwidth available for scripts, fonts and other resources)

    Parallel streaming of progressive images

  • The remainder of the image data is low priority. Browsers can stream it last to refine image quality once there’s no rush, since the page is already fully usable.

Knowing the exact amount of data to send in each phase requires understanding the structure of image files, but it seemed weird to us to make our web server parse image responses and have a format-specific behavior hardcoded at a protocol level. By framing the problem as a dynamic change of priorities, were able to elegantly separate low-level networking code from knowledge of image formats. We can use Workers or offline image processing tools to analyze the images, and instruct our server to change HTTP/2 priorities accordingly.

The great thing about parallel streaming of images is that it doesn’t add any overhead. We’re still sending the same data, the same amount of data, we’re just sending it in a smarter order. This technique takes advantage of existing web standards, so it’s compatible with all browsers.

The waterfall

Here are waterfall charts from WebPageTest showing comparison of regular HTTP/2 responses and progressive streaming. In both cases the files were exactly the same, the amount of data transferred was the same, and the overall page loading time was the same (within measurement noise). In the charts, blue segments show when data was transferred, and green shows when each request was idle.

Parallel streaming of progressive images

The first chart shows a typical server behavior that makes images load mostly sequentially. The chart itself looks neat, but the actual experience of loading that page was not great — the last image didn’t start loading until almost the end.

The second chart shows images loaded in parallel. The blue vertical streaks throughout the chart are image headers sent early followed by a couple of stages of progressive rendering. You can see that useful data arrived sooner for all of the images. You may notice that one of the images has been sent in one chunk, rather than split like all the others. That’s because at the very beginning of a TCP/IP connection we don’t know the true speed of the connection yet, and we have to sacrifice some opportunity to do prioritization in order to maximize the connection speed.

The metrics compared to other solutions

There are other techniques intended to provide image previews quickly, such as low-quality image placeholder (LQIP), but they have several drawbacks. They add unnecessary data for the placeholders, and usually interfere with browsers’ preload scanner, and delay loading of full-quality images due to dependence on JavaScript needed to upgrade the previews to full images.

  • Our solution doesn’t cause any additional requests, and doesn’t add any extra data. Overall page load time is not delayed.
  • Our solution doesn’t require any JavaScript. It takes advantage of functionality supported natively in the browsers.
  • Our solution doesn’t require any changes to page’s markup, so it’s very safe and easy to deploy site-wide.

The improvement in user experience is reflected in performance metrics such as SpeedIndex metric and and time to visually complete. Notice that with regular image loading the visual progress is linear, but with the progressive streaming it quickly jumps to mostly complete:

Parallel streaming of progressive images

Parallel streaming of progressive images

Getting the most out of progressive rendering

Avoid ruining the effect with JavaScript. Scripts that hide images and wait until the onload event to reveal them (with a fade in, etc.) will defeat progressive rendering. Progressive rendering works best with the good old <img> element.

Is it JPEG-only?

Our implementation is format-independent, but progressive streaming is useful only for certain file types. For example, it wouldn’t make sense to apply it to scripts or stylesheets: these resources are rendered as all-or-nothing.

Prioritizing of image headers (containing image size) works for all file formats.

The benefits of progressive rendering are unique to JPEG (supported in all browsers) and JPEG 2000 (supported in Safari). GIF and PNG have interlaced modes, but these modes come at a cost of worse compression. WebP doesn’t even support progressive rendering at all. This creates a dilemma: WebP is usually 20%-30% smaller than a JPEG of equivalent quality, but progressive JPEG appears to load 50% faster. There are next-generation image formats that support progressive rendering better than JPEG, and compress better than WebP, but they’re not supported in web browsers yet. In the meantime you can choose between the bandwidth savings of WebP or the better perceived performance of progressive JPEG by changing Polish settings in your Cloudflare dashboard.

Custom header for experimentation

We also support a custom HTTP header that allows you to experiment with, and optimize streaming of other resources on your site. For example, you could make our servers send the first frame of animated GIFs with high priority and deprioritize the rest. Or you could prioritize loading of resources mentioned in <head> of HTML documents before <body> is loaded.

The custom header can be set only from a Worker. The syntax is a comma-separated list of file positions with priority and concurrency. The priority and concurrency is the same as in the whole-file cf-priority header described in the previous blog post.

cf-priority-change: <offset in bytes>:<priority>/<concurrency>, ...

For example, for a progressive JPEG we use something like (this is a fragment of JS to use in a Worker):

let headers = new Headers(response.headers);
headers.set("cf-priority", "30/0");
headers.set("cf-priority-change", "512:20/1, 15000:10/n");
return new Response(response.body, {headers});

Which instructs the server to use priority 30 initially, while it sends the first 512 bytes. Then switch to priority 20 with some concurrency (/1), and finally after sending 15000 bytes of the file, switch to low priority and high concurrency (/n) to deliver the rest of the file.

We’ll try to split HTTP/2 frames to match the offsets specified in the header to change the sending priority as soon as possible. However, priorities don’t guarantee that data of different streams will be multiplexed exactly as instructed, since the server can prioritize only when it has data of multiple streams waiting to be sent at the same time. If some of the responses arrive much sooner from the upstream server or the cache, the server may send them right away, without waiting for other responses.

Try it!

You can use our Polish tool to convert your images to progressive JPEG. Sign up for the beta to have them elegantly streamed in parallel.

Better HTTP/2 Prioritization for a Faster Web

Post Syndicated from Patrick Meenan original https://blog.cloudflare.com/better-http-2-prioritization-for-a-faster-web/

Better HTTP/2 Prioritization for a Faster Web

Better HTTP/2 Prioritization for a Faster Web

HTTP/2 promised a much faster web and Cloudflare rolled out HTTP/2 access for all our customers long, long ago. But one feature of HTTP/2, Prioritization, didn’t live up to the hype. Not because it was fundamentally broken but because of the way browsers implemented it.

Today Cloudflare is pushing out a change to HTTP/2 Prioritization that gives our servers control of prioritization decisions that truly make the web much faster.

Historically the browser has been in control of deciding how and when web content is loaded. Today we are introducing a radical change to that model for all paid plans that puts control into the hands of the site owner directly. Customers can enable “Enhanced HTTP/2 Prioritization” in the Speed tab of the Cloudflare dashboard: this overrides the browser defaults with an improved scheduling scheme that results in a significantly faster visitor experience (we have seen 50% faster on multiple occasions). With Cloudflare Workers, site owners can take this a step further and fully customize the experience to their specific needs.

Background

Web pages are made up of dozens (sometimes hundreds) of separate resources that are loaded and assembled by the browser into the final displayed content. This includes the visible content the user interacts with (HTML, CSS, images) as well as the application logic (JavaScript) for the site itself, ads, analytics for tracking site usage and marketing tracking beacons. The sequencing of how those resources are loaded can have a significant impact on how long it takes for the user to see the content and interact with the page.

A browser is basically an HTML processing engine that goes through the HTML document and follows the instructions in order from the start of the HTML to the end, building the page as it goes along. References to stylesheets (CSS) tell the browser how to style the page content and the browser will delay displaying content until it has loaded the stylesheet (so it knows how to style the content it is going to display). Scripts referenced in the document can have several different behaviors. If the script is tagged as “async” or “defer” the browser can keep processing the document and just run the script code whenever the scripts are available. If the scripts are not tagged as async or defer then the browser MUST stop processing the document until the script has downloaded and executed before continuing. These are referred to as “blocking” scripts because they block the browser from continuing to process the document until they have been loaded and executed.

The HTML document is split into two parts. The <head> of the document is at the beginning and contains stylesheets, scripts and other instructions for the browser that are needed to display the content. The <body> of the document comes after the head and contains the actual page content that is displayed in the browser window (though scripts and stylesheets are allowed to be in the body as well). Until the browser gets to the body of the document there is nothing to display to the user and the page will remain blank so getting through the head of the document as quickly as possible is important. “HTML5 rocks” has a great tutorial on how browsers work if you want to dive deeper into the details.

The browser is generally in charge of determining the order of loading the different resources it needs to build the page and to continue processing the document. In the case of HTTP/1.x, the browser is limited in how many things it can request from any one server at a time (generally 6 connections and only one resource at a time per connection) so the ordering is strictly controlled by the browser by how things are requested. With HTTP/2 things change pretty significantly. The browser can request all of the resources at once (at least as soon as it knows about them) and it provides detailed instructions to the server for how the resources should be delivered.

Optimal Resource Ordering

For most parts of the page loading cycle there is an optimal ordering of the resources that will result in the fastest user experience (and the difference between optimal and not can be significant – as much as a 50% improvement or more).

As described above, early in the page load cycle before the browser can render any content it is blocked on the CSS and blocking JavaScript in the <head> section of the HTML. During that part of the loading cycle it is best for 100% of the connection bandwidth to be used to download the blocking resources and for them to be downloaded one at a time in the order they are defined in the HTML. That lets the browser parse and execute each item while it is downloading the next blocking resource, allowing the download and execution to be pipelined.

Better HTTP/2 Prioritization for a Faster Web

The scripts take the same amount of time to download when downloaded in parallel or one after the other but by downloading them sequentially the first script can be processed and execute while the second script is downloading.

Once the render-blocking content has loaded things get a little more interesting and the optimal loading may depend on the specific site or even business priorities (user content vs ads vs analytics, etc). Fonts in particular can be difficult as the browser only discovers what fonts it needs after the stylesheets have been applied to the content that is about to be displayed so by the time the browser knows about a font, it is needed to display text that is already ready to be drawn to the screen. Any delays in getting the font loaded end up as periods with blank text on the screen (or text displayed using the wrong font).

Generally there are some tradeoffs that need to be considered:

  • Custom fonts and visible images in the visible part of the page (viewport) should be loaded as quickly as possible. They directly impact the user’s visual experience of the page loading.
  • Non-blocking JavaScript should be downloaded serially relative to other JavaScript resources so the execution of each can be pipelined with the downloads. The JavaScript may include user-facing application logic as well as analytics tracking and marketing beacons and delaying them can cause a drop in the metrics that the business tracks.
  • Images benefit from downloading in parallel. The first few bytes of an image file contain the image dimensions which may be necessary for browser layout, and progressive images downloading in parallel can look visually complete with around 50% of the bytes transferred.

Weighing the tradeoffs, one strategy that works well in most cases is:

  • Custom fonts download sequentially and split the available bandwidth with visible images.
  • Visible images download in parallel, splitting the “images” share of the bandwidth among them.
  • When there are no more fonts or visible images pending:
    • Non-blocking scripts download sequentially and split the available bandwidth with non-visible images
    • Non-visible images download in parallel, splitting the “images” share of the bandwidth among them.

That way the content visible to the user is loaded as quickly as possible, the application logic is delayed as little as possible and the non-visible images are loaded in such a way that layout can be completed as quickly as possible.

Example

For illustrative purposes, we will use a simplified product category page from a typical e-commerce site. In this example the page has:

  • Better HTTP/2 Prioritization for a Faster Web The HTML file for the page itself, represented by a blue box.
  • Better HTTP/2 Prioritization for a Faster Web 1 external stylesheet (CSS file), represented by a green box.
  • Better HTTP/2 Prioritization for a Faster Web 4 external scripts (JavaScript), represented by orange boxes. 2 of the scripts are blocking at the beginning of the page and 2 are asynchronous. The blocking script boxes use a darker shade of orange.
  • Better HTTP/2 Prioritization for a Faster Web 1 custom web font, represented by a red box.
  • Better HTTP/2 Prioritization for a Faster Web 13 images, represented by purple boxes. The page logo and 4 of the product images are visible in the viewport and 8 of the product images require scrolling to see. The 5 visible images use a darker shade of purple.

For simplicity, we will assume that all of the resources are the same size and each takes 1 second to download on the visitor’s connection. Loading everything takes a total of 20 seconds, but HOW it is loaded can have a huge impact to the experience.

Better HTTP/2 Prioritization for a Faster Web

This is what the described optimal loading would look like in the browser as the resources load:
Better HTTP/2 Prioritization for a Faster Web

  • The page is blank for the first 4 seconds while the HTML, CSS and blocking scripts load, all using 100% of the connection.
  • At the 4-second mark the background and structure of the page is displayed with no text or images.
  • One second later, at 5 seconds, the text for the page is displayed.
  • From 5-10 seconds the images load, starting out as blurry but sharpening very quickly. By around the 7-second mark it is almost indistinguishable from the final version.
  • At the 10 second mark all of the visual content in the viewport has completed loading.
  • Over the next 2 seconds the asynchronous JavaScript is loaded and executed, running any non-critical logic (analytics, marketing tags, etc).
  • For the final 8 seconds the rest of the product images load so they are ready for when the user scrolls.

Current Browser Prioritization

All of the current browser engines implement different prioritization strategies, none of which are optimal.

Microsoft Edge and Internet Explorer do not support prioritization so everything falls back to the HTTP/2 default which is to load everything in parallel, splitting the bandwidth evenly among everything. Microsoft Edge is moving to use the Chromium browser engine in future Windows releases which will help improve the situation. In our example page this means that the browser is stuck in the head for the majority of the loading time since the images are slowing down the transfer of the blocking scripts and stylesheets.

Better HTTP/2 Prioritization for a Faster Web

Visually that results in a pretty painful experience of staring at a blank screen for 19 seconds before most of the content displays, followed by a 1-second delay for the text to display. Be patient when watching the animated progress because for the 19 seconds of blank screen it may feel like nothing is happening (even though it is):
Better HTTP/2 Prioritization for a Faster Web

Safari loads all resources in parallel, splitting the bandwidth between them based on how important Safari believes they are (with render-blocking resources like scripts and stylesheets being more important than images). Images load in parallel but also load at the same time as the render-blocking content.

Better HTTP/2 Prioritization for a Faster Web

While similar to Edge in that everything downloads at the same time, by allocating more bandwidth to the render-blocking resources Safari can display the content much sooner:
Better HTTP/2 Prioritization for a Faster Web

  • At around 8 seconds the stylesheet and scripts have finished loading so the page can start to be displayed. Since the images were loading in parallel, they can also be rendered in their partial state (blurry for progressive images). This is still twice as slow as the optimal case but much better than what we saw with Edge.
  • At around 11 seconds the font has loaded so the text can be displayed and more image data has been downloaded so the images will be a little sharper. This is comparable to the experience around the 7-second mark for the optimal loading case.
  • For the remaining 9 seconds of the load the images get sharper as more data for them downloads until it is finally complete at 20 seconds.

Firefox builds a dependency tree that groups resources and then schedules the groups to either load one after another or to share bandwidth between the groups. Within a given group the resources share bandwidth and download concurrently. The images are scheduled to load after the render-blocking stylesheets and to load in parallel but the render-blocking scripts and stylesheets also load in parallel and do not get the benefits of pipelining.

Better HTTP/2 Prioritization for a Faster Web

In our example case this ends up being a slightly faster experience than with Safari since the images are delayed until after the stylesheets complete:
Better HTTP/2 Prioritization for a Faster Web

  • At the 6 second mark the initial page content is rendered with the background and blurry versions of the product images (compared to 8 seconds for Safari and 4 seconds for the optimal case).
  • At 8 seconds the font has loaded and the text can be displayed along with slightly sharper versions of the product images (compared to 11 seconds for Safari and 7 seconds in the Optimal case).
  • For the remaining 12 seconds of the loading the product images get sharper as the remaining content loads.

Chrome (and all Chromium-based browsers) prioritizes resources into a list. This works really well for the render-blocking content that benefits from loading in order but works less well for images. Each image loads to 100% completion before starting the next image.

Better HTTP/2 Prioritization for a Faster Web

In practice this is almost as good as the optimal loading case with the only difference being that the images load one at a time instead of in parallel:
Better HTTP/2 Prioritization for a Faster Web

  • Up until the 5 second mark the Chrome experience is identical to the optimal case, displaying the background at 4 seconds and the text content at 5.
  • For the next 5 seconds the visible images load one at a time until they are all complete at the 10 second mark (compared to the optimal case where they are just slightly blurry at 7 seconds and sharpen up for the remaining 3 seconds).
  • After the visual part of the page is complete at 10 seconds (identical to the optimal case), the remaining 10 seconds are spent running the async scripts and loading the hidden images (just like with the optimal loading case).

Visual Comparison

Visually, the impact can be quite dramatic, even though they all take the same amount of time to technically load all of the content:

Better HTTP/2 Prioritization for a Faster Web

Server-Side Prioritization

HTTP/2 prioritization is requested by the client (browser) and it is up to the server to decide what to do based on the request. A good number of servers don’t support doing anything at all with the prioritization but for those that do, they all honor the client’s request. Another option would be to decide on the best prioritization to use on the server-side, taking into account the client’s request.

Per the specification, HTTP/2 prioritization is a dependency tree that requires full knowledge of all of the in-flight requests to be able to prioritize resources against each other. That allows for incredibly complex strategies but is difficult to implement well on either the browser or server side (as evidenced by the different browser strategies and varying levels of server support). To make prioritization easier to manage we have developed a simpler prioritization scheme that still has all of the flexibility needed for optimal scheduling.

The Cloudflare prioritization scheme consists of 64 priority “levels” and within each priority level there are groups of resources that determine how the connection is shared between them:

Better HTTP/2 Prioritization for a Faster Web

All of the resources at a higher priority level are transferred before moving on to the next lower priority level.

Within a given priority level, there are 3 different “concurrency” groups:

  • 0 : All of the resources in the concurrency “0” group are sent sequentially in the order they were requested, using 100% of the bandwidth. Only after all of the concurrency “0” group resources have been downloaded are other groups at the same level considered.
  • 1 : All of the resources in the concurrency “1” group are sent sequentially in the order they were requested. The available bandwidth is split evenly between the concurrency “1” group and the concurrency “n” group.
  • n : The resources in the concurrency “n” group are sent in parallel, splitting the bandwidth available to the group between them.

Practically speaking, the concurrency “0” group is useful for critical content that needs to be processed sequentially (scripts, CSS, etc). The concurrency “1” group is useful for less-important content that can share bandwidth with other resources but where the resources themselves still benefit from processing sequentially (async scripts, non-progressive images, etc). The concurrency “n” group is useful for resources that benefit from processing in parallel (progressive images, video, audio, etc).

Cloudflare Default Prioritization

When enabled, the enhanced prioritization implements the “optimal” scheduling of resources described above. The specific prioritizations applied look like this:

Better HTTP/2 Prioritization for a Faster Web

This prioritization scheme allows sending the render-blocking content serially, followed by the visible images in parallel and then the rest of the page content with some level of sharing to balance application and content loading. The “* If Detectable” caveat is that not all browsers differentiate between the different types of stylesheets and scripts but it will still be significantly faster in all cases. 50% faster by default, particularly for Edge and Safari visitors is not unusual:

Better HTTP/2 Prioritization for a Faster Web

Customizing Prioritization with Workers

Faster-by-default is great but where things get really interesting is that the ability to configure the prioritization is also exposed to Cloudflare Workers so sites can override the default prioritization for resources or implement their own complete prioritization schemes.

If a Worker adds a “cf-priority” header to the response, Cloudflare edge servers will use the specified priority and concurrency for that response. The format of the header is <priority>/<concurrency> so something like response.headers.set('cf-priority', “30/0”); would set the priority to 30 with a concurrency of 0 for the given response. Similarly, “30/1” would set concurrency to 1 and “30/n” would set concurrency to n.

With this level of flexibility a site can tweak resource prioritization to meet their needs. Boosting the priority of some critical async scripts for example or increasing the priority of hero images before the browser has identified that they are in the viewport.

To help inform any prioritization decisions, the Workers runtime also exposes the browser-requested prioritization information in the request object passed in to the Worker’s fetch event listener (request.cf.requestPriority). The incoming requested priority is a semicolon-delimited list of attributes that looks something like this: “weight=192;exclusive=0;group=3;group-weight=127”.

  • weight: The browser-requested weight for the HTTP/2 prioritization.

  • exclusive: The browser-requested HTTP/2 exclusive flag (1 for Chromium-based browsers, 0 for others).

  • group: HTTP/2 stream ID for the request group (only non-zero for Firefox).

  • group-weight: HTTP/2 weight for the request group (only non-zero for Firefox).

This is Just the Beginning

The ability to tune and control the prioritization of responses is the basic building block that a lot of future work will benefit from. We will be implementing our own advanced optimizations on top of it but by exposing it in Workers we have also opened it up to sites and researchers to experiment with different prioritization strategies. With the Apps Marketplace it is also possible for companies to build new optimization services on top of the Workers platform and make it available to other sites to use.

If you are on a Pro plan or above, head over to the speed tab in the Cloudflare dashboard and turn on “Enhanced HTTP/2 Prioritization” to accelerate your site.

Better HTTP/2 Prioritization for a Faster Web

Argo and the Cloudflare Global Private Backbone

Post Syndicated from Rustam Lalkaka original https://blog.cloudflare.com/argo-and-the-cloudflare-global-private-backbone/

Argo and the Cloudflare Global Private Backbone

Argo and the Cloudflare Global Private Backbone

Welcome to Speed Week! Each day this week, we’re going to talk about something Cloudflare is doing to make the Internet meaningfully faster for everyone.

Cloudflare has built a massive network of data centers in 180 cities in 75 countries. One way to think of Cloudflare is a global system to transport bits securely, quickly, and reliably from any point A to any other point B on the planet.

To make that a reality, we built Argo. Argo uses real-time global network information to route around brownouts, cable cuts, packet loss, and other problems on the Internet. Argo makes the network that Cloudflare relies on—the Internet—faster, more reliable, and more secure on every hop around the world.

We launched Argo two years ago, and it now carries over 22% of Cloudflare’s traffic. On an average day, Argo cuts the amount of time Internet users spend waiting for content by 112 years!

As Cloudflare and our traffic volumes have grown, it now makes sense to build our own private backbone to add further security, reliability, and speed to key connections between Cloudflare locations.

Today, we’re introducing the Cloudflare Global Private Backbone. It’s been in operation for a while now and links Cloudflare locations with private fiber connections.

This private backbone benefits all Cloudflare customers, and it shines in combination with Argo. Argo can select the best available link across the Internet on a per data center-basis, and takes full advantage of the Cloudflare Global Private Backbone automatically.

Let’s open the hood on Argo and explain how our backbone network further improves performance for our customers.

What’s Argo?

Argo is like Waze for the Internet. Every day, Cloudflare carries hundreds of billions of requests across our network and the Internet. Because our network, our customers, and their end-users are well distributed globally, all of these requests flowing across our infrastructure paint a great picture of how different parts of the Internet are performing at any given time.

Just like Waze examines real data from real drivers to give you accurate, uncongested (and sometimes unorthodox) routes across town, Argo Smart Routing uses the timing data Cloudflare collects from each request to pick faster, more efficient routes across the Internet.

In practical terms, Cloudflare’s network is expansive in its reach. Some of the Internet links in a given region may be congested and cause poor performance (a literal traffic jam). By understanding this is happening and using alternative network locations and providers, Argo can put traffic on a less direct, but faster, route from its origin to its destination.

These benefits are not theoretical: enabling Argo Smart Routing shaves an average of 33% off HTTP time to first byte (TTFB).

One other thing we’re proud of: we’ve stayed super focused on making it easy to use. One click in the dashboard enables better, smarter routing, bringing the full weight of Cloudflare’s network, data, and engineering expertise to bear on making your traffic faster. Advanced analytics allow you to understand exactly how Argo is performing for you around the world.

You can read a lot more about how Argo works in our original launch blog post.

So far, we’ve been talking about Argo at a functional level: you turn it on and it makes requests that traverse the Internet to your origin faster. How does it actually work? Argo is dependent on a few things to make its magic happen: Cloudflare’s network, up-to-the-second performance data on how traffic is moving on the Internet, and machine learning routing algorithms.

Cloudflare’s Global Network

Cloudflare maintains a network of data centers around the world, and our network continues to grow significantly. Today, we have more than 180 data centers in 75 countries. That’s an additional 69 data centers since we launched Argo in May 2017.

In addition to adding new locations, Cloudflare is constantly working with network partners to add connectivity options to our network locations. A single Cloudflare data center may be peered with a dozen networks, connected to multiple Internet eXchanges (IXs), connected to multiple transit providers (e.g. Telia, GTT, etc), and now, connected to our own physical backbone. A given destination may be reachable over multiple different links from the same location; each of these links will have different performance and reliability characteristics.

This increased network footprint is important in making Argo faster. Additional network locations and providers mean Argo has more options at its disposal to route around network disruptions and congestion. Every time we add a new network location, we exponentially grow the number of routing options available to any given request.

Better routing for improved performance

Argo requires the huge global network we’ve built to do its thing. But it wouldn’t do much of anything if it didn’t have the smarts to actually take advantage of all our data centers and cables between them to move traffic faster.

Argo combines multiple machine learning techniques to build routes, test them, and disqualify routes that are not performing as we expect.

The generation of routes is performed on data using “offline” optimization techniques: Argo’s route construction algorithms take an input data set (timing data) and fixed optimization target (“minimize TTFB”), outputting routes that it believes satisfy this constraint.

Route disqualification is performed by a separate pipeline that has no knowledge of the route construction algorithms. These two systems are intentionally designed to be adversarial, allowing Argo to be both aggressive in finding better routes across the Internet but also adaptive to rapidly changing network conditions.

One specific example of Argo’s smarts is its ability to distinguish between multiple potential connectivity options as it leaves a given data center. We call this “transit selection”.

As we discussed above, some of our data centers may have a dozen different, viable options for reaching a given destination IP address. It’s as if you subscribed to every available ISP at your house, and you could choose to use any one of them for each website you tried to access. Transit selection enables Cloudflare to pick the fastest available path in real-time at every hop to reach the destination.

With transit selection, Argo is able to specify both:

1) Network location waypoints on the way to the origin.
2) The specific transit provider or link at each waypoint in the journey of the packet all the way from the source to the destination.

To analogize this to Waze, Argo giving directions without transit selection is like telling someone to drive to a waypoint (go to New York from San Francisco, passing through Salt Lake City), without specifying the roads to actually take to Salt Lake City or New York. With transit selection, we’re able to give full turn-by-turn directions — take I-80 out of San Francisco, take a left here, enter the Salt Lake City area using SR-201 (because I-80 is congested around SLC), etc. This allows us to route around issues on the Internet with much greater precision.

Argo and the Cloudflare Global Private Backbone

Transit selection requires logic in our inter-data center data plane (the components that actually move data across our network) to allow for differentiation between different providers and links available in each location. Some interesting network automation and advertisement techniques allow us to be much more discerning about which link actually gets picked to move traffic.

Without modifications to the Argo data plane, those options would be abstracted away by our edge routers, with the choice of transit left to BGP. We plan to talk more publicly about the routing techniques used in the future.

We are able to directly measure the impact transit selection has on Argo customer traffic. Looking at global average improvement, transit selection gets customers an additional 16% TTFB latency benefit over taking standard BGP-derived routes. That’s huge!

One thing we think about: Argo can itself change network conditions when moving traffic from one location or provider to another by inducing demand (adding additional data volume because of improved performance) and changing traffic profiles. With great power comes great intricacy.

Adding the Cloudflare Global Private Backbone

Given our diversity of transit and connectivity options in each of our data centers, and the smarts that allow us to pick between them, why did we go through the time and trouble of building a backbone for ourselves? The short answer: operating our own private backbone allows us much more control over end-to-end performance and capacity management.

When we buy transit or use a partner for connectivity, we’re relying on that provider to manage the link’s health and ensure that it stays uncongested and available. Some networks are better than others, and conditions change all the time.

As an example, here’s a measurement of jitter (variance in round trip time) between two of our data centers, Chicago and Newark, over a transit provider’s network:

Argo and the Cloudflare Global Private Backbone

Average jitter over the pictured 6 hours is 4ms, with average round trip latency of 27ms. Some amount of latency is something we just need to learn to live with; the speed of light is a tough physical constant to do battle with, and network protocols are built to function over links with high or low latency.

Jitter, on the other hand, is “bad” because it is unpredictable and network protocols and applications built on them often degrade quickly when jitter rises. Jitter on a link is usually caused by more buffering, queuing, and general competition for resources in the routing hardware on either side of a connection. As an illustration, having a VoIP conversation over a network with high latency is annoying but manageable. Each party on a call will notice “lag”, but voice quality will not suffer. Jitter causes the conversation to garble, with packets arriving on top of each other and unpredictable glitches making the conversation unintelligible.

Here’s the same jitter chart between Chicago and Newark, except this time, transiting the Cloudflare Global Private Backbone:

Argo and the Cloudflare Global Private Backbone

Much better! Here we see a jitter measurement of 536μs (microseconds), almost eight times better than the measurement over a transit provider between the same two sites.

The combination of fiber we control end-to-end and Argo Smart Routing allows us to unlock the full potential of Cloudflare’s backbone network. Argo’s routing system knows exactly how much capacity the backbone has available, and can manage how much additional data it tries to push through it. By controlling both ends of the pipe, and the pipe itself, we can guarantee certain performance characteristics and build those expectations into our routing models. The same principles do not apply to transit providers and networks we don’t control.

Latency, be gone!

Our private backbone is another tool available to us to improve performance on the Internet. Combining Argo’s cutting-edge machine learning and direct fiber connectivity between points on our large network allows us to route customer traffic with predictable, excellent performance.

We’re excited to see the backbone and its impact continue to expand.

Speaking personally as a product manager, Argo is really fun to work on. We make customers happier by making their websites, APIs, and networks faster. Enabling Argo allows customers to do that with one click, and see immediate benefit. Under the covers, huge investments in physical and virtual infrastructure begin working to accelerate traffic as it flows from its source to destination.  

From an engineering perspective, our weekly goals and objectives are directly measurable — did we make our customers faster by doing additional engineering work? When we ship a new optimization to Argo and immediately see our charts move up and to the right, we know we’ve done our job.

Building our physical private backbone is the latest thing we’ve done in our need for speed.

Welcome to Speed Week!

Argo and the Cloudflare Global Private Backbone

Activate Argo now, or contact sales to learn more!

Welcome to Speed Week!

Post Syndicated from Patrick Meenan original https://blog.cloudflare.com/welcome-to-speed-week/

Welcome to Speed Week!

Welcome to Speed Week!

Every year, we celebrate Cloudflare’s birthday in September when we announce the products we’re releasing to help make the Internet better for everyone. We’re always building new and innovative products throughout the year, and having to pick five announcements for just one week of the year is always challenging. Last year we brought back Crypto Week where we shared new cryptography technologies we’re supporting and helping advance to help build a more secure Internet.

Today I’m thrilled to announce we are launching our first-ever Speed Week and we want to showcase some of the things that we’re obsessed with to make the Internet faster for everyone.

How much faster is faster?

When we built the software stack that runs our network, we knew that both security and speed are important to our customers, and they should never have to compromise one for the other. All of the products we’re announcing this week will help our customers have a better experience on the Internet with as much as a 50% improvement in page load times for websites, getting the  most out of HTTP/2’s features (while only lifting a finger to click the button that enables them), finding the optimal route across the Internet, and providing the best live streaming video experience. I am constantly amazed by the talented engineers that work on the products that we launch all year round. I wish we could have weeks like this all year round to celebrate the wins they’ve accumulated as they tackle the difficult performance challenges of the web. We’re never content to settle for the status quo, and as our network continues to grow, so does our ability to improve our flagship products like Argo, or how we support rich media sites that rely heavily on images and video. The sheer scale of our network provides rich data that we can use to make better decisions on how we support our customers’ web properties.

We also recognize that the Internet is evolving. New standards and protocols such as HTTP/2, QUIC, TLS 1.3 are great advances to improve web performance and security, but they can also be challenging for many developers to easily deploy. HTTP/2 was introduced in 2015 by the IETF, and was the first major revision of the HTTP protocol. While our customers have always been able to benefit from HTTP/2, we’re exploring how we can make that experience even faster.

All things Speed

Want a sneak peek at what we’re announcing this week? I’m really excited to see this week’s announcements unfold. Each day we’ll post a new blog where we’ll share product announcements and customer stories that demonstrate how we’re making life better for our customers.

  • Monday: An inside view of how we’re making faster, smarter routing decisions
  • Tuesday: HTTP/2 can be faster, we’ll show you how
  • Wednesday: Simplify image management and speed up load times on any device
  • Thursday: How we’re improving our network for faster video streaming
  • Friday: How we’re helping make JavaScript faster

For bonus points, sign up for a live stream webinar where Kornel Lesinksi and I will be hosted by Dennis Publishing to discuss the many challenges of the modern web “Stronger, Better, Faster: Solving the performance challenges of the modern web.” The event will be held on Monday, May 13th at 11:00 am BST and you can either register for the live event or sign up for one of the on-demand sessions later in the week.

I hope you’re just as excited about our upcoming Speed Week as much as I am, be sure to subscribe to the blog to get daily updates sent to your inbox, cause who knows… there may even be “one last thing”