Tag Archives: Performance

Creating low-latency, high-volume APIs with Provisioned Concurrency

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/creating-low-latency-high-volume-apis-with-provisioned-concurrency/

The AWS Lambda service runs customer code on-demand in response to events. It works by creating a new execution environment and downloading your code. This initial setup is commonly called a “cold start” and introduces latency to the total execution time of the function.

Cold starts happen when you first invoke a function, or when a function is invoked after being inactive for an extended period. They also happen when Lambda scales up a function, since each new instance of the function is a new execution environment.

The serverless community has previously created “function warmer” libraries to help improve the likelihood of a Lambda invocation using an existing execution environment. This is a good approach for development and test workloads, or where you do not need hyper-ready performance. The Provisioned Concurrency feature is designed for workloads needing predictable low-latency.

This blog post shows how to eliminate cold starts in architectures supporting web applications. I reference code from the Ask Around Me example application. This allows users to ask and answer questions in their local geographic area in real time. To learn more, refer to part 1 of the blog application series.

Cold starts and web applications

The Ask Around Me application uses the following backend architecture:

Ask Around Me backend architecture

This represents a typical web application. Some Lambda functions are invoked by Amazon API Gateway while others are invoked by services further down the application stack. API Gateway invokes Lambda functions synchronously, meaning the caller is blocked until the function returns a value.

Functions invoked by services like Amazon SQS and Amazon DynamoDB are called asynchronously. This means that the caller continues with other work during execution and the function does not return a value. This application uses both types of invocation:

Sync and async parts of the backend

Generally, cold starts are less impactful in asynchronous executions. The latency overhead in starting the execution environment usually has less impact on overall performance of the application in this case. For web applications in particular, cold starts are most noticeable in synchronous applications closer to the frontend. This is where the speed of the API request has the most influence on the user experience of your application.

In Ask Around Me, there are four Lambda functions supporting the API endpoints for the application. Three of these are lightweight functions that put messages in SQS queues and retrieve data from a DynamoDB table. The most complex function is GetQuestions, which fetches questions based upon the latitude and longitude of the user. This is also expected to receive the most usage, with an expected 50,000 queries per hour, so it’s the most important for performance optimization.

Measuring the existing Lambda function performance

In a previous blog post on load testing this application, the GetQuestions API shows considerable variability in performance. In an API load test for 30 seconds with 20 requests per second, the median response is 175 ms while the slowest is 2149 ms:

Load testing performance output

In this application, the frontend application waits until this synchronous API call is completed. The median performance is likely acceptable to a user, whereas any response time over one second makes interaction with the application appear slow.

To gain more insight into the performance of this function, I turn on AWS X-Ray for this function. From the Lambda console, I select the GetQuestions function and check the Active tracing check box in the AWS X-Ray panel. After saving the function, X-Ray is now enabled.

Enable active tracing in Lambda

I re-run the load test for this function and navigate to the X-Ray console. In the Analytics menu, the Response time distribution panel graphs the performance of the function invocations:

Response time distribution

I select all the invocations in the graph after the p95 marker, representing the slowest 5% of all requests. This filters 34 slow requests, which correspond to the number of Concurrent Executions for the function shown in the function’s Metrics console:

Concurrent executions

X-Ray lists the individual traces for the 34 slowest calls, and selecting the slowest single invocation breaks down the durations of each segment:

Slowest single invocation

This analysis shows that this function’s performance is impacted by cold starts. The initialization of the execution environment and the function code is contributing over 1 second of latency in this example. Each of the 34 slowest invocations corresponds with scaling up events for this function.

Configuring Provisioned Concurrency for a Lambda function

Provisioned concurrency is a Lambda feature that allows you to prepare execution environments before receiving traffic. In addition to downloading the function’s code, it also runs the initialization code outside of the main Lambda handler. This provides a reliable way to keep functions ready to respond within double-digit millisecond latency.

While all Provisioned Concurrency functions start more quickly than the existing on-demand Lambda execution style, this is particularly beneficial for certain function profiles. Runtimes like C# and Java have much slower initialization times than Node.js or Python, but faster execution times once initialized. With Provisioned Concurrency turned on, these runtimes benefit from both the consistent low latency of the function’s start-up and the performance during execution.

To enable Provisioned Concurrency for a Lambda function:

  1. Go to the AWS Lambda console and then choose your existing Lambda function.
  2. Provisioned concurrency settings must be applied to a published version or an alias. Go to the Actions drop-down and choose Publish new version.

    Publish new version

  3. Choose Publish. Scroll down to the Concurrency panel and choose Add Configuration.
    Configure Provisioned Concurrency
  4. Enter your preferred concurrency and choose Save.
  5. After a few minutes, Lambda has prepared the execution environments and the Status shows Ready in the console.

    Status is ready

It’s important to remember that the feature is applied explicitly to a function version or alias. Ensure that your invocation method is calling this alias, and not the $LATEST version. Provisioned Concurrency cannot be applied to the $LATEST version.

When configuring Provisioned Concurrency, you select capacity to reserve. During usage, if you exceed this level, any additional functional invocations then use the on-demand model. These invocations exhibit a more typical Lambda start-up performance profile, but you are not throttled or limited from running invocations at high levels of throughput.

Using Amazon CloudWatch Logs or the Monitoring tab for your function in the Lambda console, you can see metrics for the number of Provisioned Concurrency invocations, compared with the total. This can help identify when total load is above the amount of concurrency, and you can make changes accordingly.

You can also use Application Auto Scaling to help you automate provisioning the appropriate capacity. Instead of reserving a fixed amount of capacity, this increases the amount of concurrency during peak loads, and decreases as load reduces. You can configure this in both the AWS CLI and AWS Serverless Application Model.

Comparing performance before and after Provisioned Concurrency

I run the same load test on the same function, now using Provisioned Concurrency – 20 requests per second over 2 minutes. The results show a median latency of 165 ms, a p95 time of 202 ms, and a slowest execution of 532 ms:

Load test result after Provisioned Concurrency

In X-Ray, the latest Response time distribution graph shows the significantly improved performance across the 2400 requests:

Load test result in X-Ray

By enabling Provisioned Concurrency for this Lambda function, the slowest performance has been improved by 75%. The function can serve 1200 requests per minute with a much more consistent performance for users.

Function warmers and Provisioned Concurrency

The broader serverless community offers open source libraries to “warm“ Lambda functions via a pinging mechanism. This approach uses Amazon CloudWatch Events to invoke the function every minute to help keep the execution environment active. As a result, this can increase the likelihood of using a warm environment when you invoke the function.

However, this is not a guaranteed way to reduce cold starts. It does not help in production environments when functions scale up to meet traffic. It also does not work if the Lambda service runs your function in another Availability Zone as part of normal load balancing operations. Additionally, the Lambda service reaps execution environments regularly to keep these fresh, so it’s possible to invoke a function in between pings. In all of these cases, you experience cold starts despite using a warming library.

This approach might be adequate for development and test environments, or low-traffic or low-priority workloads. However, if you need predictable function start times for your workload, Provisioned Concurrency is the recommend solution to ensure predictable latency. It keeps your functions initialized and hyper-ready to respond in double-digit milliseconds at the scale you need.

Conclusion

This post examines how cold starts impact performance in serverless backends for web applications. It shows how the most important focus area is usually synchronous APIs called by the frontend application. I explain options available for targeting cold starts in the Lambda service.

Using the Ask Around Me application, I apply Provisioned Concurrency to the most latency-sensitive Lambda function. I compare the load testing performance before and after enabling this feature. This shows predictable start-up times and a 75% reduction in the slowest execution time. Finally, I show when you might use function warmers, and how Provisioned Concurrency is more suitable for latency-sensitive and production workloads.

To learn about the cost of using this feature, visit the Lambda pricing page.

Making the WAF 40% faster

Post Syndicated from Miguel de Moura original https://blog.cloudflare.com/making-the-waf-40-faster/

Making the WAF 40% faster

Cloudflare’s Web Application Firewall (WAF) protects against malicious attacks aiming to exploit vulnerabilities in web applications. It is continuously updated to provide comprehensive coverage against the most recent threats while ensuring a low false positive rate.

As with all Cloudflare security products, the WAF is designed to not sacrifice performance for security, but there is always room for improvement.

This blog post provides a brief overview of the latest performance improvements that were rolled out to our customers.

Transitioning from PCRE to RE2

Back in July of 2019, the WAF transitioned from using a regular expression engine based on PCRE to one inspired by RE2, which is based around using a deterministic finite automaton (DFA) instead of backtracking algorithms. This change came as a result of an outage where an update added a regular expression which backtracked enormously on certain HTTP requests, resulting in exponential execution time.

After the migration was finished, we saw no measurable difference in CPU consumption at the edge, but noticed execution time outliers in the 95th and 99th percentiles decreased, something we expected given RE2’s guarantees of a linear time execution with the size of the input.

As the WAF engine uses a thread pool, we also had to implement and tune a regex cache shared between the threads to avoid excessive memory consumption (the first implementation turned out to use a staggering amount of memory).

These changes, along with others outlined in the post-mortem blog post, helped us improve reliability and safety at the edge and have the confidence to explore further performance improvements.

But while we’ve highlighted regular expressions, they are only one of the many capabilities of the WAF engine.

Matching Stages

When an HTTP request reaches the WAF, it gets organized into several logical sections to be analyzed: method, path, headers, and body. These sections are all stored in Lua variables. If you are interested in more detail on the implementation of the WAF itself you can watch this old presentation.

Before matching these variables against specific malicious request signatures, some transformations are applied. These transformations are functions that range from simple modifications like lowercasing strings to complex tokenizers and parsers looking to fingerprint certain malicious payloads.

As the WAF currently uses a variant of the ModSecurity syntax, this is what a rule might look like:

SecRule REQUEST_BODY "@rx /\x00+evil" "drop, t:urlDecode, t:lowercase"

It takes the request body stored in the REQUEST_BODY variable, applies the urlDecode() and lowercase() functions to it and then compares the result with the regular expression signature \x00+evil.

In pseudo-code, we can represent it as:

rx( "/\x00+evil", lowercase( urlDecode( REQUEST_BODY ) ) )

Which in turn would match a request whose body contained percent encoded NULL bytes followed by the word “evil”, e.g.:

GET /cms/admin?action=post HTTP/1.1
Host: example.com
Content-Type: text/plain; charset=utf-8
Content-Length: 16

thiSis%2F%00eVil

The WAF contains thousands of these rules and its objective is to execute them as quickly as possible to minimize any added latency to a request. And to make things harder, it needs to run most of the rules on nearly every request. That’s because almost all HTTP requests are non-malicious and no rules are going to match. So we have to optimize for the worst case: execute everything!

To help mitigate this problem, one of the first matching steps executed for many rules is pre-filtering. By checking if a request contains certain bytes or sets of strings we are able to potentially skip a considerable number of expressions.

In the previous example, doing a quick check for the NULL byte (represented by \x00 in the regular expression) allows us to completely skip the rule if it isn’t found:

contains( "\x00", REQUEST_BODY )
and
rx( "/\x00+evil", lowercase( urlDecode( REQUEST_BODY ) ) )

Since most requests don’t match any rule and these checks are quick to execute, overall we aren’t doing more operations by adding them.

Other steps can also be used to scan through and combine several regular expressions and avoid execution of rule expressions. As usual, doing less work is often the simplest way to make a system faster.

Memoization

Which brings us to memoization – caching the output of a function call to reuse it in future calls.

Let’s say we have the following expressions:

1. rx( "\x00+evil", lowercase( url_decode( body ) ) )
2. rx( "\x00+EVIL", remove_spaces( url_decode( body ) ) )
3. rx( "\x00+evil", lowercase( url_decode( headers ) ) )
4. streq( "\x00evil", lowercase( url_decode( body ) ) )

In this case, we can reuse the result of the nested function calls (1) as they’re the same in (4). By saving intermediate results we are also able to take advantage of the result of url_decode( body ) from (1) and use it in (2) and (4). Sometimes it is also possible to swap the order functions are applied to improve caching, though in this case we would get different results.

A naive implementation of this system can simply be a hash table with each entry having the function(s) name(s) and arguments as the key and its output as the value.

Some of these functions are expensive and caching the result does lead to significant savings. To give a sense of magnitude, one of the rules we modified to ensure memoization took place saw its execution time reduced by about 95%:

Making the WAF 40% faster
Execution time per rule

The WAF engine implements memoization and the rules take advantage of it, but there’s always room to increase cache hits.

Rewriting Rules and Results

Cloudflare has a very regular cadence of releasing updates and new rules to the Managed Rulesets. However, as more rules are added and new functions implemented, the memoization cache hit rate tends to decrease.

To improve this, we first looked into the rules taking the most wall-clock time to execute using some of our performance metrics:

Making the WAF 40% faster
Execution time per rule

Having these, we cross-referenced them with the ones having cache misses (output is truncated with […]):

[email protected] $ ./parse.py --profile
Hit Ratio:
-------------
0.5608

Hot entries:
-------------
[urlDecode, replaceComments, REQUEST_URI, REQUEST_HEADERS, ARGS_POST]
[urlDecode, REQUEST_URI]
[urlDecode, htmlEntityDecode, jsDecode, replaceNulls, removeWhitespace, REQUEST_URI, REQUEST_HEADERS]
[urlDecode, lowercase, REQUEST_FILENAME]
[urlDecode, REQUEST_FILENAME]
[urlDecode, lowercase, replaceComments, compressWhitespace, ARGS, REQUEST_FILENAME]
[urlDecode, replaceNulls, removeWhitespace, REQUEST_URI, REQUEST_HEADERS, ARGS_POST]
[...]

Candidates:
-------------
100152A - replace t:removeWhitespace with t:compressWhitespace,t:removeWhitespace
100214 - replace t:lowercase with (?i)
100215 - replace t:lowercase with (?i)
100300 - consider REQUEST_URI over REQUEST_FILENAME
100137D - invert order of t:replaceNulls,t:lowercase
[...]

After identifying more than 40 rules, we rewrote them to take full advantage of memoization and added pre-filter checks where possible. Many of these changes were not immediately obvious, which is why we’re also creating tools to aid analysts in creating even more efficient rules. This also helps ensure they run in accordance with the latency budgets the team has set.

This change resulted in an increase of the cache hit percentage from 56% to 74%, which crucially included the most expensive transformations.

Most importantly, we also observed a sharp decrease of 40% in the average time the WAF takes to process and analyze an HTTP request at the Cloudflare edge.

Making the WAF 40% faster
WAF Request Processing – Time Average

A comparable decrease was also observed for the 95th and 99th percentiles.

Finally, we saw a drop of CPU consumption at the edge of around 4.3%.

Next Steps

While the Lua WAF has served us well throughout all these years, we are currently porting it to use the same engine powering Firewall Rules. It is based on our open-sourced wirefilter execution engine, which uses a filter syntax inspired by Wireshark®. In addition to allowing more flexible filter expressions, it provides better performance and safety.

The rule optimizations we’ve described in this blog post are not lost when moving to the new engine, however, as the changes were deliberately not specific to the current Lua engine’s implementation. And while we’re routinely profiling, benchmarking and making complex optimizations to the Firewall stack, sometimes just relatively simple changes can have a surprisingly huge effect.

Introducing Cache Analytics

Post Syndicated from Jon Levine original https://blog.cloudflare.com/introducing-cache-analytics/

Introducing Cache Analytics

Today, I’m delighted to announce Cache Analytics: a new tool that gives deeper exploration capabilities into what Cloudflare’s caching and content delivery services are doing for your web presence.

Caching is the most effective way to improve the performance and economics of serving your website to the world. Unsurprisingly, customers consistently ask us how they can optimize their cache performance to get the most out of Cloudflare.

With Cache Analytics, it’s easier than ever to learn how to speed up your website, and reduce traffic sent to your origin. Some of my favorite capabilities include:

  • See what resources are missing from cache, expired, or never eligible for cache in the first place
  • Slice and dice your data as you see fit: filter by hostnames, or see a list of top URLs that miss cache
  • Switch between views of requests and data Transfer to understand both performance and cost
Introducing Cache Analytics
An overview of Cache Analytics

Cache Analytics is available today for all customers on our Pro, Business, and Enterprise plans.

In this blog post, I’ll explain why we built Cache Analytics and how you can get the most out of it.

Why do we need analytics focused on caching?

If you want to scale the delivery of a fast, high-performance website, then caching is critical. Caching has two main goals:

First, caching improves performance. Cloudflare data centers are within 100ms of 90% of the planet; putting your content in Cloudflare’s cache gets it physically closer to your customers and visitors, meaning that visitors will see your website faster when they request it! (Plus, reading assets on our edge SSDs is really fast, rather than waiting for origins to generate a response.)

Second, caching helps reduce bandwidth costs associated with operating a presence on the Internet. Origin data transfer is one of the biggest expenses of running a web service, so serving content out of Cloudflare’s cache can significantly reduce costs incurred by origin infrastructure.

Because it’s not safe to cache all content (we wouldn’t want to cache your bank balance by default), Cloudflare relies on customers to tell us what’s safe to cache with HTTP Cache-Control headers and page rules. But even with page rules, it can be hard to understand what’s actually getting cached — or more importantly, what’s not getting cached, and why. Is a resource expired? Or was it even eligible for cache in the first place?

Faster or cheaper? Why not both!

Cache Analytics was designed to help users understand how Cloudflare’s cache is performing, but it can also be used as a general-purpose analytics tool. Here I’ll give a quick walkthrough of the interface.

First, at the top-left, you should decide if you want to focus on requests or data transfer.

Introducing Cache Analytics
Cache Analytics enables you to toggle between views of requests and data transfer.

As a rule of thumb, requests (the default view) is more useful for understanding performance, because every request that misses cache results in a performance hit. Data transfer is useful for understanding cost, because most hosts charge for every byte that leaves their network — every gigabyte served by Cloudflare translates into money saved at the origin.

You can always toggle between these two views while keeping filters enabled.

A filter for every occasion

Let’s say you’re focused on improving the performance of a specific subdomain on your zone. Cache Analytics allows flexible filtering of the data that’s important to you:

Introducing Cache Analytics
Cache Analytics enables flexible filtering of data.

Filtering is essential for zooming in on the chunk of traffic that you’re most interested in. You can filter by cache status, hostname, path, content type, and more. This is helpful, for example, if you’re trying to reduce data transfer for a specific subdomain, or are trying to tune the performance of your HTML pages.

Seeing the big picture

When analyzing traffic patterns, it’s essential to understand how things change over time. Perhaps you just applied a configuration change and want to see the impact, or just launched a big sale on your e-commerce site.

Introducing Cache Analytics

“Served by Cloudflare” indicates traffic that we were able to serve from our edge without reaching your origin server. “Served by Origin” indicates traffic that was proxied back to origin servers. (It can be really satisfying to add a page rule and see the amount of traffic “Served by Cloudflare” go up!)

Note that this graph will change significantly when you switch between “Requests” and “Data Transfer.” Revalidated requests are particularly interesting; because Cloudflare checks with the origin before returning a result from cache, these count as “Served by Cloudflare” for the purposes of data transfer, but “Served by Origin” for the purposes of “requests.”

Slicing the pie

After the high-level summary, we show an overview of cache status, which explains why traffic might be served from Cloudflare or from origin. We also show a breakdown of cache status by Content-Type to give an overview on how different components of your website perform:

Introducing Cache Analytics

Cache statuses are also essential for understanding what you need to do to optimize cache ratios. For example:

  • Dynamic indicates that a request was never eligible for cache, and went straight to origin. This is the default for many file types, including HTML. Learn more about making more content eligible for cache using page rules. Fixing this is one of the fastest ways to reduce origin data transfer cost.
  • Revalidated indicates content that was expired, but after Cloudflare checked the origin, it was still fresh! If you see a lot of revalidated content, it’s a good sign you should increase your Edge Cache TTLs through a page rule or max-age origin directive. Updating TTLs is one of the easiest ways to make your site faster.
  • Expired resources are ones that were in our cache, but were expired. Consider if you can extend TTLs on these, or at least support revalidation at your origin.
  • A miss indicates that Cloudflare has not seen that resource recently. These can be tricky to optimize, but there are a few potential remedies: Enable Argo Tiered Caching to check another datacenter’s cache before going to origin, or use a Custom Cache Key to make multiple URLs match the same cached resource (for example, by ignoring query string)

For a full explanation of each cache status, see our help center.

To the Nth dimension

Finally, Cache Analytics shows a number of what we call “Top Ns” — various ways to slice and dice the above data on useful dimensions.

Introducing Cache Analytics

It’s often helpful to apply filters (for example, to a specific cache status) before looking at these lists. For example, when trying to tune performance, I often filter to just “expired” or “revalidated,” then see if there are a few URLs that dominate these stats.

But wait, there’s more

Cache Analytics is available now for customers on our Pro, Business, and Enterprise plans. Pro customers have access to up to 3 days of analytics history. Business and Enterprise customers have access to up to 21 days, with more coming soon.

This is just the first step for Cache Analytics. We’re planning to add more dimensions to drill into the data. And we’re planning to add even more essential statistics — for example, about how cache keys are being used.

Finally, I’m really excited about Cache Analytics because it shows what we have in store for Cloudflare Analytics more broadly. We know that you’ve asked for many features— like per-hostname analytics, or the ability to see top URLs — for a long time, and we’re hard at work on bringing these to Zone Analytics. Stay tuned!

Test your home network performance

Post Syndicated from Achiel van der Mandele original https://blog.cloudflare.com/test-your-home-network-performance/

Test your home network performance

With many people being forced to work from home, there’s increased load on consumer ISPs. You may be asking yourself: how well is my ISP performing with even more traffic? Today we’re announcing the general availability of speed.cloudflare.com, a way to gain meaningful insights into exactly how well your network is performing.

We’ve seen a massive shift from users accessing the Internet from busy office districts to spread out urban areas.

Although there are a slew of speed testing tools out there, none of them give you precise insights into how they came to those measurements and how they map to real-world performance. With speed.cloudflare.com, we give you insights into what we’re measuring and how exactly we calculate the scores for your network connection. Best of all, you can easily download the measurements from right inside the tool if you’d like to perform your own analysis.

We also know you care about privacy. We believe that you should know what happens with the results generated by this tool. Many other tools sell the data to third parties. Cloudflare does not sell your data. Performance data is collected and anonymized and is governed by the terms of our Privacy Policy. The data is used anonymously to determine how we can improve our network, both in terms of capacity as well as to help us determine which Internet Service Providers to peer with.

Test your home network performance

The test has three main components: download, upload and a latency test. Each measures  different aspects of your network connection.

Down

For starters we run you through a basic download test. We start off downloading small files and progressively move up to larger and larger files until the test has saturated your Internet downlink. Small files (we start off with 10KB, then 100KB and so on) are a good representation of how websites will load, as these typically encompass many small files such as images, CSS stylesheets and JSON blobs.

For each file size, we show you the measurements inside a table, allowing you to drill down. Each dot in the bar graph represents one of the measurements, with the thin line delineating the range of speeds we’ve measured. The slightly thicker block represents the set of measurements between the 25th and 75th percentile.

Test your home network performance

Getting up to the larger file sizes we can see true maximum throughput: how much bandwidth do you really have? You may be wondering why we have to use progressively larger files. The reason is that download speeds start off slow (this is aptly called slow start) and then progressively gets faster. If we were to use only small files we would never get to the maximum throughput that your network provider supports, which should be close to the Internet speed your ISP quoted you when you signed up for service.

The maximum throughput on larger files will be indicative of how fast you can download large files such as games (GTA V is almost 100 GB to download!) or the maximum quality that you can stream video on (lower download speed means you have to use a lower resolution to get continuous playback). We only increase download file sizes up to the absolute minimum required to get accurate measurements: no wasted bandwidth.

Up

Upload is the opposite of download: we send data from your browser to the Internet. This metric is more important nowadays with many people working from home: it directly affects live video conferencing. A faster upload speed means your microphone and video feed can be of higher quality, meaning people can see and hear you more clearly on videoconferences.

Measurements for upload operate in the same manner: we progressively try to upload larger and larger files up until the point we notice your connection is saturated.

Speed measurements are never 100% consistent, which is why we repeat them. An easy way for us to report your speed would be to simply report the fastest speed we see. The problem is that this will not be representative of your real-world experience: latency and packet loss constantly fluctuates, meaning you can’t expect to see your maximum measured performance all the time.

To compensate for this, we take the 90th percentile of measurements, or p90 and report that instead of the absolute maximum speed that we measured. Taking the 90th percentile is a more accurate representation in that it discounts peak outliers, which is a much closer approximation of what you can expect in terms of speeds in the real world.

Latency and Jitter

Download and upload are important metrics but don’t paint the entire picture of the quality of your Internet connection. Many of us find ourselves interacting with work and friends over videoconferencing software more than ever. Although speeds matter, video is also very sensitive to the latency of your Internet connection. Latency represents the time an IP packet needs to travel from your device to the service you’re using on the Internet and back. High latency means that when you’re talking on a video conference, it will take longer for the other party to hear your voice.

But, latency only paints half the picture. Imagine yourself in a conversation where you have some delay before you hear what the other person says. That may be annoying but after a while you get used to it. What would be even worse is if the delay differed constantly: sometimes the audio is almost in sync and sometimes it has a delay of a few seconds. You can imagine how often this would result into two people starting to talk at the same time. This is directly related to how stable your latency is and is represented by the jitter metric. Jitter is the average variation found in consecutive latency measurements. A lower number means that the latencies measured are more consistent, meaning your media streams will have the same delay throughout the session.

Test your home network performance

We’ve designed speed.cloudflare.com to be as transparent as possible: you can click into any of the measurements to see the average, median, minimum, maximum measurements, and more. If you’re interested in playing around with the numbers, there’s a download button that will give you the raw results we measured.

Test your home network performance

The entire speed.cloudflare.com backend runs on Workers, meaning all logic runs entirely on the Cloudflare edge and your browser, no server necessary! If you’re interested in seeing how the benchmarks take place, we’ve open-sourced the code, feel free to take a peek on our Github repository.

We hope you’ll enjoy adding this tool to your set of network debugging tools. We love being transparent and our tools reflect this: your network performance is more than just one number. Give it a whirl and let us know what you think.

CUBIC and HyStart++ Support in quiche

Post Syndicated from Junho Choi original https://blog.cloudflare.com/cubic-and-hystart-support-in-quiche/

CUBIC and HyStart++ Support in quiche

quiche, Cloudflare’s IETF QUIC implementation has been running CUBIC congestion control for a while in our production environment as mentioned in Comparing HTTP/3 vs. HTTP/2 Performance). Recently we also added HyStart++  to the congestion control module for further improvements.

In this post, we will talk about QUIC congestion control and loss recovery briefly and CUBIC and HyStart++ in the quiche congestion control module. We will also discuss lab test results and how to visualize those using qlog which was recently added to the quiche library as well.

QUIC Congestion Control and Loss Recovery

In the network transport area, congestion control is how to decide how much data the connection can send into the network. It has an important role in networking so as not to overrun the link but also at the same time it needs to play nice with other connections in the same network to ensure that the overall network, the Internet, doesn’t collapse. Basically congestion control is trying to detect the current capacity of the link and tune itself in real time and it’s one of the core algorithms for running the Internet.

QUIC congestion control has been written based on many years of TCP experience, so it is little surprise that the two have mechanisms that bear resemblance. It’s based on the CWND (congestion window, the limit of how many bytes you can send to the network) and the SSTHRESH (slow start threshold, how to start a new collection slowly so as not to overwhelm the network). Congestion control mechanisms can have complicated edge cases and can be hard to tune. Since QUIC is a new transport protocol that people are implementing from scratch, the current draft recommends Reno as a relatively simple mechanism to get people started. However, it has known limitations and so QUIC is designed to have pluggable congestion control; it’s up to implementers to adopt any more advanced ones of their choosing.

Since Reno became the standard for TCP congestion control, many congestion control algorithms have been proposed by academia and industry. Largely there are two categories: loss-based congestion control such as Reno and CUBIC, where the congestion control responds to a packet loss event, and delay-based congestion control, such as Vegas and BBR , which the algorithm tries to find a balance between the bandwidth and RTT increase and tune the packet send rate.

You can port TCP based congestion control algorithms to QUIC without much change by implementing a few hooks. quiche provides a modular API to add a new congestion control module easily.

Loss detection is how to detect packet loss at the sender side. It’s usually separated from the congestion control algorithm but helps the congestion control to quickly respond to the congestion. Packet loss can be a result of the congestion on the link, but the link layer may also drop a packet without congestion due to the characteristics of the physical layer, such as on a WiFi or mobile network.

Traditionally TCP uses 3 DUP ACKs for ACK based detection, but delay-based loss detection such as RACK  has also been used over the years. QUIC combines the lesson from TCP into two categories . One is based on the packet threshold (similar to 3 DUP ACK detection) and the other is based on a time threshold (similar to RACK). QUIC also has ACK Ranges  similar to TCP SACK to provide a hole in the receive buffer but ACK Ranges can keep a longer list of received packets in the ACK frame than TCP SACK. This simplifies the implementation overall and helps provide quick recovery when there is multiple loss.

Reno

Reno (often referred as NewReno) is a standard congestion control for TCP and QUIC .

Reno is easy to understand and doesn’t need additional memory to store the state so can be implemented in low spec hardware too. However, its slow start can be very aggressive because it keeps increasing the CWND quickly until it sees congestion. In other words, it doesn’t stop until it sees the packet loss.

Note that there are multiple states for Reno; Reno starts from “slow start” mode which increases the CWND very aggressively, roughly 2x for every RTT until the congestion is detected or CWND > SSTHRESH. When packet loss is detected, it enters into the “recovery” mode until packet loss is recovered.

When it exits from recovery (no lost ranges) and CWND > SSTHRESH, it enters into the “congestion avoidance” mode where the CWND grows slowly (roughly a full packet per RTT) and tries to converge on a stable CWND. As a result you will see a “sawtooth” pattern when you make a graph of the CWND over time.

Here is an example of Reno congestion control CWND graph. See the “Congestion Window” line.

CUBIC and HyStart++ Support in quiche

CUBIC

CUBIC was announced in 2008 and became the default congestion control in the Linux kernel. Currently it’s defined in RFC8312  and implemented in many OS including Linux, BSD and Windows. quiche’s CUBIC implementation follows RFC8312 with a fix made by Google in the Linux kernel .

What makes the difference from Reno is during congestion avoidance  its CWND growth is based on CUBIC function as follows:

CUBIC and HyStart++ Support in quiche
(from the CUBIC paper: https://www.cs.princeton.edu/courses/archive/fall16/cos561/papers/Cubic08.pdf)

Wmax is the value of CWND when the congestion is detected. Then it will reduce the CWND by 30% and then the CWND starts to grow again using a CUBIC function as in the graph, approaching Wmax aggressively in the beginning in the first half but slowly converging to Wmax later. This makes sure that CWND growth approaches the previous point carefully and once we pass Wmax, it starts to grow aggressively again after some time to find a new max CWND (this is called “Max Probing”).

Also it has a “TCP-friendly” (actually a Reno-friendly) mode to make sure CWND growth is always bigger than Reno. When the congestion event happens, CUBIC reduces its CWND by 30%, where Reno cuts down CWND by 50%. This makes CUBIC a little more aggressive on packet loss.

Note that the original CUBIC only defines how to update the CWND during congestion avoidance. Slow start mode is exactly the same as Reno.

HyStart++

The authors of CUBIC made a separate effort to improve slow start because CUBIC only changed the way the CWND grows during congestion avoidance. They came up with the idea of HyStart .

HyStart is based on two ideas and basically changes how the CWND is updated during slow start:

  • RTT delay samples: when the RTT is increased during slow start and over the threshold, it exits slow start early and enters into congestion avoidance.
  • ACK train: When ACK inter-arrival time gets higher and over the threshold, it exits slow start early and enters into congestion avoidance.

However in the real world, ACK train is not very useful because of ACK compression (merging multiple ACKs into one). Also RTT delay may not work well when the network is unstable.

To improve such situations there is a new IETF draft proposed by Microsoft engineers named HyStart++ . HyStart++ is included in the Windows 10 TCP stack with CUBIC.

It’s a little different from original HyStart:

  • No ACK Train, only RTT sampling.
  • Add a LSS (Limited Slow Start) phase after exiting slow start. LSS grows the CWND faster than congestion avoidance but slower than Reno slow start. Instead of going into congestion avoidance directly, slow start exits to LSS and LSS exits to congestion avoidance when packet loss happens.
  • Simpler implementation.

In quiche, HyStart++ is turned on by default for both Reno and CUBIC congestion control and can be configured via API.

Lab Test

Here is a test result using the test lab . The test condition is as follows:

  • 5Mbps bandwidth, 60ms RTT with a different packet loss from 0% to 8%
  • Measure download time of 8MB file
  • NGINX 1.16.1 server with the HTTP3 patch
  • TCP: CUBIC in Linux kernel 4.14
  • QUIC: Cloudflare quiche
  • Download 20 times and take a median download time

I run the test with the following combination:

  • TCP CUBIC (TCP-CUBIC)
  • QUIC Reno (QUIC-RENO)
  • QUIC Reno with Hystart++ (QUIC-RENO-HS)
  • QUIC CUBIC (QUIC-CUBIC)
  • QUIC CUBIC with Hystart++ (QUIC-CUBIC-HS)

Overall Test Result

Here is a chart of overall test results:

CUBIC and HyStart++ Support in quiche

In these tests, TCP-CUBIC (blue bars) is the baseline to which we compare the performance of QUIC congestion control variants. We include QUIC-RENO (red and yellow bars) because that is the default QUIC baseline. Reno is simpler so we expect it to perform worse than TCP-CUBIC. QUIC-CUBIC (green and orange bars) should perform the same or better than TCP-CUBIC.

You can see with 0% packet loss TCP and QUIC are almost doing the same (but QUIC is slightly slower). As  packet loss increases QUIC CUBIC performs better than TCP CUBIC. QUIC loss recovery looks to work well, which is great news for real-world networks that do encounter loss.

With HyStart++, overall performance doesn’t change but that is to be expected, because the main goal of HyStart++ is to prevent overshooting the network. We will see that in the next section.

The impact of HyStart++

HyStart++ may not improve the download time but it will reduce packet loss while maintaining the same performance without it. Since slow start will exit to congestion avoidance when packet loss is detected, we focus on 0% packet loss where only network congestion creates packet loss.

Packet Loss

For each test, the number of detected packets lost (not the retransmit count) is shown in the following chart. The lost packets number is the average of 20 runs for each test.

CUBIC and HyStart++ Support in quiche

As shown above, you can see that HyStart++ reduces a lot of packet loss.

Note that compared with Reno, CUBIC can create more packet loss in general. This is because the CUBIC CWND can grow faster than Reno during congestion avoidance and also reduces the CWND less (30%) than Reno (50%) at the congestion event.

Visualization using qlog and qvis

qvis  is a visualization tool based on qlog . Since quiche has implemented qlog support , we can take qlogs from a QUIC connection and use the qvis tool to visualize connection stats. This is a very useful tool for protocol development. We already used qvis for the Reno graph but let’s see a few more examples to understand how HyStart++ works.

CUBIC without HyStart++

Here is a qvis congestion chart for a 16MB transfer in the same lab test conditions, with 0% packet loss. You can see a high peak of CWND in the beginning due to slow start. After some time, it starts to show the CUBIC window growth pattern (concave function).

CUBIC and HyStart++ Support in quiche

When we zoom into the slow start section (the first 1.5 seconds), we can see there is a linear increase of CWND during slow start. This continues until we see a packet lost around 770ms and enters into congestion avoidance, as you can see in the following chart:

CUBIC and HyStart++ Support in quiche

CUBIC with HyStart++

Let’s see the same graph when HyStart++ is enabled. You can see the slow start peak is much smaller than without HyStart++, which will lead to less overshooting and packet loss:

CUBIC and HyStart++ Support in quiche

When we zoom in the slow start part again, now we can see that the slow start exits to Limited Slow Start (LSS) around 430ms and exit to congestion avoidance at the congestion event around 620ms.

As a result you can see the slope is less steep until congestion is detected. It will lead to less packet loss due to less overshooting the network and faster convergence to a stable CWND.

CUBIC and HyStart++ Support in quiche

Conclusions and Future Tasks

The QUIC draft spec already has integrated a lot of experience from TCP congestion control and loss recovery. It recommends the simple Reno mechanism as a means to get people started implementing the protocol but is under no illusion that there are better performing ones out there. So QUIC is designed to be pluggable in order for it to adopt mechanisms that are being deployed in state-of-the-art TCP implementations.

CUBIC and HyStart++ are known implementations in the TCP world and give better performance (faster download and less packet loss) than Reno. We’ve made quiche pluggable and have added CUBIC and HyStart++ support. Our lab testing shows that QUIC is a clear performance winner in lossy network conditions, which is the very thing it is designed for.

In the future, we also plan to work on advanced features in quiche, such as packet pacing, advanced recovery and BBR congestion control for better QUIC performance. Using quiche you can switch among multiple congestion control algorithms using the config API at the connection level, so you can play with it and choose the best one depending on your need. qlog endpoint logging can be visualized to provide high accuracy insight into how QUIC is behaving, greatly helping understanding and development.

CUBIC and HyStart++ code is available in the quiche master branch today. Please try it!

Comparing HTTP/3 vs. HTTP/2 Performance

Post Syndicated from Sreeni Tellakula original https://blog.cloudflare.com/http-3-vs-http-2/

Comparing HTTP/3 vs. HTTP/2 Performance

Comparing HTTP/3 vs. HTTP/2 Performance

We announced support for HTTP/3, the successor to HTTP/2 during Cloudflare’s birthday week last year. Our goal is and has always been to help build a better Internet. Collaborating on standards is a big part of that, and we’re very fortunate to do that here.

Even though HTTP/3 is still in draft status, we’ve seen a lot of interest from our users. So far, over 113,000 zones have activated HTTP/3 and, if you are using an experimental browser those zones can be accessed using the new protocol! It’s been great seeing so many people enable HTTP/3: having real websites accessible through HTTP/3 means browsers have more diverse properties to test against.

When we launched support for HTTP/3, we did so in partnership with Google, who simultaneously launched experimental support in Google Chrome. Since then, we’ve seen more browsers add experimental support: Firefox to their nightly builds, other Chromium-based browsers such as Opera and Microsoft Edge through the underlying Chrome browser engine, and Safari via their technology preview. We closely follow these developments and partner wherever we can help; having a large network with many sites that have HTTP/3 enabled gives browser implementers an excellent testbed against which to try out code.

So, what’s the status and where are we now?

The IETF standardization process develops protocols as a series of document draft versions with the ultimate aim of producing a final draft version that is ready to be marked as an RFC. The members of the QUIC Working Group collaborate on analyzing, implementing and interoperating the specification in order to find things that don’t work quite right. We launched with support for Draft-23 for HTTP/3 and have since kept up with each new draft, with 27 being the latest at time of writing. With each draft the group improves the quality of the QUIC definition and gets closer to “rough consensus” about how it behaves. In order to avoid a perpetual state of analysis paralysis and endless tweaking, the bar for proposing changes to the specification has been increasing with each new draft. This means that changes between versions are smaller, and that a final RFC should closely match the protocol that we’ve been running in production.

Benefits

One of the main touted advantages of HTTP/3 is increased performance, specifically around fetching multiple objects simultaneously. With HTTP/2, any interruption (packet loss) in the TCP connection blocks all streams (Head of line blocking). Because HTTP/3 is UDP-based, if a packet gets dropped that only interrupts that one stream, not all of them.

In addition, HTTP/3 offers 0-RTT support, which means that subsequent connections can start up much faster by eliminating the TLS acknowledgement from the server when setting up the connection. This means the client can start requesting data much faster than with a full TLS negotiation, meaning the website starts loading earlier.

The following illustrates the packet loss and its impact: HTTP/2 multiplexing two requests . A request comes over HTTP/2 from the client to the server requesting two resources (we’ve colored the requests and their associated responses green and yellow). The responses are broken up into multiple packets and, alas, a packet is lost so both requests are held up.

Comparing HTTP/3 vs. HTTP/2 Performance
Comparing HTTP/3 vs. HTTP/2 Performance

The above shows HTTP/3 multiplexing 2 requests. A packet is lost that affects the yellow response but the green one proceeds just fine.

Improvements in session startup mean that ‘connections’ to servers start much faster, which means the browser starts to see data more quickly. We were curious to see how much of an improvement, so we ran some tests. To measure the improvement resulting from 0-RTT support, we ran some benchmarks measuring time to first byte (TTFB). On average, with HTTP/3 we see the first byte appearing after 176ms. With HTTP/2 we see 201ms, meaning HTTP/3 is already performing 12.4% better!

Comparing HTTP/3 vs. HTTP/2 Performance

Interestingly, not every aspect of the protocol is governed by the drafts or RFC. Implementation choices can affect performance, such as efficient packet transmission and choice of congestion control algorithm. Congestion control is a technique your computer and server use to adapt to overloaded networks: by dropping packets, transmission is subsequently throttled. Because QUIC is a new protocol, getting the congestion control design and implementation right requires experimentation and tuning.

In order to provide a safe and simple starting point, the Loss Detection and Congestion Control specification recommends the Reno algorithm but allows endpoints to choose any algorithm they might like.  We started with New Reno but we know from experience that we can get better performance with something else. We have recently moved to CUBIC and on our network with larger size transfers and packet loss, CUBIC shows improvement over New Reno. Stay tuned for more details in future.

For our existing HTTP/2 stack, we currently support BBR v1 (TCP). This means that in our tests, we’re not performing an exact apples-to-apples comparison as these congestion control algorithms will behave differently for smaller vs larger transfers. That being said, we can already see a speedup in smaller websites using HTTP/3 when compared to HTTP/2. With larger zones, the improved congestion control of our tuned HTTP/2 stack shines in performance.

For a small test page of 15KB, HTTP/3 takes an average of 443ms to load compared to 458ms for HTTP/2. However, once we increase the page size to 1MB that advantage disappears: HTTP/3 is just slightly slower than HTTP/2 on our network today, taking 2.33s to load versus 2.30s.

Comparing HTTP/3 vs. HTTP/2 Performance
Comparing HTTP/3 vs. HTTP/2 Performance
Comparing HTTP/3 vs. HTTP/2 Performance

Synthetic benchmarks are interesting, but we wanted to know how HTTP/3 would perform in the real world.

To measure, we wanted a third party that could load websites on our network, mimicking a browser. WebPageTest is a common framework that is used to measure the page load time, with nice waterfall charts. For analyzing the backend, we used our in-house Browser Insights, to capture timings as our edge sees it. We then tied both pieces together with bits of automation.

As a test case we decided to use this very blog for our performance monitoring. We configured our own instances of WebPageTest spread over the world to load these sites over both HTTP/2 and HTTP/3. We also enabled HTTP/3 and Browser Insights. So, every time our test scripts kickoff a webpage test with an HTTP/3 supported browser loading the page, browser analytics report the data back. Rinse and repeat for HTTP/2 to be able to compare.

The following graph shows the page load time for a real world page — blog.cloudflare.com, to compare the performance of HTTP/3 and HTTP/2. We have these performance measurements running from different geographical locations.

Comparing HTTP/3 vs. HTTP/2 Performance

As you can see, HTTP/3 performance still trails HTTP/2 performance, by about 1-4% on average in North America and similar results are seen in Europe, Asia and South America. We suspect this could be due to the difference in congestion algorithms: HTTP/2 on BBR v1 vs. HTTP/3 on CUBIC. In the future, we’ll work to support the same congestion algorithm on both to get a more accurate apples-to-apples comparison.

Conclusion

Overall, we’re very excited to be allowed to help push this standard forward. Our implementation is holding up well, offering better performance in some cases and at worst similar to HTTP/2. As the standard finalizes, we’re looking forward to seeing browsers add support for HTTP/3 in mainstream versions. As for us, we continue to support the latest drafts while at the same time looking for more ways to leverage HTTP/3 to get even better performance, be it congestion tuning, prioritization or system capacity (CPU and raw network throughput).

In the meantime, if you’d like to try it out, just enable HTTP/3 on our dashboard and download a nightly version of one of the major browsers. Instructions on how to enable HTTP/3 can be found on our developer documentation.

Cloudflare for SSH, RDP and Minecraft

Post Syndicated from Achiel van der Mandele original https://blog.cloudflare.com/cloudflare-for-ssh-rdp-and-minecraft/

Cloudflare for SSH, RDP and Minecraft

Cloudflare for SSH, RDP and Minecraft

Almost exactly two years ago, we launched Cloudflare Spectrum for our Enterprise customers. Today, we’re thrilled to extend DDoS protection and traffic acceleration with Spectrum for SSH, RDP, and Minecraft to our Pro and Business plan customers.

When we think of Cloudflare, a lot of the time we think about protecting and improving the performance of websites. But the Internet is so much more, ranging from gaming, to managing servers, to cryptocurrencies. How do we make sure these applications are secure and performant?

With Spectrum, you can put Cloudflare in front of your SSH, RDP and Minecraft services, protecting them from DDoS attacks and improving network performance. This allows you to protect the management of your servers, not just your website. Better yet, by leveraging the Cloudflare network you also get increased reliability and increased performance: lower latency!

Remote access to servers

While access to websites from home is incredibly important, being able to remotely manage your servers can be equally critical. Losing access to your infrastructure can be disastrous: people need to know their infrastructure is safe and connectivity is good and performant. Usually, server management is done through SSH (Linux or Unix based servers) and RDP (Windows based servers). With these protocols, performance and reliability are key: you need to know you can always reliably manage your servers and that the bad guys are kept out. What’s more, low latency is really important. Every time you type a key in an SSH terminal or click a button in a remote desktop session, that key press or button click has to traverse the Internet to your origin before the server can process the input and send feedback. While increasing bandwidth can help, lowering latency can help even more in getting your sessions to feel like you’re working on a local machine and not one half-way across the globe.

All work and no play makes Jack Steve a dull boy

While we stay at home, many of us are also looking to play and not only work. Video games in particular have seen a huge increase in popularity. As personal interaction becomes more difficult to come by, Minecraft has become a popular social outlet. Many of us at Cloudflare are using it to stay in touch and have fun with friends and family in the current age of quarantine. And it’s not just employees at Cloudflare that feel this way, we’ve seen a big increase in Minecraft traffic flowing through our network. Traffic per week had remained steady for a while but has more than tripled since many countries have put their citizens in lockdown:

Cloudflare for SSH, RDP and Minecraft

Minecraft is a particularly popular target for DDoS attacks: it’s not uncommon for people to develop feuds whilst playing the game. When they do, some of the more tech-savvy players of this game opt to take matters into their own hands and launch a (D)DoS attack, rendering it unusable for the duration of the attacks. Our friends at Hypixel and Nodecraft have known this for many years, which is why they’ve chosen to protect their servers using Spectrum.

While we love recommending their services, we realize some of you prefer to run your own Minecraft server on a VPS (virtual private server like a DigitalOcean droplet) that you maintain. To help you protect your Minecraft server, we’re providing Spectrum for Minecraft as well, available on Pro and Business plans. You’ll be able to use the entire Cloudflare network to protect your server and increase network performance.

How does it work?

Configuring Spectrum is easy, just log into your dashboard and head on over to the Spectrum tab. From there you can choose a protocol and configure the IP of your server:

Cloudflare for SSH, RDP and Minecraft

After that all you have to do is use the subdomain you configured to connect instead of your IP. Traffic will be proxied using Spectrum on the Cloudflare network, keeping the bad guys out and your services safe.

Cloudflare for SSH, RDP and Minecraft

So how much does this cost? We’re happy to announce that all paid plans will get access to Spectrum for free, with a generous free data allowance. Pro plans will be able to use SSH and Minecraft, up to 5 gigabytes for free each month. Biz plans can go up to 10 gigabytes for free and also get access to RDP. After the free cap you will be billed on a per gigabyte basis.

Spectrum is complementary to Access: it offers DDoS protection and improved network performance as a ‘drop-in’ product, no configuration necessary on your origins. If you want more control over who has access to which services, we highly recommend taking a look at Cloudflare for Teams.

We’re very excited to extend Cloudflare’s services to not just HTTP traffic, allowing you to protect your core management services and Minecraft gaming servers. In the future, we’ll add support for more protocols. If you have a suggestion, let us know! In the meantime, if you have a Pro or Business account, head on over to the dashboard and enable Spectrum today!

Speeding up Linux disk encryption

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Speeding up Linux disk encryption

Data encryption at rest is a must-have for any modern Internet company. Many companies, however, don’t encrypt their disks, because they fear the potential performance penalty caused by encryption overhead.

Encrypting data at rest is vital for Cloudflare with more than 200 data centres across the world. In this post, we will investigate the performance of disk encryption on Linux and explain how we made it at least two times faster for ourselves and our customers!

Encrypting data at rest

When it comes to encrypting data at rest there are several ways it can be implemented on a modern operating system (OS). Available techniques are tightly coupled with a typical OS storage stack. A simplified version of the storage stack and encryption solutions can be found on the diagram below:

Speeding up Linux disk encryption

On the top of the stack are applications, which read and write data in files (or streams). The file system in the OS kernel keeps track of which blocks of the underlying block device belong to which files and translates these file reads and writes into block reads and writes, however the hardware specifics of the underlying storage device is abstracted away from the filesystem. Finally, the block subsystem actually passes the block reads and writes to the underlying hardware using appropriate device drivers.

The concept of the storage stack is actually similar to the well-known network OSI model, where each layer has a more high-level view of the information and the implementation details of the lower layers are abstracted away from the upper layers. And, similar to the OSI model, one can apply encryption at different layers (think about TLS vs IPsec or a VPN).

For data at rest we can apply encryption either at the block layers (either in hardware or in software) or at the file level (either directly in applications or in the filesystem).

Block vs file encryption

Generally, the higher in the stack we apply encryption, the more flexibility we have. With application level encryption the application maintainers can apply any encryption code they please to any particular data they need. The downside of this approach is they actually have to implement it themselves and encryption in general is not very developer-friendly: one has to know the ins and outs of a specific cryptographic algorithm, properly generate keys, nonces, IVs etc. Additionally, application level encryption does not leverage OS-level caching and Linux page cache in particular: each time the application needs to use the data, it has to either decrypt it again, wasting CPU cycles, or implement its own decrypted “cache”, which introduces more complexity to the code.

File system level encryption makes data encryption transparent to applications, because the file system itself encrypts the data before passing it to the block subsystem, so files are encrypted regardless if the application has crypto support or not. Also, file systems can be configured to encrypt only a particular directory or have different keys for different files. This flexibility, however, comes at a cost of a more complex configuration. File system encryption is also considered less secure than block device encryption as only the contents of the files are encrypted. Files also have associated metadata, like file size, the number of files, the directory tree layout etc., which are still visible to a potential adversary.

Encryption down at the block layer (often referred to as disk encryption or full disk encryption) also makes data encryption transparent to applications and even whole file systems. Unlike file system level encryption it encrypts all data on the disk including file metadata and even free space. It is less flexible though – one can only encrypt the whole disk with a single key, so there is no per-directory, per-file or per-user configuration. From the crypto perspective, not all cryptographic algorithms can be used as the block layer doesn’t have a high-level overview of the data anymore, so it needs to process each block independently. Most common algorithms require some sort of block chaining to be secure, so are not applicable to disk encryption. Instead, special modes were developed just for this specific use-case.

So which layer to choose? As always, it depends… Application and file system level encryption are usually the preferred choice for client systems because of the flexibility. For example, each user on a multi-user desktop may want to encrypt their home directory with a key they own and leave some shared directories unencrypted. On the contrary, on server systems, managed by SaaS/PaaS/IaaS companies (including Cloudflare) the preferred choice is configuration simplicity and security – with full disk encryption enabled any data from any application is automatically encrypted with no exceptions or overrides. We believe that all data needs to be protected without sorting it into "important" vs "not important" buckets, so the selective flexibility the upper layers provide is not needed.

Hardware vs software disk encryption

When encrypting data at the block layer it is possible to do it directly in the storage hardware, if the hardware supports it. Doing so usually gives better read/write performance and consumes less resources from the host. However, since most hardware firmware is proprietary, it does not receive as much attention and review from the security community. In the past this led to flaws in some implementations of hardware disk encryption, which render the whole security model useless. Microsoft, for example, started to prefer software-based disk encryption since then.

We didn’t want to put our data and our customers’ data to the risk of using potentially insecure solutions and we strongly believe in open-source. That’s why we rely only on software disk encryption in the Linux kernel, which is open and has been audited by many security professionals across the world.

Linux disk encryption performance

We aim not only to save bandwidth costs for our customers, but to deliver content to Internet users as fast as possible.

At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.

Device mapper and dm-crypt

Linux implements transparent disk encryption via a dm-crypt module and dm-crypt itself is part of device mapper kernel framework. In a nutshell, the device mapper allows pre/post-process IO requests as they travel between the file system and the underlying block device.

dm-crypt in particular encrypts "write" IO requests before sending them further down the stack to the actual block device and decrypts "read" IO requests before sending them up to the file system driver. Simple and easy! Or is it?

Benchmarking setup

For the record, the numbers in this post were obtained by running specified commands on an idle Cloudflare G9 server out of production. However, the setup should be easily reproducible on any modern x86 laptop.

Generally, benchmarking anything around a storage stack is hard because of the noise introduced by the storage hardware itself. Not all disks are created equal, so for the purpose of this post we will use the fastest disks available out there – that is no disks.

Instead Linux has an option to emulate a disk directly in RAM. Since RAM is much faster than any persistent storage, it should introduce little bias in our results.

The following command creates a 4GB ramdisk:

$ sudo modprobe brd rd_nr=1 rd_size=4194304
$ ls /dev/ram0

Now we can set up a dm-crypt instance on top of it thus enabling encryption for the disk. First, we need to generate the disk encryption key, "format" the disk and specify a password to unlock the newly generated key.

$ fallocate -l 2M crypthdr.img
$ sudo cryptsetup luksFormat /dev/ram0 --header crypthdr.img

WARNING!
========
This will overwrite data on crypthdr.img irrevocably.

Are you sure? (Type uppercase yes): YES
Enter passphrase:
Verify passphrase:

Those who are familiar with LUKS/dm-crypt might have noticed we used a LUKS detached header here. Normally, LUKS stores the password-encrypted disk encryption key on the same disk as the data, but since we want to compare read/write performance between encrypted and unencrypted devices, we might accidentally overwrite the encrypted key during our benchmarking later. Keeping the encrypted key in a separate file avoids this problem for the purposes of this post.

Now, we can actually "unlock" the encrypted device for our testing:

$ sudo cryptsetup open --header crypthdr.img /dev/ram0 encrypted-ram0
Enter passphrase for /dev/ram0:
$ ls /dev/mapper/encrypted-ram0
/dev/mapper/encrypted-ram0

At this point we can now compare the performance of encrypted vs unencrypted ramdisk: if we read/write data to /dev/ram0, it will be stored in plaintext. Likewise, if we read/write data to /dev/mapper/encrypted-ram0, it will be decrypted/encrypted on the way by dm-crypt and stored in ciphertext.

It’s worth noting that we’re not creating any file system on top of our block devices to avoid biasing results with a file system overhead.

Measuring throughput

When it comes to storage testing/benchmarking Flexible I/O tester is the usual go-to solution. Let’s simulate simple sequential read/write load with 4K block size on the ramdisk without encryption:

$ sudo fio --filename=/dev/ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=plain
plain: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=21013MB, aggrb=1126.5MB/s, minb=1126.5MB/s, maxb=1126.5MB/s, mint=18655msec, maxt=18655msec
  WRITE: io=21023MB, aggrb=1126.1MB/s, minb=1126.1MB/s, maxb=1126.1MB/s, mint=18655msec, maxt=18655msec

Disk stats (read/write):
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

The above command will run for a long time, so we just stop it after a while. As we can see from the stats, we’re able to read and write roughly with the same throughput around 1126 MB/s. Let’s repeat the test with the encrypted ramdisk:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=1693.7MB, aggrb=150874KB/s, minb=150874KB/s, maxb=150874KB/s, mint=11491msec, maxt=11491msec
  WRITE: io=1696.4MB, aggrb=151170KB/s, minb=151170KB/s, maxb=151170KB/s, mint=11491msec, maxt=11491msec

Whoa, that’s a drop! We only get ~147 MB/s now, which is more than 7 times slower! And this is on a totally idle machine!

Maybe, crypto is just slow

The first thing we considered is to ensure we use the fastest crypto. cryptsetup allows us to benchmark all the available crypto implementations on the system to select the best one:

$ sudo cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1340890 iterations per second for 256-bit key
PBKDF2-sha256    1539759 iterations per second for 256-bit key
PBKDF2-sha512    1205259 iterations per second for 256-bit key
PBKDF2-ripemd160  967321 iterations per second for 256-bit key
PBKDF2-whirlpool  720175 iterations per second for 256-bit key
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   969.7 MiB/s  3110.0 MiB/s
 serpent-cbc   128b           N/A           N/A
 twofish-cbc   128b           N/A           N/A
     aes-cbc   256b   756.1 MiB/s  2474.7 MiB/s
 serpent-cbc   256b           N/A           N/A
 twofish-cbc   256b           N/A           N/A
     aes-xts   256b  1823.1 MiB/s  1900.3 MiB/s
 serpent-xts   256b           N/A           N/A
 twofish-xts   256b           N/A           N/A
     aes-xts   512b  1724.4 MiB/s  1765.8 MiB/s
 serpent-xts   512b           N/A           N/A
 twofish-xts   512b           N/A           N/A

It seems aes-xts with a 256-bit data encryption key is the fastest here. But which one are we actually using for our encrypted ramdisk?

$ sudo dmsetup table /dev/mapper/encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0

We do use aes-xts with a 256-bit data encryption key (count all the zeroes conveniently masked by dmsetup tool – if you want to see the actual bytes, add the --showkeys option to the above command). The numbers do not add up however: cryptsetup benchmark tells us above not to rely on the results, as "Tests are approximate using memory only (no storage IO)", but that is exactly how we’ve set up our experiment using the ramdisk. In a somewhat worse case (assuming we’re reading all the data and then encrypting/decrypting it sequentially with no parallelism) doing back-of-the-envelope calculation we should be getting around (1126 * 1823) / (1126 + 1823) =~696 MB/s, which is still quite far from the actual 147 * 2 = 294 MB/s (total for reads and writes).

dm-crypt performance flags

While reading the cryptsetup man page we noticed that it has two options prefixed with --perf-, which are probably related to performance tuning. The first one is --perf-same_cpu_crypt with a rather cryptic description:

Perform encryption using the same cpu that IO was submitted on.  The default is to use an unbound workqueue so that encryption work is automatically balanced between available CPUs.  This option is only relevant for open action.

So we enable the option

$ sudo cryptsetup close encrypted-ram0
$ sudo cryptsetup open --header crypthdr.img --perf-same_cpu_crypt /dev/ram0 encrypted-ram0

Note: according to the latest man page there is also a cryptsetup refresh command, which can be used to enable these options live without having to "close" and "re-open" the encrypted device. Our cryptsetup however didn’t support it yet.

Verifying if the option has been really enabled:

$ sudo dmsetup table encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0 1 same_cpu_crypt

Yes, we can now see same_cpu_crypt in the output, which is what we wanted. Let’s rerun the benchmark:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=1596.6MB, aggrb=139811KB/s, minb=139811KB/s, maxb=139811KB/s, mint=11693msec, maxt=11693msec
  WRITE: io=1600.9MB, aggrb=140192KB/s, minb=140192KB/s, maxb=140192KB/s, mint=11693msec, maxt=11693msec

Hmm, now it is ~136 MB/s which is slightly worse than before, so no good. What about the second option --perf-submit_from_crypt_cpus:

Disable offloading writes to a separate thread after encryption.  There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly.  The default is to offload write bios to the same thread.  This option is only relevant for open action.

Maybe, we are in the "some situation" here, so let’s try it out:

$ sudo cryptsetup close encrypted-ram0
$ sudo cryptsetup open --header crypthdr.img --perf-submit_from_crypt_cpus /dev/ram0 encrypted-ram0
Enter passphrase for /dev/ram0:
$ sudo dmsetup table encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0 1 submit_from_crypt_cpus

And now the benchmark:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=2066.6MB, aggrb=169835KB/s, minb=169835KB/s, maxb=169835KB/s, mint=12457msec, maxt=12457msec
  WRITE: io=2067.7MB, aggrb=169965KB/s, minb=169965KB/s, maxb=169965KB/s, mint=12457msec, maxt=12457msec

~166 MB/s, which is a bit better, but still not good…

Asking the community

Being desperate we decided to seek support from the Internet and posted our findings to the dm-crypt mailing list, but the response we got was not very encouraging:

If the numbers disturb you, then this is from lack of understanding on your side. You are probably unaware that encryption is a heavy-weight operation…

We decided to make a scientific research on this topic by typing "is encryption expensive" into Google Search and one of the top results, which actually contains meaningful measurements, is… our own post about cost of encryption, but in the context of TLS! This is a fascinating read on its own, but the gist is: modern crypto on modern hardware is very cheap even at Cloudflare scale (doing millions of encrypted HTTP requests per second). In fact, it is so cheap that Cloudflare was the first provider to offer free SSL/TLS for everyone.

Digging into the source code

When trying to use the custom dm-crypt options described above we were curious why they exist in the first place and what is that "offloading" all about. Originally we expected dm-crypt to be a simple "proxy", which just encrypts/decrypts data as it flows through the stack. Turns out dm-crypt does more than just encrypting memory buffers and a (simplified) IO traverse path diagram is presented below:

Speeding up Linux disk encryption

When the file system issues a write request, dm-crypt does not process it immediately – instead it puts it into a workqueue named "kcryptd". In a nutshell, a kernel workqueue just schedules some work (encryption in this case) to be performed at some later time, when it is more convenient. When "the time" comes, dm-crypt sends the request to Linux Crypto API for actual encryption. However, modern Linux Crypto API is asynchronous as well, so depending on which particular implementation your system will use, most likely it will not be processed immediately, but queued again for "later time". When Linux Crypto API will finally do the encryption, dm-crypt may try to sort pending write requests by putting each request into a red-black tree. Then a separate kernel thread again at "some time later" actually takes all IO requests in the tree and sends them down the stack.

Now for read requests: this time we need to get the encrypted data first from the hardware, but dm-crypt does not just ask for the driver for the data, but queues the request into a different workqueue named "kcryptd_io". At some point later, when we actually have the encrypted data, we schedule it for decryption using the now familiar "kcryptd" workqueue. "kcryptd" will send the request to Linux Crypto API, which may decrypt the data asynchronously as well.

To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is:

A significant amount of tail latency is due to queueing effects

So, why are all these queues there and can we remove them?

Git archeology

No-one writes more complex code just for fun, especially for the OS kernel. So all these queues must have been put there for a reason. Luckily, the Linux kernel source is managed by git, so we can try to retrace the changes and the decisions around them.

The "kcryptd" workqueue was in the source since the beginning of the available history with the following comment:

Needed because it would be very unwise to do decryption in an interrupt context, so bios returning from read requests get queued here.

So it was for reads only, but even then – why do we care if it is interrupt context or not, if Linux Crypto API will likely use a dedicated thread/queue for encryption anyway? Well, back in 2005 Crypto API was not asynchronous, so this made perfect sense.

In 2006 dm-crypt started to use the "kcryptd" workqueue not only for encryption, but for submitting IO requests:

This patch is designed to help dm-crypt comply with the new constraints imposed by the following patch in -mm: md-dm-reduce-stack-usage-with-stacked-block-devices.patch

It seems the goal here was not to add more concurrency, but rather reduce kernel stack usage, which makes sense again as the kernel has a common stack across all the code, so it is a quite limited resource. It is worth noting, however, that the Linux kernel stack has been expanded in 2014 for x86 platforms, so this might not be a problem anymore.

A first version of "kcryptd_io" workqueue was added in 2007 with the intent to avoid:

starvation caused by many requests waiting for memory allocation…

The request processing was bottlenecking on a single workqueue here, so the solution was to add another one. Makes sense.

We are definitely not the first ones experiencing performance degradation because of extensive queueing: in 2011 a change was introduced to conditionally revert some of the queueing for read requests:

If there is enough memory, code can directly submit bio instead queuing this operation in a separate thread.

Unfortunately, at that time Linux kernel commit messages were not as verbose as today, so there is no performance data available.

In 2015 dm-crypt started to sort writes in a separate "dmcrypt_write" thread before sending them down the stack:

On a multiprocessor machine, encryption requests finish in a different order than they were submitted. Consequently, write requests would be submitted in a different order and it could cause severe performance degradation.

It does make sense as sequential disk access used to be much faster than the random one and dm-crypt was breaking the pattern. But this mostly applies to spinning disks, which were still dominant in 2015. It may not be as important with modern fast SSDs (including NVME SSDs).

Another part of the commit message is worth mentioning:

…in particular it enables IO schedulers like CFQ to sort more effectively…

It mentions the performance benefits for the CFQ IO scheduler, but Linux schedulers have improved since then to the point that CFQ scheduler has been removed from the kernel in 2018.

The same patchset replaces the sorting list with a red-black tree:

In theory the sorting should be performed by the underlying disk scheduler, however, in practice the disk scheduler only accepts and sorts a finite number of requests. To allow the sorting of all requests, dm-crypt needs to implement its own sorting.

The overhead associated with rbtree-based sorting is considered negligible so it is not used conditionally.

All that make sense, but it would be nice to have some backing data.

Interestingly, in the same patchset we see the introduction of our familiar "submit_from_crypt_cpus" option:

There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly

Overall, we can see that every change was reasonable and needed, however things have changed since then:

  • hardware became faster and smarter
  • Linux resource allocation was revisited
  • coupled Linux subsystems were rearchitected

And many of the design choices above may not be applicable to modern Linux.

The "clean-up"

Based on the research above we decided to try to remove all the extra queueing and asynchronous behaviour and revert dm-crypt to its original purpose: simply encrypt/decrypt IO requests as they pass through. But for the sake of stability and further benchmarking we ended up not removing the actual code, but rather adding yet another dm-crypt option, which bypasses all the queues/threads, if enabled. The flag allows us to switch between the current and new behaviour at runtime under full production load, so we can easily revert our changes should we see any side-effects. The resulting patch can be found on the Cloudflare GitHub Linux repository.

Synchronous Linux Crypto API

From the diagram above we remember that not all queueing is implemented in dm-crypt. Modern Linux Crypto API may also be asynchronous and for the sake of this experiment we want to eliminate queues there as well. What does "may be" mean, though? The OS may contain different implementations of the same algorithm (for example, hardware-accelerated AES-NI on x86 platforms and generic C-code AES implementations). By default the system chooses the "best" one based on the configured algorithm priority. dm-crypt allows overriding this behaviour and request a particular cipher implementation using the capi: prefix. However, there is one problem. Let us actually check the available AES-XTS (this is our disk encryption cipher, remember?) implementations on our system:

$ grep -A 11 'xts(aes)' /proc/crypto
name         : xts(aes)
driver       : xts(ecb(aes-generic))
module       : kernel
priority     : 100
refcnt       : 7
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : __xts(aes)
driver       : cryptd(__xts-aes-aesni)
module       : cryptd
priority     : 451
refcnt       : 1
selftest     : passed
internal     : yes
type         : skcipher
async        : yes
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : xts(aes)
driver       : xts-aes-aesni
module       : aesni_intel
priority     : 401
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : yes
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : __xts(aes)
driver       : __xts-aes-aesni
module       : aesni_intel
priority     : 401
refcnt       : 7
selftest     : passed
internal     : yes
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64

We want to explicitly select a synchronous cipher from the above list to avoid queueing effects in threads, but the only two supported are xts(ecb(aes-generic)) (the generic C implementation) and __xts-aes-aesni (the x86 hardware-accelerated implementation). We definitely want the latter as it is much faster (we’re aiming for performance here), but it is suspiciously marked as internal (see internal: yes). If we check the source code:

Mark a cipher as a service implementation only usable by another cipher and never by a normal user of the kernel crypto API

So this cipher is meant to be used only by other wrapper code in the Crypto API and not outside it. In practice this means, that the caller of the Crypto API needs to explicitly specify this flag, when requesting a particular cipher implementation, but dm-crypt does not do it, because by design it is not part of the Linux Crypto API, rather an "external" user. We already patch the dm-crypt module, so we could as well just add the relevant flag. However, there is another problem with AES-NI in particular: x86 FPU. "Floating point" you say? Why do we need floating point math to do symmetric encryption which should only be about bit shifts and XOR operations? We don’t need the math, but AES-NI instructions use some of the CPU registers, which are dedicated to the FPU. Unfortunately the Linux kernel does not always preserve these registers in interrupt context for performance reasons (saving/restoring FPU is expensive). But dm-crypt may execute code in interrupt context, so we risk corrupting some other process data and we go back to "it would be very unwise to do decryption in an interrupt context" statement in the original code.

Our solution to address the above was to create another somewhat "smart" Crypto API module. This module is synchronous and does not roll its own crypto, but is just a "router" of encryption requests:

  • if we can use the FPU (and thus AES-NI) in the current execution context, we just forward the encryption request to the faster, "internal" __xts-aes-aesni implementation (and we can use it here, because now we are part of the Crypto API)
  • otherwise, we just forward the encryption request to the slower, generic C-based xts(ecb(aes-generic)) implementation

Using the whole lot

Let’s walk through the process of using it all together. The first step is to grab the patches and recompile the kernel (or just compile dm-crypt and our xtsproxy modules).

Next, let’s restart our IO workload in a separate terminal, so we can make sure we can reconfigure the kernel at runtime under load:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...

In the main terminal make sure our new Crypto API module is loaded and available:

$ sudo modprobe xtsproxy
$ grep -A 11 'xtsproxy' /proc/crypto
driver       : xts-aes-xtsproxy
module       : xtsproxy
priority     : 0
refcnt       : 0
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64
ivsize       : 16
chunksize    : 16

Reconfigure the encrypted disk to use our newly loaded module and enable our patched dm-crypt flag (we have to use low-level dmsetup tool and cryptsetup obviously is not aware of our modifications):

$ sudo dmsetup table encrypted-ram0 --showkeys | sed 's/aes-xts-plain64/capi:xts-aes-xtsproxy-plain64/' | sed 's/$/ 1 force_inline/' | sudo dmsetup reload encrypted-ram0

We just "loaded" the new configuration, but for it to take effect, we need to suspend/resume the encrypted device:

$ sudo dmsetup suspend encrypted-ram0 && sudo dmsetup resume encrypted-ram0

And now observe the result. We may go back to the other terminal running the fio job and look at the output, but to make things nicer, here’s a snapshot of the observed read/write throughput in Grafana:

Speeding up Linux disk encryption
Speeding up Linux disk encryption

Wow, we have more than doubled the throughput! With the total throughput of ~640 MB/s we’re now much closer to the expected ~696 MB/s from above. What about the IO latency? (The await statistic from the iostat reporting tool):

Speeding up Linux disk encryption

The latency has been cut in half as well!

To production

So far we have been using a synthetic setup with some parts of the full production stack missing, like file systems, real hardware and most importantly, production workload. To ensure we’re not optimising imaginary things, here is a snapshot of the production impact these changes bring to the caching part of our stack:

Speeding up Linux disk encryption

This graph represents a three-way comparison of the worst-case response times (99th percentile) for a cache hit in one of our servers. The green line is from a server with unencrypted disks, which we will use as baseline. The red line is from a server with encrypted disks with the default Linux disk encryption implementation and the blue line is from a server with encrypted disks and our optimisations enabled. As we can see the default Linux disk encryption implementation has a significant impact on our cache latency in worst case scenarios, whereas the patched implementation is indistinguishable from not using encryption at all. In other words the improved encryption implementation does not have any impact at all on our cache response speed, so we basically get it for free! That’s a win!

We’re just getting started

This post shows how an architecture review can double the performance of a system. Also we reconfirmed that modern cryptography is not expensive and there is usually no excuse not to protect your data.

We are going to submit this work for inclusion in the main kernel source tree, but most likely not in its current form. Although the results look encouraging we have to remember that Linux is a highly portable operating system: it runs on powerful servers as well as small resource constrained IoT devices and on many other CPU architectures as well. The current version of the patches just optimises disk encryption for a particular workload on a particular architecture, but Linux needs a solution which runs smoothly everywhere.

That said, if you think your case is similar and you want to take advantage of the performance improvements now, you may grab the patches and hopefully provide feedback. The runtime flag makes it easy to toggle the functionality on the fly and a simple A/B test may be performed to see if it benefits any particular case or setup. These patches have been running across our wide network of more than 200 data centres on five generations of hardware, so can be reasonably considered stable. Enjoy both performance and security from Cloudflare for all!

Speeding up Linux disk encryption

Keepalives considered harmful

Post Syndicated from Ivan Babrou original https://blog.cloudflare.com/keepalives-considered-harmful/

Keepalives considered harmful

This may sound like a weird title, but hear me out. You’d think keepalives would always be helpful, but turns out reality isn’t always what you expect it to be. It really helps if you read Why does one NGINX worker take all the load? first. This post is an adaptation of a rather old post on Cloudflare’s internal blog, so not all details are exactly as they are in production today but the lessons are still valid.

This is a story about how we were seeing some complaints about sporadic latency spikes, made some unconventional changes, and were able to slash the 99.9th latency percentile by 4x!

Request flow on Cloudflare edge

I’m going to focus only on two parts of our edge stack: FL and SSL.

  • FL accepts plain HTTP connections and does the main request logic, including our WAF
  • SSL terminates SSL and passes connections to FL over local Unix socket:

Here’s a diagram:

Keepalives considered harmful

These days we route all traffic through SSL for simplicity, but in the grand scheme of things it’s not going to matter much.

Each of these processes is not itself a single process, but rather a master process and a collection of workers that do actual processing. Another diagram for you to make this more visual:

Keepalives considered harmful

Keepalives

Requests come over connections that are reused for performance reasons, which is sometimes referred to as “keepalive”. It’s generally expensive for a client to open a new TCP connection and our servers keep some memory pools associated with connections that need to be recycled. In fact, one of the range of mitigations we use for attack traffic is disabling keepalives for abusive clients, forcing them to reopen connections, which slows them down considerably.

To illustrate the usefulness of keepalives, here’s me requesting http://example.com/ from curl over the same connection:

$ curl -s -w "Time to connect: %{time_connect} Time to first byte: %{time_starttransfer}\n" -o /dev/null http://example.com/ -o /dev/null http://example.com/

Time to connect: 0.012108 Time to first byte: 0.018724
Time to connect: 0.000019 Time to first byte: 0.007391

The first request took 18.7ms and out of them 12.1ms were used to establish a new connection, which is not even a TLS one. When I sent another request over the same connection, I didn’t need to pay extra and it took just 7.3ms to service my request. That’s a big win, which gets even bigger if you need to establish a brand new connection (especially if it’s not TLSv1.3 or QUIC). DNS was also cached in the example above, but it may be another negative factor for domains with low TTL.

Keepalives tend to be used extensively, because they seem beneficial. Generally keepalives are enabled by default for this exact reason.

Taking all the load

Due to how Linux works (see the first link in this blog post), most of the load goes to a few workers out of the pool. When that worker is not able to accept() a pending connection because it’s busy processing requests, that connection spills to another worker. The process cascades until some worker is ready to pick up.

This leaves us with a ladder of load spread between workers:

               CPU%                                       Runtime
nobody    4254 51.2  0.8 5655600 1093848 ?     R    Aug23 3938:34 nginx: worker process
nobody    4257 47.9  0.8 5615848 1071612 ?     S    Aug23 3682:05 nginx: worker process
nobody    4253 43.8  0.8 5594124 1069424 ?     R    Aug23 3368:27 nginx: worker process
nobody    4255 39.4  0.8 5573888 1070272 ?     S    Aug23 3030:01 nginx: worker process
nobody    4256 36.2  0.7 5556700 1052560 ?     R    Aug23 2784:23 nginx: worker process
nobody    4251 33.1  0.8 5563276 1063700 ?     S    Aug23 2545:07 nginx: worker process
nobody    4252 29.2  0.8 5561232 1058748 ?     S    Aug23 2245:59 nginx: worker process
nobody    4248 26.7  0.8 5554652 1057288 ?     S    Aug23 2056:19 nginx: worker process
nobody    4249 24.5  0.7 5537276 1043568 ?     S    Aug23 1883:18 nginx: worker process
nobody    4245 22.5  0.7 5552340 1048592 ?     S    Aug23 1736:37 nginx: worker process
nobody    4250 20.7  0.7 5533728 1038676 ?     R    Aug23 1598:16 nginx: worker process
nobody    4247 19.6  0.7 5547548 1044480 ?     S    Aug23 1507:27 nginx: worker process
nobody    4246 18.4  0.7 5538104 1043452 ?     S    Aug23 1421:23 nginx: worker process
nobody    4244 17.5  0.7 5530480 1035264 ?     S    Aug23 1345:39 nginx: worker process
nobody    4243 16.6  0.7 5529232 1024268 ?     S    Aug23 1281:55 nginx: worker process
nobody    4242 16.6  0.7 5537956 1038408 ?     R    Aug23 1278:40 nginx: worker process

The third column is instant CPU%, the time after the date is total on-CPU time. If you look at the the same processes and count their open sockets, you’ll see this (using the same order of processes as above):

4254 2357
4257 1833
4253 2180
4255 1609
4256 1565
4251 1519
4252 1175
4248 1065
4249 1056
4245 1201
4250 886
4247 908
4246 968
4244 1180
4243 867
4242 884

More load corresponds to more open sockets. More open sockets generate more load to serve these connections. It’s a vicious circle.

The twist

Now that we have all these workers holding onto this connections, requests over these connections are also in a way pinned to workers:

Keepalives considered harmful

In FL we’re doing some things that are somewhat compute intensive, which means that some workers can be busy for a somewhat prolonged period of time, while other workers will be sitting idle, increasing latency for requests. Ideally we want to always hand over a request to a worker that’s idle, because that maximizes our chances of not being blocked, even if it takes a few event loop iterations to serve a request (meaning that we may still block down the road).

One part of dealing with the issue is offloading some of the compute intensive parts of the request processing into a thread pool, but that was something that hadn’t happened at that point. We had to do something else in the meantime.

Clients have no way of knowing that their worker is busy and they should probably ask another worker to serve the connection. For clients over a real network this doesn’t even make sense, since by the time their request comes to NGINX up to tens of milliseconds later, the situation will probably be different.

But our clients are not over a long haul network! They are local and they are SSL connecting over a Unix socket. It’s not exactly that expensive to reopen a new connection for each request and what it gives is the ability to pass a fully formed request that’s buffered in SSL into FL:

Keepalives considered harmful

Two key points here:

  • The request is always picked up by an idle worker
  • The request is readily available for potentially compute intensive processing

The former is the most important part.

Validating the hypothesis

To validate this assumption, I wrote the following program:

package main
 
import (
    "flag"
    "io/ioutil"
    "log"
    "net/http"
    "os"
    "time"
)
 
func main() {
    u := flag.String("url", "", "url to request")
    p := flag.Duration("pause", 0, "pause between requests")
    t := flag.Duration("threshold", 0, "threshold for reporting")
    c := flag.Int("count", 10000, "number of requests to send")
    n := flag.Int("close", 1, "close connection after every that many requests")
 
    flag.Parse()
 
    if *u == "" {
        flag.PrintDefaults()
        os.Exit(1)
    }
 
    client := http.Client{}
 
    for i := 0; i < *c; i++ {
        started := time.Now()
 
        request, err := http.NewRequest("GET", *u, nil)
        if err != nil {
            log.Fatalf("Error constructing request: %s", err)
        }
 
        if i%*n == 0 {
            request.Header.Set("Connection", "Close")
        }
 
        response, err := client.Do(request)
        if err != nil {
            log.Fatalf("Error performing request: %s", err)
        }
 
        _, err = ioutil.ReadAll(response.Body)
        if err != nil {
            log.Fatalf("Error reading request body: %s", err)
        }
 
        response.Body.Close()
 
        elapsed := time.Since(started)
        if elapsed > *t {
            log.Printf("Request %d took %dms", i, int(elapsed.Seconds()*1000))
        }
 
        time.Sleep(*p)
    }
}

The program connects to a requested URL and recycles the connection after X requests have completed. We also pause for a short time between requests.

If we close a connection after every request:

$ go run /tmp/main.go -url http://test.domain/cdn-cgi/trace -count 10000 -pause 5ms -threshold 20ms -close 1
2018/08/24 23:42:34 Request 453 took 32ms
2018/08/24 23:42:38 Request 1044 took 24ms
2018/08/24 23:43:00 Request 4106 took 83ms
2018/08/24 23:43:12 Request 5778 took 27ms
2018/08/24 23:43:16 Request 6292 took 27ms
2018/08/24 23:43:20 Request 6856 took 21ms
2018/08/24 23:43:32 Request 8578 took 45ms
2018/08/24 23:43:42 Request 9938 took 22ms

We request an endpoint that’s served in FL by Lua, so seeing any slow requests is unfortunate. There’s an element of luck in this game and our program sees no cooperation from eyeballs or SSL, so it’s somewhat expected.

Now, if we start closing the connection only after every other request, the situation immediately gets a lot worse:

$ go run /tmp/main.go -url http://teste1.cfperf.net/cdn-cgi/trace -count 10000 -pause 5ms -threshold 20ms -close 2
2018/08/24 23:43:51 Request 162 took 22ms
2018/08/24 23:43:51 Request 220 took 21ms
2018/08/24 23:43:53 Request 452 took 23ms
2018/08/24 23:43:54 Request 540 took 41ms
2018/08/24 23:43:54 Request 614 took 23ms
2018/08/24 23:43:56 Request 900 took 40ms
2018/08/24 23:44:02 Request 1705 took 21ms
2018/08/24 23:44:03 Request 1850 took 27ms
2018/08/24 23:44:03 Request 1878 took 36ms
2018/08/24 23:44:08 Request 2470 took 21ms
2018/08/24 23:44:11 Request 2926 took 22ms
2018/08/24 23:44:14 Request 3350 took 37ms
2018/08/24 23:44:14 Request 3404 took 21ms
2018/08/24 23:44:16 Request 3598 took 32ms
2018/08/24 23:44:16 Request 3606 took 22ms
2018/08/24 23:44:19 Request 4026 took 33ms
2018/08/24 23:44:20 Request 4250 took 74ms
2018/08/24 23:44:22 Request 4483 took 20ms
2018/08/24 23:44:23 Request 4572 took 21ms
2018/08/24 23:44:23 Request 4644 took 23ms
2018/08/24 23:44:24 Request 4758 took 63ms
2018/08/24 23:44:25 Request 4808 took 39ms
2018/08/24 23:44:30 Request 5496 took 28ms
2018/08/24 23:44:31 Request 5736 took 88ms
2018/08/24 23:44:32 Request 5845 took 43ms
2018/08/24 23:44:33 Request 5988 took 52ms
2018/08/24 23:44:34 Request 6042 took 26ms
2018/08/24 23:44:34 Request 6049 took 23ms
2018/08/24 23:44:40 Request 6872 took 86ms
2018/08/24 23:44:40 Request 6940 took 23ms
2018/08/24 23:44:40 Request 6964 took 23ms
2018/08/24 23:44:44 Request 7532 took 32ms
2018/08/24 23:44:49 Request 8224 took 22ms
2018/08/24 23:44:49 Request 8234 took 29ms
2018/08/24 23:44:51 Request 8536 took 24ms
2018/08/24 23:44:55 Request 9028 took 22ms
2018/08/24 23:44:55 Request 9050 took 23ms
2018/08/24 23:44:55 Request 9092 took 26ms
2018/08/24 23:44:57 Request 9330 took 25ms
2018/08/24 23:45:01 Request 9962 took 48ms

If we close after every 5 requests, the number of slow responses almost doubles. This is counter-intuitive, keepalives are supposed to help with latency, not make it worse!

Trying this in the wild

To see how this works out in the real world, we disabled keepalives between SSL and FL in one location, forcing SSL to send every request over a separate connection (remember: cheap local Unix socket). Here’s how our probes toward that location reacted to this:

Keepalives considered harmful

Here’s cumulative wait time between SSL and FL:

Keepalives considered harmful

And finally 99.9th percentile of the same measurement:

Keepalives considered harmful

This is a huge win.

Another telling graph is comparing our average “edge processing time” (which includes WAF) in a test location to a global value:

Keepalives considered harmful

We reduced unnecessary wait time due to an imbalance without increasing the CPU load, which directly translates into improved user experience and lower resource consumption for us.

The downsides

There has to be a downside from this, right? The problem we introduced is that CPU imbalance between individual CPU cores went up:

Keepalives considered harmful

Overall CPU usage did not change, just the distribution of it. We already know of a way of dealing with this: SO_REUSEPORT. Either that or EPOLLROUNDROBIN, which doesn’t have some of the drawbacks of SO_REUSEPORT (which does not work for a Unix socket, for example), but requires a patched kernel. If we combine both disabled keepalives and EPOLLROUNDROBIN changes, we can see CPUs allocated to FL converge in their utilization nicely:

Keepalives considered harmful

We’ve tried different combinations and having both EPOLLROUNDROBIN with disabled keepalives worked best. Having just one of them is not as beneficial to lower latency.

Conclusions

We disabled keepalives between SSL and FL running on the same box and this greatly improved our tail latency caused by requests landing on non-optimal FL workers. This was an unexpected fix, but it worked and we are able to explain it.

This doesn’t mean that you should go and disable keepalives everywhere. Generally keepalives are great and should stay enabled, but in our case the latency of local connection establishment is much lower than the delay we can get from landing on a busy worker.

In reality this means that we can run our machines hotter and not see latency rise as much as it did before. Imagine moving the CPU cap from 50% to 80% with no effect on latency. The numbers are arbitrary, but the idea holds. Running hotter allows for fewer machines able to serve the same amount of traffic, reducing our overall footprint. 🌎

Returning 575 Terabytes of storage space back to our users

Post Syndicated from Grab Tech original https://engineering.grab.com/returning-storage-space-back-to-our-users

Have you ever run out of storage on your phone? Mobile phones come with limited storage and with the multiplication of apps and large video files, many of you are running out of space.

In this article, we explain how we measure and reduce the storage footprint of the Grab App on a user’s device to help you overcome this issue.

The wakeup call

Android vitals (information provided by Google play Console about our app performance) gives us two main pieces of information about storage footprint.

15.7% of users have less than 1GB of free storage and they tend to uninstall more than other users (1.2x).

The proportion of 30 day active devices which reported less than 1GB free storage. Calculated as a 30 days rolling average.

Active devices with <1GB free space
Active devices with <1GB free space

This is the ratio of uninstalls on active devices with less than 1GB free storage to uninstalls on all active devices. Calculated as a 30 days rolling average.

Ratio of uninstalls on active devices with less than 1GB
Ratio of uninstalls on active devices with less than 1GB

Instrumentation to know where we stand

First things first, we needed to know how much space the Grab App occupies on user device. So we started using our personal devices. We can find this information by opening the phone settings and selecting Grab App.

App Settings
App Settings

For this device (screenshot), the application itself (Installed binary) was 186 MB and the total footprint was 322 MB. Since this information varies a lot based on the usage of the app, we needed this information directly from our users in production.

Disclaimer: We are only measuring files that are inside the internal Grab app folder (Cache/Database). We do NOT measure any file that is not inside the private Grab folder.

We decided to leverage on our current implementation using StorageManager API to gather the following information during each session launch:

  • Application Size (Installed binary size)
  • Cache folder size
  • Total footprint
Sample code to retrieve storage information on Android
Sample code to retrieve storage information on Android

Data analysis

We began analysing this data one month after our users’ updated their app and found that the cache size was anomaly huge (> 1GB) for a lot of users. Intrigued, we dug deeper.

We added code to log the top largest files inside the cache folder, and we found that most of the files were inside a sub cache folder that was no longer in use. This was due to a usage of a 3rd party library that was removed from our app. We added a specific metric to track the size of this folder.

In the end, a lot of users still had this old cache data and for some users the amount of data can be up to 1GB.

Root cause analysis

The Grab app relies a lot on 3rd party libraries. For example, Picasso was a library we used in the past for image display which is now replaced by Glide. Picasso uses a cache to store images and avoid making network calls again and again. After removing Picasso from the app, we didn’t delete this cache folder on the user device. We knew there would likely be more third-party libraries that had been discontinued so we expanded our analysis to look at how other 3rd party libraries cached their data.

Freeing up space on user’s phone

Here comes the fun part. We implemented a cleanup mechanism to remove old cache folders. When users update the GrabApp, any old cache folders which were there before would automatically be removed. Through this, we released up to 1GB of data in a second back to our users. In total, we removed 575 terabytes of old cache data across more than 13 million devices (approximately 40MB per user on average).

Data summary

The following graph shows the total size of junk data (in Terabytes) that we can potentially remove each day, calculated by summing up the maximum size of cache when a user opens the Grab app each day.

The first half of the graph reflects the amount of junk data in relation to the latest app version before auto-clean up was activated. The second half of the graph shows a dramatic dip in junk data after auto-clean up was activated. We were deleting up to 33 Terabytes of data per day on the user’s device when we first started!

Sum of all junk data on user’s device reported per day in Terabytes
Sum of all junk data on user’s device reported per day in Terabytes

Next step

This is the first phase of our journey in reducing the storage footprint of our app on Android devices. We specifically focused on making improvements at scale i.e. deliver huge storage gains to the most number of users in the shortest time. In the next phase, we will look at more targeted improvements for specific groups of users that still have a high storage footprint. In addition, we are also reviewing iOS data to see if a round of clean up is necessary.

Concurrently, we are also reducing the maximum size of cache created by some libraries. For example, Glide by default creates a cache of 250MB but this can be configured and optimised.

We hope you found this piece insightful and please remember to update your app regularly to benefit from the improvements we’re making every day. If you find that your app is still taking a lot of space on your phone, be assured that we’re looking into it.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Journey to a Faster Everyday Super App Where Every Millisecond Counts

Post Syndicated from Grab Tech original https://engineering.grab.com/journey-to-a-faster-everyday-super-app

Introduction

At Grab, we are moving faster than ever. In 2019 alone, we released dozens of new features in the Grab passenger app. With our goal to delight users in Southeast Asia with a powerful everyday super app, the app’s performance became one of the most critical components in delivering that experience to our users.

This post narrates the journey of our performance improvement efforts on the Grab passenger app. It highlights how we were able to reduce the time spent starting the app by more than 60%, while preventing regressions introduced by new features. We use the p95 scale when referring to these improvements.

Here’s a quick look at the improvements and timeline:

Improvements Timeline

Improving App Performance

While app performance consists of different aspects – such as battery consumption rate, network performance, app responsiveness, etc. – the first thing users notice is the time it takes for an app to start. Apps that take too long to load frustrate users, leading to bad reviews and uninstalls.

We focused our efforts on the app’s time to interactive(TTI), which consists of two main operations:

  • Starting the app
  • Displaying interactive service tiles (these are the icons for the services offered on the app such as Transport, Food, Delivery, and so on)

There are many other operations that occur in the background, which we won’t cover in this article.

We prioritised on optimising the app’s ability to load the service tiles (highlighted in the image below) and render them as interactive upon startup (cold start). This allowed users to use the app as soon as they launch it.

Service Tiles

Instrumentation and Benchmarking

Before we could start improving the app’s performance, we needed to know where we stood and set measurable goals.

We couldn’t get a baseline from local performance testing as it did not simulate the real environment condition, where network variability and device performance are contributing factors. Thus, we needed to use real production data to get an accurate reflection of our current performance at a scale. In production, we measured the performance of ~8-9 millions users per day – a small subset of our overall active user base.

As a start, we measured the different components contributing to TTI, such as binary loading, library initialisations, and tiles loading. For example, if we had to measure the time taken by function A, this is how it looked like in the code:

functionA (){
// start the timer
....
....
...
//Stop the timer, calculate the time difference and send it as an analytic event
}

With all the numbers from the contributing components, we took the sum to calculate the full TTI (as shown in the following image).

Full TTI

When the numbers started rolling in from production, we needed specific measurements to interpret those numbers, so we started looking at TTI’s 50th, 90th, and 95th percentile. A 90th percentile (p90) of x seconds means that 90% of the users have an interactive screen in at most x seconds.

We chose to only focus on p50 and p95 as these cover the majority of our users who deal with performance issues. Improving performance for <p50 (who already have high-end devices) would not bring too much of a value, and improving for >p95 would be very difficult as the app performance improvements will be limited by device performance.

By the end of January, we got the p50, p90, and p95 numbers for the contributing components that summed up to TTI numbers for tiles, which allowed us to start identifying areas with potential improvements.

Caching and Animation Removal

While reviewing the TTI numbers, we were drawn to contributors with high time consumption rates such as tile loading and app start animation. Other evident improvement we worked on was caching data between app launches instead of waiting for a network response for loading tiles at every single app launch.

Tile Caching

Based on the gathered data, the service tiles only change when a user travels between cities. This is because the available services vary in each city. Since users do not frequently change cities, the service tiles do not change very frequently either, and so caching the tiles made sense. However, we also needed to sync the fresh tiles, in case of any change. So, we updated the logic based on these findings. as illustrated in the following image:

Tile Caching Logic

Caching tiles brought us a huge improvement of ~3s on each platform.

Animation Removal

We came across a beautifully created animation at appstart that didn’t provide any additional value in terms of information or practicality.

With detailed discussions and trade-offs with designers, we removed the animation and improved our TTI further by 1s.

In conclusion, with the caching and animation removal alone, we improved the TTI by 4s.

Welcome Static Linking and Coroutines

At this point, our users gained 4 seconds of their time back, but we didn’t want to stop with that number. So, we dug through the data to see what further enhancements we could do. When we could not find anything else that was similar to caching and animation removal, we shifted to architecture fundamentals.

We knew that this was not an easy route to take and that it would come with a cost; if we decided to choose a component related to architecture fundamentals, all the other teams working on the Grab app would be impacted. We had to evaluate our options and make decisions with trade-offs for overall improvements. And this eventually led to static linking on iOS and coroutines on Android.

Binary Loading

Binary loading is one of the first steps in both mobile platforms when an app is launched. It primarily contributes to pre-main and dex-loading, on iOS and Android respectively.

The pre-main time on iOS was about 7.9s. It is known in the iOS development world that each framework (binary) can either be dynamically or statically linked. While static helps in a faster app start, it brings complexity in building frameworks that are elaborate or contain resources bundles.Building a lot of libraries statically also impact build times negatively.With proper evaluations, we decided to take the route to enable more static linking due to the trade-offs.

Apple recommends a maximum of half a dozen dynamic frameworks for an optimal performance. Guess what? Our passenger app had 107 dynamically linked frameworks, a lot of them were internal.

The task looked daunting at first, since it affected all parts of the app, but we were ready to tackle the challenge head on. Deciding to take this on was the easy part, the actual work entailed lots of tricky coordination and collaboration with multiple teams.

We created an RFC (Request For Comments) doc to propose the static linking of frameworks, wherever applicable, and co-ordinated with teams with the agreed timelines to execute this change.

While collaborating with teams, we learned that we could remove 12 frameworks entirely that were no longer required. This exercise highlighted the importance of regular cleanup and deprecation in our codebase, and was added into our standard process.

And so, we were left with 95 frameworks; 75 of which were statically linked successfully, resulting in our p90 pre-main dropping by 41%.

As Grabbers, it’s in our DNA to push ourselves a little more. With the remaining 20 frameworks, our pre-main was still considerably high. Out of the 20 frameworks, 10 could not be statically linked without issues. As a workaround, we merged multiple dynamic frameworks into one. One of our outstanding engineers even created a plug-in for this, which is called the Cocoapod Merge. With this plug-in, we were able to merge 10 dynamically linked frameworks into 2. We’ve made this plug-in open source: https://github.com/grab/cocoapods-pod-merge.

With all of the above steps, we were finally left with 12 dynamic frameworks – a huge 88% reduction.

The following image illustrates the complex numbers mentioned above:

Static Linking

Using cocoapod merge further helped us with ~0.8s of improvement.

Coroutines

While we were executing the static linking initiative on iOS, we also started refactoring the application initialisation for a modular and clean code on Android. This resulted in creating an ApplicationInitialiser class, which handles the entire application initialisation process with maximum parallelism using coroutines.

Now all the libraries are being initialised in parallel via coroutines and thus enabling better utilisations of computing resources and a faster TTI.

This refactoring and background initialisation for libraries on Android helped in gaining ~0.4s of improvements.

Changing the Basics – Visualisation Setup

By the end of H1 2019, we observed a 50% improvement in TTI, and now it was time to set new goals for H2 2019. Until this point, we would query our database for all metric numbers, copy the numbers into a spreadsheet, and compare them against weeks and app versions.

Despite the high loads of manual work and other challenges, this method still worked at the beginning due to the improvements we had to focus on.

However, in H2 2019 it became apparent that we had to reassess our methodology of reading numbers. So, we started thinking about other ways to present and visualise these numbers better. With help from our Product Analyst, we took advantage of metabase’s advanced capabilities and presented our goals and metrics in a clear and easy to understand format.

For example, here is a graph that shows the top contributing metrics for Android:

Android Metrics

Looking at it, we could clearly tell which metric needed to be prioritised for improvements.

We did this not only for our metrics, but also for our main goals, which allowed us to easily see our progress and track our improvements on a daily basis.

Visualisation

The color bars in the above image depicts the status of our numbers against our goals and also shows the actual numbers at p50, p90, and p95.

As our tracking progressed, we started including more granular and precise measurements, to help guide the team and achieve more impactful improvements of around ~0.3-0.4s.

Fortunately, we were deprecating a third-party library for analytics and experimentation, which happened to be one of the highest contributing metrics for both platforms due to a high number of operations on the main thread. We started using our own in-house experimentation platform where we had better control over performance. We removed this third-party dependency, and it helped us with huge improvements of ~2.5s on Android and ~0.5-0.7s on iOS.

You might be wondering as to why there is such a big difference on the iOS and Android improvement numbers for this dependency. This was due to the setting user attributes operations that ran only in the Android codebase, which was performed on the main thread and took a huge amount of time. These were the times that made us realise that we should focus more on the consistency for both platforms, as well as to identify the third-party library APIs that are used, and to assess whether they are absolutely necessary.

*Tip*: So, it is time for you as well to eliminate such inconsistencies, if there are any.

Ok, there goes our third quarter with ~3s of improvement on Android and ~1.3s on iOS.

Performance Regression Detection

Entering into Q4 brought us many challenges as we were running out of improvements to make. Even finding an improvement worth ~0.05s was really difficult! We were also strongly challenged by regressions (increase in TTI numbers) because of continuous feature releases and code additions to the app start process.

So, maintaining the TTI numbers became our primary task for this period. We started looking into setting up processes to block regressions from being merged to the master, or at least get notified before they hit production.

To begin with, we identified the main sources of regressions: static linking breakage on iOS and library initialisation in the app startup process on Android.

We took the following measures to cover these cases:

Linters

We built linters on the Continuous Integration (CI) pipeline to detect potential changes in static linking on iOS and the ApplicationInitialiser class on Android. The linters block the changelist and enforce a special review process for such changes.

Library Integration Process

The team also focused on setting up a process for library integrations, where each library (internal or third party) will first be evaluated for performance impact before it is integrated into the codebase.

While regression guarding was in process, we were simultaneously trying to bring in more improvements for TTI. We enabled the Link Time Optimisations (LTO) flag on iOS to improve the overall app performance. We also experimented on order files on iOS and anko layout on Android, but were ruled out due to known issues.

On Android, we hit the bottom hard as there were minimal improvements. Fortunately, it was a different story for iOS. We managed to get improvements worth ~0.6s by opting for lazy loading, optimising I/O operations, and deferring more operations to post app start (if applicable).

Next Steps

We will be looking at the different aspects of performance such as network, battery, and storage, while maintaining our current numbers for TTI.

  • Network performance – Track the turnaround time for network requests then move on to optimisations.
  • Battery performance – Focus on profiling the app for CPU and energy intensive operations, which drains the battery, and then move to optimisations.
  • Storage performance – Review our caching and storage mechanisms, and then look for ways to optimise them.

In addition to these, we are also focusing on bringing performance initiatives for all the teams at Grab. We believe that performance is a collaborative approach, and we would like to improve the app performance in all aspects.

We defined different metrics to track performance e.g. Time to Interactive, Time to feedback (the time taken to get the feedback for a user action), UI smoothness indicators, storage, and network metrics.

We are enabling all teams to benchmark their performance numbers based on defined metrics and move on to a path of improvement.

Conclusion

Overall, we improved by 60%, and this calls for a big celebration! Woohoo! The bigger celebration came from knowing that we’ve improved our customers’ experience in using our app.

This graph represents our performance improvement journey for the entire 2019, in terms of TTI.

Performance Graph

Based on the graph, looking at our p95 improvements and converting them to number of hours saved per day gives us ~21,388 hours on iOS and ~38,194 hours saved per day on Android.

Hey, did you know that it takes approximately 80-85 hours to watch all the episodes of Friends? Just saying. 🙂

We will continue to serve our customers for a better and faster experience in the upcoming years.

Announcing improved VPC networking for AWS Lambda functions

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/announcing-improved-vpc-networking-for-aws-lambda-functions/

We’re excited to announce a major improvement to how AWS Lambda functions work with your Amazon VPC networks. With today’s launch, you will see dramatic improvements to function startup performance and more efficient usage of elastic network interfaces. These improvements are rolling out to all existing and new VPC functions, at no additional cost. Roll out begins today and continues gradually over the next couple of months across all Regions.

Lambda first supported VPCs in February 2016, allowing you to access resources in your VPCs or on-premises systems using an AWS Direct Connect link. Since then, we’ve seen customers widely use VPC connectivity to access many different services:

  • Relational databases such as Amazon RDS
  • Datastores such as Amazon ElastiCache for Redis or Amazon Elasticsearch Service
  • Other services running in Amazon EC2 or Amazon ECS

Today’s launch brings enhancements to how this connectivity between Lambda and your VPCs works.

Previous Lambda integration model for VPCs

It helps to understand some basics about the way that networking with Lambda works. Today, all of the compute infrastructure for Lambda runs inside of VPCs owned by the service.Lambda runs in a VPC controlled by the service

When you invoke a Lambda function from any invocation source and with any execution model (synchronous, asynchronous, or poll based), it occurs through the Lambda API. There is no direct network access to the execution environment where your functions run:Invoke access only through the Lambda API

By default, when your Lambda function is not configured to connect to your own VPCs, the function can access anything available on the public internet such as other AWS services, HTTPS endpoints for APIs, or services and endpoints outside AWS. The function then has no way to connect to your private resources inside of your VPC.

When you configure your Lambda function to connect to your own VPC, it creates an elastic network interface in your VPC and then does a cross-account attachment. These network interfaces allow network access from your Lambda functions to your private resources. These Lambda functions continue to run inside of the Lambda service’s VPC and can now only access resources over the network through your VPC.

All invocations for functions continue to come from the Lambda service’s API and you still do not have direct network access to the execution environment. The attached network interface exists inside of your account and counts towards limits that exist in your account for that Region.lambda mapped to a customer eni

With this model, there are a number of challenges that you might face. The time taken to create and attach new network interfaces can cause longer cold-starts, increasing the time it takes to spin up a new execution environment before your code can be invoked.

As your function’s execution environment scales to handle an increase in requests, more network interfaces are created and attached to the Lambda infrastructure. The exact number of network interfaces created and attached is a factor of your function configuration and concurrency.many ENIs mapped to Lambda environments

Every network interface created for your function is associated with and consumes an IP address in your VPC subnets. It counts towards your account level maximum limit of network interfaces.

As your function scales, you have to be mindful of several issues:

  • Managing the IP address space in your subnets
  • Reaching the account level network interface limit
  • The potential to hit the API rate limit on creating new network interfaces

What’s changing

Starting today, we’re changing the way that your functions connect to your VPCs. AWS Hyperplane, the Network Function Virtualization platform used for Network Load Balancer and NAT Gateway, has supported inter-VPC connectivity for offerings like AWS PrivateLink, and we are now leveraging Hyperplane to provide NAT capabilities from the Lambda VPC to customer VPCs.

The Hyperplane ENI is a managed network resource that the Lambda service controls, allowing multiple execution environments to securely access resources inside of VPCs in your account. Instead of the previous solution of mapping network interfaces in your VPC directly to Lambda execution environments, network interfaces in your VPC are mapped to the Hyperplane ENI and the functions connect using it.
VPC to VPC NAT

Hyperplane still uses a cross account attached network interface that exists in your VPC, but in a greatly improved way:

  • The network interface creation happens when your Lambda function is created or its VPC settings are updated. When a function is invoked, the execution environment simply uses the pre-created network interface and quickly establishes a network tunnel to it. This dramatically reduces the latency that was previously associated with creating and attaching a network interface at cold start.
  • Because the network interfaces are shared across execution environments, typically only a handful of network interfaces are required per function. Every unique security group:subnet combination across functions in your account requires a distinct network interface. If a combination is shared across multiple functions in your account, we reuse the same network interface across functions.
  • Your function scaling is no longer directly tied to the number of network interfaces and Hyperplane ENIs can scale to support large numbers of concurrent function executions

Hyperplane ENIs are tied to a security group:subnet combination in your account. Functions in the same account that share the same security group:subnet pairing use the same network interfaces. This way, a single application with multiple functions but the same network and security configuration can benefit from the existing interface configuration.

Several things do not change in this new model:

  • Your Lambda functions still need the IAM permissions required to create and delete network interfaces in your VPC.
  • You still control the subnet and security group configurations of these network interfaces. You can continue to apply normal network security controls and follow best practices on VPC configuration.
  • You still have to use a NAT device(for example VPC NAT Gateway) to give a function internet access or use VPC endpoints to connect to services outside of your VPC.
  • Nothing changes about the types of resources that your functions can access inside of your VPCs.

As mentioned earlier, Hyperplane now creates a shared network interface when your Lambda function is first created or when its VPC settings are updated, improving function setup performance and scalability. This one-time setup can take up to 90 seconds to complete. If your Lambda function is invoked while the shared network interface is still being created, the invocation still succeeds but may see longer cold-start times. When setup is complete, your Lambda function no longer requires network interface creation at invocation time. If Lambda functions in an account go idle for consecutive weeks, the service will reclaim the unused Hyperplane resources and so very infrequently invoked functions may still see longer cold-start times.

As I mentioned earlier, you don’t have to enable this new capability yourself, and there are no new associated costs. Starting today, we are gradually rolling this out across all AWS Regions over the next couple of months. We will update this post on a Region by Region basis after the rollout has completed in a given Region.

If you have existing VPC-configured Lambda workloads, you will soon see your applications scale and respond to new invocations more quickly. You can see this change using tools like AWS X-Ray or those from Lambda’s third party partners.

The following screenshots show a before/after example taken with AWS X-Ray:

The first shows a function that was launched in a VPC, with a full cold start with network interface creation at invocation.

In this second image, you see the same function executed but this time in an account that has the new VPC networking capability. You can see the massive difference that this has made in function execution duration, with it dropping to 933 ms from 14.8 seconds!

Conclusion

On the AWS Lambda team, we consider “Simplicity” a core tenet to how we think about building services for our customers. We’re constantly evaluating ways that we can improve and simplify the experience that you see in building serverless applications with Lambda. These changes in how we connect with your VPCs improve the performance and scale for your Lambda functions. They enable you to harness the full power of serverless architectures.

NGINX structural enhancements for HTTP/2 performance

Post Syndicated from Nick Jones original https://blog.cloudflare.com/nginx-structural-enhancements-for-http-2-performance/

NGINX structural enhancements for HTTP/2 performance

NGINX structural enhancements for HTTP/2 performance

Introduction

My team: the Cloudflare PROTOCOLS team is responsible for termination of HTTP traffic at the edge of the Cloudflare network. We deal with features related to: TCP, QUIC, TLS and Secure Certificate management, HTTP/1 and HTTP/2. Over Q1, we were responsible for implementing the Enhanced HTTP/2 Prioritization product that Cloudflare announced during Speed Week.

This is a very exciting project to be part of, and doubly exciting to see the results of, but during the course of the project, we had a number of interesting realisations about NGINX: the HTTP oriented server onto which Cloudflare currently deploys its software infrastructure. We quickly became certain that our Enhanced HTTP/2 Prioritization project could not achieve even moderate success if the internal workings of NGINX were not changed.

Due to these realisations we embarked upon a number of significant changes to the internal structure of NGINX in parallel to the work on the core prioritization product. This blog post describes the motivation behind the structural changes, how we approached them, and what impact they had. We also identify additional changes that we plan to add to our roadmap, which we hope will improve performance further.

Background

Enhanced HTTP/2 Prioritization aims to do one thing to web traffic flowing between a client and a server: it provides a means to shape the many HTTP/2 streams as they flow from upstream (server or origin side) into a single HTTP/2 connection that flows downstream (client side).

Enhanced HTTP/2 Prioritization allows site owners and the Cloudflare edge systems to dictate the rules about how various objects should combine into the single HTTP/2 connection: whether a particular object should have priority and dominate that connection and reach the client as soon as possible, or whether a group of objects should evenly share the capacity of the connection and put more emphasis on parallelism.

As a result, Enhanced HTTP/2 Prioritization allows site owners to tackle two problems that exist between a client and a server: how to control precedence and ordering of objects, and: how to make the best use of a limited connection resource, which may be constrained by a number of factors such as bandwidth, volume of traffic and CPU workload at the various stages on the path of the connection.

What did we see?

The key to prioritisation is being able to compare two or more HTTP/2 streams in order to determine which one’s frame is to go down the pipe next. The Enhanced HTTP/2 Prioritization project necessarily drew us into the core NGINX codebase, as our intention was to fundamentally alter the way that NGINX compared and queued HTTP/2 data frames as they were written back to the client.

Very early in the analysis phase, as we rummaged through the NGINX internals to survey the site of our proposed features, we noticed a number of shortcomings in the structure of NGINX itself, in particular: how it moved data from upstream (server side) to downstream (client side) and how it temporarily stored (buffered) that data in its various internal stages. The main conclusion of our early analysis of NGINX was that it largely failed to give the stream data frames any ‘proximity’. Either streams were processed in the NGINX HTTP/2 layer in isolated succession or frames of different streams spent very little time in the same place: a shared queue for example. The net effect was a reduction in the opportunities for useful comparison.

We coined a new, barely scientific but useful measurement: Potential, to describe how effectively the Enhanced HTTP/2 Prioritization strategies (or even the default NGINX prioritization) can be applied to queued data streams. Potential is not so much a measurement of the effectiveness of prioritization per se, that metric would be left for later on in the project, it is more a measurement of the levels of participation during the application of the algorithm. In simple terms, it considers the number of streams and frames thereof that are included in an iteration of prioritization, with more streams and more frames leading to higher Potential.

What we could see from early on was that by default, NGINX displayed low Potential: rendering prioritization instructions from either the browser, as is the case in the traditional HTTP/2 prioritization model, or from our Enhanced HTTP/2 Prioritization product, fairly useless.

What did we do?

With the goal of improving the specific problems related to Potential, and also improving general throughput of the system, we identified some key pain points in NGINX. These points, which will be described below, have either been worked on and improved as part of our initial release of Enhanced HTTP/2 Prioritization, or have now branched out into meaningful projects of their own that we will put engineering effort into over the course of the next few months.

HTTP/2 frame write queue reclamation

Write queue reclamation was successfully shipped with our release of Enhanced HTTP/2 Prioritization and ironically, it wasn’t a change made to the original NGINX, it was in fact a change made against our Enhanced HTTP/2 Prioritization implementation when we were part way through the project, and it serves as a good example of something one may call: conservation of data, which is a good way to increase Potential.

Similar to the original NGINX, our Enhanced HTTP/2 Prioritization algorithm will place a cohort of HTTP/2 data frames into a write queue as a result of an iteration of the prioritization strategies being applied to them. The contents of the write queue would be destined to be written the downstream TLS layer.  Also similar to the original NGINX, the write queue may only be partially written to the TLS layer due to back-pressure from the network connection that has temporarily reached write capacity.

NGINX structural enhancements for HTTP/2 performance

Early on in our project, if the write queue was only partially written to the TLS layer, we would simply leave the frames in the write queue until the backlog was cleared, then we would re-attempt to write that data to the network in a future write iteration, just like the original NGINX.

The original NGINX takes this approach because the write queue is the only place that waiting data frames are stored. However, in our NGINX modified for Enhanced HTTP/2 Prioritization, we have a unique structure that the original NGINX lacks: per-stream data frame queues where we temporarily store data frames before our prioritization algorithms are applied to them.

We came to the realisation that in the event of a partial write, we were able to restore the unwritten frames back into their per-stream queues. If it was the case that a subsequent data cohort arrived behind the partially unwritten one, then the previously unwritten frames could participate in an additional round of prioritization comparisons, thus raising the Potential of our algorithms.

The following diagram illustrates this process:

NGINX structural enhancements for HTTP/2 performance

We were very pleased to ship Enhanced HTTP/2 Prioritization with the reclamation feature included as this single enhancement greatly increased Potential and made up for the fact that we had to withhold the next enhancement for speed week due to its delicacy.

HTTP/2 frame write event re-ordering

In Cloudflare infrastructure, we map the many streams of a single HTTP/2 connection from the eyeball to multiple HTTP/1.1 connections to the upstream Cloudflare control plane.

As a note: it may seem counterintuitive that we downgrade protocols like this, and it may seem doubly counterintuitive when I reveal that we also disable HTTP keepalive on these upstream connections, resulting in only one transaction per connection, however this arrangement offers a number of advantages, particularly in the form of improved CPU workload distribution.

When NGINX monitors its upstream HTTP/1.1 connections for read activity, it may detect readability on many of those connections and process them all in a batch. However, within that batch, each of the upstream connections is processed sequentially, one at a time, from start to finish: from HTTP/1.1 connection read, to framing in the HTTP/2 stream, to HTTP/2 connection write to the TLS layer.

The existing NGINX workflow is illustrated in this diagram:

NGINX structural enhancements for HTTP/2 performance

By committing each streams’ frames to the TLS layer one stream at a time, many frames may pass entirely through the NGINX system before backpressure on the downstream connection allows the queue of frames to build up, providing an opportunity for these frames to be in proximity and allowing prioritization logic to be applied.  This negatively impacts Potential and reduces the effectiveness of prioritization.

The Cloudflare Enhanced HTTP/2 Prioritization modified NGINX aims to re-arrange the internal workflow described above into the following model:

NGINX structural enhancements for HTTP/2 performance

Although we continue to frame upstream data into HTTP/2 data frames in the separate iterations for each upstream connection, we no longer commit these frames to a single write queue within each iteration, instead we arrange the frames into the per-stream queues described earlier. We then post a single event to the end of the per-connection iterations, and perform the prioritization, queuing and writing of the HTTP/2 data frames of all streams in that single event.

This single event finds the cohort of data conveniently stored in their respective per-stream queues, all in close proximity, which greatly increases the Potential of the Edge Prioritization algorithms.

In a form closer to actual code, the core of this modification looks a bit like this:

ngx_http_v2_process_data(ngx_http_v2_connection *h2_conn,
                         ngx_http_v2_stream *h2_stream,
                         ngx_buffer *buffer)
{
    while ( ! ngx_buffer_empty(buffer) {
        ngx_http_v2_frame_data(h2_conn,
                               h2_stream->frames,
                               buffer);
    }

    ngx_http_v2_prioritise(h2_conn->queue,
                           h2_stream->frames);

    ngx_http_v2_write_queue(h2_conn->queue);
}

To this:

ngx_http_v2_process_data(ngx_http_v2_connection *h2_conn,
                         ngx_http_v2_stream *h2_stream,
                         ngx_buffer *buffer)
{
    while ( ! ngx_buffer_empty(buffer) {
        ngx_http_v2_frame_data(h2_conn,
                               h2_stream->frames,
                               buffer);
    }

    ngx_list_add(h2_conn->active_streams, h2_stream);

    ngx_call_once_async(ngx_http_v2_write_streams, h2_conn);
}

ngx_http_v2_write_streams(ngx_http_v2_connection *h2_conn)
{
    ngx_http_v2_stream *h2_stream;

    while ( ! ngx_list_empty(h2_conn->active_streams)) {
        h2_stream = ngx_list_pop(h2_conn->active_streams);

        ngx_http_v2_prioritise(h2_conn->queue,
                               h2_stream->frames);
    }

    ngx_http_v2_write_queue(h2_conn->queue);
}

There is a high level of risk in this modification, for even though it is remarkably small, we are taking the well established and debugged event flow in NGINX and switching it around to a significant degree. Like taking a number of Jenga pieces out of the tower and placing them in another location, we risk: race conditions, event misfires and event blackholes leading to lockups during transaction processing.

Because of this level of risk, we did not release this change in its entirety during Speed Week, but we will continue to test and refine it for future release.

Upstream buffer partial re-use

Nginx has an internal buffer region to store connection data it reads from upstream. To begin with, the entirety of this buffer is Ready for use. When data is read from upstream into the Ready buffer, the part of the buffer that holds the data is passed to the downstream HTTP/2 layer. Since HTTP/2 takes responsibility for that data, that portion of the buffer is marked as: Busy and it will remain Busy for as long as it takes for the HTTP/2 layer to write the data into the TLS layer, which is a process that may take some time (in computer terms!).

During this gulf of time, the upstream layer may continue to read more data into the remaining Ready sections of the buffer and continue to pass that incremental data to the HTTP/2 layer until there are no Ready sections available.

When Busy data is finally finished in the HTTP/2 layer, the buffer space that contained that data is then marked as: Free

The process is illustrated in this diagram:

NGINX structural enhancements for HTTP/2 performance

You may ask: When the leading part of the upstream buffer is marked as Free (in blue in the diagram), even though the trailing part of the upstream buffer is still Busy, can the Free part be re-used for reading more data from upstream?

The answer to that question is: NO

Because just a small part of the buffer is still Busy, NGINX will refuse to allow any of the entire buffer space to be re-used for reads. Only when the entirety of the buffer is Free, can the buffer be returned to the Ready state and used for another iteration of upstream reads. So in summary, data can be read from upstream into Ready space at the tail of the buffer, but not into Free space at the head of the buffer.

This is a shortcoming in NGINX and is clearly undesirable as it interrupts the flow of data into the system. We asked: what if we could cycle through this buffer region and re-use parts at the head as they became Free? We seek to answer that question in the near future by testing the following buffering model in NGINX:

NGINX structural enhancements for HTTP/2 performance

TLS layer Buffering

On a number of occasions in the above text, I have mentioned the TLS layer, and how the HTTP/2 layer writes data into it. In the OSI network model, TLS sits just below the protocol (HTTP/2) layer, and in many consciously designed networking software systems such as NGINX, the software interfaces are separated in a way that mimics this layering.

The NGINX HTTP/2 layer will collect the current cohort of data frames and place them in priority order into an output queue, then submit this queue to the TLS layer. The TLS layer makes use of a per-connection buffer to collect HTTP/2 layer data before performing the actual cryptographic transformations on that data.

The purpose of the buffer is to give the TLS layer a more meaningful quantity of data to encrypt, for if the buffer was too small, or the TLS layer simply relied on the units of data from the HTTP/2 layer, then the overhead of encrypting and transmitting the multitude of small blocks may negatively impact system throughput.

The following diagram illustrates this undersize buffer situation:

NGINX structural enhancements for HTTP/2 performance

If the TLS buffer is too big, then an excessive amount of HTTP/2 data will be committed to encryption and if it failed to write to the network due to backpressure, it would be locked into the TLS layer and not be available to return to the HTTP/2 layer for the reclamation process, thus reducing the effectiveness of reclamation. The following diagram illustrates this oversize buffer situation:

NGINX structural enhancements for HTTP/2 performance

In the coming months, we will embark on a process to attempt to find the ‘goldilocks’ spot for TLS buffering: To size the TLS buffer so it is big enough to maintain efficiency of encryption and network writes, but not so big as to reduce the responsiveness to incomplete network writes and the efficiency of reclamation.

Thank you – Next!

The Enhanced HTTP/2 Prioritization project has the lofty goal of fundamentally re-shaping how we send traffic from the Cloudflare edge to clients, and as results of our testing and feedback from some of our customers shows, we have certainly achieved that! However, one of the most important aspects that we took away from the project was the critical role the internal data flow within our NGINX software infrastructure plays in the outlook of the traffic observed by our end users. We found that changing a few lines of (albeit critical) code, could have significant impacts on the effectiveness and performance of our prioritization algorithms. Another positive outcome is that in addition to improving HTTP/2, we are looking forward to carrying our newfound skills and lessons learned and apply them to HTTP/3 over QUIC.

We are eager to share our modifications to NGINX with the community, so we have opened this ticket, through which we will discuss upstreaming the event re-ordering change and the buffer partial re-use change with the NGINX team.

As Cloudflare continues to grow, our requirements on our software infrastructure also shift. Cloudflare has already moved beyond proxying of HTTP/1 over TCP to support termination and Layer 3 and 4 protection for any UDP and TCP traffic. Now we are moving on to other technologies and protocols such as QUIC and HTTP/3, and full proxying of a wide range of other protocols such as messaging and streaming media.

For these endeavours we are looking at new ways to answer questions on topics such as: scalability, localised performance, wide scale performance, introspection and debuggability, release agility, maintainability.

If you would like to help us answer these questions and know a bit about: hardware and software scalability, network programming, asynchronous event and futures based software design, TCP, TLS, QUIC, HTTP, RPC protocols, Rust or maybe something else?, then have a look here.

One more thing… new Speed Page

Post Syndicated from Andrew Galloni original https://blog.cloudflare.com/new-speed-page/

Congratulations on making it through Speed Week. In the last week, Cloudflare has: described how our global network speeds up the Internet, launched a HTTP/2 prioritisation model that will improve web experiences on all browsers, launched an image resizing service which will deliver the optimal image to every device, optimized live video delivery, detailed how to stream progressive images so that they render twice as fast – using the flexibility of our new HTTP/2 prioritisation model and finally, prototyped a new over-the-wire format for JavaScript that could improve application start-up performance especially on mobile devices. As a bonus, we’re also rolling out one more new feature: “TCP Turbo” automatically chooses the TCP settings to further accelerate your website.

As a company, we want to help every one of our customers improve web experiences. The growth of Cloudflare, along with the increase in features, has often made simple questions difficult to answer:

  • How fast is my website?
  • How should I be thinking about performance features?
  • How much faster would the site be if I were to enable a particular feature?

This post will describe the exciting changes we have made to the Speed Page on the Cloudflare dashboard to give our customers a much clearer understanding of how their websites are performing and how they can be made even faster. The new Speed Page consists of :

  • A visual comparison of your website loading on Cloudflare, with caching enabled, compared to connecting directly to the origin.
  • The measured improvement expected if any performance feature is enabled.
  • A report describing how fast your website is on desktop and mobile.

We want to simplify the complexity of making web experiences fast and give our customers control.  Take a look – We hope you like it.

Why do fast web experiences matter?

Customer experience : No one likes slow service. Imagine if you go to a restaurant and the service is slow, especially when you arrive; you are not likely to go back or recommend it to your friends. It turns out the web works in the same way and Internet customers are even more demanding. As many as 79% of customers who are “dissatisfied” with a website’s performance are less likely to buy from that site again.

Engagement and Revenue : There are many studies explaining how speed affects customer engagement, bounce rates and revenue.

Reputation : There is also brand reputation to consider as customers associate an online experience to the brand. A study found that for 66% of the sample website performance influences their impression of the company.

Diversity : Mobile traffic has grown to be larger than its desktop counterpart over the last few years. Mobile customers’ expectations have becoming increasingly demanding and expect seamless Internet access regardless of location.

Mobile provides a new set of challenges that includes the diversity of device specifications. When testing, be aware that the average mobile device is significantly less capable than the top-of-the-range models. For example, there can be orders-of-magnitude disparity in the time different mobile devices take to run JavaScript. Another challenge is the variance in mobile performance, as customers move from a strong, high quality office network to mobile networks of different speeds (3G/5G), and quality within the same browsing session.

New Speed Page

There is compelling evidence that a faster web experience is important for anyone online. Most of the major studies involve the largest tech companies, who have whole teams dedicated to measuring and improving web experiences for their own services. At Cloudflare we are on a mission to help build a better and faster Internet for everyone – not just the selected few.

Delivering fast web experiences is not a simple matter. That much is clear.
To know what to send and when requires a deep understanding of every layer of the stack, from TCP tuning, protocol level prioritisation, content delivery formats through to the intricate mechanics of browser rendering.  You will also need a global network that strives to be within 10 ms of every Internet user. The intrinsic value of such a network, should be clear to everyone. Cloudflare has this network, but it also offers many additional performance features.

With the Speed Page redesign, we are emphasizing the performance benefits of using Cloudflare and the additional improvements possible from our features.

The de facto standard for measuring website performance has been WebPageTest. Having its creator in-house at Cloudflare encouraged us to use it as the basis for website performance measurement. So, what is the easiest way to understand how a web page loads? A list of statistics do not paint a full picture of actual user experience. One of the cool features of WebPageTest is that it can generate a filmstrip of screen snapshots taken during a web page load, enabling us to quantify how a page loads, visually. This view makes it significantly easier to determine how long the page is blank for, and how long it takes for the most important content to render. Being able to look at the results in this way, provides the ability to empathise with the user.

How fast on Cloudflare ?

After moving your website to Cloudflare, you may have asked: How fast did this decision make my website? Well, now we provide the answer:

Comparison of website performance using Cloudflare. 

As well as the increase in speed, we provide filmstrips of before and after, so that it is easy to compare and understand how differently a user will experience the website. If our tests are unable to reach your origin and you are already setup on Cloudflare, we will test with development mode enabled, which disables caching and minification.

Site performance statistics

How can we measure the user experience of a website?

Traditionally, page load was the important metric. Page load is a technical measurement used by browser vendors that has no bearing on the presentation or usability of a page. The metric reports on how long it takes not only to load the important content but also all of the 3rd party content (social network widgets, advertising, tracking scripts etc.). A user may very well not see anything until after all the page content has loaded, or they may be able to interact with a page immediately, while content continues to load.

A user will not decide whether a page is fast by a single measure or moment. A user will perceive how fast a website is from a combination of factors:

  • when they see any response
  • when they see the content they expect
  • when they can interact with the page
  • when they can perform the task they intended

Experience has shown that if you focus on one measure, it will likely be to the detriment of the others.

Importance of Visual response

If an impatient user navigates to your site and sees no content for several seconds or no valuable content, they are likely to get frustrated and leave. The paint timing spec defines a set of paint metrics, when content appears on a page, to measure the key moments in how a user perceives performance.

First Contentful Paint (FCP) is the time when the browser first renders any DOM content.

First Meaningful Paint (FMP) is the point in time when the page’s “primary” content appears on the screen. This metric should relate to what the user has come to the site to see and is designed as the point in time when the largest visible layout change happens.

Speed Index attempts to quantify the value of the filmstrip rather than using a single paint timing. The speed index measures the rate at which content is displayed – essentially the area above the curve. In the chart below from our progressive image feature you can see reaching 80% happens much earlier for the parallelized (red) load rather than the regular (blue).

Image Description

Importance of interactivity

The same impatient user is now happy that the content they want to see has appeared. They will still become frustrated if they are unable to interact with the site.
Time to Interactive is the time it takes for content to be rendered and the page is ready to receive input from the user. Technically this is defined as when the browser’s main processing thread has been idle for several seconds after first meaningful paint.

The Speed Tab displays these key metrics for mobile and desktop.

How much faster on Cloudflare ?

The Cloudflare Dashboard provides a list of performance features which can, admittedly, be both confusing and daunting. What would be the benefit of turning on Rocket Loader and on which performance metrics will it have the most impact ? If you upgrade to Pro what will be the value of the enhanced HTTP/2 prioritisation ? The optimization section answers these questions.

Tests are run with each performance feature turned on and off. The values for the tests for the appropriate performance metrics are displayed, along with the improvement. You can enable or upgrade the feature from this view. Here are a few examples :

If Rocket Loader were enabled for this website, the render-blocking JavaScript would be deferred causing first paint time to drop from 1.25s to 0.81s – an improvement of 32% on desktop.

Image heavy sites do not perform well on slow mobile connections. If you enable Mirage, your customers on 3G connections would see meaningful content 1s sooner – an improvement of 29.4%.

So how about our new features?

We tested the enhanced HTTP/2 prioritisation feature on an Edge browser on desktop and saw meaningful content display 2s sooner – an improvement of 64%.

This is a more interesting result taken from the blog example used to illustrate the progressive image streaming. At first glance the improvement of 29% in speed index is good. The filmstrip comparison shows a more significant difference. In this case the page with no images shown is already 43% visually complete for both scenarios after 1.5s. At 2.5s the difference is 77% compared to 50%.

This is a great example of how metrics do not tell the full story. They cannot completely replace viewing the page loading flow and understanding what is important for your site.

How to try

This is our first iteration of the new Speed Page and we are eager to get your feedback. We will be rolling this out to beta customers who are interested in seeing how their sites perform. To be added to the queue for activation of the new Speed Page please click on the banner on the overview page,

or click on the banner on the existing Speed Page.

Faster script loading with BinaryAST?

Post Syndicated from Ingvar Stepanyan original https://blog.cloudflare.com/binary-ast/

Faster script loading with BinaryAST?

JavaScript Cold starts

Faster script loading with BinaryAST?

The performance of applications on the web platform is becoming increasingly bottlenecked by the startup (load) time. Large amounts of JavaScript code are required to create rich web experiences that we’ve become used to. When we look at the total size of JavaScript requested on mobile devices from HTTPArchive, we see that an average page loads 350KB of JavaScript, while 10% of pages go over the 1MB threshold. The rise of more complex applications can push these numbers even higher.

While caching helps, popular websites regularly release new code, which makes cold start (first load) times particularly important. With browsers moving to separate caches for different domains to prevent cross-site leaks, the importance of cold starts is growing even for popular subresources served from CDNs, as they can no longer be safely shared.

Usually, when talking about the cold start performance, the primary factor considered is a raw download speed. However, on modern interactive pages one of the other big contributors to cold starts is JavaScript parsing time. This might seem surprising at first, but makes sense -- before starting to execute the code, the engine has to first parse the fetched JavaScript, make sure it doesn’t contain any syntax errors and then compile it to the initial bytecode. As networks become faster, parsing and compilation of JavaScript could become the dominant factor.

Faster script loading with BinaryAST?

Faster script loading with BinaryAST?

The device capability (CPU or memory performance) is the most important factor in the variance of JavaScript parsing times and correspondingly the time to application start. A 1MB JavaScript file will take an order of a 100 ms to parse on a modern desktop or high-end mobile device but can take over a second on an average phone  (Moto G4).

A more detailed post on the overall cost of parsing, compiling and execution of JavaScript shows how the JavaScript boot time can vary on different mobile devices. For example, in the case of news.google.com, it can range from 4s on a Pixel 2 to 28s on a low-end device.

While engines continuously improve raw parsing performance, with V8 in particular doubling it over the past year, as well as moving more things off the main thread, parsers still have to do lots of potentially unnecessary work that consumes memory, battery and might delay the processing of the useful resources.

The “BinaryAST” Proposal

This is where BinaryAST comes in. BinaryAST is a new over-the-wire format for JavaScript proposed and actively developed by Mozilla that aims to speed up parsing while keeping the semantics of the original JavaScript intact. It does so by using an efficient binary representation for code and data structures, as well as by storing and providing extra information to guide the parser ahead of time.

The name comes from the fact that the format stores the JavaScript source as an AST encoded into a binary file. The specification lives at tc39.github.io/proposal-binary-ast and is being worked on by engineers from Mozilla, Facebook, Bloomberg and Cloudflare.

“Making sure that web applications start quickly is one of the most important, but also one of the most challenging parts of web development. We know that BinaryAST can radically reduce startup time, but we need to collect real-world data to demonstrate its impact. Cloudflare’s work on enabling use of BinaryAST with Cloudflare Workers is an important step towards gathering this data at scale.”

Till Schneidereit, Senior Engineering Manager, Developer Technologies
Mozilla

Parsing JavaScript

For regular JavaScript code to execute in a browser the source is parsed into an intermediate representation known as an AST that describes the syntactic structure of the code. This representation can then be compiled into a byte code or a native machine code for execution.

Faster script loading with BinaryAST?

A simple example of adding two numbers can be represented in an AST as:

Faster script loading with BinaryAST?

Parsing JavaScript is not an easy task; no matter which optimisations you apply, it still requires reading the entire text file char by char, while tracking extra context for syntactic analysis.

The goal of the BinaryAST is to reduce the complexity and the amount of work the browser parser has to do overall by providing an additional information and context by the time and place where the parser needs it.

To execute JavaScript delivered as BinaryAST the only steps required are:

Faster script loading with BinaryAST?

Another benefit of BinaryAST is that it makes possible to only parse the critical code necessary for start-up, completely skipping over the unused bits. This can dramatically improve the initial loading time.

Faster script loading with BinaryAST?

Faster script loading with BinaryAST?

Faster script loading with BinaryAST?

Faster script loading with BinaryAST?

This post will now describe some of the challenges of parsing JavaScript in more detail, explain how the proposed format addressed them, and how we made it possible to run its encoder in Workers.

Hoisting

JavaScript relies on hoisting for all declarations -- variables, functions, classes. Hoisting is a property of the language that allows you to declare items after the point they’re syntactically used.

Let’s take the following example:

function f() {
	return g();
}

function g() {
	return 42;
}

Here, when the parser is looking at the body of f, it doesn’t know yet what g is referring to -- it could be an already existing global function or something declared further in the same file -- so it can’t finalise parsing of the original function and start the actual compilation.

BinaryAST fixes this by storing all the scope information and making it available upfront before the actual expressions.

Faster script loading with BinaryAST?

As shown by the difference between the initial AST and the enhanced AST in a JSON representation:

Faster script loading with BinaryAST?

Lazy parsing

One common technique used by modern engines to improve parsing times is lazy parsing. It utilises the fact that lots of websites include more JavaScript than they actually need, especially for the start-up.

Working around this involves a set of heuristics that try to guess when any given function body in the code can be safely skipped by the parser initially and delayed for later. A common example of such heuristic is immediately running the full parser for any function that is wrapped into parentheses:

(function(...

Such prefix usually indicates that a following function is going to be an IIFE (immediately-invoked function expression), and so the parser can assume that it will be compiled and executed ASAP, and wouldn’t benefit from being skipped over and delayed for later.

(function() {
	…
})();

These heuristics significantly improve the performance of the initial parsing and cold starts, but they’re not completely reliable or trivial to implement.

One of the reasons is the same as in the previous section -- even with lazy parsing, you still need to read the contents, analyse them and store an additional scope information for the declarations.

Another reason is that the JavaScript specification requires reporting any syntax errors immediately during load time, and not when the code is actually executed. A class of these errors, called early errors, is checking for mistakes like usage of the reserved words in invalid contexts, strict mode violations, variable name clashes and more. All of these checks require not only lexing JavaScript source, but also tracking extra state even during the lazy parsing.

Having to do such extra work means you need to be careful about marking functions as lazy too eagerly, especially if they actually end up being executed during the page load. Otherwise you’re making cold start costs even worse, as now every function that is erroneously marked as lazy, needs to be parsed twice -- once by the lazy parser and then again by the full one.

Because BinaryAST is meant to be an output format of other tools such as Babel, TypeScript and bundlers such as Webpack, the browser parser can rely on the JavaScript being already analysed and verified by the initial parser. This allows it to skip function bodies completely, making lazy parsing essentially free.

It reduces the cost of a completely unused code -- while including it is still a problem in terms of the network bandwidth (don’t do this!), at least it’s not affecting parsing times anymore. These benefits apply equally to the code that is used later in the page lifecycle (for example, invoked in response to user actions), but is not required during the startup.

Last but not least important benefit of such approach is that BinaryAST encodes lazy annotations as part of the format, giving tools and developers direct and full control over the heuristics. For example, a tool targeting the Web platform or a framework CLI can use its domain-specific knowledge to mark some event handlers as lazy or eager depending on the context and the event type.

Avoiding ambiguity in parsing

Using a text format for a programming language is great for readability and debugging, but it’s not the most efficient representation for parsing and execution.

For example, parsing low-level types like numbers, booleans and even strings from text requires extra analysis and computation, which is unnecessary when you can just store and read them as native binary-encoded values in the first place and read directly on the other side.

Another problem is an ambiguity in the grammar itself. It was already an issue in the ES5 world, but could usually be resolved with some extra bookkeeping based on the previously seen tokens. However, in ES6+ there are productions that can be ambiguous all the way through until they’re parsed completely.

For example, a token sequence like:

(a, {b: c, d}, [e = 1])...

can start either a parenthesized comma expression with nested object and array literals and an assignment:

(a, {b: c, d}, [e = 1]); // it was an expression

or a parameter list of an arrow expression function with nested object and array patterns and a default value:

(a, {b: c, d}, [e = 1]) => … // it was a parameter list

Both representations are perfectly valid, but have completely different semantics, and you can’t know which one you’re dealing with until you see the final token.

To work around this, parsers usually have to either backtrack, which can easily get exponentially slow, or to parse contents into intermediate node types that are capable of holding both expressions and patterns, with following conversion. The latter approach preserves linear performance, but makes the implementation more complicated and requires preserving more state.

In the BinaryAST format this issue doesn’t exist in the first place because the parser sees the type of each node before it even starts parsing its contents.

Cloudflare Implementation

Currently, the format is still in flux, but the very first version of the client-side implementation was released under a flag in Firefox Nightly several months ago. Keep in mind this is only an initial unoptimised prototype, and there are already several experiments changing the format to provide improvements to both size and parsing performance.

On the producer side, the reference implementation lives at github.com/binast/binjs-ref. Our goal was to take this reference implementation and consider how we would deploy it at Cloudflare scale.

If you dig into the codebase, you will notice that it currently consists of two parts.

Faster script loading with BinaryAST?

One is the encoder itself, which is responsible for taking a parsed AST, annotating it with scope and other relevant information, and writing out the result in one of the currently supported formats. This part is written in Rust and is fully native.

Another part is what produces that initial AST -- the parser. Interestingly, unlike the encoder, it’s implemented in JavaScript.

Unfortunately, there is currently no battle-tested native JavaScript parser with an open API, let alone implemented in Rust. There have been a few attempts, but, given the complexity of JavaScript grammar, it’s better to wait a bit and make sure they’re well-tested before incorporating it into the production encoder.

On the other hand, over the last few years the JavaScript ecosystem grew to extensively rely on developer tools implemented in JavaScript itself. In particular, this gave a push to rigorous parser development and testing. There are several JavaScript parser implementations that have been proven to work on thousands of real-world projects.

With that in mind, it makes sense that the BinaryAST implementation chose to use one of them -- in particular, Shift -- and integrated it with the Rust encoder, instead of attempting to use a native parser.

Connecting Rust and JavaScript

Integration is where things get interesting.

Rust is a native language that can compile to an executable binary, but JavaScript requires a separate engine to be executed. To connect them, we need some way to transfer data between the two without sharing the memory.

Initially, the reference implementation generated JavaScript code with an embedded input on the fly, passed it to Node.js and then read the output when the process had finished. That code contained a call to the Shift parser with an inlined input string and produced the AST back in a JSON format.

This doesn’t scale well when parsing lots of JavaScript files, so the first thing we did is transformed the Node.js side into a long-living daemon. Now Rust could spawn a required Node.js process just once and keep passing inputs into it and getting responses back as individual messages.

Faster script loading with BinaryAST?

Running in the cloud

While the Node.js solution worked fairly well after these optimisations, shipping both a Node.js instance and a native bundle to production requires some effort. It’s also potentially risky and requires manual sandboxing of both processes to make sure we don’t accidentally start executing malicious code.

On the other hand, the only thing we needed from Node.js is the ability to run the JavaScript parser code. And we already have an isolated JavaScript engine running in the cloud -- Cloudflare Workers! By additionally compiling the native Rust encoder to Wasm (which is quite easy with the native toolchain and wasm-bindgen), we can even run both parts of the code in the same process, making cold starts and communication much faster than in a previous model.

Faster script loading with BinaryAST?

Optimising data transfer

The next logical step is to reduce the overhead of data transfer. JSON worked fine for communication between separate processes, but with a single process we should be able to retrieve the required bits directly from the JavaScript-based AST.

To attempt this, first of all, we needed to move away from the direct JSON usage to something more generic that would allow us to support various import formats. The Rust ecosystem already has an amazing serialisation framework for that -- Serde.

Aside from allowing us to be more flexible in regard to the inputs, rewriting to Serde helped an existing native use case too. Now, instead of parsing JSON into an intermediate representation and then walking through it, all the native typed AST structures can be deserialized directly from the stdout pipe of the Node.js process in a streaming manner. This significantly improved both the CPU usage and memory pressure.

But there is one more thing we can do: instead of serializing and deserializing from an intermediate format (let alone, a text format like JSON), we should be able to operate [almost] directly on JavaScript values, saving memory and repetitive work.

How is this possible? wasm-bindgen provides a type called JsValue that stores a handle to an arbitrary value on the JavaScript side. This handle internally contains an index into a predefined array.

Each time a JavaScript value is passed to the Rust side as a result of a function call or a property access, it’s stored in this array and an index is sent to Rust. The next time Rust wants to do something with that value, it passes the index back and the JavaScript side retrieves the original value from the array and performs the required operation.

By reusing this mechanism, we could implement a Serde deserializer that requests only the required values from the JS side and immediately converts them to their native representation. It’s now open-sourced under https://github.com/cloudflare/serde-wasm-bindgen.

Faster script loading with BinaryAST?

At first, we got a much worse performance out of this due to the overhead of more frequent calls between 1) Wasm and JavaScript -- SpiderMonkey has improved these recently, but other engines still lag behind and 2) JavaScript and C++, which also can’t be optimised well in most engines.

The JavaScript <-> C++ overhead comes from the usage of TextEncoder to pass strings between JavaScript and Wasm in wasm-bindgen, and, indeed, it showed up as the highest in the benchmark profiles. This wasn’t surprising -- after all, strings can appear not only in the value payloads, but also in property names, which have to be serialized and sent between JavaScript and Wasm over and over when using a generic JSON-like structure.

Luckily, because our deserializer doesn’t have to be compatible with JSON anymore, we can use our knowledge of Rust types and cache all the serialized property names as JavaScript value handles just once, and then keep reusing them for further property accesses.

This, combined with some changes to wasm-bindgen which we have upstreamed, allows our deserializer to be up to 3.5x faster in benchmarks than the original Serde support in wasm-bindgen, while saving ~33% off the resulting code size. Note that for string-heavy data structures it might still be slower than the current JSON-based integration, but situation is expected to improve over time when reference types proposal lands natively in Wasm.

After implementing and integrating this deserializer, we used the wasm-pack plugin for Webpack to build a Worker with both Rust and JavaScript parts combined and shipped it to some test zones.

Show me the numbers

Keep in mind that this proposal is in very early stages, and current benchmarks and demos are not representative of the final outcome (which should improve numbers much further).

As mentioned earlier, BinaryAST can mark functions that should be parsed lazily ahead of time. By using different levels of lazification in the encoder (https://github.com/binast/binjs-ref/blob/b72aff7dac7c692a604e91f166028af957cdcda5/crates/binjs_es6/src/lazy.rs#L43) and running tests against some popular JavaScript libraries, we found following speed-ups.

Level 0 (no functions are lazified)

With lazy parsing disabled in both parsers we got a raw parsing speed improvement of between 3 and 10%.

NameSource size (kb)JavaScript Parse time (average ms)BinaryAST parse time (average ms)Diff (%)
React200.4030.385-4.56
D3 (v5)24011.17810.525-6.018
Angular1806.9856.331-9.822
Babel78021.25520.599-3.135
Backbone320.7750.699-10.312
wabtjs172064.83659.556-8.489
Fuzzball (1.2)723.1652.768-13.383

Level 3 (functions up to 3 levels deep are lazified)

But with the lazification set to skip nested functions of up to 3 levels we see much more dramatic improvements in parsing time between 90 and 97%. As mentioned earlier in the post, BinaryAST makes lazy parsing essentially free by completely skipping over the marked functions.

NameSource size (kb)Parse time (average ms)BinaryAST parse time (average ms)Diff (%)
React200.4070.032-92.138
D3 (v5)24011.6230.224-98.073
Angular1807.0930.680-90.413
Babel78021.1000.895-95.758
Backbone320.8980.045-94.989
wabtjs172059.8021.601-97.323
Fuzzball (1.2)722.9370.089-96.970

All the numbers are from manual tests on a Linux x64 Intel i7 with 16Gb of ram.

While these synthetic benchmarks are impressive, they are not representative of real-world scenarios. Normally you will use at least some of the loaded JavaScript during the startup. To check this scenario, we decided to test some realistic pages and demos on desktop and mobile Firefox and found speed-ups in page loads too.

For a sample application (https://github.com/cloudflare/binjs-demo, https://serve-binjs.that-test.site/) which weighed in at around 1.2 MB of JavaScript we got the following numbers for initial script execution:

DeviceJavaScriptBinaryAST
Desktop338ms314ms
Mobile (HTC One M8)2019ms1455ms

Here is a video that will give you an idea of the improvement as seen by a user on mobile Firefox (in this case showing the entire page startup time):

Faster script loading with BinaryAST?

Next step is to start gathering data on real-world websites, while improving the underlying format.

How do I test BinaryAST on my website?

We’ve open-sourced our Worker so that it could be installed on any Cloudflare zone: https://github.com/binast/binjs-ref/tree/cf-wasm.

One thing to be currently wary of is that, even though the result gets stored in the cache, the initial encoding is still an expensive process, and might easily hit CPU limits on any non-trivial JavaScript files and fall back to the unencoded variant. We are working to improve this situation by releasing BinaryAST encoder as a separate feature with more relaxed limits in the following few days.

Meanwhile, if you want to play with BinaryAST on larger real-world scripts, an alternative option is to use a static binjs_encode tool from https://github.com/binast/binjs-ref to pre-encode JavaScript files ahead of time. Then, you can use a Worker from https://github.com/cloudflare/binast-cf-worker to serve the resulting BinaryAST assets when supported and requested by the browser.

On the client side, you’ll currently need to download Firefox Nightly, go to about:config and enable unrestricted BinaryAST support via the following options:

Faster script loading with BinaryAST?

Now, when opening a website with either of the Workers installed, Firefox will get BinaryAST instead of JavaScript automatically.

Summary

The amount of JavaScript in modern apps is presenting performance challenges for all consumers. Engine vendors are experimenting with different ways to improve the situation -- some are focusing on raw decoding performance, some on parallelizing operations to reduce overall latency, some are researching new optimised formats for data representation, and some are inventing and improving protocols for the network delivery.

No matter which one it is, we all have a shared goal of making the Web better and faster. On Cloudflare’s side, we’re always excited about collaborating with all the vendors and combining various approaches to make that goal closer with every step.

Live video just got more live: Introducing Concurrent Streaming Acceleration

Post Syndicated from Jon Levine original https://blog.cloudflare.com/introducing-concurrent-streaming-acceleration/

Live video just got more live: Introducing Concurrent Streaming Acceleration

Live video just got more live: Introducing Concurrent Streaming Acceleration

Today we’re excited to introduce Concurrent Streaming Acceleration, a new technique for reducing the end-to-end latency of live video on the web when using Stream Delivery.

Let’s dig into live-streaming latency, why it’s important, and what folks have done to improve it.

How “live” is “live” video?

Live streaming makes up an increasing share of video on the web. Whether it’s a TV broadcast, a live game show, or an online classroom, users expect video to arrive quickly and smoothly. And the promise of “live” is that the user is seeing events as they happen. But just how close to “real-time” is “live” Internet video?

Delivering live video on the Internet is still hard and adds lots of latency:

  1. The content source records video and sends it to an encoding server;
  2. The origin server transforms this video into a format like DASH, HLS or CMAF that can be delivered to millions of devices efficiently;
  3. A CDN is typically used to deliver encoded video across the globe
  4. Client players decode the video and render it on the screen

Live video just got more live: Introducing Concurrent Streaming Acceleration

And all of this is under a time constraint — the whole process need to happen in a few seconds, or video experiences will suffer. We call the total delay between when the video was shot, and when it can be viewed on an end-user’s device, as “end-to-end latency” (think of it as the time from the camera lens to your phone’s screen).

Traditional segmented delivery

Video formats like DASH, HLS, and CMAF work by splitting video into small files, called “segments”. A typical segment duration is 6 seconds.

If a client player needs to wait for a whole 6s segment to be encoded, sent through a CDN, and then decoded, it can be a long wait! It takes even longer if you want the client to build up a buffer of segments to protect against any interruptions in delivery. A typical player buffer for HLS is 3 segments:

Live video just got more live: Introducing Concurrent Streaming Acceleration
Clients may have to buffer three 6-second chunks, introducing at least 18s of latency‌‌

When you consider encoding delays, it’s easy to see why live streaming latency on the Internet has typically been about 20-30 seconds. We can do better.

Reduced latency with chunked transfer encoding

A natural way to solve this problem is to enable client players to start playing the chunks while they’re downloading, or even while they’re still being created. Making this possible requires a clever bit of cooperation to encode and deliver the files in a particular way, known as “chunked encoding.” This involves splitting up segments into smaller, bite-sized pieces, or “chunks”. Chunked encoding can typically bring live latency down to 5 or 10 seconds.

Confusingly, the word “chunk” is overloaded to mean two different things:

  1. CMAF or HLS chunks, which are small pieces of a segment (typically 1s) that are aligned on key frames
  2. HTTP chunks, which are just a way of delivering any file over the web

Live video just got more live: Introducing Concurrent Streaming Acceleration
Chunked Encoding splits segments into shorter chunks

HTTP chunks are important because web clients have limited ability to process streams of data. Most clients can only work with data once they’ve received the full HTTP response, or at least a complete HTTP chunk. By using HTTP chunked transfer encoding, we enable video players to start parsing and decoding video sooner.

CMAF chunks are important so that decoders can actually play the bits that are in the HTTP chunks. Without encoding video in a careful way, decoders would have random bits of a video file that can’t be played.

CDNs can introduce additional buffering

Chunked encoding with HLS and CMAF is growing in use across the web today. Part of what makes this technique great is that HTTP chunked encoding is widely supported by CDNs – it’s been part of the HTTP spec for 20 years.

CDN support is critical because it allows low-latency live video to scale up and reach audiences of thousands or millions of concurrent viewers – something that’s currently very difficult to do with other, non-HTTP based protocols.

Unfortunately, even if you enable chunking to optimise delivery, your CDN may be working against you by buffering the entire segment. To understand why consider what happens when many people request a live segment at the same time:

Live video just got more live: Introducing Concurrent Streaming Acceleration

If the file is already in cache, great! CDNs do a great job at delivering cached files to huge audiences. But what happens when the segment isn’t in cache yet? Remember – this is the typical request pattern for live video!

Typically, CDNs are able to “stream on cache miss” from the origin. That looks something like this:

Live video just got more live: Introducing Concurrent Streaming Acceleration

But again – what happens when multiple people request the file at once? CDNs typically need to pull the entire file into cache before serving additional viewers:

Live video just got more live: Introducing Concurrent Streaming Acceleration
Only one viewer can stream video, while other clients wait for the segment to buffer at the CDN

This behavior is understandable. CDN data centers consist of many servers. To avoid overloading origins, these servers typically coordinate amongst themselves using a “cache lock” (mutex) that allows only one server to request a particular file from origin at a given time. A side effect of this is that while a file is being pulled into cache, it can’t be served to any user other than the first one that requested it. Unfortunately, this cache lock also defeats the purpose of using chunked encoding!

To recap thus far:

  • Chunked encoding splits up video segments into smaller pieces
  • This can reduce end-to-end latency by allowing chunks to be fetched and decoded by players, even while segments are being produced at the origin server
  • Some CDNs neutralize the benefits of chunked encoding by buffering entire files inside the CDN before they can be delivered to clients

Cloudflare’s solution: Concurrent Streaming Acceleration

As you may have guessed, we think we can do better. Put simply, we now have the ability to deliver un-cached files to multiple clients simultaneously while we pull the file once from the origin server.

Live video just got more live: Introducing Concurrent Streaming Acceleration

This sounds like a simple change, but there’s a lot of subtlety to do this safely. Under the hood, we’ve made deep changes to our caching infrastructure to remove the cache lock and enable multiple clients to be able to safely read from a single file while it’s still being written.

The best part is – all of Cloudflare now works this way! There’s no need to opt-in, or even make a config change to get the benefit.

We rolled this feature out a couple months ago and have been really pleased with the results so far. We measure success by the “cache lock wait time,” i.e. how long a request must wait for other requests – a direct component of Time To First Byte.  One OTT customer saw this metric drop from 1.5s at P99 to nearly 0, as expected:

Live video just got more live: Introducing Concurrent Streaming Acceleration

This directly translates into a 1.5-second improvement in end-to-end latency. Live video just got more live!

Conclusion

New techniques like chunked encoding have revolutionized live delivery, enabling publishers to deliver low-latency live video at scale. Concurrent Streaming Acceleration helps you unlock the power of this technique at your CDN, potentially shaving precious seconds of end-to-end latency.

If you’re interested in using Cloudflare for live video delivery, contact our enterprise sales team.

And if you’re interested in working on projects like this and helping us improve live video delivery for the entire Internet, join our engineering team!

Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery

Post Syndicated from Isaac Specter original https://blog.cloudflare.com/announcing-cloudflare-image-resizing-simplifying-optimal-image-delivery/

Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery

Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery

In the past three years, the amount of image data on the median mobile webpage has doubled. Growing images translate directly to users hitting data transfer caps, experiencing slower websites, and even leaving if a website doesn’t load in a reasonable amount of time. The crime is many of these images are so slow because they are larger than they need to be, sending data over the wire which has absolutely no (positive) impact on the user’s experience.

To provide a concrete example, let’s consider this photo of Cloudflare’s Lava Lamp Wall:

Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery
Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery

On the left you see the photo, scaled to 300 pixels wide. On the right you see the same image delivered in its original high resolution, scaled in a desktop web browser. They both look exactly the same, yet the image on the right takes more than twenty times more data to load. Even for the best and most conscientious developers resizing every image to handle every possible device geometry consumes valuable time, and it’s exceptionally easy to forget to do this resizing altogether.

Today we are launching a new product, Image Resizing, to fix this problem once and for all.

Announcing Image Resizing

With Image Resizing, Cloudflare adds another important product to its suite of available image optimizations.  This product allows customers to perform a rich set of the key actions on images.

  • Resize – The source image will be resized to the specified height and width.  This action allows multiple different sized variants to be created for each specific use.
  • Crop – The source image will be resized to a new size that does not maintain the original aspect ratio and a portion of the image will be removed.  This can be especially helpful for headshots and product images where different formats must be achieved by keeping only a portion of the image.
  • Compress – The source image will have its file size reduced by applying lossy compression.  This should be used when slight quality reduction is an acceptable trade for file size reduction.
  • Convert to WebP – When the users browser supports it, the source image will be converted to WebP.  Delivering a WebP image takes advantage of the modern, highly optimized image format.

By using a combination of these actions, customers store a single high quality image on their server, and Image Resizing can be leveraged to create specialized variants for each specific use case.  Without any additional effort, each variant will also automatically benefit from Cloudflare’s global caching.

Examples

Ecommerce Thumbnails

Ecommerce sites typically store a high-quality image of each product.  From that image, they need to create different variants depending on how that product will be displayed.  One example is creating thumbnails for a catalog view.  Using Image Resizing, if the high quality image is located here:

https://example.com/images/shoe123.jpg

This is how to display a 75×75 pixel thumbnail using Image Resizing:

<img src="/cdn-cgi/image/width=75,height=75/images/shoe123.jpg">

Responsive Images

When tailoring a site to work on various device types and sizes, it’s important to always use correctly sized images.  This can be difficult when images are intended to fill a particular percentage of the screen.  To solve this problem, <img srcset sizes> can be used.

Without Image Resizing, multiple versions of the same image would need to be created and stored.  In this example, a single high quality copy of hero.jpg is stored, and Image Resizing is used to resize for each particular size as needed.

<img width="100%" srcset=" /cdn-cgi/image/fit=contain,width=320/assets/hero.jpg 320w, /cdn-cgi/image/fit=contain,width=640/assets/hero.jpg 640w, /cdn-cgi/image/fit=contain,width=960/assets/hero.jpg 960w, /cdn-cgi/image/fit=contain,width=1280/assets/hero.jpg 1280w, /cdn-cgi/image/fit=contain,width=2560/assets/hero.jpg 2560w, " src="/cdn-cgi/image/width=960/assets/hero.jpg">

Enforce Maximum Size Without Changing URLs

Image Resizing is also available from within a Cloudflare Worker. Workers allow you to write code which runs close to your users all around the world. For example, you might wish to add Image Resizing to your images while keeping the same URLs. Your users and client would be able to use the same image URLs as always, but the images will be transparently modified in whatever way you need.

You can install a Worker on a route which matches your image URLs, and resize any images larger than a limit:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

async function handleRequest(request) {
  return fetch(request, {
    cf: {
      image: {
        width: 800,
        height: 800,
        fit: 'scale-down'
      }
  });
}

As a Worker is just code, it is also easy to run this worker only on URLs with image extensions, or even to only resize images being delivered to mobile clients.

Cloudflare and Images

Cloudflare has a long history building tools to accelerate images. Our caching has always helped reduce latency by storing a copy of images closer to the user.  Polish automates options for both lossless and lossy image compression to remove unnecessary bytes from images.  Mirage accelerates image delivery based on device type. We are continuing to invest in all of these tools, as they all serve a unique role in improving the image experience on the web.

Image Resizing is different because it is the first image product at Cloudflare to give developers full control over how their images would be served. You should choose Image Resizing if you are comfortable defining the sizes you wish your images to be served at in advance or within a Cloudflare Worker.

Next Steps and Simple Pricing

Image Resizing is available today for Business and Enterprise Customers.  To enable it, login to the Cloudflare Dashboard and navigate to the Speed Tab.  There you’ll find the section for Image Resizing which you can enable with one click.

This product is included in the Business and Enterprise plans at no additional cost with generous usage limits.  Business Customers have 100k requests per month limit and will be charged $10 for each additional 100k requests per month.  Enterprise Customers have a 10M request per month limit with discounted tiers for higher usage.  Requests are defined as a hit on a URI that contains Image Resizing or a call to Image Resizing from a Worker.

Now that you’ve enabled Image Resizing, it’s time to resize your first image.

  1. Using your existing site, store an image here: https://yoursite.com/images/yourimage.jpg
  2. Use this URL to resize that image:
    https://yoursite.com/cdn-cgi/image/width=100,height=100,quality=75/images/yourimage.jpg
  3. Experiment with changing width=, height=, and quality=.

The instructions above use the Default URL Format for Image Resizing.  For details on options, uses cases, and compatibility, refer to our Developer Documentation.

Parallel streaming of progressive images

Post Syndicated from Andrew Galloni original https://blog.cloudflare.com/parallel-streaming-of-progressive-images/

Parallel streaming of progressive images

Parallel streaming of progressive images

Progressive image rendering and HTTP/2 multiplexing technologies have existed for a while, but now we’ve combined them in a new way that makes them much more powerful. With Cloudflare progressive streaming images appear to load in half of the time, and browsers can start rendering pages sooner.


In HTTP/1.1 connections, servers didn’t have any choice about the order in which resources were sent to the client; they had to send responses, as a whole, in the exact order they were requested by the web browser. HTTP/2 improved this by adding multiplexing and prioritization, which allows servers to decide exactly what data is sent and when. We’ve taken advantage of these new HTTP/2 capabilities to improve perceived speed of loading of progressive images by sending the most important fragments of image data sooner.

This feature is compatible with all major browsers, and doesn’t require any changes to page markup, so it’s very easy to adopt. Sign up for the Beta to enable it on your site!

What is progressive image rendering?

Basic images load strictly from top to bottom. If a browser has received only half of an image file, it can show only the top half of the image. Progressive images have their content arranged not from top to bottom, but from a low level of detail to a high level of detail. Receiving a fraction of image data allows browsers to show the entire image, only with a lower fidelity. As more data arrives, the image becomes clearer and sharper.

Parallel streaming of progressive images

This works great in the JPEG format, where only about 10-15% of the data is needed to display a preview of the image, and at 50% of the data the image looks almost as good as when the whole file is delivered. Progressive JPEG images contain exactly the same data as baseline images, merely reshuffled in a more useful order, so progressive rendering doesn’t add any cost to the file size. This is possible, because JPEG doesn’t store the image as pixels. Instead, it represents the image as frequency coefficients, which are like a set of predefined patterns that can be blended together, in any order, to reconstruct the original image. The inner workings of JPEG are really fascinating, and you can learn more about them from my recent performance.now() conference talk.

The end result is that the images can look almost fully loaded in half of the time, for free! The page appears to be visually complete and can be used much sooner. The rest of the image data arrives shortly after, upgrading images to their full quality, before visitors have time to notice anything is missing.

HTTP/2 progressive streaming

But there’s a catch. Websites have more than one image (sometimes even hundreds of images). When the server sends image files naïvely, one after another, the progressive rendering doesn’t help that much, because overall the images still load sequentially:

Parallel streaming of progressive images

Having complete data for half of the images (and no data for the other half) doesn’t look as good as having half of the data for all images.

And there’s another problem: when the browser doesn’t know image sizes yet, it lays the page out with placeholders instead, and relays out the page when each image loads. This can make pages jump during loading, which is inelegant, distracting and annoying for the user.

Our new progressive streaming feature greatly improves the situation: we can send all of the images at once, in parallel. This way the browser gets size information for all of the images as soon as possible, can paint a preview of all images without having to wait for a lot of data, and large images don’t delay loading of styles, scripts and other more important resources.

This idea of streaming of progressive images in parallel is as old as HTTP/2 itself, but it needs special handling in low-level parts of web servers, and so far this hasn’t been implemented at a large scale.

When we were improving our HTTP/2 prioritization, we realized it can be also used to implement this feature. Image files as a whole are neither high nor low priority. The priority changes within each file, and dynamic re-prioritization gives us the behavior we want:

  • The image header that contains the image size is very high priority, because the browser needs to know the size as soon as possible to do page layout. The image header is small, so it doesn’t hurt to send it ahead of other data.

    Parallel streaming of progressive images

  • The minimum amount of data in the image required to show a preview of the image has a medium priority (we’d like to plug "holes" left for unloaded images as soon as possible, but also leave some bandwidth available for scripts, fonts and other resources)

    Parallel streaming of progressive images

  • The remainder of the image data is low priority. Browsers can stream it last to refine image quality once there’s no rush, since the page is already fully usable.

Knowing the exact amount of data to send in each phase requires understanding the structure of image files, but it seemed weird to us to make our web server parse image responses and have a format-specific behavior hardcoded at a protocol level. By framing the problem as a dynamic change of priorities, were able to elegantly separate low-level networking code from knowledge of image formats. We can use Workers or offline image processing tools to analyze the images, and instruct our server to change HTTP/2 priorities accordingly.

The great thing about parallel streaming of images is that it doesn’t add any overhead. We’re still sending the same data, the same amount of data, we’re just sending it in a smarter order. This technique takes advantage of existing web standards, so it’s compatible with all browsers.

The waterfall

Here are waterfall charts from WebPageTest showing comparison of regular HTTP/2 responses and progressive streaming. In both cases the files were exactly the same, the amount of data transferred was the same, and the overall page loading time was the same (within measurement noise). In the charts, blue segments show when data was transferred, and green shows when each request was idle.

Parallel streaming of progressive images

The first chart shows a typical server behavior that makes images load mostly sequentially. The chart itself looks neat, but the actual experience of loading that page was not great — the last image didn’t start loading until almost the end.

The second chart shows images loaded in parallel. The blue vertical streaks throughout the chart are image headers sent early followed by a couple of stages of progressive rendering. You can see that useful data arrived sooner for all of the images. You may notice that one of the images has been sent in one chunk, rather than split like all the others. That’s because at the very beginning of a TCP/IP connection we don’t know the true speed of the connection yet, and we have to sacrifice some opportunity to do prioritization in order to maximize the connection speed.

The metrics compared to other solutions

There are other techniques intended to provide image previews quickly, such as low-quality image placeholder (LQIP), but they have several drawbacks. They add unnecessary data for the placeholders, and usually interfere with browsers’ preload scanner, and delay loading of full-quality images due to dependence on JavaScript needed to upgrade the previews to full images.

  • Our solution doesn’t cause any additional requests, and doesn’t add any extra data. Overall page load time is not delayed.
  • Our solution doesn’t require any JavaScript. It takes advantage of functionality supported natively in the browsers.
  • Our solution doesn’t require any changes to page’s markup, so it’s very safe and easy to deploy site-wide.

The improvement in user experience is reflected in performance metrics such as SpeedIndex metric and and time to visually complete. Notice that with regular image loading the visual progress is linear, but with the progressive streaming it quickly jumps to mostly complete:

Parallel streaming of progressive images

Parallel streaming of progressive images

Getting the most out of progressive rendering

Avoid ruining the effect with JavaScript. Scripts that hide images and wait until the onload event to reveal them (with a fade in, etc.) will defeat progressive rendering. Progressive rendering works best with the good old <img> element.

Is it JPEG-only?

Our implementation is format-independent, but progressive streaming is useful only for certain file types. For example, it wouldn’t make sense to apply it to scripts or stylesheets: these resources are rendered as all-or-nothing.

Prioritizing of image headers (containing image size) works for all file formats.

The benefits of progressive rendering are unique to JPEG (supported in all browsers) and JPEG 2000 (supported in Safari). GIF and PNG have interlaced modes, but these modes come at a cost of worse compression. WebP doesn’t even support progressive rendering at all. This creates a dilemma: WebP is usually 20%-30% smaller than a JPEG of equivalent quality, but progressive JPEG appears to load 50% faster. There are next-generation image formats that support progressive rendering better than JPEG, and compress better than WebP, but they’re not supported in web browsers yet. In the meantime you can choose between the bandwidth savings of WebP or the better perceived performance of progressive JPEG by changing Polish settings in your Cloudflare dashboard.

Custom header for experimentation

We also support a custom HTTP header that allows you to experiment with, and optimize streaming of other resources on your site. For example, you could make our servers send the first frame of animated GIFs with high priority and deprioritize the rest. Or you could prioritize loading of resources mentioned in <head> of HTML documents before <body> is loaded.

The custom header can be set only from a Worker. The syntax is a comma-separated list of file positions with priority and concurrency. The priority and concurrency is the same as in the whole-file cf-priority header described in the previous blog post.

cf-priority-change: <offset in bytes>:<priority>/<concurrency>, ...

For example, for a progressive JPEG we use something like (this is a fragment of JS to use in a Worker):

let headers = new Headers(response.headers);
headers.set("cf-priority", "30/0");
headers.set("cf-priority-change", "512:20/1, 15000:10/n");
return new Response(response.body, {headers});

Which instructs the server to use priority 30 initially, while it sends the first 512 bytes. Then switch to priority 20 with some concurrency (/1), and finally after sending 15000 bytes of the file, switch to low priority and high concurrency (/n) to deliver the rest of the file.

We’ll try to split HTTP/2 frames to match the offsets specified in the header to change the sending priority as soon as possible. However, priorities don’t guarantee that data of different streams will be multiplexed exactly as instructed, since the server can prioritize only when it has data of multiple streams waiting to be sent at the same time. If some of the responses arrive much sooner from the upstream server or the cache, the server may send them right away, without waiting for other responses.

Try it!

You can use our Polish tool to convert your images to progressive JPEG. Sign up for the beta to have them elegantly streamed in parallel.

Argo and the Cloudflare Global Private Backbone

Post Syndicated from Rustam Lalkaka original https://blog.cloudflare.com/argo-and-the-cloudflare-global-private-backbone/

Argo and the Cloudflare Global Private Backbone

Argo and the Cloudflare Global Private Backbone

Welcome to Speed Week! Each day this week, we’re going to talk about something Cloudflare is doing to make the Internet meaningfully faster for everyone.

Cloudflare has built a massive network of data centers in 180 cities in 75 countries. One way to think of Cloudflare is a global system to transport bits securely, quickly, and reliably from any point A to any other point B on the planet.

To make that a reality, we built Argo. Argo uses real-time global network information to route around brownouts, cable cuts, packet loss, and other problems on the Internet. Argo makes the network that Cloudflare relies on—the Internet—faster, more reliable, and more secure on every hop around the world.

We launched Argo two years ago, and it now carries over 22% of Cloudflare’s traffic. On an average day, Argo cuts the amount of time Internet users spend waiting for content by 112 years!

As Cloudflare and our traffic volumes have grown, it now makes sense to build our own private backbone to add further security, reliability, and speed to key connections between Cloudflare locations.

Today, we’re introducing the Cloudflare Global Private Backbone. It’s been in operation for a while now and links Cloudflare locations with private fiber connections.

This private backbone benefits all Cloudflare customers, and it shines in combination with Argo. Argo can select the best available link across the Internet on a per data center-basis, and takes full advantage of the Cloudflare Global Private Backbone automatically.

Let’s open the hood on Argo and explain how our backbone network further improves performance for our customers.

What’s Argo?

Argo is like Waze for the Internet. Every day, Cloudflare carries hundreds of billions of requests across our network and the Internet. Because our network, our customers, and their end-users are well distributed globally, all of these requests flowing across our infrastructure paint a great picture of how different parts of the Internet are performing at any given time.

Just like Waze examines real data from real drivers to give you accurate, uncongested (and sometimes unorthodox) routes across town, Argo Smart Routing uses the timing data Cloudflare collects from each request to pick faster, more efficient routes across the Internet.

In practical terms, Cloudflare’s network is expansive in its reach. Some of the Internet links in a given region may be congested and cause poor performance (a literal traffic jam). By understanding this is happening and using alternative network locations and providers, Argo can put traffic on a less direct, but faster, route from its origin to its destination.

These benefits are not theoretical: enabling Argo Smart Routing shaves an average of 33% off HTTP time to first byte (TTFB).

One other thing we’re proud of: we’ve stayed super focused on making it easy to use. One click in the dashboard enables better, smarter routing, bringing the full weight of Cloudflare’s network, data, and engineering expertise to bear on making your traffic faster. Advanced analytics allow you to understand exactly how Argo is performing for you around the world.

You can read a lot more about how Argo works in our original launch blog post.

So far, we’ve been talking about Argo at a functional level: you turn it on and it makes requests that traverse the Internet to your origin faster. How does it actually work? Argo is dependent on a few things to make its magic happen: Cloudflare’s network, up-to-the-second performance data on how traffic is moving on the Internet, and machine learning routing algorithms.

Cloudflare’s Global Network

Cloudflare maintains a network of data centers around the world, and our network continues to grow significantly. Today, we have more than 180 data centers in 75 countries. That’s an additional 69 data centers since we launched Argo in May 2017.

In addition to adding new locations, Cloudflare is constantly working with network partners to add connectivity options to our network locations. A single Cloudflare data center may be peered with a dozen networks, connected to multiple Internet eXchanges (IXs), connected to multiple transit providers (e.g. Telia, GTT, etc), and now, connected to our own physical backbone. A given destination may be reachable over multiple different links from the same location; each of these links will have different performance and reliability characteristics.

This increased network footprint is important in making Argo faster. Additional network locations and providers mean Argo has more options at its disposal to route around network disruptions and congestion. Every time we add a new network location, we exponentially grow the number of routing options available to any given request.

Better routing for improved performance

Argo requires the huge global network we’ve built to do its thing. But it wouldn’t do much of anything if it didn’t have the smarts to actually take advantage of all our data centers and cables between them to move traffic faster.

Argo combines multiple machine learning techniques to build routes, test them, and disqualify routes that are not performing as we expect.

The generation of routes is performed on data using “offline” optimization techniques: Argo’s route construction algorithms take an input data set (timing data) and fixed optimization target (“minimize TTFB”), outputting routes that it believes satisfy this constraint.

Route disqualification is performed by a separate pipeline that has no knowledge of the route construction algorithms. These two systems are intentionally designed to be adversarial, allowing Argo to be both aggressive in finding better routes across the Internet but also adaptive to rapidly changing network conditions.

One specific example of Argo’s smarts is its ability to distinguish between multiple potential connectivity options as it leaves a given data center. We call this “transit selection”.

As we discussed above, some of our data centers may have a dozen different, viable options for reaching a given destination IP address. It’s as if you subscribed to every available ISP at your house, and you could choose to use any one of them for each website you tried to access. Transit selection enables Cloudflare to pick the fastest available path in real-time at every hop to reach the destination.

With transit selection, Argo is able to specify both:

1) Network location waypoints on the way to the origin.
2) The specific transit provider or link at each waypoint in the journey of the packet all the way from the source to the destination.

To analogize this to Waze, Argo giving directions without transit selection is like telling someone to drive to a waypoint (go to New York from San Francisco, passing through Salt Lake City), without specifying the roads to actually take to Salt Lake City or New York. With transit selection, we’re able to give full turn-by-turn directions — take I-80 out of San Francisco, take a left here, enter the Salt Lake City area using SR-201 (because I-80 is congested around SLC), etc. This allows us to route around issues on the Internet with much greater precision.

Argo and the Cloudflare Global Private Backbone

Transit selection requires logic in our inter-data center data plane (the components that actually move data across our network) to allow for differentiation between different providers and links available in each location. Some interesting network automation and advertisement techniques allow us to be much more discerning about which link actually gets picked to move traffic.

Without modifications to the Argo data plane, those options would be abstracted away by our edge routers, with the choice of transit left to BGP. We plan to talk more publicly about the routing techniques used in the future.

We are able to directly measure the impact transit selection has on Argo customer traffic. Looking at global average improvement, transit selection gets customers an additional 16% TTFB latency benefit over taking standard BGP-derived routes. That’s huge!

One thing we think about: Argo can itself change network conditions when moving traffic from one location or provider to another by inducing demand (adding additional data volume because of improved performance) and changing traffic profiles. With great power comes great intricacy.

Adding the Cloudflare Global Private Backbone

Given our diversity of transit and connectivity options in each of our data centers, and the smarts that allow us to pick between them, why did we go through the time and trouble of building a backbone for ourselves? The short answer: operating our own private backbone allows us much more control over end-to-end performance and capacity management.

When we buy transit or use a partner for connectivity, we’re relying on that provider to manage the link’s health and ensure that it stays uncongested and available. Some networks are better than others, and conditions change all the time.

As an example, here’s a measurement of jitter (variance in round trip time) between two of our data centers, Chicago and Newark, over a transit provider’s network:

Argo and the Cloudflare Global Private Backbone

Average jitter over the pictured 6 hours is 4ms, with average round trip latency of 27ms. Some amount of latency is something we just need to learn to live with; the speed of light is a tough physical constant to do battle with, and network protocols are built to function over links with high or low latency.

Jitter, on the other hand, is “bad” because it is unpredictable and network protocols and applications built on them often degrade quickly when jitter rises. Jitter on a link is usually caused by more buffering, queuing, and general competition for resources in the routing hardware on either side of a connection. As an illustration, having a VoIP conversation over a network with high latency is annoying but manageable. Each party on a call will notice “lag”, but voice quality will not suffer. Jitter causes the conversation to garble, with packets arriving on top of each other and unpredictable glitches making the conversation unintelligible.

Here’s the same jitter chart between Chicago and Newark, except this time, transiting the Cloudflare Global Private Backbone:

Argo and the Cloudflare Global Private Backbone

Much better! Here we see a jitter measurement of 536μs (microseconds), almost eight times better than the measurement over a transit provider between the same two sites.

The combination of fiber we control end-to-end and Argo Smart Routing allows us to unlock the full potential of Cloudflare’s backbone network. Argo’s routing system knows exactly how much capacity the backbone has available, and can manage how much additional data it tries to push through it. By controlling both ends of the pipe, and the pipe itself, we can guarantee certain performance characteristics and build those expectations into our routing models. The same principles do not apply to transit providers and networks we don’t control.

Latency, be gone!

Our private backbone is another tool available to us to improve performance on the Internet. Combining Argo’s cutting-edge machine learning and direct fiber connectivity between points on our large network allows us to route customer traffic with predictable, excellent performance.

We’re excited to see the backbone and its impact continue to expand.

Speaking personally as a product manager, Argo is really fun to work on. We make customers happier by making their websites, APIs, and networks faster. Enabling Argo allows customers to do that with one click, and see immediate benefit. Under the covers, huge investments in physical and virtual infrastructure begin working to accelerate traffic as it flows from its source to destination.  

From an engineering perspective, our weekly goals and objectives are directly measurable — did we make our customers faster by doing additional engineering work? When we ship a new optimization to Argo and immediately see our charts move up and to the right, we know we’ve done our job.

Building our physical private backbone is the latest thing we’ve done in our need for speed.

Welcome to Speed Week!

Argo and the Cloudflare Global Private Backbone

Activate Argo now, or contact sales to learn more!

Welcome to Speed Week!

Post Syndicated from Patrick Meenan original https://blog.cloudflare.com/welcome-to-speed-week/

Welcome to Speed Week!

Welcome to Speed Week!

Every year, we celebrate Cloudflare’s birthday in September when we announce the products we’re releasing to help make the Internet better for everyone. We’re always building new and innovative products throughout the year, and having to pick five announcements for just one week of the year is always challenging. Last year we brought back Crypto Week where we shared new cryptography technologies we’re supporting and helping advance to help build a more secure Internet.

Today I’m thrilled to announce we are launching our first-ever Speed Week and we want to showcase some of the things that we’re obsessed with to make the Internet faster for everyone.

How much faster is faster?

When we built the software stack that runs our network, we knew that both security and speed are important to our customers, and they should never have to compromise one for the other. All of the products we’re announcing this week will help our customers have a better experience on the Internet with as much as a 50% improvement in page load times for websites, getting the  most out of HTTP/2’s features (while only lifting a finger to click the button that enables them), finding the optimal route across the Internet, and providing the best live streaming video experience. I am constantly amazed by the talented engineers that work on the products that we launch all year round. I wish we could have weeks like this all year round to celebrate the wins they’ve accumulated as they tackle the difficult performance challenges of the web. We’re never content to settle for the status quo, and as our network continues to grow, so does our ability to improve our flagship products like Argo, or how we support rich media sites that rely heavily on images and video. The sheer scale of our network provides rich data that we can use to make better decisions on how we support our customers’ web properties.

We also recognize that the Internet is evolving. New standards and protocols such as HTTP/2, QUIC, TLS 1.3 are great advances to improve web performance and security, but they can also be challenging for many developers to easily deploy. HTTP/2 was introduced in 2015 by the IETF, and was the first major revision of the HTTP protocol. While our customers have always been able to benefit from HTTP/2, we’re exploring how we can make that experience even faster.

All things Speed

Want a sneak peek at what we’re announcing this week? I’m really excited to see this week’s announcements unfold. Each day we’ll post a new blog where we’ll share product announcements and customer stories that demonstrate how we’re making life better for our customers.

  • Monday: An inside view of how we’re making faster, smarter routing decisions
  • Tuesday: HTTP/2 can be faster, we’ll show you how
  • Wednesday: Simplify image management and speed up load times on any device
  • Thursday: How we’re improving our network for faster video streaming
  • Friday: How we’re helping make JavaScript faster

For bonus points, sign up for a live stream webinar where Kornel Lesinksi and I will be hosted by Dennis Publishing to discuss the many challenges of the modern web “Stronger, Better, Faster: Solving the performance challenges of the modern web.” The event will be held on Monday, May 13th at 11:00 am BST and you can either register for the live event or sign up for one of the on-demand sessions later in the week.

I hope you’re just as excited about our upcoming Speed Week as much as I am, be sure to subscribe to the blog to get daily updates sent to your inbox, cause who knows… there may even be “one last thing”