We are excited to announce that Cloudflare has joined the Microsoft 365 Networking Partner Program (NPP). Cloudflare One, which provides an optimized path for traffic from Cloudflare customers to Microsoft 365, recently qualified for the NPP by demonstrating that on-ramps through Cloudflare’s network help optimize user connectivity to Microsoft.
Connecting users to the Internet on a faster network
Customers who deploy Cloudflare One give their team members access to the world’s fastest network, on average, as their on-ramp to the rest of the Internet. Users connect from their devices or offices and reach Cloudflare’s network in over 250 cities around the world. Cloudflare’s network accelerates traffic to its final destination through a combination of intelligent routing and software improvements.
We’re also excited that, in many cases, the final destination that a user visits already sits on Cloudflare’s network. Cloudflare serves over 28M HTTP requests per second, on average, for the millions of customers who secure their applications on our network. When those applications do not run on our network, we can rely on our own global private backbone and our connectivity with over 10,000 networks globally to connect the user.
For Microsoft 365 traffic, we focus on breaking out traffic as locally and direct as possible to bring users to the productivity tools they need without slowing them down. Legacy security solutions can introduce additional hops or backhauling that slows down connectivity to tools like Microsoft 365. With Cloudflare One, we provide the flexibility to identify that traffic and give it the most direct path to Microsoft’s own network of service endpoints around the world.
Securing data and users with Cloudflare Zero Trust
With this setting, trusted traffic to Microsoft uses the most direct path without additional processing. However, the rest of the Internet should not be trusted. Cloudflare’s network also secures the connections, queries, and requests your teams make to protect organizations from attacks and data loss. We can do that without slowing users down because we deliver that security in the data centers at our edge.
SaaS applications delivered over the Internet can make any device with a browser into a workstation. However, that also means that those same devices can connect to the rest of the Internet. Attackers try to lure users into lookalike sites to steal credentials, or they attempt to have users download malware to compromise the device. Either type of attack can put the data stored in SaaS applications at risk.
Cloudflare helps organizations stop those types of attacks through a defense-in-depth strategy. First, Cloudflare starts by delivering a next-generation network firewall in our data centers, filtering traffic for connections to potentially dangerous destinations. Next, Cloudflare runs the world’s fastest DNS resolver and combines it with the data we see about the rest of the Internet to filter queries to phishing domains or sites that host malware.
Finally, Cloudflare’s Secure Web Gateway can inspect HTTP traffic for data loss, viruses, or can choose to isolate the browser for specific sites or entire categories. While Cloudflare’s network secures users from attacks on the rest of the Internet, Cloudflare One ensures that users have a direct, unfettered connection to the Microsoft 365 tools they need.
With traffic secured, Cloudflare can also give administrators visibility into the other applications used in their organization. Without any additional software or features, Cloudflare uses its Zero Trust security suite to analyze and categorize the requests to all applications in a comprehensive Shadow IT report. Administrators can mark applications as approved, unapproved, or unknown and pending investigation so for example Administrators could mark Microsoft 365 traffic as approved — which is also the default setting in deployments that use the one-click enablement being released today.
In some cases, that visibility leads to surprises. Security and IT teams discover that users are doing work in SaaS platforms that have not been reviewed and approved by the organization. In those cases, teams can use Cloudflare’s Secure Web Gateway to block requests to those destinations or just to prevent certain types of activities like blocking file uploads to tools other than OneDrive. With Shadow IT, we can help teams that use Microsoft 365 ensure that data only stays in Microsoft 365.
Our participation in Microsoft 365 Networking Partner Program
Cloudflare has joined the Microsoft 365 Networking Partner Program (NPP). The program is designed to offer customers a set of partners whose deployment practices and guidance are aligned with Microsoft’s networking principles for Microsoft 365 to provide users with the best user experience. Microsoft established the NPP to work with networking companies for optimal connectivity to its service. We are excited to work with a partner whose global network and security principles align with ours.
Starting today, through Cloudflare One, organizations have the ability to ensure as direct a connection as possible for Microsoft 365 traffic. This allows our customers with our WARP client to benefit from a seamless user experience for Microsoft 365, while at the same time securing the rest of their traffic either to SaaS apps, on-prem apps or direct internet traffic through Cloudflare’s global network and security suite of products.
To do this all customers need to do is to enable the Microsoft 365 traffic optimization setting in their Cloudflare One dashboard. Via the setting even if Microsoft 365 connections are routed through the Cloudflare gateways, they are being handed with the least amount of additional overhead for example “Do not inspect” policy is automatically enabled.
For Exclude Office 365 traffic and Bypass Office 365 traffic, click Create entries.
“We’re thrilled to welcome Cloudflare into the Networking Partner Program for Microsoft 365,” said Scott Schnoll, Senior Product Marketing Manager, Microsoft. “Cloudflare is a valued partner that is focused on helping Microsoft 365 customers implement the Microsoft 365 Network Connectivity Principles. Microsoft only recommends Networking Partner Program member solutions for connectivity to Microsoft 365.”
Your organization can start deploying Cloudflare One today alongside your existing Microsoft 365 usage. We’re excited to work with Microsoft to give your team members fast, reliable, and secure connectivity to the tools they need to do their jobs.
AWS Directory Service for Microsoft Active Directory (AWS Managed Microsoft AD), provides a fully managed service for Microsoft Active Directory (AD) in the AWS cloud. When you create your directory, AWS deploys two domain controllers in separate Availability Zones that are exclusively yours for high availability. For use cases requiring even higher resilience and performance, in a specific Region or during specific hours, AWS Managed Microsoft AD allows you to scale by deploying additional domain controllers to meet your needs. These domain controllers can help load-balance, increase overall performance, or simply provide additional nodes to protect against temporary availability issues. AWS Managed Microsoft AD allows you to define the correct number of domain controllers for your directory based on your individual use case.
This post will walk you through how to automate scaling in AWS Managed Microsoft AD using utilization metrics from your directory. You’ll do this using Amazon CloudWatch Alarms, SNS notifications, and a Lambda function to increase the number of domain controllers in your directory based on utilization peaks.
Simplified directory scaling
AWS Managed Microsoft AD has now simplified this directory scaling process by integrating with Amazon CloudWatch metrics. This new integration enables you to:
Analyze your directory to identify expected average and peak directory utilization
Scale your directory based on utilization data to adequately address the expected load
Automate the addition of domain controllers to handle unexpected load.
Integration is available for both domain controller utilization metrics such as CPU, Memory, Disk and Network, and for AD-specific metrics, such as LDAP searches, binds, DNS queries, and Directory reads/writes. Analyzing this data over time to identify expected average and peak utilization on your directory can help you deploy additional domain controllers in Regions that need them. Once you’ve established this utilization baseline, you can deploy additional domain controllers to service this load, and configure alarms for anything exceeding this baseline.
In this example, our AWS Managed Microsoft AD has the default two domain controllers; once your utilization threshold is reached, you’ll add one additional domain controller (domain controller 3 in the diagram) to cover this additional load.
Choose the Directory Service namespace, then choose AWS Managed Microsoft AD.
In the Directory ID column, select your directory and check search for this only.
From the Metric Category column, select Processor from Metric Category and check add to search. This view will show the processor utilization for your directory.
Figure 2. Processor utilization metrics
To see the average utilization across all domain controllers, choose Add Math, then All Functions, then AVG to create a metric math expression for average CPU utilization across all domain controllers.
Figure 3. Adding a math function to compute average
Next, choose the Graphed Metrics tab in the CloudWatch metrics console, select the newly created expression, then select the bell icon from the Actions column to create a CloudWatch alarm based on this metric.
Figure 4. Create a CloudWatch Alarm using Metric Math Expression
Configure the threshold alarm to trigger when CPU utilization exceeds 70%.
In the Metrics section, under Period, choose 1 Hour. In the Conditions section, under Threshold Type, choose Static. Under Define the alarm condition, choose Greater than threshold. Under Define the threshold value, enter 70. See Figure 5 for an image of how alarm parameters should look on your screen. Choose Next to Configure actions.
Figure 5. Configure the alarm parameters
On the Configure actions screen, configure the actions using the parameters listed below to send an email notification when the alarm state is triggered. See Figure 6 for an image of how email notifications are configured.
In the Notification section, set Alarm state trigger to In alarm. Set Select an SNS topic to Create topic. Fill in the name of the alarm in the Create a new topic field, and add the email where notifications should be sent to the Email endpoints that will receive notification field. An email address is required to create the SNS topic and you should use an email address that’s accessible by your operations team. This SNS topic will be used to trigger the Lambda automation described in a later section. Note: make a note of the SNS topic name you chose; you will use it later when creating the Lambda function in the To create an AWS Lambda function to automate scale out procedure below.
Figure 6. Create SNS topic and email notification
In the Alarm name field, provide a name for the alarm. You can optionally also add an Alarm description. Choose Next.
Review your configuration, and choose Create alarm to create the alarm.
Once you’ve completed these steps, you will now have an alarm implemented for when domain controller CPU utilization exceeds an average of 70% across both domain controllers. This will trigger an SNS topic when your directory is experiencing a heavy load, which will be used to start the Lambda automation and will send an informational email notification. In the next section, we’ll configure an AWS Lambda function to automate the addition of a domain controller based on this SNS topic.
The sample Lambda function shown below checks the number of domain controllers in this Region, and increases that by adding one additional domain controller. This procedure describes how to configure the IAM role required for this Lambda function, then how to deploy the Lambda function to execute when the alarm is triggered to automatically add a domain controller when your load exceeds your typical usage baseline.
Choose Next:Tags to add tags (optional) before choosing Next:Review.
On the Create Policy screen, provide a name in the Name field. You can optionally also add a description. Choose Create policy to complete creating the new policy.
Note: make a note of the policy name you chose; you will use it later when updating the execution role for the Lambda function.
Figure 7. Provide a name to create the IAM policy
In the AWS Console, navigate to Lambda and choose Create Function
On the Create Function screen, select Author from Scratch and provide a Name, then choose Create Function.
Figure 8. Create a Lambda function
Once created, on the Lambda function’s page, choose the Configuration tab, then choose Permissions from the sidebar and choose the execution role name linked under Role name. This will open the IAM console in another tab, preloaded to your Lambda execution role.
Figure 9: Select the Execution Role
On the execution role screen, choose Attach policies and select the IAM policy you’ve just created (e.g. DirectoryService-DCNumber Update). On the Attach Permissions screen, choose Attach policy to complete updating the execution role. Once completed this step, you may close this tab and return the previous browser tab.
Figure 10. Select and attach the IAM policy
On the Lambda function screen, choose the Configuration tab, then choose Triggers from the sidebar.
On the Add Trigger screen, choose the pulldown under Trigger configuration and select SNS. On the SNS topic box, select the SNS topic you created in Step 9 of the To create a CloudWatch Alarm with SNS topic notifications procedure above. Then choose Add to complete the trigger configuration.
On the Lambda function screen, choose the Configuration tab, then choose Environment variables from the sidebar.
On the Environment variables card, click Edit.
On the Edit environment variables screen, choose Add environment variables and use the Key “DIRECTORY_ID” and the Value will be the directory ID for you AWS Managed Microsoft AD.
Figure 11. The “Edit environment variables” screen
On the Lambda function screen, choose the Code tab to open the in-browser code editor experience inside the Code source card. Paste in the sample Lambda function code given below to complete the implementation.
Figure 12. Paste sample code to complete the Lambda function setup
Sample Lambda function code
The sample Lambda function given below automates adding another domain controller to your directory. When your CloudWatch alarm triggers, you will receive a notification email, and an additional domain controller will be deployed to provide the added capacity to support the increase in directory usage.
Note: The example code contains a variable for the maximum number of domain controllers (maxDcNum), to prevent you from over provisioning in the event of a missed configuration. This value is set to 3 for this blog post’s example and can be increased to suit your use case.
maxDcNum = 10
minDcNum = 2
region = "us-east-1"
dsId = "d-906752246f"
ds = boto3.client('ds', region_name=region)
def lambda_handler(event, context):
## get the current number of domain controllers
dcs = ds.describe_domain_controllers(DirectoryId = dsId)
DomainControllers = dcs["DomainControllers"]
DCcount = len(DomainControllers)
print(">>> Current number of DCs:" + str(DCcount))
#increase the number of DCs
if DCcount < maxDcNum:
NewDCnumber = DCcount + 1
response = ds.update_number_of_domain_controllers(DirectoryId = dsId, DesiredNumber = NewDCnumber);
'body': json.dumps("New DC number will be " + str(NewDCnumber))
'body': json.dumps("Max number of DCs reached. The number of DCs is" + str(DCcount))
Note: When testing this Lambda function, remember that this will increase the number of domain controllers for your directory in that Region. If the additional domain controller is not needed, please reduce the count after the test to avoid costs for an additional domain controller. The same principles used in this article to automate the addition of domain controllers can be applied to automate the reduction of domain controllers and you should consider automating the reduction to optimize for resilience, performance and cost.
In this post, you’ve implemented alarms based on thresholds in Domain Controller utilization using AWS CloudWatch and automation to increase the number of domain controllers using AWS Lambda functions. This solution helps to cost-effectively improve resilience and performance of your directory, by scaling your directory based on historical load patterns.
Today we’re excited to announce the general availability of Argo for Spectrum, a way to turbo-charge any TCP based application. With Argo for Spectrum, you can reduce latency, packet loss and improve connectivity for any TCP application, including common protocols like Minecraft, Remote Desktop Protocol and SFTP.
The Internet — more than just a browser
When people think of the Internet, many of us think about using a browser to view websites. Of course, it’s so much more! We often use other ways to connect to each other and to the resources we need for work. For example, you may interact with servers for work using SSH File Transfer Protocol (SFTP), git or Remote Desktop software. At home, you might play a video game on the Internet with friends.
To help people that protect these services against DDoS attacks, Spectrum launched in 2018 and extends Cloudflare’s DDoS protection to any TCP or UDP based protocol. Customers use it for a wide variety of use cases, including to protect video streaming (RTMP), gaming and internal IT systems. Spectrum also supports common VoIP protocols such as SIP and RTP, which have recently seen an increase in DDoS ransomware attacks. A lot of these applications are also highly sensitive to performance issues. No one likes waiting for a file to upload or dealing with a lagging video game.
Latency and throughput are the two metrics people generally discuss when talking about network performance. Latency refers to the amount of time a piece of data (a packet) takes to traverse between two systems. Throughput refers to the amount of bits you can actually send per second. This blog will discuss how these two interplay and how we improve them with Argo for Spectrum.
Argo to the rescue
There are a number of factors that cause poor performance between two points on the Internet, including network congestion, the distance between the two points, and packet loss. This is a problem many of our customers have, even on web applications. To help, we launched Argo Smart Routing in 2017, a way to reduce latency (or time to first byte, to be precise) for any HTTP request that goes to an origin.
That’s great for folks who run websites, but what if you’re working on an application that doesn’t speak HTTP? Up until now people had limited options for improving performance for these applications. That changes today with the general availability of Argo for Spectrum. Argo for Spectrum offers the same benefits as Argo Smart Routing for any TCP-based protocol.
Argo for Spectrum takes the same smarts from our network traffic and applies it to Spectrum. At time of writing, Cloudflare sits in front of approximately 20% of the Alexa top 10 million websites. That means that we see, in near real-time, which networks are congested, which are slow and which are dropping packets. We use that data and take action by provisioning faster routes, which sends packets through the Internet faster than normal routing. Argo for Spectrum works the exact same way, using the same intelligence and routing plane but extending it to any TCP based application.
But what does this mean for real application performance? To find out, we ran a set of benchmarks on Catchpoint. Catchpoint is a service that allows you to set up performance monitoring from all over the world. Tests are repeated at intervals and aggregate results are reported. We wanted to use a third party such as Catchpoint to get objective results (as opposed to running themselves).
For our test case, we used a file server in the Netherlands as our origin. We provisioned various tests on Catchpoint to measure file transfer performance from various places in the world: Rabat, Tokyo, Los Angeles and Lima.
Depending on location, transfers saw increases of up to 108% (for locations such as Tokyo) and 85% on average. Why is it so much faster? The answer is bandwidth delay product. In layman’s terms, bandwidth delay product means that the higher the latency, the lower the throughput. This is because with transmission protocols such as TCP, we need to wait for the other party to acknowledge that they received data before we can send more.
As an analogy, let’s assume we’re operating a water cleaning facility. We send unprocessed water through a pipe to a cleaning facility, but we’re not sure how much capacity the facility has! To test, we send an amount of water through the pipe. Once the water has arrived, the facility will call us up and say, “we can easily handle this amount of water at a time, please send more.” If the pipe is short, the feedback loop is quick: the water will arrive, and we’ll immediately be able to send more without having to wait. If we have a very, very long pipe, we have to stop sending water for a while before we get confirmation that the water has arrived and there’s enough capacity.
The same happens with TCP: we send an amount of data to the wire and wait to get confirmation that it arrived. If the latency is high it reduces the throughput because we’re constantly waiting for confirmation. If latency is low we can throttle throughput at a high rate. With Spectrum and Argo, we help in two ways: the first is that Spectrum terminates the TCP connection close to the user, meaning that latency for that link is low. The second is that Argo reduces the latency between our edge and the origin. In concert, they create a set of low-latency connections, resulting in a low overall bandwidth delay product between users in origin. The result is a much higher throughput than you would otherwise get.
Argo for Spectrum supports any TCP based protocol. This includes commonly used protocols like SFTP, git (over SSH), RDP and SMTP, but also media streaming and gaming protocols such as RTMP and Minecraft. Setting up Argo for Spectrum is easy. When creating a Spectrum application, just hit the “Argo Smart Routing” toggle. Any traffic will automatically be smart routed.
Argo for Spectrum covers much more than just these applications: we support any TCP-based protocol. If you’re interested, reach out to your account team today to see what we can do for you.
If you’re considering delegating encryption to an AWS service to use a key under your control when it encrypts your data in that service, you might wonder how to ensure the AWS service can only use your key when you want it to and not have full access to decrypt any of your resources at any time. The answer is to use scoped-down dynamic permissions in AWS KMS. Specifically, a combination of permissions that you define in the KMS key policy document along with additional permissions that are created dynamically using KMS grants define the conditions under which one or more AWS services can use your KMS keys to encrypt and decrypt your data.
In this blog post, I discuss:
An example of how an AWS service uses your KMS key policy and grants to securely manage access to your encryption keys. The example uses Amazon RDS and demonstrates how the block storage volume behind your database instance is encrypted.
Best practices for using grants from AWS KMS in your own workloads.
Recent performance improvements when using grants in AWS KMS.
Case study: How RDS uses grants from AWS KMS to encrypt your database volume
Many Amazon RDS instance types are hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance where the underlying storage layer is an Amazon Elastic Block Store (Amazon EBS) volume. The blocks of the EBS volume that stores the database content are encrypted under a randomly generated 256-bit symmetric data key that is itself encrypted under a KMS key that you configure RDS to use when you create your database instance. Let’s look at how RDS interacts with EBS, EC2, and AWS KMS to securely create an RDS instance using an KMS key.
When you send a request to RDS to create your database, there are several asynchronous requests being made among the RDS, EC2, EBS, and KMS services to:
Create the underlying storage volume with a unique encryption key.
Create the compute instance in EC2.
Load the database engine into the EC2 instance.
Give the EC2 instance permissions to use the encryption key to read and write data to the database storage volume.
The initial authenticated request that you make to RDS to create a new database is made by an AWS Identity and Access Management (IAM) principal in your account (e.g. a user or role). Once the request is received, a series of things has to happen:
RDS needs to request EBS to create an encrypted volume to store your future data.
EBS needs to request AWS KMS generate a unique 256-bit data key for the volume and encrypt it under the KMS key you told RDS to use.
RDS then needs to request that EC2 launch an instance, attach that encrypted volume, and make the data key available to EC2 for use in reads and writes to the volume.
From your perspective, the IAM principal used to create the database also must have permissions in the KMS key policy for the GenerateDataKeyWithoutPlaintext and Decrypt actions. This enables the unique 256-bit data key to be created and encrypted under the desired KMS key as well as allowing the user or role to have the data key decrypted and provisioned to the Nitro card managing your EC2 instance so that reads/writes can happen from/to the database. Given the asynchronous nature of the process of creating the database vs. launching the database volume in the future, how do the RDS, EBS, and EC2 services all get the necessary least privileged permissions to create and provision the data key for use with your database? The answer starts with your IAM principal having permission for the AWS KMS CreateGrant action in the key policy.
RDS uses the identity from your IAM principal to create a grant in AWS KMS that allows it to create other grants for EC2 and EBS with very limited permissions that are further scoped down compared to the original permissions your IAM principal has on the AWS KMS key. A total of three grants are created:
The initial RDS grant.
A subsequent EBS grant that allows EBS to call AWS KMS and generate a 256-bit data key that is encrypted under the KMS key you defined when creating your database.
The attachment grant, which allows the specific EC2 instance hosting your database volume to decrypt the encrypted data key for and provision it for use during I/O between the instance and the EBS volume.
In this example, let’s say you’ve created an RDS instance with an ID of db-1234 and specified a KMS key for encryption. The following grant is created on the KMS key, allowing RDS to create more grants for EC2 and EBS to use in the asynchronous processes required to launch your database instance. The RDS grant is as follows:
In plain English, this grant gives RDS permissions to use the KMS key for three specific operations (API actions) only when the call specifies the RDS instance ID db-1234 in the Encryption Context parameter. The grant provides access for the the grantee principal, which in this case is the value shown for the <Regional RDS service account>. This grant is created in AWS KMS and associated with your KMS key. Because the EC2 instance hasn’t yet been created and launched, the grantee principal cannot include the EC2 instance ID and must instead be the regional RDS service account.
With the RDS instance and initial AWS KMS grant created, RDS requests EC2 to launch an instance for the RDS database. EC2 creates an instance with a unique ID (e.g. i-1234567890abcdefg) using EC2 permissions you gave to the original IAM principal. In addition to the EC2 instance being created, RDS requests that Amazon EBS create an encrypted volume dedicated to the database. As a part of volume creation, EBS needs permission to call AWS KMS to generate a unique 256-bit data key for the volume and encrypt that data key under the KMS key you defined.
The EC2 instance ID is used as the name of the identity for future calls to AWS KMS, so RDS inserts it as the grantee principal in the EBS grant it creates. The EBS grant is as follows:
You’ll notice that this grant uses the same encryption context as the initial RDS grant. However, now that we have the EC2 instance ID associated with the database ID, the permissions that EBS gets to use your key as the grantee principal can be scoped down to require both values. Once this grant is created, EBS can create the EBS volume (e.g. vol-0987654321gfedcba) and call AWS KMS to generate and encrypt a 256-bit data key that can only be used for that volume. This encrypted data key is stored by EBS in preparation for the volume attachment process.
The final step in creating the RDS instance is to attach the EBS volume to the EC2 instance hosting your database. EC2 now uses the previously created EBS grant to create the attachment grant with the i-1234567890abcdefg instance identity. This grant allows EC2 to decrypt the encrypted data key, provision it to the Nitro card that manages the instance, and begin encrypting I/O to the EBS volume of the RDS database. The attachment grant in this example will be as follows:
The attachment grant is the most restrictive of the three grants. It requires the caller to know the IDs of all the AWS entities involved: EC2 instance ID, EBS volume ID, and RDS database ID. This design ensures that your KMS key can only be used for decryption by these AWS services in order to launch the specific RDS database you want.
The encrypted EBS volume is now active and attached to the EC2 instance. Should you terminate the RDS instance, the services retire all the relevant KMS grants so they no longer have any permission to use your KMS key to decrypt the 256-bit data key required to decrypt data in your database. If you need to launch your encrypted database again, a similar set of three grants will be dynamically created with the RDS database, EC2 instance, and EBS volume IDs used to scope down permissions on the AWS KMS key.
The process described in the previous paragraphs is graphically shown in Figure 1:
Considering all the AWS KMS key permissions that are added and removed as a part of launching a database, you might ask why not just use the key policy document to make these changes? A KMS key allows only one key policy with a maximum document size of 32 KB. Because one key could be used to encrypt any number of AWS resources, trying to dynamically add and remove scoped-down permissions related to each resource to the key policy document creates two risks. First, the maximum allowable size of the key policy document (32KB) might be exceeded. Second, depending on how many resources are being accessed concurrently, you may exceed the request rate quota for the PutKeyPolicy API action in AWS KMS.
In contrast, there can be any number of grants on a given AWS KMS key, each grant specifying a scoped-down permission for the use of a KMS key with any AWS service that integrated with AWS KMS. Grant creation and deletion is also designed for much higher-volume request rates than modifications to the key policy document. Finally, permission to call PutKeyPolicy is a highly privileged permission, as it lets the caller make unrestricted changes to the permissions on the key, including changes to administrative permissions to disable or schedule the key for deletion. Grants on a key can only allow permissions to use the key, not administer the key. Also, grants that allow the creation of other grants by other IAM principals prohibit the escalation of privilege. In the RDS example above, the permissions RDS receives from the IAM principal in your account during the first CreateGrant request cannot be more permissive than what you defined for the IAM principal in the KMS key policy. The permissions RDS gives to EC2 and EBS during the database creation process cannot be more permissive than the original permission RDS has from the initial grant. This design ensures that AWS services cannot escalate their privileges and use your KMS key for purposes different than what you intend.
Best practices for using AWS KMS grants
AWS KMS grants are a powerful tool to dynamically define permissions to use keys. They are automatically created when you use server-side encryption features in various AWS services. You can also use grants to control permission in your own applications that perform client-side encryption. Here are some best practices to consider:
Design the permissions to be as scoped down as possible. Use a specific grantee principal, such as an IAM role, and give the principal access only to the AWS KMS API actions that are needed. You can further limit the scope of grants with the Encryption Context parameter by using any element you want to ensure callers are using the AWS KMS key only for the intended purpose. Below is a specific example that grants AWS account 123456789012 permission to call the GenerateDataKey or Decrypt APIs, but only if the supplied encryption context for customerID is 5678.
This grant could prevent your application from decrypting data belonging to customer “5678” without explicitly passing the expected customerID in the request to AWS KMS. This may be a useful defense-in-depth mechanism to prevent unauthorized access to your customers’ data if your application’s AWS credentials were compromised and used from a different caller who doesn’t know that encryption context is a required parameter for all reads and writes in order to encrypt and decrypt data.
Remember that grants don’t automatically expire. Your code needs to retire or revoke them once you know the permission is no longer needed on the KMS key. Grants that aren’t retired are leftover permissions that might create a security risk for encrypted resources. See retiring and revoking grants in the AWS KMS developer guide for more detail.
Avoid creating duplicate grants. A duplicate grant is a grant that shares the same AWS KMS key ID, API actions, grantee principal, encryption context, and name. If you retire the original grant after use and not the duplicates, then the leftover duplicate grants can lead to unintended access to encrypt or decrypt data.
Recent performance improvements to AWS KMS grants: Removing a resource quota
For customers who use AWS KMS to encrypt resources in AWS services that use grants, there used to be cases where AWS KMS had to enforce a quota on the number of concurrently active resources that could be encrypted under the same KMS key. For example, customers of Amazon RDS, Amazon WorkSpaces, or Amazon EBS would run into this quota at very large scale. This was the Grants for a given principal per key quota and was previously set to 500. You might have seen the error message “Keys only support 500 grants per grantee principal in this region” when trying to create a resource in one of these services.
We recently made a change to AWS KMS to remove this quota entirely and this error message no longer exists. With this quota removed, you can now attach unlimited grants to any KMS key when using any AWS service.
In this blog post, you’ve seen how services such as Amazon RDS use AWS KMS grants to pass scoped-down permissions through the Amazon EC2 and Amazon EBS instances. You also saw some best practices for using AWS KMS grants in your own applications. Finally, you learned about how AWS KMS has improved grants by removing one of the resource quotas.
Below are some additional resources for AWS KMS and grants.
There are many design choices to consider when we build our monitoring environment for high-frequency monitoring. How to minimize performance impact? What are the data retention policies with storage space in mind? What are the available out-of-the-box features to solve these potential problems?
In this blog post, we will discuss when you should use preprocessing and when it is better to use the “Do not keep history” option for your metrics, and what are the pros and cons for both of these approaches.
Throttling and other preprocessing steps
We’ve discussed throttling previously as the go-to approach for high-frequency monitoring. Indeed, with throttling, you can discard repeated values and do so with a heartbeat. This is extremely useful with metrics that come as discreet values – services states, network port statuses, and so on.
Example of throttling with and without heartbeat
In addition, since starting from Zabbix 4.2 all preprocessing is also performed by Zabbix proxies. This means we can discard the repeated values before they reach the Zabbix server. This can help us both with the performance (fewer metrics to insert in the Zabbix server DB) and reduce the DB size (Fewer metrics stored in the DB. This also helps with improving overall Zabbix performance)
There are a few caveats with this approach – since metrics get discarded before they reach the Zabbix server, the triggers will not react on these metrics (This is where having a heartbeat is useful) and, since trends are calculated by Zabbix server based on the received history data, there could be a lack of trend information for these metrics. Keep in mind that this applies not only to throttling preprocessing rules – any preprocessing can be done on the proxy and any preprocessing rules can be used to transform your data.
Understanding “Do not keep history” option
The behavior of “Do not keep history” which we can define when configuring an item is a bit different though. If we collect an item by a Proxy and configure the item with “Do not keep history”, the history won’t always get discarded! There are a couple of reasons for this.
First off, let’s not forget that some of our values can populate host inventory! If the particular item is configured to populate an inventory field – it will be forwarded to the Zabbix server, but it will not get stored in the history tables.
If the item does not populate an inventory field – the text data such as character, log and text will indeed get discarded before reaching the Zabbix server, but Numeric values – both float and integer, will get forwarded to the server. The reason for that is deriving trend information from the numeric values. Mind that the numeric data will still not get stored in the history tables, only trends will be available for these items.
Note: This behavior has been properly implemented starting from Zabbix 5.2. See ZBX-17548
Setting the “Do not keep history” option for an item
Using trend functions with high-frequency monitoring
With the specifics of “Do not keep history” in mind, we should now recall that starting from Zabbix 5.2 we have trend functions available at our disposal!
History functions such as trendavg, trendcount, trendmax, trendmin, trendsum allow us to perform different kinds of trend calculations – from counting the number of trend values to retrieving min/max/avg trend values for a time period.
This means, that if we require only the metric trend for specific time periods (hours, days, weeks, etc) we can use these trend functions together with “Do not keep history” option, thus discarding unnecessary data and improving our Zabbix server performance!
There are two approaches two using trend functions:
If you wish to collect and display the trend data, you need to create the item which will collect the metrics (say, a net.if.in Agent item for collecting incoming network traffic) and then create a separate calculated item that uses the trend function to calculate the avg/min/max value for the trend over a time period. The original item can then have “Do not keep history” option selected for it.
trendavg item for calculating hourly trends from the net.if.in[ifHCInOctets.5] item
If you wish to simply define triggers and react on long-term trends and are not required to collect the trend values, then we can skip the creation of the calculated item and simply use the trend function on the original item in the trigger.
This trigger fires if the hourly average trend value exceeds 100M. Note: In this case only the original item is required.
By combining these approaches in our environment – using preprocessing when we wish to discard or transform the data and also implementing opting out of storing the history data, whenever this is appropriate, we can minimize the performance impact on our Zabbix instance. Add a layer of distributed Zabbix proxies on top of this and you can truly achieve a large, scalable Zabbix infrastructure optimized for high-frequency ingestion and processing of your data.
The web has come a long way and initially consisted of very simple protocols. One of them was HTTP/1.0, which required browsers to make a separate connection for every subresource on the page. This design was quickly recognized as having significant performance bottlenecks and was extended with HTTP pipelining and persistent connections in HTTP/1.1 revision, which allowed HTTP requests to reuse the same TCP connection. But, yet again, this was no silver bullet: while multiple requests could share the same connection, they still had to be serialized one after the other, so a client and server could only execute a single request/response exchange at any given time for each connection. As time passed, websites became more complex in structure and dynamic in nature, and HTTP/1.1 was identified as a major bottleneck. The only way to gain concurrency at the network layer was to use multiple TCP connections to the same origin in parallel, but this meant losing most benefits of persistent connections and ended up overloading the origin servers which were unable to meet the concurrency demand.
To address these performance limitations, the SPDY protocol was introduced over a decade later. SPDY supported stream multiplexing, where requests to and responses from the server used a single interleaved TCP connection, and allowed browsers to prioritize requests for critical subresources first — that were blocking page rendering. A modified variant of SPDY was standardized by the IETF as HTTP/2 in 2012 and published as RFC 7540 in 2015.
HTTP/2 and onwards retained this new standard for connection reuse. More specifically, all subresources on the same domain were able to reuse the same TCP/TLS (or UDP/QUIC) connection without any head-of-line blocking (at least on the application layer). This resulted in a single connection for all the subresources — reducing extraneous requests on page loads — potentially speeding up some websites and applications.
Interestingly, the protocol has a lesser-known feature to also enable subresources at differenthostnamesto be fetched over the sameconnection. We studied the real-world feasibility and benefits of this technique as an effort to improve users’ experience for websites across our network.
The technique is often referred to as Connection Coalescing and, to put it simply, is a way to access resources from different hostnames that are accessible from the same web server.
There are several reasons for why a single server could handle requests for different hosts, ranging from low-cost virtual hosting to the usage of CDNs and cloud providers (including Cloudflare, that acts as a reverse proxy for approximately 25 million Internet properties). Before going into the technical conditions required to enable connection coalescing, we should take a look at some benefits such a strategy can provide.
Privacy. When resources at different hostnames are loaded via separate TLS connections, those connections expose metadata to ISPs and other observers via the Server Name Indicator (SNI) field about the destinations that are being contacted (i.e., in the absence of encrypted SNI). This set of exposed SNI’s can allow an on-path adversary to fingerprint traffic and possibly determine user interactions on the webpage. On the other hand, coalesced requests for more than one hostname on a single connection exposes only one destination, and helps avoid such threats.
Resource Prioritization. Multiplexing requests on a single connection means that applications have better visibility and more direct control over how related resources are prioritized and scheduled. In the absence of coalescing, the network properties (for example, route congestion) can interfere with the intended order of delivery for resources. This reliability gained through connection coalescing opens up new optimization opportunities to improve web page load times, among other things.
However, along with all these potential benefits, connection coalescing also has some associated risk factors that need to be considered in practice. First, TCP incorporates “fair” congestion control mechanisms — if there are ten connections on the same route, each gets approximately 1/10th of the total bandwidth. So with a route congested and bandwidth restricted, a client relying on multiple connections might be better off (for example, if they have five of the ten connections, their total share of bandwidth would be half). Second, browsers will use different parallelization routines for scheduling requests on multiple connections versus the same connection — it is not immediately clear whether the former or latter would perform better. Third, multiple connections exhibit an inherent form of load balancing for TLS-termination processes. That’s because multiple requests on the same connection must be answered by the same TLS-termination process that holds the session keys (often on the same physical server). So, it is important to study connection coalescing carefully before rolling it out widely.
With this context in mind, we studied the feasibility of connection coalescing on real-world traffic. More specifically, the two questions we wanted to answer were (a) can we empirically demonstrate and quantify the theoretical benefits of connection coalescing?, and (b) could coalescing cause unintended side effects, such as performance degradation, due to the risks highlighted above?
We then identified a list of approximately four thousand websites using Cloudflare (on the Free plan) that likely used cdnjs. We divided this list of sites into evenly-sized and randomly-picked control and experiment groups. Our plan was to enable coalescing only for the experiment group, so that subresource requests generated from their web pages for cdnjs could reuse existing connections. In this way, we were able to compare results obtained on the experiment group, with the ones for the control group, and attribute any differences observed to connection coalescing.
In order to signal browsers that the requests can be coalesced, we served cdnjs and the sites from the same IP address in a few regions around the world. This meant the same DNS responses for all the zones that were part of the study — eventually load balanced by our Anycast network. These sites also had TLS certificates that included cdnjs.
The above two conditions (same IP and compatible certificate) are required to achieve coalescing as per the HTTP/2 spec. However, the QUIC spec allows coalescing even if only the second condition is met. Major web browsers are yet to adopt the QUIC coalescing mechanism, and currently use only the HTTP/2 coalescing logic for both protocols.
We started noticing evidence of real-world coalescing from the day our experiment was launched. The following graph shows that approximately 50% of requests to cdnjs from our experiment group sites are coalesced (i.e., their TLS SNI does not equal cdnjs) as compared to 0% of requests from the control group sites.
In addition, we conducted active measurements using our private WebPageTest instances at the landing pages of experiment and control sites — using the two well-supported browsers: Google Chrome and Firefox. From our results, Chrome created about 78% fewer TLS connections to cdnjs for our experiment group sites, as compared to the control group. But surprisingly, Firefox created just roughly 22% fewer connections. As TLS handshakes are computationally expensive because they involve cryptographic signatures and key exchange algorithms, fewer handshakes meant less CPU cycles spent by both the client and the server.
Upon further analysis, we were able to make two observations from the data:
A fraction of sites that never coalesced connections with either browser appeared to load subresources with CORS enabled (i.e., <script src="https://cdnjs.cloudflare.com/..." integrity="sha512-894Y..." crossorigin="anonymous">). This is the default way cdnjs recommends inclusion of subresources, as CORS is needed for integrity checks that provide substantial mitigations against script-manipulation attacks. We do not recommend removing this attribute. Our testing also revealed that using XMLHttpRequest or Fetch APIs to load subresources disabled coalescing as well. It is unclear why browsers choose to not coalesce such connections, and we are in contact with the vendors to find out.
Although both Firefox and Chrome coalesced requests for cdnjs on existing connections, the reason for the discrepancy in the number of TLS connections to cdnjs (approximately 78% vs roughly 22%) is because Firefox appears to open new connections even if it does not end up using them.
After evaluating the potential benefits of coalescing, we wanted to understand if coalescing caused any unintended side effects. Hence, the final measurement we conducted was to check whether our experiments were detrimental to a website’s performance. We tracked Page Load Times (PLT) and Largest Contentful Paint (LCP) across a variety of stimulated network conditions using both Chrome and Firefox and found the results for experiment vs control group to not be statistically significant.
We consider our experimentation successful in determining the feasibility of connection coalescing and highlighting its potential benefits in terms of privacy and performance. More specifically, we observed the privacy benefits of coalescing in more than 50% of requests to cdnjs from real-world traffic. In addition, our active testing demonstrated that browsers create fewer TLS connections with coalescing enabled. Interestingly, our results also revealed that the benefits might not always occur (i.e., CORS-enabled requests, Firefox creating additional TLS connections despite coalescing). Finally, we did not find any evidence that coalescing can cause harm to real-world users’ experience on the Internet.
Some future directions we would like to explore include:
More aggressive connection reuse with multiple hostnames, while identifying conditions most suitable for coalescing.
Understanding how different connection reuse methods compare, e.g., IP-based coalescing vs. use of Origin Frames, and what effects do they have on user experience over the Internet.
Evaluating coalescing support among different browser vendors, and encouraging adoption of HTTP/3 QUIC based coalescing.
Reaping the full benefits of connection coalescing by experimenting with custom priority schemes for requests within the same connection.
Please send questions and feedback to [email protected]. We’re excited to continue this line of work in our effort to help build a better Internet! For those interested in joining our team please visit our Careers Page.
We often put together experiments that measure hardware performance to improve our understanding and provide insights to our hardware partners. We recently wanted to know more about Hyper-Threading and Turbo Boost. The last time we assessed these two technologies was when we were still deploying the Intel Xeons (Skylake/Purley), but beginning with our Gen X servers we switched over to the AMD EPYC (Zen 2/Rome). This blog is about our latest attempt at quantifying the performance impact of Hyper-Threading and Turbo Boost on our AMD-based servers running our software stack.
Intel briefly introduced Hyper-Threading with NetBurst (Northwood) back in 2002, then reintroduced Hyper-Threading six years later with Nehalem along with Turbo Boost. AMD presented their own implementation of these technologies with Zen in 2017, but AMD’s version of Turbo Boost actually dates back to AMD K10 (Thuban), in 2010, when it used to be called Turbo Core. Since Zen, Hyper-Threading and Turbo Boost are known as simultaneous multithreading (SMT) and Core Performance Boost (CPB), respectively. The underlying implementation of Hyper-Threading and Turbo Boost differs between the two vendors, but the high-level concept remains the same.
Hyper-Threading or simultaneous multithreading creates a second hardware thread within a processor’s core, also known as a logical core, by duplicating various parts of the core to support the context of a second application thread. The two hardware threads execute simultaneously within the core, across their dedicated and remaining shared resources. If neither hardware threads contend over a particular shared resource, then the throughput can be drastically increased.
Turbo Boost or Core Performance Boost opportunistically allows the processor to operate beyond its rated base frequency as long as the processor operates within guidelines set by Intel or AMD. Generally speaking, the higher the frequency, the faster the processor finishes a task.
Similar to Intel’s Hyper-Threading, AMD implemented 2-way simultaneous multithreading. The AMD EPYC 7642 has 48 cores, and with simultaneous multithreading enabled it can simultaneously execute 96 hardware threads. Core Performance Boost allows the AMD EPYC 7642 to operate anywhere between 2.3 to 3.3 GHz, depending on the workload and limitations imposed on the processor. With Core Performance Boost disabled, the processor will operate at 2.3 GHz, the rated base frequency on the AMD EPYC 7642. We took our usual simulated traffic pattern of 10 KiB cached assets over HTTPS, provided by our performance team, to generate a sustained workload that saturated the processor to 100% CPU utilization.
After establishing a baseline with simultaneous multithreading and Core Performance Boost disabled, we started enabling one feature at a time. When we enabled Core Performance Boost, the processor operated near its peak turbo frequency, hovering between 3.2 to 3.3 GHz which is more than 39% higher than the base frequency. Higher operating frequency directly translated into 40% additional requests per second. We then disabled Core Performance Boost and enabled simultaneous multithreading. Similar to Core Performance Boost, simultaneous multithreading alone improved requests per second by 43%. Lastly, by enabling both features, we observed an 86% improvement in requests per second.
Latencies were generally lowered by either or both Core Performance Boost and simultaneous multithreading. While Core Performance Boost consistently maintained a lower latency than the baseline, simultaneous multithreading gradually took longer to process a request as it reached tail latencies. Though not depicted in the figure below, when we examined beyond p9999 or 99.99th percentile, simultaneous multithreading, even with the help of Core Performance Boost, exponentially increased in latency by more than 150% over the baseline, presumably due to the two hardware threads contending over a shared resource within the core.
Moving into production, since our traffic fluctuates throughout the day, we took four identical Gen X servers and measured in parallel during peak hours. The only changes we made to the servers were enabling and disabling simultaneous multithreading and Core Performance Boost to create a comprehensive test matrix. We conducted the experiment in two different regions to identify any anomalies and mismatching trends. All trends were alike.
Before diving into the results, we should preface that the baseline server operated at a higher CPU utilization than others. Every generation, our servers deliver a noticeable improvement in performance. So our load balancer, named Unimog, sends a different number of connections to the target server based on its generation to balance out the CPU utilization. When we disabled simultaneous multithreading and Core Performance Boost, the baseline server’s performance degraded to the point where Unimog encountered a “guard rail” or the lower limit on the requests sent to the server, and so its CPU utilization rose instead. Given that the baseline server operated at a higher CPU utilization, the baseline server processed more requests per second to meet the minimum performance threshold.
Due to the skewed baseline, when core performance boost was enabled, we only observed 7% additional requests per second. Next, simultaneous multithreading alone improved requests per second by 41%. Lastly, with both features enabled, we saw an 86% improvement in requests per second.
Though we lack concrete baseline data, we can normalize requests per second by CPU utilization to approximate the improvement for each scenario. Once normalized, the estimated improvement in requests per second from core performance boost and simultaneous multithreading were 36% and 80%, respectively. With both features enabled, requests per second improved by 136%.
Latency was not as interesting since the baseline server operated at a higher CPU utilization, and in turn, it produced a higher tail latency than we would have otherwise expected. All other servers maintained a lower latency due to their lower CPU utilization in conjunction with Core Performance Boost, simultaneous multithreading, or both.
At this point, our experiment did not go as we had planned. Our baseline is skewed, and we only got half useful answers. However, we find experimenting to be important because we usually end up finding other helpful insights as well.
Let’s add power data. Since our baseline server was operating at a higher CPU utilization, we knew it was serving more requests and therefore, consumed more power than it needed to. Enabling Core Performance Boost allowed the processor to run up to its peak turbo frequency, increasing power consumption by 35% over the skewed baseline. More interestingly, enabling simultaneous multithreading increased power consumption by only 7%. Combining Core Performance Boost with simultaneous multithreading resulted in 58% increase in power consumption.
AMD’s implementation of simultaneous multithreading appears to be power efficient as it achieves 41% additional requests per second while consuming only 7% more power compared to the skewed baseline. For completeness, using the data we have, we bridged performance and power together to obtain performance per watt to summarize power efficiency. We divided the non-normalized requests per second by power consumption to produce the requests per watt figure below. Our Gen X servers attained the best performance per watt by enabling just simultaneous multithreading.
In our assessment of AMD’s implementation of Hyper-Threading and Turbo Boost, the original experiment we designed to measure requests per second and latency did not pan out as expected. As soon as we entered production, our baseline measurement was skewed due to the imbalance in CPU utilization and only partially reproduced our lab results.
We added power to the experiment and found other meaningful insights. By analyzing the performance and power characteristics of simultaneous multithreading and Core Performance Boost, we concluded that simultaneous multithreading could be a power-efficient mechanism to attain additional requests per second. Drawbacks of simultaneous multithreading include long tail latency that is currently curtailed by enabling Core Performance Boost. While the higher frequency enabled by Core Performance Boost provides latency reduction and more requests per second, we are more mindful that the increase in power consumption is quite significant.
Do you want to help shape the Cloudflare network? This blog was a glimpse of the work we do at Cloudflare. Come join us and help complete the feedback loop for our developers and hardware partners.
During Speed Week we’ve talked a lot about services that make the web faster. Argo 2.0 for better routing around bad Internet weather, Orpheus to ensure that origins are reachable from anywhere, image optimization to send just the right bits to the client, Tiered Cache to maximize cache hit rates and get the most out of Cloudflare’s new 25% bigger network, our expanded fiber backbone and more.
Those things are all great.
But it’s vital that we also measure the performance of our network and benchmark ourselves against industry players large and small to make sure we are providing the best, fastest service.
We recently ran a measurement experiment where we used Real User Measurement (RUM) data from the standard browser API to test the performance of Cloudflare and others in real-world conditions across the globe. We wanted to use third-party tests for this, but they didn’t have the granularity we wanted. We want to drill down to every single ISP in the world to make sure we optimize everywhere. We knew that in some places the answers we got wouldn’t be good, and we’d need to do work to improve our performance. But without detailed analysis across the entire globe we couldn’t know we were really the fastest or where we needed to improve.
In this blog post I’ll describe how that measurement worked and what the results are. But the short version is: Cloudflare is #1 in almost all cases whether you look at all the networks on the globe, or the top 1,000 largest, or the top 1,000 most active, and whether you look at mean timings or 95th percentile, and you measure how quickly a connection can be established, how long it takes for the first byte to arrive in a user’s web browser, or how long the complete download takes. And we’re not resting here, we’re committed to working network by network globally to ensure that we are always #1.
Why we measured
Commercial Internet measurement services (such as Cedexis, Catchpoint, Pingdom, ThousandEyes) already exist and perform the sorts of RUM measurements that Cloudflare used for this test. And we wanted to ensure that our test was as fair as possible and allowed each network to put its best foot forward.
We subscribe to the third party monitoring services already. And, when we looked at their data we weren’t satisfied.
First, we worried that the way they sampled wasn’t globally representative and was often skewed by measuring from the server, rather than the eyeball, side of the network. Or, even if operating from the eyeball side, could be skewed as artificial or tainted by bots and other automated traffic.
Second, it wasn’t granular enough. It showed our performance by country or region, but didn’t dive into individual networks and therefore obscured the details and outliers behind averages. While we looked good in third party tests, we didn’t trust them to be as thorough and accurate as we wanted. The goal isn’t to pick a test where we looked good. The goal was to be accurate and see where we weren’t good enough yet, so we could focus on those areas and improve. That’s why we had to build this ourselves.
We benchmark against others because it’s useful to know what’s possible. If someone else is faster than we are somewhere then it proves it’s possible. We won’t be satisfied until we’re at least as good as everyone else everywhere. Now we have the granular data to ensure that’ll happen. We plan our next update during Birthday Week when our target is to take 10% of networks where we’re not the fastest and become the fastest.
How we measured
To measure performance we did two things. We created a small internal team to do the measurements separate from the team that manages and optimizes our network. The goal was to show the team where we need to improve.
And to ensure that the other CDNs were tested using their most representative assets we used the very same endpoints that commercial measurement services use on the assumption that our competitors will have already ensured that those endpoints are optimized to show their networks’ best performance.
The measurements in this blog post are based on four days just before Speed Week began (2021-09-10 12:25:02 UTC to 2021-09-13 16:21:10 UTC). We took measurements of downloading exactly the same 100KB PNG file. We categorized them by the network the measurement was taken from. It’s identified by its ASN and a name. We’re not done with these measurements and will continue measuring and optimizing.
A 100KB file is a common test measurement used in the industry and allows us to measure network characteristics like the connection time, but also the total download time.
Before we get into results let’s talk a little about how the Internet fits together. The Internet is a network of networks that cooperate to form the global network that we all use. These networks are identified by a strangely named “autonomous system number” or ASN. The idea is that large networks (like ISPs, or cloud providers, or universities, or mobile phone companies) operate autonomously and then join the global Internet through a protocol called BGP (which we’ve written about in the past).
In a way the Internet is these ASNs and because the Internet is made of them we want to measure our performance for each ASN. Why? Because one part of improving performance is improving our connectivity to each ASN and knowing how we’re doing on a per-network basis helps enormously.
There are roughly 70,000 ASNs in the global Internet and during the measurement period we saw traffic from about 21,000 (exact number: 20,728) of them. This makes sense since not all networks are “external” (as in the source of traffic to websites); many ASNs are intermediaries moving traffic around the Internet.
For the rest of this blog we simply say “network” instead of “ASN”.
What we measured
Getting real user measurement data used to be hard but has been made easy for HTTP endpoints thanks to the Resource Timing API, supported by most modern browsers. This API allows a page to measure network timing data of fetched resources using high-resolution timestamps, accurate to 5 µs (microseconds).
The point of this API is to get timing information that shows how a real end-user experiences the Internet (and not a synthetic test that might measure a single component of all the things that happen when a user browses the web).
The Resource Timing API is supported by pretty much every browser enabling measurement on everything from old versions of Internet Explorer, to mobile browsers on iOS and Android to the latest version of Chrome. Using this API gives us a view of real world use on real world browsers.
We don’t just instruct the browser to download an image too. To make sure that we’re fair and replicate the real-life end-user experience, we make sure that no local caching was involved in the request, check if the object has been compressed by the server or not, take the HTTP headers size into account, and record if the connection has been pre-warmed or not, to name a few technical details.
To measure the performance of each CDN we downloaded an image from each, when a browser visited one of our special pages. Once every image is downloaded we record the measurements using a Cloudflare Workers based API.
The three measurements: TCP connection time, TTFB and TTLB
We focused on three measurements to illustrate how fast our network is: TCP connection time, TTFB and TTLB. Here’s why those three values matter.
TCP connection time is used to show how well-connected we are to the global Internet as it counts only the time taken for a machine to establish a connection to the remote host (before any higher level protocols are used). The TCP connection time is calculated as connectEnd – connectStart (see the diagram above).
TTFB (or time to first byte) is the time taken once an HTTP request has been sent for the first byte of data to be returned by the server. This is a common measurement used to show how responsive a server is. We calculate TTFB as responseStart – connectStart – (requestStart – connectEnd).
TTLB (or time to last byte) is the time taken to send the entire response to the web browser. It’s a good measure of how long a complete download takes and helps measure how good the server (or CDN) is at sending data. We calculate TTLB as responseEnd – connectStart – (requestStart – connectEnd).
We then produced two sets of data: mean and p95. The mean is a very well understood number for laypeople and gives the average user experience, but it doesn’t capture the breadth of different speeds people experience very well. Because it averages everything together it can miss skewed distributions of data (where lots of people get good performance and lots bad performance, for example).
To address the mean’s problems we also used p95, the 95th percentile. This number tells us what performance 95% of measurements fall below. That can be a hard number to understand, but you can think of it as the “reasonable worst case” performance for a user. Only 5% of measurements were worse than this number.
An example chart
As this blog contains a lot of data, let’s start with a detailed look at a chart of results. We compared ourselves against industry giants Google and Amazon CloudFront, industry pioneer Akamai and up and comer Fastly.
For each network (represented by an ASN) and for each CDN we calculated which CDN had the best performance. Here, for example, is a count of the number of networks on which each CDN had the best performance for TTFB. This particular chart shows p95 and includes data from the top 1,000 networks (by number of IPv4 addresses advertised).
In these charts, longer bars are better; the bars indicate the number of networks for which that CDN had the lowest TTFB at p95.
This shows that Cloudflare had the lowest time to first byte (the TTFB, or the time it took the first byte of content to reach their browser) at the 95th percentile for the largest number of networks in the top 1,000 (based on the number IPv4 addresses they advertise). Google was next, then Fastly followed by Amazon CloudFront and Akamai.
All three measures, TCP connection time, time to first byte and time to last byte, matter to the user experience. For this example, I focused on time to first byte (TTFB) because it’s such a common measure of responsiveness of the web. It’s literally the time it takes a web server to start responding to a request from a browser for a web page.
To understand the data we gathered let’s look at two large US-based ISPs: Cox and Comcast. Cox serves about 6.5 million customers and Comcast has about 30 million customers. We performed roughly 22,000 measurements on Cox’s network and 100,000 on Comcast’s. Below we’ll make use of measurement counts to rank networks by their size, here we see that our measurements and customer counts of Cox and Comcast track nicely.
Cox Communications has ASN 22773 and our data shows that the p95 TTFB for the five CDNs was as follows: Cloudflare 332.6ms, Fastly 357.6ms, Google 380.3ms, Amazon CloudFront 404.4ms and Akamai 441.5ms. In this case Cloudflare was the fastest and about 7% faster than the next CDN (Fastly) which was faster than Google and so on.
Looking at Comcast (with ASN 7922) we see p95 TTFB was 323.7ms (Fastly), 324.2ms (Cloudflare), 353.7ms (Google), 384.6ms (Akamai) and 418.8ms (Amazon CloudFront). Here Fastly (323.7ms) edged out Cloudflare (324.2ms) by 0.2%.
Figures like these go into determining which CDN is the fastest for each network for this analysis and the charts presented. At a granular level they matter to Cloudflare as we can then target networks and connectivity for optimization.
Shown below are the results for three different measurement types (TCP connection time, TTFB and TTLB) with two different aggregations (mean and p95) and for three different groupings of networks.
The first grouping is the simplest: it’s all the networks we have data for. The second grouping is the one used above, the top 1,000 networks by number of IP addresses. But we also show a third grouping, top 1,000 networks by number of observations. That last group is important.
Top 1,000 networks by number of IP addresses is interesting (it includes the major ISPs) but it also includes some networks that have huge numbers of IP addresses available that aren’t necessarily used. These come about because of historical allocations of IP addresses organisations like the US Department of Defense.
Cloudflare’s testing reveals which networks are most used, and so we also report results for the top 1,000 networks by number of observations to get an idea of how we’re performing on networks with heavy usage.
Hence, we have 18 charts showing all combinations of (TCP connection time, TTFB, TTLB), (mean, p95) and (all networks, top 1,000 networks by IP count, top 1,000 networks by observations).
You’ll see that in two of the charts Cloudflare is not #1 of 18 (we’re #2). We’ll be working to make sure we’re #1 for every measurement over the next few weeks. Both of those are average times. We’re most interested in the p95 measurements because they show the “reasonable worst case” scenario for someone using the Internet. But as we go about optimizing performance we want to be #1 on every chart so that we’re top no matter how performance is measured.
TCP Connection Time
Let’s start with TCP connection time to get a sense of how well-connected the five companies we’ve measured. Recall that longer bars are better here, they indicate that the particular CDN was the highest performance for that many networks: more networks is better.
Time To First Byte (TTFB)
Next up is TTFB for the five companies. Once again, longer bars is better means more networks where that CDN had the lowest TTFB.
Time To Last Byte (TTLB)
And finally the TTLB measurements. Once again, longer bars is better means more networks where that CDN had the lowest TTLB.
Looking not just at where we are #1 but where we are #1 or #2 helps us see how quickly we can optimize our network to be #1 in more places. Examining the top 1,000 networks by observations we see that we’re currently #1 or #2 for TTFB in 69.9% of networks, for TTLB in 65.0% of networks and for TCP connection time in 70.5%.
To see how much optimization we need to do to go from #2 to #1 we looked at the three measures and see that median TTFB of the #1 network is 92.3%, median TTLB is 94.0% and TCP connection time is 91.5%.
The latter figure is very interesting because it shows that we can make big gains by optimizing network level routing.
Where’s the world map?
It’s very common to present data about Internet performance on maps. World maps look good but they obscure information. The reason is very simple: world maps show geography (and, depending on the projection, a very skewed version of the world’s actual countries and land masses).
Here’s an example world map with countries colored by who had the lowest TTLB at p95. Cloudflare is in orange, Amazon CloudFront in yellow, Google in purple, Akamai in red and Fastly in blue.
What that doesn’t show is population. And Cloudflare is interested in how well we perform for people. Consider Russia and Indonesia. Indonesia has double the population of Russia and about 1/10th of the land.
By focusing on networks we can optimize for the people who use the Internet. For example, Biznet is a major ISP in Indonesia with ASN 17451. Looking at TTFB at p95 (the same statistics we discussed earlier for US ISPs Cox and Comcast) we see that for Biznet users Cloudflare has the fastest p95 TTFB at 677.7ms, Fastly 744.0ms, Google 872.8ms, Amazon CloudFront 1,239.9 and Akamai 1,248.9ms.
The data we’ve gathered gives us a granular view of every network that connects to Cloudflare. This allows us to optimize routes and over the coming days and weeks we’ll be using the data to further increase Cloudflare’s performance.
In just over a week it’s Cloudflare’s Birthday Week, and we are aiming to improve our performance in 10% of the networks where we are not #1 today. Stay tuned.
Cloudflare has millions of free customers. Not only is it something we’re incredibly proud of in the context of helping to build a better Internet — but it’s something that has made the Cloudflare service measurably better. One of the ways we’ve benefited is that it’s created a very strong imperative for Cloudflare to maintain a network that is as efficient as possible. There’s simply no other way to serve so many free customers.
In the spirit of this, we are very excited about the latest step in our energy-efficiency journey: turning to Arm for our server CPUs. It has been a long journey getting here — we started testing our first Arm CPUs all the way back in November 2017. It’s only recently, however, that the quantum of energy efficiency improvement from Arm has become clear. Our first Arm CPU was deployed in production earlier this month — July 2021.
Our most recently deployed generation of edge servers, Gen X, used AMD Rome CPUs. Compared with that, the newest Arm based CPUs process an incredible 57% more Internet requests per watt. While AMD has a sequel, Milan (and which Cloudflare will also be deploying), it doesn’t achieve the same degree of energy efficiency that the Arm processor does — managing only 39% more requests per watt than Rome CPUs in our existing fleet. As Arm based CPUs become more widely deployed, and our software is further optimized to take advantage of the Arm architecture, we expect further improvements in the energy efficiency of Arm servers.
Using Arm, Cloudflare can now securely process over ten times as many Internet requests for every watt of power consumed, than we did for servers designed in 2013.
(In the graphic below, for 2021, the perforated data point refers to x86 CPUs, whereas the bold data point refers to Arm CPUs)
As Arm server CPUs demonstrate their performance and become more widely deployed, we hope this will inspire x86 CPUs manufacturers (such as Intel and AMD) to urgently take energy efficiency more seriously. This is especially important since, worldwide, x86 CPUs continue to represent the vast majority of global data center energy consumption.
Together, we can reduce the carbon impact of Internet use. The environment depends on it.
Caching is a magic trick. Instead of a customer’s origin responding to every request, Cloudflare’s 200+ data centers around the world respond with content that is cached geographically close to visitors. This dramatically improves the load performance for web pages while decreasing the bandwidth costs by having Cloudflare respond to a request with cached content.
However, if content is not in cache, Cloudflare data centers must contact the origin server to receive the content. This isn’t as fast as delivering content from cache. It also places load on an origin server, and is more costly compared to serving directly from cache. These issues can be amplified depending on the geographic distribution of a website’s visitors, the number of data centers contacting the origin, and the available origin resources for responding to requests.
To decrease the number of times our network of data centers communicate with an origin, we organize data centers into tiers so that only upper-tier data centers can request content from an origin and then they spread content to lower tiers. This means content that loads faster for visitors, is cheaper to serve, and reduces origin resource consumption.
Today, I’m thrilled to announce a fundamental improvement to Argo Tiered Cache we’re calling Smart Tiered Cache Topology. When enabled, Argo Tiered Cache will now dynamically select the single best upper tier for each of your website’s origins while providing tiered cache analytics showing how your custom topology is performing.
Smarter Tiered Cache Topology Generation
Tiered Cache is part of Argo, a constellation of products that analyzes and optimizes routing decisions across the global Internet in real-time by processing information from every Cloudflare request to determine which routes to an origin are fast, which are slow, and what the optimum path from visitor to content is at any given moment. Previously, Argo Tiered Cache would use a static collection of upper-tier data centers to communicate with the origin. With the improvements we’re announcing today, Tiered Cache can now dynamically find the single best upper tier for an origin using Argo performance and routing data. When Argo is enabled and a request for particular content is made, we collect latency data for each request to pick the optimal path. Using this latency data, we can determine how well any upper-tier data center is connected with an origin and can empirically select the best data center with the lowest latency to be the upper tier for an origin.
Argo Tiered Cache
Taking one step back, tiered caching is a practice where Cloudflare’s network of global data centers are subdivided into a hierarchy of upper tiers and lower tiers. In order to control bandwidth and number of connections between an origin and Cloudflare, only upper tiers are permitted to request content from an origin and must propagate that information to the lower tiers. In this way, Cloudflare data centers first talk to each other to find content before asking the origin. This practice improves bandwidth efficiency by limiting the number of data centers that can ask the origin for content, reduces origin load, and makes websites more cost-effective to operate. Argo Tiered Cache customers only pay for data transfer between the client and edge, and we take care of the rest. Tiered caching also allows for improved performance for visitors, because distances and links traversed between Cloudflare data centers are generally shorter and faster than the links between data centers and origins.
Previously, when Argo Tiered Cache was enabled for a website, several of Cloudflare’s largest and most-connected data centers were determined to be upper tiers and could pull content from an origin on a cache MISS. While utilizing a topology consisting of numerous upper-tier data centers may be globally performant, we found that cost-sensitive customers generally wanted to find the single best upper tier for their origin to ensure efficient data transfer of their content to Cloudflare’s network. We built Smart Tiered Cache Topology for this reason.
How to enable Smart Tiered Cache Topology
When you enable Argo Tiered Cache, Cloudflare now by default concentrates connections to origin servers so they come from a single data center. This is done without needing to work with our Customer Success or Solutions Engineering organization to custom configure the best single upper tier. Argo customers can generate this topology by:
Logging into your Cloudflare account.
Navigating to the Traffic tab in the dashboard.
Ensuring you have Argo enabled.
From there, Non-Enterprise Argo customers will automatically be enrolled in Smart Tiered Cache Topology without needing to make any additional changes.
Enterprise customers can select the type of topology they’d like to generate.
Self-serve Argo customers are automatically enrolled in Smart Tiered Cache Topology
Enterprise customers can determine the tiered cache topology that works best for them.
More data, fewer problems
Once enabled, in addition to performance and cost improvements, Smart Tiered Cache Topology also delivers summary analytics for how the upper tiers are performing so that you can monitor the cost and performance benefits your website is receiving. These analytics are available in the Cache Tab of the dashboard in the Tiered Cache section. The “Primary Data Center” and “Secondary Data Center” fields show you which data centers were determined to be the best upper tier and failover for your origin. “Cached Hits” and the “Hit Ratio” shows you the proportion of requests that were served by the upper tier and how many needed to be forwarded on to the origin for a response. “Bytes Saved” indicates the total transfer from the upper-tier data center to the lower tiers, showing the total bandwidth saved by having Cloudflare’s lower tier data centers ask the upper tiers for the content instead of the origin.
Smart Tiered Cache Topology works with Cloudflare’s existing products to deliver you a seamless, easy, and performant experience that saves you money and provides you useful information about how your upper tiers are working with your origins. Smart Tiered Cache Topology stands on the shoulders of some of the most resilient and useful products at Cloudflare to provide even more benefits to webmasters.
If you’re interested in seeing how Argo and Smart Tiered Cache Topology can benefit your web property, please login to your Cloudflare account and find more information in the Traffic tab of the dashboard here.
A few years ago, we released Argo to help make the Internet faster and more efficient. Argo observes network conditions and finds the optimal route across the Internet for origin server requests, avoiding congestion along the way.
Tiered Cache is an Argo feature that reduces the number of data centers responsible for requesting assets from the origin. With Tiered Cache active, a request in South Africa won’t go directly to an origin in North America, but, instead, look in a large, nearby data center to see if the data requested is cached there first. The number and location of the data centers used by Tiered Cache is controlled by a piece of configuration called the topology. By default, we use a generic topology for every customer that strikes a balance between cache hit ratios and latency that is suitable for most users.
Today we’re introducing Smart Topology, which maximizes cache hit ratios by building on Argo’s internal infrastructure to identify the single best data center for making requests to the origin.
The standard method for caching assets is to let each data center be a reverse proxy for the origin server. In this scheme, a miss in any data center causes a request to the origin for an asset. A request to the origin for one asset could be made as many times as there are data centers.
A cache miss in any data center will result in a request being sent to the origin server even if the asset is cached in some other data center. This is because the data centers are completely oblivious of each other.
Theoretically, a request for the asset would have to be sent to every data center in order to reduce the cache misses to the minimum possible. However, sending every request to every data center is not practical.
The minimum possible cache hit latency is achieved if the asset is moved into the nearest cache before the request for it is made, but this kind of prediction is generally not possible. Instead, a good heuristic is to move the asset into the nearest cache after the first cache miss.
However, the asset has to be copied from somewhere and it isn’t possible to know where in the network it might be without querying each data center.
To avoid querying each data center, a copy of the asset has to be stored in a known location after the first cache miss so it is available to other data centers. This is precisely what Tiered Cache does.
Tiered Cache improves cache hit ratios by allowing some data centers to serve as caches for others, before the latter has to make a request to the origin. With Tiered Cache, certain data centers are reverse proxies to the origin for other data centers.
If the proxied data centers make requests for the same asset, the asset will already be cached in the proxying data center and can be retrieved from there rather than from the origin. Fewer requests to the origin are made overall.
In Tiered Cache, the topology describes which data center should serve as a proxy for others.
For customers, devising an optimal topology is a challenge requiring optimization and continuous maintenance. The best topology is a configuration based on information that is privately held by the customer and other information held only by Cloudflare.
For instance, knowing the desired balance of latency versus cache hit ratio is information only the customer has, but how to best make use of the Internet is something we would know. Enterprise customers usually have dedicated infrastructure teams that work with our solutions engineers to manually optimize and maintain their tiered cache topology.
Not every customer would want to personalize their topology. For this reason a generic topology exists.
The generic topology is designed to achieve good latency and cache efficiency for any origin, regardless of location. A balance is struck between two constraints — cache efficiency and latency.
The generic topology has multiple proxying data centers that are distributed around the world in order to ensure that requests that result in a cache miss do not take a very long detour before going to the origin. There is a balance between the number of proxying data centers and the cache hit ratio, because the proxying data centers are oblivious to each other.
If a proxying data center is taken offline, the proxied data centers either use a fallback (if the fallback is online) or revert to behaving like Tiered Cache is disabled.
To achieve the best balance for general usage, the generic topology instructs the smaller data centers to be proxied by the larger data centers in the same geographic region.
Smart Topology assumes the origin is in one place and then automatically configures itself to be optimal once the customer just flips a switch in the dashboard. In order to actually do this, Cloudflare needs to be able to determine which data center has the lowest latency to the origin without making the customer tell Cloudflare where the origin is.
Methods for Latency Determination
There are a few ways to determine which data center has the lowest latency with respect to the origin.
IP geolocation Physical distance can be used as an approximation for latency, but Smart Topology was not built this way for a couple of reasons. First, even the best commercial IP geo database doesn’t have the required coverage and accuracy. Second, even with perfect accuracy, physical distance is a questionable approximation of Internet latency.
Probing Latency to an IP address can be determined exactly by probing that address. The probe can just be the time required to perform the TCP handshake. Each data center probes the origin so that the latencies can be directly measured and the minimum can be found. Except for edge cases involving Anycast and TCP termination, we can assume that the latency to an IP address is the same as the latency to the origin server behind that IP address.
Topology Selection Algorithm
The goal of the topology selection algorithm is to minimize cache misses and latency. The topology chooses a single proxying data center in order to maximize the cache hit ratio. The proxying data center is chosen to be close to the origin so that the latencies of cache misses in the proxied data centers are not much worse than they would be with tiered cache turned off.
The choice should eventually become stable. Stability is important because each time the choice changes, cache misses in proxied data centers are likely to cause cache misses in the new proxying data center. Capacity is important because when a data center goes offline, it can cause a large number of cache misses. Minimizing latency to the origin is important to ensure that the network is used efficiently.
The data center selection algorithm is rather like a leaderboard of the fastest data center for each origin. As data is collected, a faster data center can knock others off a given origin’s leaderboard. This competition is based on the 24 hour median latency and is held each hour. Only a subset of data centers deemed large enough are permitted to compete.
Eventually, the choice for proxying data centers becomes stable. Over time, data centers produce competing records for each origin and less competitive records in the leaderboard are replaced as necessary. Thus, latencies for any origin on the leaderboard can only monotonically decrease. There are always physical limits in the real world, so eventually the ideal data center will set a record that is too good to beat.
Also, the leaderboard actually includes both the lowest latency data center and the second lowest latency data center. The second lowest latency data center serves as a fallback if the preferred data center is taken offline for maintenance.
Anycast Networks We are measuring the latency to the origin IP address and assuming that it represents the latency to the origin server, but this can break down in certain cases. A few cloud providers other than Cloudflare also use Anycast technology to provide their services. In Anycast, multiple machines can share an IP address regardless of where they are connected to the Internet, and the Internet will typically route packets destined for that address to the closest machine. If an Anycast network is used to proxy an origin server, then the apparent latency to the IP address for the origin server is actually the latency to the edge of the Anycast network rather than the latency to the origin server. The real latency to the origin server cannot be determined by probing.
The algorithm would fail to select the single best proxying data center if the latencies are not representative of the actual latency between data center and origin. Selecting the wrong data center would adversely affect latencies for requests to the origin, and could be expensive.
For instance, imagine a cloud provider provides an IP address that actually routes to multiple data centers all over the world. Packets are routed through private infrastructure to the correct destination once they enter the network. The lowest latency data center to this Anycast IP address could potentially even be on a different continent than the actual origin server. Therefore, the apparent latency cannot actually be trusted as a true measure of actual latency to the origin.
The data center selection algorithm assumes that the origin is in a single geographic location and can be probed to determine latency from each data center. These networks break one or both of these assumptions, so a procedure had to be developed in order to detect them. First, it is assumed that the IP appears in a single geographic location and is not proxied by such a network. The latency to the origin is bounded by the speed of light through fiber. Although the distance between any data center and the origin server is not known, the distances between data centers is known by Cloudflare.
Imagine putting the origin server as a pitstop in that journey. Then, the theoretical minimum possible observable pair of latencies between the origin server and any two data centers can be computed. We have the latency probe data from both of these data centers and the origin, so we can check to see whether the observed latency is lower than what is possible.
The original assumption was that the origin IP address identifies an origin server that is in one location and the latency to that IP address is the latency to the origin server. If the observed latencies are faster than light then clearly the assumption is false. Smart Topology falls back to the generic topology when the original assumption does not hold. To be extra sure, we check this constraint on a bunch of data centers around the world and fall back if there is even a single physically impossible observation.
The Big Picture
When Smart Topology is enabled many Cloudflare systems work together to ensure the correct data center is eventually used to request assets from the origin.
When the customer enables Tiered Cache Smart Topology, one of a few things can happen from the perspective of the origin. If a proxying data center has already been assigned to the CIDR block that encompasses the origin IP, the preferred or fallback data center is used to request assets from the origin. Otherwise, the generic topology is used to determine which proxying data centers to use to pull assets from the origin. The latency to the proxying data center should only decrease as the choice for proxying data center is updated over time.
Developing this technology offered a lot of opportunities to exercise great engineering and build an impactful product. It was not done in a vacuum; we used infrastructure that Cloudflare had already built, and we moved along that exponential gradient of using existing progress to make more progress. Building this framework opens a lot of doors to future progress too; for instance, in the future, we can explore ways to select the ideal proxying data center even for origins behind Anycast networks that hide the true latency to the origin.
“Where has all that free memory gone?” This is the question we ask ourselves every time our application emits that dreaded OutOfMemoyError just before it crashes. Amazon CodeGuru Profiler can help you find the answer.
Thanks to its brand-new memory profiling capabilities, troubleshooting and resolving memory issues in Java applications (or almost anything that runs on the JVM) is much easier. AWS launched the CodeGuru Profiler Heap Summary feature at re:Invent 2020. This is the first step in helping us, developers, understand what our software is doing with all that memory it uses.
The Heap Summary view shows a list of Java classes and data types present in the Java Virtual Machine heap, alongside the amount of memory they’re retaining and the number of instances they represent. The following screenshot shows an example of this view.
Because CodeGuru Profiler is a low-overhead, production profiling service designed to be always on, it can capture and represent how memory utilization varies over time, providing helpful visual hints about the object types and the data types that exhibit a growing trend in memory consumption.
In the preceding screenshot, we can see that several lines on the graph are trending upwards:
The red top line, horizontal and flat, shows how much memory has been reserved as heap space in the JVM. In this case, we see a heap size of 512 MB, which can usually be configured in the JVM with command line parameters like -Xmx.
The second line from the top, blue, represents the total memory in use in the heap, independent of their type.
The third, fourth, and fifth lines show how much memory space each specific type has been using historically in the heap. We can easily spot that java.util.LinkedHashMap$Entry and java.lang.UUID display growing trends, whereas byte has a flat line and seems stable in memory usage.
Types that exhibit constantly growing trend of memory utilization with time deserve a closer look. Profiler helps you focus your attention on these cases. Associating the information presented by the Profiler with your own knowledge of your application and code base, you can evaluate whether the amount of memory being used for a specific data type can be considered normal, or if it might be a memory leak – the unintentional holding of memory by an application due to the failure in freeing-up unused objects. In our example above, java.util.LinkedHashMap$Entry and java.lang.UUIDare good candidates for investigation.
To make this functionality available to customers, CodeGuru Profiler uses the power of Java Flight Recorder (JFR), which is now openly available with Java 8 (since OpenJDK release 262) and above. The Amazon CodeGuru Profiler agent for Java, which already does an awesome job capturing data about CPU utilization, has been extended to periodically collect memory retention metrics from JFR and submit them for processing and visualization via Amazon CodeGuru Profiler. Thanks to its high stability and low overhead, the Profiler agent can be safely deployed to services in production, because it is exactly there, under real workloads, that really interesting memory issues are most likely to show up.
For more information about CodeGuru Profiler and other AI-powered services in the Amazon CodeGuru family, see Amazon CodeGuru. If you haven’t tried the CodeGuru Profiler yet, start your 90-day free trial right now and understand why continuous profiling is becoming a must-have in every production environment. For Amazon CodeGuru customers who are already enjoying the benefits of always-on profiling, this new feature is available at no extra cost. Just update your Profiler agent to version 1.1.0 or newer, and enable Heap Summary in your agent configuration.
Releasing often is a good thing. It’s cool, and helps us deliver new functionality quickly, but I want to share one positive side-effect – it helps with analyzing production performance issues.
We do releases every 5 to 10 days and after a recent release, the application CPU chart jumped twice (the lines are differently colored because we use blue-green deployment):
What are the typical ways to find performance issues with production loads?
Connect a profiler directly to production – tricky, as it requires managing network permissions and might introduce unwanted overhead
Run performance tests against a staging or local environment and do profiling there – good, except your performance tests might not hit exactly the functionality that causes the problem (this is what happens in our case, as it was some particular types of API calls that caused it, which weren’t present in our performance tests). Also, performance tests can be tricky
Do a thread dump (and heap dump) and analyze them locally – a good step, but requires some luck and a lot of experience analyzing dumps, even if equipped with the right tools
Check your git history / release notes for what change might have caused it – this is what helped us resolve the issue. And it was possible because there were only 10 days of commits between the releases.
We could go through all of the commits and spot potential performance issues. Most of them turned out not to be a problem, and one seemingly unproblematic pieces was discovered to be the problem after commenting it out for a brief period a deploying a quick release without it, to test the hypothesis. I’ll share a separate post about the particular issue, but we would have to waste a lot more time if that release has 3 months worth of commits rather than 10 days.
Sometimes it’s not an obvious spike in the CPU or memory, but a more gradual issue that you introduce at some point and it starts being a problem a few months later. That’s what happened a few months ago, when we noticed a stead growth in the CPU with the growth of ingested data. Logical in theory, but the CPU usage grew faster than the data ingestion rate, which isn’t good.
So we were able to answer the question “when did it start growing” in order to be able to pinpoint the release that introduced the issue. because the said release had only 5 days of commits, it was much easier to find the culprit.
All of the above techniques are useful and should be employed at the right time. But releasing often gives you a hand with analyzing where a performance issues is coming from.
Late last year I read a blog post about our CSAM image scanning tool. I remember thinking: this is so cool! Image processing is always hard, and deploying a real image identification system at Cloudflare is no small achievement!
Some time later, I was chatting with Kornel: “We have all the pieces in the image processing pipeline, but we are struggling with the performance of one component.” Scaling to Cloudflare needs ain’t easy!
The problem was in the speed of the matching algorithm itself. Let me elaborate. As John explained in his blog post, the image matching algorithm creates a fuzzy hash from a processed image. The hash is exactly 144 bytes long. For example, it might look like this:
The hash is designed to be used in a fuzzy matching algorithm that can find “nearby”, related images. The specific algorithm is well defined, but making it fast is left to the programmer — and at Cloudflare we need the matching to be done super fast. We want to match thousands of hashes per second, of images passing through our network, against a database of millions of known images. To make this work, we need to seriously optimize the matching algorithm.
Naive quadratic algorithm
The first algorithm that comes to mind has O(K*N) complexity: for each query, go through every hash in the database. In naive implementation, this creates a lot of work. But how much work exactly?
First, we need to explain how fuzzy matching works.
Given a query hash, the fuzzy match is the “closest” hash in a database. This requires us to define a distance. We treat each hash as a vector containing 144 numbers, identifying a point in a 144-dimensional space. Given two such points, we can calculate the distance using the standard Euclidean formula.
For our particular problem, though, we are interested in the “closest” match in a database only if the distance is lower than some predefined threshold. Otherwise, when the distance is large, we can assume the images aren’t similar. This is the expected result — most of our queries will not have a related image in the database.
This function returns the squared distance. We avoid computing the actual distance to save us from running the square root function – it’s slow. Inside the code, for performance and simplicity, we’ll mostly operate on the squared value. We don’t need the actual distance value, we just need to find the vector with the smallest one. In our case it doesn’t matter if we’ll compare distances or squared distances!
As you can see, fuzzy matching is basically a standard problem of finding the closest point in a multi-dimensional space. Surely this has been solved in the past — but let’s not jump ahead.
While this code might be simple, we expect it to be rather slow. Finding the smallest hash distance in a database of, say, 1M entries, would require going over all records, and would need at least:
144 * 1M subtractions
144 * 1M multiplications
144 * 1M additions
And more. This alone adds up to 432 million operations! How does it look in practice? To illustrate this blog post we prepared a full test suite. The large database of known hashes can be well emulated by random data. The query hashes can’t be random and must be slightly more sophisticated, otherwise the exercise wouldn’t be that interesting. We generated the test smartly by byte-swaps of the actual data from the database — this allows us to precisely control the distance between test hashes and database hashes. Take a look at the scripts for details. Here’s our first run of the first, naive, algorithm:
$ make naive
< test-vector.txt ./mmdist-naive > test-vector.tmp
Total: 85261.833ms, 1536 items, avg 55.509ms per query, 18.015 qps
We matched 1,536 test hashes against a database of 1 million random vectors in 85 seconds. It took 55ms of CPU time on average to find the closest neighbour. This is rather slow for our needs.
SIMD for help
An obvious improvement is to use more complex SIMD instructions. SIMD is a way to instruct the CPU to process multiple data points using one instruction. This is a perfect strategy when dealing with vector problems — as is the case for our task.
Using AVX2 is easier said than done. There is no single instruction to count Euclidean distance between two uint8 vectors! The fastest way of counting the full distance of two 144-byte vectors with AVX2 we could find is authored by Vlad:
It’s actually simpler than it looks: load 16 bytes, convert vector from uint8 to int16, subtract the vector, store intermediate sums as int32, repeat. At the end, we need to do complex 4 instructions to extract the partial sums into the final sum. This AVX2 code improves the performance around 3x:
$ make naive-avx2
Total: 25911.126ms, 1536 items, avg 16.869ms per query, 59.280 qps
We measured 17ms per item, which is still below our expectations. Unfortunately, we can’t push it much further without major changes. The problem is that this code is limited by memory bandwidth. The measurements come from my Intel i7-5557U CPU, which has the max theoretical memory bandwidth of just 25GB/s. The database of 1 million entries takes 137MiB, so it takes at least 5ms to feed the database to my CPU. With this naive algorithm we won’t be able to go below that.
Vantage Point Tree algorithm
Since the naive brute force approach failed, we tried using more sophisticated algorithms. My colleague Kornel Lesiński implemented a super cool Vantage Point algorithm. After a few ups and downs, optimizations and rewrites, we gave up. Our problem turned out to be unusually hard for this kind of algorithm.
We observed “the curse of dimensionality”. Space partitioning algorithms don’t work well in problems with large dimensionality — and in our case, we have an enormous number of 144 dimensions. K-D trees are doomed. Locality-sensitive hashing is also doomed. It’s a bizarre situation in which the space is unimaginably vast, but everything is close together. The volume of the space is a 347-digit-long number, but the maximum distance between points is just 3060 – sqrt(255*255*144).
Space partitioning algorithms are fast, because they gradually narrow the search space as they get closer to finding the closest point. But in our case, the common query is never close to any point in the set, so the search space can’t be narrowed to a meaningful degree.
A VP-tree was a promising candidate, because it operates only on distances, subdividing space into near and far partitions, like a binary tree. When it has a close match, it can be very fast, and doesn’t need to visit more than O(log(N)) nodes. For non-matches, its speed drops dramatically. The algorithm ends up visiting nearly half of the nodes in the tree. Everything is close together in 144 dimensions! Even though the algorithm avoided visiting more than half of the nodes in the tree, the cost of visiting remaining nodes was higher, so the search ended up being slower overall.
Smarter brute force?
This experience got us thinking. Since space partitioning algorithms can’t narrow down the search, and still need to go over a very large number of items, maybe we should focus on going over all the hashes, extremely quickly. We must be smarter about memory bandwidth though — it was the limiting factor in the naive brute force approach before.
Perhaps we don’t need to fetch all the data from memory.
The breakthrough came from the realization that we don’t need to count the full distance between hashes. Instead, we can compute only a subset of dimensions, say 32 out of the total of 144. If this distance is already large, then there is no need to compute the full one! Computing more points is not going to reduce the Euclidean distance.
The proposed algorithm works as follows:
1. Take the query hash and extract a 32-byte short hash from it
2. Go over all the 1 million 32-byte short hashes from the database. They must be densely packed in the memory to allow the CPU to perform good prefetching and avoid reading data we won’t need.
3. If the distance of the 32-byte short hash is greater or equal a best score so far, move on
4. Otherwise, investigate the hash thoroughly and compute the full distance.
Even though this algorithm needs to do less arithmetic and memory work, it’s not faster than the previous naive one. See make short-avx2. The problem is: we still need to compute a full distance for hashes that are promising, and there are quite a lot of them. Computing the full distance for promising hashes adds enough work, both in ALU and memory latency, to offset the gains of this algorithm.
There is one detail of our particular application of the image matching problem that will help us a lot moving forward. As we described earlier, the problem is less about finding the closest neighbour and more about proving that the neighbour with a reasonable distance doesn’t exist. Remember — in practice, we don’t expect to find many matches! We expect almost every image we feed into the algorithm to be unrelated to image hashes stored in the database.
It’s sufficient for our algorithm to prove that no neighbour exists within a predefined distance threshold. Let’s assume we are not interested in hashes more distant than, say, 220, which squared is 48,400. This makes our short-distance algorithm variation work much better:
$ make short-avx2-threshold
Total: 4994.435ms, 1536 items, avg 3.252ms per query, 307.542 qps
Origin distance variation
At some point, John noted that the threshold allows additional optimization. We can order the hashes by their distance from some origin point. Given a query hash which has origin distance of A, we can inspect only hashes which are distant between |A-threshold| and |A+threshold| from the origin. This is pretty much how each level of Vantage Point Tree works, just simplified. This optimization — ordering items in the database by their distance from origin point — is relatively simple and can help save us a bit of work.
While great on paper, this method doesn’t introduce much gain in practice, as the vectors are not grouped in clusters — they are pretty much random! For the threshold values we are interested in, the origin distance algorithm variation gives us ~20% speed boost, which is okay but not breathtaking. This change might bring more benefits if we ever decide to reduce the threshold value, so it might be worth doing for production implementation. However, it doesn’t work well with query batching.
Transposing data for better AVX
But we’re not done with AVX optimizations! The usual problem with AVX is that the instructions don’t normally fit a specific problem. Some serious mind twisting is required to adapt the right instruction to the problem, or to reverse the problem so that a specific instruction can be used. AVX2 doesn’t have useful “horizontal” uint16 subtract, multiply and add operations. For example, _mm_hadd_epi16 exists, but it’s slow and cumbersome.
Instead, we can twist the problem to make use of fast available uint16 operands. For example we can use:
The add would overflow in our case, but fortunately there is add-saturate _mm256_adds_epu16.
The saturated add is great and saves us conversion to uint32. It just adds a small limitation: the threshold passed to the program (i.e., the max squared distance) must fit into uint16. However, this is fine for us.
To effectively use these instructions we need to transpose the data in the database. Instead of storing hashes in rows, we can store them in columns:
So instead of:
[a1, a2, a3],
[b1, b2, b3],
[c1, c2, c3],
We can lay it out in memory transposed:
[a1, b1, c1],
[a2, b2, c2],
[a3, b3, c3],
Now we can load 16 first bytes of hashes using one memory operation. In the next step, we can subtract the first byte of the querying hash using a single instruction, and so on. The algorithm stays exactly the same as defined above; we just make the data easier to load and easier to process for AVX.
$ make short-inv-avx2
Total: 1118.669ms, 1536 items, avg 0.728ms per query, 1373.062 qps
Whoa! This is pretty awesome. We started from 55ms per query, and we finished with just 0.73ms. There are further micro-optimizations possible, like memory prefetching or using huge pages to reduce page faults, but they have diminishing returns at this point.
What an optimization journey! We jumped between memory and ALU bottlenecked code. We discussed more sophisticated algorithms, but in the end, a brute force algorithm — although tuned — gave us the best results.
To get even better numbers, I experimented with Nvidia GPU using CUDA. The CUDA intrinsics like vabsdiff4 and dp4a fit the problem perfectly. The V100 gave us some amazing numbers, but I wasn’t fully satisfied with it. Considering how many AMD Ryzen cores with AVX2 we can get for the cost of a single server-grade GPU, we leaned towards general purpose computing for this particular problem.
This is a great example of the type of complexities we deal with every day. Making even the best technologies work “at Cloudflare scale” requires thinking outside the box. Sometimes we rewrite the solution dozens of times before we find the optimal one. And sometimes we settle on a brute-force algorithm, just very very optimized.
The computation of hashes and image matching are challenging problems that require running very CPU intensive operations.. The CPU we have available on the edge is scarce and workloads like this are incredibly expensive. Even with the optimization work talked about in this blog post, running the CSAM scanner at scale is a challenge and has required a huge engineering effort. And we’re not done! We need to solve more hard problems before we’re satisfied. If you want to help, consider applying!
Making things fast is one of the things we do at Cloudflare. More responsive websites, apps, APIs, and networks directly translate into improved conversion and user experience. Today, Google announced that Google Search will directly take web performance and page experience data into account when ranking results on their search engine results pages (SERPs), beginning in May 2021.
Specifically, Google Search will prioritize results based on how pages score on Core Web Vitals, a measurement methodology Cloudflare has worked closely with Google to establish, and we have implemented support for in our analytics tools.
The Core Web Vitals metrics are Largest Contentful Paint (LCP, a loading measurement), First Input Delay (FID, a measure of interactivity), and Cumulative Layout Shift (CLS, a measure of visual stability). Each one is directly associated with user perceptible page experience milestones. All three can be improved using our performance products, and all three can be measured with our Cloudflare Browser Insights product, and soon, with our free privacy-aware Cloudflare Web Analytics.
SEO experts have always suspected faster pages lead to better search ranking. With today’s announcement from Google, we can say with confidence that Cloudflare helps you achieve the web performance trifecta: our product suite makes your site faster, gives you direct visibility into how it is performing (and use that data to iteratively improve), and directly drives improved search ranking and business results.
“Google providing more transparency about how Search ranking works is great for the open Web. The fact they are ranking using real metrics that are easy to measure with tools like Cloudflare’s analytics suite makes today’s announcement all the more exciting. Cloudflare offers a full set of tools to make sites incredibly fast and measure ‘incredibly’ directly.”
– Matt Weinberg, president of Happy Cog, a full-service digital agency.
Cloudflare helps make your site faster
Cloudflare offers a diverse, easy to deploy set of products to improve page experience for your visitors. We offer a rich, configurable set of tools to improve page speed, which this post is too small to contain. Unlike Fermat, who once famously described a math problem and then said “the margin is too small to contain the solution”, and then let folks spend three hundred plus years trying to figure out his enigma, I’m going to tell you how to solve web performance problems with Cloudflare. Here are the highlights:
Caching and Smart Routing
The typical website is composed of a mix of static assets, like images and product descriptions, and dynamic content, like the contents of a shopping cart or a user’s profile page. Cloudflare caches customers’ static content at our edge, avoiding the need for a full roundtrip to origin servers each time content is requested. Because our edge network places content very close (in physical terms) to users, there is less distance to travel and page loads are consequently faster. Thanks, Einstein.
And Argo Smart Routing helps speed page loads that require dynamic content. It analyzes and optimizes routing decisions across the global Internet in real-time. Think Waze, the automobile route optimization app, but for Internet traffic.
Just as Waze can tell you which route to take when driving by monitoring which roads are congested or blocked, Smart Routing can route connections across the Internet efficiently by avoiding packet loss, congestion, and outages.
Using caching and Smart Routing directly improves page speed and experience scores like Web Vitals. With today’s announcement from Google, this also means improved search ranking.
Caching and Smart Routing are designed to reduce and speed up round trips from your users to your origin servers, respectively. Cloudflare also offers features to optimize the content we do serve.
Cloudflare Image Resizing allows on-demand sizing, quality, and format adjustments to images, including the ability to convert images to modern file formats like WebP and AVIF.
Delivering images this way to your end-users helps you save bandwidth costs and improve performance, since Cloudflare allows you to optimize images already cached at the edge.
For WordPress operators, we recently launched Automatic Platform Optimization (APO). With APO, Cloudflare will serve your entire site from our edge network, ensuring that customers see improved performance when visiting your site. By default, Cloudflare only caches static content, but with APO we can also cache dynamic content (like HTML) so the entire site is served from cache. This removes round trips from the origin drastically improving TTFB and other site performance metrics. In addition to caching dynamic content, APO caches third party scripts to further reduce the need to make requests that leave Cloudflare’s edge network.
Workers and Workers Sites
Reducing load on customer origins and making sure we serve the right content to the right clients at the right time are great, but what if customers want to take things a step further and eliminate origin round trips entirely? What if there was no origin? Before we get into Schrödinger’s cat/server territory, we can make this concrete: Cloudflare offers tools to serve entire websites from our edge, without an origin server being involved at all. For more on Workers Sites, check out our introductory blog post and peruse our Built With Workers project gallery.
As big proponents of dogfooding, many of Cloudflare’s own web properties are deployed to Workers Sites, and we use Web Vitals to measure our customers’ experiences.
Using Workers Sites, our developers.cloudflare.com site, which gets hundreds of thousands of visits a day and is critical to developers building atop our platform, is able to attain incredible Web Vitals scores:
These scores are superb, showing the performance and ease of use of our edge, our static website delivery system, and our analytics toolchain.
Cloudflare Web Analytics and Browser Insights directly measure the signals Google is prioritizing
As illustrated above, Cloudflare makes it easy to directly measure Web Vitals with Browser Insights. Enabling Browser Insights for websites proxied by Cloudflare takes one click in the Speed tab of the Cloudflare dashboard. And if you’re not proxying sites through Cloudflare, Web Vitals measurements will be supported in our upcoming, free, Cloudflare Web Analytics product that any site, using Cloudflare’s proxy or not, can use.
Web Vitals breaks down user experience into three components:
Loading: How long did it take for content to become available?
Interactivity: How responsive is the website when you interact with it?
Visual stability: How much does the page move around while loading?
Cloudflare Browser Insights measures all three metrics directly in your users’ browsers, all with one-click enablement from the Cloudflare dashboard.
To start using Browser Insights, just head over to the Speed tab in the dashboard.
Making pages fast is better for everyone
Google’s announcement today, that Web Vitals measurements will be a key part of search ranking starting in May 2021, places even more emphasis on running fast, accessible websites.
Using Cloudflare’s performance tools, like our best-of-breed caching, Argo Smart Routing, content optimization, and Cloudflare Workers® products, directly improves page experience and Core Web Vitals measurements, and now, very directly, where your pages appear in Google Search results. And you don’t have to take our word for this — our analytics tools directly measure Web Vitals scores, instrumenting your real users’ experiences.
We’re excited to help our customers build fast websites, understand exactly how fast they are, and rank highly on Google search as a result. Render on!
Amazon CodeGuru is a set of developer tools powered by machine learning that provides intelligent recommendations for improving code quality and identifying an application’s most expensive lines of code. Amazon CodeGuru Profiler allows you to profile your applications in a low impact, always on manner. It helps you improve your application’s performance, reduce cost and diagnose application issues through rich data visualization and proactive recommendations. CodeGuru Profiler has been a very successful and widely used service within Amazon, before it was offered as a public service. This post discusses a few ways in which internal Amazon teams have used and benefited from continuous profiling of their production applications. These uses cases can provide you with better insights on how to reap similar benefits for your applications using CodeGuru Profiler.
Inside Amazon, over 100,000 applications currently use CodeGuru Profiler across various environments globally. Over the last few years, CodeGuru Profiler has served as an indispensable tool for resolving issues in the following three categories:
Performance bottlenecks, high latency and CPU utilization
Cost and Infrastructure utilization
Diagnosis of an application impacting event
API latency improvement for CodeGuru Profiler
What could be a better example than CodeGuru Profiler using itself to improve its own performance? CodeGuru Profiler offers an API called BatchGetFrameMetricData, which allows you to fetch time series data for a set of frames or methods. We noticed that the 99th percentile latency (i.e. the slowest 1 percent of requests over a 5 minute period) metric for this API was approximately 5 seconds, higher than what we wanted for our customers.
CodeGuru Profiler is built on a micro service architecture, with the BatchGetFrameMetricData API implemented as set of AWS Lambda functions. It also leverages other AWS services such as Amazon DynamoDB to store data and Amazon CloudWatch to record performance metrics.
When investigating the latency issue, the team found that the 5-second latency spikes were happening during certain time intervals rather than continuously, which made it difficult to easily reproduce and determine the root cause of the issue in pre-production environment. The new Lambda profiling feature in CodeGuru came in handy, and so the team decided to enable profiling for all its Lambda functions. The low impact, continuous profiling capability of CodeGuru Profiler allowed the team to capture comprehensive profiles over a period of time, including when the latency spikes occurred, enabling the team to better understand the issue. After capturing the profiles, the team went through the flame graphs of one of the Lambda functions (TimeSeriesMetricsGeneratorLambda) and learned that all of its CPU time was spent by the thread responsible to publish metrics to CloudWatch. The following screenshot shows a flame graph during one of these spikes.
As seen, there is a single call stack visible in the above flame graph, indicating all the CPU time was taken by the thread invoking above code. This helped the team immediately understand what was happening. Above code was related to the thread responsible for publishing the CloudWatch metrics. This thread was publishing these metrics in a synchronized block and as this thread took most of the CPU, it caused all other threads to wait and the latency to spike. To fix the issue, the team simply changed the TimeSeriesMetricsGeneratorLambda Lambda code, to publish CloudWatch metrics at the end of the function, which eliminated contention of this thread with all other threads.
After the fix was deployed, the 5 second latency spikes were gone, as seen in the following graph.
Cost, infrastructure and other improvements for CAGE
CAGE is an internal Amazon retail service that does royalty aggregation for digital products, such as Kindle eBooks, MP3 songs and albums and more. Like many other Amazon services, CAGE is also customer of CodeGuru Profiler.
CAGE was experiencing latency delays and growing infrastructure cost, and wanted to reduce them. Thanks to CodeGuru Profiler’s always-on profiling capabilities, rich visualization and recommendations, the team was able to successfully diagnose the issues, determine the root cause and fix them.
With the help of CodeGuru Profiler, the CAGE team identified several reasons for their degraded service performance and increased hardware utilization:
Excessive garbage collection activity – The team reviewed the service flame graphs (see the following screenshot) and identified that a lot of CPU time was spent getting garbage collection activities, 65.07% of the total service CPU.
Metadata overhead – The team followed CodeGuru Profiler recommendation to identify that the service’s DynamoDB responses were consuming higher CPU, 2.86% of total CPU time. This was due to the response metadata caching in the AWS SDK v1.x HTTP client that was turned on by default. This was causing higher CPU overhead for high throughput applications such as CAGE. The following screenshot shows the relevant recommendation.
Excessive logging – The team also identified excessive logging of its internal Amazon ION structures. The team initially added this logging for debugging purposes, but was unaware of its impact on the CPU cost, taking 2.28% of the overall service CPU. The following screenshot is part of the flame graph that helped identify the logging impact.
The team used these flame graphs and CodeGuru Profiler provided recommendations to determine the root cause of the issues and systematically resolve them by doing the following:
After making these changes, the team was able to reduce their infrastructure cost by 25%, saving close to $2600 per month. Service latency also improved, with a reduction in service’s 99th percentile latency from approximately 2,500 milliseconds to 250 milliseconds in their North America (NA) region as shown below.
The team also realized a side benefit of having reduced log verbosity and saw a reduction in log size by 55%.
Event Analysis of increased checkout latency for Amazon.com
During one of the high traffic times, Amazon retail customers experienced higher than normal latency on their checkout page. The issue was due to one of the downstream service’s API experiencing high latency and CPU utilization. While the team quickly mitigated the issue by increasing the service’s servers, the always-on CodeGuru Profiler came to the rescue to help diagnose and fix the issue permanently.
The team analyzed the flame graphs from CodeGuru Profiler at the time of the event and noticed excessive CPU consumption (69.47%) when logging exceptions using Log4j2. See the following screenshot taken from an earlier version of CodeGuru Profiler user interface.
With CodeGuru Profiler flame graph and other metrics, the team quickly confirmed that the issue was due to excessive exception logging using Log4j2. This downstream service had recently upgraded to Log4j2 version 2.8, in which exception logging could be expensive, due to the way Log4j2 handles class-loading of certain stack frames. Log4j 2.x versions enabled class loading by default, which was disabled in 1.x versions, causing the increased latency and CPU utilization. The team was not able to detect this issue in pre-production environment, as the impact was observable only in high traffic situations.
After they understood the issue, the team successfully rolled out the fix, removing the unnecessary exception trace logging to fix the issue. Such performance issues and many others are proactively offered as CodeGuru Profiler recommendations, to ensure you can proactively learn about such issues with your applications and quickly resolve them.
I hope this post provided a glimpse into various ways CodeGuru Profiler can benefit your business and applications. To get started using CodeGuru Profiler, see Setting up CodeGuru Profiler. For more information about CodeGuru Profiler, see the following:
In late June, Cloudflare’s resolver team noticed a spike in DNS requests for the 65479 Resource Record thanks to data exposed through our new Radar service. We began investigating and found these to be a part of Apple’s iOS14 beta release where they were testing out a new SVCB/HTTPS record type.
Once we saw that Apple was requesting this record type, and while the iOS 14 beta was still on-going, we rolled out support across the Cloudflare customer base.
This blog post explains what this new record type does and its significance, but there’s also a deeper story: Cloudflare customers get automatic support for new protocols like this.
That means that today if you’ve enabled HTTP/3 on an Apple device running iOS 14, when it needs to talk to a Cloudflare customer (say you browse to a Cloudflare-protected website, or use an app whose API is on Cloudflare) it can find the best way of making that connection automatically.
And if you’re a Cloudflare customer you have to do… absolutely nothing… to give Apple users the best connection to your Internet property.
Negotiating HTTP security and performance
Whenever a user types a URL in the browser box without specifying a scheme (like “https://” or “http://”), the browser cannot assume, without prior knowledge such as a Strict-Transport-Security (HSTS) cache or preload list entry, whether the requested website supports HTTPS or not. The browser will first try to fetch the resources using plaintext HTTP, and only if the website redirects to an HTTPS URL, or if it specifies an HSTS policy in the initial HTTP response, the browser will then fetch the resource again over a secure connection.
This means that the latency incurred in fetching the initial resource (say, the index page of a website) is doubled, due to the fact that the browser needs to re-establish the connection over TLS and request the resource all over again. But worse still, the initial request is leaked to the network in plaintext, which could potentially be modified by malicious on-path attackers (think of all those unsecured public WiFi networks) to redirect the user to a completely different website. In practical terms, this weakness is sometimes used by said unsecured public WiFi network operators to sneak advertisements into people’s browsers.
Unfortunately, that’s not the full extent of it. This problem also impacts HTTP/3, the newest revision of the HTTP protocol that provides increased performance and security. HTTP/3 is advertised using the Alt-Svc HTTP header, which is only returned after the browser has already contacted the origin using a different and potentially less performant HTTP version. The browser ends up missing out on using faster HTTP/3 on its first visit to the website (although it does store the knowledge for later visits).
The fundamental problem comes from the fact that negotiation of HTTP-related parameters (such as whether HTTPS or HTTP/3 can be used) is done through HTTP itself (either via a redirect, HSTS and/or Alt-Svc headers). This leads to a chicken and egg problem where the client needs to use the most basic HTTP configuration that has the best chance of succeeding for the initial request. In most cases this means using plaintext HTTP/1.1. Only after it learns of parameters can it change its configuration for the following requests.
But before the browser can even attempt to connect to the website, it first needs to resolve the website’s domain to an IP address via DNS. This presents an opportunity: what if additional information required to establish a connection could be provided, in addition to IP addresses, with DNS?
That’s what we’re excited to be announcing today: Cloudflare has rolled out initial support for HTTPS records to our edge network. Cloudflare’s DNS servers will now automatically generate HTTPS records on the fly to advertise whether a particular zone supports HTTP/3 and/or HTTP/2, based on whether those features are enabled on the zone.
Service Bindings via DNS
The new proposal, currently discussed by the Internet Engineering Task Force (IETF) defines a family of DNS resource record types (“SVCB”) that can be used to negotiate parameters for a variety of application protocols.
The generic DNS record “SVCB” can be instantiated into records specific to different protocols. The draft specification defines one such instance called “HTTPS”, specific to the HTTP protocol, which can be used not only to signal to the client that it can connect in over a secure connection (skipping the initial unsecured request), but also to advertise the different HTTP versions supported by the website. In the future, potentially even more features could be advertised.
example.com 3600 IN HTTPS 1 . alpn=”h3,h2”
The DNS record above advertises support for the HTTP/3 and HTTP/2 protocols for the example.com origin.
This is best used alongside DNS over HTTPS or DNS over TLS, and DNSSEC, to again prevent malicious actors from manipulating the record.
The client will need to fetch not only the typical A and AAAA records to get the origin’s IP addresses, but also the HTTPS record. It can of course do these lookups in parallel to avoid additional latency at the start of the connection, but this could potentially lead to A/AAAA and HTTPS responses diverging from each other. For example, in cases where the origin makes use of DNS load-balancing: if an origin can be served by multiple CDNs it might happen that the responses for A and/or AAAA records come from one CDN, while the HTTPS record comes from another. In some cases this can lead to failures when connecting to the origin (say, if the HTTPS record from one of the CDNs advertises support for HTTP/3, but the CDN the client ends up connecting to doesn’t support it).
This is solved by the SVCB and HTTPS records by providing the IP addresses directly, without the need for the client to look at A and AAAA records. This is done via the “ipv4hint” and “ipv6hint” parameters that can optionally be added to these records, which provide lists of IPv4 and IPv6 addresses that can be used by the client in lieu of the addresses specified in A and AAAA records. Of course clients will still need to query the A and AAAA records, to support cases where no SVCB or HTTPS record is available, but these IP hints provide an additional layer of robustness.
example.com 3600 IN HTTPS 1 . alpn=”h3,h2” ipv4hint=”192.0.2.1” ipv6hint=”2001:db8::1”
In addition to all this, SVCB and HTTPS can also be used to define alternative endpoints that are authoritative for a service, in a similar vein to SRV records:
example.com 3600 IN HTTPS 1 example.net alpn=”h3,h2”
example.com 3600 IN HTTPS 2 example.org alpn=”h2”
In this case the “example.com” HTTPS service can be provided by both “example.net” (which supports both HTTP/3 and HTTP/2, in addition to HTTP/1.x) as well as “example.org” (which only supports HTTP/2 and HTTP/1.x). The client will first need to fetch A and AAAA records for “example.net” or “example.org” before being able to connect, which might increase the connection latency, but the service operator can make use of the IP hint parameters discussed above in this case as well, to reduce the amount of required DNS lookups the client needs to perform.
This means that SVCB and HTTPS records might finally provide a way for SRV-like functionality to be supported by popular browsers and other clients that have historically not supported SRV records.
There is always room at the top apex
When setting up a website on the Internet, it’s common practice to use a “www” subdomain (like in “www.cloudflare.com”) to identify the site, as well as the “apex” (or “root”) of the domain (in this case, “cloudflare.com”). In order to avoid duplicating the DNS configuration for both domains, the “www” subdomain can typically be configured as a CNAME (Canonical Name) record, that is, a record that maps to a different DNS record.
cloudflare.com. 3600 IN A 192.0.2.1
cloudflare.com. 3600 IN AAAA 2001:db8::1
www 3600 IN CNAME cloudflare.com.
This way the list of IP addresses of the websites won’t need to be duplicated all over again, but clients requesting A and/or AAAA records for “www.cloudflare.com” will still get the same results as “cloudflare.com”.
However, there are some cases where using a CNAME might seem like the best option, but ends up subtly breaking the DNS configuration for a website. For example when setting up services such as GitLab Pages, GitHub Pages or Netlify with a custom domain, the user is generally asked to add an A (and sometimes AAAA) record to the DNS configuration for their domain. Those IP addresses are hard-coded in users’ configurations, which means that if the provider of the service ever decides to change the addresses (or add new ones), even if just to provide some form of load-balancing, all of their users will need to manually change their configuration.
Using a CNAME to a more stable domain which can then have variable A and AAAA records might seem like a better option, and some of these providers do support that, but it’s important to note that this generally only works for subdomains (like “www” in the previous example) and not apex records. This is because the DNS specification that defines CNAME records states that when a CNAME is defined on a particular target, there can’t be any other records associated with it. This is fine for subdomains, but apex records will need to have additional records defined, such as SOA and NS, for the DNS configuration to work properly and could also have records such as MX to make sure emails get properly delivered. In practical terms, this means that defining a CNAME record at the apex of a domain might appear to be working fine in some cases, but be subtly broken in ways that are not immediately apparent.
But what does this all have to do with SVCB and HTTPS records? Well, it turns out that those records can also solve this problem, by defining an alternative format called “alias form” that behaves in the same manner as a CNAME in all the useful ways, but without the annoying historical baggage. A domain operator will be able to define a record such as:
example.com. 3600 IN HTTPS example.org.
and expect it to work as if a CNAME was defined, but without the subtle side-effects.
One more thing
Encrypted SNI is an extension to TLS intended to improve privacy of users on the Internet. You might remember how it makes use of a custom DNS record to advertise the server’s public key share used by clients to then derive the secret key necessary to actually encrypt the SNI. In newer revisions of the specification (which is now called “Encrypted ClientHello” or “ECH”) the custom TXT record used previously is simply replaced by a new parameter, called “echconfig”, for the SVCB and HTTPS records.
This means that SVCB/HTTPS are a requirement to support newer revisions of Encrypted SNI/Encrypted ClientHello. More on this later this year.
This all sounds great, but what does it actually mean for Cloudflare customers? As mentioned earlier, we have enabled initial support for HTTPS records across our edge network. Cloudflare’s DNS servers will automatically generate HTTPS records on the fly to advertise whether a particular zone supports HTTP/3 and/or HTTP/2, based on whether those features are enabled on the zone, and we will later also add Encrypted ClientHello support.
Thanks to Cloudflare’s large network that spans millions of web properties (we happen to be one of the most popular DNS providers), serving these records on our customers’ behalf will help build a more secure and performant Internet for anyone that is using a supporting client.
Adopting new protocols requires cooperation between multiple parties. We have been working with various browsers and clients to increase the support and adoption of HTTPS records. Over the last few weeks, Apple’s iOS 14 release has included client support for HTTPS records, allowing connections to be upgraded to QUIC when the HTTP/3 parameter is returned in the DNS record. Apple has reported that so far, of the population that has manually enabled HTTP/3 on iOS 14, 8% of the QUIC connections had the HTTPS record response.
Other browser vendors, such as Google and Mozilla, are also working on shipping support for HTTPS records to their users, and we hope to be hearing more on this front soon.
Many of us at Cloudflare obsess about how to make websites faster. But to improve performance, you have to measure it first. Last year we launched Browser Insights to help our customers measure web performance from the perspective of end users.
Today, we’re partnering with the Google Chrome team to bring Web Vitals measurements into Browser Insights. Web Vitals are a new set of metrics to help web developers and website owners measure and understand load time, responsiveness, and visual stability. And with Cloudflare’s Browser Insights, they’re easier to measure than ever – and it’s free for anyone to collect data from the whole web.
Why do we need Web Vitals?
When trying to understand performance, it’s tempting to focus on the metrics that are easy to measure — like Time To First Byte (TTFB). While TTFB and similar metrics are important to understand, we’ve learned that they don’t always tell the whole story.
Our partners on the Google Chrome team have tackled this problem by breaking down user experience into three components:
Loading: How long did it take for content to become available?
Interactivity: How responsive is the website when you interact with it?
Visual stability: How much does the page move around while loading? (I think of this as the inverse of “jankiness”)
Measuring the Core Web Vitals isn’t the end of the story. Rather, they’re a jumping off point to understand what factors impact a website’s performance. Web Vitals tells you what is happening at a high level, and other more detailed metrics help you understand why user experience could be slow.
For more information about improving web performance, check out Google’s guides to improving LCP, FID, and CLS.
Why measure Web Vitals with Cloudflare?
First, we think that RUM (Real User Measurement) is a critical companion to synthetic measurement. While you can always try a few page loads on your own laptop and see the results, gathering data from real users is the only way to take into account real-life device performance and network conditions.
There are other great RUM tools out there. Google’s Chrome User Experience Report (CrUX) collects data about the entire web and makes it available through tools like Page Speed Insights (PSI), which combines synthetic and RUM results into useful diagnostic information.
One major benefit of Cloudflare’s Browser Insights is that it updates constantly; new data points are available shortly after seeing a request from an end-user. The data in the Chrome UX Report is a 28-day rolling average of aggregated metrics, so you need to wait until you can see changes reflected in the data.
Another benefit of Browser Insights is that we can measure any browser — not just Chrome. As of this writing, the APIs necessary to report Web Vitals are only supported in Chromium browsers, but we’ll support Safari and Firefox when they implement those APIs.
Finally, Brower Insights is free to use! We’ve worked really hard to make our analytics blazing fast for websites with any amount of traffic. We’re excited to support slicing and grouping by URL, Browser, OS, and Country, and plan to support several more dimensions soon.
Push a button to start measuring
To start using Browser Insights, just head over to the Speed tab in the dashboard. Starting today, Web Vitals metrics are now available for everyone!
Coming soon, we’re excited to make this available for all our Web Analytics customers — even those who don’t use Cloudflare today. We’re also hard at work adding much-requested features like client-side error reporting, and diagnostics tools to make it easier to understand where to improve.
At Cloudflare, we’re constantly working on improving the performance of our edge — and that was exactly what my internship this summer entailed. I’m excited to share some improvements we’ve made to our popular Firewall Rules product over the past few months.
Firewall Rules lets customers filter the traffic hitting their site. It’s built using our engine, Wirefilter, which takes powerful boolean expressions written by customers and matches incoming requests against them. Customers can then choose how to respond to traffic which matches these rules. We will discuss some in-depth optimizations we have recently made to Wirefilter, so you may wish to get familiar with how it works if you haven’t already.
Minimizing CPU usage
As a new member of the Firewall team, I quickly learned that performance is important — even in our security products. We look for opportunities to make our customers’ Internet properties faster where it’s safe to do so, maximizing both security and performance.
Our engine is already heavily used, powering all of Firewall Rules. But we have bigger plans. More and more products like our Web Application Firewall (WAF) will be running behind our Wirefilter-based engine, and it will become responsible for eating up a sizable chunk of our total CPU usage before long.
How to measure performance?
Measuring performance is a notoriously tricky task, and as you can probably imagine trying to do this in a highly distributed environment (aka Cloudflare’s edge) does not help. We’ve been surprised in the past by optimizations that look good on paper, but, when tested out in production, just don’t seem to do much.
Our solution? Performance measurement as a service — an isolated and reproducible benchmark for our Firewall engine and a framework for engineers to easily request runs and view results. It’s worth noting that we took a lot of inspiration from the fantastic Rust Compiler benchmarks to build this.
What to measure?
Our next challenge was to find some meaningful performance metrics. Some experimentation quickly uncovered that time was far too volatile a measure for meaningful comparisons, so we turned to hardware counters . It’s not hard to find tools to measure these (perf and VTune are two such examples), although they (mostly) don’t allow control over which parts of the program are recorded. In our case, we wished to individually record measurements for different stages of filter processing — parsing, compilation, analysis, and execution.
Once again we took inspiration from the Rust compiler, and its self-profiling options, using the perf_event_open API to record counters from inside our binary. We then output something like the following, which our framework can easily ingest and store for later visualization.
Whilst we mainly focussed on metrics relating to CPU usage, we also use a combination of getrusage and clear_refs to find the maximum resident set size (RSS). This is useful to understand the memory impact of particular algorithms in addition to CPU.
But the challenge was not over. Cloudflare’s standard CI agents use virtualization and sandboxing for security and convenience, but this makes accessing hardware counters virtually impossible. Running our benchmarks on a dedicated machine gave us access to these counters, and ensured more reproducible results.
Speeding up the speed test
Our benchmarks were designed from the outset to take an important place in our development process. For instance, we now perform a full benchmark run before releasing each new version to detect performance regressions.
But with our benchmarks in place, it quickly became clear that we had a problem. Our benchmarks simply weren’t fast enough — at least if we wanted to complete them in less than a few hours! The problem was we have a very large number of filters. Since our engine would never usually execute requests against this many filters at once it was proving incredibly costly. We came up with a few tricks to cut this down…
Deduplication. It turns out that only around a third of filters are structurally unique (something that is easy to check as Wirefilter can helpfully serialize to JSON). We managed to cut down a great deal of time by ignoring duplicate filters in our benchmarks.
Sampling.Still, we had too many filters and random sampling presented an easy solution. A more subtle challenge was to make sure that the random sample was always the same to maintain reproducibility.
Partitioning.We worried that deduplication and sampling would cause us to miss important cases that are useful to optimize. By first partitioning filters by Wirefilter language feature, we can ensure we’re getting a good range of filters. It also helpfully gives us more detail about where specifically the impact of a performance change is.
Most of these are trade-offs, but very necessary ones which allow us to run continual benchmarks without development speed grinding to a halt. At the time of writing, we’ve managed to get a benchmark run down to around 20 minutes using these ideas.
Optimizing our engine
With a benchmarking framework in place, we were ready to begin testing optimizations. But how do you optimize an interpreter like Wirefilter? Just-in-time (JIT) compilation, selective inlining and replication were some ideas floating around in the word of interpreters that seemed attractive. After all, we previously wrote about the cost of dynamic dispatch in Wirefilter. All of these techniques aim to reduce that effect.
However, running some real filters through a profiler tells a different story. Most execution time, around 65%, is spent not resolving dynamic dispatch calls but instead performing operations like comparison and searches. Filters currently in production tend to be pretty light on functions, but throw in a few more of these and even less time would be spent on dynamic dispatch. We suspect that even a fair chunk of the remaining 35% is actually spent reading the memory of request fields.
Breakdown of CPU time while executing a typical production filter.
An adventure in substring searching
By now, you shouldn’t be surprised that the contains operator was one of the first in line for optimization. If you’ve ever written a Firewall Rule, you’re probably already familiar with what it does — it checks whether a substring is present in the field you are matching against. For example, the following expression would match when the host is “example.com” or “www.example.net”, but not when it is “cloudflare.com”. In string searching algorithms, this is commonly referred to as finding a ‘needle’ (“example”) within a ‘haystack’ (“example.com”).
http.host contains “example”
How does this work under the hood? Ordinarily, we may have used Rust’s `String::contains` function but Wirefilter also allows raw byte expressions that don’t necessarily conform to UTF-8.
Sounds good, right? It was, and it was working pretty well, although we’d noticed that rewriting `contains` filters using regular expressions could bizarrely often make them faster.
http.host matches “example”
Regular expressions are great, but since they’re far more powerful than the `contains` operator, they shouldn’t be faster than a specialized algorithm in simple cases like this one.
Something was definitely up. It turns out that Rust’s regex library comes equipped with a whole host of specialized matchers for what it deems to be simple expressions like this. The obvious question was whether we could therefore simply use the regex library. Interestingly, you may not have realized that the popular ripgrep tool does just that when searching for fixed-string patterns.
However, our use case is a little different. Since we’re building an interpreter (and we’re using dynamic dispatch in any case), we would prefer to dispatch to a specialized case for `contains` expressions, rather than matching on some enum deep within the regex crate when the filter is executed. What’s more, there are some prettycoolthings being done to perform substring searching that leverages SIMD instruction sets. So we wired up our engine to some previous work by Wojciech Muła and the results were fantastic.
Expressions using `contains` operator
Improvements in instruction count using Wojciech Muła’s sse4-strstr library over the memmem crate with Wirefilter.
I encourage you to read more on “Algorithm 1”, which we used, but it works something like this (I’ve changed the order a little to help make it clearer). It’s worth reading up on SIMD instructions if you’re unfamiliar with them — they’re the essence behind what makes this algorithm fast.
We fill one SIMD register with the first byte of the needle being searched for, simply repeated over and over.
We load as much of our haystack as we can into another SIMD register and perform a bitwise equality operation with our previous register.
Now, any position in the resultant register that is 0 cannot be the start of the match since it doesn’t start with the same byte of the needle.
We now repeat this process with the last byte of the needle, offsetting the haystack, to rule out any positions that don’t end with the same byte as the needle.
Bitwise ANDing these two results together, we (hopefully) have now drastically reduced our potential matches.
Each of the remaining potential matches can be checked manually using a memcmp operation. If we find a match, then we’re done.
If not, we continue with the next part of our haystack and repeat until we’ve checked the entire thing.
When it goes wrong
You may be wondering what happens if our haystack doesn’t fit neatly into registers. In the original algorithm, nothing. It simply continues reading into the oblivion after the end of the haystack until the last register is full, and uses a bitmask to ignore potential false-positives from this additional region of memory.
As we mentioned, security is our priority when it comes to optimizations, so we could never deploy something with this kind of behaviour. We ended up porting Muła’s library to Rust (we’ve also open-sourced the crate!) and performed an overlapping registers modification found in ARM’s blog.
It’s best illustrated by example — notice the difference between how we would fill registers on an imaginary SIMD instruction-set with 4-byte registers.
In our case, repeating some bytes within two different registers will never change the final outcome, so this modification is allowed as-is. However, in reality, we found it was better to use a bitmask to exclude repeated parts of the final register and minimize the number of memcmp calls.
What if the haystack is too small to even fill a single register? In this case, we can’t use our overlapping trick since there’s nothing to overlap with. Our solution is straightforward: while we were primarily targeting AVX2, which can store 32-bytes in a lane, we can easily move down to another instruction set with smaller registers that the haystack can fit into. In reality, we don’t currently go any smaller than SSE2. Beyond this, we instead use an implementation of the Rabin-Karp searching algorithm which appears to perform well.
Register sizes in different SIMD instruction sets . We did not consider AVX512 since support for this is not widespread enough.
Is it always fast?
Choosing the first and last bytes of the needle to rule out potential matches is a great idea. It means that when it does come to performing a memcmp, we can ignore these, as we know they already match. Unfortunately, as Muła points out, this also makes the algorithm susceptible to a worst-case attack in some instances.
Let’s give an expression that a customer might write to illustrate this.
http.request.uri.path contains “/wp-admin/”
If we try to search for this within a very long sequence of ‘/’s, we will find a potential match in every position and make lots of calls to memcmp — essentially performing a slow bruteforce substring search.
Clearly we need to choose different bytes from the needle. But which ones should we choose? For each choice, an adversary can always find a slightly different, but equally troublesome, worst case. We instead use randomness to throw off our would-be adversary, picking the first byte of the needle as before, but then choosing another random byte to use.
Our new version is unsurprisingly slower than Muła’s, yet it still exhibits a great improvement over both the memmem and regex crates. Performance, but without sacrificing safety.
sliceslice (our version)
Expressions using `contains` operator
Improvements in instruction count of using sse4-strstr and sliceslice over the memmem crate with Wirefilter.
This is only a small taste of the performance work we’ve been doing, and we have much more yet to come. Nevertheless, none of this would have been possible without the support of my manager Richard and my mentor Elie, who contributed a lot of these ideas. I’ve learned so much over the past few months, but most of all that Cloudflare is an amazing place to be an intern!
 Since our benchmarks are not run within a production environment, results in this post do not represent traffic on our edge.
 We found instruction counts to be a particularly stable measure, and CPU cycles a particularly unstable one.
 Note that SWAR is technically not an instruction set, but instead uses regular registers like vector registers.
The AWS Lambda service runs customer code on-demand in response to events. It works by creating a new execution environment and downloading your code. This initial setup is commonly called a “cold start” and introduces latency to the total execution time of the function.
Cold starts happen when you first invoke a function, or when a function is invoked after being inactive for an extended period. They also happen when Lambda scales up a function, since each new instance of the function is a new execution environment.
The serverless community has previously created “function warmer” libraries to help improve the likelihood of a Lambda invocation using an existing execution environment. This is a good approach for development and test workloads, or where you do not need hyper-ready performance. The Provisioned Concurrency feature is designed for workloads needing predictable low-latency.
This blog post shows how to eliminate cold starts in architectures supporting web applications. I reference code from the Ask Around Me example application. This allows users to ask and answer questions in their local geographic area in real time. To learn more, refer to part 1 of the blog application series.
Cold starts and web applications
The Ask Around Me application uses the following backend architecture:
This represents a typical web application. Some Lambda functions are invoked by Amazon API Gateway while others are invoked by services further down the application stack. API Gateway invokes Lambda functions synchronously, meaning the caller is blocked until the function returns a value.
Functions invoked by services like Amazon SQS and Amazon DynamoDB are called asynchronously. This means that the caller continues with other work during execution and the function does not return a value. This application uses both types of invocation:
Generally, cold starts are less impactful in asynchronous executions. The latency overhead in starting the execution environment usually has less impact on overall performance of the application in this case. For web applications in particular, cold starts are most noticeable in synchronous applications closer to the frontend. This is where the speed of the API request has the most influence on the user experience of your application.
In Ask Around Me, there are four Lambda functions supporting the API endpoints for the application. Three of these are lightweight functions that put messages in SQS queues and retrieve data from a DynamoDB table. The most complex function is GetQuestions, which fetches questions based upon the latitude and longitude of the user. This is also expected to receive the most usage, with an expected 50,000 queries per hour, so it’s the most important for performance optimization.
Measuring the existing Lambda function performance
In a previous blog post on load testing this application, the GetQuestions API shows considerable variability in performance. In an API load test for 30 seconds with 20 requests per second, the median response is 175 ms while the slowest is 2149 ms:
In this application, the frontend application waits until this synchronous API call is completed. The median performance is likely acceptable to a user, whereas any response time over one second makes interaction with the application appear slow.
To gain more insight into the performance of this function, I turn on AWS X-Ray for this function. From the Lambda console, I select the GetQuestions function and check the Active tracing check box in the AWS X-Ray panel. After saving the function, X-Ray is now enabled.
I re-run the load test for this function and navigate to the X-Ray console. In the Analytics menu, the Response time distribution panel graphs the performance of the function invocations:
I select all the invocations in the graph after the p95 marker, representing the slowest 5% of all requests. This filters 34 slow requests, which correspond to the number of Concurrent Executions for the function shown in the function’s Metrics console:
X-Ray lists the individual traces for the 34 slowest calls, and selecting the slowest single invocation breaks down the durations of each segment:
This analysis shows that this function’s performance is impacted by cold starts. The initialization of the execution environment and the function code is contributing over 1 second of latency in this example. Each of the 34 slowest invocations corresponds with scaling up events for this function.
Configuring Provisioned Concurrency for a Lambda function
Provisioned concurrency is a Lambda feature that allows you to prepare execution environments before receiving traffic. In addition to downloading the function’s code, it also runs the initialization code outside of the main Lambda handler. This provides a reliable way to keep functions ready to respond within double-digit millisecond latency.
While all Provisioned Concurrency functions start more quickly than the existing on-demand Lambda execution style, this is particularly beneficial for certain function profiles. Runtimes like C# and Java have much slower initialization times than Node.js or Python, but faster execution times once initialized. With Provisioned Concurrency turned on, these runtimes benefit from both the consistent low latency of the function’s start-up and the performance during execution.
To enable Provisioned Concurrency for a Lambda function:
Provisioned concurrency settings must be applied to a published version or an alias. Go to the Actions drop-down and choose Publish new version.
Choose Publish. Scroll down to the Concurrency panel and choose Add Configuration.
Enter your preferred concurrency and choose Save.
After a few minutes, Lambda has prepared the execution environments and the Status shows Ready in the console.
It’s important to remember that the feature is applied explicitly to a function version or alias. Ensure that your invocation method is calling this alias, and not the $LATEST version. Provisioned Concurrency cannot be applied to the $LATEST version.
When configuring Provisioned Concurrency, you select capacity to reserve. During usage, if you exceed this level, any additional functional invocations then use the on-demand model. These invocations exhibit a more typical Lambda start-up performance profile, but you are not throttled or limited from running invocations at high levels of throughput.
Using Amazon CloudWatch Logs or the Monitoring tab for your function in the Lambda console, you can see metrics for the number of Provisioned Concurrency invocations, compared with the total. This can help identify when total load is above the amount of concurrency, and you can make changes accordingly.
You can also use Application Auto Scaling to help you automate provisioning the appropriate capacity. Instead of reserving a fixed amount of capacity, this increases the amount of concurrency during peak loads, and decreases as load reduces. You can configure this in both the AWS CLI and AWS Serverless Application Model.
Comparing performance before and after Provisioned Concurrency
I run the same load test on the same function, now using Provisioned Concurrency – 20 requests per second over 2 minutes. The results show a median latency of 165 ms, a p95 time of 202 ms, and a slowest execution of 532 ms:
In X-Ray, the latest Response time distribution graph shows the significantly improved performance across the 2400 requests:
By enabling Provisioned Concurrency for this Lambda function, the slowest performance has been improved by 75%. The function can serve 1200 requests per minute with a much more consistent performance for users.
Function warmers and Provisioned Concurrency
The broader serverless community offers open source libraries to “warm“ Lambda functions via a pinging mechanism. This approach uses Amazon CloudWatch Events to invoke the function every minute to help keep the execution environment active. As a result, this can increase the likelihood of using a warm environment when you invoke the function.
However, this is not a guaranteed way to reduce cold starts. It does not help in production environments when functions scale up to meet traffic. It also does not work if the Lambda service runs your function in another Availability Zone as part of normal load balancing operations. Additionally, the Lambda service reaps execution environments regularly to keep these fresh, so it’s possible to invoke a function in between pings. In all of these cases, you experience cold starts despite using a warming library.
This approach might be adequate for development and test environments, or low-traffic or low-priority workloads. However, if you need predictable function start times for your workload, Provisioned Concurrency is the recommend solution to ensure predictable latency. It keeps your functions initialized and hyper-ready to respond in double-digit milliseconds at the scale you need.
This post examines how cold starts impact performance in serverless backends for web applications. It shows how the most important focus area is usually synchronous APIs called by the frontend application. I explain options available for targeting cold starts in the Lambda service.
Using the Ask Around Me application, I apply Provisioned Concurrency to the most latency-sensitive Lambda function. I compare the load testing performance before and after enabling this feature. This shows predictable start-up times and a 75% reduction in the slowest execution time. Finally, I show when you might use function warmers, and how Provisioned Concurrency is more suitable for latency-sensitive and production workloads.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.