Tag Archives: Edge

An attendee’s guide to hybrid cloud and edge computing at AWS re:Invent 2023

2023-11-14 Chris Munns

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/an-attendees-guide-to-hybrid-cloud-and-edge-computing-at-aws-reinvent-2023/

This post is written by Savitha Swaminathan, AWS Sr. Product Marketing Manager

AWS re:Invent 2023 starts on Nov 27^th in Las Vegas, Nevada. The event brings technology business leaders, AWS partners, developers, and IT practitioners together to learn about the latest innovations, meet AWS experts, and network among their peer attendees.

This year, AWS re:Invent will once again have a dedicated track for hybrid cloud and edge computing. The sessions in this track will feature the latest innovations from AWS to help you build and run applications securely in the cloud, on premises, and at the edge – wherever you need to. You will hear how AWS customers are using our cloud services to innovate on premises and at the edge. You will also be able to immerse yourself in hands-on experiences with AWS hybrid and edge services through innovative demos and workshops.

At re:Invent there are several session types, each designed to provide you with a way to learn however fits you best:

Innovation Talks provide a comprehensive overview of how AWS is working with customers to solve their most important problems.
Breakout sessions are lecture style presentations focused on a topic or area of interest and are well liked by business leaders and IT practitioners, alike.
Chalk talks deep dive on customer reference architectures and invite audience members to actively participate in the white boarding exercise.
Workshops and builder sessions popular with developers and architects, provide the most hands-on experience where attendees can build real-time solutions with AWS experts.

The hybrid edge track will include one leadership overview session and 15 other sessions (4 breakouts, 6 chalk talks, and 5 workshops). The sessions are organized around 4 key themes: Low latency, Data residency, Migration and modernization, and AWS at the far edge.

Hybrid Cloud & Edge Overview

HYB201 | AWS wherever you need it

Join Jan Hofmeyr, Vice President, Amazon EC2, in this leadership session where he presents a comprehensive overview of AWS hybrid cloud and edge computing services, and how we are helping customers innovate on AWS wherever they need it – from Regions, to metro centers, 5G networks, on premises, and at the far edge. Jun Shi, CEO and President of Accton, will also join Jan on stage to discuss how Accton enables smart manufacturing across its global manufacturing sites using AWS hybrid, IoT, and machine learning (ML) services.

Low latency

Many customer workloads require single-digit millisecond latencies for optimal performance. Customers in every industry are looking for ways to run these latency sensitive portions of their applications in the cloud while simplifying operations and optimizing for costs. You will hear about customer use cases and how AWS edge infrastructure is helping companies like Riot Games meet their application performance goals and innovate at the edge.

Data residency

As cloud has become main stream, governments and standards bodies continue to develop security, data protection, and privacy regulations. Having control over digital assets and meeting data residency regulations is becoming increasingly important for public sector customers and organizations operating in regulated industries. The data residency sessions deep dive into the challenges, solutions, and innovations that customers are addressing with AWS to meet their data residency requirements.

Breakout session

HYB309 | Navigating data residency and protecting sensitive data

Chalk talk

HYB307 | Architecting for data residency and data protection at the edge

Workshops

HYB301 | Addressing data residency requirements with AWS edge services

Migration and modernization

Migration and modernization in industries that have traditionally operated with on-premises infrastructure or self-managed data centers is helping customers achieve scale, flexibility, cost savings, and performance. We will dive into customer stories and real-world deployments, and share best practices for hybrid cloud migrations.

Breakout session

HYB203 | A migration strategy for edge and on-premises workloads

Chalk talk

HYB313 | Real-world analysis of successful hybrid cloud migrations

AWS at the far edge

Some customers operate in what we call the far edge: remote oil rigs, military and defense territories, and even space! In these sessions we cover customer use cases and explore how AWS brings cloud services to the far edge and helps customers gain the benefits of the cloud regardless of where they operate.

Breakout session

HYB306 | Bringing AWS to remote edge locations

Chalk talk

HYB312 | Deploying cloud-enabled applications starting at the edge

Workshops

HYB304 | Generative AI for robotics: Race for the best drone control assistant

In addition to the sessions across the 4 themes listed above, the track includes two additional chalk talks covering topics that are applicable more broadly to customers operating hybrid workloads. These chalk talks were chosen based on customer interest and will have repeat sessions, due to high customer demand.

HYB310 | Building highly available and fault-tolerant edge applications

HYB311 | AWS hybrid and edge networking architectures

Learn through interactive demos

In addition to breakout sessions, chalk talks, and workshops, make sure you check out our interactive demos to see the benefits of hybrid cloud and edge in action:

Drone Inspector: Generative AI at the Edge

Location: AWS Village | Venetian Level 2, Expo Hall, Booth 852 | AWS for Every App activation

Embark on a competitive adventure where generative artificial intelligence (AI) intersects with edge computing. Experience how drones can swiftly respond to chat instructions for a time-sensitive object detection mission. Learn how you can deploy foundation models and computer vision (CV) models at the edge using AWS hybrid and edge services for real-time insights and actions.

AWS Hybrid Cloud & Edge kiosk

Location: AWS Village | Venetian Level 2, Expo Hall, Booth 852 | Kiosk #9 & 10

Stop by and chat with our experts about AWS Local Zones, AWS Outposts, AWS Snow Family, AWS Wavelength, AWS Private 5G, AWS Telco Network Builder, and Integrated Private Wireless on AWS. Check out the hardware innovations inside an AWS Outposts rack up close and in person. Learn how you can set up a reliable private 5G network within days and live stream video content with minimal latency.

AWS Next Gen Infrastructure Experience

Location: AWS Village | Venetian Level 2, Expo Hall, Booth 852

Check out demos across Global Infrastructure, AWS for Hybrid Cloud & Edge, Compute, Storage, and Networking kiosks, share on social, and win prizes!

The Future of Connected Mobility

Location: Venetian Level 4, EBC Lounge, wall outside of Lando 4201B

Step into the driver’s seat and experience high fidelity 3D terrain driving simulation with AWS Local Zones. Gain real-time insights from vehicle telemetry with AWS IoT Greengrass running on AWS Snowcone and a broader set of AWS IoT services and Amazon Managed Grafana in the Region. Learn how to combine local data processing with cloud analytics for enhanced safety, performance, and operational efficiency. Explore how you can rapidly deliver the same experience to global users in 75+ countries with minimal application changes using AWS Outposts.

Immersive tourism experience powered by 5G and AR/VR

Location: Venetian, Level 2 | Expo Hall | Telco demo area

Explore and travel to Chichen Itza with an augmented reality (AR) application running on a private network fully built on AWS, which includes the Radio Access Network (RAN), the core, security, and applications, combined with services for deployment and operations. This demo features AWS Outposts.

AWS unplugged: A real time remote music collaboration session using 5G and MEC

Location: Venetian, Level 2 | Expo Hall | Telco demo area

We will demonstrate how musicians in Los Angeles and Las Vegas can collaborate in real time with AWS Wavelength. You will witness songwriters and musicians in Los Angeles and Las Vegas in a live jam session.

Disaster relief with AWS Snowball Edge and AWS Wickr

Location: AWS for National Security & Defense | Venetian, Casanova 606

The hurricane has passed leaving you with no cell coverage and you have a slim chance of getting on the internet. You need to set up a situational awareness and communications network for your team, fast. Using Wickr on Snowball Edge Compute, you can rapidly deploy a platform that provides both secure communications with rich collaboration functionality, as well as real time situational awareness with the Wickr ATAK integration. Allowing you to get on with what’s important.

We hope this guide to the Hybrid Cloud and Edge track at AWS re:Invent 2023 helps you plan for the event and we hope to see you there!

Protect your Amazon Cognito user pool with AWS WAF

2023-04-21 Maitreya Ranganath

Post Syndicated from Maitreya Ranganath original https://aws.amazon.com/blogs/security/protect-your-amazon-cognito-user-pool-with-aws-waf/

Many of our customers use Amazon Cognito user pools to add authentication, authorization, and user management capabilities to their web and mobile applications. You can enable the built-in advanced security in Amazon Cognito to detect and block the use of credentials that have been compromised elsewhere, and to detect unusual sign-in activity and then prompt users for additional verification or block sign-ins. Additionally, you can associate an AWS WAF web access control list (web ACL) with your user pool to allow or block requests to Amazon Cognito user pools, based on security rules.

In this post, we’ll show how you can use AWS WAF with Amazon Cognito user pools and provide a sample set of rate-based rules and advanced AWS WAF rule groups. We’ll also show you how to test and tune the rules to help protect your user pools from common threats.

Rate-based rules for Amazon Cognito user pool endpoints

The following are endpoints exposed publicly by an Amazon Cognito user pool that you can protect with AWS WAF:

Hosted UI — These endpoints are listed in the OIDC and hosted UI API reference. Cognito creates these endpoints when you assign a domain to your user pool. Your users will interact with these endpoints when they use the Hosted UI web interface directly, or when your application calls Cognito OAuth endpoints such as Authorize or Token.
Public API operations — These generate a request to Cognito API actions that are either unauthenticated or authenticated with a session string or access token, but not with AWS credentials.

A good way to protect these endpoints is to deploy rate-based AWS WAF rules. These rules will detect and block requests with high rates that could indicate an attempt to exceed your Amazon Cognito API request rate quotas and that could subsequently impact requests from legitimate users.

When you apply rate limits, it helps to group Amazon Cognito API actions into four action categories. You can set specific rate limits per action category giving you traffic visibility for each category.

User Creation — This category includes operations that create new users in Cognito. Setting a rate limit for this category provides visibility for traffic of these operations and threats such as fake users being created in Cognito, which drives up your Monthly Active User (MAU) costs for Cognito.
Sign-in — This category includes operations to initiate a sign-in operation. Setting a rate limit for this category can provide visibility into the abuse of these operations. This could indicate high frequency, automated attempts to guess user credentials, sometimes referred to as credential stuffing.
Account Recovery — This category includes operations to recover accounts, including “forgot password” flows. Setting a rate limit for this category can provide visibility into the abuse of these operations, malicious activity can include: sending fake reset attempts, which might result in emails and SMS messages being sent to users.
Default — This is a catch-all rate limit that applies to an operation that is not in one of the prior categories. Setting a default rate limit can provide visibility and mitigation from request flooding attacks.

Table 1 below shows selected Hosted UI endpoint paths (the equivalent of individual API actions) and the recommended rate-based rule limit category for each.

Table 1: Amazon Cognito Hosted UI URL paths mapped to action categories

Hosted UI URL path	Authentication method	Action category
/signup	Unauthenticated	User Creation
/confirmUser	Confirmation code	User Creation
/resendcode	Unauthenticated	User Creation
/login	Unauthenticated	Sign-in
/oauth2/authorize	Unauthenticated	Sign-in
/forgotPassword	Unauthenticated	Account Recovery
/confirmForgotPassword	Confirmation code	Account Recovery
/logout	Unauthenticated	Default
/oauth2/revoke	Refresh token	Default
/oauth2/token	Auth code, or refresh token, or client credentials	Default
/oauth2/userInfo	Access token	Default
/oauth2/idpresponse	Authorization code	Default
/saml2/idpresponse	SAML assertion	Default

Table 2 below shows selected Cognito API actions and the recommended rate-based rule category for each.

Table 2: Selected Cognito API actions mapped to action categories

API action name	Authentication method	Action category
SignUp	Unauthenticated	User Creation
ConfirmSignUp	Confirmation code	User Creation
ResendConfirmationCode	Unauthenticated	User Creation
InitiateAuth	Unauthenticated	Sign-in
RespondToAuthChallenge	Unauthenticated	Sign-in
ForgotPassword	Unauthenticated	Account Recovery
ConfirmForgotPassword	Confirmation code	Account Recovery
AssociateSoftwareToken	Access token or session	Default
VerifySoftwareToken	Access token or session	Default

Additionally, the rate-based rules we provide in this post include the following:

Two IP sets that represent allow lists for IPv4 and IPv6. You can add IPs that represent your trusted source IP addresses to these IP sets so that other AWS WAF rules don’t apply to requests that originate from these IP addresses.
Two IP sets that represent deny lists for IPv4 and IPv6. Add IPs to these IP sets that you want to block in all cases, regardless of the result of other rules.
An AWS managed IP reputation rule group: The AWS managed IP reputation list rule group contains rules that are based on Amazon internal threat intelligence, to identify IP addresses typically associated with bots or other threats. You can limit requests that match rules in this rule group to a specific rate limit.

Deploy rate-based rules

You can deploy the rate-based rules described in the previous section by using the AWS CloudFormation template that we provide here.

To deploy rate-based rules using the template

(Optional but recommended) If you want to enable AWS WAF logging and resources to analyze request rates, create an Amazon Simple Storage Service (Amazon S3) bucket in the same AWS Region as your Amazon Cognito user pool, with a bucket name starting with the prefix aws-waf-logs-. If you previously created an S3 bucket for AWS WAF logs, you can choose to reuse it, or you can create a new bucket to store AWS WAF logs for Amazon Cognito.
Choose the following Launch Stack button to launch a CloudFormation stack in your account.

Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, download the solution’s CloudFormation template and deploy it to the selected Region.

This template creates the following resources in your AWS account:
- A rule group for the rate-based rules, according to the limits shown in Tables 1 and 2.
- Four IP sets for an allow list and deny list for IPv4 and IPv6 addresses.
- A web ACL that includes the rule group that is created, IP set based rules, and the AWS managed IP reputation rule group.
- (Optional) The template enables AWS WAF logging for the web ACL to an S3 bucket that you specify.
- (Optional) The template creates resources to help you analyze AWS WAF logs in S3 to calculate peak request rates that you can use to set rate limits for the rate-based rules.

Set the template parameters as needed. The following table shows the default values for the parameters. We recommend that you deploy the template with the default values and with TestMode set to Yes so that all rules are set to Count. This allows all requests but emits Amazon CloudWatch metrics and AWS WAF log events for each rule that matches. You can then follow the guidance in the next section to analyze the logs and tune the rate limits to match the traffic patterns to your user pool. When you are satisfied with the unique rate limits for each parameter, you can update the stack and set TestMode to No to start blocking requests that exceed the rate limits.

The rate limits for AWS WAF rate-based rules are configured as the number of requests per 5-minute period per unique source IP. The value of the rate limit can be between 100 and 2,000,000,000 (2 billion).

Table 3: Default values for template parameters

Parameter name	Description	Default value	Allowed values
Request rate limits by action category
UserCreationRateLimit	Rate limit applied to User Creation actions	2000	100–2,000,000,000
SignInRateLimit	Rate limit applied to Sign-in actions	4000	100–2,000,000,000
AccountRecoveryRateLimit	Rate limit applied to Account Recovery actions	1000	100–2,000,000,000
IPReputationRateLimit	Rate limit applied to requests that match the AWS Managed IP reputation list	1000	100–2,000,000,000
DefaultRateLimit	Default rate limit applied to actions that are not in any of the prior categories	6000	100–2,000,000,000
Test mode
TestMode	Set to Yes to test rules by overriding rule actions to Count. Set to No to apply the default actions for rules after you’ve tested the impact of these rules.	Yes	Yes or No
AWS WAF logging and rate analysis
EnableWAFLogsAndRateAnalysis	Set to Yes to enable logging for the AWS WAF web ACL to an S3 bucket and create resources for request rate analysis. Set to No to disable AWS WAF logging and skip creating resources for rate analysis. If No, the rest of the parameter values in this section are ignored. If Yes, choose values for the rest of the parameters in this section.	Yes	Yes or No
WAFLogsS3Bucket	The name of an existing S3 bucket where AWS WAF logs are delivered. The bucket name must start with aws-waf-logs- and can end with any suffix. Only used if the parameter EnableWAFLogsAndRateAnalysis is set to Yes.	None	Name of an existing S3 bucket that starts with the prefix aws-waf-logs-
DatabaseName	The name of the AWS Glue database to create, which will contain the request rate analysis tables created by this template. (Important: The name cannot contain hyphens.) Only used if the parameter EnableWAFLogsAndRateAnalysis is set to Yes.	rate_analysis
WorkgroupName	The name of the Amazon Athena workgroup to create for rate analysis. Only used if the parameter EnableWAFLogsAndRateAnalysis is set to Yes.	rate_analysis
WAFLogsTableName	The name of the AWS Glue table for AWS WAF logs. Only used if the parameter EnableWAFLogsAndRateAnalysis is set to Yes.		waf_logs
WAFLogsProjectionStartDate	The earliest date to analyze AWS WAF logs, in the format YYYY/MM/DD (example: 2023/02/28). Only used if the parameter EnableWAFLogsAndRateAnalysis is set to Yes.	None	Set this to the current date, in the format YYYY/MM/DD

Wait for the CloudFormation template to be created successfully.
Go to the AWS WAF console and choose the web ACL created by the template. It will have a name ending with CognitoWebACL.
Choose the Associated AWS resources tab, and then choose Add AWS resource.
For Resource type, choose Amazon Cognito user pool, and then select the Amazon Cognito user pools that you want to protect with this web ACL.
Choose Add.

Now that your user pool is being protected by the rate-based rules in the web ACL you created, you can proceed to tune the rate-based rule limits by analyzing AWS WAF logs.

Tune AWS WAF rate-based rule limits

As described in the previous section, the rate-based rules give you the ability to set separate rate limit values for each category of Amazon Cognito API actions.

Although the CloudFormation template has default starting values for these rate limits, it is important that you tune these values to match the traffic patterns for your user pool. To begin the tuning process, deploy the template with default values for all parameters, including Yes for TestMode. This overrides all rule actions to Count, allowing all requests but emitting CloudWatch metrics and AWS WAF log events for each rule that matches.

After you collect AWS WAF logs for a period of time (this period can vary depending on your traffic, from a couple of hours to a couple of days), you can analyze them, as shown in the next section, to get peak request rates to tune the rate limits to match observed traffic patterns for your user pool.

Query AWS WAF logs to calculate peak request rates by request type

You can calculate peak request rates by analyzing information that is present in AWS WAF logs. One way to analyze these is to send AWS WAF logs to S3 and to analyze the logs by using SQL queries in Amazon Athena. If you deploy the template in this post with default values, it creates the resources you need to analyze AWS WAF logs in S3 to calculate peak requests rates by request type.

If you are instead ingesting AWS WAF logs into your security information and event management (SIEM) system or a different analytics environment, you can create equivalent queries by using the query language for your SIEM or analytics environment to get similar results.

To access and edit the queries built by the CloudFormation template for use

Open the Athena console and switch to the Athena workgroup that was created by the template (the default name is rate_analysis).

On the Saved queries tab, choose the query named Peak request rate per 5-minute period by source IP and request category. The following SQL query will be loaded into the edit panel.

-- Gets the top 5 source IPs sending the most requests in a 5-minute period per request category
‐‐ NOTE: change the start and end timestamps to match the duration of interest
SELECT request_category, from_unixtime(time_bin*60*5) AS date_time, client_ip, request_count FROM (
  SELECT *, row_number() OVER (PARTITION BY request_category ORDER BY request_count DESC, time_bin DESC) AS row_num FROM (
    SELECT
      CASE
        WHEN ip_reputation_labels.name IN (
          'awswaf:managed:aws:amazon-ip-list:AWSManagedIPReputationList',
          'awswaf:managed:aws:amazon-ip-list:AWSManagedReconnaissanceList',
          'awswaf:managed:aws:amazon-ip-list:AWSManagedIPDDoSList'
        ) THEN 'IPReputation'
        WHEN target.value IN (
          'AWSCognitoIdentityProviderService.InitiateAuth',
          'AWSCognitoIdentityProviderService.RespondToAuthChallenge'
        ) THEN 'SignIn'
        WHEN target.value IN (
          'AWSCognitoIdentityProviderService.ResendConfirmationCode',
          'AWSCognitoIdentityProviderService.SignUp',
          'AWSCognitoIdentityProviderService.ConfirmSignUp'
        ) THEN 'UserCreation'
        WHEN target.value IN (
          'AWSCognitoIdentityProviderService.ForgotPassword',
          'AWSCognitoIdentityProviderService.ConfirmForgotPassword'
        ) THEN 'AccountRecovery'
        WHEN httprequest.uri IN (
          '/login',
          '/oauth2/authorize'
        ) THEN 'SignIn'
        WHEN httprequest.uri IN (
          '/signup',
          '/confirmUser',
          '/resendcode'
        ) THEN 'UserCreation'
        WHEN  httprequest.uri IN (
          '/forgotPassword',
          '/confirmForgotPassword'
        ) THEN 'AccountRecovery'
        ELSE 'Default'
      END AS request_category,
      httprequest.clientip AS client_ip,
      FLOOR("timestamp"/(1000*60*5)) AS time_bin,
      COUNT(*) AS request_count
    FROM waf_logs
      LEFT OUTER JOIN UNNEST(FILTER(httprequest.headers, h -> h.name = 'x-amz-target')) AS t(target) ON TRUE
      LEFT OUTER JOIN UNNEST(FILTER(labels, l -> l.name like 'awswaf:managed:aws:amazon-ip-list:%')) AS t(ip_reputation_labels) ON TRUE
    WHERE
      from_unixtime("timestamp"/1000) BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2023-01-01 00:00:00'
    GROUP BY 1, 2, 3
    ORDER BY 1, 4 DESC
  )
) WHERE row_num <= 5 ORDER BY request_category ASC, row_num ASC

Scroll down to Line 48 in the Query Editor and edit the timestamps to match the start and end time of the time window of interest.
Run the query to calculate the top 5 peak request rates per 5-minute period by source IP and by action category.

The results show the action category, source IP, time, and count of requests. You can use the request count to tune the rate limits for each action category.

The lowest rate limit you can set for AWS WAF rate-based rules is 100 requests per 5-minute period. If your query results show that the peak request count is less than 100, set the rate limit as 100 or higher.

After you have tuned the rate limits, you can apply the changes to your web ACL by updating the CloudFormation stack.

To update the CloudFormation stack

On the CloudFormation console, choose the stack you created earlier.
Choose Update. For Prepare template, choose Use current template, and then choose Next.
Update the values of the parameters with rate limits to match the tuned values from your analysis.
You can choose to enable blocking of requests by setting TestMode to No. This will set the action to Block for the rate-based rules in the web ACL and start blocking traffic that exceeds the rate limits you have chosen.
Choose Next and then Next again to update the stack.

Now the rate-based rules are updated with your tuned limits, and requests will be blocked if you set TestMode to No.

Protect endpoints with user interaction

Now that we’ve covered the bases with rate-based rules, we’ll show you some more advanced AWS WAF rules that further help protect your user pool. We’ll explore two sample scenarios in detail, and provide AWS WAF rules for each. You can use the rules provided as a guideline to build others that can help with similar use cases.

Rules to verify human activity

The first scenario is protecting endpoints where users have interaction with the page. This will be a browser-based interaction, and a human is expected to be behind the keyboard. This scenario applies to the Hosted UI endpoints such as /login, /signup, and /forgotPassword, where a CAPTCHA can be rendered on the user’s browser for the user to solve. Let’s take the login (sign-in) endpoint as an example, and imagine you want to make sure that only actual human users are attempting to sign in and you want to block bots that might try to guess passwords.

To illustrate how to protect this endpoint with AWS WAF, we’re sharing a sample rule, shown in Figure 1. In this rule, you can take input from prior rules like the Amazon IP reputation list or the Anonymous IP list (which are configured to Count requests and add labels) and combine that with a CAPTCHA action. The logic of the rule says that if the request matches the reputation rules (and has received the corresponding labels) and is going to the /login endpoint, then the AWS WAF action should be to respond with a CAPTCHA challenge. This will present a challenge that increases the confidence that a human is performing the action, and it also adds a custom label so you can efficiently identify and have metrics on how many requests were matched by this rule. The rule is provided in the CloudFormation template and is in JSON format, because it has advanced logic that cannot be displayed by the console. Learn more about labels and CAPTCHA actions in the AWS WAF documentation.

Figure 1: Login sample rule flow

Note that the rate-based rules you created in the previous section are evaluated before the advanced rules. The rate-based rules will block requests to the /login endpoint that exceed the rate limit you have configured, while this advanced rule will match requests that are below the rate limit but match the other conditions in the rule.

Rules for specific activity

The second scenario explores activity on specific application clients within the user pool. You can spot this activity by monitoring the logs provided by AWS WAF, or other traffic logs like Application Load Balancer (ALB) logs. The application client information is provided in the call to the service.

In the Amazon Cognito user pool in this scenario, we have different application clients and they’re constrained by geography. For example, for one of the application clients, requests are expected to come from the United States at or below a certain rate. We can create a rule that combines the rate and geographical criteria to block requests that don’t meet the conditions defined.

The flow of this rule is shown in Figure 2. The logic of the rule will evaluate the application client information provided in the request and the geographic information identified by the service, and apply the selected rate limit. If blocked, the rule will provide a custom response code by using HTTP code 429 Too Many Requests, which can help the sender understand the reason for the block. For requests that you make with the Amazon Cognito API, you could also customize the response body of a request that receives a Block response. Adding a custom response helps provide the sender context and adjust the rate or information that is sent.

Figure 2: AppClientId sample rule flow

AWS WAF can detect geo location with Region accuracy and add specific labels for the location. These can then be used in other rule evaluations. This rule is also provided as a sample in the CloudFormation template.

Advanced protections

To build on the rules we’ve shared so far, you can consider using some of the other intelligent threat mitigation rules that are available as managed rules—namely, bot control for common or targeted bots. These rules offer advanced capabilities to detect bots in sensitive endpoints where automation or non-browser user agents are not expected or allowed. If you receive machine traffic to the endpoint, these rules will result in false positives that would need to be tuned. For more information, see Options for intelligent threat mitigation.

The sample rule flow in Figure 3 shows an example for our Hosted UI, which builds on the first rule we built for specific activity and adds signals coming from the Bot Control common bots managed rule, in this case the non-browser-user-agent label.

Figure 3: Login sample rule with advanced protections

Adding the bot detection label will also add accuracy to the evaluation, because AWS WAF will consider multiple different sources of information when analyzing the request. This can also block attacks that come from a small set of IPs or easily recognizable bots.

We’ve shared this rule in the CloudFormation template sample. The rule requires you to add AWS WAF Bot Control (ABC) before the custom rule evaluation. ABC has additional costs associated with it and should only be used for specific use cases. For more information on ABC and how to enable it, see this blog post.

After adding these protections, we have a complete set of rules for our Hosted UI–specific needs; consider that your traffic and needs might be different. Figure 4 shows you what the rule priority looks like. All rules except the last are included in the provided CloudFormation template. Managed rule evaluations need to have higher priority and be in Count mode; this way, a matching request can get labels that can be evaluated further down the priority list by using the custom rules that were created. For more information, see How labeling works.

Figure 4: Summary of the rules discussed in this post

Conclusion

In this post, we examined the different protections provided by the integration between AWS WAF and Amazon Cognito. This integration makes it simpler for you to view and monitor the activity in the different Amazon Cognito endpoints and APIs, while also adding rate-based rules and IP reputation evaluations. For more specific use cases and advanced protections, we provided sample custom rules that use labels, as well as an advanced rule that uses bot control for common bots. You can use these advanced rules as examples to create similar rules that apply to your use cases.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the re:Post with tag AWS WAF or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Oxy: Fish/Bumblebee/Splicer subsystems to improve reliability

2023-04-20 Quang Luong

Post Syndicated from Quang Luong original https://blog.cloudflare.com/oxy-fish-bumblebee-splicer-subsystems-to-improve-reliability/

Oxy: Fish/Bumblebee/Splicer subsystems to improve reliability

At Cloudflare, we are building proxy applications on top of Oxy that must be able to handle a huge amount of traffic. Besides high performance requirements, the applications must also be resilient against crashes or reloads. As the framework evolves, the complexity also increases. While migrating WARP to support soft-unicast (Cloudflare servers don’t own IPs anymore), we needed to add different functionalities to our proxy framework. Those additions increased not only the code size but also resource usage and states required to be preserved between process upgrades.

To address those issues, we opted to split a big proxy process into smaller, specialized services. Following the Unix philosophy, each service should have a single responsibility, and it must do it well. In this blog post, we will talk about how our proxy interacts with three different services – Splicer (which pipes data between sockets), Bumblebee (which upgrades an IP flow to a TCP socket), and Fish (which handles layer 3 egress using soft-unicast IPs). Those three services help us to improve system reliability and efficiency as we migrated WARP to support soft-unicast.

Splicer

Most transmission tunnels in our proxy forward packets without making any modifications. In other words, given two sockets, the proxy just relays the data between them: read from one socket and write to the other. This is a common pattern within Cloudflare, and we reimplement very similar functionality in separate projects. These projects often have their own tweaks for buffering, flushing, and terminating connections, but they also have to coordinate long-running proxy tasks with their process restart or upgrade handling, too.

Turning this into a service allows other applications to send a long-running proxying task to Splicer. The applications pass the two sockets to Splicer and they will not need to worry about keeping the connection alive when restart. After finishing the task, Splicer will return the two original sockets and the original metadata attached to the request, so the original application can inspect the final state of the sockets – for example using TCP_INFO – and finalize audit logging if required.

Bumblebee

Many of Cloudflare’s on-ramps are IP-based (layer 3) but most of our services operate on TCP or UDP sockets (layer 4). To handle TCP termination, we want to create a kernel TCP socket from IP packets received from the client (and we can later forward this socket and an upstream socket to Splicer to proxy data between the eyeball and origin). Bumblebee performs the upgrades by spawning a thread in an anonymous network namespace with unshare syscall, NAT-ing the IP packets, and using a tun device there to perform TCP three-way handshakes to a listener. You can find a more detailed write-up on how we upgrade an IP flows to a TCP stream here.

In short, other services just need to pass a socket carrying the IP flow, and Bumblebee will upgrade it to a TCP socket, no user-space TCP stack involved! After the socket is created, Bumblebee will return the socket to the application requesting the upgrade. Again, the proxy can restart without breaking the connection as Bumblebee pipes the IP socket while Splicer handles the TCP ones.

Fish

Fish forwards IP packets using soft-unicast IP space without upgrading them to layer 4 sockets. We previously implemented packet forwarding on shared IP space using iptables and conntrack. However, IP/port mapping management is not simple when you have many possible IPs to egress from and variable port assignments. Conntrack is highly configurable, but applying configuration through iptables rules requires careful coordination and debugging iptables execution can be challenging. Plus, relying on configuration when sending a packet through the network stack results in arcane failure modes when conntrack is unable to rewrite a packet to the exact IP or port range specified.

Fish attempts to overcome this problem by rewriting the packets and configuring conntrack using the netlink protocol. Put differently, a proxy application sends a socket containing IP packets from the client, together with the desired soft-unicast IP and port range, to Fish. Then, Fish will ensure to forward those packets to their destination. The client’s choice of IP address does not matter; Fish ensures that egressed IP packets have a unique five-tuple within the root network namespace and performs the necessary packet rewriting to maintain this isolation. Fish’s internal state is also survived across restarts.

The Unix philosophy, manifest

To sum up what we are having so far: instead of adding the functionalities directly to the proxy application, we create smaller and reusable services. It becomes possible to understand the failure cases present in a smaller system and design it to exhibit reliable behavior. Then if we can remove the subsystems of a larger system, we can apply this logic to those subsystems. By focusing on making the smaller service work correctly, we improve the whole system’s reliability and development agility.

Although those three services’ business logics are different, you can notice what they do in common: receive sockets, or file descriptors, from other applications to allow them to restart. Those services can be restarted without dropping the connection too. Let’s take a look at how graceful restart and file descriptor passing work in our cases.

File descriptor passing

We use Unix Domain Sockets for interprocess communication. This is a common pattern for inter-process communication. Besides sending raw data, unix sockets also allow passing file descriptors between different processes. This is essential for our architecture as well as graceful restarts.

There are two main ways to transfer a file descriptor: using pid_getfd syscall or SCM_RIGHTS. The latter is the better choice for us here as the use cases gear toward the proxy application “giving” the sockets instead of the microservices “taking” them. Moreover, the first method would require special permission and a way for the proxy to signal which file descriptor to take.

Currently we have our own internal library named hot-potato to pass the file descriptors around as we use stable Rust in production. If you are fine with using nightly Rust, you may want to consider the unix_socket_ancillary_data feature. The linked blog post above about SCM_RIGHTS also explains how that can be implemented. Still, we also want to add some “interesting” details you may want to know before using your SCM_RIGHTS in production:

There is a maximum number of file descriptors you can pass per message
The limit is defined by the constant SCM_MAX_FD in the kernel. This is set to 253 since kernel version 2.6.38
Getting the peer credentials of a socket may be quite useful for observability in multi-tenant settings
A SCM_RIGHTS ancillary data forms a message boundary.
It is possible to send any file descriptors, not only sockets
We use this trick together with memfd_create to get around the maximum buffer size without implementing something like length-encoded frames. This also makes zero-copy message passing possible.

Graceful restart

We explored the general strategy for graceful restart in “Oxy: the journey of graceful restarts” blog. Let’s dive into how we leverage tokio and file descriptor passing to migrate all important states in the old process to the new one. We can terminate the old process almost instantly without leaving any connection behind.

Passing states and file descriptors

Applications like NGINX can be reloaded with no downtime. However, if there are pending requests then there will be lingering processes that handle those connections before they terminate. This is not ideal for observability. It can also cause performance degradation when the old processes start building up after consecutive restarts.

In three micro-services in this blog post, we use the state-passing concept, where the pending requests will be paused and transferred to the new process. The new process will pick up both new requests and the old ones immediately on start. This method indeed requires a higher complexity than keeping the old process running. At a high level, we have the following extra steps when the application receives an upgrade request (usually SIGHUP): pause all tasks, wait until all tasks (in groups) are paused, and send them to the new process.

WaitGroup using JoinSet

Problem statement: we dynamically spawn different concurrent tasks, and each task can spawn new child tasks. We must wait for some of them to complete before continuing.

In other words, tasks can be managed as groups. In Go, waiting for a collection of tasks to complete is a solved problem with WaitGroup. We discussed a way to implement WaitGroup in Rust using channels in a previous blog. There also exist crates like waitgroup that simply use AtomicWaker. Another approach is using JoinSet, which may make the code more readable. Considering the below example, we group the requests using a JoinSet.

    let mut task_group = JoinSet::new();

    loop {
        // Receive the request from a listener
        let Some(request) = listener.recv().await else {
            println!("There is no more request");
            break;
        };
        // Spawn a task that will process request.
        // This returns immediately
        task_group.spawn(process_request(request));
    }

    // Wait for all requests to be completed before continue
    while task_group.join_next().await.is_some() {}

However, an obvious problem with this is if we receive a lot of requests then the JoinSet will need to keep the results for all of them. Let’s change the code to clean up the JoinSet as the application processes new requests, so we have lower memory pressure

    loop {
        tokio::select! {
            biased; // This is optional

            // Clean up the JoinSet as we go
            // Note: checking for is_empty is important 😉
            _task_result = task_group.join_next(), if !task_group.is_empty() => {}

            req = listener.recv() => {
                let Some(request) = req else {
                    println!("There is no more request");
                    break;
                };
                task_group.spawn(process_request(request));
            }
        }
    }

    while task_group.join_next().await.is_some() {}

Cancellation

We want to pass the pending requests to the new process as soon as possible once the upgrade signal is received. This requires us to pause all requests we are processing. In other terms, to be able to implement graceful restart, we need to implement graceful shutdown. The official tokio tutorial already covered how this can be achieved by using channels. Of course, we must guarantee the tasks we are pausing are cancellation-safe. The paused results will be collected into the JoinSet, and we just need to pass them to the new process using file descriptor passing.

For example, in Bumblebee, a paused state will include the environment’s file descriptors, client socket, and the socket proxying IP flow. We also need to transfer the current NAT table to the new process, which could be larger than the socket buffer. So the NAT table state is encoded into an anonymous file descriptor, and we just need to pass the file descriptor to the new process.

Conclusion

We considered how a complex proxy app can be divided into smaller components. Those components can run as new processes, allowing different life-times. Still, this type of architecture does incur additional costs: distributed tracing and inter-process communication. However, the costs are acceptable nonetheless considering the performance, maintainability, and reliability improvements. In the upcoming blog posts, we will talk about different debug tricks we learned when working with a large codebase with complex service interactions using tools like strace and eBPF.

Oxy: the journey of graceful restarts

2023-04-04 Chris Branch

Post Syndicated from Chris Branch original https://blog.cloudflare.com/oxy-the-journey-of-graceful-restarts/

Oxy: the journey of graceful restarts

Any software under continuous development and improvement will eventually need a new version deployed to the systems running it. This can happen in several ways, depending on how much you care about things like reliability, availability, and correctness. When I started out in web development, I didn’t think about any of these qualities; I simply blasted my new code over FTP directly to my /cgi-bin/ directory, which was the style at the time. For those of us producing desktop software, often you sidestep this entirely by having the user save their work, close the program and install an update – but they usually get to decide when this happens.

At Cloudflare we have to take this seriously. Our software is in constant use and cannot simply be stopped abruptly. A dropped HTTP request can cause an entire webpage to load incorrectly, and a broken connection can kick you out of a video call. Taking away reliability creates a vacuum filled only by user frustration.

The limitations of the typical upgrade process

There is no one right way to upgrade software reliably. Some programming languages and environments make it easier than others, but in a Turing-complete language few things are impossible.

One popular and generally applicable approach is to start a new version of the software, make it responsible for a small number of tasks at first, and then gradually increase its workload until the new version is responsible for everything and the old version responsible for nothing. At that point, you can stop the old version.

Most of Cloudflare’s proxies follow a similar pattern: they receive connections or requests from many clients over the Internet, communicate with other internal services to decide how to serve the request, and fetch content over the Internet if we cannot serve it locally. In general, all of this work happens within the lifetime of a client’s connection. If we aren’t serving any clients, we aren’t doing any work.

The safest time to restart, therefore, is when there is nobody to interrupt. But does such a time really exist? The Internet operates 24 hours a day and many users rely on long-running connections for things like backups, real-time updates or remote shell sessions. Even if you defer restarts to a “quiet” period, the next-best strategy of “interrupt the fewest number of people possible” will fail when you have a critical security fix that needs to be deployed immediately.

Despite this challenge, we have to start somewhere. You rarely arrive at the perfect solution in your first try.

(╯°□°）╯︵ ┻━┻

We have previously blogged about implementing graceful restarts in Cloudflare’s Go projects, using a library called tableflip. This starts a new version of your program and allows the new version to signal to the old version that it started successfully, then lets the old version clear its workload. For a proxy like any Oxy application, that means the old version stops accepting new connections once the new version starts accepting connections, then drives its remaining connections to completion.

This is the simplest case of the migration strategy previously described: the new version immediately takes all new connections, instead of a gradual rollout. But in aggregate across Cloudflare’s server fleet the upgrade process is spread across several hours and the result is as gradual as a deployment orchestrated by Kubernetes or similar.

tableflip also allows your program to bind to sockets, or to reuse the sockets opened by a previous instance. This enables the new instance to accept new connections on the same socket and let the old instance release that responsibility.

Oxy is a Rust project, so we can’t reuse tableflip. We rewrote the spawning/signaling section in Rust, but not the socket code. For that we had an alternative approach.

Socket management with systemd

systemd is a widely used suite of programs for starting and managing all of the system software needed to run a useful Linux system. It is responsible for running software in the correct order – for example ensuring the network is ready before starting a program that needs network access – or running it only if it is needed by another program.

Socket management falls in this latter category, under the term ‘socket activation’. Its intended and original use is interesting but ultimately irrelevant here; for our purposes, systemd is a mere socket manager. Many Cloudflare services configure their sockets using systemd .socket files, and when their service is started the socket is brought into the process with it. This is how we deploy most Oxy-based services, and Oxy has first-class support for sockets opened by systemd.

Using systemd decouples the lifetime of sockets from the lifetime of the Oxy application. When Oxy creates its sockets on startup, if you restart or temporarily stop the Oxy application the sockets are closed. When clients attempt to connect to the proxy during this time, they will get a very unfriendly “connection refused” error. If, however, systemd manages the socket, that socket remains open even while the Oxy application is stopped. Clients can still connect to the socket and those connections will be served as soon as the Oxy application starts up successfully.

Channeling your inner WaitGroup

A useful piece of library code our Go projects use is WaitGroups. These are essential in Go, where goroutines – asynchronously-running code blocks – are pervasive. Waiting for goroutines to complete before continuing another task is a common requirement. Even the example for tableflip uses them, to demonstrate how to wait for tasks to shut down cleanly before quitting your process.

There is not an out-of-the-box equivalent in tokio – the async Rust runtime Oxy uses – or async/await generally, so we had to create one ourselves. Fortunately, most of the building blocks to roll your own exist already. Tokio has multi-producer, single consumer (MPSC) channels, generally used by multiple tasks to push the results of work onto a queue for a single task to process, but we can exploit the fact that it signals to that single receiver when all the sender channels have been closed and no new messages are expected.

To start, we create an MPSC channel. Each task takes a clone of the producer end of the channel, and when that task completes it closes its instance of the producer. When we want to wait for all of the tasks to complete, we await a result on the consumer end of the MPSC channel. When every instance of the producer channel is closed – i.e. all tasks have completed – the consumer receives a notification that all of the channels are closed. Closing the channel when a task completes is an automatic consequence of Rust’s RAII rules. Because the language enforces this rule it is harder to write incorrect code, though in fact we need to write very little code at all.

Getting feedback on failure

Many programs that implement a graceful reload/restart mechanism use Unix signals to trigger the process to perform an action. Signals are an ancient technique introduced in early versions of Unix to solve a specific problem while creating dozens more. A common pattern is to change a program’s configuration on disk, then send it a signal (often SIGHUP) which the program handles by reloading those configuration files.

The limitations of this technique are obvious as soon as you make a mistake in the configuration, or when an important file referenced in the configuration is deleted. You reload the program and wonder why it isn’t behaving as you expect. If an error is raised, you have to look in the program’s log output to find out.

This problem compounds when you use an automated configuration management tool. It is not useful if that tool makes a configuration change and reports that it successfully reloaded your program, when in fact the program failed to read the change. The only thing that was successful was sending the reload signal!

We solved this in Oxy by creating a Unix socket specifically for coordinating restarts, and adding a new mode to Oxy that triggers a restart. In this mode:

The restarter process validates the configuration file.
It connects to the restart coordination socket defined in that file.
It sends a “restart requested” message.
The current proxy instance receives this message.
A new instance is started, inheriting a pipe it will use to notify its parent instance.
The current instance waits for the new instance to report success or fail.
The current instance sends a “restart response” message back to the restarter process, containing the result.
The restarter process reports this result back to the user, using exit codes for automated systems to detect failure.

Now when we make a change to any of our Oxy applications, we can be confident that failures are detected using nothing more than our SREs’ existing tooling. This lets us discover failures earlier, narrow down root causes sooner, and avoid our systems getting into an inconsistent state.

This technique is described more generally in a coworker’s blog, using an internal HTTP endpoint instead. Yet HTTP is missing one important property of Unix sockets for the purpose of replacing signals. A user may only send a signal to a process if the process belongs to them – i.e. they started it – or if the user is root. This prevents another user logged into the same machine from you from terminating all of your processes. As Unix sockets are files, they also follow the Unix permission model. Write permissions are required to connect to a socket. Thus we can trivially reproduce the signals security model by making the restart coordination socket user writable only. (Root, as always, bypasses all permission checks.)

Leave no connection behind

We have put a lot of effort into making restarts as graceful as possible, but there are still certain limitations. After restarting, eventually the old process has to terminate, to prevent a build-up of old processes after successive restarts consuming excessive memory and reducing the performance of other running services. There is an upper bound to how long we’ll let the old process run for; when this is reached, any connections remaining are forcibly broken.

The configuration changes that can be applied using graceful restart is limited by the design of systemd. While some configuration like resource limits can now be applied without restarting the service it applies to, others cannot; most significantly, new sockets. This is a problem inherent to the fork-and-inherit model.

For UDP-based protocols like HTTP/3, there is not even a concept of listener socket. The new process may open UDP sockets, but by default incoming packets are balanced between all open unconnected UDP sockets for a given address. How does the old process drain existing sessions without receiving packets intended for the new process and vice versa?

Is there a way to carry existing state to a new process to avoid some of these limitations? This is a hard problem to solve generally, and even in languages designed to support hot code upgrades there is some degree of running old tasks with old versions of code. Yet there are some common useful tasks that can be carried between processes so we can “interrupt the fewest number of people possible”.

Let’s not forget the unplanned outages: segfaults, oomkiller and other crashes. Thankfully rare in Rust code, but not impossible.

You can find the source for our Rust implementation of graceful restarts, named shellflip, in its GitHub repository. However, restarting correctly is just the first step of many needed to achieve our ultimate reliability goals. In a follow-up blog post we’ll talk about some creative solutions to these limitations.

Introducing Rollbacks for Workers Deployments

2023-04-03 Cloudflare

Post Syndicated from Cloudflare original https://blog.cloudflare.com/introducing-rollbacks-for-workers-deployments/

Introducing Rollbacks for Workers Deployments

In November, 2022, we introduced deployments for Workers. Deployments are created as you make changes to a Worker. Each one is unique. These let you track changes to your Workers over time, seeing who made the changes, and where they came from.

When we made the announcement, we also said our intention was to build more functionality on top of deployments.

Today, we’re proud to release rollbacks for deployments.

Rollbacks

As nice as it would be to know that every deployment is perfect, it’s not always possible – for various reasons. Rollbacks provide a quick way to deploy past versions of a Worker – providing another layer of confidence when developing and deploying with Workers.

Via the dashboard

In the dashboard, you can navigate to the Deployments tab. For each deployment that’s not the most recent, you should see a new icon on the far right of the deployment. Hovering over that icon will display the option to rollback to the specified deployment.

Clicking on that will bring up a confirmation dialog, where you can enter a reason for rollback. This provides another mechanism of record-keeping and helps give more context for why the rollback was necessary.

Once you enter a reason and confirm, a new rollback deployment will be created. This deployment has its own ID, but is a duplicate of the one you rolled back to. A message appears with the new deployment ID, as well as an icon showing the rollback message you entered above.

Via Wrangler

With Wrangler version 2.13, rolling back deployments via Wrangler can be done via a new command – wrangler rollback. This command takes an optional ID to rollback to a specific deployment, but can also be run without an ID to rollback to the previous deployment. This provides an even faster way to rollback in a situation where you know that the previous deployment is the one that you want.

Just like the dashboard, when you initiate a rollback you will be prompted to add a rollback reason and to confirm the action.

In addition to wrangler rollback we’ve done some refactoring to the wrangler deployments command. Now you can run wrangler deployments list to view up to the last 10 deployments.

Here, you can see two new annotations: rollback from and message. These match the dashboard experience, and provide more visibility into your deployment history.

To view an individual deployment, you can run wrangler deployments view. This will display the last deployment made, which is the active deployment. If you would like to see a specific deployment, you can run wrangler deployments view [ID].

We’ve updated this command to display more data like: compatibility date, usage model, and bindings. This additional data will help you to quickly visualize changes to Worker or to see more about a specific Worker deployment without having to open your editor and go through source code.

Keep deploying!

We hope this feature provides even more confidence in deploying Workers, and encourages you to try it out! If you leverage the Cloudflare dashboard to manage deployments, you should have access immediately. Wrangler users will need to update to version 2.13 to see the new functionality.

Make sure to check out our updated deployments docs for more information, as well as information on limitations to rollbacks. If you have any feedback, please let us know via this form.

Oxy is Cloudflare’s Rust-based next generation proxy framework

2023-03-02 Ivan Nikulin

Post Syndicated from Ivan Nikulin original https://blog.cloudflare.com/introducing-oxy/

Oxy is Cloudflare's Rust-based next generation proxy framework

In this blog post, we are proud to introduce Oxy – our modern proxy framework, developed using the Rust programming language. Oxy is a foundation of several Cloudflare projects, including the Zero Trust Gateway, the iCloud Private Relay second hop proxy, and the internal egress routing service.

Oxy leverages our years of experience building high-load proxies to implement the latest communication protocols, enabling us to effortlessly build sophisticated services that can accommodate massive amounts of daily traffic.

We will be exploring Oxy in greater detail in upcoming technical blog posts, providing a comprehensive and in-depth look at its capabilities and potential applications. For now, let us embark on this journey and discover what Oxy is and how we built it.

What Oxy does

We refer to Oxy as our “next-generation proxy framework”. But what do we really mean by “proxy framework”? Picture a server (like NGINX, that reader might be familiar with) that can proxy traffic with an array of protocols, including various predefined common traffic flow scenarios that enable you to route traffic to specific destinations or even egress with a different protocol than the one used for ingress. This server can be configured in many ways for specific flows and boasts tight integration with the surrounding infrastructure, whether telemetry consumers or networking services.

Now, take all of that and add in the ability to programmatically control every aspect of the proxying: protocol decapsulation, traffic analysis, routing, tunneling logic, DNS resolution, and so much more. And this is what Oxy proxy framework is: a feature-rich proxy server tightly integrated with our internal infrastructure that’s customizable to meet application requirements, allowing engineers to tweak every component.

This design is in line with our belief in an iterative approach to development, where a basic solution is built first and then gradually improved over time. With Oxy, you can start with a basic solution that can be deployed to our servers and then add additional features as needed, taking advantage of the many extensibility points offered by Oxy. In fact, you can avoid writing any code, besides a few lines of bootstrap boilerplate and get a production-ready server with a wide variety of startup configuration options and traffic flow scenarios.

For example, suppose you’d like to implement an HTTP firewall. With Oxy, you can proxy HTTP(S) requests right out of the box, eliminating the need to write any code related to production services, such as request metrics and logs. You simply need to implement an Oxy hook handler for HTTP requests and responses. If you’ve used Cloudflare Workers before, then you should be familiar with this extensibility model.

Similarly, you can implement a layer 4 firewall by providing application hooks that handle ingress and egress connections. This goes beyond a simple block/accept scenario, as you can build authentication functionality or a traffic router that sends traffic to different destinations based on the geographical information of the ingress connection. The capabilities are incredibly rich, and we’ve made the extensibility model as ergonomic and flexible as possible. As an example, if information obtained from layer 4 is insufficient to make an informed firewall decision, the app can simply ask Oxy to decapsulate the traffic and process it with HTTP firewall.

The aforementioned scenarios are prevalent in many products we build at Cloudflare, so having a foundation that incorporates ready solutions is incredibly useful. This foundation has absorbed lots of experience we’ve gained over the years, taking care of many sharp and dark corners of high-load service programming. As a result, application implementers can stay focused on the business logic of their application with Oxy taking care of the rest. In fact, we’ve been able to create a few privacy proxy applications using Oxy that now serve massive amounts of traffic in production with less than a couple of hundred lines of code. This is something that would have taken multiple orders of magnitude more time and lines of code before.

As previously mentioned, we’ll dive deeper into the technical aspects in future blog posts. However, for now, we’d like to provide a brief overview of Oxy’s capabilities. This will give you a glimpse of the many ways in which Oxy can be customized and used.

On-ramps

On-ramp defines a combination of transport layer socket type and protocols that server listeners can use for ingress traffic.

Oxy supports a wide variety of traffic on-ramps:

HTTP 1/2/3 (including various CONNECT protocols for layer 3 and 4 traffic)
TCP and UDP traffic over Proxy Protocol
general purpose IP traffic, including ICMP

With Oxy, you have the ability to analyze and manipulate traffic at multiple layers of the OSI model – from layer 3 to layer 7. This allows for a wide range of possibilities in terms of how you handle incoming traffic.

One of the most notable and powerful features of Oxy is the ability for applications to force decapsulation. This means that an application can analyze traffic at a higher level, even if it originally arrived at a lower level. For example, if an application receives IP traffic, it can choose to analyze the UDP traffic encapsulated within the IP packets. With just a few lines of code, the application can tell Oxy to upgrade the IP flow to a UDP tunnel, effectively allowing the same code to be used for different on-ramps.

The application can even go further and ask Oxy to sniff UDP packets and check if they contain HTTP/3 traffic. In this case, Oxy can upgrade the UDP traffic to HTTP and handle HTTP/3 requests that were originally received as raw IP packets. This allows for the simultaneous processing of traffic at all three layers (L3, L4, L7), enabling applications to analyze, filter, and manipulate the traffic flow from multiple perspectives. This provides a robust toolset for developing advanced traffic processing applications.

Off-ramps

Off-ramp defines a combination of transport layer socket type and protocols that proxy server connectors can use for egress traffic.

Oxy offers versatility in its egress methods, supporting a range of protocols including HTTP 1 and 2, UDP, TCP, and IP. It is equipped with internal DNS resolution and caching, as well as customizable resolvers, with automatic fallback options for maximum system reliability. Oxy implements happy eyeballs for TCP, advanced tunnel timeout logic and has the ability to route traffic to internal services with accompanying metadata.

Additionally, through collaboration with one of our internal services (which is an Oxy application itself!) Oxy is able to offer geographical egress — allowing applications to route traffic to the public Internet from various locations in our extensive network covering numerous cities worldwide. This complex and powerful feature can be easily utilized by Oxy application developers at no extra cost, simply by adjusting configuration settings.

Tunneling and request handling

We’ve discussed Oxy’s communication capabilities with the outside world through on-ramps and off-ramps. In the middle, Oxy handles efficient stateful tunneling of various traffic types including TCP, UDP, QUIC, and IP, while giving applications full control over traffic blocking and redirection.

Additionally, Oxy effectively handles HTTP traffic, providing full control over requests and responses, and allowing it to serve as a direct HTTP or API service. With built-in tools for streaming analysis of HTTP bodies, Oxy makes it easy to extract and process data, such as form data from uploads and downloads.

In addition to its multi-layer traffic processing capabilities, Oxy also supports advanced HTTP tunneling methods, such as CONNECT-UDP and CONNECT-IP, using the latest extensions to HTTP 3 and 2 protocols. It can even process HTTP CONNECT request payloads on layer 4 and recursively process the payload as HTTP if the encapsulated traffic is HTTP.

TLS

The modern Internet is unimaginable without traffic encryption, and Oxy, of course, provides this essential aspect. Oxy’s cryptography and TLS are based on BoringSSL, providing both a FIPS-compliant version with a limited set of certified features and the latest version that supports all the currently available TLS features. Oxy also allows applications to switch between the two versions in real-time, on a per-request or per-connection basis.

Oxy’s TLS client is designed to make HTTPS requests to upstream servers, with the functionality and security of a browser-grade client. This includes the reconstruction of certificate chains, certificate revocation checks, and more. In addition, Oxy applications can be secured with TLS v1.3, and optionally mTLS, allowing for the extraction of client authentication information from x509 certificates.

Oxy has the ability to inspect and filter HTTPS traffic, including HTTP/3, and provides the means for dynamically generating certificates, serving as a foundation for implementing data loss prevention (DLP) products. Additionally, Oxy’s internal fork of BoringSSL, which is not FIPS-compliant, supports the use of raw public keys as an alternative to WebPKI, making it ideal for internal service communication. This allows for all the benefits of TLS without the hassle of managing root certificates.

Gluing everything together

Oxy is more than just a set of building blocks for network applications. It acts as a cohesive glue, handling the bootstrapping of the entire proxy application with ease, including parsing and applying configurations, setting up an asynchronous runtime, applying seccomp hardening and providing automated graceful restarts functionality.

With built-in support for panic reporting to Sentry, Prometheus metrics with a Rust-macro based API, Kibana logging, distributed tracing, memory and runtime profiling, Oxy offers comprehensive monitoring and analysis capabilities. It can also generate detailed audit logs for layer 4 traffic, useful for billing and network analysis.

To top it off, Oxy includes an integration testing framework, allowing for easy testing of application interactions using TypeScript-based tests.

Extensibility model

To take full advantage of Oxy’s capabilities, one must understand how to extend and configure its features. Oxy applications are configured using YAML configuration files, offering numerous options for each feature. Additionally, application developers can extend these options by leveraging the convenient macros provided by the framework, making customization a breeze.

Suppose the Oxy application uses a key-value database to retrieve user information. In that case, it would be beneficial to expose a YAML configuration settings section for this purpose. With Oxy, defining a structure and annotating it with the #[oxy_app_settings] attribute is all it takes to accomplish this:

///Application’s key-value (KV) database settings
#[oxy_app_settings]
pub struct MyAppKVSettings {
    /// Key prefix.
    pub prefix: Option<String>,
    /// Path to the UNIX domain socket for the appropriate KV 
    /// server instance.
    pub socket: Option<String>,
}

Oxy can then generate a default YAML configuration file listing available options and their default values, including those extended by the application. The configuration options are automatically documented in the generated file from the Rust doc comments, following best Rust practices.

Moreover, Oxy supports multi-tenancy, allowing a single application instance to expose multiple on-ramp endpoints, each with a unique configuration. But, sometimes even a YAML configuration file is not enough to build a desired application, this is where Oxy’s comprehensive set of hooks comes in handy. These hooks can be used to extend the application with Rust code and cover almost all aspects of the traffic processing.

To give you an idea of how easy it is to write an Oxy application, here is an example of basic Oxy code:

struct MyApp;

// Defines types for various application extensions to Oxy's
// data types. Contexts provide information and control knobs for
// the different parts of the traffic flow and applications can extend // all of them with their custom data. As was mentioned before,
// applications could also define their custom configuration.
// It’s just a matter of defining a configuration object with
// `#[oxy_app_settings]` attribute and providing the object type here.
impl OxyExt for MyApp {
    type AppSettings = MyAppKVSettings;
    type EndpointAppSettings = ();
    type EndpointContext = ();
    type IngressConnectionContext = MyAppIngressConnectionContext;
    type RequestContext = ();
    type IpTunnelContext = ();
    type DnsCacheItem = ();

}
   
#[async_trait]
impl OxyApp for MyApp {
    fn name() -> &'static str {
        "My app"
    }

    fn version() -> &'static str {
        env!("CARGO_PKG_VERSION")
    }

    fn description() -> &'static str {
        "This is an example of Oxy application"
    }

    async fn start(
        settings: ServerSettings<MyAppSettings, ()>
    ) -> anyhow::Result<Hooks<Self>> {
        // Here the application initializes various hooks, with each
        // hook being a trait implementation containing multiple
        // optional callbacks invoked during the lifecycle of the
        // traffic processing.
        let ingress_hook = create_ingress_hook(&settings);
        let egress_hook = create_egress_hook(&settings);
        let tunnel_hook = create_tunnel_hook(&settings);
        let http_request_hook = create_http_request_hook(&settings);
        let ip_flow_hook = create_ip_flow_hook(&settings);

        Ok(Hooks {
            ingress: Some(ingress_hook),
            egress: Some(egress_hook),
            tunnel: Some(tunnel_hook),
            http_request: Some(http_request_hook),
            ip_flow: Some(ip_flow_hook),
            ..Default::default()
        })
    }
}

// The entry point of the application
fn main() -> OxyResult<()> {
    oxy::bootstrap::<MyApp>()
}

Technology choice

Oxy leverages the safety and performance benefits of Rust as its implementation language. At Cloudflare, Rust has emerged as a popular choice for new product development, and there are ongoing efforts to migrate some of the existing products to the language as well.

Rust offers memory and concurrency safety through its ownership and borrowing system, preventing issues like null pointers and data races. This safety is achieved without sacrificing performance, as Rust provides low-level control and the ability to write code with minimal runtime overhead. Rust’s balance of safety and performance has made it popular for building safe performance-critical applications, like proxies.

We intentionally tried to stand on the shoulders of the giants with this project and avoid reinventing the wheel. Oxy heavily relies on open-source dependencies, with hyper and tokio being the backbone of the framework. Our philosophy is that we should pull from existing solutions as much as we can, allowing for faster iteration, but also use widely battle-tested code. If something doesn’t work for us, we try to collaborate with maintainers and contribute back our fixes and improvements. In fact, we now have two team members who are core team members of tokio and hyper projects.

Even though Oxy is a proprietary project, we try to give back some love to the open-source community without which the project wouldn’t be possible by open-sourcing some of the building blocks such as https://github.com/cloudflare/boring and https://github.com/cloudflare/quiche.

The road to implementation

At the beginning of our journey, we set out to implement a proof-of-concept for an HTTP firewall using Rust for what would eventually become Zero Trust Gateway product. This project was originally part of the WARP service repository. However, as the PoC rapidly advanced, it became clear that it needed to be separated into its own Gateway proxy for both technical and operational reasons.

Later on, when tasked with implementing a relay proxy for iCloud Private Relay, we saw the opportunity to reuse much of the code from the Gateway proxy. The Gateway project could also benefit from the HTTP/3 support that was being added for the Private Relay project. In fact, early iterations of the relay service were forks of the Gateway server.

It was then that we realized we could extract common elements from both projects to create a new framework, Oxy. The history of Oxy can be traced back to its origins in the commit history of the Gateway and Private Relay projects, up until its separation as a standalone framework.

Since our inception, we have leveraged the power of Oxy to efficiently roll out multiple projects that would have required a significant amount of time and effort without it. Our iterative development approach has been a strength of the project, as we have been able to identify common, reusable components through hands-on testing and implementation.

Our small core team is supplemented by internal contributors from across the company, ensuring that the best subject-matter experts are working on the relevant parts of the project. This contribution model also allows us to shape the framework’s API to meet the functional and ergonomic needs of its users, while the core team ensures that the project stays on track.

Relation to Pingora

Although Pingora, another proxy server developed by us in Rust, shares some similarities with Oxy, it was intentionally designed as a separate proxy server with a different objective. Pingora was created to serve traffic from millions of our client’s upstream servers, including those with ancient and unusual configurations. Non-UTF 8 URLs or TLS settings that are not supported by most TLS libraries being just a few such quirks among many others. This focus on handling technically challenging unusual configurations sets Pingora apart from other proxy servers.

The concept of Pingora came about during the same period when we were beginning to develop Oxy, and we initially considered merging the two projects. However, we quickly realized that their objectives were too different to do that. Pingora is specifically designed to establish Cloudflare’s HTTP connectivity with the Internet, even in its most technically obscure corners. On the other hand, Oxy is a multipurpose platform that supports a wide variety of communication protocols and aims to provide a simple way to develop high-performance proxy applications with business logic.

Conclusion

Oxy is a proxy framework that we have developed to meet the demanding needs of modern services. It has been designed to provide a flexible and scalable solution that can be adapted to meet the unique requirements of each project and by leveraging the power of Rust, we made it both safe and fast.

Looking forward, Oxy is poised to play one of the critical roles in our company’s larger effort to modernize and improve our architecture. It provides a solid block in foundation on which we can keep building the better Internet.

As the framework continues to evolve and grow, we remain committed to our iterative approach to development, constantly seeking out new opportunities to reuse existing solutions and improve our codebase. This collaborative, community-driven approach has already yielded impressive results, and we are confident that it will continue to drive the future success of Oxy.

Stay tuned for more tech savvy blog posts on the subject!

Incremental adoption of micro-frontends with Cloudflare Workers

2022-11-17 Peter Bacon Darwin

Post Syndicated from Peter Bacon Darwin original https://blog.cloudflare.com/fragment-piercing/

Bring micro-frontend benefits to legacy Web applications

Incremental adoption of micro-frontends with Cloudflare Workers

Recently, we wrote about a new fragment architecture for building Web applications that is fast, cost-effective, and scales to the largest projects, while enabling a fast iteration cycle. The approach uses multiple collaborating Cloudflare Workers to render and stream micro-frontends into an application that is interactive faster than traditional client-side approaches, leading to better user experience and SEO scores.

This approach is great if you are starting a new project or have the capacity to rewrite your current application from scratch. But in reality most projects are too large to be rebuilt from scratch and can adopt architectural changes only in an incremental way.

In this post we propose a way to replace only selected parts of a legacy client-side rendered application with server-side rendered fragments. The result is an application where the most important views are interactive sooner, can be developed independently, and receive all the benefits of the micro-frontend approach, while avoiding large rewrites of the legacy codebase. This approach is framework-agnostic; in this post we demonstrate fragments built with React, Qwik, and SolidJS.

The pain of large frontend applications

Many large frontend applications developed today fail to deliver good user experience. This is often caused by architectures that require large amounts of JavaScript to be downloaded, parsed and executed before users can interact with the application. Despite efforts to defer non-critical JavaScript code via lazy loading, and the use of server-side rendering, these large applications still take too long to become interactive and respond to the user’s inputs.

Furthermore, large monolithic applications can be complex to build and deploy. Multiple teams may be collaborating on a single codebase and the effort to coordinate testing and deployment of the project makes it hard to develop, deploy and iterate on individual features.

As outlined in our previous post, micro-frontends powered by Cloudflare Workers can solve these problems but converting an application monolith to a micro-frontend architecture can be difficult and expensive. It can take months, or even years, of engineering time before any benefits are perceived by users or developers.

What we need is an approach where a project can incrementally adopt micro-frontends into the most impactful parts of the application incrementally, without needing to rewrite the whole application in one go.

Fragments to the rescue

The goal of a fragment based architecture is to significantly decrease loading and interaction latency for large web applications (as measured via Core Web Vitals) by breaking the application into micro-frontends that can be quickly rendered (and cached) in Cloudflare Workers. The challenge is how to integrate a micro-frontend fragment into a legacy client-side rendered application with minimal cost to the original project.

The technique we propose allows us to convert the most valuable parts of a legacy application’s UI, in isolation from the rest of the application.

It turns out that, in many applications, the most valuable parts of the UI are often nested within an application “shell” that provides header, footer, and navigational elements. Examples of these include a login form, product details panel in an e-commerce application, the inbox in an email client, etc.

Let’s take a login form as an example. If it takes our application several seconds to display the login form, the users will dread logging in, and we might lose them. We can however convert the login form into a server-side rendered fragment, which is displayed and interactive immediately, while the rest of the legacy application boots up in the background. Since the fragment is interactive early, the user can even submit their credentials before the legacy application has started and rendered the rest of the page.

Animation showing the login form being available before the main application

This approach enables engineering teams to deliver valuable improvements to users in just a fraction of the time and engineering cost compared to traditional approaches, which either sacrifice user experience improvements, or require a lengthy and high-risk rewrite of the entire application. It allows teams with monolithic single-page applications to adopt a micro-frontend architecture incrementally, target the improvements to the most valuable parts of the application, and therefore front-load the return on investment.

An interesting challenge in extracting parts of the UI into server-side rendered fragments is that, once displayed in the browser, we want the legacy application and the fragments to feel like a single application. The fragments should be neatly embedded within the legacy application shell, keeping the application accessible by correctly forming the DOM hierarchy, but we also want the server-side rendered fragments to be displayed and become interactive as quickly as possible — even before the legacy client-side rendered application shell comes into existence. How can we embed UI fragments into an application shell that doesn’t exist yet? We resolved this problem via a technique we devised, which we call “fragment piercing”.

Fragment piercing

Fragment piercing combines HTML/DOM produced by server-side rendered micro-frontend fragments with HTML/DOM produced by a legacy client-side rendered application.

The micro-frontend fragments are rendered directly into the top level of the HTML response, and are designed to become immediately interactive. In the background, the legacy application is client-side rendered as a sibling of these fragments. When it is ready, the fragments are “pierced” into the legacy application – the DOM of each fragment is moved to its appropriate place within the DOM of the legacy application – without causing any visual side effects, or loss of client-side state, such as focus, form data, or text selection. Once “pierced”, a fragment can begin to communicate with the legacy application, effectively becoming an integrated part of it.

Here, you can see a “login” fragment and the empty legacy application “root” element at the top level of the DOM, before piercing.

<body>
  <div id="root"></div>
  <piercing-fragment-host fragment-id="login">
    <login q:container...>...</login>
  </piercing-fragment-host>
</body>

And here you can see that the fragment has been pierced into the “login-page” div in the rendered legacy application.

<body>
  <div id="root">
    <header>...</header>
    <main>
      <div class="login-page">
        <piercing-fragment-outlet fragment-id="login">
          <piercing-fragment-host fragment-id="login">
            <login  q:container...>...</login>
          </piercing-fragment-host>
        </piercing-fragment-outlet>
      </div>
    </main>
    <footer>...</footer>
  </div>
</body>

To keep the fragment from moving and causing a visible layout shift during this transition, we apply CSS styles that position the fragment in the same way before and after piercing.

At any time an application can be displaying any number of pierced fragments, or none at all. This technique is not limited only to the initial load of the legacy application. Fragments can also be added to and removed from an application, at any time. This allows fragments to be rendered in response to user interactions and client-side routing.

With fragment piercing, you can start to incrementally adopt micro-frontends, one fragment at a time. You decide on the granularity of fragments, and which parts of the application to turn into fragments. The fragments don’t all have to use the same Web framework, which can be useful when switching stacks, or during a post-acquisition integration of multiple applications.

The “Productivity Suite” demo

As a demonstration of fragment piercing and incremental adoption we have developed a “productivity suite” demo application that allows users to manage to-do lists, read hacker news, etc. We implemented the shell of this application as a client-side rendered React application — a common tech choice in corporate applications. This is our “legacy application”. There are three routes in the application that have been updated to use micro-frontend fragments:

/login – a simple dummy login form with client-side validation, displayed when users are not authenticated (implemented in Qwik).
/todos – manages one or more todo lists, implemented as two collaborating fragments:
- Todo list selector – a component for selecting/creating/deleting Todo lists (implemented in Qwik).
- Todo list editor – a clone of the TodoMVC app (implemented in React).
/news – a clone of the HackerNews demo (implemented in SolidJS).

This demo showcases that different independent technologies can be used for both the legacy application and for each of the fragments.

The application is deployed at https://productivity-suite.web-experiments.workers.dev/.

To try it out, you first need to log in – simply use any username you like (no password needed). The user’s data is saved in a cookie, so you can log out and back in using the same username. After you’ve logged in, navigate through the various pages using the navigation bar at the top of the application. In particular, take a look at the “Todo Lists” and “News” pages to see the piercing in action.

At any point, try reloading the page to see that fragments are rendered instantly while the legacy application loads slowly in the background. Try interacting with the fragments even before the legacy application has appeared!

At the very top of the page there are controls to let you see the impact of fragment piercing in action.

Use the “Legacy app bootstrap delay” slider to set the simulated delay before the legacy application starts.
Toggle “Piercing Enabled” to see what the user experience would be if the app did not use fragments.
Toggle “Show Seams” to see where each fragment is on the current page.

How it works

The application is composed of a number of building blocks.

The Legacy application host in our demo serves the files that define the client-side React application (HTML, JavaScript and stylesheets). Applications built with other tech stacks would work just as well. The Fragment Workers host the micro-frontend fragments, as described in our previous fragment architecture post. And the Gateway Worker handles requests from the browser, selecting, fetching and combining response streams from the legacy application and micro-frontend fragments.

Once these pieces are all deployed, they work together to handle each request from the browser. Let’s look at what happens when you go to the `/login` route.

The user navigates to the application and the browser makes a request to the Gateway Worker to get the initial HTML. The Gateway Worker identifies that the browser is requesting the login page. It then makes two parallel sub-requests – one to fetch the index.html of the legacy application, and another to request the server-side rendered login fragment. It then combines these two responses into a single response stream containing the HTML that is delivered to the browser.

The browser displays the HTML response containing the empty root element for the legacy application, and the server-side rendered login fragment, which is immediately interactive for the user.

The browser then requests the legacy application’s JavaScript. This request is proxied by the Gateway Worker to the Legacy application host. Similarly, any other assets for the legacy application or fragments get routed through the Gateway Worker to the legacy application host or appropriate Fragment Worker.

Once the legacy application’s JavaScript has been downloaded and executed, rendering the shell of the application in the process, the fragment piercing kicks in, moving the fragment into the appropriate place in the legacy application, while preserving all of its UI state.

While focussed on the login fragment to explain fragment piercing, the same ideas apply to the other fragments implemented in the /todos and /news routes.

The piercing library

Despite being implemented using different Web frameworks, all the fragments are integrated into the legacy application in the same way using helpers from a “Piercing Library”. This library is a collection of server-side and client-side utilities that we developed, for the demo, to handle integrating the legacy application with micro-frontend fragments. The main features of the library are the PiercingGateway class, fragment host and fragment outlet custom elements, and the MessageBus class.

PiercingGateway

The PiercingGateway class can be used to instantiate a Gateway Worker that handles all requests for our application’s HTML, JavaScript and other assets. The `PiercingGateway` routes requests through to the appropriate Fragment Workers or to the host of the Legacy Application. It also combines the HTML response streams from these fragments with the response from the legacy application into a single HTML stream that is returned to the browser.

Implementing a Gateway Worker is straightforward using the Piercing Library. Create a new gateway instance of PiercingGateway, passing it the URL to the legacy application host and a function to determine whether piercing is enabled for the given request. Export the gateway as the default export from the Worker script so that the Workers runtime can wire up its fetch() handler.

const gateway = new PiercingGateway<Env>({
  // Configure the origin URL for the legacy application.
  getLegacyAppBaseUrl: (env) => env.APP_BASE_URL,
  shouldPiercingBeEnabled: (request) => ...,
});
...

export default gateway;

Fragments can be registered by calling the registerFragment() method so that the gateway can automatically route requests for a fragment’s HTML and assets to its Fragment Worker. For example, registering the login fragment would look like:

gateway.registerFragment({
  fragmentId: "login",
  prePiercingStyles: "...",
  shouldBeIncluded: async (request) => !(await isUserAuthenticated(request)),
});

Fragment host and outlet

Routing requests and combining HTML responses in the Gateway Worker is only half of what makes piercing possible. The other half needs to happen in the browser where the fragments need to be pierced into the legacy application using the technique we described earlier.

The fragment piercing in the browser is facilitated by a pair of custom elements, the fragment host (<piercing-fragment-host>) and the fragment outlet (<piercing-fragment-outlet>).

The Gateway Worker wraps the HTML for each fragment in a fragment host. In the browser, the fragment host manages the life-time of the fragment and is used when moving the fragment’s DOM into position in the legacy application.

<piercing-fragment-host fragment-id="login">
  <login q:container...>...</login>
</piercing-fragment-host>

In the legacy application, the developer marks where a fragment should appear when it is pierced by adding a fragment outlet. Our demo application’s Login route looks as follows:

export function Login() {
  …
  return (
    <div className="login-page" ref={ref}>
      <piercing-fragment-outlet fragment-id="login" />
    </div>
  );
}

When a fragment outlet is added to the DOM, it searches the current document for its associated fragment host. If found, the fragment host and its contents are moved inside the outlet. If the fragment host is not found, the outlet will make a request to the gateway worker to fetch the fragment HTML, which is then streamed directly into the fragment outlet, using the writable-dom library (a small but powerful library developed by the MarkoJS team).

This fallback mechanism enables client-side navigation to routes that contain new fragments. This way fragments can be rendered in the browser via both initial (hard) navigation and client-side (soft) navigation.

Message bus

Unless the fragments in our application are completely presentational or self-contained, they also need to communicate with the legacy application and other fragments. The MessageBus is a simple asynchronous, isomorphic, and framework-agnostic communication bus that the legacy application and each of the fragments can access.

In our demo application the login fragment needs to inform the legacy application when the user has authenticated. This message dispatch is implemented in the Qwik LoginForm component as follows:

const dispatchLoginEvent = $(() => {
  getBus(ref.value).dispatch("login", {
    username: state.username,
    password: state.password,
  });
  state.loading = true;
});

The legacy application can then listen for these messages like this:

useEffect(() => {
  return getBus().listen<LoginMessage>("login", async (user) => {
    setUser(user);
    await addUserDataIfMissing(user.username);
    await saveCurrentUser(user.username);
    getBus().dispatch("authentication", user);
    navigate("/", { replace: true, });
  });
}, []);

We settled on this message bus implementation because we needed a solution that was framework-agnostic, and worked well on both the server as well as client.

Give it a go!

With fragments, fragment piercing, and Cloudflare Workers, you can improve performance as well as the development cycle of legacy client-side rendered applications. These changes can be adopted incrementally, and you can even do so while implementing fragments with a Web framework for your choice.

The “Productivity Suite” application demonstrating these capabilities can be found at https://productivity-suite.web-experiments.workers.dev/.

All the code we have shown is open-source and published to Github: https://github.com/cloudflare/workers-web-experiments/tree/main/productivity-suite.

Feel free to clone the repository. It is easy to run locally and even deploy your own version (for free) to Cloudflare. We tried to make the code as reusable as possible. Most of the core logic is in the piercing library that you could try in your own projects. We’d be thrilled to receive feedback, suggestions, or hear about applications you’d like to use it for. Join our GitHub discussion or also reach us on our discord channel.

We believe that combining Cloudflare Workers with the latest ideas from frameworks will drive the next big steps forward in improved experiences for both users and developers in Web applications. Expect to see more demos, blog posts and collaborations as we continue to push the boundaries of what the Web can offer. And if you’d also like to be directly part of this journey, we are also happy to share that we are hiring!

Cloudflare Workers and micro-frontends: made for one another

2022-10-20 Peter Bacon Darwin

Post Syndicated from Peter Bacon Darwin original https://blog.cloudflare.com/better-micro-frontends/

Cloudflare Workers and micro-frontends: made for one another

To help developers build better web applications we researched and devised a fragments architecture to build micro-frontends using Cloudflare Workers that is lightning fast, cost-effective to develop and operate, and scales to the needs of the largest enterprise teams without compromising release velocity or user experience.

Here we share a technical overview and a proof of concept of this architecture.

Why micro-frontends?

One of the challenges of modern frontend web development is that applications are getting bigger and more complex. This is especially true for enterprise web applications supporting e-commerce, banking, insurance, travel, and other industries, where a unified user interface provides access to a large amount of functionality. In such projects it is common for many teams to collaborate to build a single web application. These monolithic web applications, usually built with JavaScript technologies like React, Angular, or Vue, span thousands, or even millions of lines of code.

When a monolithic JavaScript architecture is used with applications of this scale, the result is a slow and fragile user experience with low Lighthouse scores. Furthermore, collaborating development teams often struggle to maintain and evolve their parts of the application, as their fates are tied with fates of all the other teams, so the mistakes and tech debt of one team often impacts all.

Drawing on ideas from microservices, the frontend community has started to advocate for micro-frontends to enable teams to develop and deploy their features independently of other teams. Each micro-frontend is a self-contained mini-application, that can be developed and released independently, and is responsible for rendering a “fragment” of the page. The application then combines these fragments together so that from the user’s perspective it feels like a single application.

Cloudflare Workers and micro-frontends: made for one another — An application consisting of multiple micro-frontends

Fragments could represent vertical application features, like “account management” or “checkout”, or horizontal features, like “header” or “navigation bar”.

Client-side micro-frontends

A common approach to micro-frontends is to rely upon client-side code to lazy load and stitch fragments together (e.g. via Module Federation). Client-side micro-frontend applications suffer from a number of problems.

Common code must either be duplicated or published as a shared library. Shared libraries are problematic themselves. It is not possible to tree-shake unused library code at build time resulting in more code than necessary being downloaded to the browser and coordinating between teams when shared libraries need to be updated can be complex and awkward.

Also, the top-level container application must bootstrap before the micro-frontends can even be requested, and they also need to boot before they become interactive. If they are nested, then you may end up getting a waterfall of requests to get micro-frontends leading to further runtime delays.

These problems can result in a sluggish application startup experience for the user.

Server-side rendering could be used with client-side micro-frontends to improve how quickly a browser displays the application but implementing this can significantly increase the complexity of development, deployment and operation. Furthermore, most server-side rendering approaches still suffer from a hydration delay before the user can fully interact with the application.

Addressing these challenges was the main motivation for exploring an alternative solution, which relies on the distributed, low latency properties provided by Cloudflare Workers.

Micro-frontends on Cloudflare Workers

Cloudflare Workers is a compute platform that offers a highly scalable, low latency JavaScript execution environment that is available in over 275 locations around the globe. In our exploration we used Cloudflare Workers to host and render micro-frontends from anywhere on our global network.

Fragments architecture

In this architecture the application consists of a tree of “fragments” each deployed to Cloudflare Workers that collaborate to server-side render the overall response. The browser makes a request to a “root fragment”, which will communicate with “child fragments” to generate the final response. Since Cloudflare Workers can communicate with each other with almost no overhead, applications can be server-side rendered quickly by child fragments, all working in parallel to render their own HTML, streaming their results to the parent fragment, which combines them into the final response stream delivered to the browser.

Visit the “Cloud Gallery”

We have built an example of a “Cloud Gallery” application to show how this can work in practice. It is deployed to Cloudflare Workers at https://cloud-gallery.web-experiments.workers.dev/

The demo application is a simple filtered gallery of cloud images built using our fragments architecture. Try selecting a tag in the type-ahead to filter the images listed in the gallery. Then change the delay on the stream of cloud images to see how the type-ahead filtering can be interactive before the page finishes loading.

Multiple Cloudflare Workers

The application is composed of a tree of six collaborating but independently deployable Cloudflare Workers, each rendering their own fragment of the screen and providing their own client-side logic, and assets such as CSS stylesheets and images.

The “main” fragment acts as the root of the application. The “header” fragment has a slider to configure an artificial delay to the display of gallery images. The “body” fragment contains the “filter” fragment and “gallery” fragments. Finally, the “footer” fragment just shows some static content.

The full source code of the demo app is available on GitHub.

Benefits and features

This architecture of multiple collaborating server-side rendered fragments, deployed to Cloudflare Workers has some interesting features.

Encapsulation

Fragments are entirely encapsulated, so they can control what they own and what they make available to other fragments.

Fragments can be developed and deployed independently

Updating one of the fragments is as simple as redeploying that fragment. The next request to the main application will use the new fragment. Also, fragments can host their own assets (client-side JavaScript, images, etc.), which are streamed through their parent fragment to the browser.

Server-only code is not sent to the browser

As well as reducing the cost of downloading unnecessary code to the browser, security sensitive code that is only needed for server-side rendering the fragment is never exposed to other fragments and is not downloaded to the browser. Also, features can be safely hidden behind feature flags in a fragment, allowing more flexibility with rolling out new behavior safely.

Composability

Fragments are fully composable – any fragment can contain other fragments. The resulting tree structure gives you more flexibility in how you architect and deploy your application. This helps larger projects to scale their development and deployment. Also, fine-grain control over how fragments are composed, could allow fragments that are expensive to server-side render to be cached individually.

Fantastic Lighthouse scores

Streaming server-rendered HTML results in great user experiences and Lighthouse scores, which in practice means happier users and higher chance of conversions for your business.

Each fragment can parallelize requests to its child fragments and pipe the resulting HTML streams into its own single streamed server-side rendered response. Not only can this reduce the time to render the whole page but streaming each fragment through to the browser reduces the time to the first byte of each fragment.

Eager interactivity

One of the powers of a fragments architecture is that fragments can become interactive even while the rest of the application (including other fragments) is still being streamed down to the browser.

In our demo, the “filter” fragment is immediately interactive as soon as it is rendered, even if the image HTML for the “gallery” fragment is still loading.

To make it easier to see this, we added a slider to the top of the “header” that can simulate a network or database delay that slows down the HTML stream which renders the “gallery” images. Even when the “gallery” fragment is still loading, the type-ahead input, in the “filter” fragment, is already fully interactive.

Just think of all the frustration that this eager interactivity could avoid for web application users with unreliable Internet connection.

Under the hood

As discussed already this architecture relies upon deploying this application as many cooperating Cloudflare Workers. Let’s look into some details of how this works in practice.

We experimented with various technologies, and while this approach can be used with many frontend libraries and frameworks, we found the Qwik framework to be a particularly good fit, because of its HTML-first focus and low JavaScript overhead, which avoids any hydration problems.

Implementing a fragment

Each fragment is a server-side rendered Qwik application deployed to its own Cloudflare Worker. This means that you can even browse to these fragments directly. For example, the “header” fragment is deployed to https://cloud-gallery-header.web-experiments.workers.dev/.

The header fragment is defined as a Header component using Qwik. This component is rendered in a Cloudflare Worker via a fetch() handler:

export default {
  fetch(request: Request, env: Record<string, unknown>): Promise<Response> {
    return renderResponse(request, env, <Header />, manifest, "header");
  },
};

cloud-gallery/header/src/entry.ssr.tsx

The renderResponse() function is a helper we wrote that server-side renders the fragment and streams it into the body of a Response that we return from the fetch() handler.

The header fragment serves its own JavaScript and image assets from its Cloudflare Worker. We configure Wrangler to upload these assets to Cloudflare and serve them from our network.

Implementing fragment composition

Fragments that contain child fragments have additional responsibilities:

Request and inject child fragments when rendering their own HTML.
Proxy requests for child fragment assets through to the appropriate fragment.

Injecting child fragments

The position of a child fragment inside its parent can be specified by a FragmentPlaceholder helper component that we have developed. For example, the “body” fragment has the “filter” and “gallery” fragments.

<div class="content">
  <FragmentPlaceholder name="filter" />
  <FragmentPlaceholder name="gallery" />
</div>

cloud-gallery/body/src/root.tsx

The FragmentPlaceholder component is responsible for making a request for the fragment and piping the fragment stream into the output stream.

Proxying asset requests

As mentioned earlier, fragments can host their own assets, especially client-side JavaScript files. When a request for an asset arrives at the parent fragment, it needs to know which child fragment should receive the request.

In our demo we use a convention that such asset paths will be prefixed with /_fragment/<fragment-name>. For example, the header logo image path is /_fragment/header/cf-logo.png. We developed a tryGetFragmentAsset() helper which can be added to the parent fragment’s fetch() handler to deal with this:

export default {
  async fetch(
    request: Request,
    env: Record<string, unknown>
  ): Promise<Response> {
    // Proxy requests for assets hosted by a fragment.
    const asset = await tryGetFragmentAsset(env, request);
    if (asset !== null) {
      return asset;
    }
    // Otherwise server-side render the template injecting child fragments.
    return renderResponse(request, env, <Root />, manifest, "div");
  },
};

cloud-gallery/body/src/entry.ssr.tsx

Fragment asset paths

If a fragment hosts its own assets, then we need to ensure that any HTML it renders uses the special _fragment/<fragment-name> path prefix mentioned above when referring to these assets. We have implemented a strategy for this in the helpers we developed.

The FragmentPlaceholder component adds a base searchParam to the fragment request to tell it what this prefix should be. The renderResponse() helper extracts this prefix and provides it to the Qwik server-side renderer. This ensures that any request for client-side JavaScript has the correct prefix. Fragments can apply a hook that we developed called useFragmentRoot(). This allows components to gather the prefix from a FragmentContext context.

For example, since the “header” fragment hosts the Cloudflare and GitHub logos as assets, it must call the useFragmentRoot() hook:

export const Header = component$(() => {
  useStylesScoped$(HeaderCSS);
  useFragmentRoot();

  return (...);
});

cloud-gallery/header/src/root.tsx

The FragmentContext value can then be accessed in components that need to apply the prefix. For example, the Image component:

export const Image = component$((props: Record<string, string | number>) => {
  const { base } = useContext(FragmentContext);
  return <img {...props} src={base + props.src} />;
});

cloud-gallery/helpers/src/image/image.tsx

Service-binding fragments

Cloudflare Workers provide a mechanism called service bindings to make requests between Cloudflare Workers efficiently that avoids network requests. In the demo we use this mechanism to make the requests from parent fragments to their child fragments with almost no performance overhead, while still allowing the fragments to be independently deployed.

Comparison to current solutions

This fragments architecture has three properties that distinguish it from other current solutions.

Unlike monoliths, or client-side micro-frontends, fragments are developed and deployed as independent server-side rendered applications that are composed together on the server-side. This significantly improves rendering speed, and lowers interaction latency in the browser.

Unlike server-side rendered micro-frontends with Node.js or cloud functions, Cloudflare Workers is a globally distributed compute platform with a region-less deployment model. It has incredibly low latency, and a near-zero communication overhead between fragments.

Unlike solutions based on module federation, a fragment’s client-side JavaScript is very specific to the fragment it is supporting. This means that it is small enough that we don’t need to have shared library code, eliminating the version skew issues and coordination problems when updating shared libraries.

Future possibilities

This demo is just a proof of concept, so there are still areas to investigate. Here are some of the features we’d like to explore in the future.

Caching

Each micro-frontend fragment can be cached independently of the others based on how static its content is. When requesting the full page, the fragments only need to run server-side rendering for micro-frontends that have changed.

With per-fragment caching you can return the HTML response to the browser faster, and avoid incurring compute costs in re-rendering content unnecessarily.

Our demo application used micro-frontend fragments to compose a single page. We could however use this approach to implement page routing as well. When server-side rendering, the main fragment could insert the appropriate “page” fragment based on the visited URL. When navigating, client-side, within the app, the main fragment would remain the same while the displayed “page” fragment would change.

This approach combines the best of server-side and client-side routing with the power of fragments.

Using other frontend frameworks

Although the Cloud Gallery application uses Qwik to implement all fragments, it is possible to use other frameworks as well. If really necessary, it’s even possible to mix and match frameworks.

To achieve good results, the framework of choice should be capable of server-side rendering, and should have a small client-side JavaScript footprint. HTML streaming capabilities, while not required, can significantly improve performance of large applications.

Incremental migration strategies

Adopting a new architecture, compute platform, and deployment model is a lot to take in all at once, and for existing large applications is prohibitively risky and expensive. To make this fragment-based architecture available to legacy projects, an incremental adoption strategy is a key.

Developers could test the waters by migrating just a single piece of the user-interface within their legacy application to a fragment, integrating with minimal changes to the legacy application. Over time, more of the application could then be moved over, one fragment at a time.

Convention over configuration

As you can see in the Cloud Gallery demo application, setting up a fragment-based micro-frontend requires quite a bit of configuration. A lot of this configuration is very mechanical and could be abstracted away via conventions and better tooling. Following productivity-focused precedence found in Ruby on Rails, and filesystem based routing meta-frameworks, we could make a lot of this configuration disappear.

Try it yourself!

There is still so much to dig into! Web applications have come a long way in recent years and their growth is hard to overstate. Traditional implementations of micro-frontends have had only mixed success in helping developers scale development and deployment of large applications. Cloudflare Workers, however, unlock new possibilities which can help us tackle many of the existing challenges and help us build better web applications.

Thanks to the generous free plan offered by Cloudflare Workers, you can check out the Gallery Demo code and deploy it yourself.

If all of these sounds interesting to you, and you would like to work with us on improving the developer experience for Cloudflare Workers, we are also happy to share that we are hiring!

Let’s Architect! Architecting for the edge

2022-08-24 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-architecting-for-the-edge/

Edge computing comprises elements of geography and networking and brings computing closer to the end users of the application.

For example, using a content delivery network (CDN) such as AWS CloudFront can help video streaming providers reduce latency for distributing their material by taking advantage of caching at the edge. Another example might look like an Internet of Things (IoT) solution that helps a company run business logic in remote areas or with low latency.

IoT is a challenging field because there are multiple aspects to consider as architects, like hardware, protocols, networking, and software. All of these aspects must be designed to interact together and be fault tolerant.

In this edition of Let’s Architect!, we share resources that are helpful for teams that are approaching or expanding their workloads for edge computing We cover macro topics such as security, best practices for IoT, patterns for machine learning (ML), and scenarios with strict latency requirements.

Build Machine Learning at the edge applications

In Let’s Architect! Architecting for Machine Learning, we touched on some of the most relevant aspects to consider while putting ML into production. However, in many scenarios, you may also have specific constraints like latency or a lack of connectivity that require you to design a deployment at the edge.

This blog post considers a solution based on ML applied to agriculture, where a reliable connection to the Internet is not always available. You can learn from this scenario, which includes information from model training to deployment, to design your ML workflows for the edge. The solution uses Amazon SageMaker in the cloud to explore, train, package, and deploy the model to AWS IoT Greengrass, which is used for inference at the edge.

High-level architecture of the components that reside on the farm and how they interact with the cloud environment

Security at the edge

Security is one of the fundamental pillars described in the AWS Well-Architected Framework. In all organizations, security is a major concern both for the business and the technical stakeholders. It impacts the products they are building and the perception that customers have.

We covered security in Let’s Architect! Architecting for Security, but we didn’t focus specifically on edge technologies. This whitepaper shows approaches for implementing a security strategy at the edge, with a focus on describing how AWS services can be used. You can learn how to secure workloads designed for content delivery, as well as how to implement network protection to defend against DDoS attacks and protect your IoT solutions.

The AWS Well-Architected Tool is designed to help you review the state of your applications and workloads. It provides a central place for architectural best practices and guidance

AWS Outposts High Availability Design and Architecture Considerations

AWS Outposts allows companies to run some AWS services on-premises, which may be crucial to comply with strict data residency or low latency requirements. With Outposts, you can deploy servers and racks from AWS directly into your data center.

This whitepaper introduces architectural patterns, anti-patterns, and recommended practices for building highly available systems based on Outposts. You will learn how to manage your Outposts capacity and use networking and data center facility services to set up highly available solutions. Moreover, you can learn from mental models that AWS engineers adopted to consider the different failure modes and the corresponding mitigations, and apply the same models to your architectural challenges.

An Outpost deployed in a customer data center and connected back to its anchor Availability Zone and parent Region

AWS IoT Lens

The AWS Well-Architected Lenses are designed for specific industry or technology scenarios. When approaching the IoT domain, the AWS IoT Lens is a key resource to learn the best practices to adopt for IoT. This whitepaper breaks down the IoT workloads into the different subdomains (for example, communication, ingestion) and maps the AWS services for IoT with each specific challenge in the corresponding subdomain.

As architects and developers, we tend to automate and reduce the risk of human errors, so the IoT Lens Checklist is a great resource to review your workloads by following a structured approach.

Workload context checklist from the IoT Lens Checklist

See you next time!

Thanks for joining our discussion on architecting for the edge! See you in two weeks when we talk about database architectures on AWS.

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Packet captures at the edge

2022-03-17 Annika Garbers

Post Syndicated from Annika Garbers original https://blog.cloudflare.com/packet-captures-at-edge/

Packet captures at the edge

Packet captures are a critical tool used by network and security engineers every day. As more network functions migrate from legacy on-prem hardware to cloud-native services, teams risk losing the visibility they used to get by capturing 100% of traffic funneled through a single device in a datacenter rack. We know having easy access to packet captures across all your network traffic is important for troubleshooting problems and deeply understanding traffic patterns, so today, we’re excited to announce the general availability of on-demand packet captures from Cloudflare’s global network.

What are packet captures and how are they used?

A packet capture is a file that contains all packets that were seen by a particular network box, usually a firewall or router, during a specific time frame. Packet captures are a powerful and commonly used tool for debugging network issues or getting better visibility into attack traffic to tighten security (e.g. by adding firewall rules to block a specific attack pattern).

A network engineer might use a pcap file in combination with other tools, like mtr, to troubleshoot problems with reachability to their network. For example, if an end user reports intermittent connectivity to a specific application, an engineer can set up a packet capture filtered to the user’s source IP address to record all packets received from their device. They can then analyze that packet capture and compare it to other sources of information (e.g. pcaps from the end user’s side of the network path, traffic logs and analytics) to understand the magnitude and isolate the source of the problem.

Security engineers can also use packet captures to gain a better understanding of potentially malicious traffic. Let’s say an engineer notices an unexpected spike in traffic that they suspect could be an attempted attack. They can grab a packet capture to record the traffic as it’s hitting their network and analyze it to determine whether the packets are valid. If they’re not, for example, if the packet payload is randomly generated gibberish, the security engineer can create a firewall rule to block traffic that looks like this from entering their network.

Fragmenting traffic creates gaps in visibility

Traditionally, users capture packets by logging into their router or firewall and starting a process like tcpdump. They’d set up a filter to only match on certain packets and grab the file. But as networks have become more fragmented and users are moving security functions out to the edge, it’s become increasingly challenging to collect packet captures for relevant traffic. Instead of just one device that all traffic flows through (think of a drawbridge in the “castle and moat” analogy) engineers may have to capture packets across many different physical and virtual devices spread across locations. Many of these packets may not allow taking pcaps at all, and then users have to try to stitch them back together to create a full picture of their network traffic. This is a nearly impossible task today and only getting harder as networks become more fractured and complex.

On-demand packet captures from the Cloudflare global network

With Cloudflare, you can regain this visibility. With Magic Transit and Magic WAN, customers route all their public and private IP traffic through Cloudflare’s network to make it more secure, faster, and more reliable, but also to increase visibility. You can think of Cloudflare like a giant, globally distributed version of the drawbridge in our old analogy: because we act as a single cloud-based router and firewall across all your traffic, we can capture packets across your entire network and deliver them back to you in one place.

How does it work?

Customers can request a packet capture using our Packet Captures API. To get the packets you’re looking for you can provide a filter with the IP address, ports, and protocol of the packets you want.

curl -X POST https://api.cloudflare.com/client/v4/accounts/${account_id}/pcaps \
-H 'Content-Type: application/json' \
-H 'X-Auth-Email: [email protected]' \
-H 'X-Auth-Key: 00000000000' \
--data '{
        "filter_v1": {
               "source_address": "1.2.3.4",
               "protocol": 6
        },
        "time_limit": 300,
        "byte_limit": "10mb",
        "packet_limit": 10000,
        "type": "simple",
        "system": "magic-transit"
}'

Example of a request for packet capture using our API.

We leverage nftables to apply the filter to the customer’s incoming packets and log them using nflog:

table inet pcaps_1 {
    chain pcap_1 {
        ip protocol 6 ip saddr 1.2.3.4 log group 1 comment “packet capture”
    }
}

Example nftables configuration used to filter log customer packets

nflog creates a netfilter socket through which logs of a packet are sent from the Linux kernel to user space. In user space, we use tcpdump to read packets off the netfilter socket and generate a packet capture file:

tcpdump -i nflog:1 -w pcap_1.pcap

Example tcpdump command to create a packet capture file.

Usually tcpdump is used by listening to incoming packets on a network interface, but in our case we configure it to read packet logs from an nflog group. tcpdump will convert the packet logs into a packet capture file.

Once we have a packet capture file, we need to deliver it to customers. Because packet capture files can be large and contain sensitive information (e.g. packet payloads), we send them to customers directly from our machines to a cloud storage service of their choice. This means we never store sensitive data, and it’s easy for customers to manage and store these large files.

Get started today

On-demand packet captures are now generally available for customers who have purchased the Advanced features of Magic Firewall. The packet capture API allows customers to capture the first 160 bytes of packets, sampled at a default rate of 1/100. More functionality including full packet captures and on-demand packet capture control in the Cloudflare Dashboard is coming in the following weeks. Contact your account team to stay updated on the latest!

New – Securely manage your AWS IoT Greengrass edge devices using AWS Systems Manager

2021-11-29 Sean M. Tracey

Post Syndicated from Sean M. Tracey original https://aws.amazon.com/blogs/aws/new-securely-manage-your-aws-iot-greengrass-edge-devices-using-aws-systems-manager/

In 2020, we launched AWS IoT Greengrass 2.0, an open-source edge runtime and cloud service for building, deploying, and managing device software and applications. Today, we’re very excited to announce the ability to securely manage your AWS IoT Greengrass edge devices using AWS Systems Manager (SSM).

Managing vast fleets of varying systems and applications remotely can be a challenge for administrators of edge devices. AWS IoT Greengrass was built to enable these administrators to manage their edge device application stack. While this addressed the needs of many typical edge device administrators, system software on these devices still needed to be updated and maintained through operational policies consistent with those of their broader IT organizations. To this end, administrators would typically have to build or integrate tools to create a centralized interface for managing their edge and IT device software stacks – from security updates, to remote access, and operating system patches.

Until today, IT administrators have had to build or integrate custom tools to make sure edge devices can be managed alongside EC2 and on-prem instances, through a consistent set of policies. At scale, managing device and systems software across a wide variety of edge and IT systems becomes a significant investment in time and money. This is time that could be better spent deploying, optimizing, and managing the very edge devices that they’re maintaining.

What’s New?
Today, we have integrated IoT Greengrass and Systems Manager to simplify the management and maintenance of system software for edge devices. When coupled with the AWS IoT Greengrass Client Software, edge device administrators now can remotely access and securely manage with the multitude of devices that they own – from OS patching, to application deployments. Additionally, regularly scheduled operations that maintain edge compute systems can be automated, all without the need for creating additional custom processes. For IT administrators, this release gives a complete overview of all of their devices through a centralized interface, and a consistent set of tools and policies with the AWS Systems Manager.

For customers new to the AWS IoT Greengrass platform, the integration with Systems Manager simplifies setup even further with a new on- boarding wizard that can reduce the time it takes to create operational management systems for edge devices from weeks to hours.

How is this achieved?
This new capability is enabled by the AWS Systems Manager (SSM) Agent. As of today, customers can deploy the AWS Systems Manager Agent, via the AWS IoT Greengrass console, to their existing edge devices. Once installed on each device, AWS Systems Manager will list all of the devices in the Systems Manager Console, thereby giving administrators and IoT stakeholders an overview of their entire fleet. When coupled with the AWS IoT Greengrass console, administrators can manage their newly configured devices remotely; patching or updating operating systems, troubleshooting remotely, and deploying new applications, all through a centralized, integrated user interface. Devices can be patched individually, or in groups organized by tags or resource groups.

Further information
These new features are now available in all regions where AWS Systems Manager and AWS IoT Greengrass are available. To get started, please visit the IoT Greengrass home page.

Computer Vision at the Edge with AWS Panorama

2021-10-20 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/computer-vision-at-the-edge-with-aws-panorama/

Today, the AWS Panorama Appliance is generally available to all of you. The AWS Panorama Appliance is a computer vision (CV) appliance designed to be deployed on your network to analyze images provided by your on-premises cameras.

Every week, I read about new and innovative use cases for computer vision. Some customers are using CV to verify pallet trucks are parked in designated areas to ensure worker safety in warehouses, some are analyzing customer walking flows in retail stores to optimize space and product placement, and some are using it to recognize cats and mice, just to name a few.

AWS customers agree the cloud is the most convenient place to train computer vision models thanks to its virtually infinite access to storage and compute resources. In the cloud, data scientists have access to powerful tools such as Amazon SageMaker and a wide variety of compute resources and frameworks.

However, when it’s time to analyze images from one or multiple video feeds, many of you are telling us the cloud is not the place where you want to run such workloads. There are a number of reasons for that: sometimes the facilities where the images are captured do not have enough bandwidth to send video feeds to the cloud, some use cases require very low latency, or some just want to keep their images on premises and not send them for analysis outside of their network.

At re:Invent 2020, we announced the AWS Panorama Appliance and SDK to address these requirements.

AWS Panorama is a machine learning appliance and software development kit (SDK) that allows you to bring computer vision to on-premises cameras to make predictions locally with high accuracy and low latency. With the AWS Panorama Appliance, you can automate tasks that have traditionally required human inspection to improve visibility into potential issues. For example, you can use AWS Panorama Appliance to evaluate manufacturing quality, identify bottlenecks in industrial processes, and monitor workplace security even in environments with limited or no internet connectivity. The software development kit allows camera manufacturers to bring equivalent capabilities directly inside their IP camera.

As usual on this blog, I would like to walk you through the development and deployment of a computer vision application for the AWS Panorama Appliance. The demo application from this blog uses a machine learning model to recognise objects in frames of video from a network camera. The application loads a model onto the AWS Panorama Appliance, gets images from a camera, and runs those images through the model. The application then overlays the results on top of the original video and outputs it to a connected display. The application uses libraries provided by AWS Panorama to interact with input and output video streams and the model, no low level programming is required.

Let’s first define a few concepts. I borrowed the following definitions from the AWS Panorama documentation page.

Concepts
The AWS Panorama Appliance is the hardware that runs your applications. You use the AWS Panorama console or AWS SDKs to register an appliance, update its software, and deploy applications to it. The software that runs on the appliance discovers and connects to camera streams, sends frames of video to your application, and optionally displays video output on an attached display.

The appliance is an edge device. Instead of sending images to the AWS Cloud for processing, it runs applications locally on optimized hardware. This enables you to analyze video in real time and process the results with limited connectivity. The appliance only requires an internet connection to report its status, upload logs, and get software updates and deployments.

An application comprises multiple components called nodes, which represent cameras, models, code, or global variables. A node can be configuration only (inputs and outputs), or include artifacts (models and code). Application nodes are bundled in node packages that you upload to an S3 access point, where the AWS Panorama Appliance can access them. An application manifest is a configuration file that defines connections between the nodes.

A computer vision model is a machine learning network that is trained to process images. Computer vision models can perform various tasks such as classification, detection, segmentation, and tracking. A computer vision model takes an image as input and outputs information about the image or objects in the image.

AWS Panorama supports models built with Apache MXNet, DarkNet, GluonCV, Keras, ONNX, PyTorch, TensorFlow, and TensorFlow Lite. You can build models with Amazon SageMaker and import them from an Amazon Simple Storage Service (Amazon S3) bucket.

Now that we grasp the concepts, let’s get our hands on.

Unboxing Your AWS Panorama Appliance
In the box the service team sent me, I found the appliance itself (no surprise!), a power cord and two ethernet cables. The box also contains a USB key to initially configure the appliance. The device is designed to work in industrial environments. It has two ethernet ports next to the power connector on the back. On the front, protected behind a sliding door, I found a SD card reader, one HDMI connector and two USB ports. There is also a power button and a reset button to reinitialise the device to its factory state.

Configuring Your Appliance
I first configured it for my network (cable + DHCP, but it also supports static IP configuration) and registered it to securely connect back to my AWS Account. To do so, I navigated to the AWS Management Console, entered my network configuration details. It generated a set of configuration files and certificates. I copied them to the appliance using the provided USB key. My colleague Martin Beeby shared screenshots of this process. The team slightly modified the screens based on the feedback they received during the preview, but I don’t think it is worth going through the step-by-step process again. Tip from the field: be sure to use the USB key provided in the box, it is correctly formatted and automatically recognised by the appliance (my own USB key was not recognized properly).

I then downloaded a sample application from the Panorama GitHub repository and tried it with the Test Utility for Panorama, also available on this GitHub (the test utility is an EC2 instance configured to act as a simulator). The Test Utility for Panorama uses Jupyter notebooks to quickly experiment with sample applications or your code before deploying it to the appliance. It also lists commands allowing you to deploy your applications to the appliance programmatically.

Panorama Command Line
The Panorama command line simplifies the operations to create a project, import assets, package it, and deploy it to the AWS Panorama Appliance. You can follow these instructions to download and install the Panorama command line.

When receiving an application developed by someone else, like the sample application, I have to replace AWS account IDs in all application files and directory names. I do this with one single command:

panorama-cli import-application

Application Structure
A Panorama application structure looks as follows:

├── assets
├── graphs
│   └── example_project
│ └── graph.json
└── packages
├── accountXYZ-model-1.0
│   ├── descriptor.json
│   └── package.json
└── accountXYZ-sample-app-1.0
├── Dockerfile
├── descriptor.json
├── package.json
└── src
└── app.py

graph.json lists all the packages and nodes in this application. Nodes are the way to define an application in Panorama.
in each package package.json has details about the package and the assets it uses.
model package model has a descriptor.json which contains the metadata required for compiling the model.
container packagesample-app package contains the application code in the src directory and a Dockerfile to build the container. descriptor.json has details about which command and file to use when the container is launched.
assets directory is where all the assets reside, such as packaged code and compiled models. You should not make any changes in this directory.

Note that package names are prefixed with your account number.

When my application is ready, I build the container (I am using a Linux machine with Docker Engine and Docker CLI to avoid using Docker Desktop for macOS or Windows.)

$ panorama-cli build-container                               \
               --container-asset-name {container_asset_name} \ 
               --package-path packages/{account_id}-{package_name}-1.0

A Note About the Cameras
AWS Panorama Appliance has a concept of “abstract cameras”. Abstract camera sources are placeholders that can be mapped to actual camera devices during application deployment. The Test Utility for Panorama allows you to map abstract cameras to video files for easy, repeatable tests.

Adding a ML Model
The AWS Panorama Appliance supports multiple ML Model frameworks. Models may be trained on Amazon SageMaker or any other solution of your choice. I downloaded my ML model from S3 and import it to my project:

panorama-cli add-raw-model                                                 \
    --model-asset-name {asset_name}                                        \
    --model-s3-uri s3://{S3_BUCKET}/{project_name}/{ML_MODEL_FNAME}.tar.gz \
    --descriptor-path {descriptor_path}                                    \
    --packages-path {package_path}

Behind the scenes, ML Models are compiled to optimise them to the Nvidia Accelerated Linux Arm64 architecture of the AWS Panorama Appliance.

Package the Application
Now that I have a ML model and my application code packaged in a container, I am ready to package my application assets for AWS Panorama Appliance:

panorama-cli package-application

This command uploads all my application assets to the AWS cloud account along with all the manifests.

Deploy the Application
Finally I deploy the application to the AWS Panorama Appliance. A deployment copies the application and its configuration, like camera stream selection, from the AWS cloud to my on-premise AWS Panorama Appliance. I may deploy my application programmatically using Python code (and the Boto3 SDK you might know already):


client = boto3.client('panorama')
client.create_application_instance(
    Name="AWS News Blog Sample Application",
    Description="An object detection app",
    ManifestPayload={
       'PayloadData': manifest         # <== this is the graph.json file content 
    },
    RuntimeRoleArn=role,               # <== this is a role that gives my app permissions to use AWS Services such as Cloudwatch
    DefaultRuntimeContextDevice=device # <== this is my device name 
)

Alternatively, I may use the AWS Management Console:

On Deployed applications, I select Deploy application.

I copy and paste the content of graphs/<project name>/graph.json to the console and select Next.

I give my application a name and an optional description. I select Proceed to deploy.

The next steps are

declare an IAM role to give permissions to my application to use AWS Service. The minimal permissions set allows to call the PuMetricData API on CloudWatch.
select the AWS Panorama Appliance I want to deploy to
map the abstract cameras defined in the application descriptors.json to physical cameras known by the AWS Panorama Appliance
fill in any application-specific inputs, such as acceptable threshold value, log level etc.

An example IAM policy is

AWSTemplateFormatVersion: '2010-09-09'
Description: Resources for an AWS Panorama application.
Resources:
  runtimeRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: Allow
            Principal:
              Service:
                - panorama.amazonaws.com
            Action:
              - sts:AssumeRole
      Policies:
        - PolicyName: cloudwatch-putmetrics
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action: 'cloudwatch:PutMetricData'
                Resource: '*'
      Path: /service-role/

These six screenhots capture this process:

The deployment takes 15-30 minutes depending on the size of your code and your ML models, and the appliance available bandwidth. Eventually, the status turn green to “Running”.

Once the application is deployed to your AWS Panorama Appliance it begins to run, continuously analyzing video and generating highly accurate predictions locally within milliseconds. I connect an HDMI cable to the AWS Panorama Appliance to monitor the output, and I can see:

Should anything goes wrong during the deployment or during the life of the application, I have access to the logs on Amazon CloudWatch. There are two log streams created, one for the AWS Panorama Appliance itself and one for the application.

Pricing and Availability
The AWS Panorama Appliance is available to purchase at AWS Elemental order page in the AWS Console. You can place orders from the United States, Canada, the United Kingdom, and the European Union. There is a one-time charge of $4,000 for the appliance itself.

There is a usage charge of $8.33 / month / camera feed.

AWS Panorama stores versioned copies of all assets deployed to the AWS Panorama Appliance (including ML models and business logic) in the cloud. You are charged $0.10 per-GB, per-month for this storage.

You may incur additional charges if the business logic deployed to your AWS Panorama Appliance uses other AWS services. For example, if your business logic uploads ML predictions to S3 for offline analysis, you will be billed separately by S3 for any storage charges incurred.

The AWS Panorama Appliance can be installed anywhere. The appliance connects back to the AWS Panorama service in the AWS cloud in one of the following AWS Region : US East (N. Virginia), US West (Oregon), Canada (Central), or Europe (Ireland).

Go and build your first computer vision model today.

— seb

Introducing the Security at the Edge: Core Principles whitepaper

2021-10-15 Maddie Bacon

Post Syndicated from Maddie Bacon original https://aws.amazon.com/blogs/security/introducing-the-security-at-the-edge-core-principles-whitepaper/

Amazon Web Services (AWS) recently released the Security at the Edge: Core Principles whitepaper. Today’s business leaders know that it’s critical to ensure that both the security of their environments and the security present in traditional cloud networks are extended to workloads at the edge. The whitepaper provides security executives the foundations for implementing a defense in depth strategy for security at the edge by addressing three areas of edge security:

AWS services at AWS edge locations
How those services and others can be used to implement the best practices outlined in the design principles of the AWS Well-Architected Framework Security Pillar
Additional AWS edge services, which customers can use to help secure their edge environments or expand operations into new, previously unsupported environments

Together, these elements offer core principles for designing a security strategy at the edge, and demonstrate how AWS services can provide a secure environment extending from the core cloud to the edge of the AWS network and out to customer edge devices and endpoints. You can find more information in the Security at the Edge: Core Principles whitepaper.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Enhancing Existing Building Systems with AWS IoT Services

2021-06-16 Lewis Taylor

Post Syndicated from Lewis Taylor original https://aws.amazon.com/blogs/architecture/enhancing-existing-building-systems-with-aws-iot-services/

With the introduction of cloud technology and by extension the rapid emergence of Internet of Things (IoT), the barrier to entry for creating smart building solutions has never been lower. These solutions offer commercial real estate customers potential cost savings and the ability to enhance their tenants’ experience. You can differentiate your business from competitors by offering new amenities and add new sources of revenue by understanding more about your buildings’ operations.

There are several building management systems to consider in commercial buildings, such as air conditioning, fire, elevator, security, and grey/white water. Each system continues to add more features and become more automated, meaning that control mechanisms use all kinds of standards and protocols. This has led to fragmented building systems and inefficiency.

In this blog, we’ll show you how to use AWS for the Edge to bring these systems into one data path for cloud processing. You’ll learn how to use AWS IoT services to review and use this data to build smart building functions. Some common use cases include:

Provide building facility teams a holistic view of building status and performance, alerting them to problems sooner and helping them solve problems faster.
Provide a detailed record of the efficiency and usage of the building over time.
Use historical building data to help optimize building operations and predict maintenance needs.
Offer enriched tenant engagement through services like building control and personalized experiences.
Allow building owners to gather granular usage data from multiple buildings so they can react to changing usage patterns in a single platform.

Securely connecting building devices to AWS IoT Core

AWS IoT Core supports connections with building devices, wireless gateways, applications, and services. Devices connect to AWS IoT Core to send and receive data from AWS IoT Core services and other devices. Buildings often use different device types, and AWS IoT Core has multiple options to ingest data and enabling connectivity within your building. AWS IoT Core is made up of the following components:

Device Gateway is the entry point for all devices. It manages your device connections and supports HTTPS and MQTT (3.1.1) protocols.
Message Broker is an elastic and fully managed pub/sub message broker that securely transmits messages (for example, device telemetry data) to and from all your building devices.
Registry is a database of all your devices and associated attributes and metadata. It allows you to group devices and services based upon attributes such as building, software version, vendor, class, floor, etc.

The architecture in Figure 1 shows how building devices can connect into AWS IoT Core. AWS IoT Core supports multiple connectivity options:

Native MQTT – Multiple building management systems or device controllers have MQTT support immediately.
AWS IoT Device SDK – This option supports MQTT protocol and multiple programming languages.
AWS IoT Greengrass – The previous options assume that devices are connected to the internet, but this isn’t always possible. AWS IoT Greengrass extends the cloud to the building’s edge. Devices can connect directly to AWS IoT Greengrass and send telemetry to AWS IoT Core.
AWS for the Edge partner products – There are several partner solutions, such as Ignition Edge from Inductive Automation, that offer protocol translation software to normalize in-building sensor data.

Figure 1. Data ingestion options from on-premises devices to AWS

Challenges when connecting buildings to the cloud

There are two common challenges when connecting building devices to the cloud:

You need a flexible platform to aggregate building device communication data
You need to transform the building data to a standard protocol, such as MQTT

Building data is made up of various protocols and formats. Many of these are system-specific or legacy protocols. To overcome this, we suggest processing building device data at the edge, extracting important data points/values before transforming to MQTT, and then sending the data to the cloud.

Transforming protocols can be complex because they can abstract naming and operation types. AWS IoT Greengrass and partner products such as Ignition Edge make it possible to read that data, normalize the naming, and extract useful information for device operation. Combined with AWS IoT Greengrass, this gives you a single way to validate the building device data and standardize its processing.

Using building data to develop smart building solutions

The architecture in Figure 2 shows an in-building lighting system. It is connected to AWS IoT Core and reports on devices’ status and gives users control over connected lights.

The architecture in Figure 2 has two data paths, which we’ll provide details on in the following sections, but here’s a summary:

The “cold” path gathers all incoming data for batch data analysis and historical dashboarding.
The “warm” bidirectional path is for faster, real-time data. It gathers devices’ current state data. This path is used by end-user applications for sending control messages, real-time reporting, or initiating alarms.

Figure 2. Architecture diagram of a building lighting system connected to AWS IoT Core

Cold data path

The cold data path gathers all lighting device telemetry data, such as power consumption, operating temperature, health data, etc. to help you understand how the lighting system is functioning.

Building devices can often deliver unstructured, inconsistent, and large volumes of data. AWS IoT Analytics helps clean up this data by applying filters, transformations, and enrichment from other data sources before storing it. By using Amazon Simple Storage Service (Amazon S3), you can analyze your data in different ways. Here we use Amazon Athena and Amazon QuickSight for building operational dashboard visualizations.

Let’s discuss a real-world example. For building lighting systems, understanding your energy consumption is important for evaluating energy and cost efficiency. Data ingested into AWS IoT Core can be stored long term in Amazon S3, making it available for historical reporting. Athena and QuickSight can quickly query this data and build visualizations that show lighting state (on or off) and annual energy consumption over a set period of time. You can also overlay this data with sunrise and sunset data to provide insight into whether you are using your lighting systems efficiently. For example, adjusting the lighting schedule accordingly to the darker winter months versus the brighter summer months.

Warm data path

In the warm data path, AWS IoT Device Shadow service makes the device state available. Shadow updates are forwarded by an AWS IoT rule into downstream services such an AWS IoT Event, which tracks and monitors multiple devices and data points. Then it initiates actions based on specific events. Further, you could build APIs that interact with AWS IoT Device Shadow. In this architecture, we have used AWS AppSync and AWS Lambda to enable building controls via a tenant smartphone application.

Let’s discuss a real-world example. In an office meeting room lighting system, maintaining a certain brightness level is important for health and safety. If that space is unoccupied, you can save money by turning the lighting down or off. AWS IoT Events can take inputs from lumen sensors, lighting systems, and motorized blinds and put them into a detector model. This model calculates and prompts the best action to maintain the room’s brightness throughout the day. If the lumen level drops below a specific brightness threshold in a room, AWS IoT Events could prompt an action to maintain an optimal brightness level in the room. If an occupancy sensor is added to the room, the model can know if someone is in the room and maintain the lighting state. If that person leaves, it will turn off that lighting. The ongoing calculation of state can also evaluate the time of day or weather conditions. It would then select the most economical option for the room, such as opening the window blinds rather than turning on the lighting system.

Conclusion

In this blog, we demonstrated how to collect and aggregate the data produced by on-premises building management platforms. We discussed how augmenting this data with the AWS IoT Core platform allows for development of smart building solutions such as building automation and operational dashboarding. AWS products and services can enable your buildings to be more efficient while and also provide engaging tenant experiences. For more information on how to get started please check out our getting started with AWS IoT Core developer guide.

Automating data center expansions with Airflow

2021-01-27 Jet Mariscal

Post Syndicated from Jet Mariscal original https://blog.cloudflare.com/automating-data-center-expansions-with-airflow/

Automating data center expansions with Airflow

Cloudflare’s network keeps growing, and that growth doesn’t just come from building new data centers in new cities. We’re also upgrading the capacity of existing data centers by adding newer generations of servers — a process that makes our network safer, faster, and more reliable for our users.

Connecting new Cloudflare servers to our network has always been complex, in large part because of the amount of manual effort that used to be required. Members of our Data Center and Infrastructure Operations, Network Operations, and Site Reliability Engineering teams had to carefully follow steps in an extremely detailed standard operating procedure (SOP) document, often copying command-line snippets directly from the document and pasting them into terminal windows.

But such a manual process can only scale so far, and we knew must be a way to automate the installation of new servers.

Here’s how we tackled that challenge by building our own Provisioning-as-a-Service (PraaS) platform and cut by 90% the amount of time our team spent on mundane operational tasks.

Choosing and using an automation framework

When we began our automation efforts, we quickly realized it made sense to replace each of these manual SOP steps with an API-call equivalent and to present them in a self-service web-based portal.

To organize these new automatic steps, we chose Apache Airflow, an open-source workflow management platform. Airflow is built around directed acyclic graphs, or DAGs, which are collections of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

In this new system, each SOP step is implemented as a task in the DAG. The majority of these tasks are API calls to Salt — software which automates the management and configuration of any infrastructure or application, and which we use to manage our servers, switches, and routers. Other DAG tasks are calls to query Prometheus (systems monitoring and alerting toolkit), Thanos (a highly available Prometheus setup with long-term storage capabilities), Google Chat webhooks, JIRA, and other internal systems.

Here is an example of one of these tasks. In the original SOP, SREs were given the following instructions to enable anycast:

Login to a remote system.
Copy and Paste the command in the terminal.
Replace the router placeholder in the command snippet with the actual value.
Execute the command.

MeanwhileIn our new workflow, this step becomes a single task in the DAG named “enable_anycast”:

enable_anycast = builder.wrap_class(AsyncSaltAPIOperator)(
             task_id='enable_anycast',
             target='{{ params.netops }}',
             function='cmd.run',
             fun_kwargs={'cmd': 'salt {{ get_router(params.colo_name) }} '
                         'anycast.enable --out=json --out-indent=-1'},
             salt_conn_id='salt_api',
             trigger_rule='one_success')

As you can see, automation eliminates the need for a human operator to login to a remote system, and to figure out the router that will be used to replace the placeholder in the command to be executed.

In Airflow, a task is an implementation of an Operator. The Operator in the automated step is the “AsyncSaltAPIOperator”, a custom operator built in-house. This extensibility is one of the many reasons that made us decide to use Apache Airflow. It allowed us to extend its functionality by writing custom operators that suit our needs.

SREs from various teams have written quite a lot of custom Airflow Operators that integrate with Salt, Prometheus, Bitbucket, Google Chat, JIRA, PagerDuty, among others.

Manual SOP steps transformed into a feature-packed automation

The tasks that replaced steps in the SOP are marvelously feature-packed. Here are some highlights of what they are capable of, on top of just executing a command:

Failure Handling
When a task fails for whatever reason, it automatically retries until it exhausts its maximum retry limit that we set for the task. We employ various retry strategies, including instructing tasks to not retry at all, especially when it’s impractical to retry, or when we deliberately do not want it to retry at all regardless of whether there are any retry attempts remaining, such as when an exception is encountered or a condition that is unlikely to change for the better.

Logging
Each task provides a comprehensive log during executions. We’ve written our tasks to ensure that we log as much information as possible that would help us audit and troubleshoot issues.

Notifications
We’ve written our tasks to send a notification with information such as the name of the DAG, the name of the task, its task state, the number of attempts it took to reach a certain state, and a link to view the task logs.

When a task fails, we definitely want to be notified, so we also set tasks to additionally provide information such as the number of retry attempts and links to view relevant wiki pages or Grafana dashboards.

Depending on the criticality of the failure, we can also instruct it to page the relevant on-call person on the provisioning shift, should it require immediate attention.

Jinja Templating
Jinja templating allows providing dynamic content using code to otherwise static objects such as strings. We use this in combination with macros wherein we provide parameters that can change during the execution, since macros are evaluated while the task gets run.

Macros
Macros are used to pass dynamic information into task instances at runtime. Macros are a way to expose objects to templates. In other words, macros are functions that take input, modify that input, and give the modified output.

Adapting tasks for preconditions and human intervention

There are a few steps in the SOP that require certain preconditions to be met. We use sensors to set dependencies between these tasks, and even between different DAGs, so that one does not run until the dependency has been met.

Below is an example of a sensor that waits until all nodes resolve to their assigned DNS records:

verify_node_dns = builder.wrap_class(DNSSensor)(
            task_id='verify_node_dns',
            zone=domain,
            nodes_from='{{ to_json(run_ctx.globals.import_nodes_via_mpl) }}',
            timeout=60 * 30,
            poke_interval=60 * 10,
	mode='reschedule')

In addition, some of our tasks still require input from a human operator. In these circumstances, we use sensors as blocking tasks that prevent work from starting until certain preconditions are met. We use these to set dependencies between tasks and even DAGs so that one does not run until the dependency has finished successfully.

The code below is a simple example of a task that will send notifications to get the attention of a human operator, and waits until a Change Request ticket has been provided and verified:

verify_jira_input = builder.wrap_class(InputSensor)(
            task_id='verify_jira_input',
            var_key='jira',
            prompt='Please provide the Change Request ticket.',
            notify=True,
            require_human=True)

Another sensor task example is waiting until a zone has been deployed by a Cloudflare engineer as described in https://blog.cloudflare.com/improving-the-resiliency-of-our-infrastructure-dns-zone/.

In order for PraaS to be able to accept human inputs, we’ve written a separate DAG we call our DAG Manager. Whenever we need to submit input back to a running expansion DAG, we simply trigger the DAG Manager and pass in our input as a JSON configuration, which will then be processed accordingly and submit the input back to the expansion DAG.

Automating data center expansions with Airflow

Managing Dependencies Between Tasks

Replacing SOP steps with DAG tasks was only the first part of our journey towards greater automation. We also had to define the dependencies between these tasks and construct the workflow accordingly.

Here’s an example of what this looks like in code:

verify_cr >> parse_cr >> [execute_offline, execute_online]
        execute_online >> silence_highstate_runner >> silence_metals >> \
            disable_highstate_runner

The code simply uses bit shift operators to chain the operations. A list of tasks can also be set as dependencies:

change_metal_status >>  [wait_for_change_metal_status, verify_zone_update] >> \
evaluate_ecmp_management

With the bit shift operator, chaining multiple dependencies becomes concise.

By default, a downstream task will only run if its upstream has succeeded. For a more complex dependency setup, we set a trigger_rule which defines the rule by which the generated task gets triggered.

All operators have a trigger_rule argument. The Airflow scheduler decides whether to run the task or not depending on what rule was specified in the task. An example rule that we use a lot in PraaS is “one_success” — it fires as soon as at least one parent succeeds, and it does not wait for all parents to be done.

Solving Complex Workflows with Branching and Multi-DAGs

Having complex workflows means that we need a workflow to branch, or only go down a certain path, based on an arbitrary condition, which is typically related to something that happened in an upstream task. Branching is used to perform conditional logic, that is, execute a set of tasks based on a condition. We use BranchPythonOperator to achieve this.

At some point in the workflow, our data center expansion DAGs trigger various external DAGs to accomplish complex tasks. This is why we have written our DAGs to be fully reusable. We did not try to incorporate all the logic into a single DAG; instead, we created other separable DAGs that are fully reusable and can be triggered on-demand manually by hand or programmatically — our DAG Manager and the “helper” DAG is an example of this.

The Helper DAG comprises logic that allows us to mimic a “for loop” by having the DAG respawn itself if needed, technically doing cycles. If you recall, a DAG is acyclic, but we have some tasks in our workflow that require us to do complex loops and are solved by using a helper DAG.

We designed reusable DAGs early on, which allowed us to build complex automation workflows from separable DAGs, each of which handles distinct and well-defined tasks. Each data center DAG could easily reuse other DAGs by triggering them programmatically.

Having separate DAGs that run independently, that are triggered by other DAGs, and that keep inter-dependencies between them, is a pattern we use a lot. It has allowed us to execute very complex workflows.

Creating DAGs that Scale and Executing Tasks at Scale

Data center expansions are done in two phases:

Phase 1 – this is the phase in which servers are powered on. It boots our custom Linux kernel, and begins the provisioning process.

Phase 2 – this is the phase in which newly provisioned servers are enabled in the cluster to receive production traffic.

To reflect these phases in the automation workflow, we also wrote two separate DAGs, one for each phase. However, we have over 200 data centers, so if we were to write a pair of DAGs for each, we would end up writing and maintaining 400 files!

A viable option could be to parameterize our DAGs. At first glance, this approach sounds reasonable. However, it poses one major challenge: tracking the progress of DAG runs will be too difficult and confusing for the human operator using PraaS.

Following the software design principle called DRY (Don’t Repeat Yourself), and inspired by the Factory Method design pattern in programming, we’ve instead written both phase 1 and phase 2 DAGs in a way that allow them to dynamically create multiple different DAGs with exactly the same tasks, and to fully reuse the exact same code. As a result, we only maintain one code base, and as we add new data centers, we are also able to generate a DAG for each new data center instantly, without writing a single line of code.

And Airflow even made it easy to put a simple customized web UI on top of the process, which made it simple to use by more employees who didn’t have to understand all the details.

The death of SOPs?

We would like to think that all of this automation removes the need for our original SOP document. But this is not really the case. Automation can fail, the components in it can fail, and a particular task in the DAG may fail. When this happens, our SOPs will be used again to prevent provisioning and expansion activities from stopping completely.

Introducing automation paved the way for what we call an SOP-as-Code practice. We made sure that every task in the DAG had an equivalent manual step in the SOP that SREs can execute by hand, should the need arise, and that every change in the SOP has a corresponding pull request (PR) in the code.

What’s next for PraaS

Onboarding of the other provisioning activities into PraaS, such as decommissioning, is already ongoing.

For expansions, our ultimate goal is a fully autonomous system that monitors whether new servers have been racked in our edge data centers — and automatically triggers expansions — with no human intervention.

An introduction to three-phase power and PDUs

2020-12-04 Rob Dinh

Post Syndicated from Rob Dinh original https://blog.cloudflare.com/an-introduction-to-three-phase-power-and-pdus/

An introduction to three-phase power and PDUs

Our fleet of over 200 locations comprises various generations of servers and routers. And with the ever changing landscape of services and computing demands, it’s imperative that we manage power in our data centers right. This blog is a brief Electrical Engineering 101 session going over specifically how power distribution units (PDU) work, along with some good practices on how we use them. It appears to me that we could all use a bit more knowledge on this topic, and more love and appreciation of something that’s critical but usually taken for granted, like hot showers and opposable thumbs.

A PDU is a device used in data centers to distribute power to multiple rack-mounted machines. It’s an industrial grade power strip typically designed to power an average consumption of about seven US households. Advanced models have monitoring features and can be accessed via SSH or webGUI to turn on and off power outlets. How we choose a PDU depends on what country the data center is and what it provides in terms of voltage, phase, and plug type.

An introduction to three-phase power and PDUs

For each of our racks, all of our dual power-supply (PSU) servers are cabled to one of the two vertically mounted PDUs. As shown in the picture above, one PDU feeds a server’s PSU via a red cable, and the other PDU feeds that server’s other PSU via a blue cable. This is to ensure we have power redundancy maximizing our service uptime; in case one of the PDUs or server PSUs fail, the redundant power feed will be available keeping the server alive.

Faraday’s Law and Ohm’s Law

Like most high-voltage applications, PDUs and servers are designed to use AC power. Meaning voltage and current aren’t constant — they’re sine waves with magnitudes that can alternate between positive and negative at a certain frequency. For example, a voltage feed of 100V is not constantly at 100V, but it bounces between 100V and -100V like a sine wave. One complete sine wave cycle is one phase of 360 degrees, and running at 50Hz means there are 50 cycles per second.

The sine wave can be explained by Faraday’s Law and by looking at how an AC power generator works. Faraday’s Law tells us that a current is induced to flow due to a changing magnetic field. Below illustrates a simple generator with a permanent magnet rotating at constant speed and a coil coming in contact with the magnet’s magnetic field. Magnetic force is strongest at the North and South ends of the magnet. So as it rotates on itself near the coil, current flow fluctuates in the coil. One complete rotation of the magnet represents one phase. As the North end approaches the coil, current increases from zero. Once the North end leaves, current decreases to zero. The South end in turn approaches, now the current “increases” in the opposite direction. And finishing the phase, the South end leaves returning the current back to zero. Current alternates its direction at every half cycle, hence the naming of Alternating Current.

Current and voltage in AC power fluctuate in-phase, or “in tandem”, with each other. So by Ohm’s Law of Power = Voltage x Current, power will always be positive. Notice on the graph below that AC power (Watts) has two peaks per cycle. But for practical purposes, we’d like to use a constant power value. We do that by interpreting AC power into “DC” power using root-mean-square (RMS) averaging, which takes the max value and divides it by √2. For example, in the US, our conditions are 208V 24A at 60Hz. When we look at spec sheets, all of these values can be assumed as RMS’d into their constant DC equivalent values. When we say we’re fully utilizing a PDU’s max capacity of 5kW, it actually means that the power consumption of our machines bounces between 0 and 7.1kW (5kW x √2).

It’s also critical to figure out the sum of power our servers will need in a rack so that it falls under the PDU’s design max power capacity. For our US example, a PDU is typically 5kW (208 volts x 24 amps); therefore, we’re budgeting 5kW and fit as many machines as we can under that. If we need more machines and the total sum power goes above 5kW, we’d need to provision another power source. That would lead to possibly another set of PDUs and racks that we may not fully use depending on demand; e.g. more underutilized costs. All we can do is abide by P = V x I.

However there is a way we can increase the max power capacity economically — 3-phase PDU. Compared to single phase, its max capacity is √3 or 1.7 times higher. A 3-phase PDU of the same US specs above has a capacity of 8.6kW (5kW x √3), allowing us to power more machines under the same source. Using a 3-phase setup might mean it has thicker cables and bigger plug; but despite being more expensive than a 1-phase, its value is higher compared to two 1-phase rack setups for these reasons:

It’s more cost-effective, because there are fewer hardware resources to buy
Say the computing demand adds up to 215kW of hardware, we would need 25 3-phase racks compared to 43 1-phase racks.
Each rack needs two PDUs for power redundancy. Using the example above, we would need 50 3-phase PDUs compared to 86 1-phase PDUs to power 215kW worth of hardware.
That also means a smaller rack footprint and fewer power sources provided and charged by the data center, saving us up to √3 or 1.7 times in opex.
It’s more resilient, because there are more circuit breakers in a 3-phase PDU — one more than in a 1-phase. For example, a 48-outlet PDU that is 1-phase would be split into two circuits of 24 outlets. While a 3-phase one would be split into 3 circuits of 16 outlets. If a breaker tripped, we’d lose 16 outlets using a 3-phase PDU instead of 24.

The PDU shown above is a 3-phase model of 48 outlets. We can see three pairs of circuit breakers for the three branches that are intertwined with each other — white, grey, and black. Industry demands today pressure engineers to maximize compute performance and minimize physical footprint, making the 3-phase PDU a widely-used part of operating a data center.

What is 3-phase?

A 3-phase AC generator has three coils instead of one where the coils are 120 degrees apart inside the cylindrical core, as shown in the figure below. Just like the 1-phase generator, current flow is induced by the rotation of the magnet thus creating power from each coil sequentially at every one-third of the magnet’s rotation cycle. In other words, we’re generating three 1-phase power offset by 120 degrees.

A 3-phase feed is set up by joining any of its three coils into line pairs. L1, L2, and L3 coils are live wires with each on their own phase carrying their own phase voltage and phase current. Two phases joining together form one line carrying a common line voltage and line current. L1 and L2 phase voltages create the L1/L2 line voltage. L2 and L3 phase voltages create the L2/L3 line voltage. L1 and L3 phase voltages create the L1/L3 line voltage.

Let’s take a moment to clarify the terminology. Some other sources may refer to line voltage (or current) as line-to-line or phase-to-phase voltage (or current). It can get confusing, because line voltage is the same as phase voltage in 1-phase circuits, as there’s only one phase. Also, the magnitude of the line voltage is equal to the magnitude of the phase voltage in 3-phase Delta circuits, while the magnitude of the line current is equal to the magnitude of the phase current in 3-phase Wye circuits.

Conversely, the line current equals to phase current times √3 in Delta circuits. In Wye circuits, the line voltage equals to phase voltage times √3.

In Delta circuits:
V_line = V_phase
I_{line = √3 x I_phase}

In Wye circuits:
V_line = √3 x V_phase
I_line = I_phase

Delta and Wye circuits are the two methods that three wires can join together. This happens both at the power source with three coils and at the PDU end with three branches of outlets. Note that the generator and the PDU don’t need to match each other’s circuit types.

On PDUs, these phases join when we plug servers into the outlets. So we conceptually use the wirings of coils above and replace them with resistors to represent servers. Below is a simplified wiring diagram of a 3-phase Delta PDU showing the three line pairs as three modular branches. Each branch carries two phase currents and its own one common voltage drop.

And this one below is of a 3-phase Wye PDU. Note that Wye circuits have an additional line known as the neutral line where all three phases meet at one point. Here each branch carries one phase and a neutral line, therefore one common current. The neutral line isn’t considered as one of the phases.

Thanks to a neutral line, a Wye PDU can offer a second voltage source that is √3 times lower for smaller devices, like laptops or monitors. Common voltages for Wye PDUs are 230V/400V or 120V/208V, particularly in North America.

Where does the √3 come from?

Why are we multiplying by √3? As the name implies, we are adding phasors. Phasors are complex numbers representing sine wave functions. Adding phasors is like adding vectors. Say your GPS tells you to walk 1 mile East (vector a), then walk a 1 mile North (vector b). You walked 2 miles, but you only moved by 1.4 miles NE from the original location (vector a+b). That 1.4 miles of “work” is what we want.

Let’s take in our application L1 and L2 in a Delta circuit. we add phases L1 and L2, we get a L1/L2 line. We assume the 2 coils are identical. Let’s say α represents the voltage magnitude for each phase. The 2 phases are 120 degrees offset as designed in the 3-phase power generator:

|L1| = |L2| = α
L1 = |L1|∠0° = α∠0°
L2 = |L2|∠-120° = α∠-120°

Using vector addition to solve for L1/L2:

L1/L2 = L1 + L2

Convert L1/L2 into polar form:

Since voltage is a scalar, we’re only interested in the “work”:

|L1/L2| = √3α

Given that α also applies for L3. This means for any of the three line pairs, we multiply the phase voltage by √3 to calculate the line voltage.

V_line = √3 x V_phase

Now with the three line powers being equal, we can add them all to get the overall effective power. The derivation below works for both Delta and Wye circuits.

P_overall = 3 x P_line
P_overall = 3 x (V_line x I_line)
P_overall = (3/√3) x (V_phase x I_phase)
P_overall = √3 x V_phase x I_phase

Using the US example, V_phase is 208V and I_phase is 24A. This leads to the overall 3-phase power to be 8646W (√3 x 208V x 24A) or 8.6kW. There lies the biggest advantage for using 3-phase systems. Adding 2 sets of coils and wires (ignoring the neutral wire), we’ve turned a generator that can produce √3 or 1.7 times more power!

Dealing with 3-phase

The derivation in the section above assumes that the magnitude at all three phases is equal, but we know in practice that’s not always the case. In fact, it’s barely ever. We rarely have servers and switches evenly distributed across all three branches on a PDU. Each machine may have different loads and different specs, so power could be wildly different, potentially causing a dramatic phase imbalance. Having a heavily imbalanced setup could potentially hinder the PDU’s available capacity.

A perfectly balanced and fully utilized PDU at 8.6kW means that each of its three branches has 2.88kW of power consumed by machines. Laid out simply, it’s spread 2.88 + 2.88 + 2.88. This is the best case scenario. If we were to take 1kW worth of machines out of one branch, spreading power to 2.88 + 1.88 + 2.88. Imbalance is introduced, the PDU is underutilized, but we’re fine. However, if we were to put back that 1kW into another branch — like 3.88 + 1.88 + 2.88 — the PDU is over capacity, even though the sum is still 8.6kW. In fact, it would be over capacity even if you just added 500W instead of 1kW on the wrong branch, thus reaching 3.18 + 1.88 + 2.88 (8.1kW).

That’s because a 8.6kW PDU is spec’d to have a maximum of 24A for each phase current. Overloading one of the branches can force phase currents to go over 24A. Theoretically, we can reach the PDU’s capacity by loading one branch until its current reaches 24A and leave the other two branches unused. That’ll render it into a 1-phase PDU, losing the benefit of the √3 multiplier. In reality, the branch would have fuses rated less than 24A (usually 20A) to ensure we won’t reach that high and cause overcurrent issues. Therefore the same 8.6kW PDU would have one of its branches tripped at 4.2kW (208V x 20A).

Loading up one branch is the easiest way to overload the PDU. Being heavily imbalanced significantly lowers PDU capacity and increases risk of failure. To help minimize that, we must:

Ensure that total power consumption of all machines is under the PDU’s max power capacity
Try to be as phase-balanced as possible by spreading cabling evenly across the three branches
Ensure that the sum of phase currents from powered machines at each branch is under the fuse rating at the circuit breaker.

This spreadsheet from Raritan is very useful when designing racks.

For the sake of simplicity, let’s ignore other machines like switches. Our latest 2U4N servers are rated at 1800W. That means we can only fit a maximum of four of these 2U4N chassis (8600W / 1800W = 4.7 chassis). Rounding them up to 5 would reach a total rack level power consumption of 9kW, so that’s a no-no.

Splitting 4 chassis into 3 branches evenly is impossible, and will force us to have one of the branches to have 2 chassis. That would lead to a non-ideal phase balancing:

Keeping phase currents under 24A, there’s only 1.1A (24A – 22.9A) to add on L1 or L2 before the PDU gets overloaded. Say we want to add as many machines as we can under the PDU’s power capacity. One solution is we can add up to 242W on the L1/L2 branch until both L1 and L2 currents reach their 24A limit.

Alternatively, we can add up to 298W on the L2/L3 branch until L2 current reaches 24A. Note we can also add another 298W on the L3/L1 branch until L1 current reaches 24A.

In the examples above, we can see that various solutions are possible. Adding two 298W machines each at L2/L3 and L3/L1 is the most phase balanced solution, given the parameters. Nonetheless, PDU capacity isn’t optimized at 7.8kW.

Dealing with a 1800W server is not ideal, because whichever branch we choose to power one would significantly swing the phase balance unfavorably. Thankfully, our Gen X servers take up less space and are more power efficient. Smaller footprint allows us to have more flexibility and fine-grained control over our racks in many of our diverse data centers. Assuming each 1U server is 450W, as if we physically split the 1800W 2U4N into fours each with their own power supplies, we’re now able to fit 18 nodes. That’s 2 more nodes than the four 2U4N setup:

Adding two more servers here means we’ve increased our value by 12.5%. While there are more factors not considered here to calculate the Total Cost of Ownership, this is still a great way to show we can be smarter with asset costs.

Cloudflare provides the back-end services so that our customers can enjoy the performance, reliability, security, and global scale of our edge network. Meanwhile, we manage all of our hardware in over 100 countries with various power standards and compliances, and ensure that our physical infrastructure is as reliable as it can be.

There’s no Cloudflare without hardware, and there’s no Cloudflare without power. Want to know more? Watch this Cloudflare TV segment about power: https://cloudflare.tv/event/7E359EDpCZ6mHahMYjEgQl.

Unimog – Cloudflare’s edge load balancer

2020-09-09 David Wragg

Post Syndicated from David Wragg original https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/

Unimog - Cloudflare’s edge load balancer

As the scale of Cloudflare’s edge network has grown, we sometimes reach the limits of parts of our architecture. About two years ago we realized that our existing solution for spreading load within our data centers could no longer meet our needs. We embarked on a project to deploy a Layer 4 Load Balancer, internally called Unimog, to improve the reliability and operational efficiency of our edge network. Unimog has now been deployed in production for over a year.

This post explains the problems Unimog solves and how it works. Unimog builds on techniques used in other Layer 4 Load Balancers, but there are many details of its implementation that are tailored to the needs of our edge network.

Unimog - Cloudflare’s edge load balancer

The role of Unimog in our edge network

Cloudflare operates an anycast network, meaning that our data centers in 200+ cities around the world serve the same IP addresses. For example, our own cloudflare.com website uses Cloudflare services, and one of its IP addresses is 104.17.175.85. All of our data centers will accept connections to that address and respond to HTTP requests. By the magic of Internet routing, when you visit cloudflare.com and your browser connects to 104.17.175.85, your connection will usually go to the closest (and therefore fastest) data center.

Inside those data centers are many servers. The number of servers in each varies greatly (the biggest data centers have a hundred times more servers than the smallest ones). The servers run the application services that implement our products (our caching, DNS, WAF, DDoS mitigation, Spectrum, WARP, etc). Within a single data center, any of the servers can handle a connection for any of our services on any of our anycast IP addresses. This uniformity keeps things simple and avoids bottlenecks.

But if any server within a data center can handle any connection, when a connection arrives from a browser or some other client, what controls which server it goes to? That’s the job of Unimog.

There are two main reasons why we need this control. The first is that we regularly move servers in and out of operation, and servers should only receive connections when they are in operation. For example, we sometimes remove a server from operation in order to perform maintenance on it. And sometimes servers are automatically removed from operation because health checks indicate that they are not functioning correctly.

The second reason concerns the management of the load on the servers (by load we mean the amount of computing work each one needs to do). If the load on a server exceeds the capacity of its hardware resources, then the quality of service to users will suffer. The performance experienced by users degrades as a server approaches saturation, and if a server becomes sufficiently overloaded, users may see errors. We also want to prevent servers being underloaded, which would reduce the value we get from our investment in hardware. So Unimog ensures that the load is spread across the servers in a data center. This general idea is called load balancing (balancing because the work has to be done somewhere, and so for the load on one server to go down, the load on some other server must go up).

Note that in this post, we’ll discuss how Cloudflare balances the load on its own servers in edge data centers. But load balancing is a requirement that occurs in many places in distributed computing systems. Cloudflare also has a Layer 7 Load Balancing product to allow our customers to balance load across their servers. And Cloudflare uses load balancing in other places internally.

Deploying Unimog led to a big improvement in our ability to balance the load on our servers in our edge data centers. Here’s a chart for one data center, showing the difference due to Unimog. Each line shows the processor utilization of an individual server (the colour of the lines indicates server model). The load on the servers varies during the day with the activity of users close to this data center. The white line marks the point when we enabled Unimog. You can see that after that point, the load on the servers became much more uniform. We saw similar results when we deployed Unimog to our other data centers.

How Unimog compares to other load balancers

There are a variety of techniques for load balancing. Unimog belongs to a category called Layer 4 Load Balancers (L4LBs). L4LBs direct packets on the network by inspecting information up to layer 4 of the OSI network model, which distinguishes them from the more common Layer 7 Load Balancers.

The advantage of L4LBs is their efficiency. They direct packets without processing the payload of those packets, so they avoid the overheads associated with higher level protocols. For any load balancer, it’s important that the resources consumed by the load balancer are low compared to the resources devoted to useful work. At Cloudflare, we already pay close attention to the efficient implementation of our services, and that sets a high bar for the load balancer that we put in front of those services.

The downside of L4LBs is that they can only control which connections go to which servers. They cannot modify the data going over the connection, which prevents them from participating in higher-level protocols like TLS, HTTP, etc. (in contrast, Layer 7 Load Balancers act as proxies, so they can modify data on the connection and participate in those higher-level protocols).

L4LBs are not new. They are mostly used at companies which have scaling needs that would be hard to meet with L7LBs alone. Google has published about Maglev, Facebook open-sourced Katran, and Github has open-sourced their GLB.

Unimog is the L4LB that Cloudflare has built to meet the needs of our edge network. It shares features with other L4LBs, and it is particularly strongly influenced by GLB. But there are some requirements that were not well-served by existing L4LBs, leading us to build our own:

Unimog is designed to run on the same general-purpose servers that provide application services, rather than requiring a separate tier of servers dedicated to load balancing.
It performs dynamic load balancing: measurements of server load are used to adjust the number of connections going to each server, in order to accurately balance load.
It supports long-lived connections that remain established for days.
Virtual IP addresses are managed as ranges (Cloudflare serves hundreds of thousands of IPv4 addresses on behalf of our customers, so it is impractical to configure these individually).
Unimog is tightly integrated with our existing DDoS mitigation system, and the implementation relies on the same XDP technology in the Linux kernel.

The rest of this post describes these features and the design and implementation choices that follow from them in more detail.

For Unimog to balance load, it’s not enough to send the same (or approximately the same) number of connections to each server, because the performance of our servers varies. We regularly update our server hardware, and we’re now on our 10th generation. Once we deploy a server, we keep it in service for as long as it is cost effective, and the lifetime of a server can be several years. It’s not unusual for a single data center to contain a mix of server models, due to expansion and upgrades over time. Processor performance has increased significantly across our server generations. So within a single data center, we need to send different numbers of connections to different servers to utilize the same percentage of their capacity.

It’s also not enough to give each server a fixed share of connections based on static estimates of their capacity. Not all connections consume the same amount of CPU. And there are other activities running on our servers and consuming CPU that are not directly driven by connections from clients. So in order to accurately balance load across servers, Unimog does dynamic load balancing: it takes regular measurements of the load on each of our servers, and uses a control loop that increases or decreases the number of connections going to each server so that their loads converge to an appropriate value.

Refresher: TCP connections

The relationship between TCP packets and connections is central to the operation of Unimog, so we’ll briefly describe that relationship.

(Unimog supports UDP as well as TCP, but for clarity most of this post will focus on the TCP support. We explain how UDP support differs towards the end.)

Here is the outline of a TCP packet:

The TCP connection that this packet belongs to is identified by the four labelled header fields, which span the IPv4/IPv6 (i.e. layer 3) and TCP (i.e. layer 4) headers: the source and destination addresses, and the source and destination ports. Collectively, these four fields are known as the 4-tuple. When we say the Unimog sends a connection to a server, we mean that all the packets with the 4-tuple identifying that connection are sent to that server.

A TCP connection is established via a three-way handshake between the client and the server handling that connection. Once a connection has been established, it is crucial that all the incoming packets for that connection go to that same server. If a TCP packet belonging to the connection is sent to a different server, it will signal the fact that it doesn’t know about the connection to the client with a TCP RST (reset) packet. Upon receiving this notification, the client terminates the connection, probably resulting in the user seeing an error. So a misdirected packet is much worse than a dropped packet. As usual, we consider the network to be unreliable, and it’s fine for occasional packets to be dropped. But even a single misdirected packet can lead to a broken connection.

Cloudflare handles a wide variety of connections on behalf of our customers. Many of these connections carry HTTP, and are typically short lived. But some HTTP connections are used for websockets, and can remain established for hours or days. Our Spectrum product supports arbitrary TCP connections. TCP connections can be terminated or stall for many reasons, and ideally all applications that use long-lived connections would be able to reconnect transparently, and applications would be designed to support such reconnections. But not all applications and protocols meet this ideal, so we strive to maintain long-lived connections. Unimog can maintain connections that last for many days.

Forwarding packets

The previous section described that the function of Unimog is to steer connections to servers. We’ll now explain how this is implemented.

To start with, let’s consider how one of our data centers might look without Unimog or any other load balancer. Here’s a conceptual view:

Packets arrive from the Internet, and pass through the router, which forwards them on to servers (in reality there is usually additional network infrastructure between the router and the servers, but it doesn’t play a significant role here so we’ll ignore it).

But is such a simple arrangement possible? Can the router spread traffic over servers without some kind of load balancer in between? Routers have a feature called ECMP (equal cost multipath) routing. Its original purpose is to allow traffic to be spread across multiple paths between two locations, but it is commonly repurposed to spread traffic across multiple servers within a data center. In fact, Cloudflare relied on ECMP alone to spread load across servers before we deployed Unimog. ECMP uses a hashing scheme to ensure that packets on a given connection use the same path (Unimog also employs a hashing scheme, so we’ll discuss how this can work in further detail below) . But ECMP is vulnerable to changes in the set of active servers, such as when servers go in and out of service. These changes cause rehashing events, which break connections to all the servers in an ECMP group. Also, routers impose limits on the sizes of ECMP groups, which means that a single ECMP group cannot cover all the servers in our larger edge data centers. Finally, ECMP does not allow us to do dynamic load balancing by adjusting the share of connections going to each server. These drawbacks mean that ECMP alone is not an effective approach.

Ideally, to overcome the drawbacks of ECMP, we could program the router with the appropriate logic to direct connections to servers in the way we want. But although programmable network data planes have been a hot research topic in recent years, commodity routers are still essentially fixed-function devices.

We can work around the limitations of routers by having the router send the packets to some load balancing servers, and then programming those load balancers to forward packets as we want. If the load balancers all act on packets in a consistent way, then it doesn’t matter which load balancer gets which packets from the router (so we can use ECMP to spread packets across the load balancers). That suggests an arrangement like this:

And indeed L4LBs are often deployed like this.

Instead, Unimog makes every server into a load balancer. The router can send any packet to any server, and that initial server will forward the packet to the right server for that connection:

We have two reasons to favour this arrangement:

First, in our edge network, we avoid specialised roles for servers. We run the same software stack on the servers in our edge network, providing all of our product features, whether DDoS attack prevention, website performance features, Cloudflare Workers, WARP, etc. This uniformity is key to the efficient operation of our edge network: we don’t have to manage how many load balancers we have within each of our data centers, because all of our servers act as load balancers.

The second reason relates to stopping attacks. Cloudflare’s edge network is the target of incessant attacks. Some of these attacks are volumetric – large packet floods which attempt to overwhelm the ability of our data centers to process network traffic from the Internet, and so impact our ability to service legitimate traffic. To successfully mitigate such attacks, it’s important to filter out attack packets as early as possible, minimising the resources they consume. This means that our attack mitigation system needs to occur before the forwarding done by Unimog. That mitigation system is called l4drop, and we’ve written about it before. l4drop and Unimog are closely integrated. Because l4drop runs on all of our servers, and because l4drop comes before Unimog, it’s natural for Unimog to run on all of our servers too.

XDP and xdpd

Unimog implements packet forwarding using a Linux kernel facility called XDP. XDP allows a program to be attached to a network interface, and the program gets run for every packet that arrives, before it is processed by the kernel’s main network stack. The XDP program returns an action code to tell the kernel what to do with the packet:

PASS: Pass the packet on to the kernel’s network stack for normal processing.
DROP: Drop the packet. This is the basis for l4drop.
TX: Transmit the packet back out of the network interface. The XDP program can modify the packet data before transmission. This action is the basis for Unimog forwarding.

XDP programs run within the kernel, making this an efficient approach even at high packet rates. XDP programs are expressed as eBPF bytecode, and run within an in-kernel virtual machine. Upon loading an XDP program, the kernel compiles its eBPF code into machine code. The kernel also verifies the program to check that it does not compromise security or stability. eBPF is not only used in the context of XDP: many recent Linux kernel innovations employ eBPF, as it provides a convenient and efficient way to extend the behaviour of the kernel.

XDP is much more convenient than alternative approaches to packet-level processing, particularly in our context where the servers involved also have many other tasks. We have continued to enhance Unimog since its initial deployment. Our deployment model for new versions of our Unimog XDP code is essentially the same as for userspace services, and we are able to deploy new versions on a weekly basis if needed. Also, established techniques for optimizing the performance of the Linux network stack provide good performance for XDP.

There are two main alternatives for efficient packet-level processing:

Kernel-bypass networking (such as DPDK), where a program in userspace manages a network interface (or some part of one) directly without the involvement of the kernel. This approach works best when servers can be dedicated to a network function (due to the need to dedicate processor or network interface hardware resources, and awkward integration with the normal kernel network stack; see our old post about this). But we avoid putting servers in specialised roles. (Github’s open-source GLB uses DPDK, and this is one of the main factors that made GLB unsuitable for us.)
Kernel modules, where code is added to the kernel to perform the necessary network functions. The Linux IPVS (IP Virtual Server) subsystem falls into this category. But developing, testing, and deploying kernel modules is cumbersome compared to XDP.

The following diagram shows an overview of our use of XDP. Both l4drop and Unimog are implemented by an XDP program. l4drop matches attack packets, and uses the DROP action to discard them. Unimog forwards packets, using the TX action to resend them. Packets that are not dropped or forwarded pass through to the normal Linux network stack. To support our elaborate use of XPD, we have developed the xdpd daemon which performs the necessary supervisory and support functions for our XDP programs.

Rather than a single XDP program, we have a chain of XDP programs that must be run for each packet (l4drop, Unimog, and others we have not covered here). One of the responsibilities of xdpd is to prepare these programs, and to make the appropriate system calls to load them and assemble the full chain.

Our XDP programs come from two sources. Some are developed in a conventional way: engineers write C code, our build system compiles it (with clang) to eBPF ELF files, and our release system deploys those files to our servers. Our Unimog XDP code works like this. In contrast, the l4drop XDP code is dynamically generated by xdpd based on information it receives from attack detection systems.

xdpd has many other duties to support our use of XDP:

XDP programs can be supplied with data using data structures called maps. xdpd populates the maps needed by our programs, based on information received from control planes.
Programs (for instance, our Unimog XDP program) may depend upon configuration values which are fixed while the program runs, but do not have universal values known at the time their C code was compiled. It would be possible to supply these values to the program via maps, but that would be inefficient (retrieving a value from a map requires a call to a helper function). So instead, xdpd will fix up the eBPF program to insert these constants before it is loaded.
Cloudflare carefully monitors the behaviour of all our software systems, and this includes our XDP programs: They emit metrics (via another use of maps), which xdpd exposes to our metrics and alerting system (prometheus).
When we deploy a new version of xdpd, it gracefully upgrades in such a way that there is no interruption to the operation of Unimog or l4drop.

Although the XDP programs are written in C, xdpd itself is written in Go. Much of its code is specific to Cloudflare. But in the course of developing xdpd, we have collaborated with Cilium to develop https://github.com/cilium/ebpf, an open source Go library that provides the operations needed by xdpd for manipulating and loading eBPF programs and related objects. We’re also collaborating with the Linux eBPF community to share our experience, and extend the core eBPF technology in ways that make features of xdpd obsolete.

In evaluating the performance of Unimog, our main concern is efficiency: that is, the resources consumed for load balancing relative to the resources used for customer-visible services. Our measurements show that Unimog costs less than 1% of the processor utilization, compared to a scenario where no load balancing is in use. Other L4LBs, intended to be used with servers dedicated to load balancing, may place more emphasis on maximum throughput of packets. Nonetheless, our experience with Unimog and XDP in general indicates that the throughput is more than adequate for our needs, even during large volumetric attacks.

Unimog is not the first L4LB to use XDP. In 2018, Facebook open sourced Katran, their XDP-based L4LB data plane. We considered the possibility of reusing code from Katran. But it would not have been worthwhile: the core C code needed to implement an XDP-based L4LB is relatively modest (about 1000 lines of C, both for Unimog and Katran). Furthermore, we had requirements that were not met by Katran, and we also needed to integrate with existing components and systems at Cloudflare (particularly l4drop). So very little of the code could have been reused as-is.

Encapsulation

As discussed as the start of this post, clients make connections to one of our edge data centers with a destination IP address that can be served by any one of our servers. These addresses that do not correspond to a specific server are known as virtual IPs (VIPs). When our Unimog XDP program forwards a packet destined to a VIP, it must replace that VIP address with the direct IP (DIP) of the appropriate server for the connection, so that when the packet is retransmitted it will reach that server. But it is not sufficient to overwrite the VIP in the packet headers with the DIP, as that would hide the original destination address from the server handling the connection (the original destination address is often needed to correctly handle the connection).

Instead, the packet must be encapsulated: Another set of packet headers is prepended to the packet, so that the original packet becomes the payload in this new packet. The DIP is then used as the destination address in the outer headers, but the addressing information in the headers of the original packet is preserved. The encapsulated packet is then retransmitted. Once it reaches the target server, it must be decapsulated: the outer headers are stripped off to yield the original packet as if it had arrived directly.

Encapsulation is a general concept in computer networking, and is used in a variety of contexts. The headers to be added to the packet by encapsulation are defined by an encapsulation format. Many different encapsulation formats have been defined within the industry, tailored to the requirements in specific contexts. Unimog uses a format called GUE (Generic UDP Encapsulation), in order to allow us to re-use the glb-redirect component from github’s GLB (glb-redirect is discussed below).

GUE is a relatively simple encapsulation format. It encapsulates within a UDP packet, placing a GUE-specific header between the outer IP/UDP headers and the payload packet to allow extension data to be carried (and we’ll see how Unimog takes advantage of this):

When an encapsulated packet arrives at a server, the encapsulation process must be reversed. This step is called decapsulation. The headers that were added during the encapsulation process are removed, leaving the original packet to be processed by the network stack as if it had arrived directly from the client.

An issue that can arise with encapsulation is hitting limits on the maximum packet size, because the encapsulation process makes packets larger. The de-facto maximum packet size on the Internet is 1500 bytes, and not coincidentally this is also the maximum packet size on ethernet networks. For Unimog, encapsulating a 1500-byte packet results in a 1536-byte packet. To allow for these enlarged encapsulated packets, we have enabled jumbo frames on the networks inside our data centers, so that the 1500-byte limit only applies to packets headed out to the Internet.

Forwarding logic

So far, we have described the technology used to implement the Unimog load balancer, but not how our Unimog XDP program selects the DIP address when forwarding a packet. This section describes the basic scheme. But as we’ll see, there is a problem, so then we’ll describe how this scheme is elaborated to solve that problem.

In outline, our Unimog XDP program processes each packet in the following way:

Determine whether the packet is destined for a VIP address. Not all of the packets arriving at a server are for VIP addresses. Other packets are passed through for normal handling by the kernel’s network stack. (xdpd obtains the VIP address ranges from the Unimog control plane.)
Determine the DIP for the server handling the packet’s connection.
Encapsulate the packet, and retransmit it to the DIP.

In step 2, note that all the load balancers must act consistently – when forwarding packets, they must all agree about which connections go to which servers. The rate of new connections arriving at a data center is large, so it’s not practical for load balancers to agree by communicating information about connections amongst themselves. Instead L4LBs adopt designs which allow the load balancers to reach consistent forwarding decisions independently. To do this, they rely on hashing schemes: Take the 4-tuple identifying the packet’s connection, put it through a hash function to obtain a key (the hash function ensures that these key values are uniformly distributed), then perform some kind of lookup into a data structure to turn the key into the DIP for the target server.

Unimog uses such a scheme, with a data structure that is simple compared to some other L4LBs. We call this data structure the forwarding table, and it consists of an array where each entry contains a DIP specifying the server target server for the relevant packets (we call these entries buckets). The forwarding table is generated by the Unimog control plane and broadcast to the load balancers (more on this below), so that it has the same contents on all load balancers.

To look up a packet’s key in the forwarding table, the low N bits from the key are used as the index for a bucket (the forwarding table is always a power-of-2 in size):

Note that this approach does not provide per-connection control – each bucket typically applies to many connections. All load balancers in a data center use the same forwarding table, so they all forward packets in a consistent manner. This means it doesn’t matter which packets are sent by the router to which servers, and so ECMP re-hashes are a non-issue. And because the forwarding table is immutable and simple in structure, lookups are fast.

Although the above description only discusses a single forwarding table, Unimog supports multiple forwarding tables, each one associated with a trafficset – the traffic destined for a particular service. Ranges of VIP addresses are associated with a trafficset. Each trafficset has its own configuration settings and forwarding tables. This gives us the flexibility to differentiate how Unimog behaves for different services.

Precise load balancing requires the ability to make fine adjustments to the number of connections arriving at each server. So we make the number of buckets in the forwarding table more than 100 times the number of servers. Our data centers can contain hundreds of servers, and so it is normal for a Unimog forwarding table to have tens of thousands of buckets. The DIP for a given server is repeated across many buckets in the forwarding table, and by increasing or decreasing the number of buckets that refer to a server, we can control the share of connections going to that server. Not all buckets will correspond to exactly the same number of connections at a given point in time (the properties of the hash function make this a statistical matter). But experience with Unimog has demonstrated that the relationship between the number of buckets and resulting server load is sufficiently strong to allow for good load balancing.

But as mentioned, there is a problem with this scheme as presented so far. Updating a forwarding table, and changing the DIPs in some buckets, would break connections that hash to those buckets (because packets on those connections would get forwarded to a different server after the update). But one of the requirements for Unimog is to allow us to change which servers get new connections without impacting the existing connections. For example, sometimes we want to drain the connections to a server, maintaining the existing connections to that server but not forwarding new connections to it, in the expectation that many of the existing connections will terminate of their own accord. The next section explains how we fix this scheme to allow such changes.

Maintaining established connections

To make changes to the forwarding table without breaking established connections, Unimog adopts the “daisy chaining” technique described in the paper Stateless Datacenter Load-balancing with Beamer.

To understand how the Beamer technique works, let’s look at what can go wrong when a forwarding table changes: imagine the forwarding table is updated so that a bucket which contained the DIP of server A now refers to server B. A packet that would formerly have been sent to A by the load balancers is now sent to B. If that packet initiates a new connection (it’s a TCP SYN packet), there’s no problem – server B will continue the three-way handshake to complete the new connection. On the other hand, if the packet belongs to a connection established before the change, then the TCP implementation of server B has no matching TCP socket, and so sends a RST back to the client, breaking the connection.

This explanation hints at a solution: the problem occurs when server B receives a forwarded packet that does not match a TCP socket. If we could change its behaviour in this case to forward the packet a second time to the DIP of server A, that would allow the connection to server A to be preserved. For this to work, server B needs to know the DIP for the bucket before the change.

To accomplish this, we extend the forwarding table so that each bucket has two slots, each containing the DIP for a server. The first slot contains the current DIP, which is used by the load balancer to forward packets as discussed (and here we refer to this forwarding as the first hop). The second slot preserves the previous DIP (if any), in order to allow the packet to be forwarded again on a second hop when necessary.

For example, imagine we have a forwarding table that refers to servers A, B, and C, and then it is updated to stop new connections going to server A, but maintaining established connections to server A. This is achieved by replacing server A’s DIP in the first slot of any buckets where it appears, but preserving it in the second slot:

In addition to extending the forwarding table, this approach requires a component on each server to forward packets on the second hop when necessary. This diagram shows where this redirector fits into the path a packet can take:

The redirector follows some simple logic to decide whether to process a packet locally on the first-hop server or to forward it on the second-hop server:

If the packet is a SYN packet, initiating a new connection, then it is always processed by the first-hop server. This ensures that new connections go to the first-hop server.
For other packets, the redirector checks whether the packet belongs to a connection with a corresponding TCP socket on the first-hop server. If so, it is processed by that server.
Otherwise, the packet has no corresponding TCP socket on the first-hop server. So it is forwarded on to the second-hop server to be processed there (in the expectation that it belongs to some connection established on the second-hop server that we wish to maintain).

In that last step, the redirector needs to know the DIP for the second hop. To avoid the need for the redirector to do forwarding table lookups, the second-hop DIP is placed into the encapsulated packet by the Unimog XDP program (which already does a forwarding table lookup, so it has easy access to this value). This second-hop DIP is carried in a GUE extension header, so that it is readily available to the redirector if it needs to forward the packet again.

This second hop, when necessary, does have a cost. But in our data centers, the fraction of forwarded packets that take the second hop is usually less than 1% (despite the significance of long-lived connections in our context). The result is that the practical overhead of the second hops is modest.

When we initially deployed Unimog, we adopted the glb-redirect iptables module from github’s GLB to serve as the redirector component. In fact, some implementation choices in Unimog, such as the use of GUE, were made in order to facilitate this re-use. glb-redirect worked well for us initially, but subsequently we wanted to enhance the redirector logic. glb-redirect is a custom Linux kernel module, and developing and deploying changes to kernel modules is more difficult for us than for eBPF-based components such as our XDP programs. This is not merely due to Cloudflare having invested more engineering effort in software infrastructure for eBPF; it also results from the more explicit boundary between the kernel and eBPF programs (for example, we are able to run the same eBPF programs on a range of kernel versions without recompilation). We wanted to achieve the same ease of development for the redirector as for our XDP programs.

To that end, we decided to write an eBPF replacement for glb-redirect. While the redirector could be implemented within XDP, like our load balancer, practical concerns led us to implement it as a TC classifier program instead (TC is the traffic control subsystem within the Linux network stack). A downside to XDP is that the packet contents prior to processing by the XDP program are not visible using conventional tools such as tcpdump, complicating debugging. TC classifiers do not have this downside, and in the context of the redirector, which passes most packets through, the performance advantages of XDP would not be significant.

The result is cls-redirect, a redirector implemented as a TC classifier program. We have contributed our cls-redirect code as part of the Linux kernel test suite. In addition to implementing the redirector logic, cls-redirect also implements decapsulation, removing the need to separately configure GUE tunnel endpoints for this purpose.

There are some features suggested in the Beamer paper that Unimog does not implement:

Beamer embeds generation numbers in the encapsulated packets to address a potential corner case where a ECMP rehash event occurs at the same time as a forwarding table update is propagating from the control plane to the load balancers. Given the combination of circumstances required for a connection to be impacted by this issue, we believe that in our context the number of affected connections is negligible, and so the added complexity of the generation numbers is not worthwhile.
In the Beamer paper, the concept of daisy-chaining encompasses third hops etc. to preserve connections across a series of changes to a bucket. Unimog only uses two hops (the first and second hops above), so in general it can only preserve connections across a single update to a bucket. But our experience is that even with only two hops, a careful strategy for updating the forwarding tables permits connection lifetimes of days.

To elaborate on this second point: when the control plane is updating the forwarding table, it often has some choice in which buckets to change, depending on the event that led to the update. For example, if a server is being brought into service, then some buckets must be assigned to it (by placing the DIP for the new server in the first slot of the bucket). But there is a choice about which buckets. A strategy of choosing the least-recently modified buckets will tend to minimise the impact to connections.

Furthermore, when updating the forwarding table to adjust the balance of load between servers, Unimog often uses a novel trick: due to the redirector logic, exchanging the first-hop and second-hop DIPs for a bucket only affects which server receives new connections for that bucket, and never impacts any established connections. Unimog is able to achieve load balancing in our edge data centers largely through forwarding table changes of this type.

Control plane

So far, we have discussed the Unimog data plane – the part that processes network packets. But much of the development effort on Unimog has been devoted to the control plane – the part that generates the forwarding tables used by the data plane. In order to correctly maintain the forwarding tables, the control plane consumes information from multiple sources:

Server information: Unimog needs to know the set of servers present in a data center, some key information about each one (such as their DIP addresses), and their operational status. It also needs signals about transitional states, such as when a server is being withdrawn from service, in order to gracefully drain connections (preventing the server from receiving new connections, while maintaining its established connections).
Health: Unimog should only send connections to servers that are able to correctly handle those connections, otherwise those servers should be removed from the forwarding tables. To ensure this, it needs health information at the node level (indicating that a server is available) and at the service level (indicating that a service is functioning normally on a server).
Load: in order to balance load, Unimog needs information about the resource utilization on each server.
IP address information: Cloudflare serves hundreds of thousands of IPv4 addresses, and these are something that we have to treat as a dynamic resource rather than something statically configured.

The control plane is implemented by a process called the conductor. In each of our edge data centers, there is one active conductor, but there are also standby instances that will take over if the active instance goes away.

We use Hashicorp’s Consul in a number of ways in the Unimog control plane (we have an independent Consul server cluster in each data center):

Consul provides a key-value store, with support for blocking queries so that changes to values can be received promptly. We use this to propagate the forwarding tables and VIP address information from the conductor to xdpd on the servers.
Consul provides server- and service-level health checks. We use this as the source of health information for Unimog.
The conductor stores its state in the Consul KV store, and uses Consul’s distributed locks to ensure that only one conductor instance is active.

The conductor obtains server load information from Prometheus, which we already use for metrics throughout our systems. It balances the load across the servers using a control loop, periodically adjusting the forwarding tables to send more connections to underloaded servers and less connections to overloaded servers. The load for a server is defined by a Prometheus metric expression which measures processor utilization (with some intricacies to better handle characteristics of our workloads). The determination of whether a server is underloaded or overloaded is based on comparison with the average value of the load metric, and the adjustments made to the forwarding table are proportional to the deviation from the average. So the result of the feedback loop is that the load metric for all servers converges on the average.

Finally, the conductor queries internal Cloudflare APIs to obtain the necessary information on servers and addresses.

Unimog is a critical system: incorrect, poorly adjusted or stale forwarding tables could cause incoming network traffic to a data center to be dropped, or servers to be overloaded, to the point that a data center would have to be removed from service. To maintain a high quality of service and minimise the overhead of managing our many edge data centers, we have to be able to upgrade all components. So to the greatest extent possible, all components are able to tolerate brief absences of the other components without any impact to service. In some cases this is possible through careful design. In other cases, it requires explicit handling. For example, we have found that Consul can temporarily report inaccurate health information for a server and its services when the Consul agent on that server is restarted (for example, in order to upgrade Consul). So we implemented the necessary logic in the conductor to detect and disregard these transient health changes.

Unimog also forms a complex system with feedback loops: The conductor reacts to its observations of behaviour of the servers, and the servers react to the control information they receive from the conductor. This can lead to behaviours of the overall system that are hard to anticipate or test for. For instance, not long after we deployed Unimog we encountered surprising behaviour when data centers became overloaded. This is of course a scenario that we strive to avoid, and we have automated systems to remove traffic from overloaded data centers if it does. But if a data center became sufficiently overloaded, then health information from its servers would indicate that many servers were degraded to the point that Unimog would stop sending new connections to those servers. Under normal circumstances, this is the correct reaction to a degraded server. But if enough servers become degraded, diverting new connections to other servers would mean those servers became degraded, while the original servers were able to recover. So it was possible for a data center that became temporarily overloaded to get stuck in a state where servers oscillated between healthy and degraded, even after the level of demand on the data center had returned to normal. To correct this issue, the conductor now has logic to distinguish between isolated degraded servers and such data center-wide problems. We have continued to improve Unimog in response to operational experience, ensuring that it behaves in a predictable manner over a wide range of conditions.

UDP Support

So far, we have described Unimog’s support for directing TCP connections. But Unimog also supports UDP traffic. UDP does not have explicit connections between clients and servers, so how it works depends upon how the UDP application exchanges packets between the client and server. There are a few cases of interest:

Request-response UDP applications

Some applications, such as DNS, use a simple request-response pattern: the client sends a request packet to the server, and expects a response packet in return. Here, there is nothing corresponding to a connection (the client only sends a single packet, so there is no requirement to make sure that multiple packets arrive at the same server). But Unimog can still provide value by spreading the requests across our servers.

To cater to this case, Unimog operates as described in previous sections, hashing the 4-tuple from the packet headers (the source and destination IP addresses and ports). But the Beamer daisy-chaining technique that allows connections to be maintained does not apply here, and so the buckets in the forwarding table only have a single slot.

UDP applications with flows

Some UDP applications have long-lived flows of packets between the client and server. Like TCP connections, these flows are identified by the 4-tuple. It is necessary that such flows go to the same server (even when Cloudflare is just passing a flow through to the origin server, it is convenient for detecting and mitigating certain kinds of attack to have that flow pass through a single server within one of Cloudflare’s data centers).

It’s possible to treat these flows by hashing the 4-tuple, skipping the Beamer daisy-chaining technique as for request-response applications. But then adding servers will cause some flows to change servers (this would effectively be a form of consistent hashing). For UDP applications, we can’t say in general what impact this has, as we can for TCP connections. But it’s possible that it causes some disruption, so it would be nice to avoid this.

So Unimog adapts the daisy-chaining technique to apply it to UDP flows. The outline remains similar to that for TCP: the same redirector component on each server decides whether to send a packet on a second hop. But UDP does not have anything corresponding to TCP’s SYN packet that indicates a new connection. So for UDP, the part that depends on SYNs is removed, and the logic applied for each packet becomes:

The redirector checks whether the packet belongs to a connection with a corresponding UDP socket on the first-hop server. If so, it is processed by that server.
Otherwise, the packet has no corresponding TCP socket on the first-hop server. So it is forwarded on to the second-hop server to be processed there (in the expectation that it belongs to some flow established on the second-hop server that we wish to maintain).

Although the change compared to the TCP logic is not large, it has the effect of switching the roles of the first- and second-hop servers: For UDP, new flows go to the second-hop server. The Unimog control plane has to take account of this when it updates a forwarding table. When it introduces a server into a bucket, that server should receive new connections or flows. For a TCP trafficset, this means it becomes the first-hop server. For UDP trafficset, it must become the second-hop server.

This difference between handling of TCP and UDP also leads to higher overheads for UDP. In the case of TCP, as new connections are formed and old connections terminate over time, fewer packets will require the second hop, and so the overhead tends to diminish. But with UDP, new connections always involve the second hop. This is why we differentiate the two cases, taking advantage of SYN packets in the TCP case.

The UDP logic also places a requirement on services. The redirector must be able to match packets to the corresponding sockets on a server according to their 4-tuple. This is not a problem in the TCP case, because all TCP connections are represented by connected sockets in the BSD sockets API (these sockets are obtained from an accept system call, so that they have a local address and a peer address, determining the 4-tuple). But for UDP, unconnected sockets (lacking a declared peer address) can be used to send and receive packets. So some UDP services only use unconnected sockets. For the redirector logic above to work, services must create connected UDP sockets in order to expose their flows to the redirector.

UDP applications with sessions

Some UDP-based protocols have explicit sessions, with a session identifier in each packet. Session identifiers allow sessions to persist even if the 4-tuple changes. This happens in mobility scenarios – for example, if a mobile device passes from a WiFi to a cellular network, causing its IP address to change. An example of a UDP-based protocol with session identifiers is QUIC (which calls them connection IDs).

Our Unimog XDP program allows a flow dissector to be configured for different trafficsets. The flow dissector is the part of the code that is responsible for taking a packet and extracting the value that identifies the flow or connection (this value is then hashed and used for the lookup into the forwarding table). For TCP and UDP, there are default flow dissectors that extract the 4-tuple. But specialised flow dissectors can be added to handle UDP-based protocols.

We have used this functionality in our WARP product. We extended the Wireguard protocol used by WARP in a backwards-compatible way to include a session identifier, and added a flow dissector to Unimog to exploit it. There are more details in our post on the technical challenges of WARP.

Conclusion

Unimog has been deployed to all of Cloudflare’s edge data centers for over a year, and it has become essential to our operations. Throughout that time, we have continued to enhance Unimog (many of the features described here were not present when it was first deployed). So the ease of developing and deploying changes, due to XDP and xdpd, has been a significant benefit. Today we continue to extend it, to support more services, and to help us manage our traffic and the load on our servers in more contexts.

Rendering React on the Edge with Flareact and Cloudflare Workers

2020-09-03 Guest Author

Post Syndicated from Guest Author original https://blog.cloudflare.com/rendering-react-on-the-edge-with-flareact-and-cloudflare-workers/

Rendering React on the Edge with Flareact and Cloudflare Workers

The following is a guest post from Josh Larson, Engineer at Vox Media.

Imagine you’re the maintainer of a high-traffic media website, and your DNS is already hosted on Cloudflare.

Page speed is critical. You need to get content to your audience as quickly as possible on every device. You also need to render ads in a speedy way to maintain a good user experience and make money to support your journalism.

One solution would be to render your site statically and cache it at the edge. This would help ensure you have top-notch delivery speed because you don’t need a server to return a response. However, your site has decades worth of content. If you wanted to make even a small change to the site design, you would need to regenerate every single page during your next deploy. This would take ages.

Another issue is that your site would be static — and future updates to content or new articles would not be available until you deploy again.

That’s not going to work.

Another solution would be to render each page dynamically on your server. This ensures you can return a dynamic response for new or updated articles.

However, you’re going to need to pay for some beefy servers to be able to handle spikes in traffic and respond to requests in a timely manner. You’ll also probably need to implement a system of internal caches to optimize the performance of your app, which could lead to a more complicated development experience. That also means you’ll be at risk of a thundering herd problem if, for any reason, your cache becomes invalidated.

Neither of these solutions are great, and you’re forced to make a tradeoff between one of these two approaches.

Thankfully, you’ve recently come across a project like Next.js which offers a hybrid approach: static-site generation along with incremental regeneration. You’re in love with the patterns and developer experience in Next.js, but you’d also love to take advantage of the Cloudflare Workers platform to host your site.

Cloudflare Workers allow you to run your code on the edge quickly, efficiently and at scale. Instead of paying for a server to host your code, you can host it directly inside the datacenter — reducing the number of network trips required to load your application. In a perfect world, we wouldn’t need to find hosting for a Next.js site, because Cloudflare offers the same JavaScript hosting functionality with the Workers platform. With their dynamic runtime and edge caching capabilities, we wouldn’t need to worry about making a tradeoff between static and dynamic for our site.

Unfortunately, frameworks like Next.js and Cloudflare Workers don’t mesh together particularly well due to technical constraints. Until now:

I’m excited to announce Flareact, a new open-source React framework built for Cloudflare Workers.

Rendering React on the Edge with Flareact and Cloudflare Workers

With Flareact, you don’t need to make the tradeoff between a static site and a dynamic application.

Flareact allows you to render your React apps at the edge rather than on the server. It is modeled after Next.js, which means it supports file-based page routing, dynamic page paths and edge-side data fetching APIs.

Not only are Flareact pages rendered at the edge — they’re also cached at the edge using the Cache API. This allows you to provide a dynamic content source for your app without worrying about traffic spikes or response times.

With no servers or origins to deal with, your site is instantly available to your audience. Cloudflare Workers gives you a 0ms cold start and responses from the edge within milliseconds.

You can check out the docs and get started now by clicking the button below:

To get started manually, install the latest wrangler, and use the handy wrangler generate command below to create your first project:

npm i @cloudflare/wrangler -g
wrangler generate my-project https://github.com/flareact/flareact-template

What’s the big deal?

Hosting React apps on Cloudflare Workers Sites is not a new concept. In fact, you’ve always been able to deploy a create-react-app project to Workers Sites in addition to static versions of other frameworks like Gatsby and Next.js.

However, Flareact renders your React application at the edge. This allows you to provide an initial server response with HTML markup — which can be helpful for search engine crawlers. You can also cache the response at the edge and optionally invalidate that cache on a timed basis — meaning your static markup will be regenerated if you need it to be fresh.

This isn’t a new pattern: Next.js has done the hard work in defining the shape of this API with SSG support and Incremental Static Regeneration. While there are nuanced differences in the implementation between Flareact and Next.js, they serve a similar purpose: to get your application to your end-user in the quickest and most-scalable way possible.

A focus on developer experience

A magical developer experience is a crucial ingredient to any successful product.

As a longtime fan and user of Next.js, I wanted to experiment with running the framework on Cloudflare Workers. However, Next.js and its APIs are framed around the Node.js HTTP Server API, while Cloudflare Workers use V8 isolates and are modeled after the FetchEvent type.

Since we don’t have typical access to a filesystem inside V8 isolates, it’s tough to mimic the environment required to run a dynamic Next.js server at the edge. Though projects like Fab have come up with workarounds, I decided to approach the project with a clean slate and use existing patterns established in Next.js in a brand-new framework.

As a developer, I absolutely love the simplicity of exporting an asynchronous function from my page to have it supply props to the component. Flareact implements this pattern by allowing you to export a getEdgeProps function. This is similar to getStaticProps in Next.js, and it matches the expected return shape of that function in Next.js — including a revalidate parameter. Learn more about data fetching in Flareact.

I was also inspired by the API Routes feature of Next.js when I implemented the API Routes feature of Flareact — enabling you to write standard Cloudflare Worker scripts directly within your React app.

I hope porting over an existing Next.js project to Flareact is a breeze!

How it works

When a FetchEvent request comes in, Flareact inspects the URL pathname to decide how to handle it:

If the request is for a page or for page props, it checks the cache for that request and returns it if there’s a hit. If there is a cache miss, it generates the page request or props function, stores the result in the cache, and returns the response.

If the request is for an API route, it sends the entire FetchEvent along to the user-defined API function, allowing the user to respond as they see fit.

If you want your cached page to be revalidated after a certain amount of time, you can return an additional revalidate property from getEdgeProps(). This instructs Flareact to cache the endpoint for that number of seconds before generating a new response.

Finally, if the request is for a static asset, it returns it directly from the Workers KV.

The Worker

The core responsibilities of the Worker — or in a traditional SSR framework, the server — are to:

Render the initial React page component into static HTML markup.
Provide the initial page props as a JSON object, embedded into the static markup in a script tag.
Load the client-side JavaScript bundles and stylesheets necessary to render the interactive page.

One challenge with building Flareact is that the Webpack targets the webworker output rather than the node output. This makes it difficult to inform the worker which pages exist in the filesystem, since there is no access to the filesystem.

To get around this, Flareact leverages require.context, a Webpack-specific API, to inspect the project and build a manifest of pages on the client and the worker. I’d love to replace this with a smarter bundling strategy on the client-side eventually.

The Client

In addition to handling incoming Worker requests, Flareact compiles a client bundle containing the code necessary for routing, data fetching and more from the browser.

The core responsibilities of the client are to:

Listen for routing events
Fetch the necessary page component and its props from the worker over AJAX

Building a client router from scratch has been a challenge. It listens for changes to the internal route state, updates the URL pathname with pushState, makes an AJAX request to the worker for the page props, and then updates the current component in the render tree with the requested page.

It was fun building a flareact/link component similar to next/link:

import Link from "flareact/link";

export default function Index() {
  return (
    <div>
      <Link href="/about">
        <a>Go to About</a>
      </Link>
    </div>
  );
}

I also set out to build a custom version of next/head for Flareact. As it turns out, this was non-trivial! With lots of interesting stuff going on behind the scenes to support SSR and client-side routing events, I decided to make flareact/head a simple wrapper around react-helmet instead:

import Head from "flareact/head";

export default function Index() {
  return (
    <div>
      <Head>
        <title>My page title</title>
      </Head>
      <h1>Hello, world.</h1>
    </div>
  );
}

Local Development

The local developer experience of Flareact leverages the new wrangler dev command, sending server requests through a local tunnel to the Cloudflare edge and back to your machine.

This is a huge win for productivity, since you don’t need to manually build and deploy your application to see how it will perform in a production environment.

It’s also a really exciting update to the serverless toolchain. Running a robust development environment in a serverless world has always been a challenge, since your code is executing in a non-traditional context. Tunneling local code to the edge and back is such a great addition to Cloudflare’s developer experience.

Use cases

Flareact is a great candidate for a lot of Jamstack-adjacent applications, like blogs or static marketing sites.

It could also be used for more dynamic applications, with robust API functions and authentication mechanisms — all implemented using Cloudflare Workers.

Imagine building a high-traffic e-commerce site with Flareact, where both site reliability and dynamic rendering for things like price changes and stock availability are crucial.

There are also untold possibilities for integrating the Workers KV into your edge props or API functions as a first-class database solution. No need to reach for an externally-hosted database!

While the project is still in its early days, here are a couple real-world examples:

The Flareact docs site, powered by Markdown files
A blog site, powered by a headless WordPress API

The road ahead

I have to be honest: creating a server-side rendered React framework with little prior knowledge was very difficult. There’s still a ton to learn, and Flareact has a long way to go to reach parity with Next.js in the areas of optimization and production-readiness.

Here’s what I’m hoping to add to Flareact in the near future:

Smarter client bundling and Webpack chunks to reduce individual page weight
A more feature-complete client-side router
The ability to extend and customize the root document of the app
Support for more style frameworks (CSS-in-JS, Sass, CSS modules, etc)
A more stable development environment
Documentation and support for environment variables, secrets and KV namespaces
A guide for deploying from GitHub Actions and other CI tools

If the project sounds interesting to you, be sure to check out the source code on GitHub. Contributors are welcome!