Tag Archives: Rust

Open sourcing Pingora: our Rust framework for building programmable network services

2024-02-28 Yuchen Wu

Post Syndicated from Yuchen Wu original https://blog.cloudflare.com/pingora-open-source

Today, we are proud to open source Pingora, the Rust framework we have been using to build services that power a significant portion of the traffic on Cloudflare. Pingora is released under the Apache License version 2.0.

As mentioned in our previous blog post, Pingora is a Rust async multithreaded framework that assists us in constructing HTTP proxy services. Since our last blog post, Pingora has handled nearly a quadrillion Internet requests across our global network.

We are open sourcing Pingora to help build a better and more secure Internet beyond our own infrastructure. We want to provide tools, ideas, and inspiration to our customers, users, and others to build their own Internet infrastructure using a memory safe framework. Having such a framework is especially crucial given the increasing awareness of the importance of memory safety across the industry and the US government. Under this common goal, we are collaborating with the Internet Security Research Group (ISRG) Prossimo project to help advance the adoption of Pingora in the Internet’s most critical infrastructure.

In our previous blog post, we discussed why and how we built Pingora. In this one, we will talk about why and how you might use Pingora.

Pingora provides building blocks for not only proxies but also clients and servers. Along with these components, we also provide a few utility libraries that implement common logic such as event counting, error handling, and caching.

What’s in the box

Pingora provides libraries and APIs to build services on top of HTTP/1 and HTTP/2, TLS, or just TCP/UDP. As a proxy, it supports HTTP/1 and HTTP/2 end-to-end, gRPC, and websocket proxying. (HTTP/3 support is on the roadmap.) It also comes with customizable load balancing and failover strategies. For compliance and security, it supports both the commonly used OpenSSL and BoringSSL libraries, which come with FIPS compliance and post-quantum crypto.

Besides providing these features, Pingora provides filters and callbacks to allow its users to fully customize how the service should process, transform and forward the requests. These APIs will be especially familiar to OpenResty and NGINX users, as many map intuitively onto OpenResty’s “*_by_lua” callbacks.

Operationally, Pingora provides zero downtime graceful restarts to upgrade itself without dropping a single incoming request. Syslog, Prometheus, Sentry, OpenTelemetry and other must-have observability tools are also easily integrated with Pingora as well.

Who can benefit from Pingora

You should consider Pingora if:

Security is your top priority: Pingora is a more memory safe alternative for services that are written in C/C++. While some might argue about memory safety among programming languages, from our practical experience, we find ourselves way less likely to make coding mistakes that lead to memory safety issues. Besides, as we spend less time struggling with these issues, we are more productive implementing new features.

Your service is performance-sensitive: Pingora is fast and efficient. As explained in our previous blog post, we saved a lot of CPU and memory resources thanks to Pingora’s multi-threaded architecture. The saving in time and resources could be compelling for workloads that are sensitive to the cost and/or the speed of the system.

Your service requires extensive customization: The APIs that the Pingora proxy framework provides are highly programmable. For users who wish to build a customized and advanced gateway or load balancer, Pingora provides powerful yet simple ways to implement it. We provide examples in the next section.

Let’s build a load balancer

Let’s explore Pingora’s programmable API by building a simple load balancer. The load balancer will select between https://1.1.1.1/ and https://1.0.0.1/ to be the upstream in a round-robin fashion.

First let’s create a blank HTTP proxy.

pub struct LB();

#[async_trait]
impl ProxyHttp for LB {
    async fn upstream_peer(...) -> Result<Box<HttpPeer>> {
        todo!()
    }
}

Any object that implements the ProxyHttp trait (similar to the concept of an interface in C++ or Java) is an HTTP proxy. The only required method there is upstream_peer(), which is called for every request. This function should return an HttpPeer which contains the origin IP to connect to and how to connect to it.

Next let’s implement the round-robin selection. The Pingora framework already provides the LoadBalancer with common selection algorithms such as round robin and hashing, so let’s just use it. If the use case requires more sophisticated or customized server selection logic, users can simply implement it themselves in this function.

pub struct LB(Arc<LoadBalancer<RoundRobin>>);

#[async_trait]
impl ProxyHttp for LB {
    async fn upstream_peer(...) -> Result<Box<HttpPeer>> {
        let upstream = self.0
            .select(b"", 256) // hash doesn't matter for round robin
            .unwrap();

        // Set SNI to one.one.one.one
        let peer = Box::new(HttpPeer::new(upstream, true, "one.one.one.one".to_string()));
        Ok(peer)
    }
}

Since we are connecting to an HTTPS server, the SNI also needs to be set. Certificates, timeouts, and other connection options can also be set here in the HttpPeer object if needed.

Finally, let’s put the service in action. In this example we hardcode the origin server IPs. In real life workloads, the origin server IPs can also be discovered dynamically when the upstream_peer() is called or in the background. After the service is created, we just tell the LB service to listen to 127.0.0.1:6188. In the end we created a Pingora server, and the server will be the process which runs the load balancing service.

fn main() {
    let mut upstreams = LoadBalancer::try_from_iter(["1.1.1.1:443", "1.0.0.1:443"]).unwrap();

    let mut lb = pingora_proxy::http_proxy_service(&my_server.configuration, LB(upstreams));
    lb.add_tcp("127.0.0.1:6188");

    let mut my_server = Server::new(None).unwrap();
    my_server.add_service(lb);
    my_server.run_forever();
}

Let’s try it out:

curl 127.0.0.1:6188 -svo /dev/null
> GET / HTTP/1.1
> Host: 127.0.0.1:6188
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 403 Forbidden

We can see that the proxy is working, but the origin server rejects us with a 403. This is because our service simply proxies the Host header, 127.0.0.1:6188, set by curl, which upsets the origin server. How do we make the proxy correct that? This can simply be done by adding another filter called upstream_request_filter. This filter runs on every request after the origin server is connected and before any HTTP request is sent. We can add, remove or change http request headers in this filter.

async fn upstream_request_filter(…, upstream_request: &mut RequestHeader, …) -> Result<()> {
    upstream_request.insert_header("Host", "one.one.one.one")
}

Let’s try again:

curl 127.0.0.1:6188 -svo /dev/null
< HTTP/1.1 200 OK

This time it works! The complete example can be found here.

Below is a very simple diagram of how this request flows through the callback and filter we used in this example. The Pingora proxy framework currently provides more filters and callbacks at different stages of a request to allow users to modify, reject, route and/or log the request (and response).

Behind the scenes, the Pingora proxy framework takes care of connection pooling, TLS handshakes, reading, writing, parsing requests and any other common proxy tasks so that users can focus on logic that matters to them.

Open source, present and future

Pingora is a library and toolset, not an executable binary. In other words, Pingora is the engine that powers a car, not the car itself. Although Pingora is production-ready for industry use, we understand a lot of folks want a batteries-included, ready-to-go web service with low or no-code config options. Building that application on top of Pingora will be the focus of our collaboration with the ISRG to expand Pingora’s reach. Stay tuned for future announcements on that project.

Other caveats to keep in mind:

Today, API stability is not guaranteed. Although we will try to minimize how often we make breaking changes, we still reserve the right to add, remove, or change components such as request and response filters as the library evolves, especially during this pre-1.0 period.
Support for non-Unix based operating systems is not currently on the roadmap. We have no immediate plans to support these systems, though this could change in the future.

How to contribute

Feel free to raise bug reports, documentation issues, or feature requests in our GitHub issue tracker. Before opening a pull request, we strongly suggest you take a look at our contribution guide.

Conclusion

In this blog we announced the open source of our Pingora framework. We showed that Internet entities and infrastructure can benefit from Pingora’s security, performance and customizability. We also demonstrated how easy it is to use Pingora and how customizable it is.

Whether you’re building production web services or experimenting with network technologies we hope you find value in Pingora. It’s been a long journey, but sharing this project with the open source community has been a goal from the start. We’d like to thank the Rust community as Pingora is built with many great open-sourced Rust crates. Moving to a memory safe Internet may feel like an impossible journey, but it’s one we hope you join us on.

Announcing bpftop: Streamlining eBPF performance optimization

2024-02-26 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/announcing-bpftop-streamlining-ebpf-performance-optimization-6a727c1ae2e5

By Jose Fernandez

Today, we are thrilled to announce the release of bpftop, a command-line tool designed to streamline the performance optimization and monitoring of eBPF applications. As Netflix increasingly adopts eBPF [1, 2], applying the same rigor to these applications as we do to other managed services is imperative. Striking a balance between eBPF’s benefits and system load is crucial, ensuring it enhances rather than hinders our operational efficiency. This tool enables Netflix to embrace eBPF’s potential.

Introducing bpftop

bpftop provides a dynamic real-time view of running eBPF programs. It displays the average execution runtime, events per second, and estimated total CPU % for each program. This tool minimizes overhead by enabling performance statistics only while it is active.

bpftop simplifies the performance optimization process for eBPF programs by enabling an efficient cycle of benchmarking, code refinement, and immediate feedback. Without bpftop, optimization efforts would require manual calculations, adding unnecessary complexity to the process. With bpftop, users can quickly establish a baseline, implement improvements, and verify enhancements, streamlining the process.

A standout feature of this tool is its ability to display the statistics in time series graphs. This approach can uncover patterns and trends that could be missed otherwise.

How it works

bpftop uses the BPF_ENABLE_STATS syscall command to enable global eBPF runtime statistics gathering, which is disabled by default to reduce performance overhead. It collects these statistics every second, calculating the average runtime, events per second, and estimated CPU utilization for each eBPF program within that sample period. This information is displayed in a top-like tabular format or a time series graph over a 10s moving window. Once bpftop terminates, it turns off the statistics-gathering function. The tool is written in Rust, leveraging the libbpf-rs and ratatui crates.

Getting started

Visit the project’s GitHub page to learn more about using the tool. We’ve open-sourced bpftop under the Apache 2 license and look forward to contributions from the community.

Announcing bpftop: Streamlining eBPF performance optimization was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introducing Foundations – our open source Rust service foundation library

2024-01-24 Ivan Nikulin http://blog.cloudflare.com/author/ivan-nikulin/

Post Syndicated from Ivan Nikulin http://blog.cloudflare.com/author/ivan-nikulin/ original https://blog.cloudflare.com/introducing-foundations-our-open-source-rust-service-foundation-library

In this blog post, we’re excited to present Foundations, our foundational library for Rust services, now released as open source on GitHub. Foundations is a foundational Rust library, designed to help scale programs for distributed, production-grade systems. It enables engineers to concentrate on the core business logic of their services, rather than the intricacies of production operation setups.

Originally developed as part of our Oxy proxy framework, Foundations has evolved to serve a wider range of applications. For those interested in exploring its technical capabilities, we recommend consulting the library’s API documentation. Additionally, this post will cover the motivations behind Foundations’ creation and provide a concise summary of its key features. Stay with us to learn more about how Foundations can support your Rust projects.

What is Foundations?

In software development, seemingly minor tasks can become complex when scaled up. This complexity is particularly evident when comparing the deployment of services on server hardware globally to running a program on a personal laptop.

The key question is: what fundamentally changes when transitioning from a simple laptop-based prototype to a full-fledged service in a production environment? Through our experience in developing numerous services, we’ve identified several critical differences:

Observability: locally, developers have access to various tools for monitoring and debugging. However, these tools are not as accessible or practical when dealing with thousands of software instances running on remote servers.
Configuration: local prototypes often use basic, sometimes hardcoded, configurations. This approach is impractical in production, where changes require a more flexible and dynamic configuration system. Hardcoded settings are cumbersome, and command-line options, while common, don’t always suit complex hierarchical configurations or align with the “Configuration as Code” paradigm.
Security: services in production face a myriad of security challenges, exposed to diverse threats from external sources. Basic security hardening becomes a necessity.

Addressing these distinctions, Foundations emerges as a comprehensive library, offering solutions to these challenges. Derived from our Oxy proxy framework, Foundations brings the tried-and-tested functionality of Oxy to a broader range of Rust-based applications at Cloudflare.

Foundations was developed with these guiding principles:

High modularity: recognizing that many services predate Foundations, we designed it to be modular. Teams can adopt individual components at their own pace, facilitating a smooth transition.
API ergonomics: a top priority for us is user-friendly library interaction. Foundations leverages Rust’s procedural macros to offer an intuitive, well-documented API, aiming for minimal friction in usage.
Simplified setup and configuration: our goal is for engineers to spend minimal time on setup. Foundations is designed to be ‘plug and play’, with essential functions working immediately and adjustable settings for fine-tuning. We understand that this focus on ease of setup over extreme flexibility might be debatable, as it implies a trade-off. Unlike other libraries that cater to a wide range of environments with potentially verbose setup requirements, Foundations is tailored for specific, production-tested environments and workflows. This doesn’t restrict Foundations’ adaptability to other settings, but we approach this with compile-time features to manage setup workflows, rather than a complex setup API.

Next, let’s delve into the components Foundations offers. To better illustrate the functionality that Foundations provides we will refer to the example web server from Foundations’ source code repository.

Telemetry

In any production system, observability, which we refer to as telemetry, plays an essential role. Generally, three primary types of telemetry are adequate for most service needs:

Logging: this involves recording arbitrary textual information, which can be enhanced with tags or structured fields. It’s particularly useful for documenting operational errors that aren’t critical to the service.
Tracing: this method offers a detailed timing breakdown of various service components. It’s invaluable for identifying performance bottlenecks and investigating issues related to timing.
Metrics: these are quantitative data points about the service, crucial for monitoring the overall health and performance of the system.

Foundations integrates an API that encompasses all these telemetry aspects, consolidating them into a unified package for ease of use.

Tracing

Foundations’ tracing API shares similarities with tokio/tracing, employing a comparable approach with implicit context propagation, instrumentation macros, and futures wrapping:

#[tracing::span_fn("respond to request")]
async fn respond(
    endpoint_name: Arc<String>,
    req: Request<Body>,
    routes: Arc<Map<String, ResponseSettings>>,
) -> Result<Response<Body>, Infallible> {
    …
}

Refer to the example web server and documentation for more comprehensive examples.

However, Foundations distinguishes itself in a few key ways:

Simplified API: we’ve streamlined the setup process for tracing, aiming for a more minimalistic approach compared to tokio/tracing.
Enhanced trace sampling flexibility: Foundations allows for selective override of the sampling ratio in specific code branches. This feature is particularly useful for detailed performance bug investigations, enabling a balance between global trace sampling for overall performance monitoring and targeted sampling for specific accounts, connections, or requests.
Distributed trace stitching: our API supports the integration of trace data from multiple services, contributing to a comprehensive view of the entire pipeline. This functionality includes fine-tuned control over sampling ratios, allowing upstream services to dictate the sampling of specific traffic flows in downstream services.
Trace forking capability: addressing the challenge of long-lasting connections with numerous multiplexed requests, Foundations introduces trace forking. This feature enables each request within a connection to have its own trace, linked to the parent connection trace. This method significantly simplifies the analysis and improves performance, particularly for connections handling thousands of requests.

We regard telemetry as a vital component of our software, not merely an optional add-on. As such, we believe in rigorous testing of this feature, considering it our primary tool for monitoring software operations. Consequently, Foundations includes an API and user-friendly macros to facilitate the collection and analysis of tracing data within tests, presenting it in a format conducive to assertions.

Logging

Foundations’ logging API shares its foundation with tokio/tracing and slog, but introduces several notable enhancements.

During our work on various services, we recognized the hierarchical nature of logging contextual information. For instance, in a scenario involving a connection, we might want to tag each log record with the connection ID and HTTP protocol version. Additionally, for requests served over this connection, it would be useful to attach the request URL to each log record, while still including connection-specific information.

Typically, achieving this would involve creating a new logger for each request, copying tags from the connection’s logger, and then manually passing this new logger throughout the relevant code. This method, however, is cumbersome, requiring explicit handling and storage of the logger object.

To streamline this process and prevent telemetry from obstructing business logic, we adopted a technique similar to tokio/tracing’s approach for tracing, applying it to logging. This method relies on future instrumentation machinery (tracing-rs documentation has a good explanation of the concept), allowing for implicit passing of the current logger. This enables us to “fork” logs for each request and use this forked log seamlessly within the current code scope, automatically propagating it down the call stack, including through asynchronous function calls:

 let conn_tele_ctx = TelemetryContext::current();

 let on_request = service_fn({
        let endpoint_name = Arc::clone(&endpoint_name);

        move |req| {
            let routes = Arc::clone(&routes);
            let endpoint_name = Arc::clone(&endpoint_name);

            // Each request gets independent log inherited from the connection log and separate
            // trace linked to the connection trace.
            conn_tele_ctx
                .with_forked_log()
                .with_forked_trace("request")
                .apply(async move { respond(endpoint_name, req, routes).await })
        }
});

Refer to example web server and documentation for more comprehensive examples.

In an effort to simplify the user experience, we merged all APIs related to context management into a single, implicitly available in each code scope, TelemetryContext object. This integration not only simplifies the process but also lays the groundwork for future advanced features. These features could blend tracing and logging information into a cohesive narrative by cross-referencing each other.

Like tracing, Foundations also offers a user-friendly API for testing service’s logging.

Metrics

Foundations incorporates the official Prometheus Rust client library for its metrics functionality, with a few enhancements for ease of use. One key addition is a procedural macro provided by Foundations, which simplifies the definition of new metrics with typed labels, reducing boilerplate code:

use foundations::telemetry::metrics::{metrics, Counter, Gauge};
use std::sync::Arc;

#[metrics]
pub(crate) mod http_server {
    /// Number of active client connections.
    pub fn active_connections(endpoint_name: &Arc<String>) -> Gauge;

    /// Number of failed client connections.
    pub fn failed_connections_total(endpoint_name: &Arc<String>) -> Counter;

    /// Number of HTTP requests.
    pub fn requests_total(endpoint_name: &Arc<String>) -> Counter;

    /// Number of failed requests.
    pub fn requests_failed_total(endpoint_name: &Arc<String>, status_code: u16) -> Counter;
}

Refer to the example web server and documentation for more information of how metrics can be defined and used.

In addition to this, we have refined the approach to metrics collection and structuring. Foundations offers a streamlined, user-friendly API for both these tasks, focusing on simplicity and minimalism.

Memory profiling

Recognizing the efficiency of jemalloc for long-lived services, Foundations includes a feature for enabling jemalloc memory allocation. A notable aspect of jemalloc is its memory profiling capability. Foundations packages this functionality into a straightforward and safe Rust API, making it accessible and easy to integrate.

Telemetry server

Foundations comes equipped with a built-in, customizable telemetry server endpoint. This server automatically handles a range of functions including health checks, metric collection, and memory profiling requests.

Security

A vital component of Foundations is its robust and ergonomic API for seccomp, a Linux kernel feature for syscall sandboxing. This feature enables the setting up of hooks for syscalls used by an application, allowing actions like blocking or logging. Seccomp acts as a formidable line of defense, offering an additional layer of security against threats like arbitrary code execution.

Foundations provides a simple way to define lists of all allowed syscalls, also allowing a composition of multiple lists (in addition, Foundations ships predefined lists for common use cases):

  use foundations::security::common_syscall_allow_lists::{ASYNC, NET_SOCKET_API, SERVICE_BASICS};
    use foundations::security::{allow_list, enable_syscall_sandboxing, ViolationAction};

    allow_list! {
        static ALLOWED = [
            ..SERVICE_BASICS,
            ..ASYNC,
            ..NET_SOCKET_API
        ]
    }

    enable_syscall_sandboxing(ViolationAction::KillProcess, &ALLOWED)

Refer to the web server example and documentation for more comprehensive examples of this functionality.

Settings and CLI

Foundations simplifies the management of service settings and command-line argument parsing. Services built on Foundations typically use YAML files for configuration. We advocate for a design where every service comes with a default configuration that’s functional right off the bat. This philosophy is embedded in Foundations’ settings functionality.

In practice, applications define their settings and defaults using Rust structures and enums. Foundations then transforms Rust documentation comments into configuration annotations. This integration allows the CLI interface to generate a default, fully annotated YAML configuration files. As a result, service users can quickly and easily understand the service settings:

use foundations::settings::collections::Map;
use foundations::settings::net::SocketAddr;
use foundations::settings::settings;
use foundations::telemetry::settings::TelemetrySettings;

#[settings]
pub(crate) struct HttpServerSettings {
    /// Telemetry settings.
    pub(crate) telemetry: TelemetrySettings,
    /// HTTP endpoints configuration.
    #[serde(default = "HttpServerSettings::default_endpoints")]
    pub(crate) endpoints: Map<String, EndpointSettings>,
}

impl HttpServerSettings {
    fn default_endpoints() -> Map<String, EndpointSettings> {
        let mut endpoint = EndpointSettings::default();

        endpoint.routes.insert(
            "/hello".into(),
            ResponseSettings {
                status_code: 200,
                response: "World".into(),
            },
        );

        endpoint.routes.insert(
            "/foo".into(),
            ResponseSettings {
                status_code: 403,
                response: "bar".into(),
            },
        );

        [("Example endpoint".into(), endpoint)]
            .into_iter()
            .collect()
    }
}

#[settings]
pub(crate) struct EndpointSettings {
    /// Address of the endpoint.
    pub(crate) addr: SocketAddr,
    /// Endoint's URL path routes.
    pub(crate) routes: Map<String, ResponseSettings>,
}

#[settings]
pub(crate) struct ResponseSettings {
    /// Status code of the route's response.
    pub(crate) status_code: u16,
    /// Content of the route's response.
    pub(crate) response: String,
}

The settings definition above automatically generates the following default configuration YAML file:

---
# Telemetry settings.
telemetry:
  # Distributed tracing settings
  tracing:
    # Enables tracing.
    enabled: true
    # The address of the Jaeger Thrift (UDP) agent.
    jaeger_tracing_server_addr: "127.0.0.1:6831"
    # Overrides the bind address for the reporter API.
    # By default, the reporter API is only exposed on the loopback
    # interface. This won't work in environments where the
    # Jaeger agent is on another host (for example, Docker).
    # Must have the same address family as `jaeger_tracing_server_addr`.
    jaeger_reporter_bind_addr: ~
    # Sampling ratio.
    #
    # This can be any fractional value between `0.0` and `1.0`.
    # Where `1.0` means "sample everything", and `0.0` means "don't sample anything".
    sampling_ratio: 1.0
  # Logging settings.
  logging:
    # Specifies log output.
    output: terminal
    # The format to use for log messages.
    format: text
    # Set the logging verbosity level.
    verbosity: INFO
    # A list of field keys to redact when emitting logs.
    #
    # This might be useful to hide certain fields in production logs as they may
    # contain sensitive information, but allow them in testing environment.
    redact_keys: []
  # Metrics settings.
  metrics:
    # How the metrics service identifier defined in `ServiceInfo` is used
    # for this service.
    service_name_format: metric_prefix
    # Whether to report optional metrics in the telemetry server.
    report_optional: false
  # Server settings.
  server:
    # Enables telemetry server
    enabled: true
    # Telemetry server address.
    addr: "127.0.0.1:0"
# HTTP endpoints configuration.
endpoints:
  Example endpoint:
    # Address of the endpoint.
    addr: "127.0.0.1:0"
    # Endoint's URL path routes.
    routes:
      /hello:
        # Status code of the route's response.
        status_code: 200
        # Content of the route's response.
        response: World
      /foo:
        # Status code of the route's response.
        status_code: 403
        # Content of the route's response.
        response: bar

Refer to the example web server and documentation for settings and CLI API for more comprehensive examples of how settings can be defined and used with Foundations-provided CLI API.

Wrapping Up

At Cloudflare, we greatly value the contributions of the open source community and are eager to reciprocate by sharing our work. Foundations has been instrumental in reducing our development friction, and we hope it can do the same for others. We welcome external contributions to Foundations, aiming to integrate diverse experiences into the project for the benefit of all.

If you’re interested in working on projects like Foundations, consider joining our team — we’re hiring!

Why AWS is the Best Place to Run Rust

2023-10-23 Deval Parikh

Post Syndicated from Deval Parikh original https://aws.amazon.com/blogs/devops/why-aws-is-the-best-place-to-run-rust/

Introduction

The Rust programming language was created by Mozilla Research in 2010 to be “a programming language empowering everyone to build reliable and efficient(fast) software”[1]. If you are a beginner level SDE or a DevOps engineer or a decision maker in your organization looking to adopt Rust for your specific use, you will find this blog helpful to get started with Rust on AWS. We will begin by explaining why Rust has gained a huge traction over programming languages like C, C++, Java, Python, and Go. We will then talk about why AWS is one of the best platforms for Rust. Finally, we will provide an example of how you can quickly run a Rust program using AWS Lambda function.

Why Rust?

Rust is an efficient and reliable programming language that addresses performance, reliability, and productivity all at once. It distinguishes itself from its peers by boasting memory safety and thread safety without a need for garbage collector.

Historically, C and C++ have held the title of being the most performant programming languages; however, their speeds have often come with a significant cost to their safety and maintainability. The biggest threat in using such languages range from corruption of valid data to the execution of arbitrary code. The frequency of these issues is even more obvious when you notice that from 2007 to 2019, 70 percent of all vulnerabilities addressed by Microsoft through security updates pertain to memory safety [2]. Languages like Java have come a long way in mitigating such vulnerabilities using garbage collector, however this has come with significant performance bottleneck. Rust seeks to marry performance and safety using its novel borrow-checker, which is a type of static analysis tool that can help check for errors in code such as null-pointer dereferences, data races, etc.

There are other ways programs may access invalid memory. Iterating through an array, for example, requires the iterator to know how many elements are in the array to create a stopping condition. Furthermore, without checking array out of bounds, how would an accessor method be sure it is not accessing an index that does not exist? Here, safety comes with a performance overhead. Typically, the safety benefits of languages like Java are worth the performance overhead. However, for situations where safety and speed are both an absolute necessity, developers may choose to run their mission critical applications in Rust. Here, Rust can be viewed as a memory-safe, fast, low-resource programming language that requires no runtime. This makes Rust also suitable to run on embedded or low-resource device applications.

Rust brings polished tooling, a robust package manager (Cargo), and perhaps most importantly – a fast-growing and passionate community of developers. As Rust gains in popularity, so does the number of high-profile organizations adopting it (including AWS!) for critical applications where performance and safety are top concerns. Did you know that Amazon S3 leverages Rust to attempt to return responses with single-digit millisecond latency? To name a few, AWS product components written in Rust include Amazon CloudFront, Amazon EC2, and AWS Lambda among others.

There are many great resources to learn Rust. Most Rust developers start with the official Rust book, which is available for free online.

[1]: Rust Language official website

[2]: https://www.zdnet.com/article/microsoft-70-percent-of-all-security-bugs-are-memory-safety-issues/

[3]: https://codilime.com/blog/why-is-rust-programming-language-so-popular/#:~:text=Rust%20is%20a%20statically%2Dtyped,developed%20originally%20at%20Mozilla%20Research

Why Rust on AWS?

Rust matters to AWS for two main reasons. First, our customers are choosing to use Rust for their mission critical workloads and adoption is growing, therefore it becomes imperative that AWS provides the best tools possible to run Rust on AWS. In the next section, I will provide an example to show how easy it is to interact with AWS services using Rust runtime on AWS Lambda.

Additionally, it is important that we are creating high performant, safe infrastructure and services for our customers to run their business critical workloads on AWS. In 2018, AWS first launched its open source microVM technology Firecracker written completely in Rust. Since then, AWS has delivered over two dozen open source projects developed in Rust. For instance, AWS uses Firecracker to run AWS Lambda and AWS Fargate. Today, AWS Lambda processes trillions of executions for hundreds of thousands of active customers every month. Its ability to fire up AWS Lambda or AWS Fargate in less than 125ms attributes to blazing fast speed of Rust. AWS also developed and launched Bottlerocket, a Linux-based open source container OS purpose built for running containers. Veeva Systems a leader in cloud based software for the life sciences industry runs a variety of microservices on Bottleneck securely, with enhanced resource efficiency, and decreased management overhead, thanks to Rust.

Here at AWS, our product development teams have leveraged Rust to deliver more than a dozen services. Besides services such as Amazon Simple Storage Service (Amazon S3), AWS developers uses Rust as the language of choice to develop product components for Amazon Elastic Compute Cloud (Amazon EC2), Amazon CloudFront, Amazon Route 53, and more. Our Amazon EC2 team uses Rust for new AWS Nitro System components, including sensitive applications such as Nitro Enclaves.

Not only is AWS using Rust for improving their product response times, we are actively contributing to and supporting Rust and the open source ecosystem around it. AWS employs a number of core open source contributors to the Rust project and popular Rust libraries like tokio, used for writing asynchronous applications with Rust. According to Marc Brooker, Distinguished Engineer and Vice President of Database and AI at AWS, “Hiring engineers to work directly on Rust allows us to improve it in ways that matter to us and to our customers, and help grow the overall Rust community.” AWS is an active member on the Board of Directors for the Rust Foundation and have generously donated infrastructure and technology services to the Rust Foundation. You can read more about how AWS is helping the Rust community here.

Getting Started with Rust on AWS

This demonstration will walk you through creating your first AWS Lambda + Rust App! We’ll bootstrap the development process by utilizing the AWS Serverless Application Model (SAM)—a tool designed for building, deploying, and managing serverless applications. AWS SAM streamlines the Rust development process by setting up AWS’s official Rust Lambda Runtime, Cargo Lambda. This runtime offers a specialized build tool command for direct deployment to AWS. Additionally, AWS SAM integrates both Amazon DynamoDB table and an Amazon API Gateway endpoint. The provided example serves as a foundational template for leveraging the AWS Rust SDK with Amazon DynamoDB.

architecture diagram

Prerequisites

Steps

1. Open a terminal and navigate to your project directory.

2. Initialize the project using sam init

3. Choose “1 - AWS Quick Start Templates”, then “16 - DynamoDB Example”.

4. Name the project (for demo: “rust-ddb-example-app“)

5. Now navigate into the newly created directory with the SAM application code and execute sam build && sam deploy --guided.

a. Accept prompts with “y” or defaults.

6. After deployment concludes, record the Amazon CloudFormation “PutApi” output URL. (i.e https://a1b2c3d4e5f6.execute-api.us-west-2.amazonaws.com/Prod/)

7. Add an element to your table. (For the demo the id of our element will be foo and the payload will be bar). (e.g curl -X PUT <PutApi URL>/foo -d "bar")

8. Validate the addition via the AWS Console’s DynamoDB. Locate the table named after your AWS SAM app and verify the new item. You can do this by going to the AWS Console, clicking DynamoDB, then Tables, and then Explore Items.

dynamodb example

What Next?

This is a great starting point on your journey with Rust on AWS. For taking your development journey to the next level consider:

Explore More Rust on AWS: AWS provides a plethora of examples and documentation. Explore the AWS Rust GitHub Repository for more intricate use cases and examples.
Join a Rust Workshop: AWS often hosts workshops and webinars on various topics. Keep an eye on the AWS Events Page for an upcoming Rust-focused session.
Deepen Your Rust Knowledge: If you’re new to Rust or want to delve deeper, the Rust Book is an excellent resource. We also highly recommend watching the videos on the Cargo Lambda documentation page.
Engage with the Community: The Rust community is vibrant and welcoming. Join forums, attend meetups, and participate in discussions to grow your network and knowledge. Become a member of Rust Foundation to collaborate with other members of the community.
Contribute to make Rust even better: Report on bugs or fix them, write documentation, and add new features. Here is how.

Conclusion

For those of us living in the safety net confines of an interpreter, Rust changes how we can still execute safely in a compiler generated world. Most importantly, Rust brings to the table blazing fast speed and performance without compromises to the security and stability of the system. It is a language of choice in embedded-systems programming, mission critical systems, blockchain and crypto development, and has found its place in 3D video gaming as well.

Rust on AWS is a game changer in that it makes it easy for developers to run code without having the need to setup extensive infrastructure to run it. It serves as an excellent backend service with zero administration. AWS Lambda‘s in-built Rust support further exemplifies AWS’s commitment to accommodating popularity of this language. In addition, the popularity of Rust has mandated an inbuilt handler be added to AWS Lambda for further support of Rust.

Additional Reading

About the Authors

Why Rust is the most admired language among developers

2023-08-30 Sara Verdi

Post Syndicated from Sara Verdi original https://github.blog/2023-08-30-why-rust-is-the-most-admired-language-among-developers/

For the eighth year in a row, Rust has topped the chart as “the most desired programming language” in Stack Overflow’s annual developer survey. And with more than 80% of developers reporting that they’d like to use the language again next year, you have to wonder how a language created less than 20 years ago has stolen the hearts of developers around the world.

In this article, we’ll look at the history of Rust, what it’s commonly used for, why developers love it so much, and some resources to help you start learning one of the top fastest growing languages on GitHub.

So, what is the Rust programming language?

Rust’s print macro displaying the output “Hello, World!”

Originally intended to serve as a safer alternative to C and C++, Rust is a systems programming language that has gained significant popularity among developers thanks to its emphasis on safety, performance, and productivity. Rust is a statically typed language, so variable and expression types are determined and checked at compile time, which helps enhance memory safety and error detection, resulting in more reliable builds.

In 2006, the software developer, Graydon Hoare, started Rust as a personal project while he was working at Mozilla. According to an interview with MIT Technology Review, the inspiration for Rust came from a broken elevator in Hoare’s apartment building. The software for the lift operation system had crashed and Hoare understood that issues like this usually came from problems with how a program uses memory.

Quite often, the software for these types of devices is written in C or C++, but these languages require significant memory management, which can lead to errors that would cause the system to crash. So, Hoare set to work on figuring out how to create a programming language that could be both compact and memory bug-free.

He later showed the project to a manager—which led to Mozilla sponsoring it in 2009 as part of a longer-term effort to incorporate the language into the development of an experimental browser engine. In 2010, Mozilla Research officially announced the Rust project and released the source code to the public as an open-source project. After several years of development, Rust reached a stable and mature state—and in May 2015, Rust 1.0 was released. This milestone signaled that Rust was ready for production and provided a foundation for developers to build upon.

Since the 1.0 release, Rust has exploded in popularity and adoption, with top applications, such as Microsoft Windows, utilizing Rust to rewrite core libraries with its memory-safe code. Outside of the tech giants, Rust also has a vibrant community of developers, or “Rustaceans,” that are dedicated to making the Rust experience an active and collaborative one.

Ferris, an orange cartoon crustacean who is the unofficial mascot of Rust.

Meet Ferris, the unofficial mascot for Rust!

According to a recent survey by SlashData, there are roughly 2.8 million Rust developers worldwide in 2023, a number that has nearly tripled over the past two years. With plenty of active forums, documentation, and a supportive community for developers of all skill levels, it’s perhaps unsurprising that Rust keeps topping the most-desired language lists.

What makes Rust special?

So, what are some of Rust’s key features that make it so attractive to developers?

In simple terms, Rust solves some of developers’ most frustrating memory management problems commonly associated with C and C++, but that’s not its only shining capability. One of GitHub’s staff software engineers, Jason Orendorff, who co-authored a book on programming with Rust, said about the language:

“To me, what’s great about Rust is that it’s both fast AND reliable,” according to Orendorff. “It lets me write multi-headed programs that run on 16 cores and keep them readable, maintainable, and crash-free. It also lets me write very low-level algorithms requiring control over memory layout and pull in a crate that makes HTTPS requests super simple. It’s the combination of these features that makes Rust so unique.”

Building on that, here’s a few more of its well-loved characteristics and features:

Concurrency. Rust has built-in support for concurrent programming through its ownership system which enforces strict rules for data access, and its borrowing model, which prevents data races by allowing controlled, simultaneous access. This ensures that multiple threads can work on shared data without introducing memory-related issues.
No garbage collection. Unlike some programming languages, Rust does not employ garbage collection. Instead, its ownership and borrowing rules manage memory, which helps empower developers to have precise control over memory allocation and deallocation for efficient resource management.
Cargo Package Manager. Rust’s built-in package manager, Cargo, streamlines project management, dependency tracking, and building, which helps contribute to efficient and organized development workflows. But this doesn’t make it clear just how bonkers the Cargo ecosystem is. According to Orendorff, “My team takes advantage of high-quality open source packages for hashing, serialization, multithreading, data structures, compression, and a lot more. These are performance-critical libraries. Without some of these, our project to rethink code search on GitHub wouldn’t have been possible.” And here’s a fun fact: Rust was actually the first systems programming language to have a standard package manager, and, as a result, the Rust ecosystem is incredibly robust.
Zero-cost abstractions. This feature allows developers to write high-level code abstractions and features without introducing any runtime performance overhead.
Pattern matching. This powerful language feature enables developers to concisely and effectively match complex data structures against specific patterns to extract and handle different cases or scenarios in a clean and readable manner.
Type inference. This feature allows Rust’s compiler to automatically detect an expression based on context while you code. “Many programming languages have some type inference,” Orendorff said. “C# and C++ have some, Rust has a little more, and languages like Haskell, Scala, and ML have even more.”

fn main() {
    break rust;
}

Run this code for an inside joke among Rust developers 😆

What is Rust commonly used for?

Thanks to its direct access to both hardware and memory, Rust is well suited for embedded systems and bare-metal development. And since it’s a general purpose language, it can also be used for a variety of applications.

Let’s explore a few key use cases:

Using Rust to build performance-critical backend systems

Performance-critical backend systems are software components or services that handle tasks that require high-speed processing, low-latency responses, and efficient resource utilization—and Rust’s performance, thread safety, and error handling make it an excellent choice for developing these types of systems. In fact, we use Rust to build some of these systems at GitHub. For example, the backend of our code search feature is written in Rust (and you can read more about the development of GitHub’s newest code search with Rust, too).

Using Rust to develop operating systems

Rust was originally created to solve an operating system issue (remember the elevator problem?)—so, unsurprisingly, it’s often used to build operating systems, kernels, device drivers, or other low-level components where control over memory and performance is crucial. Redox, a Unix-like operating system, was written in Rust, which contributes to its most crucial feature: its security. “Fuchsia is another example that was built at Google,” Orendorff said. “If you have a Google Nest smart speaker, it’s likely running Fuchsia.”

Rust for operating system-adjacent code

Rust is also well-suited for writing code that performs tasks that closely interact with the operating system. For example, the Codespaces team at GitHub is leveraging Rust to enhance the speed of starting up the virtual disk within GitHub Codespaces and optimize the utilization of Azure storage. Coursera also employs Rust in its online grading system, as it operates within Docker and needs a language that compiles to machine code with minimal dependencies.

Using Rust for web development

Rust is increasingly being used for web development—especially on the server side. The async programming model and performance characteristics of Rust make it fitting for building high-performance web servers, APIs, and backend services. Plus, there’s been an influx of web frameworks for Rust, like Rocket, that can help folks get started with writing secure web applications. The emergence of these frameworks underscores Rust’s position as a mature language, and also helps increase the support for folks looking to use Rust in front or backend work.

Using Rust for crypto and blockchain development

Rust’s speed, memory management, and security all contribute to its involvement with cryptocurrency and blockchain technologies. For example, Polkadot, which is designed to enable the interoperability and interaction between multiple blockchains to share information and assets in a secure and decentralized manner, utilizes Rust to build its core infrastructure. Polkadot’s runtime logic, which governs the behavior and rules of the blockchain, is also written in Rust. Check out this repository, awesome-blockchain-rust, for some useful components for building your own blockchain applications with Rust.

Using Rust to build CLI tools

Rust’s compilation to efficient machine code and its expressive syntax make it a strong choice for building command line tools and applications. Plus, writing a command line app is a great way to learn and get comfortable with Rust. Take a look at this comprehensive guide on how to build your own CLI application with Rust in 15 minutes!

Learn how open source developers are making the command line more friendly—and more powerful.

Using Rust for embedded systems and IoT development

Rust’s minimal runtime and control over memory layout makes it incredibly useful for developing embedded systems and Internet of Things (IoT) devices. Its ability to prevent memory-related bugs, manage concurrency, and generate small, efficient binaries caters to IoT’s security, real-time, and efficiency needs.

Why developers love Rust

While its user base for Rust isn’t nearly as large as Java or Python, Rust continues to compete with the big hitters in most-admired lists across the internet. There’s even a full website composed of developer’s praises for Rust.

But why exactly is Rust so admired by developers? If you boil it down to just a handful of reasons why developers love Rust so much, they’d have to be the language’s speed, safety, and performance.

Moreover, Rust is continuing to evolve and grow with new frameworks, tools, and resources. You can keep tabs on contributions to the language in the awesome-rust repository, which hosts an impressive list of Rust code and resources.

The bottom line: Admiring Rust isn’t just about adopting a language—it’s embracing a mindset that prioritizes innovation without compromising on the core tenets of stability and security.

How to get started with Rust

We know, there’s plenty of resources to sharpen your Rust skills peppered throughout this article—but we have another pro tip for you: take Rust for a test drive with GitHub Copilot. As your AI-powered pair programmer, GitHub Copilot can help you learn and refine the basics of Rust as you go with tailored code suggestions.

Here’s a developer advocate at GitHub experimenting with Rust for the first time with GitHub Copilot. And to all of our seasoned Rustaceans out there, what do you think? Was the suggestion correct?

If you’re ready to begin your coding journey with Rust, GitHub Copilot can jumpstart your progress—all without the need to study documentation for hours at a time.

Get started

The post Why Rust is the most admired language among developers appeared first on The GitHub Blog.

Wasm core dumps and debugging Rust in Cloudflare Workers

2023-08-14 Sven Sauleau

Post Syndicated from Sven Sauleau original http://blog.cloudflare.com/wasm-coredumps/

Wasm core dumps and debugging Rust in Cloudflare Workers

A clear sign of maturing for any new programming language or environment is how easy and efficient debugging them is. Programming, like any other complex task, involves various challenges and potential pitfalls. Logic errors, off-by-ones, null pointer dereferences, and memory leaks are some examples of things that can make software developers desperate if they can't pinpoint and fix these issues quickly as part of their workflows and tools.

WebAssembly (Wasm) is a binary instruction format designed to be a portable and efficient target for the compilation of high-level languages like Rust, C, C++, and others. In recent years, it has gained significant traction for building high-performance applications in web and serverless environments.

Cloudflare Workers has had first-party support for Rust and Wasm for quite some time. We've been using this powerful combination to bootstrap and build some of our most recent services, like D1, Constellation, and Signed Exchanges, to name a few.

Using tools like Wrangler, our command-line tool for building with Cloudflare developer products, makes streaming real-time logs from our applications running remotely easy. Still, to be honest, debugging Rust and Wasm with Cloudflare Workers involves a lot of the good old time-consuming and nerve-wracking printf'ing strategy.

What if there’s a better way? This blog is about enabling and using Wasm core dumps and how you can easily debug Rust in Cloudflare Workers.

What are core dumps?

In computing, a core dump consists of the recorded state of the working memory of a computer program at a specific time, generally when the program has crashed or otherwise terminated abnormally. They also add things like the processor registers, stack pointer, program counter, and other information that may be relevant to fully understanding why the program crashed.

In most cases, depending on the system’s configuration, core dumps are usually initiated by the operating system in response to a program crash. You can then use a debugger like gdb to examine what happened and hopefully determine the cause of a crash. gdb allows you to run the executable to try to replicate the crash in a more controlled environment, inspecting the variables, and much more. The Windows' equivalent of a core dump is a minidump. Other mature languages that are interpreted, like Python, or languages that run inside a virtual machine, like Java, also have their ways of generating core dumps for post-mortem analysis.

Core dumps are particularly useful for post-mortem debugging, determining the conditions that lead to a failure after it has occurred.

WebAssembly core dumps

WebAssembly has had a proposal for implementing core dumps in discussion for a while. It's a work-in-progress experimental specification, but it provides basic support for the main ideas of post-mortem debugging, including using the DWARF (debugging with attributed record formats) debug format, the same that Linux and gdb use. Some of the most popular Wasm runtimes, like Wasmtime and Wasmer, have experimental flags that you can enable and start playing with Wasm core dumps today.

If you run Wasmtime or Wasmer with the flag:

--coredump-on-trap=/path/to/coredump/file

The core dump file will be emitted at that location path if a crash happens. You can then use tools like wasmgdb to inspect the file and debug the crash.

But let's dig into how the core dumps are generated in WebAssembly, and what’s inside them.

How are Wasm core dumps generated

(and what’s inside them)

When WebAssembly terminates execution due to abnormal behavior, we say that it entered a trap. With Rust, examples of operations that can trap are accessing out-of-bounds addresses or a division by zero arithmetic call. You can read about the security model of WebAssembly to learn more about traps.

The core dump specification plugs into the trap workflow. When WebAssembly crashes and enters a trap, core dumping support kicks in and starts unwinding the call stack gathering debugging information. For each frame in the stack, it collects the function parameters and the values stored in locals and in the stack, along with binary offsets that help us map to exact locations in the source code. Finally, it snapshots the memory and captures information like the tables and the global variables.

DWARF is used by many mature languages like C, C++, Rust, Java, or Go. By emitting DWARF information into the binary at compile time a debugger can provide information such as the source name and the line number where the exception occurred, function and argument names, and more. Without DWARF, the core dumps would be just pure assembly code without any contextual information or metadata related to the source code that generated it before compilation, and they would be much harder to debug.

WebAssembly uses a (lighter) version of DWARF that maps functions, or a module and local variables, to their names in the source code (you can read about the WebAssembly name section for more information), and naturally core dumps use this information.

All this information for debugging is then bundled together and saved to the file, the core dump file.

The core dump structure has multiple sections, but the most important are:

General information about the process;
The threads and their stack frames (note that WebAssembly is single threaded in Cloudflare Workers);
A snapshot of the WebAssembly linear memory or only the relevant regions;
Optionally, other sections like globals, data, or table.

Here’s the thread definition from the core dump specification:

corestack   ::= customsec(thread-info vec(frame))
thread-info ::= 0x0 thread-name:name ...
frame       ::= 0x0 ... funcidx:u32 codeoffset:u32 locals:vec(value)
                stack:vec(value)

A thread is a custom section called corestack. A corestack section contains the thread name and a vector (or array) of frames. Each frame contains the function index in the WebAssembly module (funcidx), the code offset relative to the function's start (codeoffset), the list of locals, and the list of values in the stack.

Values are defined as follows:

value ::= 0x01       => ∅
        | 0x7F n:i32 => n
        | 0x7E n:i64 => n
        | 0x7D n:f32 => n
        | 0x7C n:f64 => n

At the time of this writing these are the possible numbers types in a value. Again, we wanted to describe the basics; you should track the full specification to get more detail or find information about future changes. WebAssembly core dump support is in its early stages of specification and implementation, things will get better, things might change.

This is all great news. Unfortunately, however, the Cloudflare Workers runtime doesn’t support WebAssembly core dumps yet. There is no technical impediment to adding this feature to workerd; after all, it's based on V8, but since it powers a critical part of our production infrastructure and products, we tend to be conservative when it comes to adding specifications or standards that are still considered experimental and still going through the definitions phase.

So, how do we get core Wasm dumps in Cloudflare Workers today?

Polyfilling

Polyfilling means using userland code to provide modern functionality in older environments that do not natively support it. Polyfills are widely popular in the JavaScript community and the browser environment; they've been used extensively to address issues where browser vendors still didn't catch up with the latest standards, or when they implement the same features in different ways, or address cases where old browsers can never support a new standard.

Meet wasm-coredump-rewriter, a tool that you can use to rewrite a Wasm module and inject the core dump runtime functionality in the binary. This runtime code will catch most traps (exceptions in host functions are not yet catched and memory violation not by default) and generate a standard core dump file. To some degree, this is similar to how Binaryen's Asyncify works.

Let’s look at code and see how this works. He’s some simple pseudo code:

export function entry(v1, v2) {
    return addTwo(v1, v2)
}

function addTwo(v1, v2) {
  res = v1 + v2;
  throw "something went wrong";

  return res
}

An imaginary compiler could take that source and generate the following Wasm binary code:

  (func $entry (param i32 i32) (result i32)
    (local.get 0)
    (local.get 1)
    (call $addTwo)
  )

  (func $addTwo (param i32 i32) (result i32)
    (local.get 0)
    (local.get 1)
    (i32.add)
    (unreachable) ;; something went wrong
  )

  (export "entry" (func $entry))

“;;” is used to denote a comment.

entry() is the Wasm function exported to the host. In an environment like the browser, JavaScript (being the host) can call entry().

Irrelevant parts of the code have been snipped for brevity, but this is what the Wasm code will look like after wasm-coredump-rewriter rewrites it:

  (func $entry (type 0) (param i32 i32) (result i32)
    ...
    local.get 0
    local.get 1
    call $addTwo ;; see the addTwo function bellow
    global.get 2 ;; is unwinding?
    if  ;; label = @1
      i32.const x ;; code offset
      i32.const 0 ;; function index
      i32.const 2 ;; local_count
      call $coredump/start_frame
      local.get 0
      call $coredump/add_i32_local
      local.get 1
      call $coredump/add_i32_local
      ...
      call $coredump/write_coredump
      unreachable
    end)

  (func $addTwo (type 0) (param i32 i32) (result i32)
    local.get 0
    local.get 1
    i32.add
    ;; the unreachable instruction was here before
    call $coredump/unreachable_shim
    i32.const 1 ;; funcidx
    i32.const 2 ;; local_count
    call $coredump/start_frame
    local.get 0
    call $coredump/add_i32_local
    local.get 1
    call $coredump/add_i32_local
    ...
    return)

  (export "entry" (func $entry))

As you can see, a few things changed:

The (unreachable) instruction in addTwo() was replaced by a call to $coredump/unreachable_shim which starts the unwinding process. Then, the location and debugging data is captured, and the function returns normally to the entry() caller.
Code has been added after the addTwo() call instruction in entry() that detects if we have an unwinding process in progress or not. If we do, then it also captures the local debugging data, writes the core dump file and then, finally, moves to the unconditional trap unreachable.

In short, we unwind until the host function entry() gets destroyed by calling unreachable.

Let’s go over the runtime functions that we inject for more clarity, stay with us:

$coredump/start_frame(funcidx, local_count) starts a new frame in the coredump.
$coredump/add_*_local(value) captures the values of function arguments and in locals (currently capturing values from the stack isn’t implemented.)
$coredump/write_coredump is used at the end and writes the core dump in memory. We take advantage of the first 1 KiB of the Wasm linear memory, which is unused, to store our core dump.

A diagram is worth a thousand words:

Wait, what’s this about the first 1 KiB of the memory being unused, you ask? Well, it turns out that most WebAssembly linters and tools, including Emscripten and WebAssembly’s LLVM don’t use the first 1 KiB of memory. Rust and Zig also use LLVM, but they changed the default. This isn’t pretty, but the hugely popular Asyncify polyfill relies on the same trick, so there’s reasonable support until we find a better way.

But we digress, let’s continue. After the crash, the host, typically JavaScript in the browser, can now catch the exception and extract the core dump from the Wasm instance’s memory:

try {
    wasmInstance.exports.someExportedFunction();
} catch(err) {
    const image = new Uint8Array(wasmInstance.exports.memory.buffer);
    writeFile("coredump." + Date.now(), image);
}

If you're curious about the actual details of the core dump implementation, you can find the source code here. It was written in AssemblyScript, a TypeScript-like language for WebAssembly.

This is how we use the polyfilling technique to implement Wasm core dumps when the runtime doesn’t support them yet. Interestingly, some Wasm runtimes, being optimizing compilers, are likely to make debugging more difficult because function arguments, locals, or functions themselves can be optimized away. Polyfilling or rewriting the binary could actually preserve more source-level information for debugging.

You might be asking what about performance? We did some testing and found that the impact is negligible; the cost-benefit of being able to debug our crashes is positive. Also, you can easily turn wasm core dumps on or off for specific builds or environments; deciding when you need them is up to you.

Debugging from a core dump

We now know how to generate a core dump, but how do we use it to diagnose and debug a software crash?

Similarly to gdb (GNU Project Debugger) on Linux, wasmgdb is the tool you can use to parse and make sense of core dumps in WebAssembly; it understands the file structure, uses DWARF to provide naming and contextual information, and offers interactive commands to navigate the data. To exemplify how it works, wasmgdb has a demo of a Rust application that deliberately crashes; we will use it.

Let's imagine that our Wasm program crashed, wrote a core dump file, and we want to debug it.

$ wasmgdb source-program.wasm /path/to/coredump
wasmgdb>

When you fire wasmgdb, you enter a REPL (Read-Eval-Print Loop) interface, and you can start typing commands. The tool tries to mimic the gdb command syntax; you can find the list here.

Let's examine the backtrace using the bt command:

wasmgdb> bt
#18     000137 as __rust_start_panic () at library/panic_abort/src/lib.rs
#17     000129 as rust_panic () at library/std/src/panicking.rs
#16     000128 as rust_panic_with_hook () at library/std/src/panicking.rs
#15     000117 as {closure#0} () at library/std/src/panicking.rs
#14     000116 as __rust_end_short_backtrace<std::panicking::begin_panic_handler::{closure_env#0}, !> () at library/std/src/sys_common/backtrace.rs
#13     000123 as begin_panic_handler () at library/std/src/panicking.rs
#12     000194 as panic_fmt () at library/core/src/panicking.rs
#11     000198 as panic () at library/core/src/panicking.rs
#10     000012 as calculate (value=0x03000000) at src/main.rs
#9      000011 as process_thing (thing=0x2cff0f00) at src/main.rs
#8      000010 as main () at src/main.rs
#7      000008 as call_once<fn(), ()> (???=0x01000000, ???=0x00000000) at /rustc/b833ad56f46a0bbe0e8729512812a161e7dae28a/library/core/src/ops/function.rs
#6      000020 as __rust_begin_short_backtrace<fn(), ()> (f=0x01000000) at /rustc/b833ad56f46a0bbe0e8729512812a161e7dae28a/library/std/src/sys_common/backtrace.rs
#5      000016 as {closure#0}<()> () at /rustc/b833ad56f46a0bbe0e8729512812a161e7dae28a/library/std/src/rt.rs
#4      000077 as lang_start_internal () at library/std/src/rt.rs
#3      000015 as lang_start<()> (main=0x01000000, argc=0x00000000, argv=0x00000000, sigpipe=0x00620000) at /rustc/b833ad56f46a0bbe0e8729512812a161e7dae28a/library/std/src/rt.rs
#2      000013 as __original_main () at <directory not found>/<file not found>
#1      000005 as _start () at <directory not found>/<file not found>
#0      000264 as _start.command_export at <no location>

Each line represents a frame from the program's call stack; see frame #3:

#3      000015 as lang_start<()> (main=0x01000000, argc=0x00000000, argv=0x00000000, sigpipe=0x00620000) at /rustc/b833ad56f46a0bbe0e8729512812a161e7dae28a/library/std/src/rt.rs

You can read the funcidx, function name, arguments names and values and source location are all present. Let's select frame #9 now and inspect the locals, which include the function arguments:

wasmgdb> f 9
000011 as process_thing (thing=0x2cff0f00) at src/main.rs
wasmgdb> info locals
thing: *MyThing = 0xfff1c

Let’s use the p command to inspect the content of the thing argument:

wasmgdb> p (*thing)
thing (0xfff2c): MyThing = {
    value (0xfff2c): usize = 0x00000003
}

You can also use the p command to inspect the value of the variable, which can be useful for nested structures:

wasmgdb> p (*thing)->value
value (0xfff2c): usize = 0x00000003

And you can use p to inspect memory addresses. Let’s point at 0xfff2c, the start of the MyThing structure, and inspect:

wasmgdb> p (MyThing) 0xfff2c
0xfff2c (0xfff2c): MyThing = {
    value (0xfff2c): usize = 0x00000003
}

All this information in every step of the stack is very helpful to determine the cause of a crash. In our test case, if you look at frame #10, we triggered an integer overflow. Once you get comfortable walking through wasmgdb and using its commands to inspect the data, debugging core dumps will be another powerful skill under your belt.

Tidying up everything in Cloudflare Workers

We learned about core dumps and how they work, and we know how to make Cloudflare Workers generate them using the wasm-coredump-rewriter polyfill, but how does all this work in practice end to end?

We've been dogfooding the technique described in this blog at Cloudflare for a while now. Wasm core dumps have been invaluable in helping us debug Rust-based services running on top of Cloudflare Workers like D1, Privacy Edge, AMP, or Constellation.

Today we're open-sourcing the Wasm Coredump Service and enabling anyone to deploy it. This service collects the Wasm core dumps originating from your projects and applications when they crash, parses them, prints an exception with the stack information in the logs, and can optionally store the full core dump in a file in an R2 bucket (which you can then use with wasmgdb) or send the exception to Sentry.

We use a service binding to facilitate the communication between your application Worker and the Coredump service Worker. A Service binding allows you to send HTTP requests to another Worker without those requests going over the Internet, thus avoiding network latency or having to deal with authentication. Here’s a diagram of how it works:

Using it is as simple as npm/yarn installing @cloudflare/wasm-coredump, configuring a few options, and then adding a few lines of code to your other applications running in Cloudflare Workers, in the exception handling logic:

import shim, { getMemory, wasmModule } from "../build/worker/shim.mjs"

const timeoutSecs = 20;

async function fetch(request, env, ctx) {
    try {
        // see https://github.com/rustwasm/wasm-bindgen/issues/2724.
        return await Promise.race([
            shim.fetch(request, env, ctx),
            new Promise((r, e) => setTimeout(() => e("timeout"), timeoutSecs * 1000))
        ]);
    } catch (err) {
      const memory = getMemory();
      const coredumpService = env.COREDUMP_SERVICE;
      await recordCoredump({ memory, wasmModule, request, coredumpService });
      throw err;
    }
}

The ../build/worker/shim.mjs import comes from the worker-build tool, from the workers-rs packages and is automatically generated when wrangler builds your Rust-based Cloudflare Workers project. If the Wasm throws an exception, we catch it, extract the core dump from memory, and send it to our Core dump service.

You might have noticed that we race the workers-rs shim.fetch() entry point with another Promise to generate a timeout exception if the Rust code doesn't respond earlier. This is because currently, wasm-bindgen, which generates the glue between the JavaScript and Rust land, used by workers-rs, has an issue where a Promise might not be rejected if Rust panics asynchronously (leading to the Worker runtime killing the worker with “Error: The script will never generate a response”.). This can block the wasm-coredump code and make the core dump generation flaky.

We are working to improve this, but in the meantime, make sure to adjust timeoutSecs to something slightly bigger than the typical response time of your application.

Here’s an example of a Wasm core dump exception in Sentry:

You can find a working example, the Sentry and R2 configuration options, and more details in the @cloudflare/wasm-coredump GitHub repository.

Too big to fail

It's worth mentioning one corner case of this debugging technique and the solution: sometimes your codebase is so big that adding core dump and DWARF debugging information might result in a Wasm binary that is too big to fit in a Cloudflare Worker. Well, worry not; we have a solution for that too.

Fortunately the DWARF for WebAssembly specification also supports external DWARF files. To make this work, we have a tool called debuginfo-split that you can add to the build command in the wrangler.toml configuration:

command = "... && debuginfo-split ./build/worker/index.wasm"

What this does is it strips the debugging information from the Wasm binary, and writes it to a new separate file called debug-{UUID}.wasm. You then need to upload this file to the same R2 bucket used by the Wasm Coredump Service (you can automate this as part of your CI or build scripts). The same UUID is also injected into the main Wasm binary; this allows us to correlate the Wasm binary with its corresponding DWARF debugging information. Problem solved.

Binaries without DWARF information can be significantly smaller. Here’s our example:

4.5 MiB	debug-63372dbe-41e6-447d-9c2e-e37b98e4c656.wasm
313 KiB	build/worker/index.wasm

Final words

We hope you enjoyed reading this blog as much as we did writing it and that it can help you take your Wasm debugging journeys, using Cloudflare Workers or not, to another level.

Note that while the examples used here were around using Rust and WebAssembly because that's a common pattern, you can use the same techniques if you're compiling WebAssembly from other languages like C or C++.

Also, note that the WebAssembly core dump standard is a hot topic, and its implementations and adoption are evolving quickly. We will continue improving the wasm-coredump-rewriter, debuginfo-split, and wasmgdb tools and the wasm-coredump service. More and more runtimes, including V8, will eventually support core dumps natively, thus eliminating the need to use polyfills, and the tooling, in general, will get better; that's a certainty. For now, we present you with a solution that works today, and we have strong incentives to keep supporting it.

As usual, you can talk to us on our Developers Discord or the Community forum or open issues or PRs in our GitHub repositories; the team will be listening.

Every request, every microsecond: scalable machine learning at Cloudflare

2023-06-19 Alex Bocharov

Post Syndicated from Alex Bocharov original http://blog.cloudflare.com/scalable-machine-learning-at-cloudflare/

Every request, every microsecond: scalable machine learning at Cloudflare

In this post, we will take you through the advancements we've made in our machine learning capabilities. We'll describe the technical strategies that have enabled us to expand the number of machine learning features and models, all while substantially reducing the processing time for each HTTP request on our network. Let's begin.

Background

For a comprehensive understanding of our evolved approach, it's important to grasp the context within which our machine learning detections operate. Cloudflare, on average, serves over 46 million HTTP requests per second, surging to more than 63 million requests per second during peak times.

Machine learning detection plays a crucial role in ensuring the security and integrity of this vast network. In fact, it classifies the largest volume of requests among all our detection mechanisms, providing the final Bot Score decision for over 72% of all HTTP requests. Going beyond, we run several machine learning models in shadow mode for every HTTP request.

At the heart of our machine learning infrastructure lies our reliable ally, CatBoost. It enables ultra low-latency model inference and ensures high-quality predictions to detect novel threats such as stopping bots targeting our customers' mobile apps. However, it's worth noting that machine learning model inference is just one component of the overall latency equation. Other critical components include machine learning feature extraction and preparation. In our quest for optimal performance, we've continuously optimized each aspect contributing to the overall latency of our system.

Initially, our machine learning models relied on single-request features, such as presence or value of certain headers. However, given the ease of spoofing these attributes, we evolved our approach. We turned to inter-request features that leverage aggregated information across multiple dimensions of a request in a sliding time window. For example, we now consider factors like the number of unique user agents associated with certain request attributes.

The extraction and preparation of inter-request features were handled by Gagarin, a Go-based feature serving platform we developed. As a request arrived at Cloudflare, we extracted dimension keys from the request attributes. We then looked up the corresponding machine learning features in the multi-layered cache. If the desired machine learning features were not found in the cache, a memcached "get" request was made to Gagarin to fetch those. Then machine learning features were plugged into CatBoost models to produce detections, which were then surfaced to the customers via Firewall and Workers fields and internally through our logging pipeline to ClickHouse. This allowed our data scientists to run further experiments, producing more features and models.

Initially, Gagarin exhibited decent latency, with a median latency around 200 microseconds to serve all machine learning features for given keys. However, as our system evolved and we introduced more features and dimension keys, coupled with increased traffic, the cache hit ratio began to wane. The median latency had increased to 500 microseconds and during peak times, the latency worsened significantly, with the p99 latency soaring to roughly 10 milliseconds. Gagarin underwent extensive low-level tuning, optimization, profiling, and benchmarking. Despite these efforts, we encountered the limits of inter-process communication (IPC) using Unix Domain Socket (UDS), among other challenges, explored below.

Problem definition

In summary, the previous solution had its drawbacks, including:

High tail latency: during the peak time, a portion of requests experienced increased latency caused by CPU contention on the Unix socket and Lua garbage collector.
Suboptimal resource utilization: CPU and RAM utilization was not optimized to the full potential, leaving less resources for other services running on the server.
Machine learning features availability: decreased due to memcached timeouts, which resulted in a higher likelihood of false positives or false negatives for a subset of the requests.
Scalability constraints: as we added more machine learning features, we approached the scalability limit of our infrastructure.

Equipped with a comprehensive understanding of the challenges and armed with quantifiable metrics, we ventured into the next phase: seeking a more efficient way to fetch and serve machine learning features.

Exploring solutions

In our quest for more efficient methods of fetching and serving machine learning features, we evaluated several alternatives. The key approaches included:

Further optimizing Gagarin: as we pushed our Go-based memcached server to its limits, we encountered a lower bound on latency reductions. This arose from IPC over UDS synchronization overhead and multiple data copies, the serialization/deserialization overheads, as well as the inherent latency of garbage collector and the performance of hashmap lookups in Go.

Considering Quicksilver: we contemplated using Quicksilver, but the volume and update frequency of machine learning features posed capacity concerns and potential negative impacts on other use cases. Moreover, it uses a Unix socket with the memcached protocol, reproducing the same limitations previously encountered.

Increasing multi-layered cache size: we investigated expanding cache size to accommodate tens of millions of dimension keys. However, the associated memory consumption, due to duplication of these keys and their machine learning features across worker threads, rendered this approach untenable.

Sharding the Unix socket: we considered sharding the Unix socket to alleviate contention and improve performance. Despite showing potential, this approach only partially solved the problem and introduced more system complexity.

Switching to RPC: we explored the option of using RPC for communication between our front line server and Gagarin. However, since RPC still requires some form of communication bus (such as TCP, UDP, or UDS), it would not significantly change the performance compared to the memcached protocol over UDS, which was already simple and minimalistic.

After considering these approaches, we shifted our focus towards investigating alternative Inter-Process Communication (IPC) mechanisms.

IPC mechanisms

Adopting a first principles design approach, we questioned: "What is the most efficient low-level method for data transfer between two processes provided by the operating system?" Our goal was to find a solution that would enable the direct serving of machine learning features from memory for corresponding HTTP requests. By eliminating the need to traverse the Unix socket, we aimed to reduce CPU contention, improve latency, and minimize data copying.

To identify the most efficient IPC mechanism, we evaluated various options available within the Linux ecosystem. We used ipc-bench, an open-source benchmarking tool specifically designed for this purpose, to measure the latencies of different IPC methods in our test environment. The measurements were based on sending one million 1,024-byte messages forth and back (i.e., ping pong) between two processes.

IPC method	Avg duration, μs	Avg throughput, msg/s
eventfd (bi-directional)	9.456	105,533
TCP sockets	8.74	114,143
Unix domain sockets	5.609	177,573
FIFOs (named pipes)	5.432	183,388
Pipe	4.733	210,369
Message Queue	4.396	226,421
Unix Signals	2.45	404,844
Shared Memory	0.598	1,616,014
Memory-Mapped Files	0.503	1,908,613

Based on our evaluation, we found that Unix sockets, while taking care of synchronization, were not the fastest IPC method available. The two fastest IPC mechanisms were shared memory and memory-mapped files. Both approaches offered similar performance, with the former using a specific tmpfs volume in /dev/shm and dedicated system calls, while the latter could be stored in any volume, including tmpfs or HDD/SDD.

Missing ingredients

In light of these findings, we decided to employ memory-mapped files as the IPC mechanism for serving machine learning features. This choice promised reduced latency, decreased CPU contention, and minimal data copying. However, it did not inherently offer data synchronization capabilities like Unix sockets. Unlike Unix sockets, memory-mapped files are simply files in a Linux volume that can be mapped into memory of the process. This sparked several critical questions:

How could we efficiently fetch an array of hundreds of float features for given dimension keys when dealing with a file?
How could we ensure safe, concurrent and frequent updates for tens of millions of keys?
How could we avert the CPU contention previously encountered with Unix sockets?
How could we effectively support the addition of more dimensions and features in the future?

To address these challenges we needed to further evolve this new approach by adding a few key ingredients to the recipe.

Augmenting the Idea

To realize our vision of memory-mapped files as a method for serving machine learning features, we needed to employ several key strategies, touching upon aspects like data synchronization, data structure, and deserialization.

Wait-free synchronization

When dealing with concurrent data, ensuring safe, concurrent, and frequent updates is paramount. Traditional locks are often not the most efficient solution, especially when dealing with high concurrency environments. Here's a rundown on three different synchronization techniques:

With-lock synchronization: a common approach using mechanisms like mutexes or spinlocks. It ensures only one thread can access the resource at a given time, but can suffer from contention, blocking, and priority inversion, just as evident with Unix sockets.

Lock-free synchronization: this non-blocking approach employs atomic operations to ensure at least one thread always progresses. It eliminates traditional locks but requires careful handling of edge cases and race conditions.

Wait-free synchronization: a more advanced technique that guarantees every thread makes progress and completes its operation without being blocked by other threads. It provides stronger progress guarantees compared to lock-free synchronization, ensuring that each thread completes its operation within a finite number of steps.

	Disjoint Access Parallelism	Starvation Freedom	Finite Execution Time
With lock
Lock-free
Wait-free

Our wait-free data access pattern draws inspiration from Linux kernel's Read-Copy-Update (RCU) pattern and the Left-Right concurrency control technique. In our solution, we maintain two copies of the data in separate memory-mapped files. Write access to this data is managed by a single writer, with multiple readers able to access the data concurrently.

We store the synchronization state, which coordinates access to these data copies, in a third memory-mapped file, referred to as "state". This file contains an atomic 64-bit integer, which represents an InstanceVersion and a pair of additional atomic 32-bit variables, tracking the number of active readers for each data copy. The InstanceVersion consists of the currently active data file index (1 bit), the data size (39 bits, accommodating data sizes up to 549 GB), and a data checksum (24 bits).

Zero-copy deserialization

To efficiently store and fetch machine learning features, we needed to address the challenge of deserialization latency. Here, zero-copy deserialization provides an answer. This technique reduces the time and memory required to access and use data by directly referencing bytes in the serialized form.

We turned to rkyv, a zero-copy deserialization framework in Rust, to help us with this task. rkyv implements total zero-copy deserialization, meaning no data is copied during deserialization and no work is done to deserialize data. It achieves this by structuring its encoded representation to match the in-memory representation of the source type.

One of the key features of rkyv that our solution relies on is its ability to access HashMap data structures in a zero-copy fashion. This is a unique capability among Rust serialization libraries and one of the main reasons we chose rkyv for our implementation. It also has a vibrant Discord community, eager to offer best-practice advice and accommodate feature requests.

Enter mmap-sync crate

Leveraging the benefits of memory-mapped files, wait-free synchronization and zero-copy deserialization, we've crafted a unique and powerful tool for managing high-performance, concurrent data access between processes. We've packaged these concepts into a Rust crate named mmap-sync, which we're thrilled to open-source for the wider community.

At the core of the mmap-sync package is a structure named Synchronizer. It offers an avenue to read and write any data expressible as a Rust struct. Users simply have to implement or derive a specific Rust trait surrounding struct definition – a task requiring just a single line of code. The Synchronizer presents an elegantly simple interface, equipped with "write" and "read" methods.

impl Synchronizer {
    /// Write a given `entity` into the next available memory mapped file.
    pub fn write<T>(&mut self, entity: &T, grace_duration: Duration) -> Result<(usize, bool), SynchronizerError> {
        …
    }

    /// Reads and returns `entity` struct from mapped memory wrapped in `ReadResult`
    pub fn read<T>(&mut self) -> Result<ReadResult<T>, SynchronizerError> {
        …
    }
}

/// FeaturesMetadata stores features along with their metadata
#[derive(Archive, Deserialize, Serialize, Debug, PartialEq)]
#[archive_attr(derive(CheckBytes))]
pub struct FeaturesMetadata {
    /// Features version
    pub version: u32,
    /// Features creation Unix timestamp
    pub created_at: u32,
    /// Features represented by vector of hash maps
    pub features: Vec<HashMap<u64, Vec<f32>>>,
}

A read operation through the Synchronizer performs zero-copy deserialization and returns a "guarded" Result encapsulating a reference to the Rust struct using RAII design pattern. This operation also increments the atomic counter of active readers using the struct. Once the Result is out of scope, the Synchronizer decrements the number of readers.

The synchronization mechanism used in mmap-sync is not only "lock-free" but also "wait-free". This ensures an upper bound on the number of steps an operation will take before it completes, thus providing a performance guarantee.

The data is stored in shared mapped memory, which allows the Synchronizer to “write” to it and “read” from it concurrently. This design makes mmap-sync a highly efficient and flexible tool for managing shared, concurrent data access.

Now, with an understanding of the underlying mechanics of mmap-sync, let's explore how this package plays a key role in the broader context of our Bot Management platform, particularly within the newly developed components: the bliss service and library.

System design overhaul

Transitioning from a Lua-based module that made memcached requests over Unix socket to Gagarin in Go to fetch machine learning features, our new design represents a significant evolution. This change pivots around the introduction of mmap-sync, our newly developed Rust package, laying the groundwork for a substantial performance upgrade. This development led to a comprehensive system redesign and introduced two new components that form the backbone of our Bots Liquidation Intelligent Security System – or BLISS, in short: the bliss service and the bliss library.

Bliss service

The bliss service operates as a Rust-based, multi-threaded sidecar daemon. It has been designed for optimal batch processing of vast data quantities and extensive I/O operations. Among its key functions, it fetches, parses, and stores machine learning features and dimensions for effortless data access and manipulation. This has been made possible through the incorporation of the Tokio event-driven platform, which allows for efficient, non-blocking I/O operations.

Bliss library

Operating as a single-threaded dynamic library, the bliss library seamlessly integrates into each worker thread using the Foreign Function Interface (FFI) via a Lua module. Optimized for minimal resource usage and ultra-low latency, this lightweight library performs tasks without the need for heavy I/O operations. It efficiently serves machine learning features and generates corresponding detections.

In addition to leveraging the mmap-sync package for efficient machine learning feature access, our new design includes several other performance enhancements:

Allocations-free operation: bliss library re-uses pre-allocated data structures and performs no heap allocations, only low-cost stack allocations. To enforce our zero-allocation policy, we run integration tests using the dhat heap profiler.
SIMD optimizations: wherever possible, the bliss library employs vectorized CPU instructions. For instance, AVX2 and SSE4 instruction sets are used to expedite hex-decoding of certain request attributes, enhancing speed by tenfold.
Compiler tuning: We compile both the bliss service and library with the following flags for superior performance:

[profile.release]
codegen-units = 1
debug = true
lto = "fat"
opt-level = 3

Benchmarking & profiling: We use Criterion for benchmarking every major feature or component within bliss. Moreover, we are also able to use the Go pprof profiler on Criterion benchmarks to view flame graphs and more:

cargo bench -p integration -- --verbose --profile-time 100

go tool pprof -http=: ./target/criterion/process_benchmark/process/profile/profile.pb

This comprehensive overhaul of our system has not only streamlined our operations but also has been instrumental in enhancing the overall performance of our Bot Management platform. Stay tuned to witness the remarkable changes brought about by this new architecture in the next section.

Rollout results

Our system redesign has brought some truly "blissful" dividends. Above all, our commitment to a seamless user experience and the trust of our customers have guided our innovations. We ensured that the transition to the new design was seamless, maintaining full backward compatibility, with no customer-reported false positives or negatives encountered. This is a testament to the robustness of the new system.

As the old adage goes, the proof of the pudding is in the eating. This couldn't be truer when examining the dramatic latency improvements achieved by the redesign. Our overall processing latency for HTTP requests at Cloudflare improved by an average of 12.5% compared to the previous system.

This improvement is even more significant in the Bot Management module, where latency improved by an average of 55.93%.

More specifically, our machine learning features fetch latency has improved by several orders of magnitude:

Latency metric	Before (μs)	After (μs)	Change
p50	532	9	-98.30% or x59
p99	9510	18	-99.81% or x528
p999	16000	29	-99.82% or x551

To truly grasp this impact, consider this: with Cloudflare’s average rate of 46 million requests per second, a saving of 523 microseconds per request equates to saving over 24,000 days or 65 years of processing time every single day!

In addition to latency improvements, we also reaped other benefits from the rollout:

Enhanced feature availability: thanks to eliminating Unix socket timeouts, machine learning feature availability is now a robust 100%, resulting in fewer false positives and negatives in detections.
Improved resource utilization: our system overhaul liberated resources equivalent to thousands of CPU cores and hundreds of gigabytes of RAM – a substantial enhancement of our server fleet's efficiency.
Code cleanup: another positive spin-off has been in our Lua and Go code. Thousands of lines of less performant and less memory-safe code have been weeded out, reducing technical debt.
Upscaled machine learning capabilities: last but certainly not least, we've significantly expanded our machine learning features, dimensions, and models. This upgrade empowers our machine learning inference to handle hundreds of machine learning features and dozens of dimensions and models.

Conclusion

In the wake of our redesign, we've constructed a powerful and efficient system that truly embodies the essence of 'bliss'. Harnessing the advantages of memory-mapped files, wait-free synchronization, allocation-free operations, and zero-copy deserialization, we've established a robust infrastructure that maintains peak performance while achieving remarkable reductions in latency. As we navigate towards the future, we're committed to leveraging this platform to further improve our Security machine learning products and cultivate innovative features. Additionally, we're excited to share parts of this technology through an open-sourced Rust package mmap-sync.

As we leap into the future, we are building upon our platform's impressive capabilities, exploring new avenues to amplify the power of machine learning. We are deploying a new machine learning model built on BLISS with select customers. If you are a Bot Management subscriber and want to test the new model, please reach out to your account team.

Separately, we are on the lookout for more Cloudflare customers who want to run their own machine learning models at the edge today. If you’re a developer considering making the switch to Workers for your application, sign up for our Constellation AI closed beta. If you’re a Bot Management customer and looking to run an already trained, lightweight model at the edge, we would love to hear from you. Let's embark on this path to bliss together.

How Cloudflare runs machine learning inference in microseconds

2023-06-19 Austin Hartzheim

Post Syndicated from Austin Hartzheim original http://blog.cloudflare.com/how-cloudflare-runs-ml-inference-in-microseconds/

How Cloudflare runs machine learning inference in microseconds

Cloudflare executes an array of security checks on servers spread across our global network. These checks are designed to block attacks and prevent malicious or unwanted traffic from reaching our customers’ servers. But every check carries a cost – some amount of computation, and therefore some amount of time must be spent evaluating every request we process. As we deploy new protections, the amount of time spent executing security checks increases.

Latency is a key metric on which CDNs are evaluated. Just as we optimize network latency by provisioning servers in close proximity to end users, we also optimize processing latency – which is the time spent processing a request before serving a response from cache or passing the request forward to the customers’ servers. Due to the scale of our network and the diversity of use-cases we serve, our edge software is subject to demanding specifications, both in terms of throughput and latency.

Cloudflare's bot management module is one suite of security checks which executes during the hot path of request processing. This module calculates a variety of bot signals and integrates directly with our front line servers, allowing us to customize behavior based on those signals. This module evaluates every request for heuristics and behaviors indicative of bot traffic, and scores every request with several machine learning models.

To reduce processing latency, we've undertaken a project to rewrite our bot management technology, porting it from Lua to Rust, and applying a number of performance optimizations. This post focuses on optimizations applied to the machine-learning detections within the bot management module, which account for approximately 15% of the latency added by bot detection. By switching away from a garbage collected language, removing memory allocations, and optimizing our parsers, we reduce the P50 latency of the bot management module by 79μs – a 20% reduction.

Engineering for zero allocations

Writing software without memory allocation poses several challenges. Indeed, high-level programming languages often trade memory management for productivity, abstracting away the details of memory management. But, in those details, are a number of algorithms to find contiguous regions of free memory, handle fragmentation, and call into the kernel to request new memory pages. Garbage collected languages incur additional costs throughout program execution to track when memory can be freed, plus pauses in program execution while the garbage collector executes. But, when performance is a critical requirement, languages should be evaluated for their ability to meet performance constraints.

Stack allocation

One of the simplest ways to reduce memory allocations is to work with fixed-size buffers. Fixed-sized buffers can be placed on the stack, which eliminates the need to invoke heap allocation logic; the compiler simply reserves space in the current stack frame to hold local variables. Alternatively, the buffers can be heap-allocated outside the hot path (for example, at application startup), incurring a one-time cost.

Arrays can be stack allocated:

let mut buf = [0u8; BUFFER_SIZE];

Vectors can be created outside the hot path:

let mut buf = Vec::with_capacity(BUFFER_SIZE);

To demonstrate the performance difference, let's compare two implementations of a case-insensitive string equality check. The first will allocate a buffer for each invocation to store the lower-case version of the string. The second will use a buffer that has already been allocated for this purpose.

Allocate a new buffer for each iteration:

fn case_insensitive_equality_buffer_with_capacity(s: &str, pat: &str) -> bool {
	let mut buf = String::with_capacity(s.len());
	buf.extend(s.chars().map(|c| c.to_ascii_lowercase()));
	buf == pat
}

Re-use a buffer for each iteration, avoiding allocations:

fn case_insensitive_equality_buffer_with_capacity(s: &str, pat: &str, buf: &mut String) -> bool {
	buf.clear();
	buf.extend(s.chars().map(|c| c.to_ascii_lowercase()));
	buf == pat
}

Benchmarking the two code snippets, the first executes in ~40ns per iteration, the second in ~25ns. Changing only the memory allocation behavior, the second implementation is ~38% faster.

Choice of algorithms

Another strategy to reduce the number of memory allocations is to choose algorithms that operate on the data in-place and store any necessary state on the stack.

Returning to our string comparison function from above, let's rewrite it operate completely on the stack, and without copying data into a separate buffer:

fn case_insensitive_equality_buffer_iter(s: &str, pat: &str) -> bool {
	s.chars().map(|c| c.to_ascii_lowercase()).eq(pat.chars())
}

In addition to being the shortest, this function is also the fastest. This function benchmarks at ~13ns/iter, which is just slightly slower than the 11ns used to execute eq_ignore_ascii_case from the standard library. And the standard library implementation similarly avoids buffer allocation through use of iterators.

Testing allocations

Automated testing of memory allocation on the critical path prevents accidental use of functions or libraries which allocate memory. dhat is a crate in the Rust ecosystem that supports such testing. By setting a new global allocator, dhat is able to count the number of allocations, as well as the number of bytes allocated on a given code path.

/// Assert that the hot path logic performs zero allocations.
#[test]
fn zero_allocations() {
	let _profiler = dhat::Profiler::builder().testing().build();

	// Execute hot-path logic here.

	// Assert that no allocations occurred.
	dhat::assert_eq!(stats.total_blocks, 0);
	dhat::assert_eq!(stats.total_bytes, 0);
}

It is important to note, dhat does have the limitation that it only detects allocations in Rust code. External libraries can still allocate memory without using the Rust allocator. FFI calls, such as those made to C, are one such place where memory allocations may slip past dhat's measurements.

Zero allocation decision trees

CatBoost is an open-source machine learning library used within the bot management module. The core logic of CatBoost is implemented in C++, and the library exposes bindings to a number of other languages – such as C and Rust. The Lua-based implementation of the bot management module relied on FFI calls to the C API to execute our models.

By removing memory allocations and implementing buffer re-use, we optimize the execution duration of the sample model included in the CatBoost repository by 10%. Our production models see gains up to 15%.

Optimize for single-document evaluation

By optimizing CatBoost to evaluate a single set of features at a time, we reduce memory allocations and reduce latency. The CatBoost API has several functions which are optimized for evaluating multiple documents at a time, but this API does not benefit our application where requests are evaluated in the order they are received, and delaying processing to coalesce batches is undesirable. To support evaluation of a variable number of documents, the CatBoost implementation allocates vectors and iterates over the input documents, writing them into the vectors.

TVector<TConstArrayRef<float>> floatFeaturesVec(docCount);
TVector<TConstArrayRef<int>> catFeaturesVec(docCount);
for (size_t i = 0; i < docCount; ++i) {
    if (floatFeaturesSize > 0) {
        floatFeaturesVec[i] = TConstArrayRef<float>(floatFeatures[i], floatFeaturesSize);
    }
    if (catFeaturesSize > 0) {
        catFeaturesVec[i] = TConstArrayRef<int>(catFeatures[i], catFeaturesSize);
    }
}
FULL_MODEL_PTR(modelHandle)->Calc(floatFeaturesVec, catFeaturesVec, TArrayRef<double>(result, resultSize));

To evaluate a single document, however, CatBoost only needs access to a reference to contiguous memory holding feature values. The above code can be replaced with the following:

TConstArrayRef<float> floatFeaturesArray = TConstArrayRef<float>(floatFeatures, floatFeaturesSize);
TConstArrayRef<int> catFeaturesArray = TConstArrayRef<int>(catFeatures, catFeaturesSize);
FULL_MODEL_PTR(modelHandle)->Calc(floatFeaturesArray, catFeaturesArray, TArrayRef<double>(result, resultSize));

Similar to the C++ implementation, the CatBoost Rust bindings also allocate vectors to support multi-document evaluation. For example, the bindings iterate over a vector of vectors, mapping it to a newly allocated vector of pointers:

let mut float_features_ptr = float_features
   .iter()
   .map(|x| x.as_ptr())
   .collect::<Vec<_>>();

But in the single-document case, we don't need the outer vector at all. We can simply pass the inner pointer value directly:

let float_features_ptr = float_features.as_ptr();

Reusing buffers

The previous API in the Rust bindings accepted owned Vecs as input. By taking ownership of a heap-allocated data structure, the function also takes responsibility for freeing the memory at the conclusion of its execution. This is undesirable as it forecloses the possibility of buffer reuse. Additionally, categorical features are passed as owned Strings, which prevents us from passing references to bytes in the original request. Instead, we must allocate a temporary String on the heap and copy bytes into it.

pub fn calc_model_prediction(
	&self,
	float_features: Vec<Vec<f32>>,
	cat_features: Vec<Vec<String>>,
) -> CatBoostResult<Vec<f64>> { ... }

Let's rewrite this function to take &[f32] and &[&str]:

pub fn calc_model_prediction_single(
	&self,
	float_features: &[f32],
	cat_features: &[&str],
) -> CatBoostResult<f64> { ... }

But, we also execute several models per request, and those models may use the same categorical features. Instead of calculating the hash for each separate model we execute, let's compute the hashes first and then pass them to each model that requires them:

pub fn calc_model_prediction_single_with_hashed_cat_features(
	&self,
	float_features: &[f32],
	hashed_cat_features: &[i32],
) -> CatBoostResult<f64> { ... }

Summary

By optimizing away unnecessary memory allocations in the bot management module, we reduced P50 latency from 388us to 309us (20%), and reduced P99 latency from 940us to 813us (14%). And, in many cases, the optimized code is shorter and easier to read than the unoptimized implementation.

These optimizations were targeted at model execution in the bot management module. To learn more about how we are porting bot management from Lua to Rust, check out this blog post.

How Pingora keeps count

2023-05-12 Yuchen Wu

Post Syndicated from Yuchen Wu original http://blog.cloudflare.com/how-pingora-keeps-count/

How Pingora keeps count

A while ago we shared how we replaced NGINX with our in-house proxy, Pingora. We promised to share more technical details as well as our open sourcing plan. This blog post will be the first of a series that shares both the code libraries that power Pingora and the ideas behind them.

Today, we take a look at one of Pingora’s libraries: pingora-limits.

pingora-limits provides the functionality to count inflight events and estimate the rate of events over time. These functions are commonly used to protect infrastructure and services from being overwhelmed by certain types of malicious or misbehaving requests.

For example, when an origin server becomes slow or unresponsive, requests will accumulate on our servers, which adds pressure on both our servers and our customers’ servers. With this library, we are able to identify which origins have issues, so that action can be taken without affecting other traffic.

The problem can be abstracted in a very simple way. The input is a (never ending) stream of different types of events. At any point, the system should be able to tell the number of appearances (or the rate) of a certain type of event.

In a simple example, colors are used as the type of event. The following is one possible example of a sequence of events:

red, blue, red, orange, green, brown, red, blue,...

In this example, the system should report that “red” appears three times.

The corresponding algorithms are straightforward to design. One obvious answer is to use a hash table, where the keys are the colors and the values are their corresponding appearances. Whenever a new event appears, the algorithm looks up the hash table and increases the appearance counter. It is not hard to tell that this algorithm’s time complexity is O(1) (per event) and the space complexity O(n) where n is the number of the types of events.

How Pingora does it

The hash table solution is fine in common scenarios, but we believe there are a few things that can be improved.

We observe traffic to millions of different servers when the misbehaving ones are only a few at a given time. It seems a bit wasteful to require space (memory) that holds the counter for all the keys.
Concurrently updating the hash table (especially when adding new keys) requires a lock. This behavior potentially forces all concurrent event processing to go through our system serialized. In other words, when lock contention is severe, the lock slows down the system.

The motivation to improve the above algorithm is even stronger considering such algorithms need to be deployed at scale. This algorithm operates on tens of thousands of machines. It handles more than twenty million requests per second. The benefits of efficiency improvement can be significant.

pingora-limits adopts a different approach: count–min sketch (CM sketch) estimation. CM sketch estimates the counts of events in O(1) (per event) but only using O(log(n)) of space (polylogarithmic, to be precise, more details here). Because of the simplicity of this algorithm, which we will discuss in a bit, it can be implemented without locks. Therefore, pingora-limits runs much faster and more efficiently compared to the hash table approach discussed earlier.

CM sketch

The idea of a CM sketch is similar to a Bloom filter. The mathematical details of the CM sketch can be found in this paper. In this section, we will just illustrate how it works.

A CM sketch data structure takes two parameters, H: number of hashes (rows) and N number of counters (columns) per hash (row). The rows and columns form a matrix. The space they take is H*N. Each row has its own independent hash function (hash_i()).

For this example, we use H=3 and N=4:

0	0	0	0
0	0	0	0
0	0	0	0

When an event, "red", arrives, it is counted by every row independently. Each row will use its own hashing function ( hash_i(“red”) ) to choose a column. The counter of the column is increased without worrying about collisions (see the end of this section).

The table below illustrates a possible state of the matrix after a single “red” event:

0	1	0	0
0	0	1	0
1	0	0	0

Then, let’s assume the event "blue" arrives, and we assume it collides with "red" at row 2: both hash to the third slot:

1	1	0	0
0	0	2	0
1	0	0	1

Let’s say after another series of events, “blue, red, red, red, blue, red”, So far the algorithm observed 5 “red”s and 3 “blue”s in total. Following the algorithm, the estimator eventually becomes:

3	5	0	0
0	0	8	0
5	0	0	3

Now, let’s see how the matrix reports the occurrence of each event. In order to retrieve the count of keys, the estimator just returns the minimal value of all the columns to which that key belongs. So the count of red is min(5, 8, 5) = 5 and blue is min(3, 8, 3) = 3.

This algorithm chooses the cells with the least collisions (via the min() operations). Therefore, collisions between events in single cells are acceptable because as long as there are collision free cells for a given type of event, the counting for that event is accurate.

The estimator can overestimate when two (or more) keys collide on all slots. Assuming there are only two keys, the probability of their total collision is 1/ N^H (1/64 in this example). On the other hand, it never underestimates because it never loses count of any events.

Practical implementation

Because the algorithm only requires hashing, array index and counter increment, it can be implemented in a few lines of code and lock-free.

The following is a code snippet of how it is implemented in Rust.

pub struct Estimator {
    estimator: Box<[(Box<[AtomicIsize]>, RandomState)]>,
}
 
impl Estimator {
    /// Increment `key` by the value given. Return the new estimated value as a result.
    pub fn incr<T: Hash>(&self, key: T, value: isize) -> isize {
        let mut min = isize::MAX;
        for (slot, hasher) in self.estimator.iter() {
            let hash = hash(&key, hasher) as usize;
            let counter = &slot[hash % slot.len()];
            let current = counter.fetch_add(value, Ordering::Relaxed);
            min = std::cmp::min(min, current + value);
        }
        min
    }
}

Performance

We compare the design above with the two hash table based approaches.

naive: Mutex<HashMap<u32, usize>>. This approach references the simple hash table approach mentioned above. This design requires a lock on every operation.
optimized: DashMap<u32, AtomicUsize>. DashMap leverages multiple hash tables in order to shard the keys to reduce contentions across different keys. We also use atomic counters here so that counting existing keys won't need a write lock.

We have two test cases, one that is single threaded and another that is multi-threaded. In both cases, we have one million keys. We generate 100 million events from the keys. The keys are uniformly distributed among the events.

The results below are performed on Debian VM running on M1 MacBook Pro.

Speed
Per event (the incr() function above) timing, lower is better:

	pingora-limits	naive	optimized
Single thread	10ns	51ns	43ns
Eight threads	212ns	1505ns	212ns

In the single thread case, where there is no lock contention, our approach is 5x faster than the naive one and 4x faster than the optimized one. With multiple threads, there is a high amount of contention. Our approach is similar to the optimized version. Both are 7x faster than the naive one. The reason the performance of pingora-limits and the optimized hash table are similar is because in both approaches the hot path is just updating the atomic counter.

Memory consumption
Lower is better. The numbers are collected only from the single threaded test cases for simplicity.

	peak memory bytes	total allocations	total allocated bytes
pingora-limits	26,184	9	26,184
naive	53,477,392	20	71,303,260
optimized	36,211,208	491	71,307,722

Pingora-limits at peak requires 1/2000 of the memory compared to the naive one and 1/1300 of the memory of the optimized one.

From the data above, pingora-limits is both CPU and memory efficient.

The estimator provided by Pingora-limits is a biased estimator because it is possible for it to overestimate the appearance of events.

In the case of accurate counting, where false positives are absolutely unacceptable, pingora-limits can still be very useful. It can work as a first stage filter where only the events beyond a certain threshold are fed to a hash table to perform accurate counting. In this case, the majority of low frequency event types are filtered out by the filter so that the hash table also consumes little memory without losing any accuracy.

How it is used in production

In production, pingora uses this library in a few places. The most common one is the connection limit feature. When our servers try to establish too many connections to a single origin server, in order to protect the server and our infrastructure from becoming overloaded, this feature will start rejecting new requests with 503 errors.

In this feature every incoming request increases a counter, shared by all other requests with the same customer ID, server IP and the server hostname. When the request finishes, the counter decreases accordingly. If the value of the counter is beyond a certain threshold, the request is rejected with a 503 error response. In our production environment we choose the parameters of the library so that a theoretical collision chance between two unrelated customers is about 1 / 2 ^ 52. Additionally, the rejection threshold is significantly higher than what a healthy customer’s traffic would reach. Therefore, even if multiple customers’ counters collide, it is not likely that the overestimated value would reach the threshold. So a false positive on the connection limit is not likely to happen.

Conclusion

Pingora-limits crate is available now on GitHub. Both the core functionality and the performance benchmark performed above can be found there.

In this blog post, we introduced pingora-limits, a library that counts events efficiently. We explained the core idea, which is based on a probabilistic data structure. We also showed through a performance benchmark that the pingora-limits implementation is fast and very efficient for memory consumption.

Not only that, but we will continue introducing and open sourcing Pingora components and libraries because we believe that sharing the idea behind the code is equally important as sharing the code itself.

Interested in joining us to help build a better Internet? Our engineering teams are hiring.

Oxy: Fish/Bumblebee/Splicer subsystems to improve reliability

2023-04-20 Quang Luong

Post Syndicated from Quang Luong original https://blog.cloudflare.com/oxy-fish-bumblebee-splicer-subsystems-to-improve-reliability/

Oxy: Fish/Bumblebee/Splicer subsystems to improve reliability

At Cloudflare, we are building proxy applications on top of Oxy that must be able to handle a huge amount of traffic. Besides high performance requirements, the applications must also be resilient against crashes or reloads. As the framework evolves, the complexity also increases. While migrating WARP to support soft-unicast (Cloudflare servers don’t own IPs anymore), we needed to add different functionalities to our proxy framework. Those additions increased not only the code size but also resource usage and states required to be preserved between process upgrades.

To address those issues, we opted to split a big proxy process into smaller, specialized services. Following the Unix philosophy, each service should have a single responsibility, and it must do it well. In this blog post, we will talk about how our proxy interacts with three different services – Splicer (which pipes data between sockets), Bumblebee (which upgrades an IP flow to a TCP socket), and Fish (which handles layer 3 egress using soft-unicast IPs). Those three services help us to improve system reliability and efficiency as we migrated WARP to support soft-unicast.

Splicer

Most transmission tunnels in our proxy forward packets without making any modifications. In other words, given two sockets, the proxy just relays the data between them: read from one socket and write to the other. This is a common pattern within Cloudflare, and we reimplement very similar functionality in separate projects. These projects often have their own tweaks for buffering, flushing, and terminating connections, but they also have to coordinate long-running proxy tasks with their process restart or upgrade handling, too.

Turning this into a service allows other applications to send a long-running proxying task to Splicer. The applications pass the two sockets to Splicer and they will not need to worry about keeping the connection alive when restart. After finishing the task, Splicer will return the two original sockets and the original metadata attached to the request, so the original application can inspect the final state of the sockets – for example using TCP_INFO – and finalize audit logging if required.

Bumblebee

Many of Cloudflare’s on-ramps are IP-based (layer 3) but most of our services operate on TCP or UDP sockets (layer 4). To handle TCP termination, we want to create a kernel TCP socket from IP packets received from the client (and we can later forward this socket and an upstream socket to Splicer to proxy data between the eyeball and origin). Bumblebee performs the upgrades by spawning a thread in an anonymous network namespace with unshare syscall, NAT-ing the IP packets, and using a tun device there to perform TCP three-way handshakes to a listener. You can find a more detailed write-up on how we upgrade an IP flows to a TCP stream here.

In short, other services just need to pass a socket carrying the IP flow, and Bumblebee will upgrade it to a TCP socket, no user-space TCP stack involved! After the socket is created, Bumblebee will return the socket to the application requesting the upgrade. Again, the proxy can restart without breaking the connection as Bumblebee pipes the IP socket while Splicer handles the TCP ones.

Fish

Fish forwards IP packets using soft-unicast IP space without upgrading them to layer 4 sockets. We previously implemented packet forwarding on shared IP space using iptables and conntrack. However, IP/port mapping management is not simple when you have many possible IPs to egress from and variable port assignments. Conntrack is highly configurable, but applying configuration through iptables rules requires careful coordination and debugging iptables execution can be challenging. Plus, relying on configuration when sending a packet through the network stack results in arcane failure modes when conntrack is unable to rewrite a packet to the exact IP or port range specified.

Fish attempts to overcome this problem by rewriting the packets and configuring conntrack using the netlink protocol. Put differently, a proxy application sends a socket containing IP packets from the client, together with the desired soft-unicast IP and port range, to Fish. Then, Fish will ensure to forward those packets to their destination. The client’s choice of IP address does not matter; Fish ensures that egressed IP packets have a unique five-tuple within the root network namespace and performs the necessary packet rewriting to maintain this isolation. Fish’s internal state is also survived across restarts.

The Unix philosophy, manifest

To sum up what we are having so far: instead of adding the functionalities directly to the proxy application, we create smaller and reusable services. It becomes possible to understand the failure cases present in a smaller system and design it to exhibit reliable behavior. Then if we can remove the subsystems of a larger system, we can apply this logic to those subsystems. By focusing on making the smaller service work correctly, we improve the whole system’s reliability and development agility.

Although those three services’ business logics are different, you can notice what they do in common: receive sockets, or file descriptors, from other applications to allow them to restart. Those services can be restarted without dropping the connection too. Let’s take a look at how graceful restart and file descriptor passing work in our cases.

File descriptor passing

We use Unix Domain Sockets for interprocess communication. This is a common pattern for inter-process communication. Besides sending raw data, unix sockets also allow passing file descriptors between different processes. This is essential for our architecture as well as graceful restarts.

There are two main ways to transfer a file descriptor: using pid_getfd syscall or SCM_RIGHTS. The latter is the better choice for us here as the use cases gear toward the proxy application “giving” the sockets instead of the microservices “taking” them. Moreover, the first method would require special permission and a way for the proxy to signal which file descriptor to take.

Currently we have our own internal library named hot-potato to pass the file descriptors around as we use stable Rust in production. If you are fine with using nightly Rust, you may want to consider the unix_socket_ancillary_data feature. The linked blog post above about SCM_RIGHTS also explains how that can be implemented. Still, we also want to add some “interesting” details you may want to know before using your SCM_RIGHTS in production:

There is a maximum number of file descriptors you can pass per message
The limit is defined by the constant SCM_MAX_FD in the kernel. This is set to 253 since kernel version 2.6.38
Getting the peer credentials of a socket may be quite useful for observability in multi-tenant settings
A SCM_RIGHTS ancillary data forms a message boundary.
It is possible to send any file descriptors, not only sockets
We use this trick together with memfd_create to get around the maximum buffer size without implementing something like length-encoded frames. This also makes zero-copy message passing possible.

Graceful restart

We explored the general strategy for graceful restart in “Oxy: the journey of graceful restarts” blog. Let’s dive into how we leverage tokio and file descriptor passing to migrate all important states in the old process to the new one. We can terminate the old process almost instantly without leaving any connection behind.

Passing states and file descriptors

Applications like NGINX can be reloaded with no downtime. However, if there are pending requests then there will be lingering processes that handle those connections before they terminate. This is not ideal for observability. It can also cause performance degradation when the old processes start building up after consecutive restarts.

In three micro-services in this blog post, we use the state-passing concept, where the pending requests will be paused and transferred to the new process. The new process will pick up both new requests and the old ones immediately on start. This method indeed requires a higher complexity than keeping the old process running. At a high level, we have the following extra steps when the application receives an upgrade request (usually SIGHUP): pause all tasks, wait until all tasks (in groups) are paused, and send them to the new process.

WaitGroup using JoinSet

Problem statement: we dynamically spawn different concurrent tasks, and each task can spawn new child tasks. We must wait for some of them to complete before continuing.

In other words, tasks can be managed as groups. In Go, waiting for a collection of tasks to complete is a solved problem with WaitGroup. We discussed a way to implement WaitGroup in Rust using channels in a previous blog. There also exist crates like waitgroup that simply use AtomicWaker. Another approach is using JoinSet, which may make the code more readable. Considering the below example, we group the requests using a JoinSet.

    let mut task_group = JoinSet::new();

    loop {
        // Receive the request from a listener
        let Some(request) = listener.recv().await else {
            println!("There is no more request");
            break;
        };
        // Spawn a task that will process request.
        // This returns immediately
        task_group.spawn(process_request(request));
    }

    // Wait for all requests to be completed before continue
    while task_group.join_next().await.is_some() {}

However, an obvious problem with this is if we receive a lot of requests then the JoinSet will need to keep the results for all of them. Let’s change the code to clean up the JoinSet as the application processes new requests, so we have lower memory pressure

    loop {
        tokio::select! {
            biased; // This is optional

            // Clean up the JoinSet as we go
            // Note: checking for is_empty is important 😉
            _task_result = task_group.join_next(), if !task_group.is_empty() => {}

            req = listener.recv() => {
                let Some(request) = req else {
                    println!("There is no more request");
                    break;
                };
                task_group.spawn(process_request(request));
            }
        }
    }

    while task_group.join_next().await.is_some() {}

Cancellation

We want to pass the pending requests to the new process as soon as possible once the upgrade signal is received. This requires us to pause all requests we are processing. In other terms, to be able to implement graceful restart, we need to implement graceful shutdown. The official tokio tutorial already covered how this can be achieved by using channels. Of course, we must guarantee the tasks we are pausing are cancellation-safe. The paused results will be collected into the JoinSet, and we just need to pass them to the new process using file descriptor passing.

For example, in Bumblebee, a paused state will include the environment’s file descriptors, client socket, and the socket proxying IP flow. We also need to transfer the current NAT table to the new process, which could be larger than the socket buffer. So the NAT table state is encoded into an anonymous file descriptor, and we just need to pass the file descriptor to the new process.

Conclusion

We considered how a complex proxy app can be divided into smaller components. Those components can run as new processes, allowing different life-times. Still, this type of architecture does incur additional costs: distributed tracing and inter-process communication. However, the costs are acceptable nonetheless considering the performance, maintainability, and reliability improvements. In the upcoming blog posts, we will talk about different debug tricks we learned when working with a large codebase with complex service interactions using tools like strace and eBPF.

Oxy: the journey of graceful restarts

2023-04-04 Chris Branch

Post Syndicated from Chris Branch original https://blog.cloudflare.com/oxy-the-journey-of-graceful-restarts/

Oxy: the journey of graceful restarts

Any software under continuous development and improvement will eventually need a new version deployed to the systems running it. This can happen in several ways, depending on how much you care about things like reliability, availability, and correctness. When I started out in web development, I didn’t think about any of these qualities; I simply blasted my new code over FTP directly to my /cgi-bin/ directory, which was the style at the time. For those of us producing desktop software, often you sidestep this entirely by having the user save their work, close the program and install an update – but they usually get to decide when this happens.

At Cloudflare we have to take this seriously. Our software is in constant use and cannot simply be stopped abruptly. A dropped HTTP request can cause an entire webpage to load incorrectly, and a broken connection can kick you out of a video call. Taking away reliability creates a vacuum filled only by user frustration.

The limitations of the typical upgrade process

There is no one right way to upgrade software reliably. Some programming languages and environments make it easier than others, but in a Turing-complete language few things are impossible.

One popular and generally applicable approach is to start a new version of the software, make it responsible for a small number of tasks at first, and then gradually increase its workload until the new version is responsible for everything and the old version responsible for nothing. At that point, you can stop the old version.

Most of Cloudflare’s proxies follow a similar pattern: they receive connections or requests from many clients over the Internet, communicate with other internal services to decide how to serve the request, and fetch content over the Internet if we cannot serve it locally. In general, all of this work happens within the lifetime of a client’s connection. If we aren’t serving any clients, we aren’t doing any work.

The safest time to restart, therefore, is when there is nobody to interrupt. But does such a time really exist? The Internet operates 24 hours a day and many users rely on long-running connections for things like backups, real-time updates or remote shell sessions. Even if you defer restarts to a “quiet” period, the next-best strategy of “interrupt the fewest number of people possible” will fail when you have a critical security fix that needs to be deployed immediately.

Despite this challenge, we have to start somewhere. You rarely arrive at the perfect solution in your first try.

(╯°□°）╯︵ ┻━┻

We have previously blogged about implementing graceful restarts in Cloudflare’s Go projects, using a library called tableflip. This starts a new version of your program and allows the new version to signal to the old version that it started successfully, then lets the old version clear its workload. For a proxy like any Oxy application, that means the old version stops accepting new connections once the new version starts accepting connections, then drives its remaining connections to completion.

This is the simplest case of the migration strategy previously described: the new version immediately takes all new connections, instead of a gradual rollout. But in aggregate across Cloudflare’s server fleet the upgrade process is spread across several hours and the result is as gradual as a deployment orchestrated by Kubernetes or similar.

tableflip also allows your program to bind to sockets, or to reuse the sockets opened by a previous instance. This enables the new instance to accept new connections on the same socket and let the old instance release that responsibility.

Oxy is a Rust project, so we can’t reuse tableflip. We rewrote the spawning/signaling section in Rust, but not the socket code. For that we had an alternative approach.

Socket management with systemd

systemd is a widely used suite of programs for starting and managing all of the system software needed to run a useful Linux system. It is responsible for running software in the correct order – for example ensuring the network is ready before starting a program that needs network access – or running it only if it is needed by another program.

Socket management falls in this latter category, under the term ‘socket activation’. Its intended and original use is interesting but ultimately irrelevant here; for our purposes, systemd is a mere socket manager. Many Cloudflare services configure their sockets using systemd .socket files, and when their service is started the socket is brought into the process with it. This is how we deploy most Oxy-based services, and Oxy has first-class support for sockets opened by systemd.

Using systemd decouples the lifetime of sockets from the lifetime of the Oxy application. When Oxy creates its sockets on startup, if you restart or temporarily stop the Oxy application the sockets are closed. When clients attempt to connect to the proxy during this time, they will get a very unfriendly “connection refused” error. If, however, systemd manages the socket, that socket remains open even while the Oxy application is stopped. Clients can still connect to the socket and those connections will be served as soon as the Oxy application starts up successfully.

Channeling your inner WaitGroup

A useful piece of library code our Go projects use is WaitGroups. These are essential in Go, where goroutines – asynchronously-running code blocks – are pervasive. Waiting for goroutines to complete before continuing another task is a common requirement. Even the example for tableflip uses them, to demonstrate how to wait for tasks to shut down cleanly before quitting your process.

There is not an out-of-the-box equivalent in tokio – the async Rust runtime Oxy uses – or async/await generally, so we had to create one ourselves. Fortunately, most of the building blocks to roll your own exist already. Tokio has multi-producer, single consumer (MPSC) channels, generally used by multiple tasks to push the results of work onto a queue for a single task to process, but we can exploit the fact that it signals to that single receiver when all the sender channels have been closed and no new messages are expected.

To start, we create an MPSC channel. Each task takes a clone of the producer end of the channel, and when that task completes it closes its instance of the producer. When we want to wait for all of the tasks to complete, we await a result on the consumer end of the MPSC channel. When every instance of the producer channel is closed – i.e. all tasks have completed – the consumer receives a notification that all of the channels are closed. Closing the channel when a task completes is an automatic consequence of Rust’s RAII rules. Because the language enforces this rule it is harder to write incorrect code, though in fact we need to write very little code at all.

Getting feedback on failure

Many programs that implement a graceful reload/restart mechanism use Unix signals to trigger the process to perform an action. Signals are an ancient technique introduced in early versions of Unix to solve a specific problem while creating dozens more. A common pattern is to change a program’s configuration on disk, then send it a signal (often SIGHUP) which the program handles by reloading those configuration files.

The limitations of this technique are obvious as soon as you make a mistake in the configuration, or when an important file referenced in the configuration is deleted. You reload the program and wonder why it isn’t behaving as you expect. If an error is raised, you have to look in the program’s log output to find out.

This problem compounds when you use an automated configuration management tool. It is not useful if that tool makes a configuration change and reports that it successfully reloaded your program, when in fact the program failed to read the change. The only thing that was successful was sending the reload signal!

We solved this in Oxy by creating a Unix socket specifically for coordinating restarts, and adding a new mode to Oxy that triggers a restart. In this mode:

The restarter process validates the configuration file.
It connects to the restart coordination socket defined in that file.
It sends a “restart requested” message.
The current proxy instance receives this message.
A new instance is started, inheriting a pipe it will use to notify its parent instance.
The current instance waits for the new instance to report success or fail.
The current instance sends a “restart response” message back to the restarter process, containing the result.
The restarter process reports this result back to the user, using exit codes for automated systems to detect failure.

Now when we make a change to any of our Oxy applications, we can be confident that failures are detected using nothing more than our SREs’ existing tooling. This lets us discover failures earlier, narrow down root causes sooner, and avoid our systems getting into an inconsistent state.

This technique is described more generally in a coworker’s blog, using an internal HTTP endpoint instead. Yet HTTP is missing one important property of Unix sockets for the purpose of replacing signals. A user may only send a signal to a process if the process belongs to them – i.e. they started it – or if the user is root. This prevents another user logged into the same machine from you from terminating all of your processes. As Unix sockets are files, they also follow the Unix permission model. Write permissions are required to connect to a socket. Thus we can trivially reproduce the signals security model by making the restart coordination socket user writable only. (Root, as always, bypasses all permission checks.)

Leave no connection behind

We have put a lot of effort into making restarts as graceful as possible, but there are still certain limitations. After restarting, eventually the old process has to terminate, to prevent a build-up of old processes after successive restarts consuming excessive memory and reducing the performance of other running services. There is an upper bound to how long we’ll let the old process run for; when this is reached, any connections remaining are forcibly broken.

The configuration changes that can be applied using graceful restart is limited by the design of systemd. While some configuration like resource limits can now be applied without restarting the service it applies to, others cannot; most significantly, new sockets. This is a problem inherent to the fork-and-inherit model.

For UDP-based protocols like HTTP/3, there is not even a concept of listener socket. The new process may open UDP sockets, but by default incoming packets are balanced between all open unconnected UDP sockets for a given address. How does the old process drain existing sessions without receiving packets intended for the new process and vice versa?

Is there a way to carry existing state to a new process to avoid some of these limitations? This is a hard problem to solve generally, and even in languages designed to support hot code upgrades there is some degree of running old tasks with old versions of code. Yet there are some common useful tasks that can be carried between processes so we can “interrupt the fewest number of people possible”.

Let’s not forget the unplanned outages: segfaults, oomkiller and other crashes. Thankfully rare in Rust code, but not impossible.

You can find the source for our Rust implementation of graceful restarts, named shellflip, in its GitHub repository. However, restarting correctly is just the first step of many needed to achieve our ultimate reliability goals. In a follow-up blog post we’ll talk about some creative solutions to these limitations.

From IP packets to HTTP: the many faces of our Oxy framework

2023-03-30 Nuno Diegues

Post Syndicated from Nuno Diegues original https://blog.cloudflare.com/from-ip-packets-to-http-the-many-faces-of-our-oxy-framework/

From IP packets to HTTP: the many faces of our Oxy framework

We have recently introduced Oxy, our Rust-based framework for proxies powering many Cloudflare services and products. Today, we will explain why and how it spans various layers of the OSI model, by handling directly raw IP packets, TCP connections and UDP payloads, all the way up to application protocols such as HTTP and SSH.

On-ramping IP packets

An application built on top of Oxy defines — in a configuration file — the on-ramps that will accept ingress traffic to be proxied to some off-ramp. One of the possibilities is to on-ramp raw IP packets. But why operate at that layer?

The answer is: to power Cloudflare One, our network offering for customers to extend their private networks — such as offices, data centers, cloud networks and roaming users — with the Cloudflare global network. Such private networks operate based on Zero Trust principles, which means every access is authenticated and authorized, contrasting with legacy approaches where you can reach every private service after authenticating once with the Virtual Private Network.

To effectively extend our customer’s private network into ours, we need to support arbitrary protocols that rely on the Internet Protocol (IP). Hence, we on-ramp Cloudflare One customers’ traffic at (OSI model) layer 3, as a stream of IP packets. Naturally, those will often encapsulate TCP streams and UDP sessions. But nothing precludes other traffic from flowing through.

IP tunneling

Cloudflare’s operational model dictates that every service, machine and network be operated in an homogeneous way, usable by every one of our customers the same way. We essentially have a gigantic multi-tenanted system. Simply on-ramping raw IP packets does not suffice: we must always move the IP packets within the scope of the tenant they belong to.

This is why we introduced the concept of IP tunneling in Oxy: every IP packet handled has context associated with it; at the very least, the tenant that it belongs to. Other arbitrary contexts can be added, but that is up to each application (built on top of Oxy) to define, parse and consume in its Oxy hooks. This allows applications to extend and customize Oxy’s behavior.

You have probably heard of (or even used!) Cloudflare Zero Trust WARP: a client software that you can install on your device(s) to create virtual private networks managed and handled by Cloudflare. You begin by authenticating with your Cloudflare One account, and then the software will on-ramp your device’s traffic through the nearest Cloudflare data center: either to be upstreamed to Internet public IPs, or to other Cloudflare One connectors, such as another WARP device.

Today, WARP routes the traffic captured in your device (e.g. your smartphone) via a WireGuard tunnel that is terminated in a server in the nearest Cloudflare data center. That server then opens an IP tunnel to an Oxy instance running on the same server. To convey context about that traffic, namely the identity of the tenant, some context must be attached to the IP tunnel.

For this, we use a Unix SOCK_SEQPACKET, which is a datagram-oriented socket exposing a connection-based interface with reliable and ordered delivery — it only accepts connections locally within the machine where it is bound to. Oxy receives the context in the first datagram, which the application parses — could be any format the application using Oxy desires. Then all subsequent datagrams are assumed to be raw self-describing IP packets, with no overhead whatsoever.

Another example are the on-ramps of Magic WAN, such as GRE or IPsec tunnels, which also bring raw IP packets from customer’s networks to Cloudflare data centers. Unlike WARP, where its IP packets are decapsulated in user space, for GRE and IPsec we rely on the Linux kernel to do the job for us. Hence, we have no state whatsoever between two consecutive IP packets coming from the same customer, as the Linux kernel is routing them independently.

To accommodate the differences between IP packet handling in user space and the kernel, Oxy differentiates two types of IP tunnels:

Connected IP tunnels — as explained for WARP above, where the context is passed once, in the first datagram of the IP Tunnel SEQPACKET connection
Unconnected IP tunnels — used by Magic WAN, where each IP packet is encapsulated (using GUE, i.e. Generic UDP Encapsulation) to accommodate the context and unconnected UDP sockets are used

Encapsulating every IP packet comes at the cost of extra CPU usage. But moving the packet around to and from an Oxy instance does not change much regardless of the encapsulation, as we do not have MTU limitations inside our data centers. This way we avoid causing IP packet fragmentation, whose reassembly takes a toll on CPU and Memory usage.

Tracking IP flows

Once IP packets arrive to Oxy, regardless of how they on-ramp, we must decide what to do with them. We decided to rely on the idea of IP flows, as that is inherent to most protocols: a point to point interaction will generally be bounded in time and follow some type of state machine, either known by the transport or by the application protocol.

We perform flow tracking to detect IP flows. When handling an on-ramped IP packet, we parse its IP header and possible transport (i.e. OSI Model layer 4) header. We use the excellent etherparse Rust crate for this purpose, which parses the flow signature, with a source and destination IP address, ports (optional) and protocol. We then look up whether there is already a known IP flow for that signature: if so, then the packet is proxied through the path already determined for that flow towards its off-ramp. If the flow is new, then its upstream route is computed and memoized for future packets. This is in essence what routers do, and to some extent Oxy handling of IP packets is meant to operate as a router.

The interesting thing about tracking IP flows is that we can now expose their lifetime events to the application built on top of Oxy, via its hooks. Applications can then use these events for interesting operations, such as:

Applying Zero Trust principles before allowing the IP flow through, such as our Secure Web Gateway policies
Emitting audit logs that collect the decisions taken at the start of the IP flow
Collecting metadata about the traffic processed by the time the IP flow ends, e.g., to support billing calculations
Computing routing decisions of where to send the IP flow next, e.g. to another Cloudflare product/service, or off-ramped to the Internet, or to another Cloudflare One connector

From an IP flow to a TCP stream

You would think that most applications do not handle IP packets directly. That is a good hunch, and also a fact at Cloudflare: many systems operate at the application layer (OSI Model layer 7) where they can inspect traffic in a way much closer to what the end user is perceiving.

To get closer to that reality, Oxy can upgrade an IP flow to the transport layer (OSI Model layer 4). We first consider what this means for the case of TCP traffic. The problem that we want to solve is to process a given stream of raw IP packets, with the same TCP flow signature initiating a TCP handshake, and obtain as a result a TCP connection streaming data. Hence, we need a TCP protocol implementation that can be used from userspace.

The best Rust-native implementation is the smoltcp crate. However, its stated objectives do not match our needs, as it does not implement many of the performance and reliability enhancements of TCP that are expected of a first-class TCP, therefore not sufficing for the sheer amount of traffic and demands we have.

Instead, we rely on the Linux kernel to help us here. After all, it has the most battle-tested TCP protocol implementation in the world.

To leverage that, we set up a TUN interface, and add an IP route to forward traffic to that interface (more details below as to what IPs to use). A TUN interface is a virtual network device whose network data is generated by user-programmable software, rather than a device driver for a physically-connected network adapter. But otherwise it looks and works like a physical network adapter for all purposes.

We write the IP packets — meant to be upgraded to a TCP stream — to the file descriptor backing the TUN interface. However, that’s not enough, as the kernel in our machines will drop those packets since customer’s IP addresses only make sense in their infrastructure.

The step we are missing is that those packets must be transformed, i.e. Network Address Translated (NAT), so that the kernel routes them into the TUN interface. Hence, Oxy maintains its own stateful NAT: every IP flow desired to be upgraded to a TCP stream must claim a NAT slot (to be returned when the TCP stream finishes), and have its packets’ addresses rewritten for the IPs that the TUN interface route encompasses.

Once packets flow into the TUN interface with the right addresses, the kernel will process them as if they had entered the machine through your network card. This means that you can now bind a TCP listener to accept TCP connections in the IP address for which the NAT-ed IP packets are destined to, and voilà, we have our IP flows upgraded to TCP streams.

We are left with one question: what IP address should the NAT use? One option is to just reserve some machine-local IP address and hope that no other application running in that machine uses it, as otherwise unexpected traffic will show up in our TUN device.

Instead, we chose to not have to worry about that at all by relying on Linux network namespaces. A network namespace provides you with an isolated network in a machine, acting as a virtualization layer provided by the kernel. Even if you do not know what this is, you are likely using it, e.g. via Docker.

Hence, Oxy dynamically starts a network namespace to run its TUN interface for upgrading IP flows, where it can use all the local IP space and ports freely. After all, those TCP connections only matter locally, between Oxy’s NAT and Oxy’s L4 proxy.

An interesting aspect here is that the Oxy application itself runs in the default/root namespace, making it easily reachable for on-ramping traffic, and also able to off-ramp traffic to other services operating on the same machine in the default/root namespace. But that raises the question: how is Oxy able to operate simultaneously in the root namespace as well as in the namespace dedicated to upgrading IP flows to TCP connections? The trick is to:

Run the Oxy-based process in the root namespace, without any special permissions (no elevated permissions required).
That process calls clone into a new unnamed user and network namespace.
The child (cloned) and parent (original) processes communicate via a paired pipe.
The child brings up the TUN interface and establishes the IP routes to it.
The child process binds a TCP listener on an IP address that is bound to the TUN interface and passes that file descriptor to the parent process using SCM_RIGHTS.

This way, the Oxy process will now have a TCP listener, to obtain the upgraded IP flow connections from, while running in the default namespace and despite that TCP listener — and any connections accepted from it — operating in an unnamed dynamically created namespace.

From a TCP stream to HTTP

Once Oxy has a TCP stream, it may also upgrade it, in a sense, to be handled as HTTP traffic. Again, the framework provides the capabilities, but it is up to the application (built on top of Oxy) to make the decision. Analogously to the IP flow, the TCP stream start also triggers a hook to let the application know about a new connection, and to let it decide what to do with it. One of the choices is to treat it as HTTP(S) traffic, at which point Oxy will pass the connection through a Hyper server (possibly also doing TLS if necessary). If you are curious about this part, then rest assured we will have a blog post focused just on that soon.

What about UDP

While we have been focusing on TCP so far, all of the capabilities implemented for TCP are also supported for UDP as well. We’ve glossed over it so far because it is easier to handle, since converting an IP packet to UDP payloads requires only stripping the IP and UDP headers. We do this in Oxy logic, in user space, thereby replacing the idea employed for TCP that relies on the TUN interface. Everything else works the same way across TCP and UDP, with UDP traffic potentially being HTTPS for the case of QUIC-based HTTP/3.

From TCP/UDP back to IP flow

We have been looking at IP packets on-ramping in Oxy and converting from IP flows to TCP/UDP. Eventually that traffic is sent to an upstream that will respond back, and so we ought to obtain resulting IP packets to send to the client. This happens quite naturally in the code base as we only need to revert the operation done in the upgrade:

For UDP, we add the IP and UDP headers to the payload of each datagram and thereby obtain the IP packet to send to the client.
For TCP, writing to the upgraded TCP socket causes the kernel to generate IP packets routed to the TUN interface. We read these packets from the TUN interface and undo the NAT operation explained above — applied to packets being written to the TUN interface — thereby obtaining the IP packet to send to the client.

More interestingly, the application built on top of Oxy may also define that TCP/UDP traffic (handled as layer 4) is to be downgraded to IP flow (i.e. layer 3). To imagine where this would be usable, consider another Cloudflare One example, where a WARP client establishes an SSH session to a remote WARP device (which is now possible) and has configured SSH command audit logging — in that case, we will have the following steps:

On-ramp the IP packets from WARP client device into the Oxy application.
Oxy tracks the IP flows; per application mandate, then Oxy checks if it is a TCP IP flow with destination port 22, and as such it upgrades to TCP connection.
The application is given control of the TCP connection and, in this case, our Secure Web Gateway (an Oxy application) parses the traffic to perform the SSH command logging.
Since the upstream is determined to be another WARP device, Oxy is mandated to downgrade the TCP connection to IP packets, so that they can be off-ramped to the upstream as such.

Therefore, we need to provide the capability to do step 4, which we haven’t described yet. For UDP the operation is trivial: add or remove the IP/UDP headers as necessary.

For TCP, we will again resort to (another) TUN interface. This is slightly more complicated than upgrading, because when upgrading we use a single TCP listener from the network namespace where all upgraded connections appear, whereas to downgrade we need a TCP client connection from the network namespace per downgraded connection. Therefore we need to interact with the network namespace to obtain these on-demand TCP client connections at runtime, as explained next, making the process to downgrade more convoluted.

To enable that, we rely on the paired pipe maintained between the Oxy (parent) process and the cloned (child) process that operates inside the dynamic namespace: it is used for requesting the TCP client socket for a specific IP flow. This entails the following steps:

The Oxy process reserves a NAT mapping for that IP flow for downgrade.
It requests (via a pipe sendmsg) the cloned child process to establish a TCP connection to the NAT-ed addresses.
By doing so, the child process inherently makes the Linux kernel TCP implementation issue a TCP handshake to the upstream, causing a SYN IP packet to show up in the TUN interface.
The Oxy process is consuming packets from the downgrading namespace’s TUN interface, and hence will consume that packet, for which it promptly reverts the NAT. The IP packet is then off-ramped as explained in the next section.
In the meantime, the child process will have sent back (via the paired pipe) the file descriptor for the TCP client socket, again using SCM_RIGHTS. The Oxy application will now proxy the client TCP connection (meant to be downgraded) into that obtained TCP connection, to result in the raw IP packets read from the TUN interface.

Despite being elaborate, this is quite intuitive, particularly if you’ve read through the upgrade section earlier that is a simpler version of this idea.

The overall picture

In the sections above we have covered the life of an IP packet entering Oxy and what happens to it until exiting towards its upstream destination. This is summarized in the following diagram illustrating the life cycle of such packets.

We are left with how to exit the traffic. Sending the proxied traffic towards its destination (referred to as upstream) is what we call off-ramping it. We support off-ramping traffic across the same OSI Model layers that we allow to on-ramp: that is, as IP packets, TCP or UDP sockets, or HTTP(S) directly.

It is up to the application logic (that uses the Oxy framework) to make that decision and instruct Oxy on which layer to use. There is a lot to be said about this part, such as what IPs to use when egressing to the Internet — so if you are curious for more details, then stay tuned for more blog posts about Oxy.

No software overview is complete without its tests. The one interesting thing to think about here is that, to test all of the above, we need to generate raw IP packets in our tests. That’s not ideal as one would like to just write plain Rust logic that establishes TCP connections towards the Oxy proxy. Hence, to simplify all of this, our tests actually reuse our internal library (described above) to create a dynamic network namespaces and downgrade/upgrade the TCP connections as necessary.

Therefore, our tests talk normal TCP against a TCP downgrader running together with the tests, which outputs raw IP packets that we pipe to the Oxy instance being tested. It is an elegant and simple way to work around the challenge while battle testing further the TUN interface logic.

Wrapping up

Covering proxying IP packets all the way to HTTP requests feels like an overly broad framework. We felt the same at first at Cloudflare, particularly because Oxy was not born in a day, and in fact it started first with HTTP proxying and then started to go down the OSI Model layers. In hindsight, doing it all feels the right decision: being able to upgrade and downgrade traffic as necessary has been very useful, and in fact our proxying logic shares the majority of code despite handling different layers (socket primitives, observability, security aspects, configurability, etc).

Today, all of the ideas above are powering Cloudflare One Zero Trust as well as plain WARP. This means they are battle-tested across millions of daily users exchanging most of their traffic (both to the Internet as well as towards private/corporate networks) through the Cloudflare global network.

If you’ve enjoyed reading this and are interested in working on similar challenges with Rust, then be sure to check our open positions as we continue to grow our team. Likewise, there will be more blog posts related to our learnings developing Oxy, so tag along the ride for more fun!

Oxy is Cloudflare’s Rust-based next generation proxy framework

2023-03-02 Ivan Nikulin

Post Syndicated from Ivan Nikulin original https://blog.cloudflare.com/introducing-oxy/

Oxy is Cloudflare's Rust-based next generation proxy framework

In this blog post, we are proud to introduce Oxy – our modern proxy framework, developed using the Rust programming language. Oxy is a foundation of several Cloudflare projects, including the Zero Trust Gateway, the iCloud Private Relay second hop proxy, and the internal egress routing service.

Oxy leverages our years of experience building high-load proxies to implement the latest communication protocols, enabling us to effortlessly build sophisticated services that can accommodate massive amounts of daily traffic.

We will be exploring Oxy in greater detail in upcoming technical blog posts, providing a comprehensive and in-depth look at its capabilities and potential applications. For now, let us embark on this journey and discover what Oxy is and how we built it.

What Oxy does

We refer to Oxy as our “next-generation proxy framework”. But what do we really mean by “proxy framework”? Picture a server (like NGINX, that reader might be familiar with) that can proxy traffic with an array of protocols, including various predefined common traffic flow scenarios that enable you to route traffic to specific destinations or even egress with a different protocol than the one used for ingress. This server can be configured in many ways for specific flows and boasts tight integration with the surrounding infrastructure, whether telemetry consumers or networking services.

Now, take all of that and add in the ability to programmatically control every aspect of the proxying: protocol decapsulation, traffic analysis, routing, tunneling logic, DNS resolution, and so much more. And this is what Oxy proxy framework is: a feature-rich proxy server tightly integrated with our internal infrastructure that’s customizable to meet application requirements, allowing engineers to tweak every component.

This design is in line with our belief in an iterative approach to development, where a basic solution is built first and then gradually improved over time. With Oxy, you can start with a basic solution that can be deployed to our servers and then add additional features as needed, taking advantage of the many extensibility points offered by Oxy. In fact, you can avoid writing any code, besides a few lines of bootstrap boilerplate and get a production-ready server with a wide variety of startup configuration options and traffic flow scenarios.

For example, suppose you’d like to implement an HTTP firewall. With Oxy, you can proxy HTTP(S) requests right out of the box, eliminating the need to write any code related to production services, such as request metrics and logs. You simply need to implement an Oxy hook handler for HTTP requests and responses. If you’ve used Cloudflare Workers before, then you should be familiar with this extensibility model.

Similarly, you can implement a layer 4 firewall by providing application hooks that handle ingress and egress connections. This goes beyond a simple block/accept scenario, as you can build authentication functionality or a traffic router that sends traffic to different destinations based on the geographical information of the ingress connection. The capabilities are incredibly rich, and we’ve made the extensibility model as ergonomic and flexible as possible. As an example, if information obtained from layer 4 is insufficient to make an informed firewall decision, the app can simply ask Oxy to decapsulate the traffic and process it with HTTP firewall.

The aforementioned scenarios are prevalent in many products we build at Cloudflare, so having a foundation that incorporates ready solutions is incredibly useful. This foundation has absorbed lots of experience we’ve gained over the years, taking care of many sharp and dark corners of high-load service programming. As a result, application implementers can stay focused on the business logic of their application with Oxy taking care of the rest. In fact, we’ve been able to create a few privacy proxy applications using Oxy that now serve massive amounts of traffic in production with less than a couple of hundred lines of code. This is something that would have taken multiple orders of magnitude more time and lines of code before.

As previously mentioned, we’ll dive deeper into the technical aspects in future blog posts. However, for now, we’d like to provide a brief overview of Oxy’s capabilities. This will give you a glimpse of the many ways in which Oxy can be customized and used.

On-ramps

On-ramp defines a combination of transport layer socket type and protocols that server listeners can use for ingress traffic.

Oxy supports a wide variety of traffic on-ramps:

HTTP 1/2/3 (including various CONNECT protocols for layer 3 and 4 traffic)
TCP and UDP traffic over Proxy Protocol
general purpose IP traffic, including ICMP

With Oxy, you have the ability to analyze and manipulate traffic at multiple layers of the OSI model – from layer 3 to layer 7. This allows for a wide range of possibilities in terms of how you handle incoming traffic.

One of the most notable and powerful features of Oxy is the ability for applications to force decapsulation. This means that an application can analyze traffic at a higher level, even if it originally arrived at a lower level. For example, if an application receives IP traffic, it can choose to analyze the UDP traffic encapsulated within the IP packets. With just a few lines of code, the application can tell Oxy to upgrade the IP flow to a UDP tunnel, effectively allowing the same code to be used for different on-ramps.

The application can even go further and ask Oxy to sniff UDP packets and check if they contain HTTP/3 traffic. In this case, Oxy can upgrade the UDP traffic to HTTP and handle HTTP/3 requests that were originally received as raw IP packets. This allows for the simultaneous processing of traffic at all three layers (L3, L4, L7), enabling applications to analyze, filter, and manipulate the traffic flow from multiple perspectives. This provides a robust toolset for developing advanced traffic processing applications.

Off-ramps

Off-ramp defines a combination of transport layer socket type and protocols that proxy server connectors can use for egress traffic.

Oxy offers versatility in its egress methods, supporting a range of protocols including HTTP 1 and 2, UDP, TCP, and IP. It is equipped with internal DNS resolution and caching, as well as customizable resolvers, with automatic fallback options for maximum system reliability. Oxy implements happy eyeballs for TCP, advanced tunnel timeout logic and has the ability to route traffic to internal services with accompanying metadata.

Additionally, through collaboration with one of our internal services (which is an Oxy application itself!) Oxy is able to offer geographical egress — allowing applications to route traffic to the public Internet from various locations in our extensive network covering numerous cities worldwide. This complex and powerful feature can be easily utilized by Oxy application developers at no extra cost, simply by adjusting configuration settings.

Tunneling and request handling

We’ve discussed Oxy’s communication capabilities with the outside world through on-ramps and off-ramps. In the middle, Oxy handles efficient stateful tunneling of various traffic types including TCP, UDP, QUIC, and IP, while giving applications full control over traffic blocking and redirection.

Additionally, Oxy effectively handles HTTP traffic, providing full control over requests and responses, and allowing it to serve as a direct HTTP or API service. With built-in tools for streaming analysis of HTTP bodies, Oxy makes it easy to extract and process data, such as form data from uploads and downloads.

In addition to its multi-layer traffic processing capabilities, Oxy also supports advanced HTTP tunneling methods, such as CONNECT-UDP and CONNECT-IP, using the latest extensions to HTTP 3 and 2 protocols. It can even process HTTP CONNECT request payloads on layer 4 and recursively process the payload as HTTP if the encapsulated traffic is HTTP.

TLS

The modern Internet is unimaginable without traffic encryption, and Oxy, of course, provides this essential aspect. Oxy’s cryptography and TLS are based on BoringSSL, providing both a FIPS-compliant version with a limited set of certified features and the latest version that supports all the currently available TLS features. Oxy also allows applications to switch between the two versions in real-time, on a per-request or per-connection basis.

Oxy’s TLS client is designed to make HTTPS requests to upstream servers, with the functionality and security of a browser-grade client. This includes the reconstruction of certificate chains, certificate revocation checks, and more. In addition, Oxy applications can be secured with TLS v1.3, and optionally mTLS, allowing for the extraction of client authentication information from x509 certificates.

Oxy has the ability to inspect and filter HTTPS traffic, including HTTP/3, and provides the means for dynamically generating certificates, serving as a foundation for implementing data loss prevention (DLP) products. Additionally, Oxy’s internal fork of BoringSSL, which is not FIPS-compliant, supports the use of raw public keys as an alternative to WebPKI, making it ideal for internal service communication. This allows for all the benefits of TLS without the hassle of managing root certificates.

Gluing everything together

Oxy is more than just a set of building blocks for network applications. It acts as a cohesive glue, handling the bootstrapping of the entire proxy application with ease, including parsing and applying configurations, setting up an asynchronous runtime, applying seccomp hardening and providing automated graceful restarts functionality.

With built-in support for panic reporting to Sentry, Prometheus metrics with a Rust-macro based API, Kibana logging, distributed tracing, memory and runtime profiling, Oxy offers comprehensive monitoring and analysis capabilities. It can also generate detailed audit logs for layer 4 traffic, useful for billing and network analysis.

To top it off, Oxy includes an integration testing framework, allowing for easy testing of application interactions using TypeScript-based tests.

Extensibility model

To take full advantage of Oxy’s capabilities, one must understand how to extend and configure its features. Oxy applications are configured using YAML configuration files, offering numerous options for each feature. Additionally, application developers can extend these options by leveraging the convenient macros provided by the framework, making customization a breeze.

Suppose the Oxy application uses a key-value database to retrieve user information. In that case, it would be beneficial to expose a YAML configuration settings section for this purpose. With Oxy, defining a structure and annotating it with the #[oxy_app_settings] attribute is all it takes to accomplish this:

///Application’s key-value (KV) database settings
#[oxy_app_settings]
pub struct MyAppKVSettings {
    /// Key prefix.
    pub prefix: Option<String>,
    /// Path to the UNIX domain socket for the appropriate KV 
    /// server instance.
    pub socket: Option<String>,
}

Oxy can then generate a default YAML configuration file listing available options and their default values, including those extended by the application. The configuration options are automatically documented in the generated file from the Rust doc comments, following best Rust practices.

Moreover, Oxy supports multi-tenancy, allowing a single application instance to expose multiple on-ramp endpoints, each with a unique configuration. But, sometimes even a YAML configuration file is not enough to build a desired application, this is where Oxy’s comprehensive set of hooks comes in handy. These hooks can be used to extend the application with Rust code and cover almost all aspects of the traffic processing.

To give you an idea of how easy it is to write an Oxy application, here is an example of basic Oxy code:

struct MyApp;

// Defines types for various application extensions to Oxy's
// data types. Contexts provide information and control knobs for
// the different parts of the traffic flow and applications can extend // all of them with their custom data. As was mentioned before,
// applications could also define their custom configuration.
// It’s just a matter of defining a configuration object with
// `#[oxy_app_settings]` attribute and providing the object type here.
impl OxyExt for MyApp {
    type AppSettings = MyAppKVSettings;
    type EndpointAppSettings = ();
    type EndpointContext = ();
    type IngressConnectionContext = MyAppIngressConnectionContext;
    type RequestContext = ();
    type IpTunnelContext = ();
    type DnsCacheItem = ();

}
   
#[async_trait]
impl OxyApp for MyApp {
    fn name() -> &'static str {
        "My app"
    }

    fn version() -> &'static str {
        env!("CARGO_PKG_VERSION")
    }

    fn description() -> &'static str {
        "This is an example of Oxy application"
    }

    async fn start(
        settings: ServerSettings<MyAppSettings, ()>
    ) -> anyhow::Result<Hooks<Self>> {
        // Here the application initializes various hooks, with each
        // hook being a trait implementation containing multiple
        // optional callbacks invoked during the lifecycle of the
        // traffic processing.
        let ingress_hook = create_ingress_hook(&settings);
        let egress_hook = create_egress_hook(&settings);
        let tunnel_hook = create_tunnel_hook(&settings);
        let http_request_hook = create_http_request_hook(&settings);
        let ip_flow_hook = create_ip_flow_hook(&settings);

        Ok(Hooks {
            ingress: Some(ingress_hook),
            egress: Some(egress_hook),
            tunnel: Some(tunnel_hook),
            http_request: Some(http_request_hook),
            ip_flow: Some(ip_flow_hook),
            ..Default::default()
        })
    }
}

// The entry point of the application
fn main() -> OxyResult<()> {
    oxy::bootstrap::<MyApp>()
}

Technology choice

Oxy leverages the safety and performance benefits of Rust as its implementation language. At Cloudflare, Rust has emerged as a popular choice for new product development, and there are ongoing efforts to migrate some of the existing products to the language as well.

Rust offers memory and concurrency safety through its ownership and borrowing system, preventing issues like null pointers and data races. This safety is achieved without sacrificing performance, as Rust provides low-level control and the ability to write code with minimal runtime overhead. Rust’s balance of safety and performance has made it popular for building safe performance-critical applications, like proxies.

We intentionally tried to stand on the shoulders of the giants with this project and avoid reinventing the wheel. Oxy heavily relies on open-source dependencies, with hyper and tokio being the backbone of the framework. Our philosophy is that we should pull from existing solutions as much as we can, allowing for faster iteration, but also use widely battle-tested code. If something doesn’t work for us, we try to collaborate with maintainers and contribute back our fixes and improvements. In fact, we now have two team members who are core team members of tokio and hyper projects.

Even though Oxy is a proprietary project, we try to give back some love to the open-source community without which the project wouldn’t be possible by open-sourcing some of the building blocks such as https://github.com/cloudflare/boring and https://github.com/cloudflare/quiche.

The road to implementation

At the beginning of our journey, we set out to implement a proof-of-concept for an HTTP firewall using Rust for what would eventually become Zero Trust Gateway product. This project was originally part of the WARP service repository. However, as the PoC rapidly advanced, it became clear that it needed to be separated into its own Gateway proxy for both technical and operational reasons.

Later on, when tasked with implementing a relay proxy for iCloud Private Relay, we saw the opportunity to reuse much of the code from the Gateway proxy. The Gateway project could also benefit from the HTTP/3 support that was being added for the Private Relay project. In fact, early iterations of the relay service were forks of the Gateway server.

It was then that we realized we could extract common elements from both projects to create a new framework, Oxy. The history of Oxy can be traced back to its origins in the commit history of the Gateway and Private Relay projects, up until its separation as a standalone framework.

Since our inception, we have leveraged the power of Oxy to efficiently roll out multiple projects that would have required a significant amount of time and effort without it. Our iterative development approach has been a strength of the project, as we have been able to identify common, reusable components through hands-on testing and implementation.

Our small core team is supplemented by internal contributors from across the company, ensuring that the best subject-matter experts are working on the relevant parts of the project. This contribution model also allows us to shape the framework’s API to meet the functional and ergonomic needs of its users, while the core team ensures that the project stays on track.

Relation to Pingora

Although Pingora, another proxy server developed by us in Rust, shares some similarities with Oxy, it was intentionally designed as a separate proxy server with a different objective. Pingora was created to serve traffic from millions of our client’s upstream servers, including those with ancient and unusual configurations. Non-UTF 8 URLs or TLS settings that are not supported by most TLS libraries being just a few such quirks among many others. This focus on handling technically challenging unusual configurations sets Pingora apart from other proxy servers.

The concept of Pingora came about during the same period when we were beginning to develop Oxy, and we initially considered merging the two projects. However, we quickly realized that their objectives were too different to do that. Pingora is specifically designed to establish Cloudflare’s HTTP connectivity with the Internet, even in its most technically obscure corners. On the other hand, Oxy is a multipurpose platform that supports a wide variety of communication protocols and aims to provide a simple way to develop high-performance proxy applications with business logic.

Conclusion

Oxy is a proxy framework that we have developed to meet the demanding needs of modern services. It has been designed to provide a flexible and scalable solution that can be adapted to meet the unique requirements of each project and by leveraging the power of Rust, we made it both safe and fast.

Looking forward, Oxy is poised to play one of the critical roles in our company’s larger effort to modernize and improve our architecture. It provides a solid block in foundation on which we can keep building the better Internet.

As the framework continues to evolve and grow, we remain committed to our iterative approach to development, constantly seeking out new opportunities to reuse existing solutions and improve our codebase. This collaborative, community-driven approach has already yielded impressive results, and we are confident that it will continue to drive the future success of Oxy.

Stay tuned for more tech savvy blog posts on the subject!

How we built Pingora, the proxy that connects Cloudflare to the Internet

2022-09-14 Yuchen Wu

Post Syndicated from Yuchen Wu original https://blog.cloudflare.com/how-we-built-pingora-the-proxy-that-connects-cloudflare-to-the-internet/

Introduction

How we built Pingora, the proxy that connects Cloudflare to the Internet

Today we are excited to talk about Pingora, a new HTTP proxy we’ve built in-house using Rust that serves over 1 trillion requests a day, boosts our performance, and enables many new features for Cloudflare customers, all while requiring only a third of the CPU and memory resources of our previous proxy infrastructure.

As Cloudflare has scaled we’ve outgrown NGINX. It was great for many years, but over time its limitations at our scale meant building something new made sense. We could no longer get the performance we needed nor did NGINX have the features we needed for our very complex environment.

Many Cloudflare customers and users use the Cloudflare global network as a proxy between HTTP clients (such as web browsers, apps, IoT devices and more) and servers. In the past, we’ve talked a lot about how browsers and other user agents connect to our network, and we’ve developed a lot of technology and implemented new protocols (see QUIC and optimizations for http2) to make this leg of the connection more efficient.

Today, we’re focusing on a different part of the equation: the service that proxies traffic between our network and servers on the Internet. This proxy service powers our CDN, Workers fetch, Tunnel, Stream, R2 and many, many other features and products.

Let’s dig in on why we chose to replace our legacy service and how we developed Pingora, our new system designed specifically for Cloudflare’s customer use cases and scale.

Why build yet another proxy

Over the years, our usage of NGINX has run up against limitations. For some limitations, we optimized or worked around them. But others were much harder to overcome.

Architecture limitations hurt performance

The NGINX worker (process) architecture has operational drawbacks for our use cases that hurt our performance and efficiency.

First, in NGINX each request can only be served by a single worker. This results in unbalanced load across all CPU cores, which leads to slowness.

Because of this request-process pinning effect, requests that do CPU heavy or blocking IO tasks can slow down other requests. As those blog posts attest we’ve spent a lot of time working around these problems.

The most critical problem for our use cases is poor connection reuse. Our machines establish TCP connections to origin servers to proxy HTTP requests. Connection reuse speeds up TTFB (time-to-first-byte) of requests by reusing previously established connections from a connection pool, skipping TCP and TLS handshakes required on a new connection.

However, the NGINX connection pool is per worker. When a request lands on a certain worker, it can only reuse the connections within that worker. When we add more NGINX workers to scale up, our connection reuse ratio gets worse because the connections are scattered across more isolated pools of all the processes. This results in slower TTFB and more connections to maintain, which consumes resources (and money) for both us and our customers.

As mentioned in past blog posts, we have workarounds for some of these issues. But if we can address the fundamental issue: the worker/process model, we will resolve all these problems naturally.

Some types of functionality are difficult to add

NGINX is a very good web server, load balancer or a simple gateway. But Cloudflare does way more than that. We used to build all the functionality we needed around NGINX, which is not easy to do while trying not to diverge too much from NGINX upstream codebase.

For example, when retrying/failing over a request, sometimes we want to send a request to a different origin server with a different set of request headers. But that is not something NGINX allows us to do. In cases like this, we spend time and effort on working around the NGINX constraints.

Meanwhile, the programming languages we had to work with didn’t provide help alleviating the difficulties. NGINX is purely in C, which is not memory safe by design. It is very error-prone to work with such a 3rd party code base. It is quite easy to get into memory safety issues, even for experienced engineers, and we wanted to avoid these as much as possible.

The other language we used to complement C is Lua. It is less risky but also less performant. In addition, we often found ourselves missing static typing when working with complicated Lua code and business logic.

And the NGINX community is not very active, and development tends to be “behind closed doors”.

Choosing to build our own

Over the past few years, as we’ve continued to grow our customer base and feature set, we continually evaluated three choices:

Continue to invest in NGINX and possibly fork it to tailor it 100% to our needs. We had the expertise needed, but given the architecture limitations mentioned above, significant effort would be required to rebuild it in a way that fully supported our needs.
Migrate to another 3rd party proxy codebase. There are definitely good projects, like envoy and others. But this path means the same cycle may repeat in a few years.
Start with a clean slate, building an in-house platform and framework. This choice requires the most upfront investment in terms of engineering effort.

We evaluated each of these options every quarter for the past few years. There is no obvious formula to tell which choice is the best. For several years, we continued with the path of the least resistance, continuing to augment NGINX. However, at some point, building our own proxy’s return on investment seemed worth it. We made a call to build a proxy from scratch, and began designing the proxy application of our dreams.

The Pingora Project

Design decisions

To make a proxy that serves millions of requests per second fast, efficient and secure, we have to make a few important design decisions first.

We chose Rust as the language of the project because it can do what C can do in a memory safe way without compromising performance.

Although there are some great off-the-shelf 3rd party HTTP libraries, such as hyper, we chose to build our own because we want to maximize the flexibility in how we handle HTTP traffic and to make sure we can innovate at our own pace.

At Cloudflare, we handle traffic across the entire Internet. We have many cases of bizarre and non-RFC compliant HTTP traffic that we have to support. For example, hyper did not support HTTP status codes greater than 599 until late 2020, three years after people initially raised the issue and repeatedly argued that it was necessary. And we need more than being correct. We need a robust, permissive, customizable HTTP library that can survive the wilds of the Internet. The best way to guarantee that is to implement our own.

The next design decision was around our workload scheduling system. We chose multithreading over multiprocessing in order to share resources, especially connection pools, easily. We also decided that work stealing was required to avoid some classes of performance problems mentioned above. The Tokio async runtime turned out to be a great fit for our needs.

Finally, we wanted our project to be intuitive and developer friendly. What we build is not the final product, and should be extensible as a platform as more features are built on top of it. We decided to implement a “life of a request” event based programmable interface similar to NGINX/OpenResty. For example, the “request filter” phase allows developers to run code to modify or reject the request when a request header is received. With this design, we can separate our business logic and generic proxy logic cleanly. Developers who previously worked on NGINX can easily switch to Pingora and quickly become productive.

Pingora is faster in production

Let’s fast-forward to the present. Pingora handles almost every HTTP request that needs to interact with an origin server (for a cache miss, for example), and we’ve collected a lot of performance data in the process.

First, let’s see how Pingora speeds up our customer’s traffic. Overall traffic on Pingora shows 5ms reduction on median TTFB and 80ms reduction on the 95th percentile. This is not because we run code faster. Even our old service could handle requests in the sub-millisecond range.

The savings come from our new architecture which can share connections across all threads. This means a better connection reuse ratio, which spends less time on TCP and TLS handshakes.

Across all customers, Pingora makes only a third as many new connections per second compared to the old service. For one major customer, it increased the connection reuse ratio from 87.1% to 99.92%, which reduced new connections to their origins by 160x. To present the number more intuitively, by switching to Pingora, we save our customers and users 434 years of handshake time every day.

More features

Having a developer friendly interface engineers are familiar with while eliminating the previous constraints allows us to develop more features, more quickly. Core functionality like new protocols act as building blocks to more products we can offer to customers.

As an example, we were able to add HTTP/2 upstream support to Pingora without major hurdles. This allowed us to offer gRPC to our customers shortly afterwards. Adding this same functionality to NGINX would have required significantly more engineering effort and might not have materialized.

More recently we’ve announced Cache Reserve where Pingora uses R2 storage as a caching layer. As we add more functionality to Pingora, we’re able to offer new products that weren’t feasible before.

More efficient

In production, Pingora consumes about 70% less CPU and 67% less memory compared to our old service with the same traffic load. The savings come from a few factors.

Our Rust code runs more efficiently compared to our old Lua code. On top of that, there are also efficiency differences from their architectures. For example, in NGINX/OpenResty, when the Lua code wants to access an HTTP header, it has to read it from the NGINX C struct, allocate a Lua string and then copy it to the Lua string. Afterwards, Lua has to garbage-collect its new string as well. In Pingora, it would just be a direct string access.

The multithreading model also makes sharing data across requests more efficient. NGINX also has shared memory but due to implementation limitations, every shared memory access has to use a mutex lock and only strings and numbers can be put into shared memory. In Pingora, most shared items can be accessed directly via shared references behind atomic reference counters.

Another significant portion of CPU saving, as mentioned above, is from making fewer new connections. TLS handshakes are expensive compared to just sending and receiving data via established connections.

Safer

Shipping features quickly and safely is difficult, especially at our scale. It’s hard to predict every edge case that can occur in a distributed environment processing millions of requests a second. Fuzzing and static analysis can only mitigate so much. Rust’s memory-safe semantics guard us from undefined behavior and give us confidence our service will run correctly.

With those assurances we can focus more on how a change to our service will interact with other services or a customer’s origin. We can develop features at a higher cadence and not be burdened by memory safety and hard to diagnose crashes.

When crashes do occur an engineer needs to spend time to diagnose how it happened and what caused it. Since Pingora’s inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.

In fact, Pingora crashes are so rare we usually find unrelated issues when we do encounter one. Recently we discovered a kernel bug soon after our service started crashing. We’ve also discovered hardware issues on a few machines, in the past ruling out rare memory bugs caused by our software even after significant debugging was nearly impossible.

Conclusion

To summarize, we have built an in-house proxy that is faster, more efficient and versatile as the platform for our current and future products.

We will be back with more technical details regarding the problems we faced, the optimizations we applied and the lessons we learned from building Pingora and rolling it out to power a significant portion of the Internet. We will also be back with our plan to open source it.

Pingora is our latest attempt at rewriting our system, but it won’t be our last. It is also only one of the building blocks in the re-architecting of our systems.

Interested in joining us to help build a better Internet? Our engineering teams are hiring.

Building Cloudflare Images in Rust and Cloudflare Workers

2021-09-15 Yevgen Safronov

Post Syndicated from Yevgen Safronov original https://blog.cloudflare.com/building-cloudflare-images-in-rust-and-cloudflare-workers/

Building Cloudflare Images in Rust and Cloudflare Workers

This post explains how we implemented the Cloudflare Images product with reusable Rust libraries and Cloudflare Workers. It covers the technical design of Cloudflare Image Resizing and Cloudflare Images. Using Rust and Cloudflare Workers helps us quickly iterate and deliver product improvements over the coming weeks and months.

Reuse of code in Rusty image projects

We developed Image Resizing in Rust. It’s a web server that receives HTTP requests for images along with resizing options, fetches the full-size images from the origin, applies resizing and other image processing operations, compresses, and returns the HTTP response with the optimized image.

Rust makes it easy to split projects into libraries (called crates). The image processing and compression parts of Image Resizing are usable as libraries.

We also have a product called Polish, which is a Golang-based service that recompresses images in our cache. Polish was initially designed to run command-line programs like jpegtran and pngcrush. We took the core of Image Resizing and wrapped it in a command-line executable. This way, when Polish needs to apply lossy compression or generate WebP images or animations, it can use Image Resizing via a command-line tool instead of a third-party tool.

Reusing libraries has allowed us to easily unify processing between Image Resizing and Polish (for example, to ensure that both handle metadata and color profiles in the same way).

Cloudflare Images is another product we’ve built in Rust. It added support for a custom storage back-end, variants (size presets), support for signing URLs and more. We made it as a collection of Rust crates, so we can reuse pieces of it in other services running anywhere in our network. Image Resizing provides image processing for Cloudflare Images and shares libraries with Images to understand the new URL scheme, access the storage back-end, and database for variants.

How Image Resizing works

The Image Resizing service runs at the edge and is deployed on every server of the Cloudflare global network. Thanks to Cloudflare’s global Anycast network, the closest Cloudflare data center will handle eyeball image resizing requests. Image Resizing is tightly integrated with the Cloudflare cache and handles eyeball requests only on a cache miss.

There are two ways to use Image Resizing. The default URL scheme provides an easy, declarative way of specifying image dimensions and other options. The other way is to use a JavaScript API in a Worker. Cloudflare Workers give powerful programmatic control over every image resizing request.

How Cloudflare Images work

Cloudflare Images consists of the following components:

The Images core service that powers the public API to manage images assets.
The Image Resizing service responsible for image transformations and caching.
The Image delivery Cloudflare Worker responsible for serving images and passing corresponding parameters through to the Imaging Resizing service.
Image storage that provides access and storage for original image assets.

To support Cloudflare Images scenarios for image transformations, we made several changes to the Image Resizing service:

Added access to Cloudflare storage with original image assets.
Added access to variant definitions (size presets).
Added support for signing URLs.

Image delivery

The primary use case for Cloudflare Images is to provide a simple and easy-to-use way of managing images assets. To cover egress costs, we provide image delivery through the Cloudflare managed imagedelivery.net domain. It is configured with Tiered Caching to maximize the cache hit ratio for image assets. imagedelivery.net provides image hosting without a need to configure a custom domain to proxy through Cloudflare.

A Cloudflare Worker powers image delivery. It parses image URLs and passes the corresponding parameters to the image resizing service.

How we store Cloudflare Images

There are several places we store information on Cloudflare Images:

image metadata in Cloudflare’s core data centers
variant definitions in Cloudflare’s edge data centers
original images in core data centers
optimized images in Cloudflare cache, physically close to eyeballs.

Image variant definitions are stored and delivered to the edge using Cloudflare’s distributed key-value store called Quicksilver. We use a single source of truth for variants. The Images core service makes calls to Quicksilver to read and update variant definitions.

The rest of the information about the image is stored in the image URL itself:
https://imagedelivery.net/<encoded account id>/<image id>/<variant name>

<image id> contains a flag, whether it’s publicly available or requires access verification. It’s not feasible to store any image metadata in Quicksilver as the data volume would increase linearly with the number of images we host. Instead, we only allow a finite number of variants per account, so we responsibly utilize available disk space on the edge. The downside of storing image metadata as part of <image id> is that <image id> will change on access change.

How we keep Cloudflare Images up to date

The only way to access images is through the use of variants. Each variant is a named image resizing configuration. Once the image asset is fetched, we cache the transformed image in the Cloudflare cache. The critical question is how we keep processed images up to date. The answer is by purging the Cloudflare cache when necessary. There are two use cases:

access to the image is changed
the variant definition is updated

In the first instance, we purge the cache by calling a URL:
https://imagedelivery.net/<encoded account id>/<image id>

Then, the customer updates the variant we issue a cache purge request by tag:
account-id/variant-name

To support cache purge by tag, the image resizing service adds the necessary tags for all transformed images.

How we restrict access to Cloudflare Images

The Image resizing service supports restricted access to images by using URL signatures with expiration. URLs are signed with an SHA-256 HMAC key. The steps to produce valid signatures are:

Take the path and query string (the path starts with /).
Compute the path’s SHA-256 HMAC with the query string, using the Images’ URL signing key as the secret. The key is configured in the Dashboard.
If the URL is meant to expire, compute the Unix timestamp (number of seconds since 1970) of the expiration time, and append ?exp= and the timestamp as an integer to the URL.
Append ? or & to the URL as appropriate (? if it had no query string; & if it had a query string).
Append sig= and the HMAC as hex-encoded 64 characters.

A signed URL looks like this:

A signed URL with an expiration timestamp looks like this:

Signature of /hello/world URL with a secret ‘this is a secret’ is 6293f9144b4e9adc83416d1b059abcac750bf05b2c5c99ea72fd47cc9c2ace34.

https://imagedelivery.net/hello/world?sig=6293f9144b4e9adc83416d1b059abcac750bf05b2c5c99ea72fd47cc9c2ace34

Direct creator uploads with Cloudflare Worker and KV

Similar to Cloudflare Stream, Images supports direct creator uploads. That allow users to upload images without API tokens. Everyday use of direct creator uploads is by web apps, client-side applications, or mobile apps where users upload content directly to Cloudflare Images.

Once again, we used our serverless platform to support direct creator uploads. The successful API call stores the account’s information in Workers KV with the specified expiration date. A simple Cloudflare Worker handles the upload URL, which reads the KV value and grants upload access only on a successful call to KV.

Future Work

Cloudflare Images product has an exciting product roadmap. Let’s review what’s possible with the current architecture of Cloudflare Images.

Resizing hints on upload

At the moment, no image transformations happen on upload. That means we can serve the image globally once it is uploaded to Image storage. We are considering adding resizing hints on image upload. That won’t necessarily schedule image processing in all cases but could provide a valuable signal to resize the most critical image variants. An example could be to generate an AVIF variant for the most vital image assets.

Serving images from custom domains

We think serving images from a domain we manage (with Tiered Caching) is a great default option for many customers. The downside is that loading Cloudflare images requires additional TLS negotiations on the client-side, adding latency and impacting loading performance. On the other hand, serving Cloudflare Images from custom domains will be a viable option for customers who set up a website through Cloudflare. The good news is that we can support such functionality with the current architecture without radical changes in the architecture.

Conclusion

The Cloudflare Images product runs on top of the Cloudflare global network. We built Cloudflare Images in Rust and Cloudflare Workers. This way, we use Rust reusable libraries in several products such as Cloudflare Images, Image Resizing, and Polish. Cloudflare’s serverless platform is an indispensable tool to build Cloudflare products internally. If you are interested in building innovative products in Rust and Cloudflare Workers, we’re hiring.

Native Rust support on Cloudflare Workers

2021-09-09 Steve Manuel

Post Syndicated from Steve Manuel original https://blog.cloudflare.com/workers-rust-sdk/

Native Rust support on Cloudflare Workers

You can now write Cloudflare Workers in 100% Rust, no JavaScript required. Try it out: https://github.com/cloudflare/workers-rs

Cloudflare Workers has long supported the building blocks to run many languages using WebAssembly. However, there has always been a challenging “trampoline” step required to allow languages like Rust to talk to JavaScript APIs such as fetch().

In addition to the sizable amount of boilerplate needed, lots of “off the shelf” bindings between languages don’t include support for Cloudflare APIs such as KV and Durable Objects. What we wanted was a way to write a Worker in idiomatic Rust, quickly, and without needing knowledge of the host JavaScript environment. While we had a nice “starter” template that made it easy enough to pull in some Rust libraries and use them from JavaScript, the barrier was still too high if your goal was to write a full program in Rust and ship it to our edge.

Not anymore!

Introducing the worker crate, available on GitHub and crates.io, which makes Rust developers feel right at home on the Workers platform by running code inside the V8 WebAssembly engine. In the snippet below, you can see how the worker crate does all the heavy lifting by providing Rustacean-friendly Workers APIs.

use worker::*;

#[event(fetch)]
pub async fn main(req: Request, env: Env) -> Result<Response> {
    console_log!(
        "{} {}, located at: {:?}, within: {}",
        req.method().to_string(),
        req.path(),
        req.cf().coordinates().unwrap_or_default(),
        req.cf().region().unwrap_or("unknown region".into())
    );

    if !matches!(req.method(), Method::Post) {
        return Response::error("Method Not Allowed", 405);
    }

    if let Some(file) = req.form_data().await?.get("file") {
        return match file {
            FormEntry::File(buf) => {
                Response::ok(&format!("size = {}", buf.bytes().await?.len()))
            }
            _ => Response::error("`file` part of POST form must be a file", 400),
        };
    }

    Response::error("Bad Request", 400)
}

Get your own Worker in Rust started with a single command:

# see installation instructions for our `wrangler` CLI at https://github.com/cloudflare/wrangler
# (requires v1.19.2 or higher)
$ wrangler generate --type=rust my-project

We’ve stripped away all the glue code, provided an ergonomic HTTP framework, and baked in what you need to build small scripts or full-fledged Workers apps in Rust. You’ll find fetch, a router, easy-to-use HTTP functionality, Workers KV stores and Durable Objects, secrets, and environment variables too. It’s all open source, and we’d love your feedback!

Why are we doing this?

Cloudflare Workers is on a mission to simplify the developer experience. When we took a hard look at the previous experience writing non-JavaScript Workers, we knew we could do better. Rust happens to be a great language for us to kick-start our mission: it has first-class support for WebAssembly, and a wonderful, growing ecosystem. Tools like wasm-bindgen, libraries like web-sys, and Rust’s powerful macro system gave us a significant starting-off point. Plus, Rust’s popularity is growing rapidly, and if our own use of Rust at Cloudflare is any indication, there is no question that Rust is staking its claim as a must-have in the developer toolbox.

So give it a try, leave some feedback, even open a PR! By the way, we’re always on the lookout for great people to join us, and we are hiring for many open roles (including Rust engineers!) — take a look.

Pin, Unpin, and why Rust needs them

2021-08-26 Adam Chalmers

Post Syndicated from Adam Chalmers original https://blog.cloudflare.com/pin-and-unpin-in-rust/

Pin, Unpin, and why Rust needs them

Using async Rust libraries is usually easy. It’s just like using normal Rust code, with a little async or .await here and there. But writing your own async libraries can be hard. The first time I tried this, I got really confused by arcane, esoteric syntax like T: ?Unpin and Pin<&mut Self>. I had never seen these types before, and I didn’t understand what they were doing. Now that I understand them, I’ve written the explainer I wish I could have read back then. In this post, we’re gonna learn

What Futures are
What self-referential types are
Why they were unsafe
How Pin/Unpin made them safe
Using Pin/Unpin to write tricky nested futures

What are Futures?

A few years ago, I needed to write some code which would take some async function, run it and collect some metrics about it, e.g. how long it took to resolve. I wanted to write a type TimedWrapper that would work like this:

// Some async function, e.g. polling a URL with [https://docs.rs/reqwest]
// Remember, Rust functions do nothing until you .await them, so this isn't
// actually making a HTTP request yet.
let async_fn = reqwest::get("http://adamchalmers.com");

// Wrap the async function in my hypothetical wrapper.
let timed_async_fn = TimedWrapper::new(async_fn);

// Call the async function, which will send a HTTP request and time it.
let (resp, time) = timed_async_fn.await;
println!("Got a HTTP {} in {}ms", resp.unwrap().status(), time.as_millis())

I like this interface, it’s simple and should be easy for the other programmers on my team to use. OK, let’s implement it! I know that, under the hood, Rust’s async functions are just regular functions that return a Future. The Future trait is pretty simple. It just means a type which:

Can be polled
When it’s polled, it might return “Pending” or “Ready”
If it’s pending, you should poll it again later
If it’s ready, it responds with a value. We call this “resolving”.

Here’s a really easy example of implementing a Future. Let’s make a Future that returns a random u16.

use std::{future::Future, pin::Pin, task::Context}

/// A future which returns a random number when it resolves.
#[derive(Default)]
struct RandFuture;

impl Future for RandFuture {
	// Every future has to specify what type of value it returns when it resolves.
	// This particular future will return a u16.
	type Output = u16;

	// The `Future` trait has only one method, named "poll".
fn poll(self: Pin<&mut Self>, _cx: &mut Context) -> Poll<Self::Output  {
		Poll::ready(rand::random())
	}
}

Not too hard! I think we’re ready to implement TimedWrapper.

Trying and failing to use nested Futures

Let’s start by defining the type.

pub struct TimedWrapper<Fut: Future> {
	start: Option<Instant>,
	future: Fut,
}

OK, so a TimedWrapper is generic over a type Fut, which must be a Future. And it will store a future of that type as a field. It’ll also have a start field which will record when it first was first polled. Let’s write a constructor:

impl<Fut: Future> TimedWrapper<Fut> {
	pub fn new(future: Fut) -> Self {
		Self { future, start: None }
	}
}

Nothing too complicated here. The new function takes a future and wraps it in the TimedWrapper. Of course, we have to set start to None, because it hasn’t been polled yet. So, let’s implement the poll method, which is the only thing we need to implement Future and make it .awaitable.

impl<Fut: Future> Future for TimedWrapper<Fut> {
	// This future will output a pair of values:
	// 1. The value from the inner future
	// 2. How long it took for the inner future to resolve
	type Output = (Fut::Output, Duration);

	fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
		// Call the inner poll, measuring how long it took.
		let start = self.start.get_or_insert_with(Instant::now);
		let inner_poll = self.future.poll(cx);
		let elapsed = self.elapsed();

		match inner_poll {
			// The inner future needs more time, so this future needs more time too
			Poll::Pending => Poll::Pending,
			// Success!
			Poll::Ready(output) => Poll::Ready((output, elapsed)),
		}
	}
}

OK, that wasn’t too hard. There’s just one problem: this doesn’t work.

So, the Rust compiler reports an error on self.future.poll(cx), which is “no method named poll found for type parameter Fut in the current scope”. This is confusing, because we know Fut is a Future, so surely it has a poll method? OK, but Rust continues: Fut doesn’t have a poll method, but Pin<&mut Fut> has one. What is this weird type?

Well, we know that methods have a “receiver”, which is some way it can access self. The receiver might be self, &self or &mut self, which mean “take ownership of self,” “borrow self,” and “mutably borrow self” respectively. So this is just a new, unfamiliar kind of receiver. Rust is complaining because we have Fut and we really need a Pin<&mut Fut>. At this point I have two questions:

What is Pin?
If I have a T value, how do I get a Pin<&mut T>?

The rest of this post is going to be answering those questions. I’ll explain some problems in Rust that could lead to unsafe code, and why Pin safely solves them.

Self-reference is unsafe

Pin exists to solve a very specific problem: self-referential datatypes, i.e. data structures which have pointers into themselves. For example, a binary search tree might have self-referential pointers, which point to other nodes in the same struct.

Self-referential types can be really useful, but they’re also hard to make memory-safe. To see why, let’s use this example type with two fields, an i32 called val and a pointer to an i32 called pointer.

So far, everything is OK. The pointer field points to the val field in memory address A, which contains a valid i32. All the pointers are valid, i.e. they point to memory that does indeed encode a value of the right type (in this case, an i32). But the Rust compiler often moves values around in memory. For example, if we pass this struct into another function, it might get moved to a different memory address. Or we might Box it and put it on the heap. Or if this struct was in a Vec<MyStruct>, and we pushed more values in, the Vec might outgrow its capacity and need to move its elements into a new, larger buffer.

When we move it, the struct’s fields change their address, but not their value. So the pointer field is still pointing at address A, but address A now doesn’t have a valid i32. The data that was there was moved to address B, and some other value might have been written there instead! So now the pointer is invalid. This is bad — at best, invalid pointers cause crashes, at worst they cause hackable vulnerabilities. We only want to allow memory-unsafe behaviour in unsafe blocks, and we should be very careful to document this type and tell users to update the pointers after moves.

Unpin and !Unpin

To recap, all Rust types fall into two categories.

Types that are safe to move around in memory. This is the default, the norm. For example, this includes primitives like numbers, strings, bools, as well as structs or enums entirely made of them. Most types fall into this category!
Self-referential types, which are not safe to move around in memory. These are pretty rare. An example is the intrusive linked list inside some Tokio internals. Another example is most types which implement Future and also borrow data, for reasons explained in the Rust async book.

Types in category (1) are totally safe to move around in memory. You won’t invalidate any pointers by moving them around. But if you move a type in (2), then you invalidate pointers and can get undefined behaviour, as we saw before. In earlier versions of Rust, you had to be really careful using these types to not move them, or if you moved them, to use unsafe and update all the pointers. But since Rust 1.33, the compiler can automatically figure out which category any type is in, and make sure you only use it safely.

Any type in (1) implements a special auto trait called Unpin. Weird name, but its meaning will become clear soon. Again, most “normal” types implement Unpin, and because it’s an auto trait (like Send or Sync or Sized1), so you don’t have to worry about implementing it yourself. If you’re unsure if a type can be safely moved, just check it on docs.rs and see if it impls Unpin!

Types in (2) are creatively named !Unpin (the ! in a trait means “does not implement”). To use these types safely, we can’t use regular pointers for self-reference. Instead, we use special pointers that “pin” their values into place, ensuring they can’t be moved. This is exactly what the Pin type does.

Pin wraps a pointer and stops its value from moving. The only exception is if the value impls Unpin — then we know it’s safe to move. Voila! Now we can write self-referential structs safely! This is really important, because as discussed above, many Futures are self-referential, and we need them for async/await.

Using Pin

So now we understand why Pin exists, and why our Future poll method has a pinned &mut self to self instead of a regular &mut self. So let’s get back to the problem we had before: I need a pinned reference to the inner future. More generally: given a pinned struct, how do we access its fields?

The solution is to write helper functions which give you references to the fields. These references might be normal Rust references like &mut, or they might also be pinned. You can choose whichever one you need. This is called projection: if you have a pinned struct, you can write a projection method that gives you access to all its fields.

Projecting is really just getting data into and out of Pins. For example, we get the start: Option<Duration> field from the Pin<&mut self>, and we need to put the future: Fut into a Pin so we can call its poll method). If you read the Pin methods you’ll see this is always safe if it points to an Unpin value, but requires unsafe otherwise.

// Putting data into Pin
pub        fn new          <P: Deref<Target:Unpin>>(pointer: P) -> Pin<P>;
pub unsafe fn new_unchecked<P>                     (pointer: P) -> Pin<P>;

// Getting data from Pin
pub        fn into_inner          <P: Deref<Target: Unpin>>(pin: Pin<P>) -> P;
pub unsafe fn into_inner_unchecked<P>                      (pin: Pin<P>) -> P;

I know unsafe can be a bit scary, but it’s OK to write unsafe code! I think of unsafe as the compiler saying “hey, I can’t tell if this code follows the rules here, so I’m going to rely on you to check for me.” The Rust compiler does so much work for us, it’s only fair that we do some of the work every now and then. If you want to learn how to write your own projection methods, I can highly recommend this fasterthanli.me blog post on the topic. But we’re going to take a little shortcut.

Using pin-project instead

So, OK, look, it’s time for a confession: I don’t like using unsafe. I know I just explained why it’s OK, but still, given the option, I would rather not.

I didn’t start writing Rust because I wanted to carefully think about the consequences of my actions, damnit, I just want to go fast and not break things. Luckily, someone sympathized with me and made a crate which generates totally safe projections! It’s called pin-project and it’s awesome. All we need to do is change our definition:

#[pin_project::pin_project] // This generates a `project` method
pub struct TimedWrapper<Fut: Future> {
	// For each field, we need to choose whether `project` returns an
	// unpinned (&mut T) or pinned (Pin<&mut T>) reference to the field.
	// By default, it assumes unpinned:
	start: Option<Instant>,
	// Opt into pinned references with this attribute:
	#[pin]
	future: Fut,
}

For each field, you have to choose whether its projection should be pinned or not. By default, you should use a normal reference, just because they’re easier and simpler. But if you know you need a pinned reference — for example, because you want to call .poll(), whose receiver is Pin<&mut Self> — then you can do that with #[pin].

Now we can finally poll the inner future!

fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
	// This returns a type with all the same fields, with all the same types,
	// except that the fields defined with #[pin] will be pinned.
	let mut this = self.project();
	
    // Call the inner poll, measuring how long it took.
	let start = this.start.get_or_insert_with(Instant::now);
	let inner_poll = this.future.as_mut().poll(cx);
	let elapsed = start.elapsed();

	match inner_poll {
		// The inner future needs more time, so this future needs more time too
		Poll::Pending => Poll::Pending,
		// Success!
		Poll::Ready(output) => Poll::Ready((output, elapsed)),
	}
}

Finally, our goal is complete — and we did it all without any unsafe code.

Summary

If a Rust type has self-referential pointers, it can’t be moved safely. After all, moving doesn’t update the pointers, so they’ll still be pointing at the old memory address, so they’re now invalid. Rust can automatically tell which types are safe to move (and will auto impl the Unpin trait for them). If you have a Pin-ned pointer to some data, Rust can guarantee that nothing unsafe will happen (if it’s safe to move, you can move it, if it’s unsafe to move, then you can’t). This is important because many Future types are self-referential, so we need Pin to safely poll a Future. You probably won’t have to poll a future yourself (just use async/await instead), but if you do, use the pin-project crate to simplify things.

I hope this helped — if you have any questions, please ask me on Twitter. And if you want to get paid to talk to me about Rust and networking protocols, my team at Cloudflare is hiring, so be sure to visit careers.cloudflare.com.

References

Complete TimedWrapper example code on GitHub
This post is based on a presentation I gave at a Rust Bay Area meetup a few weeks ago. My talk starts around 40 minutes in.
The std::pin docs have a pretty good explanation of Pin’s details.
The Rust async book explains why Futures often need self-referential pointers.
Comprehensive article on how pin projection actually works by @fasterthanlime
Great article explaining when and how Rust moves values to different memory addresses, by @HashRustThanks to Nick Vollmar for feedback and to Shepmaster for helping me use pin-project when I first needed to write a nested Future

Building an ARM64 Rust development environment using AWS Graviton2 and AWS CDK

2021-06-09 Alistair McLean

Post Syndicated from Alistair McLean original https://aws.amazon.com/blogs/devops/building-an-arm64-rust-development-environment-using-aws-graviton2-and-aws-cdk/

2020 was the year that ARM chips made the headlines by moving from largely mobile form factors into the cloud thanks to AWS Graviton2, allowing you to have up to 40% better price performance over comparable current generation x86 Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Relational Database Service (Amazon RDS) instances.

We speak to customers daily about Graviton2. One recurring question we hear is “Graviton2 is great, but how can my team develop for ARM natively without the complexity of cross-compilation or having to buy custom hardware on premises?” This post seeks to answer that question by setting up the Visual Studio Code-based Code Server IDE, running on a Graviton2 EC2 instance that enables native development in a cost-effective and secure manner accessed via your browser.

The Rust programming language has gained a huge amount of popularity recently. This post aims to show that you can use this environment for Rust development as well as hundreds of other supported languages. AWS has committed to supporting the Rust community and using the language to deliver fast and robust services to customers at scale, and we want to enable our customers to do the same.

We also include instructions for building and installing the rust-analyzer and CodeLLDB debugger plugins to add additional language features.

Solution overview

The following diagram illustrates our solution architecture.

Architecture of the solution showing components and their linkages

The solution consists of an EC2 Graviton2 instance located in a private VPC subnet routed through an AWS Global Accelerator accelerator to provide routing optimization and keep packet loss, jitter, and latency lower by up to 60%. An internal facing Application Load Balancer containing the AWS Certificate Manager certificate decrypts and forwards traffic to this instance.

Code Server queries AWS Secrets Manager to initially set the login password on startup and allow for continued password-based authentication and easy password rotation. The EC2 instance has access to the internet through a NAT gateway and has no public IP address or key pair associated, and is accessible only through AWS Systems Manager Session Manager.

Prerequisites

For this walkthrough, the following are prerequisites:

An AWS account
Familiarity with the AWS Command Line Interface (AWS CLI)
Account capacity for two Elastic IPs for the NAT gateways
Access to an AWS account with administrator or PowerUser (or equivalent) AWS Identity and Access Management (IAM) role policies attached
Access to AWS CloudShell
Basic knowledge of the Linux operating system
A public hosted zone in Amazon Route 53—for instructions on setting one up, see Getting started with Amazon Route 53
A private CIDR range for the new VPC that is created as part of the AWS Cloud Development Kit (AWS CDK) stack

AWS CDK stack

In order to deploy our architecture, I use the AWS CDK. As a developer, it’s more intuitive to me to define my infrastructure using a language and tooling with which I am familiar. I can also do things like environment variable injection and scripting as part of the stack creation to add stack parameters and customization points.

The AWS CDK application is comprised of five stacks. Each stack defines a separate part of the architecture:

Networking – Defines a VPC across two Availability Zones with the CIDR range of your choice. The routing and public/private subnet creation is done for us as part of the default configuration.
Certificate – This is the reason for the domain prerequisite. It’s a best practice to encrypt web applications using TLS, and for that we need a certificate and therefore a domain. This stack creates a certificate for the subdomain you specify as part of the stack creation and DNS validation in Route 53.
Amazon EC2 configuration – This defines both our AMI and the instance type and configuration. In this case, we’re using Amazon Linux 2 ARM64 edition. Here we also set the instance-managed roles that allow Session Manager connectivity and Secrets Manager access.
ALB configuration – Here we define the internal load balancer and specify the listener, certificate, and target configuration. I have injected the Amazon EC2 configuration as part of the class constructor so that I can reference it directly as a target.
Global accelerator configuration – Finally, the accelerator is defined here with two ports open, the ALB we defined in the ALB stack as a target, and most importantly adds in a CNAME DNS entry pointing to the DNS name of the accelerator.

Walkthrough overview

This walkthrough uses the AWS CDK command line tools to deploy the stack. Session Manager is enabled to allow access to the EC2 instance and configure the Code Server application and associated plugins.

The walkthrough specifically covers the following steps:

Deploy the AWS CDK stacks via CloudShell to build out the application infrastructure and associated IAM roles.
Launch Code Server via the official Docker container with the commands to get and set the password stored in Secrets Manager.
Log in and build the rust-analyzer and CodeLLDB plugins from a terminal to allow for debugging within a “Hello World” application.

Start CloudShell and install the appropriate tooling

In this section, I use dummy values for the domain, the VPC CIDR, AWS Region, and the secret password. You need to submit real values as appropriate.

sudo yum groupinstall -y "Development Tools"
sudo npm install aws-cdk -g
git clone https://github.com/aws-samples/cdk-graviton2-alb-aga-route53.git
cd cdk-graviton2-alb-aga-route53
python3 -m venv .
source bin/activate
python -m pip install -r requirements.txt
export VPC_CIDR=”10.0.0.1/16” #Substitute your CIDR here.
export CDK_DEPLOY_ACCOUNT=`aws sts get-caller-identity | jq -r '.Account'`
export CDK_DEPLOY_REGION=$AWS_REGION
export R53_DOMAIN=”code-server.example.com” #Substitute your domain here.
cdk bootstrap aws://$CDK_DEPLOY_ACCOUNT/$CDK_DEPLOY_REGION
cdk deploy --all

The deploy step takes around 10-15 mins to run and prompts a couple of times to add resources like security groups and IAM roles.

Log in to the new instance using Session Manager

Install the latest version of the Session Manager plugin for the AWS CLI:

cd ~
curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm" -o "session-manager-plugin.rpm"
sudo yum install -y session-manager-plugin.rpm

Now start a session, logging into the newly created EC2 instance and log in as ec2-user:

aws ssm start-session --target i-1234xyz7890abc #Substitute the instance id we just created here
#Once session is active:
sudo su - ec2-user

Add the password as a secret and start the container

Enter the following code to add the password as a secret in Secrets Manager and start the container:

aws secretsmanager create-secret --name CodeServerProd --secret-string Password123abc # Substitute the appropriate password here.
sudo docker run -d --name=code-server -e PUID=1000 -e PGID=1000 -e PASSWORD=`aws secretsmanager get-secret-value --secret-id CodeServerProd | jq -r '.SecretString'` -p 8080:8080 -v /home/ec2-user/.config:/config --restart unless-stopped codercom/code-server

Access and configure the web application for Rust development

So far, we have accomplished the following:

Created the infrastructure in the diagram via AWS CDK deployment
Configured the EC2 instance to run Docker and added this to the systemctl startup scripts
Created a secret in Secrets Manager to use as the application login password
Instantiated a Docker container running Code Server

Next, we access the running container via the web interface and install the required development tools.

Log in to the Code Server web application

To log in to the Code Server web application, complete the following steps:

Browse to https://code-server.example.com, where example.com is the name of the domain you supplied in the AWS CDK step.
Log in using the password you created in Secrets Manager.
Create a new terminal by choosing the hamburger icon and, under Terminal, choosing New Terminal.
Issue the following commands into the terminal to install the Rust programming language:

bash
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential npm clang lldb
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

Install the rust-analyzer plugin

Open the extensions panel and enter Rust Analyzer in the search bar. Then install the plugin.

Install the debugger

Go back to the extensions panel in the Code Server application and enter CodeLLDB into the search bar. Then install this extension.

Create a sample application and open it in the Code Server window

To create and use our sample application, complete the following steps:

In the existing Code Server terminal, enter the following:

mkdir -p ~/src/
cd ~/src
cargo new helloworld --bin

Open the newly created folder in Code Server verifying that the helloworld directory was successfully created.

Open File or Folder dialog in Code Server

Rust-analyzer runs when you open up src/main.rs and index the file.
You can run the program by choosing Run in the editor.

Main Code Server editor window showing helloworld Rust program code.

Similarly, to launch the debugger, choose Debug in the editor.

Code Server Debugger view

Troubleshooting

If the CloudShell session times out, you need to reset your environment variables in order to re-deploy, modify, and delete the stack deployment.

Clean up

This stack incurs an estimated monthly cost of $143.00.

To delete the stack, log in to CloudShell and enter the following commands:

cd cdk-graviton2-alb-aga-route53
source bin/activate

# Re-set the environment variables again if required
export VPC_CIDR=”10.0.0.1/16” #Substitute your CIDR here.
export CDK_DEPLOY_ACCOUNT=`aws sts get-caller-identity | jq -r '.Account'`
export CDK_DEPLOY_REGION=$AWS_REGION
export R53_DOMAIN=”code-server.example.com” #Substitute your domain here.
cdk destroy --all

This destroys all the resources created in the first step. You can verify this by browsing to the AWS CloudFormation console and noting the deletion of all the stacks.

Conclusion

AWS is a place where builders can reinvent the future. The future of development means supporting different chipsets depending on different business requirements. This post is designed to enable development targeting the ARM64 microarchitecture by utilizing AWS Graviton2. Happy building!

Author bio

Author portrait

Alistair is a Principal Solutions Architect at AWS focused on EdTech customers. Originally from the west coast of Scotland, Alistair now lives in Fairfield, Connecticut, with his wife and two daughters and enjoys spending time with his family, skiing, golfing, cycling, and using his pellet smoker.

Using One Cron Parser Everywhere With Rust and Saffron

2020-12-25 Aaron Loyd

Post Syndicated from Aaron Loyd original https://blog.cloudflare.com/using-one-cron-parser-everywhere-with-rust-and-saffron/

Using One Cron Parser Everywhere With Rust and Saffron

As part of the development for Cron Triggers on Cloudflare Workers, we had an interesting problem to tackle relating to parsers and the cron expression format. Cron expressions are the format used to write schedules in Cron Triggers, and extensions for cron expressions are everywhere. They vary between parsers and platforms as well, and aren’t standardized by a governing body, which means most parsers out there support many different feature sets, which isn’t good if you’d like something off the shelf that just works.

It can be tough to find the right parser for each part of the Cron Triggers stack, when its user interface, API, and edge service are all written in different languages. On top of that, it isn’t practical to reinvent the wheel multiple times by writing the same parser in different languages and make sure they all match perfectly. So you’re likely stuck with a less-than-perfect solution.

However, in the end, because we wrote our backend service in Rust, it took much less effort to solve this problem. Rust has a great ecosystem for working across multiple languages, which allows us to write a parser once and pull it from the backend to the frontend and everywhere in between with minimal glue code.

The Trouble with Cron

Cron expressions are a set of fields that represent a set of times. They act as a pattern that matches over the minute, hour, day of the month, month, and day of the week of a given time. Since cron is a simple format, it’s easy to extend with extra fields, so some parsers and platforms allow specifying seconds and years as well. However, seconds are a bit too granular and years are a bit too long, so we opted to not support them as part of Cron Triggers.

In the original cron program, the expressions supported were simple, each field could contain either:

A star (‘*’) representing all values,
A value (a number for all fields or a 3 letter abbreviation for months or days of the week, like JUN or FRI)
A range of values (i.e. ‘0-30’), or
A set of ranges and/or values (i.e. ‘0-15,30,45-50,55’)

This is a good start for specifying most time patterns, but many extensions exist out there to fill in some gaps. For example,

‘L’ can be used for the day of the month position to specify the last day of the month, or in the day of the week position with a day value to specify the last of that weekday during the month (i.e. 7L, the last Saturday of the month).
‘W’ can be used for the day of the month, and lets you specify “the closest weekday (MON-FRI) to a given day”, like 15W, or the closest weekday to the 15th of the month.
‘/’ can be used for step values in any field. For example, */5 in the minute field is every 5th minute in the hour. This can be combined with a range to specify things such as ‘30-59/5’, or every 5th minute from minute 30 to minute 59 in the hour.
‘#’ can be used with a day of the week value to specify the “nth day of the month”, such as ‘5#3’, or the 3rd Thursday of the month.

So far I’ve only listed extensions we currently support on Workers, but others exist such as ‘H’ in Jenkins and ‘?’ in some cron implementations for start-up time. Most libraries don’t support said extensions, however ‘?’ is used in some implementations in certain circumstances, but not as start-up time. With all these extensions and a lack of standardization, some libraries aren’t guaranteed to support them all.

The Multitude of Libraries

During the development of Cron Triggers, we needed some things to just work, and to do that, we opted to pull some libraries off the shelf from package repositories for different parts of the stack.

In the Rust backend, we needed a cron library that supported all the extensions we wanted, while also leaving off other field extensions like seconds and years, and had an API that let us simply check if a given time matched the expression pattern. None of the crates on crates.io offered these, so we had to write it ourselves. Using the nom crate, it was easy to draft a simple, fast, safe parser, named ‘saffron’. As time went on and we got closer to release, it became clearer which extensions we really wanted to support. It was incredibly easy to add support for the new features without worrying about safety since the compiler checked it for us, so all we had to do was extensive logic testing. Last offset weekdays (“L-XW”) and leap years were difficult to get right the first time, but testing them was easy with Rust.

#[test]
   fn parse_check_offset_weekend_start_months() {
       let cron = "0 0 L-30W * *";
 
       check_does_contain(
           cron,
           &["2021-05-3T00:00:00+00:00", "2022-01-3T00:00:00+00:00"],
       );
   }
   #[test]
   fn parse_check_offset_leap_days() {
       let cron = "0 0 L-1 FEB *";
 
       check_does_contain(
           cron,
           &[
               "2400-02-28T00:00:00+00:00",
               "2300-02-27T00:00:00+00:00",
               "2200-02-27T00:00:00+00:00",
               "2100-02-27T00:00:00+00:00",
               "2024-02-28T00:00:00+00:00",
               "2020-02-28T00:00:00+00:00",
               "2004-02-28T00:00:00+00:00",
               "2000-02-28T00:00:00+00:00",
           ],
       );
 
       check_does_not_contain(
           cron,
           &[
               "2400-02-29T00:00:00+00:00",
               "2300-02-28T00:00:00+00:00",
               "2200-02-28T00:00:00+00:00",
               "2100-02-28T00:00:00+00:00",
               "2024-02-29T00:00:00+00:00",
               "2020-02-29T00:00:00+00:00",
               "2004-02-29T00:00:00+00:00",
               "2000-02-29T00:00:00+00:00",
           ],
       );
   }

However, the UI had a different set of requirements. It didn’t need to know whether a given time matched a cron pattern — we wanted to provide information to the user about the cron expression they’d written, so it needed to provide a more human readable translation (description) of their cron expression and show them their next five executions (future times). But we were on a limited time budget — we needed something off the shelf.

We used two different JavaScript libraries for displaying info about given cron expressions: one gave us descriptions, the other gave us future times. Since these two libraries were tasked with parsing cron expressions, they also acted as validation; however, just using these two libraries for validation proved to be less than optimal. Both of the libraries supported extensions that were different both from each other and from the backend. Because of that they’d sometimes allow users to add schedules that would be rejected by the API on submit, which doesn’t translate into a good user experience. This validation should happen while the user writes their cron expression, not after they already hit submit! Because of this fracture in extension support, the UI parsers also sometimes didn’t parse expressions that should be supported and were accepted by the API!

Before release on the API side, we simply used a Go library for validation. This proved to be an easy solution, but we quickly noticed that the API accepted more than the schedule runner supported. This caused some triggers to be successfully added to the schedule, but were ignored by the runner because they failed to parse.

So before launch, we were using four completely different parsers! This probably wouldn’t be much of an issue if cron expressions were standardized. But because they aren’t, inconsistencies could exist at every step in the trigger creation process: between the two libraries we used on the frontend, between the frontend and API, and between the API and the backend.

To solve these issues in the UI and API before release, we synced the API and backend with another schedule runner entrypoint that simply read a cron expression from stdio, parsed it, and returned whether it was valid, to make sure they perfectly matched. We also added a validation endpoint to the API that could be used by the UI to check a cron expression, to make sure the backend actually accepted it. This fixed all cases of the API and UI being too accepting of expressions that weren’t supported, but neither of these solutions were optimal.

For one, they weren’t performant. Each time we wanted to validate a cron expression in the UI, we’d have to parse the expression twice in JavaScript (once for a description, and again for future times) and make an request to the API, which would start an instance of the schedule runner, parse the expression, and return whether it properly parsed.

Another reason this was nonoptimal is we were still limited in the features we supported by one library. One of our UI libraries didn’t support the ‘L’ and ‘W’ extensions, and since we also programmed the UI to accept expressions based on whether all parsers accepted it, expressions that used those extensions couldn’t be added.

So even though we dropped it to three parsers before release, it still didn’t seem good enough. Soon after release, I made plans to remedy it and started working on saffron (originally this project was called cfron but Cloudflare’s CTO couldn’t resist suggesting renaming it to saffron because he loves puns) to fill in for the one library holding us back in the UI. It would’ve been OK if missing extension support was the only thing wrong after release, but soon some other issues came up.

Off By One

Saffron is based on the Quartz open source scheduler’s cron parser, which makes days of the week when specified as integers start from 1 (Sunday) and go to 7 (Saturday). Both parsers on the frontend follow the original values for cron, where days start from 0 and go to 6, and 7 could be used for Sunday as well. So when users entered 1-5, the UI told them they were entering a schedule from Monday to Friday, and the backend ended up executing Sunday to Thursday! This was missed when testing Cron Triggers initially and was caught by observant community members on the forum.

Fixing the issue turned out to be a bit difficult. While the library we were using for descriptions had the option to simply switch from 0-6 to 1-7 days of the week, our future times library did not have that option. Luckily, development was already halfway through with replacing it in Saffron. However, we couldn’t place it directly on the frontend yet, since web bindings didn’t exist and I didn’t have time to write them. We needed something easier to develop quickly.

Reintroducing: Cloudflare Workers!

Workers made it incredibly easy to take the existing code, add some wasm entry points for a makeshift API, and call with JavaScript. No need to build a whole separate API in Go! Just take your existing code and put it directly within 100ms of nearly everyone on the Internet. Why call all the way back home when the nearest PoP works just as well?

Plus, we don’t have to worry about building and publishing, wrangler does it for us! For example, our validation code is all written in Rust:

#[wasm_bindgen]
#[derive(Clone, Debug)]
pub struct ValidationResult {
   errors: Option<Vec<String>>,
}
 
#[wasm_bindgen]
pub fn validate(crons: JsArray) -> ValidationResult {
   set_panic_hook();
 
   let len = crons.length();
   let mut map = HashMap::with_capacity(len as usize);
   for i in 0..len {
       let string = match crons.get(i).as_string() {
           Some(string) => string,
           None => {
               return ValidationResult {
                   errors: Some(vec![format!("Element '{}' is not a string", i)]),
               }
           }
       };
 
       let cron: Cron = match string.parse() {
           Ok(cron) => cron,
           Err(err) => {
               return ValidationResult {
                   errors: Some(vec![format!(
                       "Failed to parse expression at index '{}': {}",
                       i, err
                   )]),
               }
           }
       };
 
       if let Some(old_str) = map.insert(cron, string.clone()) {
           return ValidationResult {
               errors: Some(vec![format!(
                   "Expression '{}' already exists in the form of '{}'",
                   string, old_str
               )]),
           };
       }
   }
 
   ValidationResult { errors: None }
}

and our code to handle processing the request and response is written in JavaScript:

  const path = new URL(request.url).pathname;
 switch (path) {
   case "/validate": {
     let body;
     try {
       body = await request.json()
     } catch (e) {
       return status(400, "Bad Request");
     }
     let crons = body.crons;
     if (!Array.isArray(crons)) {
       return status(400, "Bad Request");
     }
 
     let result = validate(crons).errors();
     let success = result == null;
     return apiResponse({}, success, result);
   }

After a week of dedicated development, a Worker was written, the future times were calculated, and the UI was fixed! On top of that, we also implicitly introduced support for more extensions by removing the old parser and replacing it with the same one used on the backend as part of the fix itself. But we’re still using two parsers, so inconsistencies may still exist out there that we haven’t seen yet (that we don’t already know about).

For example, this expression “0 0 L-1W 2 *”, or “12:00 AM on the closest weekday to the 2nd to last day of the month in February” cannot be parsed by the parser we use for descriptions, but it’s accepted by the API, backend, and Worker, so you can use it in your cron triggers, but the UI won’t give you a description for it.

The Quest for the One True Parser

This brings us to today. In the search of better and faster, we want to bring the number of parsers down from two to one. One source of truth for the entire stack. To make it all faster, we should do parsing on the frontend locally instead of making a call to a remote Worker (if possible). In the API, the separate entry point was a nice easy solution, but starting the schedule runner just to check if a cron string is valid every time a user adds one doesn’t seem like it’s the best it could be.

Luckily Rust has a vibrant ecosystem that can meet all these needs! To bring the parser to the UI, we can compile saffron to wasm and use generated bindings created with wasm-pack. This can be easily integrated with our existing webpack setup, making it simple to get future times and create descriptions of cron strings on the frontend. Then, to bring the parser closer to the API, we can use Rust’s ability to create C APIs that we can then integrate with Go using cgo.

With our parser everywhere, we can then focus exclusively on cron descriptions to replace the one other parser we’re using in the UI. At that point we will have one parser for the whole stack, a single source of truth that anyone can reference to understand how the frontend, API, and backend all work together. It also simplifies our graph. Now instead of multiple libraries written in different languages, we have one library with multiple language wrappers, each serving a different part of the stack. No inconsistencies will exist since they’re all using the same parser!

However, we wanted to do something before that…

We made it open source!

I think this project serves as a great example of Rust’s type system, its safety, and its extensibility across the entire stack. The project itself is simple, easy to understand, and easy to port and provide bindings for. By open sourcing, we can publish packages for these bindings on npm and crates.io, allowing anyone to use these bindings for whatever they want. It also means you can also follow along with development to see the finishing touches get added and maybe make some suggestions for future improvements in the UI and the parser itself.

You can view the project on GitHub at https://github.com/cloudflare/saffron.

Building even faster interpreters in Rust

2020-09-24 Zak Cutner

Post Syndicated from Zak Cutner original https://blog.cloudflare.com/building-even-faster-interpreters-in-rust/

Building even faster interpreters in Rust

At Cloudflare, we’re constantly working on improving the performance of our edge — and that was exactly what my internship this summer entailed. I’m excited to share some improvements we’ve made to our popular Firewall Rules product over the past few months.

Firewall Rules lets customers filter the traffic hitting their site. It’s built using our engine, Wirefilter, which takes powerful boolean expressions written by customers and matches incoming requests against them. Customers can then choose how to respond to traffic which matches these rules. We will discuss some in-depth optimizations we have recently made to Wirefilter, so you may wish to get familiar with how it works if you haven’t already.

Minimizing CPU usage

As a new member of the Firewall team, I quickly learned that performance is important — even in our security products. We look for opportunities to make our customers’ Internet properties faster where it’s safe to do so, maximizing both security and performance.

Our engine is already heavily used, powering all of Firewall Rules. But we have bigger plans. More and more products like our Web Application Firewall (WAF) will be running behind our Wirefilter-based engine, and it will become responsible for eating up a sizable chunk of our total CPU usage before long.

How to measure performance?

Measuring performance is a notoriously tricky task, and as you can probably imagine trying to do this in a highly distributed environment (aka Cloudflare’s edge) does not help. We’ve been surprised in the past by optimizations that look good on paper, but, when tested out in production, just don’t seem to do much.

Our solution? Performance measurement as a service — an isolated and reproducible benchmark for our Firewall engine and a framework for engineers to easily request runs and view results. It’s worth noting that we took a lot of inspiration from the fantastic Rust Compiler benchmarks to build this.

What to measure?

Our next challenge was to find some meaningful performance metrics. Some experimentation quickly uncovered that time was far too volatile a measure for meaningful comparisons, so we turned to hardware counters [2]. It’s not hard to find tools to measure these (perf and VTune are two such examples), although they (mostly) don’t allow control over which parts of the program are recorded. In our case, we wished to individually record measurements for different stages of filter processing — parsing, compilation, analysis, and execution.

Once again we took inspiration from the Rust compiler, and its self-profiling options, using the perf_event_open API to record counters from inside our binary. We then output something like the following, which our framework can easily ingest and store for later visualization.

Whilst we mainly focussed on metrics relating to CPU usage, we also use a combination of getrusage and clear_refs to find the maximum resident set size (RSS). This is useful to understand the memory impact of particular algorithms in addition to CPU.

But the challenge was not over. Cloudflare’s standard CI agents use virtualization and sandboxing for security and convenience, but this makes accessing hardware counters virtually impossible. Running our benchmarks on a dedicated machine gave us access to these counters, and ensured more reproducible results.

Speeding up the speed test

Our benchmarks were designed from the outset to take an important place in our development process. For instance, we now perform a full benchmark run before releasing each new version to detect performance regressions.

But with our benchmarks in place, it quickly became clear that we had a problem. Our benchmarks simply weren’t fast enough — at least if we wanted to complete them in less than a few hours! The problem was we have a very large number of filters. Since our engine would never usually execute requests against this many filters at once it was proving incredibly costly. We came up with a few tricks to cut this down…

Deduplication. It turns out that only around a third of filters are structurally unique (something that is easy to check as Wirefilter can helpfully serialize to JSON). We managed to cut down a great deal of time by ignoring duplicate filters in our benchmarks.
Sampling. Still, we had too many filters and random sampling presented an easy solution. A more subtle challenge was to make sure that the random sample was always the same to maintain reproducibility.
Partitioning. We worried that deduplication and sampling would cause us to miss important cases that are useful to optimize. By first partitioning filters by Wirefilter language feature, we can ensure we’re getting a good range of filters. It also helpfully gives us more detail about where specifically the impact of a performance change is.

Most of these are trade-offs, but very necessary ones which allow us to run continual benchmarks without development speed grinding to a halt. At the time of writing, we’ve managed to get a benchmark run down to around 20 minutes using these ideas.

Optimizing our engine

With a benchmarking framework in place, we were ready to begin testing optimizations. But how do you optimize an interpreter like Wirefilter? Just-in-time (JIT) compilation, selective inlining and replication were some ideas floating around in the word of interpreters that seemed attractive. After all, we previously wrote about the cost of dynamic dispatch in Wirefilter. All of these techniques aim to reduce that effect.

However, running some real filters through a profiler tells a different story. Most execution time, around 65%, is spent not resolving dynamic dispatch calls but instead performing operations like comparison and searches. Filters currently in production tend to be pretty light on functions, but throw in a few more of these and even less time would be spent on dynamic dispatch. We suspect that even a fair chunk of the remaining 35% is actually spent reading the memory of request fields.

Function	CPU time
`matches` operator	0.6%
`in` operator	1.1%
`eq` operator	11.8%
`contains` operator	51.5%
Everything else	35.0%

Breakdown of CPU time while executing a typical production filter.

An adventure in substring searching

By now, you shouldn’t be surprised that the contains operator was one of the first in line for optimization. If you’ve ever written a Firewall Rule, you’re probably already familiar with what it does — it checks whether a substring is present in the field you are matching against. For example, the following expression would match when the host is “example.com” or “www.example.net”, but not when it is “cloudflare.com”. In string searching algorithms, this is commonly referred to as finding a ‘needle’ (“example”) within a ‘haystack’ (“example.com”).

http.host contains “example”

How does this work under the hood? Ordinarily, we may have used Rust’s `String::contains` function but Wirefilter also allows raw byte expressions that don’t necessarily conform to UTF-8.

http.host contains 65:78:61:6d:70:6c:65

We therefore used the memmem crate which performs a two-way substring search algorithm on raw bytes.

Sounds good, right? It was, and it was working pretty well, although we’d noticed that rewriting `contains` filters using regular expressions could bizarrely often make them faster.

http.host matches “example”

Regular expressions are great, but since they’re far more powerful than the `contains` operator, they shouldn’t be faster than a specialized algorithm in simple cases like this one.

Something was definitely up. It turns out that Rust’s regex library comes equipped with a whole host of specialized matchers for what it deems to be simple expressions like this. The obvious question was whether we could therefore simply use the regex library. Interestingly, you may not have realized that the popular ripgrep tool does just that when searching for fixed-string patterns.

However, our use case is a little different. Since we’re building an interpreter (and we’re using dynamic dispatch in any case), we would prefer to dispatch to a specialized case for `contains` expressions, rather than matching on some enum deep within the regex crate when the filter is executed. What’s more, there are some pretty cool things being done to perform substring searching that leverages SIMD instruction sets. So we wired up our engine to some previous work by Wojciech Muła and the results were fantastic.

Benchmark	Improvement
Expressions using `contains` operator	72.3%
‘Simple’ expressions	0.0%
All expressions	31.6%

Improvements in instruction count using Wojciech Muła’s sse4-strstr library over the memmem crate with Wirefilter.

I encourage you to read more on “Algorithm 1”, which we used, but it works something like this (I’ve changed the order a little to help make it clearer). It’s worth reading up on SIMD instructions if you’re unfamiliar with them — they’re the essence behind what makes this algorithm fast.

We fill one SIMD register with the first byte of the needle being searched for, simply repeated over and over.
We load as much of our haystack as we can into another SIMD register and perform a bitwise equality operation with our previous register.
Now, any position in the resultant register that is 0 cannot be the start of the match since it doesn’t start with the same byte of the needle.
We now repeat this process with the last byte of the needle, offsetting the haystack, to rule out any positions that don’t end with the same byte as the needle.
Bitwise ANDing these two results together, we (hopefully) have now drastically reduced our potential matches.
Each of the remaining potential matches can be checked manually using a memcmp operation. If we find a match, then we’re done.
If not, we continue with the next part of our haystack and repeat until we’ve checked the entire thing.

When it goes wrong

You may be wondering what happens if our haystack doesn’t fit neatly into registers. In the original algorithm, nothing. It simply continues reading into the oblivion after the end of the haystack until the last register is full, and uses a bitmask to ignore potential false-positives from this additional region of memory.

As we mentioned, security is our priority when it comes to optimizations, so we could never deploy something with this kind of behaviour. We ended up porting Muła’s library to Rust (we’ve also open-sourced the crate!) and performed an overlapping registers modification found in ARM’s blog.

It’s best illustrated by example — notice the difference between how we would fill registers on an imaginary SIMD instruction-set with 4-byte registers.

Before modification

After modification

In our case, repeating some bytes within two different registers will never change the final outcome, so this modification is allowed as-is. However, in reality, we found it was better to use a bitmask to exclude repeated parts of the final register and minimize the number of memcmp calls.

What if the haystack is too small to even fill a single register? In this case, we can’t use our overlapping trick since there’s nothing to overlap with. Our solution is straightforward: while we were primarily targeting AVX2, which can store 32-bytes in a lane, we can easily move down to another instruction set with smaller registers that the haystack can fit into. In reality, we don’t currently go any smaller than SSE2. Beyond this, we instead use an implementation of the Rabin-Karp searching algorithm which appears to perform well.

Instruction set	Register size
AVX2	32 bytes
SSE2	16 bytes
SWAR (u64)	8 bytes
SWAR (u32)	4 bytes
…	…

Register sizes in different SIMD instruction sets [3]. We did not consider AVX512 since support for this is not widespread enough.

Is it always fast?

Choosing the first and last bytes of the needle to rule out potential matches is a great idea. It means that when it does come to performing a memcmp, we can ignore these, as we know they already match. Unfortunately, as Muła points out, this also makes the algorithm susceptible to a worst-case attack in some instances.

Let’s give an expression that a customer might write to illustrate this.

http.request.uri.path contains “/wp-admin/”

If we try to search for this within a very long sequence of ‘/’s, we will find a potential match in every position and make lots of calls to memcmp — essentially performing a slow bruteforce substring search.

Clearly we need to choose different bytes from the needle. But which ones should we choose? For each choice, an adversary can always find a slightly different, but equally troublesome, worst case. We instead use randomness to throw off our would-be adversary, picking the first byte of the needle as before, but then choosing another random byte to use.

Our new version is unsurprisingly slower than Muła’s, yet it still exhibits a great improvement over both the memmem and regex crates. Performance, but without sacrificing safety.

Benchmark	Improvement
	sse4-strstr (original)	sliceslice (our version)
Expressions using `contains` operator	72.3%	49.1%
‘Simple’ expressions	0.0%	0.1%
All expressions	31.6%	24.0%

Improvements in instruction count of using sse4-strstr and sliceslice over the memmem crate with Wirefilter.

What’s next?

This is only a small taste of the performance work we’ve been doing, and we have much more yet to come. Nevertheless, none of this would have been possible without the support of my manager Richard and my mentor Elie, who contributed a lot of these ideas. I’ve learned so much over the past few months, but most of all that Cloudflare is an amazing place to be an intern!

[1] Since our benchmarks are not run within a production environment, results in this post do not represent traffic on our edge.

[2] We found instruction counts to be a particularly stable measure, and CPU cycles a particularly unstable one.

[3] Note that SWAR is technically not an instruction set, but instead uses regular registers like vector registers.