Tag Archives: nginx

Cloudflare just got faster and more secure, powered by Rust

Post Syndicated from Richard Boulton original https://blog.cloudflare.com/20-percent-internet-upgrade/

Cloudflare is relentless about building and running the world’s fastest network. We have been tracking and reporting on our network performance since 2021: you can see the latest update here.

Building the fastest network requires work in many areas. We invest a lot of time in our hardware, to have efficient and fast machines. We invest in peering arrangements, to make sure we can talk to every part of the Internet with minimal delay. On top of this, we also have to invest in the software we run our network on, especially as each new product can otherwise add more processing delay.

No matter how fast messages arrive, we introduce a bottleneck if that software takes too long to think about how to process and respond to requests. Today we are excited to share a significant upgrade to our software that cuts the median time we take to respond by 10ms and delivers a 25% performance boost, as measured by third-party CDN performance tests.

We’ve spent the last year rebuilding major components of our system, and we’ve just slashed the latency of traffic passing through our network for millions of our customers. At the same time, we’ve made our system more secure, and we’ve reduced the time it takes for us to build and release new products. 

Where did we start?

Every request that hits Cloudflare starts a journey through our network. It might come from a browser loading a webpage, a mobile app calling an API, or automated traffic from another service. These requests first terminate at our HTTP and TLS layer, then pass into a system we call FL, and finally through Pingora, which performs cache lookups or fetches data from the origin if needed.


FL is the brain of Cloudflare. Once a request reaches FL, we then run the various security and performance features in our network. It applies each customer’s unique configuration and settings, from enforcing WAF rules and DDoS protection to routing traffic to the Developer Platform and R2

Built more than 15 years ago, FL has been at the core of Cloudflare’s network. It enables us to deliver a broad range of features, but over time that flexibility became a challenge. As we added more products, FL grew harder to maintain, slower to process requests, and more difficult to extend. Each new feature required careful checks across existing logic, and every addition introduced a little more latency, making it increasingly difficult to sustain the performance we wanted.

You can see how FL is key to our system — we’ve often called it the “brain” of Cloudflare. It’s also one of the oldest parts of our system: the first commit to the codebase was made by one of our founders, Lee Holloway, well before our initial launch. We’re celebrating our 15th Birthday this week – this system started 9 months before that!

commit 39c72e5edc1f05ae4c04929eda4e4d125f86c5ce
Author: Lee Holloway <q@t60.(none)>
Date:   Wed Jan 6 09:57:55 2010 -0800

    nginx-fl initial configuration

As the commit implies, the first version of FL was implemented based on the NGINX webserver, with product logic implemented in PHP.  After 3 years, the system became too complex to manage effectively, and too slow to respond, and an almost complete rewrite of the running system was performed. This led to another significant commit, this time made by Dane Knecht, who is now our CTO.

commit bedf6e7080391683e46ab698aacdfa9b3126a75f
Author: Dane Knecht
Date:   Thu Sep 19 19:31:15 2013 -0700

    remove PHP.

From this point on, FL was implemented using NGINX, the OpenResty framework, and LuaJIT.  While this was great for a long time, over the last few years it started to show its age. We had to spend increasing amounts of time fixing or working around obscure bugs in LuaJIT. The highly dynamic and unstructured nature of our Lua code, which was a blessing when first trying to implement logic quickly, became a source of errors and delay when trying to integrate large amounts of complex product logic. Each time a new product was introduced, we had to go through all the other existing products to check if they might be affected by the new logic.

It was clear that we needed a rethink. So, in July 2024, we cut an initial commit for a brand new, and radically different, implementation. To save time agreeing on a new name for this, we just called it “FL2”, and started, of course, referring to the original FL as “FL1”.

commit a72698fc7404a353a09a3b20ab92797ab4744ea8
Author: Maciej Lechowski
Date:   Wed Jul 10 15:19:28 2024 +0100

    Create fl2 project

Rust and rigid modularization

We weren’t starting from scratch. We’ve previously blogged about how we replaced another one of our legacy systems with Pingora, which is built in the Rust programming language, using the Tokio runtime. We’ve also blogged about Oxy, our internal framework for building proxies in Rust. We write a lot of Rust, and we’ve gotten pretty good at it.

We built FL2 in Rust, on Oxy, and built a strict module framework to structure all the logic in FL2.

Why Oxy?

When we set out to build FL2, we knew we weren’t just replacing an old system; we were rebuilding the foundations of Cloudflare. That meant we needed more than just a proxy; we needed a framework that could evolve with us, handle the immense scale of our network, and let teams move quickly without sacrificing safety or performance. 

Oxy gives us a powerful combination of performance, safety, and flexibility. Built in Rust, it eliminates entire classes of bugs that plagued our Nginx/LuaJIT-based FL1, like memory safety issues and data races, while delivering C-level performance. At Cloudflare’s scale, those guarantees aren’t nice-to-haves, they’re essential. Every microsecond saved per request translates into tangible improvements in user experience, and every crash or edge case avoided keeps the Internet running smoothly. Rust’s strict compile-time guarantees also pair perfectly with FL2’s modular architecture, where we enforce clear contracts between product modules and their inputs and outputs.

But the choice wasn’t just about language. Oxy is the culmination of years of experience building high-performance proxies. It already powers several major Cloudflare services, from our Zero Trust Gateway to Apple’s iCloud Private Relay, so we knew it could handle the diverse traffic patterns and protocol combinations that FL2 would see. Its extensibility model lets us intercept, analyze, and manipulate traffic from layer 3 up to layer 7, and even decapsulate and reprocess traffic at different layers. That flexibility is key to FL2’s design because it means we can treat everything from HTTP to raw IP traffic consistently and evolve the platform to support new protocols and features without rewriting fundamental pieces.

Oxy also comes with a rich set of built-in capabilities that previously required large amounts of bespoke code. Things like monitoring, soft reloads, dynamic configuration loading and swapping are all part of the framework. That lets product teams focus on the unique business logic of their module rather than reinventing the plumbing every time. This solid foundation means we can make changes with confidence, ship them quickly, and trust they’ll behave as expected once deployed.

Smooth restarts – keeping the Internet flowing

One of the most impactful improvements Oxy brings is handling of restarts. Any software under continuous development and improvement will eventually need to be updated. In desktop software, this is easy: you close the program, install the update, and reopen it. On the web, things are much harder. Our software is in constant use and cannot simply stop. A dropped HTTP request can cause a page to fail to load, and a broken connection can kick you out of a video call. Reliability is not optional.

In FL1, upgrades meant restarts of the proxy process. Restarting a proxy meant terminating the process entirely, which immediately broke any active connections. That was particularly painful for long-lived connections such as WebSockets, streaming sessions, and real-time APIs. Even planned upgrades could cause user-visible interruptions, and unplanned restarts during incidents could be even worse.

Oxy changes that. It includes a built-in mechanism for graceful restarts that lets us roll out new versions without dropping connections whenever possible. When a new instance of an Oxy-based service starts up, the old one stops accepting new connections but continues to serve existing ones, allowing those sessions to continue uninterrupted until they end naturally.

This means that if you have an ongoing WebSocket session when we deploy a new version, that session can continue uninterrupted until it ends naturally, rather than being torn down by the restart. Across Cloudflare’s fleet, deployments are orchestrated over several hours, so the aggregate rollout is smooth and nearly invisible to end users.

We take this a step further by using systemd socket activation. Instead of letting each proxy manage its own sockets, we let systemd create and own them. This decouples the lifetime of sockets from the lifetime of the Oxy application itself. If an Oxy process restarts or crashes, the sockets remain open and ready to accept new connections, which will be served as soon as the new process is running. That eliminates the “connection refused” errors that could happen during restarts in FL1 and improves overall availability during upgrades.

We also built our own coordination mechanisms in Rust to replace Go libraries like tableflip with shellflip. This uses a restart coordination socket that validates configuration, spawns new instances, and ensures the new version is healthy before the old one shuts down. This improves feedback loops and lets our automation tools detect and react to failures immediately, rather than relying on blind signal-based restarts.

Composing FL2 from Modules

To avoid the problems we had in FL1, we wanted a design where all interactions between product logic were explicit and easy to understand. 

So, on top of the foundations provided by Oxy, we built a platform which separates all the logic built for our products into well-defined modules. After some experimentation and research, we designed a module system which enforces some strict rules:

  • No IO (input or output) can be performed by the module.

  • The module provides a list of phases.

  • Phases are evaluated in a strictly defined order, which is the same for every request.

  • Each phase defines a set of inputs which the platform provides to it, and a set of outputs which it may emit.

Here’s an example of what a module phase definition looks like:

Phase {
    name: phases::SERVE_ERROR_PAGE,
    request_types_enabled: PHASE_ENABLED_FOR_REQUEST_TYPE,
    inputs: vec![
        InputKind::IPInfo,
        InputKind::ModuleValue(
            MODULE_VALUE_CUSTOM_ERRORS_FETCH_WORKER_RESPONSE.as_str(),
        ),
        InputKind::ModuleValue(MODULE_VALUE_ORIGINAL_SERVE_RESPONSE.as_str()),
        InputKind::ModuleValue(MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT.as_str()),
        InputKind::ModuleValue(MODULE_VALUE_RULESETS_UPSTREAM_ERROR_DETAILS.as_str()),
        InputKind::RayId,
        InputKind::StatusCode,
        InputKind::Visitor,
    ],
    outputs: vec![OutputValue::ServeResponse],
    filters: vec![],
    func: phase_serve_error_page::callback,
}

This phase is for our custom error page product.  It takes a few things as input — information about the IP of the visitor, some header and other HTTP information, and some “module values.” Module values allow one module to pass information to another, and they’re key to making the strict properties of the module system workable. For example, this module needs some information that is produced by the output of our rulesets-based custom errors product (the “MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT” input). These input and output definitions are enforced at compile time.

While these rules are strict, we’ve found that we can implement all our product logic within this framework. The benefit of doing so is that we can immediately tell which other products might affect each other.

How to replace a running system

Building a framework is one thing. Building all the product logic and getting it right, so that customers don’t notice anything other than a performance improvement, is another.

The FL code base supports 15 years of Cloudflare products, and it’s changing all the time. We couldn’t stop development. So, one of our first tasks was to find ways to make the migration easier and safer.

Step 1 – Rust modules in OpenResty

It’s a big enough distraction from shipping products to customers to rebuild product logic in Rust. Asking all our teams to maintain two versions of their product logic, and reimplement every change a second time until we finished our migration was too much.

So, we implemented a layer in our old NGINX and OpenResty based FL which allowed the new modules to be run. Instead of maintaining a parallel implementation, teams could implement their logic in Rust, and replace their old Lua logic with that, without waiting for the full replacement of the old system.

For example, here’s part of the implementation for the custom error page module phase defined earlier (we’ve cut out some of the more boring details, so this doesn’t quite compile as-written):

pub(crate) fn callback(_services: &mut Services, input: &Input<'_>) -> Output {
    // Rulesets produced a response to serve - this can either come from a special
    // Cloudflare worker for serving custom errors, or be directly embedded in the rule.
    if let Some(rulesets_params) = input
        .get_module_value(MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT)
        .cloned()
    {
        // Select either the result from the special worker, or the parameters embedded
        // in the rule.
        let body = input
            .get_module_value(MODULE_VALUE_CUSTOM_ERRORS_FETCH_WORKER_RESPONSE)
            .and_then(|response| {
                handle_custom_errors_fetch_response("rulesets", response.to_owned())
            })
            .or(rulesets_params.body);

        // If we were able to load a body, serve it, otherwise let the next bit of logic
        // handle the response
        if let Some(body) = body {
            let final_body = replace_custom_error_tokens(input, &body);

            // Increment a metric recording number of custom error pages served
            custom_pages::pages_served("rulesets").inc();

            // Return a phase output with one final action, causing an HTTP response to be served.
            return Output::from(TerminalAction::ServeResponse(ResponseAction::OriginError {
                rulesets_params.status,
                source: "rulesets http_custom_errors",
                headers: rulesets_params.headers,
                body: Some(Bytes::from(final_body)),
            }));
        }
    }
}

The internal logic in each module is quite cleanly separated from the handling of data, with very clear and explicit error handling encouraged by the design of the Rust language.

Many of our most actively developed modules were handled this way, allowing the teams to maintain their change velocity during our migration.

Step 2 – Testing and automated rollouts


It’s essential to have a seriously powerful test framework to cover such a migration.  We built a system, internally named Flamingo, which allows us to run thousands of full end-to-end test requests concurrently against our production and pre-production systems. The same tests run against FL1 and FL2, giving us confidence that we’re not changing behaviours.

Whenever we deploy a change, that change is rolled out gradually across many stages, with increasing amounts of traffic. Each stage is automatically evaluated, and only passes when the full set of tests have been successfully run against it – as well as overall performance and resource usage metrics being within acceptable bounds. This system is fully automated, and pauses or rolls back changes if the tests fail.

The benefit is that we’re able to build and ship new product features in FL2 within 48 hours – where it would have taken weeks in FL1. In fact, at least one of the announcements this week involved such a change!

Step 3 – Fallbacks

Over 100 engineers have worked on FL2, and we have over 130 modules. And we’re not quite done yet. We’re still putting the final touches on the system, to make sure it replicates all the behaviours of FL1.

So how do we send traffic to FL2 without it being able to handle everything? If FL2 receives a request, or a piece of configuration for a request, that it doesn’t know how to handle, it gives up and does what we’ve called a fallback – it passes the whole thing over to FL1. It does this at the network level – it just passes the bytes on to FL1.

As well as making it possible for us to send traffic to FL2 without it being fully complete, this has another massive benefit. When we have implemented a piece of new functionality in FL2, but want to double check that it is working the same as in FL1, we can evaluate the functionality in FL2, and then trigger a fallback. We are able to compare the behaviour of the two systems, allowing us to get a high confidence that our implementation was correct.

Step 4 – Rollout

We started running customer traffic through FL2 early in 2025, and have been progressively increasing the amount of traffic served throughout the year. Essentially, we’ve been watching two graphs: one with the proportion of traffic routed to FL2 going up, and another with the proportion of traffic failing to be served by FL2 and falling back to FL1 going down.

We started this process by passing traffic for our free customers through the system. We were able to prove that the system worked correctly, and drive the fallback rates down for our major modules. Our Cloudflare Community MVPs acted as an early warning system, smoke testing and flagging when they suspected the new platform might be the cause of a new reported problem. Crucially their support allowed our team to investigate quickly, apply targeted fixes, or confirm the move to FL2 was not to blame.

We then advanced to our paying customers, gradually increasing the amount of customers using the system. We also worked closely with some of our largest customers, who wanted the performance benefits of FL2, and onboarded them early in exchange for lots of feedback on the system.

Right now, most of our customers are using FL2. We still have a few features to complete, and are not quite ready to onboard everyone, but our target is to turn off FL1 within a few more months.

Impact of FL2

As we described at the start of this post, FL2 is substantially faster than FL1. The biggest reason for this is simply that FL2 performs less work. You might have noticed in the module definition example a line

    filters: vec![],

Every module is able to provide a set of filters, which control whether they run or not. This means that we don’t run logic for every product for every request — we can very easily select just the required set of modules. The incremental cost for each new product we develop has gone away.

Another huge reason for better performance is that FL2 is a single codebase, implemented in a performance focussed language. In comparison, FL1 was based on NGINX (which is written in C), combined with LuaJIT (Lua, and C interface layers), and also contained plenty of Rust modules.  In FL1, we spent a lot of time and memory converting data from the representation needed by one language, to the representation needed by another.

As a result, our internal measures show that FL2 uses less than half the CPU of FL1, and much less than half the memory. That’s a huge bonus — we can spend the CPU on delivering more and more features for our customers!

How do we measure if we are getting better?

Using our own tools and independent benchmarks like CDNPerf, we measured the impact of FL2 as we rolled it out across the network. The results are clear: websites are responding 10 ms faster at the median, a 25% performance boost.


Security

FL2 is also more secure by design than FL1. No software system is perfect, but the Rust language brings us huge benefits over LuaJIT. Rust has strong compile-time memory checks and a type system that avoids large classes of errors. Combine that with our rigid module system, and we can make most changes with high confidence.

Of course, no system is secure if used badly. It’s easy to write code in Rust, which causes memory corruption. To reduce risk, we maintain strong compile time linting and checking, together with strict coding standards, testing and review processes.

We have long followed a policy that any unexplained crash of our systems needs to be investigated as a high priority. We won’t be relaxing that policy, though the main cause of novel crashes in FL2 so far has been due to hardware failure. The massively reduced rates of such crashes will give us time to do a good job of such investigations.

What’s next?

We’re spending the rest of 2025 completing the migration from FL1 to FL2, and will turn off FL1 in early 2026. We’re already seeing the benefits in terms of customer performance and speed of development, and we’re looking forward to giving these to all our customers.

We have one last service to completely migrate. The “HTTP & TLS Termination” box from the diagram way back at the top is also an NGINX service, and we’re midway through a rewrite in Rust. We’re making good progress on this migration, and expect to complete it early next year.

After that, when everything is modular, in Rust and tested and scaled, we can really start to optimize! We’ll reorganize and simplify how the modules connect to each other, expand support for non-HTTP traffic like RPC and streams, and much more. 

If you’re interested in being part of this journey, check out our careers page for open roles – we’re always looking for new talent to help us to help build a better Internet.

Upgrading one of the oldest components in Cloudflare’s software stack

Post Syndicated from Maciej Lechowski original https://blog.cloudflare.com/upgrading-one-of-the-oldest-components-in-cloudflare-software-stack/

Upgrading one of the oldest components in Cloudflare’s software stack

Upgrading one of the oldest components in Cloudflare’s software stack

Cloudflare serves a huge amount of traffic: 45 million HTTP requests per second on average (as of 2023; 61 million at peak) from more than 285 cities in over 100 countries. What inevitably happens with that kind of scale is that software will be pushed to its limits. As we grew, one of the problems we faced was related to deploying our code. Sometimes, a release would be delayed because of inadequate hardware resources on our servers. Buying more and more hardware is expensive and there are limits to e.g. how much memory we can realistically have on a server. In this article, we explain how we optimised our software and its release process so that no additional resources are needed.

In order to handle traffic, each of our servers runs a set of specialised proxies. Historically, they were based on NGINX, but increasingly they include services created in Rust. Out of our proxy applications, FL (Front Line) is the oldest and still has a broad set of responsibilities.

At its core, it’s one of the last uses of NGINX at Cloudflare. It contains a large amount of business logic that runs many Cloudflare products, using a variety of Lua and Rust libraries. As a result, it consumes a large amount of system resources: up to 50-60 GiB of RAM. As FL grew, it became more and more difficult to release it. The upgrade mechanism requires double the memory (which sometimes is not available) than at runtime. This was causing delays in releases. We have now improved the release procedure of FL, and in effect, removed the need for additional memory during the release process, improving its speed and performance.

Architecture

To accomplish all of its work, each FL instance runs many workers (individual processes). By design, individual processes handle requests while the master process controls them and makes sure they stay up. This allows NGINX to handle more traffic by adding more workers. We take advantage of this architecture.

The number of workers depends on the numbers of total CPU cores present. It’s typically equal to half of the CPU cores available, e.g. on a 128-core CPU we use 64 FL workers.

So far so good, what’s the problem then?

We aim to deploy code in a way that’s transparent to our customers. We want to continue serving requests without interruptions. In practice, this means briefly running both versions of FL at the same time during the upgrade, so that we can flawlessly transition from one version to another. As soon as the new instance is up and running, we begin to shut down the old one, giving it some time to finish its work. In the end, only the new version is left running. NGINX implements this procedure and FL makes use of it.

After a new version of FL is installed on a server, the upgrade procedure starts. NGINX’s implementation revolves around communicating with the master process using signals. The upgrade process starts by sending the USR2 signal which will start the new master process and its workers.

At that point, as can be seen below, both versions will be running and processing requests. The unfortunate side effect of this is the memory footprint has been doubled.

  PID  PPID COMMAND
33126     1 nginx: master process /usr/local/nginx/sbin/nginx
33134 33126 nginx: worker process (nginx)
33135 33126 nginx: worker process (nginx)
33136 33126 nginx: worker process (nginx)
36264 33126 nginx: master process /usr/local/nginx/sbin/nginx
36265 36264 nginx: worker process (nginx)
36266 36264 nginx: worker process (nginx)
36267 36264 nginx: worker process (nginx)

Then, the WINCH signal will be sent to the master process which will then ask its workers to gracefully shut down. Eventually, they will all quit, leaving just the original master process running (which can be shut down with the QUIT signal). The successful outcome of this will leave just the new instance running, which will look similar to this:

  PID  PPID COMMAND
36264     1 nginx: master process /usr/local/nginx/sbin/nginx
36265 36264 nginx: worker process (nginx)
36266 36264 nginx: worker process (nginx)
36267 36264 nginx: worker process (nginx)

The standard NGINX upgrade mechanism is visualised in this diagram:

Upgrading one of the oldest components in Cloudflare’s software stack

It’s also clearly visible in the memory usage graph below (notice the large bump during the upgrade).

Upgrading one of the oldest components in Cloudflare’s software stack

The mechanism outlined above has both versions running at the same time for a while. When both sets of workers are running, they are still sharing the same sockets, so all of them accept requests. As the release progresses, ‘old’ workers stop listening and accepting new requests (at that point only the new workers accept new requests). As we release new code multiple times per week, this process is effectively doubling up our memory requirements. At our scale, it’s easy to see how multiplying this event by the number of servers we operate results in an immense waste of memory resources.

In addition, sometimes servers would take hours to upgrade (a concern especially when we need to release something quickly), as we are waiting to have enough memory available to kick off the reload action.

New upgrade mechanism

We solved this problem by modifying NGINX’s method for upgrading executable. Here’s how it works.

The crux of the problem is that NGINX treats the entire instance (master + workers) as one. When upgrading, we need to start all the workers whilst all the previous ones are still running. Considering the number of workers we use and how heavy they are, this is not sustainable.

So, instead, we modified NGINX to be able to control individual workers. Rather than starting and stopping them all at once, we can do so by selecting them individually. To accomplish this, the master process and workers understand additional signals compared to the ones NGINX uses. The actual mechanism to accomplish this in NGINX is nearly the same as when handling workers in bulk. However, there’s a crucial difference.

Typically, NGINX’s master process ensures that the right number of workers is actually running (per configuration). If any of them crashes, it will be restarted. This is good, but it doesn’t work for our new upgrade mechanism because when we need to shut down a single worker, we don’t want the NGINX master process to think that a worker has crashed and needs to be restarted. So we introduced a signal to disable that behaviour in NGINX while we’re shutting down a single process.

Step by step

Our improved mechanism handles each worker individually. Here are the steps:

  1. At the beginning, we have an instance of FL running 64 workers.
  2. Disable the feature to automatically restart workers which exit.
  3. Shut down a worker from the first (old) instance of FL. We’re down to 63 workers.
  4. Create a new instance of FL but only with a single worker. We’re back to 64 workers but including one from the new version.
  5. Re-enable the feature to automatically restart worker processes which exit.
  6. We continue this pattern of shutting down a worker from an older instance and creating a new one to replace it. This can be visualised in the diagram below.
Upgrading one of the oldest components in Cloudflare’s software stack

We can observe our new mechanism in action in the graph below. Thanks to our new procedure, our use of memory remains stable during the release.

Upgrading one of the oldest components in Cloudflare’s software stack

But why do we begin by shutting down a worker belonging to an older instance (v1)? This turns out to be important.

Worker-CPU pinning

During this workflow, we also had to take care of CPU pinning. FL workers on our servers are pinned to CPU cores with one process occupying one CPU core to help us distribute resources more efficiently. If we start a new worker first, it will share the CPU core with another one for a brief amount of time. This will make one CPU overloaded compared to other ones running FL, impacting the latency for requests served. That’s why we start by freeing up a CPU core which then can be taken over by a new worker rather than starting by creating a new worker.

For the same reason, pinning of worker processes to cores must be maintained throughout the whole operation. At no point, we can have two different workers sharing a CPU core. We make sure this is the case by executing the whole procedure in the same order every time.

We start from CPU core 1 (or whichever is the first one used by FL) and do the following:

  1. Shut down a worker that’s running there.
  2. Create a new worker. NGINX will pin it to the CPU core we have freed up in the previous step.

After doing that for all workers, we end up with a new set of workers which are correctly pinned to their CPU cores.

Conclusion

At Cloudflare, we need to release new software multiple times per day across our fleet. Standard upgrade mechanism used by NGINX is not suitable at our scale. For this reason, we customised the process to avoid increasing the amount of memory needed to release FL. This enabled us to safely ship code whenever it’s needed, everywhere. The custom upgrade mechanism enables us to release a large application such as FL reliably regardless of how much memory we have available on an edge server. We showed that it’s possible to extend NGINX and its built-in upgrade mechanism to accommodate the unique requirements of Cloudflare.

If you enjoy solving challenging application infrastructure problems and want to help maintain the busiest web server in the world, we’re hiring!

ROFL with a LOL: rewriting an NGINX module in Rust

Post Syndicated from Sam Howson original https://blog.cloudflare.com/rust-nginx-module/

ROFL with a LOL: rewriting an NGINX module in Rust

ROFL with a LOL: rewriting an NGINX module in Rust

At Cloudflare, engineers spend a great deal of time refactoring or rewriting existing functionality. When your company doubles the amount of traffic it handles every year, what was once an elegant solution to a problem can quickly become outdated as the engineering constraints change. Not only that, but when you’re averaging 40 million requests a second, issues that might affect 0.001% of requests flowing through our network are big incidents which may impact millions of users, and one-in-a-trillion events happen several times a day.

Recently, we’ve been working on a replacement to one of our oldest and least-well-known components called cf-html, which lives inside the core reverse web proxy of Cloudflare known as FL (Front Line). Cf-html is the framework in charge of parsing and rewriting HTML as it streams back through from the website origin to the website visitor. Since the early days of Cloudflare, we’ve offered features which will rewrite the response body of web requests for you on the fly. The first ever feature we wrote in this way was to replace email addresses with chunks of JavaScript, which would then load the email address when viewed in a web browser. Since bots are often unable to evaluate JavaScript, this helps to prevent scraping of email addresses from websites. You can see this in action if you view the source of this page and look for this email address: [email protected].

FL is where most of the application infrastructure logic for Cloudflare runs, and largely consists of code written in the Lua scripting language, which runs on top of NGINX as part of OpenResty. In order to interface with NGINX directly, some parts (like cf-html) are written in lower-level languages like C and C++. In the past, there were many such OpenResty services at Cloudflare, but these days FL is one of the few left, as we move other components to Workers or Rust-based proxies. The platform that once was the best possible blend of developer ease and speed has more than started to show its age for us.

When discussing what happens to an HTTP request passing through our network and in particular FL, nearly all the attention is given to what happens up until the request reaches the customer’s origin. That’s understandable as this is where most of the business logic happens: firewall rules, Workers, and routing decisions all happen on the request. But it’s not the end of the story. From an engineering perspective, much of the more interesting work happens on the response, as we stream the HTML response back from the origin to the site visitor.

The logic to handle this is contained in a static NGINX module, and runs in the Response Body Filters  phase in NGINX, as chunks of the HTTP response body are streamed through. Over time, more features were added, and the system became known as cf-html. cf-html uses a streaming HTML parser to match on specific HTML tags and content, called Lazy HTML or lhtml, with much of the logic for both it and the cf-html features written using the Ragel state machine engine.

Memory safety

All the cf-html logic was written in C, and therefore was susceptible to memory corruption issues that plague many large C codebases. In 2017 this led to a security bug as the team was trying to replace part of cf-html. FL was reading arbitrary data from memory and appending it to response bodies. This could potentially include data from other requests passing through FL at the same time. This security event became known widely as Cloudbleed.

Since this episode, Cloudflare implemented a number of policies and safeguards to ensure something like that never happened again. While work has been carried out on cf-html over the years, there have been few new features implemented on the framework, and we’re now hyper-sensitive to crashes happening in FL (and, indeed, any other process running on our network), especially in parts that can reflect data back with a response.

Fast-forward to 2022 into 2023, and the FL Platform team have been getting more and more requests for a system they can easily use to look at and rewrite response body data. At the same time, another team has been working on a new response body parsing and rewriting framework for Workers called lol-html or Low Output Latency HTML. Not only is lol-html faster and more efficient than Lazy HTML, but it’s also currently in full production use as part of the Worker interface, and written in Rust, which is much safer than C in terms of its handling of memory. It’s ideal, therefore, as a replacement for the ancient and creaking HTML parser we’ve been using in FL up until now.

ROFL with a LOL: rewriting an NGINX module in Rust

So we started working on a new framework, written in Rust, that would incorporate lol-html and allow other teams to write response body parsing features without the threat of causing massive security issues. The new system is called ROFL or Response Overseer for FL, and it’s a brand-new NGINX module written completely in Rust. As of now, ROFL is running in production on millions of responses a second, with comparable performance to cf-html. In building ROFL, we’ve been able to deprecate one of the scariest bits of code in Cloudflare’s entire codebase, while providing teams at Cloudflare with a robust system they can use to write features which need to parse and rewrite response body data.

Writing an NGINX module in Rust

While writing the new module, we learned a lot about how NGINX works, and how we can get it to talk to Rust. NGINX doesn’t provide much documentation on writing modules written in languages other than C, and so there was some work which needed to be done to figure out how to write an NGINX module in our language of choice. When starting out, we made heavy use of parts of the code from the nginx-rs project, particularly around the handling of buffers and memory pools. While writing a full NGINX module in Rust is a long process and beyond the scope of this blog post, there are a few key bits that make the whole thing possible, and that are worth talking about.

The first one of these is generating the Rust bindings so that NGINX can communicate with it. To do that, we used Rust’s library Bindgen to build the FFI bindings for us, based on the symbol definitions in NGINX’s header files. To add this to an existing Rust project, the first thing is to pull down a copy of NGINX and configure it. Ideally this would be done in a simple script or Makefile, but when done by hand it would look something like this:

$ git clone --depth=1 https://github.com/nginx/nginx.git
$ cd nginx
$ ./auto/configure --without-http_rewrite_module --without-http_gzip_module

With NGINX in the right state, we need to create a build.rs file in our Rust project to auto-generate the bindings at build-time of the module. We’ll now add the necessary arguments to the build, and use Bindgen to generate us the bindings.rs file. For the arguments, we just need to include all the directories that may contain header files for clang to do its thing. We can then feed them into Bindgen, along with some allowlist arguments, so it knows for what things it should generate the bindings, and which things it can ignore. Adding a little boilerplate code to the top, the whole file should look something like this:

use std::env;
use std::path::PathBuf;

fn main() {
    println!("cargo:rerun-if-changed=build.rs");

    let clang_args = [
        "-Inginx/objs/",
        "-Inginx/src/core/",
        "-Inginx/src/event/",
        "-Inginx/src/event/modules/",
        "-Inginx/src/os/unix/",
        "-Inginx/src/http/",
        "-Inginx/src/http/modules/"
    ];

    let bindings = bindgen::Builder::default()
        .header("wrapper.h")
        .layout_tests(false)
        .allowlist_type("ngx_.*")
        .allowlist_function("ngx_.*")
        .allowlist_var("NGX_.*|ngx_.*|nginx_.*")
        .parse_callbacks(Box::new(bindgen::CargoCallbacks))
        .clang_args(clang_args)
        .generate()
        .expect("Unable to generate bindings");

    let out_path = PathBuf::from(env::var("OUT_DIR").unwrap());

    bindings.write_to_file(out_path.join("bindings.rs"))
        .expect("Unable to write bindings.");
}

Hopefully this is all fairly self-explanatory. Bindgen traverses the NGINX source and generates equivalent constructs in Rust in a file called bindings.rs, which we can import into our project. There’s just one more thing to add- Bindgen has trouble with a couple of symbols in NGINX, which we’ll need to fix in a file called wrapper.h. It should have the following contents:

#include <ngx_http.h>

const char* NGX_RS_MODULE_SIGNATURE = NGX_MODULE_SIGNATURE;
const size_t NGX_RS_HTTP_LOC_CONF_OFFSET = NGX_HTTP_LOC_CONF_OFFSET;

With this in place and Bindgen set in the [build-dependencies] section of the Cargo.toml file, we should be ready to build.

$ cargo build
   Compiling rust-nginx-module v0.1.0 (/Users/sam/cf-repos/rust-nginx-module)
    Finished dev [unoptimized + debuginfo] target(s) in 4.70s

With any luck, we should see a file called bindings.rs in the target/debug/build directory, which contains Rust definitions of all the NGINX symbols.

$ find target -name 'bindings.rs' 
target/debug/build/rust-nginx-module-c5504dc14560ecc1/out/bindings.rs

$ head target/debug/build/rust-nginx-module-c5504dc14560ecc1/out/bindings.rs
/* automatically generated by rust-bindgen 0.61.0 */
[...]

To be able to use them in the project, we can include them in a new file under the src directory which we’ll call bindings.rs.

$ cat > src/bindings.rs
include!(concat!(env!("OUT_DIR"), "/bindings.rs"));

With that set, we just need to add the usual imports to the top of the lib.rs file, and we can access NGINX constructs from Rust. Not only does this make bugs in the interface between NGINX and our Rust module much less likely than if these values were hand-coded, but it’s also a fantastic reference we can use to check the structure of things in NGINX when building modules in Rust, and it takes a lot of the leg-work out of setting everything up. It’s really a testament to the quality of a lot of Rust libraries such as Bindgen that something like this can be done with so little effort, in a robust way.

Once the Rust library has been built, the next step is to hook it into NGINX. Most NGINX modules are compiled statically. That is, the module is compiled as part of the compilation of NGINX as a whole. However, since NGINX 1.9.11, it has supported dynamic modules, which are compiled separately and then loaded using the load_module directive in the nginx.conf file. This is what we needed to use to build ROFL, so that the library could be compiled separately and loaded-in at the time NGINX starts up. Finding the right format so that the necessary symbols could be found from the documentation was tricky, though, and although it is possible to use a separate config file to set some of this metadata, it’s better if we can load it as part of the module, to keep things neat. Luckily, it doesn’t take much spelunking through the NGINX codebase to find where dlopen is called.

ROFL with a LOL: rewriting an NGINX module in Rust

So after that it’s just a case of making sure the relevant symbols exist.

use std::os::raw::c_char;
use std::ptr;

#[no_mangle]
pub static mut ngx_modules: [*const ngx_module_t; 2] = [
    unsafe { rust_nginx_module as *const ngx_module_t },
    ptr::null()
];

#[no_mangle]
pub static mut ngx_module_type: [*const c_char; 2] = [
    "HTTP_FILTER\0".as_ptr() as *const c_char,
    ptr::null()
];

#[no_mangle]
pub static mut ngx_module_names: [*const c_char; 2] = [
    "rust_nginx_module\0".as_ptr() as *const c_char,
    ptr::null()
];

When writing an NGINX module, it’s crucial to get its order relative to the other modules correct. Dynamic modules get loaded as NGINX starts, which means they are (perhaps counterintuitively) the first to run on a response. Ensuring your module runs after gzip decompression by specifying its order relative to the gunzip module is essential, otherwise you can spend lots of time staring at streams of unprintable characters, wondering why you aren’t seeing the response you expected. Not fun. Fortunately this is also something that can be solved by looking at the NGINX source, and making sure the relevant entities exist in your module. Here’s an example of what you might set-

pub static mut ngx_module_order: [*const c_char; 3] = [
    "rust_nginx_module\0".as_ptr() as *const c_char,
    "ngx_http_headers_more_filter_module\0".as_ptr() as *const c_char,
    ptr::null()
];

We’re essentially saying we want our module rust_nginx_module to run just before the ngx_http_headers_more_filter_module module, which should allow it to run in the place we expect.

One of the quirks of NGINX and OpenResty is how it is really hostile to making calls to external services at the point that you’re dealing with the HTTP response. It’s something that isn’t provided as part of the OpenResty Lua framework, even though it would make working with the response phase of a request much easier. While we could do this anyway, that would mean having to fork NGINX and OpenResty, which would bring its own challenges. As a result, we’ve spent a lot of time over the years thinking about ways to pass state from the time when NGINX’s dealing with an HTTP request, over to the time when it’s streaming through the response, and much of our logic is built around this style of work.

For ROFL, that means in order to determine if we need to apply a certain feature for a response, we need to figure that out on the request, then pass that information over to the response so that we know which features to activate. To do that, we need to use one of the utilities that NGINX provides you with. With the help of the bindings.rs file generated earlier, we can take a look at the definition of the ngx_http_request_s struct, which contains all the state associated with a given request:

#[repr(C)]
#[derive(Debug, Copy, Clone)]
pub struct ngx_http_request_s {
    pub signature: u32,
    pub connection: *mut ngx_connection_t,
    pub ctx: *mut *mut ::std::os::raw::c_void,
    pub main_conf: *mut *mut ::std::os::raw::c_void,
    pub srv_conf: *mut *mut ::std::os::raw::c_void,
    pub loc_conf: *mut *mut ::std::os::raw::c_void,
    pub read_event_handler: ngx_http_event_handler_pt,
    pub write_event_handler: ngx_http_event_handler_pt,
    pub cache: *mut ngx_http_cache_t,
    pub upstream: *mut ngx_http_upstream_t,
    pub upstream_states: *mut ngx_array_t,
    pub pool: *mut ngx_pool_t,
    pub header_in: *mut ngx_buf_t,
    pub headers_in: ngx_http_headers_in_t,
    pub headers_out: ngx_http_headers_out_t,
    pub request_body: *mut ngx_http_request_body_t,
[...]
}

As we can see, there’s a member called ctx. As the NGINX Development Guide mentions, it’s a place where you’re able to store any value associated with a request, which should live for as long as the request does. In OpenResty this is used heavily for the storing of state to do with a request over its lifetime in a Lua context. We can do the same thing for our module, so that settings initialised during the request phase are there when our HTML parsing and rewriting is run in the response phase. Here’s an example function which can be used to get the request ctx:

pub fn get_ctx(request: &ngx_http_request_t) -> Option<&mut Ctx> {
    unsafe {
        match *request.ctx.add(ngx_http_rofl_module.ctx_index) {
            p if p.is_null() => None,
            p => Some(&mut *(p as *mut Ctx)),
        }
    }
}

Notice that ctx is at the offset of the ctx_index member of ngx_http_rofl_module – this is the structure of type ngx_module_t that’s part of the module definition needed to make an NGINX module. Once we have this, we can point it to a structure containing any setting we want. For example, here’s the actual function we use to enable the Email Obfuscation feature from Lua, via FFI to the Rust module using LuaJIT’s FFI tools:

#[no_mangle]
pub extern "C" fn rofl_module_email_obfuscation_new(
    request: &mut ngx_http_request_t,
    dry_run: bool,
    decode_script_url: *const u8,
    decode_script_url_len: usize,
) {
    let ctx = context::get_or_init_ctx(request);
    let decode_script_url = unsafe {
        std::str::from_utf8(std::slice::from_raw_parts(decode_script_url, decode_script_url_len))
            .expect("invalid utf-8 string for decode script")
    };

    ctx.register_module(EmailObfuscation::new(decode_script_url.to_owned()), dry_run);
}

The function is called get_or_init_ctx here- it performs the same job as get_ctx, but also initialises the structure if it doesn’t exist yet. Once we’ve set whatever data we need in ctx during the request, we can then check what features need to be run in the response, without having to make any calls to external databases, which might slow us down.

One of the nice things about storing state on ctx in this way, and working with NGINX in general, is that it relies heavily on memory pools to store request content. This largely removes any need for the programmer to have to think about freeing memory after use- the pool is automatically allocated at the start of a request, and is automatically freed when the request is done. All that’s needed is to allocate the memory using NGINX’s built-in functions for allocating memory to the pool and then registering a callback that will be called to free everything. In Rust, that would look something like the following:

pub struct Pool<'a>(&'a mut ngx_pool_t);

impl<'a> Pool<'a> {    
    /// Register a cleanup handler that will get called at the end of the request.
    fn add_cleanup<T>(&mut self, value: *mut T) -> Result<(), ()> {
        unsafe {
            let cln = ngx_pool_cleanup_add(self.0, 0);
            if cln.is_null() {
                return Err(());
            }
            (*cln).handler = Some(cleanup_handler::<T>);
            (*cln).data = value as *mut c_void;
            Ok(())
        }
    }

    /// Allocate memory for a given value.
    pub fn alloc<T>(&mut self, value: T) -> Option<&'a mut T> {
        unsafe {
            let p = ngx_palloc(self.0, mem::size_of::<T>()) as *mut _ as *mut T;
            ptr::write(p, value);
            if let Err(_) = self.add_cleanup(p) {
                ptr::drop_in_place(p);
                return None;
            };
            Some(&mut *p)
        }
    }
}

unsafe extern "C" fn cleanup_handler<T>(data: *mut c_void) {
    ptr::drop_in_place(data as *mut T);
}

This should allow us to allocate memory for whatever we want, safe in the knowledge that NGINX will handle it for us.

It is regrettable that we have to write a lot of unsafe blocks when dealing with NGINX’s interface in Rust. Although we’ve done a lot of work to minimise them where possible, unfortunately this is often the case with writing Rust code which has to manipulate C constructs through FFI. We have plans to do more work on this in the future, and remove as many lines as possible from unsafe.

Challenges encountered

The NGINX module system allows for a massive amount of flexibility in terms of the way the module itself works, which makes it very accommodating to specific use-cases, but that flexibility can also lead to problems. One that we ran into had to do with the way the response data is handled between Rust and FL. In NGINX, response bodies are chunked, and these chunks are then linked together into a list. Additionally, there may be more than one of these linked lists per response, if the response is large.

Efficiently handling these chunks means processing them and passing them on as quickly as possible. When writing a Rust module for manipulating responses, it’s tempting to implement a Rust-based view into these linked lists. However, if you do that, you must be sure to update both the Rust-based view and the underlying NGINX data structures when mutating them, otherwise this can lead to serious bugs where Rust becomes out of sync with NGINX. Here’s a small function from an early version of ROFL that caused headaches:

fn handle_chunk(&mut self, chunk: &[u8]) {
    let mut free_chain = self.chains.free.borrow_mut();
    let mut out_chain = self.chains.out.borrow_mut();
    let mut data = chunk;

    self.metrics.borrow_mut().bytes_out += data.len() as u64;

    while !data.is_empty() {
        let free_link = self
            .pool
            .get_free_chain_link(free_chain.head, self.tag, &mut self.metrics.borrow_mut())
            .expect("Could not get a free chain link.");

        let mut link_buf = unsafe { TemporaryBuffer::from_ngx_buf(&mut *(*free_link).buf) };
        data = link_buf.write_data(data).unwrap_or(b"");
        out_chain.append(free_link);
    }
}

What this code was supposed to do is take the output of lol-html’s HTMLRewriter, and write it to the output chain of buffers. Importantly, the output can be larger than a single buffer, so you need to take new buffers off the chain in a loop until you’ve written all the output to buffers. Within this logic, NGINX is supposed to take care of popping the buffer off the free chain and appending the new chunk to the output chain, which it does. However, if you’re only thinking in terms of the way NGINX handles its view of the linked list, you may not notice that Rust never changes which buffer its free_chain.head points to, causing the logic to loop forever and the NGINX worker process to lock-up completely. This sort of issue can take a long time to track down, especially since we couldn’t reproduce it on our personal machines until we understood it was related to the response body size.

Getting a coredump to perform some analysis with gdb was also hard because once we noticed it happening, it was already too late and the process memory had grown to the point the server was in danger of falling over, and the memory consumed was too large to be written to disk. Fortunately, this code never made it to production. As ever, while Rust’s compiler can help you to catch a lot of common mistakes, it can’t help as much if the data is being shared via FFI from another environment, even without much direct use of unsafe, so extra care must be taken in these cases, especially when NGINX allows the kind of flexibility that might lead to a whole machine being taken out of service.

Another major challenge we faced had to do with backpressure from incoming response body chunks. In essence, if ROFL increased the size of the response due to having to inject some large amount of code into the stream (such as replacing an email address with a large chunk of JavaScript), NGINX can feed the output from ROFL to the other downstream modules faster than they could push it along, potentially leading to data being dropped and HTTP response bodies being truncated if the EAGAIN error from the next module is left unhandled. This was another case where the issue was really hard to test, because most of the time the response would be flushed fast enough for backpressure never to be a problem. To handle this, we had to create a special chain to store these chunks called saved_in, which required a special method for appending to it.

#[derive(Debug)]
pub struct Chains {
    /// This saves buffers from the `in` chain that were not processed for any reason (most likely
    /// backpressure for the next nginx module).
    saved_in: RefCell<Chain>,
    pub free: RefCell<Chain>,
    pub busy: RefCell<Chain>,
    pub out: RefCell<Chain>,
    [...]
}

Effectively we’re ‘queueing’ the data for a short period of time so that we don’t overwhelm the other modules by feeding them data faster than they can handle it. The NGINX Developer Guide has a lot of great information, but many of its examples are trivial to the point where issues like this don’t come up. Things such as this are the result of working in a complex NGINX-based environment, and need to be discovered independently.

A future without NGINX

The obvious question a lot of people might ask is: why are we still using NGINX? As already mentioned, Cloudflare is well on its way to replacing components that either used to run NGINX/OpenResty proxies, or would have done without heavy investment in home-grown platforms. That said, some components are easier to replace than others and FL, being where most of the logic for Cloudflare’s application services runs, is definitely on the more challenging end of the spectrum.

Another motivating reason for doing this work is that whichever platform we eventually migrate to, we’ll need to run the features that make up cf-html, and in order to do that we’ll want to have a system that is less heavily integrated and dependent on NGINX. ROFL has been specifically designed with the intention of running it in multiple places, so it will be easy to move it to another Rust-based web proxy (or indeed our Workers platform) without too much trouble. That said it’s hard to imagine we’d be in the same place without a language like Rust, which offers speed at the same time as a high degree of safety, not to mention high-quality libraries like Bindgen and Serde. More broadly, the FL team are working to migrate other aspects of the platform over to Rust, and while cf-html and the features of which make it up are a key part of our infrastructure that needed work, there are many others.

Safety in programming languages is often seen as beneficial in terms of preventing bugs, but as a company we’ve found that it also allows you to do things which would be considered very hard, or otherwise impossible to do safely. Whether it be providing a Wireshark-like filter language for writing firewall rules, allowing millions of users to write arbitrary JavaScript code and run it directly on our platform or rewriting HTML responses on the fly, having strict boundaries in place allows us to provide services we wouldn’t be able to otherwise, all while safe in the knowledge that the kind of memory-safety issues that used to plague the industry are increasingly a thing of the past.

If you enjoy rewriting code in Rust, solving challenging application infrastructure problems and want to help maintain the busiest web server in the world, we’re hiring!

How we built Pingora, the proxy that connects Cloudflare to the Internet

Post Syndicated from Yuchen Wu original https://blog.cloudflare.com/how-we-built-pingora-the-proxy-that-connects-cloudflare-to-the-internet/

How we built Pingora, the proxy that connects Cloudflare to the Internet

Introduction

How we built Pingora, the proxy that connects Cloudflare to the Internet

Today we are excited to talk about Pingora, a new HTTP proxy we’ve built in-house using Rust that serves over 1 trillion requests a day, boosts our performance, and enables many new features for Cloudflare customers, all while requiring only a third of the CPU and memory resources of our previous proxy infrastructure.

As Cloudflare has scaled we’ve outgrown NGINX. It was great for many years, but over time its limitations at our scale meant building something new made sense. We could no longer get the performance we needed nor did NGINX have the features we needed for our very complex environment.

Many Cloudflare customers and users use the Cloudflare global network as a proxy between HTTP clients (such as web browsers, apps, IoT devices and more) and servers. In the past, we’ve talked a lot about how browsers and other user agents connect to our network, and we’ve developed a lot of technology and implemented new protocols (see QUIC and optimizations for http2) to make this leg of the connection more efficient.

Today, we’re focusing on a different part of the equation: the service that proxies traffic between our network and servers on the Internet. This proxy service powers our CDN, Workers fetch, Tunnel, Stream, R2 and many, many other features and products.

Let’s dig in on why we chose to replace our legacy service and how we developed Pingora, our new system designed specifically for Cloudflare’s customer use cases and scale.

Why build yet another proxy

Over the years, our usage of NGINX has run up against limitations. For some limitations, we optimized or worked around them. But others were much harder to overcome.

Architecture limitations hurt performance

The NGINX worker (process) architecture has operational drawbacks for our use cases that hurt our performance and efficiency.

First, in NGINX each request can only be served by a single worker. This results in unbalanced load across all CPU cores, which leads to slowness.

Because of this request-process pinning effect, requests that do CPU heavy or blocking IO tasks can slow down other requests. As those blog posts attest we’ve spent a lot of time working around these problems.

The most critical problem for our use cases is poor connection reuse. Our machines establish TCP connections to origin servers to proxy HTTP requests. Connection reuse speeds up TTFB (time-to-first-byte) of requests by reusing previously established connections from a connection pool, skipping TCP and TLS handshakes required on a new connection.

However, the NGINX connection pool is per worker. When a request lands on a certain worker, it can only reuse the connections within that worker. When we add more NGINX workers to scale up, our connection reuse ratio gets worse because the connections are scattered across more isolated pools of all the processes. This results in slower TTFB and more connections to maintain, which consumes resources (and money) for both us and our customers.

How we built Pingora, the proxy that connects Cloudflare to the Internet

As mentioned in past blog posts, we have workarounds for some of these issues. But if we can address the fundamental issue: the worker/process model, we will resolve all these problems naturally.

Some types of functionality are difficult to add

NGINX is a very good web server, load balancer or a simple gateway. But Cloudflare does way more than that. We used to build all the functionality we needed around NGINX, which is not easy to do while trying not to diverge too much from NGINX upstream codebase.

For example, when retrying/failing over a request, sometimes we want to send a request to a different origin server with a different set of request headers. But that is not something NGINX allows us to do. In cases like this, we spend time and effort on working around the NGINX constraints.

Meanwhile, the programming languages we had to work with didn’t provide help alleviating the difficulties. NGINX is purely in C, which is not memory safe by design. It is very error-prone to work with such a 3rd party code base. It is quite easy to get into memory safety issues, even for experienced engineers, and we wanted to avoid these as much as possible.

The other language we used to complement C is Lua. It is less risky but also less performant. In addition, we often found ourselves missing static typing when working with complicated Lua code and business logic.

And the NGINX community is not very active, and development tends to be “behind closed doors”.

Choosing to build our own

Over the past few years, as we’ve continued to grow our customer base and feature set, we continually evaluated three choices:

  1. Continue to invest in NGINX and possibly fork it to tailor it 100% to our needs. We had the expertise needed, but given the architecture limitations mentioned above, significant effort would be required to rebuild it in a way that fully supported our needs.
  2. Migrate to another 3rd party proxy codebase. There are definitely good projects, like envoy and others. But this path means the same cycle may repeat in a few years.
  3. Start with a clean slate, building an in-house platform and framework. This choice requires the most upfront investment in terms of engineering effort.

We evaluated each of these options every quarter for the past few years. There is no obvious formula to tell which choice is the best. For several years, we continued with the path of the least resistance, continuing to augment NGINX. However, at some point, building our own proxy’s return on investment seemed worth it. We made a call to build a proxy from scratch, and began designing the proxy application of our dreams.

The Pingora Project

Design decisions

To make a proxy that serves millions of requests per second fast, efficient and secure, we have to make a few important design decisions first.

We chose Rust as the language of the project because it can do what C can do in a memory safe way without compromising performance.

Although there are some great off-the-shelf 3rd party HTTP libraries, such as hyper, we chose to build our own because we want to maximize the flexibility in how we handle HTTP traffic and to make sure we can innovate at our own pace.

At Cloudflare, we handle traffic across the entire Internet. We have many cases of bizarre and non-RFC compliant HTTP traffic that we have to support. For example, hyper did not support HTTP status codes greater than 599 until late 2020, three years after people initially raised the issue and repeatedly argued that it was necessary. And we need more than being correct. We need a robust, permissive, customizable HTTP library that can survive the wilds of the Internet. The best way to guarantee that is to implement our own.

The next design decision was around our workload scheduling system. We chose multithreading over multiprocessing in order to share resources, especially connection pools, easily. We also decided that work stealing was required to avoid some classes of performance problems mentioned above. The Tokio async runtime turned out to be a great fit for our needs.

Finally, we wanted our project to be intuitive and developer friendly. What we build is not the final product, and should be extensible as a platform as more features are built on top of it. We decided to implement a “life of a request” event based programmable interface similar to NGINX/OpenResty. For example, the “request filter” phase allows developers to run code to modify or reject the request when a request header is received. With this design, we can separate our business logic and generic proxy logic cleanly. Developers who previously worked on NGINX can easily switch to Pingora and quickly become productive.

Pingora is faster in production

Let’s fast-forward to the present. Pingora handles almost every HTTP request that needs to interact with an origin server (for a cache miss, for example), and we’ve collected a lot of performance data in the process.

First, let’s see how Pingora speeds up our customer’s traffic. Overall traffic on Pingora shows 5ms reduction on median TTFB and 80ms reduction on the 95th percentile. This is not because we run code faster. Even our old service could handle requests in the sub-millisecond range.

The savings come from our new architecture which can share connections across all threads. This means a better connection reuse ratio, which spends less time on TCP and TLS handshakes.

How we built Pingora, the proxy that connects Cloudflare to the Internet

Across all customers, Pingora makes only a third as many new connections per second compared to the old service. For one major customer, it increased the connection reuse ratio from 87.1% to 99.92%, which reduced new connections to their origins by 160x. To present the number more intuitively, by switching to Pingora, we save our customers and users 434 years of handshake time every day.

More features

Having a developer friendly interface engineers are familiar with while eliminating the previous constraints allows us to develop more features, more quickly. Core functionality like new protocols act as building blocks to more products we can offer to customers.

As an example, we were able to add HTTP/2 upstream support to Pingora without major hurdles. This allowed us to offer gRPC  to our customers shortly afterwards. Adding this same functionality to NGINX would have required significantly more engineering effort and might not have materialized.

More recently we’ve announced Cache Reserve where Pingora uses R2 storage as a caching layer. As we add more functionality to Pingora, we’re able to offer new products that weren’t feasible before.

More efficient

In production, Pingora consumes about 70% less CPU and 67% less memory compared to our old service with the same traffic load. The savings come from a few factors.

Our Rust code runs more efficiently compared to our old Lua code. On top of that, there are also efficiency differences from their architectures. For example, in NGINX/OpenResty, when the Lua code wants to access an HTTP header, it has to read it from the NGINX C struct, allocate a Lua string and then copy it to the Lua string. Afterwards, Lua has to garbage-collect its new string as well. In Pingora, it would just be a direct string access.

The multithreading model also makes sharing data across requests more efficient. NGINX also has shared memory but due to implementation limitations, every shared memory access has to use a mutex lock and only strings and numbers can be put into shared memory. In Pingora, most shared items can be accessed directly via shared references behind atomic reference counters.

Another significant portion of CPU saving, as mentioned above, is from making fewer new connections. TLS handshakes are expensive compared to just sending and receiving data via established connections.

Safer

Shipping features quickly and safely is difficult, especially at our scale. It’s hard to predict every edge case that can occur in a distributed environment processing millions of requests a second. Fuzzing and static analysis can only mitigate so much. Rust’s memory-safe semantics guard us from undefined behavior and give us confidence our service will run correctly.

With those assurances we can focus more on how a change to our service will interact with other services or a customer’s origin. We can develop features at a higher cadence and not be burdened by memory safety and hard to diagnose crashes.

When crashes do occur an engineer needs to spend time to diagnose how it happened and what caused it. Since Pingora’s inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.

In fact, Pingora crashes are so rare we usually find unrelated issues when we do encounter one. Recently we discovered a kernel bug soon after our service started crashing. We’ve also discovered hardware issues on a few machines, in the past ruling out rare memory bugs caused by our software even after significant debugging was nearly impossible.

Conclusion

To summarize, we have built an in-house proxy that is faster, more efficient and versatile as the platform for our current and future products.

We will be back with more technical details regarding the problems we faced, the optimizations we applied and the lessons we learned from building Pingora and rolling it out to power a significant portion of the Internet. We will also be back with our plan to open source it.

Pingora is our latest attempt at rewriting our system, but it won’t be our last. It is also only one of the building blocks in the re-architecting of our systems.

Interested in joining us to help build a better Internet? Our engineering teams are hiring.