All posts by Dane Knecht

Code Orange: Fail Small — Our resilience plan following recent incidents

2025-12-20 Dane Knecht

Post Syndicated from Dane Knecht original https://blog.cloudflare.com/fail-small-resilience-plan/

On November 18, 2025, Cloudflare’s network experienced significant failures to deliver network traffic for approximately two hours and ten minutes. Nearly three weeks later, on December 5, 2025, our network again failed to serve traffic for 28% of applications behind our network for about 25 minutes.

We published detailed post-mortem blog posts following both incidents, but we know that we have more to do to earn back your trust. Today we are sharing details about the work underway at Cloudflare to prevent outages like these from happening again.

We are calling the plan “Code Orange: Fail Small”, which reflects our goal of making our network more resilient to errors or mistakes that could lead to a major outage. A “Code Orange” means the work on this project is prioritized above all else. For context, we declared a “Code Orange” at Cloudflare once before, following another major incident that required top priority from everyone across the company. We feel the recent events require the same focus. Code Orange is our way to enable that to happen, allowing teams to work cross-functionally as necessary to get the job done while pausing any other work.

The Code Orange work is organized into three main areas:

Require controlled rollouts for any configuration change that is propagated to the network, just like we do today for software binary releases.
Review, improve, and test failure modes of all systems handling network traffic to ensure they exhibit well-defined behavior under all conditions, including unexpected error states.
Change our internal “break glass”* procedures, and remove any circular dependencies so that we, and our customers, can act fast and access all systems without issue during an incident.

These projects will deliver iterative improvements as they proceed, rather than one “big bang” change at their conclusion. Every individual update will contribute to more resiliency at Cloudflare. By the end, we expect Cloudflare’s network to be much more resilient, including for issues such as those that triggered the global incidents we experienced in the last two months.

We understand that these incidents are painful for our customers and the Internet as a whole. We’re deeply embarrassed by them, which is why this work is the first priority for everyone here at Cloudflare.

^*^{Break glass procedures at Cloudflare allow certain individuals to elevate their privilege under certain circumstances to perform urgent actions to resolve high severity scenarios.}

What went wrong?

In the first incident, users visiting a customer site on Cloudflare saw error pages that indicated Cloudflare could not deliver a response to their request. In the second, they saw blank pages.

Both outages followed a similar pattern. In the moments leading up to each incident we instantaneously deployed a configuration change in our data centers in hundreds of cities around the world.

The November change was an automatic update to our Bot Management classifier. We run various artificial intelligence models that learn from the traffic flowing through our network to build detections that identify bots. We constantly update those systems to stay ahead of bad actors trying to evade our security protection to reach customer sites.

During the December incident, while trying to protect our customers from a vulnerability in the popular open source framework React, we deployed a change to a security tool used by our security analysts to improve our signatures. Similar to the urgency of new bot management updates, we needed to get ahead of the attackers who wanted to exploit the vulnerability. That change triggered the start of the incident.

This pattern exposed a serious gap in how we deploy configuration changes at Cloudflare, versus how we release software updates. When we release software version updates, we do so in a controlled and monitored fashion. For each new binary release, the deployment must successfully complete multiple gates before it can serve worldwide traffic. We deploy first to employee traffic, before carefully rolling out the change to increasing percentages of customers worldwide, starting with free users. If we detect an anomaly at any stage, we can revert the release without any human intervention.

We have not applied that methodology to configuration changes. Unlike releasing the core software that powers our network, when we make configuration changes, we are modifying the values of how that software behaves and we can do so instantly. We give this power to our customers too: If you make a change to a setting in Cloudflare, it will propagate globally in seconds.

While that speed has advantages, it also comes with risks that we need to address. The past two incidents have demonstrated that we need to treat any change that is applied to how we serve traffic in our network with the same level of tested caution that we apply to changes to the software itself.

We will change how we deploy configuration updates at Cloudflare

Our ability to deploy configuration changes globally within seconds was the core commonality across the two incidents. In both events, a wrong configuration took down our network in seconds.

Introducing controlled rollouts of our configuration, just as we already do for software releases, is the most important workstream of our Code Orange plan.

Configuration changes at Cloudflare propagate to the network very quickly. When a user creates a new DNS record, or creates a new security rule, it reaches 90% of servers on the network within seconds. This is powered by a software component that we internally call Quicksilver.

Quicksilver is also used for any configuration change required by our own teams. The speed is a feature: we can react and globally update our network behavior very quickly. However, in both incidents this caused a breaking change to propagate to the entire network in seconds rather than passing through gates to test it.

While the ability to deploy changes to our network on a near-instant basis is useful in many cases, it is rarely necessary. Work is underway to treat configuration the same way that we treat code by introducing controlled deployments within Quicksilver to any configuration change.

We release software updates to our network multiple times per day through what we call our Health Mediated Deployment (HMD) system. In this framework, every team at Cloudflare that owns a service (a piece of software deployed into our network) must define the metrics that indicate a deployment has succeeded or failed, the rollout plan, and the steps to take if it does not succeed.

Different services will have slightly different variables. Some might need longer wait times before proceeding to more data centers, while others might have lower tolerances for error rates even if it causes false positive signals.

Once deployed, our HMD toolkit begins to carefully progress against that plan while monitoring each step before proceeding. If any step fails, the rollback will automatically begin and the team can be paged if needed.

By the end of Code Orange, configuration updates will follow this same process. We expect this to allow us to quickly catch the kinds of issues that occurred in these past two incidents long before they become widespread problems.

How will we address failure modes between services?

While we are optimistic that better control over configuration changes will catch more problems before they become incidents, we know that mistakes can and will occur. During both incidents, errors in one part of our network became problems in most of our technology stack, including the control plane that customers rely on to configure how they use Cloudflare.

We need to think about careful, graduated rollouts not just in terms of geographic progression (spreading to more of our data centers) or in terms of population progression (spreading to employees and customer types). We also need to plan for safer deployments that contain failures from service progression (spreading from one product like our Bot Management service to an unrelated one like our dashboard).

To that end, we are in the process of reviewing the interface contracts between every critical product and service that comprise our network to ensure that we a) assume failure will occur between each interface and b) handle that failure in the absolute most reasonable way possible.

To go back to our Bot Management service failure, there were at least two key interfaces where, if we had assumed failure was going to happen, we could have handled it gracefully to the point that it was unlikely any customer would have been impacted. The first was in the interface that read the corrupted config file. Instead of panicking, there should have been a sane set of validated defaults which would have allowed traffic to pass through our network, while we would have, at worst, lost the realtime fine-tuning that feeds into our bot detection machine-learning models.

The second interface was between the core software that runs our network and the Bot Management module itself. In the event that our bot management module failed (as it did), we should not have dropped traffic by default. Instead, we could have come up with, yet again, a more sane default of allowing the traffic to pass with a passable classification.

How will we solve emergencies faster?

During the incidents, it took us too long to resolve the problem. In both cases, this was worsened by our security systems preventing team members from accessing the tools they needed to fix the problem, and in some cases, circular dependencies slowed us down as some internal systems also became unavailable.

As a security company, all our tools are behind authentication layers with fine-grained access controls to ensure customer data is safe and to prevent unauthorized access. This is the right thing to do, but at the same time, our current processes and systems slowed us down when speed was a top priority.

Circular dependencies also affected our customer experience. For example, during the November 18 incident, Turnstile, our no CAPTCHA bot solution, became unavailable. As we use Turnstile on the login flow to the Cloudflare dashboard, customers who did not have active sessions, or API service tokens, were not able to log in to Cloudflare in the moment of most need to make critical changes.

Our team will be reviewing and improving all of the break glass procedures and technology to ensure that, when necessary, we can access the right tools as fast as possible while maintaining our security requirements. This includes reviewing and removing circular dependencies, or being able to “bypass” them quickly in the event there is an incident. We will also increase the frequency of our training exercises, so that processes are well understood by all teams prior to any potential disaster scenario in the future.

When will we be done?

While we haven’t captured in this post all the work being undertaken internally, the workstreams detailed above describe the top priorities the teams are being asked to focus on. Each of these workstreams maps to a detailed plan touching nearly every product and engineering team at Cloudflare. We have a lot of work to do.

By the end of Q1, and largely before then, we will:

Ensure all production systems are covered by Health Mediated Deployments (HMD) for configuration management.
Update our systems to adhere to proper failure modes as appropriate for each product set.
Ensure we have processes in place so the right people have the right access to provide proper remediation during an emergency.

Some of these goals will be evergreen. We will always need to better handle circular dependencies as we launch new software and our break glass procedures will need to update to reflect how our security technology changes over time.

We failed our users and the Internet as a whole in these past two incidents. We have work to do to make it right. We plan to share updates as this work proceeds and appreciate the questions and feedback we have received from our customers and partners.

Cloudflare outage on December 5, 2025

2025-12-05 Dane Knecht

Post Syndicated from Dane Knecht original https://blog.cloudflare.com/5-december-2025-outage/

On December 5, 2025, at 08:47 UTC (all times in this blog are UTC), a portion of Cloudflare’s network began experiencing significant failures. The incident was resolved at 09:12 (~25 minutes total impact), when all services were fully restored.

A subset of customers were impacted, accounting for approximately 28% of all HTTP traffic served by Cloudflare. Several factors needed to combine for an individual customer to be affected as described below.

The issue was not caused, directly or indirectly, by a cyber attack on Cloudflare’s systems or malicious activity of any kind. Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

Any outage of our systems is unacceptable, and we know we have let the Internet down again following the incident on November 18. We will be publishing details next week about the work we are doing to stop these types of incidents from occurring.

What happened

The graph below shows HTTP 500 errors served by our network during the incident timeframe (red line at the bottom), compared to unaffected total Cloudflare traffic (green line at the top).

Cloudflare’s Web Application Firewall (WAF) provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis. Before today, the buffer size was set to 128KB.

As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications. We wanted to make sure as many customers as possible were protected.

This change was being rolled out using our gradual deployment system, and, as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules. As this was an internal tool, and the fix being rolled out was a security improvement, we decided to disable the tool for the time being as it was not required to serve or protect customer traffic.

Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

In our FL1 version of our proxy under certain circumstances, this latter change caused an error state that resulted in 500 HTTP error codes to be served from our network.

As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

[lua] Failed to run module rulesets callback late_routing: /usr/local/nginx-fl/lua/modules/init.lua:314: attempt to index field 'execute' (a nil value)

resulting in HTTP code 500 errors being issued.

The issue was identified shortly after the change was applied, and was reverted at 09:12, after which all traffic was served correctly.

Customers that have their web assets served by our older FL1 proxy AND had the Cloudflare Managed Ruleset deployed were impacted. All requests for websites in this state returned an HTTP 500 error, with the small exception of some test endpoints such as /cdn-cgi/trace.

Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.

The runtime error

Cloudflare’s rulesets system consists of sets of rules which are evaluated for each request entering our system. A rule consists of a filter, which selects some traffic, and an action which applies an effect to that traffic. Typical actions are “block”, “log”, or “skip”. Another type of action is “execute”, which is used to trigger evaluation of another ruleset.

Our internal logging system uses this feature to evaluate new rules before we make them available to the public. A top level ruleset will execute another ruleset containing test rules. It was these test rules that we were attempting to disable.

We have a killswitch subsystem as part of the rulesets system which is intended to allow a rule which is misbehaving to be disabled quickly. This killswitch system receives information from our global configuration system mentioned in the prior sections. We have used this killswitch system on a number of occasions in the past to mitigate incidents and have a well-defined Standard Operating Procedure, which was followed in this incident.

However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset:

if rule_result.action == "execute" then
  rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
end

This code expects that, if the ruleset has action=”execute”, the “rule_result.execute” object will exist. However, because the rule had been skipped, the rule_result.execute object did not exist, and Lua returned an error due to attempting to look up a value in a nil value.

This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

What about the changes being made after the incident on November 18, 2025?

We made an unrelated change that caused a similar, longer availability incident two weeks ago on November 18, 2025. In both cases, a deployment to help mitigate a security issue for our customers propagated to our entire network and led to errors for nearly all of our customer base.

We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization. In particular, the projects outlined below should help contain the impact of these kinds of changes:

Enhanced Rollouts & Versioning: Similar to how we slowly deploy software with strict health validation, data used for rapid threat response and general configuration needs to have the same safety and blast mitigation features. This includes health validation and quick rollback capabilities among other things.
Streamlined break glass capabilities: Ensure that critical operations can still be achieved in the face of additional types of failures. This applies to internal services as well as all standard methods of interaction with the Cloudflare control plane used by all Cloudflare customers.
“Fail-Open” Error Handling: As part of the resilience effort, we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data-plane components. If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios. This will include drift-prevention capabilities to ensure this is enforced continuously.

Before the end of next week we will publish a detailed breakdown of all the resiliency projects underway, including the ones listed above. While that work is underway, we are locking down all changes to our network in order to ensure we have better mitigation and rollback systems before we begin again.

These kinds of incidents, and how closely they are clustered together, are not acceptable for a network like ours. On behalf of the team at Cloudflare we want to apologize for the impact and pain this has caused again to our customers and the Internet as a whole.

Timeline

Time (UTC)	Status	Description
08:47	INCIDENT start	Configuration change deployed and propagated to the network
08:48	Full impact	Change fully propagated
08:50	INCIDENT declared	Automated alerts
09:11	Change reverted	Configuration change reverted and propagation start
09:12	INCIDENT end	Revert fully propagated, all traffic restored

Every Cloudflare feature, available to everyone

2025-09-25 Dane Knecht

Post Syndicated from Dane Knecht original https://blog.cloudflare.com/enterprise-grade-features-for-all/

Over the next year Cloudflare will make nearly every feature we offer available to any customer who wants to buy and use it regardless of whether they are an enterprise account. No need to pick up a phone and talk to a sales team member. No requirement to find time with a solutions engineer in our team to turn on a feature. No contract necessary. We believe that if you want to use something we offer, you should just be able to buy it.

Today’s launch starts by bringing Single Sign-On (SSO) into our dashboard out of our enterprise plan and making it available to any user. That capability is the first of many. We will be sharing updates over the next few months as more and more features become available for purchase on any plan.

We are also making a commitment to ensuring that all future releases will follow this model. The goal is not to restrict new tools to the enterprise tier for some amount of time before making them widely available. We believe helping build a better Internet means making sure the best tools are available to anyone who needs them.

Enterprise grade for everyone

It’s not enough to build the best tools on the web. At Cloudflare our mission is to help build a better Internet and that means making the tools we build accessible. We believe the best way to make the Internet faster and more secure is to put powerful features into the hands of as many people as possible.

We first launched an Enterprise tier years ago when larger customers came to us looking to scale their usage of Cloudflare in new ways. They needed procurement options beyond a credit card, like invoices, custom contracts, and dedicated support. This offering was a necessary and important step to bring the benefits of our network and tools to large organizations with complex needs.

This created an unintended side effect in how we shipped products. Some of our most powerful and innovative features were launched within an enterprise-only tier. This created a gap, a two-tiered system where some of the most advanced features were reserved only for the largest companies.

It also created a divergence in our product development. Features built for our self-service customers had to be incredibly simple and intuitive from day-one. Features designated “enterprise-only” didn’t always face that same pressure to scale – we could instead rely on our solutions teams or partners to help set up and support.

It’s time to fix that. Starting today, we are doing away with the concept of “enterprise-only” features. Over the coming months and quarters, we will make many of our most advanced capabilities available to all of our customers.

The change will help build a more secure Internet by removing barriers to the adoption of the most advanced tools available. The change improves the experience for all customers. Smaller teams on our self-service plans will have access to the most powerful configuration options we offer. Existing enterprise teams will have easier pathways to adopt new tools without calling their account manager. And our own Product teams have even more reason to continue to make all features we ship easy to use.

Today we are beginning with dashboard SSO with instructions on how to begin setting that up right now below. It is the first of many though and capabilities like apex proxying and expanded upload limits, along with many others of our most requested enterprise features, will follow.

Starting with how you sign in to Cloudflare

One example of a feature we launched only to enterprise customers because of the complexity in setting it up is SSO. Enterprise teams maintain their own identity provider where they can manage internal employee accounts and how their team members log into different services.

They integrate these identity providers with the tools their employees need so that team members do not need to create and remember a username and password for each and every service. More importantly, the management of identity in a single place gives enterprises the ability to control authentication policies, onboard and offboard users, and hand out licenses for tools.

We first launched our own SSO support way back in 2018. In the last seven years we have been helping thousands of enterprise customers manually set this up, but we know that teams of all sizes rely on the security and convenience of an identity provider. As part of this announcement, the first enterprise feature we are making available to everyone is dashboard SSO.

The functionality is available immediately to anyone on any plan. To get started, follow the instructions here to integrate your identity provider with Cloudflare and to then connect your domain with your account. By setting up your identity provider for dashboard SSO you will also be able to begin using the vast majority of our Zero Trust security features, as well, which are available at no cost for up to 50 users.

We also know that some teams are too early or distributed to have a full-fledged identity provider but want the convenience and security of managing logins in one place. To that end, we are also excited to launch support for GitHub as a social login provider to the Cloudflare dashboard as part of today’s announcement.

And extending to almost everything else over the next year

We prioritized dashboard SSO because just about every team that uses Cloudflare wants it. This one change helps make nearly every customer safer by allowing them to centrally manage team access. As we burn down the list of previously enterprise-only features, we will continue targeting those that have similar broad impact.

Some capabilities, like Magic Transit, have less broad appeal. The organizations that maintain their own networks and want to deploy Magic Transit tend to already want to be enterprise customers for account management reasons. That said, we still can improve their experience by making tools like Magic Transit available to all plans because we will have to remove some of the friction in the setup that we have historically just solved with people hours from our solution engineers and partners.

We also realize that the way some of these features are priced only made sense with an invoice or enterprise license agreement model. To make this work, we need to revisit how some of our usage metering and billing functions. That will continue to be a priority for us, and we are excited about how this will push us to continue making our packaging and billing even simpler for all customers.

There are some features that we can’t make available to everyone because of non-technical reasons. For example, using our China Network has complicated legal requirements in China that are impossible for us to manage for millions of customers.

Self-service by default going forward

One thing we are not announcing today is a strategy to continue to release “enterprise-only” features for a while before they eventually make it to the self-service plans. Going forward, to launch something at Cloudflare the team will need to make sure that any customer can buy it off the shelf without talking to someone.

We expect that requirement to improve how all products are built here, not just the more advanced capabilities. We also consider it mission-critical. We have a long history of making the kinds of tools that only the largest businesses could buy available to anyone, from universal SSL over a decade ago to newer features this week that were available for self-service plans immediately like per-customer bot detection IDs and security of data in transit between SaaS applications. We are excited to continue this tradition.

What’s next?

You can get started right now setting up dashboard SSO in your Cloudflare account using the documentation available here. We will continue to share updates as previously enterprise-only features are made available to any plan.

Noise