Tag Archives: Dogfooding

Let’s DO this: detecting Workers Builds errors across 1 million Durable Objects

Post Syndicated from Jacob Hands original https://blog.cloudflare.com/detecting-workers-builds-errors-across-1-million-durable-durable-objects/

Cloudflare Workers Builds is our CI/CD product that makes it easy to build and deploy Workers applications every time code is pushed to GitHub or GitLab. What makes Workers Builds special is that projects can be built and deployed with minimal configuration. Just hook up your project and let us take care of the rest!

But what happens when things go wrong, such as failing to install tools or dependencies? What usually happens is that we don’t fix the problem until a customer contacts us about it, at which point many other customers have likely faced the same issue. This can be a frustrating experience for both us and our customers because of the lag time between issues occurring and us fixing them.

We want Workers Builds to be reliable, fast, and easy to use so that developers can focus on building, not dealing with our bugs. That’s why we recently started building an error detection system that can detect, categorize, and surface all build issues occurring on Workers Builds, enabling us to proactively fix issues and add missing features.

It’s also no secret that we’re big fans of being “Customer Zero” at Cloudflare, and Workers Builds is itself a product that’s built end-to-end on our Developer Platform using Workers, Durable Objects, Hyperdrive, Containers, Queues, Workers KV, R2, and Workers Observability.

In this post, we will dive into how we used the Cloudflare Developer Platform to check for issues across more than 1 million Durable Objects.

Background: Workers Builds architecture

Back in October 2024, we wrote about how we built Workers Builds entirely on the Workers platform. To recap, Builds is built using Workers, Durable Objects, Workers KV, R2, Queues, Hyperdrive, and a Postgres database. Some of these things were not present when launched back in October (for example, Queues and KV). But the core of the architecture is the same.

A client Worker receives GitHub/GitLab webhooks and stores build metadata in Postgres (via Hyperdrive). A build management Worker uses two Durable Object classes: a Scheduler class to find builds in Postgres that need scheduling, and a class called BuildBuddy to manage the lifecycle of a build. When a build needs to be started, Scheduler creates a new BuildBuddy instance which is responsible for creating a container for the build (using Cloudflare Containers), monitoring the container with health checks, and receiving build logs so that they can be viewed in the Cloudflare Dashboard.


In addition to this core scheduling logic, we have several Workers Queues for background work such as sending PR comments to GitHub/GitLab.

The problem: builds are failing

While this architecture has worked well for us so far, we found ourselves with a problem: compared to Cloudflare Pages, a concerning percentage of builds were failing. We needed to dig deeper and figure out what was wrong, and understand how we could improve Workers Builds so that developers can focus more on shipping instead of build failures.

Types of build failures

Not all build failures are the same. We have several categories of failures that we monitor:

  • Initialization failures: when the container fails to start.

  • Clone failures: failing to clone the repository from GitHub/GitLab.

  • Build timeouts: builds that ran past the limit and were terminated by BuildBuddy.

  • Builds failing health checks: the container stopped responding to health checks, e.g. the container crashed for an unknown reason.

  • Failure to install tools or dependencies.

  • Failed user build/deploy commands.

The first few failure types were straightforward, and we’ve been able to track down and fix issues in our build system and control plane to improve what we call “build completion rate”. We define build completion as the following:

  1. We successfully started the build.

  2. We attempted to install tools/dependencies (considering failures as “user error”).

  3. We attempted to run the user-defined build/deploy commands (again, considering failures as “user error”).

  4. We successfully marked the build as stopped in our database.

For example, we had a bug where builds for a deleted Worker would attempt to run and continuously fail, which affected our build completion rate metric.

User error

We’ve made a lot of progress improving the reliability of build and container orchestration, but we had a significant percentage of build failures in the “user error” metric. We started asking ourselves “is this actually user error? Or is there a problem with the product itself?”

This presented a challenge because questions like “did the build command fail due to a bug in the build system, or user error?” are a lot harder to answer than pass/fail issues like failing to create a container for the build. To answer these questions, we had to build something new, something smarter.

Build logs

The most obvious way to determine why a build failed is to look at its logs. When spot-checking build failures, we can typically identify what went wrong. For example, some builds fail to install dependencies because of an out of date lockfile (e.g. package-lock.json out of date with package.json). But looking through build failures one by one doesn’t scale. We didn’t want engineers looking through customer build logs without at least suspecting that there was an issue with our build system that we could fix.

Automating error detection

At this point, next steps were clear: we needed an automated way to identify why a build failed based on build logs, and provide a way for engineers to see what the top issues were while ensuring privacy (e.g. removing account-specific identifiers and file paths from the aggregate data).

Detecting errors in build logs using Workers Queues

The first thing we needed was a way to categorize build errors after a build fails. To do this, we created a queue named BuildErrorsQueue to process builds and look for errors. After a build fails, BuildBuddy will send the build ID to BuildErrorsQueue which fetches the logs, checks for issues, and saves results to Postgres.


We started out with a few static patterns to match things like Wrangler errors in log lines:

export const DetectedErrorCodes = {
  wrangler_error: {
    detect: async (lines: LogLines) => {
      const errors: DetectedError[] = []
      for (const line of lines) {
        if (line[2].trim().startsWith('✘ [ERROR]')) {
          errors.push({
            error_code: 'wrangler_error',
            error_group: getWranglerLogGroupFromLogLine(line, wranglerRegexMatchers),
            detected_on: new Date(),
            lines_matched: [line],
          })
        }
      }
      return errors
    },
  },
  installing_tools_or_dependencies_failed: { ... },
}

It wouldn’t be useful if all Wrangler errors were grouped under a single generic “wrangler_error” code, so we further grouped them by normalizing the log lines into groups:

function getWranglerLogGroupFromLogLine(
  logLine: LogLine,
  regexMatchers: RegexMatcher[]
): string {
  const original = logLine[2].trim().replaceAll(/[\t\n\r]+/g, ' ')
  let message = original
  let group = original
  for (const { mustMatch, patterns, stopOnMatch, name, useNameAsGroup } of regexMatchers) {
    if (mustMatch !== undefined) {
      const matched = matchLineToRegexes(message, mustMatch)
      if (!matched) continue
    }
    if (patterns) {
      for (const [pattern, mask] of patterns) {
        message = message.replaceAll(pattern, mask)
      }
    }
    if (useNameAsGroup === true) {
      group = name
    } else {
      group = message
    }
    if (Boolean(stopOnMatch) && message !== original) break
  }
  return group
}

const wranglerRegexMatchers: RegexMatcher[] = [
  {
    name: 'could_not_resolve',
    // ✘ [ERROR] Could not resolve "./balance"
    // ✘ [ERROR] Could not resolve "node:string_decoder" (originally "string_decoder/")
    mustMatch: [/^✘ \[ERROR\] Could not resolve "[@\w :/\\.-]*"/i],
    stopOnMatch: true,
    patterns: [
      [/(?<=^✘ \[ERROR\] Could not resolve ")[@\w :/\\.-]*(?=")/gi, '<MODULE>'],
      [/(?<=\(originally ")[@\w :/\\.-]*(?=")/gi, '<MODULE>'],
    ],
  },
  {
    name: 'no_matching_export_for_import',
    // ✘ [ERROR] No matching export in "src/db/schemas/index.ts" for import "someCoolTable"
    mustMatch: [/^✘ \[ERROR\] No matching export in "/i],
    stopOnMatch: true,
    patterns: [
      [/(?<=^✘ \[ERROR\] No matching export in ")[@~\w:/\\.-]*(?=")/gi, '<MODULE>'],
      [/(?<=" for import ")[\w-]*(?=")/gi, '<IMPORT>'],
    ],
  },
  // ...many more added over time
]

Once we had our error detection matchers and normalizing logic in place, implementing the BuildErrorsQueue consumer was easy:

export async function handleQueue(
  batch: MessageBatch,
  env: Bindings,
  ctx: ExecutionContext
): Promise<void> {
  ...
  await pMap(batch.messages, async (msg) => {
    try {
      const { build_id } = BuildErrorsQueueMessageBody.parse(msg.body)
      await store.buildErrors.deleteErrorsByBuildId({ build_id })
      const bb = getBuildBuddy(env, build_id)
      const errors: DetectedError[] = []
      let cursor: LogsCursor | undefined
      let hasMore = false

      do {
        using maybeNewLogs = await bb.getLogs(cursor, false)
        const newLogs = LogsWithCursor.parse(maybeNewLogs)
        cursor = newLogs.cursor
        const newErrors = await detectErrorsInLogLines(newLogs.lines)
        errors.push(...newErrors)
        hasMore = Boolean(cursor) && newLogs.lines.length > 0
      } while (hasMore)

      if (errors.length > 0) {
        await store.buildErrors.insertErrors(
          errors.map((e) => ({
            build_id,
            error_code: e.error_code,
            error_group: e.error_group,
          }))
        )
      }
      msg.ack()
    } catch (e) {
      msg.retry()
      sentry.captureException(e)
    }
  })
}

Here, we’re fetching logs from each build’s BuildBuddy Durable Object, detecting why it failed using the matchers we wrote, and saving errors to the Postgres DB. We also delete any existing errors for when we improve our error detection patterns to prevent subsequent runs from adding duplicate data to our database.

What about historical builds?

The BuildErrorsQueue was great for new builds, but this meant we still didn’t know why all the previous build failures happened other than “user error”. We considered only tracking errors in new builds, but this was unacceptable because it would significantly slow down our ability to improve our error detection system because each iteration would require us to wait days to identify issues we need to prioritize.

Problem: logs are stored across one million+ Durable Objects

Remember how every build has an associated BuildBuddy DO to store logs? This is a great design for ensuring our logging pipeline scales with our customers, but it presented a challenge when trying to aggregate issues based on logs because something would need to go through all historical builds (>1 million at the time) to fetch logs and detect why they failed.

If we were using Go and Kubernetes, we might solve this using a long-running container that goes through all builds and runs our error detection. But how do we solve this in Workers?

How do we backfill errors for historical builds?

At this point, we already had the Queue to process new builds. If we could somehow send all of the old build IDs to the queue, it could scan them all quickly using Queues concurrent consumers to quickly work through all builds. We thought about hacking together a local script to fetch all of the log IDs and sending them to an API to put them on a queue. But we wanted something more secure and easier to use so that running a new backfill was as simple as an API call.

That’s when an idea hit us: what if we used a Durable Object with alarms to fetch a range of builds and send them to BuildErrorsQueue? At first, it seemed far-fetched, given that Durable Object alarms have a limited amount of work they can do per invocation. But wait, if AI Agents built on Durable Objects can manage background tasks, why can’t we fetch millions of build IDs and forward them to queues?

Building a Build Errors Agent with Durable Objects

The idea was simple: create a Durable Object class named BuildErrorsAgent and run a single instance that loops through the specified range of builds in the database and sends them to BuildErrorsQueue.


The first thing we did was set up an RPC method to start a backfill and save the parameters in Durable Object KV storage so that it can be read each time the alarm executes:

async start({
  min_build_id,
  max_build_id,
}: {
  min_build_id: BuildRecord['build_id']
  max_build_id: BuildRecord['build_id']
}): Promise<void> {
  logger.setTags({ handler: 'start', environment: this.env.ENVIRONMENT })
  try {
    if (min_build_id < 0) throw new Error('min_build_id cannot be negative')
    if (max_build_id < min_build_id) {
      throw new Error('max_build_id cannot be less than min_build_id')
    }
    const [started_on, stopped_on] = await Promise.all([
      this.kv.get('started_on'),
      this.kv.get('stopped_on'),
    ])
    await match({ started_on, stopped_on })
      .with({ started_on: P.not(null), stopped_on: P.nullish }, () => {
        throw new Error('BuildErrorsAgent is already running')
      })
      .otherwise(async () => {
        // delete all existing data and start queueing failed builds
        await this.state.storage.deleteAlarm()
        await this.state.storage.deleteAll()
        this.kv.put('started_on', new Date())
        this.kv.put('config', { min_build_id, max_build_id })
        void this.state.storage.setAlarm(this.getNextAlarmDate())
      })
  } catch (e) {
    this.sentry.captureException(e)
    throw e
  }
}

The most important part of the implementation is the alarm that runs every second until the job is complete. Each alarm invocation has the following steps:

  1. Set a new alarm (always first to ensure an error doesn’t cause it to stop).

  2. Retrieve state from KV.

  3. Validate that the agent is supposed to be running:

    1. Ensure the agent is supposed to be running.

    2. Ensure we haven’t reached the max build ID set in the config.

  4. Finally, queue up another batch of builds by querying Postgres and sending to the BuildErrorsQueue.


async alarm(): Promise<void> {
  logger.setTags({ handler: 'alarm', environment: this.env.ENVIRONMENT })
  try {
    void this.state.storage.setAlarm(Date.now() + 1000)
    const kvState = await this.getKVState()
    this.sentry.setContext('BuildErrorsAgent', kvState)
    const ctxLogger = logger.withFields({ state: JSON.stringify(kvState) })

    await match(kvState)
      .with({ started_on: P.nullish }, async () => {
        ctxLogger.info('BuildErrorsAgent is not started, cancelling alarm')
        await this.state.storage.deleteAlarm()
      })
      .with({ stopped_on: P.not(null) }, async () => {
        ctxLogger.info('BuildErrorsAgent is stopped, cancelling alarm')
        await this.state.storage.deleteAlarm()
      })
      .with(
        // we should never have started_on set without config set, but just in case
        { started_on: P.not(null), config: P.nullish },
        async () => {
          const msg =
            'BuildErrorsAgent started but config is empty, stopping and cancelling alarm'
          ctxLogger.error(msg)
          this.sentry.captureException(new Error(msg))
          this.kv.put('stopped_on', new Date())
          await this.state.storage.deleteAlarm()
        }
      )
      .when(
        // make sure there are still builds to enqueue
        (s) =>
          s.latest_build_id !== null &&
          s.config !== null &&
          s.latest_build_id >= s.config.max_build_id,
        async () => {
          ctxLogger.info('BuildErrorsAgent job complete, cancelling alarm')
          this.kv.put('stopped_on', new Date())
          await this.state.storage.deleteAlarm()
        }
      )
      .with(
        {
          started_on: P.not(null),
          stopped_on: P.nullish,
          config: P.not(null),
          latest_build_id: P.any,
        },
        async ({ config, latest_build_id }) => {
          // 1. select batch of ~1000 builds
          // 2. send them to Queues 100 at a time, updating
          //    latest_build_id after each batch is sent
          const failedBuilds = await this.store.builds.selectFailedBuilds({
            min_build_id: latest_build_id !== null ? latest_build_id + 1 : config.min_build_id,
            max_build_id: config.max_build_id,
            limit: 1000,
          })
          if (failedBuilds.length === 0) {
            ctxLogger.info(`BuildErrorsAgent: ran out of builds, stopping and cancelling alarm`)
            this.kv.put('stopped_on', new Date())
            await this.state.storage.deleteAlarm()
          }

          for (
            let i = 0;
            i < BUILDS_PER_ALARM_RUN && i < failedBuilds.length;
            i += QUEUES_BATCH_SIZE
          ) {
            const batch = failedBuilds
              .slice(i, QUEUES_BATCH_SIZE)
              .map((build) => ({ body: build }))

            if (batch.length === 0) {
              ctxLogger.info(`BuildErrorsAgent: ran out of builds in current batch`)
              break
            }
            ctxLogger.info(
              `BuildErrorsAgent: sending ${batch.length} builds to build errors queue`
            )
            await this.env.BUILD_ERRORS_QUEUE.sendBatch(batch)
            this.kv.put(
              'latest_build_id',
              Math.max(...batch.map((m) => m.body.build_id).concat(latest_build_id ?? 0))
            )

            this.kv.put(
              'total_builds_processed',
              ((await this.kv.get('total_builds_processed')) ?? 0) + batch.length
            )
          }
        }
      )
      .otherwise(() => {
        const msg = 'BuildErrorsAgent has nothing to do - this should never happen'
        this.sentry.captureException(msg)
        ctxLogger.info(msg)
      })
  } catch (e) {
    this.sentry.captureException(e)
    throw e
  }
}

Using pattern matching with ts-pattern made it much easier to understand what states we were expecting and what will happen compared to procedural code. We considered using a more powerful library like XState, but decided on ts-pattern due to its simplicity.

Running the backfill

Once everything rolled out, we were able to trigger an errors backfill for over a million failed builds in a couple of hours with a single API call, categorizing 80% of failed builds on the first run. With a fast backfill process, we were able to iterate on our regex matchers to further refine our error detection and improve error grouping. Here’s what the error list looks like in our staging environment:


Fixes and improvements

Having a better understanding of what’s going wrong has already enabled us to make several improvements:

  • Wrangler now shows a clearer error message when no config file is found.

  • Fixed multiple edge-cases where the wrong package manager was used in TypeScript/JavaScript projects.

  • Added support for bun.lock (previously only checked for bun.lockb).

  • Fixed several edge cases where build caching did not work in monorepos.

  • Projects that use a runtime.txt file to specify a Python version no longer fail.

  • ….and more!

We’re still working on fixing other bugs we’ve found, but we’re making steady progress. Reliability is a feature we’re striving for in Workers Builds, and this project has helped us make meaningful progress towards that goal. Instead of waiting for people to contact support for issues, we’re able to proactively identify and fix issues (and catch regressions more easily).

One of the great things about building on the Developer Platform is how easy it is to ship things. The core of this error detection pipeline (the Queue and Durable Object) only took two days to build, which meant we could spend more time working on improving Workers Builds instead of spending weeks on the error detection pipeline itself.

What’s next?

In addition to continuing to improve build reliability and speed, we’ve also started thinking about other ways to help developers build their applications on Workers. For example, we built a Builds MCP server that allows users to debug builds directly in Cursor/Claude/etc. We’re also thinking about ways we can expose these detected issues in the Cloudflare Dashboard so that users can identify issues more easily without scrolling through hundreds of logs.

Ready to get started?

Building applications on Workers has never been easier! Try deploying a Durable Object-backed chat application with Workers Builds: 

How Cloudflare Security does Zero Trust

Post Syndicated from Noelle Gotthardt original https://blog.cloudflare.com/how-cloudflare-security-does-zero-trust/

How Cloudflare Security does Zero Trust

How Cloudflare Security does Zero Trust

Throughout Cloudflare One week, we provided playbooks on how to replace your legacy appliances with Zero Trust services. Using our own products is part of our team’s culture, and we want to share our experiences when we implemented Zero Trust.

Our journey was similar to many of our customers. Not only did we want better security solutions, but the tools we were using made our work more difficult than it needed to be. This started with just a search for an alternative to remotely connecting on a clunky VPN, but soon we were deploying Zero Trust solutions to protect our employees’ web browsing and email. Next, we are looking forward to upgrading our SaaS security with our new CASB product.

We know that getting started with Zero Trust can seem daunting, so we hope that you can learn from our own journey and see how it benefited us.

Replacing a VPN: launching Cloudflare Access

Back in 2015, all of Cloudflare’s internally-hosted applications were reached via a hardware-based VPN. On-call engineers would fire up a client on their laptop, connect to the VPN, and log on to Grafana. This process was frustrating and slow.

Many of the products we build are a direct result of the challenges our own team is facing, and Access is a perfect example. Launching as an internal project in 2015, Access enabled employees to access internal applications through our identity provider. We started with just one application behind Access with the goal of improving incident response times. Engineers who received a notification on their phones could tap a link and, after authenticating via their browser, would immediately have the access they needed. As soon as people started working with the new authentication flow, they wanted it everywhere. Eventually our security team mandated that we move our apps behind Access, but for a long time it was totally organic: teams were eager to use it.

With authentication occuring at our network edge, we were able to support a globally-distributed workforce without the latency of a VPN, and we were able to do so securely. Moreover, our team is committed to protecting our internal applications with the most secure and usable authentication mechanisms, and two-factor authentication is one of the most important security controls that can be implemented. With Cloudflare Access, we’re able to rely on the strong two-factor authentication mechanisms of our identity provider.

Not all second factors of authentication deliver the same level of security. Some methods are still vulnerable to man-in-the-middle (MITM) attacks. These attacks often feature bad actors stealing one-time passwords, commonly through phishing, to gain access to private resources. To eliminate that possibility, we implemented FIDO2 supported security keys. FIDO2 is an authenticator protocol designed to prevent phishing, and we saw it as an improvement to our reliance on soft tokens at the time.

While the implementation of FIDO2 can present compatibility challenges, we were enthusiastic to improve our security posture. Cloudflare Access enabled us to limit access to our systems to only FIDO2. Cloudflare employees are now required to use their hardware keys to reach our applications. The onboarding of Access was not only a huge win for ease of use, the enforcement of security keys was a massive improvement to our security posture.

Mitigate threats & prevent data exfiltration: Gateway and Remote Browser Isolation

Deploying secure DNS in our offices

A few years later, in 2020, many customers’ security teams were struggling to extend the controls they had enabled in the office to their remote workers. In response, we launched Cloudflare Gateway, offering customers protection from malware, ransomware, phishing, command & control, shadow IT, and other Internet risks over all ports and protocols. Gateway directs and filters traffic according to the policies implemented by the customer.

Our security team started with Gateway to implement DNS filtering in all of our offices. Since Gateway was built on top of the same network as 1.1.1.1, the world’s fastest DNS resolver, any current or future Cloudflare office will have DNS filtering without incurring additional latency. Each office connects to the nearest data center and is protected.

Deploying secure DNS for our remote users

Cloudflare’s WARP client was also built on top of our 1.1.1.1 DNS resolver. It extends the security and performance offered in offices to remote corporate devices. With the WARP client deployed, corporate devices connect to the nearest Cloudflare data center and are routed to Cloudflare Gateway. By sitting between the corporate device and the Internet, the entire connection from the device is secure, while also offering improved speed and privacy.

We sought to extend secure DNS filtering to our remote workforce and deployed the Cloudflare WARP client to our fleet of endpoint devices. The deployment enabled our security teams to better preserve our privacy by encrypting DNS traffic over DNS over HTTPS (DoH). Meanwhile, Cloudflare Gateway categorizes domains based on Radar, our own threat intelligence platform, enabling us to block high risk and suspicious domains for users everywhere around the world.

How Cloudflare Security does Zero Trust

Adding on HTTPS filtering and Browser Isolation

DNS filtering is a valuable security tool, but it is limited to blocking entire domains. Our team wanted a more precise instrument to block only malicious URLs, not the full domain. Since Cloudflare One is an integrated platform, most of the deployment was already complete. All we needed was to add the Cloudflare Root CA to our endpoints and then enable HTTP filtering in the Zero Trust dashboard. With those few simple steps, we were able to implement more granular blocking controls.

In addition to precision blocking, HTTP filtering enables us to implement tenant control. With tenant control, Gateway HTTP policies regulate access to corporate SaaS applications. Policies are implemented using custom HTTP headers. If the custom request header is present and the request is headed to an organizational account, access is granted. If the request header is present and the request goes to a non-organizational account, such as a personal account, the request can be blocked or opened in an isolated browser.

After protecting our users’ traffic at the DNS and HTTP layers, we implemented Browser Isolation. When Browser Isolation is implemented, all browser code executes in the cloud on Cloudflare’s network. This isolates our endpoints from malicious attacks and common data exfiltration techniques. Some remote browser isolation products introduce latency and frustrate users. Cloudflare’s Browser Isolation uses the power of our network to offer a seamless experience for our employees. It quickly improved our security posture without compromising user experience.

Preventing phishing attacks: Onboarding Area 1 email security

Also in early 2020, we saw an uptick in employee-reported phishing attempts. Our cloud-based email provider had strong spam filtering, but they fell short at blocking malicious threats and other advanced attacks. As we experienced increasing phishing attack volume and frequency we felt it was time to explore more thorough email protection options.

The team looked for four main things in a vendor: the ability to scan email attachments, the ability to analyze suspected malicious links, business email compromise protection, and strong APIs into cloud-native email providers. After testing many vendors, Area 1 became the clear choice to protect our employees. We implemented Area 1’s solution in early 2020, and the results have been fantastic.

Given the overwhelmingly positive response to the product and the desire to build out our Zero Trust portfolio, Cloudflare acquired Area 1 Email Security in April 2022. We are excited to offer the same protections we use to our customers.

What’s next: Getting started with Cloudflare’s CASB

Cloudflare acquired Vectrix in February 2022. Vectrix’s CASB offers functionality we are excited to add to Cloudflare One. SaaS security is an increasing concern for many security teams. SaaS tools are storing more and more sensitive corporate data, so misconfigurations and external access can be a significant threat. However, securing these platforms can present a significant resource challenge. Manual reviews for misconfigurations or externally shared files are time consuming, yet necessary processes for many customers. CASB reduces the burden on teams by ensuring security standards by scanning SaaS instances and identifying vulnerabilities with just a few clicks.

We want to ensure we maintain the best practices for SaaS security, and like many of our customers, we have many SaaS applications to secure. We are always seeking opportunities to make our processes more efficient, so we are excited to onboard one of our newest Zero Trust products.

Always striving for improvement

Cloudflare takes pride in deploying and testing our own products. Our security team works directly with Product to “dog food” our own products first. It’s our mission to help build a better Internet — and that means providing valuable feedback from our internal teams. As the number one consumer of Cloudflare’s products, the Security team is not only helping keep the company safer, but also contributing to build better products for our customers.

We hope you have enjoyed Cloudflare One week. We really enjoyed sharing our stories with you. To check out our recap of the week, please visit our Cloudflare TV segment.

Securing Cloudflare Using Cloudflare

Post Syndicated from Molly Cinnamon original https://blog.cloudflare.com/securing-cloudflare-using-cloudflare/

Securing Cloudflare Using Cloudflare

Securing Cloudflare Using Cloudflare

When a new security threat arises — a publicly exploited vulnerability (like log4j) or the shift from corporate-controlled environments to remote work or a potential threat actor — it is the Security team’s job to respond to protect Cloudflare’s network, customers, and employees. And as security threats evolve, so should our defense system. Cloudflare is committed to bolstering our security posture with best-in-class solutions — which is why we often turn to our own products as any other Cloudflare customer would.

We’ve written about using Cloudflare Access to replace our VPN, Purpose Justification to create granular access controls, and Magic + Gateway to prevent lateral movement from in-house. We experience the same security needs, wants, and concerns as security teams at enterprises worldwide, so we rely on the same solutions as the Fortune 500 companies that trust Cloudflare for improved security, performance, and speed. Using our own products is embedded in our team’s culture.

Security Challenges, Cloudflare Solutions

We’ve built the muscle to think Cloudflare-first when we encounter a security threat. In fact, many security problems we encounter have a Cloudflare solution.

  • Problem: Remote work creates a security blind spot of remote devices and networks.
  • Solution: Deploying Cloudflare Gateway and WARP to shield users and devices from threats, no matter their device or network connection.

Our Detection & Response team has regained visibility into security threats by connecting corporate devices to Gateway through our Cloudflare WARP application. These Zero Trust products shield users and devices from security threats like malware, phishing, shadow IT and more, as well as enable our Security team to instantly block threats and prevent sensitive data from leaving our organization — with no performance impact for our employees, no matter their location.

  • Problem: Larger company footprint (in size and location) increases the complexity of internal tooling.
  • Solution: Deploying Cloudflare Access and Purpose Justification to give network administrators granular and contextual application access controls.

Our Identity and Access Management team uses Access to create policies that minimize data access, ensuring that our employees only have access to what they need. Flexibility and instantaneous application of policies allow for ease of scalability as our internal tooling and teams evolve. With Purpose Justification Capture, employees must also justify their use case for visiting domains with particularly sensitive data — which not only solves an internal need for Cloudflare, but helps our customers meet data policy requirements (like GDPR).

  • Problem: Engineering and Product teams move at a rapid pace. Conducting a manual review of every pull request is not scalable.
  • Solution: A tool built on top of Workers that enables scanning of PRs for security bugs.

Our Product Security Engineering team uses Cloudflare’s development platform Workers to seamlessly deploy a code review assist framework to flag secrets, vulnerability dependencies, and binary security flags. The flexibility of Workers enables the Security team to evolve the tool depending on security concerns and scale it to the hundreds of PRs the company generates per week.

These are just some of the ways the Security team has used Cloudflare’s products to block malicious domains, streamline access management, provide visibility into threats, and harden our overall security posture. To give a sense of how we think about these challenges technically, we will dive into the implementation of a use of Cloudflare to secure Cloudflare.

Phish-proof websites using Cloudflare Access

Two-factor authentication is one of the most important security controls that can be implemented. Not all second factors provide the same level of security though. Time-based One-time password (TOTP) apps like Google Authenticator are a strong second factor, but are vulnerable to phishing via man-in-the-middle attacks. A successful phishing attack on an employee with privileged access is a terrifying thought, and it is a risk we wanted to completely eliminate.

FIDO2 was developed to provide simple UX and complete protection against phishing attacks. We decided to fully embrace FIDO2 supported security keys in all contexts, but FIDO2 support is not yet ubiquitous and there are many challenging compatibility issues. Cloudflare Access allowed us to enforce that FIDO2 was the only second factor that can be used when reaching systems protected by Cloudflare Access.

In order to manage our Cloudflare Access policies we check each one into source control as terraform code. Group based access control is enforced for our applications.

resource "cloudflare_access_policy" "prod_cloudflare_users" {
  application_id = cloudflare_access_application.prod_sandbox_access_application.id
  zone_id        = cloudflare_zone.prod_sandbox.id
  name           = "Allow "
  decision       = "allow"

  include {
    email_domain = ["cloudflare.com"]
    okta {
      # Require membership in Sandbox group
      name                 = ["ACL-ACCESS-sandbox"]
      identity_provider_id = cloudflare_access_identity_provider.okta_prod.id
    }
  }

  # Require a security key
  require {
    auth_method = "swk"
  }
}

The require section enforces that Cloudflare employees are using their FIDO2 supported security keys to access all of our internal and external applications that are protected by Access. More deeply described in RFC8176, auth_method is enforcing specific values are returned during the login flow from our OIDC provider within the amr field. The enforced swk is “Proof of possession of a software-secured key” which corresponds to logins using a security key.

The ability to enforce the use of security keys when accessing internal sites caused a massive improvement in our security posture. Prior to implementing this change, for many of our internal services we allowed both soft tokens like TOTP with Google Authenticator, along with WebAuthn because of a small number of systems that still didn’t support FIDO2. You’ll see that our use of soft tokens dropped to near zero after enforcing this change.

Securing Cloudflare Using Cloudflare

A Continued Practice

Not only does the Security team deploy Cloudflare’s products, but we test them first too. We work directly with Product to “dog food” our own products first. It’s our mission to help build a better Internet — and that means testing our products and collecting valuable feedback from our internal teams before every launch. As the number one consumer of Cloudflare’s products, the Security team is not only helping keep the company safer, but also contributing to build better products for our customers.

To learn more about examples and technical implementation of our use of Cloudflare products at Cloudflare, please check out this recent Cloudflare TV segment on Securing Cloudflare with Cloudflare.

And for more information on the products mentioned in this document, reach out to our Sales team to find out how we can help you secure your business, teams, and users.