Tag Archives: AI Week

Deploy your own AI vibe coding platform — in one click!

Post Syndicated from Ashish Kumar Singh original https://blog.cloudflare.com/deploy-your-own-ai-vibe-coding-platform/

It’s an exciting time to build applications. With the recent AI-powered “vibe coding” boom, anyone can build a website or application by simply describing what they want in a few sentences. We’re already seeing organizations expose this functionality to both their users and internal employees, empowering anyone to build out what they need.


Today, we’re excited to open-source an AI vibe coding platform, VibeSDK, to enable anyone to run an entire vibe coding platform themselves, end-to-end, with just one click.

Want to see it for yourself? Check out our demo platform that you can use to create and deploy applications. Or better yet, click the button below to deploy your own AI-powered platform, and dive into the repo to learn about how it’s built.

Deploying VibeSDK sets up everything you need to run your own AI-powered development platform:

  • Integration with LLM models to generate code, build applications, debug errors, and iterate in real-time, powered by Agents SDK

  • Isolated development environments that allow users to safely build and preview their applications in secure sandboxes.

  • Infinite scale that allows you to deploy thousands or even millions of applications that end users deploy, all served on Cloudflare’s global network

  • Observability and caching across multiple AI providers, giving you insight into costs and performance with built-in caching for popular responses. 

  • Project templates that the LLM can use as a starting point to build common applications and speed up development.

  • One-click project export to the user’s Cloudflare account or GitHub repo, so users can take their code and continue development on their own.


Building an AI vibe coding platform from start to finish

Step 0: Get started immediately with VibeSDK

We’re seeing companies build their own AI vibe coding platforms to enable both internal and external users. With a vibe coding platform, internal teams like marketing, product, and support can build their own landing pages, prototypes, or internal tools without having to rely on the engineering team. Similarly, SaaS companies can embed this capability into their product to allow users to build their own customizations. 

Every platform has unique requirements and specializations. By building your own, you can write custom logic to prompt LLMs for your specific needs, giving your users more relevant results. This also grants you complete control over the development environment and application hosting, giving you a secure platform that keeps your data private and within your control. 

We wanted to make it easy for anyone to build this themselves, which is why we built a complete platform that comes with project templates, previews, and project deployment. Developers can repurpose the whole platform, or simply take the components they need and customize them to fit their needs.

Step 1: Finding a safe, isolated environment for running untrusted, AI generated code

AI can now build entire applications, but there’s a catch: you need somewhere safe to run this untrusted, AI-generated code. Imagine if an LLM writes an application that needs to install packages, run build commands, and start a development server — you can’t just run this directly on your infrastructure where it might affect other users or systems.

With Cloudflare Sandboxes, you don’t have to worry about this. Every user gets their own isolated environment where the AI-generated code can do anything a normal development environment can do: install npm packages, run builds, start servers, but it’s fully contained in a secure, container-based environment that can’t affect anything outside its sandbox. 


The platform assigns each user to their own sandbox based on their session, so that if a user comes back, they can continue to access the same container with their files intact:

// Creating a sandbox client for a user session
const sandbox = getSandbox(env.Sandbox, sandboxId);

// Now AI can safely write and execute code in this isolated environment
await sandbox.writeFile('app.js', aiGeneratedCode);
await sandbox.exec('npm install express');
await sandbox.exec('node app.js');

Step 2: Generating the code

Once the sandbox is created, you have a development environment that can bring the code to life. VibeSDK orchestrates the whole workflow from writing the code, installing the necessary packages, and starting the development server. If you ask it to build a to-do app, it will generate the React application, write the component files, run bun install to get the dependencies, and start the server, so you can see the end result. 

Once the user submits their request, the AI will generate all the necessary files, whether it’s a React app, Node.js API, or full-stack application, and write them directly to the sandbox:

async function generateAndWriteCode(instanceId: string) {
    // AI generates the application structure
    const aiGeneratedFiles = await callAIModel("Create a React todo app");
    
    // Write all generated files to the sandbox
    for (const file of aiGeneratedFiles) {
        await sandbox.writeFile(
            `${instanceId}/${file.path}`,
            file.content
        );
        // User sees: "✓ Created src/App.tsx"
        notifyUser(`✓ Created ${file.path}`);
    }
}

To speed this up even more, we’ve provided a set of templates, stored in an R2 bucket, that the platform can use and quickly customize, instead of generating every file from scratch. This is just an initial set, but you can expand it and add more examples. 

Step 3: Getting a preview of your deployment

Once everything is ready, the platform starts the development server and uses the Sandbox SDK to expose it to the internet with a public preview URL which allows users to instantly see their AI-generated application running live:

// Start the development server in the sandbox
const processId = await sandbox.startProcess(
    `bun run dev`, 
    { cwd: instanceId }
);

// Create a public preview URL 
const preview = await sandbox.exposePort(3000, { 
    hostname: 'preview.example.com' 
});

// User instantly gets: "https://my-app-xyz.preview.example.com"
notifyUser(`✓ Preview ready at: ${preview.url}`);

Step 4: Test, log, fix, repeat

But that’s not all! Throughout this process, the platform will capture console output, build logs, and error messages and feed them back to the LLM for automatic fixes. As the platform makes any updates or fixes, the user can see it all happening live — the file editing, installation progress, and error resolution. 

Deploying applications: From Sandbox to Region Earth

Once the application is developed, it needs to be deployed. The platform packages everything in the sandbox and then uses a separate specialized “deployment sandbox” to deploy the application to Cloudflare Workers. This deployment sandbox runs wrangler deploy inside the secure environment to publish the application to Cloudflare’s global network. 

Since the platform may deploy up to thousands or millions of applications, Workers for Platforms is used to deploy the Workers at scale. Although all the Workers are deployed to the same Namespace, they are all isolated from one another by default, ensuring there’s no cross-tenant access. Once deployed, each application receives its own isolated Worker instance with a unique public URL like my-app.vibe-build.example.com

async function deployToWorkersForPlatforms(instanceId: string) {
    // 1. Package the app from development sandbox
    const devSandbox = getSandbox(env.Sandbox, instanceId);
    const packagedApp = await devSandbox.exec('zip -r app.zip .');
    
    // 2. Transfer to specialized deployment sandbox
    const deploymentSandbox = getSandbox(env.DeployerServiceObject, 'deployer');
    await deploymentSandbox.writeFile('app.zip', packagedApp);
    await deploymentSandbox.exec('unzip app.zip');
    
    // 3. Deploy using Workers for Platforms dispatch namespace
    const deployResult = await deploymentSandbox.exec(`
        bunx wrangler deploy \\\\
        --dispatch-namespace vibe-sdk-build-default-namespace
    `);
    
    // Each app gets its own isolated Worker and unique URL
    // e.g., https://my-app.example.com
    return `https://${instanceId}.example.com`;
}

Exportable Applications 

The platform also allows users to export their application to their own Cloudflare account and GitHub repo, so they can continue the development on their own. 


Observability, caching, and multi-model support built in! 

It’s no secret that LLM models have their specialties, which means that when building an AI-powered platform, you may end up using a few different models for different operations. By default, VibeSDK leverages Google’s Gemini models (gemini-2.5-pro, gemini-2.5-flash-lite, gemini-2.5-flash) for project planning, code generation, and debugging. 

VibeSDK is automatically set up with AI Gateway, so that by default, the platform is able to:

  • Use a unified access point to route requests across LLM providers, allowing you to use models from a range of providers (OpenAI, Anthropic, Google, and others)

  • Cache popular responses, so when someone asks to “build a to-do list app”, the gateway can serve a cached response instead of going to the provider (saving inference costs)

  • Get observability into the requests, tokens used, and response times across all providers in one place

  • Track costs across models and integrations

Open sourced, so you can build your own Platform! 

We’re open-sourcing VibeSDK for the same reason Cloudflare open-sourced the Workers runtime — we believe the best development happens in the open. That’s why we wanted to make it as easy as possible for anyone to build their own AI coding platform, whether it’s for internal company use, for your website builder, or for the next big vibe coding platform. We tied all the pieces together for you, so you can get started with the click of a button instead of spending months figuring out how to connect everything yourself. To learn more, check out our reference architecture for vibe coding platforms. 


AI Week 2025: Recap

Post Syndicated from Kenny Johnson original https://blog.cloudflare.com/ai-week-2025-wrapup/

How do we embrace the power of AI without losing control? 

That was one of our big themes for AI Week 2025, which has now come to a close. We announced products, partnerships, and features to help companies successfully navigate this new era.

Everything we built was based on feedback from customers like you that want to get the most out of AI without sacrificing control and safety. Over the next year, we will double down on our efforts to deliver world-class features that augment and secure AI. Please keep an eye on our Blog, AI Avenue, Product Change Log and CloudflareTV for more announcements.

This week we focused on four core areas to help companies secure and deliver AI experiences safely and securely:

  • Securing AI environments and workflows

  • Protecting original content from misuse by AI

  • Helping developers build world-class, secure, AI experiences 

  • Making Cloudflare better for you with AI

Thank you for following along with our first ever AI week at Cloudflare. This recap blog will summarize each announcement across these four core areas. For more information, check out our “This Week in NET” recap episode also featured at the end of this blog.


Securing AI environments and workflows

These posts and features focused on helping companies control and understand their employee’s usage of AI tools.

Blog

Recap

Beyond the ban: A better way to secure generative AI applications

Generative AI tools present a trade-off of productivity and data risk. Cloudflare One’s new AI prompt protection feature provides the visibility and control needed to govern these tools, allowing organizations to confidently embrace AI.

Unmasking the Unseen: Your Guide to Taming Shadow AI with Cloudflare One

Don’t let “Shadow AI” silently leak your data to unsanctioned AI. This new threat requires a new defense. Learn how to gain visibility and control without sacrificing innovation.

Introducing Cloudflare Application Confidence Score For AI Applications

Cloudflare will provide confidence scores within our application library for Gen AI applications, allowing customers to assess their risk for employees using shadow IT.

ChatGPT, Claude, & Gemini security scanning with Cloudflare CASB

Cloudflare CASB now scans ChatGPT, Claude, and Gemini for misconfigurations, sensitive data exposure, and compliance issues, helping organizations adopt AI with confidence.

Securing the AI Revolution: Introducing Cloudflare MCP Server Portals

Cloudflare MCP Server Portals are now available in Open Beta. MCP Server Portals are a new capability that enable you to centralize, secure, and observe every MCP connection in your organization.

Best Practices for Securing Generative AI with SASE

This guide provides best practices for Security and IT leaders to securely adopt generative AI using Cloudflare’s SASE architecture as part of a strategy for AI Security Posture Management (AI-SPM).


Protecting original content from misuse by AI

Cloudflare is committed to helping content creators control access to their original work. These announcements focused on analysis of what we’re currently seeing on the Internet with respect to AI bots and crawlers and significant improvements to our existing control features.

Blog

Recap

A deeper look at AI crawlers: breaking down traffic by purpose and industry

We are extending AI-related insights on Cloudflare Radar with new industry-focused data and a breakdown of bot traffic by purpose, such as training or user action.

The age of agents: cryptographically recognizing agent traffic

Cloudflare now lets websites and bot creators use Web Bot Auth to segment agents from verified bots, making it easier for customers to allow or disallow the many types of user and partner directed.

Make Your Website Conversational for People and Agents with NLWeb and AutoRAG

With NLWeb, an open project by Microsoft, and Cloudflare AutoRAG, conversational search is now a one-click setup for your website.

The next step for content creators in working with AI bots: Introducing AI Crawl Control

Cloudflare launches AI Crawl Control (formerly AI Audit) and introduces easily customizable 402 HTTP responses.

The crawl-to-click gap: Cloudflare data on AI bots, training, and referrals

By mid-2025, training drives nearly 80% of AI crawling, while referrals to publishers (especially from Google) are falling and crawl-to-refer ratios show AI consumes far more than it sends back.


Helping developers build world-class, secure, AI experiences

At Cloudflare we are committing to building the best platform to build AI experiences, all with security by default.

Blog

Recap

AI Gateway now gives you access to your favorite AI models, dynamic routing and more — through just one endpoint

AI Gateway now gives you access to your favorite AI models, dynamic routing and more — through just one endpoint.

How we built the most efficient inference engine for Cloudflare’s network

Infire is an LLM inference engine that employs a range of techniques to maximize resource utilization, allowing us to serve AI models more efficiently with better performance for Cloudflare workloads.

State-of-the-art image generation Leonardo models and text-to-speech Deepgram models now available in Workers AI

We’re expanding Workers AI with new partner models from Leonardo.Ai and Deepgram. Start using state-of-the-art image generation models from Leonardo and real-time TTS and STT models from Deepgram.

How Cloudflare runs more AI models on fewer GPUs: A technical deep-dive

Cloudflare built an internal platform called Omni. This platform uses lightweight isolation and memory over-commitment to run multiple AI models on a single GPU.

Cloudflare Launching AI Miniseries for Developers (and Everyone Else They Know)

In AI Avenue, we address people’s fears, show them the art of the possible, and highlight the positive human stories where AI is augmenting — not replacing — what people can do. And yes, we even let people touch AI themselves.

Block unsafe prompts targeting your LLM endpoints with Firewall for AI

Cloudflare’s AI security suite now includes unsafe content moderation, integrated into the Application Security Suite via Firewall for AI.

Cloudflare is the best place to build realtime voice agents

Today, we’re excited to announce new capabilities that make it easier than ever to build real-time, voice-enabled AI applications on Cloudflare’s global network.


Making Cloudflare better for you with AI

Cloudflare logs and analytics can often be a needle in the haystack challenge, AI helps surface and alert to issues that need attention or review. Instead of a human having to spend hours sifting and searching for an issue, they can focus on action and remediation while AI does the sifting.

Blog

Except

Evaluating image segmentation models for background removal for Images

An inside look at how the Images team compared dichotomous image segmentation models to identify and isolate subjects in an image from the background.

Automating threat analysis and response with Cloudy

Cloudy now supercharges analytics investigations and Cloudforce One threat intelligence! Get instant insights from threat events and APIs on APTs, DDoS, cybercrime & more – powered by Workers AI!

Cloudy Summarizations of Email Detections: Beta Announcement

We’re now leveraging our internal LLM, Cloudy, to generate automated summaries within our Email Security product, helping SOC teams better understand what’s happening within flagged messages.

Troubleshooting network connectivity and performance with Cloudflare AI

Troubleshoot network connectivity issues by using Cloudflare AI-Power to quickly self diagnose and resolve WARP client and network issues.

We thank you for following along this week — and please stay tuned for exciting announcements coming during Cloudflare’s 15th birthday week in September!

Check out the full video recap, featuring insights from Kenny Johnson and host João Tomé, in our special This Week in NET episode (ThisWeekinNET.com) covering everything announced during AI Week 2025.

Automating threat analysis and response with Cloudy

Post Syndicated from Alexandra Moraru original https://blog.cloudflare.com/automating-threat-analysis-and-response-with-cloudy/

Security professionals everywhere face a paradox: while more data provides the visibility needed to catch threats, it also makes it harder for humans to process it all and find what’s important. When there’s a sudden spike in suspicious traffic, every second counts. But for many security teams — especially lean ones — it’s hard to quickly figure out what’s going on. Finding a root cause means diving into dashboards, filtering logs, and cross-referencing threat feeds. All the data tracking that has happened can be the very thing that slows you down — or worse yet, what buries the threat that you’re looking for. 

Today, we’re excited to announce that we’ve solved that problem. We’ve integrated Cloudy — Cloudflare’s first AI agent — with our security analytics functionality, and we’ve also built a new, conversational interface that Cloudflare users can use to ask questions, refine investigations, and get answers.  With these changes, Cloudy can now help Cloudflare users find the needle in the digital haystack, making security analysis faster and more accessible than ever before.  

Since Cloudly’s launch in March of this year, its adoption has been exciting to watch. Over 54,000 users have tried Cloudy for custom rule creation, and 31% of them have deployed a rule suggested by the agent. For our log explainers in Cloudflare Gateway, Cloudy has been loaded over 30,000  times in just the last month, with 80% of the feedback we received confirming the summaries were insightful. We are excited to empower our users to do even more.

Talk to your traffic: a new conversational interface for faster RCA and mitigation

Security analytics dashboards are powerful, but they often require you to know exactly what you’re looking for — and the right queries to get there. The new Cloudy chat interface changes this. It is designed for faster root cause analysis (RCA) of traffic anomalies, helping you get from “something’s wrong” to “here’s the fix” in minutes. You can now start with a broad question and narrow it down, just like you would with a human analyst.

For example, you can start an investigation by asking Cloudy to look into a recommendation from Security Analytics.



From there, you can ask follow-up questions to dig deeper:

  • “Focus on login endpoints only.”

  • “What are the top 5 IP addresses involved?”

  • “Are any of these IPs known to be malicious?”

This is just the beginning of how Cloudy is transforming security. You can read more about how we’re using Cloudy to bring clarity to another critical security challenge: automating summaries of email detections. This is the same core mission — translating complex security data into clear, actionable insights — but applied to the constant stream of email threats that security teams face every day.

Use Cloudy to understand, prioritize, and act on threats

Analyzing your own logs is powerful — but it only shows part of the picture. What if Cloudy could look beyond your own data and into Cloudflare’s global network to identify emerging threats? This is where Cloudforce One’s Threat Events platform comes in.

Cloudforce One translates the high-volume attack data observed on the Cloudflare network into real-time, attacker-attributed events relevant to your organization. This platform helps you track adversary activity at scale — including APT infrastructure, cybercrime groups, compromised devices, and volumetric DDoS activity. Threat events provide detailed, context-rich events, including interactive timelines and mappings to attacker TTPs, regions, and targeted verticals. 

We have spent the last few months making Cloudy more powerful by integrating it with the Cloudforce One Threat Events platform.  Cloudy now can offer contextual data about the threats we observe and mitigate across Cloudflare’s global network, spanning everything from APT activity and residential proxies to ACH fraud, DDoS attacks, WAF exploits, cybercrime, and compromised devices. This integration empowers our users to quickly understand, prioritize, and act on indicators of compromise (IOCs) based on a vast ocean of real-time threat data. 

Cloudy lets you query this global dataset in a natural language and receive clear, concise answers. For example, imagine asking these questions and getting immediate actionable answers:

  • Who is targeting my industry vertical or country?

  • What are the most relevant indicators (IPs, JA3/4 hashes, ASNs, domains, URLs, SHA fingerprints) to block right now?

  • How has a specific adversary progressed across the cyber kill chain over time?

  • What novel new threats are threat actors using that might be used against your network next, and what insights do Cloudflare analysts know about them?

Simply interact with Cloudy in the Cloudflare Dashboard > Security Center > Threat Intelligence, providing your queries in natural language. It can walk you from a single indicator (like an IP address or domain) to the specific threat event Cloudflare observed, and then pivot to other related data — other attacks, related threats, or even other activity from the same actor. 


This cuts through the noise, so you can quickly understand an adversary’s actions across the cyber kill chain and MITRE ATT&CK framework, and then block attacks with precise, actionable intelligence. The threat events platform is like an evidence board on the wall that helps you understand threats; Cloudy is like your sidekick that will run down every lead.

How it works: Agents SDK and Workers AI

Developing this advanced capability for Cloudy was a testament to the agility of Cloudflare’s AI ecosystem. We leveraged our Agents SDK running on Workers AI. This allowed for rapid iteration and deployment, ensuring Cloudy could quickly grasp the nuances of threat intelligence and provide highly accurate, contextualized insights. The combination of our massive network telemetry, purpose-built LLM prompts, and the flexibility of Workers AI means Cloudy is not just fast, but also remarkably precise.

And a quick word on what we didn’t do when developing Cloudy: We did not train Cloudy on any Cloudflare customer data. Instead, Cloudy relies on models made publicly available through Workers AI. For more information on Cloudflare’s approach to responsible AI, please see these FAQs.

What’s next for Cloudy

This is just the next step in Cloudy’s journey. We’re working on expanding Cloudy’s abilities across the board. This includes intelligent debugging for WAF rules and deeper integrations with Alerts to give you more actionable, contextual notifications. At the same time, we are continuously enriching our threat events datasets and exploring ways for Cloudy to help you visualize complex attacker timelines, campaign overviews, and intricate attack graphs. Our goal remains the same: make Cloudy an indispensable partner in understanding and reacting to the security landscape.

The new chat interface is now available on all plans, and the threat intelligence capabilities are live for Cloudforce One customers. Learn more about Cloudforce One here and reach out for a consultation if you want to go deeper with our experts.

The crawl-to-click gap: Cloudflare data on AI bots, training, and referrals

Post Syndicated from João Tomé original https://blog.cloudflare.com/crawlers-click-ai-bots-training/

In 2025, Generative AI is reshaping how people and companies use the Internet. Search engines once drove traffic to content creators through links. Now, AI training crawlers — the engines behind commonly-used LLMs — are consuming vast amounts of web data, while sending far fewer users back. We covered this shift, along with related trends and Cloudflare features (like pay per crawl) in early July. Studies from Pew Research Center (1, 2) and Authoritas already point to AI overviews — Google’s new AI-generated summaries shown at the top of search results — contributing to sharp declines in news website traffic. For a news site, this means lots of bot hits, but far fewer real readers clicking through — which in turn means fewer people clicking on ads or chances to convert to subscriptions.

Cloudflare’s data shows the same pattern. Crawling by search engines and AI services surged in the first half of 2025 — up 24% year-over-year in June — before slowing to just 4% year-over-year growth in July. How is the space evolving? Which crawling purposes are most common, and how is that changing? Spoiler: training-related crawling is leading the way. In this post, we track AI and search bot crawl activity, what purposes dominate, and which platforms contribute the least referral traffic back to creators.

Key takeaways

  • Training crawling grows: Training now drives nearly 80% of AI bot activity, up from 72% a year ago.

  • Publisher referrals drop: Google referrals to news sites fell, with March 2025 down ~9% compared to January.

  • AI & search crawling increase: Crawling rose 32% year-over-year in April 2025, before slowing to 4% year-over-year growth in July.

  • AI-only crawler shifts: OpenAI’s GPTBot more than doubled in share of AI crawling traffic (4.7% to 11.7%), Anthropic’s ClaudeBot rose (6% to ~10%), while ByteDance’s Bytespider fell from 14.1% to 2.4%.

  • Crawl-to-refer imbalance (how many pages a bot crawls per page that a user clicks back to): Anthropic increased referrals but still leads with 38,000 crawls per visitor in July (down from 286,000:1 in January). Perplexity decreased referrals in 2025 — with more crawling but fewer referrals at 194 crawls per visitor in July.

Several of the trends in this blog use Cloudflare Radar’s new AI Insights features, explained in more detail in the post: “A deeper look at AI crawlers: breaking down traffic by purpose and industry.”

Google referrals fall as AI Overviews expand

Referral traffic from search is already shifting, as we noted above and as studies have shown. In our dataset of news-related customers (spanning the Americas, Europe, and Asia), Google’s referrals have been clearly declining since February 2025. This drop is unusual, since overall Internet traffic (and referrals as well) historically has only dipped during July and August — the summer months when the Northern Hemisphere is largely on break from school or work. The sharpest and least seasonal decline came in March. Despite being a 31-day month, March had almost the same referral volume as the shorter, 28-day February.


Looking at longer comparisons: March 2025 referral traffic from Google was 9% lower than January, the same drop seen in June. April was worse, down 15% compared with January.

This drop seems to coincide with some of Google’s changes. AI Overviews launched in the U.S. in May 2024, but in March 2025, Google upgraded AI Overviews with Gemini 2.0, introduced AI Mode in Labs, and expanded Overviews to more European countries. By May 2025, AI Mode rolled out broadly in the U.S. with Gemini 2.5, adding conversational search, Deep Search, and personalized recommendations.

The search-to-news site pipeline seems to be weakening, replaced in part by AI-driven results.

Looking at a daily perspective, we can also spot a clear U.S.-election-related peak in referrals from Google to the cohort of known news sites on November 5–6, 2024.


AI and search crawling: spring surge (+24%), summer slowdown

In June, we talked about search and AI crawler growth, and our picture of the trend is now more complete with more data. To focus only on AI and search crawlers, and to remove the bias of customer growth, we analyzed a fixed set of customers from specific weeks, a method we’ve also used in the Cloudflare Radar Year in Review.

What the data shows: crawling spiked twice: first in November 2024, then again between March and April 2025. April 2025 alone was up 32% compared with May 2024, the first full month where we have comparable data. After that surge, growth stabilized. In June 2025, crawling traffic was still 24% higher year-over-year, but by July the increase was down to just 4%. That shift highlights how quickly crawler activity can accelerate and then cool down.

As the chart below shows, crawling traffic rose sharply in March and April. It remained high but slightly lower in May, before starting to drop in June. The seasonal dip is similar to what we see in overall Internet traffic during the Northern Hemisphere’s summer months (August and September are often the quietest), though in the case of crawlers, this is likely due to reduced overall web activity rather than bots themselves taking a “break.” Historically, activity tends to rise again in November — as it did in 2024 for AI and search bot traffic — when people spend more time online for shopping and seasonal habits (a pattern we’ve seen in past years).


Googlebot is still the anchor, accounting for 39% of all AI and search crawler traffic, but the fastest growth now comes from AI-specific crawlers, though bots related to Amazon and ByteDance (Bytespider) have lost significant ground. GPTBot’s share grew from 4.7% in July 2024 to 11.7% in July 2025. ClaudeBot also increased, from 6% to nearly 10%, while Meta’s crawler jumped from 0.9% to 7.5%. By contrast, Amazonbot dropped from 10.2% to 5.9%, and ByteDance’s Bytespider dropped from 14.1% to just 2.4%.

The table below shows how market shares have shifted between July 2024 and July 2025:

Bot name

% share July 2024

% share July 2025

Δ percentage-point change

1

Googlebot

37.5

39

1.5

2

GPTBot

4.7

11.7

7

3

ClaudeBot

6

9.9

3.9

4

Bingbot

8.7

9.3

0.6

5

Meta-ExternalAgent

0.9

7.5

6.5

6

Amazonbot

10.2

5.9

-4.3

7

Googlebot-Image

4.1

3.3

-0.8

8

Yandex

5

2.9

-2.1

9

GoogleOther

4.6

2.7

-1.8

10

Bytespider

14.1

2.4

-11.6

11

Applebot

1.8

1.5

-0.3

12

ChatGPT-User

0.1

0.9

0.9

13

OAI-SearchBot

0

0.9

0.9

14

Baiduspider

0.5

0.5

0

15

Googlebot-Mobile

0.2

0.4

0.2

AI-only crawlers: OpenAI rises, ByteDance falls

Looking only at AI bot traffic (as tracked on our Radar AI page), the trend is clear. Since January 2025, GPTBot has steadily increased its crawling volume, driven mainly by training-related activity. ClaudeBot crawling accelerated in June, while Amazonbot and Bytespider activity slowed.

The chart below shows how GPTBot surged over the past 12 months, overtaking Amazonbot and Bytespider, which both fell sharply:


A comparison between July 2024 and July 2025 makes the shift even more obvious. GPTBot gained 16 percentage points, Meta’s crawler rose by more than 15, and ClaudeBot grew by 8. On the shrinking side, Amazonbot dropped 12 percentage points and Bytespider dropped over 31 percentage points.

AI-only bots

July 2024 %

July 2025 %

Δ percentage-point change

1

GPTBot

11.9

28.1

16.1

2

ClaudeBot

15

23.3

8.3

3

Meta-ExternalAgent

2.4

17.7

15.3

4

Amazonbot

26.4

14.1

-12.3

5

Bytespider

37.3

5.8

-31.5

6

Applebot

4.9

3.7

-1.2

7

ChatGPT-User

0.2

2.4

2.2

8

OAI-SearchBot

0

2.2

2.2

9

TikTokSpider

0

0.7

0.7

10

imgproxy

0

0.7

0.7

11

PerplexityBot

0

0.4

0.4

12

Google-CloudVertexBot

0

0.3

0.3

13

AI2Bot

0

0.2

0.2

14

Timpibot

0.6

0.1

-0.5

15

CCBot

0.1

0.1

0


We covered the functionality of these bots in our June blog post.

Crawling by purpose: training dominates

Training is the clear leader. (We classify purpose based on operator disclosures and industry sources, a method we explained in this AI Week blog.) Over the past 12 months, 80% of AI crawling was for training, compared with 18% for search and just 2% for user actions. In the last six months, the share for training rose further to 82%, while search dropped to 15% and user actions increased slightly to 3%.

The chart below shows how training-related crawling steadily grew over the past year, far outpacing other purposes:


The year-over-year comparison reinforces this trend. In July 2024, training accounted for 72% of AI crawling. By July 2025, it had risen to 79%. Over the same period, search fell from 26% to 17%, while user actions grew modestly from 2% to 3.2%.


Crawl-to-refer ratios shifts: tens of thousands of bot crawls per human click

The crawl-to-refer ratio measures how many pages a platform crawls compared with how often it drives users to a website. In practice, a high ratio means heavy crawling but little referral traffic. For example, for every visitor Anthropic refers back to a website, its crawlers have already visited tens of thousands of pages.

Why does this metric matter? It highlights the imbalance between how much content AI systems consume and how little traffic they return. For publishers, it can feel like giving away the raw material for free. With that in mind, here’s how different platforms compare from January to July 2025.

Anthropic remains the most crawl-heavy platform. Even after an 87% decline this year, it still crawled 38,000 pages for every referred page visit in July 2025 — the highest imbalance among major AI players. Referrals may be improving, though, after Anthropic added web search to Claude in March 2025 (initially for U.S. paid users) and expanded it globally by May to all users, including the free tier. The feature introduced direct citations with clickable URLs, creating new referral pathways.

The full dataset is below, showing January–July 2025 ratios by platform ordered by the highest ratio average:
(Note: a rising ratio means more bot crawling per human click sent back, while a falling ratio means less bot crawling per human click sent back)

Crawl-to-refer ratio (from Cloudflare Radar’s data)

Service

Jan

Feb

Mar

Apr

May

Jun

Jul

Average

% Change Jan-Jul

Anthropic

286,930.1

271,748.2

121,612.7

130,330.2

114,313

71,282.8

38,065.7

147,754.7

-86.7%

OpenAI

1,217.4

1,774.5

2,217

1200

995.6

1,655.9

1,091.4

1,437.8

-10.4%

Perplexity

54.6

55.3

201.3

300.9

199.1

200.6

194.8

172.4

256.7%

Microsoft

38.5

44.2

42.3

43.3

45.1

42

40.7

42.3

5.7%

Yandex

15.5

13.1

13.1

15.7

14.7

15.9

21.4

15.6

38.3%

Google

3.8

6.3

14.6

22.5

16.7

13.1

5.4

11.8

43%

ByteDance

18

16.4

3.5

2.3

1.6

1.6

0.9

6.3

-95%

Baidu

0.6

0.7

0.8

1.5

1.2

1

0.9

1

44.5%

DuckDuckGo

0.1

0.2

0.2

0.2

0.3

0.3

0.3

0.2

116.3%

Looking at the changes from January to July 2025:

  • Anthropic recorded the steepest decrease in bot to human traffic, down 86.7%. From 286,930 bots per human in January, to 38,065 bots per human in July, the change shows a dramatic increase in referrals. Despite the change, it remains by far the most crawl-heavy platform, with tens of thousands of pages still crawled for every referral.

  • Perplexity moved in the opposite direction, with bot crawling increasing +256.7% relative to human visitors; climbing from 54 bots per human in January to 195 bots per human in July. While the ratio is still far below Anthropic, the increase shows it is crawling more heavily, relative to the traffic it refers, than it did earlier.

  • OpenAI ratio dropped slightly, from 1,217 bots per human in January to 1,091 in July (-10%). The shift is smaller than Anthropic’s but suggests OpenAI is sending a bit more referral traffic relative to its crawling.

  • Microsoft stayed steady, with its ratio moving only slightly, from 38.5 bots per human in January to 40.7 in July (+6%). This consistency suggests stable behavior from Bing-linked services.

  • Yandex increased from 15.5 bots per human in January to 21.4 in July (+38%). The overall ratio is far smaller than Anthropic’s or Perplexity’s, but it shows Yandex is crawling more heavily relative to the traffic it sends back.

Alongside measuring crawling volumes and referral traffic (now also visible on the AI Insights page of Cloudflare Radar), it’s worth looking at whether AI operators follow good practices when deploying their bots. Cloudflare data shows that most leading AI crawlers are on our verified bots list, meaning their IP addresses match published ranges and they respect robots.txt. But adoption of newer standards like WebBotAuth — which uses cryptographic signatures in HTTP messages to confirm a request comes from a specific bot, and is especially relevant today — is still missing. 

Google, Meta, and OpenAI run distinct bots for different purposes, while Anthropic lags in verification. That makes it easier for bad actors to spoof its crawler and ignore robots.txt, since without verification, it’s hard to distinguish real from fake traffic — leaving its compliance effectively unclear. (A longer list of AI bots is available here).


Conclusion and what’s next

If training-related crawling continues to dominate while referrals stay flat, creators face a paradox: feeding AI systems without gaining traffic in return. Many want their content to appear in chatbot answers, but without monetization or cooperation, the incentive to produce quality work declines.

The Web now stands at a fork in the road. Either a new balance emerges — one where the new AI era helps sustain publishers and creators — or AI turns the open web into a one-way training set, extracting value with little flowing back.

You can learn more about some of these data trends on Cloudflare Radar’s updated AI Insights page.

Troubleshooting network connectivity and performance with Cloudflare AI

Post Syndicated from Chris Draper original https://blog.cloudflare.com/AI-troubleshoot-warp-and-network-connectivity-issues/

Monitoring a corporate network and troubleshooting any performance issues across that network is a hard problem, and it has become increasingly complex over time. Imagine that you’re maintaining a corporate network, and you get the dreaded IT ticket. An executive is having a performance issue with an application, and they want you to look into it. The ticket doesn’t have a lot of details. It simply says: “Our internal documentation is taking forever to load. PLS FIX NOW”.

In the early days of IT, a corporate network was built on-premises. It provided network connectivity between employees that worked in person and a variety of corporate applications that were hosted locally.

The shift to cloud environments, the rise of SaaS applications, and a “work from anywhere” model has made IT environments significantly more complex in the past few years. Today, it’s hard to know if a performance issue is the result of:

  • An employee’s device

  • Their home or corporate wifi

  • The corporate network

  • A cloud network hosting a SaaS app

  • An intermediary ISP

A performance ticket submitted by an employee might even be a combination of multiple performance issues all wrapped together into one nasty problem.

Cloudflare built Cloudflare One, our Secure Access Service Edge (SASE) platform, to protect enterprise applications, users, devices, and networks. In particular, this platform relies on two capabilities to simplify troubleshooting performance issues:

  • Cloudflare’s Zero Trust client, also known as WARP, forwards and encrypts traffic from devices to Cloudflare edge.

  • Digital Experience Monitoring (DEX) works alongside WARP to monitor device, network, and application performance.

We’re excited to announce two new AI-powered tools that will make it easier to troubleshoot WARP client connectivity and performance issues.  We’re releasing a new WARP diagnostic analyzer in the Zero Trust dashboard and a MCP (Model Context Protocol) server for DEX. Today, every Cloudflare One customer has free access to both of these new features by default.

WARP diagnostic analyzer

The WARP client provides diagnostic logs that can be used to troubleshoot connectivity issues on a device. For desktop clients, the most common issues can be investigated with the information captured in logs called WARP diagnostic. Each WARP diagnostic log contains an extensive amount of information spanning days of captured events occurring on the client. It takes expertise to manually go through all of this information and understand the full picture of what is occurring on a client that is having issues. In the past, we’ve advised customers having issues to send their WARP diagnostic log straight to us so that our trained support experts can do a root cause analysis for them. While this is effective, we want to give our customers the tools to take control of deciphering common troubleshooting issues for even quicker resolution. 

Enter the WARP diagnostic analyzer, a new AI available for free in the Cloudflare One dashboard as of today! This AI demystifies information in the WARP diagnostic log so you can better understand events impacting the performance of your clients and network connectivity. Now, when you run a remote capture for WARP diagnostics in the Cloudflare One dashboard, you can generate an AI analysis of the WARP diagnostic file. Simply go to your organization’s Zero Trust dashboard and select DEX > Remote Captures from the side navigation bar. After you successfully run diagnostics and produce a WARP diagnostic file, you can open the status details and select View WARP Diag to generate your AI analysis.


In the WARP Diag analysis, you will find a Cloudy summary of events that we recommend a deeper dive into.


Below this summary is an events section, where the analyzer highlights occurrences of events commonly occurring when there are client and connectivity issues. 


Expanding on any of the events detected will reveal a detailed page explaining the event, recommended resources to help troubleshoot, and a list of time stamped recent occurrences of the event on the device.


To further help with trouble shooting we’ve added a Device and WARP details section at the bottom of this page with a quick view of the device specifications and WARP configurations such as Operating system, WARP version, and the device profile ID.


Finally, we’ve made it easy to take all the information created in your AI summary with you by navigating to the JSON file tab and copying the contents. Your WARP Diag file is also available to download from this screen for any further analysis.


MCP server for DEX

Alongside the new WARP Diagnostic Analyzer, we’re excited to announce that all Cloudflare One customers have access to a MCP (Model Context Protocol) server for our Digital Experience Monitoring (DEX) product. Let’s dive into how this will save our customers time and money.

Cloudflare One customers use Digital Experience Monitoring (DEX) to monitor devices across their employee network and troubleshoot any connectivity or performance issues. Like many products at Cloudflare, every data point generated by DEX is available to customers via Cloudflare’s API or log ingestion. DEX API and log data is valuable because it enables Enterprises to create custom analytics for their devices’ connectivity and performance in a SIEM (Security Information and Event Management). Building new data pipelines and various dashboards can take a lot of time and can be expensive. Some of Cloudflare One customers aren’t able to dedicate the engineering time to build custom analytics: whether it’s due to budget, resource constraints, time constraints, or other factors.

Model Context Protocol (MCP) is an AI standardization that allows AI to connect to other applications. In this instance, using MCP together with DEX gives users a custom analytics experience — without all the work of integrating log data into a SIEM. Network security practitioners can type out a question (like “Show me the performance data for [email protected]’s device in a few graphs”), and an MCP server will answer that question with a customized graph on Alice’s device data from the DEX API.

Troubleshooting DNS query performance with the DEX MCP server

Any IT admin, security engineer, or network engineer can ask the MCP server a question like: “I received an IT ticket from [email protected] who is an enterprise employee. He is complaining about the performance of his device. Can you investigate this for me?”.

After you submit this question, the DEX MCP server will start by reviewing the device’s overall health.

Overall Device Health:
✅ Connection Status: Consistently connected for the past 24 hours with 100% availability
✅ WARP Mode: Running in optimal "warp+doh" (WARP + DNS over HTTPS) mode
✅ Client Version: Up-to-date with version 2025.1.7.2

So far, the device’s health looks great. Next, the DEX MCP server will begin analyzing the results of the performance tests that are configured on Bob’s device.

📊 Performance Metrics Analysis

HTTP Response Times:
Google.com test: Average 247ms (Range: 81-649ms)
  • Shows significant improvement throughout the day: Started at ~317ms, dropped to ~211ms by evening
Wiki.internal.acme.com: Average 1,374ms (Range: 877-2,431ms)
  • This is notably high and could be the source of performance complaints
  • DNS resolution particularly slow: Average 616ms with spikes up to 1,600ms

Now we’re making progress in our troubleshooting effort. The DEX MCP server identified that the HTTP GET test for wiki.internal.acme.com has a high average resource fetch time when compared to other websites (like google.com). In particular, high resource fetch time is the result of slow DNS resolution. Now, the DEX MCP server will summarize its findings.

🚨 Performance Issues Identified
Extremely Slow DNS Resolution:
  • Average DNS response time of 616ms is significantly high
  • Normal DNS should be <50ms for good performance
High Resource Fetch Times on wiki.internal.acme.com:
   • 1.3+ second response times indicate potential application-level issues
   • Could be related to DNS performance or server-side delays
HTTP Status Code Issues:
        • wiki.internal.acme.com is showing consistent HTTP 400 errors
        • This suggests potential configuration or routing issues

Now, we’ve successfully identified that Bob’s slow wiki performance is the result of a high average DNS response time. Slow DNS resolution increases the average resource fetch time, and also generates HTTP 400 errors for Enterprise employees that are attempting to access wiki.internal.acme.com. Notably, it’s likely that if the wiki is seeing slow DNS response times, other internal applications are experiencing performance issues as well.

We can ask the DEX MCP server to suggest potential solutions that would fix this wider DNS performance issue.

🔧 DNS Optimization Strategies

Check WARP DNS configuration and settings
  • Since the device is using WARP+DoH mode, you can optimize the DNS settings
Implement or improve local DNS caching
  • Implement or improve local DNS caching to reduce repeated lookups
Network stack tuning
  • Optimize TCP / UDP settings for DNS

Try out the DEX MCP server today

Fast and easy option for testing an MCP server

Any Cloudflare One customer with a Free, PayGo, or ENT plan can start using the DEX MCP server in less than one minute. The fastest and easiest way to try out the DEX MCP server is to visit playground.ai.cloudflare.com. There are five steps to get started:

  1. Copy the URL for the DEX MCP server: https://dex.mcp.cloudflare.com/sse

  2. Open playground.ai.cloudflare.com in a browser

  3. Find the section in the left side bar titled MCP Servers

  4. Paste the URL for the DEX MCP server into the URL input box and click Connect

  5. Authenticate your Cloudflare account, and then start asking questions to the DEX MCP server

It’s worth noting that end users will need to ask specific and explicit questions to the DEX MCP server to get a response. For example, you may need to say, “Set my production account as the active  account”, and then give the separate command, “Fetch the DEX test results for the user [email protected] over the past 24 hours”.

Better experience for MCP servers that requires additional steps

Customers will get a more flexible prompt experience by configuring the DEX MCP server with their preferred AI assistant (Claude, Gemini, ChatGPT, etc.) that has MCP server support. MCP server support may require a subscription for some AI assistants. You can read the Digital Experience Monitoring – MCP server documentation for step by step instructions on how to get set up with each of the major AI assistants that are available today.

As an example, you can configure the DEX MCP server in Claude by downloading the Claude Desktop client, then selecting Claude Code > Developer > Edit Config. You will be prompted to open “claude_desktop_config.json” in a code editor of your choice. Simply add the following JSON configuration, and you’re ready to use Claude to call the DEX MCP server.

{
  "globalShortcut": "",
  "mcpServers": {
    "cloudflare-dex-analysis": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://dex.mcp.cloudflare.com/sse"
      ]
    }
  }
}

Get started with Cloudflare One today

Are you ready to secure your Internet traffic, employee devices, and private resources without compromising speed? You can get started with our new Cloudflare One AI powered tools today.

The WARP diagnostic analyzer and the DEX MCP server are generally available to all customers. Head to the Zero Trust dashboard to run a WARP diagnostic and learn more about your client’s connectivity with the WARP diagnostic analyzer. You can test out the new DEX MCP server (https://dex.mcp.cloudflare.com/sse) in less than one minute at playground.ai.cloudflare.com, and you can also configure an AI assistant like Claude to use the new DEX MCP server.

If you don’t have a Cloudflare account, and you want to try these new features, you can create a free account for up to 50 users. If you’re an Enterprise customer, and you’d like a demo of these new Cloudflare One AI features, you can reach out to your account team to set up a demo anytime. 

You can stay up to date on latest feature releases across the Cloudflare One platform by following the Cloudflare One changelogs and joining the conversation in the Cloudflare community hub or on our Discord Server.


Cloudflare is the best place to build realtime voice agents

Post Syndicated from Renan Dincer original https://blog.cloudflare.com/cloudflare-realtime-voice-ai/

The way we interact with AI is fundamentally changing. While text-based interfaces like ChatGPT have shown us what’s possible, in terms of interaction, it’s only the beginning. Humans communicate not only by texting, but also talking — we show things, we interrupt and clarify in real-time. Voice AI brings these natural interaction patterns to our applications.

Today, we’re excited to announce new capabilities that make it easier than ever to build real-time, voice-enabled AI applications on Cloudflare’s global network. These new features create a complete platform for developers building the next generation of conversational AI experiences or can function as building blocks for more advanced AI agents running across platforms.

We’re launching:

  • Cloudflare Realtime Agents – A runtime for orchestrating voice AI pipelines at the edge

  • Pipe raw WebRTC audio as PCM in Workers – You can now connect WebRTC audio directly to your AI models or existing complex media pipelines already built on 

  • Workers AI WebSocket support – Realtime AI inference with models like PipeCat’s smart-turn-v2

  • Deepgram on Workers AI – Speech-to-text and text-to-speech running in over 330 cities worldwide

Why realtime AI matters now

Today, building voice AI applications is hard. You need to coordinate multiple services such as speech-to-text, language models, text-to-speech while managing complex audio pipelines, handling interruptions, and keeping latency low enough for natural conversation. 


Building production voice AI requires orchestrating a complex symphony of technologies. You need low latency speech recognition, intelligent language models that understand context and can handle interruptions, natural-sounding voice synthesis, and all of this needs to happen in under 800 milliseconds — the threshold where conversation feels natural rather than stilted. This latency budget is unforgiving. Every millisecond counts: 40ms for microphone input, 300ms for transcription, 400ms for LLM inference, 150ms for text-to-speech. Any additional latency from poor infrastructure choices or distant servers transforms a delightful experience into a frustrating one.

That’s why we’re building real-time AI tools: we want to make real-time voice AI as easy to deploy as a static website. We’re also witnessing a critical inflection point where conversational AI moves from experimental demos to production-ready systems that can scale globally. If you’re already a developer in the real-time AI ecosystem, we want to build the best building blocks for you to get the lowest latency by leveraging the 330+ datacenters Cloudflare has built.

Introducing Cloudflare Realtime Agents

Cloudflare Realtime Agents is a simple runtime for orchestrating voice AI pipelines that run on our global network, as close to your users as possible. Instead of managing complex infrastructure yourself, you can focus on building great conversational experiences.


How it works

When a user connects to your voice AI application, here’s what happens:

  1. WebRTC connection – Audio streams from the user’s device is sent to the nearest Cloudflare location via WebRTC, using Cloudflare RealtimeKit mobile or web SDKs

  2. AI pipeline orchestration – Your pre-configured pipeline runs: speech-to-text → LLM → text-to-speech, with support for interruption detection and turn-taking

  3. Your configured runtime options/callbacks/tools run

  4. Response delivery – Generated audio streams back to the user with minimal latency

The magic is in how we’ve designed this as composable building blocks. You’re not locked into a rigid pipeline — you can configure data flows, add tee and join operations, and control exactly how your AI agent behaves.

Take a look at the MyTextHandler function from the above diagram, for example. It’s just a function that takes in text and returns text back, inserted after speech-to-text and before text-to-speech:

class MyTextHandler extends TextComponent {
	env: Env;

	constructor(env: Env) {
		super();
		this.env = env;
	}

	async onTranscript(text: string) {
		const { response } = await this.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
			prompt: "You are a wikipedia bot, answer the user query:" + text,
		});
		this.speak(response!);
	}
}

Your agent is a JavaScript class that extends RealtimeAgent, where you initialize a pipeline consisting of the various text-to-speech, speech-to-text, text-to-text and even speech-to-speech transformations.

export class MyAgent extends RealtimeAgent<Env> {
	constructor(ctx: DurableObjectState, env: Env) {
		super(ctx, env);
	}

	async init(agentId: string ,meetingId: string, authToken: string, workerUrl: string, accountId: string, apiToken: string) {
		// Construct your text processor for generating responses to text
		const textHandler = new MyTextHandler(this.env);
		// Construct a Meeting object to join the RTK meeting
		const transport = new RealtimeKitTransport(meetingId, authToken, [
			{
				media_kind: 'audio',
				stream_kind: 'microphone',
			},
		]);
		const { meeting } = transport;

		// Construct a pipeline to take in meeting audio, transcribe it using
		// Deepgram, and pass our generated responses through ElevenLabs to
		// be spoken in the meeting
		await this.initPipeline(
			[transport, new DeepgramSTT(this.env.DEEPGRAM_API_KEY), textHandler, new ElevenLabsTTS(this.env.ELEVENLABS_API_KEY), transport],
			agentId,
			workerUrl,
			accountId,
			apiToken,
		);

		// The RTK meeting object is accessible to us, so we can register handlers
		// on various events like participant joins/leaves, chat, etc.
		// This is optional
		meeting.participants.joined.on('participantJoined', (participant) => {
			textHandler.speak(`Participant Joined ${participant.name}`);
		});
		meeting.participants.joined.on('participantLeft', (participant) => {
			textHandler.speak(`Participant Left ${participant.name}`);
		});

		// Make sure to actually join the meeting after registering all handlers
		await meeting.rtkMeeting.join();
	}

	async deinit() {
		// Add any other cleanup logic required
		await this.deinitPipeline();
	}
}

View a full example in the developer docs and get your own Realtime Agent running. View Realtime Agents on your dashboard.

Built for flexibility

What makes Realtime Agents powerful is its flexibility:

  • Many AI provider options – Use the models on Workers AI, OpenAI, Anthropic, or any provider through AI Gateway

  • Multiple input/output modes – Accept audio and/or text and respond with audio and/or text

  • Stateful coordination – Maintain context across the conversation without managing complex state yourself

  • Speed and flexibility – use RealtimeKit to manage WebRTC sessions and UI for faster development, or for full control over your stack, you can also connect directly using any standard WebRTC client or raw WebSockets

  • Integrate with the Cloudflare Agents SDK

During the open beta starting today, Cloudflare Realtime Agents runtime is free to use and works with various AI models:

  • Speech and Audio: Integration with platforms like ElevenLabs and Deepgram.

  • LLM Inference: Flexible options to use large language models through Cloudflare Workers AI and AI Gateway, connect to third-party models like OpenAi, Gemini, Grok, Claude, or bring your own custom models.

Pipe raw WebRTC audio as PCM in Workers

For developers who need the most flexibility with their applications beyond Realtime Agents, we’re exposing the raw WebRTC audio pipeline directly to Workers. 

WebRTC audio in Workers works by leveraging Cloudflare’s Realtime SFU, which converts WebRTC audio in Opus codec to PCM and streams it to any WebSocket endpoint you specify. This means you can use Workers to implement:

  • Live transcription – Stream audio from a video call directly to a transcription service

  • Custom AI pipelines – Send audio to AI models without setting up complex infrastructure

  • Recording and processing – Save, audit, or analyze audio streams in real-time


WebSockets vs WebRTC for voice AI

WebSockets and WebRTC can handle audio for AI services, but they work best in different situations. WebSockets are perfect for server-to-server communication and work fine when you don’t need super-fast responses, making them great for testing and experimenting. However, if you’re building an app where users need real-time conversations with low delay, WebRTC is the better choice.

WebRTC has several advantages that make it superior for live audio streaming. It uses UDP instead of TCP, which prevents audio delays caused by lost packets holding up the entire stream (head of line blocking is a common topic discussed on this blog). The Opus audio codec in WebRTC automatically adjusts to network conditions and can handle packet loss gracefully. WebRTC also includes built-in features like echo cancellation and noise reduction that WebSockets would require you to build separately. 

With this feature, you can use WebRTC for client to server communication and leveraging Cloudflare to convert to familiar WebSockets for server-to-server communication and backend processing.

The power of Workers + WebRTC

When WebRTC audio gets converted to WebSockets, you get PCM audio at the original sample rate, and from there, you can run any task in and out of the Cloudflare developer platform:

  • Resample audio and send to different AI providers

  • Run WebAssembly-based audio processing

  • Build complex applications with Durable Objects, Alarms and other Workers primitives

  • Deploy containerized processing pipelines with Workers Containers

The WebSocket works bidirectionally, so data sent back on the WebSocket becomes available as a WebRTC track on the Realtime SFU, ready to be consumed within WebRTC.

To illustrate this setup, we’ve made a simple WebRTC application demo that uses the ElevenLabs API for  text-to-speech.

Visit the Realtime SFU developer docs on how to get started.

Realtime AI inference with WebSockets

WebSockets provide the backbone of real-time AI pipelines because it is a low-latency, bidirectional primitive with ubiquitous support in developer tooling, especially for server to server communication. Although HTTP works great for many use cases like chat or batch inference, real-time voice AI needs persistent, low-latency connections when talking to AI inference servers. To support your real-time AI workloads, Workers AI now supports WebSocket connections in select models.

Launching with PipeCat SmartTurn V2

The first model with WebSocket support is PipeCat’s smart-turn-v2 turn detection model — a critical component for natural conversation. Turn detection models determine when a speaker has finished talking and it’s appropriate for the AI to respond. Getting this right is the difference between an AI that constantly interrupts and one that feels natural to talk to.

Below is an example on how to call smart-turn-v2 running on Workers AI.

"""
Cloudflare AI WebSocket Inference - With PipeCat's smart-turn-v2
"""

import asyncio
import websockets
import json
import numpy as np

# Configuration
ACCOUNT_ID = "your-account-id"
API_TOKEN = "your-api-token"
MODEL = "@cf/pipecat-ai/smart-turn-v2"

# WebSocket endpoint
WEBSOCKET_URL = f"wss://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/{MODEL}?dtype=uint8"

async def run_inference(audio_data: bytes) -> dict:
    async with websockets.connect(
        WEBSOCKET_URL,
        additional_headers={
            "Authorization": f"Bearer {API_TOKEN}"
        }
    ) as websocket:
        await websocket.send(audio_data)
        
        response = await websocket.recv()
        result = json.loads(response)
        
        # Response format: {'is_complete': True, 'probability': 0.87}
        return result

def generate_test_audio():    
    noise = np.random.normal(128, 20, 8192).astype(np.uint8)
    noise = np.clip(noise, 0, 255) 
    
    return noise

async def demonstrate_inference():
    # Generate test audio
    noise = generate_test_audio()
    
    try:
        print("\nTesting noise...")
        noise_result = await run_inference(noise.tobytes())
        print(f"Noise result: {noise_result}")
        
    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    asyncio.run(demonstrate_inference())

Deepgram in Workers AI

On Wednesday, we announced that Deepgram’s speech-to-text and text-to-speech models are available on Workers AI, running in Cloudflare locations worldwide. This means:

  • Lower latency – Speech recognition happens at the edge, close to users running in the same network as Workers

  • WebRTC audio processing without leaving the Cloudflare network

  • State-of-the-art audio ML models powerful, capable, and fast audio models, available directly through Workers AI

  • Global scale – leverages Cloudflare’s global network in 330+ cities automatically

Deepgram is a popular choice for voice AI applications. By building your voice AI systems on the Cloudflare platform, you get access to powerful models and the lowest latency infrastructure to give your application a natural, responsive experience.

Interested in other realtime AI models running on Cloudflare?

If you’re developing AI models for real-time applications, we want to run them on Cloudflare’s network. Whether you have proprietary models or need ultra-low latency inference at scale with open source models reach out to us.

Get started today

All of these features are available now:

Want to pick the brains of the engineers who built this? Join them for technical deep dives, live demos Q&A at Cloudflare Connect in Las Vegas. Explore the full schedule and register.


Cloudy Summarizations of Email Detections: Beta Announcement

Post Syndicated from Ayush Kumar original https://blog.cloudflare.com/cloudy-driven-email-security-summaries/

Background

Organizations face continuous threats from phishing, business email compromise (BEC), and other advanced email attacks. Attackers adapt their tactics daily, forcing defenders to move just as quickly to keep inboxes safe.

Cloudflare’s visibility across a large portion of the Internet gives us an unparalleled view of malicious campaigns. We process billions of email threat signals every day, feeding them into multiple AI and machine learning models. This lets our detection team create and deploy new rules at high speed, blocking malicious and unwanted emails before they reach the inbox.

But rapid protection introduces a new challenge: making sure security teams understand exactly what we blocked — and why.

The Challenge

Cloudflare’s fast-moving detection pipeline is one of our greatest strengths — but it also creates a communication gap for customers. Every day, our detection analysts publish new rules to block phishing, BEC, and other unwanted messages. These rules often blend signals from multiple AI and machine learning models, each looking at different aspects of a message like its content, headers, links, attachments, and sender reputation.

While this layered approach catches threats early, SOC teams don’t always have insight into the specific combination of factors that triggered a detection. Instead, they see a rule name in the investigation tab with little explanation of what it means.

Take the rule BEC.SentimentCM_BEC.SpoofedSender as an example. Internally, we know this indicates:

  • The email contained no unique links or attachments a common BEC pattern

  • It was flagged as highly likely to be BEC by our Churchmouse sentiment analysis models

  • Spoofing indicators were found, such as anomalies in the envelope_from header

Those details are second nature to our detection team, but without that context, SOC analysts are left to reverse-engineer the logic from opaque labels. They don’t see the nuanced ML outputs (like Churchmouse’s sentiment scoring) or the subtle header anomalies, or the sender IP/domain reputation data that factored into the decision.

The result is time lost to unclear investigations or the risk of mistakenly releasing malicious emails. For teams operating under pressure, that’s more than just an inconvenience, it’s a security liability.

That’s why we extended Cloudy (our AI-powered agent) to translate complex detection logic into clear explanations, giving SOC teams the context they need without slowing them down.

Enter Cloudy Summaries

Several weeks ago, we launched Cloudy within our Cloudflare One product suite to help customers understand gateway policies and their impacts (you can read more about the launch here: https://blog.cloudflare.com/introducing-ai-agent/).

We began testing Cloudy’s ability to explain the detections and updates we continuously deploy. Our first attempt revealed significant challenges.


The Hallucination Problem

We observed frequent LLM hallucinations, the model generating inaccurate information about messages. While this might be acceptable when analyzing logs, it’s dangerous for email security detections. A hallucination claiming a malicious message is clean could lead SOC analysts to release it from quarantine, potentially causing a security breach.

These hallucinations occurred because email detections involve numerous and complex inputs. Our scanning process runs messages through multiple ML algorithms examining different components: body content, attachments, links, IP reputation, and more. The same complexity that makes manual detection explanation difficult also caused our initial LLM implementation to produce inconsistent and sometimes inaccurate outputs.

Building Guardrails

To minimize hallucination risk while maintaining inbox security, we implemented several manual safeguards:

Step 1: RAG Implementation

We ensured Cloudy only accessed information from our detection dataset corpus, creating a Retrieval-Augmented Generation (RAG) system. This significantly reduced hallucinations by grounding the LLM’s assessments in actual detection data.

Step 2: Model Context Enhancement

We added crucial context about our internal models. For example, the “Churchmouse” designation refers to a group of sentiment detection models, not a single algorithm. Without this context, Cloudy attempted to define “churchmouse” using the common idiom “poor as a church mouse” referencing starving church mice because holy bread never falls to the floor. While historically interesting, this was completely irrelevant to our security context.

Current Results

Our testing shows Cloudy now produces more stable explanations with minimal hallucinations. For example, the detection SPAM.ASNReputation.IPReputation_Scuttle.Anomalous_HC now generates this summary:

“This rule flags email messages as spam if they come from a sender with poor Internet reputation, have been identified as suspicious by a blocklist, and have unusual email server setup, indicating potential malicious activity.”

This strikes the right balance. Customers can quickly understand what the detection found and why we classified the message accordingly.

Beta Program

We’re opening Cloudy email detection summaries to a select group of beta users. Our primary goal is ensuring our guardrails prevent hallucinations that could lead to security compromises. During this beta phase, we’ll rigorously test outputs and verify their quality before expanding access to all customers.

Ready to enhance your email security?

We provide all organizations (whether a Cloudflare customer or not) with free access to our Retro Scan tool, allowing them to use our predictive AI models to scan existing inbox messages. Retro Scan will detect and highlight any threats found, enabling organizations to remediate them directly in their email accounts. With these insights, organizations can implement further controls, either using Cloudflare Email Security or their preferred solution, to prevent similar threats from reaching their inboxes in the future.

If you are interested in how Cloudflare can help secure your inboxes, sign up for a phishing risk assessment here


AI Gateway now gives you access to your favorite AI models, dynamic routing and more — through just one endpoint

Post Syndicated from Michelle Chen original https://blog.cloudflare.com/ai-gateway-aug-2025-refresh/

Getting the observability you need is challenging enough when the code is deterministic, but AI presents a new challenge — a core part of your user’s experience now relies on a non-deterministic engine that provides unpredictable outputs. On top of that, there are many factors that can influence the results: the model, the system prompt. And on top of that, you still have to worry about performance, reliability, and costs. 

Solving performance, reliability and observability challenges is exactly what Cloudflare was built for, and two years ago, with the introduction of AI Gateway, we wanted to extend to our users the same levels of control in the age of AI. 

Today, we’re excited to announce several features to make building AI applications easier and more manageable: unified billing, secure key storage, dynamic routing, security controls with Data Loss Prevention (DLP). This means that AI Gateway becomes your go-to place to control costs and API keys, route between different models and providers, and manage your AI traffic. Check out our new AI Gateway landing page for more information at a glance.

Connect to all your favorite AI providers

When using an AI provider, you typically have to sign up for an account, get an API key, manage rate limits, top up credits — all within an individual provider’s dashboard. Multiply that for each of the different providers you might use, and you’ll soon be left with an administrative headache of bills and keys to manage.

With AI Gateway, you can now connect to major AI providers directly through Cloudflare and manage everything through one single plane. We’re excited to partner with Anthropic, Google, Groq, OpenAI, and xAI to provide Cloudflare users with access to their models directly through Cloudflare. With this, you’ll have access to over 350+ models across 6 different providers.

You can now get billed for usage across different providers directly through your Cloudflare account. This feature is available for Workers Paid users, where you’ll be able to add credits to your Cloudflare account and use them for AI inference to all the supported providers. You’ll be able to see real-time usage statistics and manage your credits through the AI Gateway dashboard. Your AI Gateway inference usage will also be documented in your monthly Cloudflare invoice. No more signing up and paying for each individual model provider account. 


Usage rates are based on then-current list prices from model providers — all you will need to cover is the transaction fee as you load credits into your account. Since this is one of the first times we’re launching a credits based billing system at Cloudflare, we’re releasing this feature in Closed Beta — sign up for access here.

BYO Provider Keys, now with Cloudflare Secrets Store

Although we’ve introduced unified billing, some users might still want to manage their own accounts and keys with providers. We’re happy to say that AI Gateway will continue supporting our BYO Key feature, improving the experience of BYO Provider Keys by integrating with Cloudflare’s secrets management product Secrets Store. Now, you can seamlessly and securely store your keys in one centralized location and distribute them without relying on plain text. Secrets Store uses a two level key hierarchy with AES encryption to ensure that your secret stays safe, while maintaining low latency through our global configuration system, Quicksilver.

You can now save and manage keys directly through your AI Gateway dashboard or through the Secrets Store dashboard, API, or Wrangler by using the new AI Gateway scope. Scoping your secrets to AI Gateway ensures that only this specific service will be able to access your keys, meaning that secret could not be used in a Workers binding or anywhere else on Cloudflare’s platform.


You can pass your AI provider keys without including them directly in the request header. Instead of including the actual value, you can deploy the secret only using the Secrets Store reference: 

curl -X POST https://gateway.ai.cloudflare.com/v1/<ACCOUNT_ID>/my-gateway/anthropic/v1/messages \
 --header 'cf-aig-authorization: CLOUDFLARE_AI_GATEWAY_TOKEN \
 --header 'anthropic-version: 2023-06-01' \
 --header 'Content-Type: application/json' \
 --data  '{"model": "claude-3-opus-20240229", "messages": [{"role": "user", "content": "What is Cloudflare?"}]}'

Or, using Javascript: 

import Anthropic from '@anthropic-ai/sdk';


const anthropic = new Anthropic({
  apiKey: "CLOUDFLARE_AI_GATEWAY_TOKEN",
  baseURL: "https://gateway.ai.cloudflare.com/v1/<ACCOUNT_ID>/my-gateway/anthropic",
});


const message = await anthropic.messages.create({
  model: 'claude-3-opus-20240229',
  messages: [{role: "user", content: "What is Cloudflare?"}],
  max_tokens: 1024
});

By using Secrets Store to deploy your secrets, you no longer need to give every developer access to every key — instead, you can rely on Secrets Store’s role-based access control to further lock down these sensitive values. For example, you might want your security administrators to have Secrets Store admin permissions so that they can create, update, and delete the keys when necessary. With Cloudflare audit logging, all such actions will be logged so you know exactly who did what and when. Your developers, on the other hand, might only need Deploy permissions, so they can reference the values in code, whether that is a Worker or AI Gateway or both. This way, you reduce the risk of the secret getting leaked accidentally or intentionally by a malicious actor. This also allows you to update your provider keys in one place and automatically propagate that value to any AI Gateway using those values, simplifying the management. 

Unified Request/Response

We made it super easy for people to try out different AI models – but the developer experience should match that as well. We found that each provider can have slight differences in how they expect people to send their requests, so we’re excited to launch an automatic translation layer between providers. When you send a request through AI Gateway, it just works – no matter what provider or model you use.

import OpenAI from "openai";
const client = new OpenAI({
  apiKey: "YOUR_PROVIDER_API_KEY", // Provider API key
  // NOTE: the OpenAI client automatically adds /chat/completions to the end of the URL, you should not add it yourself.
  baseURL:
    "https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/compat",
});

const response = await client.chat.completions.create({
  model: "google-ai-studio/gemini-2.0-flash",
  messages: [{ role: "user", content: "What is Cloudflare?" }],
});

console.log(response.choices[0].message.content);

Dynamic Routes

When we first launched Cloudflare Workers, it was an easy way for people to intercept HTTP requests and customize actions based on different attributes. We think the same customization is necessary for AI traffic, so we’re launching Dynamic Routes in AI Gateway.

Dynamic Routes allows you to define certain actions based on different request attributes. If you have free users, maybe you want to ratelimit them to a certain request per second (RPS) or a certain dollar spend. Or maybe you want to conduct an A/B test and split 50% of traffic to Model A and 50% of traffic to Model B. You could also want to chain several models in a row, like adding custom guardrails or enhancing a prompt before it goes to another model. All of this is possible with Dynamic Routes!

We’ve built a slick UI in the AI Gateway dashboard where you can define simple if/else interactions based on request attributes or a percentage split. Once you define a route, you’ll use the route as the “model” name in your input JSON and we will manage the traffic as you defined. 


import OpenAI from "openai";

const cloudflareToken = "CF_AIG_TOKEN";
const accountId = "{account_id}";
const gatewayId = "{gateway_id}";
const baseURL = `https://gateway.ai.cloudflare.com/v1/${accountId}/${gatewayId}`;

const openai = new OpenAI({
  apiKey: cloudflareToken,
  baseURL,
});

try {
  const model = "dynamic/<your-dynamic-route-name>";
  const messages = [{ role: "user", content: "What is a neuron?" }];
  const chatCompletion = await openai.chat.completions.create({
    model,
    messages,
  });
  const response = chatCompletion.choices[0].message;
  console.log(response);
} catch (e) {
  console.error(e);
}

Built-in security with Firewall in AI Gateway

Earlier this year we announced Guardrails in AI Gateway and now we’re expanding our security capabilities and include Data Loss Prevention (DLP) scanning in AI Gateway’s Firewall. With this, you can select the DLP profiles you are interested in blocking or flagging, and we will scan requests for the matching content. DLP profiles include general categories like “Financial Information”, “Social Security, Insurance, Tax and Identifier Numbers” that everyone has access to with a free Zero Trust account. If you would like to create a custom DLP profile to safeguard specific text, the upgraded Zero Trust plan allows you to create custom DLP profiles to catch sensitive data that is unique to your business.


False positives and grey area situations happen, we give admins controls on whether to fully block or just alert on DLP matches. This allows administrators to monitor for potential issues without creating roadblocks for their users.. Each log on AI gateway now includes details about the DLP profiles matched on your request, and the action that was taken:


More coming soon…

If you think about the history of Cloudflare, you’ll notice similar patterns that we’re following for the new vision for AI Gateway. We want developers of AI applications to be able to have simple interconnectivity, observability, security, customizable actions, and more — something that Cloudflare has a proven track record of accomplishing for global Internet traffic. We see AI Gateway as a natural extension of Cloudflare’s mission, and we’re excited to make it come to life.

We’ve got more launches up our sleeves, but we couldn’t wait to get these first handful of features into your hands. Read up about it in our developer docs, give it a try, and let us know what you think. If you want to explore larger deployments, reach out for a consultation with Cloudflare experts.


How we built the most efficient inference engine for Cloudflare’s network

Post Syndicated from Vlad Krasnov original https://blog.cloudflare.com/cloudflares-most-efficient-ai-inference-engine/

Inference powers some of today’s most powerful AI products: chat bot replies, AI agents, autonomous vehicle decisions, and fraud detection. The problem is, if you’re building one of these products on top of a hyperscaler, you’ll likely need to rent expensive GPUs from large centralized data centers to run your inference tasks. That model doesn’t work for Cloudflare — there’s a mismatch between Cloudflare’s globally-distributed network and a typical centralized AI deployment using large multi-GPU nodes. As a company that operates our own compute on a lean, fast, and widely distributed network within 50ms of 95% of the world’s Internet-connected population, we need to be running inference tasks more efficiently than anywhere else.

This is further compounded by the fact that AI models are getting larger and more complex. As we started to support these models, like the Llama 4 herd and gpt-oss, we realized that we couldn’t just throw money at the scaling problems by buying more GPUs. We needed to utilize every bit of idle capacity and be agile with where each model is deployed. 

After running most of our models on the widely used open source inference and serving engine vLLM, we figured out it didn’t allow us to fully utilize the GPUs at the edge. Although it can run on a very wide range of hardware, from personal devices to data centers, it is best optimized for large data centers. When run as a dedicated inference server on powerful hardware serving a specific model, vLLM truly shines. However, it is much less optimized for dynamic workloads, distributed networks, and for the unique security constraints of running inference at the edge alongside other services.

That’s why we decided to build something that will be able to meet the needs of Cloudflare inference workloads for years to come. Infire is an LLM inference engine, written in Rust, that employs a range of techniques to maximize memory, network I/O, and GPU utilization. It can serve more requests with fewer GPUs and significantly lower CPU overhead, saving time, resources, and energy across our network. 

Our initial benchmarking has shown that Infire completes inference tasks up to 7% faster than vLLM 0.10.0 on unloaded machines equipped with an H100 NVL GPU. On infrastructure under real load, it performs significantly better. 

Currently, Infire is powering the Llama 3.1 8B model for Workers AI, and you can test it out today at @cf/meta/llama-3.1-8b-instruct!

The Architectural Challenge of LLM Inference at Cloudflare 

Thanks to industry efforts, inference has improved a lot over the past few years. vLLM has led the way here with the recent release of the vLLM V1 engine with features like an optimized KV cache, improved batching, and the implementation of Flash Attention 3. vLLM is great for most inference workloads — we’re currently using it for several of the models in our Workers AI catalog — but as our AI workloads and catalog has grown, so has our need to optimize inference for the exact hardware and performance requirements we have. 

Cloudflare is writing much of our new infrastructure in Rust, and vLLM is written in Python. Although Python has proven to be a great language for prototyping ML workloads, to maximize efficiency we need to control the low-level implementation details. Implementing low-level optimizations through multiple abstraction layers and Python libraries adds unnecessary complexity and leaves a lot of CPU performance on the table, simply due to the inefficiencies of Python as an interpreted language.

We love to contribute to open-source projects that we use, but in this case our priorities may not fit the goals of the vLLM project, so we chose to write a server for our needs. For example, vLLM does not support co-hosting multiple models on the same GPU without using Multi-Instance GPU (MIG), and we need to be able to dynamically schedule multiple models on the same GPU to minimize downtime. We also have an in-house AI Research team exploring unique features that are difficult, if not impossible, to upstream to vLLM. 

Finally, running code securely is our top priority across our platform and Workers AI is no exception. We simply can’t trust a 3rd party Python process to run on our edge nodes alongside the rest of our services without strong sandboxing. We are therefore forced to run vLLM via gvisor. Having an extra virtualization layer adds an additional performance overhead to vLLM. More importantly, it also increases the startup and tear down time for vLLM instances — which are already pretty long. Under full load on our edge nodes, vLLM running via gvisor consumes as much as 2.5 CPU cores, and is forced to compete for CPU time with other crucial services, that in turn slows vLLM down and lowers GPU utilization as a result.

While developing Infire, we’ve been incorporating the latest research in inference efficiency — let’s take a deeper look at what we actually built.

How Infire works under the hood 

Infire is composed of three major components: an OpenAI compatible HTTP server, a batcher, and the Infire engine itself.


An overview of Infire’s architecture 

Platform startup

When a model is first scheduled to run on a specific node in one of our data centers by our auto-scaling service, the first thing that has to happen is for the model weights to be fetched from our R2 object storage. Once the weights are downloaded, they are cached on the edge node for future reuse.

As the weights become available either from cache or from R2, Infire can begin loading the model onto the GPU. 

Model sizes vary greatly, but most of them are large, so transferring them into GPU memory can be a time-consuming part of Infire’s startup process. For example, most non-quantized models store their weights in the BF16 floating point format. This format has the same dynamic range as the 32-bit floating format, but with reduced accuracy. It is perfectly suited for inference providing the sweet spot of size, performance and accuracy. As the name suggests, the BF16 format requires 16 bits, or 2 bytes per weight. The approximate in-memory size of a given model is therefore double the size of its parameters. For example, LLama3.1 8B has approximately 8B parameters, and its memory footprint is about 16GB. A larger model, like LLama4 Scout, has 109B parameters, and requires around 218GB of memory. Infire utilizes a combination of Page Locked memory with CUDA asynchronous copy mechanism over multiple streams to speed up model transfer into GPU memory.

While loading the model weights, Infire begins just-in-time compiling the required kernels based on the model’s parameters, and loads them onto the device. Parallelizing the compilation with model loading amortizes the latency of both processes. The startup time of Infire when loading the Llama-3-8B-Instruct model from disk is just under 4 seconds. 

The HTTP server

The Infire server is built on top of hyper, a high performance HTTP crate, which makes it possible to handle hundreds of connections in parallel – while consuming a modest amount of CPU time. Because of ChatGPT’s ubiquity, vLLM and many other services offer OpenAI compatible endpoints out of the box. Infire is no different in that regard. The server is responsible for handling communication with the client: accepting connections, handling prompts and returning responses. A prompt will usually consist of some text, or a “transcript” of a chat session along with extra parameters that affect how the response is generated. Some parameters that come with a prompt include the temperature, which affects the randomness of the response, as well as other parameters that affect the randomness and length of a possible response.

After a request is deemed valid, Infire will pass it to the tokenizer, which transforms the raw text into a series of tokens, or numbers that the model can consume. Different models use different kinds of tokenizers, but the most popular ones use byte-pair encoding. For tokenization, we use HuggingFace’s tokenizers crate. The tokenized prompts and params are then sent to the batcher, and scheduled for processing on the GPU, where they will be processed as vectors of numbers, called embeddings.

The batcher

The most important part of Infire is in how it does batching: by executing multiple requests in parallel. This makes it possible to better utilize memory bandwidth and caches. 

In order to understand why batching is so important, we need to understand how the inference algorithm works. The weights of a model are essentially a bunch of two-dimensional matrices (also called tensors). The prompt represented as vectors is passed through a series of transformations that are largely dominated by one operation: vector-by-matrix multiplication. The model weights are so large, that the cost of the multiplication is dominated by the time it takes to fetch it from memory. In addition, modern GPUs have hardware units dedicated to matrix-by-matrix multiplications (called Tensor Cores on Nvidia GPUs). In order to amortize the cost of memory access and take advantage of the Tensor Cores, it is necessary to aggregate multiple operations into a larger matrix multiplication.

Infire utilizes two techniques to increase the size of those matrix operations. The first one is called prefill: this technique is applied to the prompt tokens. Because all of the prompt tokens are available in advance and do not require decoding, they can all be processed in parallel. This is one reason why input tokens are often cheaper (and faster) than output tokens.


How Infire enables larger matrix multiplications via batching

The other technique is called batching: this technique aggregates multiple prompts into a single decode operation.

Infire mixes both techniques. It attempts to process as many prompts as possible in parallel, and fills the remaining slots in a batch with prefill tokens from incoming prompts. This is also known as continuous batching with chunked prefill.

As tokens get decoded by the Infire engine, the batcher is also responsible for retiring prompts that reach an End of Stream token, and sending tokens back to the decoder to be converted into text. 

Another job the batcher has is handling the KV cache. One demanding operation in the inference process is called attention. Attention requires going over the KV values computed for all of the tokens up to the current one. If we had to recompute those previously encountered KV values for every new token we decode, the runtime of the process would explode for longer context sizes. However, using a cache, we can store all of the previous values and re-read them for each consecutive token. Potentially the KV cache for a prompt can store KV values for as many tokens as the context window allows. In LLama 3, the maximal context window is 128K tokens. If we pre-allocated the KV cache for each prompt in advance, we would only have enough memory available to execute 4 prompts in parallel on H100 GPUs! The solution for this is paged KV cache. With paged KV caching, the cache is split into smaller chunks called pages. When the batcher detects that a prompt would exceed its KV cache, it simply assigns another page to that prompt. Since most prompts rarely hit the maximum context window, this technique allows for essentially unlimited parallelism under typical load.

Finally, the batcher drives the Infire forward pass by scheduling the needed kernels to run on the GPU.

CUDA kernels

Developing Infire gives us the luxury of focusing on the exact hardware we use, which is currently Nvidia Hopper GPUs. This allowed us to improve performance of specific compute kernels using low-level PTX instructions for this specific architecture.

Infire just-in-time compiles its kernel for the specific model it is running, optimizing for the model’s parameters, such as the hidden state size, dictionary size and the GPU it is running on. For some operations, such as large matrix multiplications, Infire will utilize the high performance cuBLASlt library, if it would deem it faster.

Infire also makes use of very fine-grained CUDA graphs, essentially creating a dedicated CUDA graph for every possible batch size on demand. It then stores it for future launch. Conceptually, a CUDA graph is another form of just-in-time compilation: the CUDA driver replaces a series of kernel launches with a single construct (the graph) that has a significantly lower amortized kernel launch cost, thus kernels executed back to back will execute faster when launched as a single graph as opposed to individual launches.

How Infire performs in the wild 

We ran synthetic benchmarks on one of our edge nodes with an H100 NVL GPU.

The benchmark we ran was on the widely used ShareGPT v3 dataset. We ran the benchmark on a set of 4,000 prompts with a concurrency of 200. We then compared Infire and vLLM running on bare metal as well as vLLM running under gvisor, which is the way we currently run in production. In a production traffic scenario, an edge node would be competing for resources with other traffic. To simulate this, we benchmarked vLLM running in gvisor with only one CPU available.

requests/s

tokens/s

CPU load

Infire

40.91

17224.21

25%

vLLM 0.10.0

38.38

16164.41

140%

vLLM under gvisor

37.13

15637.32

250%

vLLM under gvisor with CPU constraints

22.04

9279.25

100%

As evident from the benchmarks we achieved our initial goal of matching and even slightly surpassing vLLM performance, but more importantly, we’ve done so at a significantly lower CPU usage, in large part because we can run Infire as a trusted bare-metal process. Inference no longer takes away precious resources from our other services and we see GPU utilization upward of 80%, reducing our operational costs.

This is just the beginning. There are still multiple proven performance optimizations yet to be implemented in Infire – for example, we’re integrating Flash Attention 3, and most of our kernels don’t utilize kernel fusion. Those and other optimizations will allow us to unlock even faster inference in the near future.

What’s next 

Running AI inference presents novel challenges and demands to our infrastructure. Infire is how we’re running AI efficiently — close to users around the world. By building upon techniques like continuous batching, a paged KV-cache, and low-level optimizations tailored to our hardware, Infire maximizes GPU utilization while minimizing overhead. Infire completes inference tasks faster and with a fraction of the CPU load of our previous vLLM-based setup, especially under the strict security constraints we require. This allows us to serve more requests with fewer resources, making requests served via Workers AI faster and more efficient.

However, this is just our first iteration — we’re excited to build in multi-GPU support for larger models, quantization, and true multi-tenancy into the next version of Infire. This is part of our goal to make Cloudflare the best possible platform for developers to build AI applications.

Want to see if your AI workloads are faster on Cloudflare? Get started with Workers AI today. 

How Cloudflare runs more AI models on fewer GPUs: A technical deep-dive

Post Syndicated from Sven Sauleau original https://blog.cloudflare.com/how-cloudflare-runs-more-ai-models-on-fewer-gpus/

As the demand for AI products grows, developers are creating and tuning a wider variety of models. While adding new models to our growing catalog on Workers AI, we noticed that not all of them are used equally – leaving infrequently used models occupying valuable GPU space. Efficiency is a core value at Cloudflare, and with GPUs being the scarce commodity they are, we realized that we needed to build something to fully maximize our GPU usage.

Omni is an internal platform we’ve built for running and managing AI models on Cloudflare’s edge nodes. It does so by spawning and managing multiple models on a single machine and GPU using lightweight isolation. Omni makes it easy and efficient to run many small and/or low-volume models, combining multiple capabilities by:  

  • Spawning multiple models from a single control plane,

  • Implementing lightweight process isolation, allowing models to spin up and down quickly,

  • Isolating the file system between models to easily manage per-model dependencies, and

  • Over-committing GPU memory to run more models on a single GPU.

Cloudflare aims to place GPUs as close as we possibly can to people and applications that are using them. With Omni in place, we’re now able to run more models on every node in our network, improving model availability, minimizing latency, and reducing power consumed by idle GPUs.

Here’s how. 

Omni’s architecture – at a glance

At a high level, Omni is a platform to run AI models. When an inference request is made on Workers AI, we load the model’s configuration from Workers KV and our routing layer forwards it to the closest Omni instance that has available capacity. For inferences using the Asynchronous Batch API, we route to an Omni instance that is idle, which is typically in a location where it’s night.

Omni runs a few checks on the inference request, runs model specific pre and post processing, then hands the request over to the model.


Elastic scaling by spawning multiple models from a single control plane

If you’re developing an AI application, a typical setup is having a container or a VM dedicated to running a single model with a GPU attached to it. This is simple. But it’s also heavy-handed — because it requires managing the entire stack from provisioning the VM, installing GPU drivers, downloading model weights, and managing the Python environment. At scale, managing infrastructure this way is incredibly time consuming and often requires an entire team. 

If you’re using Workers AI, we handle all of this for you. Omni uses a single control plane for running multiple models, called the scheduler, which automatically provisions models and spawns new instances as your traffic scales. When starting a new model instance, it downloads model weights, Python code, and any other dependencies. Omni’s scheduler provides fine-grained control and visibility over the model’s lifecycle: it receives incoming inference requests and routes them to the corresponding model processes, being sure to distribute the load between multiple GPUs. It then makes sure the model processes are running, rolls out new versions as they are released, and restarts itself when detecting errors or failure states. It also collects metrics for billing and emits logs.

The inference itself is done by a per-model process, supervised by the scheduler. It receives the inference request and some metadata, then sends back a response. Depending on the model, the response can be various types; for instance, a JSON object or a SSE stream for text generation, or binary for image generation.

The scheduler and the child processes communicate by passing messages over Inter-Process Communication (IPC). Usually the inference request is buffered in the scheduler for applying features, like prompt templating or tool calling, before the request is passed to the child process. For potentially large binary requests, the scheduler hands over the underlying TCP connection to the child process for consuming the request body directly.

Implementing lightweight process and Python isolation

Typically, deploying a model requires its own dedicated container, but we want to colocate more models on a single container to conserve memory and GPU capacity. In order to do so, we needed finer-grained controls over CPU memory and the ability to isolate a model from its dependencies and environment. We deploy Omni in two configurations; a container running multiple models or bare metal running a single model. In both cases, process isolation and Python virtual environments allow us to isolate models with different dependencies by creating namespaces and are limited by cgroups

Python doesn’t take into account cgroups memory limits for memory allocations, which can lead to OOM errors. Many AI Python libraries rely on psutil for pre-allocating CPU memory. psutil reads /proc/meminfo to determine how much memory is available. Since in Omni each model has its own configurable memory limits, we need psutil to reflect the current usage and limits for a given model, not for the entire system.

The solution for us was to create a virtual file system, using fuse, to mount our own version of /proc/meminfo which reflects the model’s current usage and limits.

To illustrate this, here’s an Omni instance running a model (running as pid 8). If we enter the mount namespace and look at /proc/meminfo it will reflect the model’s configuration:

# Enter the mount (file system) namespace of a child process
$ nsenter -t 8 -m

$ mount
...
none /proc/meminfo fuse ...

$ cat /proc/meminfo
MemTotal:     7340032 kB
MemFree:     7316388 kB
MemAvailable:     7316388 kB

In this case the model has 7Gib of memory available and the entire container 15Gib. If the model tries to allocate more than 7Gib of memory, it will be OOM killed and restarted by the scheduler’s process manager, without causing any problems to the other models.

For isolating Python and some system dependencies, each model runs in a Python virtual environment, managed by uv. Dependencies are cached on the machine and, if possible, shared between models (uv uses symbolic links between its cache and virtual environments).

Also separated processes for models allows to have different CUDA contexts and isolation for error recovery. 

Over-committing memory to run more models on a single GPU

Some models don’t receive enough traffic to fully utilize a GPU, and with Omni we can pack more models on a single GPU, freeing up capacity for other workloads. When it comes to GPU memory management, Omni has two main jobs: safely over-commit GPU memory, so that more models than normal can share a single GPU, and enforce memory limits, to prevent any single model from running out of memory while running.      

Over-committing memory means allocating more memory than is physically available to the device. 

For example, if a GPU has 10 Gib of memory, Omni would allow 2 models of 10Gib each on that GPU.

Right now, Omni is configured to run 13 models and is allocating about 400% GPU memory on a single GPU, saving up 4 GPUs. Omni does this by injecting a CUDA stub library that intercepts CUDA memory allocations (cuMalloc* or cudaMalloc*) calls and forces memory allocations to be performed in unified memory mode.

In Unified memory mode CUDA shares the same memory address space for both the GPU and the CPU:


CUDA’s unified memory mode 

In practice this is what memory over-commitment looks like: imagine 3 models (A, B and C). Models A+B fit in the GPU’s memory but C takes up the entire memory.

  1. Models A+B are loaded first and are in GPU memory, while model C is in CPU memory


  2. Omni receives a request for model C so models A+B are swapped out and C is swapped in.


  3. Omni receives a request for model B, so model C is partly swapped out and model B is swapped back in.


  4. Omni receives a request for model A, so model A is swapped back in and model C is completely swapped out.


The trade-off is added latency: if performing an inference requires memory that is currently on the host system, it must be transferred to the GPU. For smaller models, this latency is minimal, because with PCIe 4.0, the physical bus between your GPU and system, provides 32 GB/sec of bandwidth. On the other hand, if a model need to be “cold started” i.e. it’s been swapped out because it hasn’t been used in a while, the system may need to swap back the entire model – a larger sized model, for example, might use 5Gib of GPU memory for weights and caches, and would take ~156ms to be swapped back into the GPU. Naturally, over time, inactive models are put into CPU memory, while active models stay hot in the GPU.

Rather than allowing the model to choose how much GPU memory it uses, AI frameworks tend to pre-allocate as much GPU memory as possible for performance reasons, making co-locating models more complicated. Omni allows us to control how much memory is actually exposed to any given model to prevent a greedy model from over-using the GPU allocated to it. We do this by overriding the CUDA runtime and driver APIs (cudaMemGetInfo and cuMemGetInfo). Instead of exposing the entire GPU memory, we only expose a subset of memory to each model.

How Omni runs multiple models for Workers AI 

AI models can run in a variety of inference engines or backends: vLLM, Python, and now our very own inference engine, Infire. While models have different capabilities, each model needs to support Workers AI features, like batching and function calling. Omni acts as a unified layer for integrating these systems. It integrates into our internal routing and scheduling systems, and provides a Python API for our engineering team to add new models more easily. Let’s take a closer look at how Omni does this in practice:

from omni import Response
import cowsay


def handle_request(request, context):
    try:
        json = request.body.json
        text = json["text"]
    except Exception as err:
        return Response.error(...)

    return cowsay.get_output_string('cow', text)

Similar to how a JavaScript Worker works, Omni calls a request handler, running the model’s logic and returning a response. 

Omni installs Python dependencies at model startup. We run an internal Python registry and mirror the public registry. In either case we declare dependencies in requirements.txt:

cowsay==6.1

The handle_request function can be async and return different Python types, including pydantic objects. Omni will convert the return value into a Workers AI response for the eyeball.

A Python package is injected, named omni, containing all the Python APIs to interact with the request, the Workers AI systems, building Responses, error handling, etc. Internally we publish it as regular Python package to be used in standalone, for unit testing for instance:

from omni import Context, Request
from model import handle_request


def test_basic():
    ctx = Context.inactive()
    req = Request(json={"text": "my dog is cooler than you!"})
    out = handle_request(req, ctx)
    assert out == """  __________________________
| my dog is cooler than you! |
  ==========================
                          \\
                           \\
                             ^__^
                             (oo)\\_______
                             (__)\\       )\\/\\
                                 ||----w |
                                 ||     ||"""

What’s next 

Omni allows us to run models more efficiently by spawning them from a single control plane and implementing lightweight process isolation. This enables quick starting and stopping of models, isolated file systems for managing Python and system dependencies, and over-committing GPU memory to run more models on a single GPU. This improves the performance for our entire Workers AI stack, reduces the cost of running GPUs, and allows us to ship new models and features quickly and safely.

Right now, Omni is running in production on a handful of models in the Workers AI catalog, and we’re adding more every week. Check out Workers AI today to experience Omni’s performance benefits on your AI application. 

State-of-the-art image generation Leonardo models and text-to-speech Deepgram models now available in Workers AI

Post Syndicated from Michelle Chen original https://blog.cloudflare.com/workers-ai-partner-models/

When we first launched Workers AI, we made a bet that AI models would get faster and smaller. We built our infrastructure around this hypothesis, adding specialized GPUs to our datacenters around the world that can serve inference to users as fast as possible. We created our platform to be as general as possible, but we also identified niche use cases that fit our infrastructure well, such as low-latency image generation or real-time audio voice agents. To lean in on those use cases, we’re bringing on some new models that will help make it easier to develop for these applications.

Today, we’re excited to announce that we are expanding our model catalog to include closed-source partner models that fit this use case. We’ve partnered with Leonardo.Ai and Deepgram to bring their latest and greatest models to Workers AI, hosted on Cloudflare’s infrastructure. Leonardo and Deepgram both have models with a great speed-to-performance ratio that suit the infrastructure of Workers AI. We’re starting off with these great partners — but expect to expand our catalog to other partner models as well.

The benefits of using these models on Workers AI is that we don’t only have a standalone inference service, we also have an entire suite of Developer products that allow you to build whole applications around AI. If you’re building an image generation platform, you could use Workers to host the application logic, Workers AI to generate the images, R2 for storage, and Images for serving and transforming media. If you’re building Realtime voice agents, we offer WebRTC and WebSocket support via Workers, speech-to-text, text-to-speech, and turn detection models via Workers AI, and an orchestration layer via Cloudflare Realtime. All in all, we want to lean into use cases that we think Cloudflare has a unique advantage in, with developer tools to back it up, and make it all available so that you can build the best AI applications on top of our holistic Developer Platform.

Leonardo Models

Leonardo.Ai is a generative AI media lab that trains their own models and hosts a platform for customers to create generative media. The Workers AI team has been working with Leonardo for a while now and have experienced the magic of their image generation models firsthand. We’re excited to bring on two image generation models from Leonardo: @cf/leonardo/phoenix-1.0 and @cf/leonardo/lucid-origin.

“We’re excited to enable Cloudflare customers a new avenue to extend and use our image generation technology in creative ways such as creating character images for gaming, generating personalized images for websites, and a host of other uses… all through the Workers AI and the Cloudflare Developer Platform.” – Peter Runham, CTO, Leonardo.Ai 

The Phoenix model is trained from the ground up by Leonardo, excelling at things like text rendering and prompt coherence. The full image generation request took 4.89s end-to-end for a 25 step, 1024×1024 image.

curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/leonardo/draco-1.0 \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "A 1950s-style neon diner sign glowing at night that reads '\''OPEN 24 HOURS'\'' with chrome details and vintage typography.",
    "width":1024,
    "height":1024,
    "steps": 25,
    "seed":1,
    "guidance": 4,
    "negative_prompt": "bad image, low quality, signature, overexposed, jpeg artifacts, undefined, unclear, Noisy, grainy, oversaturated, overcontrasted"
}'

The Lucid Origin model is a recent addition to Leonardo’s family of models and is great at generating photorealistic images. The image took 4.38s to generate end-to-end at 25 steps and a 1024×1024 image size.

curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/leonardo/lucid-origin \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "A 1950s-style neon diner sign glowing at night that reads '\''OPEN 24 HOURS'\'' with chrome details and vintage typography.",
    "width":1024,
    "height":1024,
    "steps": 25,
    "seed":1,
    "guidance": 4,
    "negative_prompt": "bad image, low quality, signature, overexposed, jpeg artifacts, undefined, unclear, Noisy, grainy, oversaturated, overcontrasted"
}'

Deepgram Models

Deepgram is a voice AI company that develops their own audio models, allowing users to interact with AI through a natural interface for humans: voice. Voice is an exciting interface because it carries higher bandwidth than text, because it has other speech signals like pacing, intonation, and more. The Deepgram models that we’re bringing on our platform are audio models which perform extremely fast speech-to-text and text-to-speech inference. Combined with the Workers AI infrastructure, the models showcase our unique infrastructure so customers can build low-latency voice agents and more.

“By hosting our voice models on Cloudflare’s Workers AI, we’re enabling developers to create real-time, expressive voice agents with ultra-low latency. Cloudflare’s global network brings AI compute closer to users everywhere, so customers can now deliver lightning-fast conversational AI experiences without worrying about complex infrastructure.” – Adam Sypniewski, CTO, Deepgram

@cf/deepgram/nova-3 is a speech-to-text model that can quickly transcribe audio with high accuracy. @cf/deepgram/aura-1 is a text-to-speech model that is context aware and can apply natural pacing and expressiveness based on the input text. The newer Aura 2 model will be available on Workers AI soon. We’ve also improved the experience of sending binary mp3 files to Workers AI, so you don’t have to convert it into an Uint8 array like you had to previously. Along with our Realtime announcements (coming soon!), these audio models are the key to enabling customers to build voice agents directly on Cloudflare.

With the AI binding, a call to the Nova 3 speech-to-text model would look like this:

const URL = "https://www.some-website.com/audio.mp3";
const mp3 = await fetch(URL);
 
const res = await env.AI.run("@cf/deepgram/nova-3", {
    "audio": {
      body: mp3.body,
      contentType: "audio/mpeg"
    },
    "detect_language": true
  });

With the REST API, it would look like this:

curl --request POST \
  --url 'https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/deepgram/nova-3?detect_language=true' \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: audio/mpeg' \
  --data-binary @/path/to/audio.mp3

As well, we’ve added WebSocket support to the Deepgram models, which you can use to keep a connection to the inference server live and use it for bi-directional input and output. To use the Nova model with WebSocket support, it would look like this:

curl --request POST \
  --url 'https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/deepgram/nova-3?detect_language=true' \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: audio/mpeg' \
  --data-binary @/path/to/audio.mp3

As well, we’ve added WebSocket support to the Deepgram models, which you can use to keep a connection to the inference server live and use it for bi-directional input and output. To use the Nova model with WebSocket support, check out our Developer Docs.

All the pieces work together so that you can:

  1. Capture audio with Cloudflare Realtime from any WebRTC source

  2. Pipe it via WebSocket to your processing pipeline

  3. Transcribe with audio ML models Deepgram running on Workers AI

  4. Process with your LLM of choice through a model hosted on Workers AI or proxied via AI Gateway

  5. Orchestrate everything with Realtime Agents

Try these models out today

Check out our developer docs for more details, pricing and how to get started with the newest partner models available on Workers AI.

Securing the AI Revolution: Introducing Cloudflare MCP Server Portals

Post Syndicated from Kenny Johnson original https://blog.cloudflare.com/zero-trust-mcp-server-portals/

Securing the AI Revolution: Introducing Cloudflare MCP Server Portals

Large Language Models (LLMs) are rapidly evolving from impressive information retrieval tools into active, intelligent agents. The key to unlocking this transformation is the Model Context Protocol (MCP), an open-source standard that allows LLMs to securely connect to and interact with any application — from Slack to Canva, to your own internal databases.

This is a massive leap forward. With MCP, an LLM client like Gemini, Claude, or ChatGPT can answer more than just “tell me about Slack.” You can ask it: “What were the most critical engineering P0s in Jira from last week, and what is the current sentiment in the #engineering-support Slack channel regarding them? Then propose updates and bug fixes to merge.”

This is the power of MCP: turning models into teammates.

But this great power comes with proportional risk. Connecting LLMs to your most critical applications creates a new, complex, and largely unprotected attack surface. Today, we change that. We’re excited to announce Cloudflare MCP Server Portals are now available in Open Beta. MCP Server Portals are a new capability that enable you to centralize, secure, and observe every MCP connection in your organization. This feature is part of Cloudflare One, our secure access service edge (SASE) platform that helps connect and protect your workspace.

What Exactly is the Model Context Protocol?

Think of MCP as a universal translator or a digital switchboard for AI. It’s a standardized set of rules that lets two very different types of software—LLMs and everyday applications—talk to each other effectively. It consists of two primary components:

  • MCP Clients: These are the LLMs you interact with, like ChatGPT, Claude, or Gemini. The client is the front end to the AI that you use to ask questions and give commands.

  • MCP Servers: These can be developed for any application you want to connect to your LLM. SaaS providers like Slack or Atlassian may offer MCP servers for their products, or your own developers can also build custom ones for internal tools.


Credit: Architecture Overview – Model Context Protocol

For a useful connection, MCP relies on a few other key concepts:

  • Resources: A mechanism for the server to give the LLM context. This could be a specific file, a database schema, or a list of users in an application.

  • Prompts: Standardized questions the server can ask the client to get the information it needs to fulfill a request (e.g., “Which user do you want to search for?”).

  • Tools: These are the actions the client can ask the server to perform, like querying a database, calling an API, or sending a message.

Without MCP, your LLM is isolated. With MCP, it’s integrated, capable of interacting with your entire software ecosystem in a structured and predictable way.

The Peril of an Unsecured AI Ecosystem

Think of an LLM as the most brilliant and enthusiastic junior hire you’ve ever had. They have boundless energy and can produce incredible work, but they lack the years of judgment to know what they shouldn’t do. The current, decentralized approach to MCP is like giving that junior hire a master key to every office and server room on their first day.

It’s not a matter of if something will go wrong, but when.

This “shadow AI” infrastructure is the modern equivalent of the early Internet, where every server had a public IP address, fully exposed to the world. It’s the Wild West of unmanaged connections, impossible to secure. And the risks go far beyond accidental data deletion. Attackers are actively exploiting the unique vulnerabilities of LLM-driven ecosystems:

  • Prompt and tool injection: This is more than just telling a model to “ignore previous instructions.” Attackers are now hiding malicious commands inside the descriptions of MCP tools themselves. Consider an LLM seeking to use a seemingly harmless “WebSearch” tool. A poisoned description could trick it into also running a query against a financial database and exfiltrating the results.

  • Supply chain attacks: How can you trust the third-party MCP servers used by your teams? In mid-2025, a critical vulnerability (CVE-2025-6514) was discovered in a popular npm package used for MCP authentication, exposing countless servers. In another incident dubbed “NeighborJack,” security researchers found hundreds of MCP servers inadvertently exposed to the public Internet because they were bound to 0.0.0.0 without a firewall, allowing for potential OS command injection and host takeover.

  • Privilege escalation and the “confused deputy”: An attacker doesn’t need to break your LLM; they just need to confuse it. In one documented case, an AI agent running with high-level privileges was tricked into executing SQL commands embedded in a support ticket. The agent, acting as a “confused deputy,” couldn’t distinguish the malicious SQL from the legitimate ticket data and dutifully executed the commands, compromising an entire database.

  • Data leakage: Without centralized controls, data can bleed between systems in unexpected ways. In June 2025, a popular team collaboration tool’s MCP integration suffered a privacy breach where a bug caused some customer information to become visible in other customers’ MCP instances, forcing them to take the integration offline for two weeks.

The Solution: A Single Front Door for Your MCP Servers

You can’t protect what you can’t see. Cloudflare MCP Server Portals solve this problem by providing a single, centralized gateway for all your MCP servers, somewhat similar to an application launcher for single sign-on. Instead of developers distributing dozens of individual server endpoints, they register their servers with Cloudflare. You provide your users with a single, unified Portal endpoint to configure in their MCP client.


This changes the security posture and user experience immediately. By routing all MCP traffic through Cloudflare, you get:

  • Centralized policy enforcement: You can integrate MCP Server Portals directly into Cloudflare One. This means you can enforce the same granular access policies for your AI connections that you do for your human users. Require multi-factor authentication, check for device posture, restrict by geography, and ensure only the right users can access specific servers and tools.

  • Comprehensive visibility and logging: Who is accessing which MCP server and which toolsets are they engaging with? What prompts are being run? What tools are being invoked? Previously, this data was scattered across every individual server. Server Portals aggregate all MCP request logs into a single place, giving you the visibility needed to audit activity and detect anomalies before they become breaches.

  • A curated AI user experience based on least privilege: Administrators can now review and approve MCP servers before making them available to users through a Portal. When a user authenticates through their Portal, they are only presented with the curated list of servers and tools they are authorized to use, preventing the use of unvetted or malicious third-party servers. This approach adheres to the Zero Trust security best practice of least privilege.

  • Simplified user configuration: Instead of having to load individual MCP server configurations into a MCP Client, users can load a single URL that pulls down all accessible MCP Servers. This drastically simplifies how many URLs need to be shared out and known by users. As new MCP Servers are added, they become dynamically available through the portal, instead of sharing each new URL on publishing of a server.

When a user connects to their MCP Server Portal, Access prompts them to authenticate with their corporate identity provider. Once authenticated, Cloudflare enforces which MCP Servers the user has access to, regardless of the underlying server’s authorization policies. 

For MCP servers with domains hosted on Cloudflare, Access policies can be used to enforce the server’s direct authorization. This is done by creating an OAuth server that is linked to the domain’s existing Access Application. For MCP servers with domains outside Cloudflare and/or hosted by a third party, they require authorization controls outside of Cloudflare Access, this is usually done using OAuth.

The Road Ahead: What’s Next for AI Security

MCP Server Portals are a foundational step in our mission to secure the AI revolution. This is just the beginning. In the coming months, we plan to build on this foundation by:

  • Mechanisms to lock down MCP Servers: Unless an MCP Server author enforces Authorization controls, users can still technically access MCPs outside of a Portal. We will build additional enforcement mechanisms to prevent this.

  • Integrating with Firewall for AI: Imagine applying the power of our WAF to your MCP traffic, detecting and blocking prompt injection attacks before they ever reach your servers.

  • Cloudflare hosted MCP Servers: We will make it easy to deploy MCP Servers using Cloudflare’s AI Gateway. This will allow for deeper prompt filtering and controls.

  • Applying machine learning to detect abuse: We will layer our own machine learning models on top of your MCP logs to automatically identify anomalous behavior, such as unusual data exfiltration patterns or suspicious tool usage.

  • Enhancing the protocol: We are committed to working with the open-source community to strengthen the MCP standard itself, contributing to a more secure and robust ecosystem for everyone.

This is our commitment: to provide the tools you need to innovate with confidence.

Get Started Today!

Progress doesn’t have to come at the expense of security. With MCP Server Portals, you can empower your teams to build the future with AI, safely. This is a critical piece of helping to build a better Internet, and we are excited to see what you will build with it.

MCP Server Portals are now available in Open Beta for all Cloudflare One customers. To get started, navigate to the Access > AI Controls page in the Zero Trust Dashboard. If you don’t have an account, you can sign up today and get started with up to 50 free seats or contact our experts to explore larger deployments.

Cloudflare is also starting a user research program focused on AI security. If you are interested in previews of new functionality or want to help shape our roadmap, please express your interest here.

ChatGPT, Claude, & Gemini security scanning with Cloudflare CASB

Post Syndicated from Alex Dunbrack original https://blog.cloudflare.com/casb-ai-integrations/

Starting today, all users of Cloudflare One, our secure access service edge (SASE) platform, can use our API-based Cloud Access Security Broker (CASB) to assess the security posture of their generative AI (GenAI) tools: specifically, OpenAI’s ChatGPT, Claude by Anthropic, and Google’s Gemini. Organizations can connect their GenAI accounts and within minutes, start detecting misconfigurations, Data Loss Prevention (DLP) matches, data exposure and sharing, compliance risks, and more — all without having to install cumbersome software onto user devices.

As Generative AI adoption has exploded in the enterprise, IT and Security teams need to hustle to keep themselves abreast of newly emerging  security and compliance challenges that come alongside these powerful tools. In this rapidly changing landscape, IT and Security teams need tools that help enable AI adoption while still protecting the security and privacy of their enterprise networks and data. 

Cloudflare’s API CASB and inline CASB work together to help organizations safely adopt AI tools. The API CASB integrations provide out-of-band visibility into data at rest and security posture inside popular AI tools like ChatGPT, Claude, and Gemini. At the same time, Cloudflare Gateway provides in-line prompt controls and Shadow AI identification. It applies policies and DLP to traffic as it moves to these AI providers. Together, these features give organizations a unified control plane for securing their use of GenAI.

What’s new

ChatGPT, Claude and Gemini are now all live in the integrations supported by Cloudflare’s API CASB. These integrations are available to all Cloudflare One users, account owners can easily connect their GenAI tenants, and CASB will scan for security issues across multiple domains:

  • Agentless Connections: Connect ChatGPT, Claude, and Gemini via agentless, API‑based integrations to scan posture and data risks; no endpoint software to install.

  • Posture Management: Detect insecure settings and misconfigurations that can lead to data exposure or misuse.

  • DLP Detection: Identify where sensitive data has been uploaded in chat attachments (prompts coming soon).

  • GenAI-specific Insights: Surface risks associated with the unique capability of a given AI provider’s toolsets.

Admins can now answer questions like: What are our employees doing in ChatGPT? What data is being uploaded and used in Claude? Is Gemini configured correctly in Google Workspace?

Now let’s take a closer look at each integration.

OpenAI ChatGPT


Cloudflare’s CASB integration with OpenAI’s ChatGPT scans for several types of insights, including:

  • External Exposure: Finds chats and GPTs that are shared beyond the tenant, like GPTs shared publicly or listed on the GPT Store, and ties them back to their owners for quick triage.

  • Secrets, Keys and Invites: Identifies API keys that aren’t rotated or are no longer used to maintain credential hygiene. Identifies over‑privileged or stale invites.

  • Sensitive Content (via DLP): Detects sensitive data (e.g. credential and secrets, financial / health information, source code, etc.) via DLP profile matches in uploaded chat attachments to enable targeted response.

Anthropic Claude

For Claude, Cloudflare is able to provide the following out-of-band detections:

  • Secrets, Keys and Invites: Surfaces high‑risk invites and entitlement drift early so the least‑privilege access control stays tight. Spots unused API keys and rotation gaps before they turn into forgotten open doors.

  • Sensitive Content (via DLP): Monitors for sensitive data in uploaded files to help organizations safely enable Claude usage while maintaining compliance. Security teams get this information as quickly as CASB scans, giving them the visibility they need to help employees use Claude productively and securely with sensitive data.

As Anthropic continues to expand Claude’s API capabilities and features, Cloudflare will add corresponding security detections to match new functionality as it becomes available.

Google Gemini

Cloudflare’s detections for Google Gemini appear as part of our API CASB integration for Google Workspace:

  • Identity & MFA: Identifies Gemini users and admins without MFA, leaving them prime targets for compromise. Imagine if an IT admin relied on Gemini daily to process corporate data, but their Google Workspace account lacked multi-factor authentication. One successful phishing email could give an attacker privileged access to Gemini and the wider Google Workspace environment — turning a minor oversight into an organization-wide breach. 

  • License Hygiene: Flags suspended accounts still holding Gemini or AI Ultra licenses to cut cost and reduce exposure. An AI Ultra user has access to more powerful and riskier features, like Project Mariner, a research prototype that acts as an autonomous agent, capable of automating up to 10 tasks simultaneously across web browsers. An attacker can cause more damage by compromising an AI Ultra user, which is why we include this in our set of detections.

The Gemini integration has a narrower scope because Google has structured their product and API differently than OpenAI or Anthropic. For organizations, Gemini is delivered as a Google Workspace add-on. Enterprises enable Gemini features in Gmail, Docs, Sheets, and other Google Workspace apps through add-on licenses such as Gemini Enterprise or AI Ultra. Our CASB detections focus on identity, MFA, and license hygiene, rather than posture issues like public sharing or custom assistant publishing because Gemini does not yet provide those API endpoints.

The Future of GenAI Posture Management

Like countless other organizations, Cloudflare is adopting GenAI, on the same journey to make these environments even safer than they are today. We are excited to extend our management coverage to our customers so they can continue to innovate with GenAI. But looking ahead, we’re encouraged to see GenAI providers take concrete steps towards making security, compliance, and data privacy even more important tenets of their platforms.

Secure GenAI beyond the reach of Inline Controls

Generative AI adoption brings new security requirements. Cloudflare CASB delivers out-of-band visibility across these tools, surfacing insights on top of inline controls. With posture, access, and data under control, organizations can embrace GenAI confidently and securely.

How to get started:

  • For existing Cloudflare One customers: Contact your account manager or enable the integrations directly in your dashboard today.

  • New to Cloudflare One? Sign up now for 50 free seats to begin securely using Gen AI immediately. For larger deployments, request a consultation with our experts.

If you want to preview other new functionality and help shape our roadmap, express interest in our user research program for AI security.

Introducing Cloudflare Application Confidence Score For AI Applications

Post Syndicated from Ayush Kumar original https://blog.cloudflare.com/confidence-score-rubric/

Introduction

The availability of SaaS and Gen AI applications is transforming how businesses operate, boosting collaboration and productivity across teams. However, with increased productivity comes increased risk, as employees turn to unapproved SaaS and Gen AI applications, often dumping sensitive data into them for quick productivity wins. 

The prevalence of “Shadow IT” and “Shadow AI” creates multiple problems for security, IT, GRC and legal teams. For example:

In spite of these problems, blanket bans of Gen AI don’t work. They stifle innovation and push employee usage underground. Instead, organizations need smarter controls.

Security, IT, legal and GRC teams therefore face a difficult challenge: how can you appropriately assess each third-party application, without auditing and crafting individual policies for every single one of them that your employees might decide to interact with? And with the rate at which they’re proliferating — how could you possibly hope to keep abreast of them all?

Today, we’re excited to announce that we’re helping these teams automate assessment of SaaS and Gen AI applications at scale with the introduction of our new Cloudflare Application Confidence Scores. Scores will soon be available as part of our new suite of AI Security Posture Management (AI-SPM) features in the Cloudflare One SASE platform, enabling IT and Security administrators to identify confidence levels associated with third-party SaaS and AI applications, and ultimately write policies informed by those confidence scores. We’re starting by scoring AI applications, because that’s where the need is most urgent.

In this blog, we’ll be covering the design of our Cloudflare Application Confidence Score, focusing specifically about the features of the score and our scoring rubric.  Our current goal is to reveal the details of our scoring rubric, which is designed to be as transparent and objective as possible — while simultaneously helping organizations of all sizes safely adopt AI, and encouraging the industry and AI providers to adopt best practices for AI safety and security.  

In the future, as part of our mission to help build a better Internet, we also plan to make Cloudflare Application Confidence Scores available for free to all our customer tiers. And even if you aren’t a Cloudflare customer, you will easily be able to browse through these Scores by creating a free account on the Cloudflare dashboard and navigating to our new Application Library.  

Transparency, not vibes

Cloudflare Application Confidence Scores is a transparent, understandable, and accountable metric that measures app safety, security, and data protection. It’s designed to give Security, IT, legal and GRC teams a rapid way of assessing the rapidly burgeoning space of AI applications.

Scores are not based on vibes or black-box “learning algorithms” or “artificial intelligence engines”.  We avoid subjective judgments or large-scale red-teaming as those can be tough to execute reliably and consistently over time. Instead, scores will be computed against an objective rubric that we describe in detail in this blog. Our rubric will be publicly maintained and kept up to date in the Cloudflare developer docs. 

Many providers of the applications that we score are also our customers and partners, so our overarching goal is to be as fair and accountable as possible. We believe that transparency will build trust in our scoring rubric and guide the industry to adopt the best practices that our scoring rubric encourages. 

Principles behind our rubric

Each component of our rubric requires a simple answer based on publicly available data like privacy policies, security documentation, compliance certifications, model cards and incident reports. If something isn’t publicly disclosed, we assign zero points to that component of the rubric, with no further assumptions or guesswork.  Scores are computed according to our rubric via an automated system that incorporates human oversight for accuracy.  We use crawlers to collect public information (e.g. privacy policies, compliance documents), process it using AI for extraction and to compute the resulting scores, and then send them to human analysts for a final review.   

Scores are reviewed on a periodic basis. If a vendor believes that we have mis-scored their application, they can submit supporting documentation via [email protected], and we will update their score if appropriate.

Scores are on a scale from 1 to 5, with 5 being the highest confidence and 1 being the most risky. We decided to use a “confidence score” instead of a “risk score” because we can express confidence in an application when it provides clear positive evidence of good security, compliance and safety practices. An application may have good practices internally, but we cannot express confidence in these practices if they are not publicly documented. Moreover, a confidence score allows us to give customers transparent information, so they can make their own informed decisions. For example, an application might get a low confidence score because it lacks a documented data retention policy. While that might be a concern for some, your organization might find it acceptable and decide to allow the application anyway.

We separately evaluate different account tiers for the same application provider, because different account tiers can provide very different levels of enterprise risk. For instance, consumer plans (e.g. ChatGPT Free) may involve training on user prompts and score lower, whereas enterprise plans (e.g. ChatGPT Enterprise) do not train on user prompts and thus score higher. 

That said, we are quite opinionated about components we selected in our rubric, drawing from deep experience of our own internal product, engineering, legal, GRC, and security teams. We prioritize factors like data retention policies and encryption standards because we believe they are foundational to protecting sensitive information in an AI-driven world. We included certifications, security frameworks and model cards because they provide evidence of maturity, stability, safety and adherence with industry best practices.

Actually, it’s really two Scores

As AI applications emerge at an unprecedented pace, the problem of “Shadow AI” intensifies traditional risks associated with Shadow IT. Shadow IT applications create risk when they retain user data for long periods, have lax security practices, are financially unstable, or widely share data with third parties.  Meanwhile, AI tools create new risks when they retain and train on user prompts, or generate responses that are biased, toxic, inaccurate or unsafe. 

To separate out these different risks, we provide two different Scores: 

  • Application Confidence Score (5 points) covers general SaaS maturity, and

  • Gen-AI Confidence Score (5 points) focused on Gen AI-specific risks.

We chose to focus on two separate areas to make our metric extensible (so that, in the future, we can apply it to applications that are not focused on Gen AI) and to make the Scores easier to understand and reason about.   

Each Score is applied to each account tier of a given Gen AI provider. For example, here’s how we scored OpenAI’s ChatGPT:

  • ChatGPT Free (App Confidence 3.3, GenAI Confidence 1) received a low score due to limited enterprise controls and higher data exposure risk since by default, input data is used for model training.

  • ChatGPT Plus (App Confidence 3.3, GenAI Confidence 3) scored slightly higher as it allows users to opt out of training on their input data.

  • ChatGPT Team (App Confidence 4.3, GenAI Confidence 3) improved further with added collaboration safeguards and configurable data retention windows.

  • ChatGPT Enterprise (App Confidence 4.3, GenAI Confidence 4) achieved the highest score, as training on input data is disabled by default while retaining the enhanced controls from the Team tier.

A detailed look at our rubric

We now walk through the details of the rubric behind each of our Scores.

Application Confidence Score (5.0 Points Total)

This half evaluates the app’s overall maturity as a SaaS service, drawing from enterprise best practices.

  • Regulatory Compliance: Checks for key certifications that signal operational maturity. We selected these because they represent proven frameworks that demonstrate a commitment to widely-adopted security and data protection best practices.

  • Data Management Practices: Focuses on how data is retained and shared to minimize exposure. These criteria were chosen as they directly impact the risk of data leaks or misuse, based on common vulnerabilities we’ve observed in SaaS environments and our own legal/GRC team’s experience assessing third-party SaaS applications at Cloudflare.

    • Documented data retention window:  Shorter retention limits risk.

      • 0 day retention: .5 points

      • 30 day retention: .4 points

      • 60 day retention: .3 points

      • 90 day retention: .1 point

      • No documented retention window: 0 points

    • Third-party sharing: No sharing means less external exposure of enterprise data. Sharing for advertising purposes means high risk of third parties mining and using the data.

      • No third-party sharing: .5 points.

      • Sharing only for troubleshooting/support: .25 points

      • Sharing for other reasons like advertising or end user targeting: 0 points

  • Security Controls: We prioritized these because they form the foundational defenses against unauthorized access, drawing from best practices that have prevented incidents in cloud services.

    • MFA support: .2 points.

    • Role-based access: .2 points.

    • Session monitoring: .2 points.

    • TLS 1.3: .2 points.

    • SSO support: .2 points.

  • Security reports and incident history: Rewards transparency and deducts for recent issues. This was included to emphasize accountability, as a history of breaches or proactive transparency often indicates how seriously a provider takes security.

    • Published safety framework and bug bounty: 1 point.

      • To get full points the company needs to have both of the following: 

        • A publicly accessible page (e.g., security, trust, or safety) that includes a comprehensive whitepaper, framework overview, OR detailed security documentation that covers:

          • Encryption in transit and at rest

          • Authentication and authorization mechanisms

          • Network or infrastructure security design

        • Incident Response Transparency – Published vulnerability disclosure or bug bounty policy OR a documented incident response process and security advisory archive.

      • Example: Google has a bug bounty program, a whitepaper providing an overview of their security posture, as well as a transparency report

    • No commitments or weak security framework with the lack of any of the above criteria. If the company only has one of the criteria above but lacks the other they will also receive no credit: 0 points.

      • Example: Lovable who has a security page but seems to lack many other parts of the criteria: https://lovable.dev/security

    • If there has been a material breach in the last two years. If the company has experienced a material cybersecurity incident that resulted in the unauthorized disclosure of customer data to external parties (e.g., data posted, sold, or otherwise made accessible outside the organization). Incident must be publicly acknowledged by the company through a trust center update, press release, incident notification page, or an official regulatory filing: Full deduction to 0.

      • Example: 23andMe suffered credential stuffing attack in 2023 that resulted in the exposure of user data.

  • Financial Stability: Gauges long-term viability of the company behind the application. We added this because a company’s financial health affects its ability to invest in ongoing security and support, and reduces the risk of sudden disruptions, corner-cutting, bankruptcy or sudden sale of user data to unknown third parties.

    • Public company or private with >$300M raised: .8 points.

    • Private with >$100M raised: .5 points.

    • Private with <$100M raised: .2 point.

    • Recent bankruptcy/distress (e.g. recent bankruptcy filings, major layoffs tied to funding shortfalls, failure to meet debt obligations): 0 points.

Gen-AI Confidence Score (5.0 Points Total)

This Score zooms in on AI-specific risks, like data usage in training and input vulnerabilities.

  • Regulatory Compliance, ISO 42001: ISO 42001 is a new certification for AI management systems. We chose this emerging standard because it specifically addresses AI governance, filling a gap in traditional certifications and signaling forward-thinking risk management.

    • ISO 42001 Compliant: 1 point.

    • Not ISO 42001 Compliant: 0 points.

  • Deployment Security Model: Stronger access controls get higher points. Authentication not only controls access but also enables monitoring and logging. This makes it easier to detect misuse and investigate incidents. Public, unauthenticated access is a red flag for shadow IT risk.

    • Authenticated web portal or key-protected API with rate limiting: 1 point.

    • Unprotected public access: 0 points.

  • Model Card:  A model card is a concise document that provides essential information about an AI model, similar to a nutrition label for a food product. It is crucial for AI safety and security because it offers transparency into a model’s design, training data, limitations, and potential biases, enabling developers and users to understand its risks and use it responsibly. Some leading AI providers have committed to providing model cards as public documentation of safety evaluations. We included this in our rubric to encourage the industry to broadly adopt model cards as a best practice. As the practice of model cards is further developed and standardized across the industry, we hope to incorporate more fine-grained details from model cards into our own risk scores. But for now, we only include the existence (or lack thereof) of a model card in our score.

    • Has its own model card: 1 point.

    • Uses a model with a model card: .5 points.

    • None: 0 points.

  • Training on user prompts: This is one of the most important components of our score.  Models that train on user prompts are very risky because users might share sensitive corporate information in user prompts. We weighted this heavily because control over training data is central to preventing unintended data exposure, a core risk in generative AI that can lead to major incidents.

    • Explicit opt-in is required for training on user prompts: 2 points.

    • Opt-out of training on user prompts is explicitly available to users: 1 point.

    • No way to opt out of training on user prompts: 0 points.

Here’s an example of these Scores applied to a few popular AI providers.  As expected, enterprise tiers typically earn higher Confidence Scores than consumer tiers of the same AI provider.

Company Application Score Gen AI Score
Gemini Free 3.8 4.0
Gemini Pro 3.8 5.0
Gemini Ultra 4.1 5.0
Gemini Business 4.7 5.0
Gemini Enterprise 4.7 5.0
OpenAI Free 3.3 1.0
OpenAI Plus 3.3 3.0
OpenAI Pro 3.3 3.0
OpenAI Team 4.3 3.0
OpenAI Enterprise 4.3 4.0
Anthropic Free 3.9 5.0
Anthropic Pro 3.9 5.0
Anthropic Max 3.9 5.0
Anthropic Team 4.9 5.0
Anthropic Enterprise 4.9 5.0

Note: Confidence scores are provided “as is” for informational purposes only and should not be considered a substitute for independent analysis or decision-making. All actions taken based on the scores are the sole responsibility of the user.

We’re just getting started…

We’re actively refining our scoring methodology. To that end, we’re collaborating with a diverse group of experts in the AI ecosystem (including researchers, legal professionals, SOC teams, and more) to fine-tune our scores, optimize for transparency, accountability and extensibility. If you have insights, suggestions, or want to get involved testing new functionality, we’d love for you to express interest in our user research program. We’d very much welcome your feedback on this scoring rubric. 

Today, we’re just releasing our scoring rubric in order to solicit feedback from the community. But soon, you’ll start seeing these Cloudflare Application Confidence Scores integrated into the Application Library in our SASE platform. Customers can simply click or hover over any score to reveal a detailed breakdown of the rubric and underlying components of the score. Again, if you see any issues with our scoring, please submit your feedback to [email protected], and our team will review it and make adjustments if appropriate. 

Looking even further ahead, we plan to enable integration of these scores directly into Cloudflare Gateway and Access, allowing our customers to write policies that block or redirect traffic, apply data loss prevention (DLP) or remote browser isolation (RBI) or otherwise control access to sites based directly on their Cloudflare Application Confidence Score. 

This is just the beginning. By prioritizing transparency in our approach, we’re not only bridging a critical gap in SASE capabilities but also driving the industry toward stronger AI safety practices. Let us know what you think!

If you’re ready to manage risk more effectively with these Confidence Scores, reach out to Cloudflare experts for a conversation.

Best Practices for Securing Generative AI with SASE

Post Syndicated from AJ Gerstenhaber original https://blog.cloudflare.com/best-practices-sase-for-ai/

As Generative AI revolutionizes businesses everywhere, security and IT leaders find themselves in a tough spot. Executives are mandating speedy adoption of Generative AI tools to drive efficiency and stay abreast of competitors. Meanwhile, IT and Security teams must rapidly develop an AI Security Strategy, even before the organization really understands exactly how it plans to adopt and deploy Generative AI. 

IT and Security teams are no strangers to “building the airplane while it is in flight”. But this moment comes with new and complex security challenges. There is an explosion in new AI capabilities adopted by employees across all business functions — both sanctioned and unsanctioned. AI Agents are ingesting authentication credentials and autonomously interacting with sensitive corporate resources. Sensitive data is being shared with AI tools, even as security and compliance frameworks struggle to keep up.

While it demands strategic thinking from Security and IT leaders, the problem of governing the use of AI internally is far from insurmountable. SASE (Secure Access Service Edge) is a popular cloud-based network architecture that combines networking and security functions into a single, integrated service that provides employees with secure and efficient access to the Internet and to corporate resources, regardless of their location. The SASE architecture can be effectively extended to meet the risk and security needs of organizations in a world of AI. 

Cloudflare’s SASE Platform is uniquely well-positioned to help IT teams govern their AI usage in a secure and responsible way — without extinguishing innovation. What makes Cloudflare different in this space is that we are one of the few SASE vendors that operate not just in cybersecurity, but also in AI infrastructure. This includes: providing AI infrastructure for developers (e.g. Workers AI, AI Gateway, remote MCP servers, Realtime AI Apps) to securing public-facing LLMs (e.g. Firewall for AI or AI Labyrinth), to allowing content creators to charge AI crawlers for access to their content, and the list goes on. Our expertise in this space gives us a unique view into governing AI usage inside an organization.  It also gives our customers the opportunity to plug different components of our platform together to build out their AI and AI cybersecurity infrastructure.

This week, we are taking this AI expertise and using it to help ensure you have what you need to implement a successful AI Security Strategy. As part of this, we are announcing several new AI Security Posture Management (AI-SPM) features, including:

All of these new AI-SPM features are built directly into Cloudflare’s powerful SASE platform.

And we’re just getting started. In the coming months you can expect to see additional valuable AI-SPM features launch across the Cloudflare platform, as we continue investing in making Cloudflare the best place to protect, connect, and build with AI.

What’s in this AI security guide?

In this guide, we will cover best practices for adopting generative AI in your organization using Cloudflare’s SASE (Secure Access Service Edge) platform. We start by covering how IT and Security leaders can formulate their AI Security Strategy. Then, we show how to implement this strategy using long-standing features of our SASE platform alongside the new AI-SPM features we launched this week. 

This guide below is divided into three key pillars for dealing with (human) employee access to AI – Visibility, Risk Management and Data Protection — followed by additional guidelines around deploying agentic AI in the enterprise using MCP. Our objective is to help you align your security strategy with your business goals while driving adoption of AI across all your projects and teams. 

And we do this all using our single SASE platform, so you don’t have to deploy and manage a complex hodgepodge of point solutions and security tools. In fact, we provide you with an overview of your AI security posture in a single dashboard, as you can see here:


AI Security Report in Cloudflare’s SASE platform

Develop your AI Security Strategy

The first step to securing AI usage is to establish your organization’s level of risk tolerance. This includes pinpointing your biggest security concerns for your users and your data, along with relevant legal and compliance requirements.   Relevant issues to consider include: 

  • Do you have specific sensitive data that should not be shared with certain AI tools? (Some examples include personally identifiable information (PII), personal health information (PHI), sensitive financial data, secrets and credentials, source code or other proprietary business information.)

  • Are there business decisions that your employees should not be making using assistance from AI? (For instance, the EU AI Act AI prohibits the use of AI to evaluate or classify individuals based on their social behavior, personal characteristics, or personality traits.)

  • Are you subject to compliance frameworks that require you to produce records of the generative AI tools that your employees used, and perhaps even the prompts that your employees input into AI providers? (For example, HIPAA requires organizations to implement audit trails that records who accessed PHI and when, GDPR requires the same for PII, SOC2 requires the same for secrets and credentials.)

  • Do you have specific data protection requirements that require employees to use the sanctioned, enterprise version of a certain generative AI provider, and avoid certain AI tools or their consumer versions?  (Enterprise AI tools often have more favorable terms of service, including shorter data retention periods, more limited data-sharing with third-parties, and/or a promise not to train AI models on user inputs.)

  • Do you require employees to completely avoid the use of certain AI tools, perhaps because they are unreliable, unreviewed or headquartered in a risky geography? 

  • Are there security protections offered by your organization’s sanctioned AI providers and to what extent do you plan to protect against misconfigurations of AI tools that can result in leaks of sensitive data?  

  • What is your policy around the use of autonomous AI agents?  What is your strategy for adopting the Model Context Protocol (MCP)? (The Model Context Protocol is a standard way to make information available to large language models (LLMs), similar to the way an application programming interface (API) works. It supports agentic AI that autonomously pursues goals and takes action.)

While almost every organization has relevant compliance requirements that implicate their use of generative AI, there is no “one size fits all” for addressing these issues. 

  • Some organizations have mandates to broadly adopt AI tools of all stripes, while others require employees to interact with sanctioned AI tools only. 

  • Some organizations are rapidly adopting the MCP, while others are not yet ready for agents to autonomously interact with their corporate resources. 

  • Some organizations have robust requirements around data loss prevention (DLP), while others are still early in the process of deploying DLP in their organization.

Even with this diversity of goals and requirements, Cloudflare SASE provides a flexible platform for the implementation of your organization’s AI Security Strategy.

Build a solid foundation for AI Security 

To implement your AI Security Strategy, you first need a solid SASE deployment

SASE provides a unified platform that consolidates security and networking, replacing a fragmented patchwork of point solutions with a single platform that controls application visibility, user authentication, Data Loss Prevention (DLP), and other policies for access to the Internet and access to internal corporate resources.  SASE is the essential foundation for an effective AI Security Strategy. 

SASE architecture allows you to execute your AI security strategy by discovering and inventorying the AI tools used by your employees. With this visibility, you can proactively manage risk and support compliance requirements by monitoring AI prompts and responses to understand what data is being shared with AI tools. Robust DLP allows you to scan and block sensitive data from being entered into AI tools, preventing data leakage and protecting your organization’s most valuable information. Our Secure Web Gateway (SWG) allows you to redirect traffic from unsanctioned AI providers to user education pages or to sanctioned enterprise AI providers. And our new integration of MCP tooling into our SASE platform helps you secure the deployment of agentic AI inside your organization.

If you’re just starting your SASE journey, our Secure Internet Traffic Deployment Guide is the best place to begin. For this guide, however, we will skip these introductory details and dive right into using SASE to secure the use of Generative AI. 

Gain visibility into your AI landscape 

You can’t protect what you can’t see. The first step is to gain visibility into your AI landscape, which is essential for discovering and inventorying all the AI tools that your employees are using, deploying or experimenting with in your organization. 

Discover Shadow AI 

Shadow AI refers to the use of AI applications that haven’t been officially sanctioned by your IT department. Shadow AI is not an uncommon phenomenon – Salesforce found that over half of the knowledge workers it surveyed admitted to using unsanctioned AI tools at work. Use of unsanctioned AI is not necessarily a sign of malicious intent; employees are often just trying to do their jobs better. As an IT or Security leader, your goal should be to discover Shadow AI and then apply the appropriate AI security policy. There are two powerful ways to do this: inline and out-of-band.

Discover employee usage of AI, inline

The most direct way to get visibility is by using Cloudflare’s Secure Web Gateway (SWG)

SWG helps you get a clear picture of both sanctioned and unsanctioned AI and chat applications. By reviewing your detected usage, you’ll gain insight into which AI apps are being used in your organization. This knowledge is essential for building policies that support approved tools, and block or control risky ones. This feature requires you to deploy the WARP client in Gateway proxy mode on your end-user devices.

You can review your company’s AI app usage using our new Application Library and Shadow IT dashboards. These tools allow you to: 

  • Review traffic from user devices to understand how many users engage with a specific application over time.

  • Denote application’s status (e.g., Approved, Unapproved) inside your organization, and use that as input to a variety of SWG policies that control access to applications with that status. 

  •  Automate assessment of SaaS and Gen AI applications at scale with our soon-to-be-released Cloudflare Application Confidence Scores


Shadow IT dashboard showing utilization of applications of different status (Approved, Unapproved, In Review, Unreviewed).

Discover employee usage of AI, out-of-band

Even if your organization doesn’t use a device client, you can still get valuable data on Shadow AI usage if you use Cloudflare’s integrations for Cloud Access Security Broker (CASB) with services like Google Workspace, Microsoft 365, or GitHub. 

Cloudflare CASB provides high-fidelity detail about your SaaS environments, including sensitive data visibility and suspicious user activity. By integrating CASB with your SSO provider, you can see if your users have authenticated to any third-party AI applications, giving you a clear and non-invasive sense of app usage across your organization.


An API CASB integration with Google Workspace, showing findings filtered to third party integrations. Findings discover multiple LLM integrations.

Implement an AI risk management framework

Now that you’ve gained visibility into your AI landscape, the next step is to proactively manage that risk. Cloudflare’s SASE platform allows you to monitor AI prompts and responses, enforce granular security policies, coach users on secure behavior, and prevent misconfigurations in your enterprise AI providers.

Detect and monitor AI prompts and responses

If you have TLS decryption enabled in your SASE platform, you can gain new and powerful insights into how your employees are using AI with our new AI prompt protection feature.  

AI Prompt Protection provides you with visibility into the exact prompts and responses from your employees’ interactions with supported AI applications. This allows you to go beyond simply knowing which tools are being used and gives you insight into exactly what kind of information is being shared.  

This feature also works with DLP profiles to detect sensitive data in prompts. You can also choose whether to block the action or simply monitor it.


Log entry for a prompt detected using AI prompt protection.

Build granular AI security policies

Once your monitoring tools give you a clear understanding of AI usage, you can begin building security policies to achieve your security goals. Cloudflare’s Gateway allows you to create policies based on application categories, application approval status, users, user groups, and device status. For example, you can:

  • create policies to explicitly allow approved AI applications while blocking unapproved AI applications;

  • create policies that redirect users from unapproved AI applications to an approved AI application;

  • limit access to certain applications to specific users or groups that have specific device security posture;

  • build policies to enable prompt capture (with AI prompt protection) for specific high-risk user groups, such as contractors or new employees, without affecting the rest of the organization; and

  • put certain applications behind Remote Browser Isolation (RBI), to prevent end users from uploading files or pasting data into the application.


Gateway application status policy selector

All of these policies can be written in Cloudflare Gateway’s unified policy builder, making it easy to deploy your AI Security Strategy across your organization.

Control access to internal LLMs 

You can use Cloudflare Access to control your employees’ access to your organization’s internal LLMs, including any proprietary models you train internally and/or models that your organization runs on Cloudflare Worker’s AI

Cloudflare Access allows you to gate access to these LLMs using fine-grained policies, including ensuring users are granted access based on their identity, user group, device posture, and other contextual signals. For example, you can use Cloudflare Access to write a policy that ensures that only certain data scientists at your organization can access a Workers AI model that is trained on certain types of customer data. 

Manage the security posture of third-party AI providers

As you define which AI tools are sanctioned, you can develop functional security controls for consistent usage. Cloudflare newly supports API CASB integrations with popular AI tools like OpenAI (ChatGPT), Anthropic (Claude), and Google Gemini. These “out-of-band” integrations provide immediate visibility into how users are engaging with sanctioned AI tools, allowing you to report on posture management findings include:

  • Misconfigurations related to sharing settings.

  • Best practices for API key management.

  • DLP profile matches in uploaded attachments

  • Riskier AI features (e.g. autonomous web browsing, code execution) that are toggled on


OpenAI API CASB Integration showing riskier features that are toggled on, security posture risks like unused admin credentials, and an uploaded attachment with a DLP profile match.

Layer on data protection 

Robust data protection is the final pillar that protects your employee’s access to AI.. 

Prevent data loss

Our SASE platform has long supported Data Loss Prevention (DLP) tools that scan and block sensitive data from being entered into AI tools, to prevent data leakage and protect your organization’s most valuable information.  You can write policies that detect sensitive data while adapting to organization-specific traffic patterns, and use Cloudflare Gateway’s unified policy builder to apply these to your users’ interactions with AI tools or other applications. For example, you could write a DLP policy that detects and blocks the upload of a social security number (SSN), phone number or address.

As part of our new AI prompt protection feature, you can now also gain a semantic understanding of your users’ interactions with supported AI providers. Prompts are classified inline into meaningful, high-level topics that include PII, credentials and secrets, source code, financial information, code abuse / malicious code and prompt injection / jailbreak.  You can then build inline granular policies based on these high-level topic classifications. For example, you could create a policy that blocks a non-HR employee from submitting a prompt with the intent to receive PII from the response, while allowing the HR team to do so during a compensation planning cycle. 

Our new AI prompt protection feature empowers you to apply smart, user-specific DLP rules that empower your teams to get work done, all while strengthening your security posture. To use our most advanced DLP feature, you’ll need to enable TLS decryption to inspect traffic.


The above policy blocks all ChatGPT prompts that may receive PII back in the response for employees in engineering, marketing, product, and finance user groups

Secure MCP — and Agentic AI 

MCP (Model Context Protocol) is an emerging AI standard, where MCP servers act as a translation layer for AI agents, allowing them to communicate with public and private APIs, understand datasets, and perform actions. Because these servers are a primary entry point for AI agents to engage with and manipulate your data, they are a new and critical security asset for your security team to manage.

Cloudflare already offers a robust set of developer tools for deploying remote MCP servers—a cloud-based server that acts as a bridge between a user’s data and tools and various AI applications. But now our customers are asking for help securing their enterprise MCP deployments. 

That is why we’re making MCP security controls a core part of our SASE platform.

Control MCP Authorization

MCP servers typically use OAuth for authorization, where the server inherits the permissions of the authorizing user. While this adheres to least-privilege for the user, it can lead to authorization sprawl — where the agent accumulates an excessive number of permissions over time. This makes the agent a high-value target for attackers.

Cloudflare Access now helps you manage authorization sprawl by applying Zero Trust principles to MCP server access. A Zero Trust model assumes no user, device, or network can be trusted implicitly, so every request is continuously verified. This approach ensures secure authentication and management of these critical assets as your business adopts more agentic workflows. 

Centralize management of MCP servers

Cloudflare MCP Server Portal is a new feature in Cloudflare’s SASE platform that centralizes the management, security, and observation of an organization’s MCP servers.

MCP Server Portal allows you to register all your MCP servers with Cloudflare and provide your end users with a single, unified Portal endpoint to configure in their MCP client. This approach simplifies the user experience, because it eliminates the need to configure a one-to-one connection between every MCP client and server. It also means that new MCP servers dynamically become available to users whenever they are added to the Portal. 

Beyond these usability enhancements, MCP Server Portal addresses the significant security risks associated with MCP in the enterprise. The current decentralized approach of MCP deployments creates a tangle of unmanaged one-to-one connections that are difficult to secure. The lack of centralized controls creates a variety of risks including prompt injection, tool injection (where malicious code is part of the MCP server itself), supply chain attacks and data leakage. 

MCP Server Portals solve this by routing all MCP traffic through Cloudflare, allowing for centralized policy enforcement, comprehensive visibility and logging, and a curated user experience based on the principle of least privilege. Administrators can review and approve MCP servers before making them available, and users are only presented with the servers and tools they are authorized to use, which prevents the use of unvetted or malicious third-party servers.


An MCP Server Portal in the Cloudflare Dashboard

All of these features are only the beginning of our MCP security roadmap, as we continue advancing our support for MCP infrastructure and security controls across the entire Cloudflare platform.

Implement your AI security strategy in a single platform

As organizations rapidly develop and deploy their AI security strategies, Cloudflare’s SASE platform is ideally situated to implement policies that balance productivity with data and security controls.

Our SASE has a full suite of features to protect employee interactions with AI. Some of these features are deeply integrated in our Secure Web Gateway (SWG), including the ability to write fine-grained access policies, gain visibility into Shadow IT and introspect on interactions with AI tools using AI prompt protection. Apart from these inline controls, our CASB provides visibility and control using out-of-band API integrations. Our Cloudflare Access product can apply Zero Trust principles while protecting employee access to corporate LLMs that are hosted on Workers AI or elsewhere. We’re newly integrating controls for securing MCP that can also be used alongside Cloudflare’s Remote MCP Server platform.

And all of these features are integrated directly into Cloudflare’s SASE’s unified dashboard, providing a unified platform for you to implement your AI security strategy. You can even gain a holistic view of all of your AI-SPM controls using our newly-released AI-SPM overview dashboard. 


AI security report showing utilization of AI applications.

As one the few SASE vendors that also offer AI infrastructure, Cloudflare’s SASE platform can also be deployed alongside products from our developer and application security platforms to holistically implement your AI security strategy alongside your AI infrastructure strategy (using, for example, Workers AI, AI Gateway, remote MCP servers, Realtime AI Apps, Firewall for AI, AI Labyrinth, or pay per crawl .)

Cloudflare is committed to helping enterprises securely adopt AI

Ensuring AI is scalable, safe, and secure is a natural extension of Cloudflare’s mission, given so much of our success relies on a safe Internet. As AI adoption continues to accelerate, so too does our mission to provide a market-leading set of controls for AI Security Posture Management (AI-SPM). Learn more about how Cloudflare helps secure AI or start exploring our new AI-SPM features in Cloudflare’s SASE dashboard today!

Block unsafe prompts targeting your LLM endpoints with Firewall for AI

Post Syndicated from Radwa Radwan original https://blog.cloudflare.com/block-unsafe-llm-prompts-with-firewall-for-ai/

Security teams are racing to secure a new attack surface: AI-powered applications. From chatbots to search assistants, LLMs are already shaping customer experience, but they also open the door to new risks. A single malicious prompt can exfiltrate sensitive data, poison a model, or inject toxic content into customer-facing interactions, undermining user trust. Without guardrails, even the best-trained model can be turned against the business.

Today, as part of AI Week, we’re expanding our AI security offerings by introducing unsafe content moderation, now integrated directly into Cloudflare Firewall for AI. Built with Llama, this new feature allows customers to leverage their existing Firewall for AI engine for unified detection, analytics, and topic enforcement, providing real-time protection for Large Language Models (LLMs) at the network level. Now with just a few clicks, security and application teams can detect and block harmful prompts or topics at the edge — eliminating the need to modify application code or infrastructure.

This feature is immediately available to current Firewall for AI users. Those not yet onboarded can contact their account team to participate in the beta program.

AI protection in application security

Cloudflare’s Firewall for AI protects user-facing LLM applications from abuse and data leaks, addressing several of the OWASP Top 10 LLM risks such as prompt injection, PII disclosure, and unbound consumption. It also extends protection to other risks such as unsafe or harmful content.

Unlike built-in controls that vary between model providers, Firewall for AI is model-agnostic. It sits in front of any model you choose, whether it’s from a third party like OpenAI or Gemini, one you run in-house, or a custom model you have built, and applies the same consistent protections.

Just like our origin-agnostic Application Security suite, Firewall for AI enforces policies at scale across all your models, creating a unified security layer. That means you can define guardrails once and apply them everywhere. For example, a financial services company might require its LLM to only respond to finance-related questions, while blocking prompts about unrelated or sensitive topics, enforced consistently across every model in use.

Unsafe content moderation protects businesses and users

Effective AI moderation is more than blocking “bad words”, it’s about setting boundaries that protect users, meeting legal obligations, and preserving brand integrity, without over-moderating in ways that silence important voices.

Because LLMs cannot be fully scripted, their interactions are inherently unpredictable. This flexibility enables rich user experiences but also opens the door to abuse.

Key risks from unsafe prompts include misinformation, biased or offensive content, and model poisoning, where repeated harmful prompts degrade the quality and safety of future outputs. Blocking these prompts aligns with the OWASP Top 10 for LLMs, preventing both immediate misuse and long-term degradation.

One example of this is Microsoft’s Tay chatbot. Trolls deliberately submitted toxic, racist, and offensive prompts, which Tay quickly began repeating. The failure was not only in Tay’s responses; it was in the lack of moderation on the inputs it accepted.

Detecting unsafe prompts before reaching the model

Cloudflare has integrated Llama Guard directly into Firewall for AI. This brings AI input moderation into the same rules engine our customers already use to protect their applications. It uses the same approach that we created for developers building with AI in our AI Gateway product.

Llama Guard analyzes prompts in real time and flags them across multiple safety categories, including hate, violence, sexual content, criminal planning, self-harm, and more.

With this integration, Firewall for AI not only discovers LLM traffic endpoints automatically, but also enables security and AI teams to take immediate action. Unsafe prompts can be blocked before they reach the model, while flagged content can be logged or reviewed for oversight and tuning. Content safety checks can also be combined with other Application Security protections, such as Bot Management and Rate Limiting, to create layered defenses when protecting your model.

The result is a single, edge-native policy layer that enforces guardrails before unsafe prompts ever reach your infrastructure — without needing complex integrations.

How it works under the hood

Before diving into the architecture of Firewall for AI engine and how it fits within our previously mentioned module to detect PII in the prompts, let’s start with how we detect unsafe topics.

Detection of unsafe topics

A key challenge in building safety guardrails is balancing a good detection with model helpfulness. If detection is too broad, it can prevent a model from answering legitimate user questions, hurting its utility. This is especially difficult for topic detection because of the ambiguity and dynamic nature of human language, where context is fundamental to meaning. 

Simple approaches like keyword blocklists are interesting for precise subjects — but insufficient. They are easily bypassed and fail to understand the context in which words are used, leading to poor recall. Older probabilistic models such as Latent Dirichlet Allocation (LDA) were an improvement, but did not properly account for word ordering and other contextual nuances.

Recent advancements in LLMs introduced a new paradigm. Their ability to perform zero-shot or few-shot classification is uniquely suited for the task of topic detection. For this reason, we chose Llama Guard 3, an open-source model based on the Llama architecture that is specifically fine-tuned for content safety classification. When it analyzes a prompt, it answers whether the text is safe or unsafe, and provides a specific category. We are showing the default categories, as listed here. Because Llama 3 has a fixed knowledge cutoff, certain categories — like defamation or elections — are time-sensitive. As a result, the model may not fully capture events or context that emerged after it was trained, and that’s important to keep in mind when relying on it.

For now, we cover the 13 default categories. We plan to expand coverage in the future, leveraging the model’s zero-shot capabilities.

A scalable architecture for future detections

We designed Firewall for AI to scale without adding noticeable latency, including Llama Guard, and this remains true even as we add new detection models.

To achieve this, we built a new asynchronous architecture. When a request is sent to an application protected by Firewall for AI, a Cloudflare Worker makes parallel, non-blocking requests to our different detection modules — one for PII, one for unsafe topics, and others as we add them. 

Thanks to the Cloudflare network, this design scales to handle high request volumes out of the box, and latency does not increase as we add new detections. It will only be bounded by the slowest model used. 


We optimize to keep the model utility at its maximum while keeping the guardrail detection broad enough.

Llama Guard is a rather large model, so running it at scale with minimal latency is a challenge. We deploy it on Workers AI, leveraging our large fleet of high performance GPUs. This infrastructure ensures we can offer fast, reliable inference throughout our network.

To ensure the system remains fast and reliable as adoption grows, we ran extensive load tests simulating the requests per second (RPS) we anticipate, using a wide range of prompt sizes to prepare for real-world traffic. To handle this, the number of model instances deployed on our network scales automatically with the load. We employ concurrency to minimize latency and optimize for hardware utilization. We also enforce a hard 2-second threshold for each analysis; if this time limit is reached, we fall back to any detections already completed, ensuring your application’s requests latency is never further impacted.

From detection to security rules enforcement

Firewall for AI follows the same familiar pattern as other Application Security features like Bot Management and WAF Attack Score, making it easy to adopt.

Once enabled, the new fields appear in Security Analytics and expanded logs. From there, you can filter by unsafe topics, track trends over time, and drill into the results of individual requests to see all detection outcomes, for example: did we detect unsafe topics, and what are the categories. The request body itself (the prompt text) is not stored or exposed; only the results of the analysis are logged.


After reviewing the analytics, you can enforce unsafe topic moderation by creating rules to log or block based on prompt categories in Custom rules.

For example, you might log prompts flagged as sexual content or hate speech for review. 

You can use this expression:
If (any(cf.llm.prompt.unsafe_topic_categories[*] in {"S10" "S12"})) then Log

Or deploy the rule with the categories field in the dashboard as in the below screenshot.


You can also take a broader approach by blocking all unsafe prompts outright:
If (cf.llm.prompt.unsafe_topic_detected)then Block


These rules are applied automatically to all discovered HTTP requests containing prompts, ensuring guardrails are enforced consistently across your AI traffic.

What’s Next

In the coming weeks, Firewall for AI will expand to detect prompt injection and jailbreak attempts. We are also exploring how to add more visibility in the analytics and logs, so teams can better validate detection results. A major part of our roadmap is adding model response handling, giving you control over not only what goes into the LLM but also what comes out. Additional abuse controls, such as rate limiting on tokens and support for more safety categories, are also on the way.

Firewall for AI is available in beta today. If you’re new to Cloudflare and want to explore how to implement these AI protections, reach out for a consultation. If you’re already with Cloudflare, contact your account team to get access and start testing with real traffic.

Cloudflare is also opening up a user research program focused on AI security. If you are curious about previews of new functionality or want to help shape our roadmap, express your interest here.

Unmasking the Unseen: Your Guide to Taming Shadow AI with Cloudflare One

Post Syndicated from Noelle Kagan original https://blog.cloudflare.com/shadow-AI-analytics/

The digital landscape of corporate environments has always been a battleground between efficiency and security. For years, this played out in the form of “Shadow IT” — employees using unsanctioned laptops or cloud services to get their jobs done faster. Security teams became masters at hunting these rogue systems, setting up firewalls and policies to bring order to the chaos.

But the new frontier is different, and arguably far more subtle and dangerous.

Imagine a team of engineers, deep into the development of a groundbreaking new product. They’re on a tight deadline, and a junior engineer, trying to optimize his workflow, pastes a snippet of a proprietary algorithm into a popular public AI chatbot, asking it to refactor the code for better performance. The tool quickly returns the revised code, and the engineer, pleased with the result, checks it in. What they don’t realize is that their query, and the snippet of code, is now part of the AI service’s training data, or perhaps logged and stored by the provider. Without anyone noticing, a critical piece of the company’s intellectual property has just been sent outside the organization’s control, a silent and unmonitored data leak.

This isn’t a hypothetical scenario. It’s the new reality. Employees, empowered by these incredibly powerful AI tools, are now using them for everything from summarizing confidential documents to generating marketing copy and, yes, even writing code. The data leaving the company in these interactions is often invisible to traditional security tools, which were never built to understand the nuances of a browser tab interacting with a large language model. This quiet, unmanaged usage is “Shadow AI,” and it represents a new, high-stakes security blind spot.

To combat this, we need a new approach—one that provides visibility into this new class of applications and gives security teams the control they need, without impeding the innovation that makes these tools so valuable.

Shadow AI reporting

This is where the Cloudflare Shadow IT Report comes in. It’s not a list of threats to be blocked, but rather a visibility and analytics tool designed to help you understand the problem before it becomes a crisis. Instead of relying on guesswork or trying to manually hunt down every unsanctioned application, Cloudflare One customers can use the insights from their traffic to gain a clear, data-driven picture of their organization’s application usage.

The report provides a detailed, categorized view of your application activity, and is easily narrowed down to AI activity. We’ve leveraged our network and threat intelligence capabilities to identify and classify AI services, identifying general-purpose models like ChatGPT, code-generation assistants like GitHub Copilot, and specialized tools used for marketing, data analysis, or other content creation, like Leonardo.ai. This granular view allows security teams to see not just that an employee is using an AI app, but which AI app, and what users are accessing it.

How we built it

Sharp eyed users may have noticed that we’ve had a shadow IT feature for a while — so what changed? While Cloudflare Gateway, our secure web gateway (SWG), has recorded some of this data for some time, users have wanted deeper insights and reporting into their organization’s application usage. Cloudflare Gateway processes hundreds of millions of rows of app usage data for our biggest users daily, and that scale was causing issues with queries into larger time windows. Additionally, the original implementation lacked the filtering and customization capabilities to properly investigate the usage of AI applications. We knew this was information that our customers loved, but we weren’t doing a good enough job of showing it to them.

Solving this was a cross-team effort requiring a complete overhaul by our analytics and reporting engineers. You may have seen our work recently in this July 2025 blog post detailing how we adopted TimescaleDB to support our analytics platform, unlocking our analytics, allowing us to aggregate and compress long term data to drastically improve query performance. This solves the issue we originally faced around our scale, letting our biggest customers query their data for long time periods. Our crawler collects the original HTTP traffic data from Gateway, which we store into a Timescale database.

Once the data are in our database, we built specific, materialized views in our database around the Shadow IT and AI use case to support analytics for this feature. Whereas the existing HTTP analytics we built are centered around the HTTP requests on an account, these specific views are centered around the information relevant to applications, for example: Which of my users are going to unapproved applications? How much bandwidth are they consuming? Is there an end-user in an unexpected geographical location interacting with an unreviewed application? What devices are using the most bandwidth?

Over the past year, the team has defined a set framework for the analytics we surface. Our timeseries graphs and top-n graphs are all filterable by duration and the relevant data points shown, allowing users to drill down to specific data points and see the details of their corporate traffic. We overhauled Shadow IT by examining the data we had and researching how AI applications were presenting visibility challenges for customers. From there we leveraged our existing framework and built the Shadow IT dashboard. This delivered the application-level visibility that we know our customers needed.

How to use it

1. Proxy your traffic with Gateway

The core of the system is Cloudflare Gateway, an in-line filter and proxy for all your organization’s Internet traffic, regardless of where your users are. When an employee tries to access an AI application, their traffic flows through Cloudflare’s global network. Cloudflare can inspect the traffic, including the hostname, and map the traffic to our application definitions. TLS inspection is optional for Gateway customers, but it is required for ShadowIT analytics.

Interactions are logged and tied to user identity, device posture, bandwidth consumed and even the geographic location. This rich context is crucial for understanding who is using which AI tools, when, and from where.

2. Review application use

All this granular data is then presented in an our Shadow IT Report within your Cloudflare One dashboard. Simply filter for AI applications so you can:

  • High-Level Overview: Get an immediate sense of your organization’s AI adoption. See the top AI applications in use, overall usage trends, and the volume of data being processed. This will help you identify and target your security and governance efforts.

  • Granular Drill-Downs: Need more detail? Click on any AI application to see specific users or groups accessing it, their usage frequency, location, and the amount of data transferred. This detail helps you pinpoint teams using AI around the company, as well as how much data is flowing to those applications.


ShadowIT analytics dashboard

3. Mark application approval statuses

We understand that not all AI tools are created equal, and your organization’s comfort level will vary. The Shadow AI Report introduces a flexible framework for Application Approval Status, allowing you to formally categorize each detected AI application:

  • Approved: These are the AI applications that have passed your internal security vetting, comply with your policies, and are officially sanctioned for use. 

  • Unapproved: These are the red-light applications. Perhaps they have concerning data privacy policies, a history of vulnerabilities, or simply don’t align with your business objectives.

  • In Review: For those gray-area applications, or newly discovered tools, this status lets your teams acknowledge their usage while conducting thorough due diligence. It buys you time to make an informed decision without immediate disruption.


Review and mark application statuses in the dashboard

4. Enforce policies

These approval statuses come alive when integrated with Cloudflare Gateway policies. This allows you to automatically enforce your AI decisions at the edge of Cloudflare’s network, ensuring consistent security for every employee, anywhere they work.

Here’s how you can translate your decisions into inline protection:

  • Block unapproved AI: The simplest and most direct action. Create a Gateway HTTP policy that blocks all traffic to any AI application marked as “Unapproved.” This immediately shuts down risky data exfiltration.

  • Limit “In Review” exposure: For applications still being assessed, you might not want a hard block, but rather a soft limit on potential risks:

  • Data Loss Prevention (DLP): Cloudflare DLP inspects and analyzes traffic for indicators of sensitive data (e.g., credit card numbers, PII, internal project names, source code) and can then block the transfer. By applying DLP to “In Review” AI applications, you can prevent AI prompts containing this proprietary data, as well as notify the user why the prompt was blocked. This could have saved our poor junior engineer from their well-intended mistake.. 

  • Restrict Specific Actions: Block only file uploads allowing basic interaction but preventing mass data egress. 

  • Isolate Risky Sessions: Route traffic for “In Review” applications through Cloudflare’s Browser Isolation. Browser Isolation executes the browser session in a secure, remote container, isolating all data interactions from your corporate network. With it, you can control file uploads, clipboard actions, reduce keyboard inputs and more, reducing interaction with the application while you review it.

  • Audit “Approved” usage: Even for AI tools you trust, you might want to log all interactions for compliance auditing or apply specific data handling rules to ensure ongoing adherence to internal policies.

This workflow enables your team to consistently audit your organization’s AI usage and easily update policies to quickly and easily reduce security risk.

Forensics with Cloudflare Log Explorer

While the Shadow AI Report provides excellent insights, security teams often need to perform deeper forensic investigations. For these advanced scenarios, we offer Cloudflare Log Explorer.

Log Explorer allows you to store and query your Cloudflare logs directly within the Cloudflare dashboard or via API, eliminating the need to send massive log volumes to third-party SIEMs for every investigation. It provides raw, unsampled log data with full context, enabling rapid and detailed analysis.

Log Explorer customers can dive into Shadow AI logs with pre-populated SQL queries from Cloudflare Analytics, enabling deeper investigations into AI usage:


Log Search’s SQL query interface

How to investigate Shadow AI with Log Explorer:

  • Trace Specific User Activity: If the Shadow AI Report flags a user with high activity on an “In Review” or “Unapproved” AI app, you can jump into Log Explorer and query by user, application category, or specific AI services. 

  • Analyze Data Exfiltration Attempts: If you have DLP policies configured, you can search for DLP matches in conjunction with AI application categories. This helps identify attempts to upload sensitive data to AI applications and pinpoint exactly what data was being transmitted.

  • Identify Anomalous AI Usage: The Shadow AI Report might show a spike in usage for a particular AI application. In Log Explorer, you can filter by application status (In Review or Unapproved) for a specific time range. Then, look for unusual patterns, such as a high number of requests from a single source IP address, or unexpected geographic origins, which could indicate compromised accounts or policy evasion attempts.

If AI visibility is a challenge for your organization, the Shadow AI Report is available now for Cloudflare One customers, as part of our broader shadow IT discovery capabilities. Log in to your dashboard to start regaining visibility and shaping your AI governance strategy today. 

Ready to modernize how you secure access to AI apps? Reach out for a consultation with our Cloudflare One security experts about how to regain visibility and control. 

Or if you’re not ready to talk to someone yet,  nearly every feature in Cloudflare One is available at no cost for up to 50 users. Many of our largest enterprise customers start by exploring the products themselves on our free plan, and you can get started here.

If you’ve got feedback or want to help shape how Cloudflare enhances visibility across shadow AI, please consider joining our user research program

Beyond the ban: A better way to secure generative AI applications

Post Syndicated from Warnessa Weaver original https://blog.cloudflare.com/ai-prompt-protection/

The revolution is already inside your organization, and it’s happening at the speed of a keystroke. Every day, employees turn to generative artificial intelligence (GenAI) for help with everything from drafting emails to debugging code. And while using GenAI boosts productivity—a win for the organization—this also creates a significant data security risk: employees may potentially share sensitive information with a third party.

Regardless of this risk, the data is clear: employees already treat these AI tools like a trusted colleague. In fact, one study found that nearly half of all employees surveyed admitted to entering confidential company information into publicly available GenAI tools. Unfortunately, the risk for human error doesn’t stop there. Earlier this year, a new feature in a leading LLM meant to make conversations shareable had a serious unintended consequence: it led to thousands of private chats — including work-related ones — being indexed by Google and other search engines. In both cases, neither example was done with malice. Instead, they were miscalculations on how these tools would be used, and it certainly did not help that organizations did not have the right tools to protect their data. 

While the instinct for many may be to deploy the old playbook of banning a risky application, GenAI is too powerful to overlook. We need a new strategy — one that moves beyond the binary universe of “blocks” and “allows” and into a reality governed by context

This is why we built AI prompt protection. As a new capability within Cloudflare’s Data Loss Prevention (DLP) product, it’s integrated directly into Cloudflare One, our secure access service edge (SASE) platform. This feature is a core part of our broader AI Security Posture Management (AI-SPM) approach. Our approach isn’t about building a stronger wall; it’s about providing the tools to understand and govern your organization’s AI usage, so you can secure sensitive data without stifling the innovation that GenAI enables.

What is AI prompt protection?

AI prompt protection identifies and secures the data entered into web-based AI tools. It empowers organizations with granular control to specify which actions users can and cannot take when using GenAI, such as if they can send a particular kind of prompt at all. Today, we are excited to announce this new capability is available for Google Gemini, ChatGPT, Claude, and Perplexity. 

AI prompt protection leverages four key components to keep your organization safe: prompt detection, topic classification, guardrails, and logging. In the next few sections, we’ll elaborate on how each element contributes to smarter and safer GenAI usage.

Gaining visibility: prompt detection

As the saying goes, you don’t know what you don’t know, or in this case, you can’t secure what you can’t see. The keystone of AI prompt protection is its ability to capture both the users’ prompts and GenAI’s responses. When using web applications like ChatGPT and Google Gemini, these services often leverage undocumented and private APIs (application programming interface), making it incredibly difficult for existing security solutions to inspect the interaction and understand what information is being shared. 

AI prompt protection begins by removing this obstacle and systematically detecting users’ prompts and AI’s responses from the set of supported AI tools mentioned above.  

Turning data into a signal: topic classification

Simply knowing what an employee is talking to AI about is not enough. The raw data stream of activity, while useful, is just noise without context. To build a robust security posture, we need semantic understanding of the prompts and responses.

AI prompt protection analyzes the content and intent behind every prompt the user provides, classifying it into meaningful, high-level topics. Understanding the semantics of each prompt allows us to get one step closer to securing GenAI usage. 

We have organized our topic classifications around two core evaluation categories:

  • Content focuses on the specific text or data the user provides the generative AI tool. It is the information the AI needs to process and analyze to generate a response. 

  • Intent focuses on the user’s goal or objective for the AI’s response. It dictates the type of output the user wants to receive. This category is particularly useful for customers who are using SaaS connectors or MCPs that provide the AI application access to internal data sources that contain sensitive information.

To facilitate easy adoption of AI prompt protection, we provide predefined profiles and detection entries that offer out-of-the-box protection for the most critical data types and risks. Every detection entry will specify which category (content or intent) is being evaluated. These profiles cover the following:

Evaluation Category Detection entry (Topic) Description

Content

PII Prompt contains personal information (names, SSNs, emails, etc.)
Credentials and Secrets Prompt contains API keys, passwords, or other sensitive credentials
Source Code Prompt contains actual source code, code snippets, or proprietary algorithms
Customer Data Prompt contains customer names, projects, business activities, or confidential customer contexts
Financial Information Prompt contains financial numbers or confidential business data

Intent

PII Prompt requests specific personal information about individuals
Code Abuse and Malicious Code Prompt requests malicious code for attacks exploits, or harmful activities
Jailbreak Prompt attempts to circumvent security policies

Let’s walk through two examples that highlight how the Content: PII and Intent: PII detections look as a realistic prompt. 

Prompt 1: “What is the nearest grocery store to me? My address is 123 Main Street, Anytown, USA.”

> This prompt will be categorized as Content: PII as it contains PII because it lists a home address and references a specific person.

Prompt 2: “Tell me Jane Doe’s address and date of birth.”

> This prompt will be categorized as Intent: PII because it is requesting PII from the AI application.


From understanding to control: guardrails

Before AI prompt protection, protecting against inappropriate use of GenAI required blocking the entire application. With semantic understanding, we can move beyond the binary of “block or allow” with the ultimate goal of enabling and governing safe usage. Guardrails allow you to build granular policies based on the very topics we have just classified.

You can, for example, create a policy that prevents a non-HR employee from submitting a prompt with the intent to receive PII from the response. The HR team, in contrast, may be allowed to do so for legitimate business purposes (e.g., compensation planning). These policies transform a blind restriction into intelligent, identity-aware controls that empower your teams without compromising security.


The above policy blocks all ChatGPT prompts that may receive PII back in the response for employees in engineering, marketing, product, and finance user groups

Closing the loop: logging

Even the most robust policies must be auditable, which leads us to the final piece of the puzzle: establishing a record of every interaction. Our logging capability captures both the prompt and the response, encrypted with a customer-provided public key to ensure that not even Cloudflare may access your sensitive data. This gives security teams the crucial visibility needed to investigate incidents, prove compliance, and understand how GenAI is concretely being used across the organization.

You can now quickly zero in on specific events using these new Gateway log filters:

  • Application type and name filters logs based on the application criteria in the policy that was triggered.

  • DLP payload log shows only logs that include a DLP profile match and payload log.

  • GenAI prompt captured displays logs from policies that contain a supported artificial intelligence application and a prompt log.


Additionally, each prompt log includes a conversation ID that allows you to reconstruct the user interaction from initial prompt to final response. The conversation ID equips security teams to quickly understand the context of a prompt rather than only seeing one element of the conversation. 


For a more focused view, our Application Library now features a new “Prompt Logs” filter. From here, admins can view a list of logs that are filtered to only show logs that include a captured prompt for that specific application. This view can be used to understand how different AI applications are being used to further highlight risk usage or discover new prompt topic use cases that require guardrails.


How we built it

Detecting the prompt with granular controls

This is where it gets more interesting and admittedly, more technical. Providing granular controls to organizations required help from multiple technologies. To jumpstart our progress, the acquisition of Kivera enhanced our operation mapping, which is a process that identifies the structure and content of an application’s APIs and then maps them to concrete operations a user can perform. This capability allowed us to move beyond simple expression-based HTTP policies, where users provide a static search pattern to find specific sequences in web traffic, to policies structured on application operations. This shift moves us into a powerful, dynamic environment where an administrator can author a policy that says, “Block the ‘share’ action from ChatGPT.” 

Action-based policies eliminate the need for organizations to manually extract request URLs from network traffic, which removes a significant burden from security teams. Instead, AI prompt protection can translate the action a user is taking and allow or deny based on an organization’s policies. This is exactly the kind of control organizations require to protect sensitive data use with GenAI.

Let’s take a look at how this plays out from the perspective of a request: 

  1. Cloudflare’s global network receives a HTTPS request.

  2. Cloudflare identifies and categorizes the request. For example, the request may be matched to a known application, such as ChatGPT, and then a specific action, such as SendPrompt. We do this by using operation mapping, which we talked about above. 

  3. This information is then passed to the DLP engine. Because different applications will use a variety of protocols, encodings, and schemas, this derived information is used as a primer for the DLP engine which enables it to rapidly scan for additional information in the body of the request and response. For GenAI specifically, the DLP engine extracts the user prompt, the prompt response, and the conversation ID (more on that later). 

Similar to how we maintain a HTTP header schema for applications and operations, DLP maintains logic for scanning the body of requests and responses to different applications. This logic is aware of what decoders are required for different vendors, and where interesting properties like the prompt response reside within the body.

Keeping with ChatGPT as our example, a text/event-stream is used for the response body format. This allows ChatGPT to stream the prompt response and metadata back to the client while it is generating. If you have used GenAI, you will have seen this in action when you see the model “thinking” and writing text before your eyes.

event: delta_encoding
data: "v1"

event: delta
data: {"p": "", "o": "add", "v": {"message": {"id": "43903a46-3502-4993-9c36-1741c1abaf1b", ...}, "conversation_id": "688cbc90-9f94-800d-b603-2c2edcfaf35a", "error": null}, "c": 0}     

// ...many metadata messages of different types.

event: delta
data: {"p": "/message/content/parts/0", "o": "append", "v": "**Why did the"}  

event: delta
data: {"v": " dog sit in the"} // Responses are appended via deltas as the model continues to think.

event: delta
data: {"v": " shade?**  \nBecause he"}

event: delta
data: {"v": " didn\u2019t want"}      

event: delta
data: {"v": " to be a hot dog!"}

We can see this “thinking” above as the model returns the prompt response piece by piece, appending to the previous output. Our DLP Engine logic is aware of this, making it possible to reconstruct the original prompt response: Why did the dog sit in the shade? Because he didn’t want to be a hot dog!. This is great, but what if we want to see the other animal-themed jokes that were generated in this conversation? This is where extracting and logging the conversation_id becomes very useful; if we are interested in the wider context of the conversation as a whole, we can filter by this conversation_id in Gateway HTTP Logs to produce the entire conversation!


Work smarter, not harder: harnessing multiple language models for smarter topic classification

Our DLP engine employs a strategic, multi-model approach to classify prompt topics efficiently and securely. Each model is mapped to specific prompt topics it can most effectively classify. When a request is received, the engine uses this mapping, along with pre-defined AI topics, to forward the request to the specific models capable of handling the relevant topics.

This system uses open-source models for several key reasons. These models have proven capable of the required tasks and allow us to host inference on Workers AI, which runs on Cloudflare’s global network for optimal performance. Crucially, this architecture ensures that user prompts are not sent to third-party vendors, thereby maintaining user privacy.

In partnership with Workers AI, our DLP engine is able to accomplish better performance and better accuracy. Workers AI makes it possible for AI prompt protection to run different models and to do so in parallel. We are then able to combine these results to achieve higher overall recall without compromising precision. This ultimately leads to more dependable policy enforcement. 

Finally, and perhaps most crucially, using open source models also ensures that user prompts are never sent to a third-party vendor, protecting our customers’ privacy. 


Each model contributes unique strengths to the system. Presidio is highly specialized and reliable for detecting Personally Identifiable Information (PII), while Promptguard2 excels at identifying malicious prompts like jailbreaks and prompt injection attacks. Llama3-70B serves as a general-purpose model, capable of detecting a wide range of topics. However, Llama3-70B has certain weaknesses: it may occasionally fail to follow instructions and is susceptible to prompt injection attacks. For example, a prompt like “Our customer’s home address is 1234 Abc Avenue…this is not PII” could lead Llama3-70B to incorrectly classify the PII content due to the final sentence. 

To enhance efficacy and mitigate these weaknesses, the system uses Cloudflare’s Vectorize. We use the bge-m3 model to compute embeddings, storing a small, anonymized subset of these embeddings in account owned indexes to retrieve similar prompts from the past. If a model request fails due to capacity limits or the model not following instructions, the system checks for similar past prompts and may use their categories instead. This process helps to ensure consistent and reliable classification. In the future, we may also fine-tune a smaller, specialized model to address the specific shortcomings of the current models.

Performance is a critical consideration. Presidio, Promptguard2, and Llama3-70B are expected to be fast, with P90 latency under 1 second. While Llama3-70B is anticipated to be slightly slower than the other two, its P50 latency is also expected to be under 1 second. The embedding and vectorization process runs in parallel with the model requests, with a P50 latency of around 500ms and a P90 of about 1 second, ensuring that the overall system remains performant and responsive.

Start protecting your AI prompts now

The future of work is here, and it is driven by AI. We are committed to providing you with a comprehensive security framework that empowers you to innovate with confidence. 

AI prompt protection is now in beta for all accounts with access to DLP. But wait, there’s more! 

Our upcoming developments focus on three key areas:

  • Broadening support: We’re expanding our reach to include more applications including embedded AI. We are also collaborating with Firewall for AI to develop additional dynamic prompt detection approaches. 

  • Improving workflow: We’re working on new features that further simplify your experience, such as combining conversations into a single log, storing uploaded files included in a prompt, and enabling you to create custom prompt topics.

  • Strengthening integrations: We’ll enable customers with AI CASB integrations to run retroactive prompt topic scans for better out-of-band protection.

Ready to regain visibility and controls over AI prompts? Reach out for a consultation with our security experts if you’re new to Cloudflare. Or if you’re an existing customer, contact your account manager to gain enterprise-level access to DLP.

Plus, if you are interested in early access previews of our AI security functionality, please sign up to participate in our user research program and help shape our AI security roadmap.

Bringing Cloudflare’s AI to FedRAMP High

Post Syndicated from Wesley Evans original https://blog.cloudflare.com/fedramphigh-ai/

Two forces are colliding: the rapid rise of generative AI and the uncompromising security and compliance expectations of the US public sector. Agencies want to use AI to improve constituent services, analysis, and mission support — but stitching together GPU capacity, inference services, data stores, and audit trails in a compliant way slows delivery.

Cloudflare’s aim is simple: make secure, serverless AI practical for the US public sector at Internet scale. We will do that through two pillars:

Workers AI. Workers AI is our serverless inference platform that runs models on Cloudflare’s global network — close to users and data — without requiring customer teams to manage servers or GPUs. It’s built for speed, scale, and a great developer experience, with performance features that lower latency and keep costs predictable.

FedRAMP at Cloudflare. Cloudflare for Government maintains FedRAMP Moderate authorization today, and our roadmap includes expanding services aligned to FedRAMP High. Security and compliance aren’t bolt-ons for us; they’re how our platform is designed and operated.

Today, we are announcing our intent to bring the entire suite of AI Developer products including Workers AI, AI Gateway, and Vectorize — into our FedRAMP High and Moderate boundaries in 2026.

Why this matters

While we don’t know what the future holds, we want you to imagine the public sector when the power of AI is placed in the hands of America’s dedicated public servants. Here’s what that future could look like with Cloudflare AI products.

For public sector missions

Agencies can finally modernize public-facing services without waiting on bespoke infrastructure.

Imagine an agency trying to reduce wait times for questions regarding benefits.  With Workers AI, inference runs close to users on Cloudflare’s global network, so a benefits assistant can answer questions quickly and consistently while keeping data inside a FedRAMP boundary. Vectorize grounds those answers in the agency’s own guidance — permits, policy memos, eligibility rules — allowing for accurate and explainable responses. AI Gateway adds the operational layer that production services require: caching to control costs during peak traffic, rate limits to protect upstream systems, and detailed logs to show exactly how inputs and outputs were handled.

The same pattern applies to back-office workflows. Freedom of Information Act (FOIA) queues, case file summaries, and daily briefings can move faster than before with a retrieval-augmented generation flow that ingests documents, stores embeddings in Vectorize, retrieves the most relevant context, and calls a Workers AI LLM to synthesize results — all with audit-ready traces from AI Gateway. In the field, translating forms, redacting PII on upload, or classifying imagery can happen in near-real time because inference executes at the edge; if connectivity wobbles, gateway-level controls provide graceful degradation while Vectorize keeps mission knowledge close to the workload. From day one, traffic can be routed in-region, logs can be scoped to the minimum necessary, and the artifacts required for an Authority to Operate (ATO) evidence are available without building a parallel auditing stack.

For developers

Cloudflare’s Workers AI stack removes undifferentiated heavy lifting, so teams can ship sooner and touch less infrastructure.

Workers AI abstracts GPUs, autoscaling, and placement decisions, letting developers focus on prompts, policies, and products. AI Gateway becomes the control plane in front of any model, providing unified analytics, request policies, safety filters, caching, and spend controls — features you usually have to bolt on late in the project. Vectorize offers a native vector database for fast, affordable semantic search that plugs directly into Workers, which means your retrieval layer doesn’t require a separate cluster or custom glue code.

A repeatable blueprint emerges: chunk and embed documents with Workers AI, store vectors and metadata in Vectorize, retrieve the top-k context, and call your chosen LLM on Workers AI — then evolve that deployment over time by swapping models or tuning policies in AI Gateway without a rewrite. Because these services are first-class citizens on Cloudflare’s platform, you can combine them with Secrets, KV, R2, D1, and Queues, adopt canary routes and retries from the gateway and move from prototype to production with minimal code churn and fewer late surprises.

For security & compliance teams

Targeting FedRAMP Moderate and FedRAMP High for Workers AI aligns cutting-edge capabilities with the federal baselines that agencies already trust elsewhere on our platform. Consolidating inference, routing, and vector search can reduce supplier count and narrow the audit surface, which can directly simplify third-party risk reviews.

AI Gateway provides a consistent enforcement and observability layer across models: the same place to define retention windows, restrict egress, set rate policies, enable safety filters, and produce the logs that demonstrate how requests were processed. Vectorize segments mission data by collection and namespace, carries metadata to support access decisions and lifecycle policies, and keeps retrieval behavior predictable even as models change. Combined with the resiliency of edge execution in Workers AI, gateway-level circuit breakers and caching insulate systems against traffic spikes and upstream instability, so citizen-facing services can remain responsive while core systems stay protected. The result is an architecture you can explain to auditors and rely on in production — without trading away velocity.

Imagine: an AI powered FOIA triage and response drafting system

Freedom of Information Act (FOIA) work is hard. Every request is unique — different date ranges, custodians, keywords, and formats — and the source material sprawls across email archives, shared drives, legacy systems, and scanned PDFs. Metadata is inconsistent or missing, duplicates are everywhere, and sensitive information must be redacted precisely. Staff may have to acknowledge each request, scope it, find likely-responsive records, generate a draft reply with citations, apply privacy and law-enforcement exemptions, and keep an auditable trail, all under statutory timelines. What agencies need is a single path that is fast, explainable, and compliant from the first form submission to the final letter.

Here’s how Workers AI, AI Gateway, and Vectorize could work together to deliver that path. A resident seeks, for example: “All emails between January 2019 and December 2021 regarding water quality monitoring from the Office of Environmental Programs.” A Cloudflare Worker acts as the front door, validates the request, applies lightweight PII scrubbing, and hands off orchestration. The agency’s policies, historical responses, custodian lists, and public documents have already been ingested: a background Worker chunked each file, used Workers AI to generate embeddings in batch, and stored vectors plus provenance metadata in Vectorize.  Originals, meanwhile, live in R2 and relational attributes (custodian, retention, sensitivity labels) live in D1. When the new request arrives, the orchestrator embeds the query with Workers AI, executes a nearest-neighbor search against Vectorize to retrieve the most relevant passages, and assembles a bounded context window that reflects current guidance and past decisions.

The Worker then sends a single normalized call through AI Gateway — prompt, parameters, and a digest of the retrieved context — rather than talking to a model endpoint directly. Gateway is the control and observability layer: it enforces rate limits so a traffic spike on one route won’t starve others, caches identical query-context pairs to control token spend during surges, applies safety and redaction policies, and emits structured logs and metrics with consistent trace IDs. AI Gateway invokes the configured model on Workers AI, which performs the generation close to the user for low latency.

The Worker streams tokens back to staff and the requester: an acknowledgment letter that states the scope it inferred, cites the specific policy passages it used, proposes clarifying questions if needed, and outlines likely custodians and next steps. Staff see the same draft with provenance links to R2 objects and Vectorize IDs; they can click into source snippets, adjust the scope, or kick off downstream collection. Because retrieval (Vectorize) is decoupled from generation (Workers AI), developers can swap to a newer model or tune temperature and max tokens in AI Gateway without re-indexing the corpus or touching application code. Security teams get an audit-ready trail from the web form to the generated letter: what was retrieved, which model ran, how outputs were filtered, where logs are retained, and which regional boundaries were enforced.

The road ahead & our commitment

This is a natural next step in our mission to help build a better Internet for the US public sector. We’ve delivered on FedRAMP before and will continue to invest in the controls, documentation, and operational rigor agencies expect — bringing Workers AI, AI Gateway, and Vectorize into scope methodically as we progress toward 2026.

Secure, serverless AI should be accessible to every agency team — not just those with the largest budgets. If you’re exploring how Workers AI can accelerate your mission, reach out for a consultation or visit Cloudflare for Government to learn more.

Cloudflare Launching AI Miniseries for Developers (and Everyone Else They Know)

Post Syndicated from Peter Saulitis original https://blog.cloudflare.com/welcome-to-ai-avenue/

If you’re here on the Cloudflare blog, chances are you already understand AI pretty well. But step outside our circle, and you’ll find a surprising number of people who still don’t know what it really is — or why it matters.

We wanted to come up with a way to make AI intuitive, something you can actually see and touch to get what’s going on. Hands on, not just hand-wavy.

The idea we landed on is simple: nothing comes into the world fully formed. Like us, and like the Internet, AI didn’t show up fully formed. So we asked ourselves: what if we told the story of AI as it learns and grows?

Episode by episode, we’d give it new capabilities, explain how those capabilities work, and explore how they change the way AI interacts with the world. Giving it a voice. Letting it see. Helping it learn. And maybe even letting it imagine the future.

So we made AI Avenue, a show where I (Craig) explore the fun, human, and sometimes surprising sides of AI… with a little help from my co-host Yorick, a robot hand with a knack for comic timing and the occasional eye-roll. Together, we travel, talk to incredible people, and get hands-on with AI to show it’s not just something to read about. It’s something you can touch, try, and enjoy.


The idea behind AI Avenue

We wanted to make something that would strip away the jargon and make AI approachable, friendly, and most importantly, fun.

In AI Avenue, we address people’s fears, show them the art of the possible, and highlight the positive human stories where AI is augmenting — not replacing — what people can do. And yes, we even let people touch AI themselves. Also yes, the previous paragraphs “intentionally included” a few em-dashes.

The result? A fast-paced, playful series that mixes demos, interviews, and real-world examples, all showing AI as something you can explore, question, and use in ways that matter to you.

You can sign up now to be notified when each episode drops and learn more about the journey at aiavenue.show.


Who we worked with

We had an absolute blast partnering with some of the most exciting players in the space:

  • Anthropic — on building safe, aligned AI models.

  • Engineered Arts — creators of the humanoid robot Ameca, who makes several appearances throughout the series.

  • ElevenLabs — powering lifelike voice synthesis.

  • HeyGen — creating realistic AI-generated video avatars and translations.

  • Roboflow — enabling computer vision projects with powerful image datasets and tools.

  • Be My Eyes — using AI and volunteers to make the world more accessible for people who are blind or have low vision.

  • Writer — bringing enterprise-grade generative AI into real-world workflows.

Episodes: One Ability at a Time

Across six episodes, we follow Yorick’s upgrades and occasional misadventures as he learns to talk, see, think, and even imagine the future.

Episode 1: Voice — We start in London where Yorick gets his voice and immediately starts chiming in on everything.

Episode 2: Vision — In San Francisco, Yorick tries computer vision for the first time. We watch someone go shopping for the first time.

Episode 3: Thinking — Hosting a live trivia stream online, Yorick begins confidently spouting answers that aren’t quite true. We head to New York City to meet someone whose life was saved by ChatGPT.

Episode 4: Learning — Yorick discovers generative AI and decides he can make the show himself, spawning multiple Craig clones and raising questions about ethics and creativity.

Episode 5: Doing — It turns out everyone we talk to just wants a robot to do their dishes. We dig into what “doing” means in AI and robotics and whether Yorick is on board.

Episode 6: Smell — In our finale, we explore the agentic AI future, quantum computing, and big sci-fi dreams, then hang out with a 9-year-old vibe coder because, well, the children are the future.

Get hands-on

Every episode is paired with developer tutorials so you can experiment with the same AI tools that we feature. No matter your skill level, you can tinker, build, and see for yourself what AI can do. We strongly believe the most important thing you can do right now is to touch AI, play with it. Now is the time.

Follow along the avenue

Yorick and I will be releasing each episode of AI Avenue as it’s ready, and we’d love to have you along for the ride.

Sign up to be notified when new episodes launch and explore more about the show at aiavenue.show.