[$] An unstable Debian stable update

Post Syndicated from jzb original https://lwn.net/Articles/1038699/

A bug in a recent release of systemd’s network manager caused
headaches for people managing systems that have a virtual LAN (VLAN)
interface on a bridge; something one might want to do, for example,
when configuring network interfaces for virtual machines. The bug
affected several Debian users when upgrading the systemd package
from v257.7-1 to v257.8-1. The updated package is part of the Debian 13.1
release, and the bug has snared enough users to cause a minor
stir—due in no small part to the maintainer’s response as much
as the bug itself.

Cloudflare Confidence Scorecards – making AI safer for the Internet

Post Syndicated from Ayush Kumar original https://blog.cloudflare.com/cloudflare-confidence-scorecards-making-ai-safer-for-the-internet/

Security and IT teams face an impossible balancing act: Employees are adopting AI tools every day, but each tool carries unique risks tied to compliance, data privacy, and security practices. Employees using these tools without seeking prior approval leads to a new type of Shadow IT which is referred to as Shadow AI. Preventing Shadow AI requires manually vetting each AI application to determine whether it should be approved or disapproved. This isn’t scalable. And blanket bans of AI applications will only drive AI usage deeper underground, making it harder to secure.

That’s why today we are launching Cloudflare Application Confidence Scorecards. This is part of our new suite of AI Security features within the Cloudflare One SASE platform. These scores bring scale and automation to the labor- and time-intensive task of evaluating generative AI and SaaS applications one by one. Instead of spending hours trying to find AI applications’ compliance certifications or data-handling practices, evaluators get a clear score that reflects an application’s safety and trustworthiness. With that signal, decision makers within organizations can confidently set policies or apply guardrails where needed, and block risky tools so their organizations can embrace innovation without compromising security.

Our Cloudflare Application Confidence Scorecards rate both AI-powered applications on a number of factors, including whether they’ve achieved industry-recognized certifications, follow certain data management and security measures, and the maturity level of the company. Meanwhile, amongst other considerations, our Generative AI confidence score awards higher scores to AI models that provide system cards that describe testing for bias, ethics, and safety considerations, and that do not train on user inputs.  We hope our emphasis on privacy, security, and safety helps drive safer and more secure AI for everyone.



Rapid increase in Shadow AI

Over the last decade, SaaS adoption has reshaped how businesses work. Employees can now pick up a new tool in minutes with nothing more than a credit card or free trial link. Now with the growth of generative AI, entire workflows are moving outside corporate oversight. From writing assistants to image generators, employees are relying on these tools daily, without knowing whether they comply with corporate or regulatory requirements. 

The risks of these tools are wide-ranging. Sensitive data can be stored or transmitted outside of company controls. Tools may lack certifications such as SOC2 or ISO 27001. Many providers retain user data indefinitely or use it to train external models. Others face financial or operational instability that could disrupt your business if they go bankrupt or suffer a breach. Models can produce biased outputs that can introduce compliance risks or lead to erroneous business decisions. Security leaders tell us they cannot keep up with auditing every new application.  

We score them for you, at scale

In order to make this effective, we needed two things: a rubric that could judge AI and SaaS applications, and then a mechanism to scalably score all those applications. Here’s how we did it.

How the rubric works

The Application Posture Score (5 points) evaluates a SaaS provider across five major categories:

  • Security and Privacy Compliance (1.2 points): Credit for SOC 2 and ISO 27001 certifications, which signal operational maturity.

  • Data Management Practices (1 point): Retention windows and whether the provider shares data with third parties. Shorter retention and no sharing earns the highest marks.

  • Security Controls (1 point): Support for MFA, SSO, TLS 1.3, role-based access, and session monitoring. These are the table stakes of modern SaaS security.

  • Security Reports and Incident History (1 point): Availability of a trust or security page, bug bounty program, and incident response transparency. A recent material breach results in a full deduction.

  • Financial Stability (.8 points): Public companies and heavily capitalized providers score highest, while startups with less funding or firms in distress score lower.

The Gen-AI Posture Score (5 points) evaluates AI-specific risks:

  • Compliance (1 point): Presence of the ISO 42001 certification for AI management systems.

  • Deployment Security Model (1 point): Whether access is authenticated and rate-limited or left publicly exposed.

  • System Card (1 point): Publication of a model or system card that documents evaluations of safety, bias, and risk.

  • Training Data Governance (2 points): Whether user data is explicitly excluded from model training or if there are available controls allowing opt-in/opt-out of training user data.

Together, these scores give a transparent view of how much confidence you can place in a provider.

How we score at scale

In the same way it’s not scalable for you to stay on top of every new AI and SaaS tool being created, our team quickly realized that we too would have the same problem. AI applications are being spun up so quickly that trying to keep pace manually would require a large team of people. 

We knew we had to build a methodology to do it automatically, so we designed infrastructure that can crawl the Internet to answer the rubric questions at scale. We built a system that scrapes public trust centers, privacy policies, security pages, and compliance documents. Large language models parse those documents to identify relevant answers, but we also hardened the process to resist hallucinations by requiring source validation and structured extraction.


Every score produced by automation is then reviewed and audited by Cloudflare analysts before it goes live in the Application Library. This combination of automated crawling/extraction and human validation makes sure that the scores are both comprehensive and trustworthy.

We make it easy to act on it

Confidence scores are built directly into the Application Library, making them actionable from day one. When you click on a score in your Cloudflare dashboard, you will see a detailed breakdown of how the app performed across each dimension of the rubric. Scores update as vendors improve their security and compliance, giving you a live view instead of a static report.


This approach makes life easier for every stakeholder. IT and security teams can spot high-risk tools at a glance. Procurement Governance Risk & Compliance teams can accelerate vendor reviews while developers and employees can make smarter choices without waiting weeks for approvals.

And it’s getting even better

Visibility is just the start. Soon, these scores will also drive enforcement across your Cloudflare One environment. You will be able to use Gateway to block or warn employees about low-scoring apps or tie DLP policies directly to confidence scores. That way untrusted AI and SaaS providers never become a backdoor for sensitive information.

By embedding scores into both visibility and enforcement, we are turning them into a tool for keeping your corporate environment safer.

Interested in these scores?

Cloudflare Application Confidence Scorecards are now live in the Application Library. You can explore them today in the Cloudflare dashboard, use them to evaluate the tools your teams rely on, and soon enforce policies across the Cloudflare Zero Trust platform.

This is one more step in our mission to make the Internet safer, faster, and more reliable not just for networks, but for the applications and AI tools that power modern work.

If you are a Cloudflare customer you can check out the Application Library, explore the confidence scores, and let us know what you think. And if you’re not — fear not! — application scores are freely available to all users, including free. You can get started by simply creating a free account — and seeing these scores yourself. 

Finally, if you want to get involved testing new functionality or sharing insights related to AI security, we would love for you to express interest in joining our user research program

Deploy your own AI vibe coding platform — in one click!

Post Syndicated from Ashish Kumar Singh original https://blog.cloudflare.com/deploy-your-own-ai-vibe-coding-platform/

It’s an exciting time to build applications. With the recent AI-powered “vibe coding” boom, anyone can build a website or application by simply describing what they want in a few sentences. We’re already seeing organizations expose this functionality to both their users and internal employees, empowering anyone to build out what they need.


Today, we’re excited to open-source an AI vibe coding platform, VibeSDK, to enable anyone to run an entire vibe coding platform themselves, end-to-end, with just one click.

Want to see it for yourself? Check out our demo platform that you can use to create and deploy applications. Or better yet, click the button below to deploy your own AI-powered platform, and dive into the repo to learn about how it’s built.

Deploying VibeSDK sets up everything you need to run your own AI-powered development platform:

  • Integration with LLM models to generate code, build applications, debug errors, and iterate in real-time, powered by Agents SDK

  • Isolated development environments that allow users to safely build and preview their applications in secure sandboxes.

  • Infinite scale that allows you to deploy thousands or even millions of applications that end users deploy, all served on Cloudflare’s global network

  • Observability and caching across multiple AI providers, giving you insight into costs and performance with built-in caching for popular responses. 

  • Project templates that the LLM can use as a starting point to build common applications and speed up development.

  • One-click project export to the user’s Cloudflare account or GitHub repo, so users can take their code and continue development on their own.


Building an AI vibe coding platform from start to finish

Step 0: Get started immediately with VibeSDK

We’re seeing companies build their own AI vibe coding platforms to enable both internal and external users. With a vibe coding platform, internal teams like marketing, product, and support can build their own landing pages, prototypes, or internal tools without having to rely on the engineering team. Similarly, SaaS companies can embed this capability into their product to allow users to build their own customizations. 

Every platform has unique requirements and specializations. By building your own, you can write custom logic to prompt LLMs for your specific needs, giving your users more relevant results. This also grants you complete control over the development environment and application hosting, giving you a secure platform that keeps your data private and within your control. 

We wanted to make it easy for anyone to build this themselves, which is why we built a complete platform that comes with project templates, previews, and project deployment. Developers can repurpose the whole platform, or simply take the components they need and customize them to fit their needs.

Step 1: Finding a safe, isolated environment for running untrusted, AI generated code

AI can now build entire applications, but there’s a catch: you need somewhere safe to run this untrusted, AI-generated code. Imagine if an LLM writes an application that needs to install packages, run build commands, and start a development server — you can’t just run this directly on your infrastructure where it might affect other users or systems.

With Cloudflare Sandboxes, you don’t have to worry about this. Every user gets their own isolated environment where the AI-generated code can do anything a normal development environment can do: install npm packages, run builds, start servers, but it’s fully contained in a secure, container-based environment that can’t affect anything outside its sandbox. 


The platform assigns each user to their own sandbox based on their session, so that if a user comes back, they can continue to access the same container with their files intact:

// Creating a sandbox client for a user session
const sandbox = getSandbox(env.Sandbox, sandboxId);

// Now AI can safely write and execute code in this isolated environment
await sandbox.writeFile('app.js', aiGeneratedCode);
await sandbox.exec('npm install express');
await sandbox.exec('node app.js');

Step 2: Generating the code

Once the sandbox is created, you have a development environment that can bring the code to life. VibeSDK orchestrates the whole workflow from writing the code, installing the necessary packages, and starting the development server. If you ask it to build a to-do app, it will generate the React application, write the component files, run bun install to get the dependencies, and start the server, so you can see the end result. 

Once the user submits their request, the AI will generate all the necessary files, whether it’s a React app, Node.js API, or full-stack application, and write them directly to the sandbox:

async function generateAndWriteCode(instanceId: string) {
    // AI generates the application structure
    const aiGeneratedFiles = await callAIModel("Create a React todo app");
    
    // Write all generated files to the sandbox
    for (const file of aiGeneratedFiles) {
        await sandbox.writeFile(
            `${instanceId}/${file.path}`,
            file.content
        );
        // User sees: "✓ Created src/App.tsx"
        notifyUser(`✓ Created ${file.path}`);
    }
}

To speed this up even more, we’ve provided a set of templates, stored in an R2 bucket, that the platform can use and quickly customize, instead of generating every file from scratch. This is just an initial set, but you can expand it and add more examples. 

Step 3: Getting a preview of your deployment

Once everything is ready, the platform starts the development server and uses the Sandbox SDK to expose it to the internet with a public preview URL which allows users to instantly see their AI-generated application running live:

// Start the development server in the sandbox
const processId = await sandbox.startProcess(
    `bun run dev`, 
    { cwd: instanceId }
);

// Create a public preview URL 
const preview = await sandbox.exposePort(3000, { 
    hostname: 'preview.example.com' 
});

// User instantly gets: "https://my-app-xyz.preview.example.com"
notifyUser(`✓ Preview ready at: ${preview.url}`);

Step 4: Test, log, fix, repeat

But that’s not all! Throughout this process, the platform will capture console output, build logs, and error messages and feed them back to the LLM for automatic fixes. As the platform makes any updates or fixes, the user can see it all happening live — the file editing, installation progress, and error resolution. 

Deploying applications: From Sandbox to Region Earth

Once the application is developed, it needs to be deployed. The platform packages everything in the sandbox and then uses a separate specialized “deployment sandbox” to deploy the application to Cloudflare Workers. This deployment sandbox runs wrangler deploy inside the secure environment to publish the application to Cloudflare’s global network. 

Since the platform may deploy up to thousands or millions of applications, Workers for Platforms is used to deploy the Workers at scale. Although all the Workers are deployed to the same Namespace, they are all isolated from one another by default, ensuring there’s no cross-tenant access. Once deployed, each application receives its own isolated Worker instance with a unique public URL like my-app.vibe-build.example.com

async function deployToWorkersForPlatforms(instanceId: string) {
    // 1. Package the app from development sandbox
    const devSandbox = getSandbox(env.Sandbox, instanceId);
    const packagedApp = await devSandbox.exec('zip -r app.zip .');
    
    // 2. Transfer to specialized deployment sandbox
    const deploymentSandbox = getSandbox(env.DeployerServiceObject, 'deployer');
    await deploymentSandbox.writeFile('app.zip', packagedApp);
    await deploymentSandbox.exec('unzip app.zip');
    
    // 3. Deploy using Workers for Platforms dispatch namespace
    const deployResult = await deploymentSandbox.exec(`
        bunx wrangler deploy \\\\
        --dispatch-namespace vibe-sdk-build-default-namespace
    `);
    
    // Each app gets its own isolated Worker and unique URL
    // e.g., https://my-app.example.com
    return `https://${instanceId}.example.com`;
}

Exportable Applications 

The platform also allows users to export their application to their own Cloudflare account and GitHub repo, so they can continue the development on their own. 


Observability, caching, and multi-model support built in! 

It’s no secret that LLM models have their specialties, which means that when building an AI-powered platform, you may end up using a few different models for different operations. By default, VibeSDK leverages Google’s Gemini models (gemini-2.5-pro, gemini-2.5-flash-lite, gemini-2.5-flash) for project planning, code generation, and debugging. 

VibeSDK is automatically set up with AI Gateway, so that by default, the platform is able to:

  • Use a unified access point to route requests across LLM providers, allowing you to use models from a range of providers (OpenAI, Anthropic, Google, and others)

  • Cache popular responses, so when someone asks to “build a to-do list app”, the gateway can serve a cached response instead of going to the provider (saving inference costs)

  • Get observability into the requests, tokens used, and response times across all providers in one place

  • Track costs across models and integrations

Open sourced, so you can build your own Platform! 

We’re open-sourcing VibeSDK for the same reason Cloudflare open-sourced the Workers runtime — we believe the best development happens in the open. That’s why we wanted to make it as easy as possible for anyone to build their own AI coding platform, whether it’s for internal company use, for your website builder, or for the next big vibe coding platform. We tied all the pieces together for you, so you can get started with the click of a button instead of spending months figuring out how to connect everything yourself. To learn more, check out our reference architecture for vibe coding platforms. 


Building unique, per-customer defenses against advanced bot threats in the AI era

Post Syndicated from Jin-Hee Lee original https://blog.cloudflare.com/per-customer-bot-defenses/

Today, we are announcing a new approach to catching bots: using models to provide behavioral anomaly detection unique to each bot management customer and stop sophisticated bot attacks. 

With this per-customer approach, we’re giving every bot management customer hyper-personalized security capabilities to stop even the sneakiest bots. We’re doing this by not only making a first-request judgement call, but also by tracking behavior of bots who play the long-game and continuously execute unwanted behavior on our customers’ websites. We want to share how this service works, and where we’re focused. Our new platform has the power to fuel hundreds of thousands of unique detection suites, and we’ve heard our first target loud and clear from site owners: protect websites from the explosion of sophisticated, AI-driven web scraping.

The new arms race: the rise of AI-driven scraping

The battle against malicious bots used to be a simpler affair. Attackers used scripts that were fairly easy to identify through static, predictable signals: a request with a missing User-Agent header, a malformed method name, or traffic from a non-standard port was a clear indicator of malicious intent. However, the Internet is always evolving. As websites became more dynamic to create rich user experiences, attackers evolved their tools in response. The simple scripts of yesterday were replaced by headless browsers and automation frameworks, capable of rendering pages and mimicking human interaction with far greater fidelity.

AI has made this even trickier. The rise of Generative AI has fundamentally changed the capabilities and the motivations of attackers. The web scraping of today isn’t limited to competitive price intelligence or content aggregation, but driven by the voracious appetite of Large Language Models (LLMs) for training data.

Cloudflare’s data shows this shift in stark terms. In mid-2025, crawling for the purpose of AI model training accounted for nearly 80% of all AI bot activity on our network, a significant increase from the year prior. Modern scraping tools are now AI-powered themselves. They leverage LLMs for semantic understanding of page content, use computer vision to solve visual challenges, and employ reinforcement learning to navigate complex websites they’ve never seen before. The evolution of these bots exposes critical vulnerability in the traditional, one-size-fits-all approach to security. While global threat intelligence is immensely powerful for stopping widespread attacks, these new AI-powered scrapers are designed to blend in. They can rotate IP addresses through residential proxies, generate human-like user agents, and mimic plausible browsing patterns. A request from one of these bots might not look anomalous when compared to the trillions of requests we see across the Cloudflare network, but would appear anomalous when compared to the established patterns of legitimate users on a specific website. This means we need to build defenses against these bots from every angle we have — from the global view to specific behavior on a single application.


Globally scalable bot fingerprinting

To target specific well-known bots or bot actors, we leverage the Cloudflare network to fingerprint bots that we see behave similarly across millions of websites. Since June, Cloudflare’s bot detection security analysts have written 50 heuristics to catch bots using a variety of signals, including but not limited to HTTP/2 fingerprints and Client Hello extensions. By observing traffic on millions of websites, we establish a baseline of legitimate fingerprints of common browsers and benign devices. When a new, unique fingerprint suddenly appears across many different sites, it’s a tell-tale sign of a distributed botnet or a new automation tool, allowing our analysts to block the bot’s signature itself and neutralize the entire campaign, regardless of the thousands of different IP addresses it might use.

Recently, we also introduced detection improvements to tackle residential proxy networks and similar commercial proxies, which are used by attackers to make their bots appear as thousands of distinct real visitors, allowing them to bypass traditional security measures. The superpower of this detection improvement? Combining the vast amount of network data we see with particular client-side fingerprints obtained through the millions of challenge solves that happen across the Internet daily. Challenges have always served as an ideal mitigation action for customers who want to protect their applications without compromising real-user experience, but now they also serve as a gift that keeps on giving: in this case, feeding the Cloudflare threat detection teams a constant stream of client-side information that allows us to pattern match to determine IP addresses that are used by residential proxy networks.

This detection improvement is already ingesting data from the entire Cloudflare network, automatically catching more malicious traffic for all customers using Super Bot Fight Mode (bot protection included for Pro, Business, and all Enterprise customers) and Enterprise Bot Management. Examining 7 days of data from the time of authoring this post, we’ve observed 11 billion requests from millions of unique IP addresses that we’ve identified as connected to residential or commercial proxy networks. This is just one piece of the global detection puzzle; the existing residential proxy detection features in our ML already catch tens of millions of requests every hour

Hyper-personalized security: learning what’s normal for you

The new arms race against AI-powered bots necessitates a closer look — something more precise. For instance, a script that systematically scrapes every user profile on a social media site, or every product listing on an e-commerce platform, is exhibiting behavior that is fundamentally abnormal for that application, even if a standalone request appears benign. This realization is at the heart of our new strategy: to win this new arms race, defenses must become as bespoke and adaptive as the attacks they face.

To meet this challenge, we built a new, foundational platform engineered to deploy custom machine learning models for every bot management customer. We’re creating a unique defense for every application. Because each website has different traffic, the traffic that we flag as anomalous will, of course, be different for each zone — for this system, we want to be clear that data from one customer’s zone won’t be used to train the model for another customer’s use.

Announcing this as a new platform capability, rather than a single feature, is a deliberate choice. It aligns with how we’ve approached our most significant innovations, from Cloudflare Workers changing how developers build applications, to AI Gateway creating a single control plane for AI observability and security. By focusing on the platform, we tackle the scraping problems our customers are seeing today and power future detections as bot attacks become increasingly sophisticated.

Our new generation of per-customer anomaly detection is a three-step process, designed to identify malicious behavior by first understanding what constitutes legitimate traffic for each individual website and API.

Step 1: Establishing a dynamic baseline

For each customer zone, our behavioral detections ingest traffic data to build a baseline of normal activity. Rather than taking a static snapshot, our new platform ingests data to make living, continuously updated calculations of what “normal” looks like on a specific website. This approach understands seasonality, recognizes traffic spikes from legitimate marketing campaigns, and maps the typical pathways users take through a site. This approach evolves the concept of Anomaly Detection already present in our Enterprise Bot Management suite, but applies it at a far more granular and dynamic per-customer level.

Step 2: Identifying the anomalies

Once the baseline of “normal” is established, we begin the true work — identifying deviations. Because the baseline is specific to each website, the anomalies detected are highly contextual, perhaps even invisible to a global system. We can examine a few different types of websites to unpack this:

  • For a gaming company: A normal traffic baseline might show millions of users making frequent, rapid API calls to a matchmaking service or an in-game inventory system. A behavioral detection model trained on this baseline would immediately flag a single user making slow, methodical, sequential API calls to scrape the entire player leaderboard. This behavior, while low in volume, is a clear anomaly against the backdrop of normal gameplay patterns.

  • For a retail website: The normal baseline is a complex funnel of users browsing categories, viewing products, adding items to a cart, and proceeding to checkout. These detections would identify an actor that systematically visits every single product page in alphabetical order at a machine-like pace, without ever interacting with the cart or session cookies, as a significant anomaly indicative of content scraping.

  • For a media publisher: Normal user behavior involves reading a few articles, following internal links, and spending a measurable amount of time on each page. An anomaly would be a script that hits thousands of article URLs per minute, spending less than a second on each, purely to extract the text content for AI model training.

In each case, the malicious activity is defined not by a universal signature, but by its deviation from the application’s unique, established norm.

Step 3: Generating actionable findings

Detecting an anomaly is only half the battle. The power of bot management comes from its seamless integration into the Cloudflare security ecosystem you already use, turning detection into immediate, actionable findings. Customers can benefit from these behavioral detection improvements in two ways:

  1. New Bot Detection IDs: For our Enterprise customers, we’re introducing a new set of Bot Detection IDs. Website owners and security teams can write WAF security rules to challenge, rate-limit, or block traffic based on the specific anomalies flagged by these detections. Since each detection type is tied to a unique ID, customers can see exactly what kind of behavior caused a request to be flagged as anomalous, offering a detailed, per-request view into stealthy malicious traffic. And for a wider view, customers can filter by Detection ID from their Security Analytics, to see the bigger picture of all traffic captured by that detection type.

  2. Improving Bot Score: Another key output from these new, per-customer models will be to directly influence the Bot Score of a request. A request flagged as anomalous will have its score lowered, moving it into the “Likely Automated” (scores 2-29) or “Automated” (score 1) categories. This means that existing WAF custom rules based on Bot Score will automatically see impact and become more effective against bespoke attacks, with no changes required. This functionality update is available today for our latest account takeover detection, residential proxy detections and our recent enhancements, and will be implemented in the future for our behavioral scraping detection. 

This three-step process is already in action with our behavioral detections to catch account takeover attacks. Taking bot detection ID 201326598 as an example: it (1) establishes a zone-level baseline that understands what normal traffic patterns look like for a specific website, (2) examines anomalous login failures to identify brute force and credential stuffing attacks, then (3) allows customers to mitigate these attacks by automatically influencing bot score and offering more visibility with the detection ID’s analytics. 


This integration strategy creates a flywheel effect: the new intelligence from these improved detections immediately enhances the value of existing products like Super Bot Fight Mode, Bot Management, and the WAF, making the entire Cloudflare platform stronger for you.

Taking on sophisticated scrapers

The first challenge we’re tackling is sophisticated scraping. AI-driven scraping is one of the most pressing and rapidly evolving threats facing website owners today, and its adaptive nature makes it an ideal adversary for a system designed to fight an enemy that constantly changes its tactics.

The first generation of our improved behavioral detections are tuned specifically to detect scraping by analyzing signals that go beyond simple request headers. These include:

  • Behavioral Analysis: Looking at session traversal paths, the sequence of requests, and interaction (or lack thereof) with dynamic page elements.

  • Client Fingerprinting: Analyzing subtle signals from the client to identify signs of automation such as JA4 fingerprints in the context of the customer’s specific traffic baseline.

  • Content-Agnostic Detection: These models do not need to understand the content of a page, only the patterns of how it is being accessed. This makes them highly scalable and efficient, without actually using the unique content on a website to make judgement calls.

How do these scraping detections look, in practice? We validated our logic for detecting scraping with early adopters in a closed beta, in order to receive ground-truth feedback and tune our detections. As with any ideal detection, our goal is to capture as much malicious traffic as possible, without compromising the experience of legitimate website visitors. Looking at just a 24-hour period, our new scraping detections have caught hundreds of millions of requests, flagging 138 million scraping requests on just 5 of our early beta zones.


Naturally, we see an overlap with our existing system of bot scoring, but the numbers here show us concretely that our new method of behavioral detections have a completely new value add: 34% of the requests flagged by our new scraping detections would not have been detected by our existing bot score system, making us all the more eager to use these novel detections to inform the way we score automation.

A birthday gift for the Internet

Our mission to help build a better Internet means that when we develop powerful new defenses, we believe in democratizing access to them. Protecting the entire Internet from new and evolving threats requires raising the baseline of security for everyone.

In that spirit, we’re excited to announce that our enhanced behavioral detections will not only roll out to bot management customers, but will also benefit Cloudflare customers using our global Super Bot Fight Mode system. For our Enterprise Bot Management customers, we automatically tune our detections based on the exact traffic for each zone. Because these advanced models are trained on your zone’s specific traffic, they detect even the most evasive attacks: from account takeovers to web scraping to other attacks executed through residential proxy networks — and we consider this only the tip of the iceberg of behavioral bot profiling. 

The road ahead

Our initial focus on scraping is just the beginning of a new wave of behavioral bot detections. The infrastructure we’ve built is a flexible, powerful foundation for tackling a wide range of malicious behavior on your websites; the same principles of establishing a per-customer baseline and detecting anomalies can be applied to other critical threats that are unique to an application’s logic, such as credential stuffing, inventory hoarding, carding attacks, and API abuse.

We are moving into an era where generic defenses are no longer enough. As threats become more personal, so must the defenses against them, and paving this path of behavioral detections is our latest gift to the Internet. Our first offering of scraping behavioral detections is just around the corner: customers will be able to turn on this new detection from the Security Overview page in their dashboard. 


(We’re always looking for enthusiastic humans to help us in our mission against bots! If you’re interested in helping us build a better Internet, check out our open positions.)

Why Cloudflare, Netlify, and Webflow are collaborating to support Open Source tools like Astro and TanStack

Post Syndicated from Rita Kozlov original https://blog.cloudflare.com/cloudflare-astro-tanstack/

Open source is the core fabric of the web, and the open source tools that power the modern web depend on the stability and support of the community. 

To ensure two major open source projects have the resources they need, we are proud to announce our financial sponsorship to two cornerstone frameworks in the modern web ecosystem: Astro and TanStack.

Critically, we think it’s important we don’t do this alone — for the open web to continue to thrive, we must bet on and support technologies and frameworks that are open and accessible to all, and not beholden to any one company. 

Which is why we are also excited to announce that for these sponsorships we are joining forces with our peers at Netlify to sponsor TanStack and Webflow to sponsor Astro.

Why Astro and TanStack? Investing in the Future of the Frontend

Our decision to support Astro and TanStack was deliberate. These two projects represent distinct but complementary visions for the future of web development. One is redefining the architecture for high-performance, content-driven websites, while the other provides a full-stack toolkit for building the most ambitious web applications.

Astro: the framework for the high-performance sites 

When it comes to endorsing a technology, we believe actions speak louder than words. 

That’s why our support for Astro isn’t just financial—it’s foundational. We run our developer documentation site, developers.cloudflare.com, entirely on Astro. This isn’t a small side project — it’s a critical resource visited by hundreds of thousands of developers every day, with dozens of contributors constantly keeping it updated. For a site like this, performance isn’t a feature; it’s a requirement. 

We chose Astro because its core principles mirror our own. Its “zero JS by default” architecture delivers the raw performance and stellar SEO that a content-heavy site demands, ensuring our docs are fast and discoverable. Just as importantly, Astro is framework-agnostic, letting teams use components from React, Vue, or Svelte without vendor lock-in. 

Astro makes it easy for our global team to keep content up-to-date and, most importantly, keep our docs blazing fast. Our sponsorship is a direct result of the immense value we’ve experienced firsthand.   

Cloudflare’s partnership and support affirms our shared mission: to make the web faster, more open, and better for everyone who builds on it.  – Fred K. Schott, Astro Co-creator, Project Steward

Webflow gives marketers, designers, and developers the freedom to build without compromise. Astro shares that same spirit by removing barriers, speeding up workflows, and opening new creative possibilities. Together with Cloudflare and Netlify, we’re helping ensure the tools our community relies on stay open, sustainable, and ready for the future. – Allan Leinwand, Webflow CTO

TanStack Start: the full-stack framework for ambitious applications

If Astro provides the ideal foundation for content-heavy sites, TanStack provides the ideal engine for complex web applications. TanStack is not a single framework but a suite of powerful, headless, and type-safe libraries that solve the hardest problems in modern application development.

Libraries like TanStack Query have become the de facto industry standard for managing asynchronous server state, elegantly solving complex challenges like caching, background refetching, and optimistic updates that once required thousands of lines of fragile, bespoke code. Similarly, TanStack Router brings full type-safety to routing, eliminating an entire class of common bugs, while TanStack Table and TanStack Form provide the robust, headless primitives needed to build sophisticated, data-intensive user interfaces.

And today, TanStack announced its official release of the release candidate for TanStack Start 1.0, taking a big stride towards production-readiness.

TanStack Start is a new full-stack framework that composes these powerful libraries into a cohesive, enterprise-grade development experience. With features like full-document Server-Side Rendering (SSR), streaming, and a “deploy anywhere” architecture, TanStack Start is designed for the modern, serverless edge. It provides the power and type-safety needed for ambitious applications and is a perfect match for deployment environments like Cloudflare Workers.

With Cloudflare alongside us, TanStack can keep raising the bar for fast, scalable, and type-safe tools for powering the next generation of web apps while protecting the openness and freedom developers depend on. – Tanner Linsley, TanStack creator

Supporting an open web is not a nice-to-have for us, but a requirement for us to fulfill our mission to build a better web. Collaborating with Cloudflare on making sure these top projects are funded is the easiest decision we can make! – Mat B, CEO

Joining forces builds a stronger open web

It is not lost on us that this coalition includes companies that compete in the market. We believe this is a feature, not a bug. It demonstrates a shared understanding that we are all building on the same open-source foundations. A healthy, innovative, and sustainable open-source ecosystem is the rising tide that lifts all of our boats.

This joint sponsorship model means Astro and TanStack are more resilient. For you, that means you can build on them with confidence, knowing they aren’t dependent on a single company’s shifting priorities.

With that, show us what you build!

The best way to support open source is to use it, build with it, and contribute back to it. See how easy it is to get started with Astro and TanStack and deploy an application to Cloudflare in minutes with the following framework guides:

Launching the x402 Foundation with Coinbase, and support for x402 transactions

Post Syndicated from Will Allen original https://blog.cloudflare.com/x402/

Cloudflare is partnering with Coinbase to create the x402 Foundation. This foundation’s mission will be to encourage the adoption of the x402 protocol, an updated framework that allows clients and services to exchange value on the web using a common language. In addition to today’s partnership, we are shipping a set of features to allow developers to use x402 in the Agents SDK and our MCP integrations, as well as proposing a new deferred payment scheme.

Payments in the age of agents

Payments on the web have historically been designed for humans. We browse a merchant’s website, show intent by adding items to a cart, and confirm our intent to purchase by inputting our credit card information and clicking “Pay.” But what if you want to enable direct transactions between digital services? We need protocols to allow machine-to-machine transactions. 

Every day, sites on Cloudflare send out over a billion HTTP 402 response codes to bots and crawlers trying to access their content and e-commerce stores. This response code comes with a simple message: “Payment Required.”

Yet these 402 responses too often go unheard. One reason is a lack of standardization. Without a specification for how to format and respond to those response codes, content creators, publishers, and website operators lack adequate tools to convey their payment requests. x402 can give developers a clear, open protocol for websites and automated agents to negotiate payments across the globe. 

A Primer on x402

Coinbase authored the x402 transaction flow, outlined below, to help machines pay directly for resources over HTTP:

  1. A client attempts to access a resource gated by x402. 

  2. The server responds with the status code 402 Payment Required. The response body contains payment instructions including the payment amount and recipient.

  3. The client requests the x402-gated resource with the payment authorization header.

  4. The payment facilitator verifies the client’s payment payload and settles the transaction.

  5. The server responds with the requested resource in the response, along with the payment response header that confirms the payment outcome. 

This flow creates programmatic access to resources across the Internet. Clients and servers capable of interpreting the x402 protocol are able to transact without the need for accounts, subscriptions, or API keys.

x402 can be used to monetize traditional use cases, but also enables monetization of a new class of use cases. For example:

  • An assistant that is able to purchase accessories for your Halloween costume from multiple merchants.

  • An AI agent that pays per browser rendering session, instead of committing to a monthly subscription fee.

  • An autonomous stock trader that makes micropayments for a high quality real-time data feed to drive decisions.

Future versions of x402 could be agnostic of the payment rails, accommodating credit cards and bank accounts in addition to stablecoins. 

Cloudflare’s pay per crawl: proposing the x402 deferred payment scheme 

Agents and crawlers often require two important functions that already exist in much of today’s financial infrastructure: delayed settlement to account for disputes; and a single, aggregated payment to make their accounting simpler. For example, crawlers participating in our private beta of pay per crawl are able to crawl a vast number of pages easily, generate audit logs, and then be charged a single fee via a connected credit card or bank account at the end of each day. 

To account for these types of payment scenarios, we’re proposing a new deferred payment scheme for the x402 protocol. This new scheme is specifically designed for agentic payments that don’t need immediate settlement and can be handled either through traditional payment methods or stablecoins. By proposing this addition, we’re helping to ensure that any compliant server can optionally decouple the cryptographic handshake from the payment settlement itself, giving agents and servers the ability to use pre-negotiated licensing agreements, batch settlements, or subscriptions.

We will be bringing this new deferred payment scheme to pay per crawl as we expand and evolve the private beta. 

The Handshake Explained

Here’s our initial proposal for the handshake that could be released in the next major version of x402:

1. The Server’s Offer

Today, an unauthenticated or unauthorized client attempts to access a resource and receives a 402 Payment Required response. The server provides a payment commitment payload that the client can use to construct a re-request. This response is a machine-readable offer, and our proposal includes a new scheme of deferred.

HTTP/1.1 402 Payment Required
Content-Type: application/json

{
  "accepts": [
    {
      "scheme": "deferred",
      "network": "example-network-provider",
      "resource": "https://example.com/page",
      "...": "...",
      "extras": {
        "id": "abc123",
        "termsUrl": "https://example.com/terms"
      },
    }
  ]
}
2. The Client’s Signed Commitment

Next, the client re-sends the request with a signed payload containing their payment commitment. The deferred scheme uses HTTP Message Signatures where a JWK-formatted public key is available in a hosted directory. The Signature-Input header clearly explains which parts of the request are included in the Signature to serve as cryptographic proof of the client’s intent, verifiable by the service provider without an on-chain transaction. 

GET /path/to/resource HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 Chrome/113.0.0 MyBotCrawler/1.1
Payment:
    scheme="deferred",
    network="example-network-provider",
    id="abc123"
Signature-Agent: signer.example.com
Signature-Input:
    sig=("payment" "signature-agent");
    created=1700000000;
    expires=1700011111;
    keyid="ba3e64==";
    tag="web-bot-auth"
Signature: sig=abc==
3. Successful Response

The resource server validates the signature and returns the content with a confirmation header. The server is responsible for attributing the payment to the account associated with the HTTP message signature, verifying the client’s identity and then delivering the content. In this scenario, there is no blockchain associated with the payments. 

HTTP/1.1 200 OK
Content-Type: text/html
Payment-Response:
    scheme="deferred",
    network="example-network-provider",
    id="abc123",
    timestamp=1730872968
4. Payment Settlement

The server can now handle the settlement flexibly. The validated id from the handshake acts as a reference for the transaction. This approach enables a flexible use model without per-request overhead, allowing the server to roll up payments on a subscription, daily, or even batch basis. This creates a flexible framework where the cryptographic trust is established immediately, while the financial settlement can use traditional payment rails or stablecoins. 

Cloudflare’s MCP servers, Agents SDK, and x402 payments

Running code is what moves an open convention from the theoretical to truly useful, and eventually to a recognized standard. Agents built using Cloudflare’s Agent SDK can now pay for resources with x402, and MCP servers can expose tools to be paid for via x402. To show how this works, we created the x402 playground, a live demo employing x402. The x402 playground is powered by the Agents SDK and has access to tools from MCP servers deployed on Cloudflare.


When you open the x402 playground, a new wallet is created and funded with Testnet USDC on a Base blockchain testnet. The agent, built with Agents SDK, has access to an MCP server with both free and paid tools.

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { McpAgent } from "agents/mcp";
import { withX402 } from "agents/x402";

export class PayMCP extends McpAgent {
  server = withX402(
    new McpServer({ name: "PayMCP", version: "1.0.0" }),
    X402_CONFIG
  );

  async init() {
    // Paid tool
    this.server.paidTool(
      "square",
      "Squares a number",
      0.01, // Tool price
      {
        a: z.number()
      },
      {},
      async ({ number }) => {
        return { content: [{ type: "text", text: String(a ** 2) }] };
      }
    );

    // Free tool
    this.server.tool(
      "add-two-numbers",
      "Adds two numbers",
      {
        a: z.number(),
        b: z.number(),
      },
      async ({ a, b }) => {
        return { content: [{ type: 'text', text: String(a + b) }] };
      }
    );
  }
}

When the agent attempts to use a paid tool, the MCP server responds with a 402 Payment Required. The agent is able to interpret the payment instructions and prompt the human whether they want to proceed with the transaction. Building an x402-compatible client requires a basic wrapper on the tool call:

import { Agent } from "agents";
import { withX402Client } from "agents/x402";

export class MyAgent extends Agent {
  // Your Agent definitions...

  async onToolCall() {

    // Build the x402 client
    const x402Client = withX402Client(
      myMcpClient,
      { network: "base-sepolia", account: this.account }
    );

    // The first parameter becomes the confirmation callback.
    // We can set it to `null` if we want the agent to pay automatically.
    const res = await x402Client.callTool(
      this.onPaymentRequired,
      {
        name: toolName,
        arguments: toolArgs
    });
  }
}

This test agent draws down the funds from the wallet and sends the payment payload to the MCP server, which settles the transaction. The transactions can be specified to execute with or without human confirmation, allowing you to design the interface best suited for your application.

What’s next? 

You can get started today by using the Agents SDK or by deploying your own MCP server.

We’ll continue to work closely with Coinbase to establish the x402 Foundation. Stay tuned for more announcements on the specifics of the structure very soon.

We believe in the value of open and interoperable protocols – which is why we are encouraging everyone to contribute to the x402 protocol directly. To get in touch with the team at Cloudflare working on x402, email us at [email protected].

Helping protect journalists and local news from AI crawlers with Project Galileo

Post Syndicated from Patrick Day original https://blog.cloudflare.com/ai-crawl-control-for-project-galileo/

We are excited to announce that Project Galileo will now include access to Cloudflare’s Bot Management and AI Crawl Control services. Participants in the program, which include roughly 750 journalists, independent news organizations, and other non-profits supporting news-gathering around the world, will now have the ability to protect their websites from AI crawlers—for free. 

Project Galileo is Cloudflare’s free program to help protect important civic voices online. Launched in 2014, it now includes more than 3,000 organizations in 125 countries, and it has served as the foundation for other free Cloudflare programs that help protect democratic elections, public schools, public health clinics, and other critical infrastructure.  

Although we think all Project Galileo participants will benefit from these additional free services, we believe they are essential for news organizations. 

News organizations, particularly local news, are facing significant challenges in transitioning to the AI-driven web. As people increasingly turn to AI models for information, less of their web traffic is making it to the actual website where that information originated. Industries, like news organizations, that rely on user traffic to generate revenue are increasingly at-risk. 

Allowing news organizations to monitor and control how AI crawlers are interacting with their websites, will help them better protect their content and make more informed decisions about engaging with AI companies. Ultimately, our goal is to provide the tools news organizations need to negotiate fair compensation for their work.  

Traffic and the news

AI is fundamentally changing how traffic flows on the Internet. Cloudflare recently published data that shows with Open AI its 750 times more difficult for website owners to get the same volume of traffic than it was with previous Google search. With Anthropic, it’s 30,000 times more difficult. 

News organizations rely on traffic to not only connect with their readers, but also generate revenue from subscriptions, advertising, e-commerce, and licensing. The CEO of the Financial Times recently stated that AI had caused a ”pretty sudden and sustained’ decline of 25% to 30% in traffic to its articles arriving via search engines.” 

Potential losses of user traffic and revenue come at an already precarious time for the news industry. It is well-documented that small, independent newspapers and news radio stations continue to face significant financial pressure, particularly in the United States. According to recent US Congressional testimony, more than two newspapers closed per week in 2024 with one third of the country’s newspapers set to close before the beginning of 2025. A 2024 report by the Northwestern Local News Initiative reported more than 206 US counties were without any local news source, and 1,561 had only one.  

Recent funding cuts to the Public Broadcasting Corporation and National Public Radio, which provided grants, programing, and other support to public news stations around the US, have put further strain on these organizations with more closures expected

Giving control back to journalists

An important first step in helping journalists and news organizations adapt to the AI-driven web is providing tools to help them monitor and control AI models’ access to their content. 

“In an era defined by AI and digital disruption, providing robust tools to independent media isn’t just support – it’s a lifeline” – Meera, CEO Internews Europe

“Independent publishers need tools that are easy to use and affordable, so they can focus on growing their business. LION appreciates the security and protection Cloudflare has provided our members through Project Galileo for years, and we’re excited to see more resources now available to help members manage the rapidly evolving landscape of digital security.”  – Sarah Gustavus Lim, LION Membership Director 

Cloudflare Bot Management and AI Crawl Control were designed for exactly these purposes. Bot management is a security tool that uses machine learning to analyze web traffic to distinguish between good bots, like search engine crawlers, and bad bots that attack websites or steal credentials. It allows website owners to block bad bots from reaching their websites, while making sure helpful bots can continue to do their work.

AI Crawl Control provides similar tools to identify and manage AI crawlers. Cloudflare uses a variety of techniques to identify and categorize crawlers (HTTP header, heuristics, and other behavior) giving website owners the ability to analyze their activity by type (e.g. AI search, AI scraper), where they are coming from (Google, OpenAI, Anthorpic, etc.), and what content they are accessing. Here’s the kind of data that Cloudflare’s AI Crawl Control tool can provide (using the radar.cloudflare.com domain) as an example:



Cloudflare combines these insights with easy-to-use controls that allow website owners to make informed decisions about whether to make their data available, including to only certain types of bots or to individual AI companies. This would, for example, allow a local newspaper to decide to block all AI crawlers and maintain direct connection to their readers via their own website, block only AI scrapers while allowing AI search crawlers that refer traffic, or negotiate and sell exclusive access to their content to a single AI company. The following image shows how AI Crawl Control lets users allow or block access on a crawler-by-crawler basis:


We think the ability to control and monitor AI crawler activity will provide immediate help to news organizations looking to protect their content and understand how models are using their data. 

“Independent publishers need tools that are easy to use and affordable, so they can focus on growing their business. LION appreciates the security and protection Cloudflare has provided our members through Project Galileo for years, and we’re excited to see more resources now available to help members manage the rapidly evolving landscape of digital security.”  – Sarah Gustavus Lim, LION Membership Director 

We also think it will provide longer term insights that will allow news organizations to negotiate mutually beneficial relationships with AI companies over time.  

“Independent media’s ability to fulfill its democratic function by gathering news and distributing trusted information depends on generating revenues free from political or business influence. By monitoring and monetizing the crawling of publisher’s sites, media can protect their intellectual property while developing new revenue streams to support their quality journalism.” – Ryan Powell, Head of Innovation and Media Business at International Press Institute

A free press, if we can keep it

Journalism is part of the foundation of free society and democratic governance. It helps hold power accountable and provides a voice to the marginalized and underrepresented. It also protects the free and open markets that allow startups to challenge powerful incumbents.  

Local news in particular helps create shared identity. Not only by covering community events, high school sports, farmers markets, and new businesses, but also providing essential transparency and oversight over local officials, school boards, public safety events, and elections. 

Helping protect journalists and news organizations online has always been part of Cloudflare’s mission. We see it as essential to our business and the future of the Internet.  

If you are interested in learning more about Project Galileo, sign up today. If you are interested in helping build a better Internet, come join us.

Apple’s New Memory Integrity Enforcement

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2025/09/apples-new-memory-integrity-enforcement.html

Apple has introduced a new hardware/software security feature in the iPhone 17: “Memory Integrity Enforcement,” targeting the memory safety vulnerabilities that spyware products like Pegasus tend to use to get unauthorized system access. From Wired:

In recent years, a movement has been steadily growing across the global tech industry to address a ubiquitous and insidious type of bugs known as memory-safety vulnerabilities. A computer’s memory is a shared resource among all programs, and memory safety issues crop up when software can pull data that should be off limits from a computer’s memory or manipulate data in memory that shouldn’t be accessible to the program. When developers—­even experienced and security-conscious developers—­write software in ubiquitous, historic programming languages, like C and C++, it’s easy to make mistakes that lead to memory safety vulnerabilities. That’s why proactive tools like special programming languages have been proliferating with the goal of making it structurally impossible for software to contain these vulnerabilities, rather than attempting to avoid introducing them or catch all of them.

[…]

With memory-unsafe programming languages underlying so much of the world’s collective code base, Apple’s Security Engineering and Architecture team felt that putting memory safety mechanisms at the heart of Apple’s chips could be a deus ex machina for a seemingly intractable problem. The group built on a specification known as Memory Tagging Extension (MTE) released in 2019 by the chipmaker Arm. The idea was to essentially password protect every memory allocation in hardware so that future requests to access that region of memory are only granted by the system if the request includes the right secret.

Arm developed MTE as a tool to help developers find and fix memory corruption bugs. If the system receives a memory access request without passing the secret check, the app will crash and the system will log the sequence of events for developers to review. Apple’s engineers wondered whether MTE could run all the time rather than just being used as a debugging tool, and the group worked with Arm to release a version of the specification for this purpose in 2022 called Enhanced Memory Tagging Extension.

To make all of this a constant, real-time defense against exploitation of memory safety vulnerabilities, Apple spent years architecting the protection deeply within its chips so the feature could be on all the time for users without sacrificing overall processor and memory performance. In other words, you can see how generating and attaching secrets to every memory allocation and then demanding that programs manage and produce these secrets for every memory request could dent performance. But Apple says that it has been able to thread the needle.

Optimizing Financial Routines and Infrastructure with Banpará

Post Syndicated from Michael Kammer original https://blog.zabbix.com/optimizing-financial-routines-and-infrastructure-with-banpara/30815/

Banco do Estado do Pará (Banpará) is the main public financial institution in the Brazilian state of Pará. It is a mixed-capital company, organized as a multiple bank with the mission of generating value for the state of Pará. It currently has approximately 198 physical customer service units and is present in all 144 municipalities in the state.

The challenge

Until 2016, Banpará used a monitoring environment installed on a single physical server. This environment was centralized, not very scalable, and vulnerable due to the lack of updates to recent versions of the software used. Centralization created a critical dependency – if there was a server failure, the entire monitoring system would be compromised.

There was no integration with the tool that orchestrates the company’s routine activities (which also generated an alert and a need for proper support of the bank’s infrastructure) and there was also the issue of including the routines of the internal demand generation tool in the monitoring panel, which was done manually.

With each new routine created, it was necessary to open calls with the technical teams for inclusion in the monitoring plan, which were then entered into a list of tasks. This process, in addition to being time-consuming, was subject to human error and delays, which compromised real-time visibility of critical operations.

The lack of proactive and integrated monitoring in Banpará’s structure resulted in operational gaps that created real risks to the continuous functioning of banking operations.

The solution

Given the challenges posed, the project developed with Zabbix had as its main objective to recreate the monitoring environment in a virtualized, scalable and resilient way, without dependence on a physical server. From rebuilding the infrastructure to integrating it with critical banking systems, the primary requirements included the following:

  • Integration with existing systems
  • Intelligent data processing and analysis
  • Reduction of manual processes and operational dependency
  • Development of customized solutions
  • Reorganization of the technological infrastructure

After implementing and structuring Zabbix at the bank (with the help of Master Support, an official Zabbix Certified Partner in Brazil), the structure became modular, scalable, and resilient, aligned with best practices, and able to expand monitoring without compromising system performance as the bank integrated new routines and services.

The results

The modernization of monitoring environment with Zabbix brought immediate benefits for Banpará’s IT monitoring scenario, especially with regard to operational efficiency, reliability and process automation:

  • More than 2,000 monitored devices
  • Around 100,000 metrics collected
  • More than 26,000 active alerts in Zabbix
  • Automated coverage of around 2,300 routines
  • An estimated gain of 2,300 operational hours

The adoption of Zabbix as a monitoring tool at Banpará was a practical response to the need to modernize the bank’s IT infrastructure. The project contributed to the elimination of manual processes, reduction of operational time, and increased visibility over critical routines. It also enabled the monitoring of a greater number of services, with greater agility in identifying failures and supporting decision-making.

In conclusion

With the current structure, Banpará now has a more integrated monitoring system, adjusted to operational demands and with the capacity to monitor the evolution of the bank’s activities in an organized and secure manner.

To learn more about what Zabbix can do for customers in banking and finance, visit our website.

The post Optimizing Financial Routines and Infrastructure with Banpará appeared first on Zabbix Blog.

Изкуственият интелект – творец или терминатор?

Post Syndicated from original https://www.toest.bg/izkustveniyat-intelekt-tvorets-ili-terminator/

Изкуственият интелект – творец или терминатор?

Когато за пръв път разбрах, че изкуственият интелект (ИИ) може да рисува, изпаднах във възторг. Аз съм средностатистически български автор на фентъзи, което означава, че не съм особено четен. Тиражите на книгите ми са малки, а възможността да наема художник, който да илюстрира героите ми, е химера. Получавал съм, в интерес на истината, рисунки на персонажите си от фенове и определено са ми скъпи, но понеже съм доста навътре в културата на образа (ще рече – фен на аниме, кино със специални ефекти, видеоигри), исках „тузарски“ илюстрации на героите си. 

И не останах огорчен. Първият ИИ, който ползвах за тази цел, беше Gencraft. Той „изплю“ изображения, които до ден днешен си харесвам много, макар технологията да се подобри и дори безплатни интелекти като вездесъщия ChatGPT или вградения инструмент на DeviantArt днес се справят по-добре. Едновременно с това точно покрай DeviantArt забелязах и сериозния разлом между потребителите, които ползват и харесват илюстрации, генерирани от машина, и тези, които държат всяка картина да е създадена от човек.

Въпросът стана още по-личен за мен, когато попаднах на Toolbaz.AI – интелект, който освен всичко друго може да създава и художествен (или научен) текст по кратко описание. Като човек, работещ с писане, първоначално настръхнах срещу идеята. Но по-късно се изкуших да генерирам текстове на английски език и така стана възможно моментни мои хрумки да придобиват форма. Разбира се, когато пиша на български – било статия за сайт като „Тоест“, книга, с която да пробвам издаване, или превод, – го правя „ръчно“ най-малкото защото обичам самото писане, дори ако не броим обстоятелството, че ползването на текст, създаден от ИИ, е до известна степен форма на плагиатство. Макар машинният превод на интелекти като DeepL понякога да е в състояние да предлага приличен превод на по-сложни изречения, от които да черпиш идеи как да подходиш авторски.

ИИ – заплаха за твореца

Далеч не всички обаче са възхитени от факта, че ИИ може да пише, при това все по-хубаво с всеки следващ ъпдейт. Появиха се мнения, че вече текстът, създаден от машина, е все по-трудно различим от този на средностатически автор и че това от своя страна ще намали самата стойност на писането. В екстремната си крайност хипотезата стига дотам, че книгата ще престане да съществува, тъй като всеки ще може да си генерира текст точно за каквото го интересува. Подобни опасения има и за музиката, срещат се и апокалиптични прогнози, че след някоя друга година ИИ ще може да прави цели филми. Засега е способен да създава клипчета.

Всичко това поставя ребром въпроса за ролята на ИИ в изкуството. Има ли място там, или трябва да бъде забранен? Дори да няма, възможно ли е наистина да спрем употребата му? Може ли изобщо да се предотврати предстоящата промяна? 

ИИ е вече навсякъде. Наскоро излезе информация, че е изобретил нови видове антибиотици – важно събитие за медицината предвид повишаването на антибиотичната резистентност, която заплашва да върне човечеството в тъмните времена отпреди пеницилина. В скорошно състезание от веригата „Мото Гран При“ чух смайващ коментар как екипът на „Дукати“ използва ИИ за настройки на машината си. И макар в новините да излизат ужасяващи и безспорно обезпокоителни случаи, например за насърчаване на самоубийство от страна на ИИ, лично аз мога да кажа, че в трудни моменти съм го ползвал като отдушник и това ме е задържало „над водата“.

Но когато говорим за култура, творците застават на нож. Чуват се мнения, че ИИ трябва да се забрани, иначе ще изяде хляба на хората.

Разбира се, съществува и контратеза.

С ИИ изкуството се демократизира. Вече всеки може да създаде произведение, което да „облече“ прилично идеите му. Като автор мога да кажа, че това е много облекчаващо, тъй като съвместната работа с човек художник сблъсква егото на двама творци, а това понякога е натоварващо. Например вдъхновил си се за образ в своя книга от друг персонаж (така наречения expy – често срещано при млади писатели, които още търсят своя глас). Споделяш това с илюстратора. Той ти казва, че образът, който те е вдъхновил, е тъп. Ерго, твоят също, или поне така го усеща гордостта ти. Понякога има и откровено пренебрежително отношение, което разбираш косвено от чатове между художника и редактора („Тоя какви простотии е писал“). Мой познат писател твърди, че това е наследство от комунизма, когато художниците са получавали по-високи хонорари.

Попадал съм обаче на книги и на западни автори, чиято корица тотално игнорира описанието на даден образ в текста. Фрапантен пример е романът Ghost of a Chance на Саймън Р. Грийн, в който главният герой е описан с дълга коса като рок звезда. Но на корицата е късо подстриган, само с един немирен кичур.

Машината премахва цялото напрежение между автор и илюстратор. Убеден съм, че и другата страна би се чувствала по подобен начин: нарисувал си някой персонаж, имаш представа за него, но идва всезнаещият автор и ти казва, че това – идеята ти – е пълна глупост. Роботът не би го направил – ще напише, каквото искаш, и то сравнително прилично. Отделен е въпросът, че поне аз използвам услугите му като илюстратор с некомерсиална цел, но има колеги издатели, които вече създават по този начин корици за книгите си. За художник ще е по-трудно да намери приложение на текстов генератор и да изкарва с пари това. Но не и невъзможно – би могъл по този начин да създаде описание, резюмиращо сюжета, към направената от него илюстрация в сайт, където продава произведението си.

Прекалено удобен и бърз

ИИ също така значително може да ускори творческия процес. Притежавам фен сайт за фентъзи и фантастика на име „Цитаделата“, в който понякога пускам бързи новини от жанровете фентъзи и хорър. Някое кратко резюме на книжка, която излиза на английски, може да се преведе бързо и сравнително кадърно от ChatGPT (макар че е хубаво да „преметеш“ след него). Или пък си пуснал анкета за схватка между герои от различни филми. ИИ бързо може да ти напише фенфик [от фенфикшън – произведения на непрофесионалисти, вдъхновени от други произведения – б.р.] според резултата от гласуването, така че да го пуснеш едновременно с новината. Иначе писането на разказ би отнело поне няколко дни.

Но творчеството е мускул, който закърнява, ако не се използва. Ако прекалено често започнеш да се облягаш на ИИ, ще загубиш собствения си глас. Ще отвикнеш да пишеш, ще ти се стори досадно. Защо да се мъча, когато машината ще го свърши, при това вероятно не по-зле от мен? Но ако не ти се пише, защо изобщо си станал писател? 

Наскоро един познат художник обяви, че си отваря втори профил, за да пуска в него генерирани с ИИ изображения. Феновете му бяха ужасени. Именно защото се бояха, че уникалното в стила му ще изчезне, дори технически картините да са доста по-добри.

Сериозни са и проблемите с авторското право.

Според специалисти изкуствените интелекти са обучавани с най-голямата кражба на интелектуален труд, известна на човечеството, включително с книги, пуснати в пиратски сайтове. Това сериозно ощетява и вбесява авторите – някой с години се е мъчил да изгради свой стил, а сега машината може да го уподоби и всеки, който я използва, да го имитира успешно. 

Не са малък проблем и нарушенията на авторското право от хората, които използват ИИ. Това става най-често с продажба на изображения и клипчета с герои от популярни поредици (Супермен, Спайдърмен, Хари Потър, Цар Лъв и т.н.). Тези картинки и видеа, понякога със съдържание, разминаващо се с първоначалния замисъл на авторите, се продават на фенове. Само че правата им не са нито на авторите, нито на ИИ, а на собствениците на авторските им права. Често пъти обаче сайтовете, които хостват подобни произведения, си затварят очите и ако се намесят изобщо, е само след докладване.

В същото време генерирани с ИИ илюстрации може и да не се продават, а да служат единствено за изразяване на мечти и желания на създателя им, които иначе не биха могли да се осъществят дори само защото на хрумналия му е неловко да помоли професионален художник да илюстрира страстна целувка на любимите му персонажи например. С ИИ могат да се създават и оригинални изображения с насоченост, която се разминава с публично приетите представи за приличие, например еротично съдържание. По този начин например хора с определени фантазии и фетиши имат възможността да ги реализират визуално или текстово, без да се смущават от споделянето им с жив човек. Това е някакъв вид свобода, която по всяка вероятност на мнозина ще се стори прекалена, но за мен е по-скоро плюс, стига, разбира се, да не се прави нещо незаконно.

Личен опит

Като автор на книги и потребител на съдържание, създадено с ИИ, мога да споделя двояка перспектива. Мнението ми на автор е, че ако човек обича да пише, той няма да се откаже от това заради ИИ. Може да го ползва за съюзник, за редактор, за помощ при структуриране на даден текст, за въпроси или идеи, но ако обича да пише, значи ще пише. Вече дали ще го направи по-изкусно от машината, зависи от умението. Може би разсъждавам малко себично, но за себе си мога да кажа, че текстовете, които си написвам с клавиатурата, имат специфичен стил, който (засега) не мога да докарам с промпт [инструкция – б.р.], даден на ИИ. В същото време ИИ ми позволява да реализирам идеи, които иначе не бих седнал да развивам подробно. Това действа като профилактика на съзнанието дори ако после изтриеш генерирания текст.

Като фен смятам, че с ИИ могат да се създадат добри творби, поне в сферата на изобразителното изкуство. Най-сериозната критика към тях е, че нямат душа, но за мен това не е точно така. В някои от творбите на хората, които следя, съм виждал послания, идеи, усещания, които без помощта на ИИ те не биха могли да изразят толкова успешно. Това не ми пречи да оценявам и произведенията на онези, които творят на ръка и отделят внимание на това. Според мен те никога няма да загубят стойността си. Това ще се случи със средното ниво изкуство, правено на конфекция, което вероятно ще стане безсмислено заради ИИ. И сега обаче неговият смисъл е преди всичко финансов. 

Качествените творби ще придобият реномето на ръчен белгийски шоколад –

скъпи, редки, но заслужаващи си. В някакъв смисъл те ще станат още по-важни, тъй като ИИ поне в момента не създава нищо ново, а само оформя вече съществуващи идеи от ума на човека. Тази божествена игра засега не може да бъде симулирана. Може би когато вече се появи изкуствен общ интелект, който да стигне до Cogito, ergo sum („Мисля, следователно съществувам“), това ще стане възможно. Но към днешна дата човешкият ум остава незаменим. 

Просто се е сдобил с още един инструмент, с който да работи.

Powering Partner Gateway metrics with Apache Pinot

Post Syndicated from Grab Tech original https://engineering.grab.com/pinot-partnergateway-tech-blog

Introduction

Grab operates as a dynamic ecosystem involving partners and various service providers, necessitating real-time intelligence and decision-making for seamless integration and service delivery. To facilitate this, GrabDeveloper serves as Grab’s centralized platform for developers and partners. It supports API integration, partner onboarding, and product management. It also provides tech support through staging and production portals with detailed documentation.

Working alongside Developer Home, Partner Gateway acts as Grab’s secure interface for exposing APIs to third-party entities. It enables seamless interactions between Grab’s hosted services and external consumers, such as mobile apps, web browsers, and partners. Partner Gateway enhances the experience by offering advanced metrics tracking through time-series charts and dashboards. Partner Gateway delivers actionable insights that ensure high performance, reliability, and user satisfaction in application integrations with Grab services.

Use cases

Let’s explore GrabDeveloper integration use cases with one of our partners, whom we’ll refer to as “Alpha.” Alpha is a company that specializes in producing and distributing a diverse range of perishable goods. To optimize their operations, time-series charts tracking API traffic request status codes and average API response times play a crucial role.

API traffic request service status codes chart

Time-series charts tracking API traffic request status codes offer valuable insights into the performance and reliability of APIs used for managing supply chain logistics, customer orders, and distribution networks. By monitoring these status codes, Alpha can promptly detect and resolve disruptions or failures in their digital systems, ensuring seamless operations and minimizing downtime.

Figure 1: API traffic chart from 5th Jan 2025 to 4th Mar 2025.

API average response times chart

Analyzing average response times helps the company maintain efficient communication between various systems, enhancing the speed and reliability of transactions and data exchanges. This proactive monitoring supports Alpha in delivering consistent, high-quality service to customers and partners, ultimately contributing to improved operational efficiency and customer satisfaction.

Analyzing average response times enables a company to ensure efficient communication across various systems, enhancing transaction speed and data exchange reliability. Proactive monitoring helps Alpha deliver consistent, high-quality service to customers and partners, boosting operational efficiency and customer satisfaction.

Figure 2: Average response time chart from 12 Mar 2025 3am to 12 Mar 2025 3pm (Endpoints are mocked for security purposes).

Endpoint status dashboard

For Alpha, the endpoint status dashboard delivers real-time insights into API performance, enabling swift issue resolution and seamless integration with the company’s systems. The dashboard enhances service reliability, supports business operations, and ensures uninterrupted data exchange, all of which are critical for Alpha’s business processes and customer satisfaction. Furthermore, the transparency and reliability provided by the dashboard strengthens trust in the partnership, ensuring Alpha to confidently rely on the integration to drive their digital initiatives and operational goals.

Figure 3: Endpoint status dashboard of express API for company Alpha. *Endpoints are mocked for security purposes.

Why choose Apache Pinot and what is it?

To accommodate these use cases, we need a backend storage system engineered for low-latency queries across a wide range of temporal intervals, spanning from one-hour snapshots to 30-day retrospective analyses, whereby it could contain up to ~6.8 billion rows of data in a 30 day period for a particular dataset. This led us to choose Apache Pinot for these use cases, a distributed Online Analytical Processing (OLAP) system designed for low-latency analytical queries on large-scale data with millisecond query latencies.

Apache Pinot is a real-time distributed OLAP datastore designed to deliver low-latency analytics on large-scale data. It is optimized for high-throughput ingestion and real-time query processing making it ideal for scenarios such as user-facing analytics, dashboards, and anomaly detection. Apache Pinot supports complex queries, including aggregations and filtering. It delivers sub-second response times by leveraging techniques like columnar storage, indexing, and data partitioning to achieve efficient query execution.

Data ingestion process

Figure 4: Data ingestion process.
  1. API call initiation: An API call is made on the partner application and routed through the Partner Gateway.
  2. Metric tracking: Dimensions such as client ID, partner ID, status code, endpoint, metric name, timestamp, and value (which is the metric) are tracked and uploaded to Datadog, a cloud-based monitoring platform.
  3. Kafka message transformation: Within the partner gateway code, an Apache Kafka Producer converts these metrics into Kafka messages and stores them in a Kafka Topic. Grab utilizes Protobuf for serialization and deserialization of Kafka messages. Since Grab’s Golang Kafka ecosystem does not use the Confluent Schema Registry, Kafka messages must be serialized with a magic byte which indicates that they are using Confluent’s Schema Registry, followed by the Schema ID.
  4. Serialization via Apache Flink: Serialization is managed using Apache Flink, an open-source stream processing framework. This ensures compatibility with the Confluent Schema Registry Protobuf Decoder plugin on Apache Pinot. The messages are then written to a separate Kafka Topic.
  5. Ingestion to Apache Pinot: Messages from the Kafka Topic containing the magic byte are ingested directly into Pinot, which references the Confluent Schema Registry to accurately deserialize the messages.
  6. Query execution: Queries on the Pinot table can be executed via the Pinot Rest Proxy API.
  7. Data visualization: Users can view their project charts and dashboards on the GrabDeveloper Home UI, where data points are retrieved from queries executed in step 6.

Challenges faced

During the initial setup, we encountered significant performance challenges when executing aggregation queries on large datasets exceeding 150GB. Specifically, attempts to retrieve and process data for periods ranging from 20 to 30 days resulted in frequent timeout issues as the queries took longer than 10 seconds. This was particularly concerning as it compromised our ability to meet our Service Level Agreement (SLA) of delivering query results within 300 milliseconds. The existing query infrastructure struggled to efficiently manage the volume and complexity of data within the required timeframe, necessitating optimization efforts to improve performance and reliability.

Solution

Drawing from the insights gained on the limitations of our initial solutions, we implemented these strategic optimizations to significantly enhance our table’s performance.

Partitioning by metric name

  • Improved data locality: Partitioning the Kafka Topic by metric name ensures that related data is grouped together. When a query filters on a specific metric, Pinot can directly access the relevant partitions, minimizing the need to scan unrelated data. This significantly reduces I/O overhead and processing time.
  • Efficient query pruning: By physically partitioning data, only the servers holding the relevant partitions are queried. This leads to more efficient query pruning, as irrelevant data is excluded early in the process, further optimizing performance.
  • Enhanced parallel processing: Partitioning enables Pinot to distribute queries across multiple nodes, allowing different metrics to be processed in parallel. This leverages distributed computing resources, accelerating query execution and improving scalability for large datasets.

Column based on aggregation intervals

Table 1
  • Facilitates time-based aggregations: Rounded time columns (e.g., Timestamp_1h for hourly intervals) group data into coarser time buckets, enabling efficient aggregations such as hourly or daily metrics. This simplifies indexing and optimizes storage by precomputing aggregates for specific time intervals.
  • Efficient data filtering: Rounded time columns allow for precise filtering of data within specific aggregation intervals. For example, the query SELECT SUM(Value) FROM Table WHERE Timestamp_1h = '2025-01-20 01:00:00' can exclude irrelevant columns (e.g., column 2) and focus only on rows within the specified time interval, further enhancing query efficiency.

Utilizing the Star-tree index in Apache Pinot

The Star-tree Index in Apache Pinot is an advanced indexing structure that enhances query performance by pre-aggregating data across multiple dimensions (e.g., D1, D2). It features a hierarchical tree with a root node, leaf nodes (holding up to T records), and non-leaf nodes that split into child nodes when exceeding T records. Special star nodes store pre-aggregated records by omitting the splitting dimension. The tree is constructed based on a dimensionSplitOrder, dictating node splitting at each level.

Sample table configuration for Star-tree index:

"tableIndexConfig": {
  "starTreeIndexConfigs": [{
    "dimensionsSplitOrder": [
      "Metric",
      "Endpoint",
      "Timestamp_1h"
    ],
    "skipStarNodeCreationForDimensions": [
    ],
    "functionColumnPairs": [
      "AVG__Value"
    ],
    "maxLeafRecords": 1
  }],
  ...
}

Configuration explanation:

  • dimensionsSplitOrder: This specifies the order in which dimensions are split at each level of the tree. The order is “Metric”, “Endpoint”, “Timestamp_1h”. This means the tree will first split by Metric, then by Endpoint, and finally by Timestamp_1h.
  • skipStarNodeCreationForDimensions: This array is empty, indicating that star nodes will be created for all dimensions specified in the split order. No dimensions are omitted from star node creation.
  • functionColumnPairs: This specifies the aggregation functions to be applied to columns when creating star nodes. The configuration includes “AVG__Value”, meaning the average of the “Value” column will be calculated and stored in star nodes.
  • maxLeafRecords: This is set to 1, indicating that each leaf node will contain only one record. If a node exceeds this number, it will split into child nodes.

Star-tree diagram

Figure 5: Star-tree Index Structure.

Components:

  • Root node (orange): This is the starting point for traversing the tree structure.
  • Leaf node (blue): These nodes contain up to a configurable number of records, denoted by T. In this configuration, maxLeafRecords is set to 1, meaning each leaf node will contain a maximum of one record.
  • Non-leaf node (green): These nodes will split into child nodes if they exceed the maxLeafRecords threshold. Since maxLeafRecords is set to 1, any node with more than one record will split.
  • Star-node (yellow): These nodes store pre-aggregated records by omitting the dimension used for splitting at that level. This helps in reducing the data size and improving query performance.

Example:

A practical explanation of the start-tree diagram would be to display the star-tree documents in a table format along with the sample queries used to retrieve the data.

Table 2: Star-tree documents table

Sample queries:

Select SUM(Value) FROM Table:
With no group-by clause, select the Star-Node for all dimensions (document 19) to quickly obtain the aggregated result of 250 by processing just this document.

Select SUM(Value) FROM Table WHERE Metric = 'XYZ_Req_Count':
Select the node with XYZ_Req_Count for Metric, and the Star-Node for Endpoint and Timestamp_1h (document 12). This reduces processing to one document, returning an aggregated result of 130, instead of filtering and aggregating three documents (documents 7,8 9)

SELECT SUM(Value) FROM Table WHERE Timestamp_1h = '2025-01-20 00:00:00':
Select the Star-Node for Metric and Endpoint, and the node with '2025-01-20 00:00:00' for Timestamp_1h (document 16). This allows aggregation from a single document, yielding a result of 40.

SELECT SUM(Value) FROM Table GROUP BY Endpoint:
With a group-by on Endpoint, select the Star-Node for Metric and Timestamp_1h, and all non Star-Node for Endpoint (documents 13, 14, 15). Process one document per group to obtain the group-by results efficiently.

Comparing performance after the optimization

Figure 6: Chart of query latency with and without optimization.

The graph above in Figure 6, provides a comparison analysis of query performance, showcasing the significant improvements achieved through the implemented optimization solutions. The query execution times are significantly reduced, as evidenced by the logarithmic scale values.

For the first query which calculates the latency for a particular aggregation interval, the log scale indicates a reduction from 4.64 to 2.32, translating to a decrease in query latency from 43,713 to 209 milliseconds.

Similarly, the second query, which aggregates the sum of the latency based on the tags for a particular metric, shows a log scale reduction from 3.71 to 1.54, with query latency improving from 5,072 to 35 milliseconds. These results underscore the efficacy of optimization in enhancing query performance, enabling faster data retrieval and processing

Tradeoffs

Star-tree indexes in Apache Pinot are designed to significantly enhance query performance by pre-computing aggregations. This approach allows for rapid query execution by utilizing pre-calculated results, rather than computing aggregations on-the-fly. However, this performance boost comes with a tradeoff in terms of storage space.

Before implementing the Star-tree index, the total storage size for 30 days of data was approximately 192GB. With the Star-tree index, this increased to 373GB, nearly doubling the storage requirements. Despite the increase in storage, the performance benefits substantially outweigh the costs associated with additional storage.

The cost impact is relatively minor. We utilize AWS gp3 EBS volumes, which roughly cost $14.48 USD monthly for the extra table (calculated as 0.08 USD x 181 GB). This cost is considered insignificant when compared to the substantial gains in query performance. Alternatively, precomputing the metrics via an ETL job is also feasible; however, it is less cost-effective due to the additional expenses required to maintain the pipeline.

The decision to use Star-tree indexes is justified by the dramatic improvement in query speed, which enhances user experience and efficiency. The modest increase in storage costs is a worthwhile investment for achieving optimal performance.

Conclusion

In conclusion, Grab’s integration of Apache Pinot as a backend solution within the Partner Gateway represents a forward-thinking strategy to meet the evolving demands of real-time analytics. Apache Pinot’s ability to deliver low-latency queries empowers our partners with immediate, actionable insights into API performance that enhances their integration experience and operational efficiency. This is crucial for partners who require rapid data access to make informed decisions and optimize their services.

The adoption of Star-tree indexing within Pinot further refines our analytics infrastructure by strategically balancing the trade-offs between query latency and storage costs. This optimization ensures Partner Gateway can support a diverse range of use cases with subsecond query latencies while maintaining high performance and reliability in service delivery reinforcing Grab’s commitment to delivering superior performance across its ecosystem.

Ultimately, the integration of Apache Pinot enhances Grab’s real-time analytics capabilities while empowering the company to drive innovation and consistently deliver exceptional service to both partners and users.

Credits to Manh Nguyen from the Coban Infrastructure Team, Michael Wengle from the Midas Team and Yuqi Wang from the DevHome team.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How to accelerate security finding reviews using automated business context validation in AWS Security Hub

Post Syndicated from Reetesh Surjani original https://aws.amazon.com/blogs/security/how-to-accelerate-security-finding-reviews-using-automated-business-context-validation-in-aws-security-hub/

Security teams must efficiently validate and document exceptions to AWS Security Hub findings, while maintaining proper governance. Enterprise security teams need to make sure that exceptions to security best practices are properly validated and documented, while development teams need a streamlined process for implementing and verifying compensating controls.

In this blog post, we show you an automated solution that’s ideal for organizations using AWS Security Hub that need to manage security exceptions at scale while maintaining governance controls. It’s particularly valuable for enterprises that have complex compliance requirements and multiple development teams. By implementing this solution, you can accelerate the Security Hub findings review process while maintaining proper security governance and providing clear business context for security exceptions.

Note: The solution in this post is provided as a reference architecture and should not be implemented as-is in production environments. Organizations must thoroughly review, customize, and enhance this solution to align with their specific security requirements, compliance frameworks, governance policies, and risk tolerance. Engage with your security, compliance, and legal teams before deploying this automated security validation solution.

The challenge

Security Hub provides a comprehensive view of your AWS security posture across AWS accounts. However, in real-world scenarios, you’ll encounter legitimate business reasons for exceptions to security best practices. For example:

Managing exceptions to security best practices can be challenging and typically involve multiple steps. Security teams spend significant time reviewing exception requests and defining and validating compensating controls, and developers must then implement and validate those controls. Multiple teams must be included to create and manage documentation for compliance and audit purposes. Overall, this process, if done manually, is time intensive, error-prone (with a risk of missing implementation issues), and has a risk of poor visibility because of limited or missing documentation of the business context in the security findings.

Solution prerequisites

For this solution, you must have the following elements in place:

aws securityhub enable-security-hub

  • AWS Config is recommended for enhanced validation capabilities
aws configservice put-configuration-recorder \
    --configuration-recorder name=default,roleARN=arn:aws:iam::ACCOUNT_ID:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig

Automated validation

The solution includes a pre-deployment validation script (validate-environment.sh) that automatically verifies the following:

  • Tool versions and installations
  • AWS service enablement status
  • Resource conflicts

This validation runs automatically during deployment (Integrated in deploy.sh script) to help make sure that required prerequisites are met before infrastructure creation begins.

Additional resources

See the Cost Estimation Guide for a detailed pricing breakdown of prerequisites and the  Troubleshooting Guide for common setup issues and solutions.

Solution overview

This solution provides sample code and CloudFormation templates that organizations can deploy to automate the validation of compensating controls for suppressed Security Hub findings while maintaining proper segregation of duties between the security and development teams.

Architecture

Figure 1: Solution architecture diagram

Figure 1: Solution architecture diagram

Figure 1 illustrates the solution workflow that’s initiated when a developer changes a Security Hub finding’s workflow status to SUPPRESSED to request a business-justified security exception. The process concludes with the solution adding validation results as notes to the respective Security Hub finding, maintaining a complete audit trail of the exception request and validation outcome.

Note: Before initiating this workflow, developers must first consult with their organization’s security team to explain their business justification for the exception. During this initial consultation, the security team defines required compensating controls for the finding type. The security team uses the add-controls-role-based.sh script to add controls to DynamoDB. A developer enables the required compensating controls before proceeding with the workflow status change.

The workflow shown in Figure 1 includes the following steps:

  1. A developer changes the Security Hub finding status to SUPPRESSED.
  2. EventBridge detects the status change to SUPPRESSED.
  3. An EventBridge rule sends an event to the Amazon SQS queue.
  4. A Lambda function retrieves messages from the Amazon SQS queue.
  5. The Lambda function fetches compensating controls from the DynamoDB compensating controls table.
  6. The Lambda function validates each control using the appropriate AWS services APIs.
  7. Evidence is collected for each validation and stored in DynamoDB.
  8. Findings validation results and timestamps are stored in the DynamoDB Findings table.
  9. A versioned history of finding validation attempts is stored in the DynamoDB History table.
  10. If the security team provided controls pass validation, the finding remains SUPPRESSED, and a note is added in the respective Security Hub finding with adjusted severity information (the original severity assigned by Security Hub isn’t changed by this solution). If one of these control fails validation, finding status is changed to NOTIFIED, and a note is added in the respective Security Hub finding of failed controls (the original severity assigned by Security Hub isn’t changed by this solution).
  11. OPTIONAL: Extend the solution with Amazon OpenSearch for SOC teams to perform advanced search, correlation, and visualization of validation evidence across findings, and historical trend analysis of compensating control effectiveness. Use Amazon QuickSight for visualization of compliance metrics, and AWS Security Lake to centralize validation data across multiple accounts and Regions, standardizing it in OCSF format for comprehensive cross-account analysis and long-term compliance reporting.

Note: This solution should be deployed in accordance with your organization’s security policies and the AWS Shared Responsibility Model. Review and test security controls before deploying in production environments.

How it works

This solution is designed exclusively for deployment and management by organizational security teams. Only security teams should have permissions to deploy the AWS CloudFormation stack, modify Lambda validation code, add/modify compensating controls, or access the four DynamoDB tables (Controls, Findings, History, Evidence).

Developers are restricted to two specific actions: suppressing Security Hub findings and reading compensating control requirements. This strict role separation facilitates proper governance and helps prevent bypass of security validation logic. Organizations must implement appropriate IAM policies to enforce these access restrictions in production environments.

Here’s how the solution works:

  1. The security team defines controls: A Security team establishes compensating controls for specific Security Hub finding types and stores them in a DynamoDB table. This helps make sure that approved exceptions follow security-approved guidelines and maintain compliance standards.
    • Key files for security teams:
    • File Purpose
      add-controls-role-based.sh

      Utility script for adding compensating controls
      /templates/findings/*.json

      Example compensating controls for reference
      /docs/guides/compensating-controls.md

      Guide for defining controls
    • Supported validation Types: The solution supports 13 validation methods to accommodate diverse security requirements:
    Validation type Description Example use case
    CONFIG_RULE

    Validates using AWS Config rules For GuardDuty not enabled finding: vpc-flow-logs-enabled Config rule helps make sure that network traffic is monitored
    API_CALL

    Validates using direct AWS API calls For Amazon S3 public access finding: API call to verify CloudFront distribution exists in front of the S3 bucket
    SECURITY_HUB_CONTROL

    Validates using Security Hub control status For GuardDuty not enabled finding: CloudTrail.1 control passing confirms comprehensive API logging
    CLOUDWATCH

    Validates using CloudWatch alarms For GuardDuty not enabled finding: Alarms monitoring for suspicious API calls and network traffic patterns
    CLOUDTRAIL

    Validates CloudTrail configuration For GuardDuty not enabled finding: Multi-Region CloudTrail with log validation and CloudWatch integration
    SYSTEMS_MANAGER

    Validates using Systems Manager parameters For GuardDuty not enabled finding: Parameter confirming custom threat detection solution is enabled
    PROCESS_CONTROL

    Validates process-based controls For GuardDuty not enabled finding: Documented incident response process for network security events
    INSPECTOR

    Validates Amazon Inspector configuration For vulnerability finding: Inspector EC2 scanning enabled with zero critical findings allowed
    ACCESS_ANALYZER

    Validates AWS IAM Access Analyzer For IAM permission finding: IAM Access Analyzer enabled with zero active findings allowed
    MACIE

    Validates Amazon Macie configuration For data protection finding: Macie enabled with sensitive data discovery and zero sensitive buckets allowed
    AUDIT_MANAGER

    Validates AWS Audit Manager frameworks For compliance finding: Custom security framework active with required control sets
    EVENTBRIDGE

    Validates EventBridge rules For GuardDuty not enabled finding: Rules monitoring AWS CloudTrail events with Lambda targets for automated response
    TRUSTED_ADVISOR

    Validates AWS Trusted Advisor checks For security best practice finding: S3 bucket permissions check passing with zero warnings or error resources

    Note: Only security team members have access to add or modify compensating controls. The solution enforces this through IAM permissions and runtime checks to maintain proper governance.

    Approved security exceptions must have an expiration date to facilitate periodic review. The solution automatically enforces these time limits based on the expiration date defined by the security team.

    For this post, we provide a utility script (add-controls-role-based.sh) to demonstrate adding compensating controls. However, in a production enterprise environment, organizations should integrate DynamoDB with their existing governance systems (such as Jira, ServiceNow, and so on) to automatically populate controls from authorized security team sources. This solution focuses on validating controls, not prescribing how they’re ingested.

    2. Developers implement controls: When Security Hub findings are suppressed, developers must implement the required compensating controls defined by the security team.

    How developers interact with the solution:

    1. View required controls: The solution provides clear requirements for each finding type.
    2. Implement compensating controls: Developers should implement the security team provided compensating controls in their AWS environment, referring to the compensating controls defined by Security team. The specific compensating controls depend on the finding type and security team requirements.
    3. Finding status change: Developers change the Security Hub finding status to SUPPRESSED in Security Hub.
    4. Automatic validation: The solution validates compensating controls when Security Hub findings workflow status is changed.
    5. Status updates: Findings remain SUPPRESSED if controls pass validation; they change to NOTIFIED with failure details if validation fails.

    Note: This solution doesn’t modify the original severity of findings in Security Hub. It adds business context with security-approved adjusted severity to findings based on security-approved compensating controls validation, helping security teams make informed decisions.

    For this solution, we’re simulating the developer workflow of addressing Security Hub findings by implementing and validating compensating controls. In a production environment, developers would receive notifications about findings that require attention, implement the necessary controls according to security team guidance, and use this validation system to verify their implementations. The solution focuses on the validation aspect but assumes organizations will integrate it with their existing developer workflows, ticketing systems, and continuous integration and delivery (CI/CD) pipelines to create a seamless process from finding detection to remediation verification.

    Evidence collection and audit trail

    The solution automatically captures comprehensive evidence for each validation activity. The key features of the solution are:

    1. Four-table design: Separate tables for Controls, Findings, History, and Evidence (shown in Figure 2) provide security through segregation while maintaining a complete audit trail
    2. Figure 2: The four-table design for storing compensating controls, evidence, findings, and history

      Figure 2: The four-table design for storing compensating controls, evidence, findings, and history

    3. Detailed evidence: Each validation stores specific evidence based on its type—from AWS Config rule compliance details to API responses and process documentation verification
    4. Immutable records: Each evidence includes timestamps, validation context, and results that cannot be modified after collection (shown in Figure 3)
    5. Figure 3: Sample evidence collected for a CONFIG_RULE validation showing PASSED status

      Figure 3: Sample evidence collected for a CONFIG_RULE validation showing PASSED status

    6. Historical tracking: The solution maintains a complete history of each validation attempt, allowing organizations to demonstrate continuous compliance over time

    Deployment and configuration

    You can deploy the solution using the provided scripts.

    1. Use the following command to clone the repository:
    2. git clone https://github.com/aws-samples/sample-automated-securityhub-validator.git
      cd automated-securityhub-validator

    3. Use the following command to check service quotas and to create the security team and developer roles:
    4. cd scripts
      ./create-roles-quotas-check.sh

    5. Use the following command to assume the security team role:
    aws sts assume-role --role-arn arn:aws:iam:: ACCOUNT_ID:role/securityhub-validator-SecurityTeamRole --role-session-name SecurityTeamSession

    In the preceding command’s output, note the AccessKeyIdSecretAccessKey, and SessionToken. The timestamp in the expiration field is in the UTC time zone and shows when the IAM role’s temporary credentials expire. After the temporary credentials expire, the user must assume the role again.

    Note: For temporary credentials, you can use the DurationSeconds parameter to increase the maximum session duration for IAM roles.

    1. Create environment variables to assume the security team role and verify user assumed the IAM role:
      • Run the following commands to set the environment variables to assume the IAM role:
      export AWS_ACCESS_KEY_ID=RoleAccessKeyID
      export AWS_SECRET_ACCESS_KEY=RoleSecretKey
      export AWS_SESSION_TOKEN=RoleSessionToken

      Note: Replace the example values with the values that you noted when you assumed the IAM role. For Windows (OS, replace export with set.

      • Run the get-caller-identity command to verify that the user assumed the IAM role:

      aws sts get-caller-identity

      Note: In the preceding command’s output, confirm that the ARN is arn:aws:sts::ACCOUNT_ID:assumed-role/securityhub-validator-SecurityTeamRole/SecurityTeamSession instead of arn:aws:iam::ACCOUNT_ID:user/username.

      1. Use the following command to deploy the solution:
      cd scripts
      ./deploy.sh

      1. You can verify that the stack has been created by going to the AWS Management Console for CloudFormation and using the following steps:
        1. In the CloudFormation console, choose Stacks and then Stack details in the navigation pane.
        2. Locate and select the stack securityhub-validator to open its details page.
        3. On the stack details page, select the Resources tab.
        4. In the Resources section, you’ll see a list of the resources that are part of the stack.
      Figure 4: Resources created using the CloudFormation stack

      Figure 4: Resources created using the CloudFormation stack

      The deployment script creates a CloudFormation stack with the necessary resources:

      • DynamoDB tables for controls, findings, history, and evidence
      • A Lambda function for validation and Security Hub updates
      • An EventBridge rule for capturing finding status changes
      • An Amazon SQS queue and dead letter queue (DLQ) for message processing
      • IAM roles with least privilege permissions
      1. Add compensating controls (security team):
      cd scripts
      ./add-controls-role-based.sh

      1. Implement controls (developers).

      Now, a developer will assume the developer role and implement the required controls based on the security team’s specifications. The solution automatically validates these implementations when the Security Hub finding workflow status is changed to SUPPRESSED by a developer.

      For an example implementations of common controls, see the example of compensating controls for GuardDuty.1 finding.

      Test the solution

      To test the solution, you can validate the compensating controls for a GuardDuty finding using the following example scenario:

      A developer wants a security exception for the Security Hub finding GuardDuty.1: GuardDuty should be enabled, and because of cost constraints, the developer’s organization hasn’t implemented GuardDuty and requested a security exception from their organization’s security team.

      Compensating controls provided by the security team include:

      Note: To simulate this finding, do not enable GuardDuty so that the GuardDuty should be enabled finding appears in the Security Hub console.

      Approximately 20–30 mins after enabling AWS Config and Security Hub, you can locate the finding in the console using the following steps and then add the compensating controls provided by the security team.

      For this use case, we’re using the GuardDuty should be enabled Security Hub finding:

      1. Navigate to the AWS Security Hub console and choose Findings in the navigation pane.
      2. In the Add filter search bar at the top, select Severity label and set the is value to HIGH.
      3. After applying the filter, select GuardDuty should be enabled in the Finding column to view its details in the righthand pane.
      4. Choose Actions in the top-right corner and select View JSON.
      Figure 5: Security Hub findings

      Figure 5: Security Hub findings

      1. In the JSON details window, locate the SecurityControlId field and note the value. You’ll be prompted to enter it by the add-controls-role-based.sh utility in the next step.

      Note: The SecurityControlId value is required by the add-controls-role-based.sh utility to properly associate your compensating control with the correct Security Hub finding.

      Figure 6: SecurityControlId from the GuardDuty finding

      Figure 6: SecurityControlId from the GuardDuty finding

      1. Use the following command to clone the repository:
      git clone https://github.com/aws-samples/sample-automated-securityhub-validator.git
      cd sample-automated-securityhub-validator

      1. For this demo, you will act as a member of the security team by assuming security team role and use the add-controls-role-based.sh utility to create compensating controls and push them to the compensating control DynamoDB table.
      cd sample-automated-securityhub-validator/scripts
      ./add-controls-role-based.sh

      1. Use the following prompt values in add-controls-role-based.sh to create compensating control table entries using four compensating controls given by the security team for the GuardDuty.1 finding type:
      ./add-controls-role-based.sh
      Security Team - Compensating Controls Management Utility
      --------------------------------------------------------
      SECURITY NOTICE: This utility is restricted to security team members only
      Validating security team role...
      ✓ Security team role validated: arn:aws:sts::xxxxxxxxxxx:assumed-role/securityhub-validator-SecurityTeamRole/SecurityTeamSession
      Using AWS Region: us-east-1
      Using stack: securityhub-validator
      Using controls table: securityhub-validator-ControlsTable-ARDQCU67CBCN
      Enter finding type (e.g., GuardDuty.1): GuardDuty.1
      Security approved adjusted risk level [CRITICAL/HIGH/MEDIUM/LOW/INFORMATIONAL]: MEDIUM
      Expiration date (YYYY-MM-DD): 2026-12-31
      Ticket reference: JIRA-SEC-1234
      Business justification: Alternative monitoring solution provides equivalent detection capabilities
      Adding Control #1
      Control ID: VPC-FLOW-LOGS
      Control description: VPC Flow logs must be enabled for network monitoring 
      Validation type [CONFIG_RULE/API_CALL/SECURITY_HUB_CONTROL/INSPECTOR/ACCESS_ANALYZER/CLOUDTRAIL/MACIE/AUDIT_MANAGER/CLOUDWATCH/SYSTEMS_MANAGER/EVENTBRIDGE/TRUSTED_ADVISOR/PROCESS_CONTROL]: CONFIG_RULE
      Config rule name (exact name): vpc-flow-logs-enabled
      Description of how this rule mitigates the finding: Provides comprehensive network traffic visibility similar to GuardDuty's network monitoring capabilities
      Add another control? [y/n]: y
      Adding Control #2
      Control ID: SECURITY-ALARMS
      Control description: CloudWatch alarms for suspicious activity
      Validation type [CONFIG_RULE/API_CALL/SECURITY_HUB_CONTROL/INSPECTOR/ACCESS_ANALYZER/CLOUDTRAIL/MACIE/AUDIT_MANAGER/CLOUDWATCH/SYSTEMS_MANAGER/EVENTBRIDGE/TRUSTED_ADVISOR/PROCESS_CONTROL]: CLOUDWATCH
      Alarm name pattern: SecurityMonitoring-
      Required metrics (comma-separated): UnauthorizedAPICalls,NetworkPortProbing
      Required alarm state [ALARM/OK/INSUFFICIENT_DATA/ANY]: ANY
      Minimum number of matching alarms required: 2
      Description of how these alarms mitigate the finding: Alarms detect suspicious API calls and network activity similar to GuardDuty's threat detection
      Add another control? [y/n]: n
      Generated controls:
      {
        "findingType": {
          "S": "GuardDuty.1"
        },
        "securityApprovedAdjustedRiskLevel": {
          "S": "MEDIUM"
        },
        "expirationDate": {
          "S": "2026-12-31T00:00:00Z"
        },
        "ticketReference": {
          "S": "JIRA-SEC-1234"
        },
        "businessJustification": {
          "S": "Alternative monitoring solution provides equivalent detection capabilities"
        },
        "auditInfo": {
          "S": "{\"createdBy\":\"arn:aws:sts::xxxxxxxxxxx:assumed-role/securityhub-validator-SecurityTeamRole/SecurityTeamSession\",\"createdAt\":\"2025-08-05T08:49:51Z\",\"lastModifiedBy\":\"arn:aws:sts::xxxxxxxxxxx:assumed-role/securityhub-validator-SecurityTeamRole/SecurityTeamSession\",\"lastModifiedAt\":\"2025-08-05T08:49:51Z\"}"
        },
        "securityControlHash": {
          "S": "a0b33a0a96a6b282bad1c093586d89cef832d40bb379abd4a004d00afdf603d1"
        },
        "requiredControls": {
          "S": "[{\"controlId\":\"VPC-FLOW-LOGS\",\"description\":\"VPC Flow logs must be enabled for network monitoring\",\"validationType\":\"CONFIG_RULE\",\"validationParams\":{\"ruleName\":\"vpc-flow-logs-enabled\",\"justification\":\"Provides comprehensive network traffic visibility similar to GuardDuty's network monitoring capabilities\"}},{\"controlId\":\"SECURITY-ALARMS\",\"description\":\"CloudWatch alarms for suspicious activity\",\"validationType\":\"CLOUDWATCH\",\"validationParams\":{\"alarmNamePattern\":\"SecurityMonitoring-\",\"requiredMetrics\":[\"UnauthorizedAPICalls\",\"NetworkPortProbing\"],\"requiredState\":\"ANY\",\"minimumAlarms\":2,\"justification\":\"Alarms detect suspicious API calls and network activity similar to GuardDuty's threat detection\"}}]"
        }
      }
      Save to DynamoDB? [y/n]: y
      Compensating controls saved to DynamoDB!
      This action has been logged for audit purposes.

      1. When prompted to save to DynamoDB, enter Y. Compensating controls will be added to the DynamoDB compensating controls table.
      Figure 7: Compensating controls for GuardDuty.1 finding

      Figure 7: Compensating controls for GuardDuty.1 finding

      1. For this proof-of-concept demonstration, the compensating controls implementation requires additional AWS permissions beyond what the developer role provides. In a production environment, these controls would typically be implemented by infrastructure teams or through automated deployment pipelines.
      • Switch to administrative credentials.
      • For the demonstration, temporarily switch back to your administrative AWS credentials (the ones used to create the roles):

        Unset the security team role credentials

        unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN

      • Implement the required controls

      Control 1: Enable VPC Flow Logs, starting by getting your VPC IDVPC_ID=$(aws ec2 describe-vpcs --query 'Vpcs[0].VpcId' --output text)

      Create flow logs:

      aws ec2 create-flow-logs \
          --resource-type VPC \
          --resource-ids $VPC_ID \
          --traffic-type ALL \
          --log-destination-type cloud-watch-logs \
          --log-group-name VPCFlowLogs

      Create the AWS Config Rule:

      aws configservice put-config-rule \
          --config-rule '{
              "ConfigRuleName": "vpc-flow-logs-enabled",
              "Source": {
                  "Owner": "AWS",
                  "SourceIdentifier": "VPC_FLOW_LOGS_ENABLED"
              }
          }'

      Control 2: Create security monitoring alarms starting with creating metric filters for CloudTrail Logs; start by creating a log group for CloudTrail (if none exists):aws logs create-log-group --log-group-name CloudTrail/SecurityEventsCreate a metric filter for unauthorized API calls:

      aws logs put-metric-filter \
          --log-group-name CloudTrail/SecurityEvents \
          --filter-name UnauthorizedAPICallsFilter \
          --filter-pattern '{ ($.errorCode = "*UnauthorizedOperation") || ($.errorCode = "AccessDenied*") }' \
          --metric-transformations metricName=UnauthorizedAPICalls,metricNamespace=SecurityMetrics,metricValue=1

      Create a filter for network port probing:

      aws logs put-metric-filter \
          --log-group-name CloudTrail/SecurityEvents \
          --filter-name NetworkPortProbingFilter \
          --filter-pattern '[version, account, eni, source, destination, srcport, destport="22" || destport="3389" || destport="1433", protocol, packets, bytes, windowstart, windowend, action="REJECT", flowlogstatus]' \
          --metric-transformations metricName=NetworkPortProbing,metricNamespace=SecurityMetrics,metricValue=1

      Create required CloudWatch alarms, starting with Alarm 1 for Unauthorized API calls:

      aws cloudwatch put-metric-alarm \
          --alarm-name "SecurityMonitoring-UnauthorizedAPICalls" \
          --alarm-description "Detects unauthorized API calls" \
          --metric-name "UnauthorizedAPICalls" \
          --namespace "SecurityMetrics" \
          --statistic Sum \
          --period 300 \
          --threshold 1 \
          --comparison-operator GreaterThanOrEqualToThreshold \
          --evaluation-periods 1

      Alarm 2: Network port probing:

      aws cloudwatch put-metric-alarm \
          --alarm-name "SecurityMonitoring-NetworkPortProbing" \
          --alarm-description "Detects network port probing activity" \
          --metric-name "NetworkPortProbing" \
          --namespace "SecurityMetrics" \
          --statistic Sum \
          --period 300 \
          --threshold 5 \
          --comparison-operator GreaterThanOrEqualToThreshold \
          --evaluation-periods 1

      1. Now assume the DeveloperRole to suppress the finding:
      aws sts assume-role \
          --role-arn arn:aws:iam::ACCOUNT_ID:role/securityhub-validator-DeveloperRole \
          --role-session-name DeveloperSession

      Configure the returned credentials:

      export AWS_ACCESS_KEY_ID=<from assume-role output>
      export AWS_SECRET_ACCESS_KEY=<from assume-role output>
      export AWS_SESSION_TOKEN=<from assume-role output>

      1. Change the workflow status of the Security Hub finding related to GuardDuty from NEW to SUPPRESSED.

      To change the workflow status using the AWS CLI (developer):

      # Get the finding ARN first (command shown for reference)
      aws securityhub get-findings \
          --filters '{"GeneratorId":[{"Value":"security-control/GuardDuty.1","Comparison":"EQUALS"}]}' \
          --query 'Findings[0].Id'
      # Get the product ARN (command shown for reference)
      aws securityhub get-findings \
          --filters '{"GeneratorId":[{"Value":"security-control/GuardDuty.1","Comparison":"EQUALS"}]}' \
          --query 'Findings[0].ProductArn' \
          --output text
      # Then suppress the finding
      aws securityhub batch-update-findings \
        --finding-identifiers '[{"Id":"finding-arn-from-above","ProductArn":"product-arn-from-above"}]' \
        --workflow '{"Status":"SUPPRESSED"}' \
        --note '{"Text":"Implemented compensating controls as per security team requirements","UpdatedBy":"[email protected]"}'

      To change the workflow status using the console (developer):

      1. Go to the Security Hub console.
      2. In the navigation pane, choose Findings.
      3. In the search bar, select Compliance Security Control ID filter and enter the value of Is as GuardDuty.1.
      4. Select the finding GuardDuty should be enabled and under Workflow status, select SUPPRESSED.
      5. In the Note field, enter Implemented compensating controls as per security team requirements.
      6. Choose Set status to save the note.
      Figure 8: GuardDuty.1 finding workflow status changed from NEW to SUPPRESSED

      Figure 8: GuardDuty.1 finding workflow status changed from NEW to SUPPRESSED

      Note: Only suppress findings after implementing the required compensating controls provided by the security team.

      1. After the Workflow status of the finding is SUPPRESSED, the automated validation process begins and you can see the Lambda function logs in the CloudWatch console related to different validations performed.

      To view Lambda function logs in the CloudWatch console:

      1. Go to the Amazon CloudWatch console.
      2. In the navigation pane, under Logs, choose Log groups.
      3. Select the log group with the Lambda function name.
      4. Select the most recent log stream to view the logs.
      Figure 9: Lambda function CloudWatch logs

      Figure 9: Lambda function CloudWatch logs

      The solution updates the note section of the findings in Security Hub with the validation results:

      If all controls pass:

      • Finding status remains SUPPRESSED.
      • A note is added with validation results and adjusted risk level.
      • Business context is added to the finding.

      If one of the controls fails:

      • Finding status changes to NOTIFIED.
      • A note is added with details about failed controls.
      • The security team reviews the changes as part of their standard process.

      To view the finding’s workflow status and updated note using the console (developer):

      1. Go to the Security Hub console.
      2. In the navigation pane, choose Findings.
      3. In the search bar, select Compliance Security Control ID filter and enter value of Is as GuardDuty.1.
      4. Select the finding GuardDuty should be enabled and check the Workflow status.
      5. For Actions, choose Add note.
      6. Check the Last note added.
      Figure 10: Security Hub updated finding note

      Figure 10: Security Hub updated finding note

      The finding note shows that automated validation has performed checks and documented the results, also note that the original severity of HIGH that was assigned by Security Hub is maintained and the adjusted severity of MEDIUM that was provided by the security team is added in the Note section and to the Evidence table, providing transparency and accountability while maintaining the original severity assigned by Security Hub.

      Clean up

      To avoid incurring ongoing charges, use the following command to clean up resources created for this post.

      ./cleanup.sh

      This deployment process is designed to be straightforward and to maintain security best practices such as encryption, least privilege, and segregation of duties.

      Conclusion

      In this post, we showed you how to implement a solution that security teams can use to define compensating controls for AWS Security Hub findings and automatically validate their implementation. We walked through the challenges of managing security exceptions and demonstrated how this solution helps to bridge the gap between security requirements and practical implementation.

      The solution provides a structured workflow where security teams define acceptable compensating controls, developers implement them, and an automated system validates their effectiveness. With support for 13 different validation types, from AWS Config rules to process documentation, the solution offers comprehensive coverage for various security scenarios.

      We also demonstrated the end-to-end process of adding compensating controls for a GuardDuty finding and showed how the solution maintains the original finding severity assigned by Security Hub while documenting the adjusted risk level approved by the security team. This approach helps maintain transparency and auditability while allowing for necessary exceptions.

      Give it a try and share your feedback in the comments section.

      Security Implication Disclaimer: The Amazon S3 configurations demonstrated in this post involve public access settings that expose data to the internet and should only be used for demonstration or non-sensitive content. Public S3 buckets carry significant risks including data exposure, unexpected costs from unauthorized usage, compliance violations, and potential security breaches. For production environments, use IAM roles, implement least privilege access policies, enable S3 Block Public Access settings, and consider CloudFront with Origin Access Control for public content delivery. Consult your security team and make sure of compliance with organizational policies before implementing public S3 configurations in production systems.


      Reetesh Surjani

      Reetesh Surjani

      Reetesh is a Delivery Consultant in Security Risk & Compliance at AWS Professional Services, based in Pune, India. He works closely with customers across diverse verticals to help strengthen their security infrastructure and achieve their security goals.

      Satish Kamat

      Satish Kamat

      Satish is a Senior Delivery Consultant in Application Development at AWS Professional Services, based in Pune, India. He works closely with customers in their cloud transformation and migration journeys across various verticals like BFSI, automotive, and telecom.

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/scaling-muse-how-netflix-powers-data-driven-creative-insights-at-trillion-row-scale-aa9ad326fd77

By Andrew Pierce, Chris Thrailkill, Victor Chiapaikeo

At Netflix, we prioritize getting timely data and insights into the hands of the people who can act on them. One of our key internal applications for this purpose is Muse. Muse’s ultimate goal is to help Netflix members discover content they’ll love by ensuring our promotional media is as effective and authentic as possible. It achieves this by equipping creative strategists and launch managers with data-driven insights showing which artwork or video clips resonate best with global or regional audiences and flagging outliers such as potentially misleading (clickbait-y) assets. These kinds of applications fall under Online Analytical Processing (OLAP), a category of systems designed for complex querying and data exploration. However, enabling Muse to support new, more advanced filtering and grouping capabilities while maintaining high performance and data accuracy has been a challenge. Previous posts have touched on artwork personalization and our impressions architecture. In this post, we’ll discuss some steps we’ve taken to evolve the Muse data serving layer to enable new capabilities while maintaining high performance and data accuracy.

Muse application

An Evolving Architecture

Like many early analytics applications, Muse began as a simple dashboard powered by batch data pipelines (Spark¹) and a modest Druid² cluster. As the application evolved, so did user demands. Users wanted new features like outlier detection and notification delivery, media comparison and playback, and advanced filtering, all while requiring lower latency and supporting ever-growing datasets (in the order of trillions of rows a year). One of the most challenging requirements was enabling dynamic analysis of promotional media performance by “audience” affinities: internally defined, algorithmically inferred labels representing collections of viewers with similar tastes. Answering questions like “Does specific promotional media resonate more with Character Drama fans or Pop Culture enthusiasts?” required augmenting already voluminous impression and playback data. Supporting filtering and grouping by these many-to-many audience relationships led to a combinatorial explosion in data volume, pushing the limits of our original architecture.

To address these complexities and support the evolving needs of our users, we undertook a significant evolution of Muse’s architecture. Today’s Muse is a React app that queries a GraphQL layer served with a set of Spring Boot GRPC microservices. In the remainder of this post, we’ll focus on steps we took to scale the data microservice, its backing ETL, and our Druid cluster. Specifically, we’ve changed the data model to rely on HyperLogLog (HLL) sketches, used Hollow for access to in-memory, precomputed aggregates, and taken a series of steps to tune Druid. To ensure the accuracy of these changes, we relied heavily on internal debugging tools to validate pre- and post-changes.

Muse’s Current Architecture

Moving to HyperLogLog (HLL) Sketches for Distinct Counts

Some of the most important metrics we track are impressions, the number of times an asset is shown to a user within a time window, and qualified plays, which links a playback event with a minimum duration back to a specific impression. Calculating these metrics requires counting distinct users. However, performing distinct counts in distributed systems is resource-intensive and challenging. For instance, to determine how many unique profiles have ever seen a particular asset, we need to compare each new set of profile ids with those from all days before it, potentially spanning months or even years.

For performance, we can trade accuracy. The Apache Datasketches library allows us to get distinct count estimates that are within a 1–2% error. This is tunable with a precision parameter called logK (0.8% in our case with logK of 17). We build sketches in two places:

  1. During Druid ingest: we use the HLLSketchBuild aggregator with Druid rollup set to true to reduce our data in preparation for fast distinct counting
  2. During our Spark ETL: we persist precomputed aggregates like all-time impressions per asset in the form of HLL sketches. Each day, we merge a new HLL sketch into the existing one using a combination of hll_union and hll_union_agg (functions added by our very own Ryan Berti)
We use Datasketches in our ETL and serving systems

HLL has been a huge performance boost for us both within the serving and ETL layer. Across our most common OLAP query patterns, we’ve seen latencies reduce by approx 50%. Nevertheless, running APPROX_COUNT_DISTINCT over large date ranges on the Druid cluster for very large titles exhausts limited threads, especially in high-concurrency situations. To further offload Druid query volume and preserve cluster threads, we’ve also relied extensively on the Hollow library.

Hollow as a Read-Only Key Value Store for Precomputed Aggregates

Our in-house Hollow³ infrastructure allows us to easily create Hollow feeds — essentially highly compressed and performant in-memory key/value stores — from Iceberg⁴ tables. In this setup, dedicated producer servers listen for changes to Iceberg tables, and when updates occur, they push the latest data to downstream consumers. On the consumer side, our Spring Boot applications listen to announcements from these producers and automatically refresh in-memory caches with the latest dataset.

This architecture has enabled us to migrate several data access patterns from Druid to Hollow, specifically ones with a limited number of parameter combinations per title. One of these was fetching distinct filter dimensions. For example, while most Netflix-branded titles are released globally, licensed titles often have rights restrictions that limit their availability to specific countries and time windows. As a result, a particular licensed title might only be available to members in Germany and Luxembourg.

Distinct countries queried from a Hollow feed for the assets for Manta Manta

In the past, retrieving these distinct country values per asset required issuing a SELECT DISTINCT query to our Druid cluster. With Hollow, we maintain a feed of distinct dimension values, allowing us to perform stream operations like the one below directly on a cached dataset.

/**
* Returns the possible filter values for a dimension such as countries
*/
public List<Dimension> getDimensions(long movieId, String dimensionId) {
// Access in-memory Hollow feed with near instant query time
Map<String, List<Dimension>> dimensions = dimensionsHollowConsumer.lookup(movieId);
return dimensions.getOrDefault(dimensionId, List.of()).stream()
.sorted(Comparator.comparing(Dimension::getName))
.toList();
}

Although it adds complexity to our service by requiring more intricate request routing and a higher memory footprint, pre-computed aggregates have given us greater stability and performance. In the case of fetching distinct dimensions, we’ve observed query times drop from hundreds of milliseconds to just tens of milliseconds. More importantly, this shift has offloaded high concurrency demands from our Druid cluster, resulting in more consistent query performance. In addition to this use case, cached pre-computed aggregates also power features such as retrieving recently launched titles, accessing all-time asset metrics, and serving various pieces of title metadata.

Tuning Druid

Even with the efficiencies gained from HLL sketches and Hollow feeds, ensuring that our Druid cluster operates performantly has been an ongoing challenge. Fortunately, at Netflix, we are in the company of multiple Apache Druid PMC members like Maytas Monsereenusorn and Jesse Tuğlu who have helped us wring out every ounce of performance. Some of the key optimizations we’ve implemented include:

  • Increasing broker count relative to historical nodes: We aim for a broker-to-historical ratio close to the recommended 1:15, which helps improve query throughput.
  • Tuning segment sizes: By targeting the 300–700 MB “sweet spot” for segment sizes, primarily using the tuningConfig.targetRowsPerSegment parameter during ingestion — we ensure that each segment a single historical thread scans is not overly large.
  • Leveraging Druid lookups for data enrichment: Since joins can be prohibitively expensive in Druid, we use lookups at query time for any key column enrichment.
  • Optimizing search predicates: We ensure that all search predicates operate on physical columns rather than virtual ones, creating necessary columns during ingestion with transformSpec.transforms.
  • Filtering and slimming data sources at ingest: By applying filters within transformSpec.filter and removing all unused columns in dimensionsSpec.dimensions, we keep our data sources lean and improve the possibility of higher rollup yield.
  • Use of multi-value dimensions: Exploiting the Druid multi-value dimension feature was key to overcoming the “many-to-many” combinatorial quandary when integrating audience filtering and grouping functionality mentioned in the “An Evolving Architecture” section above.

Together, these optimizations, combined with previous ones, have decreased our p99 Druid latencies by roughly 50%.

Validation & Rollout

Rolling out these changes to our metrics system required a thorough validation and release strategy. Our approach prioritized both data integrity and user trust, leveraging a blend of automation, targeted tooling, and incremental exposure to production traffic. At the core of our strategy was a parallel stack deployment: both the legacy and new metric stacks operated side-by-side within the Muse Data microservice. This setup allowed us to validate data quality, monitor real-world performance, and mitigate risk by enabling seamless fallback at any stage.

We adopted a two-pronged validation process:

  • Automated Offline Validation: Using Jupyter Notebooks, we automated the sampling and comparison of key metrics across both the legacy and new stacks. Our sampling set included a representative mix: recently accessed titles, high-profile launches, and edge-case titles with unique handling requirements. This allowed us to catch subtle discrepancies in metrics early in the process. Iterative testing on this set guided fixes, such as tuning the HLL logK parameter and benchmarking end-to-end latency improvements.
  • In-App Data Comparison Tooling: To facilitate rapid triage, we built a developer-facing comparison feature within our application that displays data from both the legacy and new metric stacks side by side. The tool automatically highlights any significant differences, making it easy to quickly spot and investigate discrepancies identified during offline validation or reported by users.

We implemented several release best practices to mitigate risk and maintain stability:

  • Staggered Implementation by Application Segment: We developed and deployed the new metric stack in stages, focusing on specific application segments. This meant building out support for asset types like artwork and video separately and then further dividing by CEE phase (Explore, Exploit). By implementing changes segment by segment, we were able to isolate issues early, validate each piece independently, and reduce overall risk during the migration.
  • Shadow Testing (“Dark Launch”): Prior to exposing the new stack to end users, we mirrored production traffic asynchronously to the new implementation. This allowed us to validate real-world latency and catch potential faults in a live environment, without impacting the actual user experience.
  • Granular Feature Flagging: We implemented fine-grained feature flags to control exposure within each segment. This allowed us to target specific user groups or titles and instantly roll back or adjust the rollout scope if any issues were detected, ensuring rapid mitigation with minimal disruption.

Learnings and Next Steps

Our journey with Muse tested the limits of several parts of the stack: the ETL layer, the Druid layer, and the data serving layer. While some choices, like leveraging Netflix’s in-house Hollow infrastructure, were influenced by available resources, simple principles like offloading query volume, pre-filtering of rows and columns before Druid rollup, and optimizing search predicates (along with a bit of HLL magic) went a long way in allowing us to support new capabilities while maintaining performance. Additionally, engineering best practices like producing side-by-side implementations and backwards-compatible changes enabled us to roll out revisions steadily while maintaining rigorous validation standards. Looking ahead, we’ll continue to build on this foundation by supporting a wider range of content types like Live and Games, incorporating synopsis data, deepening our understanding of how assets work together to influence member choosing, and incorporating new metrics to distinguish between “effective” and “authentic” promotional assets, in service of helping members find content that truly resonates with them.

¹ Apache Spark is an open-source analytics engine for processing large-scale data, enabling tasks like batch processing, machine learning, and stream processing.

² Apache Druid is a high-performance, real-time analytics database designed for quickly querying large volumes of data.

³ Hollow is a Java library for efficient in-memory storage and access to moderately sized, read-only datasets, making it ideal for high-performance data retrieval.

⁴ Apache Iceberg is an open-source table format designed for large-scale analytical datasets stored in data lakes. It provides a robust and reliable way to manage data in formats like Parquet or ORC within cloud object storage or distributed file systems.


Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The collective thoughts of the interwebz