Diagnose EKS Node Issues Faster with AWS DevOps Agent and Custom MCP

Post Syndicated from Shyam Kulkarni original https://aws.amazon.com/blogs/devops/diagnose-eks-node-issues-faster-with-aws-devops-agent-and-custom-mcp/

AWS DevOps Agent can investigate a growing range of production incidents autonomously. It diagnoses CrashLoopBackOff failures, traces ConfigMap deletions through audit logs, and correlates Amazon CloudWatch metrics with cluster events — all without human intervention.

But AWS DevOps Agent has a visibility boundary. When the data it needs lives outside its native integrations — on a node’s operating system, inside a third-party monitoring tool, behind a database’s internal diagnostics — the agent stalls. It can describe symptoms, but it can’t reach the evidence needed to identify root causes.

This post shows how to extend AWS DevOps Agent by building a custom Model Context Protocol (MCP) server that bridges that gap. Using a concrete example, we give AWS DevOps Agent structured access to Amazon EKS worker node diagnostics and explain how the same approach applies to data sources the agent can’t natively reach. By the end of this walkthrough, you will have a working MCP server that gives AWS DevOps Agent access to 20+ node-level log sources — providing autonomous investigation capabilities that can assist in root cause analysis compared to manual SSH sessions.

Prerequisites

Before you begin, make sure you have the following:

  • An Amazon EKS cluster with AWS Systems Manager Agent (SSM Agent) running on the worker nodes (included by default on Amazon EKS optimized AMIs)
  • Node.js v18 or later
  • AWS CLI v2
  • AWS CDK v2 installed and bootstrapped in your target account and Region
  • An AWS account with permissions to create IAM roles, Lambda functions, and Amazon S3 buckets
  • Familiarity with Amazon EKS, AWS Systems Manager, and the Model Context Protocol (MCP)

How AWS DevOps Agent discovers custom tools through MCP

MCP is an open standard that defines how AI agents discover and invoke external tools. AWS DevOps Agent supports connecting to custom MCP servers, which means you can expose new capabilities to it without modifying the agent itself. When you connect an MCP server to AWS DevOps Agent, the agent automatically discovers the available tools, understands their schemas, and calls them as part of its investigation workflow. You build and connect the MCP server — the agent handles the rest.

The extensibility model follows three steps: first, identify the data source that AWS DevOps Agent cannot natively access; second, build an MCP server that wraps safe, structured access to that data source; and third, connect the MCP server to AWS DevOps Agent so it can incorporate the new tools into its investigations.

Three design principles make this work. Return structured data, not raw text — pre-index findings with severity levels and stable IDs so the agent can filter, reference, and correlate them. Never give the agent a shell — mediate interactions through a controlled, auditable execution model. Make tools composable — design tool outputs to serve as inputs to other tools, creating a chain of evidence the agent can follow.

Why Amazon EKS node OS visibility matters

AWS DevOps Agent integrates with Amazon EKS to inspect pod status, read container logs, query CloudWatch Container Insights, and correlate cluster events. This covers application crashes, container-level resource exhaustion, and configuration drift.

However, EKS production issues with nodes originate in a layer these tools cannot reach: the node operating system. Artifacts such as iptables rules, full CNI configuration and IPAMD state, route tables, conntrack entries, dmesg kernel messages, containerd runtime logs, sysctl parameters, ENI metadata, and the unfiltered kubelet journal exist exclusively on the node. These artifacts are the primary evidence for diagnosing IP allocation failures, DNS resolution issues, network policy enforcement problems, storage mount timeouts, and node registration failures.

Integrating AWS DevOps Agent with an EKS node diagnostics MCP server

The sample-eks-node-diagnostics-mcp repository (sample-eks-node-diagnostics-mcp repository) demonstrates this pattern. It provides an MCP server that gives AWS DevOps Agent structured access to node-level diagnostic data, backed by AWS Systems Manager (SSM) Automation for safe, auditable execution.

How it works

AWS DevOps Agent connects over MCP/HTTPS to AgentCore Gateway, which authenticates via Amazon Cognito OAuth 2.0 and routes tool calls through a Lambda-based Tool Router to SSM Automation. SSM Automation dispatches runbooks to EKS worker nodes running SSM Agent, which upload collected log archives to a KMS-encrypted S3 bucket. An S3 event triggers a Lambda function that extracts and indexes findings for the agent to query.

Figure 1: End-to-end architecture of the EKS Node Diagnostics MCP server. AWS DevOps Agent discovers and invokes 19 tools through AgentCore Gateway, which dispatches SSM Automation runbooks to worker nodes for log collection and uploads results to Amazon S3 for extraction and indexing.

  1. AWS DevOps Agent calls a collect tool with an instance ID.
  2. The MCP server dispatches an SSM Automation execution to the target node, running the AWS-managed AWSSupport-CollectEKSInstanceLogs runbook.
  3. The runbook collects 20+ log sources — kubelet, containerd, iptables, CNI config, route tables, dmesg, sysctl, ENI metadata, IPAMD logs, and more — packages them into an archive, and uploads it to an Amazon S3 bucket where you configure AWS KMS encryption.
  4. A processing pipeline extracts the archive, pre-indexes errors with severity classification and stable finding IDs, and provides the results to you through additional MCP tools.

The server exposes tools for log collection, pre-indexed error retrieval, cross-file search and correlation, structured network diagnostics, and live packet capture. A typical agent workflow chains these together: collect → status → errors → search → correlate → read → summarize, with each step producing outputs that feed into the next.

AWS DevOps Agent does not get a shell on the node. Every interaction is mediated by SSM Automation — an auditable, IAM-controlled, non-interactive execution model.

Connecting through Amazon Bedrock AgentCore Gateway

The reference implementation uses Amazon Bedrock AgentCore Gateway to expose the Lambda-backed MCP server to AWS DevOps Agent. AgentCore Gateway converts Lambda functions into MCP-compatible tools and handles authentication, protocol translation, and tool discovery through a single managed endpoint.

The integration follows three steps:

Step 1: Create an OAuth authorizer with Amazon Cognito. The CDK stack provisions a Cognito User Pool configured for the OAuth 2.0 client credentials flow. This secures inbound access to the gateway — only clients with valid tokens can invoke tools.

Step 2: Create a gateway and register the Lambda as a target. Register the Lambda function that handles tool invocations as a target on the gateway. AgentCore Gateway automatically discovers the tool schemas from the Lambda and makes them available through the MCP protocol. The gateway endpoint becomes the single MCP URL for AWS DevOps Agent.

Step 3: Connect AWS DevOps Agent. Register the MCP server at the account level in the AWS DevOps Agent console, providing the gateway URL and OAuth configuration. Then allowlist the specific tools each Agent Space needs. AWS DevOps Agent authenticates by obtaining a JWT from the Cognito token endpoint using the client credentials grant and passes it as a Bearer token in requests to the gateway URL.

Deploying the MCP server

Deploy the entire stack using AWS CDK :

git clone https://github.com/aws-samples/sample-eks-node-diagnostics-mcp.git
 cd sample-eks-node-diagnostics-mcp
 chmod +x deploy.sh
 ./deploy.sh

The script walks you through cluster selection and node role configuration. Have the following ready before running the script: your target EKS cluster name, the IAM role ARN you attached to your worker nodes, and the AWS Region where your cluster runs. The script outputs your MCP gateway URL, OAuth credentials, and token endpoint — everything you need to configure the connection in AWS DevOps Agent. See the repository README for detailed deployment instructions, CI/CD mode, and prerequisite details.

Seeing it in action

To demonstrate the MCP server’s capabilities, we walk through a realistic node-level failure scenario on a test EKS cluster. We manually inject a fault that blocks pod DNS resolution at the iptables level — an issue that is invisible from kubectl since pods appear Running — then show how AWS DevOps Agent investigates and identifies the root cause using the MCP server’s tools.

Setting up the scenario

Start with an EKS cluster that has a managed node group with SSM Agent running (included by default on Amazon EKS optimized AMIs). Deploy a sample workload to one of the nodes:

kubectl create namespace demo-app

cat <<EOF | kubectl apply -f -
 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: web-frontend
   namespace: demo-app
 spec:
   replicas: 3
   selector:
     matchLabels:
       app: web-frontend
   template:
     metadata:
       labels:
         app: web-frontend
     spec:
       containers:
       - name: nginx
         image: nginx:latest
         ports:
         - containerPort: 80
 EOF

Identify the node and instance ID where the pods are running:

kubectl get pods -n demo-app -o wide

Injecting the fault

⚠ WARNING: The following commands will disrupt DNS resolution for all pods on the target node. Only run these in a non-production test environment. Do not execute on production nodes.

Connect to the target node using SSM Session Manager and run the following commands to block pod DNS traffic at the iptables level. This simulates a subtle networking issue – pods continue running but can’t resolve DNS, and the root cause is only visible in the node’s iptables rules:

# Block pod traffic to kube-dns ClusterIP — pods run but DNS fails
 # Only affects FORWARD chain (pod traffic), not the node's own DNS
 sudo iptables -I FORWARD -d 10.100.0.10/32 -p udp --dport 53 -j DROP
 sudo iptables -I FORWARD -d 10.100.0.10/32 -p tcp --dport 53 -j DROP

Replace 10.100.0.10 with your cluster’s kube-dns ClusterIP (kubectl get svc kube-dns -n kube-system -o jsonpath=’{.spec.clusterIP}’).

This fault is particularly insidious because kubectl get pods shows all pods in Running state. The applications fail with DNS resolution errors, but there is no Kubernetes event or pod status that points to the cause. The iptables DROP rules targeting the kube-dns ClusterIP exist only in the node’s firewall configuration — a layer that no Kubernetes API call can inspect.

Investigating with AWS DevOps Agent

An engineer notices applications reporting DNS failures and asks AWS DevOps Agent to investigate:

“Pods on node i-xxxxxxxxxx in cluster EKS-sample (us-east-1) are running but applications report DNS resolution failures. Collect the node logs and investigate.”

The AWS DevOps Agent "Start an investigation" dialog with the investigation details field populated: "Pods on node i-xxxxxxxxxxxx in cluster EKS-sample (us-east-1) are running but applications report DNS resolution failures. Collect the node logs and investigate." The date and time of incident is set to 2026-03-26T16:55:30.593Z.

Figure 2: Starting an investigation in AWS DevOps Agent. The engineer provides the symptom description and incident timestamp, and the agent autonomously plans and executes the investigation.

AWS DevOps Agent begins the investigation by recording the symptom and launching two parallel actions: collecting node logs via the nodelog_collect tool and checking cluster health. The cluster health check confirms all four nodes are running and SSM-online. The agent then polls the log collection status, tracking progress from 25% through 75% to completion. Once collection finishes, the agent fans out into parallel workstreams — running network diagnostics, performing quick triage, and collecting logs from a healthy node for comparison.

The investigation timeline progresses from "Starting" at 11:59:45 AM through symptom identification at +12 seconds, cluster health check at +33 seconds confirming all four nodes are running, log collection polling at 25% and 75%, to log collection complete at +1 minute 22 seconds. The agent then launches parallel network diagnostics, quick triage, and healthy node comparison.

Figure 3: Investigation timeline showing the initial data collection phase. The agent identifies the symptom, confirms cluster health, collects node logs via SSM Automation, polls for completion, and launches parallel diagnostic workstreams.

With the initial data collected, the agent launches four parallel investigation tasks to maximize coverage and minimize time-to-root-cause: (1) deep-dive-iptables-routes examines the node’s firewall rules and routing table in detail, completing in 1 minute 44 seconds across 8 tool calls; (2) search-network-errors scans the collected logs for network-related error patterns, running 15 tool calls over 7 minutes 51 seconds; (3) collect-healthy-node gathers the same diagnostics from a known-good node for comparison, taking 13 tool calls over 4 minutes 55 seconds; (4) check-oom-and-pod-status investigates kernel OOM kills and pod health, executing 19 tool calls over 8 minutes 12 seconds. Each task produces a structured report that feeds into the final synthesis.

Four parallel investigation tasks execute concurrently: deep-dive-iptables-routes (8 tool calls, 1 minute 44 seconds), search-network-errors (15 tool calls, 7 minutes 51 seconds), collect-healthy-node (13 tool calls, 4 minutes 5 seconds), and check-oom-and-pod-status (19 tool calls, 8 minutes 12 seconds). At +14 minutes 22 seconds, all four tasks complete and the agent begins synthesizing findings.

Figure 4: Parallel investigation phase. The agent runs four concurrent deep-dive tasks — iptables/route analysis, network error search, healthy node comparison, and OOM/pod status check — then synthesizes the findings into a unified report.

The iptables and route table deep-dive reveals the root cause. The agent identifies two CRITICAL findings: a FAULT-INJECT-DROP-POD-TO-POD rule in the FORWARD chain that drops inter-pod traffic, and a FAULT-INJECT-DROP-SERVICE-CIDR rule that drops forwarded traffic to the service CIDR range. It also flags a MEDIUM-severity finding — a blackhole route for 10.96.0.0/12 (the Kubernetes service CIDR) that does not exist on healthy nodes. The remaining checks come back normal: kube-proxy chains are intact, AWS VPC CNI SNAT/CONNMARK chains are properly configured, and the default gateway and ENI route tables are correct. This structured severity classification allows the agent to immediately focus on the critical items.

A severity-classified findings summary table from the deep-dive-iptables-routes task. Two CRITICAL findings: a FAULT-INJECT-DROP-POD-TO-POD rule and a FAULT-INJECT-DROP-SERVICE-CIDR rule, both in the FORWARD chain. One MEDIUM finding about limited pod /32 routes. Six Normal findings confirm kube-proxy chains, AWS VPC CNI SNAT/CONNMARK chains, FORWARD chain policy, per-ENI route table, and default gateway are all properly configured.

Figure 5: Deep-dive findings from the iptables and route table analysis. Two CRITICAL fault-injection DROP rules in the FORWARD chain are identified as the primary issue, while standard networking components — kube-proxy, VPC CNI, and routing — check normal.

The healthy node comparison confirms the diagnosis. The agent compares the unhealthy node against a known-good node across seven dimensions: security groups, ENI count, DNS configuration, iptables rules, route tables, conntrack entries, and IPAMD state. The key differences are definitive: the blackhole route for 10.96.0.0/12 exists only on the unhealthy node, kubelet API server timeout errors appear only on the unhealthy node, conntrack entries are 12x higher (1,962 vs 169), and IPAMD reconciliation errors are 5x more frequent. The iptables FORWARD chain counters show 2.4 billion packets processed on the unhealthy node versus zero on the freshly-started healthy node — confirming sustained traffic disruption.

A comparison table titled "Summary of Key Differences" between the unhealthy and healthy nodes. Five differences are listed: a blackhole route for 10.96.0.0/12 present only on the unhealthy node, kubelet API server timeout errors present only on the unhealthy node, conntrack entries at 1,962 versus 169, IPAMD reconcile errors at 5 versus 1, and iptables FORWARD counters at 2.4 billion packets versus 0 on the fresh healthy node. DNS configuration is identical on both nodes.

Figure 6: Healthy node comparison confirming the diagnosis. The agent compares diagnostics across both nodes and identifies five key differences — the blackhole route, elevated conntrack entries, and high FORWARD chain packet counts exist only on the affected node.

The agent synthesizes the findings into a definitive root cause determination. It identifies a fault-injection namespace on the EKS cluster that is running chaos experiments, introducing three specific network-disrupting modifications on the target node: (1) a FAULT-INJECT-DROP-POD-TO-POD iptables rule in the FORWARD chain that drops inter-pod traffic, (2) a FAULT-INJECT-DROP-SERVICE-CIDR rule that drops forwarded traffic to the Kubernetes service CIDR, and (3) a blackhole route for 10.96.0.0/12 that does not exist on healthy nodes. Together, these three modifications create a multi-vector network disruption — pods appear Running but cannot communicate with each other or reach Kubernetes services, including kube-dns.

The Root causes panel identifies one root cause: "Fault-injection workloads on node i-09ffc4a0ea5da9cb7 causing multi-vector network disruption." The explanation states that a fault-injection namespace is running chaos experiments that introduced two iptables FORWARD chain DROP rules (FAULT-INJECT-DROP-POD-TO-POD and FAULT-INJECT-DROP-SERVICE-CIDR) and a blackhole route for 10.96.0.0/12 that does not exist on healthy nodes.

Figure 7: Root cause determination. The agent traces the multi-vector network disruption to three fault-injection modifications — two iptables DROP rules and a blackhole route — deployed by a chaos experiment namespace on the target node.

Cleaning up the fault

To restore the node after the demo, connect via SSM Session Manager and run:

sudo iptables -D FORWARD -d 10.100.0.10/32 -p udp --dport 53 -j DROP
sudo iptables -D FORWARD -d 10.100.0.10/32 -p tcp --dport 53 -j DROP

Extending this pattern to other data sources

The EKS node diagnostics use case demonstrates the pattern, but the architecture generalizes to systems where the SSM Agent is running and you can define an SSM Automation runbook to collect the data you need.

For example, an EC2 instance with SSM Agent can use this same approach — collect OS-level logs, network configuration, package state, or application diagnostics through a custom or pre-built SSM Automation runbook, upload results to S3, and expose them through MCP tools. The same applies to ECS container instances (Docker daemon logs, ECS agent state, iptables), on-premises servers registered via SSM Hybrid Activations, or managed nodes in your fleet.

The pattern also extends beyond SSM-managed hosts. Network devices can be reached through API calls to their management planes, databases through read-only diagnostic queries, and third-party APM tools through vendor API integrations. In each case, the same three-step approach holds: identify the unreachable data, build an MCP server that wraps safe access to it, and connect it to AWS DevOps Agent.

When to use this approach
This pattern works well for incident response where diagnostic data lives outside AWS DevOps Agent’s native reach, fleet-wide triage where manual access to individual systems is impractical, and cross-source correlation where evidence spans multiple log sources.

It is not a replacement for continuous monitoring (use CloudWatch Container Insights or Prometheus for real-time alerting), log shipping (if you have compliance requirements for continuous retention), or native integrations where the agent already has access to the data source.

The reference implementation requires SSM Agent running on the nodes with appropriate IAM permissions. It is a proof of concept — validate it in non-production environments before using it with production workloads.

Clean up

Cost considerations: This solution uses AWS Lambda, Amazon S3, AWS KMS, Amazon Cognito, and Amazon Bedrock AgentCore Gateway. Costs vary based on usage. Lambda charges apply per invocation and duration. S3 charges apply for log storage. KMS charges a per-key monthly fee plus per-request charges. Cognito charges per monthly active user. AgentCore Gateway pricing is based on API calls. For current pricing details, see the AWS Pricing page for each service. To minimize costs during evaluation, delete the stack when not in use.

Remove the deployed resources by running cdk destroy from the repository root. The S3 log bucket uses a RETAIN removal policy — delete it manually after stack destruction if needed.

Conclusion

MCP provides a standardized extensibility mechanism that lets you bridge visibility gaps in AWS DevOps Agent without modifying the agent itself. The pattern is straightforward: identify the unreachable data source, build an MCP server that wraps safe and structured access to it, and connect it to AWS DevOps Agent through Amazon Bedrock AgentCore Gateway. The agent handles the reasoning. The MCP server handles the data access.

To get started:

  • Deploy the reference implementation (sample-eks-node-diagnostics-mcp repository) in a non-production environment.
  • Review the MCP specification (MCP specification).
  • Explore the Amazon EKS troubleshooting documentation (Amazon EKS troubleshooting documentation).
  • Connect custom MCP servers to AWS DevOps Agent — see the Connecting MCP Servers guide in the AWS DevOps Agent documentation.
  • Set up AgentCore Gateway — see the Amazon Bedrock AgentCore Gateway quick start guide.

About the author

Shyam Kulkarni

Shyam Kulkarni

Shyam Kulkarni is a Sr. Technical Account Manager at AWS, where he helps enterprise customers design and implement cloud-native architectures with a focus on container orchestration, platform engineering, and observability at scale. He advises organizations on strategic modernization initiatives and is passionate about architecting AI-native systems, including agentic AI platforms and scalable AI infrastructure. Outside of work, Shyam is an avid travel and landscape photographer who enjoys exploring new destinations and capturing dramatic natural scenery. He’s also an enthusiastic home cook and baker who loves experimenting with new recipes, flavors, and techniques in the kitchen. When not behind a camera or in the kitchen, you’ll find him hiking remote trails.

Build RAG-powered AI solutions at the edge with AWS Local Zones and Outposts

Post Syndicated from Fernando Galves original https://aws.amazon.com/blogs/compute/build-rag-powered-ai-solutions-at-the-edge-with-aws-local-zones-and-outposts/

Organizations in regulated industries or with strict information security requirements are increasingly looking to use generative AI. However, they often face a dilemma: how to utilize powerful models while keeping data strictly on-premises or within specific geographic boundaries. The solution lies in deploying self-managed Small Language Models (SLMs) on premises with AWS Outposts or in adjacent metros using AWS Local Zones.

SLMs can achieve accuracy comparable to large models for specific, well-scoped use cases. However, all language models suffer from a knowledge gap: their internal knowledge is static, probabilistic, and often outdated. This challenge is acute for SLMs, which have significantly smaller parametric memory than Large Language Models (LLMs). To equip an SLM to perform accurately in an enterprise context, it must be supported by an architecture that provides fresh, governed facts.

This is achieved through Retrieval-Augmented Generation (RAG). RAG is not merely an extension; it is the architectural pattern that bridges the gap between a model’s frozen memory and your dynamic enterprise data.

This post provides a solution template for deploying an SLM augmented with RAG. This architecture allows the model to perform accurately while offering enhanced Total Cost of Ownership (TCO) because of reduced size and latency. To address data residency and InfoSec needs, we provide guidance on deploying this solution entirely within AWS Local Zones and AWS Outposts.

Solution overview

To demonstrate this architecture, we present a Chatbot application designed to answer detailed technical questions regarding AWS Hybrid Edge products (specifically AWS Local Zones and AWS Outposts) to a level 200-300 knowledge depth.

A chatbot was selected as it represents the most common use case requested by AWS customers. The technical domain demonstrates the system’s ability to handle complex, specific queries. This solution provides enterprises with full control over the foundation model, including its operating location, configuration, and the security of confidential data.

Infrastructure components

The solution runs on four EC2 instances deployed on AWS Outposts or in an AWS Local Zone, each serving a distinct role in the RAG pipeline:

Component Instance Type Role
Vector Embeddings Service

g4dn or G7e (GPU)a/b

Note:

  1. Design optimized for g4dn
  2. G7e will allow larger models and higher performance
Encodes documents and queries into dense vector representations using BAAI/bge-large-en-v1.5 1
Reranking Service

g4dn or G7e (GPU)a/b

Note

  1. Design optimized for g4dn
  2. G7e will allow larger models and higher performance
Re-scores candidate chunks for contextual relevance using BAAI/bge-reranker-large 1
Milvus Vector Database

m5.xlarge

Note : Check current instance availability for your Local Zone or Outposts deployment

Stores and retrieves vector embeddings via high-dimensional similarity search
Small Language Model

See companion blog

https://aws.amazon.com/blogs/compute/running-and-optimizing-small-language-models-on-premises-and-at-the-edge/

Generates grounded responses from retrieved context

All instances use the Deep Learning Base OSS Nvidia Driver GPU AMI (Amazon Linux 2023) for GPU workloads and Amazon Linux 2023 for the database instance. For instructions on setting up the SLM with Llama.cpp, refer to the companion post: Running and optimizing small language models on-premises and at the edge.

Solution architecture showing the four EC2 instances and RAG pipeline components deployed on AWS Outposts or Local Zones

Figure 1. Elements of the chatbot

Why RAG matters for SLMs

RAG optimizes model output by referencing an authoritative knowledge base outside of its training data before generating a response. By offloading knowledge to a vector database, we allow the SLM to focus on reasoning and syntax, significantly reducing hallucinations and providing end-to-end traceability for every answer.

Architecture overview

The RAG workflow operates through a seven-stage pipeline designed so that data never leaves your controlled environment.

Seven-stage RAG pipeline architecture from user prompt through embedding, retrieval, reranking, context construction, generation, and response

Figure 2. Architecture overview

  1. Prompt: Users submit questions to the generative AI application.
  2. Embedding: The application forwards the query to the vector embeddings application to generate a dense vector representation.
  3. Retrieval: The system searches for relevant information in the Milvus vector database, which securely stores proprietary data within the AWS Outposts environment.
    • Architectural Note: This blog demonstrates a dense retrieval pipeline. However, production enterprise systems often combine this with sparse retrieval (Keyword/BM25) to create a hybrid retrieval pattern. This helps make sure that exact-match for identifiers like error codes or product SKUs are retrieved reliably, since dense embeddings alone can struggle to distinguish rare tokens.
  4. Reranking: The reranking application receives the initial candidate list (top K) and evaluates the chunks to identify the most contextually relevant information.
  5. Context construction: The prompt and the optimized set of chunks are sent to the SLM.
  6. Generation: The SLM processes the question and generates the response.
  7. Response: The final answer is returned to the user, augmented with citations, without sensitive data leaving the on-premises environment.

This design makes sure all components operate within organizational boundaries while delivering advanced AI capabilities using infrastructure deployed entirely on AWS Local Zones or Outposts.

Solution deployment

The following instructions detail how to deploy this RAG environment on AWS Outposts or Local Zones. The solution uses a range of models but these are changeable as new models come into popularity.

Prerequisites

  1. Deployed AWS Outposts or access to AWS Local Zones in your region.
  2. Two g4dn EC2 instances deployed with Deep Learning Base OSS Nvidia Driver GPU AMI (Amazon Linux 2023).
  3. One m5.xlarge EC2 instance deployed with Amazon Linux 2023.
  4. One EC2 instance running the SLM. (For instructions on setting up the SLM with Llama.cpp, refer to the blog post: Running and optimizing small language models on-premises and at the edge)
  5. Verify that you have installed the necessary libraries: pip install sentence-transformers==3.4.1 pymilvus==2.5.8.

Vector embeddings configuration

Vector embeddings are the foundation of the RAG system. Selecting the right model requires balancing dimension size, latency, and accuracy. In this post, we use the BAAI/bge-large-en-v1.5 model to encode proprietary data and user queries.

Strategic chunking

Before embedding, proprietary documents must be split into chunks. If chunks are too large, they waste the SLM’s limited context window; if too small, they lack the context needed for reasoning. For this solution, we recommend recursive character chunking as a baseline. Configure your ingestion pipeline to create chunks of 600–800 tokens with a 10–15% overlap. This makes sure that concepts don’t get cut off mid-sentence and that the SLM receives coherent “units of evidence” rather than fragmented text.

# Important: The sample code, architecture diagrams, and sample text provided in this blog post are for
# demonstration purposes only. You should always conduct your own independent security review before
# deploying any solution in production

from sentence_transformers import SentenceTransformer

# Specify and load the BGE-Large-EN-v1.5 model
model_name = "BAAI/bge-large-en-v1.5"
embedding_model = SentenceTransformer(model_name)


def generate_embeddings(text_list: list[str]) -> list[list[float]]:
    """
    Encodes a list of text strings into vector embeddings.

    Args:
        text_list: A list of text strings to embed.

    Returns:
        A list of vector embeddings.
    """
    embeddings = embedding_model.encode(text_list, normalize_embeddings=True)
    return embeddings.tolist()  # Convert to list for broader compatibility


# Example:
documents = ["Proprietary document text 1.", "Another piece of information."]
document_vectors = generate_embeddings(documents)

query = "User question regarding proprietary data."
query_vector = generate_embeddings([query])[0]

Vector database configuration and optimization

Once vector embeddings are generated based on the data provided, a specialized database is required for efficient storage and similarity search operations. Milvus will be deployed for this RAG architecture. It is an open-source vector database optimized for high-dimensional similarity search at scale while maintaining low query latency. You can follow the instructions available in the Run Milvus in Docker (Linux) section on the Milvus website. The following Python snippet demonstrates how to create a collection schema in the Milvus database:

def setup_milvus_collection():
    # Connect to Milvus
    # PRODUCTION: Enable TLS and token-based authentication
    # See https://milvus.io/docs/authenticate.md and https://milvus.io/docs/tls.md

    connections.connect(
        "default",
        host=MILVUS_HOST,
        port=MILVUS_PORT,
        # For production, add:
        # secure=True,
        # server_pem_path="/path/to/server.pem",
        # token="your_auth_token"
    )

    # The best practice for production workloads is to define MILVUS_HOST and MILVUS_PORT
    # as environment variables or AWS Systems Manager Parameter Store for production

    collection_name = "document_store"

    # Define collection schema
    fields = [
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=7000),
        FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1024),
        #
        # PRODUCTION: Add metadata fields for retrieval access control, e.g.:
        # FieldSchema(name="tenant_id", dtype=DataType.VARCHAR, max_length=128),
        # FieldSchema(name="user_role", dtype=DataType.VARCHAR, max_length=64),
        #
        # Then include these as filters in every search query to enforce
        # document-level authorization.
    ]

    schema = CollectionSchema(fields=fields, description="Document embeddings")

    # Create collection
    collection = Collection(name=collection_name, schema=schema)

    # Create index for vector field
    # We use baseline HNSW parameters here; production deployments should tune M
    # and efConstruction based on recall requirements.

    index_params = {
        "metric_type": "COSINE",
        "index_type": "HNSW",
        "params": {"M": 8, "efConstruction": 64},
    }
    collection.create_index(field_name="embedding", index_params=index_params)

    return collection

We use baseline HNSW parameters here; production deployments should tune M and efConstruction based on recall requirements.

Reranking implementation and configuration

A reranking step significantly improves retrieval quality by re-scoring initial vector search results with a cross-encoder model. The BAAI/bge-reranker-large model compares query-document pairs directly, providing more accurate relevance assessment than initial embedding similarity alone. The following Python snippet outlines a conceptual reranking application:

# PRODUCTION: Add authentication middleware (API key, mTLS, or IAM-based auth)
# to all FastAPI endpoints before exposing them on any network.

# Input size limits to prevent resource exhaustion
MAX_DOCUMENTS = 50
MAX_QUERY_LENGTH = 1000

@app.post("/rerank", response_model=RerankResponse)
async def rerank_documents_endpoint(request: RerankRequest):
    """
    Receives a query and a list of document texts, returns them reranked by relevance
    using the HuggingFaceCrossEncoder's score method directly.
    """
    # Check if the model is loaded and ready
    if cross_encoder_model is None:
        logger.error("Cross-encoder model not initialized. Service unavailable.")
        # Return 503 Service Unavailable if model isn't ready
        raise HTTPException(status_code=503, detail="Service temporarily unavailable.")
    # --- Input validation ---------------------------------------------------

    if len(request.query) > MAX_QUERY_LENGTH:
        logger.error(f"Query exceeds maximum length of {MAX_QUERY_LENGTH} characters.")
        raise HTTPException(status_code=400, detail="Service temporarily unavailable.")

    if len(request.documents) > MAX_DOCUMENTS:
        logger.error(f"Document list exceeds maximum size of {MAX_DOCUMENTS}.")
        raise HTTPException(status_code=400, detail="Service temporarily unavailable.")
    # ------------------------------------------------------------------------

    logger.info(
        f"Received request to rerank {len(request.documents)} documents for query: '{request.query[:50]}...'"
    )

    try:
        # 1. Create pairs of (query, document) for scoring
        query_doc_pairs: List[Tuple[str, str]] = [
            (request.query, doc_text) for doc_text in request.documents
        ]

        # 2. Get scores from the cross-encoder model
        logger.info(f"Scoring {len(query_doc_pairs)} pairs...")
        scores: List[float] = cross_encoder_model.score(query_doc_pairs)
        logger.info(f"Scoring complete. Received {len(scores)} scores.")

        # Ensure we got a score for each document
        if len(scores) != len(request.documents):
            logger.error(
                f"Mismatch between number of documents ({len(request.documents)}) and scores received ({len(scores)})."
            )
            # PRODUCTION: Return a generic message; log details server-side only.
            raise HTTPException(status_code=500, detail="Service temporarily unavailable.")

        # 3. Combine documents with their scores
        doc_score_pairs = list(zip(request.documents, scores))

        # 4. Sort by score in descending order
        # Lambda function sorts based on the second element (score) of each tuple
        sorted_doc_score_pairs = sorted(
            doc_score_pairs, key=lambda item: item[1], reverse=True
        )

        # 5. Select the top N results
        top_n = request.top_n if request.top_n is not None else len(sorted_doc_score_pairs)
        top_results = sorted_doc_score_pairs[:top_n]

        # 6. Format the response
        response_docs = [
            RerankedDocument(page_content=doc_text, relevance_score=score)
            for doc_text, score in top_results
        ]

        logger.info(f"Successfully reranked documents. Returning top {len(response_docs)}.")

        # Return the structured response
        return RerankResponse(
            reranked_documents=response_docs,
            model_name=MODEL_NAME,
            device_used=MODEL_DEVICE,
        )

    except RuntimeError as e:
        # Handle specific runtime errors like CUDA OOM during processing
        if "CUDA out of memory" in str(e):
            logger.error(f"CUDA out of memory during reranking.", exc_info=True)
        else:
            # Handle other runtime errors
            logger.error(f"Runtime error during reranking: {e}", exc_info=True)

        # Return a generic 500 error to the client
        raise HTTPException(
            status_code=500, detail="Service temporarily unavailable."
        ) from e

    except Exception as e:
        # Catch any other unexpected exceptions
        logger.error(f"Unexpected error during reranking: {e}", exc_info=True)
        # Return a generic 500 error to the client
        raise HTTPException(status_code=500, detail="Service temporarily unavailable.")

Performance optimization with reranking

While RAG efficiency enhances generative AI responses with relevant context, vector similarity search limitations can be challenging when deploying RAG at the edge. An additional consideration is that the context size of the prompt expands significantly adding to the latency of the SLM to generate the response, as it processes the larger prompt. One solution can be to perform a complex semantic search taking time. The alternative approach is to use a reranker to refine the output of the search, prioritizing the most contextually relevant chunks before they reach the SLM.

Vector similarity search results showing five retrieved chunks with scores from 0.7614 to 0.5422, all passing the 50 percent threshold filter

Figure 3. RAG without reranking

As illustrated, initial retrievals identify potentially relevant chunks with scores ranging from 0.7614 to 0.5422. When these chunks contain genuinely relevant information, they provide the SLM with the precise context needed for accurate and insightful responses. In this example, using a 50% similarity filter threshold, all five chunks qualify and are sent to the SLM model.

However, in cases when there are less relevant chunks in the list with scores above the filter, processing them can introduce inefficiencies in the SLM. By identifying and filtering these less valuable chunks from the SLM input, you can improve resource allocation and processing efficiency. This selective approach prevents the model from wasting computational resources on information that contributes minimally to response quality, focusing instead on the most informative content that enhances the generated answers.

Reranking results showing separated relevance scores with the top chunk at 0.9906 and less relevant chunks downgraded to 0.0044, with the threshold filter selecting only the top chunk

Figure 4. RAG with reranking

Figure 4 shows implementing a reranking process effectively identifies and prioritizes the relevant chunks to be sent to the SLM. The reranker transforms the compressed similarity scores into a highly separated spectrum. It elevates the most relevant chunk to 0.9906 while downgrading less relevant content to scores as low as 0.0044. This clear separation enables the 50% threshold filter to automatically select only the single most valuable chunk to be sent to the SLM, eliminating four unnecessary chunks from processing.

Sending only high-relevance chunks to the SLM delivers dual benefits that improve RAG performance. Technical improvements materialize through reduced token processing, faster inference, and lower GPU memory consumption while response quality increases as the model focuses exclusively on meaningful information. This optimization maximizes the GPU investments while delivering superior results compared to standard retrieval alone.

To determine if this reranking optimization applies to your specific workload, you can implement a structured evaluation framework with your domain’s data. Test both technical metrics (latency, memory usage, throughput) and quality indicators (precision, relevance) at various threshold settings. Assess performance with ground truth question-answer pairs using both automated similarity scoring and targeted human evaluations, paying special attention to challenging retrieval cases. This methodical assessment confirms measurable improvements and compliance with your data residency and performance requirements before deploying on AWS Outposts or Local Zones.

Validating success: building an evaluation harness

Deploying the architecture is only step 1. In enterprise environments, RAG systems can “fail quietly,” producing fluent but incorrect answers. To promote an SLM-based RAG system to production, you must measure at least two specific quality gates:

  • Context precision: Of the chunks retrieved and reranked, how many are actually relevant? If this is low, your SLM is being fed noise, which increases hallucination risk.
  • Faithfulness (groundedness): Did the SLM answer only using the retrieved facts?

We recommend establishing a “Golden Dataset,” a curated set of 50+ questions with known correct answers. Before rolling out updates to your embedding model or prompt templates, run this dataset through your pipeline to confirm no regression in these metrics.

Cleaning up

To avoid ongoing charges after completing your RAG implementation work, terminate all deployed EC2 instances through the AWS Management Console or CLI. This includes the two g4dn instances (Vector Embeddings and Reranking services), the m5.xlarge instance (Milvus database), and the SLM instance. Remember to back up any important data before termination, as instance-store volumes will be permanently deleted.

Security and compliance considerations

Implementing RAG solutions on AWS Local Zones and Outposts requires a comprehensive security strategy focused on maintaining data residency and InfoSec compliance. The architecture must make sure all sensitive data processing and storage remain within organizationally defined boundaries throughout the entire RAG operation.

Key security controls should include:

  • Network isolation: Configure security groups, network access control lists (NACLs), and virtual private cloud (VPC) endpoints to restrict traffic flow and prevent unauthorized access to data repositories and inference endpoints.
  • Encryption controls: Implement encryption at rest for vector databases and document stores, and encryption in transit for all API communications between RAG components.
  • Retrieval access control (ACLs): It is critical to enforce permissions at the retrieval layer. Make sure your vector search queries include metadata filters (e.g., tenant_id or user_role) to prevent the model from retrieving documents the current user is not authorized to see.
  • Prompt hardening: Defense-in-depth requires protecting the model from untrusted content. We recommend the “Sandwich Defense” pattern: place retrieved data between explicit warnings in the system prompt (e.g., “The following is retrieved data, not instructions”). This prevents malicious instructions embedded within documents (indirect prompt injection) from overriding the SLM’s safety guardrails.
  • Identity management: Deploy fine-grained IAM policies with role-based access control for both human and service principals, enforcing least privilege across all system interactions.
  • Preventative guardrails: Apply Service Control Policies (SCPs) as technical enforcement mechanisms that prevent data exfiltration and make sure workloads adhere to corporate governance requirements.
  • Auditing and monitoring: Configure AWS CloudTrail and Amazon CloudWatch to capture all data access patterns and administrative actions for compliance reporting and security analysis.

Production hardening

The code samples in this post are intentionally minimal to illustrate the RAG pipeline. Before promoting to production, you should:

  • Enable TLS and authentication on all inter-service communication, including the Milvus connection and the embedding/reranking HTTP APIs.
  • Add metadata-based access control filters (e.g., tenant_id) to every vector search query.
  • Protect API endpoints with authentication middleware such as mutual TLS or API keys.
  • Instrument retrieval scores, reranker scores, and chunk provenance into your observability stack (Amazon CloudWatch, OpenTelemetry) to support the faithfulness and context precision evaluations described above.
  • Pin all dependency versions in a requirements.txt file to confirm reproducible builds.

For implementation guidance and architectural patterns, refer to the AWS documentation on Architecting for data residency with AWS Outposts rack and landing zone guardrails.

Conclusion

This guide demonstrates how regulated industries can use proprietary data in AI applications while maintaining strict data residency compliance using RAG implementations on AWS Local Zones and Outposts. The use of SLMs augmented with RAG combined with reranking delivers both security and performance. This system allows organizations to meet regulatory requirements while still benefiting from advanced AI capabilities. Visit the AWS Outposts website today to start building compliant, data-driven AI applications tailored to your specific industry needs.

Optimize EC2 costs with AWS Compute Optimizer right sizing

Post Syndicated from Darshan Patel original https://aws.amazon.com/blogs/compute/optimize-ec2-costs-with-aws-compute-optimizer-right-sizing/

One of the most impactful ways to improve the ROI on your Amazon Elastic Compute Cloud (Amazon EC2) investment is rightsizing — when you match your instance types and sizes to the actual resource demands of your workloads. However, doing this manually across hundreds or thousands of instances is time-consuming and error-prone. AWS Compute Optimizer analyzes your AWS resources’ configuration and utilization metrics to provide rightsizing recommendations designed to help you identify opportunities to reduce cost while helping to maintain performance and capacity requirements.

In this post, we walk you through how to evaluate AWS Compute Optimizer’s EC2 rightsizing recommendations, configure recommendation preferences that align with your organization’s priorities, enrich recommendations with memory utilization data, and assess Graviton-based alternatives — all to help you make more informed, data-driven rightsizing decisions.

Prerequisites

To follow along with the best practices in this post, you need:

  • An AWS account with access to AWS Compute Optimizer
  • At least one running EC2 instance with 30+ hours of Amazon CloudWatch metric data in the past 14 days

Optional (for enhanced recommendations):

The challenge: balancing cost and performance at scale

Most organizations don’t have clear insights into the best performance-cost ratio for their EC2 instances — leading to overprovisioning and wasted spend on one side, or undersized instances and degraded user experience on the other. The key questions engineering and FinOps teams face are:

  • Which instances are oversized? Where are we paying for capacity we don’t use?
  • Which instances are undersized? Where are we risking performance degradation?
  • What’s the right trade-off? How do we optimize cost without introducing performance risk?

AWS Compute Optimizer analyzes up to 93 days of utilization data from Amazon CloudWatch and delivers recommendations classified by savings opportunity and performance risk to help you address these questions.

How Compute Optimizer evaluates EC2 instances

Compute Optimizer analyzes the following CloudWatch metrics for your EC2 instances, with recommendations refreshed daily:

  • CPU utilization — the percentage of allocated EC2 compute units in use on the instance. Metric: CPUUtilization
  • Memory utilization — the percentage of memory in use during the sample period (when enabled — see below). Metric: MemoryUtilization
  • Network I/O — the volume of incoming/outgoing traffic and packets on all network interfaces. Metrics: NetworkIn, NetworkOut, NetworkPacketsIn, NetworkPacketsOut
  • Disk I/O — read/write operations and throughput for instance store volumes. Metrics: DiskReadOps, DiskWriteOps, DiskReadBytes, DiskWriteBytes
  • EBS throughput and IOPS — read/write throughput and operations for attached EBS volumes. Metrics: VolumeReadBytes, VolumeWriteBytes, VolumeReadOps, VolumeWriteOps
  • GPU utilization — the percentage of allocated GPUs in use, GPU memory usage, and active encoder sessions (when enabled via the CloudWatch Agent with NVIDIA GPU metrics). Metrics: GPUUtilization, GPUMemoryUtilization, GPUEncoderStatsSessionCount

Based on these metrics, Compute Optimizer classifies each instance as:

Finding Meaning
Over-provisioned Instance resources exceed workload needs — downsize opportunity
Under-provisioned Workload demands exceed instance capacity — performance risk
Optimized Current instance is well-matched to workload requirements
Idle Instance has very low utilization — candidate for termination or consolidation (shown on a dedicated Idle Resource Recommendations page; criteria: peak CPU below 5% and network I/O under 5 MB/day over the 14-day lookback period; GPU instances (G/P families) have additional GPU-specific idle criteria)

When AWS Cost Optimization Hub is enabled, Compute Optimizer factors in your existing pricing commitments (AWS Savings Plans, Reserved Instances and other specific pricing discounts) when generating savings estimates — see Best practice 1 below for details.

For each finding, Compute Optimizer lists up to three optimization recommendations for a specific instance, ranked by estimated savings, performance risk, and migration effort.

Note: While this post focuses on EC2 instance rightsizing, Compute Optimizer also generates recommendations for Amazon EC2 Auto Scaling groups (including mixed instance types and scaling policies), Amazon Elastic Block Store (Amazon EBS) volumes, AWS Lambda functions, Amazon Elastic Container Service (Amazon ECS) services on AWS Fargate, commercial software licenses, and Amazon Aurora/Amazon Relational Database Service (Amazon RDS) databases. Idle resource detection extends further — covering EC2 instances, Auto Scaling groups, EBS volumes, ECS on Fargate, Aurora/RDS, and NAT Gateways. For the full list of supported resources, see Supported resources.

Evaluating recommendations in the console

In the Compute Optimizer console, navigate to EC2 Instances and select any instance to view its detail page. From here you can:

  1. Compare utilization metrics — View side-by-side graphs showing how your current instance’s CPU, memory, network, and disk metrics map to the recommended instance’s capacity.
  2. Review estimated savings — See projected monthly cost savings for each recommended option. With AWS Cost Optimization Hub enabled, savings reflect your actual pricing discounts rather than On-Demand rates (see Best practice 1).
  3. Assess performance risk — Understand the likelihood that switching to the recommended instance may result in resource contention.
  4. Evaluate migration effort — Compute Optimizer rates each recommendation from Very low to High based on CPU architecture compatibility and inferred workload type. Same architecture is Very low effort; AWS Graviton (ARM64) recommendation with a known compatible workload (for example, Amazon EMR) is Low; Graviton with an unidentified workload is Medium; and a different architecture with no known compatible version is High effort.
  5. Toggle CPU architecture preferences — Use the architecture drop-down to compare x86-based recommendations against AWS Graviton (ARM64) alternatives for additional price-performance improvements.

Best practice 1: Enable Cost Optimization Hub for after-discount savings

Why this matters: Enabling Cost Optimization Hub gives Compute Optimizer visibility into your Savings Plans, Reserved Instances, and other pricing discounts — so every recommendation reflects what you would actually save given your existing commitments. This is especially valuable for organizations with significant discount coverage, where On-Demand savings estimates may be significantly higher than what you would actually realize after accounting for existing commitments.

When you enable Cost Optimization Hub, Compute Optimizer automatically switches to AfterDiscounts mode and uses your organization-specific pricing discounts to generate recommendations. The console then displays two savings columns — Estimated monthly savings (after discounts) and Estimated monthly savings (On-Demand) — giving you both views side by side. To enable Cost Optimization Hub for your organization, see Getting started with Cost Optimization Hub. The savings estimation mode preference allows Compute Optimizer to analyze specific pricing discounts when generating the estimated cost savings of rightsizing recommendations. You can verify or override the savings estimation mode under Preferences > Savings estimation mode in the Compute Optimizer console. See Savings estimation mode for details.

Best practice 2: Enable memory metrics for accurate recommendations

Why this matters: Memory utilization is not collected by default in CloudWatch. By enabling it, you give Compute Optimizer a complete picture of your workload — CPU, network, disk, and memory together. This is especially valuable for memory-intensive workloads (databases, caching layers, JVM-based applications), where memory is often the critical sizing factor. With full visibility, Compute Optimizer can factor memory needs into every recommendation, resulting in higher-confidence suggestions that your teams can implement with greater assurance.

Option A: CloudWatch Agent

Deploy the unified CloudWatch Agent on your instances to publish memory utilization metrics. Compute Optimizer automatically incorporates these metrics once they’re available in CloudWatch.

Note: Collecting memory metrics with the CloudWatch Agent incurs charges. See Amazon CloudWatch Pricing.

Key steps:

  1. Install the CloudWatch Agent via AWS Systems Manager or manually.
  2. Configure the agent to collect memory metrics.
  3. Verify metrics appear in CloudWatch.
  4. Allow up to 24 hours for Compute Optimizer to incorporate the new data.

Option B: External metrics ingestion

If your organization uses a third-party observability platform, Compute Optimizer supports ingesting EC2 memory utilization metrics from:

  • Datadog
  • Dynatrace
  • Instana
  • New Relic

When external metrics ingestion is enabled, Compute Optimizer analyzes external memory data alongside native CloudWatch metrics to generate enhanced recommendations.

Learn more: Configuring external metrics ingestion

Best practice 3: Configure rightsizing preferences to match your strategy

Why this matters: Compute Optimizer’s defaults — P99.5 threshold (sizes instances to handle 99.5% of observed CPU peaks), 20% headroom (adds a 20% capacity buffer above those peaks for future growth), and 14-day lookback — work well for many workloads. Customizing these preferences lets you go further — extending the lookback to 32 or 93 days captures monthly or seasonal patterns for even more accurate recommendations, while adjusting headroom and threshold lets you fine-tune the balance between savings and performance for each environment. The result: recommendations tailored to your actual risk tolerance and workload patterns, producing suggestions your teams will trust and confidently implement.

Compute Optimizer supports configurable rightsizing preferences that tailor recommendations to your workload requirements. Preferences can be set at the organization level (applies to all member accounts in your AWS Organizations), account level (applies to a specific account — useful when production and dev/test accounts need different settings), or regional level (applies within a specific region — useful when workloads differ across regions). This hierarchy lets you set conservative defaults org-wide and override for specific accounts or regions that need different treatment.

Key preference options include:

Preference Description When to use
CPU utilization threshold Before generating recommendations, Compute Optimizer filters your CPU data through this percentile. Think of it as a noise filter: P99.5 (default) keeps 99.5% of your data and only discards the rarest 0.5% of spikes — so the recommendation is sized to handle almost every peak you’ve ever seen. P90 discards the top 10% of spikes, treating them as anomalies, and produces smaller (cheaper) recommendations. Options: P90, P95, P99.5 Use P99.5 for production where you can’t afford to miss peaks; P90 for dev/test where occasional spikes from deployments or one-off events are acceptable to ignore
CPU utilization headroom After Compute Optimizer determines the right instance size based on your historical peaks, it adds this percentage as a safety cushion for future growth. For example: if your analyzed peak needs 60% of an instance’s CPU, a 20% headroom means the recommended instance will still have 20 percentage points of spare capacity above that peak — room to grow without needing another resize. Options: 30%, 20% (default), 0% Use 30% for workloads with unpredictable or growing traffic; 20% for typical production; 0% for steady-state workloads where you want maximum savings and accept a tight fit
Memory utilization headroom Added memory capacity buffer (30%, 20%, or 10%) above analyzed usage to accommodate future increases. Default is 20% Use 30% for memory-sensitive workloads; 10% for steady-state where you want maximum savings
Lookback period Choose 14 days (default, no additional charge), 32 days (no additional charge), or 93 days (requires Enhanced Infrastructure Metrics (EIM), a paid feature). You can enable EIM at the organization, account, or individual resource level — useful for activating it only on production workloads where the cost is justified Use 32 days for monthly patterns; 93 days for seasonal or quarterly workloads
Preferred instance types Restrict recommendations to specific instance families or types. For example, if you have purchased Savings Plans and Reserved Instances, you can specify instances only covered by those pricing models. Or, if you want to use only instances equipped with certain processors or non-burstable instances because of your application design, you can specify those instances for your recommendation output When organizational standards, procurement commitments (RIs/SPs), or application design require approved instance families

Learn more: How to take advantage of rightsizing recommendation preferences

Best practice 4: Evaluate Graviton recommendations carefully

Why this matters: Compute Optimizer can recommend migrating x86 workloads to AWS Graviton instances, which deliver up to 40% better price-performance. However, unlike same-architecture rightsizing (which is a configuration change), Graviton involves a CPU architecture shift from x86 to ARM64 — so a structured evaluation process helps you validate compatibility and capture the full savings with confidence.

Before migrating to Graviton:

  1. Assess architecture compatibility — Verify that your application binaries, libraries, and dependencies support ARM64. Container-based workloads (using multi-arch images) typically require less modification to migrate.
  2. Check software dependencies — Confirm third-party agents, drivers, and monitoring tools are available for ARM64.
  3. Test in non-production first — Deploy the recommended Graviton instance in a staging environment.
  4. Run load tests — Validate performance parity with the current instance.
  5. Use the Graviton Transition Guide — Follow the AWS Graviton Getting Started guide for a structured migration approach.
  6. How to identify a good target workload — A good candidate for Graviton adoption is a workload running on Linux or BSD, built either using open-source components or source code that you control. Having full access to the source code of every component allows you to make any necessary changes quickly and easily as part of this adoption plan. If you use third-party software, many ISVs already support the Arm64 architecture implemented by AWS Graviton processors.

When to defer Graviton recommendations:

  • Legacy applications compiled for x86 without source code access.
  • Workloads with licensing tied to specific CPU architectures.
  • Applications with untested third-party binary dependencies.

Learn more: AWS Compute Optimizer Graviton migration guidance

Best practice 5: Implement a rightsizing workflow

Why this matters: A structured workflow turns Compute Optimizer’s recommendations into sustained, measurable cost savings. By establishing a regular cadence — reviewing, validating with stakeholders, and tracking results — your organization builds a continuous optimization loop that adapts as workloads evolve, compounds savings over time, and gives finance teams clear visibility into realized cost reductions.

To operationalize Compute Optimizer recommendations across your organization:

  1. Establish a regular review cadence — Schedule weekly or bi-weekly rightsizing reviews with your FinOps or cloud operations team.
  2. Prioritize by savings and confidence — Focus first on Over-provisioned instances with high estimated savings and low performance risk.
  3. Validate with application owners — Share recommendations with workload owners for context on usage patterns that metrics alone may not reveal (for example, seasonal traffic, scheduled batch jobs).
  4. Track implementation — Use AWS Cost Explorer to measure realized savings after rightsizing changes.

Note: Tag instances for effective rightsizing at scale. Compute Optimizer recommendations become more actionable when your instances carry consistent tags. At minimum, tag with Environment (prod/staging/dev) to drive review priority, and Application/Workload and Owner/Team to route recommendations to the right team. Compute Optimizer’s console, exports, and API all support tag-based filtering (tag:key and tag-key filters).

Taking it further — automate your workflow: For organizations ready to move beyond manual reviews, Compute Optimizer offers built-in automation that allows you to create automation rules that continuously clean up unattached volumes and upgrade volume types based on Compute Optimizer’s data-driven recommendations. For EC2 instance rightsizing, AWS provides a reference architecture for automating Compute Optimizer recommendations using AWS Step Functions, Amazon EventBridge, and AWS Lambda. See: Optimize costs by automating AWS Compute Optimizer recommendations

Clean up

If you installed the CloudWatch Agent as part of best practice 2 and no longer need memory metrics, stop and remove the agent to avoid ongoing custom metric charges.

Conclusion

AWS Compute Optimizer provides data-driven recommendations to help you make more informed EC2 rightsizing decisions. By enabling memory metrics, configuring recommendation preferences aligned to your workload needs, carefully evaluating Graviton alternatives, and establishing a systematic review process, you can identify opportunities to help optimize your EC2 fleet and help reduce costs while considering the performance your applications require.

Further reading

Building AI shopping agent using Amazon Bedrock AgentCore Runtime and Amazon OpenSearch Service

Post Syndicated from Omama Khurshid original https://aws.amazon.com/blogs/big-data/building-ai-shopping-agent-using-amazon-bedrock-agentcore-runtime-and-amazon-opensearch-service/

In this post, we explore how to build an online shopping AI agent. We focus on its architecture and implementation with Amazon OpenSearch Service, Amazon Bedrock AgentCore, and Strands Agents. Amazon Bedrock AgentCore is an agentic platform for deploying and operating those agents and tools securely at scale without managing infrastructure. AgentCore Runtime is the secure, serverless runtime that hosts your Strands Agents and tools as containerized applications. Strands Agents is an open source SDK for building AI agents. In this SDK, an agent is defined by a model, tools, and a prompt. Tools are callable functions that allow agents to perform actions beyond text generation, such as API calls, database queries, and file operations. The framework lets the model autonomously plan steps and invoke tools to complete tasks.

Today’s AI shopping assistants understand natural language, context, and shopping intent, creating a more human-like interaction. These assistants handle complex shopping requirements, such as “Find me a formal dress under $200 that’s appropriate for a summer wedding.” They maintain conversation history, process follow-up questions naturally, and provide personalized recommendations based on user preferences and past interactions. Customers can use visual search to upload images of items that they want, and the AI finds similar products across multiple retailers, matching styles and patterns. The goal is to provide instant, relevant, and personalized assistance at scale, creating a more efficient shopping journey for consumers worldwide.

AI agents combined with Retrieval Augmented Generation (RAG) on Amazon OpenSearch Service represent an evolution in conversational search. This integration builds AI agents on enriched catalogs, supporting context-aware and autonomous search experiences while maintaining accuracy and relevance through grounded responses.

Solution overview

The following diagram illustrates the solution architecture of an AI-powered online shopping agent built using Strands Agents, Amazon Bedrock AgentCore Runtime, and Amazon OpenSearch Service. For simplicity, the diagram doesn’t show authentication and authorization. In a production setup, secure access to the backend by using mechanisms such as Amazon API Gateway, AWS Identity and Access Management (IAM) roles, or OAuth-based authentication.

Architecture diagram showing an AI shopping agent: a user prompt flows from the front end to AgentCore Runtime, which routes the request to a Strands Retail Agent that calls a search tool against Amazon OpenSearch Service and an Amazon Bedrock LLM, then returns a natural-language response to the user.

The following is a walkthrough of the reference architecture:

  1. The user submits a question through the front-end application. AgentCore Runtime receives the request and routes it to the Strands Retail Agent.
  2. The Strands Agent processes the task and invokes the search_product_catalog tool.
  3. OpenSearch Service performs semantic search and returns relevant product results.
  4. The Strands Agent invokes Amazon Bedrock large language models (LLMs) to generate a natural language response.
  5. The agent response is returned to the user through the front end.

Walkthrough

The following section walks you through how to build an online shopping AI agent.

Prerequisites

To implement this solution, you need an AWS account. You also need an OpenSearch Service domain with OpenSearch version 2.13 or later. You can use an existing domain or create a new domain.

To use the vector search capabilities of OpenSearch Service with Strands Agents on AgentCore, you use ingest pipelines. These ingestion pipelines apply built-in processors to pre-process your documents before you index them in OpenSearch Service.

You use the text_embedding processor, which relies on the ML Commons plugin and a registered embedding model—Amazon Nova Multimodal Embeddings on Amazon Bedrock. OpenSearch Service uses the ML Commons plugin to generate vector embedding for your data and uses the same model to convert incoming queries into vectors. This supports semantic search across your indexed content.

You extend your semantic search backend by adding an agent built with Strands Agents and deployed on Amazon Bedrock AgentCore.

Code samples provided in this post are tested in Python 3.11. You only need to install Python 3.11 in your environment to execute the python scripts. You also need Node.js 18 or later installed to use the AgentCore CLI. The provided code scripts will deploy into your AWS account so make sure your terminal has access to necessary AWS credentials.

Install AgentCore CLI

Install the AgentCore CLI globally using npm:

npm install -g @aws/agentcore

Python Dependencies

You also need to create a requirements.txt file with following dependencies in your workspace to deploy the agents.

boto3
uv
opensearch-py
requests-aws4auth
strands-agents
strands-agents-tools
bedrock-agentcore

Run pip install -r requirements.txt in your terminal to install the required dependencies. To avoid conflicts with other dependencies in your system, you can use a virtual environment.

Now, walk through each step.

Step 1: Configure IAM permissions

Complete the following steps to register the Nova Multimodal Embeddings model with OpenSearch Service and verify that your OpenSearch Service domain has permission to invoke the Amazon Bedrock API.

  1. Go to the IAM console and create a new role with a custom trust policy. Add the following trust policy.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Statement1",
          "Effect": "Allow",
          "Principal": {
            "Service": "opensearchservice.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }

  2. Skip adding a permission policy.
  3. Give your role a name and create it. For this post, we use OpenSearchBedrockEmbeddingRole as the role name. OpenSearch Service uses this role to invoke the Nova Multimodal Embeddings model on Amazon Bedrock.
  4. On the Permissions tab, attach an inline policy with the following permissions. For this post, we name this policy OpenSearchBedrockEmbeddingPolicy.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Statement1",
          "Effect": "Allow",
          "Action": [
            "bedrock:InvokeAgent",
            "bedrock:InvokeModel"
          ],
          "Resource": [
            "arn:aws:bedrock:us-east-1::foundation-model/*"
          ]
        }
      ]
    }

  5. Create a passRole policy with the following JSON document and assign it to the IAM role that creates the ML connector. This lets the principal running the Python code pass the OpenSearchBedrockEmbeddingRole to OpenSearch. Replace <your-aws-account-id> with your own AWS account ID.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": "iam:PassRole",
          "Resource": "arn:aws:iam::\$<your-aws-account-id>:role/OpenSearchBedrockEmbeddingRole"
        }
      ]
    }

  6. By using fine-grained access control (FGAC), map the IAM role as a backend role for the ml_full_access role in the OpenSearch Dashboards Security plugin. This mapping lets the user create ML connectors:
    1. Log in to OpenSearch Dashboards and open the Security page from the navigation menu.
    2. Choose Roles and select ml_full_access.
    3. Choose Mapped Users and Manage Mapping.
    4. Under Backend roles, add the ARN of the IAM role that you created in the previous steps.

Animated demo of OpenSearch Dashboards Security plugin showing the ml_full_access role with the Mapped Users tab open and an IAM role being added as a backend role.

Step 2: Connect to the model by using OpenSearch ML Connectors

In this section, you create an ML connector to link OpenSearch Service with the Bedrock Nova Multimodal Embeddings model. You then register and deploy the model so you can use it for neural search queries.

  1. Create a file named create-connector.py with the following code. Replace <your hostname>, <your region>, and <your account id> placeholders within the code.
import boto3
import requests
from requests_aws4auth import AWS4Auth
host = '<your hostname>'##CHANGE THIS
region = '<your region>' ##CHANGE THIS
account_id = '<your account id>' ##CHANGE THIS
service = 'es'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)
path = '/_plugins/_ml/connectors/_create'
url = host + path
payload = {
"name": "Amazon Bedrock Nova multimodal model - text embedding",
"description": "Test connector for Amazon Bedrock Nova multimodal model - text embedding",
"version": 1,
"protocol": "aws_sigv4",
"credential": {
"roleArn": f"arn:aws:iam::{account_id}:role/OpenSearchBedrockEmbeddingRole"
},
"parameters": {
"region": region,
"service_name": "bedrock",
"model": "amazon.nova-2-multimodal-embeddings-v1:0",
"input_docs_processed_step_size": 1,
"dimensions": 1024,
"embeddingTypes": [
"float"
],
"truncationMode": "NONE"
},
"actions": [
{
"action_type": "predict",
"method": "POST",
"headers": {
"content-type": "application/json",
"x-amz-content-sha256": "required"
},
"url": "https://bedrock-runtime.${parameters.region}.amazonaws.com/model/${parameters.model}/invoke",
"request_body": "{\\n \\"taskType\\": \\"SINGLE_EMBEDDING\\",\\n \\"singleEmbeddingParams\\": {\\n \\"embeddingPurpose\\": \\"GENERIC_INDEX\\",\\n \\"embeddingDimension\\": ${parameters.dimensions},\\n \\"text\\": {\\n \\"truncationMode\\": \\"${parameters.truncationMode}\\",\\n \\"value\\": \\"${parameters.inputText}\\"\\n }\\n }\\n}",
"pre_process_function": "connector.pre_process.bedrock.nova.text_embedding",
"post_process_function": "connector.post_process.bedrock.nova.embedding"
}
]
}
headers = {"Content-Type": "application/json"}
r = requests.post(url, auth=awsauth, json=payload, headers=headers, timeout=15)
print(r.status_code)
print(r.text)
  1. Run python create-connector.py in your terminal by using the IAM role with ml_full_access and passRole permissions created in the previous step. This script creates a connector between OpenSearch Service and the Bedrock Nova Multimodal Embeddings model.
  2. The program responds with connector_id. Take a note of it. Then, navigate to OpenSearch Dashboards and open Dev Tools. Create a model group against which to register this model in the OpenSearch Service domain.
POST /_plugins/_ml/model_groups/_register
{
  "name": "agent-conversational-search-model-group",
  "description": "A model group for bedrock Nova embedding models used for conversational search"
}
  1. Register a model by using connector_id and model_group_id.
POST /_plugins/_ml/models/_register
{
  "name": "nova-2-multimodal-embedding-v1",
  "function_name": "remote",
  "model_group_id": "<model group id>",
  "description": "Nova 2 Multimodal Embeddings Model",
  "connector_id": "<connector id>",
  "interface": {}
}
  1. Run the following API call to deploy the model. Use the registered model ID from the previous step.
POST /_plugins/_ml/models/<registered-model-id>/_deploy

Step 3: Create an ingest pipeline for data indexing

Use the following code to create an ingest pipeline for data indexing. The pipeline establishes a connection to the embedding model, retrieves the embedding for the title field, and stores it in the OpenSearch index.

PUT /_ingest/pipeline/nova_multimodal_embedding
{
  "description": "Text embedding pipeline using nova_multimodal_embedding",
  "processors": [
    {
      "text_embedding": {
        "model_id": "<deployed model id>",
        "field_map": {
          "title": "title_vector"
        }
      }
    }
  ]
}

Step 4: Create an index for storing data

Create an index named product for storing data by using Dev Tools. This index stores raw text and 1024-dimensional embeddings of the title field, and uses the ingest pipeline you created in the previous step.

PUT /product
{
  "settings": {
    "index": {
      "default_pipeline": "nova_multimodal_embedding",
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard"
      },
      "title_vector": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "lucene"
        }
      },
      "price": {
        "type": "float"
      },
      "description": {
        "type": "text"
      },
      "category": {
        "type": "keyword"
      },
      "image_url": {
        "type": "text"
      },
      "rating": {
        "properties": {
          "rate": {
            "type": "float"
          },
          "count": {
            "type": "integer"
          }
        }
      }
    }
  }
}

Step 5: Ingest sample data

Use the following code to ingest the sample product data in Dev Tools.

POST /_bulk
{"index": {"_index": "product", "_id": "2"}}
{"id":2,"title":"Mens Casual Premium Slim Fit T-Shirts","price":22.3,"description":"Slim-fitting style, contrast raglan long sleeve, three-button henley placket, light weight & soft fabric for breathable and comfortable wearing.","category":"men's clothing","image":"https://fakestoreapi.com/img/71-3HjGNDUL._AC_SY879._SX._UX._SY._UY_.jpg","rating":{"rate":4.1,"count":259}}
{"index": {"_index": "product", "_id": "3"}}
{"id":3,"title":"Mens Cotton Jacket","price":55.99,"description":"great outerwear jackets for Spring/Autumn/Winter, suitable for many occasions, such as working, hiking, camping, mountain/rock climbing, cycling, traveling or other outdoors.","category":"men's clothing","image":"https://fakestoreapi.com/img/71li-ujtlUL._AC_UX679_.jpg","rating":{"rate":4.7,"count":500}}
{"index": {"_index": "product", "_id": "4"}}
{"id":4,"title":"Mens Casual Slim Fit","price":15.99,"description":"The color could be slightly different between on the screen and in practice.","category":"men's clothing","image":"https://fakestoreapi.com/img/71YXzeOuslL._AC_UY879_.jpg","rating":{"rate":2.1,"count":430}}
{"index": {"_index": "product", "_id": "5"}}
{"id":5,"title":"John Hardy Women's Legends Naga Gold & Silver Dragon Station Chain Bracelet","price":695,"description":"From our Legends Collection, the Naga was inspired by the mythical water dragon that protects the ocean's pearl.","category":"jewelery","image":"https://fakestoreapi.com/img/71pWzhdJNwL._AC_UL640_QL65_ML3_.jpg","rating":{"rate":4.6,"count":400}}

Step 6: Query the index

Run the following API call to test semantic search by using the Nova Multimodal Embeddings model.

GET /product/_search
{
  "_source": false,
  "fields": [
    "title",
    "price",
    "category",
    "image"
  ],
  "size": 3,
  "query": {
    "neural": {
      "title_vector": {
        "query_text": "jacket",
        "model_id": "<deployed model id>",
        "k": 5
      }
    }
  }
}

The output of the preceding query should look like the following.

{
  ...
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 0.8333229,
    "hits": [
      {
        "_index": "product",
        "_id": "3",
        "_score": 0.8333229,
        "fields": {
          "image": [
            "https://fakestoreapi.com/img/71li-ujtlUL._AC_UX679_.jpg"
          ],
          "title": [
            "Mens Cotton Jacket"
          ],
          "category": [
            "men's clothing"
          ],
          "price": [
            55.99
          ]
        }
      }
      ...
    ]
  }
}

Step 7: Create an agent with Strands and Bedrock AgentCore Runtime

Now, create the Strands Agent that uses Anthropic Claude Sonnet 4.6 on Amazon Bedrock to search products from the OpenSearch Service index. To do so:

  • Import the Runtime app with from bedrock_agentcore.runtime import BedrockAgentCoreApp.
  • Initialize the app in your code with app = BedrockAgentCoreApp().
  • Create the OpenSearch Service connection and search query with the @tool decorator.
  • Decorate the invocation function with the @app.entrypoint decorator.
  • Let AgentCore Runtime control the running of the agent with app.run().

Now, complete the following steps:

  1. Make sure that you have installed the necessary dependencies from the Prerequisites section of this post.
  2. Create and save a file named search_agent.py with the following code. Replace <your hostname>, <your region>, and <your account id> placeholders within the code.
    from strands import Agent, tool
    import argparse
    import json
    from bedrock_agentcore.runtime import BedrockAgentCoreApp
    from strands.models import BedrockModel
    import boto3
    from opensearchpy import OpenSearch, RequestsHttpConnection
    from requests_aws4auth import AWS4Auth
    app = BedrockAgentCoreApp()
    @tool
    def search_products(query: str, size: int = 5):
    try:
    # OpenSearch configuration
    host = '' ## CHANGE THIS, DOMAIN ENDPOINT WITHOUT HTTPS!
    region = '' ##CHANGE THIS
    model_id= '' ##CHANGE THIS with your deployed model id in OpenSearch
    service = 'es'
    credentials = boto3.Session().get_credentials()
    awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)
    # Create OpenSearch client
    client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
    )
    """Search products in OpenSearch using neural search"""
    search_body = {
    "_source": False,
    "fields": ["title", "price", "category", "image"],
    "size": size,
    "query": {
    "neural": {
    "title_vector": {
    "query_text": query,
    "model_id": model_id,
    "k": 3
    }
    }
    }
    }
    response = client.search(
    body=search_body,
    index="product"
    )
    products = []
    for hit in response['hits']['hits']:
    fields = hit.get('fields', {})
    product = {
    'title': fields.get('title', [''])[0] if fields.get('title') else '',
    'price': fields.get('price', [''])[0] if fields.get('price') else '',
    'category': fields.get('category', [''])[0] if fields.get('category') else '',
    'image': fields.get('image', [''])[0] if fields.get('image') else ''
    }
    products.append(product)
    return f"Found {len(products)} products: {json.dumps(products, indent=2)}"
    except Exception as e:
    return f"Search error: {str(e)}"
    model_id = "global.anthropic.claude-haiku-4-5-20251001-v1:0"
    model = BedrockModel(
    model_id=model_id,
    )
    agent = Agent(
    model=model,
    tools=[search_products],
    system_prompt="You're a helpful assistant. You can do product search, and tell the product details."
    )
    @app.entrypoint
    def strands_agent_bedrock(payload):
    """
    Invoke the agent with a payload
    """
    user_input = payload.get("prompt")
    print("User input:", user_input)
    response = agent(user_input)
    return response.message['content'][0]['text']
    if __name__ == "__main__":
    #strands_agent_bedrock({"prompt": "Search jacket"}) ##UNCOMMENT THIS FOR TESTING
    #app.run() ##UNCOMMENT THIS FOR DEPLOYMENT, MAKE SURE THE ABOVE LINE IS COMMENTED WHEN YOU ARE DEPLOYING TO AGENTCORE

    This deploys your agent locally for testing purposes.

  3. Navigate to the IAM console and add the AmazonBedrockLimitedAccess permission policy to the principal running the code.
  4. Navigate to OpenSearch Dashboards and, from the left menu, choose Security plugin, then choose Roles.
  5. Choose Create Role.
  6. Name the role agentcore-permissions.
  7. Under cluster permissions, add cluster:admin/opensearch/ml/models/get and cluster:admin/opensearch/ml/predict.
  8. Under index permissions, enter product* as the index pattern. Add search and get permissions.
  9. Create the role.
  10. Choose the role you created, switch to the Mapped Users tab, choose Manage mapping, and add the role that you use for running the Python code as a backend role.
  11. Uncomment the line strands_agent_bedrock({"prompt": "Search jacket"}) and make sure the app.run() line is commented in the code.
  12. Run python search_agent.py in your terminal to start the shopping agent. The output should look similar to the following.
    "Here are the jacket search results:\n\n1. **Mens Cotton Jacket** - $55.99\n2. **Mens Casual Slim Fit** - $15.99\n3. **Mens Casual Premium Slim Fit T-Shirts** - $22.30\n4. **John Hardy Women's Legends Naga Gold & Silver Dragon Station Chain Bracelet** - $695.00\n\nThe most relevant jacket option is the **Mens Cotton Jacket** at $55.99. Would you like to know more about any of these products?

  13. Comment strands_agent_bedrock({"prompt": "Search jacket"}) and uncomment the app.run() line in the code before going into the next step.

Step 8: Configure and launch your agent to Bedrock AgentCore Runtime

The AgentCore CLI is a command-line tool provided by AWS that simplifies deployment of agents to Amazon Bedrock AgentCore Runtime. When you run the CLI deployment command, it automates the entire deployment workflow: it creates the necessary IAM execution role with proper permissions, packages your Python application code along with its dependencies, uses AWS CodeBuild to build an optimized Docker container image, pushes that container image to Amazon Elastic Container Registry (ECR), and finally provisions the AgentCore Runtime environment that hosts your containerized agent. This eliminates the need for manual Dockerfile creation, container builds, or infrastructure management.

  1. Before you start this step, make sure you have gone through section 7 and installed the AgentCore CLI and Python dependencies listed in the Prerequisites section.
  2. Create a policy named AgentCoreAccessPolicy with the following permissions and attach it to the role running the code. Replace <ACCOUNT_ID> and <REGION> placeholders.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockAgentCoreServiceAccess",
      "Effect": "Allow",
      "Action": [
        "bedrock-agentcore:*"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECRAuthorizationToken",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECRRepositoryAccess",
      "Effect": "Allow",
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage",
        "ecr:PutImage",
        "ecr:InitiateLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:CompleteLayerUpload",
        "ecr:DescribeRepositories",
        "ecr:CreateRepository",
        "ecr:ListImages",
        "ecr:DescribeImages"
      ],
      "Resource": [
        "arn:aws:ecr:<REGION>:<ACCOUNT_ID>:repository/bedrock-agentcore-*"
      ]
    },
    {
      "Sid": "CodeBuildProjectAccess",
      "Effect": "Allow",
      "Action": [
        "codebuild:CreateProject",
        "codebuild:UpdateProject",
        "codebuild:StartBuild",
        "codebuild:BatchGetBuilds",
        "codebuild:DeleteProject"
      ],
      "Resource": "*"
    },
    {
      "Sid": "IAMRoleManagement",
      "Effect": "Allow",
      "Action": [
        "iam:CreateRole",
        "iam:AttachRolePolicy",
        "iam:PutRolePolicy",
        "iam:GetRole",
        "iam:GetRolePolicy",
        "iam:PassRole",
        "iam:DeleteRole",
        "iam:DeleteRolePolicy",
        "iam:DetachRolePolicy",
        "iam:CreateServiceLinkedRole"
      ],
      "Resource": [
        "arn:aws:iam::<ACCOUNT_ID>:role/BedrockAgentCoreExecutionRole-*",
        "arn:aws:iam::<ACCOUNT_ID>:role/AmazonBedrockAgentCoreSDKCodeBuild-*",
        "arn:aws:iam::<ACCOUNT_ID>:role/aws-service-role/runtime-identity.bedrock-agentcore.amazonaws.com/AWSServiceRoleForBedrockAgentCoreRuntimeIdentity"
      ]
    },
    {
      "Sid": "CloudWatchLogsAccess",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:PutResourcePolicy"
      ],
      "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws/bedrock-agentcore/*"
    },
    {
      "Sid": "CloudWatchLogsResourcePolicy",
      "Effect": "Allow",
      "Action": [
        "logs:PutResourcePolicy"
      ],
      "Resource": "*"
    },
    {
      "Sid": "S3BucketManagement",
      "Effect": "Allow",
      "Action": [
        "s3:CreateBucket",
        "s3:PutBucketPolicy",
        "s3:PutBucketVersioning",
        "s3:PutBucketPublicAccessBlock",
        "s3:PutLifecycleConfiguration",
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::bedrock-agentcore-codebuild-sources-<ACCOUNT_ID>-*",
        "arn:aws:s3:::bedrock-agentcore-codebuild-sources-<ACCOUNT_ID>-*/*"
      ]
    }
  ]
}
  1. Create a file named agentcore.yaml in your project directory with the following configuration. Replace <REGION>, <ACCOUNT_ID>, and <OPENSEARCH_DOMAIN_NAME> placeholders:
# AgentCore Runtime Configuration
  runtime:
    name: shopping-search-agent-runtime
    entrypoint: search_agent.py:strands_agent_bedrock
    region: <REGION>  # CHANGE THIS - Your AWS region (e.g., us-east-1)

    execution_role:
      create: true
      name: BedrockAgentCoreExecutionRole-shopping-agent
      policies:
        - policy_name: BedrockAndOpenSearchAccess
          policy_document:
            Version: "2012-10-17"
            Statement:
              - Sid: BedrockModelAccess
                Effect: Allow
                Action:
                  - bedrock:InvokeModel
                  - bedrock:InvokeModelWithResponseStream
                Resource:
                  - arn:aws:bedrock:*::foundation-model/*
                  - arn:aws:bedrock:<REGION>:<ACCOUNT_ID>:inference-profile/*  # CHANGE THIS - Replace <REGION> and <ACCOUNT_ID>
              - Sid: OpenSearchAccess
                Effect: Allow
                Action:
                  - es:ESHttpGet
                  - es:ESHttpPost
                  - es:ESHttpPut
                Resource: arn:aws:es:<REGION>:<ACCOUNT_ID>:domain/<OPENSEARCH_DOMAIN_NAME>/*  # CHANGE THIS - Replace all three placeholders
              - Sid: CloudWatchLogsAccess
                Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Resource: arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws/bedrock-agentcore/runtimes/*  # CHANGE THIS - Replace <REGION> and <ACCOUNT_ID>
              - Sid: ECRImageAccess
                Effect: Allow
                Action:
                  - ecr:GetAuthorizationToken
                  - ecr:BatchGetImage
                  - ecr:GetDownloadUrlForLayer
                Resource: "*"

    container:
      architecture: arm64  # Options: arm64 or x86_64
      requirements_file: requirements.txt

    ecr:
      auto_create: true
  1. Run the following command in your terminal to deploy the agent to AgentCore Runtime:
    • Create the AgentCore project and replace with your search agent.
      agentcore create --name ShoppingAgent --defaults
      cd ShoppingAgent
      cp ../search_agent.py app/ShoppingAgent/main.py

    • Add OpenSearch and other dependencies:
      cd app/ShoppingAgent
      uv add opensearch-py requests-aws4auth boto3
      cd ../..

    • Deploy the agent to Agentcore Runtime. This process takes approximately 5-10 minutes.
      agentcore deploy

    • Once deployment completes, verify the runtime status:
      agentcore status

      You should see:

      ShoppingAgent: Deployed - Runtime: READY (arn:aws:bedrock-agentcore:<REGION>:<ACCOUNT_ID>:runtime/ShoppingAgent_...)
      
      URL: https://bedrock-agentcore.<REGION>.amazonaws.com/runtimes/.../invocations

Step 9: Configure OpenSearch Service access

Map your AgentCore execution role to an OpenSearch backend role so the agent can access your data.

  1. Navigate to OpenSearch Dashboards. From the left menu, choose the Security plugin, then choose Roles.
  2. Search for agentcore-permissions and choose the role. Then, navigate to the Mapped Users tab, choose Manage mapping, and add arn:aws:iam::<ACCOUNT_ID>:role/AmazonBedrockAgentCoreSDKRuntime-us-east-1-custom as a backend role. Replace the <ACCOUNT_ID> placeholder with your account ID.

Animated demo of OpenSearch Dashboards Security plugin showing the agent-permissions role with the Mapped Users tab open and the AgentCore SDK runtime IAM role added as a backend role.

Step 10: Invoke the Bedrock AgentCore Runtime

You can test the agent in Agent Sandbox. Enter the prompt Search jacket less than 50$, and the agent returns the relevant result from the OpenSearch Service index with a summary.

Agent Sandbox console showing a shopping agent response that returns the Mens Cotton Jacket as the relevant result for the prompt “Search jacket less than 50$”.

In real-world scenarios, you can design a search application with a Strands Agent deployed in AgentCore Runtime. You can add AgentCore Memory, which gives your AI agents the ability to remember past interactions and provide more context-aware, personalized conversations.

Cleanup

To avoid incurring future charges, delete the resources created while building this solution:

  1. Delete the OpenSearch Service domain.
  2. Delete the Amazon Bedrock AgentCore Runtime resources.

Conclusion

In this post, you saw how to create a conversational search with Amazon OpenSearch Service and Strands Agents. You also learned how to deploy the agent on Amazon Bedrock AgentCore Runtime. You can further enhance this shopping agent by using other AgentCore capabilities. For example, AgentCore Memory retains user preferences and past interactions across sessions, AgentCore Identity manages shopper authentication and access control, and AgentCore Observability helps you monitor and debug agent behavior in production. Together, these services help you build shopping experiences that deliver instant, relevant assistance at scale.

Now it’s your turn. Build your own conversational search experience by integrating OpenSearch Service and Strands Agents with your product catalog. To learn more, see the Amazon OpenSearch Service and Amazon Bedrock AgentCore detail pages.


About the authors

Omama Khurshid

Omama Khurshid

Omama is an GTM Specialist Solutions Architect Analytics at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, listening to music, and learning new technologies.

Jumana Nagaria

Jumana Nagaria

Jumana is a Prototyping Architect at AWS. She builds innovative prototypes with customers to solve their business challenges. She is passionate about cloud computing and data analytics. Outside of work, Jumana enjoys travelling, reading, painting, and spending quality time with friends and family.

Canberk Keles

Canberk Keles

Canberk is a Solutions Architect at Amazon Web Services, helping software companies achieve their business goals by leveraging AWS technologies. He is part of OpenSearch specialist community within AWS and has been guiding customers harness the power of OpenSearch. Outside of work, he enjoys sports, reading, traveling and playing video games.

[$] Automatic mTHP creation in 7.2

Post Syndicated from corbet original https://lwn.net/Articles/1077208/

The Linux kernel has long tried to use huge pages as a way to improve
performance, sometimes with more success than others. The size of huge
pages has traditionally been imposed by the hardware, which typically only
offers a couple of relatively large options. In more recent times, though,
the use of multi-size transparent huge pages (mTHPs), with more flexible
sizing implemented in software, has been growing. If all goes well, the
7.2 development cycle will include the addition of a new feature,
contributed by Nico Pache, to make the use of mTHPs even more transparent.

Young people’s computer programs get data from space

Post Syndicated from Fergus Kirkpatrick original https://www.raspberrypi.org/blog/young-peoples-computer-programs-get-data-from-space/

An amazing 25,707 participants had their code run on the International Space Station (ISS) this year, marking the European Astro Pi Challenge’s 10th anniversary in style.

Yesterday, Astro Pi teams and their mentors received their official certificates and data — the final stage of this year’s challenge. On each certificate, participants can see the exact time and the location of the ISS when their program was run.

Congratulations to every student, teacher, volunteer, and parent involved. Your support made this historic year possible. We are also thrilled to share a special message from our 2025/26 Astro Pi Ambassador, ESA Astronaut Sophie Adenot.

The European Astro Pi Challenge is an ESA Education project run in collaboration with the Raspberry Pi Foundation, implemented by ESEROs at a national level. It offers young people the amazing opportunity to conduct scientific investigations in space by writing computer programs that run on Raspberry Pi computers on board the International Space Station.

Mission Zero: Art in orbit

Mission Zero invites young people to create nature-inspired pixel art to display for the astronauts aboard the ISS. This year, we ran a total of 17,170 programs created by 24,408 participants.

By using the Astro Pi’s colour sensor to set their background colours, these programs combined live data from the station with each team’s unique artwork. The results brought a vibrant reminder of nature and Earth to the crew. You can explore these creations on our interactive mosaic — can you find your team’s pixel art on the mosaic?

Pixel art creations made by young people who participated in Mission Zero.
A selection of Mission Zero pixel art submissions

Mission Space Lab: The speed of light (and cameras)

In Mission Space Lab, teams wrote Python programs to calculate the speed of the ISS. Using the Astro Pi sensors and Raspberry Pi High Quality cameras, 387 teams (representing 1,299 young people) achieved the prestigious ‘flight status’ and had their programs run in space.

These teams are now receiving their raw data sets, which include images of Earth’s surface captured from 400km above.

Photos of the Earth’s surface that Mission Space Lab teams captured with their programs
Earth observation images captured from the ISS by Mission Space Lab teams

Doing science in space: The ‘blue shift’ mystery

Science in orbit often brings surprises. This year, we noticed the colour balance in some ocean images was shifting toward a bright blue. After investigating, we found the camera’s white balance algorithm was reacting to ‘blue shift’.

This occurs when the spectrum of light compresses as the Earth turns toward the camera at dawn. It’s a fantastic example of the real-world physics our participants encounter when dealing with orbital data!

Photos of the ocean with varying range of blue brightness.
A selection of images showing the ‘blue shift’ effect

Inspiring even more young people and communities

We know what a great opportunity Astro Pi is and how much of an impact it can have on participants and their communities. So we constantly challenge ourselves to widen our reach and bring the challenge to communities around the UK and Ireland, especially those that don’t normally get the chance to send code to space. This year, we visited schools, clubs, and science events. We also trained teachers and volunteers to help us share the challenge.

A young person colouring a pixel art design.
A young person designing Mission Zero pixel art

What’s next?

That’s a wrap for the 2025/26 challenge, but the journey doesn’t end here.

  • On Friday 12 June, ESA astronaut Pablo Álvarez Fernández will answer questions submitted by the Mission Space Lab teams. You can watch the livestreamed event on YouTube.
  • Save the date: Astro Pi 2026/27 launches Monday 14 September 2026
  • Mission Zero: We’ll be selecting new code examples from this year’s Mission Zero submissions for our next project guide
  • Mission Space Lab: We have some exciting technical updates coming for the next cohort of Space Lab teams

In the meantime, stay curious, space travelers. The journey has only just begun!

The post Young people’s computer programs get data from space appeared first on Raspberry Pi Foundation.

Security updates for Thursday

Post Syndicated from jzb original https://lwn.net/Articles/1077536/

Security updates have been issued by AlmaLinux (.NET 10.0, .NET 8.0, .NET 9.0, podman, poppler, and postgresql-jdbc), Debian (chromium, jackson-core, libdbi-perl, and libinput), Fedora (httpd, rust, and xmlstarlet), Mageia (openssh, postfix, and roundcubemail), Oracle (frr, kernel, libyang, n, postgresql-jdbc, and unbound), Red Hat (.NET 10.0, .NET 8.0, .NET 9.0, redis, and redis:7), SUSE (agama-web-ui, cockpit, cosign, glibc, google-cloud-sap-agent, google-osconfig-agent, kanidm, kernel, kubernetes, kubernetes1.23, kubernetes1.24, kubernetes1.25, kubernetes1.27, kubernetes1.28, libpodofo-devel, libyang, NetworkManager-libreswan, openCryptoki, python311-pypdf, rclone, steampipe, wicked, and xen), and Ubuntu (exim4, libcrypt-saltedhash-perl, libhttp-daemon-perl, samba, and uriparser).

Criminal AI-as-a-Service in 2026: How the Underground Market Is Operationalizing Cybercrime

Post Syndicated from Jeremy Makowski original https://www.rapid7.com/blog/post/tr-criminal-ai-underground-market-operationalizing-cybercrime-2026

Introduction

The underground market for criminally oriented generative AI has moved beyond the early hype surrounding ‘malicious chatbots.’ The gradual integration of AI as a productivity layer within cybercrime operations has become the dominant story, indicating that while the potential for fully autonomous AI hacking systems is possible, attackers are not embracing them as expected. Instead, threat actors are increasingly using AI to accelerate routine, but operationally significant, tasks to scale their operations. Drafting phishing lures, profiling targets, debugging code, generating forged documents, modifying malware, translating victim communications, and processing stolen data at scale were once time-consuming activities that AI has made significantly easier. AI does not replace cybercriminals; it lowers friction, increases speed, and expands the range of actors able to perform tasks that previously required more time, skill, or external support.

AI is being absorbed into criminal tradecraft, embedding itself in social engineering, fraud enablement, impersonation, identity abuse, and post-breach data exploitation. The market supporting this demand is not a single coherent product category, but a broader ecosystem of jailbreak wrappers, Telegram-based bots, prompt packs, open-weight model deployments, stolen AI accounts, and hijacked API keys. Their importance lies less in technical elegance than in usability. They provide criminals with accessible, repeatable, and commercially packaged ways to apply AI to operational problems.

This ecosystem should not be mistaken for a stable or fully mature criminal market. Compared with more established sectors, criminal AI remains volatile, uneven, and heavily exposed to hype. Some services offer genuine operational utility while others are little more than repackaged public models marketed at inflated prices. Many are short-lived, deceptive, or opportunistic rebrands. 

Even so, the demand is real. The core shift is not the arrival of a single dominant criminal model, but the commercialization of access to AI-enabled criminal capability. The strategic significance of criminal AI lies in compressing time, lowering skill barriers, improving communication quality, and scaling existing criminal workflows.

Criminal AI-as-a-Service

The defining features of this market have little to do with any technical novelty, but rather the packaging and monetization of access. By early 2026, many underground services were marketed through familiar commercial mechanisms like subscriptions, private support channels, Telegram-based delivery, gated communities, and promises of uncensored output, privacy, or reduced logging. These are clear signs of SaaS-style commercialization, albeit far less mature or stable than its legitimate counterparts.

The market should be best understood as “Criminal AI-as-a-Service.” Most offerings do not appear to rely on original foundational models built by threat actors. Instead, they typically depend on jailbreaks, wrappers around commercial services, fine-tuned open-weight models, repackaged interfaces, or modular combinations of existing capabilities. 

Pricing patterns suggest growing commercialization, but not a stable market structure. Entry-level access may be inexpensive, while premium services can be marketed at significantly higher rates with promises of priority support or additional functionality. These prices should be treated as indicative, not definitive (Figures 1 and 2). They are highly volatile and shaped by takedowns, fraud, rebranding, and shifting demand. 

At the lower end, free tools and stolen access to legitimate AI services often remain the default. In the middle of the market, recurring subscriptions are increasingly common. At the upper end, some services claim to use more modular or self-hosted architectures to reduce dependence on mainstream platforms. Together, these patterns point to a market that is becoming more operationalized, even if it remains unstable and hype-driven.

xanthorox-pricing.png
Figure 1: Xanthorox’s pricing

wormGPT-pricing.png
Figure 2: WormGPT’s pricing

Main criminal AI tool families

The criminal AI ecosystem is defined by several distinct tool families that reflect how threat actors adopt, package, and market generative AI for illicit use. Some platforms function as fraud-enabling assistants, others as uncensored Telegram-native chatbots, modular offensive frameworks, or low-barrier tools aimed at novice users. Examining these categories is more useful than focusing solely on individual brand names, as it reveals the market’s underlying operational logic. That logic is based on how these tools are distributed, which users they target, and which stages of the criminal workflow they are designed to support. 

Overall, the market is increasingly splitting into two complementary directions. At one end are low-cost, mass-market tools that help less experienced actors produce phishing content, scam scripts, malware prompts, forged material, and social engineering narratives at scale. At the other end are more specialized platforms that integrate AI into execution workflows, supporting targeting, automation, and operational optimization for fewer but more precise attacks. This volume-versus-precision dynamic shows that criminal AI is no longer only about accelerating malicious content generation; it is also becoming a way to make illicit operations more scalable, quieter, and strategically targeted.

FraudGPT 

This tool family represents the distribution model for criminal AI by fraud shops. Emerging in mid-2023 for a few hundred dollars per month, its longevity on the black market stems from its positioning as an “all-in-one” operational assistant rather than a simple programming tool. Most buyers are not using it to engineer highly complex malware; instead, they treat it as a productivity engine to orchestrate the entire fraud chain. 

Threat actors use it to systematically design lookalike phishing pages, scrape target data, draft convincing spear-phishing lures, and generate scam scripts. Even as the underlying architecture has evolved away from standalone models and toward basic wrappers around legitimate, jailbroken corporate APIs, FraudGPT remains a staple of the underground economy because it effectively democratizes advanced social engineering, allowing entry-level scammers to execute highly localized, grammatically flawless, and high-volume fraud operations (Figure 3).

FraudGPT-website.png
Figure 3: FraudGPT’s website

GhostGPT 

This tool family reflects the Telegram-native distribution model. Its reported selling points — uncensored output, ease of access, and reduced operational friction — illustrate the convenience and perceived safety many criminal buyers claim to value most. However, like many tools in this category, independent verification of its capabilities is limited, and its significance lies more in what it signals about buyer preferences than in any confirmed technical differentiation.

WormGPT

This tool family serves as the ultimate case study in the power and persistence of criminal branding. While the original, headline-grabbing tool was officially shut down by its creator in August 2023 following intense law enforcement and media exposure, the name has essentially become a generic dark-web trademark for unrestricted AI. The market is saturated with opportunistic copycats, such as “WormGPT v4” and various Telegram bots trading on the name. 

Threat intelligence analysis of these modern variants reveals that they share zero code with the original system; instead, they are highly volatile marketing shells, often basic API wrappers around commercial models like Grok or Mixtral that use specialized system prompts to bypass safety guardrails. WormGPT’s relevance in 2026 lies not in its technical uniqueness but in its sociological impact. It is an entry-level gateway tool used by script kiddies and sophisticated actors alike to quickly generate functional exploit scripts, craft persuasive business email compromise (BEC) lures, and scale offensive workflows (Figure 4).

WormGPT_s-website.png
Figure 4: WormGPT‘s website

KawaiiGPT 

This is a freely accessible or low-cost criminally oriented AI chatbot/tool marketed in underground spaces to generate or support illicit content and cybercrime-related tasks. Its use highlights the problem of low-barrier access in the criminal LLM market. Its relevance does not lie in any demonstrated advanced capability and there is little evidence that it provides meaningful technical sophistication beyond basic generative AI functions. Rather, KawaiiGPT is important as an example of how free or near-free tools can normalize AI-assisted offending among less experienced users. Its significance is therefore sociological rather than technical as it lowers the threshold for participation, makes AI-assisted offending appear accessible and low-risk, and introduces novice actors to workflows such as phishing text generation, fraud scripting, impersonation, and other forms of low-level cybercrime support.

BruteForceAI 

This tool family represents a meaningfully different category from the chatbot-style tools that dominate criminal AI branding. BruteForceAI prioritizes precision over content generation. It integrates large language models for intelligent form analysis and sophisticated multi-threaded attack execution. This distinction matters. The broader trend it reflects is one of attackers making fewer, better-targeted attempts rather than relying on brute volume. AI here is not a content tool. It is an execution layer, and the shift from noisy credential stuffing to quiet, optimized targeting is strategically more significant than any individual tool name (Figure 5).

BruteforceAI-program.png
Figure 5: BruteforceAI program

Xanthorox 

This AI represents the modular criminal AI platform. Its significance lies in how it is marketed. Public reporting describes it as more than another “evil chatbot,” with claims around coding support, multiple model components, and broader operational utility. Still, Xanthorox should be framed cautiously. It is better treated as an emerging or ambitiously marketed platform than as a universally verified flagship of the underground market (Figure 6).

Xanthorox-website.png
Figure 6: Xanthorox’s website

The wide variety of smaller adversarial AI tools in 2026, including names like DarkGPT, EscapeGPT, WolfGPT, Evil-GPT, XXXGPT, and BadGPT, should be viewed with caution. These brands do not constitute a coherent or reliable category; instead, they often function as short-lived rebrandings or simple interfaces built on public or open-source models. In many cases, these are “scam-of-the-month” services hosted on Telegram, designed to capitalize on hype, with entry-level memberships starting at a few dozen dollars. However, they should not be dismissed outright, as some do offer genuine un-censorship or serve as testing grounds for malicious exploits. The bottom line in 2026 is that the brand name matters less than the underlying architecture. Most “GPT” labels are disposable marketing shells used to evade takedown measures or rebuild credibility after a service failure.

What truly defines the threat is the infrastructure supporting them. While entry-level tiers cost very little, professional-grade systems can cost thousands of dollars. At this level, the value isn’t in the name, but in the technical setup.: These include the specific model used, how the service is delivered, the reliability of the operator, and how well it connects with other criminal tools like phishing kits, stealers, and ransomware support. Ultimately, the market has shifted toward operationalizing AI, focusing on tools that can automate and maximize the efficiency of entire illicit workflows.

Stolen AI accounts as an overlooked criminal market

One of the most important and still underappreciated developments in this landscape is the resale and abuse of legitimate AI access. This pattern is not new. Every widely adopted and commercially valuable technology eventually generates a secondary criminal market around stolen credentials, compromised accounts, and unauthorized access. AI is now following the same trajectory. Threat actors do not rely only on underground “dark AI” tools. They also misuse mainstream AI platforms directly.

However, the abuse of stolen AI accounts and hijacked API keys may be more consequential than many earlier credential markets. Access to legitimate AI services can provide threat actors with scalable cognitive and operational capabilities, not just access to a single platform or dataset. A compromised AI account may enable faster reconnaissance, multilingual targeting, automated content production, code generation, malware troubleshooting, and the refinement of phishing or fraud workflows. Hijacked API keys may also allow actors to consume compute resources at the victim’s expense, bypass usage restrictions tied to their own identities, and access more capable models or enterprise-grade infrastructure. In this sense, stolen AI access is not merely another credential commodity. It can function as an operational force multiplier across multiple stages of the attack lifecycle, making its abuse both expected and potentially more impactful than many traditional forms of account compromise (Figures 7 and 8).

Stolen-AI-accounts-for-sale-cybercrime-forum.png
Figure 7: Stolen AI accounts for sale on a cybercrime forum

More-stolen-AI-accounts-for-sale-cybercrime-forum.png
Figure 8: More stolen AI accounts for sale on a cybercrime forum

The impact on organizations can be serious as AI accounts may contain proprietary information such as prompts, uploaded files, source code, legal drafts, customer data, internal summaries, product plans, meeting notes, investigative material, or strategic analysis. If compromised, the exposure extends beyond the credential itself. Enterprise AI accounts and AI-related access tokens should therefore be treated like cloud credentials, developer secrets, email accounts, or administrative SaaS access.

Deepfake services: From impersonation to KYC bypass

Deepfake services have become one of the criminal AI market’s most important adjacent segments, particularly in fraud, synthetic identity creation, onboarding abuse, and KYC bypass. These services are marketed not as experimental technologies, but as practical fraud enablers. Common offerings include face swaps, voice cloning, fake selfie generation, synthetic profiles, document manipulation, virtual camera injection, video-call impersonation, and full onboarding bypass packages (Figure 9). Their significance stems from the fact that many digital platforms continue to rely heavily on remote identity verification and visual trust cues.

The purpose of bypassing KYC controls is to create, validate, or access accounts that should not exist or should not be available to the offender. Once established, such accounts can support money laundering, mule activity, romance scams, investment fraud, payment abuse, sanctions evasion, account resale, and marketplace manipulation. The threat is no longer limited to static fake images. Attackers can combine face swaps, synthetic video, animated media, and virtual camera injection to impersonate real individuals during onboarding or verification.

Deepfake services also strengthen broader fraud operations. Romance scams, fake recruitment schemes, executive impersonation, vendor fraud, and investment scams all become more persuasive when synthetic voice or video is added to the deception chain. These services should therefore be understood as part of the same criminal AI capability stack. LLMs generate scripts, refine pretexts, localize language, and support interaction at scale. Stolen data enhances personalization. Deepfake tools add the visual and audio layer that increases trust and makes deception harder to detect. Together, these capabilities form a more complete deception architecture.

Deepfake-KYC-bypass-service-advertisement.png
Figure 9: Cybercrime forum’s advertisement for a Deepfake KYC bypass service website

Organizational impact and defensive priorities

For organizations, the impact of AI-enabled cybercrime is both economic and operational. The main concern is not the sudden arrival of fully autonomous AI hacking, but the steady increase in attacker productivity, deception quality, operational flexibility, and post-compromise efficiency.

This last concern is important to note. Once attackers obtain data, AI can help them review it more quickly and more systematically. Models can summarize large document sets, identify sensitive or monetizable material, extract victim-specific details, and support tailored extortion or fraud. This does not require a purpose-built criminal model. It requires access to a capable model, relevant data, and a clear criminal objective.

At the same time, enterprise AI environments are becoming part of the attack surface. AI accounts, API keys, prompts, uploaded files, connectors, retrieval systems, internal knowledge bases, and agentic workflows can all expose sensitive business information if they are compromised, misused, or poorly governed. These assets should therefore be managed with the same seriousness as other critical systems, including clear ownership, least-privilege access, logging, monitoring, retention rules, and periodic access reviews.

Organizations should respond by treating criminal AI as a challenge of trust, identity, workflow security, and data governance, rather than only as a malware issue. High-risk business processes should be reinforced with stronger approval controls, transaction verification, segregation of duties, and out-of-band confirmation, especially for financial transfers, access changes, sensitive data requests, and executive communications.

Phishing and fraud defenses must also adapt. Poor grammar and obvious language errors are no longer reliable indicators of malicious activity. Organizations should assume that many adversaries can now generate polished, localized, and credible communications at scale. Detection should therefore rely more heavily on behavioral indicators, sender validation, process anomalies, identity verification, and transaction integrity than on superficial language cues.

At the same time, organizations should prepare for AI-assisted post-breach exploitation by improving data minimization, segmentation, access controls, monitoring, logging, and incident response planning. They should also monitor the broader underground capability stack, including jailbreak services, stolen AI accounts, and synthetic media tooling, because these increasingly shape attacker tradecraft in practice.

The market will likely see more bundling of text generation, translation, impersonation, data analysis, and synthetic media into a single criminal offering. It will also likely see continued abuse of legitimate AI platforms alongside wrapper-based underground services. The ecosystem will likely remain uneven, opportunistic, and hype-heavy, while becoming strategically important because it makes cybercrime easier to execute, scale, and detectFor organizations, the main risk is not only higher financial loss, but also the growing operational strain created by AI-assisted attacks that are faster, more scalable, and harder to triage.

Enterprise AI accounts, API keys, prompts, uploaded files, connectors, retrieval systems, internal knowledge bases, and agentic workflows should be managed as critical assets, with clear ownership, least-privilege access, logging, monitoring, retention rules, and periodic access reviews. Sensitive data should be exposed to AI systems only when there is a clear business need, especially when AI tools connect to email, cloud storage, code repositories, customer databases, financial systems, or external services. High-risk AI connectors and workflows should be inventoried, risk-ranked, and monitored for abnormal access, bulk data movement, privilege escalation, or unauthorized agent actions.

 As phishing tactics become better, core controls should include MFA, phishing-resistant authentication, conditional access, DLP, EDR/XDR, API security monitoring, secrets scanning, prompt and output filtering, and model-access controls. Incident response plans should also cover stolen AI accounts, exposed prompts, compromised API keys, leaked embeddings, abused connectors, and sensitive data retained in AI workspaces.

The organizations best positioned for the next phase will be those that integrate AI risk into existing security governance rather than treating it as a separate technical issue. As criminal use of AI becomes part of everyday attacker tradecraft, resilience will depend on the ability to verify identity, control access, protect data flows, monitor AI-enabled workflows, and maintain human oversight over high-impact decisions. The future defensive priority is therefore not to predict every AI-enabled attack, but to build security architectures that remain reliable when attackers become faster, more persuasive, and more efficient.

Enhanced License Plate Tracking

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2026/06/enhanced-license-plate-tracking.html

The surveillance company Leonardo wants more data:

A surveillance company plans to add sensors to automatic license plate readers (ALPRs) that would mean the devices, as well as capture the license plate of passing vehicles, would also sweep up unique identifiers of mobile phones, wearables, and other Bluetooth-enabled devices in those cars, potentially letting law enforcement identify specific drivers or passengers.

The technology, called SignalTrace, would turn ALPR cameras from devices focused on tracking cars to ones that can more readily track the location of particular people. ALPR cameras have become a commonly deployed technology all across the U.S.; SignalTrace would make some of those cameras capable of collecting much more data.

Yes, it’s bad that more companies are collecting this level of surveillance data. But all of this pales in comparison to the type and quantity of data our smartphones already collect about us.

Alternate link.

[$] LWN.net Weekly Edition for June 11, 2026

Post Syndicated from jzb original https://lwn.net/Articles/1076254/

Inside this week’s LWN.net Weekly Edition:

  • Front: Suspicious AI activity in Fedora; fork() + exec(); splice() + vmsplice(); BPF loop verification; fanotify; trusted publishing.
  • Briefs: CA age bill; Bundler cooldowns; insecure code completion; Asahi and macOS 27 beta; Buildroot 2026.05; Ubuntu MATE; rsync 3.4.4; Quotes; …
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Dell Pro Max 16 Plus Review A More Mobile NVIDIA RTX Pro 5000 Blackwell System

Post Syndicated from Ryan Smith original https://www.servethehome.com/dell-pro-max-16-plus-review-intel-nvidia-rtx-pro-5000-blackwell-system/

Sparing no expense, Dell’s flagship workstation laptop, the Pro Max 16 Plus, aims to deliver as much performance as is possible in a 16-inch laptop while still being modestly portable

The post Dell Pro Max 16 Plus Review A More Mobile NVIDIA RTX Pro 5000 Blackwell System appeared first on ServeTheHome.

The collective thoughts of the interwebz