Tag Archives: devops

Accelerate autonomous incident resolutions using the Datadog MCP server and AWS DevOps agent (in preview)

Post Syndicated from Nina Chen original https://aws.amazon.com/blogs/devops/accelerate-autonomous-incident-resolutions-using-the-datadog-mcp-server-and-aws-devops-agent-in-preview/

This post was co-written with Omri Sass (Director of Product Management), Cansu Berkem (Director of Product Management), and Mohammad Jama (Product Marketing Manager) from Datadog.

On-call engineers spend hours manually investigating incidents across multiple observability tools, logs, and monitoring systems. This process delays incident resolution and impacts business operations, especially when teams need to correlate data across different monitoring platforms. AWS DevOps Agent (in preview) is a frontier agent that resolves and proactively prevents incidents, continuously improving reliability and performance of applications in AWS, multicloud, and hybrid environments. Frontier agents represent a new class of AI agents that are autonomous, massively scalable, and work for hours or days without constant intervention. AWS DevOps Agent offers built-in integration with Datadog Model Context Protocol (MCP) Server, enabling you to access the untapped insights in your data by connecting directly to Datadog’s monitoring solutions. DevOps Agent maps your application resources and correlates telemetry, code, and deployment data to reduce MTTR (Mean Time To Resolution) and drive operational excellence.

You can use this integration to collect and analyze Datadog logs, metrics, and traces, correlating this data across AWS services. When incidents occur, AWS DevOps Agent identifies issues and provides mitigation plans which engineers can then implement. Engineers can monitor automated investigations through a central dashboard and engage with the agent through interactive chat at any time. Using this integration, engineers are able to reduce mean time to resolution (MTTR) from hours to minutes, while maintaining full visibility into automated actions.

How Datadog MCP and AWS DevOps Agent work together

The integration between Datadog MCP Server and AWS DevOps Agent connects your monitoring data with automated incident response. Datadog MCP Server acts as a central access point for your monitoring data. It securely connects to Datadog through a standardized protocol, allowing AWS DevOps Agent to query logs, metrics, and traces during investigations. The service uses OAuth 2.0 authentication and supports multiple regions to help maintain data sovereignty requirements.

AWS DevOps Agent learns your resources and relationships while correlating data from both AWS services and Datadog. It analyzes Amazon CloudWatch logs and metrics, deployment data, and code alongside Datadog telemetry to build a complete picture of the incident. This combined view helps identify root causes faster than examining each data source separately. Security considerations are built into every interaction. All interactions between AWS DevOps Agent and Datadog MCP Server uses authentication, authorization, encryption, and logging for audit purposes. While the service currently only runs in us-east-1, it can monitor and analyze applications deployed across any AWS Region in customer accounts globally.

Setting up and using AWS DevOps Agent with Datadog

In this section, we will guide you through the steps required to enable Datadog MCP Server in your AWS DevOps Agent account and configure it for incident resolution.

Pre-requisites

For this walkthrough, you should have access to and understanding of the following:

  • An AWS account with permissions to create AWS IAM (Identity and Access Management) roles:
    • Agent Space role – for basic service operations
    • Agent Space web app role – for using the Agent Space web app functionality
    •  (Optional) Secondary source account roles if monitoring multiple AWS accounts. Refer to the DevOps Agent user guide for the details on setting up these roles.
  • A Datadog account
  • Access to Datadog MCP Server (in preview)

Setting up Datadog in the AWS DevOps Agent console

Start the setup in the AWS DevOps Agent console by connecting your Datadog MCP Server. Navigate to Settings, select the Datadog integration panel, and choose “Register.” Enter your Datadog MCP Server details when prompted (you can learn more about requesting access to this server in their documentation). AWS DevOps Agent validates the connection and displays a confirmation message.

This is the configuration in AWS DevOps Agent for Datadog MCP Server Details with three input fields: Server Name (with example 'my-datadog-server'), Endpoint URL (showing 'https://mcp.datadog.com/api/unstable/mcp-server/mcp'), and an optional Description field. The form includes navigation steps at the top and Cancel/Next buttons at the bottom. The interface has a dark theme with blue accents.Figure 1: Setting up Datadog MCP Server in AWS DevOps Agent Console

Create an AWS DevOps Agent Agent Space

Next, create an Agent Space in your primary AWS account. This requires an AWS IAM role that grants AWS DevOps Agent access to your AWS resources. After creating your Agent Space, add Datadog MCP Server as a telemetry source to enable comprehensive incident investigation.

To create your Agent Space, start by accessing the AWS DevOps Agent console in us-east-1. Choose the “Create Agent Space” button and provide a meaningful name and description for your space. After submitting the form, you’ll need to configure the required IAM roles, which can be done through either the automated creation process or manual setup.

This is the configuration for creating an AWS DevOps Agent AgentSpace. The screen shows the option to create a DevOps Agents, with areas to give agent details, resource access, and more. The interface is dark blue theme. Figure 2: Creating a AWS DevOps Agent in Agent Space

Your Agent Space topology can be initialized using either AWS CloudFormation stacks or AWS Tags as starting points to identify your application components. Once the basic setup is complete, you can enhance your Agent Space configuration by adding Secondary source accounts for multi-account monitoring and configuring integrations with services like SIM ticketing system, Pipelines (where GitFarm packages and CloudFormation Stacks are located), Slack, and most importantly for our use case, Telemetry with the Datadog MCP Server.

This is a page that has options for adding telemetry source (datadog) in agent space. Here, there is a pop-up to add source association. The selected source here to add is Datadog. Figure 3: Add additional telemetry sources for AWS DevOps Agent to investigate

From here, we can launch the Agent Space web app to begin the investigation.

Real-World example: Resolving API Gateway errors

Let’s walk through how AWS DevOps Agent and Datadog work together to resolve a production incident. In this scenario, Datadog detects a spike in Amazon API Gateway 5XX errors affecting downstream services.

This is a sample monitor view of sample 5XX errors in Datadog. There is a monitor of Amazon API Gateway pulled up. On the right, there is a monitor showing "Your 5XX Errors" with over 220 errors. Figure 4: Sample API Gateway errors in Datadog

Investigating 5XX errors from API Gateway Incident with the Datadog MCP Server and AWS DevOps Agent

When the alert triggers, AWS DevOps Agent automatically analyzes both Datadog metrics and API Gateway logs. Through the investigation chat interface, an engineer guides AWS DevOps Agent to examine the API Gateway configuration. The agent correlates API Gateway and AWS Lambda execution logs, quickly identifying error patterns.

This is a view in AWS DevOps Agent to allow for investigating an incident with AWS DevOps Agent and Datadog MCPFigure 4: Investigating an incident with AWS DevOps Agent and Datadog MCP

Resolving and prevention

AWS DevOps Agent helps identify potential misconfigurations in the Lambda and Amazon DynamoDB integration and implements immediate fixes. The agent documents all findings and actions in an incident record, backed by telemetry from both Datadog and AWS services. After resolution, AWS DevOps Agent generates a detailed analysis report with specific recommendations to prevent similar incidents. Teams can review and implement these suggestions through the Prevention feature in the AWS DevOps Agent web app.

This view show the investigation summary produced by AWS DevOps Agent. Here, we see the root cause for this sample incident. The root cause head line states that "1. DynamoDB table name misconfiguration - typo in environment variable". There is a longer description explaining this under it. The background for this view is plain white. Figure 5: Investigation summary produced by AWS DevOps Agent

Clean up

When you’re done using the integration, you can clean up your resources by following these steps:

  1. Delete your Agent Space from the AWS DevOps Agent console
  2. Remove the Datadog MCP Server connection from your settings
  3. Delete the IAM roles created for the Agent Space
  4. (Optional) If you created additional source account roles, remove those as well

Conclusion

The integration between Datadog MCP Server and AWS DevOps Agent reduces incident resolution time by automatically correlating data across your monitoring tools. Instead of manually switching between Datadog and AWS dashboards during incidents, teams can now get an AI-powered investigation that identifies root causes and suggests fixes. Early adopters report significant improvements in their incident response. Resolution times drop from hours to minutes, while on-call teams spend less time gathering data. Teams also see more consistent incident responses and improved root cause analysis through comprehensive data correlation. To learn more, check out the AWS DevOps Agent product page.

Datadog is an AWS Specialization Partner and AWS Marketplace Seller that has been building integrations with AWS services for over a decade, amassing a growing catalog of 100+ AWS and 1000+ built-in integrations. This new AWS DevOps Agent and Datadog MCP Server integration builds upon Datadog’s strong track record of AWS partnership success. If you’re not already using Datadog, you can get started with a 14-day free trial via the AWS Marketplace.

Sujatha Kuppuraju

Sujatha Kuppuraju is a Principal Solutions Architect at AWS, specializing in Cloud and, Generative AI Security. She collaborates with software companies’ leadership teams to architect secure, scalable solutions on AWS and guide strategic product development. Leveraging her expertise in cloud architecture and emerging technologies, Sujatha helps organizations optimize offerings, maintain robust security, and bring innovative products to market in an evolving tech landscape.

DhilipVenkatesh Uvarajan

DhilipVenkatesh Uvarajan is as an Enterprise Support Lead TAM within AWS Enterprise Support, specializing in Independent Software Vendors (ISVs) across the United States. In this role, Dhilip provides strategic technical guidance to help customers innovate, optimize their AWS architecture, and ensure the seamless operation of their business-critical applications on the AWS cloud. Beyond his professional endeavors, Dhilip is passionate about AI and Robotics, often exploring innovative projects in his spare time.

Nina Chen

Nina Chen is a Customer Solutions Manager at AWS specializing in leading software companies to leverage the power of the AWS cloud to accelerate their product innovation and growth. With over 4 years of experience working in the strategic Independent Software Vendor (ISV) vertical, Nina enjoys guiding ISV partners through their cloud transformation journeys, helping them optimize their cloud infrastructure, driving product innovation, and delivering exceptional customer experiences.

Omri Sass

Omri Sass is a Director of Product Management at Datadog, where he’s overseen the development and launch of a multitude of products and capabilities including Bits AI SRE and updog.ai. He is a keen advocate for good user experience and doing what’s right by users.

Cansu Berkem

Cansu Berkem is a Director of Product Management at Datadog, overseeing the company’s end-to-end incident response experience, including Incident Management, On-Call, Automations, and Bits AI SRE. Her products help engineers resolve incidents faster through AI-driven workflows, powered by Bits AI SRE as an autonomous incident investigator and supported by integration-rich incident management and paging flows.

Mohammad Jama

Mohammad Jama is a Product Marketing Manager at Datadog. He leads go-to-market for Datadog’s AWS integrations, working closely with product, marketing, and sales to help companies observe and secure their hybrid and AWS environments.

AWS DevOps Agent helps you accelerate incident response and improve system reliability (preview)

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-devops-agent-helps-you-accelerate-incident-response-and-improve-system-reliability-preview/

Today, we’re announcing the public preview of AWS DevOps Agent, a frontier agent that helps you respond to incidents, identify root causes, and prevent future issues through systematic analysis of past incidents and operational patterns.

Frontier agents represent a new class of AI agents that are autonomous, massively scalable, and work for hours or days without constant intervention.

When production incidents occur, on-call engineers face significant pressure to quickly identify root causes while managing stakeholder communications. They must analyze data across multiple monitoring tools, review recent deployments, and coordinate response teams. After service restoration, teams often lack bandwidth to transform incident learnings into systematic improvements.

AWS DevOps Agent is your always-on, autonomous on-call engineer. When issues arise, it automatically correlates data across your operational toolchain, from metrics and logs to recent code deployments in GitHub or GitLab. It identifies probable root causes and recommends targeted mitigations, helping reduce mean time to resolution. The agent also manages incident coordination, using Slack channels for stakeholder updates and maintaining detailed investigation timelines.

To get started, you connect AWS DevOps Agent to your existing tools through the AWS Management Console. The agent works with popular services such as Amazon CloudWatch, Datadog, Dynatrace, New Relic, and Splunk for observability data, while integrating with GitHub Actions and GitLab CI/CD to track deployments and their impact on your cloud resources. Through the bring your own (BYO) Model Context Protocol (MCP) server capability, you can also integrate additional tools such as your organization’s custom tools, specialized platforms or open source observability solutions, such as Grafana and Prometheus into your investigations.

The agent acts as a virtual team member and can be configured to automatically respond to incidents from your ticketing systems. It includes built-in support for ServiceNow, and through configurable webhooks, can respond to events from other incident management tools like PagerDuty. As investigations progress, the agent updates tickets and relevant Slack channels with its findings. All of this is powered by an intelligent application topology the agent builds—a comprehensive map of your system components and their interactions, including deployment history that helps identify potential deployment-related causes during investigations.

Let me show you how it works
To show you how it works, I deployed a straigthforward AWS Lambda function that intentionally generates errors when invoked. I deployed it in an AWS CloudFormation stack.

Step 1: Create an Agent Space

An Agent Space defines the scope of what AWS DevOps Agent can access as it performs tasks.

You can organize Agent Spaces based on your operational model. Some teams align an Agent Space with a single application, others create one per on-call team managing multiple services, and some organizations use a centralized approach. For this demonstration, I’ll show you how to create an Agent Space for a single application. This setup helps isolate investigations and resources for that specific application, making it easier to track and analyze incidents within its context.

In the AWS DevOps Agent section of the AWS Management Console, I select Create Agent Space, enter a name for this space and create the AWS Identity and Access Management (IAM) roles it uses to introspect AWS resources in my or others’ AWS accounts.

AWS DevOps Agent - Create an Agent SpaceFor this demo, I choose to enable the AWS DevOps Agent web app; more about this later. This can be done at a later stage.

When ready, I choose Create.

AWS DevOps Agent - Enable Web AppAfter it has been created, I choose the Topology tab.

This view shows the key resources, entities, and relationships AWS DevOps Agent has selected as a foundation for performing its tasks efficiently. It doesn’t represent everything AWS DevOps Agent can access or see, only what the Agent considers most relevant right now. By default, the Topology includes the AWS resources that are contained in my account. As your agent completes more tasks, it will discover and add new resources to this list.

AWS DevOps Agent - Topology

Step 2: Configure the AWS DevOps web app for the operators

The AWS DevOps Agent web app provides a web interface for on-call engineers to manually trigger investigations, view investigation details including relevant topology elements, steer investigations, and ask questions about an investigation.

I can access the web app directly from my Agent Space in the AWS console by choosing the Operator access link. Alternatively, I can use AWS IAM Identity Center to configure user access for my team. IAM Identity Center lets me manage users and groups directly or connect to an identity provider (IdP), providing a centralized way to control who can access the AWS DevOps Agent web app.

AWS DevOps Agent - web app access

At this stage, I have an Agent Space all set up to focus investigations and resources for this specific application, and I’ve enabled the DevOps team to initiate investigations using the web app.

Now that the one-time setup for this application is done, I start invoking the faulty Lambda function. It generates errors at each invocation. The CloudWatch alarm associated with the Lambda errors count turns on to ALARM state. In real life, you might receive an alert from external services, such as ServiceNow. You can configure AWS DevOps Agent to automatically start investigations when receiving such alerts.

For this demo, I manually start the investigation by selecting Start Investigation.

You can also choose from several preconfigured starting points to quickly begin your investigation: Latest alarm to investigate your most recent triggered alarm and analyze the underlying metrics and logs to determine the root cause, High CPU usage to investigate high CPU utilization metrics across your compute resources and identify which processes or services are consuming excessive resources, or Error rate spike to investigate the recent increase in application error rates by analyzing metrics, application logs, and identifying the source of failures.

AWS DevOps Agent - web app

I enter some information, such as Investigation details, Investigation starting point, the Date and time of the incident, the AWS Account ID for the incident.

- web app - start investigation

In the AWS DevOps Agent web app, you can watch the investigation unfold in real time. The agent identifies the application stack. It correlates metrics from CloudWatch, examines logs from CloudWatch Logs or external sources, such as Splunk, reviews recent code changes from GitHub, and analyzes traces from AWS X-Ray.

- web app - application stack

It identifies the error patterns and provides a detailed investigation summary. In the context of this demo, the investigation reveals that these are intentional test exceptions, shows the timeline of function invocations leading to the alarm, and even suggests monitoring improvements for error handling.

The agent uses a dedicated incident channel in Slack, notifies on-call teams if needed, and provides real-time status updates to stakeholders. Through the investigation chat interface, you can interact directly with the agent by asking clarifying questions such as “which logs did you analyze?” or steering the investigation by providing additional context, such as “focus on these specific log groups and rerun your analysis.” If you need expert assistance, you can create an AWS Support case with a single click, automatically populating it with the agent’s findings, and engage with AWS Support experts directly through the investigation chat window.

For this demo, the AWS DevOps Agent correctly identified manual activities in the Lambda console to invoke a function that intentionally triggers errors 😇.

- web app - root cause

Beyond incident response, AWS DevOps Agent analyzes my recent incidents to identify high-impact improvements that prevent future issues.

During active incidents, the agent offers immediate mitigation plans through its incident mitigations tab to help restore service quickly. Mitigation plans consist of specs that provide detailed implementation guidance for developers and agentic development tools like Kiro.

For longer-term resilience, it identifies potential enhancements by examining gaps in observability, infrastructure configurations, and deployment pipeline. My straightforward demo that triggered intentional errors was not enough to generate relevant recommendations though.

AWS DevOps Agent - web app - recommendations

For example, it might detect that a critical service lacks multi-AZ deployment and comprehensive monitoring. The agent then creates detailed recommendations with implementation guidance, considering factors like operational impact and implementation complexity. In an upcoming quick follow-up release, the agent will expand its analysis to include code bugs and testing coverage improvements.

Availability
You can try AWS DevOps Agent today in the US East (N. Virginia) Region. Although the agent itself runs in US East (N. Virginia) (us-east-1), it can monitor applications deployed in any Region, across multiple AWS accounts.

During the preview period, you can use AWS DevOps Agent at no charge, but there will be a limit on the number of agent task hours per month.

As someone who has spent countless nights debugging production issues, I’m particularly excited about how AWS DevOps Agent combines deep operational insights with practical, actionable recommendations. The service helps teams move from reactive firefighting to proactive system improvement.

To learn more and sign up for the preview, visit AWS DevOps Agent. I look forward to hearing how AWS DevOps Agent helps improve your operational efficiency.

— seb

Introducing the AWS Infrastructure as Code MCP Server: AI-Powered CDK and CloudFormation Assistance

Post Syndicated from Idriss Laouali Abdou original https://aws.amazon.com/blogs/devops/introducing-the-aws-infrastructure-as-code-mcp-server-ai-powered-cdk-and-cloudformation-assistance/

Streamline your AWS infrastructure development with AI-powered documentation search, validation, and troubleshooting

Introduction

Today, we’re excited to introduce the AWS Infrastructure-as-Code (IaC) MCP Server, a new tool that bridges the gap between AI assistants and your AWS infrastructure development workflow. Built on the Model Context Protocol (MCP), this server enables AI assistants like Kiro CLI, Claude or Cursor to help you search AWS CloudFormation and Cloud Development Kit (CDK) documentation, validate templates, troubleshoot deployments, and follow best practices – all while maintaining the security of local execution.

Whether you’re writing AWS CloudFormation templates or AWS Cloud Development Kit (CDK) code, the IaC MCP Server acts as an intelligent companion that understands your infrastructure needs and provides contextual assistance throughout your development lifecycle.

The Model Context Protocol (MCP) is an open standard that enables AI assistants to securely connect to external data sources and tools. Think of it as a universal adapter that lets AI models interact with your development tools while keeping sensitive operations local and under your control.

The IaC MCP Server provides nine specialized tools organized into two categories:

Remote Documentation Search Tools

These tools connect to the AWS Knowledge MCP backend to retrieve relevant, up-to-date information:

  1.  search_cdk_documentation
    Search the AWS CDK knowledge base for APIs, concepts, and implementation guidance.
  2. search_cdk_samples_and_constructs
    Discover pre-built AWS CDK constructs and patterns from the AWS Construct Library.
  3. search_cloudformation_documentation
    Query CloudFormation documentation for resource types, properties, and intrinsic functions.
  4. read_cdk_documentation_page
    Retrieve and read full documentation pages returned from searches or provided URLs.

Local Validation and Troubleshooting Tools

These tools run entirely on your machine

  1. cdk_best_practices
    Access a curated collection of AWS CDK best practices and design principles.
  2. validate_cloudformation_template
    Perform syntax and schema validation using cfn-lint to catch errors before deployment.
  3. check_cloudformation_template_compliance
    Run security and compliance checks against your templates using AWS Guard rules and cfn-guard.
  4. troubleshoot_cloudformation_deployment
    Analyze CloudFormation stack deployment failures with integrated CloudTrail event analysis. This tool will use your AWS credentials to analyze your stack status.
  5. get_cloudformation_pre_deploy_validation_instructions
    Returns instructions for CloudFormation’s pre-deployment validation feature, which validates templates during change set creation.

Key Use Cases

  1. Intelligent Documentation Assistant

Instead of manually searching through documentation, ask your AI assistant natural language questions:

“How do I create an S3 bucket with encryption enabled in CDK?”

The server searches CDK best practic and samples, returning relevant code examples and explanations.

     2. Proactive Template Validation

Before deploying infrastructure changes:

User: “Validate my CloudFormation template and check for security issues”

AI Agent: [Uses validate_cloudformation_template and check_cloudformation_template_compliance]

“Found 2 issues: Missing encryption on EBS volumes,

and S3 bucket lacks public access block configuration”

 3. Rapid Deployment Troubleshooting

When a stack deployment fails:

User: “My stack ‘stack_03’ in us-east-1 failed to deploy. What happened?”

AI Agent: [Uses troubleshoot_stack_deployment with CloudTrail integration]

“The deployment failed due to insufficient IAM permissions.

CloudTrail shows AccessDenied for ec2:CreateVpc.

You need to add VPC permissions to your deployment role.”

     4. Learning and Exploration

New to AWS CDK? The server helps you discover constructs and patterns:

User: “Show me how to build a serverless API”

AI Agent: [Searches CDK constructs and samples]

“Here are three approaches using API Gateway + Lambda…”

Architecture and Security

Security Design

Local Execution: The MCP server runs entirely on your local machine using uv (the fast Python package manager). No code or templates are sent to external services except for documentation searches.

AWS Credentials: The server uses your existing AWS credentials (from ~/.aws/credentials, environment variables, or IAM roles) to access CloudFormation and CloudTrail APIs. This follows the same security model as the AWS CLI.

stdio Communication: The server communicates with AI assistants over standard input/output (stdio), with no network ports opened.

Minimal Permissions: For full functionality, the server requires read-only access to CloudFormation stacks and CloudTrail events—no write permissions needed for validation and troubleshooting workflows.

Getting Started

Prerequisites

  • Python 3.10 or later
    uv package manager
    AWS credentials configured locally
    MCP-compatible AI client (e.g., Kiro CLI, Claude Desktop)

Configuration

Configure the MCP server in your MCP client configuration. For this blog we will focus on Kiro CLI. Edit .kiro/settings/mcp.json):

{
  "mcpServers": {
    "awslabs.aws-iac-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.aws-iac-mcp-server@latest"],
      "env": {
        "AWS_PROFILE": "your-named-profile",
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    }
  }
}

Security Considerations

Privacy Notice: This MCP server executes AWS API calls using your credentials and shares the response data with your third-party AI model provider (e.g., Amazon Q, Claude Desktop, Cursor, VS Code). Users are responsible for understanding your AI provider’s data handling practices and ensuring compliance with your organization’s security and privacy requirements when using this tool with AWS resources.

IAM Permissions

The MCP server requires the following AWS permissions:

For Template Validation and Compliance:

  • No AWS permissions required (local validation only)

For Deployment Troubleshooting:

  • cloudformation:DescribeStacks
  • cloudformation:DescribeStackEvents
  • cloudformation:DescribeStackResources
  • cloudtrail:LookupEvents (for CloudTrail deep links)

Example IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudformation:DescribeStacks",
        "cloudformation:DescribeStackEvents",
        "cloudformation:DescribeStackResources",
        "cloudtrail:LookupEvents"
      ],
      "Resource": "*"
    }
  ]
}

Example Use Case With Kiro CLI

IMPORTANT: Ensure you have satisfied all prerequisites before attempting these commands.

1. With the mcp.json file correctly set, try to run a sample prompt. In your terminal, run kiro-cli chat to start using Kiro-cli in the CLI.

Figure 1: Kiro-CLI with AWS IaC MCP server

Figure 1: Kiro-CLI with AWS IaC MCP server

Scenarios:

  • “What are the CDK best practices for Lambda functions?”

Figure 2 Search the CDK best practices for Lambda functions

Figure 2: Search the CDK best practices for Lambda functions

  • “Search for CDK samples that use DynamoDB with Lambda”

Figure 3: Search for CDK samples that use DynamoDB with Lambda

Figure 3: Search for CDK samples that use DynamoDB with Lambda

  • “Validate my CloudFormation template at ./template.yaml”

Figure 4: Validate my CloudFormation template with AWS IaC MCP Server

Figure 4: Validate my CloudFormation template with AWS IaC MCP Server

  • “Check if my template complies with security best practices”

Figure 5: Check if my template complies with security best practices with AWS IaC MCP Server

Figure 5: Check if my template complies with security best practices with AWS IaC MCP Server

Best Practices

  • Start with Documentation Search: Before writing code, search for existing constructs and patterns
  • Validate Early and Often: Run validation tools before attempting deployment
  • Check Compliance: Use check_template_compliance to catch security issues during development
  • Leverage CloudTrail: When troubleshooting, the CloudTrail integration provides detailed failure context
  • Follow CDK Best Practices: Use the cdk_best_practices tool to align with AWS recommendations

What’s Next?

The IAC MCP Server represents a new paradigm in the AI agentic workflow infrastructure development – one where AI assistants understand your tools, help you navigate complex documentation, and provide intelligent assistance throughout the development lifecycle.

Get Involved

The AWS IaC MCP Server is available now:

  • Documentation and GitHub Repository: aws-iac-mcp-server
  • Feedback: We welcome issues and pull requests! Or respond to our IaC survey here.

Ready to supercharge your infrastructure as code development? Install the IaC MCP Server today and experience AI-powered assistance for your AWS CDK and CloudFormation workflows.

Have questions or feedback? Reach out to the blog authors on the AWS Developer Forums.

About Authors

Idriss Laouali Abdou

Idriss is a Sr. Product Manager Technical on the AWS Infrastructure-as-Code team based in Seattle. He focuses on improving developer productivity through AWS CloudFormation and StackSets Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing.

Brian Terry

Brian Terry, Senior WW Data & AI PSA, is an innovation leader with more than 20 years of experience in technology and engineering. Brian is pursuing a PhD in computer science at the University of North Dakota and has spearheaded generative AI projects, optimized infrastructure scalability, and driven partner integration strategies. He is passionate about leveraging technology to deliver scalable, resilient solutions that foster business growth and innovation.

The Future of AWS CodeCommit

Post Syndicated from Anthony Hayes original https://aws.amazon.com/blogs/devops/aws-codecommit-returns-to-general-availability/

Back in July 2024, we announced plans to de-emphasize AWS CodeCommit based on adoption patterns and our assessment of customer needs. We never stopped looking at the data or listening to you, and what you’ve shown us is clear: you need an AWS-managed solution for your code repositories. Based on this feedback, CodeCommit is returning to full General Availability, effective immediately.

We Listened, and We Heard You

After the de-emphasis announcement last year, we heard from many of you. Your feedback was direct and revealing. You told us that CodeCommit isn’t just another code repository for you—it’s a critical piece of your infrastructure. Its deep IAM integration, VPC endpoint support, CloudTrail logging, and seamless connectivity with CodePipeline and CodeBuild provide value that’s difficult to replicate with third-party solutions, especially for teams operating in regulated industries or those who want all their development infrastructure within AWS boundaries. In short, we learned that CodeCommit is essential for many of you, so we’re bringing it back.

We acknowledge the uncertainty the de-emphasis has caused. If you invested time and resources planning or executing a migration away from CodeCommit, we apologize. We’ve learned from this, and we’re committed to doing better.

What’s Changing Today

Here’s what you need to know:

CodeCommit is open to new customers again – New customer sign-ups are open as of today. If you’ve been waiting to onboard new accounts or create repositories, you can do so right now through the AWS Console, CLI, or APIs.

For current and former customers – If you already migrated away, we understand you may have completed your transition to GitHub, GitLab, Bitbucket, or another provider. Those are excellent platforms, and we fully support your decision to use them. If you’re interested in returning to CodeCommit, our support team and account teams are available to help.

If you’re mid-migration, you can pause or reverse your plans. Contact AWS Support or your account team to discuss your specific situation and determine the best path forward.

If you stayed with CodeCommit, thank you for your patience during this period. We’re working through the backlog of feature requests and support tickets that accumulated, prioritizing by customer need. Continue to tell us how we can improve the service and support your workflows (human, machine, and agentic) moving forward.

What’s Coming Next

We’re not just maintaining CodeCommit—we’re investing in it. Here’s what’s on the roadmap:

Git LFS Support (Q1 2026) – This has been your most requested feature. Git Large File Storage will enable you to efficiently manage large binary files like images, videos, design assets, and compiled binaries without bloating your repositories. You’ll get faster clones, better performance, and cleaner version history for large assets.

Regional Expansions (Starting Q3 2026) – CodeCommit will expand to additional AWS Regions in eu-south-2 and ca-west-1, bringing the service closer to where you’re building and deploying your applications.

We’ll share more details about these features and additional roadmap items in the coming months. Keep an eye on our What’s New feed for the latest AWS launches.

Pricing, SLA, and Getting Started

Pricing remains unchanged—you can review the current structure on the CodeCommit pricing page. We continue to maintain our 99.9% uptime SLA as defined in our service terms.

If you’re new to CodeCommit or returning after a migration, check out our Getting Started Guide for step-by-step instructions. For migration assistance or questions about your specific setup, contact AWS Support or your account team.

Available Now

AWS CodeCommit is available now in 29 regions. New customers can begin creating repositories immediately. Visit the CodeCommit console to get started.

Thank you for your feedback, your patience, and your continued trust in AWS. We’re committed to making CodeCommit the best integrated Git repository service for AWS development.

Learn More:

Take fine-grained control of your AWS CloudFormation StackSets Deployment with StackSet Dependencies

Post Syndicated from Tanvi Ravindra Malali original https://aws.amazon.com/blogs/devops/take-fine-grained-control-of-your-aws-cloudformation-stacksets-deployment-with-stackset-dependencies/

Introduction

AWS CloudFormation StackSets enable you to deploy CloudFormation stacks across multiple AWS accounts and regions with a single operation, providing centralized management of infrastructure at scale through AWS Organizations integration. In enterprise environments, multiple StackSet often need to deploy in a specific order. For example, networking infrastructure must be ready before applications can deploy successfully.

Architecture diagram showing an Administrator account with a Stack set, and many target accounts with their own stacks, which in turn control other stacks. Demonstrating how a multi account, multi stack architecture can get complicated.

Figure 1: Example of a multi-region AWS CloudFormation StackSet architecture with an administrative account and target accounts

Previously, when multiple StackSets had auto-deployment enabled, they operated independently without coordination. This could cause deployment failures when dependent infrastructure wasn’t ready, forcing customers to implement complex workarounds or disable auto-deployment entirely.

We are announcing StackSets dependencies, a new feature that gives you fine-grained control over the deployment order of your auto-deployed StackSets, elegantly solving these orchestration challenges.

Feature Overview

This new feature introduces the ability to define dependencies between StackSets using the new DependsOn parameter in the AutoDeployment configuration. When accounts move between Organizational Units or are added to your organization, StackSets automatically orchestrates deployments according to your defined sequence, ensuring foundational infrastructure deploys before dependent applications.

Key capabilities include:

  • Dependency Management: Define up to 10 dependencies per StackSet, with up to 100 dependencies per account. For example, if you have 5 StackSets with 5 dependencies each, you have 25 dependencies counting towards the 100 dependency limit. You can request a limit increase through the service quota console.
  • Cycle Detection: Built-in validation prevents circular dependencies with error messages.
  • Cross-Region Support: Dependencies work across regions.
  • Automatic Cleanup: Dependencies are removed when StackSets are deleted or Organizations are deactivated.

How it works

Let’s walk through this feature with a practical example. Consider an infrastructure setup where you have: A central Infrastructure StackSet that creates IAM roles and networking components and multiple Application StackSets that depend on these foundational resources.

With StackSets dependencies, you can make sure the Infrastructure StackSet completes deployment before any Application StackSets begin, preventing deployment failures due to missing dependencies.

Implementation Scenarios

Let’s explore three common scenarios where StackSets Dependencies provides value:

Scenario 1: Foundation-First Deployment

Use Case: You have a foundational Infrastructure StackSet that creates IAM roles and networking components, and multiple Application StackSets that depend on these resources.

Setup:

  • Infrastructure StackSet ARNs (creates IAM roles, VPCs, security groups)
  • App1 StackSet (web application requiring IAM roles)
  • App2 StackSet (API service requiring networking components)
  • No additional permissions are required to use this feature.

Console Experience

The CloudFormation console provides an intuitive interface for managing StackSet dependencies. Log into the AWS console with your credentials, with an IAM user or administrative user, according to your access. Navigate to the Cloudformation service and create a new Stack or add a YAML/JSON template, where you will be configuring dependencies. In the Step 4 of the Create StackSet wizard, you’ll find a new “StackSet dependencies” form field in the Auto-deployment options section. You can use the attribute editor to add StackSet ARNs for dependencies. The console includes input validation for ARN format and helpful alerts about dependency behavior.

Console view showing options to Activate or Deactivate Automatic deployment, and whether to Delete or Retain stacks, and the new feature, Stack set dependencies, and a space to designate a dependent stack set.

Figure 2: CloudFormation StackSets Console – Auto-deployment options view

AWS CLI Implementation:

  1. Create the foundational Infrastructure StackSet:

aws cloudformation create-stack-set \
  --stack-set-name Infrastructure \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=true \
  --template-body file://infrastructure-template.yaml \
  --region us-east-1

2. Create App1 with dependency on Infrastructure:

aws cloudformation create-stack-set \
  --stack-set-name App1 \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=true,\
  DependsOn=arn:aws:cloudformation:us-east-1:123456789012:StackSet/Infrastructure:uuid \
  --template-body file://app1-template.yaml \
  --region us-east-1

3. Create App2 with dependency on Infrastructure:

aws cloudformation create-stack-set \
  --stack-set-name App2 \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=true,DependsOn=arn:aws:cloudformation:us-east-1:123456789012:StackSet/Infrastructure:uuid \
  --template-body file://app2-template.yaml \
  --region us-west-2

Now, when accounts are added to your organization, Infrastructure deploys first, then App1 and App2 deploy in parallel after Infrastructure completes.

Scenario 2: Multi-Dependency Application

Use Case: Your application requires both networking and security components to be ready before deployment.

Setup:

  • Networking StackSet (VPCs, subnets, route tables)
  • Security StackSet (security groups, NACLs, IAM policies)
  • Application StackSet (requires both networking and security)

Implementation:

  1. Create Networking StackSet

aws cloudformation create-stack-set \
  --stack-set-name Networking \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=true \
  --template-body file://networking-template.yaml \
  --region us-east-1

2. Create Security StackSet

aws cloudformation create-stack-set \
  --stack-set-name Security \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=true \
  --template-body file://security-template.yaml \
  --region us-east-1

3. Create Application with dependencies on both Networking and Security

aws cloudformation create-stack-set \
  --stack-set-name Application \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=true,DependsOn=arn:aws:cloudformation:us-east-1:123456789012:StackSet/Networking:uuid,arn:aws:cloudformation:us-east-1:123456789012:Stackset/Security:uuid \
  --template-body file://application-template.yaml \
  --region us-east-1

As a result, Networking and Security StackSets deploy in parallel, and Application waits for both to complete before starting.

Scenario 3: Resolving Dependency Conflicts

Use Case: You need to update existing StackSets to fix incorrect dependency relationships.

Problem: You have App1 and App2 StackSets. There is an existing dependency that App2 has on App1, but you realize App1 should depend on App2, not the other way around.

Implementation:

First, try to set App1 to depend on App2 (this will fail due to cycle):

aws cloudformation update-stack-set \
  --stack-set-name App1 \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=true,DependsOn=arn:aws:cloudformation:us-east-1:123456789012:StackSet/App2:uuid \
  --use-previous-template

This action will result in error: “Detected cycle(s) between auto-deployment dependencies”. If dependency validation cannot be completed, you’ll receive appropriate error messages to help troubleshoot configuration issues.

Now let’s remove the existing dependency from App2:

aws cloudformation update-stack-set \
  --stack-set-name App2 \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=true \
  --use-previous-template

Now successfully set App1 to depend on App2:

aws cloudformation update-stack-set \
  --stack-set-name App1 \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=true,DependsOn=arn:aws:cloudformation:us-east-1:123456789012:StackSet/App2:uuid \
  --use-previous-template

This scenario demonstrates cycle detection and how to resolve dependency conflicts.

Getting Started

StackSet dependencies is available now in all AWS Regions where CloudFormation StackSets are supported. To get started:

  1. Identify Dependencies: Determine which StackSets should deploy first in your infrastructure.
  2. Configure Relationships: Use the CloudFormation console or AWS CLI to set up dependencies using StackSet ARNs.
  3. Test Your Sequence: Validate your dependency configuration in a test environment.
  4. Monitor Deployments: Use CloudFormation events to track sequenced deployments.

Log into your account in the console and visit the AWS CloudFormation StackSets console or use the AWS CLI/SDK with AWS credentials configured to start controlling StackSet dependencies today.

Authors


Tanvi Ravindra Malali

Tanvi Ravindra Malali is an Associate Delivery Consultant in the AWS A2C team in ProServe. She is based in New York City. She handles customer projects and codebases, specializing in AI/ML, Data Engineering and Infrastructure as Code. Outside of work, she loves to paint landscapes, DJing her favorite songs, and dances Tango.

Idriss Louali Abdou
Idriss Laouali Abdou

Idriss Laouali Abdou is a Sr. Product Manager Technical on the AWS Infrastructure-as-Code team based in Seattle. He focuses on improving developer productivity through CloudFormation and StackSets Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing.

Monitor network performance and traffic across your EKS clusters with Container Network Observability

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/monitor-network-performance-and-traffic-across-your-eks-clusters-with-container-network-observability/

Organizations are increasingly expanding their Kubernetes footprint by deploying microservices to incrementally innovate and deliver business value faster. This growth places increased reliance on the network, giving platform teams exponentially complex challenges in monitoring network performance and traffic patterns in EKS. As a result, organizations struggle to maintain operational efficiency as their container environments scale, often delaying application delivery and increasing operational costs.

Today, I’m excited to announce Container Network Observability in Amazon Elastic Kubernetes Service (Amazon EKS), a comprehensive set of network observability features in Amazon EKS that you can use to better measure your network performance in your system and dynamically visualize the landscape and behavior of network traffic in EKS.

Here’s a quick look at Container Network Observability in Amazon EKS:

Container Network Observability in EKS addresses observability challenges by providing enhanced visibility of workload traffic. It offers performance insights into network flows within the cluster and those with cluster-external destinations. This makes your EKS cluster network environment more observable while providing built-in capabilities for more precise troubleshooting and investigative efforts.

Getting started with Container Network Observability in EKS

I can enable this new feature for a new or existing EKS cluster. For a new EKS cluster, during the Configure observability setup, I navigate to the Configure network observability section. Here, I select Edit container network observability. I can see there are three included features: Service map, Flow table, and Performance metric endpoint, which are enabled by Amazon CloudWatch Network Flow Monitor.

On the next page, I need to install the AWS Network Flow Monitor Agent.

After it’s enabled, I can navigate to my EKS cluster and select Monitor cluster.

This will bring me to my cluster observability dashboard. Then, I select the Network tab.


Comprehensive observability features
Container Network Observability in EKS provides several key features, including performance metrics, service map, and flow table with three views: AWS service view, cluster view, and external view.

With Performance metrics, you can now scrape network-related system metrics for pods and worker nodes directly from the Network Flow Monitor agent and send them to your preferred monitoring destination. Available metrics include ingress/egress flow counts, packet counts, bytes transferred, and various allowance exceeded counters for bandwidth, packets per second, and connection tracking limits. The following screenshot shows an example of how you can use Amazon Managed Grafana to visualize the performance metrics scraped using Prometheus.


With the Service map feature, you can dynamically visualize intercommunication between workloads in your cluster, making it straightforward to understand your application topology with a quick look. The service map helps you quickly identify performance issues by highlighting key metrics such as retransmissions, retransmission timeouts, and data transferred for network flows between communicating pods.

Let me show you how this works with a sample e-commerce application. The service map provides both high-level and detailed views of your microservices architecture. In this e-commerce example, we can see three core microservices working together: the GraphQL service acts as an API gateway, orchestrating requests between the frontend and backend services.

When a customer browses products or places an order, the GraphQL service coordinates communication with both the products service (for catalog data, pricing, and inventory) and the orders service (for order processing and management). This architecture allows each service to scale independently while maintaining clear separation of concerns.

For deeper troubleshooting, you can expand the view to see individual pod instances and their communication patterns. The detailed view reveals the complexity of microservices communication. Here, you can see multiple pod instances for each service and the network of connections between them.

This granular visibility is crucial for identifying issues like uneven load distribution, pod-to-pod communication bottlenecks, or when specific pod instances are experiencing higher latency. For example, if one GraphQL pod is making disproportionately more calls to a particular products pod, you can quickly spot this pattern and investigate potential causes.

Use the Flow table to monitor the top talkers across Kubernetes workloads in your cluster from three different perspectives, each providing unique insights into your network traffic patterns.

Flow table – Monitor the top talkers across Kubernetes workloads in your cluster from three different perspectives, each providing unique insights into your network traffic patterns:

  • AWS service view shows which workloads generate the most traffic to Amazon Web Services (AWS) services such as Amazon DynamoDB and Amazon Simple Storage Service (Amazon S3), so you can optimize data access patterns and identify potential cost optimization opportunities.
  • The Cluster view reveals the heaviest communicators within your cluster (east-west traffic), which means you can spot chatty microservices that might benefit from optimization or colocation strategies
  • External viewidentifies workloads with the highest traffic to destinations outside AWS (internet or on premises), which is useful for security monitoring and bandwidth management.

The flow table provides detailed metrics and filtering capabilities to analyze network traffic patterns. In this example, we can see the flow table displaying cluster view traffic between our e-commerce services. The table shows that the orders pod is communicating with multiple products pods, transferring amounts of data. This pattern suggests the orders service is making frequent product lookups during order processing.

The filtering capabilities are useful for troubleshooting, for example, to focus on traffic from a specific orders pod. This granular filtering helps you quickly isolate communication patterns when investigating performance issues. For instance, if customers are experiencing slow checkout times, you can filter to see if the orders service is making too many calls to the products service, or if there are network bottlenecks between specific pod instances.

Additional things to know
Here are key points to note about Container Network Observability in EKS:

  • Pricing – For network monitoring, you pay standard Amazon CloudWatch Network Flow Monitor pricing.
  • Availability – Container Network Observability in EKS is available in all commercial AWS regions where Amazon CloudWatch Network Flow Monitor is available.
  • Export metrics to your preferred monitoring solution – Metrics are available in OpenMetrics format, compatible with Prometheus and Grafana. For configuration details, refer to Network Flow Monitor documentation.

Get started with Container Network Observability in Amazon EKS today to improve network observability in your cluster.

Happy building!
Donnie

Safely Handle Configuration Drift with CloudFormation Drift-Aware Change Sets

Post Syndicated from JJ Lei original https://aws.amazon.com/blogs/devops/safely-handle-configuration-drift-with-cloudformation-drift-aware-change-sets/

Introduction

Is configuration drift preventing you from accessing the speed, safety, and governance benefits of AWS CloudFormation for infrastructure management? Configuration drift occurs when cloud resources are modified outside of CloudFormation, leading to a mismatch in the actual state and template definition of resources. Drift tends to accumulate from infrastructure changes that engineers make via the AWS Management Console to resolve production incidents or troubleshoot malfunctioning applications. Drift can cause unexpected changes during subsequent IaC deployments or leave resources in a non-compliant state. Unresolved drift can lead to cost increases when resources are over-provisioned outside of template definitions, or compliance violations that may result in audit penalties. Additionally, drift makes it hard to reproduce applications for testing or disaster recovery.

CloudFormation now offers drift-aware change sets that allow you to safely handle configuration drift and keep your infrastructure in sync with your templates. In this post, we will explore the process of leveraging drift-aware change sets to resolve common scenarios in which drift impacts the availability or security of your application.

Solution Overview

Drift-aware change sets are a type of CloudFormation change sets that can bring drifted resources in line with template definitions and preview the required changes to actual infrastructure states before deployment. Drift-aware change sets surface a three-way comparison of your new template, actual resource states, and previous template before deployment, allowing you to prevent unexpected overwrites of drift. Additionally, drift-aware change sets offer you a systematic mechanism to restore drifted resources to approved template definitions, strengthening the reproducibility and compliance posture of applications. You can create drift-aware change sets either from the CloudFormation Management Console or from the AWS CLI or SDK by passing the --deployment-mode REVERT_DRIFT parameter to the CreateChangeSet API.

Prerequisites

AWS CLI latest version with CloudFormation permissions configured.

AWS Identity and Access Management (IAM) permissions required: Permissions to create and manage CloudFormation stacks, AWS Lambda functions, Security Groups, Amazon Simple Storage Service (Amazon S3) buckets, and IAM roles. PowerUserAccess or Administrator access recommended for testing.

• Test environment (non-production AWS account recommended)

• Basic CloudFormation knowledge (stacks, templates, change sets)

Important Note: These sample templates are provided for educational purposes only and should not be used in production environments without proper security review and testing. You are responsible for testing, securing, and optimizing these templates based on your specific quality control practices and standards. Deploying these templates may incur AWS charges for creating or using AWS resources. Work with your security and legal teams to meet your organizational security, regulatory, and compliance requirements before any production deployment.

Scenario 1: Prevent Dangerous Overwrites

This scenario demonstrates how drift-aware change sets prevent dangerous overwrites when Lambda function memory is increased outside of CloudFormation during an outage, and a subsequent template update could accidentally reduce memory, causing performance issues.

Story: Your team deploys a Lambda function with 128 MB memory via CloudFormation. During a production outage, an engineer increases the memory to 512 MB through the Lambda Console to resolve performance issues. Later, another developer updates the template to 256 MB for a code change, unaware of the console modification. Without drift-aware change sets, CloudFormation would unexpectedly reduce memory from 512 MB to 256 MB—potentially causing the outage to recur.

User journey: Create stack with 128MB => Increase memory to 512MB via console during outage => Create drift-aware change set with 256MB template => Review three-way comparison showing dangerous memory reduction => Cancel change set to prevent outage => Update template to match production state (512MB) => Create and execute drift-aware change set with updated template (512MB) to resolve drift

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with Lambda function (128 MB memory).

Figure 1

CloudFormation stack “lambda-memory-drift-test” successfully deployed with CREATE_COMPLETE status

2. Emergency Memory Increase (Console)

Manually increase Lambda memory to 512 MB through AWS Console (simulating emergency performance fix during outage).

Figure 2

Initial Lambda function showing 128 MB memory as configured in template

Figure 3

Lambda memory increased to 512 MB through console during outage, creating drift from template

3. Create Drift-Aware Change Set

Create change set with 256 MB template using drift-aware mode to reveal the dangerous memory reduction.

Figure 4

CloudFormation console showing the new “Drift aware change set” option selected. This compares the new template with the live state of your stack and shows changes to drifted resources before deployment, unlike standard change sets that only compare templates.

aws cloudformation create-change-set \
--stack-name lambda-memory-drift-test \
--change-set-name detect-memory-overwrite \
--template-body file://lambda-memory-drift-scenario-256mb.yaml \
--deployment-mode REVERT_DRIFT \
--capabilities CAPABILITY_IAM \
--region us-east-1

4. Review Change Set – The Critical Three-Way Comparison

Examine the drift-aware change set to see the dangerous memory reduction that would occur.

Figure 5

Critical insight revealed: The change set shows Live resource state (512 MB) vs Proposed resource state (256 MB), revealing a dangerous memory reduction that would impact performance.

Figure 6: view drift

Drift analysis: Clicking “View drift” reveals the complete picture – Previous template (128 MB) vs Live resource state (512 MB). This shows the live state has 4x more memory than the original template, indicating emergency changes were made during the outage that must be preserved.

Key Insight: The drift-aware change set reveals that:

  • Previous template: 128 MB (original deployment)
  • Live resource state: 512 MB (emergency change during outage)
  • Proposed template: 256 MB (new deployment)

This would cause a dangerous reduction from 512 MB to 256 MB, potentially recreating the original performance issue. Without drift-aware change sets, this critical information would be hidden.

5. Recreate Drift-aware Change Set with Updated Template (512MB) to Resolve Drift

Update the template to match the live production state (512 MB) and create a new drift-aware change set to safely resolve the drift.

Figure 7

Resolution confirmed: The drift-aware change set shows both Live resource state and Proposed resource state at 512 MB, with change set action ” Sync with live”. This verifies that the updated template now matches production, preventing the dangerous memory reduction and safely resolving the drift without impacting performance.

CloudFormation Templates

Initial Template (128 MB):

Resources:
  DriftTestFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.9
      Handler: index.lambda_handler
      MemorySize: 128
      ReservedConcurrentExecutions: 5
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'Hello!'}
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Updated Template (256 MB – lambda-memory-drift-scenario-256mb.yaml):

Resources:
  DriftTestFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.9
      Handler: index.lambda_handler
      MemorySize: 256
      ReservedConcurrentExecutions: 5
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'Hello!'}
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name lambda-memory-drift-test --template-body file://lambda-memory-drift-scenario.yaml --capabilities CAPABILITY_IAM --region us-east-1
  1. Get function name:
aws cloudformation describe-stack-resources --stack-name lambda-memory-drift-test --logical-resource-id DriftTestFunction --query 'StackResources[0].PhysicalResourceId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name lambda-memory-drift-test --change-set-name detect-memory-overwrite --template-body file://lambda-memory-drift-scenario-256mb.yaml --deployment-mode REVERT_DRIFT --capabilities CAPABILITY_IAM --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name detect-memory-overwrite --stack-name lambda-memory-drift-test --region us-east-1

Scenario 2: Remediate Unauthorized Changes

This scenario demonstrates how drift-aware change sets systematically remediate unauthorized changes when a developer adds temporary debugging rules to a security group but forgets to remove them, creating a compliance violation.

Story: Your team deploys a security group with only HTTP access via CloudFormation for compliance. During debugging, a developer adds SSH access (port 22) through the AWS Console for their IP address to troubleshoot an application issue. They forget to remove this rule after debugging. Later, security compliance requires reverting to the original template state. A standard change set shows no changes since the template is unchanged, but a drift-aware change set can detect and systematically remove the unauthorized SSH rule.

User journey: Create stack with HTTP-only access => Add SSH rule via console for debugging => Forget to remove SSH rule => Create drift-aware change set with REVERT_DRIFT mode => Review change set showing SSH rule removal => Execute change set to restore compliance

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with security group allowing only HTTP traffic.

Figure 8

CloudFormation stack “sg-revert-drift-test” successfully deployed with DriftTestSecurityGroup resource

2. Make Unauthorized Changes (Console)

Manually add SSH ingress rule through AWS Console (simulating developer debugging access that wasn’t removed).

Figure 9: http only

Initial security group showing only HTTP (port 80) access as configured in template – compliant state

Figure 10: ssh-added

Security group now shows 2 permission entries: SSH (port 22) for specific IP and HTTP (port 80) for all traffic. The SSH rule creates drift and a compliance violation that needs systematic removal.

3. Create Drift-Aware Change Set

Create change set using REVERT_DRIFT mode to systematically remove the unauthorized SSH rule.

Figure 11

Creating drift-aware change set for security group compliance restoration. Note the “Drift aware change set” option is selected to compare with live state and detect unauthorized changes.

aws cloudformation create-change-set \
--stack-name sg-revert-drift-test \
--change-set-name revert-ssh-drift \
--use-previous-template \
--deployment-mode REVERT_DRIFT \
--region us-east-1

4. Review Change Set – Systematic Compliance Restoration

Examine the drift-aware change set to see systematic removal of unauthorized SSH rule.

Figure 12

Compliance violation detected: The drift -aware change set shows that the SSH rule in the live resource state (rule 232 for IP 15.248.7.53/32 on port 22) is not present in the proposed resource state derived from the template. This unauthorized SSH rule violates security policy and will be systematically removed

Key Insight: The drift-aware change set enables systematic compliance restoration by:

  • Previous template: Only HTTP (port 80) access – compliant state
  • Live resource state: HTTP + SSH (port 22) for 15.248.7.53/32 – compliance violation
  • Action: Remove unauthorized SSH rule to restore compliance

This provides a systematic, auditable way to remove unauthorized changes rather than manual cleanup.

Figure 13

Stack events showing successful execution of the drift-aware change set – SSH rule removed

CloudFormation Templates

security-group-drift-scenario.yaml:

Resources:
  DriftTestSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: "Security group for drift testing"
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0
          Description: "Allow HTTP traffic for demo purposes"
      SecurityGroupEgress:
        - IpProtocol: -1
          CidrIp: 0.0.0.0/0
          Description: "Allow all outbound traffic"

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name sg-revert-drift-test --template-body file://security-group-drift-scenario.yaml --region us-east-1
  1. Get security group ID:
aws ec2 describe-security-groups --filters "Name=tag:aws:cloudformation:stack-name,Values=sg-revert-drift-test" --query 'SecurityGroups[0].GroupId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name sg-revert-drift-test --change-set-name revert-ssh-drift --template-body file://security-group-drift-scenario.yaml --deployment-mode REVERT_DRIFT --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name revert-ssh-drift --stack-name sg-revert-drift-test --region us-east-1

Scenario 3: Recreate Deleted Resources

This scenario demonstrates drift detection when a dependent resource (logs bucket) is accidentally deleted outside of CloudFormation during troubleshooting. The main application bucket depends on this logs bucket for access logging. You need to recreate the deleted resource while maintaining the existing infrastructure dependencies.

Story: Your team deploys a main S3 bucket with a dependent logs bucket for access logging via CloudFormation. During troubleshooting, an operator accidentally deletes the logs bucket through the AWS Console. The main bucket still exists but its logging configuration now references a non-existent bucket. You need to recreate the deleted logs bucket while maintaining the dependency relationship.

User journey: Create stack with main and logs buckets => Accidentally delete logs bucket => Create drift-aware change set with REVERT_DRIFT mode => Review change set showing LogBucket will be recreated => Execute change set to restore deleted resource

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with main S3 bucket and dependent logs bucket.

Figure 14

CloudFormation stack “s3-deletion-drift-test” successfully deployed with both LogBucket and MainBucket resources in CREATE_COMPLETE status

2. Accidental Deletion (Console)

Manually delete the logs bucket through AWS Console (simulating accidental deletion during troubleshooting).

Figure 15

LogBucket accidentally deleted outside of CloudFormation during troubleshooting, creating drift – the MainBucket still exists but its logging configuration now references a non-existent bucket

3. Create Drift-Aware Change Set

Create change set using REVERT_DRIFT mode to recreate the deleted LogBucket.

Figure 16

Creating drift-aware change set with “Drift aware change set” option selected to detect and recreate the deleted resource by comparing template with live state

aws cloudformation create-change-set \
--stack-name s3-deletion-drift-test \
--change-set-name recreate-deleted-bucket \
--use-previous-template \
--deployment-mode REVERT_DRIFT \
--region us-east-1

4. Review Change Set – Resource Recreation

Examine change set to see LogBucket recreation while preserving MainBucket dependencies.

Figure 17

Change set preview showing LogBucket will be recreated to restore the deleted resource and MainBucket updated to maintain infrastructure dependencies

Key Insight: The drift-aware change set detects that:

  • Template expectation: Both LogBucket and MainBucket should exist
  • Live resource state: Only MainBucket exists, LogBucket is missing
  • Action: Recreate LogBucket with original configuration to restore logging functionality

This enables systematic recovery of accidentally deleted resources while maintaining infrastructure dependencies.

CloudFormation Templates

s3-drift-scenario.yaml:

Resources:
  LogBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled
  
  MainBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled
      LoggingConfiguration:
        DestinationBucketName: !Ref LogBucket

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name s3-deletion-drift-test --template-body file://s3-drift-scenario.yaml --region us-east-1
  1. Get LogBucket name:
aws cloudformation describe-stack-resources --stack-name s3-deletion-drift-test --logical-resource-id LogBucket --query 'StackResources[0].PhysicalResourceId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name s3-deletion-drift-test --change-set-name recreate-deleted-bucket --template-body file://s3-drift-scenario.yaml --deployment-mode REVERT_DRIFT --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name recreate-deleted-bucket --stack-name s3-deletion-drift-test --region us-east-1

Best Practices

When working with drift-aware change sets, consider these best practices:

Always review three-way comparisons before executing change sets to understand the full impact

Use REVERT_DRIFT deployment mode when you want to bring resources back to template compliance

Document emergency changes made outside of CloudFormation to inform future template updates

Implement change management processes to minimize unauthorized drift

Regular drift detection helps identify configuration changes before they become problematic

Test drift-aware change sets in non-production environments first

Cleanup

Important: Execute these cleanup commands promptly after completing the scenarios to avoid incurring unnecessary AWS charges. Resources such as Lambda functions, S3 buckets (even if empty), and security groups may incur costs if left running. Ensure all stacks are successfully deleted by verifying the DELETE_COMPLETE status.

Commands to delete all test resources:

# Scenario 1: Lambda Memory Drift
aws cloudformation delete-stack --stack-name lambda-memory-drift-test --region us-east-1

# Scenario 2: Security Group Drift
aws cloudformation delete-stack --stack-name sg-revert-drift-test --region us-east-1

# Scenario 3: S3 Bucket Deletion Drift
aws cloudformation delete-stack --stack-name s3-deletion-drift-test --region us-east-1

# Verify all stacks are deleted
aws cloudformation list-stacks --stack-status-filter DELETE_COMPLETE --region us-east-1

Note: CloudFormation will automatically clean up all resources created by the stacks, including Lambda functions, security groups, and S3 buckets.

Conclusion

Drift-aware change sets enable you to mitigate the operational and security risks of configuration drift, allowing you to confidently automate and govern your infrastructure updates with CloudFormation. Through the scenarios described in this post, you have seen how you can leverage drift-aware change sets to prevent outages in production environments, maintain the integrity of your test environments, and manage the compliance posture of all environments. Remember to thoroughly review the infrastructure changes previewed by drift-aware change sets before executing deployments.

Available Now

Drift-aware change sets are available in AWS Regions where CloudFormation is available. Please refer to the AWS Region table to learn more.

StackSets Deployment Strategies: Balancing Speed, Safety, and Scale to Optimize Deployments for Different Organizational Needs

Post Syndicated from Amar Meriche original https://aws.amazon.com/blogs/devops/stacksets-deployment-strategies-balancing-speed-safety-and-scale-to-optimize-deployments-for-different-organizational-needs/

AWS CloudFormation StackSets enables organizations to deploy infrastructure consistently across multiple AWS accounts and regions. However, success depends on choosing the right deployment strategy that balances three critical factors: deployment speed, operational safety, and organizational scale. This guide explores proven StackSets deployment strategies specifically designed for multi-account infrastructure management.

Understanding StackSets Deployment Fundamentals

What are StackSets Actually Used For?

Unlike single-account AWS CloudFormation templates, StackSets are specifically designed for multi-account infrastructure governance. Common use cases include Security baselines (deploying IAM policies, security groups, and access controls across all accounts), Compliance controls (rolling out AWS Config rules, AWS CloudTrail configurations, and audit requirements), Organizational standards (establishing consistent VPC configurations, tagging policies, and naming conventions), Shared services (deploying monitoring solutions, logging infrastructure, and backup policies) or Cost management (implementing budget controls, cost allocation tags, and resource optimization policies)

The Multi-Account Challenge

Managing infrastructure across dozens or hundreds of AWS accounts presents unique challenges:

Single Account (CFN Template)     Multi-Account (StackSets)
      App A                           Org Unit A (50 accounts)
        |                                     |
   [Deploy Once]               [Deploy consistently across all]
        |                                     |
    Success/Fail                Complex success/failure matrix

Multi account and multi region Cloudformation deployment complexity

The Speed-Safety-Scale Triangle

Every StackSets deployment strategy involves trade-offs: Speed (how quickly changes propagate across your organization), Safety (risk mitigation and failure containment) and Scale (ability to manage hundreds of accounts efficiently)

Prerequisites

Before implementing any of the deployment strategies described in this guide, ensure you have:

  1. AWS CLI Installation
    1. Install the latest version of AWS CLI by following the AWS CLI installation guide
    2. Verify installation with: aws –version
  2. AWS Profile Configuration
    1. Configure your AWS credentials using: aws configure
    2. For details on configuration, see AWS CLI configuration basics
    3. Ensure your profile has appropriate permissions for CloudFormation StackSets operations as described in AWS StackSets prerequisites
  3. Proper Account Access The commands in this guide must be executed from either:
    1. The management account of your AWS Organization
    2. OR a delegated administrator account for CloudFormation

For information on setting up a delegated administrator, see Register a delegated administrator

Note: StackSets deployments using service-managed permissions cannot be performed from standalone accounts.

Verify you’re using the correct account with:

bash
# For management account
aws organizations describe-organization
# For delegated admin
aws cloudformation list-stack-sets —call-as DELEGATED_ADMIN

AWS CLI to check the usage of an Organization and not a Standalone account

Core Deployment Strategies

As explained in the StackSet documentation:

  • “For a more conservative deployment, set Maximum Concurrent Accounts to 1, and Failure Tolerance to 0. Set your lowest-impact region to be first in the Region Order Start with one region.”
  • “For a faster deployment, increase the values of Maximum Concurrent Accounts and Failure Tolerance as needed. ”

Based on the above, we are proposing below several deployment strategies, depending on the speed, safety and scale you want to achieve.

1. Sequential Deployment: Maximum Safety

Use Case : Critical security updates, compliance requirements, first-time organizational rollouts

Below are listed some possible use cases:

  • Security baseline updates: New IAM policies affecting root access
  • Compliance rollouts: SOX, HIPAA, or PCI-DSS control implementations
  • Critical infrastructure changes: VPC security group modifications
  • Organizational policy changes: New AWS Config rules for audit compliance

Implementation Example:

For this example, we will download the following template ConfigRuleCloudtrailEnabled.yml from the Cloudformation sample library in the AWS documentation to configure an AWS Config rule to determine if AWS CloudTrail is enabled and follow the next steps:

Step 1: Create the StackSet

With the AWS CLI:

# Create Stackset for security baseline
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
  --stack-set-name security-baseline \
  --template-body file://ConfigRuleCloudtrailEnabled.yml \
  --capabilities CAPABILITY_NAMED_IAM \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
  --region us-east-1

AWS CLI to create a security-baseline Stackset

The expected response should be similar to the following :

{"StacksetId": "security-baseline: ...."}

Step 2: Create Stack Instances

Before you launch the below command, you need to adjust the values of the following parameters:

  • OrganizationalUnitIds: you must change the value “ou-test” in the below command line to the name of the target OU you want to deploy to. I recommend creating a new test OU in the console or via the CLI for the purpose of this test.
  • regions: if needed, change the “us-east-1 eu-west-1” value, here you need to list all the regions you want to deploy to. AWS Config must be active in the accounts/regions that you choose, otherwise you’ll get an error when deploying the Stack.

# Deploy security baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1 and eu-west-1
# SEQUENTIAL = One region at a time, sequentially
# MaxConcurrentPercentage = Deploy to 5% of accounts at once
# FailureTolerancePercentage = Stop on first failure
aws cloudformation create-stack-instances \
  --stack-set-name security-baseline \
  --deployment-targets OrganizationalUnitIds=ou-test\
  --regions us-east-1 eu-west-1 \
  --region us-east-1 \
  --operation-preferences RegionConcurrencyType=SEQUENTIAL,MaxConcurrentPercentage=5,FailureTolerancePercentage=0

AWS CLI to create security-baseline Stack Instances sequentially for maximum safety

The CLI output should look like the following:

{"OperationId": ....}

Or create the StackSet and add the Stacks with the AWS Console:

In the CloudFormation Console, click “Create StackSet”

AWS CloudFormation Console: create a security-baseline Stackset

AWS CloudFormation Console: create a security-baseline Stackset

Upload your template from S3 or from your computer and click Next:

AWS CloudFormation Console: specify a template

AWS CloudFormation Console: specify a template

Specify the StackSet name and parameters and click Next:

AWS CloudFormation Console: specify the StackSet name and parameters

AWS CloudFormation Console: specify the StackSet name and parameters

Configure StackSet options and click Next:

AWS CloudFormation Console: configure the StackSet options

AWS CloudFormation Console: configure the StackSet options

Set deployment options and click Next:

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set more deployment options

Then Review and Submit.

Not to overweight this blog, we’ll provide only this example of CLI output and Console screenshot, but the “Parallel Deployment” and “Balanced Approach” will be similar to this example. You just need to update the parameters for the different StackSet Operations options.

A real-world example would be a financial services company deploying new MFA requirements across 200 production accounts. They could use sequential deployment with 5 concurrency to ensure each batch was validated before proceeding.

2. Parallel Deployment: Maximum Speed

The Parallel Deployment is best for non-critical updates, development environments, routine maintenance

Here are some possible use cases:

  • Development account standardization: Rolling out new development tools
  • Monitoring infrastructure: Deploying Amazon CloudWatch dashboards and alarms
  • Cost optimization: Implementing automated resource cleanup policies
  • Non-production updates: Updating development and staging environments

Implementation Example:

For this example, we will copy paste the .yml template from this Re:Post article about monitoring IAM events in a file called “monitoring-baseline.yml”, and use it in the following command lines.

Step 1: Create the StackSet

# Create Stackset for monitoring baseline
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name monitoring-baseline \
--template-body file://monitoring-baseline.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a monitoring-baseline Stackset

Step 2: Create Stack Instances

Just like in the previous example, before you launch the below command, you need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to dev and sandbox accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1 and eu-west-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 80% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 20% of accounts
aws cloudformation create-stack-instances \
--stack-set-name monitoring-baseline \
--deployment-targets OrganizationalUnitIds=ou-development,ou-sandbox \
--regions us-east-1 eu-west-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=80,FailureTolerancePercentage=20

AWS CLI to create monitoring-baseline Stack Instances in parallel with high value for max concurrent percentage for maximum speed

3. Progressive Deployment: Balanced Approach or Multi Phase Approach (Recommended)

For most production scenarios with moderate risk tolerance, it is recommended to use a Balanced Approach, or Multi-Phase Implementation.

Balanced Approach

For this example, to make it easier, you can create a copy of “monitoring-baseline.yml” created previously, and name it “balanced-template.yml”.

cp monitoring-baseline.yml balanced-template.yml

bash command to copy the monitoring-baseline.yml file to balanced-template.yml

Then you can use it in the following command lines.

Step 1: Create the StackSet

# Create Stackset for a balanced creation
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name balanced-deployment \
--template-body file://balanced-template.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a balanced-deployment Stackset

Step 2: Create Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1 and ap-southeast-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 25% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 8% of accounts
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-development,ou-sandbox \
--regions us-east-1 eu-west-1 ap-southeast-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=8

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment

Multi-Phase Implementation:

Step 1: Create the StackSet

# Create Stackset for a balanced creation
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name balanced-deployment \
--template-body file://balanced-template.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a balanced-deployment Stackset

Phase 1: Pilot Accounts (10% of target)

Phase 1: Create Pilot Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1
# SEQUENTIAL = Deployment in sequence
# MaxConcurrentPercentage = 100% Deploy full speed for small pilot
# FailureTolerancePercentage = Zero tolerance in pilot
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets Accounts=pilot-account-1,pilot-account-2 \
--regions us-east-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=SEQUENTIAL,MaxConcurrentPercentage=100,FailureTolerancePercentage=0

AWS CLI to create balanced-deployment Stack Instances sequentially for maximum safety in Pilot accounts

Wait for Pilot validation before proceeding to Phase 2

Phase 2: Early Adopter OUs (30% of target)

Phase 2: Create Early Adopter Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 25% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 5% of accounts
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-early-adopter \
--regions us-east-1 \
--region us-east-1 eu-west-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=5

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment in Early Adopter OU

Wait for Early Adopter validation before proceeding to Phase 3

Phase 3: Full Deployment (Remaining 60%)

Phase 3: Full Deployment

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1 and ap-southeast-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 40% of accounts at once for higher speed after validation
# FailureTolerancePercentage = Tolerate failures in 10% of accounts for moderate tolerance
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-standard-prod,ou-legacy-prod \
--regions us-east-1 \
--region us-east-1 eu-west-1 ap-southeast-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=5

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment in the remaining OUs

Using Step Functions for Orchestration

AWS Step Functions provides a serverless workflow service that can orchestrate StackSets deployments with advanced control flow, error handling, and state management capabilities. This approach enhances your multi-account deployments with features not available through standard StackSets operations alone.

Some of the Key Benefits include:

  • Advanced Deployment Orchestration: Coordinate multi-phase rollouts with validation gates
  • Human Approval Workflows: Implement manual approval steps for critical changes
  • Enhanced Error Handling: Define sophisticated retry policies and fallback mechanisms
  • Visual Monitoring: Track deployment progress through the Step Functions visual console

Real-World Use Case: Compliance Control Rollout

In regulated industries, AWS Step Functions enables a phased approach that combines automation with necessary governance. For instance, you can:

  1. Deploy compliance controls to test accounts
  2. Run automated validation and generate compliance reports
  3. Obtain manual approval from compliance team
  4. Deploy to production accounts with comprehensive monitoring

This approach ensures consistent governance while maintaining the complete audit trail required for regulatory compliance.

Monitoring and Optimization

AWS CloudFormation StackSets do not have extensive built-in Amazon CloudWatch metrics specifically designed for monitoring StackSet operations and health. This is actually why the monitoring implementation in our blog post is valuable.

Here’s what AWS does and doesn’t provide out of the box:

What AWS provides natively:

  • Basic AWS API call metrics via AWS CloudTrail (which show that operations happened but don’t track success rates or performance)
  • General service quotas and throttling metrics for CloudFormation as a whole
  • CloudFormation provides some metrics for individual stacks, but not consolidated StackSet-specific metrics

What requires custom implementation (as in our blog post):

  • Success rate metrics for StackSet operations across accounts
  • Deployment completion time tracking
  • Configuration drift detection and monitoring
  • Account-specific failure analysis
  • Comprehensive dashboards that show StackSet health across your organization

The code in our blog post demonstrates how to implement the success rate custom metrics by:

  1. Gathering data from the CloudFormation API about StackSet operations
  2. Calculating the success rate metrics for StackSet deployments
  3. Creating custom Amazon CloudWatch metrics in a custom namespace (like “StackSetMonitoring”)
  4. Setting up alerts for issues

This explains why organizations need to implement custom monitoring solutions like the one shown in our blog post rather than relying solely on built-in metrics.

Automated Monitoring Implementation: example of a custom metric to monitor the StackSet operations success rate

The following AWS Cloudformation template provides real-time monitoring and alerting for AWS CloudFormation StackSet operations through automated infrastructure deployment. This solution creates a complete monitoring system using a AWS Lambda function, Amazon EventBridge rules, Amazon SNS notifications, and Amazon CloudWatch dashboards to track StackSet success and failure rates. The core Lambda function named StackSetMonitor continuously monitors all active StackSets in your account, calculating success rates and publishing custom metrics to Amazon CloudWatch under the StackSetMonitoring namespace.

Below you’ll find a few example of possible custom metrics that could be implemented based on this AWS Cloudformation template:

  • Count of all operations (CREATE, UPDATE, DELETE) per StackSet over time periods
  • Number of stack instances with configuration drift (requires additional API calls)
  • Average time taken for StackSet operations to complete
  • Rate of StackSet operations to identify peak usage times
  • Number of individual stack instances that failed during operations
  • Number of retried operations (indicates infrastructure issues)

Here’s the StackSetMonitor.yml CloudFormation Template:

# StackSetMonitor.yml 
# CFN template for monitoring AWS CloudFormation StackSet operations with real-time alerts, metrics, and dashboards.

AWSTemplateFormatVersion: '2010-09-09'
Description: 'CloudFormation template for StackSet operation monitoring using CloudWatch and SNS'

Parameters:
  StackSetName:
    Type: String
    Description: 'Name of the StackSet to monitor'
    Default: 'security-baseline'
    MinLength: 1
    MaxLength: 128
    AllowedPattern: '[a-zA-Z][-a-zA-Z0-9]*'
    ConstraintDescription: 'Must be a valid StackSet name (1-128 characters, alphanumeric and hyphens, must start with a letter)'
  
  VpcId:
    Type: String
    Description: 'VPC ID where the Lambda function will be deployed (leave empty to create new VPC)'
    Default: ''
  
  SubnetIds:
    Type: CommaDelimitedList
    Description: 'List of subnet IDs for the Lambda function (leave empty to create new subnets)'
    Default: ''
    
  SecurityGroupIds:
    Type: CommaDelimitedList
    Description: 'List of security group IDs for the Lambda function (leave empty to create new security group)'
    Default: ''

Conditions:
  CreateVPC: !Equals [!Ref VpcId, '']
  CreateVPCAndSubnets: !And [!Equals [!Ref VpcId, ''], !Equals [!Join [',', !Ref SubnetIds], '']]
  HasCustomSecurityGroups: !Not [!Equals [!Join [',', !Ref SecurityGroupIds], '']]
  
Resources:
  # KMS Key for CloudWatch Logs encryption
  LogsKMSKey:
    Type: AWS::KMS::Key
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      Description: 'KMS Key for StackSet Monitor CloudWatch Logs and Lambda environment variable encryption'
      EnableKeyRotation: true
      KeyPolicy:
        Version: '2012-10-17'
        Statement:
          - Sid: Enable IAM User Permissions
            Effect: Allow
            Principal:
              AWS: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:root'
            Action: 'kms:*'
            Resource: '*'
          - Sid: Allow CloudWatch Logs
            Effect: Allow
            Principal:
              Service: !Sub 'logs.${AWS::Region}.amazonaws.com'
            Action:
              - 'kms:Encrypt'
              - 'kms:Decrypt'
              - 'kms:ReEncrypt*'
              - 'kms:GenerateDataKey*'
              - 'kms:DescribeKey'
            Resource: '*'
            Condition:
              ArnEquals:
                'kms:EncryptionContext:aws:logs:arn': 
                  - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor'
                  - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets'
          - Sid: Allow Lambda Service
            Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action:
              - 'kms:Encrypt'
              - 'kms:Decrypt'
              - 'kms:ReEncrypt*'
              - 'kms:GenerateDataKey*'
              - 'kms:DescribeKey'
            Resource: '*'

  LogsKMSKeyAlias:
    Type: AWS::KMS::Alias
    Properties:
      AliasName: alias/stackset-monitor-logs
      TargetKeyId: !Ref LogsKMSKey

  # VPC Resources (created when no existing VPC is provided)
  StackSetMonitorVPC:
    Type: AWS::EC2::VPC
    Condition: CreateVPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: Name
          Value: StackSetMonitor-VPC
        - Key: Purpose
          Value: VPC for StackSet Monitor Lambda function


  PrivateSubnet1:
    Type: AWS::EC2::Subnet
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      CidrBlock: 10.0.1.0/24
      AvailabilityZone: !Select [0, !GetAZs '']
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-Subnet-1
        - Key: Purpose
          Value: Private subnet for StackSet Monitor Lambda

  PrivateSubnet2:
    Type: AWS::EC2::Subnet
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      CidrBlock: 10.0.2.0/24
      AvailabilityZone: !Select [1, !GetAZs '']
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-Subnet-2
        - Key: Purpose
          Value: Private subnet for StackSet Monitor Lambda

  PrivateRouteTable1:
    Type: AWS::EC2::RouteTable
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-RT-1

  PrivateRouteTable2:
    Type: AWS::EC2::RouteTable
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-RT-2

  PrivateSubnet1RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Condition: CreateVPC
    Properties:
      RouteTableId: !Ref PrivateRouteTable1
      SubnetId: !Ref PrivateSubnet1

  PrivateSubnet2RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Condition: CreateVPC
    Properties:
      RouteTableId: !Ref PrivateRouteTable2
      SubnetId: !Ref PrivateSubnet2

  # VPC Endpoints for AWS Services (no internet access needed)
  CloudFormationVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.cloudformation
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - cloudformation:ListStackSets
              - cloudformation:ListStackSetOperations
              - cloudformation:ListStackInstances
              - cloudformation:DescribeStackInstance
              - cloudformation:DescribeStacks
              - cloudformation:GetTemplate
            Resource: '*'

  CloudWatchVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.monitoring
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - cloudwatch:PutMetricData
            Resource: '*'

  SNSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sns
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sns:Publish
            Resource: '*'

  EventsVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.events
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - events:PutEvents
            Resource: '*'

  LogsVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.logs
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: '*'

  SQSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sqs
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sqs:SendMessage
            Resource: '*'

  STSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sts
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sts:AssumeRole
              - sts:GetCallerIdentity
              - sts:AssumeRoleWithWebIdentity
            Resource: '*'

  # Security Group for Lambda function
  LambdaSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for StackSet Monitor Lambda function
      VpcId: !If
        - CreateVPC
        - !Ref StackSetMonitorVPC
        - !Ref VpcId
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 10.0.0.0/16
          Description: HTTPS to VPC Endpoints
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS TCP to VPC for name resolution
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS UDP to VPC for name resolution
      Tags:
        - Key: Name
          Value: StackSetMonitor-Lambda-SG
        - Key: Purpose
          Value: Security group for StackSet Monitor Lambda

  VPCEndpointSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Condition: CreateVPC
    Properties:
      GroupDescription: Security group for VPC Endpoints
      VpcId: !Ref StackSetMonitorVPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: HTTPS from Lambda security group
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: DNS TCP from Lambda security group
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: DNS UDP from Lambda security group
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 10.0.0.0/16
          Description: HTTPS outbound within VPC
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS TCP outbound within VPC
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS UDP outbound within VPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-VPCEndpoint-SG
        - Key: Purpose
          Value: Security group for VPC Endpoints

  # Dead Letter Queue for Lambda function
  StackSetMonitorDLQ:
    Type: AWS::SQS::Queue
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      QueueName: StackSetMonitor-DLQ
      MessageRetentionPeriod: 1209600  # 14 days
      KmsMasterKeyId: alias/aws/sqs
      Tags:
        - Key: Purpose
          Value: Dead Letter Queue for StackSet Monitor Lambda

  StackSetAlertsTopic:
    Type: AWS::SNS::Topic
    Properties: 
      TopicName: StackSetAlerts
      DisplayName: StackSet Monitoring Alerts
      KmsMasterKeyId: alias/aws/sns
  
  StackSetLogGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties: 
      LogGroupName: /aws/cloudformation/stacksets
      RetentionInDays: 30
      KmsKeyId: !GetAtt LogsKMSKey.Arn

  LambdaLogGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      LogGroupName: /aws/lambda/StackSetMonitor
      RetentionInDays: 30
      KmsKeyId: !GetAtt LogsKMSKey.Arn
  
  StackSetMonitoringDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: StackSetMonitoring
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "width": 24,
              "height": 8,
              "properties": {
                "metrics": [
                  [ "StackSetMonitoring", "SuccessRate", "StackSetName", "${StackSetName}" ]
                ],
                "region": "${AWS::Region}",
                "title": "StackSet Operations",
                "period": 300,
                "stat": "Average"
              }
            },
            {
              "type": "log",
              "width": 24,
              "height": 6,
              "properties": {
                "query": "SOURCE '/aws/lambda/StackSetMonitor' | fields @timestamp, @message\n| sort @timestamp desc\n| limit 20",
                "region": "${AWS::Region}",
                "title": "Latest StackSet Monitor Logs",
                "view": "table"
              }
            }
          ]
        }
  
  # Consolidated rule to catch ALL StackSet events for comprehensive monitoring
  AllStackSetOperationsRule:
    Type: AWS::Events::Rule
    Properties:
      Name: AllStackSetOperationsRule
      Description: "Rule for monitoring all CloudFormation StackSet operations with failure notifications"
      EventPattern: {source: ["aws.cloudformation"], detail-type: ["CloudFormation StackSet Operation Status Change"]}
      State: ENABLED
      Targets:
        - Id: ProcessAllEvents
          Arn: !GetAtt StackSetMonitorLambda.Arn
        - Id: NotifyFailure
          Arn: !Ref StackSetAlertsTopic
          InputTransformer:
            InputPathsMap:
              "stackSetId": "$.detail.stack-set-id"
              "operationId": "$.detail.operation-id"
              "status": "$.detail.status"
              "time": "$.time"
            InputTemplate: '"StackSet Event: ID: <stackSetId>, Op: <operationId>, Status: <status>, Time: <time>"'

  StackSetMonitorLambda:
    Type: AWS::Lambda::Function
    DependsOn: LambdaLogGroup
    Properties:
      FunctionName: StackSetMonitor
      Handler: index.lambda_handler
      Role: !GetAtt StackSetMonitorRole.Arn
      Runtime: python3.12
      Timeout: 300
      MemorySize: 512
      ReservedConcurrentExecutions: 1
      DeadLetterConfig:
        TargetArn: !GetAtt StackSetMonitorDLQ.Arn
      VpcConfig:
        SecurityGroupIds: !If
          - HasCustomSecurityGroups
          - !Ref SecurityGroupIds
          - - !Ref LambdaSecurityGroup
        SubnetIds: !If
          - CreateVPCAndSubnets
          - - !Ref PrivateSubnet1
            - !Ref PrivateSubnet2
          - !Ref SubnetIds
      KmsKeyArn: !GetAtt LogsKMSKey.Arn
      Code:
        ZipFile: |
          import boto3
          import json
          import os
          import logging
          import time
          import datetime
          from typing import Dict, Any, Optional
          
          # Custom JSON encoder to handle datetime objects
          class DateTimeEncoder(json.JSONEncoder):
              def default(self, obj):
                  if isinstance(obj, datetime.datetime):
                      return obj.isoformat()
                  return super().default(obj)
          
          # Set up logging with more details
          logger = logging.getLogger()
          logger.setLevel(logging.INFO)
          
          # Log initialization to verify Lambda is loading correctly
          print("StackSetMonitor Lambda initializing...")
          
          def validate_event(event: Dict[str, Any]) -> bool:
              """Validate the incoming event structure"""
              if not isinstance(event, dict):
                  logger.error("Event must be a dictionary")
                  return False
              
              # If it's an EventBridge event, validate required fields
              if 'detail' in event:
                  detail = event.get('detail', {})
                  if not isinstance(detail, dict):
                      logger.error("Event detail must be a dictionary")
                      return False
                  
                  # Validate StackSet event structure
                  if 'stack-set-id' in detail:
                      stack_set_id = detail.get('stack-set-id')
                      if not isinstance(stack_set_id, str) or not stack_set_id.strip():
                          logger.error("stack-set-id must be a non-empty string")
                          return False
                      
                      # Validate operation-id if present
                      operation_id = detail.get('operation-id')
                      if operation_id is not None and not isinstance(operation_id, str):
                          logger.error("operation-id must be a string if provided")
                          return False
                      
                      # Validate status if present
                      status = detail.get('status')
                      if status is not None and not isinstance(status, str):
                          logger.error("status must be a string if provided")
                          return False
              
              return True
          
          def validate_context(context: Any) -> bool:
              """Validate the Lambda context object"""
              if context is None:
                  logger.error("Context cannot be None")
                  return False
              
              # Check for required context attributes
              required_attrs = ['function_name', 'function_version', 'invoked_function_arn', 'memory_limit_in_mb']
              for attr in required_attrs:
                  if not hasattr(context, attr):
                      logger.error(f"Context missing required attribute: {attr}")
                      return False
              
              return True
          
          def sanitize_string(value: str, max_length: int = 255) -> str:
              """Sanitize and truncate string inputs"""
              if not isinstance(value, str):
                  return str(value)[:max_length]
              return value.strip()[:max_length]
          
          def lambda_handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
              """Main Lambda handler function for StackSet monitoring with input validation"""
              
              # Input validation
              if not validate_event(event):
                  return {
                      "statusCode": 400,
                      "body": json.dumps({
                          "status": "error",
                          "message": "Invalid event structure"
                      }, cls=DateTimeEncoder)
                  }
              
              if not validate_context(context):
                  return {
                      "statusCode": 400,
                      "body": json.dumps({
                          "status": "error",
                          "message": "Invalid context object"
                      }, cls=DateTimeEncoder)
                  }
              
              # Log the validated event for debugging
              logger.info(f"Event received: {json.dumps(event, cls=DateTimeEncoder)}")
              logger.info(f"Function: {context.function_name}, Version: {context.function_version}")
              
              try:
                  cf = boto3.client('cloudformation')
                  cw = boto3.client('cloudwatch')
                  
                  # Log that we're starting processing
                  logger.info(f"Starting StackSet monitoring at {time.time()}")
                  
                  # Check if this is an event from EventBridge
                  if 'detail' in event and 'stack-set-id' in event.get('detail', {}):
                      detail = event['detail']
                      stack_set_id = sanitize_string(detail['stack-set-id'])
                      operation_id = sanitize_string(detail.get('operation-id', 'N/A'))
                      status = sanitize_string(detail.get('status', 'N/A'))
                      
                      # Validate stack_set_id format
                      if not stack_set_id or len(stack_set_id) > 128:
                          logger.error(f"Invalid stack_set_id: {stack_set_id}")
                          return {
                              "statusCode": 400,
                              "body": json.dumps({
                                  "status": "error",
                                  "message": "Invalid stack_set_id format"
                              }, cls=DateTimeEncoder)
                          }
                      
                      # Log the StackSet operation with additional context
                      logger.info(f"Processing StackSet event - ID: {stack_set_id}, Op: {operation_id}, Status: {status}")
                      
                      # Extract stack set name from the ID
                      stack_set_name = stack_set_id.split('/')[-1] if '/' in stack_set_id else stack_set_id
                      stack_set_name = sanitize_string(stack_set_name, 128)
                      logger.info(f"Extracted StackSet name: {stack_set_name}")
                  
                  # Always gather metrics regardless of event type
                  # Get all active StackSets
                  stack_sets_response = cf.list_stack_sets(Status='ACTIVE')
                  stack_sets = stack_sets_response.get('Summaries', [])
                  
                  if not isinstance(stack_sets, list):
                      logger.error("Invalid response from list_stack_sets")
                      return {
                          "statusCode": 500,
                          "body": json.dumps({
                              "status": "error",
                              "message": "Invalid CloudFormation API response"
                          }, cls=DateTimeEncoder)
                      }
                  
                  logger.info(f"Found {len(stack_sets)} active StackSets")
                  
                  for stack_set in stack_sets:
                      if not isinstance(stack_set, dict) or 'StackSetName' not in stack_set:
                          logger.warning(f"Skipping invalid stack_set entry: {stack_set}")
                          continue
                      
                      stack_set_name = sanitize_string(stack_set['StackSetName'], 128)
                      logger.info(f"Processing StackSet: {stack_set_name}")
                      
                      try:
                          operations = cf.list_stack_set_operations(StackSetName=stack_set_name, MaxResults=5)
                          
                          # Validate operations response
                          if not isinstance(operations, dict):
                              logger.error(f"Invalid operations response for {stack_set_name}")
                              continue
                          
                          # Calculate success rate
                          successes = 0
                          operations_list = operations.get('Summaries', [])
                          
                          if not isinstance(operations_list, list):
                              logger.error(f"Invalid operations list for {stack_set_name}")
                              continue
                          
                          total_ops = len(operations_list)
                          logger.info(f"Found {total_ops} recent operations for {stack_set_name}")
                          
                          for op in operations_list:
                              if isinstance(op, dict) and op.get('Status') == 'SUCCEEDED':
                                  successes += 1
                          
                          success_rate = (successes / total_ops * 100) if total_ops > 0 else 100
                          
                          # Validate success_rate is within expected bounds
                          if not (0 <= success_rate <= 100):
                              logger.error(f"Invalid success_rate calculated: {success_rate}")
                              continue
                          
                          # Publish metrics to CloudWatch
                          cw.put_metric_data(
                              Namespace='StackSetMonitoring',
                              MetricData=[
                                  {'MetricName': 'SuccessRate', 'Value': success_rate, 
                                   'Dimensions': [{'Name': 'StackSetName', 'Value': stack_set_name}]}
                              ]
                          )
                          
                          logger.info(f"Published metrics for {stack_set_name}: Success Rate = {success_rate}%")
                      except Exception as e:
                          logger.error(f"Error processing StackSet {stack_set_name}: {str(e)}")
                  
                  return {
                      "statusCode": 200,
                      "body": json.dumps({
                          "status": "completed",
                          "message": f"Processed {len(stack_sets)} StackSets"
                      }, cls=DateTimeEncoder)
                  }
                  
              except Exception as e:
                  logger.error(f"Error in Lambda function: {str(e)}")
                  # Return a proper response even on error
                  return {
                      "statusCode": 500,
                      "body": json.dumps({
                          "status": "error",
                          "message": str(e)
                      }, cls=DateTimeEncoder)
                  }
  
  # Managed IAM Policies
  CloudFormationAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for CloudFormation and CloudWatch access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - cloudformation:ListStackSets
              - cloudformation:ListStackSetOperations
              - cloudformation:ListStackInstances
              - cloudformation:DescribeStackInstance
            Resource: 
              - !Sub "arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stackset/*"
              - !Sub "arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stackset-target/*"
          - Effect: Allow
            Action:
              - cloudwatch:PutMetricData
            Resource: "*"
            Condition:
              StringEquals:
                "cloudwatch:namespace": "StackSetMonitoring"
          - Effect: Allow
            Action:
              - sns:Publish
            Resource: !Ref StackSetAlertsTopic

  EventsAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for EventBridge access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - events:PutEvents
            Resource: !Sub "arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:event-bus/default"

  LogsAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for CloudWatch Logs access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: 
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor:*"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets:*"

  DLQAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for Dead Letter Queue access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - sqs:SendMessage
            Resource: !GetAtt StackSetMonitorDLQ.Arn

  StackSetMonitorRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole
        - !Ref CloudFormationAccessPolicy
        - !Ref EventsAccessPolicy
        - !Ref LogsAccessPolicy
        - !Ref DLQAccessPolicy

  # Permissions for event rules to invoke Lambda
  AllOperationsRuleLambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref StackSetMonitorLambda
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt AllStackSetOperationsRule.Arn
  
  # Using a one minute schedule for testing, but you can change this value
  StackSetMonitorSchedule:
    Type: AWS::Events::Rule
    Properties:
      Name: RegularStackSetMonitoring
      Description: "Triggers Lambda function every 1 minute to check StackSet operations"
      ScheduleExpression: "rate(1 minute)"
      State: ENABLED
      Targets:
        - Id: RunMonitor
          Arn: !GetAtt StackSetMonitorLambda.Arn
  
  ScheduleLambdaInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref StackSetMonitorLambda
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt StackSetMonitorSchedule.Arn
  
  StackSetSuccessRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: "Alarm when StackSet operation success rate is low"
      MetricName: SuccessRate
      Namespace: "StackSetMonitoring"
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      DatapointsToAlarm: 2
      Threshold: 80
      ComparisonOperator: LessThanThreshold
      AlarmActions: [!Ref StackSetAlertsTopic]
      Dimensions: [{Name: StackSetName, Value: !Ref StackSetName}]

Outputs:
  SNSTopicArn: 
    Description: The ARN of the SNS topic for alerts
    Value: !Ref StackSetAlertsTopic
  DashboardURL: 
    Description: URL to the CloudWatch Dashboard
    Value: !Sub https://console.aws.amazon.com/cloudwatch/home?region=${AWS::Region}#dashboards:name=StackSetMonitoring
  LambdaLogGroupName:
    Description: Name of the CloudWatch Log Group for Lambda logs
    Value: !Ref LambdaLogGroup
  DeadLetterQueueArn:
    Description: ARN of the Dead Letter Queue for Lambda function failures
    Value: !GetAtt StackSetMonitorDLQ.Arn
  DeadLetterQueueURL:
    Description: URL of the Dead Letter Queue for monitoring failed Lambda executions
    Value: !Ref StackSetMonitorDLQ
  TestLambdaCommand:
    Description: Command to manually test the Lambda function
    Value: !Sub "aws lambda invoke --function-name ${StackSetMonitorLambda} --payload '{}' response.json && cat response.json"
  LambdaFunctionArn:
    Description: ARN of the Lambda function configured with VPC
    Value: !GetAtt StackSetMonitorLambda.Arn
  LambdaSecurityGroupId:
    Description: Security Group ID created for the Lambda function
    Value: !Ref LambdaSecurityGroup
  VpcConfiguration:
    Description: VPC configuration summary for the Lambda function
    Value: !Sub 
      - "VPC: ${VpcId}, Subnets: ${SubnetList}, Security Groups: ${LambdaSecurityGroup}"
      - SubnetList: !Join [',', !Ref SubnetIds]

You need to run the following CLI command to deploy the CloudFormation stacks. You can change the ParameterValue of StackSetName“your-stackset-name” by the name of the StackSet you want to monitor. The default value is “security-baseline”. Your CLI profile should use region=“us-east-1“.

aws cloudformation create-stack --stack-name stackset-monitor --template-body file://StackSetMonitor.yml --parameters ParameterKey=StackSetName,ParameterValue="security-baseline" --capabilities CAPABILITY_IAM

AWS CLI to deploy the StackSetMonitor.yml CloudFormation template

The CLI output should look like the following:

{"StackId": "arn:aws:cloudformation:...."}

Here’s the expected output for the CloudFormation template:

StackSetMonitor Console output

StackSetMonitor Console output

And an example of Amazon CloudWatch Dashboard and Alarm screen:

Amazon CloudWatch Dashboard screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Dashboard screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Alarm screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Alarm screenshot for StackSetMonitor stack to track StackSet operations success rate

SNS subscription setup involves retrieving the topic ARN from stack outputs and configuring notifications for email or SMS endpoints (below example CLI for email subscription):

aws sns subscribe --topic-arn $SNS_TOPIC_ARN --protocol email --notification-endpoint [email protected]

AWS CLI to subscribe to the topic providing the user email

Cost:

The estimated monthly expenses ranges between 5 and 15 USD depending on StackSet activity levels, with approximately 2,880 Lambda executions per day (each minute) under the default monitoring schedule.

The solution supports customization of monitoring frequency by modifying the ScheduleExpression from the default one-minute interval. The cost will decrease if the monitoring is less frequent.

Cleanup:

For cleanup, you can run the following command lines:

  • To cleanup the Stack Instances and StackSets created in the Core Deployment Strategies section:

aws cloudformation delete-stack-instances --stack-set-name security-baseline --deployment-targets OrganizationalUnitIds=ou-xxx --regions us-east-1 eu-west-1 --region us-east-1 --no-retain-stack

AWS CLI to delete the Stack Instances

You need to change the parameter OrganizationalUnitIds value with the name of the OU, the parameter regions with the list of regions where you want to delete your stack instances, and the value of the stack-set-name parameter (security-baseline, monitoring-baseline, balanced-deployment…).

Then you can delete the StackSet:

aws cloudformation delete-stack-set --stack-set-name security-baseline

AWS CLI to delete the StackSet

You can change the value of the stack-set-name parameter.

  • To cleanup the stackset-monitor stack

aws cloudformation delete-stack --stack-name stackset-monitor

AWS CLI to delete the stackset-monitor Stack

You can also remove any IAM roles/policies that you specifically created for this blog that you might not need anymore

Conclusion

Throughout this guide, we’ve explored the nuanced approaches to AWS CloudFormation StackSets deployments across large-scale environments. The key takeaways include:

  • Balance is Critical: Every deployment strategy requires careful consideration of the trade-offs between speed, safety, and scale based on your organizational needs.
  • Progressive Adoption Works: For most organizations, a progressive deployment approach with validation gates provides the optimal balance of safety and efficiency.
  • Organizational Context Matters: Enterprise, startup, and regulated industry patterns demonstrate that deployment strategies should be tailored to your specific business requirements and risk tolerance.
  • Monitoring is Essential: As organizations scale to hundreds of accounts, comprehensive monitoring becomes critical for maintaining visibility and ensuring compliance.

These different approaches will help you adopt the right strategy for your AWS CloudFormation Stacksets deployments in your AWS Organization.

You can now test these different approaches on your sandbox environment, before adapting them for your specific needs, in order to balance Speed, Safety and Scale to optimize your deployments.

Amar Meriche

Amar is a Sr Cloud Operations Architect at AWS in Paris. He helps his customers improve their operational posture through advocacy and guidance, and is an active member of the DevOps and IaC community at AWS. He’s passionate about helping customers use the various IaC tools available at AWS following best practices. When he’s not working with customers, Amar can be found on the mountain trails with his family or playing basketball with his team.

Idriss Laouali Abdou

Idriss is a Sr. Product Manager Technical for AWS Infrastructure-as-Code based in Seattle. He focuses on improving developer productivity through StackSets and CloudFormation Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing.

Reduce Docker image build time on AWS CodeBuild using Amazon ECR as a remote cache

Post Syndicated from Kirubakaran Sundaramoorthy original https://aws.amazon.com/blogs/devops/reduce-docker-image-build-time-on-aws-codebuild-using-amazon-ecr-as-a-remote-cache/

In modern software development, containerization with Docker has revolutionized how we build and deploy applications. While Docker enables packaging applications into portable containers, the continuous need to update these images can be resource intensive. AWS CodeBuild addresses this challenge by providing a managed build service that eliminates infrastructure maintenance overhead. In this blog post, we’ll explore how AWS CodeBuild integration with Amazon Elastic Container Registry (Amazon ECR) as a cache backend can significantly accelerate our Docker image build process, making development more efficient and streamlined.

AWS CodeBuild creates isolated environments for each build, which means build artifacts cannot be permanently stored on the host system. While CodeBuild does offer a native local caching feature, it provides only temporary storage and is most effective for builds that occur in quick succession.

This local caching mechanism, however, is not reliable when builds are triggered at varying intervals, as it operates on a best-effort basis. To address this limitation, we recommend using Amazon Elastic Container Registry as a persistent cache for Docker layers. This solution offers several advantages:

  • It provides a reliable, long-term storage solution for build caches
  • The cached layers can be reused across multiple builds regardless of timing
  • The cache remains valid and accessible at any point in time

This post shows how to implement a simple, effective, and durable Docker layer cache for CodeBuild using Amazon ECR repository as a cache backend to significantly reduce image build runtime.

Solution Overview

The following diagram illustrates the high-level architecture of this solution. We describe implementing each stage in more detail in the following paragraphs.

Solution Flow Diagram

Figure 1: Solution Flow Diagram

To use an Amazon ECR registry as a backend for caching, we must first enable the containerd image store in our Docker driver. This feature is not enabled in the default Docker driver configuration. Therefore, we create a new docker driver using docker buildx command with containerd (docker-container driver) image store enabled.

When CodeBuild runs for the first time, it will attempt to retrieve cache data from the Amazon ECR repository. Since this is the first run, no cache will be available. CodeBuild will then proceed to build the Docker image from scratch, generate cache data during this initial build and export both the newly built image and its associated cache to the Amazon ECR repository.

In each subsequent build, CodeBuild will import the previously stored cache from Amazon ECR. This cached data will be used to speed up the image building process, as only the changed layers will need to be rebuilt. Finally, the updated cache and image will be stored back in Amazon ECR.

Prerequisites

Before we begin the walk-through, we must have an AWS account. If you don’t have one, sign up at https://aws.amazon.com.

Walk-through

Launch the following AWS CloudFormation template to create Amazon ECR repository and AWS CodeBuild project including CodeBuild service role and required permission as a managed policy.

AWSTemplateFormatVersion: "2010-09-09"

Description: 'AWS CloudFormation template to create infrastructure which demo using Amazon ECR as a remote cache for AWS CodeBuild'

Parameters:
  CodeBuildProjectName:
    Type: String
    Default: CBECRCacheDemoProject
    Description: "Enter name for your CodeBuild project"
  CodeBuildServiceRolePolicyName:
    Type: String
    Default: CodeBuildDockerCachePolicy
    Description: "Enter name for the IAM policy"
  ECRRepoName:
    Type: String
    Default: amazon_linux_codebuild_image
    Description: "Enter name for Amazon ECR repository"
  GitHubLocation:
    Type:  String
    Default: "https://github.com/aws/aws-codebuild-docker-images"
    Description: "Enter your source code GitHub URL"
  ImageTag:
    Type: String
    Default: demo
    Description: "Enter Tag name for your application docker image"
  CacheTag:
    Type: String
    Default: demo-cache
    Description: "Enter tag name for the cache image"

Resources:

  CodeBuildServiceRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument: |
        {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": "codebuild.amazonaws.com"
               },
              "Action": "sts:AssumeRole"
            }
         ]
        }
      Path: /

  CodeBuildServiceRolePolicy:
    Type: AWS::IAM::RolePolicy
    Properties:
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Action:
              - ecr:BatchGetImage
              - ecr:BatchCheckLayerAvailability
              - ecr:InitiateLayerUpload
              - ecr:UploadLayerPart
              - ecr:CompleteLayerUpload
              - ecr:PutImage
              - ecr:GetDownloadUrlForLayer
            Resource: !GetAtt ECRRepository.Arn
          - Effect: Allow
            Action: 
              - ecr:GetAuthorizationToken
            Resource: '*'
          - Effect: Allow
            Action:
              - codeconnections:UseConnection
              - codeconnections:GetConnectionToken
              - codeconnections:GetConnection
              - codestar-connections:GetConnectionToken
              - codestar-connections:GetConnection
            Resource: '*'
          - Effect: Allow
            Action: 
              - logs:CreateLogStream
              - logs:CreateLogGroup
              - logs:PutLogEvents
            Resource: 
              - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/codebuild/${CodeBuildProjectName}'
              - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/codebuild/${CodeBuildProjectName}:*'
          - Effect: Allow
            Action:
              - s3:PutObject
              - s3:GetObject
              - s3:GetObjectVersion
              - s3:GetBucketAcl
              - s3:GetBucketLocation
            Resource: 
              - !Sub 'arn:${AWS::Partition}:s3:::codepipeline-${AWS::Region}-*'
      PolicyName: !Ref CodeBuildServiceRolePolicyName
      RoleName: !Ref CodeBuildServiceRole

  ECRRepository:
    Type: AWS::ECR::Repository
    Properties:
      RepositoryName: !Ref ECRRepoName
      ImageScanningConfiguration:
        ScanOnPush: true

  CodeBuildProject:
    Type: AWS::CodeBuild::Project
    Properties:
      Name: !Ref CodeBuildProjectName
      Source:
        Type: GITHUB
        Location: !Ref GitHubLocation
        BuildSpec: !Sub |
          version: 0.2

          phases:
            install:
             commands:
               - docker buildx create --name containerd --driver=docker-container --driver-opt default-load=true

            pre_build:
             commands:
               - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${AWS::AccountId}.dkr.ecr.$AWS_REGION.amazonaws.com

            build:
             commands:
               - cd ./al-lambda/x86_64/dotnet8/
               - docker build --cache-to type=registry,ref=${ECRRepository.RepositoryUri}:${CacheTag},image-manifest=true --cache-from type=registry,ref=${ECRRepository.RepositoryUri}:${CacheTag} --tag ${ECRRepository.RepositoryUri}:${ImageTag} --builder=containerd .

            post_build:
             commands:
               - docker push ${ECRRepository.RepositoryUri}:${ImageTag}
      ServiceRole: !GetAtt CodeBuildServiceRole.Arn
      Artifacts:
        Type: NO_ARTIFACTS
      Environment:
        Type: LINUX_CONTAINER
        Image: aws/codebuild/amazonlinux-x86_64-standard:5.0
        ComputeType: BUILD_GENERAL1_SMALL
        PrivilegedMode: true
      Cache:
        Type: LOCAL
        Modes:
          - LOCAL_DOCKER_LAYER_CACHE

Specify the following parameters while creating the CloudFormation stack (see Figure 2):

  1.  Set the cache image tag (CacheTag) to “demo-cache“.
  2. Name the CodeBuild project (CodeBuildProjectName) as “CBECRCacheDemoProject“.
  3. Specify the IAM policy name (CodeBuildServiceRolePolicyName) as “CodeBuildDockerCachePolicy“.
  4. Define the ECR repository name (ECRRepoName) as “amazon_al_lambda_codebuild_image“.
  5. Enter the GitHub repository URL (GitHubLocation) as “https://github.com/aws/aws-codebuild-docker-images“.
  6. Set the Docker image tag (ImageTag) to “demo
CloudFormation Stack parameter

Figure 2: CloudFormation Stack parameter

The CloudFormation stack will set up a comprehensive development environment for our project. It will create a CodeBuild project equipped with all necessary IAM roles and permissions, ensuring smooth and secure build processes. Additionally, the stack will create an Amazon ECR repository. This repository is configured to automatically scan Docker images for vulnerabilities upon upload. The ECR will serve as a secure storage location for both our Docker images and cache images.

The CodeBuild project will be created with a buildspec file which will instruct CodeBuild to do the following:

  • Creates a new driver called “containerd” using buildx since the default Docker driver supports registry cache backend only when the containerd image store is enabled.
  • To pull and push both Docker images and cache, authentication with the Amazon ECR repository is required.
  • During the image build process:
    • We use the --cache-from parameter to force Docker to check for and use any existing cache from the repository.
    • The image-manifest option is set to true to enable cache storage in the Amazon ECR repository.
    • The --cache-to parameter is used to push or update the cache to the Amazon ECR repository.
  • After the build is complete, in the post-build phase, the image is pushed to Amazon ECR. The cache is automatically uploaded to the Amazon ECR repository as part of the Docker build command execution.

Testing the solution

After having successfully created the CloudFormation stack, we can proceed to test and evaluate how it performs.

The initial build process took approximately 10 minutes to complete. Since this was the first execution, no cache was available in Amazon ECR, requiring the system to build the image entirely from scratch. Although the cache import operation failed as expected during this initial run, the system continued with the build process without any cached layers. This first build served a dual purpose: it not only created the application image but also generated a cache, which was then exported to Amazon ECR as a separate image. This cached data would become available for future builds, setting the foundation for more efficient subsequent builds.

To verify the effectiveness of the solution, we triggered a second build after introducing a minor modification – adding an echo command in the middle of the Dockerfile’s active commands (excluding commented line). During this subsequent build, Docker intelligently utilized the cached layers up until the point of modification, after which it rebuilt only the necessary layers. This smart caching strategy resulted in a build time of approximately 6 minutes, clearly demonstrating how the caching system optimizes the build process even when changes are introduced. Further validation across multiple large-scale projects confirmed the effectiveness of this approach, consistently achieving build time reductions of up to 25%.

We enabled CodeBuild’s built-in Docker layer caching feature on a best-effort basis. This approach is recommended as it uses cached layers in the local when available instead of downloading them from the repository, which will further improve the overall build speed.

Cleaning up

When we finished testing, we should de-provision the following resources to avoid incurring further charges and keep the account clean from unused resources:

  • Delete the docker images from the Amazon ECR repository amazon_al_lambda_codebuild_image.
  • Delete the CloudFormation stack which has been created in the “Launch the AWS CloudFormation template” section.

Conclusion

In this discussion, we explored an efficient and straightforward solution for implementing external Docker caching in CodeBuild using Amazon ECR as a backend storage system. This approach offers several key benefits:

The solution reduces Docker build times in CodeBuild up to 25% and is versatile enough to handle most scenarios, including complex multi-stage builds. A particularly valuable advantage is that Amazon ECR stores the cache separately in its repository, making it reusable across different projects.

The business impact is substantial: shorter build times lead directly to reduced compute costs. More importantly, this optimization results in a more streamlined development lifecycle, enabling faster feature releases at lower operational costs.

In essence, this caching solution not only improves technical efficiency but also delivers tangible business value through reduced costs and accelerated development cycles.

About the author

Kirubakaran Sundaramoorthy

Kirubakaran Sundaramoorthy is a Cloud Support Engineer specializing in DevOps practices and AWS architecture, with expertise in AWS CloudFormation and CI/CD implementations. He builds efficient cloud infrastructure solutions using automation processes, cloud deployment strategies, infrastructure as code, and DevOps best practices to help businesses succeed.

Multi Agent Collaboration with Strands

Post Syndicated from Aaron Sempf original https://aws.amazon.com/blogs/devops/multi-agent-collaboration-with-strands/

In the evolving landscape of autonomous systems, multi-agent collaboration is becoming not only feasible but necessary. As agents gain more capabilities, like advanced reasoning, adaptation, and tool use, the challenge shifts from individual performance to effective coordination. The question is no longer “can an agent solve a task?” but “how do we organize execution across many intelligent agents?”

A foundational step toward answering this came with the Supervisor pattern, introduced in our article on creating asynchronous AI agents with Amazon Bedrock. The Supervisor addresses the first generation of coordination challenges by acting as a centralized orchestrator, monitoring and delegating tasks across agents in a structured, serverless workflow. It provides asynchronous orchestration, fallback handling, and state tracking across loosely coupled agents, giving organizations a reliable way to move from single-agent prototypes to multi-agent systems.

Yet as agentic systems scale and become more dynamic, the limitations of static supervision become clear. The Supervisor model assumes a relatively stable set of agents and predictable workflows; but modern systems face constantly shifting tasks, emergent capabilities, and the need for adaptive coordination. This is where the Arbiter pattern emerges as the natural evolution: a next-generation supervisory model that extends the Supervisor with dynamic agent generation, semantic task routing, and blackboard-model-based coordination. By addressing the unpredictability and fluidity of large, evolving agent ecosystems, the Arbiter pattern enables systems not only to manage complexity but to thrive in it.

The Arbiter pattern builds directly on this by adding three key capabilities:

  1. Semantic Capability Matching: Instead of only assigning known tasks to known agents, the Arbiter reasons about what kind of agent should exist for a task—even if that agent doesn’t exist yet.
  2. Delegated Agent Creation: If no suitable agent is found, the Arbiter escalates the request to a Fabricator agent that dynamically generates a task-specific agent on demand. This moves beyond delegation to true adaptive generation.
  3. Task Planning + Contextual Memory: Building on the Supervisors task coordination capability, Arbiter decomposes complex inputs into structured task plans, and uses contextual memory to track execution, retry logic, and agent performance.

In short, the Arbiter transforms static orchestration into adaptive coordination.

The Blackboard Model Revisited

To enable loose, extensible collaboration across agents, the Arbiter Pattern incorporates principles from the blackboard model – a classic architecture from distributed AI. In this model, agents contribute opportunistically to a shared data space (the “blackboard”), reacting to changes and collectively solving problems.

Reference: See “The Blackboard Model of Control” (Hayes-Roth et al.), and early applications like Hearsay-II for foundational research.

In our extended Arbiter Pattern, the blackboard becomes a semantic event substrate. Agents, including the Arbiter, publish and consume task-relevant state, enabling loosely coupled, event-driven collaboration.

How It Works

When an event enters the system, the Arbiter takes on the supervisory role but extends it with greater dynamism and adaptability. Like the Supervisor pattern, it begins by interpreting the event and identifying the required objectives and sub-tasks. It then performs a capability assessment, using a local index or peer-published manifests, much like the Supervisor querying an Agents config table.

  1. Interpretation: The Arbiter uses LLM-based reasoning to extract task objectives and sub-tasks.
  2. Capability Assessment: It evaluates which agents can handle each sub-task using a local index or peer capability manifests.
  3. Delegation or Generation:
    • If a suitable agent exists, the task is routed accordingly.
    • If not, the Arbiter sends a generation request to the fabricator agent.
  4. Blackboard Coordination: All agents involved read/write to a shared semantic blackboard, contributing as needed based on observed task state.
  5. Reflection and Adaptation: Performance data is logged and used to inform future agent creation, adaptation, or deprecation.

Arbiter Pattern Architecture

Unlike the Supervisor, which maintains orchestration through a static config list, the Arbiter introduces a shared semantic blackboard that allows all participating agents to read, write, and coordinate based on evolving task state. This blackboard serves as a dynamic collaboration space, enabling mid-task adaptation and richer multi-agent coordination.

The following Diagram 1: Agentic AI Arbiter pattern implemented as a code example can be downloaded here

Architecture diagram of the Arbiter Pattern for Agentic AI. The diagram illustrates the components and flow of the pattern, showing how multiple AI agents interact with an arbiter to coordinate tasks and decision-making in a structured system

Diagram 1: Agentic AI Arbiter pattern

The following sequence describes the Arbiter pattern, according to the numbered steps in the diagram 1: Agentic AI Arbiter pattern

  1. Events entering the system trigger the Supervisor function
  2. Supervisor queries Agents Config table for agent capabilities
  3. Supervisor uses Agents config list as context to plan orchestration of tasks

Option: New Agent:

If no capable agent is found, the Arbiter goes further than the basic supervisor pattern: it issues a generation request to a fabricator agent, which synthesizes new worker code, stores it for runtime access, and updates the capability registry so the agentic system can immediately benefit from the new skill.

  1. Task cannot be completed, request create new capability
  2. Request to fabricate triggers Fabrication agent instance
  3. Fabrication agent queries resources register for available tools (capabilities)
  4. Fabricator generates worker agent code
  5. Worker agent code stored in bucket for runtime access
  6. New worker added to Agents config list with agent capabilities description
  7. Result of fabrication posted to message bus

Repeat steps 1, 2 & 3

Option: Orchestrate workflow:

If a suitable agent exists, the Arbiter orchestrates the workflow by invoking the appropriate worker agents, tracking progress and state as in the Supervisor model.

  1. Orchestration of tasks is stored for tracking end-to-end process
  2. Request to invoke worker agent, by name/id. Add workflow state for agent invocation.
  3. Request to invoke worker agent triggers worker agent wrapper instance
  4. Worker agent wrapper loads agent code
  5. Worker agent reasons and takes action
  6. Worker agent sends response to message bus
  7. Supervisor agent updates workflow state and tracks against orchestration

The Arbiter incorporates a reflection and adaptation loop: performance data from task execution is logged, analyzed, and fed back into the fabricator and coordination logic. This ensures that not only are tasks completed in the moment, but the system continuously adapts, retires underperforming agents, and evolves toward greater efficiency.

The Arbiter Agent: Event Orchestration Engine

The Supervisor Agent (Arbiter Agent) serves as the central coordinator component, managing complex event-driven workflows through intelligent task delegation.

Event Processing Workflow:

The Arbiter pattern follows a structured approach to handle incoming events

  1. Configuration Loading: Loads available agent configurations from Amazon DynamoDB via load_config_from_dynamodb()
  2. LLM Invocation: Invokes Amazon Bedrock LLM with event context and available tool specifications
  3. Decision Analysis: LLM analyzes the event and returns tool invocation decisions with parameters
  4. Task Dispatch: For each specified tool call:
    • Extracts tool name, input parameters, and tool use ID
    • Dispatches message to corresponding Amazon Simple Queue Service (SQS) queue via process_tool_call()
    • Maintains tool invocation list for workflow tracking

Workflow State Management:

The system maintains comprehensive state tracking throughout execution

  • Creates workflow tracking record in DynamoDB with create_workflow_tracking_record()
  • Initializes all invoked agents as incomplete
  • Associates unique request ID with orchestration instance
  • Persists orchestration state including conversation history and request mapping

Completion Coordination:

The Arbiter coordinates task completion through a systematic process

  1. Event Reception: Receives agent completion events via Amazon EventBridge
  2. Status Updates: Updates workflow tracking with update_workflow_tracking()
  3. Completion Check: Performs completion check across all tracked agents
  4. Result Aggregation: When all agents complete:
    • Aggregates results from DynamoDB data field
    • Appends tool results to conversation as user messages
    • Re-invokes orchestration with updated context
  5. Continuation: Continues until LLM provides final response without tool calls

The Fabricator Agent: Dynamic Capability Generation

The Fabricator Agent implements just-in-time agent development using the Strands agents framework, creating new capabilities when required functionality doesn’t exist in the system.

Agent Development Architecture:

The Fabricator operates as a specialized Strands Agent with specific characteristics

  • Implemented as a Strands Agent with specialized system prompt for code generation
  • Triggered by “New worker agent” events from the Arbiter
  • Receives capability requirements through prompt augmentation with agent directive
  • System prompt includes:
    • Strands Agent implementation examples
    • Complete catalog of available Strands Tools
    • Code generation patterns and conventions
    • Standardized handler() function requirements

Code Generation Process:

The agent follows a structured development workflow

  1. Requirement Analysis: LLM analyzes capability requirements and generates Python implementation
  2. Tool Selection: Prioritizes use of existing Strands Tools over custom @tool implementations
  3. Code Structure: Creates agents following standardized patterns:
    • Bedrock model initialization with models.BedrockModel()
    • Agent instantiation with appropriate tool selection
    • Standardized handler() function interface
    • Event-driven completion signaling
  4. File Creation: Writes generated code to /tmp/ directory for immediate availability

Capability Registration Pipeline:

New capabilities are registered through a multi-step process

  1. File Storage: File upload to Amazon Simple Storage Service (S3) via upload_file_to_s3() tool
  2. Metadata Registration: Registration in DynamoDB via store_agent_config_dynamo():
    • toolId: Unique capability identifier
    • filename: S3 object reference
    • schema: OpenAPI specification for LLM tool calling
    • description: Human-readable capability documentation
    • action: SQS queue routing configuration for Generic Wrapper
  3. Completion Notification: Completion event publication to Arbiter via complete_task() tool

Testing Considerations:

The original implementation revealed important insights about testing approaches

  • Previous Approach: Agent testing within the Fabricator resulted in:
    • Unstructured testing leading to false negatives
    • Overzealous optimization of generated agents
  • Recommendation: Separate testing agent with standardized harness for validation feedback

The Generic Wrapper: Dynamic Execution Runtime

The Generic Wrapper implements a hot-loading pattern that enables unlimited agent creation without infrastructure scaling, providing a universal execution environment for Fabricator-generated agents.

This hot-loading approach is critical because it decouples capability growth from infrastructure scaling. Instead of provisioning and maintaining new infrastructure components for every new agent, which could be dozens or even hundreds of agents, the system reuses a single execution wrapper that can dynamically load and execute arbitrary agent code.

This not only makes agent creation effectively limitless but also ensures infrastructure efficiency, cost optimization, and simplified operations, allowing the Arbiter and Fabricator to evolve system capabilities without operational bottlenecks.

In the AWS Samples code, found here, the Hot-loading handler is implemented as am AWS Lambda function, represented in the following code snippet:

def process_event(event, context):
    orchestration_id = event["orchestration_id"]
    tool_use_id = event["tool_use_id"]
    request = event["tool_input"]
    tool_name = event['node']

    # Based on the tool from the event, load the details from DDB
    tool = load_config_from_dynamodb(tool_name)
    config = tool['config']

    if isinstance(config, str):
        config = json.loads(config)

    file_name = config['filename']

    load_file_from_s3_into_tmp(os.environ["AGENT_BUCKET_NAME"], file_name)

    # Hot load the module from the tmp directory
    spec = importlib.util.spec_from_file_location("module.name", "/tmp/loaded_module.py")
    loaded_module = importlib.util.module_from_spec(spec)
    sys.modules["module.name"] = loaded_module
    spec.loader.exec_module(loaded_module)

    # Invoke the generic handler with whatever args were passed in by the Arbiter
    try:
        print("attempting to use module")
        response = loaded_module.handler(**request)
        print(f"response: {response}")
    except Exception as e:
        print(f"error running module: {e}")
        response = "The task could not be completed, this agent has issues, please ignore for now."

    # Finally. report back to the Arbiter. Handled by the wrapper. To avoid the Frabricator from attempting to code this part itself
    post_task_complete(response, tool_use_id, tool_name, orchestration_id)

Although this example is demonstrated through a lambda function, the Hot-Loading code can be executed in Amazon Bedrock AgentCore Runtime, or AWS native container services, such as Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (EKS)

Hot-Loading Architecture:

The wrapper implements several key architectural principles

  • Single infrastructure component handles execution of all dynamically created agents
  • Eliminates need for separate infrastructure provisioning per agent
  • Implements runtime code loading from S3 storage
  • Accepts latency trade-off for infrastructure efficiency in non-ultra-low-latency environment

Dynamic Loading Process:

The system follows a precise loading sequence

  1. Message Processing: Extracts agent identifier from incoming SQS message
  2. Configuration Retrieval: Queries DynamoDB for agent configuration via load_config_from_dynamodb()
  3. Code Download: Downloads agent implementation from S3 to /tmp/ directory
  4. Runtime Loading: Module loading using importlib.util:
    • spec_from_file_location() creates module specification
    • module_from_spec() instantiates module object
    • exec_module() performs actual code loading and execution

Execution Management:

The wrapper provides comprehensive execution oversight

  • Invokes standardized handler() function with provided parameters
  • Captures execution results and handles error conditions gracefully
  • Maintains execution isolation between different agent invocations
  • Implements resource cleanup after agent execution completion

Standardized Communication Protocol:

Communication follows strict standardization to ensure system reliability, which is critical in multi-agent environments where dozens or even hundreds of dynamically generated agents may interact. Without consistent message formats, routing rules, and completion signals, orchestration would become brittle, errors would propagate unpredictably, and debugging would be nearly impossible. Standardization guarantees that every agent, no matter when it was created, can interoperate seamlessly, enabling the Arbiter to maintain end-to-end visibility, traceability, and fault-tolerance across the entire system.

Event Handling Principles:

  • Event posting handled exclusively by Generic Wrapper, not individual agents
  • Ensures consistent event-driven communication patterns across all agents

Completion Event Structure:

  • orchestration_id: Workflow context linkage
  • tool_use_id: LLM tool invocation mapping
  • node: Agent identifier for tracking
  • data: Execution results or error information

Reliability Measures:

  • Publishes completion events to EventBridge for Arbiter processing
  • Guarantees workflow tracking receives completion signals regardless of execution outcome

Scalability Characteristics:

The hot-loading approach provides significant scalability benefits

  • Enables agent scaling creation without minimal infrastructure impact
  • S3 download latency acceptable within overall system performance profile
  • Single wrapper instance can execute multiple agent types
  • Memory and resource management handled at container level

Conclusion

The Arbiter Pattern represents a significant evolution beyond the Supervisor architecture, delivering the flexibility required for truly autonomous agentic systems. By introducing semantically rich, context-aware orchestration, it enables dynamic scalability, where agent capabilities grow in step with task demands. The architecture is resilient, redistributing or regenerating tasks when agents fail, and it achieves loose coupling by having agents interact through semantically meaningful events rather than rigid APIs. Most importantly, it embeds continuous adaptation through Arbiter-guided feedback loops, allowing systems to learn and evolve over time. This marks a shift from pre-programmed logic to generative, blackboard-model-based coordination, paving the way for decentralized, intelligent systems that can learn, adapt, and collaborate effectively at scale.

The system delivers several critical capabilities

  • Asynchronous Processing: SQS-based message passing for scalable execution
  • Persistent State Management (Short-term memory): DynamoDB-based workflow tracking
  • Scalability: Hot-loading architecture for unlimited agent creation
  • Intelligent Orchestration: LLM-driven task decomposition and sequencing
  • Self-Expanding Capabilities: Strands-based agent creation on demand
  • Standardized Communication: Reliable event-driven protocols

This architecture enables processing of arbitrary event types by dynamically creating necessary processing capabilities and coordinating their execution through LLM-driven workflow orchestration, while maintaining infrastructure efficiency through hot-loading patterns.


About the Authors

aaron sempfAaron Sempf is Next Gen Tech Lead for the AWS Partner Organization in Asia-Pacific and Japan. With over 20 years in distributed system engineering design and development, he focuses on solving for large scale complex integration and event driven systems. In his spare time, he can be found coding prototypes for autonomous robots, IoT devices, distributed solutions, and designing agentic architecture patterns for generative AI assisted business automation.

josh tothJoshua Toth is a Senior Prototyping Engineer with over a decade of experience in software engineering and distributed systems. He specializes in solving complex business challenges through technical prototypes, demonstrating the art of the possible. With deep expertise in proof of concept development, he focuses on bridging the gap between emerging technologies and practical business applications. In his spare time, he can be found developing next-generation interactive demonstrations and exploring cutting-edge technological innovations.

Introducing an Interactive Code Review Experience with Amazon Q Developer in GitHub

Post Syndicated from Sundaresh Iyer original https://aws.amazon.com/blogs/devops/introducing-an-interactive-code-review-experience-with-amazon-q-developer-in-github/

Code reviews are one of the most valuable rituals in software development. They help ensure quality, maintain consistency, and foster growth as engineers. But they’re also one of the most time consuming steps in the software development lifecycle. A common pattern I’ve seen is a developer opening a pull request (PR), receiving automated or peer comments, and then needing to search through documentation, Slack threads, or past code just to understand why a change was suggested. That search for missing context creates a friction that slows teams down, adds back-and-forth cycles, and often distracts from the bigger picture of building great products.

In the initial preview experience, teams used Amazon Q Developer in GitHub across issues and PRs for feature work, automated code reviews, and common modernization tasks. This kept work inside GitHub and reduced handoffs. Automatic reviews on new or reopened PRs surfaced findings early, but teams still wanted more context and a tighter loop inside the PR.

Today we’re introducing an interactive code review experience for PRs You can ask Amazon Q Developer questions about any finding using /q, see a concise summary with threaded findings, and apply suggested changes without leaving GitHub. Code reviews by Amazon Q Developer now complete quicker than before, which reduces wait time and shortens the review cycle so teams can merge sooner and spend more time building.

What’s new and why it matters

  • Interactive Conversations in the pull request: Comment with /q to get inline answers, or ask Q Developer to propose a code change you can apply in the PR. For example:/q explain this finding or /q propose a change that replaces class toggles with a data attribute for state.
  • Code review summaries with threaded findings: Each code review begins with a concise summary and findings are threaded underneath. This makes updates easier to follow and reduces noise.
  • Faster execution with clearer notifications: Amazon Q Developer completes its analysis quicker and notifications are organized and easier to scan. This reduces wait time and shortens the review cycle.
    When you create or open a new PR, Amazon Q Developer automatically starts a code review if the code review feature is enabled for your GitHub installation in the Amazon Q Developer console. Subsequent commits do not trigger another automatic review. To run a fresh analysis, post /q review as a new comment on the PR.

Getting Started with Amazon Q Developer in GitHub

To get started, install the Amazon Q Developer GitHub App in your GitHub organization or repository. The app is available through the GitHub Marketplace and can be used without an AWS account during the preview. During installation, you choose whether to provide access to all repositories or only selected repositories in your GitHub organization. You can increase free usage by registering the app installation in the Amazon Q Developer console.
For more details on installation, permissions, and configuration options, see the Amazon Q Developer for GitHub documentation. Once the app is installed, you can begin using Q Developer to review PRs automatically.

Using Amazon Q Developer in Pull Requests

To dive deeper, here’s an end-to-end walk-through of the new interactive code review experience using a simple card game I built with Amazon Q Developer.

  1. Create a new pull request : In this example, I started by creating a feature branch and named it demo, added atailwind.css file to the JavaScript and HTML card game app, pushed the branch, and opened a PR for review.
    • GitHub interface showing the creation of a new pull request for a demo branch, with changes to a tailwind.css file in a card game application
  2. Amazon Q Developer automatically starts a code review, analyzing code quality, potential issues, and adherence to best practices. A concise summary appeared at the top, with individual findings threaded underneath. This gave me the big picture and the specifics in one place.
    •  GitHub pull request interface showing a notification that Amazon Q Developer has automatically initiated a code review and is analyzing the changes, with a progress indicator
    • GitHub interface displaying Amazon Q Developer's completed review with a concise summary at the top and detailed findings organized in threaded comments below, highlighting code quality issues and suggested improvements
  3. Code review the summary and findings: I reviewed the summary and threaded findings to decide which change to take on first. Seeing both the rationale and the exact lines called out meant I knew where to begin, without hunting through files.
    •         GitHub interface showing Amazon Q Developer's threaded findings, where each finding is organized as a separate comment thread with detailed explanations of identified issues in the code
    •        
  4. Ask for Clarification with /q : One of the findings suggested using state property to track the card status in my card game application. so I asked Q Developer for clarification. It responded quickly with concrete context and pointers, which reduced back and forth and improved the quality of the review.
    •         Screenshot showing a conversation thread where a developer uses the /q command to ask Amazon Q Developer for clarification about state property implementation in the card game
    •         GitHub interface displaying Amazon Q Developer's detailed response to the clarification request, providing context and specific explanations about the state property implementation recommendation
  5. Continue the conversation (if needed) : I reviewed Q Developer’s suggestion and responded back stating that I preferred an alternate approach and Q Developer quickly returned a complete implementation I could apply in the PR
    •      GitHub interface showing a follow-up exchange between the developer and Amazon Q Developer, where the developer proposes an alternate approach and receives implementation suggestions
  6. Apply Fixes : After reviewing the implementation suggestion, I clicked on Commit suggestion to create a new commit on the PR branch with my username as the author.
  7. Re-run the review : I didn’t need this for my example, but if you push additional changes, you can run a fresh analysis by posting /q review as a new top-level comment. Q Developer will run the review and post updated findings.

With the code review complete and checks passing, I merged. The new interactive code review experience reduced wait time and review cycles and made the “why” behind each finding and suggested change clear.

Conclusion

Amazon Q Developer for GitHub is available today in preview. Whether you are an individual developer or part of a large engineering team, this update helps you ship cleaner code with fewer cycles and makes code reviews something to look forward to rather than avoid.
Try it out on your next PR. Type /q, ask a question, and see how smarter conversational reviews transform your workflow.

Introducing AWS Cloud Control API MCP Server: Natural Language Infrastructure Management on AWS

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/introducing-aws-cloud-control-api-mcp-server-natural-language-infrastructure-management-on-aws/

Today, we’re officially announcing the AWS Cloud Control API (CCAPI) MCP Server. This MCP server transforms AWS infrastructure management by allowing developers to create, read, update, delete, and list resources using natural language. As part of the awslabs/mcp project, this new and innovative tool serves as a bridge between natural language commands and AWS infrastructure deployment and management. This MCP server is powered by the AWS Cloud Control API – a standardized API that allows CRUDL (Create/Read/Update/Delete/List) operations to be performed against AWS and third party resources using a single endpoint.

Key Features:

  • Leverages AWS Cloud Control API for CRUDL operations for more than 1,200 AWS resources
  • Enables LLM-powered agents and developers to manage infrastructure with natural language prompts
  • Provides the option to output Infrastructure as Code (IaC) templates for infrastructure it will create, allowing to still be used with existing CI/CD pipelines
  • Integrates with AWS Pricing API to provide cost estimates for the infrastructure it will create
  • Applies security best practices automatically using Checkov

Why Use CCAPI MCP Server?

  • Simplified Infrastructure Management: No more wrestling with complex templates or documentation
  • Increased Developer Productivity: Focus on what you need, not how to configure it
  • Reduced Learning Curve: Onboard new team members faster with natural language commands
  • LLM Integration: Perfect companion for AI-assisted development workflows

The CCAPI MCP Server transforms infrastructure management by enabling natural language interactions for AWS resource operations. Bridging natural language commands with AWS infrastructure deployment and management, this MCP Server allows developers to manage cloud infrastructure through conversational inputs such as:

  • Can you create a new s3 bucket for me?or
  • Find all of my EC2 instances and tell me which one have an instance type that is not t2.large

This significantly reduces configuration overhead and accelerates onboarding for new team members, directly translates developer intent into cloud infrastructure.

Let’s see it in action.

Creating and Managing Cloud Infrastructure

Prerequisites

  • uv package manager installed
  • Python 3.x.x installed
  • AWS credentials with appropriate permissions. The MCP server supports multiple ways to define these credentials. See the MCP documentation for more information. Using dynamic credentials such as one provided via SSO is recommended. For more information on configuring AWS credentials, see the AWS CLI documentation.
  • An MCP Host application installed that supports MCP Clients and MCP Servers (e.g. Amazon Q Developer, Claude Desktop, Cursor, etc.). To follow this blog install Amazon Q Developer for CLI (CLI) as described in the installation instructions

Integration with Developer Tools

To start using the CCAPI MCP server, you will need to set up your server configuration which is typically in a file named mcp.json. For this blog we will focus on using the CCAPI MCP server with Amazon Q Developer. Note that for other MCP Host applications the path to the mcp configuration file may differ. You will need to create the file if it does not already exist in the directory.

1. Global Configuration: ~/.aws/amazon/mcp.json – Applies to all workspaces

2. Workspace Configuration: .amazonq/mcp.json – Specific to the current workspace

More information can be found in the Amazon Q Developer User Guide.

Configuration file structure

The MCP configuration file uses a JSON format with the following structure:

mcp.json

{
  "mcpServers": {
    "server-name": {
      "command": "command-to-run",
      "args": ["arg1", "arg1",],
      "env": {
        "ENV_VAR1": "value1",
        "ENV_VAR2": "value2",
      },
    }
  }
}

Here is mcp.json with the CCAPI MCP Server configuration:

{
  "mcpServers": {
   "awslabs.ccapi-mcp-server": {
      "command": "uvx",
      "args": [
        "awslabs.ccapi-mcp-server@latest"
      ],
      "env": {
        "AWS_PROFILE": "your named AWS profile",
	"DEFAULT_TAGS": “enabled”,
	"SECURITY_SCANNING": “enabled”,
	"FASTMCP_LOG_LEVEL": “ERROR”
      },
      "disabled": false,
      "autoApprove": []
    }
  }
}

Important

Ensure you correctly set your AWS credentials in the MCP server config. It is essential that you properly configure these credentials, as the MCP server uses their associated permissions when invoking the AWS Cloud Control API for CRUDL operations in your AWS account. The server supports multiple methods of consuming these credentials such as AWS profiles, Environment Variables, SSO tokens, etc. You can see some of this in the aws_client.py file. See these docs on using named profiles for more information.

Read Only Mode

If you would like to prevent the MCP server from performing mutating actions (e.g. Create/Update/Delete Resource), you can specify the --readonly flag as demonstrated below:

{
  "mcpServers": {
   "awslabs.ccapi-mcp-server": {
      "command": "uvx",
      "args": [
        "awslabs.ccapi-mcp-server@latest",
        “--readonly”"
      ],
      "env": {
        "AWS_PROFILE": "your named AWS profile",
	"DEFAULT_TAGS": “enabled”,
	"SECURITY_SCANNING": “enabled”,
	"FASTMCP_LOG_LEVEL": “ERROR”
      },
      "disabled": false,
      "autoApprove": []
    }
  }
}

More information on the configuration and tools the CCAPI MCP server provides can be found in the AWS CloudFormation MCP Server documentation.

Security Considerations

  • Ensure the IAM credentials include permissions for Cloud Control API actions (List, Get, Create, Update, Delete). See the AWS CCAPI API documentation for more info
  • Follow IAM least privilege principles
  • Enable AWS CloudTrail auditing
  • Consider running in read-only mode with --readonly flag for safer operations

Example Use Case: Creating an S3 Bucket with KMS Encryption

IMPORTANT: Ensure you have satisfied all prerequisites before attempting these commands.

1. With the mcp.json file correctly set, try to run a sample prompt. In your terminal, run q chat to start using Amazon Q in the CLI.

Q CLI Initial Load of Cloud Control API MCP Server 2. This will start initializing the MCP servers in the background, allowing you to immediately start using Q Chat even if they are still loading. As a note, if these have not finished loading, your prompts will be handled without using any MCP servers. To check the status of the servers, run /mcp

3. Once that you have validated that the MCP server was loaded successfully, try a sample command. Simply tell Amazon Q : Create an S3 bucket with versioning and encrypt it using a new KMS key

Amazon Q will use the server to automatically:

  1. Fetch your current environment variables
  2. Use those to fetch your current AWS session info
  3. Create code that defines what is in your prompt
  4. Explain the code that was generated
  5. Run security analysis against the code that was generated (if enabled)
  6. Explain the results of the security analysis
  7. Validate the configuration against AWS Cloud Control API schemas (which use CloudFormation Resource Provider Schemas as their foundation) and IAM policies. This validation ensures compliance with Cloud Control API requirements, which is essential for resource creation
  8. Create the resources directly through Cloud Control API

Note: While CloudFormation schemas are referenced in the validation step, this solution uses Cloud Control API for resource management, not CloudFormation. The schemas are used because they define the standardized resource properties that Cloud Control API expects.

4. First, Amazon Q will mention that it needs to check the environment variables to find information related to the AWS session information. It will inform you about the specific tool it aims to use and will ask for permission. Select y to accept and allow actions.

5. Next, Amazon Q will ask to use get_aws_session_info() to fetch information about the AWS session it should use for subsequent actions. It will use the relevant values from the environment variables defined in the MCP configuration file (e.g. ~/.aws/amazon/mcp.json)

6.Amazon Q will then display the AWS account ID and region it will use to deploy resources. To start, it will use generate_infrastructure_code() to generate the resource properties for a KMS key that will be sent to Cloud Control API. These properties mirror the structure defined in AWS CloudFormation Resource Provider Schemas (which Cloud Control API uses as its foundation), allowing for security validation through Checkov before deployment. The key will be configured following security best practices, with a key policy scoped to only allow usage within the AWS account.

7. Once that Amazon Q has generated the code for the resource, it will run then use the explain() tool to explain the infrastructure code that was generated. Note that default tags MANAGED_BY, MCP_SERVER_SOURCE_CODE, and MCP_SERVER_VERSION are added for all resources managed by the CCAPI MCP server. These tags provide for ease of identification of infrastructure that is being managed by the MCP server. They are configurable and you optionally can disable them, but we highly recommend adding tags to ensure you have visibility into infrastructure that is being managed by the CCAPI MCP server.

8. It will then attempt to use the run_checkov() tool to inspect the security of the code. This tool is triggered because SECURITY_SCANNING was set to enabled in your server configuration file.

9. After Checkov has run, it will then attempt to use the explain() tool again to explain the security findings from the Checkov run. If there were no security issues, it will attempt to proceed. If there were security issues, you will be asked how you’d like to proceed, and Amazon Q will recommend necessary fixes. By default, the checks that passed will only give a minimal summary. If you’d like to get more information, just ask for more details.

10. The next tool that Amazon Q will use is the create_resource() tool. This tool will attempt to create the resource using the AWS Cloud Control API, and then use the get_resource_request_status() tool to check the status of the creation. This tool uses the request token to identify the request that was submitted to the Cloud Control API and uses this to fetch its status information.

11. Amazon Q will continue using the CCAPI MCP server tools as needed until it finishes creation of both the S3 Bucket and KMS Key and will output a summary.

12. Now, ask Amazon Q to make a change potentially negatively affecting security, for example by allowing the S3 bucket to be publicly accessible. While this configuration is generally advised against, sometimes it is necessary – such as when you want to use the S3 bucket for public website hosting. Amazon Q will respond letting you know that what you are asking for is not the best practice, and explain why. However, since this could be a valid request depending on your use case, it will prompt you to confirm.

13. The CCAPI MCP server also has integrations with the AWS Pricing API, so you can even ask for the estimated cost of what it has deployed.

14. Lastly, ask Amazon Q to create a CloudFormation template of what it has created so far so you can either have a backup, or if you want to redeploy something similar, you will have a template to work off. It will use the create_template() tool to accomplish this task.

Note: The create_template() tool comes with predefined settings:

  • Outputs YAML format by default (can be JSON)
  • Sets DeletionPolicy to RETAIN
  • Sets UpdateReplacePolicy to RETAIN
  • Allows optional parameters for template ID, file saving location, and region specification

For more information, review the tool in the source code.

15. Try one more dangerous operation, attempting to delete all resources within an AWS account. The security checks block this attempt and suggest other alternatives.

16. Finally, ask Amazon Q to just delete what it has created. This time it will use the get_resource() tool to get information about the existing resources it created, the explain() tool to explain the changes that will be made, and finally the delete_resource() tool to delete the resources.

After successfully deleting the resources, it will provide a final summary.

Sample Prompts for Easy Start

Sample Prompt What It Does
“Create a VPC with private and public subnets” Sets up a complete network environment
“List all my EC2 instances” Shows running instances across your account
“Create a serverless API for my application” Deploys API Gateway with Lambda integration
“Set up a load-balanced web application” Creates ALB with target groups and instances

Conclusion

The AWS Cloud Control API MCP Server represents a significant advancement in AWS infrastructure management, making operations on cloud resources easy to express and access through natural language. Whether you’re streamlining operations, experimenting with LLM-based development, or onboarding new team members, whether you are using Amazon Q Developer in CLI or any other MCP Host application (such as Claude Desktop or Cursor), the CCAPI MCP servet and its tools offer a truly intuitive way to interact with AWS.

Authors

Kevon Mayers

Kevon Mayers is a Games Solutions Architect at AWS and is the Infrastructure as Code (IaC) Focus Area Lead for the NextGen Developer Experience Technical Field Community at AWS. Kevon is a Core Contributor for Terraform and has led multiple Terraform initiatives within AWS. Prior to joining AWS, he was working as a DevOps engineer and developer, and before that was working with the GRAMMYs/The Recording Academy as a studio manager, music producer, and audio engineer. He also owns a professional production company, MM Productions.

Brian Terry

Brian Terry, Senior WW Data & AI PSA, is an innovation leader with 20+ years of experience in technology and engineering. Pursuing a Ph.D. in Computer Science at the University of North Dakota. Brian has spearheaded generative AI projects, optimized infrastructure scalability, and driven partner integration strategies. He is passionate about leveraging technology to deliver scalable, resilient solutions that foster business growth and innovation.

Flexibility to Framework: Building MCP Servers with Controlled Tool Orchestration

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/flexibility-to-framework-building-mcp-servers-with-controlled-tool-orchestration/

MCP (Model Control Protocol) is a protocol designed to standardize interactions with Generative AI models, making it easier to build and manage AI applications. It provides a consistent way to communicate context with different types of models, regardless of where they’re hosted or how they’re implemented. The protocol helps bridge the gap between model deployment and application development by providing a unified interface for model interactions. While this protocol provides flexibility in tool choice, there are key challenges when the order of tool usage needs to be enforced. In this blog post, you will learn about how I designed this functionality and implemented it into the AWS Cloud Control API (CCAPI) MCP server .

The Challenge – Enforcing Tool Ordering in MCP

When you think of MCP, you likely think of choice. Arguably one of the main reasons you may want to use an MCP server, is to allow a Large Language Model (LLM) (through agents) to access a set of tools such as reading from a database, sending an email, or in something along those lines. The MCP framework doesn’t provide a native mechanism to enforce the sequence in which tools must be called.

Let’s take as an example two tools – fetch_weather_data() and send_email(). For the LLM using your MCP server, it is reasonable to think that you may want to enforce that an email that is sent has the current weather included. Or for another example, tools getOrderId() and getOrderDetail(), where the OrderId would be required to subsequently fetch the OrderDetail. Since MCP currently lacks tool ordering preferences, these types of sequential dependencies can be challenging to enforce.

MCP tools are designed to be independent functions that an LLM can invoke as needed. There’s no built-in concept of “workflow” or “sequence” in the MCP framework itself. Each tool call is treated as a separate operation, with no inherent knowledge of what came before or what should come after. This means that by default, an LLM can technically call your tools in any order it chooses, regardless of the logical workflow you intend.

While LLMs excel at flexible decision-making, some scenarios like infrastructure management require strict operational ordering. This presents a unique challenge when building MCP servers: how do you maintain the LLM’s natural flexibility while enforcing critical sequential dependencies?

When you think of Infrastructure as Code (IaC), you think of repeatability, consistency, versioning, and continuous integration/continuous deployment (CI/CD). Within CI/CD you have a set flow:

  1. Pull request is generated
  2. CI/CD pipeline is triggered
  3. Series of steps runs to run linting, security tests, unit tests, end-to-end tests, etc.
  4. A failure in any stage should stop the entire pipeline run

This posed a challenge with IaC and LLMs. Generative AI is non-deterministic, meaning the same prompt may not always generate the same exact response. If the result deviates significantly from what it should be, it is considered a hallucination. So, what can be done to guide the LLM on what you want it to do? Let’s talk about how this was addressed in the CCAPI MCP server.

Understanding MCP Tool Discovery and Initialization

Before diving into the solution, it’s important to understand how MCP servers communicate with AI Agents. During initialization, the MCP protocol follows specific lifecycle phases where capabilities and tools are discovered.

The Model Context Protocol defines a structured lifecycle for client-server connections that ensures proper capability negotiation and state management.

MCP Lifecycle

These phases include:

  1. Initialization: Capability negotiation and protocol version agreement
  2. Operation: Normal protocol communication
  3. Shutdown: Graceful termination of the connection

The initialization phase establishes protocol compatibility and shares implementation details. This is when an AI Agent learns about available tools through schema definitions and receives instructions for tool usage. This initialization process is crucial to the solution, as it’s where AI Agents first discover what tools are available and how they should be used. During this phase, the client sends information about its protocol version, capabilities, and implementation details. This is how tools like Amazon Q CLI receive information about an MCP server’s version, available tools, and usage instructions.

Note: For more information on the MCP lifecycle, see these docs.

Solution – Token-Based Tool Orchestration: A New Pattern for AI Agents in MCP

MCP Token Orchestration

MCP presents a specific challenge: tools cannot directly communicate with each other to enforce execution order. The CCAPI MCP server addresses this through a token messenger pattern shown above, where the server generates and controls validation tokens, and the AI Agent (as the MCP client) passes these tokens between tool calls.

Core Implementation:

  1. Function Enhancement – The mcp.tool() decorator transforms each function into a more capable entity. It wraps the function with a schema that defines required inputs and their validation rules, while preserving detailed documentation through docstrings. Each enhanced function clearly communicates its requirements and provides explicit error messages when dependencies aren’t met.
  2. Dependency Discovery – During the initialize phase in the MCP lifecycle, the AI Agent (as the MCP client) receives a complete map of all defined tools and their schemas from the MCP server. The LLM, which is part of the AI Agent, uses these schemas to understand dependencies through both parameter descriptions and required input arguments. For instance, when a tool requires a parameter described as “Result from get_aws_session_info()” and defines security_scan_token as a required input argument, the LLM understands it needs both valid tokens before proceeding. This combination of descriptive text and explicit input requirements enables the AI Agent to execute sequences like get_aws_session_info() → generate_infrastructure_code() → run_checkov() → create_resource().
  3. Token Validation Control –The server generates and controls all workflow tokens through a unified server-side storage system (_workflow_store). Each tool in the workflow generates cryptographically secure tokens, and these tokens are stored server-side with their associated data.

The AI Agent maintains these tokens in its conversation context throughout the workflow, passing them between tool calls. For security, each token used by the AI Agent must be validated against the server’s token storage. Since these tokens are short-lived, they are stored in memory (RAM) and are actively managed by the MCP server, which deletes tokens after use to maintain freshness. Any remaining tokens are automatically cleared when the server process ends or restarts. If a token doesn’t exist in the server’s storage (either because it’s invalid or already consumed), the operation fails immediately with an error. This validation is uniform across all token types, ensuring the AI Agent cannot create or modify tokens.

As the workflow progresses, tools consume existing tokens and generate new ones. For example, when explain() receives a properties_token, it first validates it exists and matches what is in _workflow_store, then consumes it and generates a new explained_properties_token. This creates a cryptographically secure chain of operations that enforces the workflow sequence (generate → scan → create), with server-side validation at every step.

The result is a predictable workflow system with strong security controls – tokens must be generated by the server and validated against server-side storage at each step, helping ensure the integrity of the infrastructure management process. This approach provides robust workflow enforcement within the confines of the current functionality of the FastMCP framework. While explicit schema-defined dependencies like @mcp.tool(depends_on=["run_checkov"]) as mentioned in this GitHub Issue would be ideal and could hopefully be added in future FastMCP versions, the current token-based approach with descriptive parameter names and clear validation provides reliable tool ordering that LLMs consistently follow without confusion.

 Potential Limitations and Solutions

  1. Session Management – When an AI Agent’s session ends or refreshes, any in-progress workflows must be restarted. This is by design – tokens are meant to be short-lived and tied to specific workflow sequences. AWS credentials naturally expire within hours as part of standard security practices, providing a natural boundary for workflow sessions.
  2. Concurrent Workflows – Each AI Agent interaction operates independently, which is appropriate for maintaining security boundaries between different workflow instances. While this means each session starts fresh, it ensures clean separation between different infrastructure operations.
  3. Implementation Options – For organizations requiring workflow persistence, traditional database storage could maintain session state between restarts. However, since tokens are designed to be short-lived security controls, most implementations can rely on the default in-memory storage with natural session boundaries.

The token messenger pattern provides a solid foundation for secure workflow orchestration, with its intentionally ephemeral tokens ensuring proper tool sequencing and data integrity during infrastructure operations.

The Future of MCP

While the above solution works, this process made me think about the future of MCP and how it can and should continue to grow. There are many updates to the framework I’ve seen recently, and it’s great to see activity. For Agentic AI in general, there are strong signs that the future of agentic platforms may be more deterministic in nature, as highlighted by Claude Code’s new support for lifecycle hooks. Per their docs, “Hooks provide deterministic control over Claude Code’s behavior, ensuring certain actions always happen rather than relying on the LLM to choose to run them.” For IaC and other deterministic technologies that it is desired to integrate AI with, this is essential for wide-scale adoption.

Conclusion

The journey of Model Control Protocol (MCP) and this new frontier of leveraging AI for managing cloud infrastructure continues to evolve, presenting both opportunities and challenges in the world of cloud computing and artificial intelligence. Current approaches using prompt loading and parameter dependencies have helped address initial challenges around tool ordering and security protocols, demonstrating how MCP can be effectively used in enterprise applications.

While the current implementation using workflow tokens and validation checks provides a functional solution, we continue to explore ways to enhance the protocol’s capabilities. For those interested in contributing to MCP’s evolution, you can find our proposals for protocol improvements, including enhanced dependency management, in the modelcontextprotocol GitHub org as well as in the FastMCP GitHub repository.

If you’d like to learn more about the AWS Cloud Control API MCP server mentioned in this blog, check out the documentation and GitHub repo. If you’d like to get hands on with it and other AWS MCP servers, check out this AWS workshop. Happy vibe coding my friends.

Authors

Kevon Mayers

Kevon Mayers is a Games Solutions Architect at AWS and is the Infrastructure as Code (IaC) Focus Area Lead for the NextGen Developer Experience Technical Field Community at AWS. Kevon is a Core Contributor for Terraform and has led multiple Terraform initiatives within AWS. Prior to joining AWS, he was working as a DevOps engineer and developer, and before that was working with the GRAMMYs/The Recording Academy as a studio manager, music producer, and audio engineer. He also owns a professional production company, MM Productions.

Beyond IAM access keys: Modern authentication approaches for AWS

Post Syndicated from Mitch Beaumont original https://aws.amazon.com/blogs/security/beyond-iam-access-keys-modern-authentication-approaches-for-aws/

When it comes to AWS authentication, relying on long-term credentials, such as AWS Identity and Access Management (IAM) access keys, introduces unnecessary risks; including potential credential exposure, unauthorized sharing, or theft. In this post, I present five common use cases where AWS customers traditionally use IAM access keys and present more secure alternatives that you should consider.

AWS CLI access: Embrace CloudShell

If you’re primarily using access keys for AWS Command Line Interface (AWS CLI) access, consider AWS CloudShell—a browser-based CLI that minimizes the need for local credential management while providing the same powerful CLI capabilities that you’re accustomed to.

AWS CLI with enhanced security: IAM Identity Center

If you need a more robust solution, AWS CLI v2 combined with AWS IAM Identity Center offers a superior authentication approach. This integration enables:

  • Centralized user management
  • Seamless multi-factor authentication (MFA) integration
  • Enhanced security controls

Configuration is straightforward using the AWS CLI documentation, and MFA can be enabled following the IAM Identity Center MFA guide.

Local development: IDE integration

For developers working in local environments, modern integrated development environments (IDEs) such as Visual Studio Code, with AWS Toolkit support offer secure authentication through IAM Identity Center. This alleviates the need for static access keys while maintaining a smooth development experience. Learn more about AWS IDE integrations.

AWS compute services and CI/CD access

When your applications and automation pipelines need AWS resource access, whether running on AWS compute services (Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), or AWS Lambda) or through continuous integration and delivery (CI/CD) tools, IAM roles can provide the ideal solution. These roles automatically manage temporary credential rotation and follow security best practices.

  • For AWS compute services: Use standard IAM roles with your compute resources. Review the EC2 IAM roles documentation for implementation details.
  • For AWS-hosted CI/CD: When using AWS CodePipeline or AWS CodeBuild for example, use service-linked roles to manage permissions securely.
  • For CI/CD tools self-hosted on Amazon EC2: If you’re running tools such as Jenkins or GitLab on AWS resources, use the instance profile roles the same as you would with other compute services.

For third-party CI/CD services (such as GitHub Actions, CircleCI, and so on), see External access requirements.

External access requirements

For scenarios involving third-party applications or on-premises workloads, AWS offers three methods:

  • Third-party applications: Implement temporary security credentials through IAM roles instead of static access keys. Never use root account access keys. See third-party access documentation.
  • On-premises workloads: Use AWS IAM Roles Anywhere to generate temporary credentials for non-AWS workloads. For more information, see Access for non AWS workloads.
  • CI/CD software as a service (SaaS): For cloud-based CI/CD services, use OpenID Connect (OIDC) integration with IAM roles to minimize the need for long-term credentials. This allows your CI/CD pipelines to obtain temporary credentials through trust relationships. See the AWS OIDC provider documentation for implementation details.

Best practice: Principle of least privilege

Regardless of your authentication method, always implement the principle of least privilege. This helps make sure that users and applications have only the permissions they need. For guidance on crafting precise IAM policies, see Techniques for writing least privilege IAM policies.

Note: AWS also offers policy generation based on AWS CloudTrail logs, helping you create permission templates based on actual usage patterns. Learn about this feature in the IAM policy generation documentation.

Conclusion

As you’ve seen, there are numerous secure alternatives to IAM access keys that you can use to enhance your AWS authentication strategy while reducing security risks. By using tools such as CloudShell, IAM Identity Center, IDE integrations, IAM roles, and IAM Roles Anywhere, you can implement robust authentication mechanisms that align with modern security best practices.Key takeaways:

  • Prefer temporary credentials over long-term access keys
  • Choose the authentication method that best fits your use case
  • Implement the principle of least privilege across all access methods
  • Take advantage of the built-in tools provided by AWS for policy generation and management
  • Regularly review and update your authentication methods as new solutions become available

By making these changes, you can not only improve your security posture but also streamline your authentication processes across your AWS environment. Start small by identifying your current IAM access key use cases and gradually transition to these more secure alternatives. Your future self—and your security team—will thank you.

If you have feedback about this post, submit comments in the Comments section below.

Mitch Beaumont

Mitch Beaumont

Mitch is a Principal Solutions Architect for Amazon Web Services based in Sydney, Australia. Mitch works with some of Australia’s largest financial services customers, helping them to continually raise the security bar for the products and features that they build and ship. Outside of work, Mitch enjoys spending time with his family, photography, and surfing.

Streamline DevOps troubleshooting: Integrate CloudWatch investigations with Slack

Post Syndicated from Paige Broderick original https://aws.amazon.com/blogs/devops/streamline-devops-troubleshooting-integrate-cloudwatch-investigations-with-slack/

Infrastructure alerts pose a challenge for DevOps teams, particularly when they occur outside of regular business hours. The complexity isn’t merely in receiving notifications, it lies in rapidly assessing their severity and determining the root cause. This challenge is compounded when upstream service disruptions cascade into multiple downstream alerts, creating a confusion of notifications that mask the true source of the problem. DevOps teams find themselves working backwards through a complex web of interconnected services, unsure whether to start investigating at the application, network, or infrastructure level.

To reduce resolution time and alert root cause analysis, AWS introduced CloudWatch Investigations, a generative AI-powered capability within Amazon CloudWatch. Powered by Amazon Q Developer, a generative AI–powered assistant for software development, CloudWatch investigations analyzes multiple metrics, logs, and deployment events to provide suggestions for remediation and root-cause analyses, reducing alarm resolution time. A key advantage of this feature is the ability to integrate these findings directly into Microsoft Teams and Slack, making sure developers and stakeholders receive immediate alerts when issues arise. This centralized collaboration approach enables teams to work together efficiently, reducing duplicate efforts and facilitating consistent problem-solving across the organization.

In this blog post, we will walk through how to integrate CloudWatch Investigations with Slack channels and demonstrate how to interact with investigations in Slack.

Overview of the solution

CloudWatch Investigations can be started in multiple ways, like from existing Amazon CloudWatch log insights, metrics, or alarms. To demonstrate CloudWatch Investigations functionalities, we will use CloudWatch alarms in a sample web application available in the aws-samples GitHub repository. Steps on how to deploy this web app in your AWS environment, via a CloudFormation template, can be found here. You can learn more about the architecture of the resources deployed in the AWS One Observability workshop. If you choose to deploy the sample web application, you will be responsible for all service charges associated with the CloudFormation template deployment. Alternatively, you can use existing CloudWatch alarms in your environment. Examples of common Amazon CloudWatch alarms include: MemoryUtilization, CPUUtiliziation, 5xxErrors and 4xxError. A full list of available alarms can be found here.

For this blog, we will utilize a pre-configured alarm to monitor when one of the website services, backed by an Application Load Balancer, experiences abnormal response times. When the alarm triggers, CloudWatch Investigations automatically initiates an investigation, analyzing both the current alarm state and 90 days of CloudTrail event history to generate hypotheses and determine potential root causes. The investigation insights are published to a Slack channel via Amazon Q Developer in Chat Applications and Amazon Simple Notification Service (SNS).

Figure 1. Architecture diagram of the services involved in the investigation integration in Slack

Prerequisites

  1. Launch the Amazon CloudFormation template associated with the One Observability lab outlined in the AWS Samples GitHub.
  2. Set up a Standard Amazon SNS topic by following the instructions outlined here. To enable CloudWatch investigations to send notifications to Slack, you must add an access policy to the Amazon SNS topic, an example can be found here.
  3. When the topic configuration is complete, navigate to Amazon Q Developer in Chat Applications (formerly AWS Chatbot) to configure the integration between Amazon Q and Slack by following the instructions outlined here. To allow channel members to interact with the investigation in Slack, add the following permission templates to the Channel role settings: Notification Permissions, Amazon Q Permissions, and Amazon Q Operations assistant permissions. More details on these permissions can be located here.

Setting up CloudWatch Investigations

To get started, navigate to the Amazon CloudWatch console. Choose AI Operations and then Configuration.

Figure 2. Configure for this account button within the AWS Console

Before we can set up an investigation, we need to create an investigation group. This is an organizational structure to manage common properties of the investigation like retention requirements, encryption, access permissions and the SNS topic linked. Click Configure for this account and follow the prompts in the console to set up the investigation group. Detailed explanations for each prompt are located in the documentation here. For this demo, we left the default options for steps 1 and 2 of the prompts. In step 3, please select the existing SNS topic created in the prerequisites section.

Figure 3. Select SNS topic for Q Developer Operational Insights

For the investigation trigger, we will use an existing alarm created by the CloudFormation deployment mentioned at the beginning of this blog. The sample alarm is named:

ApplicationInsights/Services/AWS/ApplicationELB/TargetResponseTime/app/Servic-lista-... 

and it goes into ALARM state when one of the website services, backed by an Application Load Balancer, experiences abnormal response times.

To configure this alarm to automatically start an investigation when it goes into an ALARM state:

  1. In the CloudWatch console, choose Alarms, All alarms
  2. Search for the alarm name and click on it
  3. Choose Actions, Edit
  4. Choose Next once to skip the metrics and conditions section
  5. Choose Add investigation action and then select your investigation group as outlined in figure 4
  6. Choose Skip to Preview and create, then choose Update alarm

Figure 4. Configure alarm to automatically start investigations

Testing the solution

At this point, we are ready to test the solution. To simulate a website traffic overload and trigger the alarm, we are going to use Amazon ECS tasks deployed as part of the sample web application. Open up CloudShell and run the following command:

PETLISTADOPTIONS_CLUSTER=$(aws ecs list-clusters | jq '.clusterArns[]|select(contains("PetList"))' -r)

TRAFFICGENERATOR_SERVICE=$(aws ecs list-services --cluster $PETLISTADOPTIONS_CLUSTER | jq '.serviceArns[]|select(contains("trafficgenerator"))' -r)

aws ecs update-service --cluster $PETLISTADOPTIONS_CLUSTER --service $TRAFFICGENERATOR_SERVICE --desired-count 5

The command will launch 5 instances of the Amazon ECS traffic generator container task. Once the tasks are running (after about 5 minutes), the ALB will become overloaded with requests, forcing the alarm into ALARM state as shown below. You should also see a new investigation created.

Figure 5. CloudWatch Alarm in ALARM state

Interacting with the investigation via Slack

Once the alarm is triggered, an investigation is initiated. Since we associated the investigation with an Amazon SNS topic and subscribed our Slack client to it, we can see a message in our Slack channel from Amazon Q as seen in figure 6.

Figure 6. Slack notification for open investigation

Within Slack, channel members can accept useful hypotheses and discard unhelpful ones by clicking on the Accept or Discard button. They can also add text-based notes of observations or evidence to the investigation by clicking on the Add Note button. Amazon Q will respond to messages within the same thread as the original investigation message. Channel members will be able to track who has accepted or discarded messages, as well as notes made about the investigation. This emphasizes the power of Slack integration, as teams can collaborate on the investigation and track who is actively working on it. It is important to note that CloudWatch Investigations uses Generative AI and may provide suggestions different from those below based on your specific account environment.

Figure 7. Accept or discard investigation suggestions from Slack

When integrated with Slack, CloudWatch Investigations can provide suggestions and root-cause hypotheses. Channel members with appropriate permissions can access metrics, charts, and additional information related to the investigation by clicking the blue header at the top of the investigation message. This link will direct users to the CloudWatch Investigations feed in the AWS console as shown below in figure 8.

Figure 8. CloudWatch Investigations in CloudWatch console.

Integrating CloudWatch Investigations with Slack or Teams channels improves developers’ visibility of arising issues and provides targeted recommendations to reduce alarm resolution time. The Accept and Discard buttons make it straightforward to track who is actively working on an investigation, fostering a culture of collaboration. The best part? The integration is quick to set up, especially with existing alarms.

Clean Up

If you launched the CloudFormation template mentioned at the beginning of this blog, the services will continue to run unless you delete them. To make sure that you are not charged for use of the resources after the demo, please follow the below steps to delete the resources created as part of the steps performed on this blog.

  1. Remove the Amazon Q in Chat Applications Slack integration by clicking on Remove Workspace Integration and policy as explained here.
  2. Delete Amazon SNS topic and subscription as explained here.
  3. Remove the CloudWatch Investigations as explained here.
  4. Delete the images under the Amazon ECR repository named cdk-…-container-assets… as explained here.
  5. Open the CloudShell console or AWS CLI and execute the two commands below:
curl https://raw.githubusercontent.com/aws-samples/one-observability-demo/main/PetAdoptions/cdk/pet_stack/resources/destroy_stack.sh | bash

aws cloudformation delete-stack –stack-name CDKToolkit

After executing the above command, the resources of the demo should be destroyed. Look at the CloudFormation console in case of potential errors.

Conclusion

The new CloudWatch Investigations feature reduces alarm resolution time for development teams by providing actionable insights and recommendations. It is straightforward to connect investigations to a team’s primary form of communication, such as Teams or Slack, to improve notification awareness and interaction. To learn more about the capabilities of CloudWatch Investigations check out the feature announcement and documentation.

Happy investigating!

GitOps continuous delivery with ArgoCD and EKS using natural language

Post Syndicated from Jagdish Komakula original https://aws.amazon.com/blogs/devops/gitops-continuous-delivery-with-argocd-and-eks-using-natural-language/

Introduction

ArgoCD is a leading GitOps tool that empowers teams to manage Kubernetes deployments declaratively, using Git as the single source of truth. Its robust feature set, including automated sync, rollback support, drift detection, advanced deployment strategies, RBAC integration, and multi-cluster support, makes it a go-to solution for Kubernetes application delivery. However, as organizations scale, several pain points and operational challenges become apparent.

Pain Points with Traditional ArgoCD Usage

  • ArgoCD’s UI and CLI are designed for users with extensive technical background. Interacting with YAML manifests, understanding Kubernetes resource relationships, and troubleshooting sync errors require specialized knowledge. This limits access to GitOps workflows for less technical stakeholders and increases reliance on DevOps engineers.
  • Managing ArgoCD across multiple clusters or environments (using hub-spoke, per-cluster, or grouped models) introduces significant operational complexity. Teams must handle multiple ArgoCD instances, maintain consistent configuration, and coordinate deployments, which can become a bottleneck as service footprints grow.
  • ArgoCD excels at syncing and monitoring Kubernetes resources but lacks built-in mechanisms for pre-deployment (e.g., image scanning) or post-deployment (e.g., load testing) tasks. This forces teams to rely on external tools or custom scripts, fragmenting the deployment pipeline and increasing maintenance effort.
  • Promoting applications across environments (Dev → Test → Prod) is not natively streamlined. Teams must manually orchestrate or script these promotions, slowing down urgent fixes and complicating the release process.
  • As organizations adopt multi-cluster strategies, managing ArgoCD’s access, RBAC, and resource visibility across environments becomes cumbersome, often leading to fragmented workflows and potential security gaps.

How ArgoCD MCP Server with Amazon Q CLI addresses these challenges:

  • The integration of the ArgoCD MCP (Model Context Protocol) Server with Amazon Q CLI fundamentally transforms the user experience by introducing natural language interaction for GitOps operations.
  • With MCP, users can manage deployments, monitor application states, and perform sync or rollback operations using plain conversational language rather than technical commands or YAML. For example, a user can simply ask, “What applications are out of sync in production?” or “Sync the api-service application,” and the system executes the appropriate ArgoCD API calls in the background.
  • This democratizes access to GitOps, enabling less technical team members (such as QA, product managers, or support engineers) to safely interact with deployment workflows.
  • Natural language interfaces abstract away the complexity of multi-cluster and multi-environment management. Users can query or act on resources across clusters without memorizing resource names, namespaces, or API endpoints.
  • The MCP server handles authentication, session management, and robust error handling, reducing the need for manual troubleshooting and custom scripting.
  • The integration provides detailed feedback, intelligent endpoint handling, and comprehensive error messages, making it easier to diagnose and resolve issues. Full static type checking and environment-based configuration further enhance reliability and maintainability.
  • By leveraging Amazon Q CLI’s extensibility, users gain access to pre-built integrations and context-aware prompts, accelerating development and deployment workflows.
  • The MCP server enables AI assistants and language models to automate routine tasks, recommend actions, and even debug issues, acting as a virtual DevOps engineer. This can significantly reduce manual effort and speed up incident response.

Traditional ArgoCD vs. ArgoCD MCP Server with Amazon Q CLI

Feature/Challenge Traditional ArgoCD With MCP Server + Amazon Q CLI
User Interface Technical UI/CLI, YAML required Natural language, conversational
Access for Non-Engineers Limited Broad, democratized
Multi-Cluster Management Complex, manual Simplified, abstracted
Pre-Post Deployment Tasks External tools/scripts needed (Still external, but easier to invoke)
Application Promotion Manual or scripted Natural language, easier orchestration
Troubleshooting Technical, error-prone Guided, AI-assisted, detailed feedback
Automation Scripting required AI/agent-driven, proactive

You can perform the following actions using natural language using Amazon Q CLI integration with ArgoCD MCP server.

  • Application Management: List, create, update, and delete ArgoCD applications
  • Sync Operations: Trigger sync operations and monitor their status
  • Resource Tree Visualization: View the hierarchy of resources managed by applications
  • Health Status Monitoring: Check the health of applications and their resources
  • Event Tracking: View events related to applications and resources
  • Log Access: Retrieve logs from application workloads
  • Resource Actions: Execute actions on resources managed by applications

Setting Up Your Environment

Pre-requisites

Following are the pre-requisites for setting up your EKS environment to be managed by ArgoCD using Amazon Q CLI.

  • An AWS account with appropriate permissions
  • AWS CLI v2.13.0 or later
  • Node.js v18.0.0 or later
  • npm v9.0.0 or later
  • Amazon Q CLI v1.0.0 or later (npm install -g @aws/amazon-q-cli)
  • An EKS cluster (v1.27 or later) with ArgoCD v2.8 or later installed

Connecting to your EKS cluster

  1. Use AWS CLI to update your kubeconfig

aws eks update-kubeconfig --name <cluster_name> --region <region> --role-arn <iam_role_arn>

  1. Verify ArgoCD pods are running properly in the argocd namespace

kubectl get pods -n argocd

  1. Access the ArgoCD server UI locally using port forwarding command

kubectl port-forward svc/blueprints-addon-argocd-server -n argocd 8080:443

Create AgroCD API Token

  1. Access the ArgoCD UI at https://localhost:8080
  2. Log in with the admin credentials
  3. Navigate to User Settings > API Tokens
  4. Click “Generate New” to create a token
  5. Create an Amazon Q CLI MCP configuration file at .amazonq/mcp.json and update the ARGOCD_BASE_URL and ARGOCD_API_TOKEN as per your environment setup.

Integrating with Amazon Q CLI

{ 
  "mcpServers": {
    "argocd-mcp-stdio": { 
      "type": "stdio", 
      "command": "npx", 
      "args": [ 
         "argocd-mcp@latest", 
         "stdio" 
      ], 
      "env": { 
        "ARGOCD_BASE_URL": "<ARGOCD_BASE_URL>",
        "ARGOCD_API_TOKEN": "<ARGOCD_API_TOKEN>", 
        "NODE_TLS_REJECT_UNAUTHORIZED": "0" 
      } 
    } 
  }
}

Once configured, you can start using natural language commands with Amazon Q CLI to interact with your ArgoCD applications.

Managing ArgoCD applications using natural language

Listed below are some example prompts to interact with ArgoCD applications in your EKS cluster.

List ArgoCD application

Prompt: “List all ArgoCD applications in my cluster

Amazon Q listing all ArgoCD applications in my clusterAmazon Q will use the ArgoCD MCP server to retrieve and display all applications

Create new ArgoCD application

Prompt: Create new argocd application using App name: game-2048   Repo: https://github.com/aws-ia/terraform-aws-eks-blueprints  Path: patterns/gitops/getting-started-argocd/k8s. Branch: main  Namespace: argocd

Amazon Q creating new argocd application using MCP ServerAmazon Q will create a new application from GitRepo information provided

Viewing deployment status

Prompt: “Show me the resource tree for team-carmen app

Amazon Q showing Resource tree of argocd application
Amazon Q will display the hierarchy of Kubernetes resources managed by the application

Synchronizing applications

Prompt: “Show me the applications that’s out of sync

Amazon Q showing argocd out of sync applicationsAmazon Q will display the out of sync applications

Prompt: “Sync the application

Amazon Q syncing argocd applicationsAmazon Q syncing application

Amazon Q will:

  • Initiate a sync operation for the specified application
  • Monitor the sync progress
  • Report the final status of the sync operation

Healthchecks and monitoring

Prompt:”Check the health of all resources in the team-geordie application

Amazon Q showing health status of all the resources in an applicationAmazon Q showing health status of all the resources in an application

Amazon Q will:

  • Retrieve the health status of all resources
  • Identify any unhealthy components
  • Provide recommendations for addressing issues

Prompt: “Show me the logs for the failing pod in the team-platform application

Amazon Q showing logs for the failing podAmazon Q showing logs of problematic pod

Amazon Q will:

  • Identify problematic pods
  • Retrieve and display relevant logs
  • Highlight potential error messages

Conclusion

The integration of Amazon Q CLI with ArgoCD through the MCP server marks a transformative advancement in Kubernetes management, combining ArgoCD’s GitOps capabilities with Amazon Q’s natural language processing. By transforming complex Kubernetes operations into simple conversational interactions, this solution allows teams to focus on what truly matters – creating value for their business. Rather than spending time memorizing commands or navigating technical complexities, teams can now manage their cloud infrastructure through natural dialogue, making the cloud-native journey more accessible and efficient for everyone.Ready to transform your EKS and ArgoCD experience? It’s highly recommended to try out Amazon Q CLI integration with ArgoCD MCP and discover why DevOps teams are making it an essential part of their toolkit.


About the authors

Jagdish Komakula Picture Jagdish Komakula is a passionate Sr. Delivery Consultant working with AWS Professional Services. With over two decades of experience in Information Technology, he helped numerous enterprise clients successfully navigate their digital transformation journeys and cloud adoption initiatives.
Aditya Ambati Picture Aditya Ambati, Is an experienced DevOps Engineer with 12 plus years of experience in IT. Excellent reputation for resolving problems, improving customer satisfaction, and driving overall operational improvements.
Anand Krishna Varanasi Picture Anand Krishna Varanasi, is a seasoned AWS builder and architect who began his career over 16 years ago. He guides customers with cutting-edge cloud technology migration strategies (the 7 Rs) and modernization. He is very passionate about the role that technology plays in bridging the present with all the possibilities for our future.

 

Streamline Operational Troubleshooting with Amazon Q Developer CLI

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/devops/streamline-operational-troubleshooting-with-amazon-q-developer-cli/

Amazon Q Developer is the most capable generative AI–powered assistant for software development, helping developers perform complex workflows. Amazon Q Developer command-line interface (CLI) combines conversational AI with direct access to AWS services, helping you understand, build, and operate applications more effectively. The Amazon Q Developer CLI executes commands, analyzes outputs, and provides contextual recommendations based on best practices for troubleshooting tools and platforms available on your local machine.

In today’s cloud-native environments, troubleshooting production issues often involves juggling multiple terminal windows, parsing through extensive log files, and navigating numerous AWS console pages. This constant context-switching delays problem resolution and adds cognitive burden to teams managing cloud infrastructure.

In this blog post, you will explore how Amazon Q Developer CLI transforms the troubleshooting experience by streamlining challenging scenarios through conversational interactions.

The Traditional Troubleshooting Experience

When issues arise, engineers typically spend hours manually examining infrastructure configurations, reviewing logs across services, and analyzing error patterns. The process requires switching between multiple interfaces, correlating information from various sources, and deep AWS knowledge. This complex workflow often extends problem resolution from hours into days and increase the burden on the infrastructure teams.

Solution: Amazon Q Developer CLI

Amazon Q Developer CLI streamlines the entire troubleshooting process, from initial investigation to problem resolution, making complex AWS troubleshooting accessible and efficient through simple conversations.

How Amazon Q Developer CLI works:

  • Natural Language Interface: Execute AWS CLI commands and interact with AWS services using conversational prompts
  • Automated Discovery: Map out infrastructure and analyze configurations
  • Intelligent Log Analysis: Parse, correlate, and analyze logs across services
  • Root Cause Identification: Pinpoint issues through AI-powered reasoning
  • Guided Remediation: Implement fixes with minimal human intervention
  • Validation: Test solutions and explain complex issues simply

One of the built-in tools within the Amazon Q Developer CLI, use_aws, enables natural language interaction with AWS services, as shown in Figure 1. This tool leverages the AWS CLI permissions configured on your local machine, allowing secure and authorized access to your AWS resources.

A command line interface showing a list of tools and their permissions. The display is titled "/tools" and shows several built-in tools including execute_bash, fs_read, fs_write, report_issue, and use_aws. Each tool has an associated permission level indicated by asterisks. The use_aws tool is highlighted with "trust read-only commands" permission. At the bottom, there's a note stating "Trusted tools will run without confirmation" and a tip to "Use /tools help to edit permissions".

Figure 1: Tools selection in Amazon Q Developer CLI

Real-World Troubleshooting Scenario

Demonstration Environment Setup

This demonstration was performed with the following environment configuration:

The environment includes a local development machine with necessary tools, appropriate AWS account permissions, and terminal access. By starting Amazon Q Developer CLI in the project directory, it has immediate access to relevant code and configuration files.

Scenario: Troubleshooting NGINX 5XX Errors

The scenario demonstrates troubleshooting a multi-tier application architecture as shown in figure 2 deployed on Amazon ECS Fargate with:

  • Application Load Balancer (ALB) distributing traffic across availability zones
  • NGINX reverse proxy service handling incoming requests
  • Node.js backend service processing business logic
  • Service discovery enabling internal communication
  • CloudWatch Logs providing centralized logging

An AWS cloud architecture diagram showing the flow of traffic from an Internet user through multiple components. The diagram includes: At the top: An Internet user connecting to an Internet Gateway Within a VPC (Virtual Private Cloud): Two public subnets containing a NAT Gateway and Application Load Balancer Two private subnets within an ECS Cluster containing: An NGINX service (Fargate) A Backend service (Fargate) A 10-second timeout between them A Cloud Map Service Discovery component at the bottom CloudWatch Logs integration on the right side The diagram includes a note about gateway timeouts: "504 Gateway Timeout - Backend takes 15s to respond, NGINX timeout is 10s" All components are connected with arrows showing the flow of traffic and data through the system. The infrastructure follows AWS best practices with public and private subnet separation for security.

Figure 2: AWS Architecture diagram for the app used in this blog post

Traditional Troubleshooting Steps

For the architecture in figure 2, when 502 Gateway Timeout errors occur, traditional troubleshooting requires:

  1. Checking ALB target group health
  2. Examining ECS service status across multiple consoles
  3. Analyzing CloudWatch logs from different log groups
  4. Correlating error patterns between services
  5. Reviewing infrastructure code for configuration issues
  6. Implementing and deploying fixes

Amazon Q Developer CLI Approach

Instead, let’s see how Amazon Q Developer CLI handles this systematically, step by step:

Step1: Initial Problem Report

Amazon Q Developer CLI is provided with the initial prompt as a problem statement within the application project directory as shown in the following screenshot in figure 3. Amazon Q Developer responds back and says it is going investigate the 502 Gateway Timeout errors in the NGINX application.

Prompt:

Our production NGINX application is experiencing 502 Gateway Timeout errors. 
I have checked out the application and infrastructure code locally and the AWS CLI 
profile 'demo-profile' is configured with access to the AWS account where the 
infrastructure and application is deployed to. Can you help investigate and diagnose the issue?

A Visual Studio Code window showing a debugging session for an NGINX application. The interface has three main sections: a file explorer on the left showing project files including 'app.ts' and 'nginx-config-task.json', a terminal tab in the center displaying an "Amazon Q" ASCII art logo, and a conversation where a user is reporting 502 Gateway Timeout errors. The terminal shows AWS CLI command execution using a tool called "use_aws" with parameters including the service name "ecs" and region "us-west-2". The interface has red annotations highlighting key areas like "project files", "User provided initial prompt", and "Q CLI executing AWS CLI calls.

Figure 3: Amazon Q Developer CLI with initial prompt and problem statement

Step2: Systematic Infrastructure Discovery

Amazon Q Developer CLI start to systematically discovering the infrastructure as shown in the following screenshot in figure 4. If you see the initial prompt did not include that the app is hosted on ECS, but Amazon Q Developer CLI understood the context and executes the AWS CLI calls to describe the Cluster and the services within it. It made sure that the ECS tasks are running for both the services within the Cluster. It is a key discovery that both services show healthy status (1/1 desired count), indicating the issue isn’t service availability.

A terminal window showing three sequential AWS CLI commands being executed through a "use_aws" tool: First command: "list-clusters" operation for ECS service in us-west-2 region using demo-profile, completing in 1.244 seconds Second command: "list-services" operation targeting the NginxSimulationCluster, completing in 0.877 seconds with confirmation of finding both nginx-service and backend-service Third command: "describe-services" operation examining both services in detail, completing in 0.968 seconds with confirmation that both services are running as expected (1/1 desired count) Each command includes execution details, parameters, and completion status, with the system preparing to check CloudWatch logs next.

Figure 4: AWS Infrastructure discovery by Amazon Q Developer CLI

Step 3: Intelligent Log Analysis

Amazon Q Developer CLI retrieves and analyzes recent CloudWatch logs from the NGINX container, immediately identifying the critical error pattern as shown in the following screenshot in figure 5, where Amazon Q Developer responds: “Perfect! I found the issue. The NGINX logs show clear 504 gateway timeout with upstream timeout messages.”

A terminal window showing two AWS CloudWatch Logs commands being executed: First command: "describe-log-streams" operation for the "/ecs/nginx-service" log group, limiting to 5 most recent entries, ordered by LastEventTime in descending order Second command: "get-log-events" operation retrieving 50 log entries from a specific NGINX container log stream The output reveals a critical error message highlighted at the bottom showing an upstream timeout (error 110) occurring while reading response headers. The error details include client IP 10.0.0.247, upstream server at http://10.0.3.18:3000/, and host 52.35.62.210.

Figure 5: CloudWatch Log analysis by Amazon Q Developer CLI

Step 4: Amazon Q Developer CLI Analysis and Root Cause Identification

Amazon Q Developer examines backend service logs and discovers a mismatch between the backend service response time and NGINX timeout settings, as seen in the following screenshot in figure 6.

A terminal window showing AWS CloudWatch Logs commands and their output. The first command describes log streams for a backend container, and the second retrieves log events. The output reveals a debugging analysis showing that while health checks work fine, regular requests are being delayed by about 15 seconds, causing NGINX timeout issues. The log group is "/ecs/backend-service" in the us-west-2 region using a demo-profile.

Figure 6: Root cause identification by Amazon Q Developer CLI

Step 5: Amazon Q Developer CLI Root Cause Analysis

Amazon Q Developer CLI examines the ECS task definitions to identify the exact configuration mismatch, as shown in the following screenshot in figure 7. Amazon Q Developer finds that:

  • Backend service is configured with response_delay=15000 (15 secs)
  • NGINX proxy is configured with proxy_read_timeout 10s

This mismatch causes 504 gateway timeout errors when the backend response exceeds NGINX’s timeout threshold.

A terminal window showing two AWS CLI commands to describe ECS task definitions in the us-west-2 region. Below the commands is a highlighted "Root Cause Analysis" section that explains a timeout mismatch: the backend service is configured with a 15-second response delay while NGINX has a 10-second proxy timeout, resulting in 502 Gateway Timeout errors. Both commands use a demo-profile and are labeled as checking timeout and response delay configurations.

Figure 7: Root cause analysis and issue detection by Amazon Q Developer CLI

Step 6: Automated Code Fix

Here’s where Amazon Q Developer CLI truly excels—it doesn’t just diagnose; it implements the fix. Since Amazon Q Developer CLI is started within the project where the CDK code for ECS task definition is defined, it identified the code configuration and also modified it, as shown in the following screenshot in figure 8.

A terminal window showing file operations using fs_read and fs_write tools. The code changes show an NGINX configuration update in ecs-nginx-cdk.ts, where the proxy_read_timeout is being modified from '10s' to '20s'. The file also shows additional timeout configurations being added, including proxy_connect_timeout and proxy_send_timeout. The update is confirmed with a user prompt and completed in 0.2 seconds.

Figure 8: CDK code fix by Amazon Q Developer CLI

Step 7: Deployment

Amazon Q Developer CLI builds and deploys the fix by executing cdk synth and cdk deploy using the ‘demo-profile‘ AWS CLI profile that was initially provided in the prompt, as shown in the following screenshot in figure 9.

A terminal window showing two execute_bash commands running in sequence. The first command builds a CDK project using 'npm run build' in the nginx-app directory, completing in 4.102s. The second command deploys the updated CDK stack using 'cdk deploy' with the demo-profile, showing deployment progress including some warnings about minHealthyPercent configurations and CloudFormation stack updates in us-west-2 region.

Figure 9: CDK code build and deployment by Amazon Q Developer CLI

Step 8: Validation

Amazon Q Developer CLI validates the solution by sending a curl request to the ALB endpoint after the successful deployment, as shown in the following screenshot in figure 10.

A terminal window showing the execution of a curl command to test an NGINX application on AWS. The command targets an Elastic Load Balancer in the us-west-2 region. The response shows a successful HTTP 200 OK status after 14 seconds, with a JSON response containing the message "Hello from backend". The test completes in 15.100 seconds, indicating the fix for previous 502 errors was successful.

Figure 10: Fix validation by Amazon Q Developer CLI

In addition to that, Amazon Q Developer also sends a request to the health check endpoint and validates everything is working after the fix was deployed, as shown in the following screenshot in figure 11.

A terminal screenshot showing the results of a health check on an Nginx server using curl. The command executed shows a successful response with "healthy" status, completing in 0.65 seconds. The output displays various metrics including download speed (386 B/s), 100% completion rate, and timing statistics for real, user, and system processes.

Figure 11: Health endpoint validation by Amazon Q Developer CLI

What Amazon Q Developer CLI Accomplished

Using just conversational commands, Amazon Q Developer CLI performed a complete troubleshooting cycle:

  • Infrastructure Discovery: Automatically mapped ECS clusters, services, and dependencies
  • Log Correlation: Analyzed thousands of log entries across multiple services
  • Root Cause Analysis: Identified exact configuration mismatch between NGINX’s timeout (10s) and the backend’s response delay (15s)
  • Code-Level Diagnosis: Located problematic timeout setting in CDK infrastructure code
  • Automated Implementation: Modified infrastructure code to increase the NGINX timeout
  • End-to-End Deployment: Built, deployed, and validated the complete solution
  • Comprehensive Testing: Verified both fix effectiveness and overall system health

Amazon Q Developer CLI handles troubleshooting tasks through a single, conversational interface, eliminating the need for multiple tools or AWS CLI commands.

Conclusion

Amazon Q Developer CLI represents a significant evolution in how we troubleshoot cloud infrastructure issues. By combining natural language understanding with powerful command execution capabilities, it transforms complex troubleshooting workflows into efficient, action-oriented dialogues. Whether you’re dealing with NGINX 5XX errors or similar issues across other AWS services, Amazon Q Developer CLI can help you diagnose issues, implement fixes, and validate solutions—all through a conversational interface that feels natural and intuitive.

Give Amazon Q Developer CLI a try the next time you encounter a troubleshooting challenge, and experience the difference it can make in your operational workflow.

To learn more about Amazon Q Developer’s features and pricing details, visit the Amazon Q Developer product page.

About the Author

kirankumar.jpeg

Kirankumar Chandrashekar is a Generative AI Specialist Solutions Architect at AWS, focusing on Amazon Q Developer. Bringing deep expertise in AWS cloud services, DevOps, modernization, and infrastructure as code, he helps customers accelerate their development cycles and elevate developer productivity through innovative AI-powered solutions. By leveraging Amazon Q Developer, he enables teams to build applications faster, automate routine tasks, and streamline development workflows. Kirankumar is dedicated to enhancing developer efficiency while solving complex customer challenges, and enjoys music, cooking, and traveling.

Announcing the new AWS CDK EKS v2 L2 Constructs

Post Syndicated from Matteo Luigi Restelli original https://aws.amazon.com/blogs/devops/announcing-the-new-aws-cdk-eks-v2-l2-constructs/

Introduction

Today, we’re announcing the release of aws-eks-v2 construct, a new alpha version of AWS Cloud Development Kit (CDK) L2 construct for Amazon Elastic Kubernetes Service (EKS). This construct represents a significant change in how developers can define and manage their EKS environments using infrastructure as code. While maintaining the powerful capabilities of its predecessor library for creating and managing EKS clusters, this alpha release introduces key architectural improvements that enhance both flexibility and maintainability.

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework that enables you to define your cloud infrastructure using familiar programming languages and deploy it through AWS CloudFormation.
The CDK uses constructs – a layered abstraction concept where Layer 1 (L1) constructs map directly to CloudFormation resources, while Layer 2 (L2) constructs provide intuitive APIs, helper functions, best-practice defaults, and generate a lot of the boilerplate code and glue logic for you. This layered approach means you can seamlessly move between high-level abstractions for common use cases and low-level resource definitions when you need fine-grained control. The result is an Infrastructure as Code (IaC) experience that helps you maintain productivity while ensuring you have access to the full power of AWS services when you need it.
You can read more about constructs and their benefits in the CDK user guide.

In this post we’ll explore:

  • The reasoning behind the creation of a new L2 construct for EKS and the improvements introduced by this new library
  • How to use the new EKS v2 construct

Background

Amazon EKS is a managed Kubernetes service that makes it easy to run Kubernetes on AWS without needing to manage the control plane or nodes. EKS automatically handles critical tasks like patching, node provisioning, and upgrades. You can run EKS using EC2 instances for worker nodes, AWS Fargate for serverless containers, or a combination of both, providing the flexibility to choose the right compute option for your workloads.

While the existing EKS L2 construct has served customers well, we identified opportunities to further enhance the developer experience and operational efficiency based on their feedback. The new aws-eks-v2 construct delivers significant improvements through native AWS CloudFormation resources, modern Access Entry-based authentication, and enhanced architectural flexibility. Key benefits include reduced deployment overhead, simplified cluster access management, support for multiple EKS clusters within a single stack, and granular control over resource creation with features like the optional kubectl Lambda handler.
These improvements help customers build and manage their EKS infrastructure more efficiently while maintaining the robust functionality they expect from AWS CDK constructs.

Using the L2

Given that this construct is in the alpha stage, you’ll need to install and import the construct using the experimental construct libraries process. During the alpha stage, the CDK team is actively gathering customer feedback and iterating on the implementation. Once the construct meets our bar for general availability, we’ll integrate it directly into the AWS CDK core library, making it as easily accessible as our other L1 and L2 constructs. This approach allows us to rapidly deliver new capabilities while ensuring they meet the high standards our customers expect.

Deploying EKS Cluster with Default Configuration

Let’s explore how to create an Amazon EKS cluster using AWS CDK aws-eks-v2 construct with minimal configuration requirements. The following example demonstrates the most straightforward way to define an EKS cluster, leveraging the power of CDK’s opinionated defaults.
Creating a new cluster is done using the Cluster construct. The only required property is the Kubernetes version.

import * as eksv2 from '@aws-cdk/aws-eks-v2-alpha';

// Creating an EKS Cluster with default properties
const eksCluster = new eksv2.Cluster(this, 'EksCluster', {
    version: eksv2.KubernetesVersion.V1_32
});

This translates in the following Architecture as shown in figure 1:

L2 CDK Construct v2 for EKS - Default Architecture

Figure 1 – L2 CDK construct v2 for EKS, Default Architecture

  • Amazon Virtual Private Cloud (VPC) – A logically isolated section of the AWS Cloud that spans across two Availability Zones, equipped with an Internet Gateway to enable secure communication with the internet. This multi-AZ design helps ensure your applications remain available even if an Availability Zone experiences issues.
  • Amazon EKS Control Plane – A fully managed Kubernetes control plane deployed in an AWS-managed VPC , providing high availability and automatic version management for the Kubernetes control plane components.
  • Public Subnet Infrastructure – Two public subnets, each with its own NAT Gateway Instance, enabling your cluster components to securely access the internet for essential operations like pulling container images and downloading updates. These NAT Gateways provide a secure outbound path while protecting your workloads from direct internet exposure.
  • Private Subnet Configuration – Two private subnets optimized for running your EKS worker nodes, offering enhanced security by isolating your workloads from direct internet access while maintaining the ability to communicate with AWS services and the internet through the NAT Gateways.
  • IAM Security Foundation – A comprehensive set of IAM roles and policies that implement the principle of least privilege:
    • Control plane service role that enables EKS to manage AWS resources on your behalf
    • Node IAM role that allows worker nodes to interact with other AWS services and join the EKS cluster

You can also use FargateCluster to provision a cluster that uses only Fargate workers.

import * as eksv2 from '@aws-cdk/aws-eks-v2-alpha';

// Creating an EKS Cluster with default properties and Fargate workers
const eksFargateCluster = new eksv2.FargateCluster(this, 'EksFargateCluster', {
   version: eksv2.KubernetesVersion.V1_32,
});

To help our customers maintain better control over their cluster access patterns, the Kubectl Handler is not automatically deployed with the default configuration. You can easily enable this functionality by configuring the kubectlProviderOptions property when you need kubectl access management as shown below.

import * as eksv2 from '@aws-cdk/aws-eks-v2-alpha';
import { KubectlV32Layer } from '@aws-cdk/lambda-layer-kubectl-v32'

// Creating an EKS Cluster with default properties and kubectl handler
const eksCluster = new eksv2.Cluster(this, 'EksCluster', {
   version: eksv2.KubernetesVersion.V1_32,
   kubectlProviderOptions: {
      kubectlLayer: new KubectlV32Layer(this, 'KubectlLayer')
   },
});

Deploying EKS Cluster with AutoMode

EKS Auto Mode represents a significant advancement in how Amazon EKS manages compute capacity for Kubernetes clusters. This intelligent capacity management system automatically provisions and scales node groups based on workload demands, removing the need for manual capacity planning.

When you create a new cluster with the aws-eks-v2 construct, EKS Automode is activated by default, by means that DefaultCapacityType.AUTOMODE is automatically set as the default capacity type for the EKS Cluster. If you prefer, you can specify the defaultCapacityType to AutoMode:

import * as eksv2 from '@aws-cdk/aws-eks-v2-alpha';

// Creating an EKS Cluster with AutoMode
const eksCluster = new eksv2.Cluster(this, 'EksCluster', {
   version: eksv2.KubernetesVersion.V1_32,
   defaultCapacityType: eksv2.DefaultCapacityType.AUTOMODE, // default value
});

After deploying the Stack containing the construct instance, in the EKS Console you’ll be able to see that an EKS Cluster has been created with AutoMode enabled:

EKS Cluster Deployed with Automode

Figure 2 – EKS Cluster Deployed with Automode

Auto Mode enhances your Amazon EKS experience by automatically configuring two strategically designed node pools out of the box:

  • A system node pool optimized for running critical cluster system components and add-ons, ensuring reliable cluster operations.
  • A general node pool specifically tuned for your application workloads, providing the flexibility needed for diverse containerized applications.

You can configure which node pools to enable through the compute property:

import * as eksv2 from '@aws-cdk/aws-eks-v2-alpha';

// Creating an EKS Cluster with Automode and selecting nodePools
const eksCluster = new eksv2.Cluster(this, 'EksCluster', {
   version: eksv2.KubernetesVersion.V1_32,
   defaultCapacityType: eksv2.DefaultCapacityType.AUTOMODE,
   compute: {
      nodePools: ['system', 'general-purpose'],
   },
});

Deploying EKS Cluster with Managed Node Groups

Amazon EKS Managed Node Groups deliver a seamless compute management experience for your Kubernetes clusters. This powerful capability eliminates operational complexity by automating the end-to-end lifecycle of Amazon EC2 instances that power your containerized applications.
Behind the scenes, Amazon EKS managed node groups intelligently orchestrate these changes, ensuring zero-disruption to your applications through graceful node draining. The service automatically leverages the latest Amazon EKS-optimized AMIs, providing a secure and optimized foundation for your workloads.

By setting defaultCapacityType to NODEGROUP, customers can leverage the traditional managed node group management approach:

import * as eksv2 from '@aws-cdk/aws-eks-v2-alpha';

// Creating an EKS Cluster with Managed Node Groups and default instance types
const eksCluster = new eksv2.Cluster(this, 'EksCluster', {
   version: eksv2.KubernetesVersion.V1_32,
   defaultCapacityType: eksv2.DefaultCapacityType.NODEGROUP,
});

By default, when using DefaultCapacityType.NODEGROUP, this library will allocate a managed node group with two m5.large instances.
After deploying the above code, you can check the EKS Console to see that an EKS Cluster has been deployed as shown in figure 3:

EKS Cluster Deployed with Managed Node Groups

Figure 3 – EKS Cluster Deployed with Managed Node Groups

You can also check the Compute tab and see the Managed Node Group Configuration as shown in figure 4:

EKS Cluster Managed Node Group Default Configuration

Figure 4 – EKS Cluster Managed Node Group Default Configuration

If you want to have control over instance types of a Managed Node Group, you can specify the default EC2 type as property of the construct:

import * as eksv2 from '@aws-cdk/aws-eks-v2-alpha';
import * as ec2 from 'aws-cdk-lib/aws-ec2'

// Creating an EKS Cluster with Managed Node Groups and specific instance types
const eksCluster = new eksv2.Cluster(this, 'EksCluster', {
   version: eksv2.KubernetesVersion.V1_32,
   defaultCapacityType: eksv2.DefaultCapacityType.NODEGROUP,
   defaultCapacity: 5,
   defaultCapacityInstance: ec2.InstanceType.of(ec2.InstanceClass.M5, ec2.InstanceSize.SMALL),
});

You can also specify additional customizations after the EKS cluster declaration, via the addNodegroupCapacity method:

import * as eksv2 from '@aws-cdk/aws-eks-v2-alpha';
import * as ec2 from 'aws-cdk-lib/aws-ec2'

// Creating an EKS Cluster with Managed Node Groups and specific instance types
const eksCluster = new eksv2.Cluster(this, 'EksCluster', {
   version: eksv2.KubernetesVersion.V1_32,
   defaultCapacityType: eksv2.DefaultCapacityType.NODEGROUP,
   defaultCapacity: 0,
});

eksCluster.addNodegroupCapacity('custom-node-group', {
  instanceTypes: [new ec2.InstanceType('m5.large')],
  minSize: 4,
  diskSize: 100,
});

Managing Permissions through Access Entries

The new aws-eks-v2 construct transitions away from the previous ConfigMap-based authentication (which is deprecated in EKS) in favor of the Access Entries Authentication mode. This change introduces Access Entry as the standardized method for managing cluster permissions, offering a more streamlined and secure approach to granting cluster access to IAM users and roles.

You can define Access Policies through the AccessPolicy construct and you can adjust the scope of the Access Policy to the entire EKS cluster or to specific EKS Namespaces:

import * as eksv2 from '@aws-cdk/aws-eks-v2-alpha';

// AmazonEKSClusterAdminPolicy with `cluster` scope
eks.AccessPolicy.fromAccessPolicyName('AmazonEKSClusterAdminPolicy', {
   accessScopeType: eks.AccessScopeType.CLUSTER,
});

// AmazonEKSAdminPolicy with `namespace` scope
eks.AccessPolicy.fromAccessPolicyName('AmazonEKSAdminPolicy', {
   accessScopeType: eks.AccessScopeType.NAMESPACE,
   namespaces: ['foo', 'bar'] 
});

You can then grant access to specific IAM Roles using the grantAccess method:

import * as iam from 'aws-cdk-lib/aws-iam'

// Defining a IAM Role
const clusterAdminRole = new iam.Role(this, 'ClusterAdminRole', {
   assumedBy: new iam.ArnPrincipal('arn_for_trusted_principal'),
});

// Creating an EKS Cluster with AutoMode
const eksCluster = new eksv2.Cluster(this, 'EksCluster', {
   version: eksv2.KubernetesVersion.V1_32,
   defaultCapacityType: eksv2.DefaultCapacityType.AUTOMODE,
});

// Cluster Admin role for this cluster
eksCluster.grantAccess('clusterAdminAccess', clusterAdminRole.roleArn, [
	eks.AccessPolicy.fromAccessPolicyName('AmazonEKSClusterAdminPolicy', {
   	    accessScopeType: eks.AccessScopeType.CLUSTER,
    }),
]);

When the Principal assumes the ClusterAdminRole, it receives seamless access to the EKS cluster through a carefully orchestrated permission chain. This access is governed by the AmazonEKSClusterAdminPolicy, which is automatically attached to the Access Policy linked to the IAM Role.

Conclusion

In this post, we introduced the new AWS CDK L2 construct (aws-eks-v2) for Amazon EKS, demonstrating how it simplifies cluster deployment while offering enhanced flexibility and operational efficiency. Through practical examples, we showcased how customers can leverage the construct’s intelligent defaults and customization options to build production-ready Kubernetes environments on AWS

The new L2 construct for Amazon EKS delivers significant improvements that help customers accelerate their container adoption journey:

  • Enhanced Performance: Eliminates dependency on Custom Resources and AWS Lambda functions by utilizing native AWS CloudFormation resources, resulting in faster and more reliable deployments.
  • Modern Authentication: Implements Access Entry-based authentication, replacing the deprecated ConfigMap approach with a more secure and programmable solution.
  • Improved Scalability: Removes the single-cluster-per-stack limitation and eliminates nested stacks, enabling more flexible architectural patterns.
  • Optimized Resource Creation: Makes the kubectl Lambda handler optional, giving customers fine-grained control over their infrastructure components.
  • Streamlined Operations: Provides automated node group management with intelligent defaults while maintaining full customer control when needed.

To get started with the new EKS L2 construct, visit the AWS CDK documentation. If you have specific features you’d like to see added, we encourage you to submit a feature request in the aws-cdk GitHub repository. Your feedback helps us continue innovating on your behalf.

About the author

Matteo Luigi Restelli

Matteo Luigi Restelli is an AWS Sr. Partner Solutions Architect. He mainly focuses on Italian Consulting AWS Partners and is also specialized in Infrastructure as Code, Cloud Native App Development and DevOps. Outside of work, he enjoys swimming, rock & roll music and learning something new everyday especially in Computer Science space.

How to enhance your application resiliency using Amazon Q Developer

Post Syndicated from Dr. Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/how-to-enhance-your-application-resiliency-using-amazon-q-developer/

“Everything fails, all the time” – Werner Vogels, Amazon.com CTO

In today’s digital landscape, designing applications with resilience in mind is crucial. Resiliency is the ability of applications to handle failures gracefully, adapt to changing conditions, and recover swiftly from disruptions. By integrating resilience into your application architecture, you can minimize downtime, mitigate the impact of failures, and ensure continuous availability and performance for end-users.

Amazon Q Developer, a generative AI-powered assistant for software development lifecycle (SDLC), helps design resilient architectures and enhance application availability. It recommends best practices, analyzes code, and identifies potential failure points, serving as an expert companion to strengthen application architecture and boost system availability through the following key resiliency practices.

  • Resilient design pattern recommendations: Access tailored design patterns like distributed systems, microservices, and serverless architectures. Amazon Q offers recommendations across redundancy, robust failovers, and circuit breakers to boost resilience in your environment.
  • Disaster Recovery planning: Amazon Q offers expert guidance on comprehensive disaster recovery (DR), including efficient backups, systematic restorations, strategic data replication, and seamless failovers to ensure rapid recovery from disruptions with minimal impact.
  • Customized Resiliency testing frameworks: Create custom templates to simulate diverse failure scenarios, such as network degradation and infrastructure outages. This streamlines thorough resilience verification across your systems.
  • Failure mode evaluation: Use Amazon Q to conduct comprehensive Failure Mode and Effects Analysis (FMEA) identifying infrastructure vulnerabilities and assessing their impact. Amazon Q then ranks these issues by severity, enabling you to prioritize and address the most critical risks to protect your production environment.

In the following sections, we will demonstrate how Amazon Q improves the resiliency of a foundational application architecture.

Prerequisites

To begin using Amazon Q, the following are required:

Application Overview

We have a three-tier web application shown below that is running on AWS in a single Availability Zone (AZ). The architecture consists of Application Layer hosted on Amazon Elastic Kubernetes Service (Amazon EKS) cluster with two Amazon Elastic Compute Cloud (Amazon EC2) nodes in a single-AZ and the Data Layer uses Amazon Relational Database Service (Amazon RDS) instance deployed in single-AZ configuration. The architecture is functional but has several limitations. It poses a single point of failure and offers limited application availability with no fault tolerance. High response times may occur because there is no caching layer in front of the database. Additionally, the lack of auto-scaling can lead to resource contention.

A three-tier web application basic architecture running on AWS in a single Availability Zone.

Basic Application Overview

Enhance Application Resiliency

Let’s explore how Amazon Q helps incorporate resiliency best practices that enhance system availability in our basic application architecture.

Resilient architecture recommendations

The initial architecture faced challenges with reliability, performance and scalability, largely due to its single-point of failure and lacked redundancy. To address this, we described the existing application design and its challenges to Amazon Q using a natural language prompt to seek resiliency recommendations.

Prompt for improving the architecture design:

I have manually setup an application that runs within an EKS cluster on two EC2 nodes in single AZ. My application is not highly available and scalable. It talks to an RDS database which is single AZ. However, there is high response times from database. Provide me only the recommendations to re-design this application architecture at each layer that will addresses all these issues.

Amazon Q offering resiliency architecture recommendations

Amazon Q offering resiliency architecture recommendations

Amazon Q analyzed the provided context and recommended improvements such as introducing Multi-AZ deployments for high availability, adding auto-scaling groups for elasticity, and incorporating caching layers to enhance performance. These targeted recommendations helped redesign the architecture to be more resilient and scalable, directly addressing the initial shortcomings.

Disaster Recovery (DR) recommendations to improve the architecture

To further enhance resiliency, we prompted Amazon Q for disaster recovery (DR) recommendations. We asked for guidance aligned with the AWS Well-Architected Framework. This built upon the previously improved architecture design.

Prompt for recommendations on Disaster Recovery (DR) and architecture based on RTO/RPO

Based on the above improvements on AWS architecture design, share recommendations for Disaster Recovery (DR) based on AWS Well Architected Framework

Optionally, we can use advanced prompts like the below with additional context:

Please provide a recommendations to redesign my application that is running on an EKS cluster with two EC2 nodes and a single-AZ RDS database, addressing high database latency, low availability, and scalability issues. Suggest improvements across all architectural layers including presentation tier, application tier and data tier to enhance performance, resiliency, and scalability. Also, recommend DR strategies aligned with the AWS Well-Architected Framework focusing on resilience, data protection, and recovery.

Amazon Q tailoring recommendations based on business requirements using  AWS Well-Architected Framework

Amazon Q tailoring recommendations based on business requirements using AWS Well-Architected Framework

Amazon Q provided detailed DR strategies. These included multi-region configuration, backup and restore procedures, and best practices for meeting specific Recovery Time Objective (RTO) and Recovery Point Objective (RPO) requirements.

Prepare DR strategy based on RTO and RPO requirements:

Diving further, asking for a specific disaster recovery strategy that meets the application RTO requirements of 2 hours and RPO requirements of 30 minutes.

Prompt for DR strategy based on RTO/RPO values

Which DR strategy should I use if my RTO is less than 2 hours and RPO is less than 30 minutes?

Amazon Q recommending disaster recovery strategy

Amazon Q recommending disaster recovery strategy

Amazon Q recommended a Pilot light approach, detailing the setup and components needed to achieve the specified disaster recovery objectives.

Define resiliency testing workflow, identify key metrics and tools

As we incorporate resiliency best practices into the application architecture, its is important to employ a resiliency test workflow to ensure application’s resiliency requirements are met. To do this, we are asking for guidance to define an end-to-end resiliency testing process workflow. We also want to identify the key metrics and tools needed to test the resilience of each AWS service involved in the architecture.

Prompt for defining the resiliency testing workflow:

Define the end-to-end resiliency testing process workflow. Also, identify the key metrics and tools that should be used to test the resilience of each AWS service involved in the improved architecture design.

Amazon Q offering resiliency testing best practices and tools

Amazon Q offering resiliency testing best practices and tools

Amazon Q offers a step-by-step approach to define resiliency testing experiments and prepare the environment for testing.

Failure mode evaluation to prioritize resiliency tests

Failure Mode and Effects Analysis (FMEA) can further assist with designing the resiliency tests. It is a proactive method to identify potential failures in processes or systems, assess their impact, and prioritize critical issues. It evaluates failure modes across hardware, software, human factors, and external events, enabling teams to develop strategies for prevention, detection, and mitigation, ultimately enhancing system resilience.

Leveraging Amazon Q, we requested a comprehensive FMEA report that includes components, cause, effect and their respective Risk Priority Numbers (RPN). RPNs are calculated by multiplying three key factors: Severity (S), Occurrence (O), and Detection (D). It helps organizations understand and prioritize which risks to address first.

Prompt for designing the FMEA template and scoring:

Create the FMEA in tabular format with scoring for improved architecture design above keeping in mind the RTO/RPO values and provide the steps for execution as well.

Amazon Q assisting with systematic risk assessment and FMEA report

Amazon Q assisting with systematic risk assessment and FMEA report

Amazon Q intelligently incorporated previously defined RTO and RPO requirements to identify critical failure scenarios and calculated RPN for each potential incident.

Enhanced Architecture Implementing Resiliency Best Practices

After identifying the key pain points in our original architecture such as single points of failure, limited scalability, and lack of automated recovery, we leveraged Amazon Q to analyze our architecture to get targeted recommendations to elevate the resiliency. By describing our requirements and challenges to Amazon Q, we received actionable guidance on AWS best practices and service configurations, which we then implemented to transform our infrastructure for high resilience and availability.

Resilient Application Architecture

Resilient Application Architecture

The original Application Layer was running in a single Availability Zone without auto-scaling, leading to potential downtime and performance bottlenecks. Amazon Q recommended distributing Amazon EKS worker nodes across multiple Availability Zones and enabling the Cluster Autoscaler to dynamically adjust node capacity based on traffic patterns. Additionally, it suggested implementing horizontal pod autoscaling within Amazon EKS to automatically scale application resources according to CPU utilization and custom metrics. Following these recommendations, we deployed Amazon EKS worker nodes across three Availability Zones, configured Cluster Autoscaler and horizontal pod autoscaling, and integrated an Application Load Balancer, to intelligently distribute incoming traffic. These changes significantly improved scalability, fault tolerance, and performance.

The Data Layer initially relied on a single-instance Amazon RDS deployment, which posed a risk of downtime and limited read performance. Upon review, Amazon Q advised implementing a Multi-AZ Amazon RDS configuration to enable automated failover and improve availability. It also recommended deploying read replicas to offload read-heavy workloads and enhance performance. Furthermore, Amazon Q suggested adding a Multi-AZ Amazon ElastiCache for Redis to reduce database load and speed up data access. We incorporated these recommendations, resulting in a more resilient and performant data layer capable of handling failover scenarios and scaling read operations efficiently.

The Presentation Layer lacked an optimized content delivery mechanism and comprehensive security controls. Amazon Q recommended integrating Amazon CloudFront as a content delivery network to accelerate the delivery of static content and reduce load on application servers. It also suggested deploying AWS WAF to protect against common web exploits. To improve operational visibility, Amazon Q emphasized the importance of comprehensive monitoring using Amazon CloudWatch, combining logs, metrics, and traces for rapid issue detection and resolution. Implementing these recommendations enhanced both the performance and security posture of the presentation layer.

Conclusion

Amazon Q Developer transforms how teams build resilient applications by serving as your expert companion throughout the development journey. Its guidance helps create systems that excel in resilience, scalability, and availability—critical factors for today’s demanding digital landscape. Amazon Q goes beyond theoretical advice by providing practical, step-by-step implementation guidance. In the above, we’ve witnessed how Amazon Q’s expertise can transform basic architectures into robust, failure-resistant systems. Its recommendations such as Multi-AZ redundancy, elastic scaling, strategic caching, and proactive resilience testing create applications that maintain performance and availability even during significant disruptions.

Ready to strengthen your applications against unexpected challenges? Harness Amazon Q’s capabilities to create resilient infrastructure that consistently delivers for your customers, regardless of conditions. Unlock the full potential of your AWS infrastructure and deliver uninterrupted service to your customers, today. To learn more about Amazon Q refer to the documentation.

About the authors:

Dr. Rahul Sharad Gaikwad

Dr. Rahul is a Solutions Architect at AWS, driving cloud innovation through migration and modernization of customer workloads. A Generative AI and DevOps enthusiast, he architects cutting-edge solutions and is recognized as an APJC HashiCorp Ambassador. He earned his Ph.D. in AIOps and he is recipient of the Man of Excellence Award , Indian Achievers’ Award , Best PhD Thesis Award, Research Scholar of the Year Award and Young Researcher Award.

Janardhan Molumuri

Janardhan Molumuri is a Principal Technical Leader at AWS, comes with over two decades of Engineering leadership experience, advising customers on Cloud Adoption strategies and emerging technologies including generative AI. He has passion for thought leadership, speaking, writing, and enjoys exploring technology trends to solve problems at scale.

Migrating a CDK v1 Application to CDK v2 with Amazon Q Developer

Post Syndicated from Dr. Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/migrating-a-cdk-v1-application-to-cdk-v2-with-amazon-q-developer/

Introduction:

AWS Cloud Development Kit (AWS CDK) is an open-source software development framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. As of June 1, 2023, AWS CDK version 1 is no longer supported. To avoid the potential issues that come with using an outdated version and to take advantage of the latest features and improvements, we highly recommend upgrading to AWS CDK version 2.

Amazon Q Developer, a generative AI-powered assistant for software development, enhances the efficiency of software development teams. It facilitates the creation of deployment-ready infrastructure as code (IaC) for AWS CloudFormation, AWS CDK, and Terraform. By using Amazon Q, developers can accelerate IaC development, enhance code quality, and decrease the likelihood of configuration errors.

This post demonstrates how Amazon Q Developer helps in upgrading the existing AWS CDK v1 application to AWS CDK v2.

Prerequisites

Planning

In this blog post, I will explore a code example where I have created a VPC, Subnets, and an ECS Fargate cluster using AWS CDK version 1. I will then explain how you can use Amazon Q to transform the code from CDK v1 to CDK v2.

1. In order to initiate this process, I have begun by asking Amazon Q Developer for the necessary steps to migrate from CDK version 1 to version 2, which are outlined below.

Can you provide the steps to migrate from cdk version 1 to version 2?

Amazon Q Developer outlining the comprehensive process to upgrade AWS CDK applications from version 1 to version 2.

2. In the above screenshot Amazon Q Developer outlined several steps we can take to make the necessary changes. The first step is to update the dependencies. If I need guidance on how to update the dependencies, I can ask the Amazon Q Developer again for help by asking the steps regarding updating dependencies as below .

Can you provide the steps to update dependencies?

Amazon Q Developer offering detailed, AI-powered guidance to upgrade project dependencies by analyzing the existing codebase, identifying outdated or deprecated libraries and frameworks, and recommending precise updates to ensure compatibility with newer language versions.

3. After updating the dependencies, the next step is to update the import statements. To get guidance on how to update the import statements, I can ask the Amazon Q Developer assistant again for help by asking the steps regarding how to import statements as shown below.

@workspace Can you provide the steps to update import statements?

Amazon Q Developer advises on updating import statements by analyzing the current code context and guiding developers to replace legacy or outdated import paths with the latest.

In the above screenshot if you have noticed I have added @workspace before the question which automatically includes the most relevant chunks of my workspace code as context.

4. If any errors occur while updating the code as recommended by Amazon Q Developer, I can use Amazon Q Developer to debug the issue and provide the needed inputs to resolve it.

Amazon Q Developer diagnosing issues by analyzing error messages and AWS resource states, providing natural language explanations of root causes such as permission errors and misconfigurations.

5. Once I have finished the required steps, I can deploy the application using version 2 of the AWS CDK by running the cdk deploy command.

Deployment of the updated AWS CDK version 2 application, involving synthesizing CDK stacks to generate CloudFormation templates and deployment artifacts, bootstrapping the AWS environment to provision necessary resources.

6. In addition to its other capabilities, Amazon Q offers code review functionality. To initiate a code review, simply select Amazon Q and use the /review command. I’ll then have the option to review either the active files or the entire open workspace. Select your preference, and Amazon Q will analyze your project and provide comprehensive review results.

Amazon Q Developer performs comprehensive code analysis by reviewing your entire codebase or real-time code as you write, identifying security vulnerabilities, code quality issues, and deployment risks.

7. Amazon Q Developer can also generate documentation, including README files. To create documentation, select Amazon Q and enter the /doc command. Amazon Q will automatically generate a README file for your project. I can then review the generated documentation, accept the changes, or provide specific instructions for further modifications.

Amazon Q Developer automatically generates a comprehensive README file for the entire project by analyzing the codebase, project structure, and dependencies within the selected folder in the IDE.

Conclusion

In this blog, I demonstrated how Amazon Q Developer can simplify and accelerate the upgrade process from AWS CDK version 1 to version 2, ensuring your cloud infrastructure remains secure, efficient, and aligned with the latest AWS innovations. AWS CDK v2 offers a streamlined, consolidated library with improved performance and ongoing support, making infrastructure management easier and more reliable.

By leveraging Amazon Q Developer, a generative AI-powered assistant, teams can automate Infrastructure as Code development, enhance code quality, and minimize configuration errors. Together, these tools empower development teams to confidently modernize and scale their AWS environments, turning the upgrade process into a seamless opportunity for innovation and growth.

Resources

To learn more about Amazon Q Developer, see the following resources:

To learn more about the AWS CDK, see the following resources:

About the authors:

Dr. Rahul Sharad Gaikwad

Dr. Rahul is a Solutions Architect at AWS, driving cloud innovation through migration and modernization of customer workloads. A Generative AI and DevOps enthusiast, he architects cutting-edge solutions and is recognized as an APJC HashiCorp Ambassador. He earned his Ph.D. in AIOps and he is recipient of the Man of Excellence Award , Indian Achievers’ Award , Best PhD Thesis Award, Research Scholar of the Year Award and Young Researcher Award.

Vinodkumar Mandalapu

Vinodkumar is a Devops Consultant at AWS, specializing in designing and implementing cloud-based infrastructure and deployment pipelines on AWS. With extensive experience in automating and streamlining software delivery, he has helped organizations of all sizes leverage the power of the cloud to drive innovation, improve scalability, and enhance operational efficiency. In his leisure time, he enjoys traveling and spending quality time with his son.

Tamilselvan P

Tamilselvan is a Devops Consultant at AWS, focusing on architecting and deploying cloud-native systems and continuous delivery within the ecosystem. Leveraging his comprehensive expertise in orchestrating and refining software release processes, he has assisted customers across various industries and scales in harnessing cloud technology to faster innovation, boost scalability, and elevate operational performance. During his free time, he enjoys playing cricket.