Tag Archives: Developer Tools

Building with AI-DLC using Amazon Q Developer

Post Syndicated from Will Matos original https://aws.amazon.com/blogs/devops/building-with-ai-dlc-using-amazon-q-developer/

The AI-Driven Development Life Cycle (AI-DLC) methodology marks a significant change in software development by strategically assigning routine tasks to AI while maintaining human oversight for critical decisions. Amazon Q Developer, a generative AI coding assistant, supports the entire software development lifecycle and offers the Project Rules feature, allowing users to tailor their development practices within the platform.

Recently, AWS made its AI-DLC workflow open-source, enabling developers to create software using this methodology. This workflow is implemented in Amazon Q Developer through its Project Rules customization feature. In this post, we will demonstrate how the AI-DLC workflow operates in Amazon Q Developer using an example use case.

AI-DLC Workflow Overview

The AI-DLC workflow is the practical implementation of the AI-DLC methodology for executing software development tasks. As outlined in the AI-DLC Method Definition Paper, the workflow has three phases. These phases are Inception, Construction, and Operations. Inception involves planning and architecture. Construction focuses on design and implementation. Operations cover deployment and monitoring. Each phase includes distinct stages. These stages address specific software development life cycle functions. The workflow adapts to project requirements. It analyzes requests, codebases, and complexity. This analysis determines the necessary stages. Simple bug fixes skip planning. They go directly to code generation. Complex features need requirements analysis. They also require architectural design and detailed testing.

The workflow maintains quality and control through structured milestones and transparent decision-making. At each phase, AI-DLC asks clarifying questions, creates execution plans, and waits for approval. Every decision, input, and response is logged in an audit trail for traceability. Whether building a new microservice, refactoring legacy code, or fixing a production bug, AI-DLC scales its rigor to match needs—comprehensive when complex, efficient when simple, and always in control. Figure 1 shows the phases and stages within the adaptive AI-DLC workflow. The stages shown in green boxes are mandatory, while those in yellow boxes are conditional.

AI-DLC workflow diagram showing three phases: Inception Phase (blue) with mandatory steps for Workspace Detection, Requirements Analysis, and Workflow Planning, plus conditional steps for Reverse Engineering, User Stories, Application Design, and Units Generation; Construction Phase (green) with conditional steps for Functional Design, NFR Requirements, NFR Design, and Infrastructure Design, followed by mandatory Code Generation and Build and Test steps that loop for each unit; and Operations Phase (orange) with an Operations step. The workflow flows from User Request at the top to Complete at the bottom.

Figure 1. SDLC phases and stages in AI-DLC workflow

Prerequisites

Before we begin the walk-through, we must have an AWS account or AWS Builder Id for authenticating Amazon Q Developer. If you don’t have one, sign up for AWS account or create an AWS builder id. You can use any of the Integrated Development Environments (IDEs) supported by Amazon Q Developer and install the extension as per the AWS documentation. In this post, we’ll be using the Amazon Q Developer extension in VS Code IDE. Once the plug-in is installed, you’ll need to authenticate Q Developer with the AWS cloud backend. Refer to the AWS documentation for Q Developer authentication instructions.

The AI-DLC workflow generates various Mermaid diagrams in markdown files. To view these diagrams within your IDE, you can install a Mermaid viewer plugin.

Let’s Begin Building!

Let’s construct a simple River Crossing Puzzle as a web UI app using AI-DLC. By choosing a straightforward app, we can concentrate more on learning the AI-DLC workflow and less on the project’s technical intricacies.

The sections below outline the individual steps in the AI-DLC development process using Amazon Q Developer. We’ll showcase screenshots of our IDE with the Amazon Q Developer plug-in and demonstrate how to interact with the workflow.

Although we’ve used the Amazon Q Developer IDE plug-in in this blog post, you can also use Kiro Command Line Interface (CLI) to build with AI-DLC without any additional setup. The workflow remains the same, except that you’ll interact through the command line instead of the graphical interface in the IDE.

As we progress through the workflow, your AI-DLC experience will be tailored to your specific problem statement. You’ll also notice the probabilistic nature of large language models (LLMs), as the questions and artifacts generated by them will vary from one run to another for the same problem statement. For example, if you attempt to replicate the same problem statement we used in this blog post, your experience will likely differ. This is expected and desirable. Despite these minor variations, we’ll eventually find a solution to the problem we initially set out to address.

Step 1: Clone GitHub repo containing the AI-DLC Q Developer Rules

Clone the GitHub repo containing the AI-DLC Q Developer Rules:

git clone https://github.com/awslabs/aidlc-workflows.git

Step 2: Load Q Developer Rules in your project workspace

Follow the README.md instructions in the GitHub repo to copy the rules files over to your project folder.

Step 3: Install and authenticate Amazon Q Developer Extension in IDE

Open the project folder you created in Step 2 in VS Code. Open the Amazon Q Chat Panel in the IDE and ensure that the AI-DLC workflow rules are loaded in Q Developer, as shown in Figure 2. If you don’t see what’s shown in Figure 2, please double-check the steps you performed in Step 2.

Screenshot showing four steps to access AI-DLC rules in Amazon Q: Step 1 shows opening Amazon Q Chat Panel from the left sidebar; Step 2 shows opening a chat session in Amazon Q at the top; Step 3 shows clicking on the Rules button in the chat interface; Step 4 shows ensuring AI-DLC rules are loaded in the rules panel on the right side of the screen.

Figure 2: AI-DLC rules enabling in Amazon Q Developer

Step 4: Start the AI-DLC workflow by entering a high-level problem statement

Our development environment is now set up, and we’re ready to begin application development using AI-DLC. In our Q Developer chat session, we enter the following problem statement:

Using AI-DLC let's build a web application to solve the river crossing puzzle.

Notice that we’ve prefixed our problem statement with “Using AI-DLC …” to ensure that Q Developer engages the AI-DLC workflow. Figure 3 shows what happens next. The AI-DLC workflow is triggered within Q Developer. It greets us with a welcome message and provides a brief overview of the AI-DLC methodology.

Figure 3 shows an expanded view of the AI-DLC workflow rules folder structure on the left. You’ll notice that a single aws-aidlc-rules/core-workflow.md is placed in the designated .amazonq/rules folder, while the rest of the rules files are placed in an ordinary aws-aidlc-rule-details folder. This arrangement is designed to optimize model efficiency. By placing the aws-aidlc-rules/core-workflow.md file in the .amazonq/rules folder, , it serves as additional context, ensuring that the core workflow structure is always accessible without incurring additional token consumption. Conversely, the detailed phase and stage-level behavior rules are stored in the aws-aidlc-rule-details folder and are dynamically loaded as required. This approach conserves Amazon Q’s context window and token usage by retaining only the necessary information within the context at any given time, thereby enhancing model efficiency.

The rules files under the aws-aidlc-rule-details folder are organized into three sub-folders, each representing a phase of AI-DLC. Within each phase, there are stage-specific files. A common folder houses cross-cutting rules applicable to all AI-DLC phases and stages such as the “human-in-the-loop”.

The AI-DLC workflow is self-guided and provides us with a clear understanding of what to expect next. It informs us that it will enter the AI-DLC Inception phase next, starting with the Workspace Detection stage within it.

Screenshot of Amazon Q interface showing AI-DLC workflow initialization. The left sidebar displays a file tree with various workflow stages and configuration files. The main chat area shows a problem statement input box at the top with placeholder text 'Using AI-DLC let's build a web application to solve the most pressing problem.' Below is a welcome message explaining AI-DLC (AI-Driven Development Life Cycle) and its capabilities. Three callout annotations highlight: 1) The core workflow dynamically utilizes detailed instructions for different phases and stages, loading and unloading them as required; 2) AI-DLC workflow kicks off with a welcome message and precise overview; 3) The workflow is structured to load a single Q Developer Rule file, one workflow.md, which then dynamically loads and unloads the stage definitions housed in the 'aws-aidlc-rule-details' folder as needed.

Figure 3: User enters high level problem statement in Amazon Q. AI-DLC workflow is triggered.

Step 5: Workspace Detection

We enter the Workspace Detection stage within the Inception phase. In this stage, AI-DLC analyzes the current workspace and determines whether it’s a greenfield (new) or brownfield (existing) application. Since AI-DLC is an adaptive workflow, it decides whether the next stage will be Reverse Engineering (for brownfield projects) or Requirements Analysis (for greenfield projects).

Since we’re building a greenfield application and there’s no existing code in the workspace to reverse engineer, the workflow will guide us to Requirements Analysis next. If we were working on a brownfield application, the workflow would have performed Reverse Engineering first and then moved on to Requirements Analysis. This demonstrates the adaptive nature of the workflow.

Figure 4 illustrates the process in our IDE when we enter this stage. The workflow requests our permission to create an aidlc-docs folder under the project root. This folder will serve as the repository for all the artifacts generated by AI-DLC during the workflow execution. Subsequently, the workflow generates two files within this folder: aidlc-state.md and audit.md. The purpose of these files is explained in Figure 4.

Screenshot of AI-DLC workspace detection phase showing the Amazon Q chat interface. The left sidebar displays the file tree with an 'aidlc-doc' folder highlighted. The main chat area shows the Inception Phase - Workspace Detection stage with explanatory text about analyzing the workspace. Five callout annotations explain: 1) Workflow creates aidlc-doc directory for storing AI-DLC generated artifacts; 2) The workflow tracks its progress in aidlc-metadata.json for error recovery and session continuity; 3) The audit.md file stores user's prompts; 4) Workflow highlights the AI-DLC phase and stage name with a clear heading for easy tracking; 5) Workflow loads detailed stage-level behavior files dynamically such that they don't consume the context window statically. At the bottom, a user approval prompt shows 'mkdir -p /Users/[...]/NewConsumerPortal/aidlc-docs' with the user asked to approve the 'mkdir' command.

Figure 4: Workspace Detection

The Workspace Detection will quickly finish as this is a greenfield project. The workflow will guide us into Requirements Analysis stage within the Inception phase next.

Step 6: Requirements Analysis

The workflow has progressed to the Requirements Analysis stage, where we will define the application requirements. The AI-DLC workflow presented our high-level problem statement to the Q Developer, which then responded with several requirements clarifications questions, as illustrated in Figure 4.

Several AI-DLC rules came into play at this stage. One rule instructed Amazon Q to avoid making assumptions on the user’s behalf and instead ask clarifying questions. Since LLMs tend to make assumptions and rush towards outcomes, they must be explicitly instructed to align with the engineering rigor of the AI-DLC methodology. To achieve this, the Q Developer presented several requirements clarification questions in requirement-verification-questions.md file and asked us to answer them inline in the file.

Another AI-DLC rule instructed the Q Developer to present questions in multiple-choice format and always include an open-ended option (“Other”) to enhance user convenience and provide flexibility in answering.

As shown in Figure 5, Amazon Q has asked us about the desired puzzle variant, such as the Classic Farmer, Fox, Chicken, and Grain puzzle or other popular variations. Additionally, it has asked us questions about user interaction methods, score persistence across multiple players, and the creation of a leaderboard.

These questions are essential for achieving our desired application outcome. Our responses to these questions will determine the final product. While we didn’t explicitly specify this level of detail in our high-level problem statement, AI-DLC has delegated detailed requirements elaboration to Amazon Q, but we still retain control over what gets built.

Screenshot of AI-DLC Requirements Analysis phase showing a split view. The left side displays a requirements clarification questions markdown file with multiple-choice questions about the Kuer Crossing Portal, including sections about user crossing portal variants, primary user interaction methods, and data storage preferences. The right side shows the Amazon Q chat interface with the Inception Phase - Requirements Analysis heading. Two callout annotations highlight: 1) AI-DLC asks questions in multiple choice format, with an 'Other' option that leaves an open-ended fill-in-the-blank when the answer doesn't match the predefined options; 2) AI-DLC generates config.requirements-clarification-questions.md file containing requirements clarification questions, with questions placed in an MD file where the user can respond inline in the file, using 'Answered' to indicate completion. The chat shows instructions for answering questions to clarify requirements.

Figure 5: Requirements Analysis

We answer all the questions in requirement-verification-questions.md file and enter “Done” in the chat window.

Amazon Q processes our responses. The AI-DLC workflow is designed to identify human errors. It checks if we’ve answered all the questions and identifies any contradictions or ambiguities in our answers. Any confusions, contradictions, or ambiguities will be flagged for follow-up questions. AI-DLC adheres to high standards and ensures that we don’t proceed to the next step until we’re fully in agreement on the requirements between us and Amazon Q.

Since we answered all the questions and there were no contradictions in our answers, the workflow continues and generates a comprehensive requirements.md document, as shown in figure 6.

Screenshot showing AI-DLC requirements review phase with split view. The left side displays a Requirements Document for the River Crossing Puzzle Web Application, including Intent Analysis Summary, User Request details, Request Type, and Functional Requirements with Core Puzzle Functionality items (FR-001 through FR -006) describing game features like classic farmer puzzle, timer display, move tracking, puzzle state validation, and victory messages. The right side shows the Amazon Q chat interface with 'Requirements Analysis Complete' heading, displaying project details including Puzzle Type (Classic Farmer, Fox, Chicken, and Grain river crossing puzzle), Technology (React-based modern web application), and Target Devices (web browsers only). Three callout annotations highlight: 1) Requirements Analysis phase complete; 2) Requirements document generated; 3) User may Request Changes, Add User Stories for Approval, or Approve & Continue, with a REVIEW SAFETY note warning users to review requirements and approve to continue, with options to request changes or add modifications if required.

Figure 6: Requirements Review

The workflow prompts us to review the requirements.md document and decide on the next step. If we’re not aligned on the requirements, we can prompt Amazon Q to help us achieve alignment. We can then iterate on the requirements until we’re fully aligned. Once we’re fully aligned, we prompt AI-DLC to progress to the next stage.

Given the adaptive nature of the AI-DLC workflow, Amazon Q has recommended that this application is simple enough, and we can skip the User Stories stage. If we felt otherwise, we would have overridden the model’s recommendation. In this case, we agree with Q’s recommendation and will therefore enter “Continue” in the chat window.

The workflow will enter Workflow Planning stage next.

Step 7: Workflow Planning

With our requirements established, we proceed to the Workflow Planning stage. In this phase, we leverage the requirements context and the workflow’s intelligence to plan the execution of specific stages of AI-DLC within the workflow to build our application as per the requirements specification.

Figure 7 illustrates the workflow planning stage in Q Developer. The workflow has generated an execution-plan.md file that outlines the recommended stages for execution and those that should be skipped.

The workflow planning process is highly contextual to the requirements. During requirements analysis, we decided to develop a simple river crossing puzzle application, consisting of a single HTML file, without a backend, leaderboard, or persistence. Consequently, Amazon Q recommends that we skip all the conditional stages, such as User Stories, Application Design, Units of Work Planning, and so on, and proceed directly to the Code Generation Planning stage in the Construction phase.

Figure 7 visually represents the recommended workflow graphically, indicating the stages that will be executed and those that will be skipped.

Screenshot of AI-DLC Workflow Planning phase showing an Execution Plan document on the left with Detailed Analysis Summary including user-facing changes, brownfield changes, API changes, and NFR changes, plus Risk Assessment. Below is a Workflow Visualization flowchart diagram showing the workflow stages from Inception through Construction to Operations phases. The right side shows the Amazon Q chat with 'Workflow Planning Complete' heading. Three callout annotations highlight: 1) AI -DLC workflow has analyzed requirements and based on the problem complexity has proposed a set of stages to execute in the workflow; 2) Problem is simple enough that AI-DLC is proposing to skip the detailed optional stages; 3) User may Request Changes, Add back skipped stages or Approve & Continue.

Figure 7: Workflow Planning

Since we’ve opted for a straightforward web UI app in this blog post for brevity, the workflow execution plan suggested by AI-DLC aligns seamlessly with our objectives. Should we not be aligned with the AI-DLC recommended workflow execution plan, we would request Q Developer to modify the plan to suit our preferences.

Since we’ve agreed on the workflow plan, we’ll enter “Continue” in Q’s chat session. If we weren’t aligned with the recommended workflow execution plan, we’d have prompted Q with our concerns and iterated over the revised execution plan until it aligned with our preferences. Following the recommended execution plan, the workflow will transition into the Construction phase and directly into the Code Generation Plan stage in the phase.

Step 8: Code Generation Planning

AI-DLC prioritizes planning over rushing to outcomes. This approach aligns with the concept of human-in-the-loop behavior, allowing us to detect issues early on, provide feedback on the plan, and prevent wrong assumptions from propagating further. Before we proceed with actual Code Generation, we undergo Code Generation Planning.

During Code Generation Planning, AI-DLC creates a detailed, numbered plan. It analyzes the requirements and design artifacts, breaking down the process into explicit steps for generating business logic, the API layer, the data layer, tests, documentation, and deployment files.

The plan is documented in a {unit-name}-code-generation-plan.md file, complete with check boxes. This ensures transparency, allowing users to see what will be built. It also provides control, enabling users to modify the plan. Additionally, it maintains quality by ensuring comprehensive coverage of code, tests, and documentation.

Figure 8 illustrates the AI-DLC’s code generation plan. The proposed workflow comprises eight steps, starting with creating an HTML structure and progressing to adding styling, game logic, and concluding with testing and documentation.

Screenshot of AI-DLC Code Generation Planning showing a Code Generation Plan document for River Crossing Puzzle on the left, with Unit Context listing HTML Structure, CSS Styling, and JavaScript files, followed by Unit Generation Steps including Step 1: HTML Structure Generation, Step 2: CSS Styling Generation, and Step 3 : Core Game Logic Generation with detailed checkboxes for each step. The right side shows Amazon Q chat with code generation plan details. Three callout annotations highlight: 1) The plan doc contains to-do items for AI-DLC to execute. These checkboxes get completed when the task is done; 2) This is how AI-DLC workflow persists and tracks progress state; 3) AI-DLC has proposed an 8-step code generation plan with checkboxes and review prompts, and User may Request Changes or Approve & Continue.

Figure 8: Code Generation Planning

The code generation plan appears reasonable to us. We will proceed to the Code Generation stage by entering “Continue” in Q’s chat session.

Step 9: Code Generation

The Code Generation stage executes the Code Generation Plan we approved in the previous step. It generates actual code artifacts step-by-step, including business logic, APIs, data layers, tests, and documentation. Completed steps are marked with check boxes, progress is tracked, and story traceability is ensured before presenting the generated code for user approval.

Figure 9 illustrates that the Code Generation stage has been completed. We are now reviewing a single index.html file generated with embedded styling and JavaScript consistent with our preference specified in requirements.md.

The workflow provides a summary of the activities performed during the Code Generation phase.

Screenshot of AI-DLC Code Generation phase showing generated HTML code on the left with embedded styling and JavaScript for the River Crossing Puzzle application. The right side shows Amazon Q chat with 'Code Generation Complete - river-crossing-puzzle' heading and a list of generated artifacts including HTML file, CSS interface, drag-and-drop interface, game logic, and testing services. Two callout annotations highlight: 1) The generated code is an HTML file with embedded styling and JavaScript; 2) We have specified during requirements analysis phase that we want a single-file index.html file implementation; 3) Code generation has been completed, and a summary of the generated artifacts is provided.

Figure 9: Code Generation

We’re about to test our newly created application soon. While it may be straightforward to test this simple puzzle app right now, for complex applications, we generate build and test instructions using AI-DLC.

We’ll enter “Continue” in the workflow and enter the final Build and Test stage in the Construction phase.

Step 10: Build and Test

These questions are essential for achieving our desired application.
We’ve reached the final stage of the AI-DLC Construction Phase, known as the Build and Test stage. During this stage, we create comprehensive instruction files that guide the build and packaging of the project, and document the necessary testing layers. These layers include unit tests (validating generated code), integration tests (checking unit interactions), performance tests (load/stress testing), and additional tests as required (security, contract, e2e).

The generated build instructions include dependencies and commands, test execution steps with expected results, and a summary document that provides an overview of the overall build/test status and the project’s readiness for deployment.

Figure 10 illustrates the documentation generated during this stage.

Screenshot of AI-DLC Build and Test phase showing a Build and Test Summary document on the left with Build Status (Build Tool, Build Status, Build Artifacts, Build Warnings) and Test Execution Summary including Unit Tests, Integration Tests, and Performance Tests sections with checkmarks and failure indicators. The right side shows Amazon Q chat with build and test completion status and project summary. Two callout annotations highlight: 1) Build and Test Complete! Build and Test instructions have been documented; 2) The AI-DLC workflow has concluded with a comprehensive summary of all completed stages and generated artifacts.

Figure 10: Build and Test

The AI-DLC workflow has now concluded.

Let’s Solve the Puzzle!

We open index.html in a web browser to access our newly created River Crossing Puzzle application. As shown in figure 11, we see our graphical web UI.

During requirements assessment, we chose a straightforward user interface using HTML, CSS, and JavaScript (without any frameworks), as evident in the display shown in Figure 11. Your display may vary due to the probabilistic nature of LLMs and the choices you made for requirements.

We attempt to solve the puzzle and find that it works as expected.

Side-by-side screenshots of the River Crossing Puzzle web application showing two game states. The left screenshot shows the initial state with a farmer on the left bank, and fox, chicken, and grain items listed below, with a blue river in the center and right bank on the right. The right screenshot shows a game state after moves with the farmer on the right bank and a success message 'Congratulations! You won in 7 moves!' displayed at the bottom. Both screens have a yellow 'Start Over' button and show move counts.

Figure 11: River Crossing Puzzle Web App

Conclusion

This post shows how AWS’s open-source AI-DLC workflow, guided by Amazon Q Developer’s Project Rules feature, helps developers build applications with structured oversight and transparency.

Using a River Crossing Puzzle web application as an example, the walk-through illustrates how AI-DLC methodology adapts its rigor based on project complexity, skipping unnecessary stages for simple applications while maintaining comprehensive processes for complex projects. Throughout each stage, AI-DLC enforces “human-in-the-loop” behavior, requiring user approval at critical checkpoints, asking clarifying questions, and maintaining complete audit trails for traceability.

The exercise successfully demonstrates how AI-DLC balances AI automation with human oversight, enhancing productivity without sacrificing quality or control. By following this structured, repeatable methodology, development teams can leverage generative AI’s capabilities while ensuring humans remain in charge of architectural decisions and implementation approaches. This framework provides the necessary guardrails for responsible and effective AI-assisted software development across projects of varying complexity.

Cleanup

We did not create any AWS resources in this walk-through, so no AWS cleanup is needed. You may cleanup your project workspace at your discretion.

Ready to get started? Visit our GitHub repository to download the AI-DLC workflow and join the AI-Native Builders Community to contribute to the future of software development.

About the authors:

Raja SP

Raja is a Principal Solutions Architect at AWS, where he leads Developer Transformation Programs. He has worked with more than 100 large customers, helping them design and deliver mission critical systems built on modern architectures, platform engineering practices, and Amazon inspired operating models. As generative AI reshapes the software development landscape, Raja and his team created the AI Driven Development Lifecycle (AI-DLC) — an end to end, AI native methodology that re-imagines how large teams collaboratively build production-grade software in the AI era.

Raj Jain

Raj is a Senior Solutions Architect, Developer Specialist at AWS. Prior to this role, Raj worked as a Senior Software Development Engineer at Amazon, where he helped build the security infrastructure underlying the Amazon platform. Raj is a published author in the Bell Labs Technical Journal, and has also authored IETF standards, AWS Security blogs, and holds twelve patents

Siddhesh Jog

Siddhesh is a Senior Solutions Architect at AWS. He has worked in multiple industries in a wide variety of roles and is passionate about all things technology. At AWS Siddhesh is most excited to help customers transition to the AI Driven Development Lifecycle and enable them to build applications rapidly in a secure, complaint and cost efficient cloud environment.

Will Matos

Will Matos is a Principal Specialist Solutions Architect with AWS’s Next Generation Developer Experience (NGDE) team, revolutionizing developer productivity through Generative AI, AI-powered chat interfaces, and code generation. With 27 years of technology, AI, and software development experience, he collaborates with product teams and customers to create intelligent solutions that streamline workflows and accelerate software development cycles. A thought leader engaging early adopters, Will bridges innovation and real-world needs .

Open-Sourcing Adaptive Workflows for AI-Driven Development Life Cycle (AI-DLC)

Post Syndicated from Will Matos original https://aws.amazon.com/blogs/devops/open-sourcing-adaptive-workflows-for-ai-driven-development-life-cycle-ai-dlc/

AI-Driven Development Life Cycle (AI-DLC) holds the promise of unlocking the full potential of AI in software development. By emphasizing AI-led workflows and human-centric decision-making, AI-DLC can deliver velocity and quality. However, realizing these gains hinges on how organizations effectively integrate AI into their engineering workflows.

Through our work with engineering teams across industries, we have identified three recurring challenges. These challenges consistently limit the effectiveness of AI in accelerating modern software development. The first challenge is one-size-fits-all workflows. These workflows force every project through the same rigid sequence of steps. The second challenge is the lack of flexible depth in workflow stages. This leads to over-engineering or insufficient rigor. The third challenge is tools that over-automate. These tools unintentionally divert humans away from critical validation and oversight responsibilities.

Achieving true, sustainable productivity requires the process and AI coding agents to become adaptive to context, flexible in depth, and collaborative by design. In this blog, we’ll show you how AI-DLC’s core principles address these three challenges, transforming them from productivity blockers into opportunities for adaptive, human-centered development. We’ll describe how AI-DLC enables workflows that adapt to the problem at hand by intelligently selecting stages, modulating depth, and embedding human oversight at every critical decision point.

We will also introduce our open-source Amazon Q Developer/Kiro Rules implementation, which brings AI-DLC principles to life through adaptive workflow scaffolds. This allows you to start applying these principles in your own projects and experience AI-native development that accelerates delivery without compromising engineering discipline or human judgment.

How does AI-DLC address these challenges?

Let’s explore how AI-DLC addresses these challenges.

1. The “One-Size-Fits-All” Workflow Problem

Software development has never been a linear process. In practice, different projects follow distinct pathways with their own checkpoints and deliverables. Consider these examples:

  • A simple defect fix doesn’t require elaborate requirements analysis and planning
  • A pure infrastructure porting project doesn’t warrant application design with domain modeling
  • A new feature or service addition demands different steps than applying a security patch

Yet, many modern Agentic coding tools provide hard-wired, opinionated workflows that ignore this diversity. Regardless of intent or scope, every project is forced through the same rigid sequence of steps—even when some add little or no value. This rigidity introduces friction, wastes time, and reduces productivity. The result: artificial ceremonies, unnecessary artifacts, redundant approvals, and process overhead that impede velocity.

How AI-DLC addresses this challenge:
AI-DLC addresses this challenge through the Principle 10 (No Hard-Wired, Opinionated SDLC Workflows) as defined in the AI-DLC Method Definition Paper.

AI-DLC avoids prescribing opinionated workflows for different development pathways (such as new system development, refactoring, defect fixes, or microservice scaling). Instead, it adopts a truly AI-First approach where AI recommends the Level 1 Plan based on the given pathway intention.

2. Lack of Flexible Depth Within Each Stage

True adaptivity must go beyond the breadth of a workflow and extend into its depth and intensity. This is how human experts intuitively plan software projects today.

Even when workflows are flexible, many tools fail to modulate the depth of engagement at each stage. For example, building a lightweight utility function doesn’t require full-scale Domain-Driven Design or detailed architectural modeling. When an AI coding agent compels teams to follow these steps regardless of need, the consequence is wasted effort and an over-engineered product. Developers spend cycles reviewing artifacts as the tools dictate rather than delivering business value.

How AI-DLC addresses this challenge:
Through the same principle 10, AI-DLC adapts both the breadth (choice of stages) and the depth of each stage to match the complexity of the intent and context. For example, the complexity of the requirements determines whether a conceptual design is sufficient or whether a full architectural deep dive is required in the Design stage.

Humans validate and adjust this AI-proposed breadth and depth, ensuring that each stage’s rigor matches the scope of the challenge. This elasticity—balancing breadth and depth—is essential for sustaining true velocity without sacrificing engineering discipline.

3. Tools that Reduce the Emphasis on Human Oversight

As AI tools automate more of the Software Development Life Cycle (SDLC), a new risk has emerged: process atrophy. Developers, excited by automation, often drift into passive execution—allowing AI to “decide everything.” The result is a loss of reflection, weakened oversight, and erosion of shared understanding. AI tools must not only automate work but also amplify the significance of human judgment. They should remind practitioners that “human in the loop” is not a checkbox—it is the cornerstone of trust, accountability, and correctness in AI-native development. Equally critical are the rituals and rhythms that sustain collaborative engineering.

How AI-DLC addresses this challenge:
AI-DLC addresses this challenge by requiring a collaborative human-in-the-loop cycle at every stage of the workflow. In this loop, AI generates a plan to execute a task, and relevant stakeholders assemble, review, and validate it.

These rituals, defined as Mob Elaboration and Mob Construction in AI-DLC, ensure that AI’s suggestions are not blindly accepted. Approved plans are executed, and stakeholders again review and validate the final artifacts. The AI-DLC workflow records every human action and approval, embedding reflection to ensure that humans remain the compass, guiding AI’s acceleration.

Circular workflow diagram showing AI-DLC collaboration cycle. Starting at top: Humans Provide Task (orange person icon) , arrow to AI Creates Plan and Seeks Clarification (blue brain icon), arrow to Humans Provide Clarification (orange person icon), arrow to AI Refines Plan (blue brain icon), arrow to Humans Approve Plan (orange person icon), arrow to AI Executes Plan (blue brain icon), arrow to Humans Verify Outcome (orange person icon), completing the cycle back to the start. The diagram illustrates iterative human-AI collaboration with humans making decisions and AI performing execution tasks.

Figure 1: AI-DLC workflow: Humans decide and validate, AI plans and executes.

Effective tooling must therefore emphasize:

  • Promoting for stakeholder collaboration: The system should explicitly call for collaborative rituals involving stakeholders
  • Auditability: Every AI-generated plan and artifact should surface rationale and invite review, recording every human oversight and interaction
  • Flow awareness: Tools should detect when automation races ahead of human validation and deliberately slow down to emphasize critical checkpoints

The goal is not to suppress automation but to embed critical human ownership.

From Principles to Practice

The ideas we outlined — adaptive workflows, flexible depth, and embedded human oversight — are compelling in theory and validated by all engineering teams we’ve engaged. The critical question is: How do we operationalize these ideas into practice without reintroducing the rigidity we seek to eliminate?

One approach is manual prompt engineering: crafting structured prompts that guide AI assistants through the AI-DLC workflow step by step. Each prompt encodes the role AI should assume, the task at hand, the governance requirements, and the audit trail expectations. This structured approach transforms a simple AI interaction into a disciplined workflow that embodies AI-DLC principles.

This approach, while promising, faces its own limitations. Crafting intricate prompts demands discipline and expertise, posing barriers to widespread adoption. Moreover, humans become responsible for maintaining workflow adaptability, selecting the appropriate prompt at the right moment, and ensuring collaborative checkpoints are honored. This places the burden of orchestration back on practitioners, diverging from our core principle of truly AI-native development, where AI itself drives adaptive decision-making.

The question arises: How can we embed AI-DLC principles directly into the execution layer, making adaptivity and collaboration inherent properties of the system rather than manual responsibilities?

Steering for Productivity

The answer lies in workflow scaffolds. These are Rules or Steering customizations for AI Coding Agents. They operationalize AI-DLC principles within the tools. This is done while maintaining transparency, audibility, and modifiability. Our implementation uses Rules/Steering Files. These serve as the foundation of this execution layer. It transforms AI from a passive assistant into an adaptive decision engine.

Rather than requiring developers to craft elaborate prompts, AI-Driven development begins with a simple statement of intent. From there, the workflow scaffolds evaluate context, assess complexity, and dynamically construct an appropriate development pathway. The core workflow definition, including a library of stages and decision heuristics for when and how to apply them, empowers AI to continuously tailor the development process to the nature of the work at hand.

Each AI-DLC phase (Inception, Construction, Operations) evaluates the depth at which it should execute, resulting in a process that adapts to the problem rather than forcing the problem to adapt to the process. This approach yields several critical outcomes:

  1. Adaptive decisioning: The workflow conforms to the problem’s shape, intelligently skipping or deepening stages based on contextual assessment rather than predetermined rules.
  2. Transparent checkpoints: Human approvals are embedded at every decision gate, preserving oversight while maintaining velocity. The system doesn’t just automate; it orchestrates collaboration.
  3. End-to-end traceability: Every artifact, decision, and conversation is logged, creating a continuous, inspectable trail of reasoning that supports both accountability and continuous improvement.

The result is a process that is context-aware, scalable, and self-correcting – capable of supporting everything from a single-line defect fix to a comprehensive system modernization, all while maintaining the rigor and human judgment that define engineering excellence.

Build, Test, and Evolve with Us

We’re open-sourcing the AI-DLC workflow, implemented as Amazon Q Rules and Kiro Steering Files, so organizations everywhere can experience AI-DLC in practice and build production-grade systems. We invite developers, architects, and engineering leaders to:

  1. Apply the steering rules in real-world projects, whether brownfield or greenfield. Refer to our companion AI-DLC workflow walkthrough blog for step-by-step instructions on how to build using AI-DLC in Amazon Q Developer.
  2. Observe how the process adapts to your project’s size, scope, and intent.
  3. Share your experience through our GitHub repository, where you can open issues, propose improvements, and contribute ideas.

Your feedback will help evolve this into a foundation for AI-native software development – one that accelerates delivery without sacrificing rigor or human judgment. Together, we can redefine what software engineering looks like in the age of AI: not scripted but steered.

Conclusion

AI-DLC addresses multiple challenges limiting AI’s effectiveness in software development such as rigid workflows, inflexible workflow depth, and tools that reduce human oversight. AI-DLC enables adaptive workflows that intelligently select stages, modulate depth, and embed human oversight at critical decision points. This approach, implemented through open-source tools like Amazon Q Developer Rules and Kiro Steering, accelerates delivery while maintaining engineering discipline and human judgment.

AI-DLC emphasizes human oversight and collaboration in AI-driven software development. Workflow scaffolds, embed AI-DLC principles into the execution layer, enabling adaptive decision-making, transparent checkpoints, and end-to-end traceability. Open-sourcing the AI-DLC workflow allows organizations to experience AI-DLC in practice and contribute to its evolution.

Ready to get started? Visit our GitHub repository to download the AI-DLC workflow and join the AI-Native Builders Community to contribute to the future of software development.

 

About the authors:

Raja SP

Raja is a Principal Solutions Architect at AWS, where he leads Developer Transformation Programs. He has worked with more than 100 large customers, helping them design and deliver mission critical systems built on modern architectures, platform engineering practices, and Amazon inspired operating models. As generative AI reshapes the software development landscape, Raja and his team created the AI Driven Development Lifecycle (AI-DLC) — an end to end, AI native methodology that re-imagines how large teams collaboratively build production-grade software in the AI era.

Raj Jain

Raj is a Senior Solutions Architect, Developer Specialist at AWS. Prior to this role, Raj worked as a Senior Software Development Engineer at Amazon, where he helped build the security infrastructure underlying the Amazon platform. Raj is a published author in the Bell Labs Technical Journal, and has also authored IETF standards, AWS Security blogs, and holds twelve patents

Siddhesh Jog

Siddhesh is a Senior Solutions Architect at AWS. He has worked in multiple industries in a wide variety of roles and is passionate about all things technology. At AWS Siddhesh is most excited to help customers transition to the AI Driven Development Lifecycle and enable them to build applications rapidly in a secure, complaint and cost efficient cloud environment.

Will Matos

Will Matos is a Principal Specialist Solutions Architect with AWS’s Next Generation Developer Experience (NGDE) team, revolutionizing developer productivity through Generative AI, AI-powered chat interfaces, and code generation. With 27 years of technology, AI, and software development experience, he collaborates with product teams and customers to create intelligent solutions that streamline workflows and accelerate software development cycles. A thought leader engaging early adopters, Will bridges innovation and real-world needs.

Introducing the AWS Infrastructure as Code MCP Server: AI-Powered CDK and CloudFormation Assistance

Post Syndicated from Idriss Laouali Abdou original https://aws.amazon.com/blogs/devops/introducing-the-aws-infrastructure-as-code-mcp-server-ai-powered-cdk-and-cloudformation-assistance/

Streamline your AWS infrastructure development with AI-powered documentation search, validation, and troubleshooting

Introduction

Today, we’re excited to introduce the AWS Infrastructure-as-Code (IaC) MCP Server, a new tool that bridges the gap between AI assistants and your AWS infrastructure development workflow. Built on the Model Context Protocol (MCP), this server enables AI assistants like Kiro CLI, Claude or Cursor to help you search AWS CloudFormation and Cloud Development Kit (CDK) documentation, validate templates, troubleshoot deployments, and follow best practices – all while maintaining the security of local execution.

Whether you’re writing AWS CloudFormation templates or AWS Cloud Development Kit (CDK) code, the IaC MCP Server acts as an intelligent companion that understands your infrastructure needs and provides contextual assistance throughout your development lifecycle.

The Model Context Protocol (MCP) is an open standard that enables AI assistants to securely connect to external data sources and tools. Think of it as a universal adapter that lets AI models interact with your development tools while keeping sensitive operations local and under your control.

The IaC MCP Server provides nine specialized tools organized into two categories:

Remote Documentation Search Tools

These tools connect to the AWS Knowledge MCP backend to retrieve relevant, up-to-date information:

  1.  search_cdk_documentation
    Search the AWS CDK knowledge base for APIs, concepts, and implementation guidance.
  2. search_cdk_samples_and_constructs
    Discover pre-built AWS CDK constructs and patterns from the AWS Construct Library.
  3. search_cloudformation_documentation
    Query CloudFormation documentation for resource types, properties, and intrinsic functions.
  4. read_cdk_documentation_page
    Retrieve and read full documentation pages returned from searches or provided URLs.

Local Validation and Troubleshooting Tools

These tools run entirely on your machine

  1. cdk_best_practices
    Access a curated collection of AWS CDK best practices and design principles.
  2. validate_cloudformation_template
    Perform syntax and schema validation using cfn-lint to catch errors before deployment.
  3. check_cloudformation_template_compliance
    Run security and compliance checks against your templates using AWS Guard rules and cfn-guard.
  4. troubleshoot_cloudformation_deployment
    Analyze CloudFormation stack deployment failures with integrated CloudTrail event analysis. This tool will use your AWS credentials to analyze your stack status.
  5. get_cloudformation_pre_deploy_validation_instructions
    Returns instructions for CloudFormation’s pre-deployment validation feature, which validates templates during change set creation.

Key Use Cases

  1. Intelligent Documentation Assistant

Instead of manually searching through documentation, ask your AI assistant natural language questions:

“How do I create an S3 bucket with encryption enabled in CDK?”

The server searches CDK best practic and samples, returning relevant code examples and explanations.

     2. Proactive Template Validation

Before deploying infrastructure changes:

User: “Validate my CloudFormation template and check for security issues”

AI Agent: [Uses validate_cloudformation_template and check_cloudformation_template_compliance]

“Found 2 issues: Missing encryption on EBS volumes,

and S3 bucket lacks public access block configuration”

 3. Rapid Deployment Troubleshooting

When a stack deployment fails:

User: “My stack ‘stack_03’ in us-east-1 failed to deploy. What happened?”

AI Agent: [Uses troubleshoot_stack_deployment with CloudTrail integration]

“The deployment failed due to insufficient IAM permissions.

CloudTrail shows AccessDenied for ec2:CreateVpc.

You need to add VPC permissions to your deployment role.”

     4. Learning and Exploration

New to AWS CDK? The server helps you discover constructs and patterns:

User: “Show me how to build a serverless API”

AI Agent: [Searches CDK constructs and samples]

“Here are three approaches using API Gateway + Lambda…”

Architecture and Security

Security Design

Local Execution: The MCP server runs entirely on your local machine using uv (the fast Python package manager). No code or templates are sent to external services except for documentation searches.

AWS Credentials: The server uses your existing AWS credentials (from ~/.aws/credentials, environment variables, or IAM roles) to access CloudFormation and CloudTrail APIs. This follows the same security model as the AWS CLI.

stdio Communication: The server communicates with AI assistants over standard input/output (stdio), with no network ports opened.

Minimal Permissions: For full functionality, the server requires read-only access to CloudFormation stacks and CloudTrail events—no write permissions needed for validation and troubleshooting workflows.

Getting Started

Prerequisites

  • Python 3.10 or later
    uv package manager
    AWS credentials configured locally
    MCP-compatible AI client (e.g., Kiro CLI, Claude Desktop)

Configuration

Configure the MCP server in your MCP client configuration. For this blog we will focus on Kiro CLI. Edit .kiro/settings/mcp.json):

{
  "mcpServers": {
    "awslabs.aws-iac-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.aws-iac-mcp-server@latest"],
      "env": {
        "AWS_PROFILE": "your-named-profile",
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    }
  }
}

Security Considerations

Privacy Notice: This MCP server executes AWS API calls using your credentials and shares the response data with your third-party AI model provider (e.g., Amazon Q, Claude Desktop, Cursor, VS Code). Users are responsible for understanding your AI provider’s data handling practices and ensuring compliance with your organization’s security and privacy requirements when using this tool with AWS resources.

IAM Permissions

The MCP server requires the following AWS permissions:

For Template Validation and Compliance:

  • No AWS permissions required (local validation only)

For Deployment Troubleshooting:

  • cloudformation:DescribeStacks
  • cloudformation:DescribeStackEvents
  • cloudformation:DescribeStackResources
  • cloudtrail:LookupEvents (for CloudTrail deep links)

Example IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudformation:DescribeStacks",
        "cloudformation:DescribeStackEvents",
        "cloudformation:DescribeStackResources",
        "cloudtrail:LookupEvents"
      ],
      "Resource": "*"
    }
  ]
}

Example Use Case With Kiro CLI

IMPORTANT: Ensure you have satisfied all prerequisites before attempting these commands.

1. With the mcp.json file correctly set, try to run a sample prompt. In your terminal, run kiro-cli chat to start using Kiro-cli in the CLI.

Figure 1: Kiro-CLI with AWS IaC MCP server

Figure 1: Kiro-CLI with AWS IaC MCP server

Scenarios:

  • “What are the CDK best practices for Lambda functions?”

Figure 2 Search the CDK best practices for Lambda functions

Figure 2: Search the CDK best practices for Lambda functions

  • “Search for CDK samples that use DynamoDB with Lambda”

Figure 3: Search for CDK samples that use DynamoDB with Lambda

Figure 3: Search for CDK samples that use DynamoDB with Lambda

  • “Validate my CloudFormation template at ./template.yaml”

Figure 4: Validate my CloudFormation template with AWS IaC MCP Server

Figure 4: Validate my CloudFormation template with AWS IaC MCP Server

  • “Check if my template complies with security best practices”

Figure 5: Check if my template complies with security best practices with AWS IaC MCP Server

Figure 5: Check if my template complies with security best practices with AWS IaC MCP Server

Best Practices

  • Start with Documentation Search: Before writing code, search for existing constructs and patterns
  • Validate Early and Often: Run validation tools before attempting deployment
  • Check Compliance: Use check_template_compliance to catch security issues during development
  • Leverage CloudTrail: When troubleshooting, the CloudTrail integration provides detailed failure context
  • Follow CDK Best Practices: Use the cdk_best_practices tool to align with AWS recommendations

What’s Next?

The IAC MCP Server represents a new paradigm in the AI agentic workflow infrastructure development – one where AI assistants understand your tools, help you navigate complex documentation, and provide intelligent assistance throughout the development lifecycle.

Get Involved

The AWS IaC MCP Server is available now:

  • Documentation and GitHub Repository: aws-iac-mcp-server
  • Feedback: We welcome issues and pull requests! Or respond to our IaC survey here.

Ready to supercharge your infrastructure as code development? Install the IaC MCP Server today and experience AI-powered assistance for your AWS CDK and CloudFormation workflows.

Have questions or feedback? Reach out to the blog authors on the AWS Developer Forums.

About Authors

Idriss Laouali Abdou

Idriss is a Sr. Product Manager Technical on the AWS Infrastructure-as-Code team based in Seattle. He focuses on improving developer productivity through AWS CloudFormation and StackSets Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing.

Brian Terry

Brian Terry, Senior WW Data & AI PSA, is an innovation leader with more than 20 years of experience in technology and engineering. Brian is pursuing a PhD in computer science at the University of North Dakota and has spearheaded generative AI projects, optimized infrastructure scalability, and driven partner integration strategies. He is passionate about leveraging technology to deliver scalable, resilient solutions that foster business growth and innovation.

The Future of AWS CodeCommit

Post Syndicated from Anthony Hayes original https://aws.amazon.com/blogs/devops/aws-codecommit-returns-to-general-availability/

Back in July 2024, we announced plans to de-emphasize AWS CodeCommit based on adoption patterns and our assessment of customer needs. We never stopped looking at the data or listening to you, and what you’ve shown us is clear: you need an AWS-managed solution for your code repositories. Based on this feedback, CodeCommit is returning to full General Availability, effective immediately.

We Listened, and We Heard You

After the de-emphasis announcement last year, we heard from many of you. Your feedback was direct and revealing. You told us that CodeCommit isn’t just another code repository for you—it’s a critical piece of your infrastructure. Its deep IAM integration, VPC endpoint support, CloudTrail logging, and seamless connectivity with CodePipeline and CodeBuild provide value that’s difficult to replicate with third-party solutions, especially for teams operating in regulated industries or those who want all their development infrastructure within AWS boundaries. In short, we learned that CodeCommit is essential for many of you, so we’re bringing it back.

We acknowledge the uncertainty the de-emphasis has caused. If you invested time and resources planning or executing a migration away from CodeCommit, we apologize. We’ve learned from this, and we’re committed to doing better.

What’s Changing Today

Here’s what you need to know:

CodeCommit is open to new customers again – New customer sign-ups are open as of today. If you’ve been waiting to onboard new accounts or create repositories, you can do so right now through the AWS Console, CLI, or APIs.

For current and former customers – If you already migrated away, we understand you may have completed your transition to GitHub, GitLab, Bitbucket, or another provider. Those are excellent platforms, and we fully support your decision to use them. If you’re interested in returning to CodeCommit, our support team and account teams are available to help.

If you’re mid-migration, you can pause or reverse your plans. Contact AWS Support or your account team to discuss your specific situation and determine the best path forward.

If you stayed with CodeCommit, thank you for your patience during this period. We’re working through the backlog of feature requests and support tickets that accumulated, prioritizing by customer need. Continue to tell us how we can improve the service and support your workflows (human, machine, and agentic) moving forward.

What’s Coming Next

We’re not just maintaining CodeCommit—we’re investing in it. Here’s what’s on the roadmap:

Git LFS Support (Q1 2026) – This has been your most requested feature. Git Large File Storage will enable you to efficiently manage large binary files like images, videos, design assets, and compiled binaries without bloating your repositories. You’ll get faster clones, better performance, and cleaner version history for large assets.

Regional Expansions (Starting Q3 2026) – CodeCommit will expand to additional AWS Regions in eu-south-2 and ca-west-1, bringing the service closer to where you’re building and deploying your applications.

We’ll share more details about these features and additional roadmap items in the coming months. Keep an eye on our What’s New feed for the latest AWS launches.

Pricing, SLA, and Getting Started

Pricing remains unchanged—you can review the current structure on the CodeCommit pricing page. We continue to maintain our 99.9% uptime SLA as defined in our service terms.

If you’re new to CodeCommit or returning after a migration, check out our Getting Started Guide for step-by-step instructions. For migration assistance or questions about your specific setup, contact AWS Support or your account team.

Available Now

AWS CodeCommit is available now in 29 regions. New customers can begin creating repositories immediately. Visit the CodeCommit console to get started.

Thank you for your feedback, your patience, and your continued trust in AWS. We’re committed to making CodeCommit the best integrated Git repository service for AWS development.

Learn More:

Your Guide to the Developer Tools Track at AWS re:Invent 2025

Post Syndicated from Brian Beach original https://aws.amazon.com/blogs/devops/your-guide-to-the-developer-tools-track-at-aws-reinvent-2025/

AWS re:Invent 2025 is just around the corner, and if you’re a developer looking to level up your skills, the Developer Tool (DVT) track has an incredible lineup waiting for you. From CI/CD pipelines and full-stack development to Infrastructure as Code and AI-powered coding agents, this year’s sessions will help you build faster, smarter, and more efficiently. Here’s your essential guide to navigating the week.

Must-Attend Sessions

AWS re:Invent is a learning focused conference and the best place for developer to learn is in one of the roughly 75 sessions on the Developer Tools track. With breakout sessions, lightening talks, chalk talks, code talks, workshops, builder sessions, and meetups, you are sure to find a something that appeals the developer in you. Check you the event catalog, or start with these stand out sessions.

  • DVT202: Continuous integration and continuous delivery (CI/CD) on AWS – Learn about creating complete CI/CD pipelines using infrastructure as code on AWS, with hands-on insights into planning work, collaborating on code, and deploying applications. Mandalay Bay – Monday 10:00 AM
  • DVT203: AWS infrastructure as code: A year in review – Discover the latest features and improvements for AWS CloudFormation and AWS CDK, and learn how these tools can bring rigor, clarity, and reliability to your application development. MGM Grand – Monday 10:30 AM
  • DVT204: What’s new in full-stack AWS app development – Find out how AWS is evolving to help web developers deliver differentiating experiences at 10x speed with solutions that empower you to get started easily, ship quickly, and iterate rapidly. Mandalay Bay – Monday 12:00 PM
  • DVT209: Kiro: Your agentic IDE for spec-driven development – Explore how Kiro is revolutionizing development with spec-driven workflows, agent hooks, multimodal agent chat, and MCP support to help you go from idea to production faster. MGM Grand – Wednesday 11:30 AM
  • DVT405: From Code completion to autonomous agents: The evolution of software development – Journey through the evolution of AI-powered coding agents from inline code completion to sophisticated autonomous tools, grounded in empirical evidence and real-world applications. MGM Grand – Wednesday 3:00 PM
  • DVT207: Developer experience economics: moving past productivity metrics – Learn Amazon’s approach to understanding the impact of developer experience and tooling, and discover how to bring strategic thinking to your team’s developer experience improvements. Mandalay Bay – Tuesday 5:30 PM

House of Kiro

Start your journey at the House of Kiro in the Venetian. Walk through Kiro’s haunted house filled with developer nightmares and horrors, and explore how Kiro brings structure to coding chaos through spec-driven development, vibe coding, and agent hooks. If you survive the haunted house, you will be rewarded with Kiro swag.

Rustic wooden cabin structure with "KIRO" branding and ghost logo on the roof, featuring boarded-up windows with glowing purple light emanating from behind, creating a haunted house aesthetic with a front porch and chimney.

AWS Village

Visit the AWS Village in the Expo at the Venetian Level 2 Hall B to speak with me and other experts at either the Kiro kiosk or the Developer Tools kiosk, covering CodePipeline, CodeBuild, CloudFormation, CDK, and all the essential developer tools.

  • The Venetian, Monday, Dec 1: 4:00 PM – 7:00 PM
  • Tuesday, Dec 2: 10:00 AM – 6:00 PM
  • Wednesday, Dec 3: 10:00 AM – 6:00 PM
  • Thursday, Dec 4: 10:00 AM – 4:00 PM

AWS booth at a conference or trade show featuring the iconic AWS logo and smile design suspended above a multi-level exhibition space with purple and blue gradient lighting, surrounded by attendees exploring various demo stations.

Builders Loft

Located at the south end of the strip in Mandalay Bay, the Builders Loft offers a collaborative workspace with dedicated co-working spaces and meetup zones. Enjoy coffee, snacks, SWAG, and daily tech challenges for a chance to win AWS credits. Kiro experts will be at the builders loft Monday-Thursday:

  • 8:00 AM – 12:00 PM: Co-working space for one-on-one consultations
  • 12:00 PM – 1:00 PM: Daily meetup in the meetup space
  • 4:50 PM – 5:00 PM: Q&A in the whiteboard section

Isometric 3D rendering of an AWS re:Invent expo floor layout featuring purple and pink branded kiosks, blue seating areas with round tables, interactive display stations, and workspace zones in a modern conference environment.

Hands-On Challenges

Kiro’s Labyrinth

Stop by the Kiro kiosk in the Venetian Expo to participate in Kiro’s Labyrinth, a coding challenge where you’ll help Kiro escape from a spooky Halloween maze and win prizes. The Kiro code champions will be crowned in DVT221 at Mandalay Bay on Thursday at 11:30 AM.

Atmospheric 3D render of a medieval dungeon or castle interior with dramatic red and orange lighting from wall-mounted torches, featuring stone archways, staircases, cobblestone floors, and blue accent lighting creating a moody gaming environment.

Kiroween Hackathon

Build something wicked for Kiroween, the annual hackathon that started on Halloween and ends on Friday, December 5th—the last day of re:Invent. Need help? Visit us at in the Builder Loft in Mandalay Bay: Monday-Friday, 8:30 AM – 12:00 PM or the Developer Pavilion in Venetian whenever the Expo is open.

Purple banner with "KIROWEEN" text in white, flanked by three ghost characters including the Kiro ghost mascot, a mummy ghost, and a skeleton ghost, creating aHalloween-themed branding element.

Conclusion

Make the most of your re:Invent experience by attending these sessions, connecting with experts at the AWS Village and Builders Loft, and participating in hands-on challenges. Whether you’re interested in CI/CD, infrastructure as code, AI-powered development, or just want to network with fellow builders, the Developer Tools track has something for everyone. See you in Vegas!

Simplified developer access to AWS with ‘aws login’

Post Syndicated from Shreya Jain original https://aws.amazon.com/blogs/security/simplified-developer-access-to-aws-with-aws-login/

Getting credentials for local development with AWS is now simpler and more secure. A new AWS Command Line Interface (AWS CLI) command, aws login, lets you start building immediately after signing up for AWS without creating and managing long-term access keys. You use the same sign-in method you already use for the AWS Management Console.

In this blog, we’ll show you how to get temporary credentials to your workstation for use with the AWS CLI, AWS Software Development Kits (AWS SDKs), and tools or applications built using them with the new aws login command.

Getting started with programmatic access to AWS

You can use the aws login command with your AWS Management Console sign-in method, as described in the following sections.

Scenario 1: Using IAM credentials (root or IAM user)

To obtain programmatic credentials using your root or IAM user username and password:

  1. Install the latest AWS CLI (version 2.32.0 or later).
  2. Run the aws login command.
  3. If you have not set a default Region, the CLI prompts you to specify the AWS Region of your choice (e.g., us-east-2, eu-central-1). The CLI remembers which Region you set once you enter it into this prompt.
    Figure 1: CLI Region prompt

    Figure 1: CLI Region prompt

  4. The CLI opens your default browser.
  5. Follow the instructions in the browser window:
    1. If you have already signed into the AWS Management Console, you will see a screen that says, “Continue with an active session.”
      Figure 2: Sign in to AWS - active session selection

      Figure 2: Sign in to AWS – active session selection

    2. If you haven’t signed into the AWS Management Console, you will see the sign-in options page. Select “Continue with Root or IAM user” and log in to your AWS account.
      Figure 3: AWS Sign in to AWS - Sign-in options

      Figure 3: AWS Sign in to AWS – Sign-in options

  6. Success! You’re ready to run AWS CLI commands. Try the aws sts get-caller-identity command to verify the identity you’re currently using.
    Figure 4: Sign in to AWS - completion

    Figure 4: Sign in to AWS – completion

Scenario 2: Using federated sign-in

This scenario applies when you authenticate through your organization’s identity provider. To retrieve programmatic credentials for roles you assumed with federation:

  1. Complete steps 1–4 from Scenario 1, then continue with the following instructions.
  2. Follow the instructions in the browser window:
    1. If you have already signed into the AWS Management Console, the browser provides you with the option to select your active IAM role session from federated sign-in to the console. This enables you to switch between 5 active AWS sessions if you have multi-session support enabled on your AWS Management Console.
      Figure 5: Sign in to AWS - active IAM role session selection

      Figure 5: Sign in to AWS – active IAM role session selection

    2. If you have not signed into the AWS Management Console or want to get temporary credentials for a different IAM role, sign into your AWS account using your current authentication mechanism in another browser tab. Upon successful login, switch back to this tab and select the “Refresh” button. Your console session should now be available under the active sessions.
  3. Return to the AWS CLI once you have successfully completed the aws login process.

Regardless of the console sign-in method you choose, the temporary credentials issued by the aws login command are automatically rotated by the AWS CLI, AWS Tools for PowerShell and AWS SDKs every 15 minutes. They are valid up to the set session duration of the IAM principal (maximum of 12 hours). After reaching the session duration limit, you will be prompted to log in again.

Figure 6: AWS Sign in - session expiration

Figure 6: AWS Sign in – session expiration

Accessing AWS using local developer tools

The aws login command supports switching between multiple AWS accounts and roles using profiles. You can configure a profile with aws login --profile <PROFILE_NAME> and run AWS commands with the profile using: aws sts get-caller-identity --profile <PROFILE_NAME>. The short-term credentials issued by aws login work with more than the AWS CLI. You can also use them with:

  • AWS SDKs: If you use AWS SDKs for development, the SDK clients can use these temporary credentials to authenticate with AWS.
  • AWS Tools for PowerShell: Use the Invoke-AWSLogin command.
  • Remote development servers: Use aws login --remote on a remote server without browser access, to deliver temporary credentials from your device with browser access to the AWS console.
  • Older versions of AWS SDKs that do not support the new console credentials provider: Any software written using these older SDKs can support credentials delivered by aws login by using the credential_process provider with the AWS CLI.

Controlling access to aws login with IAM policies

The aws login command is controlled by two IAM actions: signin:AuthorizeOAuth2Access and signin:CreateOAuth2Token. Use the SignInLocalDevelopmentAccess managed policy or add these actions to your IAM policies to allow IAM users and IAM roles with console access to use this feature.

AWS Organizations customers looking to control the usage of this login feature on member accounts can deny the two actions above using Service Control Policies (SCPs). These IAM actions and their resources are usable in all relevant IAM policies.

AWS recommends using centralized root access management in AWS Organizations to eliminate long-term root credentials from member accounts. This feature allows security teams to perform privileged tasks through short-term, task-scoped root sessions from a central management account. After you enable centralized root management and delete root credentials on member accounts, root login to member accounts is denied, which also prevents programmatic access with root credentials using aws login. For developers using root credentials or IAM users, aws login delivers short-lived credentials to development tools, providing a secure alternative to long-term static access keys.

Logging and security of programmatic access using aws login

AWS Sign-In logs API activity through AWS CloudTrail, which now includes two new events specific to aws login. The service logs two new event names called AuthorizeOAuth2Access and CreateOauth2Token in the AWS Region where the user logs in.

Here’s a CloudTrail sample for an AuthorizeOAuth2Access event:

{
    "eventVersion": "1.11",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROATJHQDX737YZP72NTF:testuser”,
        "arn": "arn:aws:sts::225989345271:assumed-role/Admin/testuser,
        "accountId": “111111111111”,
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROATJHQDX737YZP72NTF",
                "arn": "arn:aws:iam::111111111111:role/Admin",
                "accountId": “11111111111”,
                "userName": "Admin"
            },
            "attributes": {
                "creationDate": "2025-11-17T22:50:14Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2025-11-17T22:51:32Z",
    "eventSource": "signin.amazonaws.com",
    "eventName": "AuthorizeOAuth2Access",
    "awsRegion": "us-east-1",
    "sourceIPAddress": “192.0.2.2”,
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36",
    "requestParameters": {
        "scope": "openid",
        "redirect_uri": "http://127.0.0.1:53037/oauth/callback",
        "code_challenge_method": "SHA-256",
        "client_id": "arn:aws:signin:::devtools/same-device"
    },
    "responseElements": null,
    "additionalEventData": {
        "success": "true",
        "x-amzn-vpce-id": ""
    },
    "requestID": "e2854c76-1cba-4360-9fd1-5037b591466b",
    "eventID": "59e1720d-3deb-44ff-933d-6828be2a860a",
    "readOnly": true,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": “111111111111”,
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "us-east-1.signin.aws.amazon.com"
    }
}

Here’s a CloudTrail sample for a CreateOAuth2Token event:

{
    "eventVersion": "1.11",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROATJHQDX737YZP72NTF:testuser-Isengard",
        "arn": "arn:aws:sts::111111111111:assumed-role/Admin/testuser-Isengard",
        "accountId": "111111111111",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROATJHQDX737YZP72NTF",
                "arn": "arn:aws:iam::111111111111:role/Admin",
                "accountId": "111111111111",
                "userName": "Admin"
            },
            "attributes": {
                "creationDate": "2025-11-18T20:38:10Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2025-11-18T20:38:44Z",
    "eventSource": "signin.amazonaws.com",
    "eventName": "CreateOAuth2Token",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "192.0.2.2",
    "userAgent": "aws-cli/2.32.0 md/awscrt#0.28.4 ua/2.1 os/macos#24.6.0 md/arch#arm64 lang/python#3.13.9 md/pyimpl#CPython m/b,AA,Z,E cfg/retry-mode#standard md/installer#exe sid/35033f4ca1bd md/prompt#off md/command#login",
    "requestParameters": {
        "client_id": "arn:aws:signin:::devtools/same-device"
    },
    "responseElements": null,
    "additionalEventData": {
        "success": "true",
        "x-amzn-vpce-id": ""
    },
    "requestID": "94562943-c85b-4dc1-bf72-43b0fd42d6de",
    "eventID": "0b338fac-6a10-4740-b34d-1bb6923e799e",
    "readOnly": true,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "111111111111",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "us-east-1.signin.aws.amazon.com"
    }
}

The aws login command uses the OAuth 2.0 authorization code flow with PKCE (Proof Key for Code Exchange) to protect against authorization code interception attacks. This provides a secure alternative to setting up IAM user access keys for getting started with development on AWS. For guidance on additional modern authentication approaches and alternatives to long-term IAM access keys, see the AWS Security Blog post “Beyond IAM access keys: Modern authentication approaches for AWS.”

Conclusion

The login for AWS local development feature is a secure-by-default enhancement that helps customers eliminate the use of long-term credentials for programmatic access with AWS. With aws login, you can start building immediately using the same credentials you use to sign in to the AWS Management Console. This feature is now available across all AWS commercial Regions (excluding China and GovCloud) at no additional cost to customers.

For more information, visit the authentication and access section in the CLI user guide.

If you have feedback about this post, submit comments in the Comments section below.

Shreya Jain

Shreya Jain

Shreya is a Senior Technical Product Manager in AWS Identity. She is energized by bringing clarity and simplicity to complex ideas. When she’s not applying her creative energy at work, you’ll find her at Pilates, dancing, or discovering her next favorite coffee shop.

Sowjanya Rajavaram

Sowjanya Rajavaram

Sowjanya is a Sr Solutions Architect who specializes in Identity and Security in AWS. She works on helping customers of all sizes solve their identity and access management problems. She enjoys traveling and exploring new cultures and food.

Accelerate workflow development with enhanced local testing in AWS Step Functions

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/accelerate-workflow-development-with-enhanced-local-testing-in-aws-step-functions/

Today, I’m excited to announce enhanced local testing capabilities for AWS Step Functions through the TestState API, our testing API.

These enhancements are available through the API, so you can build automated test suites that validate your workflow definitions locally on your development machines, test error handling patterns, data transformations, and mock service integrations using your preferred testing frameworks. This launch introduces an API-based approach for local unit testing, providing programmatic access to comprehensive testing capabilities without deploying to Amazon Web Services (AWS).

There are three key capabilities introduced in this enhanced TestState API:

  • Mocking support – Mock state outputs and errors without invoking downstream services, enabling true unit testing of state machine logic. TestState validates mocked responses against AWS API models with three validation modes: STRICT (this is the default and validates all required fields), PRESENT (validates field types and names), and NONE (no validation), providing high-fidelity testing.

  • Support for all state types – All state types, including advanced states such as Map states (inline and distributed), Parallel states, activity-based Task states, .sync service integration patterns, and .waitForTaskToken service integration patterns, can now be tested. This means you can use TestState API across your entire workflow definition and write unit tests to verify control flow logic, including state transitions, error handling, and data transformations.

  • Testing individual states – Test specific states within a full state machine definition using the new stateName parameter. You can provide the complete state machine definition one time and test each state individually by name. You can control execution context to test specific retry attempts, Map iteration positions, and error scenarios.

Getting started with enhanced TestState
Let me walk you through these new capabilities in enhanced TestState.

Scenario 1: Mock successful results

The first capability is mocking support, which you can use to test your workflow logic without invoking actual AWS services or even external HTTP requests. You can either mock service responses for fast unit testing or test with actual AWS services for integration testing. When using mocked responses, you don’t need AWS Identity and Access Management (IAM) permissions.

Here’s how to mock a successful AWS Lambda function response:

aws stepfunctions test-state --region us-east-1 \
--definition '{
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": {"FunctionName": "process-order"},
  "End": true
}' \
--mock '{"result":"{\"orderId\":\"12345\",\"status\":\"processed\"}"}' \
--inspection-level DEBUG

This command tests a Lambda invocation state without actually calling the function. TestState validates your mock response against the Lambda service API model so your test data matches what the real service would return.

The response shows the successful execution with detailed inspection data (when using DEBUG inspection level):

{
    "output": "{\"orderId\":\"12345\",\"status\":\"processed\"}",
    "inspectionData": {
        "input": "{}",
        "afterInputPath": "{}",
        "afterParameters": "{\"FunctionName\":\"process-order\"}",
        "result": "{\"orderId\":\"12345\",\"status\":\"processed\"}",
        "afterResultSelector": "{\"orderId\":\"12345\",\"status\":\"processed\"}",
        "afterResultPath": "{\"orderId\":\"12345\",\"status\":\"processed\"}"
    },
    "status": "SUCCEEDED"
}

When you specify a mock response, TestState validates it against the AWS service’s API model so your mocked data conforms to the expected schema, maintaining high-fidelity testing without requiring actual AWS service calls.

Scenario 2: Mock error conditions
You can also mock error conditions to test your error handling logic:

aws stepfunctions test-state --region us-east-1 \
--definition '{
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": {"FunctionName": "process-order"},
  "End": true
}' \
--mock '{"errorOutput":{"error":"Lambda.ServiceException","cause":"Function failed"}}' \
--inspection-level DEBUG

This simulates a Lambda service exception so you can verify how your state machine handles failures without triggering actual errors in your AWS environment.

The response shows the failed execution with error details:

{
    "error": "Lambda.ServiceException",
    "cause": "Function failed",
    "inspectionData": {
        "input": "{}",
        "afterInputPath": "{}",
        "afterParameters": "{\"FunctionName\":\"process-order\"}"
    },
    "status": "FAILED"
}

Scenario 3: Test Map states
The second capability adds support for previously unsupported state types. Here’s how to test a Distributed Map state:

aws stepfunctions test-state --region us-east-1 \
--definition '{
  "Type": "Map",
  "ItemProcessor": {
    "ProcessorConfig": {"Mode": "DISTRIBUTED", "ExecutionType": "STANDARD"},
    "StartAt": "ProcessItem",
    "States": {
      "ProcessItem": {
        "Type": "Task", 
        "Resource": "arn:aws:states:::lambda:invoke",
        "Parameters": {"FunctionName": "process-item"},
        "End": true
      }
    }
  },
  "End": true
}' \
--input '[{"itemId":1},{"itemId":2}]' \
--mock '{"result":"[{\"itemId\":1,\"status\":\"processed\"},{\"itemId\":2,\"status\":\"processed\"}]"}' \
--inspection-level DEBUG

The mock result represents the complete output from processing multiple items. In this case, the mocked array must match the expected Map state output format.

The response shows successful processing of the array input:

{
    "output": "[{\"itemId\":1,\"status\":\"processed\"},{\"itemId\":2,\"status\":\"processed\"}]",
    "inspectionData": {
        "input": "[{\"itemId\":1},{\"itemId\":2}]",
        "afterInputPath": "[{\"itemId\":1},{\"itemId\":2}]",
        "afterResultSelector": "[{\"itemId\":1,\"status\":\"processed\"},{\"itemId\":2,\"status\":\"processed\"}]",
        "afterResultPath": "[{\"itemId\":1,\"status\":\"processed\"},{\"itemId\":2,\"status\":\"processed\"}]"
    },
    "status": "SUCCEEDED"
}

Scenario 4: Test Parallel states
Similarly, you can test Parallel states that execute multiple branches concurrently:

aws stepfunctions test-state --region us-east-1 \
--definition '{
  "Type": "Parallel",
  "Branches": [
    {"StartAt": "Branch1", "States": {"Branch1": {"Type": "Pass", "End": true}}},
    {"StartAt": "Branch2", "States": {"Branch2": {"Type": "Pass", "End": true}}}
  ],
  "End": true
}' \
--mock '{"result":"[{\"branch1\":\"data1\"},{\"branch2\":\"data2\"}]"}' \
--inspection-level DEBUG

The mock result must be an array with one element per branch. By using TestState, your mock data structure matches what a real Parallel state execution would produce.

The response shows the parallel execution results:

{
    "output": "[{\"branch1\":\"data1\"},{\"branch2\":\"data2\"}]",
    "inspectionData": {
        "input": "{}",
        "afterResultSelector": "[{\"branch1\":\"data1\"},{\"branch2\":\"data2\"}]",
        "afterResultPath": "[{\"branch1\":\"data1\"},{\"branch2\":\"data2\"}]"
    },
    "status": "SUCCEEDED"
}

Scenario 5: Test individual states within complete workflows
You can test specific states within a full state machine definition using the stateName parameter. Here’s an example testing a single state, though you would typically provide your complete workflow definition and specify which state to test:

aws stepfunctions test-state --region us-east-1 \
--definition '{
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": {"FunctionName": "validate-order"},
  "End": true
}' \
--input '{"orderId":"12345","amount":99.99}' \
--mock '{"result":"{\"orderId\":\"12345\",\"validated\":true}"}' \
--inspection-level DEBUG

This tests a Lambda invocation state with specific input data, showing how TestState processes the input and transforms it through the state execution.

The response shows detailed input processing and validation:

{
    "output": "{\"orderId\":\"12345\",\"validated\":true}",
    "inspectionData": {
        "input": "{\"orderId\":\"12345\",\"amount\":99.99}",
        "afterInputPath": "{\"orderId\":\"12345\",\"amount\":99.99}",
        "afterParameters": "{\"FunctionName\":\"validate-order\"}",
        "result": "{\"orderId\":\"12345\",\"validated\":true}",
        "afterResultSelector": "{\"orderId\":\"12345\",\"validated\":true}",
        "afterResultPath": "{\"orderId\":\"12345\",\"validated\":true}"
    },
    "status": "SUCCEEDED"
}

These enhancements bring the familiar local development experience to Step Functions workflows, helping me to get instant feedback on changes before deploying to my AWS account. I can write automated test suites to validate all Step Functions features with the same reliability as cloud execution, providing confidence that my workflows will work as expected when deployed.

Things to know
Here are key points to note:

  • Availability – Enhanced TestState capabilities are available in all AWS Regions where Step Functions is supported.
  • Pricing – TestState API calls are included with AWS Step Functions at no additional charge.
  • Framework compatibility – TestState works with any testing framework that can make HTTP requests, including Jest, pytest, JUnit, and others. You can write test suites that validate your workflows automatically in your continuous integration and continuous delivery (CI/CD) pipeline before deployment.
  • Feature support – Enhanced TestState supports all Step Functions features including Distributed Map, Parallel states, error handling, and JSONata expressions.
  • Documentation – For detailed options for different configurations, refer to the TestState documentation and API reference for the updated request and response model.

Get started today with enhanced local testing by integrating TestState into your development workflow.

Happy building!
Donnie

Announcing CloudFormation IDE Experience: End-to-End Development in Your IDE

Post Syndicated from Damola Oluyemo original https://aws.amazon.com/blogs/devops/announcing-cloudformation-ide-experience-end-to-end-development-in-your-ide/

If you’ve developed AWS CloudFormation templates, you know the drill; write YAML(YAML Ain’t Markup Language) in your IDE(Integrated Development Environment), switch to the AWS Management Console to validate, jump to documentation to verify property names. Then run CFN Lint(Cloudformation Linter) in your terminal, deploy and wait, then troubleshoot failures back in the console. This constant context switching between your IDE, AWS Console, documentation pages, and validation tools fragments your workflow and kills productivity. What should take 30 minutes often stretches into hours of iteration cycles.

Today, we’re excited to introduce the CloudFormation IDE Experience, a comprehensive solution that brings the entire CloudFormation development lifecycle into your IDE. No more context switching. No more fragmented workflows. Just one unified, intelligent development experience from authoring to deployment.

In this post, you’ll learn how the Cloudformation IDE Experience transforms your workflow with intelligent authoring, real-time validation, AWS integration, and more.

What is the CloudFormation IDE Experience?

The CloudFormation IDE Experience reimagines how you build infrastructure as code by creating an end-to-end development loop entirely within your IDE. Unlike generic YAML or JSON editors, this is a CloudFormation-first solution built specifically for infrastructure developers.

This solution covers the complete lifecycle; from intelligent authoring with smart code completion and navigation that understands CloudFormation semantics, to real-time multi-layer validation that catches issues before deployment. It provides direct AWS integration for seamless resource imports and stack visibility, monitors configuration drift between your templates and deployed resources, and includes server-side pre-deployment checks that prevent common deployment failures.
The result? A development environment that understands your infrastructure code as deeply as your IDE understands your application code.

Core Features

Quick Project Setup with CFN Init

CFN Init streamlines project setup by creating a structured CloudFormation project with environment configurations in seconds. Run “CFN Init: Initialize Project” from the Command Palette, configure your environments (dev, staging, production), and associate each with an AWS profile.

The CloudFormation Explorer displays your environments, letting you switch between them with a single click. Each environment maintains its own deployment settings and parameter values, eliminating manual configuration and ensuring consistent deployments across your infrastructure lifecycle.

Intelligent Authoring with Intelligent Code Completion

The IDE understands CloudFormation semantics and provides context-aware suggestions as you type. Only required properties appear automatically, while optional properties surface on hover, so when you add a Properties section to an EC2 VPC resource, nothing appears because it has no required properties. Create a subnet, however, and VpcId appears immediately because it’s required.

When you use !GetAtt or !Ref, the IDE knows exactly which attributes and resources are available. Navigation features like go-to-definition for logical IDs and hover tooltips let you explore complex templates without losing context. The IDE also provides full support for CloudFormation intrinsic functions and pseudo parameters.

Multi-Layer Validation System

The IDE provides comprehensive validation at multiple levels:

Static Validation (Real-time)

  • CloudFormation Guard Integration: Security and compliance checks using AWS Security pillar rules. For example, it automatically flags insecure configurations like MapPublicIpOnLaunch: true on subnets
  • CFN Lint Integration: Advanced syntax and logic validation, including overlapping CIDR block detection, resource dependency validation, and property checks beyond basic schema validation

Interactive Error Resolution
When errors occur, the IDE doesn’t just highlight them, it helps you fix them. Contextual error messages explain what’s wrong and why it matters, while one-click quick fixes automatically correct common issues like missing required properties or invalid reference formats. If you reference a non-existent resource, the IDE suggests valid alternatives from your template. Reference an invalid attribute with !GetAtt, the IDE immediately shows which attributes are actually available for that resource type.

AWS Resource Integration (CCAPI)

Import existing AWS resources directly into your templates using the Cloud Control API (CCAPI). Browse live resources and view all CloudFormation stacks in your AWS account from within the IDE. Pull resource configurations directly into your template with one click, complete with accurate property values. This transforms existing infrastructure into Infrastructure-as-Code without manual reconstruction or switching to the console to look up property values.

Server-Side Validation

Before you deploy, the IDE performs comprehensive server-side validation through AWS’s intelligent validation service that analyzes your CloudFormation templates against real-world deployment patterns and catches issues static analysis can’t detect.

The AWS’s intelligent validation service uses AWS-managed hooks to analyze your change sets before execution across three categories. Enhanced template validation covers CFN Lint blind spots like transforms and parameter values. Primary identifier conflict detection finds existing resources with the same identifiers before you attempt deployment. Resource state validation checks resource readiness ensuring, for example, that Amazon Simple Storage Service(S3) buckets are empty before deletion attempts.

This validation is based on analysis of the top CloudFormation failure patterns, helping you catch issues before they cause rollbacks or failed states.

Getting Started

Getting started with the CloudFormation IDE Experience is straightforward:

Prerequisite:

  1. Install an IDE that supports the CloudFormation extension, such as Visual Studio Code, Kiro
  2. Download the CloudFormation extension for your platform (available through the AWS Toolkit)
  3. Install the extension following the standard VS Code extension installation process

No complex dependency management or schema updates required—all configuration and updates are handled automatically.

Let’s See How It Works

Let’s walk through a practical example that demonstrates the IDE experience in action. We’ll build a simple Amazon Virtual Private Cloud (Amazon VPC) infrastructure with subnets and an S3 bucket.

Setting Up Your Project

Start by initializing a new CloudFormation project. Open the Command Palette, run “CFN Init: Initialize Project”, choose your project location, and set up environments. For this example, create a “beta” environment and associate it with your AWS development profile. The IDE creates your project structure with configuration files ready to use. You can now select your “beta” environment from the CloudFormation Explorer to ensure all deployments use the correct settings.

Figure 1: Initializing a CloudFormation project with environment configuration

Starting with Intelligent Authoring

Create a new CloudFormation template and start typing AWS::EC2::VPC. The IDE provides intelligent completions as you type.

Cloudformation IDE extension intelligent completion

Figure 2.0: Resource type auto-completion with CloudFormation-aware IntelliSense

When you add the Properties section, notice something interesting: nothing appears automatically. That’s because Amazon Elastic Compute Cloud (Amazon EC2) VPC has no required properties.

Cloudformation IDE extension doesn't suggest optional properties
Figure 2.1: No automatic suggestions for VPC properties since none are required

Hover over Properties to see all available options with their types and documentation links.

Hover information displaying optional properties and their documentation

Figure 2.2: Hover information displaying optional properties and their documentation

Add a CIDR block, then create a subnet. This time, when you type Properties, VpcId appears immediately because it’s required.

Required properties VpcID automatically suggested for EC2 Subnet
Figure 2.3: Required properties VpcID automatically suggested for EC2 Subnet

The IDE provides the resource names in your template, and when you use !GetAtt or !Ref, it knows which attributes are available for each resource type.

Type-aware completions for intrinsic functions like !GetAtt & !Ref

Figure 2.4: Type-aware completions for intrinsic functions like !GetAtt & !Ref

Real-Time Validation in Action

As you continue building, add MapPublicIpOnLaunch: true to make a public subnet. Immediately, a blue squiggly line appears.

CloudFormation Guard warning highlighted in real-time

Figure 3: CloudFormation Guard warning highlighted in real-time

Hovering reveals a CloudFormation Guard warning from the AWS Security pillar rules: this configuration isn’t recommended for security compliance.

Security compliance warning with detailed explanation

Figure 3.1: Security compliance warning with detailed explanation

Create a second subnet by copying the first, but now red squiggly lines appear. CFN Lint has detected overlapping CIDR blocks between your two subnets – an issue that would fail during deployment. You can fix it immediately with the contextual information provided.

CFN Lint error detection for overlapping CIDR blocks providing detailed error information helping you resolve the issue quickly
Figure 3.2: CFN Lint error detection for overlapping CIDR blocks providing detailed error information helping you resolve the issue quickly

Importing Existing Resources

Now you need an S3 bucket. Instead of writing it from scratch, open the Resource Explorer panel on the left. Using CCAPI integration, you can see all your existing AWS resources. Select an S3 bucket and click “Import resource state”. The IDE pulls in the complete resource configuration with all properties already set. You can now iterate on this resource without needing to remember or look up all the configuration details.

Automatically imported resource configuration from live AWS resources

Figure 4: Automatically imported resource configuration from live AWS resources

Developer Experience Benefits

The CloudFormation IDE Experience delivers measurable improvements across productivity and quality:

Productivity Gains:

  • Reduced context switching: Keep your entire workflow in one place
  • Faster iteration cycles: Catch and fix issues in seconds, not minutes or hours
  • Shift-left validation: Identify problems before deployment, not after
  • Intelligent assistance: Spend less time in documentation, more time building

Quality Improvements:

  • Proactive error prevention: Multi-layer validation catches issues early
  • Security by default: Built-in compliance checks from CloudFormation Guard
  • Best practice enforcement: Automated guidance aligned with AWS recommendations
  • Deployment confidence: Pre-deployment validation reduces rollback scenarios

What previously took hours of troubleshooting and multiple deployment attempts now becomes a confident 30-minute development cycle.

“I will definitely use these features; they help to reduce the feedback loop and speed up the development of IaC templates.” – AWS Community Builder

Things to Know

Platform Support

The CloudFormation IDE Experience is available for:

  • Visual Studio Code: Full feature support
  • Kiro: Full feature support
  • Cursor: Full feature support
  • JetBrains IDEs: Complete integration across the IntelliJ family (Fast Follow)
  • Operating Systems: macOS (ARM), Linux (x64) and Windows(…)

Conclusion

The CloudFormation IDE Experience eliminates the context switching that fragments your workflow. Write, validate, and deploy all from one environment. What used to take hours of iteration now takes minutes.

Ready to get started? Install the CloudFormation extension from the AWS Toolkit for VS Code and experience the difference. For detailed setup instructions and feature documentation, see the CloudFormation IDE Experience guide.

About the Authors:

Damola Oluyemo

Damola Oluyemo is a Solutions Architect at Amazon Web Services focused on Enterprise customers. He helps customers design cloud solutions while exploring the potential of Infrastructure as Code and generative AI in software development.

Jehu Gray

Jehu Gray is a Prototyping Architect at Amazon Web Services where he helps customers design solutions that fits their needs. He enjoys exploring what’s possible with IaC.

Safely Handle Configuration Drift with CloudFormation Drift-Aware Change Sets

Post Syndicated from JJ Lei original https://aws.amazon.com/blogs/devops/safely-handle-configuration-drift-with-cloudformation-drift-aware-change-sets/

Introduction

Is configuration drift preventing you from accessing the speed, safety, and governance benefits of AWS CloudFormation for infrastructure management? Configuration drift occurs when cloud resources are modified outside of CloudFormation, leading to a mismatch in the actual state and template definition of resources. Drift tends to accumulate from infrastructure changes that engineers make via the AWS Management Console to resolve production incidents or troubleshoot malfunctioning applications. Drift can cause unexpected changes during subsequent IaC deployments or leave resources in a non-compliant state. Unresolved drift can lead to cost increases when resources are over-provisioned outside of template definitions, or compliance violations that may result in audit penalties. Additionally, drift makes it hard to reproduce applications for testing or disaster recovery.

CloudFormation now offers drift-aware change sets that allow you to safely handle configuration drift and keep your infrastructure in sync with your templates. In this post, we will explore the process of leveraging drift-aware change sets to resolve common scenarios in which drift impacts the availability or security of your application.

Solution Overview

Drift-aware change sets are a type of CloudFormation change sets that can bring drifted resources in line with template definitions and preview the required changes to actual infrastructure states before deployment. Drift-aware change sets surface a three-way comparison of your new template, actual resource states, and previous template before deployment, allowing you to prevent unexpected overwrites of drift. Additionally, drift-aware change sets offer you a systematic mechanism to restore drifted resources to approved template definitions, strengthening the reproducibility and compliance posture of applications. You can create drift-aware change sets either from the CloudFormation Management Console or from the AWS CLI or SDK by passing the --deployment-mode REVERT_DRIFT parameter to the CreateChangeSet API.

Prerequisites

AWS CLI latest version with CloudFormation permissions configured.

AWS Identity and Access Management (IAM) permissions required: Permissions to create and manage CloudFormation stacks, AWS Lambda functions, Security Groups, Amazon Simple Storage Service (Amazon S3) buckets, and IAM roles. PowerUserAccess or Administrator access recommended for testing.

• Test environment (non-production AWS account recommended)

• Basic CloudFormation knowledge (stacks, templates, change sets)

Important Note: These sample templates are provided for educational purposes only and should not be used in production environments without proper security review and testing. You are responsible for testing, securing, and optimizing these templates based on your specific quality control practices and standards. Deploying these templates may incur AWS charges for creating or using AWS resources. Work with your security and legal teams to meet your organizational security, regulatory, and compliance requirements before any production deployment.

Scenario 1: Prevent Dangerous Overwrites

This scenario demonstrates how drift-aware change sets prevent dangerous overwrites when Lambda function memory is increased outside of CloudFormation during an outage, and a subsequent template update could accidentally reduce memory, causing performance issues.

Story: Your team deploys a Lambda function with 128 MB memory via CloudFormation. During a production outage, an engineer increases the memory to 512 MB through the Lambda Console to resolve performance issues. Later, another developer updates the template to 256 MB for a code change, unaware of the console modification. Without drift-aware change sets, CloudFormation would unexpectedly reduce memory from 512 MB to 256 MB—potentially causing the outage to recur.

User journey: Create stack with 128MB => Increase memory to 512MB via console during outage => Create drift-aware change set with 256MB template => Review three-way comparison showing dangerous memory reduction => Cancel change set to prevent outage => Update template to match production state (512MB) => Create and execute drift-aware change set with updated template (512MB) to resolve drift

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with Lambda function (128 MB memory).

Figure 1

CloudFormation stack “lambda-memory-drift-test” successfully deployed with CREATE_COMPLETE status

2. Emergency Memory Increase (Console)

Manually increase Lambda memory to 512 MB through AWS Console (simulating emergency performance fix during outage).

Figure 2

Initial Lambda function showing 128 MB memory as configured in template

Figure 3

Lambda memory increased to 512 MB through console during outage, creating drift from template

3. Create Drift-Aware Change Set

Create change set with 256 MB template using drift-aware mode to reveal the dangerous memory reduction.

Figure 4

CloudFormation console showing the new “Drift aware change set” option selected. This compares the new template with the live state of your stack and shows changes to drifted resources before deployment, unlike standard change sets that only compare templates.

aws cloudformation create-change-set \
--stack-name lambda-memory-drift-test \
--change-set-name detect-memory-overwrite \
--template-body file://lambda-memory-drift-scenario-256mb.yaml \
--deployment-mode REVERT_DRIFT \
--capabilities CAPABILITY_IAM \
--region us-east-1

4. Review Change Set – The Critical Three-Way Comparison

Examine the drift-aware change set to see the dangerous memory reduction that would occur.

Figure 5

Critical insight revealed: The change set shows Live resource state (512 MB) vs Proposed resource state (256 MB), revealing a dangerous memory reduction that would impact performance.

Figure 6: view drift

Drift analysis: Clicking “View drift” reveals the complete picture – Previous template (128 MB) vs Live resource state (512 MB). This shows the live state has 4x more memory than the original template, indicating emergency changes were made during the outage that must be preserved.

Key Insight: The drift-aware change set reveals that:

  • Previous template: 128 MB (original deployment)
  • Live resource state: 512 MB (emergency change during outage)
  • Proposed template: 256 MB (new deployment)

This would cause a dangerous reduction from 512 MB to 256 MB, potentially recreating the original performance issue. Without drift-aware change sets, this critical information would be hidden.

5. Recreate Drift-aware Change Set with Updated Template (512MB) to Resolve Drift

Update the template to match the live production state (512 MB) and create a new drift-aware change set to safely resolve the drift.

Figure 7

Resolution confirmed: The drift-aware change set shows both Live resource state and Proposed resource state at 512 MB, with change set action ” Sync with live”. This verifies that the updated template now matches production, preventing the dangerous memory reduction and safely resolving the drift without impacting performance.

CloudFormation Templates

Initial Template (128 MB):

Resources:
  DriftTestFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.9
      Handler: index.lambda_handler
      MemorySize: 128
      ReservedConcurrentExecutions: 5
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'Hello!'}
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Updated Template (256 MB – lambda-memory-drift-scenario-256mb.yaml):

Resources:
  DriftTestFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.9
      Handler: index.lambda_handler
      MemorySize: 256
      ReservedConcurrentExecutions: 5
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'Hello!'}
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name lambda-memory-drift-test --template-body file://lambda-memory-drift-scenario.yaml --capabilities CAPABILITY_IAM --region us-east-1
  1. Get function name:
aws cloudformation describe-stack-resources --stack-name lambda-memory-drift-test --logical-resource-id DriftTestFunction --query 'StackResources[0].PhysicalResourceId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name lambda-memory-drift-test --change-set-name detect-memory-overwrite --template-body file://lambda-memory-drift-scenario-256mb.yaml --deployment-mode REVERT_DRIFT --capabilities CAPABILITY_IAM --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name detect-memory-overwrite --stack-name lambda-memory-drift-test --region us-east-1

Scenario 2: Remediate Unauthorized Changes

This scenario demonstrates how drift-aware change sets systematically remediate unauthorized changes when a developer adds temporary debugging rules to a security group but forgets to remove them, creating a compliance violation.

Story: Your team deploys a security group with only HTTP access via CloudFormation for compliance. During debugging, a developer adds SSH access (port 22) through the AWS Console for their IP address to troubleshoot an application issue. They forget to remove this rule after debugging. Later, security compliance requires reverting to the original template state. A standard change set shows no changes since the template is unchanged, but a drift-aware change set can detect and systematically remove the unauthorized SSH rule.

User journey: Create stack with HTTP-only access => Add SSH rule via console for debugging => Forget to remove SSH rule => Create drift-aware change set with REVERT_DRIFT mode => Review change set showing SSH rule removal => Execute change set to restore compliance

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with security group allowing only HTTP traffic.

Figure 8

CloudFormation stack “sg-revert-drift-test” successfully deployed with DriftTestSecurityGroup resource

2. Make Unauthorized Changes (Console)

Manually add SSH ingress rule through AWS Console (simulating developer debugging access that wasn’t removed).

Figure 9: http only

Initial security group showing only HTTP (port 80) access as configured in template – compliant state

Figure 10: ssh-added

Security group now shows 2 permission entries: SSH (port 22) for specific IP and HTTP (port 80) for all traffic. The SSH rule creates drift and a compliance violation that needs systematic removal.

3. Create Drift-Aware Change Set

Create change set using REVERT_DRIFT mode to systematically remove the unauthorized SSH rule.

Figure 11

Creating drift-aware change set for security group compliance restoration. Note the “Drift aware change set” option is selected to compare with live state and detect unauthorized changes.

aws cloudformation create-change-set \
--stack-name sg-revert-drift-test \
--change-set-name revert-ssh-drift \
--use-previous-template \
--deployment-mode REVERT_DRIFT \
--region us-east-1

4. Review Change Set – Systematic Compliance Restoration

Examine the drift-aware change set to see systematic removal of unauthorized SSH rule.

Figure 12

Compliance violation detected: The drift -aware change set shows that the SSH rule in the live resource state (rule 232 for IP 15.248.7.53/32 on port 22) is not present in the proposed resource state derived from the template. This unauthorized SSH rule violates security policy and will be systematically removed

Key Insight: The drift-aware change set enables systematic compliance restoration by:

  • Previous template: Only HTTP (port 80) access – compliant state
  • Live resource state: HTTP + SSH (port 22) for 15.248.7.53/32 – compliance violation
  • Action: Remove unauthorized SSH rule to restore compliance

This provides a systematic, auditable way to remove unauthorized changes rather than manual cleanup.

Figure 13

Stack events showing successful execution of the drift-aware change set – SSH rule removed

CloudFormation Templates

security-group-drift-scenario.yaml:

Resources:
  DriftTestSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: "Security group for drift testing"
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0
          Description: "Allow HTTP traffic for demo purposes"
      SecurityGroupEgress:
        - IpProtocol: -1
          CidrIp: 0.0.0.0/0
          Description: "Allow all outbound traffic"

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name sg-revert-drift-test --template-body file://security-group-drift-scenario.yaml --region us-east-1
  1. Get security group ID:
aws ec2 describe-security-groups --filters "Name=tag:aws:cloudformation:stack-name,Values=sg-revert-drift-test" --query 'SecurityGroups[0].GroupId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name sg-revert-drift-test --change-set-name revert-ssh-drift --template-body file://security-group-drift-scenario.yaml --deployment-mode REVERT_DRIFT --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name revert-ssh-drift --stack-name sg-revert-drift-test --region us-east-1

Scenario 3: Recreate Deleted Resources

This scenario demonstrates drift detection when a dependent resource (logs bucket) is accidentally deleted outside of CloudFormation during troubleshooting. The main application bucket depends on this logs bucket for access logging. You need to recreate the deleted resource while maintaining the existing infrastructure dependencies.

Story: Your team deploys a main S3 bucket with a dependent logs bucket for access logging via CloudFormation. During troubleshooting, an operator accidentally deletes the logs bucket through the AWS Console. The main bucket still exists but its logging configuration now references a non-existent bucket. You need to recreate the deleted logs bucket while maintaining the dependency relationship.

User journey: Create stack with main and logs buckets => Accidentally delete logs bucket => Create drift-aware change set with REVERT_DRIFT mode => Review change set showing LogBucket will be recreated => Execute change set to restore deleted resource

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with main S3 bucket and dependent logs bucket.

Figure 14

CloudFormation stack “s3-deletion-drift-test” successfully deployed with both LogBucket and MainBucket resources in CREATE_COMPLETE status

2. Accidental Deletion (Console)

Manually delete the logs bucket through AWS Console (simulating accidental deletion during troubleshooting).

Figure 15

LogBucket accidentally deleted outside of CloudFormation during troubleshooting, creating drift – the MainBucket still exists but its logging configuration now references a non-existent bucket

3. Create Drift-Aware Change Set

Create change set using REVERT_DRIFT mode to recreate the deleted LogBucket.

Figure 16

Creating drift-aware change set with “Drift aware change set” option selected to detect and recreate the deleted resource by comparing template with live state

aws cloudformation create-change-set \
--stack-name s3-deletion-drift-test \
--change-set-name recreate-deleted-bucket \
--use-previous-template \
--deployment-mode REVERT_DRIFT \
--region us-east-1

4. Review Change Set – Resource Recreation

Examine change set to see LogBucket recreation while preserving MainBucket dependencies.

Figure 17

Change set preview showing LogBucket will be recreated to restore the deleted resource and MainBucket updated to maintain infrastructure dependencies

Key Insight: The drift-aware change set detects that:

  • Template expectation: Both LogBucket and MainBucket should exist
  • Live resource state: Only MainBucket exists, LogBucket is missing
  • Action: Recreate LogBucket with original configuration to restore logging functionality

This enables systematic recovery of accidentally deleted resources while maintaining infrastructure dependencies.

CloudFormation Templates

s3-drift-scenario.yaml:

Resources:
  LogBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled
  
  MainBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled
      LoggingConfiguration:
        DestinationBucketName: !Ref LogBucket

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name s3-deletion-drift-test --template-body file://s3-drift-scenario.yaml --region us-east-1
  1. Get LogBucket name:
aws cloudformation describe-stack-resources --stack-name s3-deletion-drift-test --logical-resource-id LogBucket --query 'StackResources[0].PhysicalResourceId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name s3-deletion-drift-test --change-set-name recreate-deleted-bucket --template-body file://s3-drift-scenario.yaml --deployment-mode REVERT_DRIFT --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name recreate-deleted-bucket --stack-name s3-deletion-drift-test --region us-east-1

Best Practices

When working with drift-aware change sets, consider these best practices:

Always review three-way comparisons before executing change sets to understand the full impact

Use REVERT_DRIFT deployment mode when you want to bring resources back to template compliance

Document emergency changes made outside of CloudFormation to inform future template updates

Implement change management processes to minimize unauthorized drift

Regular drift detection helps identify configuration changes before they become problematic

Test drift-aware change sets in non-production environments first

Cleanup

Important: Execute these cleanup commands promptly after completing the scenarios to avoid incurring unnecessary AWS charges. Resources such as Lambda functions, S3 buckets (even if empty), and security groups may incur costs if left running. Ensure all stacks are successfully deleted by verifying the DELETE_COMPLETE status.

Commands to delete all test resources:

# Scenario 1: Lambda Memory Drift
aws cloudformation delete-stack --stack-name lambda-memory-drift-test --region us-east-1

# Scenario 2: Security Group Drift
aws cloudformation delete-stack --stack-name sg-revert-drift-test --region us-east-1

# Scenario 3: S3 Bucket Deletion Drift
aws cloudformation delete-stack --stack-name s3-deletion-drift-test --region us-east-1

# Verify all stacks are deleted
aws cloudformation list-stacks --stack-status-filter DELETE_COMPLETE --region us-east-1

Note: CloudFormation will automatically clean up all resources created by the stacks, including Lambda functions, security groups, and S3 buckets.

Conclusion

Drift-aware change sets enable you to mitigate the operational and security risks of configuration drift, allowing you to confidently automate and govern your infrastructure updates with CloudFormation. Through the scenarios described in this post, you have seen how you can leverage drift-aware change sets to prevent outages in production environments, maintain the integrity of your test environments, and manage the compliance posture of all environments. Remember to thoroughly review the infrastructure changes previewed by drift-aware change sets before executing deployments.

Available Now

Drift-aware change sets are available in AWS Regions where CloudFormation is available. Please refer to the AWS Region table to learn more.

Accelerate infrastructure development with CloudFormation pre-deployment validation and simplified troubleshooting

Post Syndicated from Idriss Laouali Abdou original https://aws.amazon.com/blogs/devops/accelerate-infrastructure-development-with-cloudformation-pre-deployment-validation-and-simplified-troubleshooting/

AWS CloudFormation makes it easy to model and provision your cloud application infrastructure as code. CloudFormation templates can be written directly in JSON or YAML, or they can be generated by tools like the AWS Cloud Development Kit (CDK). Resources are created and managed by CloudFormation as units called Stacks. Additionally, change set enable you to preview the stack changes before deployment.

CloudFormation now offers powerful new features that transform how you develop and troubleshoot infrastructure as code, pre-deployment validation that catches errors in seconds, enhanced operation tracking, and simplified failure debugging. These capabilities shift-left infrastructure code validation, helping you prevent infrastructure deployment failures that impacts development velocity.

In this blog post, we’ll explore how these new features accelerate development cycles by catching common errors during change set creation and providing precise troubleshooting through operation tracking and failure filtering. Whether you’re a platform engineer managing complex multi-service deployments or a developer iterating on infrastructure templates, we’ll show you how to:

  • Validate resource properties and detect naming conflicts before deployment
  • Prevent deployment failures by checking S3 bucket emptiness before deletion operations
  • Track operations with unique IDs for focused troubleshooting
  • Quickly identify root causes using the new describe-events API

This comprehensive guide will walk through real-world scenarios demonstrating how these capabilities can reduce infrastructure deployment failures from hours of debugging to seconds of validation, helping you deliver cloud infrastructure faster and more reliably.

Key Capabilities

  • Pre-deployment Validation: Catch template errors instantly instead of discovering them after resource provisioning attempts. These include pre-deployment validation for resource property syntax errors, resource naming conflicts for existing resources in your account, and S3 bucket emptiness constraint violations on delete operations.
  • Operation Tracking: Say goodbye to long debugging sessions. Each stack action now comes with a unique Operation ID, transforming the “needle in haystack” troubleshooting experience into precise, targeted problem-solving.
  • Streamlined Events API for simplified Debugging: Use the new describe-events API and FailedEvents=true filter to instantly pinpoint issues. One command tells you exactly what went wrong, eliminating the need to scroll through endless logs.
  • Immediate Feedback: Transform your CI/CD pipeline from a potential bottleneck into a rapid iteration engine. Get immediate feedback on common deployment issues, allowing your team to fix and deploy faster than ever before.

How It works

Pre-deployment Validation

The following scenarios show how you can leverage CloudFormation pre-deployment validation to detect property syntax errors, resource naming conflicts, and constraint violations during change set creation.

Understanding Validation Modes
CloudFormation pre-deployment validation operates in two modes that determine how validation failures are handled.

  • FAIL mode prevents change set execution when validation detects errors, ensuring problematic templates cannot proceed to deployment. This applies to property syntax errors and resource naming conflicts.
  • WARN mode allows change set creation to succeed despite validation failures, providing warnings that developers can review and address before execution. This applies to constraint violations like S3 bucket emptiness that may be resolvable through manual intervention.

Understanding these modes helps you anticipate whether validation issues will block your deployment workflow or simply require attention before execution.

Let’s walk you through practical scenarios:

Scenario 1: Validate Resource Property Syntax

CloudFormation evaluates each resource property definition or value before provisioning begins. The following example illustrates several common resource property errors:

  1. The “AWS::Lambda::Function” Role property requires an ARN pattern.
  2. The “AWS::Lambda::Function” Timeout property expects an integer instead of a string.
  3. The “AWS::Lambda::Function” TracingConfig.Mode nested property ENUM value is invalid.
  4. The “AWS::Lambda::Alias” Name property is required but not defined.
  5. The “AWS::Lambda::Alias” the extra property Description in a nested path RoutingConfig.AdditionalVersionWeights.0 is not supported.

Prior to this launch, these resource configuration errors would be detected at the resource provisioning time only. However, with the pre-deployment validations feature, these errors can be identified ahead of the deployment phase, streamlining the development-test lifecycle efficiency and minimizing rollbacks during deployments.

Template

AWSTemplateFormatVersion: "2010-09-09"

Description: This template demonstrates how pre-deployment validation and enhanced troubleshooting work

Resources:
  MyLambdaFunction:
    Type: "AWS::Lambda::Function"
    Properties:
      FunctionName: "dev-test"
      Role: 'MyRole'          #1. Non-matching pattern
      Runtime: "python3.11"
      Handler: "index.lambda_handler"
      Code:
        ZipFile: |
          import json
          
          def lambda_handler(event, context):
              return {
                  'statusCode': 200,
                  'body': json.dumps('Hello from Lambda!')
              }
      Timeout: "30s"          #2. Type mismatch
      MemorySize: 128
      TracingConfig:
        Mode: "DISABLED"       #3. Invalid ENUM

  MyCandidateReleaseVersion:
    Type: "AWS::Lambda::Version"
    Properties:
      FunctionName: !Ref "MyLambdaFunction"
      Description: "v2"

  MyLambdaAlias:
    Type: AWS::Lambda::Alias
    Properties:
                              #4. Missing required property "Name"
      FunctionName: !Ref "MyLambdaFunction"
      FunctionVersion: "$LATEST"
      RoutingConfig:
        AdditionalVersionWeights:
          - FunctionVersion: !GetAtt "MyCandidateReleaseVersion.Version"
            FunctionWeight: 0.1
            Description: "10% traffic to the new version" #5. Unsupported property


Step 1: Create Change Set

Console
Create a new stack using the change set creation flow, provide the template and all required parameters.

CloudFormation Create change set

Figure 1: Create a change set view

CLI Command

aws cloudformation create-change-set \
    --stack-name "dev-lambda-stack" \
    --change-set-name "updateAlias" \
    --change-set-type "CREATE" \
    --template-body file://lambda-with-alias-template.yaml

Step 2: Check Change Set Status
To review the status of your change set

Console

Figure 2: Describe change set status

Figure 2: Describe change set status

CLI command

aws cloudformation describe-change-set \
  --change-set-name "arn:aws:cloudformation:us-west-2:123456789012:changeSet/updateAlias/94498df5-1afb-43b1-9869-9f82b2d877ac"
{
  "ChangeSetName": "updateAlias",
  "ChangeSetId": "arn:aws:cloudformation:us-west-2:123456789012:changeSet/updateAlias/94498df5-1afb-43b1-9869-9f82b2d877ac",
  "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
  "StackName": "dev-lambda-stack",
  "CreationTime": "2025-11-06T21:40:13.333000+00:00",
 <strong> "ExecutionStatus": "UNAVAILABLE",
  "Status": "FAILED",
  "StatusReason": "The following hook(s)/validation failed: [AWS::EarlyValidation::PropertyValidation]. To troubleshoot Early Validation errors, use the DescribeEvents API for detailed failure information.",
  "NotificationARNs": [],</strong>
  "RollbackConfiguration": {},
  "Capabilities": [],
  "Changes": [
    {
      "Type": "Resource",
      "ResourceChange": {
        "Action": "Add",
        "LogicalResourceId": "MyCandidateReleaseVersion",
        "ResourceType": "AWS::Lambda::Version",
        "Scope": [],
        "Details": []
      }
    },
    {
      "Type": "Resource",
      "ResourceChange": {
        "Action": "Add",
        "LogicalResourceId": "MyLambdaAlias",
        "ResourceType": "AWS::Lambda::Alias",
        "Scope": [],
        "Details": []
      }
    },
    {
      "Type": "Resource",
      "ResourceChange": {
        "Action": "Add",
        "LogicalResourceId": "MyLambdaFunction",
        "ResourceType": "AWS::Lambda::Function",
        "Scope": [],
        "Details": []
      }
    }
  ],
  "IncludeNestedStacks": false
}

You can see the status of the change set is failed with a detailed status reason. You can now proceed to review the change set validation results.

Step 3: Review validation results

Console

With the console, you can review multiple validation errors in a single interface. When you click on a validation, CloudFormation pinpoints the location of the invalid property error in your template.

Figure 3: Pre-deployment validations view

Figure 3: Pre-deployment validations view

Use Case: Invalid ENUM value for nested property
Catching invalid configuration values before deployment. This demonstrates validation of nested properties like TracingConfig.Mode. The tool helpfully shows the supported values “Active” & “Pass through” as well as the provided invalid value “DISABLED”.

Figure 4: CloudFormation Validation of Invalid ENUM value for nested property
Figure 4: Validation of Invalid ENUM value for nested property

Use Case: Lambda Function Timeout property type mismatch
Preventing type-related deployment failures. Shows how validation catches string values (“30s”) where integers are required, saving developers from runtime errors.

Figure 5: Validation of Lambda Function Timeout property type mismatch
Figure 5: Validation of Lambda Function Timeout property type mismatch

Use Case: Lambda Function Role property pattern mismatch
Validating ARN format requirements. Demonstrates pattern validation ensuring Role properties match required ARN format.

Figure 6: Lambda Function Role property pattern mismatch

Figure 6: Lambda Function Role property pattern mismatch

Use Case: Undefined required Lambda Alias Name property
Catching missing required properties. Shows validation detecting absent mandatory fields, preventing incomplete resource definitions from reaching deployment.

Figure 7: Validation of undefined required Lambda Alias Name property
Figure 7: Validation of undefined required Lambda Alias Name property

Notice how the validation Path field (e.g., “/Resources/MyLambdaFunction/Properties/TracingConfig/Mode”) pinpoints the exact template location of each error. This eliminates manual searching through hundreds of lines of infrastructure code – a common time sink that can take minutes in complex templates.

Use case: Unsupported property
Shows how CloudFormation validation catches unsupported properties. In this example, the AWS::Lambda::Alias resource had an unsupported extra property Description in a nested path RoutingConfig.AdditionalVersionWeights.0.

Figure 8: CloudFormation validation of unsupported resource property

Figure 8: CloudFormation validation of unsupported resource property

CLI command
You can also use the new describe-events API to review the validation responses.

aws cloudformation describe-events \
  --change-set-id "arn:aws:cloudformation:us-west-2:123456789012:changeSet/updateAlias/94498df5-1afb-43b1-9869-9f82b2d877ac"
{
  "OperationEvents": [
    {
      "EventId": "d3221796-d6a4-40c3-a987-93b103e7fcc1",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "94498df5-1afb-43b1-9869-9f82b2d877ac",
      "OperationType": "CREATE_CHANGESET",
      "OperationStatus": "FAILED",
      "EventType": "STACK_EVENT",
      "Timestamp": "2025-11-06T21:40:18.428000+00:00",
      "StartTime": "2025-11-06T21:40:13.399000+00:00",
      "EndTime": "2025-11-06T21:40:18.428000+00:00"
    },
    {
      "EventId": "87b628b4-fbcb-42b0-bf07-779007bf0d85",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "94498df5-1afb-43b1-9869-9f82b2d877ac",
      "OperationType": "CREATE_CHANGESET",
      "EventType": "VALIDATION_ERROR",
      "LogicalResourceId": "MyLambdaFunction",
      "PhysicalResourceId": "",
      "ResourceType": "AWS::Lambda::Function",
      "Timestamp": "2025-11-06T21:40:18.163000+00:00",
      "ValidationFailureMode": "FAIL", "ValidationName": "PROPERTY_VALIDATION", "ValidationStatus": "FAILED", "ValidationStatusReason": "DISABLED is not a valid enum value. Supported values: [Active, PassThrough]", "ValidationPath": "/Resources/MyLambdaFunction/Properties/TracingConfig/Mode" },
    {
      "EventId": "2f89cf64-e810-4285-8936-b77f7b72228c",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "94498df5-1afb-43b1-9869-9f82b2d877ac",
      "OperationType": "CREATE_CHANGESET",
      "EventType": "VALIDATION_ERROR",
      "LogicalResourceId": "MyLambdaFunction",
      "PhysicalResourceId": "",
      "ResourceType": "AWS::Lambda::Function",
      "Timestamp": "2025-11-06T21:40:18.163000+00:00",
      "ValidationFailureMode": "FAIL", "ValidationName": "PROPERTY_VALIDATION", "ValidationStatus": "FAILED", "ValidationStatusReason": "Property [Timeout] expected type: Integer, found: String", "ValidationPath": "/Resources/MyLambdaFunction/Properties/Timeout"    },
    {
      "EventId": "b2448484-4e41-4c53-b19e-6355dafeac6b",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "94498df5-1afb-43b1-9869-9f82b2d877ac",
      "OperationType": "CREATE_CHANGESET",
      "EventType": "VALIDATION_ERROR",
      "LogicalResourceId": "MyLambdaAlias",
      "PhysicalResourceId": "",
      "ResourceType": "AWS::Lambda::Alias",
      "Timestamp": "2025-11-06T21:40:18.134000+00:00",
     "ValidationFailureMode": "FAIL", "ValidationName": "PROPERTY_VALIDATION", "ValidationStatus": "FAILED", "ValidationStatusReason": "Required property [Name] not found", "ValidationPath": "/Resources/MyLambdaAlias/Properties"   },
    {
      "EventId": "694e94f0-a2f1-49fd-8045-545a9cb41ca9",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "94498df5-1afb-43b1-9869-9f82b2d877ac",
      "OperationType": "CREATE_CHANGESET",
      "EventType": "VALIDATION_ERROR",
      "LogicalResourceId": "MyLambdaAlias",
      "PhysicalResourceId": "",
      "ResourceType": "AWS::Lambda::Alias",
      "Timestamp": "2025-11-06T21:40:18.132000+00:00",
      "ValidationFailureMode": "FAIL",
      "ValidationName": "PROPERTY_VALIDATION",
      "ValidationStatus": "FAILED",
      "ValidationStatusReason": "Unsupported property [Description]",
      "ValidationPath": "/Resources/MyLambdaAlias/Properties/RoutingConfig/AdditionalVersionWeights/0"
    },
    {
      "EventId": "935cbd72-a637-4ad5-908d-e2ce241022ad",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "94498df5-1afb-43b1-9869-9f82b2d877ac",
      "OperationType": "CREATE_CHANGESET",
      "EventType": "VALIDATION_ERROR",
      "LogicalResourceId": "MyLambdaFunction",
      "PhysicalResourceId": "",
      "ResourceType": "AWS::Lambda::Function",
      "Timestamp": "2025-11-06T21:40:18.126000+00:00",
     "ValidationFailureMode": "FAIL", "ValidationName": "PROPERTY_VALIDATION", "ValidationStatus": "FAILED", "ValidationStatusReason": "Property value [MyRole] does not match pattern: ^arn:(aws[a-zA-Z-]*)?:iam::\\d{12}:role/?[a-zA-Z_0-9+=,.@\\-_/]+$", "ValidationPath": "/Resources/MyLambdaFunction/Properties/Role"    },
    {
      "EventId": "c4d25b22-9e8f-42f9-bd2e-3391b9bdacbd",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "94498df5-1afb-43b1-9869-9f82b2d877ac",
      "OperationType": "CREATE_CHANGESET",
      "OperationStatus": "IN_PROGRESS",
      "EventType": "STACK_EVENT",
      "Timestamp": "2025-11-06T21:40:13.399000+00:00",
      "StartTime": "2025-11-06T21:40:13.399000+00:00"
    }
  ]
}

Scenario 2: Resource Name Conflict Validation
Resource name conflict validation makes sure that new resources added to a template are not already present in your AWS account or globally (e.g: Amazon S3, Amazon Route 53 DNS), preventing deployment errors caused due to resource name conflicts

After reviewing the property validation exceptions, let’s assume that you resolved all the issues and successfully deployed the stack. Next, the you have decided to include a S3 bucket resource in the template. You name the bucket “dev-thumbnails” but didn’t verify if the bucket with this name already exists. If a bucket with this name already exists, the CreateChangeSet operation will fail, reporting to the developer that the bucket already exists.

...

  MyDevThumbnailsBucket:
    Type: "AWS::S3::Bucket"
    Properties:
      BucketName: "dev-thumbnails"

Step 1: Create Change Set

aws cloudformation create-change-set \                  
    --stack-name "dev-lambda-stack" \
    --change-set-name "addBucket" \ 
    --template-body file://lambda-with-alias-template.yaml | jq .

Step 2: Review Deployment Validations
Use CloudFormation change set console to review validations response or use the new DescribeEvents API in the CLi.

Figure 8: Resource name conflict validation
Figure 9: Resource name conflict validation

CLI Command

aws cloudformation describe-events \
    --change-set-name "arn:aws:cloudformation:us-west-2:123456789012:changeSet/addBucket/eafcdb2b-e018-4e0f-9e87-86b251f4eac5"
{
  "OperationEvents": [
    {
      "EventId": "e6049394-30e4-466d-9fb4-b5f525144058",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "eafcdb2b-e018-4e0f-9e87-86b251f4eac5",
      "OperationType": "CREATE_CHANGESET",
      "OperationStatus": "FAILED",
      "EventType": "STACK_EVENT",
      "Timestamp": "2025-11-06T21:58:49.872000+00:00",
      "StartTime": "2025-11-06T21:58:44.252000+00:00",
      "EndTime": "2025-11-06T21:58:49.872000+00:00"
    },
    {
      "EventId": "bca310c3-61e6-4478-9b0a-3a89f816aec0",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "eafcdb2b-e018-4e0f-9e87-86b251f4eac5",
      "OperationType": "CREATE_CHANGESET",
      "EventType": "VALIDATION_ERROR",
      "LogicalResourceId": "MyDevThumbnailsBucket",
      "PhysicalResourceId": "",
      "ResourceType": "AWS::S3::Bucket",
      "Timestamp": "2025-11-06T21:58:49.606000+00:00",
      "ValidationFailureMode": "FAIL", "ValidationName": "NAME_CONFLICT_VALIDATION", "ValidationStatus": "FAILED", "ValidationStatusReason": "Resource of type 'AWS::S3::Bucket' with identifier 'dev-thumbnails' already exists.", "ValidationPath": "/Resources/MyDevThumbnailsBucket"   },
    {
      "EventId": "8158f79f-ee58-4c3b-b3eb-3beace064139",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "eafcdb2b-e018-4e0f-9e87-86b251f4eac5",
      "OperationType": "CREATE_CHANGESET",
      "OperationStatus": "IN_PROGRESS",
      "EventType": "STACK_EVENT",
      "Timestamp": "2025-11-06T21:58:44.252000+00:00",
      "StartTime": "2025-11-06T21:58:44.252000+00:00"
    }
  ]
}

Scenario 3: S3 bucket not empty
Since AWS S3 service does not allow customers to delete S3 Buckets when there are objects in them, the new pre-deployment validations will warn you if you try to delete a bucket that is not empty.

Resuming our journey, let’s assume that you fix the name conflict issue by renaming the bucket to “dev-test-tumbnails”, and then updates the stack. After testing the lambda function’s integration with S3, the dev-cycle generated a few thumbnail objects in the S3 bucket.

Later, you decide to fix the bucket name because you notice a typo: “dev-test-tumbnails” should be “dev-test-thumbnails” (missing “h”). When you update the template to use the corrected name, CloudFormation will need to create the new bucket then delete the old one during the clean-up phase.

Step 1: Create Change Set

aws cloudformation create-change-set \                  
    --stack-name "dev-lambda-stack" \
    --change-set-name "renameBucket" \ 
    --template-body file://lambda-with-alias-template.yaml | jq .

Step 2: Review Validation

Use CloudFormation change set console to review validations response or use the new DescribeEvents API in the CLI.

Figure 9: S3 bucket emptiness on delete operation validation

Figure 10: S3 bucket emptiness on delete operation validation

CLI Command

aws cloudformation describe-events \
    --change-set-name "arn:aws:cloudformation:us-west-2:123456789012:changeSet/addBucket/eafcdb2b-e018-4e0f-9e87-86b251f4eac5"
{
  "OperationEvents": [
    {
      "EventId": "24920e0f-1941-45a5-9177-786bc805b724",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "8fef2b60-b411-4d0e-920e-7ec7c7aa39f2",
      "OperationType": "CREATE_CHANGESET",
      "OperationStatus": "SUCCEEDED",
      "EventType": "STACK_EVENT",
      "Timestamp": "2025-11-06T22:52:26.355000+00:00",
      "StartTime": "2025-11-06T22:52:21.071000+00:00",
      "EndTime": "2025-11-06T22:52:26.355000+00:00"
    },
    {
      "EventId": "c117e02d-a652-4755-9586-6d4ccb0f6504",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "8fef2b60-b411-4d0e-920e-7ec7c7aa39f2",
      "OperationType": "CREATE_CHANGESET",
      "EventType": "VALIDATION_ERROR",
      "LogicalResourceId": "MyDevThumbnailsBucket",
      "PhysicalResourceId": "",
      "ResourceType": "AWS::S3::Bucket",
      "Timestamp": "2025-11-06T22:52:25.960000+00:00",
      "ValidationFailureMode": "WARN", "ValidationName": "BUCKET_EMPTINESS_VALIDATION", "ValidationStatus": "FAILED", "ValidationStatusReason": "The bucket 'dev-tumbnails' is not empty. You must either delete all objects and versions or use the deletion policy to retain it, otherwise the delete operation will fail.", "ValidationPath": "/Resources/MyDevThumbnailsBucket"
    },
    {
      "EventId": "6c66ff53-6751-4b4c-96b8-d1a33fc43b4f",
      "StackId": "arn:aws:cloudformation:us-west-2:123456789012:stack/dev-lambda-stack/2d2c3240-bb59-11f0-b080-0613dc96740d",
      "OperationId": "8fef2b60-b411-4d0e-920e-7ec7c7aa39f2",
      "OperationType": "CREATE_CHANGESET",
      "OperationStatus": "IN_PROGRESS",
      "EventType": "STACK_EVENT",
      "Timestamp": "2025-11-06T22:52:21.071000+00:00",
      "StartTime": "2025-11-06T22:52:21.071000+00:00"
    }
  ]
}

Bucket emptiness validation uses WARN mode, which allows change set creation to succeed even when the validation check fails. This gives you time to review and empty the bucket before execution. However, if you execute the change set without emptying the bucket, the delete operation will fail.

Notice in the output above:

  • ValidationStatus: "FAILED" – The emptiness check detected objects in the bucket
  • ValidationFailureMode: "WARN" – This is a warning, not a blocking error
  • OperationStatus: "SUCCEEDED" – Change set creation completed successfully despite the warning

This design allows you to review the warning, take corrective action (such as emptying the bucket), and then proceed with execution.

Beyond catching errors early, these capabilities also transform how you troubleshoot failed deployments with enhanced operation tracking and filtering.

New DescribeEvents API with Operation IDs and root cause filtering

The new DescribeEvents API retrieves CloudFormation events based on flexible query criteria. It groups stack operations by operation ID, enabling you to focus specifically on individual stack operations involved during your stack deployment.

Operation: An operation is any action performed on a stack, including stack lifecycle actions (Create, Update, Delete, Rollback), change set creation, nested stack creation, and automatic rollbacks triggered by failures. Each operation has a unique identifier and represents a discrete change attempt on the stack.

Figure 10: Stack Events grouped by Operation Id

 Figure 11: Stack Events grouped by Operation Id

Scenario
When an update operation on an existing stack fails and results in a rollback, and you want to understand the reason behind the update stack failure. Using the operation ID obtained from the update stack response or from the describe stacks response, you can call describe events to get details on the failure.

Step 1: Update Stack

aws cloudformation update-stack \
 --stack-name test-1106 \
 --template-body file://test-1106-update.yaml
Output:
{
    "StackId": "arn:aws:cloudformation:us-west-2:012345678901:stack/test-1106/07580010-bb79-11f0-8f6c-0289bb5c804f",
    "OperationId": "1c211b5a-4538-4dc9-bfed-e07734371e57"
}

Step 2: Review stack status with describe stacks

The stack description available via describe-stacks API now includes LastOperations information showing recent operation IDs and their types. This enables you to quickly identify which operations occurred and their current status without parsing through event logs.

Figure 11: CloudFormation Stack Info page showing new operation IDs
Figure 11: CloudFormation Stack Info page showing new operation IDs

CLI Command

aws cloudformation describe-stacks \
 --stack-name test-1106
{
    "Stacks": [
        {
            "StackId": "arn:aws:cloudformation:us-west-2:012345678901:stack/test-1106/07580010-bb79-11f0-8f6c-0289bb5c804f",
            "StackName": "test-1106",
            "Description": "A simple CloudFormation template to create an S3 bucket.",
            "CreationTime": "2025-11-07T01:28:13.778000+00:00",
            "LastUpdatedTime": "2025-11-07T01:43:39.838000+00:00",
            "RollbackConfiguration": {},
            "StackStatus": "UPDATE_ROLLBACK_COMPLETE",
            "DisableRollback": false,
            "NotificationARNs": [],
            "Tags": [],
            "EnableTerminationProtection": false,
            "DriftInformation": {
                "StackDriftStatus": "NOT_CHECKED"
            },
            "LastOperations": [ { "OperationType": "ROLLBACK", "OperationId": "d0f12313-7bdb-414d-a879-828a99b36f29" }, { "OperationType": "UPDATE_STACK", "OperationId": "1c211b5a-4538-4dc9-bfed-e07734371e57" }
            ]
        }
    ]
}

Step 3: Review operation status with describe events API and operation id
Using the operation ID from the previous step, you can now query specific operation events to understand exactly what happened during that operation. This targeted approach eliminates the need to search through all stack events to find relevant information.

Figure 12: New CloudFormation stack operation pageFigure 12: New CloudFormation stack operation page

CLI Command

aws cloudformation describe-stacks \
 --stack-name test-1106
{
    "OperationEvents": [
        {
            "EventId": "76358afe-01ff-45e1-bf4d-8b89109aca57",
            "StackId": "arn:aws:cloudformation:us-west-2:012345678901:stack/test-1106/07580010-bb79-11f0-8f6c-0289bb5c804f",
 "OperationId": "1c211b5a-4538-4dc9-bfed-e07734371e57",             "OperationType": "UPDATE_STACK",
            "OperationStatus": "FAILED",
            "EventType": "STACK_EVENT",
            "Timestamp": "2025-11-07T01:43:44.322000+00:00",
            "StartTime": "2025-11-07T01:43:39.820000+00:00",
            "EndTime": "2025-11-07T01:43:44.322000+00:00"
        },
        {
            "EventId": "01fcd898-38f3-477d-891d-e950d964d594",
            "StackId": "arn:aws:cloudformation:us-west-2:012345678901:stack/test-1106/07580010-bb79-11f0-8f6c-0289bb5c804f",
 "OperationId": "1c211b5a-4538-4dc9-bfed-e07734371e57",             "EventType": "PROVISIONING_ERROR",
            "LogicalResourceId": "MyS3Bucket",
            "PhysicalResourceId": "test-1106-bucket",
            "ResourceType": "AWS::S3::Bucket",
            "Timestamp": "2025-11-07T01:43:43.561000+00:00",
            "ResourceStatus": "UPDATE_FAILED",
            "ResourceStatusReason": "The target bucket for logging does not exist (Service: Amazon S3; Status Code: 400; Error Code: InvalidTargetBucketForLogging; Request ID: ZQAPTT7646A9GQ0H; S3 Extended Request ID: 5Cl/xSAfQgs8UJ7rdq4EvsJT8pxnYLZlc3FzTgpQCxZlukoIiWYXkuds6xDzkmpurH+6epy2s9g7Ro7XN4ZFoQ==; Proxy: null)",
            "ResourceProperties": "{\"BucketName\":\"test-1106-bucket\",\"LoggingConfiguration\":{\"LogFilePrefix\":\"access-logs/\",\"DestinationBucketName\":\"logs-1106-bucket\"},\"LifecycleConfiguration\":{\"Rules\":[{\"Status\":\"Enabled\",\"ExpirationInDays\":\"90\",\"Id\":\"DeleteOldVersions\"}]},\"Tags\":[{\"Value\":\"Development\",\"Key\":\"Environment\"},{\"Value\":\"CloudFormationDemo\",\"Key\":\"Project\"}]}"
        },
        {
            "EventId": "2976d65e-44cc-4674-b771-a22d86a7d3f8",
            "StackId": "arn:aws:cloudformation:us-west-2:012345678901:stack/test-1106/07580010-bb79-11f0-8f6c-0289bb5c804f",
 "OperationId": "1c211b5a-4538-4dc9-bfed-e07734371e57",             "EventType": "PROGRESS",
            "LogicalResourceId": "MyS3Bucket",
            "PhysicalResourceId": "test-1106-bucket",
            "ResourceType": "AWS::S3::Bucket",
            "Timestamp": "2025-11-07T01:43:43.034000+00:00",
            "ResourceStatus": "UPDATE_IN_PROGRESS",
            "ResourceProperties": "{\"BucketName\":\"test-1106-bucket\",\"LoggingConfiguration\":{\"LogFilePrefix\":\"access-logs/\",\"DestinationBucketName\":\"logs-bucket\"},\"LifecycleConfiguration\":{\"Rules\":[{\"Status\":\"Enabled\",\"ExpirationInDays\":\"90\",\"Id\":\"DeleteOldVersions\"}]},\"Tags\":[{\"Value\":\"Development\",\"Key\":\"Environment\"},{\"Value\":\"CloudFormationDemo\",\"Key\":\"Project\"}]}"
        },
        {
            "EventId": "daf7e299-df02-4eab-b3e9-11a4659f789f",
            "StackId": "arn:aws:cloudformation:us-west-2:012345678901:stack/test-1106/07580010-bb79-11f0-8f6c-0289bb5c804f",
 "OperationId": "1c211b5a-4538-4dc9-bfed-e07734371e57",             "EventType": "PROGRESS",
            "LogicalResourceId": "test-1106",
            "PhysicalResourceId": "arn:aws:cloudformation:us-west-2:012345678901:stack/test-1106/07580010-bb79-11f0-8f6c-0289bb5c804f",
            "ResourceType": "AWS::CloudFormation::Stack",
            "Timestamp": "2025-11-07T01:43:39.838000+00:00",
            "ResourceStatus": "UPDATE_IN_PROGRESS",
            "ResourceStatusReason": "User Initiated"
        },
        {
            "EventId": "0b1ebf05-4496-4a8c-978e-7c081def3e4d",
            "StackId": "arn:aws:cloudformation:us-west-2:012345678901:stack/test-1106/07580010-bb79-11f0-8f6c-0289bb5c804f",
 "OperationId": "1c211b5a-4538-4dc9-bfed-e07734371e57",             "OperationType": "UPDATE_STACK",
            "OperationStatus": "IN_PROGRESS",
            "EventType": "STACK_EVENT",
            "Timestamp": "2025-11-07T01:43:39.820000+00:00",
            "StartTime": "2025-11-07T01:43:39.820000+00:00"
        }
    ]
}

Step 4: Identify failure root cause(s) with FailedEvents filter
The new failure root cause filter instantly surfaces only the events that caused the operation to fail. This eliminates the need to manually scan through progress events to identify the root cause of deployment failures.

Figure 13: Filter operation failure root causes
Figure 13: Filter operation failure root causes

CLI Command

aws cloudformation describe-events \
 --operation-id 1c211b5a-4538-4dc9-bfed-e07734371e57 \
 --filter FailedEvents=true
{
    "OperationEvents": [
        {
            "EventId": "01fcd898-38f3-477d-891d-e950d964d594",
            "StackId": "arn:aws:cloudformation:us-west-2:012345678901:stack/test-1106/07580010-bb79-11f0-8f6c-0289bb5c804f",
            "OperationId": "1c211b5a-4538-4dc9-bfed-e07734371e57",
            "EventType": "PROVISIONING_ERROR",
            "LogicalResourceId": "MyS3Bucket",
            "PhysicalResourceId": "test-1106-bucket",
            "ResourceType": "AWS::S3::Bucket",
            "Timestamp": "2025-11-07T01:43:43.561000+00:00",
            "ResourceStatus": "UPDATE_FAILED",
            "ResourceStatusReason": "The target bucket for logging does not exist (Service: Amazon S3; Status Code: 400; Error Code: InvalidTargetBucketForLogging; Request ID: ZQAPTT7646A9GQ0H; S3 Extended Request ID: 5Cl/xSAfQgs8UJ7rdq4EvsJT8pxnYLZlc3FzTgpQCxZlukoIiWYXkuds6xDzkmpurH+6epy2s9g7Ro7XN4ZFoQ==; Proxy: null)",
            "ResourceProperties": "{\"BucketName\":\"test-1106-bucket\",\"LoggingConfiguration\":{\"LogFilePrefix\":\"access-logs/\",\"DestinationBucketName\":\"logs-bucket\"},\"LifecycleConfiguration\":{\"Rules\":[{\"Status\":\"Enabled\",\"ExpirationInDays\":\"90\",\"Id\":\"DeleteOldVersions\"}]},\"Tags\":[{\"Value\":\"Development\",\"Key\":\"Environment\"},{\"Value\":\"CloudFormationDemo\",\"Key\":\"Project\"}]}"
        }
    ]
}

The FailedEvents=true filter transforms troubleshooting from parsing dozens of progress events to instantly seeing only what matters. This can make diagnosis of issues during an incident much easier..

Real-World Impact
These features improve your Infrastructure development experience with CloudFormation:

  • Template syntax errors: Previously discovered after minutes of provisioning, now caught in seconds
  • Resource conflicts: No more failed deployments due to existing resources
  • Debugging complexity: Transform troubleshooting sessions into faster targeted fixes
  • CI/CD reliability: Reduce pipeline failures and improve deployment confidence

Getting Started

These capabilities are available today in all AWS Regions where CloudFormation is supported. Pre-deployment validation is automatically enabled for all change set operations, no configuration required.

Try it now:

  1. Create any change set from the CloudFormation console or via SDK or CLI with aws cloudformation create-change-set
  2. Use `aws cloudformation describe-events –change-set-name <your-changeset-arn>` to see validation results
  3. Filter failure root causes instantly: via console or CLI with aws cloudformation describe-events –operation-id <id> –filter FailedEvents=true

Best Practices

  • Always use change sets: Even for simple updates, change sets now provide validation feedback
  • Leverage Operation IDs: Use the unique identifiers for focused troubleshooting
  • Filter events strategically: Use –filters FailedEvents=true to focus on problems
  • Automate validation: Integrate the describe-events API into your CI/CD pipelines
  • Use Console: CloudFormation console provides a visual experience with error source mapping to the specific line on your template.

Conclusion

Start using these features today in your development workflow. Whether you’re building new infrastructure or maintaining existing stacks, early validation and enhanced troubleshooting will accelerate your deployment cycles and make it easier to manage infrastructure.

Ready to experience faster CloudFormation development? Create your first change set and see validation in action.

Blog Authors Bio:

Idriss Laouali Abdou

Idriss is a Sr. Product Manager Technical on the AWS Infrastructure-as-Code team based in Seattle. He focuses on improving developer productivity through AWS CloudFormation and StackSets Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing.

Olivia Biswas

Olivia is a Software Development Manager on the AWS Infrastructure-as-Code team based in Seattle, where she leads developer productivity initiatives through CloudFormation. During her tenure at Amazon, she has built several customer-obsessed software solutions within Alexa and Buy With Prime. Outside of work, she is a globe trotter who enjoys baking, dancing, reading, and watching documentaries.

Marcus Ramos

Marcus is a Software Engineer on the AWS Infrastructure-as-Code team. He’s passionate about building features that minimize customers’ effort, and improving efficiency. Outside of work, he enjoys traveling, spending time with his family, and playing PC games.

Subha Velayuthams

Subha is a Senior Software Engineer on the AWS Infrastructure-as-Code team, where she builds features to improve developer productivity. Outside of work, she enjoys reading, traveling, and experimenting with new creative hobbies.

Introducing AWS Capabilities by Region for easier Regional planning and faster global deployments

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/introducing-aws-capabilities-by-region-for-easier-regional-planning-and-faster-global-deployments/

At AWS, a common question we hear is: “Which AWS capabilities are available in different Regions?” It’s a critical question whether you’re planning Regional expansion, ensuring compliance with data residency requirements, or architecting for disaster recovery.

Today, I’m excited to introduce AWS Capabilities by Region, a new planning tool that helps you discover and compare AWS services, features, APIs, and AWS CloudFormation resources across Regions. You can explore service availability through an interactive interface, compare multiple Regions side-by-side, and view forward-looking roadmap information. This detailed visibility helps you make informed decisions about global deployments and avoid project delays and costly rework.

Getting started with Regional comparison
To get started, go to AWS Builder Center and choose AWS Capabilities and Start Exploring. When you select Services and features, you can choose the AWS Regions you’re most interested in from the dropdown list. You can use the search box to quickly find specific services or features. For example, I chose US (N. Virginia), Asia Pacific (Seoul), and Asia Pacific (Taipei) Regions to compare Amazon Simple Storage Service (Amazon S3) features.

Now I can view the availability of services and features in my chosen Regions and also see when they’re expected to be released. Select Show only common features to identify capabilities consistently available across all selected Regions, ensuring you design with services you can use everywhere.

The result will indicate availability using the following states: Available (live in the region); Planning (evaluating launch strategy); Not Expanding (will not launch in region); and 2026 Q1 (directional launch planning for the specified quarter).

In addition to exploring services and features, AWS Capabilities by Region also helps you explore available APIs and CloudFormation resources. As an example, to explore API operations, I added Europe (Stockholm) and Middle East (UAE) Regions to compare Amazon DynamoDB features across different geographies. The tool lets you view and search the availability of API operations in each Region.

The CloudFormation resources tab helps you verify Regional support for specific resource types before writing your templates. You can search by Service, Type, Property, and Config.For instance, when planning an Amazon API Gateway deployment, you can check the availability of resource types like AWS::ApiGateway::Account.

You can also search detailed resources such as Amazon Elastic Compute Cloud (Amazon EC2) instance type availability, including specialized instances such as Graviton-based, GPU-enabled, and memory-optimized variants. For example, I searched 7th generation compute-optimized metal instances and could find c7i.metal-24xl and c7i.metal-48xl instances are available across all targeted Regions.

Beyond the interactive interface, the AWS Capabilities by Region data is also accessible through the AWS Knowledge MCP Server. This allows you to automate Region expansion planning, generate AI-powered recommendations for Region and service selection, and integrate Regional capability checks directly into your development workflows and CI/CD pipelines.

Now available
You can begin exploring AWS Capabilities by Region in AWS Builder Center immediately. The Knowledge MCP server is also publicly accessible at no cost and does not require an AWS account. Usage is subject to rate limits. Follow the getting started guide for setup instructions.

We would love to hear your feedback, so please send us any suggestions through the Builder Support page.

Channy

Supercharging the ML and AI Development Experience at Netflix

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/supercharging-the-ml-and-ai-development-experience-at-netflix-b2d5b95c63eb

Supercharging the ML and AI Development Experience at Netflix with Metaflow

Shashank Srikanth, Romain Cledat

Metaflow — a framework we started and open-sourced in 2019 — now powers a wide range of ML and AI systems across Netflix and at many other companies. It is well loved by users for helping them take their ML/AI workflows from prototype to production, allowing them to focus on building cutting-edge systems that bring joy and entertainment to audiences worldwide.

Metaflow allows users to:

  1. Iterate and ship quickly by minimizing friction
  2. Operate systems reliably in production with minimal overhead, at Netflix scale.

Metaflow works with many battle-hardened tooling to address the second point — among them Maestro, our newly open-sourced workflow orchestrator that powers nearly every ML and AI system at Netflix and serves as a backbone for Metaflow itself.

In this post, we focus on the first point and introduce a new Metaflow functionality, Spin, that helps users accelerate their iterative development process. By the end, you’ll have a solid understanding of Spin’s capabilities and learn how to try it out yourself with Metaflow 2.19.

Iterative development in ML and AI workflows

Developing a Metaflow flow with cards in VSCode

To understand our approach to improving the ML and AI development experience, it helps to consider how these workflows differ from traditional software engineering.

ML and AI development revolves not just around code but also around data and models, which are large, mutable, and computationally expensive to process. Iteration cycles can involve long-running data transformations, model training, and stochastic processes that yield slightly different results from run to run. These characteristics make fast, stateful iteration a critical part of productive development.

This is where notebooks — such as Jupyter, Observable, or Marimo — shine. Their ability to preserve state in memory allows developers to load a dataset once and iteratively explore, transform, and visualize it without reloading or recomputing from scratch. This persistent, interactive environment turns what would otherwise be a slow, rigid loop into a fluid, exploratory workflow — perfectly suited to the needs of ML and AI practitioners.

Because ML and AI development is computationally intensive, stochastic, and data- and model-centric, tools that optimize iteration speed must treat state management as a first-class design concern. Any system aiming to improve the development experience in this domain must therefore enable quick, incremental experimentation without losing continuity between iterations.

New: rapid, iterative development with spin

At first glance, Metaflow code looks like a workflow — similar to Airflow — but there’s another way to look at it: each Metaflow @step serves as a checkpoint boundary. At the end of every step, Metaflow automatically persists all instance variables as artifacts, allowing the execution to resume seamlessly from that point onward. The below animation shows this behavior in action:

An animated GIF showing how resume can be used in Metaflow. The GIF shows how using `flow.py resume join` makes Metaflow clone previously executed steps and resumes the computation from the `join` step and continues executing till the end of the flow.
Using resume in Metaflow

In a sense, we can consider a @step similar to a notebook cell: it is the smallest unit of execution that updates state upon completion. It does have a few differences that address the issues with notebook cells:

  • The execution order is explicit and deterministic: no surprises due to out-of-order cell execution;
  • The state is not hidden: state is explicitly stored as self. variables as shared state, which can be discovered and inspected;
  • The state is versioned and persisted making results more reproducible.

While Metaflow’s resume feature can approximate the incremental and iterative development approach of notebooks, it restarts execution from the selected step onward, introducing more latency between iterations. In contrast, a notebook allows near-instant feedback by letting users tweak and rerun individual cells while seamlessly reusing data from earlier cells held in memory.

The new spin command in Metaflow 2.19 addresses this gap. Similar to executing a single notebook cell, it quickly executes a single Metaflow @step — with all the state carried over from the parent step. As a result, users can develop and debug Metaflow steps as easily as a cell in a notebook.

The effect becomes clear when considering the three complementary execution modes — run, resume, and spin — side by side, mapping them to the corresponding notebook behavior:

Diagram showing the various modes of execution in Metaflow: Run, Resume and Spin
Run, Resume and Spin “modes”

Another major difference isn’t just what gets executed, but what gets recorded. Both run and resume create a full, versioned run with complete metadata and artifacts, while spin skips tracking altogether. It’s built for fast, throw-away iterations during development.

The one-minute clip below illustrates a typical iterative development workflow that alternates between run and spin. In this example, we are building a flow that reads a dataset from a Parquet file and trains a separate model for each product category, focusing on computer-related categories.

As shown in the video, we start by creating a flow from scratch and running a minimal version of it to persist test artifacts — in this case, a Parquet dataset. From there, we can use spin to iterate on one step at a time, incrementally building out the flow, for example, by adding the parallel training steps demonstrated in the clip.

Once the flow has been iterated on locally, it can be seamlessly deployed to production orchestrators like Maestro or Argo, and scaled up on compute platforms such as AWS Batch, Titus, Kubernetes and more. Thus, the experience is as smooth as developing in a notebook, but the outcome is a production-ready, scalable workflow, implemented as an idiomatic Python project!

Spin up smooth development in VSCode/Cursor

Instead of typing run and spin manually in the terminal, we can bind them to keyboard shortcuts. For example, the simple metaflow-dev VS Code extension (works with Cursor as well) maps Ctrl+Opt+R to run and Ctrl+Opt+S to spin. Just hack away, hit Ctrl+Opt+S, and the extension will save your file and spin the step you are currently editing.

One area where spin truly shines is in creating mini-dashboards and reports with Metaflow Cards. Visualization is another strong point of notebooks but the combination of spin and cards makes Metaflow a very compelling alternative for developing real-time and post-execution visualizations. Developing cards is inherently iterative and visual (much like building web pages) where you want to tweak code and see the results instantly. This workflow is readily available with the combination of VSCode/Cursor, which includes a built-in web-view, the local card viewer, and spin.

To see the trio of tools — along with the VS Code extension — in action, in this short clip we add observability to the train step that we built in the earlier example:

A major benefit of Metaflow Cards is that we don’t need to deploy any extra services, data streams, and databases for observability. Just develop visual outputs as above, deploy the flow, and wehave a complete system in production with reporting and visualizations included.

Spin to the next level: injecting inputs, inspecting outputs

Spin does more than just run code — it also lets us take full control of a spun @step’s inputs and outputs, enabling a range of advanced patterns.

In contrast to notebooks, we can spin any arbitrary @step in a flow using state from any past run, making it easy to test functions with different inputs. For example, if we have multiple models produced by separate runs, we could spin an inference step, supplying a different model run each time.

We can also override artifact values or inject arbitrary Python objects — similar to a notebook cell — for spin. Simply specify a Python module with an ARTIFACTS dictionary:

ARTIFACTS = {
"model": "kmeans",
"k": 15
}

and point spin at the module:

spin train --artifacts-module artifacts.py

By default spin doesn’t persist artifacts, but we can easily change this by adding –persist. Even in this case, artifacts are not persisted in the usual Metaflow datastore but to a directory-specific location which you can easily clean up after testing. We can access the results with the Client API as usual — just specify the directory you want to inspect with inspect_spin:

from metaflow import inspect_spin

inspect_spin(".")
Flow("TrainingFlow").latest_run["train"].task["model"].data

Being able to inspect and modify a step’s inputs and outputs on the fly unlocks a powerful use case: unit testing individual steps. We can use spin programmatically through the Runner API and assert the results:

from metaflow import Runner

with Runner("flow.py").spin("train", persist=True) as spin:
assert spin.task["model"].data == "kmeans"

Making AI agents spin

In addition to speeding up development for humans, spin turns out to be surprisingly handy for coding agents too. There are two major advantages to teaching AI how to spin:

  1. It accelerates the development loop. Agents don’t naturally understand what’s slow, or why speed matters, so they need to be nudged to favor faster tools over slower ones.
  2. It helps surface errors faster and contextualizes them to a specific piece of code, increasing the chance that the agent is able to fix errors by itself.

Metaflow users are already using Claude Code; spin makes this even easier. In the example below, we added the following section in a CLAUDE.md file:

## Developing Metaflow code
Follow this incremental development workflow that ensures quick iterations
and correct results. You must create a flow incrementally, step by step
following this process:
1. Create a flow skeleton with empty `@step`s.
2. Add a data loading step.
3. `run` the flow.
4. Populate the next step and use `spin` to test it with the correct inputs.
5. `run` the flow to record outputs from the new step.
5. Iterate on (4–5) until all steps have been implemented and work correctly.
6. `run` the whole flow to ensure final correctness.

To test a flow, run the flow as follows
```
python flow.py - environment=pypi run
```

Do this once before running `spin`.
As you are building the flow, you `spin` to test steps quickly.
For instance
```
python flow.py - environment=pypi spin train
```

Just based on these quick instructions, the agent is able to use spin effectively. Take a look at the following inspirational example that one-shots Claude to create a flow, along the lines of our earlier examples, which trains a classifier to predict product categories:

In the video, we can see Claude using spin around the 45-second mark to test a preprocess step. The step initially fails due to a classic data science pitfall: during testing, Claude samples only a small subset of data, causing some classes to be underrepresented. The first spin surfaces the issue, which Claude then fixes by switching to stratified sampling — and finally does another spin to confirm the fix, before proceeding to complete the task.

The inner loop of end-to-end ML/AI

To circle back to where we started, our motivation for adding spin — and for creating Metaflow in the first place — is to accelerate development cycles so we can deliver more joy to our subscribers, faster. Ultimately, we believe there’s no single magic feature that makes this possible. It takes all parts of an ML/AI platform working together coherently — spin included.

From this perspective, it’s useful to place spin in the context of other Metaflow features. It’s designed for the innermost loop of model and business-logic development, with the added benefit of supporting unit testing during deployment, as shown in the overall blueprint of the Metaflow toolchain below.

Metaflow tool-chain.
Metaflow tool-chain

In this diagram, the solid blue boxes represent different Metaflow commands, while the blue text denotes decorators and other features. In particular, note the Shared Functionality box — another key focus area for us over the past year — which includes configuration management and custom decorators. These capabilities let domain-specific teams and platform providers tailor Metaflow to their own use cases. Following our ethos of composability, all of these features integrate seamlessly with spin as well.

Another key design philosophy of Metaflow is to let projects start small and simple, adding complexity only when it becomes necessary. So don’t be overwhelmed by the diagram above. To get started, install Metaflow easily with

pip install metaflow

and take your first baby @steps for a spin! Check out the docs and for questions, support, and feedback, join the friendly Metaflow Community Slack.

Acknowledgments

We would like to thank our partners at Outerbounds, and particularly Ville Tuulos, Savin Goyal, and Madhur Tandon, for their collaboration on this feature, from initial ideation to review, testing and documentation. We would also like to acknowledge the rest of the Model Development and Management team (Maria Alder, David J. Berg, Shaojing Li, Rui Lin, Nissan Pow, Chaoying Wang, Regina Wang, Seth Yang, Darin Yu) for their input and comments.


Supercharging the ML and AI Development Experience at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Streamlining Multi-Account Infrastructure with AWS CloudFormation StackSets and AWS CDK

Post Syndicated from Franco Abregu original https://aws.amazon.com/blogs/devops/streamlining-multi-account-infrastructure-with-aws-cloudformation-stacksets-and-aws-cdk/

Introduction

Organizations operating at scale on AWS often need to manage resources across multiple accounts and regions. Whether it’s deploying security controls, compliance configurations, or shared services, maintaining consistency can be challenging.

AWS CloudFormation StackSets (StackSets) has been helping organizations deploy resources across multiple accounts and regions since its launch. While the service is powerful on its own, combining it with Infrastructure as Code (IaC) tools and implementing automated deployments can significantly enhance its capabilities.

In this post, we’ll show you how to leverage AWS CloudFormation StackSets at scale using AWS CDK and implement a robust CI/CD pipeline for automated deployments with AWS CodePipeline.

StackSets key concepts

AWS CloudFormation StackSets allows you to create, update, or delete CloudFormation stacks across multiple AWS accounts and regions with a single operation. It’s essentially a way to manage infrastructure at scale across your AWS organization. Using an administrator account, you define and manage a CloudFormation template, and use the template as the basis for provisioning stacks into selected target accounts across specified AWS Regions:

StackSets Overview

Figure 1. StackSets overview.

The Administrator Account is the AWS account where you create and manage StackSets and the Target Accounts are the AWS accounts where the stack instances are deployed.

The Stack Instances are individual stacks created from the StackSet template deployed to specific account-region combinations.

 You can make the following operations using StackSets: Create, update, and delete actions performed on stack instances. These operations can be applied in concurrent or sequential way.

Sequential Deployment:

  • Account-by-account deployment
  • Region-by-region within accounts
  • Configurable failure thresholds

Parallel Deployment:

  • Concurrent account deployments
  • Maximum concurrent account setting
  • Region priority configuration

Hybrid Deployment:

  • Combine sequential and parallel
  • Account group-based deployment
  • Regional deployment strategies

The power of StackSets

The use of StackSets allows us to extend AWS CloudFormation’s capabilities in several important ways:

Governance

It provides you with Centralized Management as a single point of control while including consistent deployment patterns and automated stack instance management across AWS accounts and regions.

With Drift Detection feature, you can identify if any of the stack instances of your StackSet have configuration differences according to its expected configuration. You detect changes made outside CloudFormation and changes made to an instance stack through CloudFormation directly without using the StackSet.

Flexible Deployment

You also have flexible deployment options with controlled rollout. For example, with Concurrent Deployments you can deploy to multiple accounts within each region simultaneously while controlling deployment order. It also includes failure tolerance with automated retry failed operations.

Operational Efficiency

It reduces manual effort in managing multi-account and multi-region environments while minimizes human error in deployments.

Cost Management

It delivers comprehensive resource organization and streamlined tracking of resources across accounts and regions containing instance stacks. Using centralized management, simplifies the resource tracking and organization enabling you you to have:

  • unified visibility: view all related stacks from a single StackSet console (with their deployment status)
  • consistent tagging: apply standardized tags across all stack instances for cost allocation and resource grouping
  • drift detection: run drift detection across all stack instances simultaneously
  • operations tracking: track all operations (create, update and delete) across account/regions from one place

Built-in Safety

You can establish maximum concurrent operation limits, failure tolerance thresholds and automatic retry mechanisms. You also have recovery capabilities through update operations. All these features make a built-in safety mechanisms that prevent widespread failures.

Let’s say you have 100 target accounts, with the maximum concurrent limits, you can for example deploy a change to only 10 accounts. Also, with a failure threshold you can set how many failures do you allow before automatically stopping the process (e.g., stop if more than 5 accounts fail). This way you can gradually deploy and test your templates with a little group, establishing failure thresholds, instead of affecting the stacks preventing mass failures.

When an operation fails, AWS CloudFormation performs a rollback in the stack instances deploying the previous working template. You will still need to correct the template and apply it again in all the stack instances. With StackSets, you can fix the issues in the template and run again an update across all the stacks including the concurrent limit and failure threshold mentioned before to safety test the fix.

Security and Compliance management

This security-focused approach with StackSets helps organizations maintain a strong security posture across their AWS environment while reducing the operational overhead of managing security at scale.

You can use StackSets to deploy standardized security policies across accounts, enforce security baselines automatically and implement security guardrails organization-wide. For example, you can deploy detective control resource and its configuration in all your accounts like Amazon GuardDuty or Amazon Macie. You can also deploy preventive controls like SCPs, AWS Firewall Manager or AWS Shield Advanced. For example you can deploy through StackSets the following CloudFormation template en each target account to block certain actions in a region:

<code>AWSTemplateFormatVersion: '2010-09-09'</code><br /><code>Description: 'Service Control Policy to block access to specific AWS regions'</code><br /><br /><code>Parameters:</code><br /><code>  PolicyName:</code><br /><code>    Type: String</code><br /><code>    Default: 'RegionDenyPolicy'</code><br /><code>    Description: 'Name for the Service Control Policy'</code><br /><code>    </code><br /><code>  PolicyDescription:</code><br /><code>    Type: String</code><br /><code>    Default: 'Blocks access to Singapore region (ap-southeast-1) while allowing global services'</code><br /><code>    Description: 'Description for the Service Control Policy'</code><br /><code>    </code><br /><code>  BlockedRegion:</code><br /><code>    Type: String</code><br /><code>    Default: 'ap-southeast-1'</code><br /><code>    Description: 'AWS Region to block access to'</code><br /><code>    AllowedValues:</code><br /><code>      - 'ap-southeast-1'</code><br /><code>      - 'ap-southeast-2'</code><br /><code>      - 'eu-west-3'</code><br /><code>      - 'us-west-1'</code><br /><code>      - 'ca-central-1'</code><br /><code>    </code><br /><code>  TargetOUId:</code><br /><code>    Type: String</code><br /><code>    Description: 'Organizational Unit ID to attach the policy to (e.g., ou-root-xxxxxxxxxx)'</code><br /><code>    </code><br /><code>Resources:</code><br /><code>  RegionDenySCP:</code><br /><code>    Type: AWS::Organizations::Policy</code><br /><code>    Properties:</code><br /><code>      Name: !Ref PolicyName</code><br /><code>      Description: !Ref PolicyDescription</code><br /><code>      Type: SERVICE_CONTROL_POLICY</code><br /><code>      Content:</code><br /><code>        Version: '2012-10-17'</code><br /><code>        Statement:</code><br /><code>          - Sid: DenyAccessToSpecificRegion</code><br /><code>            Effect: Deny</code><br /><code>            NotAction:</code><br /><code>              - 'route53:*'</code><br /><code>              - 'cloudfront:*'</code><br /><code>              - 'sts:*'</code><br /><code>            Resource: '*'</code><br /><code>            Condition:</code><br /><code>              StringEquals:</code><br /><code>                'aws:RequestedRegion':</code><br /><code>                  - !Ref BlockedRegion</code><br /><code>      TargetIds:</code><br /><code>        - !Ref TargetOUId</code><br /><code>      Tags:</code><br /><code>        - Key: Purpose</code><br /><code>          Value: RegionCompliance</code><br /><code>        - Key: ManagedBy</code><br /><code>          Value: CloudFormation</code><br /><br /><code>Outputs:</code><br /><code>  PolicyId:</code><br /><code>    Description: 'ID of the created Service Control Policy'</code><br /><code>    Value: !Ref RegionDenySCP</code><br /><code>    Export:</code><br /><code>      Name: !Sub '${AWS::StackName}-PolicyId'</code><br /><code>      </code><br /><code>  PolicyArn:</code><br /><code>    Description: 'ARN of the created Service Control Policy'</code><br /><code>    Value: !GetAtt RegionDenySCP.Arn</code><br /><code>    Export:</code><br /><code>      Name: !Sub '${AWS::StackName}-PolicyArn'</code>

Other capabilities include compliance-related resources consistently, maintain audit trails of security configurations and ensure regulatory requirements are met across all accounts. For example, you can enable CouldTrail and deploy AWS Config rules across all the instance stacks managed by the StackSet.

For both Security and Compliance incidents you can use StackSets to deploy automated response workflows, configure event notifications and implement remediation actions across your accounts and regions.

Import existing stacks into StackSets

A stack import operation can import existing stacks into new or existing StackSets, so that you can migrate existing stacks to a StackSet in one operation.

Solution Overview

This solution includes an AWS CodePipeline stack that creates a CI/CD pipeline to deploy our StackSet. This pipeline deploys an application stack containing the AWS CloudFormation StackSet with a monitoring dashboard in AWS CloudWatch.

Solution overview

Figure 2. Solution overview

The following Amazon CloudWatch dashboard is an example of what you will in the target accounts after the StackSet is deployed:

Dashboard example

Figure 3. Dashboard example

In the CI/CD pipeline, before running the deployment commands, it applies python security and quality code checks to ensure code quality and security and cdk-nag to ensure AWS Well Architected best practices. You can find more details about these checks in the solution repository in README.md file.

The solution includes 2 AWS CloudFormation stacks defined by in the AWS CDK application and a template for the StackSet that will be deployed in the target accounts and regions. This stack contains the monitoring dashboard that will be deployed en the target regions of each target account as a single unit.

The idea of using AWS CodePipeline with IaC is that development teams can define and share “pipelines-as-code” patterns for deploying their applications making it easy to add stages. This way, security and quality code testing can run any time you change the source code.

Pipeline overview

Figure 4. Pipeline overview

The best practice is to ensure shift-left: adding this checks to the earlier stages of the SDLC. You can accomplish this complementing your CI/CD pipeline with githooks or IDE Plugins. For example with Amazon Q Developer IDE extension you can use the review function to analyze the security of your code locally.

Walkthrough

If you’d like to try this solution out yourself, visit the walkthrough in the corresponding GitHub repo: https://github.com/aws-cloudformation/aws-cloudformation-templates/tree/main/CloudFormation/StackSets-CDK

To use the CI/CD pipeline just create a repository using any of the AWS CodeConnection git supported providers and add the contents of the folder. All details are included in the README.md so you can always get the latest version of the code and how it works.

Conclusion

In this post, we showed how to use AWS CDK to deploy AWS CloudFormation StackSets to reduce operational overhead and ensure consistency, compliance and security across multiple regions and accounts. We also learned how to create a CI/CD pipeline to guarantee a robust DevSecOps cycle for our Infrastructure as Code.

Now that we’ve explored the main concepts together, you can clone the example repository from the walkthrough section, follow the setup instructions, and customize the implementation to enhance AWS resources management across accounts and regions. Whether you’re managing a single account or multiple organizations, these practices can be adapted to your specific needs. Now that you learned the main concepts, go ahead and clone the example repository from walkthrough section, follow the setup instructions and customize the implementation to improve the AWS resources management across your accounts and regions.

Franco Abregu

Franco Abregu is a Sr. Delivery Consultant – DevOps at AWS Professional Services based in Argentina. Franco focuses on transforming customers DevOps culture to improve developer productivity, operations, deployments and process standardization. His expertise includes CI/CD, Infrastructure as Code, software development and organizational adoption of DevOps culture.

Idriss Laouali Abdou

Idriss Laouali Abdou is a Sr. Product Manager Technical for AWS Infrastructure-as-Code based in Seattle. He focuses on improving developer productivity through StackSets and CloudFormation Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing..

How to Simplify Multi-Account Deployments Monitoring: Centralized Logs for AWS CloudFormation StackSets

Post Syndicated from Idriss Laouali Abdou original https://aws.amazon.com/blogs/devops/how-to-simplify-multi-account-deployments-monitoring-centralized-logs-for-aws-cloudformation-stacksets/

Introduction

As organizations adopt multi-account strategies for improved security features and governance, AWS CloudFormation StackSets enables organizations to deploy infrastructure across multiple accounts and regions. However, monitoring and tracking these distributed deployments across multiple accounts presents operational challenges. When a critical security baseline deployed across 50 accounts suddenly starts failing, teams face the daunting task of logging into each account individually to understand what went wrong and which accounts were affected.

This operational overhead scales exponentially with organization growth, requiring platform teams to spend countless hours switching between accounts and manually correlating deployment events. The lack of centralized visibility slows incident response and makes it difficult to identify patterns or implement proactive monitoring. In this blog post, we’ll explore a solution that centralizes AWS CloudFormation logs from multiple accounts into a single management account, making it easier to monitor and troubleshoot StackSets deployments.

Solution Architecture

Our solution creates a centralized logging system that collects AWS CloudFormation events from all target accounts and forwards them to a central management account. This approach provides a single pane of glass for monitoring and troubleshooting AWS CloudFormation deployments across your entire organization.

Figure 1. Architecture diagram showing event flow from member accounts to management account through EventBridge and CloudWatch Logs

Figure 1. Architecture diagram showing event flow from member accounts to management account through EventBridge and CloudWatch Logs.

The architecture consists of four main components:

  1. Management Account Setup: Creates a central event bus, log group, and necessary permissions in the organization’s management account.
  2. Target Account Configuration: Deployed via StackSets to configure event rules that forward AWS CloudFormation events to the management account.
  3. Resource Deployment: Uses StackSets to deploy common resources across target accounts, generating the events we want to monitor.
  4. Monitoring and Visualization: Provides dashboards and queries for operational insights.

How It Works

The solution follows this event flow:

  1. Event Generation: AWS CloudFormation operations in target accounts generate events (stack creation, updates, deletions, resource changes).
  2. Event Capture: Amazon EventBridge rules in each target account capture these AWS CloudFormation events based on defined patterns.
  3. Cross-Account Forwarding: Events are forwarded to a custom event bus in the management account using cross-account permissions.
  4. Centralized Logging: The central event bus routes all events to a Amazon CloudWatch Log Group with structured logging.
  5. Monitoring and Alerting: Administrators can view consolidated logs, create custom queries, and set up alerts from a single location.

Prerequisites

Before implementing this solution, ensure you have the following prerequisites in place:

  • AWS account: Ensure you have valid AWS account.
  • AWS Organizations: You must have an AWS Organization structure set up with a primary management account and several member accounts under the management account.
  • Trusted Access: Enable trusted access for AWS CloudFormation StackSets from the management account (this allows StackSets to assume roles in member accounts).
  • Appropriate Permissions: You must have access to the management account or be configured as a delegated administrator to create and manage StackSets. For detailed information about permissions and security considerations when using StackSets with AWS Organizations, please review the Prerequisites in the AWS CloudFormation StackSets documentation.

Implementation Deep Dive

The solution is implemented using two AWS CloudFormation templates that work together to create a comprehensive monitoring system:

1. Management Account Logging Setup (log-setup-management.yaml)

This template establishes the central logging infrastructure in the management account by creating a custom Amazon EventBridge event bus with cross-account access policies and an encrypted Amazon CloudWatch Log Group using a customer-managed AWS Key Management Service (AWS KMS) key. A key feature is the included stack set resource that automatically deploys the target account configuration to all member accounts, eliminating manual setup and ensuring consistent configuration across the entire organization.

2. Stack set Deployment Template (common-resources-stackset.yaml)

This template creates a service-managed stack set that deploys common resources to all accounts in specified organizational units. The StackSet is configured with auto-deployment enabled to automatically provision new accounts added to the organization and includes operation preferences for parallel regional deployment with fault tolerance settings.

Step-by-Step Deployment Guide

Step 1: Download the templates:

Step 2: Deploy the Management Account Infrastructure

Deploy the centralized logging infrastructure to your management account.

Using CLI:

aws cloudformation deploy \
  --template-file log-setup-management.yaml \
  --stack-name log-setup-management \
  --parameter-overrides \
    OUID=your-organizational-unit-id \
    OrgID=your-organization-id \
  --capabilities CAPABILITY_IAM \
  --region us-east-1

AWS CLI command execution for stack deployment

Using AWS Console:

  1. Open the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation.
  2. On the Stacks page, choose Create stack at top right, and then choose With new resources (standard).
  3. On the Create stack page, Upload a template file, choose Choose File to choose a template file from your local computer.
  4. Choose Next to continue and to validate the template.
  5. On the Specify stack details page, type a stack name in the Stack name box.
  6. In the Parameters section, specify values for the parameters that were defined in the template.
  7. Choose Next to continue creating the stack.
  8. Acknowledge capabilities and transforms.
  9. Choose Next to continue.
  10. Choose Submit to launch your stack.

This single deployment:

  1. Creates the central logging infrastructure in the management account.
  2. Automatically deploys Amazon EventBridge rules to all accounts in the specified OU.
  3. Sets up the necessary IAM roles and policies for cross-account access.

Figure 2: Screenshot showing successful deployment of log-setup-management.yaml template in the management account

Figure 2.1: Screenshot showing successful deployment of log-setup-management.yaml template in the management account

Figure 2.2: Screenshot showing deployment timeline of log-setup-management.yaml template in the management account

Figure 2.2: Deployment timeline view of log-setup-management.yaml template in the management account

Step 3: Deploy Common Resources

Deploy the sample common resources to demonstrate the logging functionality.

Using CLI:

aws cloudformation deploy \
  --template-file common-resources-stackset.yaml \
  --stack-name common-resources-stackset \
  --parameter-overrides \
    OUID=your-organizational-unit-id \
  --capabilities CAPABILITY_IAM \
  --region us-east-1

AWS CLI command execution for stack deployment

Using AWS Console:

  1. Open the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation.
  2. On the Stacks page, choose Create stack at top right, and then choose With new resources (standard).
  3. On the Create stack page, Upload a template file, choose Choose File to choose a template file from your local computer.
  4. Choose Next to continue and to validate the template.
  5. On the Specify stack details page, type a stack name in the Stack name box.
  6. In the Parameters section, specify values for the parameters that were defined in the template.
  7. Choose Next to continue creating the stack.
  8. Acknowledge capabilities and transforms.
  9. Choose Next to continue.
  10. Choose Submit to launch your stack.

This creates a stack set that deploys Amazon Simple Storage Service (Amazon S3) infrastructure to all target accounts, generating AWS CloudFormation events that will be captured by your centralized logging system.

Screenshot showing successful deployment of common-resources-stackset.yaml template for target accounts

Figure 3: Screenshot showing successful deployment of common-resources-stackset.yaml template for target accounts

Step 4: Validation and Testing

Confirm event flow and monitoring functionality by viewing the log streams in the ‘central-cloudformation-logs’ log group.

Monitoring and Visualization

The centralized logging solution provides advanced monitoring capabilities through Amazon CloudWatch Logs Insights and custom dashboards.

You can customize your queries to get:

  • Recent AWS CloudFormation events across all accounts.
  • Failed stack operations for quick troubleshooting.
  • Successful deployments for verification.
  • Event distribution by account and region.
  • Status breakdown of all AWS CloudFormation operations.

The following query helps you analyze CloudFormation events across your organization by showing:

  • Timestamp of events
  • Account ID where the event occurred
  • Region of deployment
  • Resource types being deployed
  • Deployment status
  • Logical resource identifiers

fields @timestamp, account, region
| parse @message /"resource-type":"(?<resource_type>[^"]+)"/ 
| parse @message /"status":"(?<status>[^"]+)"/ 
| parse @message /"logical-resource-id":"(?<logical_resource_id>[^"]+)"/ 
| sort @timestamp desc

Figure 4: CloudWatch Logs Insights query results showing CloudFormation events across accounts

Figure 4: CloudWatch Logs Insights query results showing CloudFormation events across accounts

You can customize your queries to filter for specific conditions such as failed deployment status, particular resource types, or specific accounts to quickly identify and troubleshoot issues across your organization’s AWS CloudFormation deployments.

Cost Implications

When implementing this centralized monitoring solution, you should consider the following cost components:

Clean up

To clean up the resources created in this solution, follow these steps:

  1. First, delete the common resources stack set (common-resources-stackset) from the AWS CloudFormation console in your management account. This will remove all the resources deployed across your member accounts.
  2. After the stack set operations are complete, delete the management account logging setup stack (log-setup-management) to remove the centralized logging infrastructure, including the event bus, log groups, and associated IAM roles.

Note: Make sure all stack set operations are complete before deleting the management account logging setup to ensure proper cleanup of all resources.

Conclusion

Managing infrastructure across multiple AWS accounts doesn’t have to be complex. By centralizing AWS CloudFormation logs, you can gain visibility into your multi-account deployments, troubleshoot issues more efficiently, and help achieve consistent resource deployment across your organization.

This solution demonstrates how AWS services like AWS CloudFormation StackSets, Amazon EventBridge, and Amazon CloudWatch Logs can be combined to create a powerful monitoring system for your infrastructure as code deployments.

Get started today by implementing this solution in your AWS Organization to gain immediate visibility into your multi-account deployments. Download the templates from our GitHub repository and follow the step-by-step guide to enhance your cloud operations.

Authors:

Fatima Bzioui

Fatima Bzioui is a Cloud Support Engineer with a focus on DevOps best practices and cloud-native solutions. Fatima’s expertise includes Infrastructure as Code and CI/CD implementations, which she uses to help organizations overcome complex technical challenges and achieve their cloud goals.

Idriss Laouali Abdou

Idriss Laouali Abdou is a Sr. Product Manager Technical for AWS Infrastructure-as-Code based in Seattle. He focuses on improving developer productivity through StackSets and CloudFormation Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing.

Infrastructure as Code at Thomson Reuters with AWS CDK

Post Syndicated from Vu San Ha Huynh original https://aws.amazon.com/blogs/devops/infrastructure-as-code-at-thomson-reuters-with-aws-cdk/

This post is cowritten by Danilo Tommasina and Lalit Kumar B from Thomson Reuters.

Large organizations often struggle with infrastructure management challenges including compliance issues, development bottlenecks and errors from inconsistent AWS resource creation across teams. Without standardized naming, tagging and policy enforcement, teams face repeated boilerplate code and difficulty accessing centrally-managed resources.

In this post, we will show you how Thomson Reuters developed an extension of the AWS Cloud Development Kit (CDK) to automate compliance, standardization and policy enforcement in Infrastructure as Code (IaC) scripts. We will explore the strategic reasoning behind this initiative, outline foundational design principles, and provide technical details on TR’s journey from concept to implementation. The solution accelerates and standardizes cloud infrastructure deployment and management through seamless integration between TR’s custom library and AWS CDK.

Thomson Reuters (TR) is one of the world’s leading information organizations for businesses and professionals. TR provides companies with the intelligence, technology, and human expertise they need to find trusted answers, enabling them to make better decisions more quickly. TR’s customers span the financial, risk, legal, tax, accounting, and media industries.

Overview

In a large organization that offers a variety of customer products, it is essential to manage numerous cloud resources effectively. This involves overseeing multiple AWS accounts, implementing access control or addressing financial tracking challenges. These tasks require the application of centrally defined standards and conventions, with additional requirements tailored to specific sub-organizations.

Infrastructure as Code (IaC) is an effective method for managing cloud resources. However, utilizing vanilla AWS CloudFormation for extensive and intricate infrastructure can pose challenges. It requires careful attention to naming conventions, tagging standards, security, and best practices for infrastructure deployments. Additionally, repeating infrastructure patterns across various services and products often leads to excessive use of copy-paste and dealing with boilerplate code. When projects require configurable and dynamic components – including conditionals, loops, repeatable patterns, and distribution to a large user base – delivering CloudFormation scripts can become quite cumbersome and prone to errors.

AWS CDK addresses these challenges by enabling IaC development in high-level programming languages like TypeScript, JavaScript, Python, Java. AWS CDK Level 2 and 3 constructs simplify and reduce the amount of code to be written to manage complex infrastructure. It allows TR to create custom libraries that extend the vanilla AWS CDK with additional patterns and utilities. The extension libraries can also be distributed for multiple programming languages and package managers thanks to JSII. JSII enables TypeScript libraries to be automatically compiled and packaged for native consumption in each target language, allowing CDK libraries to be written once but used in many different programming environments.

Solution to optimize the process

In a medium to large company, different teams provide the fundamental infrastructure services (e.g. authentication and authorization, networking, security, financial tracking and optimization, base infrastructure provisioning, etc.) to enable use of the cloud for a large community of developers.

Figure 1 illustrates the conventional method involving teams producing documentation that outlines the usage of pre-deployed infrastructure. This includes naming and tagging standards, required security boundaries, default settings and other relevant guidelines. Subsequently, the implementation team reviews these documents and integrates the established rules into their tool chain consistently, often working in isolation. This results in inefficiencies, misinterpretation risks and maintenance challenges when specifications change.

Figure 1. The traditional approach with separate documentation and implementation teams.

Figure 1: The traditional approach

TR’s optimized approach replaces documentation with working code as shown in Figure 2.

Figure 2: The optimized approach with shared CDK extension library

Figure 2: The optimized approach

Infrastructure teams contribute their specifications into an extension library for AWS CDK, while the implementation teams can also contribute common patterns back into the central extension. The central extension library is released as polyglot packages allowing the implementation teams to pick the programming language that fits best to their knowledge.

With this approach, TR introduce a “shift-left” in the development and delivery lifecycle. Standards and best practices are introduced early, things are done right by default, and TR minimizes the risks of getting inappropriately configured resources to be deployed, which leads to a reduction in the number of governance and security incidents.
Implementation delivery teams can share well architected patterns for re-use by other teams to improve overall effectiveness.

Implementation

Design principles

Key factors for the adoption of a framework are:

  • Simplicity, ease-of-use, self-service, and fast onboarding
  • Low maintenance effort and cost
  • Controlled roll-out, ability to quickly roll-back

With the above in mind, TR delivered a minimally invasive framework that can be enabled with a tiny set of custom code on top of vanilla AWS CDK code.

Using the TR-AWS CDK core library is straightforward – users simply import the package and adapt their entry point. From there, they can leverage standard AWS CDK code and documentation for most development tasks. There’s no need to learn custom construct classes or follow extensive specialized tutorials – vanilla AWS CDK knowledge is sufficient for most requirements. Additionally, developers can quickly incorporate open-source construct libraries through standard package managers. These third-party libraries integrate seamlessly with the TR implementation, automatically conforming to company standards without requiring additional configuration.

By managing distribution of the library following standard software packaging and release procedures TR enable consumers to adopt new capabilities in a controlled way, with the ability to roll-back to previous versions if something goes wrong during an update.

All this together allows TR to tick off the key factors listed above.

The monorepo approach

TR created a monorepo (monolithic repository) which is a version control strategy where multiple projects or packages are stored in a single repository. This approach offers several advantages over maintaining separate repositories for each package: unified versioning, simplified dependency management, consistent tooling, atomic changes across packages and improved collaboration.

This setup mirrors the configuration used by AWS CDK itself.

TR organized their monorepo following this structure:

  • repo/package.json: Defines dev dependencies and global scripts used by all packages
  • repo/packages: contains the different modules
  • repo/packages/core/package.json: deps of core module and scripts for core module
  • repo/packages/core/lib/*: typescript code that composes the core module
  • repo/packages/core/lib/augmentation/*: module augmentations for AWS CDK core components
  • repo/packages/constructs-pattern-X: define multiple reusable and independent level 3 constructs
  • repo/packages/tr-cdk-lib/package.json: assembly module that defines scripts to assemble the final mono package that will be shared via a npm repository

Figure 3. The monorepo structure

Figure 3: Repo structure

This structure enables TR to maintain a collection of related, but distinct CDK constructs while making sure they work together seamlessly.

The modules are assembled and released into one single versioned package which simplifies the end-user’s consumption.

The core module: Foundation of TR AWS CDK library

The core module is the foundation of TR’s CDK extension library, it consists of several key components that work together to “TR-ify” AWS resources and offer simplified access to centrally managed infrastructure resources that are provided by TR’s AWS landing zone teams.

TR refers to “TR-ification”, as the process of dynamically adapting AWS CDK constructs to meet their standards and best practices. From a user perspective, the process happens in a minimally invasive way, for most of the time the user is coding with vanilla AWS CDK components, while having access to short-cuts to a variety of TR specific resources.

The core module serves several critical purposes:

  1. Standardization: makes sure the AWS resources follow TR naming conventions and tagging standards
  2. Simplification: abstracts away complex configurations required for TR compliance
  3. Integration: provides seamless access to TR-managed resources like VPCs, security groups, and Route53 hosted zones
  4. Policy Enforcement: automatically applies custom security and financial optimization policies

The “TR-ification” process happens on every construct following a consistent order, for each construct it will:

  1. If applicable, set a name following a consistent pattern
  2. Apply custom initialization logic (e.g. set IAM permission boundary)
  3. Apply security and financial optimization defaults (if not set)
  4. Perform custom validations
  5. Verify security and financial optimization policies
  6. Tag resources

TR uses a single root-level Aspect instead of multiple Aspects to avoid complex resource type checking and improve maintainability:

// This is the entrypoint that triggers the trification process on all CDK constructs
// we apply all TR specific transformations at this point
Aspects.of(this).add({
  visit: (node: IConstruct) => {
    node.getTRifier().trify();
  },
});

The careful readers at this point will scream:
Wait a moment! node.getTRifier().trify() won’t compile!

Which is absolutely correct… unless you know a topic in TypeScript called module augmentation, in TR’s case, they augment the IConstruct interface and Construct class as follows:

/** Defines the set of functionality needed when trifying resources */
export interface ITRifier {
    trify(): void;
    readonly name: string | undefined;
    readonly nameFromTree: string;
}

declare module 'constructs/lib/construct' {
    interface IConstruct {
        /** Obtain the ITRifier responsible to add TR specific features to this CDK IConstruct */
        getTRifier(): ITRifier;
        
        trContext(): AppContext | StageContext | StackContext;
    }
    
    interface Construct extends IConstruct {
        /** Build the ITRifier responsible to add TR specific features to this CDK IConstruct */
        buildTRifier(): ITRifier;
    }
}

Then provide default implementations for the generic Construct:

Construct.prototype.getTRifier = function () {
    // Lazy getter, build the TRifier only when needed and cache it
    return ObjectUtils.lazyGetFrom(this, 'trifier', () => this.buildTRifier());
};

Construct.prototype.buildTRifier = function () {
    return new ConstructTRifier(this); // Default dummy implementation
};

Construct.prototype.trContext = function (): StackContext {
    return Stack.of(this).trContext() as StackContext;
};

Since AWS CDK constructs implement the IConstruct interface, respectively extend the Construct class automatically, the “TR-ification” process becomes available for many types of constructs.
All you need to do now is inject your custom logic for all resources you need customization and make sure the module is loaded, e.g. in case of a Lambda function, it uses:

lambda.CfnFunction.prototype.buildTRifier = function () {
    return new CfnResourceTRifierLambda.CfnFunction(
        this,
        () => { // Accessor for retrieving the lambda function name
            return this.functionName;
        },
        (name: string) => { // Accessor for setting the lambda function name
                this.functionName = name;
        },
        () => {
            // Our own stuff to set defaults for financial optimizations
            const policyChecker = FinOps.Lambda.Defaults.apply(this);
            
            this.node.addValidation({
                validate: () => {
                    // Inject a custom validation logic to check compliance with financial policies
                    return policyChecker.addErrorIfNotCompliant(this);
                }
            });
        }
    );
};

TR targets L1 (Cfn) constructs like CfnFunction because the higher-level L2 and L3 constructs internally create L1 constructs during synthesis. This architectural decision makes sure TR-ification is applied universally, whether users write new lambda.Function() or new lambda.CfnFunction(), both will be TR-ified. This approach provides complete coverage with a single implementation point while remaining completely transparent to library users who can continue using their preferred abstraction level without awareness of this internal mechanism.

Naming standardization

TR uses standardized naming to support IAM policy filtering and consistent resource management. In order to support a broad range of use-cases, TR defined the resource name pattern as follows:
<segregationPrefix>[-appPrefix]-<resourceName>[-region]-<envSuffix>
where the elements mean:

  • segregationPrefix: A prefix used for grouping resources for a specific asset, it implies that a segregated administrative group is responsible for this resource, where applicable it is used for ARN based IAM resource filtering.
  • appPrefix: Optional, a prefix used to map a resource to a specific application or service, this is shared across stacks within a CDK app.
  • resourceName: The name of a resource indicating its purpose.
  • region: Optional, applied only to resources that are global but are part of a CDK stack that is bound to a specific region.
  • envSuffix: A suffix used to segregate different deployment environments, e.g. development, continuous integration, quality assurance, production.

Traditional approaches require developers to manually construct these names, propagating prefixes and suffixes throughout their code:

new lambda.Function(stack, 'foo', {
    runtime: lambda.Runtime.NODEJS_LATEST,
    handler: 'index.handler',
    code: new lambda.InlineCode('bar'),
    functionName: `\${segregationPrefix}-\${appPrefix}-compute-stats-\${envSuffix}`,
});

With TR AWS CDK extension, the code is simplified to:

new lambda.Function(stack, 'MyFunction', {
  runtime: lambda.Runtime.NODEJS_LATEST,
  handler: 'index.handler',
  code: new lambda.InlineCode('foo'),
  functionName: 'compute-stats',
});

The functionName describes what the function does without “noise”, TR AWS CDK will transparently generate and inject the name into the synthetized CloudFormation script, matching the specification. Note that functionName is optional and TR-CDK will either TR-ify a provided name or automatically generate a valid one if the user omits it, making sure CloudFormation receives a properly formatted name.

Access to “Landing Zone” resources

TR’s central AWS Landing Zone team is responsible of inflating a set of standard resources (e.g. VPC, subnets, security groups, Route 53 zones, golden AMIs, etc.) into AWS accounts that are made available to application development teams.

Through module augmentation (shown earlier), the TR-ifier defines the function trContext() which provides access to a context-aware utility. When calling this function on a resource that resides within a Stack, it will return an object that implements StackContext interface.

export interface StackContext extends StageContext {
  /** Get access to the TR IVpc */
  readonly vpc: IVpc;

  /** Provides access to standard security groups that are available in all TR accounts */
  readonly securityGroups: trparams.ISecurityGroupsResolver;

  /** Provides access to private and public hosted zones (with numeric digits) that are available in all TR accounts */
  readonly route53: trparams.IRoute53Resolver;

  /** Provides access to TR golden AMIs that are available in all TR accounts */
  readonly goldenAMI: TRGoldenAMI;
}

The readonly attributes are accessors for the AWS Landing Zones resources listed above. With calls like the following examples, you have a simple way to obtain access to the standard VPC, subnets selections, route 53 private hosted zone, …

// Get the IVpc:
const trVpc: IVpc = stack.trContext().vpc;

// Get the private subnets as array
const privateSubnets: ISubnet[] = trVpc.privateSubnets;

// Get the private subnets as SubnetSelection
const privateSubSel: SubnetSelection = trVpc.selectSubnets({
    subnetType: SubnetType.PRIVATE_WITH_EGRESS,
});

// Get the private Route53 hosted zone
const privateHZ = stack.trContext().route53.privateHostedZone;

You might now wonder how TR resolves the resources and obtain objects implementing IVpc, ISubnet, ISecurityGroup, …

Instead of using hard-coded resource attributes (e.g. Id, ARN, …) or complex lookups, TR uses CloudFormation’s ability to resolve Systems Manager parameters at execution time, as part of the AWS account initial inflation along with the resources, Systems Manager parameters are registered as well. The parameter names are the same across TR’s AWS accounts, the value contains e.g. the id of the matching AWS Landing Zone standard resource, e.g. /landing-zone/vpc/vpc-id, /landing-zone/vpc/subnets/private-1-id, /landing-zone/vpc/subnets/private-2-id, …

TR then defined custom IVpc, ISubnet, IHostedZone… implementations and for each function they implemented dynamic resolution of resource attributes via Systems Manager parameters. With this approach, TR obtains portable code that runs on AWS accounts initialized via TR inflation process. There are no hard-coded resource identifiers, and there is no need for lookups via AWS SDK during synthesis.

As a user of the TR AWS CDK library, TR developers interact with an object implementing the IVpc interface and do not have to care about how to obtain e.g. the VPC-id and subnet ids. The same principle applies to Route53 hosted zones, Golden AMI ids, etc.

Application initialization

As mentioned previously, one key design principle is to minimize the custom code that a user of TR AWS CDK is required to use compared to using vanilla AWS CDK. This approach leverages existing AWS CDK and reduces the learning curve for developers.

This is how TR developers initialize an App with vanilla CDK, compared to how they initialize it with TR AWS CDK.

// Initialize a vanilla AWS CDK application
const app = new cdk.App()

// Initialize a TR CDK application
const app = TRCdk.newApp({
  segregationId: '123456',
  resourceOwner: '[email protected]',
  namingProps: { prefix: 'myapp' },
  deploymentEnv: TRDeploymentEnv.DEV
});

From this point on, the developers can continue using vanilla AWS CDK code, the value returned by TRCdk.newApp(…) is an instance of an extension of CDK’s App class and is fully compatible with it. It, however, injects the TR-ification aspect, manages the tagging process, and initializes contextual information.

Here and there, e.g. when they need to pass the VPC into a construct, they will need to call TR AWS CDK code via the trContext() entry point that is exposed on CDK constructs through TypeScript’s module augmentation feature, but that’s it! 99% of the code is vanilla AWS CDK code.

The segregationId, namingProps, and deploymentEnv attributes are used for multiple purposes like formatting resource names and tagging resources.

Standardized Tagging

TR defines tagging standards, there are mandatory tags (e.g. for attribution to a specific product asset and for tracking resource ownership), and there are optional tags (e.g. for specifying resources that belong to different services within the same product asset).

The segregationId, the resourceOwner, and deploymentEnv attributes are used to set mandatory tags using CDK’s built-in functionality for tagging.
TR also defines a standardized set of optional tags that can be passed into the application context or set ad-hoc on individual constructs.

// Initialize a vanilla AWS CDK Application
const app = new cdk.App()

// Initialize a TR CDK application
const app = TRCdk.newApp({
  segregationId: '123456',
  resourceOwner: '[email protected]',
  namingProps: { prefix: 'myapp' },
  deploymentEnv: TRDeploymentEnv.DEV
  optionalTRTags: {
    financialId: '123456789',
    projectName: 'my-project',
    serviceName: 'ServiceX',
    environmentName: 'Dev environment for ServiceX'
  }

This approach maintains consistency in the use of tag names and setting the values, it happens automatically behind the scenes and will be applied to the taggable constructs. No copy-pasting of tag definitions like in AWS CloudFormation, no issues dealing with CloudFormation’s inconsistent syntax for tag declarations, no forgetting of tagging resources.

Conclusion

In this post, we discussed how the monorepo approach to AWS CDK development, centered around the core module, has significantly improved the infrastructure management at Thomson Reuters. By providing well-architected L3 constructs, standardizing and simplifying AWS resource creation, they’ve reduced errors, enhanced governance, and accelerated development.

The core module’s ability to enforce policies, standardize naming and tagging, and provide access to TR-managed resources makes it an invaluable tool for teams working with AWS infrastructure at Thomson Reuters.

To get started with AWS CDK and build your CDK solutions, check out the AWS CDK Developer Guide.

Danilo Tommasina is a Distinguished Engineer at Thomson Reuters. With over 25 years of experience working in technology roles ranging from Software Engineer, over Director of Engineering and now as Distinguished Engineer. As a passionate generalist, proficient in multiple programming languages, cloud technologies, DevOps practices and with engineering knowledge in the ML space, he contributed to the scaling of TR Labs’ engineering organization. He is also a big fan of automation including but not limited to MLOps, LLMOps processes and Infrastructure as Code principles.

Lalit Kumar B is an Associate Cloud & AI Solutions Architect at Thomson Reuters with over 15 years of experience in various technology roles, including Software Engineer, Database Engineer, DevOps Architect, and Solutions Architect, and now as an AI Architect in Platform Engineering. He helped scaling AWS CDK within TR through the ‘tr-cdk-lib’ solution which is an enterprise-grade centralized library of patterns. He enjoys tackling complex challenges and prioritizing effectiveness over efficiency.

Vu San Ha Huynh is a Solutions Architect at AWS with a PhD in Computer Science. He helps large Enterprise customers drive innovation across different domains with a focus on AI/ML and Generative AI solutions.

Paul Wright is a Senior Technical Account Manger, with over 20 years experience in the IT industry and over 7 years of dedicated cloud focus. Paul has helped some of the largest enterprise customers grow their business and improve their operational excellence. In his spare time Paul is a huge football and NFL fan.

Boosting Unit Test Automation at Audible with Amazon Q Developer

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/devops/boosting-unit-test-automation-at-audible-with-amazon-q-developer/

Audible, an Amazon company, is a leading producer and provider of audio storytelling. With a vast library of over 1,000,000 titles including audiobooks, podcasts, and Audible Originals with specific curated offerings available in each marketplace, Audible makes it easy to transform everyday moments into extraordinary opportunities for learning, imagination, and entertainment through immersive audio experiences. Robust testing is critical to ensure millions of end users enjoy a seamless experience across devices.

Remember the last time you inherited a software application codebase with minimal test coverage? Or perhaps you’ve written code in a rush to meet a deadline, promising yourself you’d add tests “later”? We’ve all been there. Testing is crucial but can often gets deprioritized when deadlines loom. That’s where Amazon Q Developer‘s agentic workflows come in, transforming the way developers approach test generation. This blog explores how Audible used Amazon Q Developer to boost their unit test coverage.

Business Use Case for Software Testing

In high velocity development environments, testing cycles can often times get compressed under tight deadlines, increasing quality risks. Amazon Q Developer transforms this paradigm by accelerating testing while maintaining comprehensive standards. Through automated test generation, edge case identification, and fix suggestions, teams execute thorough testing in reduced timeframes, delivering expedited releases, optimized QA resources, and enhanced production readiness.

Each function that does not have the appropriate testing implemented, represents the potential for a rework, bugs, and maintenance challenges. Additionally, inherited codebases present particular challenges: developers must choose between spending weeks writing tests for existing functionality or continuing the cycle of untested code.

Amazon Q Developer addresses these challenges by reducing the time and effort required for proper test coverage, transforming testing from a burdensome chore into a streamlined process that allows teams to focus on delivering new features while helping to ensure code quality.

Amazon Q Developer: Expanding test coverage for your codebase

Amazon Q Developer introduces an advanced approach to software testing generation through its agentic workflows. Unlike traditional test generation tools that produce generic tests, Amazon Q Developer analyzes your code’s intent, business logic, and edge cases. It doesn’t just generate tests; it creates meaningful test suites that validate your code’s behavior comprehensively.

Beyond the dedicated test generation workflow we’ll explore today, Amazon Q Developer offers multiple ways to assist with testing. You can use conversational prompts for test plan generation, request test improvements for existing code, or even pair-program with Amazon Q Developer as you write tests. The flexibility to integrate AI assistance throughout your testing workflow makes Amazon Q Developer a versatile companion for developers.

Amazon Q Developer workflow architecture

The following architecture diagram illustrates how Audible leveraged Amazon Q Developer for both test generation and code transformation:

A flowchart showing Amazon Q Developer's test generation and code transformation process. The diagram illustrates three main workflows: Test Generation: Shows how legacy Java codebase (11 packages) is analyzed using AI capabilities to generate comprehensive test suites including unit tests and edge cases. Code Transformation: Depicts the process of migrating JDK 8 code to JDK 17, including JUnit 4 to JUnit 5 conversion, with over 5,000 tests transformed. Results: Demonstrates outcomes including improved test coverage, successful JDK migration, and 50+ hours saved through code transformation. The diagram uses icons and color-coded sections to show the flow between different stages, with blue sections for initial inputs, green for processing, and grey for outputs.

The Amazon Q Developer workflow demonstrates two key capabilities:

  • Test Generation: Amazon Q Developer analyzes Java classes and creates comprehensive test suites including unit tests, edge case tests, and exception handling tests.
  • Code Transformation: Amazon Q Developer performs automated migration tasks including JDK 8 to JDK 17/21 upgrades, handling language version compatibility, JUnit 4 to JUnit 5 conversion, modernizing test framework syntax and annotations, syntax migration, updating deprecated APIs and code patterns.

What makes this workflow particularly powerful is how it combines AI capabilities with human expertise, allowing expert developers to leverage AI in their day-to-day workflow. Amazon Q Developer analyzes your codebase and uses it as a context, identifies edge cases, and performs automated transformations, while developers apply their domain knowledge to ensure the outputs align with business requirements and expected behavior.

Audible’s Approach to harness the potential of Amazon Q Developer

The Audible teams followed the below steps to harness Amazon Q Developer to boost test coverage.

Code Submission: The Audible team leveraged Amazon Q Developer to enhance their test coverage by generating additional unit tests for Java classes, including static methods and methods with existing test cases. This approach complemented their robust testing strategy. Amazon Q Developer has the ability to examine classes, methods, parameters, return types, and exceptions. Amazon Q Developer is helpful in automatically identifying unit tests to cover edge cases that can easily be overlooked, such as null input checks and empty string checks.

Targeted Requests: The Audible team specifically asked Amazon Q Developer to provide:

  • Suggestions for unit tests to cover the given method within a Java class
  • Recommendations for unit tests targeting untested edge cases
  • Recommendations for test cases addressing error handling and exception scenarios

The Audible team achieved significant improvements using Amazon Q Developer for both test generation and code transformation. The key to their success was providing rich context along with targeted prompts in a systematic workflow.

Developer Workflow

A horizontal workflow diagram titled 'Developer Context Creation Workflow' showing 5 sequential steps: Developer opens class file Developer selects method and adds prompt Selected method and prompt are submitted to Amazon Q Developer Amazon Q Developer generates tests Developer reviews and integrates the generated tests Each step is connected by colored arrows and represented by simple icons including user figures, file symbols, and tool icons. The process flows from left to right, illustrating the complete cycle from initial file access to final test integration.

Audible adopts a human in the loop approach to review the output from automation tools. The above workflow shows the complete process: (1) open a class file in their IDE, (2) select a specific method and add their prompt, (3) submit this combined context to Amazon Q Developer, (4) receive generated tests, and (5) review and integrate the tests into their codebase.

Effective Prompts and Approach

The Audible team followed a structured approach, using targeted requests that Amazon Q Developer could act upon:

Code Submission: The team provided Java classes to Amazon Q Developer with code to generate tests for individual methods, including static methods and those that already had some tests but lacked full coverage. Amazon Q Developer examined classes, methods, parameters, return types, and exceptions, automatically identifying unit tests to cover edge cases like null input checks and empty string checks.

Below are generic Sample Prompts for Specific Requests:

Basic Test Generation:

Generate unit tests for the following Java method. Focus on covering all possible input scenarios and edge cases:

[method code here]

Please include tests for:
- Valid input scenarios
- Null input checks
- Empty string validations
- Exception handling

Edge Case Focus:

I have this method that processes user input. Can you suggest unit tests that cover edge cases I might have missed? Pay special attention to boundary conditions and error scenarios:

[method code here]

Manual Framework Migration (via Q Developer Chat):

Convert this JUnit 4 test to JUnit 5 format. Make sure to update annotations and use modern JUnit 5 features where appropriate:

[JUnit 4 test code here]

Note: While Amazon Q Developer’s code transformation feature can handle JUnit4 to JUnit5 migration automatically across entire codebases, Audible also used the conversational interface for manual, targeted conversions as shown above. Both approaches are available. Refer to documentation for automated transformation details.

Test Generation: Based on the team’s requests, Amazon Q Developer generated specific test suggestions addressing these areas with appropriate assertions and test methods.

Implementation: The development team implemented the suggested tests after review.

Documentation: Amazon Q Developer has the ability to add comments to explain the purpose of the test, area of the functionality that the test is covering. In addition, Amazon Q Developer also has the ability to generate documentation related to other aspects like read-me files and project documentation.

Quantifiable Results

By leveraging Amazon Q Developer, the Audible team achieved:

  • Over 10 key packages received comprehensive unit test coverage
  • ~1 hour saved per test class (typically containing 8-10 individual tests)
  • 5,000+ test cases successfully migrated from JUnit4 to JUnit5 using both Amazon Q Developer’s code transformation and manual conversational assistance
  • 50+ hours of manual work saved during the JDK8 to JDK17 migration using Amazon Q Developer’s code transformation
  • Reduced human errors through AI-assisted transformations

Key Capabilities Demonstrated

Amazon Q Developer excelled in several areas that can be overlooked in manual testing:

Comprehensive Exception Testing: Beyond standard null input checks and empty string validations, it automatically suggested tests for IllegalArgumentException, NullPointerException, and custom business exceptions, including verification of both exception throwing and specific error messages. This systematic approach made test coverage more complete and error handling more robust.

Automated Edge Case Detection: Amazon Q Developer made inline suggestions for null pointer exception handling without prompting, making the process smoother and faster.

Manual Framework Migration with AI Assistance: Amazon Q Developer’s pattern recognition accelerated the migration process through conversational assistance. The team could ask Amazon Q Developer through the chat to convert test syntax from JUnit4 to JUnit5 manually. For example, their previous setup had JUnit4 syntax with @UseDataProvider and @DataProvider annotations. All they had to do was highlight the code block, Send to Prompt, and ask Amazon Q Developer to make the test JUnit5 compatible. Within seconds, it generated a reliable JUnit5 test with ParameterizedTest annotation and Stream of Arguments that they could manually implement.

Contextual Analysis: Amazon Q Developer analyzes the existing codebase and recognized patterns and generated tests that matched the team’s coding style and testing conventions.

Conclusion

Amazon Q Developer transforms the test generation process from a time-consuming chore into a streamlined workflow, enabling teams to achieve comprehensive test coverage with minimal effort. This allows developers to focus on higher-value activities while improving code quality and reliability.

The business impact is substantial: As testing becomes less burdensome, teams naturally adopt better testing practices, creating a positive feedback loop that enhances overall code quality, and creates an opportunity for faster development cycles, and reduced time spent on maintenance.

To learn more about Amazon Q Developer’s features and pricing details, visit the Amazon Q Developer product page.

About the Authors

kirankumar.jpeg

Kirankumar Chandrashekar is a Generative AI Specialist Solutions Architect at AWS, focusing on Next Generation Developer Experience tools like Q Developer, Kiro and Developer Productivity using AI. Bringing deep expertise in AWS cloud services, DevOps, modernization, and infrastructure as code, he helps customers accelerate their development cycles and elevate developer productivity through innovative AI-powered solutions. By leveraging Amazon Q Developer, he enables teams to build applications faster, automate routine tasks, and streamline development workflows. Kirankumar is dedicated to enhancing developer efficiency while solving complex customer challenges, and enjoys music, cooking, and traveling.

alex-torres.jpeg

Alex Torres is a Senior Solutions Architect at AWS, supporting Amazon.com in architecting, designing, and building applications on AWS. With deep expertise in security, governance, and Agentic AI for developers, he helps customers leverage cutting-edge cloud technology to create products that shape people’s lives. Passionate about empowering teams to solve complex challenges through innovative AWS solutions, Alex is dedicated to driving customer success while maintaining the highest standards of security and governance. Outside of work, he enjoys cooking and hiking.

GK is a Senior Customer Solutions Manager and strategic customer advisor supporting Amazon as a customer of AWS. Over her four years at AWS, she has focused on improving developer productivity and advocating for Amazon’s needs across AWS services to enhance user experience and drive deeper alignment between the two organizations. Her work with advanced Amazon teams helps deliver solutions that ultimately benefit both internal and external AWS customers. GK is particularly interested in how GenAI is bridging the gap between developers and non-developers, and she spends much of her time solving challenges in GenAI and security. She is based in the San Francisco Bay Area and enjoys hiking and camping.

Aditi Joshi is a Software Engineer at Audible, working on expanding Audible’s presence across Amazon platforms. As a full-stack developer, she primarily works with web technologies, cloud services, and programming languages like JavaScript and Java to build and enhance cross-platform integration features, including recent projects like introducing Audible purchase capabilities in the Amazon iOS app. With expertise in user interface development, responsive design, and web technologies, she focuses on showcasing Audible offers and growing Audible’s visibility across Amazon’s ecosystem. Aditi is passionate about software architecture and user experience, focusing on building scalable systems with clean, efficient code. When not coding, Aditi enjoys traveling, practicing yoga, and listening to music.

Sam Park is a Software Development Engineer at Audible, focused on building Audible features across Amazon platforms. He has played a key role in enabling Audible purchases through Amazon Cart, as well as expanding Audible’s visibility within the Amazon iOS and Android apps. His work spans multiple touchpoints within the Amazon ecosystem, including Search, Product pages, Checkout, and Cart experiences. Sam is passionate about developing solutions that create intuitive customer experiences and leveraging GenAI to boost development efficiency and productivity. Outside of work, he enjoys traveling, playing basketball, and cheering on the Cleveland Cavaliers.

StackSets Deployment Strategies: Balancing Speed, Safety, and Scale to Optimize Deployments for Different Organizational Needs

Post Syndicated from Amar Meriche original https://aws.amazon.com/blogs/devops/stacksets-deployment-strategies-balancing-speed-safety-and-scale-to-optimize-deployments-for-different-organizational-needs/

AWS CloudFormation StackSets enables organizations to deploy infrastructure consistently across multiple AWS accounts and regions. However, success depends on choosing the right deployment strategy that balances three critical factors: deployment speed, operational safety, and organizational scale. This guide explores proven StackSets deployment strategies specifically designed for multi-account infrastructure management.

Understanding StackSets Deployment Fundamentals

What are StackSets Actually Used For?

Unlike single-account AWS CloudFormation templates, StackSets are specifically designed for multi-account infrastructure governance. Common use cases include Security baselines (deploying IAM policies, security groups, and access controls across all accounts), Compliance controls (rolling out AWS Config rules, AWS CloudTrail configurations, and audit requirements), Organizational standards (establishing consistent VPC configurations, tagging policies, and naming conventions), Shared services (deploying monitoring solutions, logging infrastructure, and backup policies) or Cost management (implementing budget controls, cost allocation tags, and resource optimization policies)

The Multi-Account Challenge

Managing infrastructure across dozens or hundreds of AWS accounts presents unique challenges:

Single Account (CFN Template)     Multi-Account (StackSets)
      App A                           Org Unit A (50 accounts)
        |                                     |
   [Deploy Once]               [Deploy consistently across all]
        |                                     |
    Success/Fail                Complex success/failure matrix

Multi account and multi region Cloudformation deployment complexity

The Speed-Safety-Scale Triangle

Every StackSets deployment strategy involves trade-offs: Speed (how quickly changes propagate across your organization), Safety (risk mitigation and failure containment) and Scale (ability to manage hundreds of accounts efficiently)

Prerequisites

Before implementing any of the deployment strategies described in this guide, ensure you have:

  1. AWS CLI Installation
    1. Install the latest version of AWS CLI by following the AWS CLI installation guide
    2. Verify installation with: aws –version
  2. AWS Profile Configuration
    1. Configure your AWS credentials using: aws configure
    2. For details on configuration, see AWS CLI configuration basics
    3. Ensure your profile has appropriate permissions for CloudFormation StackSets operations as described in AWS StackSets prerequisites
  3. Proper Account Access The commands in this guide must be executed from either:
    1. The management account of your AWS Organization
    2. OR a delegated administrator account for CloudFormation

For information on setting up a delegated administrator, see Register a delegated administrator

Note: StackSets deployments using service-managed permissions cannot be performed from standalone accounts.

Verify you’re using the correct account with:

bash
# For management account
aws organizations describe-organization
# For delegated admin
aws cloudformation list-stack-sets —call-as DELEGATED_ADMIN

AWS CLI to check the usage of an Organization and not a Standalone account

Core Deployment Strategies

As explained in the StackSet documentation:

  • “For a more conservative deployment, set Maximum Concurrent Accounts to 1, and Failure Tolerance to 0. Set your lowest-impact region to be first in the Region Order Start with one region.”
  • “For a faster deployment, increase the values of Maximum Concurrent Accounts and Failure Tolerance as needed. ”

Based on the above, we are proposing below several deployment strategies, depending on the speed, safety and scale you want to achieve.

1. Sequential Deployment: Maximum Safety

Use Case : Critical security updates, compliance requirements, first-time organizational rollouts

Below are listed some possible use cases:

  • Security baseline updates: New IAM policies affecting root access
  • Compliance rollouts: SOX, HIPAA, or PCI-DSS control implementations
  • Critical infrastructure changes: VPC security group modifications
  • Organizational policy changes: New AWS Config rules for audit compliance

Implementation Example:

For this example, we will download the following template ConfigRuleCloudtrailEnabled.yml from the Cloudformation sample library in the AWS documentation to configure an AWS Config rule to determine if AWS CloudTrail is enabled and follow the next steps:

Step 1: Create the StackSet

With the AWS CLI:

# Create Stackset for security baseline
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
  --stack-set-name security-baseline \
  --template-body file://ConfigRuleCloudtrailEnabled.yml \
  --capabilities CAPABILITY_NAMED_IAM \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
  --region us-east-1

AWS CLI to create a security-baseline Stackset

The expected response should be similar to the following :

{"StacksetId": "security-baseline: ...."}

Step 2: Create Stack Instances

Before you launch the below command, you need to adjust the values of the following parameters:

  • OrganizationalUnitIds: you must change the value “ou-test” in the below command line to the name of the target OU you want to deploy to. I recommend creating a new test OU in the console or via the CLI for the purpose of this test.
  • regions: if needed, change the “us-east-1 eu-west-1” value, here you need to list all the regions you want to deploy to. AWS Config must be active in the accounts/regions that you choose, otherwise you’ll get an error when deploying the Stack.

# Deploy security baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1 and eu-west-1
# SEQUENTIAL = One region at a time, sequentially
# MaxConcurrentPercentage = Deploy to 5% of accounts at once
# FailureTolerancePercentage = Stop on first failure
aws cloudformation create-stack-instances \
  --stack-set-name security-baseline \
  --deployment-targets OrganizationalUnitIds=ou-test\
  --regions us-east-1 eu-west-1 \
  --region us-east-1 \
  --operation-preferences RegionConcurrencyType=SEQUENTIAL,MaxConcurrentPercentage=5,FailureTolerancePercentage=0

AWS CLI to create security-baseline Stack Instances sequentially for maximum safety

The CLI output should look like the following:

{"OperationId": ....}

Or create the StackSet and add the Stacks with the AWS Console:

In the CloudFormation Console, click “Create StackSet”

AWS CloudFormation Console: create a security-baseline Stackset

AWS CloudFormation Console: create a security-baseline Stackset

Upload your template from S3 or from your computer and click Next:

AWS CloudFormation Console: specify a template

AWS CloudFormation Console: specify a template

Specify the StackSet name and parameters and click Next:

AWS CloudFormation Console: specify the StackSet name and parameters

AWS CloudFormation Console: specify the StackSet name and parameters

Configure StackSet options and click Next:

AWS CloudFormation Console: configure the StackSet options

AWS CloudFormation Console: configure the StackSet options

Set deployment options and click Next:

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set more deployment options

Then Review and Submit.

Not to overweight this blog, we’ll provide only this example of CLI output and Console screenshot, but the “Parallel Deployment” and “Balanced Approach” will be similar to this example. You just need to update the parameters for the different StackSet Operations options.

A real-world example would be a financial services company deploying new MFA requirements across 200 production accounts. They could use sequential deployment with 5 concurrency to ensure each batch was validated before proceeding.

2. Parallel Deployment: Maximum Speed

The Parallel Deployment is best for non-critical updates, development environments, routine maintenance

Here are some possible use cases:

  • Development account standardization: Rolling out new development tools
  • Monitoring infrastructure: Deploying Amazon CloudWatch dashboards and alarms
  • Cost optimization: Implementing automated resource cleanup policies
  • Non-production updates: Updating development and staging environments

Implementation Example:

For this example, we will copy paste the .yml template from this Re:Post article about monitoring IAM events in a file called “monitoring-baseline.yml”, and use it in the following command lines.

Step 1: Create the StackSet

# Create Stackset for monitoring baseline
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name monitoring-baseline \
--template-body file://monitoring-baseline.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a monitoring-baseline Stackset

Step 2: Create Stack Instances

Just like in the previous example, before you launch the below command, you need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to dev and sandbox accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1 and eu-west-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 80% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 20% of accounts
aws cloudformation create-stack-instances \
--stack-set-name monitoring-baseline \
--deployment-targets OrganizationalUnitIds=ou-development,ou-sandbox \
--regions us-east-1 eu-west-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=80,FailureTolerancePercentage=20

AWS CLI to create monitoring-baseline Stack Instances in parallel with high value for max concurrent percentage for maximum speed

3. Progressive Deployment: Balanced Approach or Multi Phase Approach (Recommended)

For most production scenarios with moderate risk tolerance, it is recommended to use a Balanced Approach, or Multi-Phase Implementation.

Balanced Approach

For this example, to make it easier, you can create a copy of “monitoring-baseline.yml” created previously, and name it “balanced-template.yml”.

cp monitoring-baseline.yml balanced-template.yml

bash command to copy the monitoring-baseline.yml file to balanced-template.yml

Then you can use it in the following command lines.

Step 1: Create the StackSet

# Create Stackset for a balanced creation
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name balanced-deployment \
--template-body file://balanced-template.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a balanced-deployment Stackset

Step 2: Create Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1 and ap-southeast-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 25% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 8% of accounts
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-development,ou-sandbox \
--regions us-east-1 eu-west-1 ap-southeast-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=8

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment

Multi-Phase Implementation:

Step 1: Create the StackSet

# Create Stackset for a balanced creation
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name balanced-deployment \
--template-body file://balanced-template.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a balanced-deployment Stackset

Phase 1: Pilot Accounts (10% of target)

Phase 1: Create Pilot Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1
# SEQUENTIAL = Deployment in sequence
# MaxConcurrentPercentage = 100% Deploy full speed for small pilot
# FailureTolerancePercentage = Zero tolerance in pilot
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets Accounts=pilot-account-1,pilot-account-2 \
--regions us-east-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=SEQUENTIAL,MaxConcurrentPercentage=100,FailureTolerancePercentage=0

AWS CLI to create balanced-deployment Stack Instances sequentially for maximum safety in Pilot accounts

Wait for Pilot validation before proceeding to Phase 2

Phase 2: Early Adopter OUs (30% of target)

Phase 2: Create Early Adopter Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 25% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 5% of accounts
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-early-adopter \
--regions us-east-1 \
--region us-east-1 eu-west-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=5

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment in Early Adopter OU

Wait for Early Adopter validation before proceeding to Phase 3

Phase 3: Full Deployment (Remaining 60%)

Phase 3: Full Deployment

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1 and ap-southeast-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 40% of accounts at once for higher speed after validation
# FailureTolerancePercentage = Tolerate failures in 10% of accounts for moderate tolerance
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-standard-prod,ou-legacy-prod \
--regions us-east-1 \
--region us-east-1 eu-west-1 ap-southeast-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=5

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment in the remaining OUs

Using Step Functions for Orchestration

AWS Step Functions provides a serverless workflow service that can orchestrate StackSets deployments with advanced control flow, error handling, and state management capabilities. This approach enhances your multi-account deployments with features not available through standard StackSets operations alone.

Some of the Key Benefits include:

  • Advanced Deployment Orchestration: Coordinate multi-phase rollouts with validation gates
  • Human Approval Workflows: Implement manual approval steps for critical changes
  • Enhanced Error Handling: Define sophisticated retry policies and fallback mechanisms
  • Visual Monitoring: Track deployment progress through the Step Functions visual console

Real-World Use Case: Compliance Control Rollout

In regulated industries, AWS Step Functions enables a phased approach that combines automation with necessary governance. For instance, you can:

  1. Deploy compliance controls to test accounts
  2. Run automated validation and generate compliance reports
  3. Obtain manual approval from compliance team
  4. Deploy to production accounts with comprehensive monitoring

This approach ensures consistent governance while maintaining the complete audit trail required for regulatory compliance.

Monitoring and Optimization

AWS CloudFormation StackSets do not have extensive built-in Amazon CloudWatch metrics specifically designed for monitoring StackSet operations and health. This is actually why the monitoring implementation in our blog post is valuable.

Here’s what AWS does and doesn’t provide out of the box:

What AWS provides natively:

  • Basic AWS API call metrics via AWS CloudTrail (which show that operations happened but don’t track success rates or performance)
  • General service quotas and throttling metrics for CloudFormation as a whole
  • CloudFormation provides some metrics for individual stacks, but not consolidated StackSet-specific metrics

What requires custom implementation (as in our blog post):

  • Success rate metrics for StackSet operations across accounts
  • Deployment completion time tracking
  • Configuration drift detection and monitoring
  • Account-specific failure analysis
  • Comprehensive dashboards that show StackSet health across your organization

The code in our blog post demonstrates how to implement the success rate custom metrics by:

  1. Gathering data from the CloudFormation API about StackSet operations
  2. Calculating the success rate metrics for StackSet deployments
  3. Creating custom Amazon CloudWatch metrics in a custom namespace (like “StackSetMonitoring”)
  4. Setting up alerts for issues

This explains why organizations need to implement custom monitoring solutions like the one shown in our blog post rather than relying solely on built-in metrics.

Automated Monitoring Implementation: example of a custom metric to monitor the StackSet operations success rate

The following AWS Cloudformation template provides real-time monitoring and alerting for AWS CloudFormation StackSet operations through automated infrastructure deployment. This solution creates a complete monitoring system using a AWS Lambda function, Amazon EventBridge rules, Amazon SNS notifications, and Amazon CloudWatch dashboards to track StackSet success and failure rates. The core Lambda function named StackSetMonitor continuously monitors all active StackSets in your account, calculating success rates and publishing custom metrics to Amazon CloudWatch under the StackSetMonitoring namespace.

Below you’ll find a few example of possible custom metrics that could be implemented based on this AWS Cloudformation template:

  • Count of all operations (CREATE, UPDATE, DELETE) per StackSet over time periods
  • Number of stack instances with configuration drift (requires additional API calls)
  • Average time taken for StackSet operations to complete
  • Rate of StackSet operations to identify peak usage times
  • Number of individual stack instances that failed during operations
  • Number of retried operations (indicates infrastructure issues)

Here’s the StackSetMonitor.yml CloudFormation Template:

# StackSetMonitor.yml 
# CFN template for monitoring AWS CloudFormation StackSet operations with real-time alerts, metrics, and dashboards.

AWSTemplateFormatVersion: '2010-09-09'
Description: 'CloudFormation template for StackSet operation monitoring using CloudWatch and SNS'

Parameters:
  StackSetName:
    Type: String
    Description: 'Name of the StackSet to monitor'
    Default: 'security-baseline'
    MinLength: 1
    MaxLength: 128
    AllowedPattern: '[a-zA-Z][-a-zA-Z0-9]*'
    ConstraintDescription: 'Must be a valid StackSet name (1-128 characters, alphanumeric and hyphens, must start with a letter)'
  
  VpcId:
    Type: String
    Description: 'VPC ID where the Lambda function will be deployed (leave empty to create new VPC)'
    Default: ''
  
  SubnetIds:
    Type: CommaDelimitedList
    Description: 'List of subnet IDs for the Lambda function (leave empty to create new subnets)'
    Default: ''
    
  SecurityGroupIds:
    Type: CommaDelimitedList
    Description: 'List of security group IDs for the Lambda function (leave empty to create new security group)'
    Default: ''

Conditions:
  CreateVPC: !Equals [!Ref VpcId, '']
  CreateVPCAndSubnets: !And [!Equals [!Ref VpcId, ''], !Equals [!Join [',', !Ref SubnetIds], '']]
  HasCustomSecurityGroups: !Not [!Equals [!Join [',', !Ref SecurityGroupIds], '']]
  
Resources:
  # KMS Key for CloudWatch Logs encryption
  LogsKMSKey:
    Type: AWS::KMS::Key
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      Description: 'KMS Key for StackSet Monitor CloudWatch Logs and Lambda environment variable encryption'
      EnableKeyRotation: true
      KeyPolicy:
        Version: '2012-10-17'
        Statement:
          - Sid: Enable IAM User Permissions
            Effect: Allow
            Principal:
              AWS: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:root'
            Action: 'kms:*'
            Resource: '*'
          - Sid: Allow CloudWatch Logs
            Effect: Allow
            Principal:
              Service: !Sub 'logs.${AWS::Region}.amazonaws.com'
            Action:
              - 'kms:Encrypt'
              - 'kms:Decrypt'
              - 'kms:ReEncrypt*'
              - 'kms:GenerateDataKey*'
              - 'kms:DescribeKey'
            Resource: '*'
            Condition:
              ArnEquals:
                'kms:EncryptionContext:aws:logs:arn': 
                  - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor'
                  - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets'
          - Sid: Allow Lambda Service
            Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action:
              - 'kms:Encrypt'
              - 'kms:Decrypt'
              - 'kms:ReEncrypt*'
              - 'kms:GenerateDataKey*'
              - 'kms:DescribeKey'
            Resource: '*'

  LogsKMSKeyAlias:
    Type: AWS::KMS::Alias
    Properties:
      AliasName: alias/stackset-monitor-logs
      TargetKeyId: !Ref LogsKMSKey

  # VPC Resources (created when no existing VPC is provided)
  StackSetMonitorVPC:
    Type: AWS::EC2::VPC
    Condition: CreateVPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: Name
          Value: StackSetMonitor-VPC
        - Key: Purpose
          Value: VPC for StackSet Monitor Lambda function


  PrivateSubnet1:
    Type: AWS::EC2::Subnet
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      CidrBlock: 10.0.1.0/24
      AvailabilityZone: !Select [0, !GetAZs '']
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-Subnet-1
        - Key: Purpose
          Value: Private subnet for StackSet Monitor Lambda

  PrivateSubnet2:
    Type: AWS::EC2::Subnet
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      CidrBlock: 10.0.2.0/24
      AvailabilityZone: !Select [1, !GetAZs '']
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-Subnet-2
        - Key: Purpose
          Value: Private subnet for StackSet Monitor Lambda

  PrivateRouteTable1:
    Type: AWS::EC2::RouteTable
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-RT-1

  PrivateRouteTable2:
    Type: AWS::EC2::RouteTable
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-RT-2

  PrivateSubnet1RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Condition: CreateVPC
    Properties:
      RouteTableId: !Ref PrivateRouteTable1
      SubnetId: !Ref PrivateSubnet1

  PrivateSubnet2RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Condition: CreateVPC
    Properties:
      RouteTableId: !Ref PrivateRouteTable2
      SubnetId: !Ref PrivateSubnet2

  # VPC Endpoints for AWS Services (no internet access needed)
  CloudFormationVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.cloudformation
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - cloudformation:ListStackSets
              - cloudformation:ListStackSetOperations
              - cloudformation:ListStackInstances
              - cloudformation:DescribeStackInstance
              - cloudformation:DescribeStacks
              - cloudformation:GetTemplate
            Resource: '*'

  CloudWatchVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.monitoring
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - cloudwatch:PutMetricData
            Resource: '*'

  SNSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sns
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sns:Publish
            Resource: '*'

  EventsVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.events
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - events:PutEvents
            Resource: '*'

  LogsVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.logs
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: '*'

  SQSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sqs
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sqs:SendMessage
            Resource: '*'

  STSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sts
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sts:AssumeRole
              - sts:GetCallerIdentity
              - sts:AssumeRoleWithWebIdentity
            Resource: '*'

  # Security Group for Lambda function
  LambdaSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for StackSet Monitor Lambda function
      VpcId: !If
        - CreateVPC
        - !Ref StackSetMonitorVPC
        - !Ref VpcId
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 10.0.0.0/16
          Description: HTTPS to VPC Endpoints
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS TCP to VPC for name resolution
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS UDP to VPC for name resolution
      Tags:
        - Key: Name
          Value: StackSetMonitor-Lambda-SG
        - Key: Purpose
          Value: Security group for StackSet Monitor Lambda

  VPCEndpointSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Condition: CreateVPC
    Properties:
      GroupDescription: Security group for VPC Endpoints
      VpcId: !Ref StackSetMonitorVPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: HTTPS from Lambda security group
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: DNS TCP from Lambda security group
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: DNS UDP from Lambda security group
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 10.0.0.0/16
          Description: HTTPS outbound within VPC
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS TCP outbound within VPC
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS UDP outbound within VPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-VPCEndpoint-SG
        - Key: Purpose
          Value: Security group for VPC Endpoints

  # Dead Letter Queue for Lambda function
  StackSetMonitorDLQ:
    Type: AWS::SQS::Queue
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      QueueName: StackSetMonitor-DLQ
      MessageRetentionPeriod: 1209600  # 14 days
      KmsMasterKeyId: alias/aws/sqs
      Tags:
        - Key: Purpose
          Value: Dead Letter Queue for StackSet Monitor Lambda

  StackSetAlertsTopic:
    Type: AWS::SNS::Topic
    Properties: 
      TopicName: StackSetAlerts
      DisplayName: StackSet Monitoring Alerts
      KmsMasterKeyId: alias/aws/sns
  
  StackSetLogGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties: 
      LogGroupName: /aws/cloudformation/stacksets
      RetentionInDays: 30
      KmsKeyId: !GetAtt LogsKMSKey.Arn

  LambdaLogGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      LogGroupName: /aws/lambda/StackSetMonitor
      RetentionInDays: 30
      KmsKeyId: !GetAtt LogsKMSKey.Arn
  
  StackSetMonitoringDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: StackSetMonitoring
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "width": 24,
              "height": 8,
              "properties": {
                "metrics": [
                  [ "StackSetMonitoring", "SuccessRate", "StackSetName", "${StackSetName}" ]
                ],
                "region": "${AWS::Region}",
                "title": "StackSet Operations",
                "period": 300,
                "stat": "Average"
              }
            },
            {
              "type": "log",
              "width": 24,
              "height": 6,
              "properties": {
                "query": "SOURCE '/aws/lambda/StackSetMonitor' | fields @timestamp, @message\n| sort @timestamp desc\n| limit 20",
                "region": "${AWS::Region}",
                "title": "Latest StackSet Monitor Logs",
                "view": "table"
              }
            }
          ]
        }
  
  # Consolidated rule to catch ALL StackSet events for comprehensive monitoring
  AllStackSetOperationsRule:
    Type: AWS::Events::Rule
    Properties:
      Name: AllStackSetOperationsRule
      Description: "Rule for monitoring all CloudFormation StackSet operations with failure notifications"
      EventPattern: {source: ["aws.cloudformation"], detail-type: ["CloudFormation StackSet Operation Status Change"]}
      State: ENABLED
      Targets:
        - Id: ProcessAllEvents
          Arn: !GetAtt StackSetMonitorLambda.Arn
        - Id: NotifyFailure
          Arn: !Ref StackSetAlertsTopic
          InputTransformer:
            InputPathsMap:
              "stackSetId": "$.detail.stack-set-id"
              "operationId": "$.detail.operation-id"
              "status": "$.detail.status"
              "time": "$.time"
            InputTemplate: '"StackSet Event: ID: <stackSetId>, Op: <operationId>, Status: <status>, Time: <time>"'

  StackSetMonitorLambda:
    Type: AWS::Lambda::Function
    DependsOn: LambdaLogGroup
    Properties:
      FunctionName: StackSetMonitor
      Handler: index.lambda_handler
      Role: !GetAtt StackSetMonitorRole.Arn
      Runtime: python3.12
      Timeout: 300
      MemorySize: 512
      ReservedConcurrentExecutions: 1
      DeadLetterConfig:
        TargetArn: !GetAtt StackSetMonitorDLQ.Arn
      VpcConfig:
        SecurityGroupIds: !If
          - HasCustomSecurityGroups
          - !Ref SecurityGroupIds
          - - !Ref LambdaSecurityGroup
        SubnetIds: !If
          - CreateVPCAndSubnets
          - - !Ref PrivateSubnet1
            - !Ref PrivateSubnet2
          - !Ref SubnetIds
      KmsKeyArn: !GetAtt LogsKMSKey.Arn
      Code:
        ZipFile: |
          import boto3
          import json
          import os
          import logging
          import time
          import datetime
          from typing import Dict, Any, Optional
          
          # Custom JSON encoder to handle datetime objects
          class DateTimeEncoder(json.JSONEncoder):
              def default(self, obj):
                  if isinstance(obj, datetime.datetime):
                      return obj.isoformat()
                  return super().default(obj)
          
          # Set up logging with more details
          logger = logging.getLogger()
          logger.setLevel(logging.INFO)
          
          # Log initialization to verify Lambda is loading correctly
          print("StackSetMonitor Lambda initializing...")
          
          def validate_event(event: Dict[str, Any]) -> bool:
              """Validate the incoming event structure"""
              if not isinstance(event, dict):
                  logger.error("Event must be a dictionary")
                  return False
              
              # If it's an EventBridge event, validate required fields
              if 'detail' in event:
                  detail = event.get('detail', {})
                  if not isinstance(detail, dict):
                      logger.error("Event detail must be a dictionary")
                      return False
                  
                  # Validate StackSet event structure
                  if 'stack-set-id' in detail:
                      stack_set_id = detail.get('stack-set-id')
                      if not isinstance(stack_set_id, str) or not stack_set_id.strip():
                          logger.error("stack-set-id must be a non-empty string")
                          return False
                      
                      # Validate operation-id if present
                      operation_id = detail.get('operation-id')
                      if operation_id is not None and not isinstance(operation_id, str):
                          logger.error("operation-id must be a string if provided")
                          return False
                      
                      # Validate status if present
                      status = detail.get('status')
                      if status is not None and not isinstance(status, str):
                          logger.error("status must be a string if provided")
                          return False
              
              return True
          
          def validate_context(context: Any) -> bool:
              """Validate the Lambda context object"""
              if context is None:
                  logger.error("Context cannot be None")
                  return False
              
              # Check for required context attributes
              required_attrs = ['function_name', 'function_version', 'invoked_function_arn', 'memory_limit_in_mb']
              for attr in required_attrs:
                  if not hasattr(context, attr):
                      logger.error(f"Context missing required attribute: {attr}")
                      return False
              
              return True
          
          def sanitize_string(value: str, max_length: int = 255) -> str:
              """Sanitize and truncate string inputs"""
              if not isinstance(value, str):
                  return str(value)[:max_length]
              return value.strip()[:max_length]
          
          def lambda_handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
              """Main Lambda handler function for StackSet monitoring with input validation"""
              
              # Input validation
              if not validate_event(event):
                  return {
                      "statusCode": 400,
                      "body": json.dumps({
                          "status": "error",
                          "message": "Invalid event structure"
                      }, cls=DateTimeEncoder)
                  }
              
              if not validate_context(context):
                  return {
                      "statusCode": 400,
                      "body": json.dumps({
                          "status": "error",
                          "message": "Invalid context object"
                      }, cls=DateTimeEncoder)
                  }
              
              # Log the validated event for debugging
              logger.info(f"Event received: {json.dumps(event, cls=DateTimeEncoder)}")
              logger.info(f"Function: {context.function_name}, Version: {context.function_version}")
              
              try:
                  cf = boto3.client('cloudformation')
                  cw = boto3.client('cloudwatch')
                  
                  # Log that we're starting processing
                  logger.info(f"Starting StackSet monitoring at {time.time()}")
                  
                  # Check if this is an event from EventBridge
                  if 'detail' in event and 'stack-set-id' in event.get('detail', {}):
                      detail = event['detail']
                      stack_set_id = sanitize_string(detail['stack-set-id'])
                      operation_id = sanitize_string(detail.get('operation-id', 'N/A'))
                      status = sanitize_string(detail.get('status', 'N/A'))
                      
                      # Validate stack_set_id format
                      if not stack_set_id or len(stack_set_id) > 128:
                          logger.error(f"Invalid stack_set_id: {stack_set_id}")
                          return {
                              "statusCode": 400,
                              "body": json.dumps({
                                  "status": "error",
                                  "message": "Invalid stack_set_id format"
                              }, cls=DateTimeEncoder)
                          }
                      
                      # Log the StackSet operation with additional context
                      logger.info(f"Processing StackSet event - ID: {stack_set_id}, Op: {operation_id}, Status: {status}")
                      
                      # Extract stack set name from the ID
                      stack_set_name = stack_set_id.split('/')[-1] if '/' in stack_set_id else stack_set_id
                      stack_set_name = sanitize_string(stack_set_name, 128)
                      logger.info(f"Extracted StackSet name: {stack_set_name}")
                  
                  # Always gather metrics regardless of event type
                  # Get all active StackSets
                  stack_sets_response = cf.list_stack_sets(Status='ACTIVE')
                  stack_sets = stack_sets_response.get('Summaries', [])
                  
                  if not isinstance(stack_sets, list):
                      logger.error("Invalid response from list_stack_sets")
                      return {
                          "statusCode": 500,
                          "body": json.dumps({
                              "status": "error",
                              "message": "Invalid CloudFormation API response"
                          }, cls=DateTimeEncoder)
                      }
                  
                  logger.info(f"Found {len(stack_sets)} active StackSets")
                  
                  for stack_set in stack_sets:
                      if not isinstance(stack_set, dict) or 'StackSetName' not in stack_set:
                          logger.warning(f"Skipping invalid stack_set entry: {stack_set}")
                          continue
                      
                      stack_set_name = sanitize_string(stack_set['StackSetName'], 128)
                      logger.info(f"Processing StackSet: {stack_set_name}")
                      
                      try:
                          operations = cf.list_stack_set_operations(StackSetName=stack_set_name, MaxResults=5)
                          
                          # Validate operations response
                          if not isinstance(operations, dict):
                              logger.error(f"Invalid operations response for {stack_set_name}")
                              continue
                          
                          # Calculate success rate
                          successes = 0
                          operations_list = operations.get('Summaries', [])
                          
                          if not isinstance(operations_list, list):
                              logger.error(f"Invalid operations list for {stack_set_name}")
                              continue
                          
                          total_ops = len(operations_list)
                          logger.info(f"Found {total_ops} recent operations for {stack_set_name}")
                          
                          for op in operations_list:
                              if isinstance(op, dict) and op.get('Status') == 'SUCCEEDED':
                                  successes += 1
                          
                          success_rate = (successes / total_ops * 100) if total_ops > 0 else 100
                          
                          # Validate success_rate is within expected bounds
                          if not (0 <= success_rate <= 100):
                              logger.error(f"Invalid success_rate calculated: {success_rate}")
                              continue
                          
                          # Publish metrics to CloudWatch
                          cw.put_metric_data(
                              Namespace='StackSetMonitoring',
                              MetricData=[
                                  {'MetricName': 'SuccessRate', 'Value': success_rate, 
                                   'Dimensions': [{'Name': 'StackSetName', 'Value': stack_set_name}]}
                              ]
                          )
                          
                          logger.info(f"Published metrics for {stack_set_name}: Success Rate = {success_rate}%")
                      except Exception as e:
                          logger.error(f"Error processing StackSet {stack_set_name}: {str(e)}")
                  
                  return {
                      "statusCode": 200,
                      "body": json.dumps({
                          "status": "completed",
                          "message": f"Processed {len(stack_sets)} StackSets"
                      }, cls=DateTimeEncoder)
                  }
                  
              except Exception as e:
                  logger.error(f"Error in Lambda function: {str(e)}")
                  # Return a proper response even on error
                  return {
                      "statusCode": 500,
                      "body": json.dumps({
                          "status": "error",
                          "message": str(e)
                      }, cls=DateTimeEncoder)
                  }
  
  # Managed IAM Policies
  CloudFormationAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for CloudFormation and CloudWatch access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - cloudformation:ListStackSets
              - cloudformation:ListStackSetOperations
              - cloudformation:ListStackInstances
              - cloudformation:DescribeStackInstance
            Resource: 
              - !Sub "arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stackset/*"
              - !Sub "arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stackset-target/*"
          - Effect: Allow
            Action:
              - cloudwatch:PutMetricData
            Resource: "*"
            Condition:
              StringEquals:
                "cloudwatch:namespace": "StackSetMonitoring"
          - Effect: Allow
            Action:
              - sns:Publish
            Resource: !Ref StackSetAlertsTopic

  EventsAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for EventBridge access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - events:PutEvents
            Resource: !Sub "arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:event-bus/default"

  LogsAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for CloudWatch Logs access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: 
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor:*"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets:*"

  DLQAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for Dead Letter Queue access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - sqs:SendMessage
            Resource: !GetAtt StackSetMonitorDLQ.Arn

  StackSetMonitorRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole
        - !Ref CloudFormationAccessPolicy
        - !Ref EventsAccessPolicy
        - !Ref LogsAccessPolicy
        - !Ref DLQAccessPolicy

  # Permissions for event rules to invoke Lambda
  AllOperationsRuleLambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref StackSetMonitorLambda
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt AllStackSetOperationsRule.Arn
  
  # Using a one minute schedule for testing, but you can change this value
  StackSetMonitorSchedule:
    Type: AWS::Events::Rule
    Properties:
      Name: RegularStackSetMonitoring
      Description: "Triggers Lambda function every 1 minute to check StackSet operations"
      ScheduleExpression: "rate(1 minute)"
      State: ENABLED
      Targets:
        - Id: RunMonitor
          Arn: !GetAtt StackSetMonitorLambda.Arn
  
  ScheduleLambdaInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref StackSetMonitorLambda
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt StackSetMonitorSchedule.Arn
  
  StackSetSuccessRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: "Alarm when StackSet operation success rate is low"
      MetricName: SuccessRate
      Namespace: "StackSetMonitoring"
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      DatapointsToAlarm: 2
      Threshold: 80
      ComparisonOperator: LessThanThreshold
      AlarmActions: [!Ref StackSetAlertsTopic]
      Dimensions: [{Name: StackSetName, Value: !Ref StackSetName}]

Outputs:
  SNSTopicArn: 
    Description: The ARN of the SNS topic for alerts
    Value: !Ref StackSetAlertsTopic
  DashboardURL: 
    Description: URL to the CloudWatch Dashboard
    Value: !Sub https://console.aws.amazon.com/cloudwatch/home?region=${AWS::Region}#dashboards:name=StackSetMonitoring
  LambdaLogGroupName:
    Description: Name of the CloudWatch Log Group for Lambda logs
    Value: !Ref LambdaLogGroup
  DeadLetterQueueArn:
    Description: ARN of the Dead Letter Queue for Lambda function failures
    Value: !GetAtt StackSetMonitorDLQ.Arn
  DeadLetterQueueURL:
    Description: URL of the Dead Letter Queue for monitoring failed Lambda executions
    Value: !Ref StackSetMonitorDLQ
  TestLambdaCommand:
    Description: Command to manually test the Lambda function
    Value: !Sub "aws lambda invoke --function-name ${StackSetMonitorLambda} --payload '{}' response.json && cat response.json"
  LambdaFunctionArn:
    Description: ARN of the Lambda function configured with VPC
    Value: !GetAtt StackSetMonitorLambda.Arn
  LambdaSecurityGroupId:
    Description: Security Group ID created for the Lambda function
    Value: !Ref LambdaSecurityGroup
  VpcConfiguration:
    Description: VPC configuration summary for the Lambda function
    Value: !Sub 
      - "VPC: ${VpcId}, Subnets: ${SubnetList}, Security Groups: ${LambdaSecurityGroup}"
      - SubnetList: !Join [',', !Ref SubnetIds]

You need to run the following CLI command to deploy the CloudFormation stacks. You can change the ParameterValue of StackSetName“your-stackset-name” by the name of the StackSet you want to monitor. The default value is “security-baseline”. Your CLI profile should use region=“us-east-1“.

aws cloudformation create-stack --stack-name stackset-monitor --template-body file://StackSetMonitor.yml --parameters ParameterKey=StackSetName,ParameterValue="security-baseline" --capabilities CAPABILITY_IAM

AWS CLI to deploy the StackSetMonitor.yml CloudFormation template

The CLI output should look like the following:

{"StackId": "arn:aws:cloudformation:...."}

Here’s the expected output for the CloudFormation template:

StackSetMonitor Console output

StackSetMonitor Console output

And an example of Amazon CloudWatch Dashboard and Alarm screen:

Amazon CloudWatch Dashboard screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Dashboard screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Alarm screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Alarm screenshot for StackSetMonitor stack to track StackSet operations success rate

SNS subscription setup involves retrieving the topic ARN from stack outputs and configuring notifications for email or SMS endpoints (below example CLI for email subscription):

aws sns subscribe --topic-arn $SNS_TOPIC_ARN --protocol email --notification-endpoint [email protected]

AWS CLI to subscribe to the topic providing the user email

Cost:

The estimated monthly expenses ranges between 5 and 15 USD depending on StackSet activity levels, with approximately 2,880 Lambda executions per day (each minute) under the default monitoring schedule.

The solution supports customization of monitoring frequency by modifying the ScheduleExpression from the default one-minute interval. The cost will decrease if the monitoring is less frequent.

Cleanup:

For cleanup, you can run the following command lines:

  • To cleanup the Stack Instances and StackSets created in the Core Deployment Strategies section:

aws cloudformation delete-stack-instances --stack-set-name security-baseline --deployment-targets OrganizationalUnitIds=ou-xxx --regions us-east-1 eu-west-1 --region us-east-1 --no-retain-stack

AWS CLI to delete the Stack Instances

You need to change the parameter OrganizationalUnitIds value with the name of the OU, the parameter regions with the list of regions where you want to delete your stack instances, and the value of the stack-set-name parameter (security-baseline, monitoring-baseline, balanced-deployment…).

Then you can delete the StackSet:

aws cloudformation delete-stack-set --stack-set-name security-baseline

AWS CLI to delete the StackSet

You can change the value of the stack-set-name parameter.

  • To cleanup the stackset-monitor stack

aws cloudformation delete-stack --stack-name stackset-monitor

AWS CLI to delete the stackset-monitor Stack

You can also remove any IAM roles/policies that you specifically created for this blog that you might not need anymore

Conclusion

Throughout this guide, we’ve explored the nuanced approaches to AWS CloudFormation StackSets deployments across large-scale environments. The key takeaways include:

  • Balance is Critical: Every deployment strategy requires careful consideration of the trade-offs between speed, safety, and scale based on your organizational needs.
  • Progressive Adoption Works: For most organizations, a progressive deployment approach with validation gates provides the optimal balance of safety and efficiency.
  • Organizational Context Matters: Enterprise, startup, and regulated industry patterns demonstrate that deployment strategies should be tailored to your specific business requirements and risk tolerance.
  • Monitoring is Essential: As organizations scale to hundreds of accounts, comprehensive monitoring becomes critical for maintaining visibility and ensuring compliance.

These different approaches will help you adopt the right strategy for your AWS CloudFormation Stacksets deployments in your AWS Organization.

You can now test these different approaches on your sandbox environment, before adapting them for your specific needs, in order to balance Speed, Safety and Scale to optimize your deployments.

Amar Meriche

Amar is a Sr Cloud Operations Architect at AWS in Paris. He helps his customers improve their operational posture through advocacy and guidance, and is an active member of the DevOps and IaC community at AWS. He’s passionate about helping customers use the various IaC tools available at AWS following best practices. When he’s not working with customers, Amar can be found on the mountain trails with his family or playing basketball with his team.

Idriss Laouali Abdou

Idriss is a Sr. Product Manager Technical for AWS Infrastructure-as-Code based in Seattle. He focuses on improving developer productivity through StackSets and CloudFormation Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing.

Reduce Docker image build time on AWS CodeBuild using Amazon ECR as a remote cache

Post Syndicated from Kirubakaran Sundaramoorthy original https://aws.amazon.com/blogs/devops/reduce-docker-image-build-time-on-aws-codebuild-using-amazon-ecr-as-a-remote-cache/

In modern software development, containerization with Docker has revolutionized how we build and deploy applications. While Docker enables packaging applications into portable containers, the continuous need to update these images can be resource intensive. AWS CodeBuild addresses this challenge by providing a managed build service that eliminates infrastructure maintenance overhead. In this blog post, we’ll explore how AWS CodeBuild integration with Amazon Elastic Container Registry (Amazon ECR) as a cache backend can significantly accelerate our Docker image build process, making development more efficient and streamlined.

AWS CodeBuild creates isolated environments for each build, which means build artifacts cannot be permanently stored on the host system. While CodeBuild does offer a native local caching feature, it provides only temporary storage and is most effective for builds that occur in quick succession.

This local caching mechanism, however, is not reliable when builds are triggered at varying intervals, as it operates on a best-effort basis. To address this limitation, we recommend using Amazon Elastic Container Registry as a persistent cache for Docker layers. This solution offers several advantages:

  • It provides a reliable, long-term storage solution for build caches
  • The cached layers can be reused across multiple builds regardless of timing
  • The cache remains valid and accessible at any point in time

This post shows how to implement a simple, effective, and durable Docker layer cache for CodeBuild using Amazon ECR repository as a cache backend to significantly reduce image build runtime.

Solution Overview

The following diagram illustrates the high-level architecture of this solution. We describe implementing each stage in more detail in the following paragraphs.

Solution Flow Diagram

Figure 1: Solution Flow Diagram

To use an Amazon ECR registry as a backend for caching, we must first enable the containerd image store in our Docker driver. This feature is not enabled in the default Docker driver configuration. Therefore, we create a new docker driver using docker buildx command with containerd (docker-container driver) image store enabled.

When CodeBuild runs for the first time, it will attempt to retrieve cache data from the Amazon ECR repository. Since this is the first run, no cache will be available. CodeBuild will then proceed to build the Docker image from scratch, generate cache data during this initial build and export both the newly built image and its associated cache to the Amazon ECR repository.

In each subsequent build, CodeBuild will import the previously stored cache from Amazon ECR. This cached data will be used to speed up the image building process, as only the changed layers will need to be rebuilt. Finally, the updated cache and image will be stored back in Amazon ECR.

Prerequisites

Before we begin the walk-through, we must have an AWS account. If you don’t have one, sign up at https://aws.amazon.com.

Walk-through

Launch the following AWS CloudFormation template to create Amazon ECR repository and AWS CodeBuild project including CodeBuild service role and required permission as a managed policy.

AWSTemplateFormatVersion: "2010-09-09"

Description: 'AWS CloudFormation template to create infrastructure which demo using Amazon ECR as a remote cache for AWS CodeBuild'

Parameters:
  CodeBuildProjectName:
    Type: String
    Default: CBECRCacheDemoProject
    Description: "Enter name for your CodeBuild project"
  CodeBuildServiceRolePolicyName:
    Type: String
    Default: CodeBuildDockerCachePolicy
    Description: "Enter name for the IAM policy"
  ECRRepoName:
    Type: String
    Default: amazon_linux_codebuild_image
    Description: "Enter name for Amazon ECR repository"
  GitHubLocation:
    Type:  String
    Default: "https://github.com/aws/aws-codebuild-docker-images"
    Description: "Enter your source code GitHub URL"
  ImageTag:
    Type: String
    Default: demo
    Description: "Enter Tag name for your application docker image"
  CacheTag:
    Type: String
    Default: demo-cache
    Description: "Enter tag name for the cache image"

Resources:

  CodeBuildServiceRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument: |
        {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": "codebuild.amazonaws.com"
               },
              "Action": "sts:AssumeRole"
            }
         ]
        }
      Path: /

  CodeBuildServiceRolePolicy:
    Type: AWS::IAM::RolePolicy
    Properties:
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Action:
              - ecr:BatchGetImage
              - ecr:BatchCheckLayerAvailability
              - ecr:InitiateLayerUpload
              - ecr:UploadLayerPart
              - ecr:CompleteLayerUpload
              - ecr:PutImage
              - ecr:GetDownloadUrlForLayer
            Resource: !GetAtt ECRRepository.Arn
          - Effect: Allow
            Action: 
              - ecr:GetAuthorizationToken
            Resource: '*'
          - Effect: Allow
            Action:
              - codeconnections:UseConnection
              - codeconnections:GetConnectionToken
              - codeconnections:GetConnection
              - codestar-connections:GetConnectionToken
              - codestar-connections:GetConnection
            Resource: '*'
          - Effect: Allow
            Action: 
              - logs:CreateLogStream
              - logs:CreateLogGroup
              - logs:PutLogEvents
            Resource: 
              - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/codebuild/${CodeBuildProjectName}'
              - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/codebuild/${CodeBuildProjectName}:*'
          - Effect: Allow
            Action:
              - s3:PutObject
              - s3:GetObject
              - s3:GetObjectVersion
              - s3:GetBucketAcl
              - s3:GetBucketLocation
            Resource: 
              - !Sub 'arn:${AWS::Partition}:s3:::codepipeline-${AWS::Region}-*'
      PolicyName: !Ref CodeBuildServiceRolePolicyName
      RoleName: !Ref CodeBuildServiceRole

  ECRRepository:
    Type: AWS::ECR::Repository
    Properties:
      RepositoryName: !Ref ECRRepoName
      ImageScanningConfiguration:
        ScanOnPush: true

  CodeBuildProject:
    Type: AWS::CodeBuild::Project
    Properties:
      Name: !Ref CodeBuildProjectName
      Source:
        Type: GITHUB
        Location: !Ref GitHubLocation
        BuildSpec: !Sub |
          version: 0.2

          phases:
            install:
             commands:
               - docker buildx create --name containerd --driver=docker-container --driver-opt default-load=true

            pre_build:
             commands:
               - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${AWS::AccountId}.dkr.ecr.$AWS_REGION.amazonaws.com

            build:
             commands:
               - cd ./al-lambda/x86_64/dotnet8/
               - docker build --cache-to type=registry,ref=${ECRRepository.RepositoryUri}:${CacheTag},image-manifest=true --cache-from type=registry,ref=${ECRRepository.RepositoryUri}:${CacheTag} --tag ${ECRRepository.RepositoryUri}:${ImageTag} --builder=containerd .

            post_build:
             commands:
               - docker push ${ECRRepository.RepositoryUri}:${ImageTag}
      ServiceRole: !GetAtt CodeBuildServiceRole.Arn
      Artifacts:
        Type: NO_ARTIFACTS
      Environment:
        Type: LINUX_CONTAINER
        Image: aws/codebuild/amazonlinux-x86_64-standard:5.0
        ComputeType: BUILD_GENERAL1_SMALL
        PrivilegedMode: true
      Cache:
        Type: LOCAL
        Modes:
          - LOCAL_DOCKER_LAYER_CACHE

Specify the following parameters while creating the CloudFormation stack (see Figure 2):

  1.  Set the cache image tag (CacheTag) to “demo-cache“.
  2. Name the CodeBuild project (CodeBuildProjectName) as “CBECRCacheDemoProject“.
  3. Specify the IAM policy name (CodeBuildServiceRolePolicyName) as “CodeBuildDockerCachePolicy“.
  4. Define the ECR repository name (ECRRepoName) as “amazon_al_lambda_codebuild_image“.
  5. Enter the GitHub repository URL (GitHubLocation) as “https://github.com/aws/aws-codebuild-docker-images“.
  6. Set the Docker image tag (ImageTag) to “demo
CloudFormation Stack parameter

Figure 2: CloudFormation Stack parameter

The CloudFormation stack will set up a comprehensive development environment for our project. It will create a CodeBuild project equipped with all necessary IAM roles and permissions, ensuring smooth and secure build processes. Additionally, the stack will create an Amazon ECR repository. This repository is configured to automatically scan Docker images for vulnerabilities upon upload. The ECR will serve as a secure storage location for both our Docker images and cache images.

The CodeBuild project will be created with a buildspec file which will instruct CodeBuild to do the following:

  • Creates a new driver called “containerd” using buildx since the default Docker driver supports registry cache backend only when the containerd image store is enabled.
  • To pull and push both Docker images and cache, authentication with the Amazon ECR repository is required.
  • During the image build process:
    • We use the --cache-from parameter to force Docker to check for and use any existing cache from the repository.
    • The image-manifest option is set to true to enable cache storage in the Amazon ECR repository.
    • The --cache-to parameter is used to push or update the cache to the Amazon ECR repository.
  • After the build is complete, in the post-build phase, the image is pushed to Amazon ECR. The cache is automatically uploaded to the Amazon ECR repository as part of the Docker build command execution.

Testing the solution

After having successfully created the CloudFormation stack, we can proceed to test and evaluate how it performs.

The initial build process took approximately 10 minutes to complete. Since this was the first execution, no cache was available in Amazon ECR, requiring the system to build the image entirely from scratch. Although the cache import operation failed as expected during this initial run, the system continued with the build process without any cached layers. This first build served a dual purpose: it not only created the application image but also generated a cache, which was then exported to Amazon ECR as a separate image. This cached data would become available for future builds, setting the foundation for more efficient subsequent builds.

To verify the effectiveness of the solution, we triggered a second build after introducing a minor modification – adding an echo command in the middle of the Dockerfile’s active commands (excluding commented line). During this subsequent build, Docker intelligently utilized the cached layers up until the point of modification, after which it rebuilt only the necessary layers. This smart caching strategy resulted in a build time of approximately 6 minutes, clearly demonstrating how the caching system optimizes the build process even when changes are introduced. Further validation across multiple large-scale projects confirmed the effectiveness of this approach, consistently achieving build time reductions of up to 25%.

We enabled CodeBuild’s built-in Docker layer caching feature on a best-effort basis. This approach is recommended as it uses cached layers in the local when available instead of downloading them from the repository, which will further improve the overall build speed.

Cleaning up

When we finished testing, we should de-provision the following resources to avoid incurring further charges and keep the account clean from unused resources:

  • Delete the docker images from the Amazon ECR repository amazon_al_lambda_codebuild_image.
  • Delete the CloudFormation stack which has been created in the “Launch the AWS CloudFormation template” section.

Conclusion

In this discussion, we explored an efficient and straightforward solution for implementing external Docker caching in CodeBuild using Amazon ECR as a backend storage system. This approach offers several key benefits:

The solution reduces Docker build times in CodeBuild up to 25% and is versatile enough to handle most scenarios, including complex multi-stage builds. A particularly valuable advantage is that Amazon ECR stores the cache separately in its repository, making it reusable across different projects.

The business impact is substantial: shorter build times lead directly to reduced compute costs. More importantly, this optimization results in a more streamlined development lifecycle, enabling faster feature releases at lower operational costs.

In essence, this caching solution not only improves technical efficiency but also delivers tangible business value through reduced costs and accelerated development cycles.

About the author

Kirubakaran Sundaramoorthy

Kirubakaran Sundaramoorthy is a Cloud Support Engineer specializing in DevOps practices and AWS architecture, with expertise in AWS CloudFormation and CI/CD implementations. He builds efficient cloud infrastructure solutions using automation processes, cloud deployment strategies, infrastructure as code, and DevOps best practices to help businesses succeed.

Multi-Cloud Code Deployments using Amazon Q Developer with Echo3D

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/multi-cloud-code-deployments-using-amazon-q-developer-with-echo3d/

Banner showing echo3D logo and Amazon Q Developer logo

Image showing 87& speed up in development tasks completion, 41% of code written by Amazon Q Developer, and 60% development productivity increasedn

Overview

Founded in 2018, echo3D built a revolutionary 3D digital asset management (DAM) platform to address the surging demand for immersive content across industries. The company’s platform enables enterprises to seamlessly store, secure, optimize, and share 3D content, serving over 200,000 professionals across energy, healthcare, gaming, retail, and beyond.

echo3D’s platform has become the go-to solution for managing complex 3D assets at scale, supporting major enterprises across multiple sectors. With their technology operating within clients’ own AWS accounts, echo3D delivers critical infrastructure that powers real-time 3D content management for organizations worldwide.

As customer demand grew, echo3D faced increasing pressure to maintain rapid innovation while ensuring stable multi-cloud deployments. With a streamlined development team managing expanding cross-platform requirements, the company needed an efficient solution to accelerate their build and debug processes. This led them to explore Amazon Q Developer as a way to enhance their development capabilities and meet growing market demands.

Opportunity | Building for a Multi-Cloud Reality through Amazon Q Developer

echo3D specializes in 3D digital asset management, with a critical focus on multi-cloud deployments to serve their diverse enterprise client base. The company’s commitment to cross-platform functionality isn’t optional—it’s fundamental to their business model, with many clients specifically requiring AWS compatibility.

The company’s existing cloud infrastructure needed to support seamless migrations while maintaining robust performance across different environments. “For many of our clients, AWS is the ultimate destination,” explains Ben Pedazur, CTO at echo3D. “Amazon Q Developer has proven to be an indispensable guide for these migrations, both for our infrastructure and for the solutions we build for customers.”

After evaluating various solutions, echo3D identified Amazon Q Developer as their key tool for standardizing cross-platform development. “We needed a solution that could generate consistent code across different cloud environments while resolving platform-specific challenges,” notes Pedazur. This capability became particularly crucial during a recent customer migration project, which served as a perfect test case for Amazon Q Developer’s capabilities.

Solution | Streamlining the Journey to AWS with Amazon Q Developer

To streamline their cloud migration process, echo3D implemented Amazon Q Developer across their entire development workflow. The team utilized Amazon Q Developer to handle a critical migration from Azure Cosmos DB to Amazon DynamoDB, leveraging the AI assistant to generate comprehensive migration blueprints that included code modifications, configuration changes, and testing strategies.

Developers used detailed prompts to generate migration plans and receive context-aware guidance throughout the process. Amazon Q Developer provided not just code snippets, but complete architectural solutions that considered both the source and target platforms. During implementation, the team integrated Amazon Q Developer directly into their workflow, receiving real-time suggestions for code optimization and platform-specific adjustments.

The impact of Amazon Q Developer was immediate and measurable, with 41% of the new codebase being generated or auto-completed by the tool. “Amazon Q Developer has transformed our migration efficiency,” says Pedazur. “Our development time for cloud migrations has decreased by 87%, while significantly improving code quality.”

Amazon Q Developer assists throughout the entire development lifecycle, generating test cases, deployment scripts, and documentation. This comprehensive support has led to remarkable improvements: platform-specific bugs decreased by 75%, deployment success rates reached 99.8% across multiple clouds, and code review cycles shortened by 60%.

Beyond code generation, echo3D uses Amazon Q Developer to enhance team collaboration and knowledge sharing. The tool has cut onboarding time for new engineers in half, reducing it from four weeks to two weeks. Support tickets related to deployment errors have dropped by 68%, indicating improved code stability and reliability.

The new multi-cloud infrastructure, built with AWS services including DynamoDB, enables echo3D to scale efficiently while maintaining high performance across different cloud environments. The combination of Amazon Q Developer and AWS services has empowered echo3D to accelerate their development cycle while ensuring consistent quality across platforms.

“Amazon Q Developer isn’t just about coding faster—it’s about building better,” explains Pedazur. “We’ve seen improvements across every metric, from development speed to code quality, allowing our team to focus on innovation rather than troubleshooting.”

Outcome | Reimagining Development Through AI-Powered Workflows

With Amazon Q Developer, echo3D plans to further leverage Amazon Q Developer across their product lifecycle, from rapid prototyping to ongoing code maintenance and enhancement.

“Amazon Q Developer has revolutionized our approach to multi-cloud development,” says Pedazur. “It’s not just about automating tasks; it’s about reimagining our entire workflow. We’re now able to prototype, test, and deploy across cloud platforms with unprecedented speed and accuracy.”

Authors

Headshot of Lilly McDermott, Account Manager, AWS

Lilly McDermott

Lilly McDermott is an AWS account manager specializing in supporting gaming companies and game tech. As a trusted advisor, she guides customers through their cloud journey, helping them implement scalable solutions that drive innovation and growth in their games and services. Lilly is dedicated to guiding her customers in transforming their creative ideas into executable plans, empowering them to thrive in the competitive gaming market.

Headshot of Kevon Mayers, Infrastructure as Code Focus Area Lead and Games Solutions Architect, AWS

Kevon Mayers

Kevon Mayers is a Games Solutions Architect at AWS and is the Infrastructure as Code (IaC) Focus Area Lead for the NextGen Developer Experience Technical Field Community at AWS. Kevon is a Core Contributor for Terraform and has led multiple Terraform initiatives within AWS. Prior to joining AWS, he was working as a DevOps engineer and developer, and before that was working with the GRAMMYs/The Recording Academy as a studio manager, music producer, and audio engineer. He also owns a professional production company, MM Productions.

Headshot of Ben Pedazur, echo3D CTO

Ben Pedazur

Ben Pedazur (CTO at echo3D) holds a MSc in Electrical Engineering from Tel Aviv University specializing in computer vision and network communication, a BSc in Electrical Engineering from Afeka Academic College of Engineering specializing in image processing, is a former engineering manager at Cisco Systems, founder of an AR+Drones startup, and algorithm engineer at AdiMap. Ben is skilled in agile leadership, engineering management, and product research & development.

Headshot of Alon Grinshpoon, echo3D CEO

Alon Grinshpoon

Alon Grinshpoon (CEO at echo3D) holds MS in Computer Science from Columbia University specializing in 3D/AR/VR and human-computer interaction (HCI), BS in Computer Science and Electrical Engineering specializing in cloud technology, former NVIDIA engineer, published 3D UI researcher, a frequent speaker at CES, SXSW, Augmented World Expo (AWE), NYVR, Slush, and more. Alon has published papers in top engineering journals such as SIGGRAPH 2018 Emerging Technologies and IEEE Conference on Virtual Reality and 3D User Interfaces (VR) on AR system design and 3D interaction techniques in AR.

Accelerating AWS Infrastructure Deployment: A Practical Guide to Console-to-Code

Post Syndicated from Damola Oluyemo original https://aws.amazon.com/blogs/devops/accelerating-aws-infrastructure-deployment-a-practical-guide-to-console-to-code/

In today’s cloud-first environment, Infrastructure as Code (IaC) has become crucial for managing cloud resources effectively. However, organizations often face significant challenges in adopting IaC practices, including steep learning curves, complex syntax requirements, and difficulty translating manual operations into code.  Amazon Q Developer‘s Console-to-Code feature addresses these challenges by providing an intuitive bridge between manual AWS Console operations and infrastructure as code. This innovative solution helps organizations accelerate their automation journey while maintaining consistency and reliability in AWS deployments.

Understanding Amazon Q Developer and Console-to-Code

Console-to-Code is a feature of Amazon Q Developer that helps automate AWS infrastructure by recording manual actions performed in the AWS Management Console and converting them into infrastructure-as-code (IaC). It leverages generative AI to generate automation-ready code, allowing users to transition from manual operations to repeatable deployments effortlessly. Console-to-Code provides multi-language support, offering code generation in AWS Cloud Development Kit (CDK) formats such as Java, Python, and TypeScript, as well as AWS CloudFormation in JSON and YAML formats.

Console-to-Code records your console actions, then uses generative AI to suggest code in your preferred language and format.

In this blog post, we’ll explore how Console-to-Code can help you:

  1. Transform manual console actions into reusable infrastructure code
  2. Improve operational efficiency and reduce human error
  3. Accelerate the transition from manual to automated deployments

Supported AWS Services

Console-to-Code currently supports automation for several essential AWS services. These include Amazon EC2, which allows for the provisioning and management of virtual machines; Amazon VPC, which enables configuration of networking components such as subnets, route tables, and gateways; and Amazon RDS, which facilitates the management of database instances, configurations, and scaling options

Potential Use Cases:

Now that we’ve covered the basics of Console-to-Code and its supported services, let’s explore some potential use cases for this feature.

DevOps and Agile Development

In the fast-paced world of DevOps and agile development, Console-to-Code enables teams to rapidly prototype and iterate on infrastructure configurations. The ability to quickly create and replicate consistent environments across development, staging, and production stages ensures infrastructure reliability while maintaining agile velocity.

Compliance-Focused Industries

Organizations in regulated industries benefit from Console-to-Code’s systematic approach to implementing and maintaining compliant infrastructure. By recording proven, compliant configurations, organizations ensure that all subsequent deployments maintain the same level of security and compliance, creating an automatic audit trail for regulatory requirements.

Step-by-Step Guide to Using Console-to-Code

Follow these steps to automate AWS services using Amazon Q Developer’s Console-to-Code feature:

Prerequisites

Before getting started with Amazon Q Developer’s Console-to-Code feature, ensure you have the following:

  • Service Access:
    • AWS Management Console access
    • Access and permissions to supported AWS services such as: Amazon EC2, Amazon VPC, and Amazon RDS
  • Required Permissions:
    • The q:GenerateCodeFromCommands permission for Amazon Q Developer to use Console-to-Code (added by default, no additional requirement from the user)
  • Subscription Tiers:

Free Tier:

    • No fixed monthly limit for recording console actions
    • No fixed monthly limit for generating CLI commands
    • Monthly limit applies to AWS CDK and CloudFormation code generation

Pro Tier:

    • Requires IAM Identity Center authentication
    • IAM Identity Center identity must be subscribed to Amazon Q Developer Pro
    • No fixed monthly limit for AWS CDK or CloudFormation code generation

For this demonstration the Free Tier would suffice.

  1. Supported Code Formats:

Console-to-Code can generate infrastructure-as-code in the following formats:

  • AWS CDK: Java, Python, and TypeScript
  • AWS CloudFormation: JSON and YAML

Getting Started

Step 1: Start Recording

To start recording with Console-to-Code, follow these steps:

  1. Sign in to the AWS Management Console.
  2. Navigate to the console of one of the supported services (Amazon VPC, Amazon RDS, or Amazon EC2)
  3. On the right edge of the browser window, choose the Console-to-Code icon.
  4. Click Start recording.

Note: While recording actions is free, you will still be charged for any AWS resources created during the recording process.

This gif shows the user opening the Console-to-Code panel and starting a recording session in the VPC console

Figure 1: Opening the Console-to-Code panel and starting a recording session in the VPC console.

Step 2: Perform Actions in AWS Console

  1. Go to the AWS service (e.g., EC2, S3) you want to automate.
  2. Perform desired actions such as launching an EC2 instance or creating an S3 bucket.

This gif demonstrates creating and configuring resources while Console-to-Code records all actions in real-time

Figure 2: Demonstration of creating and configuring resources while Console-to-Code records all actions in real-time.

The Console-to-Code panel will record all of these actions as you perform them. You can move between different service consoles (such as VPC and EC2) during a single recording session, allowing you to create a comprehensive recording that involves actions across multiple supported services

Step 3: Generate Code & Stop Recording

  1. In the Console-to-Code panel, review your recorded actions. You can filter the recorded actions using the dropdown, search box, or filter widget at the top of the Console-to-Code panel.
  2. Select the actions that you want to convert into code. Only the actions with checked boxes will be used in the following steps.
  3. Indicate the type of code that you want to generate. From the reverse dropdown menu at the lower right of the Console-to-Code panel, select the language and (if applicable) format of the code to be generated.
  4. Choose Generate chosen language. The generated code will appear, along with the equivalent CLI commands.

This gif shows the user stopping the recording, selecting desired actions, and generating infrastructure code in your preferred language through the Console-to-Code panel

Figure 3: Stopping the recording, selecting desired actions, and generating infrastructure code in your preferred language through the Console-to-Code panel.

Benefits of Using Console-to-Code

The implementation of Console-to-Code offers numerous advantages to AWS users. It increases time efficiency by reducing manual effort on repetitive console tasks and ensures consistency and compliance with organizational security and governance policies. The tool minimizes human errors through the generation of syntactically accurate infrastructure code, enables rapid prototyping for quick transitions from experimentation to production, and serves as a valuable learning resource for new AWS users to understand infrastructure-as-code best practices

Best Practices

Planning and Organization

Success with Console-to-Code requires thorough planning and organization. Document your infrastructure requirements comprehensively, establish clear naming conventions and tagging strategies, and maintain a systematic approach to version control for generated code.

Maintenance and Updates

Regular review and testing of generated code ensure continued reliability and efficiency. Implement a code review process for infrastructure changes and maintain comprehensive documentation of your deployment patterns and configurations.

Troubleshooting Guidelines

Common issues during recording sessions can often be resolved by using a single browser tab and ensuring proper permissions are in place. For code generation issues, validate service compatibility and review action sequences carefully. Clear browser cache and verify IAM permissions when encountering persistent issues.

Conclusion

Amazon Q Developer’s Console-to-Code represents a significant advancement in infrastructure automation, making IaC accessible to teams of all skill levels. By following the strategies and best practices outlined in this guide, organizations can effectively leverage this tool to accelerate their cloud journey while maintaining security, compliance, and operational excellence.

The future of infrastructure automation looks promising with tools like Console-to-Code, enabling organizations to focus more on innovation and less on manual operations. As AWS continues to enhance this feature, users can expect even more capabilities and integrations to support their infrastructure automation needs.

Ready to accelerate your infrastructure automation journey? Start exploring Console-to-Code today by signing into your AWS Management Console and recording your first infrastructure deployment. For additional resources and documentation, visit the AWS Console-to-Code documentation page.

About the Authors:

Adeogo Olajide

Adeogo is a Solutions Architect at AWS, where he supports GovTech customers and other public sector customers in their cloud transformation journey. He specializes in designing secure, scalable, and compliant architectures that help public sector organizations modernize their digital services. Outside of work, he enjoys playing and watching soccer.

Jehu Gray

Jehu Gray is a Prototyping Architect at Amazon Web Services where he helps customers design solutions that fits their needs. He enjoys exploring what’s possible with IaC.

Damola Oluyemo

Damola Oluyemo is a Solutions Architect at Amazon Web Services focused on Enterprise customers. He helps customers design cloud solutions while exploring the potential of Infrastructure as Code and generative AI in software development.

Abiola Olanrewaju

Abiola Olanrewaju is a Solutions Architect at AWS, specializing in helping Financial services customers implement scalable solutions that drive business outcomes. He has a keen interest in Data Analytics, Security and Generative AI.

Ibraheem Ojelade

Ibraheem Ojelade is a Solutions Architect at Amazon Web Services, focused on supporting Independent Software Vendors (ISVs). He partners with customers to accelerate their cloud adoption, optimize performance, and strengthen security. With a strong background in cybersecurity and cloud solutions, he helps ISVs design scalable architectures while exploring emerging technologies to drive innovation and growth.