Tag Archives: software

Building with AI-DLC using Amazon Q Developer

Post Syndicated from Will Matos original https://aws.amazon.com/blogs/devops/building-with-ai-dlc-using-amazon-q-developer/

The AI-Driven Development Life Cycle (AI-DLC) methodology marks a significant change in software development by strategically assigning routine tasks to AI while maintaining human oversight for critical decisions. Amazon Q Developer, a generative AI coding assistant, supports the entire software development lifecycle and offers the Project Rules feature, allowing users to tailor their development practices within the platform.

Recently, AWS made its AI-DLC workflow open-source, enabling developers to create software using this methodology. This workflow is implemented in Amazon Q Developer through its Project Rules customization feature. In this post, we will demonstrate how the AI-DLC workflow operates in Amazon Q Developer using an example use case.

AI-DLC Workflow Overview

The AI-DLC workflow is the practical implementation of the AI-DLC methodology for executing software development tasks. As outlined in the AI-DLC Method Definition Paper, the workflow has three phases. These phases are Inception, Construction, and Operations. Inception involves planning and architecture. Construction focuses on design and implementation. Operations cover deployment and monitoring. Each phase includes distinct stages. These stages address specific software development life cycle functions. The workflow adapts to project requirements. It analyzes requests, codebases, and complexity. This analysis determines the necessary stages. Simple bug fixes skip planning. They go directly to code generation. Complex features need requirements analysis. They also require architectural design and detailed testing.

The workflow maintains quality and control through structured milestones and transparent decision-making. At each phase, AI-DLC asks clarifying questions, creates execution plans, and waits for approval. Every decision, input, and response is logged in an audit trail for traceability. Whether building a new microservice, refactoring legacy code, or fixing a production bug, AI-DLC scales its rigor to match needs—comprehensive when complex, efficient when simple, and always in control. Figure 1 shows the phases and stages within the adaptive AI-DLC workflow. The stages shown in green boxes are mandatory, while those in yellow boxes are conditional.

AI-DLC workflow diagram showing three phases: Inception Phase (blue) with mandatory steps for Workspace Detection, Requirements Analysis, and Workflow Planning, plus conditional steps for Reverse Engineering, User Stories, Application Design, and Units Generation; Construction Phase (green) with conditional steps for Functional Design, NFR Requirements, NFR Design, and Infrastructure Design, followed by mandatory Code Generation and Build and Test steps that loop for each unit; and Operations Phase (orange) with an Operations step. The workflow flows from User Request at the top to Complete at the bottom.

Figure 1. SDLC phases and stages in AI-DLC workflow

Prerequisites

Before we begin the walk-through, we must have an AWS account or AWS Builder Id for authenticating Amazon Q Developer. If you don’t have one, sign up for AWS account or create an AWS builder id. You can use any of the Integrated Development Environments (IDEs) supported by Amazon Q Developer and install the extension as per the AWS documentation. In this post, we’ll be using the Amazon Q Developer extension in VS Code IDE. Once the plug-in is installed, you’ll need to authenticate Q Developer with the AWS cloud backend. Refer to the AWS documentation for Q Developer authentication instructions.

The AI-DLC workflow generates various Mermaid diagrams in markdown files. To view these diagrams within your IDE, you can install a Mermaid viewer plugin.

Let’s Begin Building!

Let’s construct a simple River Crossing Puzzle as a web UI app using AI-DLC. By choosing a straightforward app, we can concentrate more on learning the AI-DLC workflow and less on the project’s technical intricacies.

The sections below outline the individual steps in the AI-DLC development process using Amazon Q Developer. We’ll showcase screenshots of our IDE with the Amazon Q Developer plug-in and demonstrate how to interact with the workflow.

Although we’ve used the Amazon Q Developer IDE plug-in in this blog post, you can also use Kiro Command Line Interface (CLI) to build with AI-DLC without any additional setup. The workflow remains the same, except that you’ll interact through the command line instead of the graphical interface in the IDE.

As we progress through the workflow, your AI-DLC experience will be tailored to your specific problem statement. You’ll also notice the probabilistic nature of large language models (LLMs), as the questions and artifacts generated by them will vary from one run to another for the same problem statement. For example, if you attempt to replicate the same problem statement we used in this blog post, your experience will likely differ. This is expected and desirable. Despite these minor variations, we’ll eventually find a solution to the problem we initially set out to address.

Step 1: Clone GitHub repo containing the AI-DLC Q Developer Rules

Clone the GitHub repo containing the AI-DLC Q Developer Rules:

git clone https://github.com/awslabs/aidlc-workflows.git

Step 2: Load Q Developer Rules in your project workspace

Follow the README.md instructions in the GitHub repo to copy the rules files over to your project folder.

Step 3: Install and authenticate Amazon Q Developer Extension in IDE

Open the project folder you created in Step 2 in VS Code. Open the Amazon Q Chat Panel in the IDE and ensure that the AI-DLC workflow rules are loaded in Q Developer, as shown in Figure 2. If you don’t see what’s shown in Figure 2, please double-check the steps you performed in Step 2.

Screenshot showing four steps to access AI-DLC rules in Amazon Q: Step 1 shows opening Amazon Q Chat Panel from the left sidebar; Step 2 shows opening a chat session in Amazon Q at the top; Step 3 shows clicking on the Rules button in the chat interface; Step 4 shows ensuring AI-DLC rules are loaded in the rules panel on the right side of the screen.

Figure 2: AI-DLC rules enabling in Amazon Q Developer

Step 4: Start the AI-DLC workflow by entering a high-level problem statement

Our development environment is now set up, and we’re ready to begin application development using AI-DLC. In our Q Developer chat session, we enter the following problem statement:

Using AI-DLC let's build a web application to solve the river crossing puzzle.

Notice that we’ve prefixed our problem statement with “Using AI-DLC …” to ensure that Q Developer engages the AI-DLC workflow. Figure 3 shows what happens next. The AI-DLC workflow is triggered within Q Developer. It greets us with a welcome message and provides a brief overview of the AI-DLC methodology.

Figure 3 shows an expanded view of the AI-DLC workflow rules folder structure on the left. You’ll notice that a single aws-aidlc-rules/core-workflow.md is placed in the designated .amazonq/rules folder, while the rest of the rules files are placed in an ordinary aws-aidlc-rule-details folder. This arrangement is designed to optimize model efficiency. By placing the aws-aidlc-rules/core-workflow.md file in the .amazonq/rules folder, , it serves as additional context, ensuring that the core workflow structure is always accessible without incurring additional token consumption. Conversely, the detailed phase and stage-level behavior rules are stored in the aws-aidlc-rule-details folder and are dynamically loaded as required. This approach conserves Amazon Q’s context window and token usage by retaining only the necessary information within the context at any given time, thereby enhancing model efficiency.

The rules files under the aws-aidlc-rule-details folder are organized into three sub-folders, each representing a phase of AI-DLC. Within each phase, there are stage-specific files. A common folder houses cross-cutting rules applicable to all AI-DLC phases and stages such as the “human-in-the-loop”.

The AI-DLC workflow is self-guided and provides us with a clear understanding of what to expect next. It informs us that it will enter the AI-DLC Inception phase next, starting with the Workspace Detection stage within it.

Screenshot of Amazon Q interface showing AI-DLC workflow initialization. The left sidebar displays a file tree with various workflow stages and configuration files. The main chat area shows a problem statement input box at the top with placeholder text 'Using AI-DLC let's build a web application to solve the most pressing problem.' Below is a welcome message explaining AI-DLC (AI-Driven Development Life Cycle) and its capabilities. Three callout annotations highlight: 1) The core workflow dynamically utilizes detailed instructions for different phases and stages, loading and unloading them as required; 2) AI-DLC workflow kicks off with a welcome message and precise overview; 3) The workflow is structured to load a single Q Developer Rule file, one workflow.md, which then dynamically loads and unloads the stage definitions housed in the 'aws-aidlc-rule-details' folder as needed.

Figure 3: User enters high level problem statement in Amazon Q. AI-DLC workflow is triggered.

Step 5: Workspace Detection

We enter the Workspace Detection stage within the Inception phase. In this stage, AI-DLC analyzes the current workspace and determines whether it’s a greenfield (new) or brownfield (existing) application. Since AI-DLC is an adaptive workflow, it decides whether the next stage will be Reverse Engineering (for brownfield projects) or Requirements Analysis (for greenfield projects).

Since we’re building a greenfield application and there’s no existing code in the workspace to reverse engineer, the workflow will guide us to Requirements Analysis next. If we were working on a brownfield application, the workflow would have performed Reverse Engineering first and then moved on to Requirements Analysis. This demonstrates the adaptive nature of the workflow.

Figure 4 illustrates the process in our IDE when we enter this stage. The workflow requests our permission to create an aidlc-docs folder under the project root. This folder will serve as the repository for all the artifacts generated by AI-DLC during the workflow execution. Subsequently, the workflow generates two files within this folder: aidlc-state.md and audit.md. The purpose of these files is explained in Figure 4.

Screenshot of AI-DLC workspace detection phase showing the Amazon Q chat interface. The left sidebar displays the file tree with an 'aidlc-doc' folder highlighted. The main chat area shows the Inception Phase - Workspace Detection stage with explanatory text about analyzing the workspace. Five callout annotations explain: 1) Workflow creates aidlc-doc directory for storing AI-DLC generated artifacts; 2) The workflow tracks its progress in aidlc-metadata.json for error recovery and session continuity; 3) The audit.md file stores user's prompts; 4) Workflow highlights the AI-DLC phase and stage name with a clear heading for easy tracking; 5) Workflow loads detailed stage-level behavior files dynamically such that they don't consume the context window statically. At the bottom, a user approval prompt shows 'mkdir -p /Users/[...]/NewConsumerPortal/aidlc-docs' with the user asked to approve the 'mkdir' command.

Figure 4: Workspace Detection

The Workspace Detection will quickly finish as this is a greenfield project. The workflow will guide us into Requirements Analysis stage within the Inception phase next.

Step 6: Requirements Analysis

The workflow has progressed to the Requirements Analysis stage, where we will define the application requirements. The AI-DLC workflow presented our high-level problem statement to the Q Developer, which then responded with several requirements clarifications questions, as illustrated in Figure 4.

Several AI-DLC rules came into play at this stage. One rule instructed Amazon Q to avoid making assumptions on the user’s behalf and instead ask clarifying questions. Since LLMs tend to make assumptions and rush towards outcomes, they must be explicitly instructed to align with the engineering rigor of the AI-DLC methodology. To achieve this, the Q Developer presented several requirements clarification questions in requirement-verification-questions.md file and asked us to answer them inline in the file.

Another AI-DLC rule instructed the Q Developer to present questions in multiple-choice format and always include an open-ended option (“Other”) to enhance user convenience and provide flexibility in answering.

As shown in Figure 5, Amazon Q has asked us about the desired puzzle variant, such as the Classic Farmer, Fox, Chicken, and Grain puzzle or other popular variations. Additionally, it has asked us questions about user interaction methods, score persistence across multiple players, and the creation of a leaderboard.

These questions are essential for achieving our desired application outcome. Our responses to these questions will determine the final product. While we didn’t explicitly specify this level of detail in our high-level problem statement, AI-DLC has delegated detailed requirements elaboration to Amazon Q, but we still retain control over what gets built.

Screenshot of AI-DLC Requirements Analysis phase showing a split view. The left side displays a requirements clarification questions markdown file with multiple-choice questions about the Kuer Crossing Portal, including sections about user crossing portal variants, primary user interaction methods, and data storage preferences. The right side shows the Amazon Q chat interface with the Inception Phase - Requirements Analysis heading. Two callout annotations highlight: 1) AI-DLC asks questions in multiple choice format, with an 'Other' option that leaves an open-ended fill-in-the-blank when the answer doesn't match the predefined options; 2) AI-DLC generates config.requirements-clarification-questions.md file containing requirements clarification questions, with questions placed in an MD file where the user can respond inline in the file, using 'Answered' to indicate completion. The chat shows instructions for answering questions to clarify requirements.

Figure 5: Requirements Analysis

We answer all the questions in requirement-verification-questions.md file and enter “Done” in the chat window.

Amazon Q processes our responses. The AI-DLC workflow is designed to identify human errors. It checks if we’ve answered all the questions and identifies any contradictions or ambiguities in our answers. Any confusions, contradictions, or ambiguities will be flagged for follow-up questions. AI-DLC adheres to high standards and ensures that we don’t proceed to the next step until we’re fully in agreement on the requirements between us and Amazon Q.

Since we answered all the questions and there were no contradictions in our answers, the workflow continues and generates a comprehensive requirements.md document, as shown in figure 6.

Screenshot showing AI-DLC requirements review phase with split view. The left side displays a Requirements Document for the River Crossing Puzzle Web Application, including Intent Analysis Summary, User Request details, Request Type, and Functional Requirements with Core Puzzle Functionality items (FR-001 through FR -006) describing game features like classic farmer puzzle, timer display, move tracking, puzzle state validation, and victory messages. The right side shows the Amazon Q chat interface with 'Requirements Analysis Complete' heading, displaying project details including Puzzle Type (Classic Farmer, Fox, Chicken, and Grain river crossing puzzle), Technology (React-based modern web application), and Target Devices (web browsers only). Three callout annotations highlight: 1) Requirements Analysis phase complete; 2) Requirements document generated; 3) User may Request Changes, Add User Stories for Approval, or Approve & Continue, with a REVIEW SAFETY note warning users to review requirements and approve to continue, with options to request changes or add modifications if required.

Figure 6: Requirements Review

The workflow prompts us to review the requirements.md document and decide on the next step. If we’re not aligned on the requirements, we can prompt Amazon Q to help us achieve alignment. We can then iterate on the requirements until we’re fully aligned. Once we’re fully aligned, we prompt AI-DLC to progress to the next stage.

Given the adaptive nature of the AI-DLC workflow, Amazon Q has recommended that this application is simple enough, and we can skip the User Stories stage. If we felt otherwise, we would have overridden the model’s recommendation. In this case, we agree with Q’s recommendation and will therefore enter “Continue” in the chat window.

The workflow will enter Workflow Planning stage next.

Step 7: Workflow Planning

With our requirements established, we proceed to the Workflow Planning stage. In this phase, we leverage the requirements context and the workflow’s intelligence to plan the execution of specific stages of AI-DLC within the workflow to build our application as per the requirements specification.

Figure 7 illustrates the workflow planning stage in Q Developer. The workflow has generated an execution-plan.md file that outlines the recommended stages for execution and those that should be skipped.

The workflow planning process is highly contextual to the requirements. During requirements analysis, we decided to develop a simple river crossing puzzle application, consisting of a single HTML file, without a backend, leaderboard, or persistence. Consequently, Amazon Q recommends that we skip all the conditional stages, such as User Stories, Application Design, Units of Work Planning, and so on, and proceed directly to the Code Generation Planning stage in the Construction phase.

Figure 7 visually represents the recommended workflow graphically, indicating the stages that will be executed and those that will be skipped.

Screenshot of AI-DLC Workflow Planning phase showing an Execution Plan document on the left with Detailed Analysis Summary including user-facing changes, brownfield changes, API changes, and NFR changes, plus Risk Assessment. Below is a Workflow Visualization flowchart diagram showing the workflow stages from Inception through Construction to Operations phases. The right side shows the Amazon Q chat with 'Workflow Planning Complete' heading. Three callout annotations highlight: 1) AI -DLC workflow has analyzed requirements and based on the problem complexity has proposed a set of stages to execute in the workflow; 2) Problem is simple enough that AI-DLC is proposing to skip the detailed optional stages; 3) User may Request Changes, Add back skipped stages or Approve & Continue.

Figure 7: Workflow Planning

Since we’ve opted for a straightforward web UI app in this blog post for brevity, the workflow execution plan suggested by AI-DLC aligns seamlessly with our objectives. Should we not be aligned with the AI-DLC recommended workflow execution plan, we would request Q Developer to modify the plan to suit our preferences.

Since we’ve agreed on the workflow plan, we’ll enter “Continue” in Q’s chat session. If we weren’t aligned with the recommended workflow execution plan, we’d have prompted Q with our concerns and iterated over the revised execution plan until it aligned with our preferences. Following the recommended execution plan, the workflow will transition into the Construction phase and directly into the Code Generation Plan stage in the phase.

Step 8: Code Generation Planning

AI-DLC prioritizes planning over rushing to outcomes. This approach aligns with the concept of human-in-the-loop behavior, allowing us to detect issues early on, provide feedback on the plan, and prevent wrong assumptions from propagating further. Before we proceed with actual Code Generation, we undergo Code Generation Planning.

During Code Generation Planning, AI-DLC creates a detailed, numbered plan. It analyzes the requirements and design artifacts, breaking down the process into explicit steps for generating business logic, the API layer, the data layer, tests, documentation, and deployment files.

The plan is documented in a {unit-name}-code-generation-plan.md file, complete with check boxes. This ensures transparency, allowing users to see what will be built. It also provides control, enabling users to modify the plan. Additionally, it maintains quality by ensuring comprehensive coverage of code, tests, and documentation.

Figure 8 illustrates the AI-DLC’s code generation plan. The proposed workflow comprises eight steps, starting with creating an HTML structure and progressing to adding styling, game logic, and concluding with testing and documentation.

Screenshot of AI-DLC Code Generation Planning showing a Code Generation Plan document for River Crossing Puzzle on the left, with Unit Context listing HTML Structure, CSS Styling, and JavaScript files, followed by Unit Generation Steps including Step 1: HTML Structure Generation, Step 2: CSS Styling Generation, and Step 3 : Core Game Logic Generation with detailed checkboxes for each step. The right side shows Amazon Q chat with code generation plan details. Three callout annotations highlight: 1) The plan doc contains to-do items for AI-DLC to execute. These checkboxes get completed when the task is done; 2) This is how AI-DLC workflow persists and tracks progress state; 3) AI-DLC has proposed an 8-step code generation plan with checkboxes and review prompts, and User may Request Changes or Approve & Continue.

Figure 8: Code Generation Planning

The code generation plan appears reasonable to us. We will proceed to the Code Generation stage by entering “Continue” in Q’s chat session.

Step 9: Code Generation

The Code Generation stage executes the Code Generation Plan we approved in the previous step. It generates actual code artifacts step-by-step, including business logic, APIs, data layers, tests, and documentation. Completed steps are marked with check boxes, progress is tracked, and story traceability is ensured before presenting the generated code for user approval.

Figure 9 illustrates that the Code Generation stage has been completed. We are now reviewing a single index.html file generated with embedded styling and JavaScript consistent with our preference specified in requirements.md.

The workflow provides a summary of the activities performed during the Code Generation phase.

Screenshot of AI-DLC Code Generation phase showing generated HTML code on the left with embedded styling and JavaScript for the River Crossing Puzzle application. The right side shows Amazon Q chat with 'Code Generation Complete - river-crossing-puzzle' heading and a list of generated artifacts including HTML file, CSS interface, drag-and-drop interface, game logic, and testing services. Two callout annotations highlight: 1) The generated code is an HTML file with embedded styling and JavaScript; 2) We have specified during requirements analysis phase that we want a single-file index.html file implementation; 3) Code generation has been completed, and a summary of the generated artifacts is provided.

Figure 9: Code Generation

We’re about to test our newly created application soon. While it may be straightforward to test this simple puzzle app right now, for complex applications, we generate build and test instructions using AI-DLC.

We’ll enter “Continue” in the workflow and enter the final Build and Test stage in the Construction phase.

Step 10: Build and Test

These questions are essential for achieving our desired application.
We’ve reached the final stage of the AI-DLC Construction Phase, known as the Build and Test stage. During this stage, we create comprehensive instruction files that guide the build and packaging of the project, and document the necessary testing layers. These layers include unit tests (validating generated code), integration tests (checking unit interactions), performance tests (load/stress testing), and additional tests as required (security, contract, e2e).

The generated build instructions include dependencies and commands, test execution steps with expected results, and a summary document that provides an overview of the overall build/test status and the project’s readiness for deployment.

Figure 10 illustrates the documentation generated during this stage.

Screenshot of AI-DLC Build and Test phase showing a Build and Test Summary document on the left with Build Status (Build Tool, Build Status, Build Artifacts, Build Warnings) and Test Execution Summary including Unit Tests, Integration Tests, and Performance Tests sections with checkmarks and failure indicators. The right side shows Amazon Q chat with build and test completion status and project summary. Two callout annotations highlight: 1) Build and Test Complete! Build and Test instructions have been documented; 2) The AI-DLC workflow has concluded with a comprehensive summary of all completed stages and generated artifacts.

Figure 10: Build and Test

The AI-DLC workflow has now concluded.

Let’s Solve the Puzzle!

We open index.html in a web browser to access our newly created River Crossing Puzzle application. As shown in figure 11, we see our graphical web UI.

During requirements assessment, we chose a straightforward user interface using HTML, CSS, and JavaScript (without any frameworks), as evident in the display shown in Figure 11. Your display may vary due to the probabilistic nature of LLMs and the choices you made for requirements.

We attempt to solve the puzzle and find that it works as expected.

Side-by-side screenshots of the River Crossing Puzzle web application showing two game states. The left screenshot shows the initial state with a farmer on the left bank, and fox, chicken, and grain items listed below, with a blue river in the center and right bank on the right. The right screenshot shows a game state after moves with the farmer on the right bank and a success message 'Congratulations! You won in 7 moves!' displayed at the bottom. Both screens have a yellow 'Start Over' button and show move counts.

Figure 11: River Crossing Puzzle Web App

Conclusion

This post shows how AWS’s open-source AI-DLC workflow, guided by Amazon Q Developer’s Project Rules feature, helps developers build applications with structured oversight and transparency.

Using a River Crossing Puzzle web application as an example, the walk-through illustrates how AI-DLC methodology adapts its rigor based on project complexity, skipping unnecessary stages for simple applications while maintaining comprehensive processes for complex projects. Throughout each stage, AI-DLC enforces “human-in-the-loop” behavior, requiring user approval at critical checkpoints, asking clarifying questions, and maintaining complete audit trails for traceability.

The exercise successfully demonstrates how AI-DLC balances AI automation with human oversight, enhancing productivity without sacrificing quality or control. By following this structured, repeatable methodology, development teams can leverage generative AI’s capabilities while ensuring humans remain in charge of architectural decisions and implementation approaches. This framework provides the necessary guardrails for responsible and effective AI-assisted software development across projects of varying complexity.

Cleanup

We did not create any AWS resources in this walk-through, so no AWS cleanup is needed. You may cleanup your project workspace at your discretion.

Ready to get started? Visit our GitHub repository to download the AI-DLC workflow and join the AI-Native Builders Community to contribute to the future of software development.

About the authors:

Raja SP

Raja is a Principal Solutions Architect at AWS, where he leads Developer Transformation Programs. He has worked with more than 100 large customers, helping them design and deliver mission critical systems built on modern architectures, platform engineering practices, and Amazon inspired operating models. As generative AI reshapes the software development landscape, Raja and his team created the AI Driven Development Lifecycle (AI-DLC) — an end to end, AI native methodology that re-imagines how large teams collaboratively build production-grade software in the AI era.

Raj Jain

Raj is a Senior Solutions Architect, Developer Specialist at AWS. Prior to this role, Raj worked as a Senior Software Development Engineer at Amazon, where he helped build the security infrastructure underlying the Amazon platform. Raj is a published author in the Bell Labs Technical Journal, and has also authored IETF standards, AWS Security blogs, and holds twelve patents

Siddhesh Jog

Siddhesh is a Senior Solutions Architect at AWS. He has worked in multiple industries in a wide variety of roles and is passionate about all things technology. At AWS Siddhesh is most excited to help customers transition to the AI Driven Development Lifecycle and enable them to build applications rapidly in a secure, complaint and cost efficient cloud environment.

Will Matos

Will Matos is a Principal Specialist Solutions Architect with AWS’s Next Generation Developer Experience (NGDE) team, revolutionizing developer productivity through Generative AI, AI-powered chat interfaces, and code generation. With 27 years of technology, AI, and software development experience, he collaborates with product teams and customers to create intelligent solutions that streamline workflows and accelerate software development cycles. A thought leader engaging early adopters, Will bridges innovation and real-world needs .

Open-Sourcing Adaptive Workflows for AI-Driven Development Life Cycle (AI-DLC)

Post Syndicated from Will Matos original https://aws.amazon.com/blogs/devops/open-sourcing-adaptive-workflows-for-ai-driven-development-life-cycle-ai-dlc/

AI-Driven Development Life Cycle (AI-DLC) holds the promise of unlocking the full potential of AI in software development. By emphasizing AI-led workflows and human-centric decision-making, AI-DLC can deliver velocity and quality. However, realizing these gains hinges on how organizations effectively integrate AI into their engineering workflows.

Through our work with engineering teams across industries, we have identified three recurring challenges. These challenges consistently limit the effectiveness of AI in accelerating modern software development. The first challenge is one-size-fits-all workflows. These workflows force every project through the same rigid sequence of steps. The second challenge is the lack of flexible depth in workflow stages. This leads to over-engineering or insufficient rigor. The third challenge is tools that over-automate. These tools unintentionally divert humans away from critical validation and oversight responsibilities.

Achieving true, sustainable productivity requires the process and AI coding agents to become adaptive to context, flexible in depth, and collaborative by design. In this blog, we’ll show you how AI-DLC’s core principles address these three challenges, transforming them from productivity blockers into opportunities for adaptive, human-centered development. We’ll describe how AI-DLC enables workflows that adapt to the problem at hand by intelligently selecting stages, modulating depth, and embedding human oversight at every critical decision point.

We will also introduce our open-source Amazon Q Developer/Kiro Rules implementation, which brings AI-DLC principles to life through adaptive workflow scaffolds. This allows you to start applying these principles in your own projects and experience AI-native development that accelerates delivery without compromising engineering discipline or human judgment.

How does AI-DLC address these challenges?

Let’s explore how AI-DLC addresses these challenges.

1. The “One-Size-Fits-All” Workflow Problem

Software development has never been a linear process. In practice, different projects follow distinct pathways with their own checkpoints and deliverables. Consider these examples:

  • A simple defect fix doesn’t require elaborate requirements analysis and planning
  • A pure infrastructure porting project doesn’t warrant application design with domain modeling
  • A new feature or service addition demands different steps than applying a security patch

Yet, many modern Agentic coding tools provide hard-wired, opinionated workflows that ignore this diversity. Regardless of intent or scope, every project is forced through the same rigid sequence of steps—even when some add little or no value. This rigidity introduces friction, wastes time, and reduces productivity. The result: artificial ceremonies, unnecessary artifacts, redundant approvals, and process overhead that impede velocity.

How AI-DLC addresses this challenge:
AI-DLC addresses this challenge through the Principle 10 (No Hard-Wired, Opinionated SDLC Workflows) as defined in the AI-DLC Method Definition Paper.

AI-DLC avoids prescribing opinionated workflows for different development pathways (such as new system development, refactoring, defect fixes, or microservice scaling). Instead, it adopts a truly AI-First approach where AI recommends the Level 1 Plan based on the given pathway intention.

2. Lack of Flexible Depth Within Each Stage

True adaptivity must go beyond the breadth of a workflow and extend into its depth and intensity. This is how human experts intuitively plan software projects today.

Even when workflows are flexible, many tools fail to modulate the depth of engagement at each stage. For example, building a lightweight utility function doesn’t require full-scale Domain-Driven Design or detailed architectural modeling. When an AI coding agent compels teams to follow these steps regardless of need, the consequence is wasted effort and an over-engineered product. Developers spend cycles reviewing artifacts as the tools dictate rather than delivering business value.

How AI-DLC addresses this challenge:
Through the same principle 10, AI-DLC adapts both the breadth (choice of stages) and the depth of each stage to match the complexity of the intent and context. For example, the complexity of the requirements determines whether a conceptual design is sufficient or whether a full architectural deep dive is required in the Design stage.

Humans validate and adjust this AI-proposed breadth and depth, ensuring that each stage’s rigor matches the scope of the challenge. This elasticity—balancing breadth and depth—is essential for sustaining true velocity without sacrificing engineering discipline.

3. Tools that Reduce the Emphasis on Human Oversight

As AI tools automate more of the Software Development Life Cycle (SDLC), a new risk has emerged: process atrophy. Developers, excited by automation, often drift into passive execution—allowing AI to “decide everything.” The result is a loss of reflection, weakened oversight, and erosion of shared understanding. AI tools must not only automate work but also amplify the significance of human judgment. They should remind practitioners that “human in the loop” is not a checkbox—it is the cornerstone of trust, accountability, and correctness in AI-native development. Equally critical are the rituals and rhythms that sustain collaborative engineering.

How AI-DLC addresses this challenge:
AI-DLC addresses this challenge by requiring a collaborative human-in-the-loop cycle at every stage of the workflow. In this loop, AI generates a plan to execute a task, and relevant stakeholders assemble, review, and validate it.

These rituals, defined as Mob Elaboration and Mob Construction in AI-DLC, ensure that AI’s suggestions are not blindly accepted. Approved plans are executed, and stakeholders again review and validate the final artifacts. The AI-DLC workflow records every human action and approval, embedding reflection to ensure that humans remain the compass, guiding AI’s acceleration.

Circular workflow diagram showing AI-DLC collaboration cycle. Starting at top: Humans Provide Task (orange person icon) , arrow to AI Creates Plan and Seeks Clarification (blue brain icon), arrow to Humans Provide Clarification (orange person icon), arrow to AI Refines Plan (blue brain icon), arrow to Humans Approve Plan (orange person icon), arrow to AI Executes Plan (blue brain icon), arrow to Humans Verify Outcome (orange person icon), completing the cycle back to the start. The diagram illustrates iterative human-AI collaboration with humans making decisions and AI performing execution tasks.

Figure 1: AI-DLC workflow: Humans decide and validate, AI plans and executes.

Effective tooling must therefore emphasize:

  • Promoting for stakeholder collaboration: The system should explicitly call for collaborative rituals involving stakeholders
  • Auditability: Every AI-generated plan and artifact should surface rationale and invite review, recording every human oversight and interaction
  • Flow awareness: Tools should detect when automation races ahead of human validation and deliberately slow down to emphasize critical checkpoints

The goal is not to suppress automation but to embed critical human ownership.

From Principles to Practice

The ideas we outlined — adaptive workflows, flexible depth, and embedded human oversight — are compelling in theory and validated by all engineering teams we’ve engaged. The critical question is: How do we operationalize these ideas into practice without reintroducing the rigidity we seek to eliminate?

One approach is manual prompt engineering: crafting structured prompts that guide AI assistants through the AI-DLC workflow step by step. Each prompt encodes the role AI should assume, the task at hand, the governance requirements, and the audit trail expectations. This structured approach transforms a simple AI interaction into a disciplined workflow that embodies AI-DLC principles.

This approach, while promising, faces its own limitations. Crafting intricate prompts demands discipline and expertise, posing barriers to widespread adoption. Moreover, humans become responsible for maintaining workflow adaptability, selecting the appropriate prompt at the right moment, and ensuring collaborative checkpoints are honored. This places the burden of orchestration back on practitioners, diverging from our core principle of truly AI-native development, where AI itself drives adaptive decision-making.

The question arises: How can we embed AI-DLC principles directly into the execution layer, making adaptivity and collaboration inherent properties of the system rather than manual responsibilities?

Steering for Productivity

The answer lies in workflow scaffolds. These are Rules or Steering customizations for AI Coding Agents. They operationalize AI-DLC principles within the tools. This is done while maintaining transparency, audibility, and modifiability. Our implementation uses Rules/Steering Files. These serve as the foundation of this execution layer. It transforms AI from a passive assistant into an adaptive decision engine.

Rather than requiring developers to craft elaborate prompts, AI-Driven development begins with a simple statement of intent. From there, the workflow scaffolds evaluate context, assess complexity, and dynamically construct an appropriate development pathway. The core workflow definition, including a library of stages and decision heuristics for when and how to apply them, empowers AI to continuously tailor the development process to the nature of the work at hand.

Each AI-DLC phase (Inception, Construction, Operations) evaluates the depth at which it should execute, resulting in a process that adapts to the problem rather than forcing the problem to adapt to the process. This approach yields several critical outcomes:

  1. Adaptive decisioning: The workflow conforms to the problem’s shape, intelligently skipping or deepening stages based on contextual assessment rather than predetermined rules.
  2. Transparent checkpoints: Human approvals are embedded at every decision gate, preserving oversight while maintaining velocity. The system doesn’t just automate; it orchestrates collaboration.
  3. End-to-end traceability: Every artifact, decision, and conversation is logged, creating a continuous, inspectable trail of reasoning that supports both accountability and continuous improvement.

The result is a process that is context-aware, scalable, and self-correcting – capable of supporting everything from a single-line defect fix to a comprehensive system modernization, all while maintaining the rigor and human judgment that define engineering excellence.

Build, Test, and Evolve with Us

We’re open-sourcing the AI-DLC workflow, implemented as Amazon Q Rules and Kiro Steering Files, so organizations everywhere can experience AI-DLC in practice and build production-grade systems. We invite developers, architects, and engineering leaders to:

  1. Apply the steering rules in real-world projects, whether brownfield or greenfield. Refer to our companion AI-DLC workflow walkthrough blog for step-by-step instructions on how to build using AI-DLC in Amazon Q Developer.
  2. Observe how the process adapts to your project’s size, scope, and intent.
  3. Share your experience through our GitHub repository, where you can open issues, propose improvements, and contribute ideas.

Your feedback will help evolve this into a foundation for AI-native software development – one that accelerates delivery without sacrificing rigor or human judgment. Together, we can redefine what software engineering looks like in the age of AI: not scripted but steered.

Conclusion

AI-DLC addresses multiple challenges limiting AI’s effectiveness in software development such as rigid workflows, inflexible workflow depth, and tools that reduce human oversight. AI-DLC enables adaptive workflows that intelligently select stages, modulate depth, and embed human oversight at critical decision points. This approach, implemented through open-source tools like Amazon Q Developer Rules and Kiro Steering, accelerates delivery while maintaining engineering discipline and human judgment.

AI-DLC emphasizes human oversight and collaboration in AI-driven software development. Workflow scaffolds, embed AI-DLC principles into the execution layer, enabling adaptive decision-making, transparent checkpoints, and end-to-end traceability. Open-sourcing the AI-DLC workflow allows organizations to experience AI-DLC in practice and contribute to its evolution.

Ready to get started? Visit our GitHub repository to download the AI-DLC workflow and join the AI-Native Builders Community to contribute to the future of software development.

 

About the authors:

Raja SP

Raja is a Principal Solutions Architect at AWS, where he leads Developer Transformation Programs. He has worked with more than 100 large customers, helping them design and deliver mission critical systems built on modern architectures, platform engineering practices, and Amazon inspired operating models. As generative AI reshapes the software development landscape, Raja and his team created the AI Driven Development Lifecycle (AI-DLC) — an end to end, AI native methodology that re-imagines how large teams collaboratively build production-grade software in the AI era.

Raj Jain

Raj is a Senior Solutions Architect, Developer Specialist at AWS. Prior to this role, Raj worked as a Senior Software Development Engineer at Amazon, where he helped build the security infrastructure underlying the Amazon platform. Raj is a published author in the Bell Labs Technical Journal, and has also authored IETF standards, AWS Security blogs, and holds twelve patents

Siddhesh Jog

Siddhesh is a Senior Solutions Architect at AWS. He has worked in multiple industries in a wide variety of roles and is passionate about all things technology. At AWS Siddhesh is most excited to help customers transition to the AI Driven Development Lifecycle and enable them to build applications rapidly in a secure, complaint and cost efficient cloud environment.

Will Matos

Will Matos is a Principal Specialist Solutions Architect with AWS’s Next Generation Developer Experience (NGDE) team, revolutionizing developer productivity through Generative AI, AI-powered chat interfaces, and code generation. With 27 years of technology, AI, and software development experience, he collaborates with product teams and customers to create intelligent solutions that streamline workflows and accelerate software development cycles. A thought leader engaging early adopters, Will bridges innovation and real-world needs.

Multi Agent Collaboration with Strands

Post Syndicated from Aaron Sempf original https://aws.amazon.com/blogs/devops/multi-agent-collaboration-with-strands/

In the evolving landscape of autonomous systems, multi-agent collaboration is becoming not only feasible but necessary. As agents gain more capabilities, like advanced reasoning, adaptation, and tool use, the challenge shifts from individual performance to effective coordination. The question is no longer “can an agent solve a task?” but “how do we organize execution across many intelligent agents?”

A foundational step toward answering this came with the Supervisor pattern, introduced in our article on creating asynchronous AI agents with Amazon Bedrock. The Supervisor addresses the first generation of coordination challenges by acting as a centralized orchestrator, monitoring and delegating tasks across agents in a structured, serverless workflow. It provides asynchronous orchestration, fallback handling, and state tracking across loosely coupled agents, giving organizations a reliable way to move from single-agent prototypes to multi-agent systems.

Yet as agentic systems scale and become more dynamic, the limitations of static supervision become clear. The Supervisor model assumes a relatively stable set of agents and predictable workflows; but modern systems face constantly shifting tasks, emergent capabilities, and the need for adaptive coordination. This is where the Arbiter pattern emerges as the natural evolution: a next-generation supervisory model that extends the Supervisor with dynamic agent generation, semantic task routing, and blackboard-model-based coordination. By addressing the unpredictability and fluidity of large, evolving agent ecosystems, the Arbiter pattern enables systems not only to manage complexity but to thrive in it.

The Arbiter pattern builds directly on this by adding three key capabilities:

  1. Semantic Capability Matching: Instead of only assigning known tasks to known agents, the Arbiter reasons about what kind of agent should exist for a task—even if that agent doesn’t exist yet.
  2. Delegated Agent Creation: If no suitable agent is found, the Arbiter escalates the request to a Fabricator agent that dynamically generates a task-specific agent on demand. This moves beyond delegation to true adaptive generation.
  3. Task Planning + Contextual Memory: Building on the Supervisors task coordination capability, Arbiter decomposes complex inputs into structured task plans, and uses contextual memory to track execution, retry logic, and agent performance.

In short, the Arbiter transforms static orchestration into adaptive coordination.

The Blackboard Model Revisited

To enable loose, extensible collaboration across agents, the Arbiter Pattern incorporates principles from the blackboard model – a classic architecture from distributed AI. In this model, agents contribute opportunistically to a shared data space (the “blackboard”), reacting to changes and collectively solving problems.

Reference: See “The Blackboard Model of Control” (Hayes-Roth et al.), and early applications like Hearsay-II for foundational research.

In our extended Arbiter Pattern, the blackboard becomes a semantic event substrate. Agents, including the Arbiter, publish and consume task-relevant state, enabling loosely coupled, event-driven collaboration.

How It Works

When an event enters the system, the Arbiter takes on the supervisory role but extends it with greater dynamism and adaptability. Like the Supervisor pattern, it begins by interpreting the event and identifying the required objectives and sub-tasks. It then performs a capability assessment, using a local index or peer-published manifests, much like the Supervisor querying an Agents config table.

  1. Interpretation: The Arbiter uses LLM-based reasoning to extract task objectives and sub-tasks.
  2. Capability Assessment: It evaluates which agents can handle each sub-task using a local index or peer capability manifests.
  3. Delegation or Generation:
    • If a suitable agent exists, the task is routed accordingly.
    • If not, the Arbiter sends a generation request to the fabricator agent.
  4. Blackboard Coordination: All agents involved read/write to a shared semantic blackboard, contributing as needed based on observed task state.
  5. Reflection and Adaptation: Performance data is logged and used to inform future agent creation, adaptation, or deprecation.

Arbiter Pattern Architecture

Unlike the Supervisor, which maintains orchestration through a static config list, the Arbiter introduces a shared semantic blackboard that allows all participating agents to read, write, and coordinate based on evolving task state. This blackboard serves as a dynamic collaboration space, enabling mid-task adaptation and richer multi-agent coordination.

The following Diagram 1: Agentic AI Arbiter pattern implemented as a code example can be downloaded here

Architecture diagram of the Arbiter Pattern for Agentic AI. The diagram illustrates the components and flow of the pattern, showing how multiple AI agents interact with an arbiter to coordinate tasks and decision-making in a structured system

Diagram 1: Agentic AI Arbiter pattern

The following sequence describes the Arbiter pattern, according to the numbered steps in the diagram 1: Agentic AI Arbiter pattern

  1. Events entering the system trigger the Supervisor function
  2. Supervisor queries Agents Config table for agent capabilities
  3. Supervisor uses Agents config list as context to plan orchestration of tasks

Option: New Agent:

If no capable agent is found, the Arbiter goes further than the basic supervisor pattern: it issues a generation request to a fabricator agent, which synthesizes new worker code, stores it for runtime access, and updates the capability registry so the agentic system can immediately benefit from the new skill.

  1. Task cannot be completed, request create new capability
  2. Request to fabricate triggers Fabrication agent instance
  3. Fabrication agent queries resources register for available tools (capabilities)
  4. Fabricator generates worker agent code
  5. Worker agent code stored in bucket for runtime access
  6. New worker added to Agents config list with agent capabilities description
  7. Result of fabrication posted to message bus

Repeat steps 1, 2 & 3

Option: Orchestrate workflow:

If a suitable agent exists, the Arbiter orchestrates the workflow by invoking the appropriate worker agents, tracking progress and state as in the Supervisor model.

  1. Orchestration of tasks is stored for tracking end-to-end process
  2. Request to invoke worker agent, by name/id. Add workflow state for agent invocation.
  3. Request to invoke worker agent triggers worker agent wrapper instance
  4. Worker agent wrapper loads agent code
  5. Worker agent reasons and takes action
  6. Worker agent sends response to message bus
  7. Supervisor agent updates workflow state and tracks against orchestration

The Arbiter incorporates a reflection and adaptation loop: performance data from task execution is logged, analyzed, and fed back into the fabricator and coordination logic. This ensures that not only are tasks completed in the moment, but the system continuously adapts, retires underperforming agents, and evolves toward greater efficiency.

The Arbiter Agent: Event Orchestration Engine

The Supervisor Agent (Arbiter Agent) serves as the central coordinator component, managing complex event-driven workflows through intelligent task delegation.

Event Processing Workflow:

The Arbiter pattern follows a structured approach to handle incoming events

  1. Configuration Loading: Loads available agent configurations from Amazon DynamoDB via load_config_from_dynamodb()
  2. LLM Invocation: Invokes Amazon Bedrock LLM with event context and available tool specifications
  3. Decision Analysis: LLM analyzes the event and returns tool invocation decisions with parameters
  4. Task Dispatch: For each specified tool call:
    • Extracts tool name, input parameters, and tool use ID
    • Dispatches message to corresponding Amazon Simple Queue Service (SQS) queue via process_tool_call()
    • Maintains tool invocation list for workflow tracking

Workflow State Management:

The system maintains comprehensive state tracking throughout execution

  • Creates workflow tracking record in DynamoDB with create_workflow_tracking_record()
  • Initializes all invoked agents as incomplete
  • Associates unique request ID with orchestration instance
  • Persists orchestration state including conversation history and request mapping

Completion Coordination:

The Arbiter coordinates task completion through a systematic process

  1. Event Reception: Receives agent completion events via Amazon EventBridge
  2. Status Updates: Updates workflow tracking with update_workflow_tracking()
  3. Completion Check: Performs completion check across all tracked agents
  4. Result Aggregation: When all agents complete:
    • Aggregates results from DynamoDB data field
    • Appends tool results to conversation as user messages
    • Re-invokes orchestration with updated context
  5. Continuation: Continues until LLM provides final response without tool calls

The Fabricator Agent: Dynamic Capability Generation

The Fabricator Agent implements just-in-time agent development using the Strands agents framework, creating new capabilities when required functionality doesn’t exist in the system.

Agent Development Architecture:

The Fabricator operates as a specialized Strands Agent with specific characteristics

  • Implemented as a Strands Agent with specialized system prompt for code generation
  • Triggered by “New worker agent” events from the Arbiter
  • Receives capability requirements through prompt augmentation with agent directive
  • System prompt includes:
    • Strands Agent implementation examples
    • Complete catalog of available Strands Tools
    • Code generation patterns and conventions
    • Standardized handler() function requirements

Code Generation Process:

The agent follows a structured development workflow

  1. Requirement Analysis: LLM analyzes capability requirements and generates Python implementation
  2. Tool Selection: Prioritizes use of existing Strands Tools over custom @tool implementations
  3. Code Structure: Creates agents following standardized patterns:
    • Bedrock model initialization with models.BedrockModel()
    • Agent instantiation with appropriate tool selection
    • Standardized handler() function interface
    • Event-driven completion signaling
  4. File Creation: Writes generated code to /tmp/ directory for immediate availability

Capability Registration Pipeline:

New capabilities are registered through a multi-step process

  1. File Storage: File upload to Amazon Simple Storage Service (S3) via upload_file_to_s3() tool
  2. Metadata Registration: Registration in DynamoDB via store_agent_config_dynamo():
    • toolId: Unique capability identifier
    • filename: S3 object reference
    • schema: OpenAPI specification for LLM tool calling
    • description: Human-readable capability documentation
    • action: SQS queue routing configuration for Generic Wrapper
  3. Completion Notification: Completion event publication to Arbiter via complete_task() tool

Testing Considerations:

The original implementation revealed important insights about testing approaches

  • Previous Approach: Agent testing within the Fabricator resulted in:
    • Unstructured testing leading to false negatives
    • Overzealous optimization of generated agents
  • Recommendation: Separate testing agent with standardized harness for validation feedback

The Generic Wrapper: Dynamic Execution Runtime

The Generic Wrapper implements a hot-loading pattern that enables unlimited agent creation without infrastructure scaling, providing a universal execution environment for Fabricator-generated agents.

This hot-loading approach is critical because it decouples capability growth from infrastructure scaling. Instead of provisioning and maintaining new infrastructure components for every new agent, which could be dozens or even hundreds of agents, the system reuses a single execution wrapper that can dynamically load and execute arbitrary agent code.

This not only makes agent creation effectively limitless but also ensures infrastructure efficiency, cost optimization, and simplified operations, allowing the Arbiter and Fabricator to evolve system capabilities without operational bottlenecks.

In the AWS Samples code, found here, the Hot-loading handler is implemented as am AWS Lambda function, represented in the following code snippet:

def process_event(event, context):
    orchestration_id = event["orchestration_id"]
    tool_use_id = event["tool_use_id"]
    request = event["tool_input"]
    tool_name = event['node']

    # Based on the tool from the event, load the details from DDB
    tool = load_config_from_dynamodb(tool_name)
    config = tool['config']

    if isinstance(config, str):
        config = json.loads(config)

    file_name = config['filename']

    load_file_from_s3_into_tmp(os.environ["AGENT_BUCKET_NAME"], file_name)

    # Hot load the module from the tmp directory
    spec = importlib.util.spec_from_file_location("module.name", "/tmp/loaded_module.py")
    loaded_module = importlib.util.module_from_spec(spec)
    sys.modules["module.name"] = loaded_module
    spec.loader.exec_module(loaded_module)

    # Invoke the generic handler with whatever args were passed in by the Arbiter
    try:
        print("attempting to use module")
        response = loaded_module.handler(**request)
        print(f"response: {response}")
    except Exception as e:
        print(f"error running module: {e}")
        response = "The task could not be completed, this agent has issues, please ignore for now."

    # Finally. report back to the Arbiter. Handled by the wrapper. To avoid the Frabricator from attempting to code this part itself
    post_task_complete(response, tool_use_id, tool_name, orchestration_id)

Although this example is demonstrated through a lambda function, the Hot-Loading code can be executed in Amazon Bedrock AgentCore Runtime, or AWS native container services, such as Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (EKS)

Hot-Loading Architecture:

The wrapper implements several key architectural principles

  • Single infrastructure component handles execution of all dynamically created agents
  • Eliminates need for separate infrastructure provisioning per agent
  • Implements runtime code loading from S3 storage
  • Accepts latency trade-off for infrastructure efficiency in non-ultra-low-latency environment

Dynamic Loading Process:

The system follows a precise loading sequence

  1. Message Processing: Extracts agent identifier from incoming SQS message
  2. Configuration Retrieval: Queries DynamoDB for agent configuration via load_config_from_dynamodb()
  3. Code Download: Downloads agent implementation from S3 to /tmp/ directory
  4. Runtime Loading: Module loading using importlib.util:
    • spec_from_file_location() creates module specification
    • module_from_spec() instantiates module object
    • exec_module() performs actual code loading and execution

Execution Management:

The wrapper provides comprehensive execution oversight

  • Invokes standardized handler() function with provided parameters
  • Captures execution results and handles error conditions gracefully
  • Maintains execution isolation between different agent invocations
  • Implements resource cleanup after agent execution completion

Standardized Communication Protocol:

Communication follows strict standardization to ensure system reliability, which is critical in multi-agent environments where dozens or even hundreds of dynamically generated agents may interact. Without consistent message formats, routing rules, and completion signals, orchestration would become brittle, errors would propagate unpredictably, and debugging would be nearly impossible. Standardization guarantees that every agent, no matter when it was created, can interoperate seamlessly, enabling the Arbiter to maintain end-to-end visibility, traceability, and fault-tolerance across the entire system.

Event Handling Principles:

  • Event posting handled exclusively by Generic Wrapper, not individual agents
  • Ensures consistent event-driven communication patterns across all agents

Completion Event Structure:

  • orchestration_id: Workflow context linkage
  • tool_use_id: LLM tool invocation mapping
  • node: Agent identifier for tracking
  • data: Execution results or error information

Reliability Measures:

  • Publishes completion events to EventBridge for Arbiter processing
  • Guarantees workflow tracking receives completion signals regardless of execution outcome

Scalability Characteristics:

The hot-loading approach provides significant scalability benefits

  • Enables agent scaling creation without minimal infrastructure impact
  • S3 download latency acceptable within overall system performance profile
  • Single wrapper instance can execute multiple agent types
  • Memory and resource management handled at container level

Conclusion

The Arbiter Pattern represents a significant evolution beyond the Supervisor architecture, delivering the flexibility required for truly autonomous agentic systems. By introducing semantically rich, context-aware orchestration, it enables dynamic scalability, where agent capabilities grow in step with task demands. The architecture is resilient, redistributing or regenerating tasks when agents fail, and it achieves loose coupling by having agents interact through semantically meaningful events rather than rigid APIs. Most importantly, it embeds continuous adaptation through Arbiter-guided feedback loops, allowing systems to learn and evolve over time. This marks a shift from pre-programmed logic to generative, blackboard-model-based coordination, paving the way for decentralized, intelligent systems that can learn, adapt, and collaborate effectively at scale.

The system delivers several critical capabilities

  • Asynchronous Processing: SQS-based message passing for scalable execution
  • Persistent State Management (Short-term memory): DynamoDB-based workflow tracking
  • Scalability: Hot-loading architecture for unlimited agent creation
  • Intelligent Orchestration: LLM-driven task decomposition and sequencing
  • Self-Expanding Capabilities: Strands-based agent creation on demand
  • Standardized Communication: Reliable event-driven protocols

This architecture enables processing of arbitrary event types by dynamically creating necessary processing capabilities and coordinating their execution through LLM-driven workflow orchestration, while maintaining infrastructure efficiency through hot-loading patterns.


About the Authors

aaron sempfAaron Sempf is Next Gen Tech Lead for the AWS Partner Organization in Asia-Pacific and Japan. With over 20 years in distributed system engineering design and development, he focuses on solving for large scale complex integration and event driven systems. In his spare time, he can be found coding prototypes for autonomous robots, IoT devices, distributed solutions, and designing agentic architecture patterns for generative AI assisted business automation.

josh tothJoshua Toth is a Senior Prototyping Engineer with over a decade of experience in software engineering and distributed systems. He specializes in solving complex business challenges through technical prototypes, demonstrating the art of the possible. With deep expertise in proof of concept development, he focuses on bridging the gap between emerging technologies and practical business applications. In his spare time, he can be found developing next-generation interactive demonstrations and exploring cutting-edge technological innovations.

Delivering Malware Through Abandoned Amazon S3 Buckets

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2025/02/delivering-malware-through-abandoned-amazon-s3-buckets.html

Here’s a supply-chain attack just waiting to happen. A group of researchers searched for, and then registered, abandoned Amazon S3 buckets for about $400. These buckets contained software libraries that are still used. Presumably the projects don’t realize that they have been abandoned, and still ping them for patches, updates, and etc.

The TL;DR is that this time, we ended up discovering ~150 Amazon S3 buckets that had previously been used across commercial and open source software products, governments, and infrastructure deployment/update pipelines—and then abandoned.

Naturally, we registered them, just to see what would happen—”how many people are really trying to request software updates from S3 buckets that appear to have been abandoned months or even years ago?”, we naively thought to ourselves.

Turns out they got eight million requests over two months.

Had this been an actual attack, they would have modified the code in those buckets to contain malware and watch as it was incorporated in different software builds around the internet. This is basically the SolarWinds attack, but much more extensive.

But there’s a second dimension to this attack. Because these update buckets are abandoned, the developers who are using them also no longer have the power to patch them automatically to protect them. The mechanism they would use to do so is now in the hands of adversaries. Moreover, often—but not always—losing the bucket that they’d use for it also removes the original vendor’s ability to identify the vulnerable software in the first place. That hampers their ability to communicate with vulnerable installations.

Software supply-chain security is an absolute mess. And it’s not going to be easy, or cheap, to fix. Which means that it won’t be. Which is an even worse mess.

Първо пътуване с електронен билет в Пловдив

Post Syndicated from Йовко Ламбрев original https://yovko.net/e-tickets-plovdiv/

Първата четвърт на 21-ви век почти отминава, но… вече и в Пловдив става възможно човек да си купи електронен билет за градския транспорт и да го използва. Проверено и потвърдено! Е, засега само по един маршрут – автобус 25.

Скрийншоти от приложението MPass с електронен билет за пътуване в Пловдив

Приложението MPass е семпло, изглежда доста добре и е лесно за ползване, което е важно. След регистрацията си купих билет за еднократно пътуване срещу 1 лев, който платих с дебитна карта (Revolut) без никакви проблеми. Изненадата беше, че е валиден до края на денонощието (по-точно в рамките на работното време на градския транспорт), но в интерес на истината това е разписано зад бутон Детайли на всеки билет. Там пише също и че трябва да е купен поне минута, преди да се качим на автобуса и да се опитаме да го валидираме. Та не си купувайте билети за следващия ден – няма да можете да ги ползвате.

На същото място пише, че валидирането на билета става като поставите смартфона си, с QR-кода на билета на разстояние около една педя (10-15cm) под валидатора. Това е полезна информация, защото QR-кода очевидно следва да се сканира оптично, а на самия валидатор не личи къде му е сензора. Логично е да отдолу, но ако телефонът е поставен твърде близо, разчитането на QR-кода няма да се получи. Аз успях от втори опит, виждайки светлината на скенера върху дисплея на телефона си, която подсказа колко да го отдалеча и как да го поставя. Не бях прочел това предварително като купувах билета, а шофьорът, разбира се, не знаеше – само промърмори, че не разбирал от електронни неща. Всъщност не е и негова работа, но в първоначалния период на въвеждане на нова система е добре да има някоя табелка, стрелкичка, пиктограма… за всички ще е по-лесно. У нас продължава да е в сила методологията „Оправяй се!“.

След валидиране на устройството в автобуса билетът се маркира с червена лентичка с надпис "За контрол".

И още една особеност. С един акаунт човек може да се логне на няколко устройства (вкл. в браузър) – аз си купих билета, докато бях в офиса на компютъра си, а не през мобилното приложение. Билетът обаче остава в браузъра/устройството, с което е купен, а аз възнамерявах да го ползвам от смартфона си (в автобуса). Няма драма, билетът може да се премести (само веднъж!) – с допълнителна стъпка, която изисква човек да има достъп до електронната си поща, където да получи и използва един код. Иначе и това е лесно и бързо.

С две думи, нещата засега изглеждат доста обещаващо. Остава всичко това да стане възможно и във всички останали линии и автобуси от пловдивския градски транспорт. И разбира се, този град да се сдобие с обществен транспорт, на който да може да се разчита.

Първо пътуване с електронен билет в Пловдив

Post Syndicated from Йовко Ламбрев original https://yovko.net/e-tickets-plovdiv/

Първо пътуване с електронен билет в Пловдив

Първата четвърт на 21-ви век почти отминава, но… вече и в Пловдив става възможно човек да си купи електронен билет за градския транспорт и да го използва. Проверено и потвърдено! Е, засега само по един маршрут – автобус 25.

Приложението MPass е семпло, изглежда доста добре и е лесно за ползване, което е важно. След регистрацията си купих билет за еднократно пътуване срещу 1 лев, който платих с дебитна карта (Revolut) без никакви проблеми. Изненадата беше, че е валиден до края на денонощието (по-точно в рамките на работното време на градския транспорт), но в интерес на истината това е разписано зад бутон Детайли на всеки билет. Там пише също и че трябва да е купен поне минута, преди да се качим на автобуса и да се опитаме да го валидираме. Та не си купувайте билети oт днес за следващия ден – ще „изветреят“ преди да ги използвате.

На същото място пише, че валидирането на билета става като поставите смартфона си с QR-кода на билета на разстояние 10-15cm (около една педя) под валидатора. Това е полезно знание, защото QR-кода очевидно трябва да се сканира оптично, а на самия валидатор не личи къде му е сензора. Логично е да отдолу, но ако телефонът е поставен твърде близо, разчитането на QR-кода няма да се получи. Аз успях от втори опит, виждайки светлината на скенера върху дисплея на телефона си, която подсказа колко да го отдалеча и как да го поставя. Не бях прочел това предварително като купувах билета, а шофьорът, разбира се, не знаеше – само промърмори, че не разбирал от електронни неща. Всъщност не е и негова работа, но в първоначалния период на въвеждане на нова система е добре да има някоя табелка, стрелкичка, пиктограма… за всички ще е по-лесно. Но у нас продължава да е в сила методологията „Оправяй се!“.

След валидиране на устройството в автобуса, билетът се маркира с червена лентичка с надпис "За контрол".

И още една особеност. С един акаунт човек може да се логне на няколко устройства (вкл. в браузър) – аз си купих билета, докато бях в офиса на компютъра си, а не през мобилното приложение. Билетът обаче остава в браузъра/устройството, с което е купен, а аз възнамерявах да го ползвам от смартфона си (в автобуса). Няма драма, билетът може да се премести (макар и само веднъж!) – с допълнителна стъпка, която изисква човек да има достъп до електронната си поща, където да получи един код. Иначе и това е лесно и бързо.

С две думи, нещата засега изглеждат доста обещаващо. Остава всичко това да стане възможно и в останалите линии и автобуси от пловдивския градски транспорт. Да се появят и по-удобни варианти на билети. И разбира се, този град да се сдобие с обществен транспорт, на който да може да се разчита.

P. S. Възможно е и да не се ползва приложението, а да се плати директно с карта (през валидатора), но понеже не съм редовен ползвател на градския транспорт, а и автобус 25 не е сред тези, които ползвам обичайно… не съм тествал тази опция.

Education in Secure Software Development

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/08/education-in-secure-software-development.html

The Linux Foundation and OpenSSF released a report on the state of education in secure software development.

…many developers lack the essential knowledge and skills to effectively implement secure software development. Survey findings outlined in the report show nearly one-third of all professionals directly involved in development and deployment ­ system operations, software developers, committers, and maintainers ­ self-report feeling unfamiliar with secure software development practices. This is of particular concern as they are the ones at the forefront of creating and maintaining the code that runs a company’s applications and systems.

Providing Security Updates to Automobile Software

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/07/providing-security-updates-to-automobile-software.html

Auto manufacturers are just starting to realize the problems of supporting the software in older models:

Today’s phones are able to receive updates six to eight years after their purchase date. Samsung and Google provide Android OS updates and security updates for seven years. Apple halts servicing products seven years after they stop selling them.

That might not cut it in the auto world, where the average age of cars on US roads is only going up. A recent report found that cars and trucks just reached a new record average age of 12.6 years, up two months from 2023. That means the car software hitting the road today needs to work­—and maybe even improve—­beyond 2036. The average length of smartphone ownership is just 2.8 years.

I wrote about this in 2018, in Click Here to Kill Everything, talking about patching as a security mechanism:

This won’t work with more durable goods. We might buy a new DVR every 5 or 10 years, and a refrigerator every 25 years. We drive a car we buy today for a decade, sell it to someone else who drives it for another decade, and that person sells it to someone who ships it to a Third World country, where it’s resold yet again and driven for yet another decade or two. Go try to boot up a 1978 Commodore PET computer, or try to run that year’s VisiCalc, and see what happens; we simply don’t know how to maintain 40-year-old [consumer] software.

Consider a car company. It might sell a dozen different types of cars with a dozen different software builds each year. Even assuming that the software gets updated only every two years and the company supports the cars for only two decades, the company needs to maintain the capability to update 20 to 30 different software versions. (For a company like Bosch that supplies automotive parts for many different manufacturers, the number would be more like 200.) The expense and warehouse size for the test vehicles and associated equipment would be enormous. Alternatively, imagine if car companies announced that they would no longer support vehicles older than five, or ten, years. There would be serious environmental consequences.

We really don’t have a good solution here. Agile updates is how we maintain security in a world where new vulnerabilities arise all the time, and we don’t have the economic incentive to secure things properly from the start.

За данните на градския транспорт на Пловдив в Google Maps

Post Syndicated from Йовко Ламбрев original https://yovko.net/plovdiv-google-maps/

За данните на градския транспорт на Пловдив в Google Maps

От лятото на 2019 г. в блога ми стои чернова, която така и не публикувах. Идеята ми зад въпросната статия беше да разкажа как се появи разписанието на градския транспорт на Пловдив в Google Maps. Но това сега няма никакво значение, защото от тази седмица там вече няма такива данни.

Съвсем кратка предистория: Преди вече пет години трима приятели предложихме на Община Пловдив да свършим необходимото, за да се появят маршрутите и разписанията на градския транспорт в Google Maps. Напълно безвъзмездно. Речено – сторено. Общината не трябваше да прави нищо друго, освен от време на време да ни уведомява за промените в маршрутите и разписанията, за да поддържаме нещата актуални и коректни. Това обаче не се случи. В началото на 2022 г. се уморихме да се борим с вятърни мелници и преустановихме поддръжката, като маркирахме, че данните не са актуални и няма да се обновяват.

Преди няколко дни човек от екипа на Google Transit (така се нарича услугата, която добавя информация за обществения транспорт в Maps) се свърза с мен да провери защо вече две години не сме актуализирали данните за Пловдив. Отговорих му, че за съжаление, няма как да го направим, тъй като разчитахме да получаваме тези данни от Община Пловдив, но се оказва, че всеки път трябва да им се молим за данните или да си ги събираме, както намерим за добре. Което определено не е начинът, по който трябва да се прави това. Нито може да гарантира нужната прецизност. Колегата от Google сподели, че разбира ситуацията, и ще преценят как да постъпят. След още два дни ми писа, че са взели решение да прекратят услугата за Пловдив.

Днес в Общинския съвет е имало дискусия по темата, провокирана от публикация в „Капитал“. Журналистите от медията писаха и преди година по темата и дори заедно прогнозирахме това развитие, но тишината откъм Община Пловдив си остана все така оглушителна.

Пиша това, защото сегашната ресорна кметица, г-жа Савина Петкова, е казала пред общинските съветници няколко изречения, които от стремеж към акуратност не мога да отмина без коментар. Позовавам се на публикации в регионалната преса, тъй като не съм присъствал лично.

Започвам с нейното твърдение, че градският транспорт на Пловдив има далеч по-важни проблеми, с което няма как да не се съглася. Аз самият съм се питал какви данни се опитваме да събираме и предоставяме на хората, когато ситуацията с обществения транспорт в града е толкова безобразна.

Вярно е и че общината никога не е имала договор с Google Maps. Просто защото няма как да има договор. Google не подписва такива. Платформата предоставя напълно безплатно възможността да визуализира данни за обществен или частен транспорт, ако те са предоставени от достоверен източник и в изисквания от тях формат. Иначе казано, Община Пловдив може сама да обобщава такива данни и да ги изпраща към Google. Това могат да правят и самите транспортни фирми. И има няколко в България, чиито разписания са в Google Maps, включително и БДЖ.

Проблемите тук са два. Първо, Google да се довери, че източникът на данни е достоверен. И второ, тези данни да бъдат в изисквания от тях формат – което не е задача с грандиозна трудност, но не е и тривиално начинание. През 2019 г. се наложи част от информацията да събираме от сканирани PDF-и, MS Word документи с вградени в тях таблици, които всъщност бяха растерни картинки. А данните за маршрутите, по които се движат автобусите, рисувахме точка по точка на ръка, защото „Организация и контрол по транспорта“ (ОКТ) разполагаше с GPS координати на спирките, но не и на пътя, по който автобусите се движат между тези спирки. Имаше още купчина технически засади, за които не подозирахме и в най-песимистичните си очаквания. Но някак ги разрешихме.

Иначе по темата защо Google все пак реши да публикува данни за Пловдив, това стана възможно, след като не общината, а ние, доброволците, се ангажирахме да съберем всичко и да го подредим. За да направим нещо полезно за града и гостите му. И защото ни е грижа за средата, в която живеем.

Община Пловдив потвърди с официално писмо пред Google, че ние ще вършим тази работа с подреждането и преобразуването на данните от нейно име.

За последвалите две години и половина, през които поддържахме данните живи, само веднъж получихме актуализация от общината. След двукратни мои настоявания пред тогавашния ресорен кмет Тодор Чонов, на които и до днес нямам отговор, в крайна сметка просто загубих търпение, обадих се на колегите в ОКТ по телефона и си изпросих нови данни. Те, между другото, винаги са откликвали, но така и никой не им възложи да изпращат промените, когато има такива, или примерно веднъж на три или дори шест месеца. Това щеше да е повече от достатъчно.

Не отговаря на истината обаче твърдението на г-жа Петкова, че данните, които са показвани досега от Google, са идвали от системата на „Индра“. Те не бяха особено полезни, защото в тях имаше грешки и ги получихме едва след като вече се бяхме преборили с дигитализацията на множество „хартиени данни“, които впрочем също бяха само частично полезни. Единственото улеснение на двата комплекта данни беше, че можехме да ги съпоставим и така да проследим къде се разминават, което помогна да отстраним някакво количество грешки.

Би било чудесно, ако системата на „Индра“ изобщо работеше и имаше как да интегрираме данните от нея за информация в реално време с Google. Това беше в плановете ни за бъдещето, когато градският транспорт на Пловдив евентуално щеше да има адекватна информационна система.

А приложението, което г-жа Петкова съжалява, че не е достатъчно популярно, е ето това. Моля, проследете линка и ще си отговорите сами защо „не е достатъчно популярно“. То работи (слава богу, нещо работи!), но е кошмарно остаряло и е адски неудобно за ползване, особено в движение и от мобилен телефон. Честно, г-жо Петкова, Вие колко пъти сте го ползвала?

И в заключение: искрено се надявам, че колегите от „Тикси“, които правят поредната за Пловдив нова система за управление на градския транспорт, ще имат добрата воля да предоставят отворено API или друг публично достъпен и документиран начин за интеграция с външни системи. Това ще позволи не само нова и по-лесна интеграция с Google Maps (защо не и в реално време този път?), но и създаването на други мобилни приложения и дигитални удобства за Пловдив. Защото ние още продължаваме да живеем в този град и да ни пука за него.

За данните на градския транспорт на Пловдив в Google Maps

Post Syndicated from Йовко Ламбрев original https://yovko.net/plovdiv-google-maps/

За данните на градския транспорт на Пловдив в Google Maps

От лятото на 2019 г. в блога ми стои чернова, която така и не публикувах. Идеята ми зад въпросната статия беше да разкажа как се появи разписанието на градския транспорт на Пловдив в Google Maps. Но това сега няма никакво значение, защото от тази седмица там вече няма такива данни.

Съвсем кратка предистория: Преди вече пет години трима приятели предложихме на Община Пловдив да свършим необходимото, за да се появят маршрутите и разписанията на градския транспорт в Google Maps. Напълно безвъзмездно. Речено – сторено. Общината не трябваше да прави нищо друго, освен от време на време да ни уведомява за промените в маршрутите и разписанията, за да поддържаме нещата актуални и коректни. Това обаче не се случи. В началото на 2022 г. се уморихме да се борим с вятърни мелници и преустановихме поддръжката, като маркирахме, че данните не са актуални и няма да се обновяват.

Преди няколко дни човек от екипа на Google Transit (така се нарича услугата, която добавя информация за обществения транспорт в Maps) се свърза с мен да провери защо вече две години не сме актуализирали данните за Пловдив. Отговорих му, че за съжаление, няма как да го направим, тъй като разчитахме да получаваме тези данни от Община Пловдив, но се оказва, че всеки път трябва да им се молим за това или да си ги събираме, както намерим за добре. Което определено не е начинът, по който трябва да се прави това. Нито може да гарантира нужната прецизност. Колегата от Google сподели, че разбира ситуацията, и ще преценят как да постъпят. След още два дни ми писа, че са взели решение да прекратят услугата за Пловдив.

Днес в Общинския съвет е имало дискусия по темата, провокирана от публикация в „Капитал“. Журналистите от медията писаха и преди година по темата и дори заедно прогнозирахме това развитие, но тишината откъм Община Пловдив си остана все така оглушителна.

Пиша това, защото сегашната ресорна кметица, г-жа Савина Петкова, е казала пред общинските съветници няколко изречения, които от стремеж към прецизност не мога да отмина без коментар. Позовавам се на публикации в регионалната преса, тъй като не съм присъствал лично.

Започвам с нейното твърдение, че градският транспорт на Пловдив има далеч по-важни проблеми, с което няма как да не се съглася. Аз самият съм се питал какви данни се опитваме да събираме и предоставяме на хората, когато ситуацията с обществения транспорт в града е толкова безобразна.

Вярно е и че общината никога не е имала договор с Google Maps. Просто защото няма как да има такъв договор. Google не подписва такива. Платформата предоставя напълно безплатно възможността да визуализира данни за обществен или частен транспорт, ако те са предоставени от достоверен източник и в изисквания от тях формат. Иначе казано, Община Пловдив може сама да обобщава такива данни и да ги изпраща към Google. Това могат да правят и самите транспортни фирми. И има няколко в България, чиито разписания са в Google Maps, включително и БДЖ.

Проблемите тук са два. Първо, Google да се довери, че източникът на данни е достоверен. И второ, тези данни да бъдат в изисквания от тях формат – което не е задача с грандиозна трудност, но не е и тривиално начинание. През 2019 г. се наложи част от информацията да събираме от сканирани PDF-и, MS Word документи с вградени в тях таблици, които всъщност бяха растерни картинки. А данните за маршрутите, по които се движат автобусите, рисувахме точка по точка на ръка, защото „Организация и контрол по транспорта“ (ОКТ) разполагаше с GPS координати на спирките, но не и на пътя, по който автобусите се движат между тези спирки. Имаше още купчина технически засади, за които не подозирахме и в най-песимистичните си очаквания. Но някак ги разрешихме.

Иначе по темата защо Google все пак реши да публикува данни за Пловдив, това стана възможно, след като не общината, а ние, доброволците, се ангажирахме да съберем всичко и да го подредим. За да направим нещо полезно за града и гостите му. И защото ни е грижа за средата, в която живеем.

Община Пловдив потвърди с официално писмо пред Google, че ние ще вършим тази работа с подреждането и преобразуването на данните от нейно име.

За последвалите две години и половина, през които поддържахме данните живи, само веднъж получихме актуализация от общината. След двукратни мои настоявания пред тогавашния ресорен кмет Тодор Чонов, на които и до днес нямам отговор, в крайна сметка просто загубих търпение, обадих се на колегите в ОКТ по телефона и си изпросих нови данни. Те, между другото, винаги са откликвали, но така и никой не им възложи да изпращат промените, когато има такива, или примерно веднъж на три или дори шест месеца. Това щеше да е повече от достатъчно.

Не отговаря на истината обаче твърдението на г-жа Петкова, че данните, които са показвани досега от Google, са идвали от системата на „Индра“. Те не бяха особено полезни, защото в тях имаше грешки и ги получихме едва след като вече се бяхме преборили с дигитализацията на множество „хартиени данни“, които впрочем също бяха само частично полезни. Единственото улеснение на двата комплекта данни беше, че можехме да ги съпоставим и така да проследим къде се разминават, което помогна да отстраним някакво количество грешки.

Би било чудесно, ако системата на „Индра“ изобщо работеше и имаше как да интегрираме данните от нея за информация в реално време с Google. Това беше в плановете ни за бъдещето, когато градският транспорт на Пловдив евентуално щеше да има адекватна информационна система.

А приложението, което г-жа Петкова съжалява, че не е достатъчно популярно, е ето това. Моля, проследете линка и ще си отговорите сами защо „не е достатъчно популярно“. То работи (слава богу, нещо работи!), но е кошмарно остаряло и е адски неудобно за ползване, особено в движение и от мобилен телефон. Честно, г-жо Петкова, Вие колко пъти сте го ползвала?

И в заключение: искрено се надявам, че колегите от „Тикси“, които правят поредната за Пловдив нова система за управление на градския транспорт, ще имат добрата воля да предоставят API или друг публично достъпен начин за интеграция с външни системи. Това ще позволи не само интеграция с Google Maps (защо не и в реално време този път?), но и създаването на други мобилни приложения и дигитални удобства за Пловдив. Защото ние още продължаваме да живеем в този град и да ни пука за него.

Easily deploy SaaS products with new Quick Launch in AWS Marketplace

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/easily-deploy-saas-products-with-new-quick-launch-in-aws-marketplace/

Today we are excited to announce the general availability of SaaS Quick Launch, a new feature in AWS Marketplace that makes it easy and secure to deploy SaaS products.

Before SaaS Quick Launch, configuring and launching third-party SaaS products could be time-consuming and costly, especially in certain categories like security and monitoring. Some products require hours of engineering time to manually set up permissions policies and cloud infrastructure. Manual multistep configuration processes also introduce risks when buyers rely on unvetted deployment templates and instructions from third-party resources.

SaaS Quick Launch helps buyers make the deployment process easy, fast, and secure by offering step-by-step instructions and resource deployment using preconfigured AWS CloudFormation templates. The software vendor and AWS validate these templates to ensure that the configuration adheres to the latest AWS security standards.

Getting started with SaaS Quick Launch
It’s easy to find which SaaS products have Quick Launch enabled when you are browsing in AWS Marketplace. Products that have this feature configured have a Quick Launch tag in their description.

Quick Launch tag in AWS Marketplace

After completing the purchase process for a Quick Launch–enabled product, you will see a button to set up your account. That button will take you to the Configure and launch page, where you can complete the registration to set up your SaaS account, deploy any required AWS resources, and launch the SaaS product.

Step 1 - set permissions

The first step ensures that your account has the required AWS permissions to configure the software.

Step 1 - set permissions

The second step involves configuring the vendor account, either to sign in to an existing account or to create a new account on the vendor website. After signing in, the vendor site may pass essential keys and parameters that are needed in the next step to configure the integration.

Step 2 - Log into the vendor account

The third step allows you to configure the software and AWS integration. In this step, the vendor provides one or more CloudFormation templates that provision the required AWS resources to configure and use the product.

Step 3 - Configure your software and AWS integration

The final step is to launch the software once everything is configured.

Step 6 - Launch your software

Availability
Sellers can enable this feature in their SaaS product. If you are a seller and want to learn how to set this up in your product, check the Seller Guide for detailed instructions.

To learn more about SaaS in AWS Marketplace, visit the service page and view all the available SaaS products currently in AWS Marketplace.

Marcia

Streaming Android games from cloud to mobile with AWS Graviton-based Amazon EC2 G5g instances

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/streaming-android-games-from-cloud-to-mobile-with-aws-graviton-based-amazon-ec2-g5g-instances/

This blog post is written by Vincent Wang, GCR EC2 Specialist SA, Compute.

Streaming games from the cloud to mobile devices is an emerging technology that allows less powerful and less expensive devices to play high-quality games with lower battery consumption and less storage capacity. This technology enables a wider audience to enjoy high-end gaming experiences from their existing devices, such as smartphones, tablets, and smart TVs.

To load games for streaming on AWS, it’s necessary to use Android environments that can utilize GPU acceleration for graphics rendering and optimize for network latency. Cloud-native products, such as the Anbox Cloud Appliance or Genymotion available on the AWS Marketplace, can provide a cost-effective containerized solution for game streaming workloads on Amazon Elastic Compute Cloud (Amazon EC2).

For example, Anbox Cloud’s virtual device infrastructure can run games with low latency and high frame rates. When combined with the AWS Graviton-based Amazon EC2 G5g instances, which offer a cost reduction of up to 30% per-game stream per-hour compared to x86-based GPU instances, it enables companies to serve millions of customers in a cost-efficient manner.

In this post, we chose the Anbox Cloud Appliance to demonstrate how you can use it to stream a resource-demanding game called Genshin Impact. We use a G5g instance along with a mobile phone to run the streamed game inside of a Firefox browser application.

Overview

Graviton-based instances utilize fewer compute resources than x86-based instances due to the 64-bit architecture of Arm processors used in AWS Graviton servers. As shown in the following diagram, Graviton instances eliminate the need for cross-compilation or Android emulation. This simplifies development efforts and reduces time-to-market, thereby lowering the cost-per-stream. With G5g instances, customers can now run their Android games natively, encode CPU or GPU-rendered graphics, and stream the game over the network to multiple mobile devices.

Architecture difference when running Android on X86-based instance and Graviton-based instance.

Figure 1: Architecture difference when running Android on X86-based instance and Graviton-based instance.

Real-time ray-traced rendering is required for most modern games to deliver photorealistic objects and environments with physically accurate shadows, reflections, and refractions. The G5g instance, which is powered by AWS Graviton2 processors and NVIDIA T4G Tensor Core GPUs, provides a cost-effective solution for running these resource-intensive games.

Architecture

Architecture of Android Streaming Game.

Figure 2: Architecture of Android Streaming Game.

When streaming games from a mobile device, only input data (touchscreen, audio, etc.) is sent over the network to the game streaming server hosted on a G5g instance. Then, the input is directed to the appropriate Android container designated for that particular client. The game application running in the container processes the input and updates the game state accordingly. Then, the resulting rendered image frames are sent back to the mobile device for display on the screen. In certain games, such as multiplayer games, the streaming server must communicate with external game servers to reflect the full game state. In these cases, additional data is transferred to and from game servers and back to the mobile client. The communication between clients and the streaming server is performed using the WebRTC network protocol to minimize latency and make sure that users’ gaming experience isn’t affected.

The Graviton processor handles compute-intensive tasks, such as the Android runtime and I/O transactions on the streaming server. However, for resource-demanding games, the Nvidia GPU is utilized for graphics rendering. To scale effortlessly, the Anbox Cloud software can be utilized to manage and execute several game sessions on the same instance.

Prerequisites

First, you need an Ubuntu single sign-on (SSO) account. If you don’t have one yet, you may create one from Ubuntu One website. Then you need an Android mobile phone with Firefox or Chrome browser installed to play the streaming games.

Setup

We can install Anbox Cloud Appliance in the AWS Marketplace. Select the Arm variant so that it works on Graviton-based instances. If the subscription doesn’t work on the first try, then you receive an email which guides you to a page where you can try again.

Figure 3: Subscribe Anbox Cloud Appliance in AWS Marketplace.

Figure 3: Subscribe Anbox Cloud Appliance in AWS Marketplace.

In this demonstration, we select G5g.xlarge in the Instance type section and leave all settings with default values, except the storage as per the following:

  1. A root disk with minimum 50 GB (required)
  2. An additional Amazon Elastic Block Store (Amazon EBS) volume with at least 100 GB (recommended)

For the Genshin Impact demo, we recommend a specific amount of storage. However, when deploying your Android applications, you must select an appropriate storage size based on the package size. Additionally, you should choose an instance size based on the resources that you plan to utilize for your gaming sessions, such as CPU, memory, and networking. In our demo, we launched only one session from a single mobile device.

Launch the instance and wait until it reaches running status. Then you can secure shell (SSH) to the instance to configure the Android environment.

Install Anbox cloud

To make sure of the security and reliability of some of the package repositories used, we update the CUDA Linux GPG Repository Key. View this Nvidia blog post for more details on this procedure.

$ sudo apt-key del 7fa2af80

$ wget

https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/cuda keyring_1.0-1_all.deb

$ sudo dpkg -i cuda-keyring_1.0-1_all.deb

As the Android in Anbox Cloud Appliance is running in an LXD container environment, upgrade LXD to the latest version.

  $ sudo snap refresh –channel=5.0/stable lxd

Install the Anbox Cloud Appliance software using the following command and selecting the default answers:

  $ sudo anbox-cloud-appliance init

Watch the status page at https://$(ec2_public_DNS_name) for progress information.

Figure 4: The status of deploying Anbox Cloud.

Figure 4: The status of deploying Anbox Cloud.

The initialization process takes approximately 20 minutes. After it’s complete, register the Ubuntu SSO account previously created, then follow the instructions provided to finalize the process.

  $ anbox-cloud-appliance dashboard register <your Ubuntu SSO email address>

Stream an Android game application

Use the sample from the following repo to setup the service on the streaming server:

  $ git clone https://github.com/anbox-cloud/cloud-gaming-demo.git

Build the Flutter web UI:

$ sudo snap install flutter –classic

$ cd cloud-gaming-demo/ui && flutter build web && cd ..

$ mkdir -p backend/service/static

$ cp -av ui/build/web/* backend/service/static

Then build the backend service which processes requests and interacts with the Anbox Stream Gateway to create instances of game applications. Start by preparing the environment:

$ sudo apt-get install python3-pip

$ sudo pip3 install virtualenv

$ cd backend && virtualenv venv

Create the configuration file for the backend service so that it can access the Anbox Stream Gateway. There are two parameters to set: gateway-URL and gateway-token. The gateway token can be obtained from the following command:

$ anbox-cloud-appliance gateway account create <account-name>

Create a file called config.yaml that contains the two values:

gateway-url: https:// <EC2 public DNS name>

gateway-token: <gateway_token>

Add the following line to the activate hook in the backend/venv/bin/ directory so that the backend service can read config.yaml on its startup:

$ export CONFIG_PATH=<path_to_config_yaml>

Now we can launch the backend service which will be served by default on TCP port 8002.

$./run.sh

In the next steps, we download a game and build it via Anbox Cloud. We need an Android APK and a configuration file. Create a folder under the HOME directory and create a manifest.yaml file in the folder. In this example, we must add the following details in the file. You can refer to the Anbox Cloud documentation for more information on the format.

name: genshin

instance-type: g10.3

resources:

cpus: 10

memory: 25GB

disk-size: 50GB

gpu-slots: 15

features: [“enable_virtual_keyboard”]

Select an APK for the arm64-v8a architecture which is natively supported on Graviton. In this example, we download Genshin Impact, an action role-playing game developed and published by miHoYo. You must supply your own Android APK if you want to try these steps. Download the APK into the folder and rename it to app.apk. Overall, the final layout of the game folder should look as follows:

.

├── app.apk

└── manifest.yaml

Run the following command from the folder to create the application:

$ amc application create  .

Wait until the application status changes to ready. You can monitor the status with the following command:

$ amc application ls

Edit the following:

  1. Update the gameids variable defined in the ui/lib/homepage.dart file to include the name of the game (as declared in the manifest file).
  2. Insert a new key/value pair to the static appNameMap and appDesMap variables defined in the lib/api/application.dart file.
  3. Provide a screenshot of the game (in jpeg format), rename it to <game-name>.jpeg, and put it into the ui/lib/assets directory.

Then, re-build the web UI, copy the contents from the ui/build/web folder to the backend/service/static directory, and refresh the webpage.

Test the game

Using your mobile phone, open the Firefox browser or another browser that supports WebRTC. Type the public DNS name of the G5g instance with the 8002 TCP port, and you should see something similar to the following:

Figure 5: The webpage of the Android streaming game portal.

Figure 5: The webpage of the Android streaming game portal.

Select the Play now button, wait a moment for the application to be setup on the server side, and then enjoy the game.

Figure 6: The screen capture of playing Android streaming game.

Figure 6: The screen capture of playing Android streaming game.

Clean-up

Please cancel the subscription of the Anbox Cloud Appliance in the AWS Marketplace, you can follow the AWS Marketplace Buyer Guide for more details, then terminate the G5g.xlarge instance to avoid incurring future costs.

Conclusion

In this post, we demonstrated how a resource-intensive Android game runs natively on a Graviton-based G5g instance and is streamed to an Arm-based mobile device. The benefits include better price-performance, reduced development effort, and faster time-to-market. One way to run your games efficiently on the cloud is through software available on the AWS Marketplace, such as the Anbox Cloud Appliance, which was showcased as an example method.

To learn more about AWS Graviton, visit the official product page and the technical guide.

Let’s Architect! Architecting for sustainability

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-architecting-for-sustainability/

Sustainability is an important topic in the tech industry, as well as society as a whole, and defined as the ability to continue to perform a process or function over an extended period of time without depletion of natural resources or the environment.

One of the key elements to designing a sustainable workload is software architecture. Think about how event-driven architecture can help reduce the load across multiple microservices, leveraging solutions like batching and queues. In these cases, the main traffic is absorbed at the entry-point of a cloud workload and ease inside your system. On top of architecture, think about data patterns, hardware optimizations, multi-environment strategies, and many more aspects of a software development lifecycle that can contribute to your sustainable posture in the Cloud.

The key takeaway: designing with sustainability in mind can help you build an application that is not only durable but also flexible enough to maintain the agility your business requires.

In this edition of Let’s Architect!, we share hands-on activities, case studies, and tips and tricks for making your Cloud applications more sustainable.

Architecting sustainably and reducing your AWS carbon footprint

Amazon Web Services (AWS) launched the Sustainability Pillar of the AWS Well-Architected Framework to help organizations evaluate and optimize their use of AWS services, and built the customer carbon footprint tool so organizations can monitor, analyze, and reduce their AWS footprint.

This session provides updates on these programs and highlights the most effective techniques for optimizing your AWS architectures. Find out how Amazon Prime Video used these tools to establish baselines and drive significant efficiencies across their AWS usage.

Take me to this re:Invent 2022 video!

Prime Video case study for understanding how the architecture can be designed for sustainability

Prime Video case study for understanding how the architecture can be designed for sustainability

Optimize your modern data architecture for sustainability

The modern data architecture is the foundation for a sustainable and scalable platform that enables business intelligence. This AWS Architecture Blog series provides tips on how to develop a modern data architecture with sustainability in mind.

Comprised of two posts, it helps you revisit and enhance your current data architecture without compromising sustainability.

Take me to Part 1! | Take me to Part 2!

An AWS data architecture; it’s now time to account for sustainability

An AWS data architecture; it’s now time to account for sustainability

AWS Well-Architected Labs: Sustainability

This workshop introduces participants to the AWS Well-Architected Framework, a set of best practices for designing and operating high-performing, highly scalable, and cost-efficient applications on AWS. The workshop also discusses how sustainability is critical to software architecture and how to use the AWS Well-Architected Framework to improve your application’s sustainability performance.

Take me to this workshop!

Sustainability implementation best practices and monitoring

Sustainability implementation best practices and monitoring

Sustainability in the cloud with Rust and AWS Graviton

In this video, you can learn about the benefits of Rust and AWS Graviton to reduce energy consumption and increase performance. Rust combines the resource efficiency of programming languages, like C, with memory safety of languages, like Java. The video also explains the benefits deriving from AWS Graviton processors designed to deliver performance- and cost-optimized cloud workloads. This resource is very helpful to understand how sustainability can become a driver for cost optimization.

Take me to this re:Invent 2022 video!

Discover how Rust and AWS Graviton can help you make your workload more sustainable and performant

Discover how Rust and AWS Graviton can help you make your workload more sustainable and performant

See you next time!

Thanks for joining us to discuss sustainability in the cloud! See you in two weeks when we’ll talk about tools for architects.

To find all the blogs from this series, you can check the Let’s Architect! list of content on the AWS Architecture Blog.

Let’s Architect! Designing event-driven architectures

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-designing-event-driven-architectures/

During the design of distributed systems, we have to identify a communication strategy to exchange information between different services while keeping the evolutionary nature of the architecture in mind. Event-driven architectures are based on events (facts that happened in a system), which are asynchronously exchanged to implement communication across different services while having a high degree of decoupling. This paradigm also allows us to run code in response to events, with benefits like cost optimization and sustainability for the entire infrastructure.

In this edition of Let’s Architect!, we share architectural resources to introduce event-driven architectures, how to build them on AWS, and how to approach the design phase.

AWS re:Invent 2022 – Keynote with Dr. Werner Vogels

re:Invent 2022 may be finished, but the keynote given by Amazon’s Chief Technology Officer, Dr. Werner Vogels, will not be forgotten. Vogels not only covered the announcements of new services but also event-driven architecture foundations in conjunction with customers’ stories on how this architecture helped to improve their systems.

Take me to this re:Invent 2022 video!

Dr. Werner Vogels presenting an example of architecture where Amazon EventBridge is used as event bus

Dr. Werner Vogels presenting an example of architecture where Amazon EventBridge is used as event bus

Benefits of migrating to event-driven architecture

In this blog post, we enumerate clearly and concisely the benefits of event-driven architectures, such as scalability, fault tolerance, and developer velocity. This is a great post to start your journey into the event-driven architecture style, as it explains the difference from request-response architecture.

Take me to this Compute Blog post!

Two common options when building applications are request-response and event-driven architecture

Two common options when building applications are request-response and event-driven architectures

Building next-gen applications with event-driven architectures

When we build distributed systems or migrate from a monolithic to a microservices architecture, we need to identify a communication strategy to integrate the different services. Teams who are building microservices often find that integration with other applications and external services can make their workloads tightly coupled.

In this re:Invent 2022 video, you learn how to use event-driven architectures to decouple and decentralize application components through asynchronous communication. The video introduces the differences between synchronous and asynchronous communications before drilling down into some key concepts for designing and building event-driven architectures on AWS.

Take me to this re:Invent 2022 video!

How to use choreography to exchange information across services plus implement orchestration for managing operations within the service boundaries

How to use choreography to exchange information across services plus implement orchestration for managing operations within the service boundaries

Designing events

When starting on the journey to event-driven architectures, a common challenge is how to design events: “how much data should an event contain?” is a typical first question we encounter.

In this pragmatic post, you can explore the different types of events, watch a video that explains even further how to use event-driven architectures, and also go through the new event-driven architecture section of serverlessland.com.

Take me to Serverless Land!

An example of events with sparse and full state description

An example of events with sparse and full state description

See you next time!

Thanks for reading our first blog of 2023! Join us next time, when we’ll talk about architecture and sustainability.

To find all the blogs from this series, visit the Let’s Architect! section of the AWS Architecture Blog.

The tale of a single register value

Post Syndicated from Jakub Sitnicki original https://blog.cloudflare.com/the-tale-of-a-single-register-value/

The tale of a single register value

“Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.” — Sherlock Holmes

Intro

The tale of a single register value

It’s not every day that you get to debug what may well be a packet of death. It was certainly the first time for me.

What do I mean by “a packet of death”? A software bug where the network stack crashes in reaction to a single received network packet, taking down the whole operating system with it. Like in the well known case of Windows ping of death.

Challenge accepted.

It starts with an oops

Around a year ago we started seeing kernel crashes in the Linux ipv4 stack. Servers were crashing sporadically, but we learned the hard way to never ignore cases like that — when possible we always trace crashes. We also couldn’t tie it to a particular kernel version, which could indicate a regression which hopefully could be tracked down to a single faulty change in the Linux kernel.

The crashed servers were leaving behind only a crash report, affectionately known as a “kernel oops”. Let’s take a look at it and go over what information we have there.

The tale of a single register value

Parts of the oops, like offsets into functions, need to be decoded in order to be human-readable. Fortunately Linux comes with the decode_stacktrace.sh script that did the work for us.

All we need is to install a kernel debug and source packages before running the script. We will use the latest version of the script as it has been significantly improved since Linux v5.4 came out.

$ RELEASE=`uname -r`
$ apt install linux-image-$RELEASE-dbg linux-source-$RELEASE
$ curl -sLO https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/scripts/decode_stacktrace.sh
$ curl -sLO https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/scripts/decodecode
$ chmod +x decode_stacktrace.sh decodecode
$ ./decode_stacktrace.sh -r 5.4.14-cloudflare-2020.1.11 < oops.txt > oops-decoded.txt

When decoded, the oops report is even longer than before! But that is a good thing. There is new information there that can help us.

The tale of a single register value

What has happened?

With this much input we can start sketching a picture of what could have happened. First thing to check is where exactly did we crash?

The report points at line 5160 in the skb_gso_transport_seglen() function. If we take a look at the source code, we can get a rough idea of what happens there. We are processing a Generic Segmentation Offload (GSO) packet carrying an encapsulated TCP packet. What is a GSO packet? In this context it’s a batch of consecutive TCP segments, travelling through the network stack together to amortize the processing cost. We will look more at the GSO later.

net/core/skbuff.c:
5150) static unsigned int skb_gso_transport_seglen(const struct sk_buff *skb)
5151) {
          …
5155)     if (skb->encapsulation) {
                  …
5159)             if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
5160)                     thlen += inner_tcp_hdrlen(skb); 👈
5161)     } else if (…) {
          …
5172)     return thlen + shinfo->gso_size;
5173) }

The exact line where we crashed belongs to an if-branch that handles tunnel traffic. It calculates the length of the TCP header of the inner packet, that is the encapsulated one. We do that to compute the length of the outer L4 segment, which accounts for the inner packet length:

The tale of a single register value

To understand how the length of the inner TCP header is computed we have to peel off a few layers of inlined function calls:

inner_tcp_hdrlen(skb)
⇓
inner_tcp_hdr(skb)->doff * 4
⇓
((struct tcphdr *)skb_inner_transport_header(skb))->doff * 4
⇓
((struct tcphdr *)(skb->head + skb->inner_transport_header))->doff * 4

Now it is clear that inner_tcp_hdrlen(skb) simply reads the Data Offset field (doff) inside the inner TCP header. Because Data Offset carries the number of 32-bit words in the TCP header, we multiply it by 4 to get the TCP header length in bytes.

From the memory access point of view, to read the Data Offset value we need to:

  1. load skb->head value from address skb + offsetof(struct sk_buff, head)
  2. load skb->inner_transport_header value from address skb + offsetof(struct sk_buff, inner_transport_header),
  3. load the TCP Data Offset from skb->head + skb->inner_transport_header + offsetof(struct tcphdr, doff)

Potentially, any of these loads could trigger a page fault. But it’s unlikely that skb contains an invalid address since we accessed the skb->encapsulation field without crashing just a few lines earlier. Our main suspect is the last load.

The invalid memory address we attempt to load from should be in one of the CPU registers at the time of the exception. And we have the CPU register snapshot in the oops report. Which register holds the address? That has been decided by the compiler. We will need to take a look at the instruction stream to discover that.

Remember the disassembly in the decoded kernel oops? Now is the time to go back to it. Hint, it’s in AT&T syntax. But to give everyone a fair chance to follow along, here’s the same disassembly but in Intel syntax. (Alright, alright. You caught me. I just can’t read the AT&T syntax.)

All code
========
   0:   c0 41 83 e0             rol    BYTE PTR [rcx-0x7d],0xe0
   4:   11 f6                   adc    esi,esi
   6:   87 81 00 00 00 20       xchg   DWORD PTR [rcx+0x20000000],eax
   c:   74 30                   je     0x3e
   e:   0f b7 87 aa 00 00 00    movzx  eax,WORD PTR [rdi+0xaa]
  15:   0f b7 b7 b2 00 00 00    movzx  esi,WORD PTR [rdi+0xb2]
  1c:   48 01 c1                add    rcx,rax
  1f:   48 29 f0                sub    rax,rsi
  22:   45 85 c0                test   r8d,r8d
  25:   48 89 c6                mov    rsi,rax
  28:   74 0d                   je     0x37
  2a:*  0f b6 41 0c             movzx  eax,BYTE PTR [rcx+0xc]           <-- trapping instruction
  2e:   c0 e8 04                shr    al,0x4
  31:   0f b6 c0                movzx  eax,al
  34:   8d 04 86                lea    eax,[rsi+rax*4]
  37:   0f b7 52 04             movzx  edx,WORD PTR [rdx+0x4]
  3b:   01 d0                   add    eax,edx
  3d:   c3                      ret
  3e:   45                      rex.RB
  3f:   85                      .byte 0x85

Code starting with the faulting instruction
===========================================
   0:   0f b6 41 0c             movzx  eax,BYTE PTR [rcx+0xc]
   4:   c0 e8 04                shr    al,0x4
   7:   0f b6 c0                movzx  eax,al
   a:   8d 04 86                lea    eax,[rsi+rax*4]
   d:   0f b7 52 04             movzx  edx,WORD PTR [rdx+0x4]
  11:   01 d0                   add    eax,edx
  13:   c3                      ret
  14:   45                      rex.RB
  15:   85                      .byte 0x85

When the trapped page fault happened, we tried to load from address %rcx + 0xc, or 12 bytes from whatever memory location %rcx held. Which is hardly a coincidence since the Data Offset field is 12 bytes into the TCP header.

This means that %rcx holds the computed skb->head + skb->inner_transport_header address. Let’s take a look at it:

RSP: 0018:ffffa4740d344ba0 EFLAGS: 00010202
RAX: 000000000000feda RBX: ffff9d982becc900 RCX: ffff9d9624bbaffc
RDX: ffff9d9624babec0 RSI: 000000000000feda RDI: ffff9d982becc900
…

The RCX value doesn’t look particularly suspicious. We can say that:

  1. it’s in a kernel virtual address space because it is greater than 0xffff000000000000 – expected, and
  2. it is very close to the 4 KiB page boundary (0xffff9d9624bbb000 – 4),

… and not much more.

We must go back further in the instruction stream. Where did the value in %rcx come from? What I like to do is try to correlate the machine code leading up to the crash with pseudo source code:

<function entry>                # %rdi = skb
…
movzx  eax,WORD PTR [rdi+0xaa]  # %eax = skb->inner_transport_header
movzx  esi,WORD PTR [rdi+0xb2]  # %esi = skb->transport_header
add    rcx,rax                  # %rcx = skb->head + skb->inner_transport_header
sub    rax,rsi                  # %rax = skb->inner_transport_header - skb->transport_header
test   r8d,r8d
mov    rsi,rax                  # %rsi = skb->inner_transport_header - skb->transport_header
je     0x37
movzx  eax,BYTE PTR [rcx+0xc]   # %eax = *(skb->head + skb->inner_transport_header + offsetof(struct tcphdr, doff))

How did I decode that assembly snippet? We know that the skb address was passed to our function in the %rdi register because the System V AMD64 ABI calling convention dictates that. If the %rdi register hasn’t been clobbered by any function calls, or reused because the compiler decided so, then maybe, just maybe, it still holds the skb address.

If 0xaa and 0xb2 are offsets into an sk_buff structure, then pahole tool can tell us which fields they correspond to:

$ pahole --hex -C sk_buff /usr/lib/debug/vmlinux-5.4.14-cloudflare-2020.1.11 | grep '\(head\|inner_transport_header\|transport_header\);'
        __u16                      inner_transport_header; /*  0xaa   0x2 */
        __u16                      transport_header;     /*  0xb2   0x2 */
        unsigned char *            head;                 /*  0xc0   0x8 */

To confirm our guesswork, we can disassemble the whole function in gdb.

It would be great to find out the value of the inner_transport_header and transport_header offsets. But the registers that were holding them, %rax and %rsi, respectively, were reused after the offset values were loaded.

However, we can still examine the difference between inner_transport_header and transport_header that both %rax and %rsi hold. Let’s take a look.

The suspicious offset

Here are the register values from the oops as a reminder:

RAX: 000000000000feda RBX: ffff9d982becc900 RCX: ffff9d9624bbaffc
RDX: ffff9d9624babec0 RSI: 000000000000feda RDI: ffff9d982becc900

From the register snapshot we can tell that:

%rax = %rsi = skb->inner_transport_header - skb->transport_header = 0xfeda = 65242

That is clearly suspicious. We expect that skb->transport_header < skb->inner_transport_header, so either

  1. skb->inner_transport_header > 0xfeda, which would mean that between outer and inner L4 packets there is 65k+ bytes worth of headers – unlikely, or
  2. 0xfeda is a garbage value, perhaps an effect of an underflow if skb->inner_transport_header < skb->transport_header.

Let’s entertain the theory that an underflow has occurred.

Any other scenario, be it an out-of-bounds write or a use-after-free that corrupted the memory, is a scary prospect where we don’t stand much chance of debugging it without help from tools like KASAN report.

But if we assume for a moment that it’s an underflow, then the task is simple 😉. We “just” need to audit all places where skb->inner_transport_header or skb->transport_header offsets could have been updated while the skb buffer travelled through the network stack.

That raises the question — what path did the packet take through the network stack before it brought the machine down?

Packet path

It is time to take a look at the call trace in the oops report. If we walk through it, it is apparent that a veth device received a packet. The packet then got routed and forwarded to some other network device. The kernel crashed before the egress device transmitted the packet out.

The tale of a single register value

What immediately draws our attention is the veth_poll() function in the call trace. Polling inside a virtual device that acts as a simple pipe joining two network namespaces together? Puzzling!

The regular operation mode of a veth device is that transmission of a packet from one side of a veth-pair results in immediate, in-line, on the same CPU, reception of the packet by the other side of the pair. There shouldn’t be any polling, context switches or such.

However, in Linux v4.19 veth driver gained support for native mode eXpress Data Path (XDP). XDP relies on NAPI, an interface between the network drivers and the Linux network stack. NAPI requires that drivers register a poll() callback for fetching received packets.

The NAPI receive path in the veth driver is taken only when there is an XDP program attached. The fork occurs in veth_forward_skb, where the TX path ends and a RX path on the other side begins.

The tale of a single register value

This is an important observation because only on the NAPI/XDP path in the veth driver, received packets might get aggregated by the Generic Receive Offload.

Super-packets

Early on we’ve noted that the crash happens when processing a GSO packet. I’ve promised we will get back to it and now is the time.

Generic Segmentation Offload (GSO) is all about delaying the L4 segmentation process until the very last moment. So called super-packets, that exceed the egress route MTU in size, travel all the way through the network stack, only to be cut into MTU-sized segments just before handing the data over to the network driver for transmission. This way we process just one big packet on the transmit path, instead of a few smaller ones and save on CPU cycles in all the IP-level stack functions like routing, nftables, traffic control

Where do these super-packets come from? They can be a result of large write to a socket, or as is our case, they can be received from one network and forwarded to another network.

The latter case, that is forwarding a super-packet, happens when Generic Receive Offload (GRO) kicks in during receive. GRO is the opposite process of GSO. Smaller, MTU-sized packets get merged to form a super-packet early on the receive path. The goal is the same — process less by pushing just one packet through the network stack layers.

Not just any packets can be fused together by GRO. Loosely speaking, any two packets to be merged must form a logical sequence in the network flow, and carry the same metadata in protocol headers. It is critical that no information is lost in the aggregation process. Otherwise, GSO won’t be able to reconstruct the segment stream when serializing packets in the network card transmission code.

To this end, each network protocol that supports GRO provides a callback which signals whether the above conditions hold true. GRO implementation (dev_gro_receive()) then walks through the packet headers, the outer as well as the inner ones, and delegates the pre-merge check to the right protocol callback. If all stars align, the packets get spliced at the end of the callback chain (skb_gro_receive()).

I will be frank. The code that performs GRO is pretty complex, and I spent a significant amount of time staring into it. Hat tip to its authors. However, for our little investigation it will be enough to understand that a TCP stream encapsulated with GRE1 would trigger callback chain like so:

The tale of a single register value

Armed with basic GRO/GSO understanding we are ready to take a shot at reproducing the crash.

The reproducer

Let’s recap what we know:

  1. a super-packet was received from a veth device,
  2. the veth device had an XDP program attached,
  3. the packet was forwarded to another device,
  4. the egress device was transmitting a GSO super-packet,
  5. the packet was encapsulated,
  6. the super-packet must have been produced by GRO on ingress.

This paints a pretty clear picture on what the setup should look like:

The tale of a single register value

We can work with that. A simple shell script will be our setup machinery.

We will be sending traffic from 10.1.1.1 to 10.2.2.2. Our traffic pattern will be a TCP stream consisting of two consecutive segments so that GRO can merge something. A Scapy script will be great for that. Let’s call it send-a-pair.py and give it a run:

$ { sleep 5; sudo ip netns exec A ./send-a-pair.py; } &
[1] 1603
$ sudo ip netns exec B tcpdump -i BA -n -nn -ttt 'ip and not arp'
…
 00:00:00.020506 IP 10.1.1.1 > 10.2.2.2: GREv0, length 1480: IP 192.168.1.1.12345 > 192.168.2.2.443: Flags [.], seq 0:1436, ack 1, win 8192, length 1436
 00:00:00.000082 IP 10.1.1.1 > 10.2.2.2: GREv0, length 1480: IP 192.168.1.1.12345 > 192.168.2.2.443: Flags [.], seq 1436:2872, ack 1, win 8192, length 1436

Where is our super-packet? Look at the packet sizes, the GRO didn’t merge anything.

Turns out NAPI is just too fast at fetching the packets from the Rx ring. We need a little buffering on transmit to increase our chances of GRO batching:

# Help GRO
ip netns exec A tc qdisc add dev AB root netem delay 200us slot 5ms 10ms packets 2 bytes 64k

With the delay in place, things look better:

 00:00:00.016972 IP 10.1.1.1 > 10.2.2.2: GREv0, length 2916: IP 192.168.1.1.12345 > 192.168.2.2.443: Flags [.], seq 0:2872, ack 1, win 8192, length 2872

8192 bytes shown by tcpdump clearly indicate GRO in action. And we are even hitting the crash point:

$ sudo bpftrace -e 'kprobe:skb_gso_transport_seglen { print(kstack()); }' -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 1 probe...

        skb_gso_transport_seglen+1
        skb_gso_validate_network_len+17
        __ip_finish_output+293
        ip_output+113
        ip_forward+876
        ip_rcv+188
        __netif_receive_skb_one_core+128
        netif_receive_skb_internal+47
        napi_gro_flush+151
        napi_complete_done+183
        veth_poll+1697
        net_rx_action+314
        …

^C

…but we are not crashing. We will need to dig deeper.

We know what packet metadata skb_gso_transport_seglen() looks at — the header offsets, then encapsulation flag, and GSO info. Let’s dump all of it:

$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 2 probes...
DEV  LEN  NH  TH  ENC INH ITH GSO SIZE SEGS TYPE FUNC
sink 2936 270 290 1   294 254  |  1436 2    0x41 skb_gso_transport_seglen

Since the skb->encapsulation flag (ENC) is set, both outer and inner header offsets should be valid. Are they?

The outer network / L3 header (NH) looks sane. When XDP is enabled, it reserves 256 bytes of headroom before the headers. 14 byte long Ethernet header follows the headroom. The IPv4 header should then start at 270 bytes into the packet buffer.

The outer transport / L4 header offset is as expected as well. The IPv4 header takes 20 bytes, and the GRE header follows it.

The inner network header (INH) begins at the offset of 294 bytes. This makes sense because the GRE header in its most basic form is 4 bytes long.

The surprise comes last. The inner transport header offset points somewhere near the end of headroom which XDP reserves. Instead, it should start at 314, following the inner IPv4 header.

The tale of a single register value

Is this the smoking gun we were looking for?

The bug

skb_gso_transport_seglen() calculates the length of the outer L4 segment when given a GSO packet. If the inner_transport_header offset is off, then the result of the calculation might be off as well. Worth checking.

We know that our segments are 1500 bytes long. That makes the L4 part 1480 bytes long. What does skb_gso_transport_seglen() say though?

$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1460

Seems that we don’t agree. But if skb_gso_transport_seglen() is getting garbage on input we can’t really blame it.

If inner_transport_header is not correct, that TCP Data Offset read that we know happens inside the function cannot end well.

If we map it out, it looks like we are loading part of the source MAC address (upper 4 bits of the 5th byte, to be precise) and interpreting it as TCP Data Offset.

The tale of a single register value

Are we? There is an easy way to check.

If we ask nicely, tcpdump will tell us what the MAC addresses are:

The tale of a single register value

Plugging that into the calculations that skb_gso_transport_seglen()

thlen = inner_transport_header(skb) - transport_header(skb) = 254 - 290 = -36
thlen += inner_transport_header(skb)->doff * 4 = -36 + (0xf * 4) = -36 + 60 = 24
retval = gso_size + thlen = 1436 + 24 = 1460

…checks out!

Does this mean that I can control the return value by setting source MAC address?!

                                               👇
$ sudo ip -n A link set AB address be:d6:07:5e:05:11 # Change the MAC address 
$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1400

Yes! 1436 + (-36) + (0 * 4) = 1400. This is it.

However, how does all this tie it to the original crash? The badly calculated L4 segment length will make GSO emit shorter segments on egress. But that’s all.

Remember the suspicious offset from the crash report?

%rax = %rsi = skb->inner_transport_header - skb->transport_header = 0xfeda = 65242

We now know that skb->transport_header should be 290. That makes skb->inner_transport_header = 65242 + 290 = 65532 = 0xfffc.

Which means that when we triggered the page fault we were trying to load memory from a location at

skb->head + skb->inner_transport_header + offsetof(tcphdr, doff) = skb->head + 0xfffc + 12 = 0xffff9d9624bbb008

Solving it for skb->head yields 0xffff9d9624bbb008 - 0xfffc - 12 = 0xffff9d9624bab000.

And this makes sense. The skb->head buffer is page-aligned, meaning it’s a multiple of 4 KiB on x86-64 platforms — the 12 least significant bits the address are 0.

However, the address we were trying to read was (0xfffc+12)/4096 ~= 16 pages (or 64 KiB) past the skb->head page boundary (0xffff9d9624babfff).

The tale of a single register value

Who knows if there was memory mapped to this address?! Looks like from time to time there wasn’t anything mapped there and the kernel page fault handling code went “oops!”.

The fix

It is finally time to understand who sets the header offsets in a super-packet.

Once GRO is done merging segments, it flushes the super-packet down the pipe by kicking off a chain of gro_complete callbacks:

napi_gro_complete → inet_gro_complete → gre_gro_complete → inet_gro_complete → tcp4_gro_complete → tcp_gro_complete

These callbacks are responsible for updating the header offsets and populating the GSO-related fields in skb_shared_info struct. Later on the transmit side will consume this data.

Let’s see how the packet metadata changes as it travels through the gro_complete callbacks2 by adding a few more tracepoint to our bpftrace script:

$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 7 probes...
DEV  LEN  NH  TH  ENC INH ITH GSO SIZE SEGS TYPE FUNC
BA   2936 294 314 0   254 254  |  1436 0    0x00 napi_gro_complete
BA   2936 294 314 0   254 254  |  1436 0    0x00 inet_gro_complete
BA   2936 294 314 0   254 254  |  1436 0    0x00 gre_gro_complete
BA   2936 294 314 1   254 254  |  1436 0    0x40 inet_gro_complete
BA   2936 294 314 1   294 254  |  1436 0    0x40 tcp4_gro_complete
BA   2936 294 314 1   294 254  |  1436 0    0x41 tcp_gro_complete
sink 2936 270 290 1   294 254  |  1436 2    0x41 skb_gso_transport_seglen

As the packet travels through the gro_complete callbacks, the inner network header (INH) offset gets updated after we have processed the inner IPv4 header.

However, the same thing did not happen to the inner transport header (ITH) that is causing trouble. We need to fix that.

--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -298,6 +298,9 @@ int tcp_gro_complete(struct sk_buff *skb)
        if (th->cwr)
                skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;

+       if (skb->encapsulation)
+               skb->inner_transport_header = skb->transport_header;
+
        return 0;
 }
 EXPORT_SYMBOL(tcp_gro_complete);

With the patch in place, the header offsets are finally all sane and skb_gso_transport_seglen() return value is as expected:

$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 2 probes...
DEV  LEN  NH  TH  ENC INH ITH GSO SIZE SEGS TYPE FUNC
sink 2936 270 290 1   294 314  |  1436 2    0x41 skb_gso_transport_seglen

$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1480

Don’t worry, though. The fix is already likely in your kernel long time ago. Patch d51c5907e980 (“net, gro: Set inner transport header offset in tcp/udp GRO hook”) has been merged into Linux v5.14, and backported to v5.10.58 and v5.4.140 LTS kernels. The Linux kernel community has got you covered. But please, keep on updating your production kernels.

Outro

What a journey! We have learned a ton and fixed a real bug in the Linux kernel. In the end it was not a Packet of Death. Maybe next time we can find one 😉

Enjoyed the read? Why not join Cloudflare and help us fix the remaining bugs in the Linux kernel? We are hiring in Lisbon, London, and Austin.

And if you would like to see more kernel blog posts, please let us know!


1Why GRE and not some other type of encapsulation? If you follow our blog closely, you might already know that Cloudflare Magic Transit uses veth pairs to route traffic into and out of network namespaces. It also happens to use GRE encapsulation. If you are curious why we chose network namespaces linked with veth pairs, be sure to watch the How we built Magic Transit talk.
2Just turn off GRO on all other network devices in use to get a clean output (sudo ethtool -K enp0s5 gro off).