Tag Archives: AWS Well-Architected Tool

Know before you go – AWS re:Invent 2025 guide to Well-Architected and Cloud Optimization sessions

2025-11-14 Anitha Selvan

Post Syndicated from Anitha Selvan original https://aws.amazon.com/blogs/architecture/know-before-you-go-aws-reinvent-2025-guide-to-well-architected-and-cloud-optimization-sessions/

Are you ready to maximize your Well-Architected and Cloud Optimization learning and networking time at re:Invent 2025? We have put together this comprehensive guide to help you plan your schedule and make the most of the Well-Architected and cloud optimization sessions available this year. These sessions will deliver the practical guidance your teams need to lead strategic cloud initiatives, design next-generation architectures, optimize costs, or secure AI-powered systems.

Key themes at re:Invent for Well-Architected and Cloud Optimization – You can expect to see the following themes at re:Invent 2025

AI-powered architecture and governance

The sessions in this theme showcase how AWS is integrating AI technologies to transform traditional architectural practices. From using AI services for automated Well-Architected reviews to implementing self-evolving systems with agentic AI, these sessions demonstrate how you can use AI to automate architectural decisions, streamline governance processes, and scale best practices across the enterprise.

Sessions: ARC324-R, ARC317-R, SPS320, ARC302-R (session details are posted in the following section)

Well-Architected Framework evolution and implementation

These sessions highlight how the AWS Well-Architected Framework has evolved beyond its original scope to address modern architectural challenges. Attendees will learn how to implement the framework principles across different domains—from IoT security to backup strategies—while focusing on enterprise-scale governance and compliance.

Sessions: ARC204, SEC337, STG313-R, ARC323-R (session details are posted in the following section)

Cost optimization and FinOps

The cost optimization track focuses on innovative approaches to cloud financial management, with a strong emphasis on AI-powered FinOps solutions. Sessions range from hands-on workshops like the Frugal Architect GameDay to chalk talks on establishing effective cost governance models.

Sessions: ARC318-R, COP309-R, ARC309, DEV318 (session details are posted in the following section)

Session formats to fit your learning style

This year’s catalog features an exciting mix of content across different formats: from breakout, chalk talks, workshops, builder sessions to code talks.

Breakout sessions – Stay in the know

Sit back and enjoy these presentations to stay current with the latest solution enhancements and practical applications. AWS experts and guest speakers will share valuable insights and real-world examples.

From ideas to impact: Architecting with cloud best practices

ARC204 | Breakout session | December 1, 8:30 AM

Discover how foundational frameworks like the AWS Well-Architected Framework, AWS Cloud Adoption Framework, and AWS Cloud Operating Model evolved through customer feedback and real-world learnings from thousands of organizations, transforming from structured guidance into dynamic insights for optimizing cloud environments. Learn practical strategies for applying unified best practices to accelerate cloud transformation while managing large-scale architectural changes and maintaining operational excellence.

Build a well-architected foundation for scaling generative AI and agentic apps

AIM310 | Breakout session | December 1, 10:00 AM

Move beyond proof-of-concepts to build a production-ready foundation supporting all AI applications across your organization, addressing the critical transition from experimentation to enterprise-scale AI deployment. Navigate model access and management, tool discovery, memory and state handling, and observability at scale while building foundations that seamlessly integrate model access, orchestration workflows, agents, and tools with enterprise-grade governance controls.

AI-Powered Enterprise Architecture with ServiceNow & AWS

ARC337-S | Breakout session | December 2, 3:00 PM

Enterprises face a core challenge: translating architectural vision into resilient cloud reality. See how integrating ServiceNow’s Enterprise Architecture Workspace with the AWS Well-Architected Tool transforms traditional design processes. Through elegant “shift-left” methodologies, architects gain contextual insights that seamlessly blend enterprise modeling with cloud best practices. This presentation is brought to you by ServiceNow, an AWS Partner.

The AI revolution in customer support: Building predictive service systems

SPS315 | Breakout session | December 3, 5:30 PM

Discover how AWS is using generative AI to transform customer support from reactive to proactive. We’ll show how large language models and AI agents are improving customer satisfaction and efficiency. Topics include smart case routing, context-aware support, early problem detection, and responsible AI use. We’ll share real results and discuss balancing AI capabilities with human touch.

Optimize AWS Costs: Developer Tools and Techniques

DEV318 | Breakout session | December 1, 3:00 PM

As cloud applications grow in complexity, optimizing costs becomes crucial for developers. This session explores AWS native tools and coding practices that reduce expenses without compromising performance or scalability.

Chalk talks

AWS speakers set the stage at the beginning of the talk and then open up for discussion. Bring your questions and dive deep into the topic with AWS experts and other customers.

Architecting agentic systems: Self-evolving patterns with AWS AI

ARC324-R | Chalk talk | December 2, 1:30 PM

Learn to architect self-evolving systems using agentic AI that align with AWS Well-Architected principles, exploring cutting-edge patterns for systems that adapt, heal, and optimize themselves autonomously while maintaining architectural integrity. Implement autonomous monitoring and self-healing capabilities with Amazon Bedrock Agents, design AI-driven security controls and automated recovery mechanisms and create systems that continuously adapt to workload patterns while maintaining reliability and performance standards.

Building Well-Architected agentic AI applications

ARC317-R/R1 | Chalk talk | December 2, 3:00 PM and December 4, 1:00 PM

Navigate generative AI agent development with robust architectural practices for security and compliance, focusing on proven patterns for building production-ready agentic AI applications that meet enterprise requirements. Design agent architectures with guardrails, monitoring systems, and access controls using the AWS Well-Architected Generative AI Lens while implementing governance patterns that ensure regulatory compliance and enable systems to scale from prototype to enterprise-wide deployment.

Using generative AI to automate architectural guidance

ARC315 | Chalk talk | December 1, 4:30 PM

Replace time-intensive manual processes with AI-powered systems that generate strategic recommendations, design principles, and best practices at scale while maintaining quality and consistency. Generate organization-specific design principles using AI analysis of architectural patterns, implement AI-driven guidance systems with effective quality control mechanisms, and build knowledge bases that feed AI-powered architectural guidance while maintaining human oversight and addressing ethical considerations.

Agentic architecting: From prototype to production-ready systems

ARC330-R/R1 | Chalk talk | December 2, 5:30PM and December 4, 2:30 PM

Transform prototypes into production-ready systems by incorporating security, monitoring, and CI/CD through agentic architecting, focusing on practical challenges of moving from experimental AI systems to production-grade architectures. Use AI agents to generate and optimize AWS CDK infrastructure and application code, implement automated security improvements and CI/CD pipeline creation, and maintain AWS Well-Architected principles while enabling teams to focus on business logic as AI handles infrastructure complexity.

AI-powered FinOps: Agent-based cloud cost management

ARC318-R/R1 | Chalk talk | December 1, 4:00 PM and December 3, 4:00 PM

Learn how intelligent agents tackle fragmented cost data and optimization processes in complex multi-account environments, moving beyond traditional FinOps approaches to autonomous, intelligent financial optimization. Architect solutions using Amazon OpenSearch Service for data aggregation and Amazon Bedrock for contextual reasoning to design secure, scalable FinOps solutions that continuously optimize costs while delivering measurable business outcomes.

Supercharge your well-architected reviews with AWS Generative AI

SPS320 | Chalk talk | December 3, 4:00 PM

Discover how Koch Industries revolutionized AWS Well-Architected reviews using generative AI, transforming weeks-long manual processes into automated, intelligent systems. Automate architectural assessments using Amazon Bedrock Knowledge Bases and Model Context Protocol (MCP) to scale best practice reviews and optimize workloads in minutes instead of days while achieving more comprehensive, consistent, and actionable recommendations through proven change management and organizational adoption strategies.

Architecting enterprise-scale governance beyond AWS Control Tower

ARC323-R/R1 | Chalk talk | December 3, 11:30 AM and December 4, 2:00PM

Discover advanced governance strategies that build upon AWS Control Tower for enterprise-scale environments requiring sophisticated compliance, security, and operational controls. Implement infrastructure across six Well-Architected Foundations capabilities with critical trade-off understanding, build efficient multi-account structures balancing security requirements with innovation needs, and architect automated compliance checks and policy enforcement at scale while enabling self-service capabilities with centralized governance and security controls.

Securing IoT Workloads with AWS IoT Lens and AWS Security Reference Architecture

SEC337 | Chalk talk | December 3, 11:30 AM

Industrial environments are reaching new levels of connectivity, automation, efficiency, and real-time data insights. However, this increased connectivity also introduces significant security challenges. Unaddressed security concerns can expose vulnerabilities and slow down companies looking to accelerate digital transformation using IoT and cloud. This chalk talk explores relevant techniques, architecture patterns, best practices and AWS security services for securing complex OT/IT environments, IoT devices, edge and cloud using the AWS Well-Architected IoT Lens and AWS Security Reference Architecture (SRA).

Establishing effective cost governance

COP309-R/R1 | Chalk talk | December 3, 3:00 PM and December 4, 12:30 PM

Generative AI agent development demands robust architectural practices for security and compliance. This chalk talk explores proven patterns for architecting secure, efficient AI agents using the AWS Well-Architected Generative AI Lens. Through collaborative discussion and whiteboarding, examine architectural governance and best practices for production environments. Learn to design agent architectures incorporating guardrails, monitoring systems, access controls, and sustainable deployment practices. Gain actionable insights for building secure, efficient, and cost-effective agentic AI applications that scale.

Break down monoliths, modernizing applications on Amazon ECS

CNS346 | Chalk talk | December 2, 4:30 PM

Join this interactive chalk talk to solve a common challenge where monolithic applications take months to deploy new features, and scaling becomes increasingly difficult. We’ll start with a real scenario, an application running on servers with a shared database. Together, we’ll design the modernization path using Amazon ECS and Well-Architected Framework principles. You’ll explore common architecture patterns, containerization strategies, CI/CD automation, and blue/green deployment approaches for ECS. After this session, you’ll walk away with a practical roadmap to transform your monolithic application into scalable microservices. Bring your curiosity and help us build the architecture live.

Hands-on workshop and Builders’ sessions

AWS speakers will introduce the use-case and tools designed to tackle the challenge. You will follow instructions, complete the tasks, and walk away with better understanding of the capabilities.

AI-powered Well-Architected reviews: Building automated governance

ARC302-R | Builders’ session | December 1, 9:00 AM; December 2, 11:30 AM and December 3, 4:00 PM

Build intelligent systems that automate AWS Well-Architected Framework reviews using generative AI, transforming manual architectural assessments into continuous, intelligent governance processes. Evaluate architecture against Well-Architected pillars while incorporating organization-specific requirements, implement continuous analysis of architecture and infrastructure as code templates, and enhance AI understanding of architectural context using Model Context Protocol servers to transform time-intensive reviews into scalable, automated processes with consistent governance.

AI-powered troubleshooting: From chaos to Well-Architected

ARC301-R | Builders’ session | December 1, 8:30 AM; December 2, 3:00 PM and December 3, 10:00 AM

Tackle complex scenarios using AI-powered tools to diagnose and resolve architectural problems, gaining practical experience using AI to transform poorly designed systems into well-architected solutions. Troubleshoot and optimize architectures with scaling bottlenecks and database inefficiencies using Amazon Q, apply Well-Architected principles to enhance performance and security under pressure, and accelerate problem identification and resolution while building AWS optimization expertise and learning to identify architectural anti-patterns before they become critical issues.

The Frugal Architect GameDay: Building cost-aware architectures

ARC309 | Workshop | December 1, 8:00 AM

Compete to implement cost efficiency improvements across multiple AWS services in this interactive GameDay, applying the Laws of the Frugal Architect to real-world scenarios for practical experience in transforming high-cost infrastructure into efficient, sustainable architectures. Address challenges spanning compute, networking, storage, serverless, and observability domains while learning to reduce cloud unit costs and improve profitability without compromising service quality through gamified scenarios that build rapid cost optimization decision-making skills.

Optimize AWS Backup using AI evaluation and Well-Architected best practices

STG313-R | Builders’ session | December 2, 1:30 PM and December 3, 8:30 AM

Enhance AWS Backup implementation using the AWS Backup Evaluator Solution, an AI agent that synthesizes data from multiple sources to provide intelligent backup optimization recommendations. Assess backup environments against the Well-Architected AWS Backup lens using Strands Agents SDK, create unified visibility across backup landscapes to identify optimization opportunities, and implement AI agents that continuously monitor backup efficiency while aligning with AWS best practices to enhance efficiency and cost-effectiveness.

Visit the AWS Cloud Support kiosk in the Venetian

Important notes:

Session dates, times, and locations listed in the post are subject to change as we continue to optimize the schedule on session popularity and venue capacity. Please check this blog post and the re:Invent session catalog regularly for the most up-to-date information about your registered sessions and newly added activities. For a full view of Well-architected content, including sessions with partners, explore the AWS re:Invent catalog and filter on the Well-Architected Framework area of interest.

Remember to reserve your seats early as popular sessions fill up quickly and bring your laptop for hands-on builders’ sessions and workshops. Register today

About the authors

Maximizing Business Value Through Strategic Cloud Optimization

2025-08-01 Ryan Dsouza

Post Syndicated from Ryan Dsouza original https://aws.amazon.com/blogs/architecture/maximizing-business-value-through-strategic-cloud-optimization/

As cloud adoption continues to accelerate, organizations are realizing that the journey to the cloud is just the beginning. The real challenge—and opportunity—lies in optimizing cloud usage to drive maximum business value. At AWS, we’re committed to helping our customers navigate this journey successfully. Let’s explore some key insights and best practices for cloud optimization from the recent MIT Technology Review publication, Driving business value by optimizing the cloud.

The cloud optimization imperative

Recent data shows that global cloud infrastructure spending reached $84 billion in Q3 2024, marking a 23% year-over-year increase. This growth underscores the critical role of the cloud in driving business agility and innovation. However, to truly harness the power of the cloud, organizations must strike the right balance between cost, security, resilience, and innovation.

André Dufour, AWS Director and General Manager for AWS Cloud Optimization, emphasizes that cloud optimization involves making cloud spending efficient so that freed-up resources can be redirected to fund new innovations, such as generative AI initiatives.

Cloud optimization should be viewed as a continuous process rather than a one-time event, requiring regular assessment whenever business conditions or technical requirements change significantly. The approach should be comprehensive, addressing not just costs but all six pillars (operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability), while also recognizing that different workloads require tailored optimization strategies rather than a one-size-fits-all approach.

Best practices for success

Consider the following best practices:

Upskill your team – Empower your employees with cloud, cost management, and optimization skills. As Dufour notes, “Every engineer or builder plays a role in cloud optimization.”
Establish a cloud center of excellence – Create a centralized body responsible for developing and distributing cloud best practices throughout your organization.
Align finance and business – Make cloud KPIs business-centric rather than purely technical, so cloud optimization efforts support overall business goals.
Embrace automation – Use tools to automate cloud provisioning, monitoring, and optimization, reducing human error and effort.
Use AI services and solutions for efficiency – Use AI technologies to automate visualization, enhance decision-making, and optimize resource utilization.

Real-world success stories

Our customers are already seeing significant benefits from strategic cloud optimization:

DreamCasino achieved 30% cost savings and a 50% reduction in API response times, enabling expansion into new markets
BMC Software reduced cloud costs by 25% while improving security and reliability, reinvesting savings into new business opportunities
Even within AWS, our use of Amazon Q for application modernization saved an estimated 4,500 years of development work and $260 million in performance benefits

Business impact

Effective cloud optimization delivers more than just cost savings. It enables the following:

Faster innovation through reinvestment of saved resources
Enhanced security and operational efficiency
Improved ability to scale and adapt to business needs
Better customer experiences and faster time-to-market
The capability to make informed architecture and design decisions by balancing trade-offs across AWS Well-Architected pillars

AWS resources for your optimization journey

To help you accelerate your cloud optimization efforts, AWS provides several tools and resources:

AWS Cloud Adoption Framework – Assess your cloud readiness
AWS Well-Architected Framework – Detailed guidance across all six pillars
AWS Well-Architected Tool – Evaluate your workloads against AWS best practices
AWS Well-Architected Lenses – Extend AWS Well-Architected Framework to specific industry and technology domains, including the Generative AI lens
Well-Architected IaC (Infrastructure as Code) Analyzer tool – Automatically assess IaC templates such as AWS CloudFormation and Terraform against the AWS Well-Architected Framework

Getting started

To get started, consider the following steps:

Download the MIT executive report
Assess your cloud readiness using the Cloud Adoption Framework
Begin with a Well-Architected Review to assess your current state
Use AWS Trusted Advisor to optimize costs, improve performance, and address security gaps
Subscribe to AWS Health events to learn how service and resource changes might affect your applications running on AWS
Consider AWS Managed Services to extend your team with operational capabilities such as monitoring, incident management, AWS Incident Detection and Response, security, patch, backup, and cost optimization
Use AWS re:Post as an authoritative, knowledge-sharing service designed to help you quickly remove technical roadblocks, accelerate innovation, and operate more efficiently

Additionally, you can engage with the AWS Cloud Optimization Success (COS) team for more detailed guidance and to help identify what to do next in your cloud optimization journey. The COS team has Solutions Architects who specialize in the Cloud Adoption Framework and Well-Architected Framework and deliver workshops and training sessions though customer and partner engagements. The team can help drive adoption of AWS services through the use of the Well-Architected and Cloud Adoption Frameworks and support other services like AWS Trusted Advisor and AWS Health to optimize cost and cloud architectures. Whether you’re just starting or looking to enhance existing implementations, the AWS COS team provides the guidance, tools, and expertise you need to succeed.

Conclusion

At AWS, we’re dedicated to helping you optimize your cloud journey. By implementing these strategies and best practices, you can unlock the full potential of the cloud, driving innovation and growth while maintaining security and operational excellence.

Ready to take your cloud optimization to the next level? Refer to the resources included in this post and contact your AWS COS team to learn how we can help you maximize the value of your cloud investments.

About the Authors

Build and operate an effective architecture review board

2025-04-14 Darrin Weber

Post Syndicated from Darrin Weber original https://aws.amazon.com/blogs/architecture/build-and-operate-an-effective-architecture-review-board/

The rapid change of pace in computing landscapes because of cloud, artificial intelligence, and technology innovation has challenged organizations to keep up while making sure that their initiatives and projects remain compliant with enterprise guidelines and policies. An effective architecture review board (ARB) can help an organization maintain compliance with enterprise guardrails while accelerating implementation of initiatives in their project pipeline.

In this post, we identify the components of an efficient architecture review process, define what an ARB is, and describe how to build and operate an effective enterprise ARB.

What is an architecture review board?

An ARB is a multi-disciplinary team responsible for reviewing solution architectures to help ensure compliance with enterprise guidelines, best practices, and supportability. Team members include stakeholders from different disciplines throughout your organization, which typically include Security, Development, Enterprise Architecture, Infrastructure, and Operations. Including a broad set of stakeholders reduces the amount of project recycle that happens when stakeholder representation is overlooked.

An ARB isn’t a standalone group, it operates within the context of your project implementation process, reviewing solution architectures, custom development, and purchased solutions to maintain enterprise compliance and alignment with goals. As shown in the following diagram, architecture review typically occurs after the design phase—before a build or purchase decision—and again before deployment to validate that the reviewed architecture matches the solution that was built.

Project implementation process with architecture review checkpoints

Most organizations recognize the benefits and value of establishing an ARB. However, they often struggle to define and operate one in a manner that maximizes the benefits, integrates with overall project execution processes, and satisfies the needs of all the stakeholders. An efficient architecture review process imparts organizational benefits such as reduced costs, minimized security events, and diminished technical debt.

Life without a formal architecture review process

One of the most pronounced issues with implementing and maintaining software architecture is the difficulty in achieving human consensus. In any organization, you’ll find a diverse range of team members—each with their own priorities, perspectives, and pain points. Without a formalized review process, these differences can lead to prolonged debates and stalled projects. We often find that many members tend to fall into one of these personas:

	The Not Invented Here – This individual doesn’t trust any software unless it was built and operated by members of their company. They’re generally wary of any cloud solution and will expend development time to avoid capital expenditure.
	The Wait a Minute – This individual has good feedback and their input is welcome, but they tend to wait until the last minute before providing any feedback, making it difficult to have productive conversations and act on any constructive criticism.
	The Bottleneck – This individual craves control and insists that all reviews, decisions, and conversations go through them. This makes scaling the architecture review process very challenging and decisions will often come down to the whim of this one person.
	The Creative – This individual has passion for software and for creating things, but will often choose complexity over simplicity and turn their architectures into art projects.
	The Perfectionist – This individual tends to let the perfect be the enemy of the good. While their intentions are pure, this approach can result in delayed decision making and debates on topics that might not be worth the time of the board.
	The Historian – This individual has been at the company for a long time and remembers every success and failure along the way. While the context this individual brings to the table is invaluable, teams must guard against only looking to the past as they try to shape the future.

Benefits of an architecture review board

Establishing an ARB within your organization can yield substantial benefits, enhancing both the quality and efficiency of your architecture. Some key advantages are:

Improved compliance

By systematically reviewing architectural decisions, the ARB helps ensure that designs adhere to company best practices, open standards, and regulatory requirements as set forth by your enterprise architecture.

Reduced technical debt

Technical debt—taking shortcuts in the development process that lead to future complications—is a common issue in software development. The ARB helps identify and mitigate technical debt early in the design phase. By enforcing architectural standards and promoting best practices, the board helps ensure that decisions are made with long-term sustainability in mind. This approach results in more robust, maintainable codebases and reduces the likelihood of future rework.

Efficiency with lowered costs

While a formal architecture review sounds like it might have the potential for increased red tape and lowered efficiency, the ARB instead contributes to operational efficiency by standardizing architectural practices across the organization. This uniformity allows for better resource allocation, faster deployment cycles, and more predictable project timelines. By catching potential issues early in the design phase, the ARB helps avoid costly rewrites and rework, which can lead to significant cost savings over time.

Supportability

Designing for supportability is crucial for the long-term success of any application. The ARB makes sure that architectures are built with maintainability in mind, making it easier for operations teams to manage and troubleshoot systems. This focus on supportability leads to fewer downtime incidents, faster resolution times, and overall higher system reliability. By making sure the composition of the ARB crosses all parts of the organization, supportability concerns can be surfaced earlier and help ensure that changes are properly socialized.

Security

Above all, security is the most critical output of an effective ARB. The ARB plays a pivotal role in embedding security into the architectural fabric from the outset. By conducting thorough security reviews and incorporating security best practices into every design, the ARB makes sure that applications are resilient against unintended disclosure, inadvertent access, and threat actors. This proactive approach not only protects sensitive data, but also builds trust with your customers and stakeholders.

Steps for effective architecture review boards

Whether looking to establish a new architecture review process or improve the effectiveness of a current ARB, we’ve identified eight key steps to make sure that an ARB operates in a way which realizes the benefits of a robust architecture review process while maintaining enterprise compliance. With the exception of leadership support, the steps aren’t presented in a particular order and can be implemented in parallel or in whatever order fits your organization and resource availability.

Leadership support

Identifying a sponsor on the executive leadership team is crucial to the success of the ARB. An executive sponsor fosters participation from stakeholders, representing key organizations such as Security, Development, and Operations, along with gaining their commitment to the review processes. The executive sponsor helps embed the ARB function within the enterprise’s project implementation process. Supported by the executive sponsor, the ARB’s reviews serve as a formal gate within the project process, reducing attempts to bypass the review processes.

Single source for guidance, policies, and best practices

Establish a single, well-known repository or index so that the entire enterprise has a single source of truth that establishes the basis for designing and reviewing architecture. A common repository doesn’t need to be complex. It can be a central document location, wiki, or file share that’s quickly discoverable. Commonly, an enterprise’s collection of guidelines and policies are dispersed and managed by each organization using different mechanisms and repositories. Best practices are often treated as folklore passed between team members. Project teams and ARB stakeholders need to share a common understanding of the enterprise’s collective intelligence consisting of guidelines, policies, and best practices.

As the project community’s collective understanding of the enterprise guidelines and policies grows, initial solution designs are better aligned, and reviews through the ARB accelerate. After a common repository is established, consider using generative AI to create a natural language chatbot, a design chatbot, to simplify access to the collective guidelines, policies, and best practices. See Amazon Bedrock or Amazon Q – Generative AI Assistant.

Defined stakeholders

Make sure that your disciplines have defined stakeholders on the ARB. A good starting point is to identify stakeholders from the Security, Enterprise Architecture, Development, Infrastructure, and Operations teams. Broad representation on the board minimizes recycles and delays later in the project, which can occur when stakeholders aren’t engaged in the review process from the beginning. A stakeholder’s responsibility is to focus on their area of subject matter expertise and commit a portion of their time to the ARB. Consider rotating stakeholders periodically to distribute knowledge and workload through the organization.

Gated process with documented decisions

As previously described, architecture reviews typically occur after design and before solution implementation or purchase. Optionally, another architecture review takes place before deployment to validate that the solution matches what was reviewed and approved. It’s important to complete the review before implementation or the purchase decision and to get stakeholder sign off. Otherwise, projects risk rework and delay later in the process, often impacting cost or schedule to a greater degree. Document each ARB action, including approvals, reasons for recycles, exceptions required, follow-ups needed, and so on. Documented decisions should be added to the project’s overall lifecycle documentation to benefit future inspection of project or similar solution architectures.

Establish an exception process

There will always be exceptions to your enterprise guidelines or policies. Plan for exceptions with a defined process for reviewing, escalating, and gaining approval. Include leadership from both IT and business areas in the assessment and sign-off on an exception. Most importantly, set expiration dates on the exceptions–they should not be granted indefinitely. Exceptions are typically granted to accommodate a temporary nonconformance to provide time to plan for and implement a better, long-term solution.

Architecture central repository

Establish a well known, central repository for solution architecture documents. Solution documentation should be treated as living artifacts that are maintained for the lifecycle of the use case. A central architecture repository benefits teams responsible for operating and maintaining solutions, along with design teams chartered with new solution design. After a repository is established, consider including your architecture documentation in the generative AI design chatbot mentioned previously.

Automate review process

Employ automated architecture review processes wherever possible. Automated review processes allow stakeholders to focus their time on their subject matter expertise instead of administrative tasks. Consider separate review processes based on an initiative’s complexity, cost, and impact. Schedule live meetings with the ARB for the most complex and impactful solutions, and use offline mechanisms, such as email, for other efforts. Define a universal architecture template to capture areas of interest for review and automate the Q&A and sign-off processes. Consider using generative AI to do initial automated design reviews against enterprise core best practices and policies to further streamline stakeholder review processes.

Architecture review process shepherd

Identify a shepherd to help ensure that solution architectures are reviewed and the ARB review processes are broadly understood. The shepherd functions as a liaison with executive sponsors for exceptions. While the shepherd can also be a stakeholder on the board, the shepherd is not the single overall decision maker. The shepherd champions the continuous improvement of the architecture review process and mechanisms.

Conclusion

In this post, we explored the benefits of establishing an architecture review board within an organization, emphasizing its role in maintaining compliance, reducing technical debt, and enhancing operational efficiency. We discussed the challenges organizations face in setting up an effective ARB and provided guidance on the essential components and steps required to build and operate a successful ARB. By following the outlined steps, organizations can maximize the benefits of an ARB, making sure that architectural decisions align with enterprise goals and standards while fostering a culture of continuous improvement and stakeholder collaboration.

For additional guidance on garnering the leadership support necessary for an effective ARB, see Well-Architected Framework: Provide executive sponsorship. For more details on the review process, see Well-Architected Framework: The review process and AWS Well-Architected Tool, an AWS Management Console-based service that provides a consistent process for measuring your architecture using AWS best practices. If you’re interested in establishing a natural language chatbot interface for your enterprise architecture information, see Amazon Bedrock, Amazon Q Business, or Build a contextual chatbot application using Amazon Bedrock Knowledge Bases.

About the authors

Announcing updates to the AWS Well-Architected Framework guidance

2024-11-06 Haleh Najafzadeh

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/announcing-updates-to-the-aws-well-architected-framework-guidance-3/

We are excited to announce the availability of enhanced and expanded guidance for the AWS Well-Architected Framework with the following six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability

This release includes new best practices and improved prescriptive implementation guidance for the existing best practices. This includes enhanced recommendations and steps on reusable architecture patterns focused on specific business outcomes.

A brief history

The Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads in the cloud.

Figure 1. 2024 AWS Well-Architected guidance timeline

In 2012, we published the first version of the Framework. In 2015, we released the AWS Well-Architected Framework whitepaper. We added the Operational Excellence pillar in 2016. We released the pillar-specific whitepapers and AWS Well-Architected Lenses in 2017. The following year, the AWS Well-Architected Tool was launched.

In 2020, we released the new version of the Well-Architected Framework guidance, more lenses, and an API integration with the AWS Well-Architected Tool. We added the sixth pillar, Sustainability in 2021. In 2022, dedicated pages were introduced for each consolidated best practices across all six pillars, with several best practices updated with improved prescriptive guidance. By December 2023, we improved more than 75% of the Framework’s best practices. As of November 2024, we’ve refreshed 100% of the Framework’s best practices at least once since October 2022.

What’s new

The Well-Architected Framework supports customers as they mature in their cloud journey by providing guidance to help achieve more operable, secure, sustainable, scalable, and resilient environment and workload solutions.

The content updates and prescriptive guidance improvements in this release provide more complete coverage across AWS, helping customers make informed decisions when developing implementation plans. We added or expanded on guidance for the following services in this update: Amazon API Gateway, Amazon CloudFront, Amazon CloudWatch, Amazon CodeGuru, Amazon Cognito, Amazon GuardDuty, Amazon Inspector, Amazon Macie, Amazon Q Business, Amazon Q Developers, Amazon Redshift, Amazon S3, AWS Certificate Manager, AWS CloudFormation, AWS CloudTrail, AWS CodeBuild, AWS CodeDeploy, AWS CodePipeline, AWS Config, AWS Control Tower, AWS Customer Carbon Footprint Tool, AWS Glue, AWS Health, AWS Identity and Access Management (IAM), AWS Key Management Service (KMS), AWS OpenSearch, AWS Organizations, AWS Resource Access Manager, AWS Secrets Manager, AWS Security Hub, AWS Step Functions, AWS Systems Manager, AWS Trusted Advisor, AWS Verified Access, and AWS WAF.

Pillar updates

Operational Excellence

In the Operational Excellence Pillar, we updated five best practices across four questions. This includes OPS02, OPS05, OPS09, and OPS10. The updates in this release include improved prescriptive guidance on multiple AWS services. OPS02-BP02 updates leverage Amazon Q Business for improving workforce collaboration and productivity. OPS05-BP08 updates demonstrate AWS Organizations and AWS Control Tower capabilities that enable updates to a multi-environment setup while meeting governance and policy requirements. OPS09-BP01 and OPS09-BP02 have updated guidance and resources for developing operational key performance indicators (KPIs). OPS10-BP02 has been updated with information on AWS Health, including its planned lifecycle events feature, for integrating into an incident management workflow.

Security

In the Security Pillar, we updated 43 best practices across nine questions. This includes SEC02, SEC03, SEC04, SEC06, SEC07, SEC08, SEC09, SEC10, and SEC11. All best practices in SEC03 (Permissions management) were revised, with updates to guidance on Attribute Based Access Control (ABAC), AWS IAM Access Analyzer, and emergency access processes. SEC02 (Identity management) also saw updates to all six of its best practices, including refinements to guidance on identity federation and secrets management. SEC07 through SEC11 received updates to guidance on data protection, incident response, and application security. Key updates include replacing the security information and event management SIEM solution on AWS OpenSearch recommendation with AWS CloudTrail Lake in SEC04 (Detection), expanded guidance on AWS S3 Object Lock and AWS S3 Glacier Vault Lock in SEC08 (Protecting data at rest), and the addition of recommendations for Mutual Transport Layer Security (mTLS) and private certificates in SEC09 (Protecting data in transit). Overall, these changes reflect AWS’s commitment to providing up-to-date, comprehensive security guidance in line with evolving best practices and new service capabilities.

Reliability

In the Reliability Pillar, we updated 14 best practices across nine questions. This includes REL01, REL02, REL04, REL06, REL07, REL08, REL10, REL12, and REL13. We expanded and clarified our guidance throughout the Pillar and added detailed implementation steps to each best practice that did not previously have them. We refreshed our multi-location deployment guidance by merging REL10-BP02 into REL10-BP01, while improving the prescriptive guidance of this best practice with a new title of Deploy the workload to multiple locations. We updated our idempotent operations guidance in REL04-BP04 to provide detailed technical guidance for builders who wish to provide idempotent APIs and updated the title to Make mutating operations idempotent. We merged functional testing guidance by migrating the content previously published under REL12-BP03 to REL08-BP02 (Integrate functional testing as part of your deployment) and expanded our guidance on testing in CI/CD pipelines. We refreshed REL07-BP01 to emphasize infrastructure as code (IaC) as a cornerstone of automated resource management and scaling. We improved our guidance in REL06-BP02 on how to use system and application logs to improve workload observability. We also refreshed our links to relevant resources including documents, videos, and presentations.

Performance Efficiency

In the Performance Efficiency Pillar, we updated the resources of PERF03-BP04 with the latest services.

Sustainability

In the Sustainability Pillar, we updated 10 best practices across six questions. This includes SUS01, SUS03, SUS04, SUS05, and SUS06. Best practices SUS01-BP01, SUS03-BP02, SUS03-BP05, SUS04-BP03, SUS04-BP05, SUS04-BP06, SUS04-BP07, SUS04-BP08, SUS05-BP04, and SUS06-BP02 now offer improved prescriptive guidance. Additionally, we added a new best practice, SUS06-BP01 Communicate and cascade your sustainability goals, which highlights the critical role of the central IT team in cascading sustainability goals and objectives across the broader organization. By strategically leveraging the cloud, implementing resource-efficient practices, and employing sustainability-focused tools and analytics, IT teams can play a pivotal role in driving meaningful reductions in the organization’s environmental impact.

Conclusion

This release includes updates and improvements to the Framework guidance totaling 78 best practices. As of this release, we’ve updated 100% of the existing Framework best practices at least once since October 2022. With this release, we have refreshed 100% of all the pillars of the Framework including the Reliability Pillar, with 14 of its best practice updated for the first time since major Framework improvements started in 2022.

Updates in this release will be incorporated into the AWS Well-Architected Tool in future releases, which you can use to review your workloads, address important design considerations, and help you follow the AWS Well-Architected Framework guidance.

The content will be available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.

Ready to get started? Review the updated AWS Well-Architected Framework Pillar best practices and pillar-specific whitepapers.

Have questions about some of the new best practices or most recent updates? Join our growing community on AWS re:Post.

Let’s Architect! Designing Well-Architected systems

2024-07-31 Vittorio Denti

Post Syndicated from Vittorio Denti original https://aws.amazon.com/blogs/architecture/lets-architect-well-architected-systems/

The design of cloud workloads can be a complex task, where a perfect and universal solution doesn’t exist. We should balance all the different trade-offs and find an optimal solution based on our context. But how does it work in practice? Which guiding principles should we follow? Which are the most important areas we should focus on?

In this blog, we will try to answer some of these questions by sharing a set of resources related to the AWS Well-Architected Framework. The Framework shares a set of methods to help you understand the pros and cons of decisions you make while building cloud systems. By following this resource, you will learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems in the cloud. The framework is constantly updated; it evolves as the technology landscape changes. Check out the latest updates from June 2024.

Build secure applications on AWS the Well-Architected way

The AWS Well-Architected Framework is constantly updated across all six pillars. The security pillar added a new best practice area: application security (AppSec). In this session, you can learn about the best practices highlighted in this area. Review four key domains: organization and culture, security of the pipeline, security in the pipeline, and dependency management. Each area provides a set of principles that you can implement and provides a complete view of how you design, develop, build, deploy, and operate secure workloads in the cloud.

Security should be part of the end-to-end development process, and implementing best practices both in the application code as well as in the underlying infrastructure components.

Figure 1. Security should be part of the end-to-end development process, and implementing best practices both in the application code as well as in the underlying infrastructure components.

Take me to this video

Announcing the AWS Well-Architected Mergers and Acquisitions Lens

How can we integrate different systems as a consequence of an acquisition? Mergers and acquisitions operations bring different people with different backgrounds together, with a need of driving systems convergence. Both organization and technical challenges can arise in this scenario. The Mergers and Acquisitions (M&A) Lens is a collection of customer-proven design principles, best practices, and prescriptive guidance to help you integrate the IT systems of two or more organizations. This lens helps companies follow AWS prescribed best practices during technical integration, drive cost optimization, and expedite merger and acquisition value realization.

If the seller company runs on another cloud platform or on-premises, the acquirer should plan a cloud migration while guaranteeing continuity of service.

Figure 2. If the seller company runs on another cloud platform or on-premises, the acquirer should plan a cloud migration while guaranteeing continuity of service.

Take me to this blog

AWS Well-Architected Labs

One of the best ways to become familiar with new concepts and methodologies consist of doing hands-on work to absorb the techniques properly. For each Let’s Architect! blog, we tend to share at least one workshop associated with the topic. The AWS Well-Architected Framework covers six different pillars, so today we share the AWS Well-Architected Labs to cover each area of the framework. Feel free to jump across the different workshops and start building!

Sustainability is one of the pillars in the framework. Asynchronous and scheduled processing are key techniques for improving the sustainability and costs of cloud architectures.

Figure 3. Sustainability is one of the pillars in the framework. Asynchronous and scheduled processing are key techniques for improving the sustainability and costs of cloud architectures.

Take me to this workshop

Gain confidence in system correctness and resilience with formal methods

Distributed systems are difficult to design. It’s even more difficult to test them and prove they are working. Formal methods enable the early discovery of design bugs that can escape the guardrails of design reviews and automated testing only to get uncovered in production. This video shows how AWS uses P, an open source, state machine–based programming language for formal modelling and analysis of distributed systems.

You can learn from AWS engineers and architects how to use P for your own applications to find bugs early in the development process and increase developer velocity. This tool is used in AWS to reason out the correctness of cloud services (for example, Amazon Simple Storage Service and Amazon DynamoDB).

Figure 4. An example of a distributed system for processing transactions.

Take me to this video

See you next time!

Thanks for reading! Hopefully, you got interesting insights into the methodologies for designing Well-Architected systems. In the next blog, we will talk about multi-region architectures. We will understand when they are actually needed, and which design principles should be applied.

To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.

Announcing updates to the AWS Well-Architected Framework guidance

2024-06-27 Haleh Najafzadeh

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/announcing-updates-to-the-aws-well-architected-framework-guidance-2/

We are excited to announce the availability of an enhanced AWS Well-Architected Framework. In this update, you’ll find expanded guidance across all six pillars of the Framework: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability.

In this release, we updated the implementation guidance for the new and existing best practices to be more prescriptive. This includes enhanced recommendations and steps on reusable architecture patterns focused on specific business outcomes.

A brief history

The Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads in the cloud.

Figure 1. 2024 AWS Well-Architected guidance timeline

In 2012, we published the first version of the Framework. In 2015, we released the AWS Well-Architected Framework whitepaper. We added the Operational Excellence pillar in 2016. We released pillar-specific whitepapers and AWS Well-Architected Lenses in 2017. The following year, the AWS Well-Architected Tool was launched.

In 2020, we released the new version of the Well-Architected Framework guidance, more lenses, and an API integration with the AWS Well-Architected Tool. We added the sixth pillar, Sustainability, in 2021. In 2022, dedicated HTML pages were introduced for each consolidated best practice across all six pillars, with several best practices updated with improved prescriptive guidance. By December 2023, we improved more than 75% of the Framework’s best practices. As of June 2024, more than 95% of the Framework’s best practices have been refreshed at least once.

What’s new

The Well-Architected Framework supports customers as they mature in their cloud journey by providing guidance to help achieve accurate business, environment, and workload solutions. Well-Architected is committed to providing such information to customers by continually evolving and updating our guidance.

The content updates and prescriptive guidance improvements in this release provide more complete coverage across AWS, helping customers make informed decisions when developing implementation plans. We added or expanded on guidance for the following services in this update: Amazon Chime, Amazon CloudWatch, Amazon CodeGuru Security, Amazon CodeWhisperer, Amazon Devops Guru, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon ElastiCache, Amazon EventBridge, Amazon GuardDuty, Amazon Q, Amazon Route 53, Amazon Security Lake, Amazon Simple Notification Service (Amazon SNS), AWS Billing and Cost Management, AWS Budgets, AWS Compute Optimizer, AWS Config, AWS Control Tower, AWS Cost Optimization Hub, AWS Customer Carbon Footprint Tool, AWS Data Exports, AWS Data Lifecycle Manager, AWS Elastic Disaster Recovery, AWS Fault Injection Service, AWS Global Accelerator, AWS Health, AWS Local Zones, AWS Organizations, AWS Outposts, AWS Resilience Hub, AWS Security Hub, AWS Systems Manager, AWS Trusted Advisor, and AWS Wickr.

Pillar updates

Operational Excellence

In the Operational Excellence Pillar, we updated 30 best practices across six questions. This includes OPS01, OPS02, OPS03, OPS07, OPS10, and OPS11. This update includes a refreshed structure and improved prescriptive guidance with updates on observability, generative AI capabilities, operating models, and the evolution of operational practices.

As part of this update, we consolidated four best practices into two (OPS01-BP07 merged into OPS01-BP06, OPS03-BP08 merged into OPS03-BP04) and changed the titles of seven best practices. Additionally, we added one new design principal to highlight the importance of aligning operating models to business outcomes and reordered design principles according to their priority from foundational to specialized. We updated three design principles and changed the title of one design principle. We’ve also updated the operating model guidance section of the pillar to be more prescriptive, showcasing pathways to evolving operating models.

The implementation guidance in best practices includes guidance on implementing generative AI capabilities with Amazon Q (Q Developer, Q Business, Q in QuickSight), the latest capabilities from Amazon CloudWatch Network Monitor, Amazon CloudWatch Internet Monitor, Amazon CloudWatch Logs, Amazon CloudWatch best practice alarms, cross-account observability, log-based alarms, log data protection, and AWS Health.

Security

In the Security Pillar, we updated 28 best practices across 10 questions. This includes SEC01, SEC02, SEC03, SEC04, SEC05, SEC06, SEC07, SEC08, SEC09, and SEC10. Best practice updates include removing duplication, clarifying desired outcomes, and providing robust prescriptive implementation guidance. As part of this update, we merged SEC01-BP05 into SEC01-BP04. We deleted two practices, SEC08-BP05 and SEC09-BP03, to remove the duplication of guidance covered across other existing practices. We updated the titles for 14 practices and changed the order of nine practices to improve clarity and flow.

Reliability

In the Reliability Pillar, we updated 11 best practices across six questions. This includes REL02, REL04, REL05, REL06, REL07, and REL08, with three best practices changing titles including REL04-BP01, REL05-BP06, and REL06-BP05. We improved resources available in best practices to include more recent blog posts, technical talks, and presentations. We also improved the prescriptive guidance by expanding on implementation steps. New services and service features added to the best practices guidance for AWS Resilience Hub, Amazon Route 53, Amazon Route53 Application Recovery Controller, AWS Fault Injection Service, and Amazon CloudWatch Synthetics.

Performance Efficiency

In the Performance Efficiency Pillar, we updated nine best practices across three questions. This includes PERF01, PERF03, and PERF05. We improved the prescriptive guidance on these best practices and added pillar-specific guidance on services including Amazon Devops Guru and Amazon ElastiCache Serverless. We’ve updated the resources section of all best practices with new and relevant resources.

Cost Optimization

In the Cost Optimization Pillar, we updated eight best practices across five questions. This includes COST01, COST02, COST03, COST05, and COST11. One new best practice added in COST06 highlights the benefits of using shared resources for organizational cost optimization. The improved best practices include guidance on AWS services and features including the AWS Cost Optimization Hub, AWS Billing and Cost Management features, and AWS Data Exports. These updates also cover sample key performance indicators (KPIs) for tracking optimization efforts, elaborate on the use of cost allocation tags, and discuss the split cost allocation for Amazon EKS and Amazon ECS to separate costs of containerized workloads. Additionally, the updates offer improved prescriptive and clear guidance on budgeting and forecasting. Finally, you’ll find guidance on using automations to reduce costs.

Sustainability

In the Sustainability Pillar, we updated 18 best practices across five questions. This includes SUS01, SUS02, SUS03, SUS04, SUS05, and SUS06. We improved the prescriptive guidance on these best practices, and added Pillar-specific guidance on services, including AWS Local Zones, AWS Outposts, Amazon Chime, AWS Wickr, Amazon CodeWhisperer, and AWS Customer Carbon Footprint Tool. We’ve expanded lists of resources across all best practices with new and relevant resources.

Conclusion

This release includes updates and improvements to the Framework guidance totaling 105 best practices. As of this release, we’ve updated 95% of the existing Framework best practices at least once since October 2022. With this release, we have refreshed 100% of the Operational Excellence, Security, Performance Efficiency, Cost Optimization, and Sustainability Pillars, as well as 79% of Reliability Pillar best practices. Best practice updates in this release across Operational Excellence, Security, and Reliability (a total of 66) are first-time updates since major Framework improvements started in 2022.

The content is available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.

Updates in this release are also available in the AWS Well-Architected Tool, which you can use to review your workloads, address important design considerations, and help you follow the AWS Well-Architected Framework guidance.

Ready to get started? Review the updated AWS Well-Architected Framework Pillar best practices and pillar-specific whitepapers.

Have questions about some of the new best practices or most recent updates? Join our growing community on AWS re:Post.

Announcing updates to the AWS Well-Architected Framework guidance

2023-10-03 Haleh Najafzadeh

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/announcing-updates-to-the-aws-well-architected-framework-guidance/

We are excited to announce the availability of improved AWS Well-Architected Framework guidance. In this update, we have made changes across all six pillars of the framework: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability.

In this release, we have made the implementation guidance for the new and updated best practices more prescriptive, including enhanced recommendations and steps on reusable architecture patterns targeting specific business outcomes in the Amazon Web Services (AWS) Cloud.

A brief history

The Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads in the cloud.

In 2012, the first version of the framework was published, leading to the 2015 release of the guidance whitepaper. We added the Operational Excellence pillar in 2016. The pillar-specific whitepapers and AWS Well-Architected Lenses were released in 2017, and the following year, the AWS Well-Architected Tool was launched.

In 2020, Well-Architected Framework guidance had a new release, along with more lenses, as well as API integration with the AWS Well-Architected Tool. The sixth pillar, Sustainability, was added in 2021. In 2022, dedicated pages were introduced for each consolidated best practice across all six pillars, with several best practices updated with improved prescriptive guidance. By April 2023, more than 50% of the Framework’s best practices have had their prescriptive guidance improved.

A brief history of the AWS Well-Architected Framework

What’s new

As customers mature in their journey, they are seeking guidance to achieve accurate solutions that is prescriptive to their business, environments, and workloads. AWS Well-Architected is committed to providing such information to customers by continually evolving and updating our guidance.

The content updates and improvements in this release focus on having more complete coverage across the AWS service portfolio, helping customers make more informed decisions when developing implementation plans. Services that were added or expanded in coverage include: AWS Elastic Disaster Recovery, AWS Trusted Advisor, AWS Resilience Hub, AWS Config, AWS Security Hub, Amazon GuardDuty, AWS Organizations, AWS Control Tower, AWS Compute Optimizer, AWS Budgets, Amazon CodeWhisperer, Amazon CodeGuru, Amazon EventBridge, Amazon CloudWatch, Amazon Simple Notification Service, AWS Systems Manager, Amazon ElastiCache, and AWS Global Accelerator.

Pillar updates

Operational Excellence

The Operational Excellence Pillar has received updates to two of the five Design Principles and has a new Design Principle on observability, which highlights its importance and relevance throughout the pillar content. All 10 best practices in OPS05 have been updated, and we have consolidated 28 best practices into 16, across four questions (OPS04, OPS06, OPS08, and OPS09), as well as improving prescriptive guidance.

Security

In the Security Pillar, the Incident response in SEC10 underwent an update to align with the AWS Security Incident Response Guide, while introducing one new best practice, and improving the prescriptive guidance for others. Two best practices in SEC08 and SEC09 have received improved prescriptive guidance on securing workloads at rest and in transit.

Reliability

The Reliability Pillar has received prescriptive guidance improvements to one best practice in REL06, and six best practices in REL11, focused on how to best monitor, failover, remediate, and limit impacts of failures. The update addresses a wide variety of managed services and designs, including multi-Region-based resilience.

Performance Efficiency

The Performance Efficiency Pillar has been completely restructured, consolidating and merging guidance to reduce the number of best practices by 10 and the number of questions by three. We have added best practices around efficient caching and optimizing hardware acceleration. We have also improved the implementation guidance in all 32 best practices of the newly restructured Pillar.

Cost Optimization

The Cost Optimization Pillar has 10 best practices with improved implementation prescriptive guidance.

Sustainability

The Sustainability Pillar has received updates to the risk levels of seven best practices.

Conclusion

This Well-Architected release includes updates and improvements to 90 best practices: Operational Excellence (26), Security (8), Reliability (7), Performance Efficiency (32), Cost Optimization (10), and Sustainability (7). These changes are in addition to the 151 improved best practices released in 2023 (127 in April 10, 2023; and 24 in July 13, 2023), resulting in more than 73% of the existing Framework best practices updated at least once in the last year.

As of this release, 100% of Performance Efficiency, Cost Optimization, and Sustainability; 63% of Operational Excellence; 60% of Security; and 50% of Reliability Pillar content have been refreshed at least once since October 2022.

The content is available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.

Updates in this release are also available in the AWS Well-Architected Tool, which can be used to review your workloads, address important design considerations, and help ensure that you follow the best practices and guidance of the AWS Well-Architected Framework.

Ready to get started? Review the updated AWS Well-Architected Framework Pillar best practices, as well as pillar-specific whitepapers.

Have questions about some of the new best practices or most recent updates? Join our growing community on AWS re:Post.

Let’s Architect! Monitoring production systems at scale

2023-04-12 Vittorio Denti

Post Syndicated from Vittorio Denti original https://aws.amazon.com/blogs/architecture/lets-architect-monitoring-production-systems-at-scale/

“Everything fails, all the time” is a famous quote from Amazon’s Chief Technology Officer Werner Vogels. This means that software and distributed systems may eventually fail because something can always go wrong. We have to accept this and design our systems accordingly, test our software and services, and think about all the possible edge cases.

With this in mind, we should also set our teams up for success by providing visibility in every environment for a quick turnaround when incidents happen. When a system serves traffic in production, we need to monitor it to make sure it behaves as expected and that all components are healthy. But questions arise such as:

How do we monitor a system?
What is monitoring?
What are some architectural and engineering approaches to implement in order to design a successful monitoring strategy?

All of these questions require complex answers. It’s not possible to cover everything in a blog post, but let’s start exploring the topic and sharing resources to guide you through this domain.

In this edition of Let’s Architect! we share some practices for monitoring used at Amazon and AWS, as well as more resources to discover how to build monitoring solutions for the workloads running on AWS.

Observability best practices at Amazon

Observability and monitoring are engineering tasks that also require putting a suitable cultural mindset in place. At Amazon, if a service doesn’t run as expected, the team writes a CoE (Correction of Errors) document to analyze the issue and answer critical questions to learn from it. There are also weekly operations meetings to analyze operational and performance dashboards for each service.

The session introduced here covers the full range of monitoring at Amazon, from how teams assess system health at a high level to how they understand the details of a single request. Use this resource to learn some best practices for metrics, logs, and tracing, and using these signals to achieve operational excellence.

Take me to this re:Invent video!

Observability is an iterative process which requires us to establish a feedback loop and improve based on the signals coming from the system.

Build an observability solution using managed AWS services and the OpenTelemetry standard

Visibility of what’s happening in a distributed system is key to operationalize workloads at scale. OpenTelemetry is the standard for observability and AWS services are fully integrated with that. The blog post introduced in this section shows you how AWS Distro for OpenTelemetry (ADOT) works under the hood and how to use it with a Kubernetes cluster. But keep in mind, this is just one of the many implementations available for AWS compute services and OpenTelemetry—so even if you’re not using Kubernetes right now, we’ve still got you covered!

Want more? Watch this re:Invent video for an understanding of how to think about logging, tracing, metrics, and monitoring with AWS services, and the possibilities to provide the observability your distributed systems need. This is a great learning resource with many demos and examples.

Take me to this blog post!

Flow of metrics and traces from Application services to the Observability Platform.

Optimizing your AWS Batch architecture for scale with observability dashboards

We’ve explored the mental models and strategies for monitoring in previous resources. Now let’s see how these principles can be applied in a scenario where we run batch and ML computing jobs at scale. In the blog post introduced in this section, you can learn how to use runtime metrics to understand an architecture designed on AWS Batch for running batch computing jobs. AWS Batch is a fully managed service enabling you to run jobs at any scale without needing to manage underlying compute resources. This blog explains how AWS Batch works and guides you through the process used to design a monitoring framework.

Since the solution is open-source, you are free to add other custom metrics you find useful. To get started with the AWS Batch open-source observability solution, visit the project page on GitHub. Several customers have used this monitoring tool to optimize their workload for scale by reshaping their jobs, refining their instance selection, and tuning their AWS Batch architecture.

Take me to this blog!

High-level structure of AWS Batch resources and interactions. This diagram depicts a user submitting jobs based on a job definition template to a job queue, which then communicates to a compute environment that resources are needed.

Observability workshop

This resource provides a hands-on experience for you on the variety of toolsets AWS offers to set up monitoring and observability on your applications. Whether your workload is on-premises or on AWS—or your application is a giant monolith or based on modern microservices-based architecture—the observability tools can provide deeper insights into application performance and health.

The monitoring tools covered in this workshop provide powerful capabilities that enable you to identify bottlenecks, issues, and defects without having to manually sift through various logs, metrics, and trace data.

Take me to this workshop!

The diagram illustrates the various components of the PetAdoptions architecture. In the workshop you will learn how to monitor this application.

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about containers on AWS.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Announcing updates to the AWS Well-Architected Framework

2023-04-11 Haleh Najafzadeh

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/announcing-updates-to-the-aws-well-architected-framework-2/

A brief history

The AWS Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads in the cloud.

In 2020, the content for the Well-Architected Framework received a major update, as well as more lenses, and API integration with the AWS Well-Architected Tool. The sixth pillar, Sustainability, was added in 2021. In 2022, dedicated pages were introduced for each consolidated best practice across all six pillars, with several best practices updated with improved prescriptive guidance.

AWS Well-Architected timeline

What’s new

Well-Architected Framework content is consistently updated and improved in order to adapt to the constantly changing and innovating AWS environment, with new and evolved emerging services and technologies. This ensures cloud architects can build and operate secure, high-performing, resilient, efficient, and sustainable systems in the AWS Cloud.

The content updates and improvements in this release focus on providing more complete coverage across the AWS service portfolio to help customers make more informed decisions when developing implementation plans. Services that were added or expanded in coverage include: AWS Elastic Disaster Recovery, AWS Trusted Advisor, AWS Resilience Hub, AWS Config, AWS Security Hub, Amazon GuardDuty, AWS Organizations, AWS Control Tower, AWS Compute Optimizer, AWS Budgets, Amazon CodeWhisperer, and Amazon CodeGuru.

Pillar updates

The Operational Excellence Pillar has a new best practice on enabling support plans for production workloads. This Pillar also has a major update on defining a customer communication plan for outages.

In the Security Pillar, we added a new best practice area, Application Security (AppSec). AppSec is complete with eight new best practices to guide customers as they develop, test, and release software, providing guidance on how to consider the tools, testing, and organizational approach used to develop software.

The Reliability Pillar has a new best practice on architecting workloads to meet availability targets and uptime service-level agreements (SLAs). We also added the resilience shared responsibility model to its introduction section.

The Cost Optimization Pillar has new best practices on automating operations as a part of cost-optimization efforts and enforcing data-retention policies.

In the Sustainability Pillar, we introduced a clear process for selecting Regions, as well as tools for right-sizing services and improving the overall utilization of resources in the AWS Cloud.

Best practice updates

The implementation guidance and best practices have been updated in this release to be more prescriptive, including enhanced recommendations and steps on reusable architecture patterns targeting specific business outcomes in the AWS Cloud.

As many as 113 best practices are updated with more prescriptive guidance in Operational Excellence (22), Security (18), Reliability (14), Performance Efficiency (10), Cost Optimization (22), and Sustainability (27). Fourteen new best practices have been introduced in Operational Excellence (1), Security (9), Reliability (1), Cost Optimization (2), and Sustainability (1).

From a total of 127 new/updated best practices, 78% include explicit implementation steps as part of making them more prescriptive. The remaining 22% have been updated by improving their existing implementation steps. These changes are in addition to the 51 improved best practices released in 2022 (18 in Q3 2022, and 33 in Q4 2022), resulting in more than 50% of the existing Framework best practices having been updated recently.

The content is available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.

Here is the list of best practices that are new or updated in this release:

Operational Excellence: OPS01-BP03, OPS01-BP04, OPS02-BP01, OPS02-BP06, OPS02-BP07, OPS03-BP04, OPS03-BP05, OPS04-BP01, OPS04-BP03, OPS04-BP04, OPS04-BP05, OPS05-BP02, OPS05-BP06, OPS05-BP07, OPS07-BP01, OPS07-BP05, OPS07-BP06, OPS08-BP02, OPS08-BP03, OPS08-BP04, OPS10-BP05, OPS11-BP01, OPS11-BP04
Security: SEC01-BP01, SEC01-BP02, SEC01-BP07, SEC02-BP01, SEC02-BP02, SEC02-BP03, SEC02-BP05, SEC03-BP02, SEC03-BP04, SEC03-BP07, SEC03-BP09, SEC04-BP01, SEC05-BP01, SEC06-BP01, SEC07-BP01, SEC08-BP04, SEC08-BP02, SEC09-BP02, SEC03-BP08, SEC11-BP01, SEC11-BP02, SEC11-BP03, SEC11-BP04, SEC11-BP05, SEC11-BP06, SEC11-BP07, SEC11-BP08
Reliability: REL01-BP01, REL01-BP02, REL01-BP03, REL01-BP04, REL01-BP06, REL02-BP01, REL09-BP01, REL09-BP02, REL09-BP03, REL09-BP04, REL10_BP04, REL10-BP03, REL11-BP07, REL13-BP02, REL13-BP03
Performance Efficiency: PERF02-BP06, PERF05_BP03, PERF05-BP02, PERF05-BP04, PERF05-BP05, PERF05-BP06, PERF05-BP07, PFRF04-BP04, PERF02_BP04, PERF02_BP05
Cost Optimization: COST02_BP01, COST02_BP02, COST02_BP03, COST02_BP05, COST03_BP02, COST03_BP04, COST03_BP05, COST04_BP01, COST04_BP02, COST04_BP03, COST04_BP04, COST04_BP05, COST05_BP03, COST05_BP05, COST05_BP06, COST06_BP01, COST06_BP03, COST07_BP01, COST07_BP02, COST07_BP05, COST09_BP03, COST10_BP01, COST10_BP02, COST11_BP01
Sustainability: SUS01_BP01, SUS02_BP01, SUS02_BP02, SUS02_BP03, SUS02_BP04, SUS02_BP05, SUS02_BP06, SUS03_BP01, SUS03_BP02, SUS03_BP03, SUS03_BP04, SUS03_BP05, SUS04_BP01, SUS04_BP02, SUS04_BP03, SUS04_BP04, SUS04_BP05, SUS04_BP06, SUS04_BP07, SUS04_BP08, SUS05_BP01, SUS05_BP02, SUS05_BP03, SUS05_BP04, SUS06_BP01, SUS06_BP02, SUS06_BP03, SUS06_BP04

Ready to get started? Review the updated AWS Well-Architected Framework Pillar best practices, as well as pillar-specific whitepapers.

Have questions about some of the new best practices or most recent updates? Join our growing community on AWS re:Post.

AWS Week in Review: Public Preview of Amazon DataZone and AWS DataSync Updates – April 3, 2023

2023-04-03 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-public-preview-of-amazon-datazone-and-aws-datasync-updates-april-3-2023/

Last weekend, I enjoyed the spring vibes at Seoul Forest, a large park in the middle of Seoul city, where cherry blossoms are in full bloom.

Compared to last year, there were crowds of people, so I realized that it was really back to normal after the pandemic. I hope you all enjoy the season of spring or fall with your family.

Last Week’s Launches
Like an April Fool’s Day joke, there were 65 launches last week, far more than usual. AWS product teams are working hard with a customer obsession.

So, I had a lot of trouble choosing the important ones. Other than the ones I’ve picked out, there may be important feature releases that fit your needs. Be sure to take a look at the full launches list in the last week.

First, here is a list of the general availability of AWS services and features treated by AWS News Blog:

Let’s take a look at some launches from the last week that I want to remind you of:

The Preview of Amazon DataZone – At AWS re:Invent 2022, we preannounced Amazon DataZone, a new data management service to catalog, discover, analyze, share, and govern data between data producers and consumers in the organization. You can now try out the public preview of Amazon DataZone.

Data producers populate the business data catalog from AWS Glue Data Catalog and Amazon Redshift tables. Data consumers search for and subscribe to data assets in the data catalog and analyze with tools such as Amazon Athena query editors in the Amazon DataZone portal. To get started with Amazon DataZone, see our Quick Start Guide to include sample datasets to implement a complete use case.

AWS DataSync Supports Azure Blob Storage in Preview – AWS DataSync supports copying your object data at scale from Azure Blob Storage to AWS storage services such as Amazon S3. AWS DataSync supports all blob types within Azure Blob Storage and can also be used with Azure Data Lake Storage (ADLS) Gen 2.

In addition to Azure Blob Storage, DataSync supports Google Cloud Storage and Azure Files storage locations as well as various general storage systems and AWS storage services. To learn more, see Migrating Azure Blob Storage to Amazon S3 using AWS DataSync in the AWS Storage Blog.

On-call schedules with AWS Systems Manager Incident Manager – You can now configure or change on-call rotation schedules with a group of contacts and have 24/7 coverage and responsiveness for critical issues in the Incident Manager console.

AWS Incident Manager helps you bring the right people and information together when a critical issue is detected, activating preconfigured response plans to engage responders using SMS, phone calls, and chat channels, as well as to run AWS Systems Manager Automation runbooks. To learn how to get started with an-call schedules in Incident Manager, see our Working with on-call schedules in Incident Manager in the AWS documentation.

AWS CloudShell Colsone Toolbar – You can now use AWS Cloudshell Console Toolbar with AWS Management Console in a single view. The Console Toolbar maintains its state (e.g., open, closed) and commands will continue to run in CloudShell as you navigate between services in the Console. For example, it allows you to run a command in CloudShell and view a CloudWatch alarm in the Console at the same time.

After signing into the Console, you can access CloudShell in the lower left of the Console by selecting the CloudShell icon in the Console Toolbar.

New Features of AWS Well-Architected Tool – The Consolidated Report and Enhanced Search enable customers to quickly identify risk themes across their workloads and scale improvements across their organization. This macro-level view helps executive stakeholders understand where common issues lie and prioritize team resources to drive widespread improvement. To learn more, see AWS Well-Architected Tool Dashboard in the AWS documentation.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items that you may find interesting from the last week:

Welcome to the .NET on AWS Blog – We launched a new blog channel for millions of .NET developers across the world. Blog posts will also cover built-for-the-cloud development, modernizing .NET Framework applications, and how to deploy .NET workloads on different AWS services. We will use this channel to share news on the work we’ve done with the .NET open-source community, post follow-ups from important events, and post announcements about upcoming presentations from our .NET developer advocates. To learn more, visit our .NET on AWS website and follow us on Twitter at @dotnetonAWS.

AWS Knowledge Center in AWS re:Post – You can now access trusted, authoritative articles and videos of AWS Knowledge Center on AWS re:Post to get answers to technical questions. Knowledge Center content is produced by an AWS team and covers the most frequent questions and requests from AWS customers. These articles are available in 10 localized languages: English, French, German, Italian, Japanese, Korean, Portuguese, Simplified Chinese, Spanish, and Traditional Chinese.

TF1’s FIFA Worldcup Digital Broadcasting Story – Sébastien shared an awesome story about how the French broadcaster TF1 use AWS Cloud technology and expertise to bring the FIFA World Cup to millions of people. He shared the history of redesigning its digital broadcasting architecture on AWS, testing the new platform on large-scale sporting events. For the preparation of the FIFA Worldcup event, TF1 enhanced monitoring to detect anomalies during the event and established the backup plan in a “war room” for the worst scenario. Even if you’re not a fan of football, I recommend reading the behind-the-scenes of the FIFA Worldcup Finals. It’s long but really fun!

Upcoming AWS Events
Check your calendars and sign up for these AWS-led events:

AWS re:Inforce 2023 – Now register AWS re:Inforce, in Anaheim, California, June 13–14. AWS Chief Information Security Officer CJ Moses will share the latest innovations in cloud security and what AWS Security is focused on. The breakout sessions will provide real-world examples of how security is embedded into the way businesses operate. To learn more and get the limited discount code to register, see CJ’s blog post of Gain insights and knowledge at AWS re:Inforce 2023 in the AWS Security Blog.

AWS Global Summits – Check your calendars and sign up for the AWS Summit closest to your city: Paris and Sydney (April 4), Seoul (May 3-4), Berlin and Singapore (May 4), Stockholm (May 11), Hong Kong (May 23), Amsterdam (June 1), London (June 7), Madrid (June 15), and Milano (June 22).

AWS Community Day – Join community-led conferences driven by AWS user group leaders closest to your city: Peru (April 15), Helsinki (April 20), Chicago (June 15), Philippines (June 29–30), and Munich (September 14). Recently, we are bringing together AWS user groups from around the world into Meetup Pro accounts. Find your group and its meetups in your city!

You can browse all upcoming AWS-led in-person and virtual events, and developer-focused events such as AWS DevDay.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Let’s Architect! Streamlining business with migration and modernization

2023-03-29 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-streamlining-business-with-migration-and-modernization/

Many customers migrate their systems to Amazon Web Services (AWS) to increase their competitive edge and drive business value. To maximize the benefits of a cloud migration, companies tend to move their applications in conjunction with modernization initiatives. These joined efforts help your applications gain more agility, scalability, and resilience. Modernizing the portfolio of workloads with AWS means that you can re-platform, refactor, or replace these workloads by using containers, serverless technologies, purpose-built data stores, and software automation. These functionalities allow you to benefit from the best of the AWS agility and total cost optimization (TCO) benefits.

In this edition of Let’s Architect! we share hands-on activities, customer stories, and tips and tricks to migrate and modernize your applications with AWS.

Migrating to the cloud: What is the cost of doing nothing?

Would you think that small companies always migrate faster than large enterprises? Actually, cloud migration speed doesn’t necessarily depend on the size of the business! Company size is not a clear indicator of migration and modernization success, but a shift of culture and mindset is essential for successful company evolution.

When it comes to migration, the cost of doing nothing is not just financial: Businesses can also expect a slower pace of innovation and a higher security burden. This video analyzes the financial benefits of migration and shares mental models for approaching an AWS cloud migration, and Marriott team members explain how they planned their migration and the lessons learned along the way.

Take me to this re:Invent 2022 video!

Benefits of an early migration start

Modernization pathways for a legacy .NET Framework monolithic application on AWS

Organizations aim to deliver the best technological solutions based on customer needs. At any stage in their cloud adoption journey, businesses often end up managing and building monolithic applications. Let’s explore a migration path for a monolithic .NET Framework application to a modern microservices-based stack on AWS, and discuss AWS tools to break the monolith into microservices and containerize applications.

Cost optimization is another key factor for modernizing your workloads and solutions include moving to Linux-based systems or using open-source database engines. This Migrate and Modernize enterprise workloads with AWS video walks you through the process of migrating and modernizing enterprise workloads with AWS.

Take me to this blog post with more detail!

A modernized microservices-based rearchitecture

Implementing a serverless-first strategy in an enterprise

Organizations of all sizes want to benefit from the agility, cost savings, and developer experience that serverless architectures can provide on AWS. For large enterprises, the return on investment (ROI) can be massive, but overcoming architecture inertia while ensuring security best practices and governance stay in place is a hurdle that many struggle with. In this lightning talk, learn how your organization can implement a serverless-first strategy to overcome these obstacles. Delta Air Lines shares the story of making serverless-first a reality as part of their AWS journey.

Take me to this video

Benefits of serverless

Application Migration with AWS

This workshop shows you how to migrate and modernize a fictional application to the AWS Cloud by:

Performing a database migration
Migrating and modernizing your web server using different migration strategies (for example, breaking down the monolith into containers)
Teaching you how to improve Operation excellence, Security, Performance efficiency, and Cost optimization of the deployed architecture by following these pillars of the AWS Well-Architected Framework.

Take me to this workshop!

Different migration strategies for web servers

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about distributed systems with containers.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Let’s Architect! Architecting a data mesh

2023-03-08 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-architecting-a-data-mesh/

Data architectures were mainly designed around technologies rather than business domains in the past. This changed in 2019, when Zhamak Dehghani introduced the data mesh. Data mesh is an application of the Domain-Driven-Design (DDD) principles to data architectures: Data is organized into data domains and the data is the product that the team owns and offers for consumption.

A data mesh architecture unites the disparate data sources within an organization through centrally managed data-sharing and governance guidelines. Business functions can maintain control over how shared data is accessed because data mesh also solves advanced data security challenges through distributed, decentralized ownership.

This edition of Let’s Architect! introduces data mesh, highlights the foundational concepts of data architectures, and covers the patterns for designing a data mesh in the AWS cloud with supporting resources.

Data lakes, lake houses and data mesh: what, why, and how?

Let’s explore a video introduction to data lakes, lake houses, and data mesh. This resource explains how to leverage those concepts to gain greater data insights across different business segments, with a special focus on best practices to build a well-architected, modern data architecture on AWS. It also gives an overview of the AWS cloud services that can be used to create such architectures and describes the fundamental pillars of designing them.

Take me to this intro to data lakes, lake houses, and data mesh video!

Data mesh is an architecture pattern where data are organized into domains and seen as products to expose for consumption

Building data mesh architectures on AWS

Knowing what a data mesh architecture is, here is a step-by-step video from re:Invent 2022 on designing one. It covers a use case on how GoDaddy considered and implemented data mesh, in addition to:

The fundamental pillars behind a well-architected data mesh in the cloud
Finding an approach to build a data mesh architecture using native AWS services
Reasons for considering a data mesh architecture where data lakes provide limitations in some scenarios
How data mesh can be applied in practice to overcome them
The mental models to apply during the data mesh design process

Take me to this re:Invent 2022 video!

In the data mesh architecture the producers expose their data for consumption to the consumers. Access is regulated through a centralized governance layer.

Amazon DataZone: Democratize data with governance

Now let’s explore data accessibility as it relates to data mesh architectures.

Amazon DataZone is a new AWS business data catalog allowing you to unlock data across organizational boundaries with built-in governance. This service provides a unified environment where everyone in an organization—from data producers to data consumers—can access, share, and consume data in a governed manner.

Here is a video to learn how to apply AWS analytics services to discover, access, and share data across organizational boundaries within the context of a data mesh architecture.

Take me to this re:Invent 2022 video!

Amazon DataZone accelerates the adoption of the data mesh pattern by making it scalable to high number of producers and consumers.

Build a data mesh on AWS

Feeling inspired to build? Hands-on experience is a great way to learn and see how the theoretical concepts apply in practice.

This workshop teaches you a data mesh architecture building approach on AWS. Many organizations are interested in implementing this architecture to:

Move away from centralized data lakes to decentralized ownership
Deliver analytics solutions across business units

Learn how a data mesh architecture can be implemented with AWS native services.

Take me to this workshop!

The diagrams shows how to separate the producers, consumers and governance components through a multi-account strategy.

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about monitoring and observability.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Introducing AWS Lambda Powertools for .NET

2023-02-28 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-aws-lambda-powertools-for-net/

This blog post is written by Amir Khairalomoum, Senior Solutions Architect.

Modern applications are built with modular architectural patterns, serverless operational models, and agile developer processes. They allow you to innovate faster, reduce risk, accelerate time to market, and decrease your total cost of ownership (TCO). A microservices architecture comprises many distributed parts that can introduce complexity to application observability. Modern observability must respond to this complexity, the increased frequency of software deployments, and the short-lived nature of AWS Lambda execution environments.

The Serverless Applications Lens for the AWS Well-Architected Framework focuses on how to design, deploy, and architect your serverless application workloads in the AWS Cloud. AWS Lambda Powertools for .NET translates some of the best practices defined in the serverless lens into a suite of utilities. You can use these in your application to apply structured logging, distributed tracing, and monitoring of metrics.

Following the community’s continued adoption of AWS Lambda Powertools for Python, Java, and TypeScript, AWS Lambda Powertools for .NET is now generally available.

This post shows how to use the new open source Powertools library to implement observability best practices with minimal coding. It walks through getting started, with the provided examples available in the Powertools GitHub repository.

About Powertools

Powertools for .NET is a suite of utilities that helps with implementing observability best practices without needing to write additional custom code. It currently supports Lambda functions written in C#, with support for runtime versions .NET 6 and newer. Powertools provides three core utilities:

Tracing provides a simpler way to send traces from functions to AWS X-Ray. It provides visibility into function calls, interactions with other AWS services, or external HTTP requests. You can add attributes to traces to allow filtering based on key information. For example, when using the Tracing attribute, it creates a ColdStart annotation. You can easily group and analyze traces to understand the initialization process.
Logging provides a custom logger that outputs structured JSON. It allows you to pass in strings or more complex objects, and takes care of serializing the log output. The logger handles common use cases, such as logging the Lambda event payload, and capturing cold start information. This includes appending custom keys to the logger.
Metrics simplifies collecting custom metrics from your application, without the need to make synchronous requests to external systems. This functionality allows capturing metrics asynchronously using Amazon CloudWatch Embedded Metric Format (EMF) which reduces latency and cost. This provides convenient functionality for common cases, such as validating metrics against CloudWatch EMF specification and tracking cold starts.

Getting started

The following steps explain how to use Powertools to implement structured logging, add custom metrics, and enable tracing with AWS X-Ray. The example application consists of an Amazon API Gateway endpoint, a Lambda function, and an Amazon DynamoDB table. It uses the AWS Serverless Application Model (AWS SAM) to manage the deployment.

When you send a GET request to the API Gateway endpoint, the Lambda function is invoked. This function calls a location API to find the IP address, stores it in the DynamoDB table, and returns it with a greeting message to the client.

Example application

The AWS Lambda Powertools for .NET utilities are available as NuGet packages. Each core utility has a separate NuGet package. It allows you to add only the packages you need. This helps to make the Lambda package size smaller, which can improve the performance.

To implement each of these core utilities in a separate example, use the Globals sections of the AWS SAM template to configure Powertools environment variables and enable active tracing for all Lambda functions and Amazon API Gateway stages.

Sometimes resources that you declare in an AWS SAM template have common configurations. Instead of duplicating this information in every resource, you can declare them once in the Globals section and let your resources inherit them.

Logging

The following steps explain how to implement structured logging in an application. The logging example shows you how to use the logging feature.

To add the Powertools logging library to your project, install the packages from NuGet gallery, from Visual Studio editor, or by using following .NET CLI command:

dotnet add package AWS.Lambda.Powertools.Logging

Use environment variables in the Globals sections of the AWS SAM template to configure the logging library:

  Globals:
    Function:
      Environment:
        Variables:
          POWERTOOLS_SERVICE_NAME: powertools-dotnet-logging-sample
          POWERTOOLS_LOG_LEVEL: Debug
          POWERTOOLS_LOGGER_CASE: SnakeCase

Decorate the Lambda function handler method with the Logging attribute in the code. This enables the utility and allows you to use the Logger functionality to output structured logs by passing messages as a string. For example:

[Logging]
public async Task<APIGatewayProxyResponse> FunctionHandler
         (APIGatewayProxyRequest apigProxyEvent, ILambdaContext context)
{
  ...
  Logger.LogInformation("Getting ip address from external service");
  var location = await GetCallingIp();
  ...
}

Lambda sends the output to Amazon CloudWatch Logs as a JSON-formatted line.

{
  "cold_start": true,
  "xray_trace_id": "1-621b9125-0a3b544c0244dae940ab3405",
  "function_name": "powertools-dotnet-tracing-sampl-HelloWorldFunction-v0F2GJwy5r1V",
  "function_version": "$LATEST",
  "function_memory_size": 256,
  "function_arn": "arn:aws:lambda:eu-west-2:286043031651:function:powertools-dotnet-tracing-sample-HelloWorldFunction-v0F2GJwy5r1V",
  "function_request_id": "3ad9140b-b156-406e-b314-5ac414fecde1",
  "timestamp": "2022-02-27T14:56:39.2737371Z",
  "level": "Information",
  "service": "powertools-dotnet-sample",
  "name": "AWS.Lambda.Powertools.Logging.Logger",
  "message": "Getting ip address from external service"
}

Another common use case, especially when developing new Lambda functions, is to print a log of the event received by the handler. You can achieve this by enabling LogEvent on the Logging attribute. This is disabled by default to prevent potentially leaking sensitive event data into logs.

[Logging(LogEvent = true)]
public async Task<APIGatewayProxyResponse> FunctionHandler
         (APIGatewayProxyRequest apigProxyEvent, ILambdaContext context)
{
  ...
}

With logs available as structured JSON, you can perform searches on this structured data using CloudWatch Logs Insights. To search for all logs that were output during a Lambda cold start, and display the key fields in the output, run following query:

fields coldStart='true'
| fields @timestamp, function_name, function_version, xray_trace_id
| sort @timestamp desc
| limit 20

CloudWatch Logs Insights query for cold starts

Tracing

Using the Tracing attribute, you can instruct the library to send traces and metadata from the Lambda function invocation to AWS X-Ray using the AWS X-Ray SDK for .NET. The tracing example shows you how to use the tracing feature.

When your application makes calls to AWS services, the SDK tracks downstream calls in subsegments. AWS services that support tracing, and resources that you access within those services, appear as downstream nodes on the service map in the X-Ray console.

You can instrument all of your AWS SDK for .NET clients by calling RegisterXRayForAllServices before you create them.

public class Function
{
  private static IDynamoDBContext _dynamoDbContext;
  public Function()
  {
    AWSSDKHandler.RegisterXRayForAllServices();
    ...
  }
  ...
}

To add the Powertools tracing library to your project, install the packages from NuGet gallery, from Visual Studio editor, or by using following .NET CLI command:

dotnet add package AWS.Lambda.Powertools.Tracing

Use environment variables in the Globals sections of the AWS SAM template to configure the tracing library.

  Globals:
    Function:
      Tracing: Active
      Environment:
        Variables:
          POWERTOOLS_SERVICE_NAME: powertools-dotnet-tracing-sample
          POWERTOOLS_TRACER_CAPTURE_RESPONSE: true
          POWERTOOLS_TRACER_CAPTURE_ERROR: true

Decorate the Lambda function handler method with the Tracing attribute to enable the utility. To provide more granular details for your traces, you can use the same attribute to capture the invocation of other functions outside of the handler. For example:

[Tracing]
public async Task<APIGatewayProxyResponse> FunctionHandler
         (APIGatewayProxyRequest apigProxyEvent, ILambdaContext context)
{
  ...
  var location = await GetCallingIp().ConfigureAwait(false);
  ...
}

[Tracing(SegmentName = "Location service")]
private static async Task<string?> GetCallingIp()
{
  ...
}

Once traffic is flowing, you see a generated service map in the AWS X-Ray console. Decorating the Lambda function handler method, or any other method in the chain with the Tracing attribute, provides an overview of all the traffic flowing through the application.

AWS X-Ray trace service view

You can also view the individual traces that are generated, along with a waterfall view of the segments and subsegments that comprise your trace. This data can help you pinpoint the root cause of slow operations or errors within your application.

AWS X-Ray waterfall trace view

You can also filter traces by annotation and create custom service maps with AWS X-Ray Trace groups. In this example, use the filter expression annotation.ColdStart = true to filter traces based on the ColdStart annotation. The Tracing attribute adds these automatically when used within the handler method.

View trace attributes

Metrics

CloudWatch offers a number of included metrics to help answer general questions about the application’s throughput, error rate, and resource utilization. However, to understand the behavior of the application better, you should also add custom metrics relevant to your workload.

The metrics utility creates custom metrics asynchronously by logging metrics to standard output using the Amazon CloudWatch Embedded Metric Format (EMF).

In the sample application, you want to understand how often your service is calling the location API to identify the IP addresses. The metrics example shows you how to use the metrics feature.

To add the Powertools metrics library to your project, install the packages from the NuGet gallery, from the Visual Studio editor, or by using the following .NET CLI command:

dotnet add package AWS.Lambda.Powertools.Metrics

Use environment variables in the Globals sections of the AWS SAM template to configure the metrics library:

  Globals:
    Function:
      Environment:
        Variables:
          POWERTOOLS_SERVICE_NAME: powertools-dotnet-metrics-sample
          POWERTOOLS_METRICS_NAMESPACE: AWSLambdaPowertools

To create custom metrics, decorate the Lambda function with the Metrics attribute. This ensures that all metrics are properly serialized and flushed to logs when the function finishes its invocation.

You can then emit custom metrics by calling AddMetric or push a single metric with a custom namespace, service and dimensions by calling PushSingleMetric. You can also enable the CaptureColdStart on the attribute to automatically create a cold start metric.

[Metrics(CaptureColdStart = true)]
public async Task<APIGatewayProxyResponse> FunctionHandler
         (APIGatewayProxyRequest apigProxyEvent, ILambdaContext context)
{
  ...
  // Add Metric to capture the amount of time
  Metrics.PushSingleMetric(
        metricName: "CallingIP",
        value: 1,
        unit: MetricUnit.Count,
        service: "lambda-powertools-metrics-example",
        defaultDimensions: new Dictionary<string, string>
        {
            { "Metric Type", "Single" }
        });
  ...
}

Conclusion

CloudWatch and AWS X-Ray offer functionality that provides comprehensive observability for your applications. Lambda Powertools .NET is now available in preview. The library helps implement observability when running Lambda functions based on .NET 6 while reducing the amount of custom code.

It simplifies implementing the observability best practices defined in the Serverless Applications Lens for the AWS Well-Architected Framework for a serverless application and allows you to focus more time on the business logic.

You can find the full documentation and the source code for Powertools in GitHub. We welcome contributions via pull request, and encourage you to create an issue if you have any feedback for the project. Happy building with AWS Lambda Powertools for .NET.

For more serverless learning resources, visit Serverless Land.

Let’s Architect! Architecture tools

2023-02-22 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-architecture-tools/

Tools, such as diagramming software, low-code applications, and frameworks, make it possible to experiment quickly. They are essential in today’s fast-paced and technology-driven world. From improving efficiency and accuracy, to enhancing collaboration and creativity, a well-defined set of tools can make a significant impact on the quality and success of a project in the area of software architecture.

As an architect, you can take advantage of a wide range of resources to help you build solutions that meet the needs of your organization. For example, with tools in the likes of the Amazon Web Services (AWS) Solutions Library and Serverless Land, you can boost your knowledge and productivity while working on event-driven architectures, microservices, and stateless computing.

In this Let’s Architect! edition, we explore how to incorporate these patterns into your architecture, and which tools to leverage to build solutions that are scalable, secure, and cost-effective.

How AWS Application Composer helps your team build great apps

In this re:Invent 2022 session, Chase Douglas, Principal Engineer at AWS, speaks about AWS Application Composer, a newly launched service.

This service has the potential to change the way architects design solutions—without writing a single line of code! The service is user-friendly, intuitive, and requires no prior coding experience. It allows users to scaffold a serverless architecture, defining a CloudFormation template visually with drag-and-drop. A detailed AWS Compute Blog post takes readers through the process of using AWS Application Composer.

Take me to this re:Invent 2022 video!

How an architecture can be designed with AWS Application Composer

AWS design + build tools

When migrating to the cloud, we suggest referencing these four tried-and-true AWS resources that can be used to design and build projects.

AWS Workshops are created by AWS teams to provide opportunities for hands-on learning to develop practical skills. Workshops are available in multiple categories and for skill levels 100-400.
AWS Architecture Center contains a collection of best practices and architectural patterns for designing and deploying cloud-based solutions using AWS services. Furthermore, it includes detailed architecture diagrams, whitepapers, case studies, and other resources that provide a wealth of information on how to design and implement cloud solutions.
Serverless Land (an Amazon property) brings together various patterns, workflows, code snippets, and blog posts pertaining to AWS serverless architectures.
AWS Solutions Library provides customers with templates, tools, and automated workflows to easily deploy, operate, and manage common use cases on the AWS Cloud.

Inside event-driven architectures designed by David Boyne on Serverless Land

The Well-Architected way

In this session, the AWS Well-Architected provides guidance on how to implement the architectural models reported in the AWS Well-Architected Framework within your organization at scale.

Discover a customer story and understand how to use the features of the AWS Well-Architected Tool and APIs to receive recommendations based on your workload and measure your architectural metrics. In the Framework whitepaper, you can explore the six pillars of Well-Architected (operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability) and best practices to achieve them.

Understanding the key design pillars can help architects make informed design decisions, leading to more robust and efficient solutions. This knowledge also enables architects to identify potential problems early on in the design process and find appropriate patterns to address those issues.

Take me to the Well-Architected video!

Discover how the AWS Well-Architected Framework can help you design scalable, maintainable, and reusable solutions

See you next time!

Thanks for exploring architecture tools and resources with us!

Join us next time when we’ll talk about data mesh architecture!

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Accelerating Well-Architected Framework reviews using integrated AWS Trusted Advisor insights

2022-11-09 Stephen Salim

Post Syndicated from Stephen Salim original https://aws.amazon.com/blogs/architecture/accelerating-well-architected-framework-reviews-using-integrated-aws-trusted-advisor-insights/

In this blog, we will explain how the new AWS Well-Architected integration with AWS Trusted Advisor can give you insights that help you create a flywheel effect to accelerate your cloud optimization. Customers that have the most success in their cloud adoption recognize that optimizing their cloud architecture and operations is not a one-time effort. Optimization is a continuous improvement virtuous cycle based on learning architectural and operational best practices, measuring workloads against these best practices, and implementing improvements based on opportunities recognized from measurement.

Customers can use the AWS Well-Architected Framework to build a “learn, measure, and improve” continuous improvement virtuous cycle (Figure 1). With the AWS Well-Architected Tool, customers can measure their workloads against these AWS best practices to identify improvement opportunities or risks they should address. After customers complete Well-Architected Framework Reviews (WAFRs) they can generate improvement plans with prioritized guidance and resources for improvement. They can also track the improvements made over time using the milestones feature in the Well-Architected Tool.

Figure 1. Continuous optimization of workloads based on AWS best practices

Amazon uses the term flywheel to describe a virtuous cycle that has additional drivers to add momentum, which accelerates the cycle and the value it delivers. Figure 2 is the often-referenced Amazon retail flywheel, which shows how Amazon’s focus on customer experience drives growth. It is accelerated by creating a lower cost structure, which allows Amazon to pass lower prices to its customers, improving customer experience and driving faster growth.

Figure 2. The Amazon Flywheel concept of scaling growth

Customers can add momentum to an AWS Well-Architected “learn, measure, and improve” virtuous cycle using tools that give more insights while measuring workloads. Improved insights result in consistent measurements, that are more efficient and more accurate. This accelerates the optimization cycle by reducing the time required to measure workloads. Collecting information on AWS resources using Trusted Advisor checks allows customers to validate if a workload’s state is aligned with AWS best practices. The new AWS Well-Architected Tool integration with AWS Trusted Advisor makes it easier and faster to gain insights during WAFRs. The Trusted Advisor checks that are relevant to a specific set of best practices have been mapped to the corresponding questions in Well-Architected. The new feature now shows the mapped Trusted Advisor checks directly in the Well-Architected Tool. These insights help customers run WAFRs in less time, with more accuracy, creating a flywheel effect (Figure 3).

Figure 3. Insights from AWS Trusted Advisor create acceleration in achieving improved outcomes

AWS Well-Architected Tool integration with AWS Trusted Advisor: feature example

In the following sections, we detail an example scenario on how to use the integration with Trusted Advisor to gain insights when measuring your workloads.

Enabling the AWS Well-Architected Tool integration with AWS Trusted Advisor

How to enable the new feature in your workload:

Create a new workload in the AWS Well-Architected Console. Refer to the user guide for detailed instructions.

Optional: When defining a workload, within the “Application” section of workload definition, you can now also specify the AWS Service Catalog AppRegistry AWS Resource Name (ARN). This field is to indicate a relationship between the AWS Well-Architected Tool workload and the AWS resources in an AppRegistry Application when performing a Well-Architected Framework Review (Figure 4).

Figure 4. Application field to select AWS Service Catalog AppRegistry ARN

This is another new AWS Well-Architected Tool feature that launched along with the integration with Trusted Advisor feature. You can find out more details about the integration with AWS Service Catalog AppRegistry in the What’s New post and on the feature documentation page. For details on how to create an AWS Service Catalog AppRegistry Application refer to Creating applications.
To enable the integration with Trusted Advisor, after the necessary workload information has been entered, within the “AWS Trusted Advisor” section, tick on “Activate Trusted Advisor” (Figure 5).

Figure 5. Enabling the AWS Trusted Advisor feature

Optional: Once the workload is created, note the workload ARN. You can find the workload ARN in the Properties section of the workload resource you created (Figure 6). For steps on how to identify your workload, refer to Well-Architected Tool User Guide on viewing a workload.

Figure 6. AWS Well-Architected Tool showing workload ARN
To collect Trusted Advisor checks from accounts other than the account where the workload you are reviewing exists, you must perform two steps. You need to ensure the account IDs are listed in the workload properties for the workload you are reviewing. You must then create an IAM role in the account from which Trusted Advisor checks will be collected with the following permission and trust relationship (Figures 7 and 8). For more information on how to setup this permission, refer to the feature documentation.

Figure 7. Permissions needed by AWS Well-Architected Tool to interrogate AWS Trusted Advisor

Figure 8. The trust relationship allowing AWS Well-Architected Tool to assume policy on behalf of the workload

Using integration with AWS Trusted Advisor for insights during reviews

Once the feature is enabled, additional insights will be noticeable about the resources in your workload using Trusted Advisor checks. Let’s explore an example question. In this case, we will use Question 9 from the Reliability Pillar, as there are Trusted Advisor checks related to the best practices in it: How do you back up data?

AWS Well-Architected Reliability Question 9 includes best practices that are related to how workload backup is performed to support the ability for the workload to recover from failure. Current findings using Trusted Advisor checks indicates the workload may not be configured based on the “Perform data backup automatically” best practice in the Reliability Pillar (Figure 9).

Figure 9. “Perform data backup automatically” best practices
To access Trusted Advisor checks as insights, you can select a question in the Well-Architected Tool (Figure 10). If there are related Trusted Advisor checks available for a question, there will be a “View checks” button like the screenshot below. You can also select the “Trusted Advisor checks” tab.

Figure 10. AWS Trusted Advisor checks that map to best practices
Trusted Advisor checks are available, which provide insights related to the best practice in the question. You will also notice the state of resources recommendations and the count of resources. Trusted Advisor checks that relate to the best practice “Perform data backup automatically” are displayed. One of the Trusted Advisor checks identified with a x in a circle (denoting “Action recommended”) status is on the Amazon Elastic Block Storage (Amazon EBS) snapshots availability to recover your EBS volume from in the event of disaster (Figure 11).

Figure 11. AWS Trusted Advisor check for Amazon EBS snapshots with “Action recommended”
Exploring the Trusted Advisor Console, you can identify the EBS volume ID that has been detected with no snapshot in this us-west-2 region (Figure 12).

Figure 12. An EBS volume that does not have snapshots
With the insights from Trusted Advisor, we can quickly determine that the “Perform data backup automatically” best practice is not in place, as we do not have Amazon EBS snapshots enabled. Through the “helpful resources” section, instructions can be found to help automate the snapshot creation of Amazon EBS volume (Figure 13). One method to achieve this is to use AWS Backup.

Figure 13. Resources with details about best practices, including links to learn more
Using AWS Backup you can define a backup plan to automate snapshots creation of the EBS volume. Using this plan, you adjust the frequency of the backup to help achieve your recovery time objective and recovery point objective (Figure 14). For more information on how to configure EBS volume backup plan, refer to the Developer Guide on creating a backup plan.

Figure 14. Setup automatic Amazon EBS volume snapshots
Once this improvement is implemented and the related EBS volume snapshot is taken, Trusted Advisor will reflect the changes to the resource (Figure 15).

Figure 15. Amazon EBS volume with a snapshot
The next time we perform a Well-Architected Framework Review on this workload, the related AWS Trusted Advisor Check will show no action required with a check-mark status (Figure 16).

Figure 16. AWS Trusted Advisor checks that represent improvements that have been implemented

Optional: For access to the list of Trusted Advisor checks in .csv format, you can click on the “Download check details” button on each question to download the resources that were checked in relation to the specified best practices (Figure 17).

Figure 17. “Download check details” button
Once implemented, this improvement ensures a means to recover the EBS volume data in the event of disaster. This makes the resources in the workload better aligned to the AWS Reliability Pillar Design principle of “Automatically recover from failure”. To reflect this alignment in the Well-Architected Tool, you can tick on the best practice check items under the related questions (Figure 18).

Figure 18. A milestone with updated best practices based on improvements that have been implemented
Finally, you can create a milestone to capture a point in time state of your workload WAFR. As you continuously optimize with more WAFRs and improvements, the number of high- and medium-risk items identified within each review will decrease. You will notice the continuous optimization of your workload over time, as in Figure 19.

Figure 19. The history of improvements being made over time

Conclusion

Using the AWS Well-Architected integration with AWS Trusted Advisor, customers have a mechanism to accelerate the “learn, measure, and improve” Well-Architected virtuous cycle, creating an optimization flywheel. We have demonstrated the value of creating acceleration through the insights from Trusted Advisor checks. You now know how to enable the integration with Trusted Advisor and have seen an example of how the insights can accelerate your review cycle. You will notice the improvements you make over time will reflect in the Trusted Advisor checks as you review the milestones for your workloads. Enable this feature on your next Well-Architected Framework Review (WAFR) to measure the impact that data-driven insights from Trusted Advisor can have on reducing the time-to-value for your reviews. For more information consider these additional resources. You can contact your account team for support in running WAFRs or check out the AWS Well-Architected Partner Program to find a partner that can help you run a review. Additionally, running a WAFR with a partner assisting you in remediating risks may also provide funding credits to offset the costs required to make the improvements.

“Perform data backup automatically” is part of the Reliability Pillar of the AWS Well-Architected Framework. AWS Well-Architected is a set of guiding design principles developed by AWS to help organizations build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Use the AWS Well-Architected Tool to review your workloads periodically to address important design considerations and ensure that they follow the best practices and guidance of the AWS Well-Architected Framework. For follow up questions or comments, join our growing community on AWS re:Post.

Verify the resilience of your workloads using Chaos Engineering

2022-10-26 Seth Eliot

Post Syndicated from Seth Eliot original https://aws.amazon.com/blogs/architecture/verify-the-resilience-of-your-workloads-using-chaos-engineering/

The following is an early preview of new guidance to be published as part of updates to the AWS Well-Architected content:

Chaos Engineering enables us to find shortcomings before our customers find them and therefore, provides us with the opportunity to create a better customer experience. Chaos Engineering does not introduce chaos into your systems, instead, it finds the chaos that is already there. By definition, chaos experiments should be fail-safe and tolerated by the system. It is therefore key that you use tools that allow for controlled experiments. A controlled experiment has a clear scope of impact, built in rollback mechanisms, and tight integration with monitoring that provides deep insights to the impact of the experiment in real-time. Chaos Engineering allows you to inject real-world cloud provider faults that give you insights on what you need to improve in regards to observability, incident response, and architecture to be resilient against faults that you cannot predict. To help you with this journey, we have adjusted our guidance in the Well-Architected Reliability Pillar, enabling you to build more robust and resilient workloads on AWS.

Well-Architected Reliability best practice: verify the resilience of your workloads using Chaos Engineering

Chaos Engineering provides your teams with capabilities to continuously inject real world disruptions (simulations) in a controlled way at the service provider, infrastructure, workload, and component levels, with minimal to no impact to your customers. It allows your teams to learn from faults and observe, measure, and improve the resilience of your workloads, as well as validate that alerts fire and teams get notified in the case of an event. When run continuously, Chaos Engineering can highlight deficiencies in your workloads that, if left unaddressed, could negatively affect availability and operation.

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. – Principles of Chaos Engineering

If a system is able to withstand these disruptions, the chaos experiment should be maintained as an automated regression test. In this way, chaos experiments should be run as part of your software development lifecycle (SDLC) and as part of your CI/CD pipeline.

To ensure that your workload can survive component failure, inject real-world events as part of your experiments. For example, experiment with the loss of EC2 instances or failover of the primary Amazon RDS database instance, and verify that your workload is not impacted (or only minimally impacted). Use a combination of component faults to simulate events that may be caused by a disruption in an Availability Zone.

For application-level faults (such as crashes), you can start with stressors such as memory and CPU exhaustion.

To validate fallback or failover mechanisms for external dependencies due to intermittent network disruptions, your components should simulate such an event by blocking access to the third-party providers for a specified duration that might last from seconds to hours.

Other modes of degradation might cause reduced functionality and slow responses, often resulting in a disruption of your services. Common sources of this type of degradation are increased latency on critical services and unreliable network communication (dropped packets). Experiments with these faults, including networking effects such as latency, dropped messages, and DNS failures, could include the inability to resolve a name, reach the DNS service, or establish connections to dependent services.

Chaos Engineering tools

AWS Fault Injection Simulator (AWS FIS) is a fully managed service for running fault injection experiments that can be used as part of your CD pipeline, or outside of the pipeline. AWS FIS is a good choice to use during Chaos Engineering game days. It supports simultaneously introducing faults across different types of resources including Amazon EC2, Amazon ECS, Amazon EKS, and Amazon RDS. These faults include termination of resources, forcing failovers, stressing CPU or memory, throttling, latency, and packet loss. Since it is integrated with Amazon CloudWatch alarms, you can set up stop conditions as guardrails to rollback an experiment if it causes an unexpected impact (Figure 1).

Figure 1. AWS Fault Injection Simulator integrates with AWS resources to enable you to run fault injection experiments for your workloads

To expand the scope of faults that can be injected on AWS, AWS FIS integrates with Chaos Mesh and Litmus Chaos, enabling you to coordinate fault injection workflows among multiple tools. For example, you can run a stress test on a pod’s CPU using Chaos Mesh or Litmus faults while terminating a randomly selected percentage of cluster nodes using AWS FIS fault actions.

Implementation steps

1. Determine which faults to use for experiments

Assess the design of your workload for resiliency. Such designs (created using the best practices of the Well-Architected Framework) consider risks based on critical dependencies, past events, known issues, and compliance requirements. List each element of the design intended to maintain resilience and the faults it is designed to mitigate. For more information about creating such lists, see the Operational Readiness Review whitepaper, which guides you on how to create a process to prevent reoccurrence of previous incidents. The Failure Modes & Effects Analysis (FMEA) process provides a framework for performing a component-level analysis of failures and how they impact your workload. FMEA is outlined in more detail in Failure Modes and Continuous Resilience by Adrian Cockcroft.

2. Assign a priority to each fault

To assess priority, consider the frequency of the fault and the impact of failure to the overall workload. It is fine to start with a coarse categorization, such as high, medium, or low, and refine it.

When considering frequency of a given fault, analyze past data for this workload when available. If not available, use data from other workloads running in a similar environment.

When considering impact of a given fault, the larger the scope of the fault, generally the larger the impact. Also consider the workload design and purpose. For example, the ability to access the source data stores is critical for a workload doing data transformation and analysis. In this case, you would prioritize experiments for access faults, as well as throttled access and latency insertion.

Post-incident analyses are a good source of data to understand both frequency and impact of failure modes.

Use the assigned priority to determine which faults to experiment with first and the order with which to develop new fault injection experiments.

3. For each experiment that you will execute, follow the Chaos Engineering/continuous resilience flywheel (Figure 2)

Figure 2. Chaos Engineering/continuous resilience flywheel, using the scientific method by Adrian Hornsby

3A. Define steady state as some measurable output of a workload that indicates normal behavior

Your workload exhibits steady state if it is operating reliably and as expected. Therefore, validate that your workload is healthy before defining steady state. Steady state does not necessarily mean that there is no impact to the workload when a fault occurs, as a certain percentage in faults could be within acceptable limits. The steady state is your baseline that you will observe during the experiment, which will highlight anomalies if your hypothesis defined in the next step does not turn out as expected.

For example, a steady state of a payments system can be defined as the processing of 300 transactions per second (TPS) with a 99% success rate and round-trip time of 500 ms.

3B. Form a hypothesis about how the workload will react to the fault

A good hypothesis is based on how the workload is expected to mitigate the fault to maintain the steady state. The hypothesis states that given the fault of a specific type, the system or workload will continue steady state, because the workload was designed with specific mitigations. The specific type of fault and mitigations should be specified in the hypothesis.

The following template can be used for the hypothesis (but other wording is also acceptable):

If [specific fault] occurs the [workload name] workload will [describe mitigating controls] to maintain [business or technical metric].

For example:

If 20% of the nodes in the EKS node-group are taken down, the Transaction Create API continues to serve the 99th percentile of requests in under 100 ms (steady state). The EKS nodes will recover within five minutes, and pods will get scheduled and process traffic within eight minutes after the initiation of the experiment. Alerts will fire within three minutes.
If a single EC2 instance failure occurs, the order system’s Elastic Load Balancer (ELB) health check will cause the ELB to only send requests to the remaining healthy instances while the EC2 Auto scaling replaces the failed instance, maintaining a less than 0.01% increase in server-side (5xx) errors (steady state).
If the primary RDS database instance fails, the supply chain data collection workload will failover and connect to the standby RDS database instance to maintain less than one minute of database read/write errors (steady state).

3C. Run the experiment by injecting the fault

An experiment should, by default, be fail-safe and tolerated by the workload. If you know that the workload will fail, do not run the experiment. Chaos Engineering should be used to find known-unknowns or unknown-unknowns. Known-unknowns are things you are aware of but don’t fully understand, and unknown-unknowns are things you are neither aware of nor fully understand. Experimenting against a workload that you know is broken won’t provide you with new insights. Your experiment should be carefully planned, have a clear scope of impact, and provide a roll back mechanism that can be run in case of unexpected turbulence. If your due diligence shows that your workload should survive the experiment, move forward with running the experiment. There are several options for injecting the faults. For workloads on AWS, AWS FIS provides many pre-defined fault simulations called actions. You can also define custom actions that run in AWS FIS using AWS Systems Manager documents.

We discourage the use of custom scripts for chaos experiments, unless the scripts have the capabilities to understand current state of the workload, are able to emit logs, and provide mechanisms for roll backs and stop conditions where possible.

An effective framework or toolset that supports Chaos Engineering should track the current state of an experiment, emit logs, and provide rollback mechanisms, to support the controlled running of an experiment. Start with an established service like AWS FIS that allows you to run experiments with a clearly defined scope and safety mechanisms that rollback the experiment if the experiment introduces unexpected turbulence. To learn about a wider variety of experiments using AWS FIS, see the Resilient and Well-Architected Apps with Chaos Engineering lab. Also, AWS Resilience Hub will analyze your workload and create experiments that you can choose to implement and run in AWS FIS.

For every experiment, clearly understand its scope and its impact. We recommend that faults should be simulated first on a non-production environment before being run in production.

It is ideal to ultimately run in production under real-world load via canary deployments that spin up both a control and experimental system deployment, where feasible. Running experiments during off-peak times is a good practice to mitigate potential impact when first experimenting in production. Also, if using actual customer traffic poses too much risk, you can run experiments using synthetic traffic on production infrastructure against the control and experimental deployments. When using production is not possible, run experiments in pre-production environments that are as close to production as possible.

You must establish and monitor guardrails to ensure that the experiment does not impact production traffic or other systems beyond acceptable limits. Establish stop conditions to stop an experiment if it reaches a threshold on a guardrail metric that you define. This should include the metrics for steady state for the workload, as well as the metric against the components into which you’re injecting the fault. A synthetic monitor (also known as a “user canary”) is one metric you should usually include as a user proxy. Stop conditions for AWS FIS are supported as part of the experiment template, enabling up to five stop-conditions per template.

One of the Principles of Chaos Engineering is to minimize the scope of the experiment and its impact, specifically “While there must be an allowance for some short-term negative impact, it is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained”. A method to verify the scope and potential impact is to run the experiment in a non-production environment first, verifying that thresholds for stop conditions occur as expected during an experiment and observability is in place to catch an exception, instead of directly experimenting in production.

When running fault injection experiments, verify that all responsible parties are well informed. Communicate with appropriate teams, such as the operations teams, service reliability teams, and customer support, to let them know when experiments will be run and what to expect. Give these teams communication tools to inform those running the experiment if they see any adverse effects.

You must restore the workload and its underlying systems back to the original known-good state. Often, the resilient design of the workload will self-heal. But some fault designs or failed experiments can leave your workload in an unexpected failed state. By the end of the experiment, you must be aware of this and restore the workload and systems. With AWS FIS, you can set a rollback configuration (also called a post action) within the action parameters. A post action returns the target to the state that it was in before the action was run. Whether automated (such as using AWS FIS) or manual, these post actions should be part of a playbook that describes how to detect and handle failures.

3D. Verify the hypothesis

The Principles of Chaos Engineering gives this guidance on how to verify steady state of your workload: “Focus on the measurable output of a system, rather than internal attributes of the system. Measurements of that output over a short period of time constitute a proxy for the system’s steady state. The overall system’s throughput, error rates, latency percentiles, etc. could all be metrics of interest representing steady state behavior. By focusing on systemic behavior patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works.”

In our two examples from Step 3B, we include the steady state metrics:

Less than 0.01% increase in server-side (5xx) errors
Less than 1 minute of database read/write errors

The 5xx errors are a good metric because they are a consequence of the failure mode that a client of the workload will experience directly. The database errors measurement is good as a direct consequence of the fault, but should also be supplemented with a client impact measurement such as failed customer requests or errors surfaced to the client. Additionally, include a synthetic monitor (also known as a “user canary”) on any APIs or URIs directly accessed by the client of your workload.

3E. Improve the workload design for resilience

If steady state was not maintained, then investigate how the workload design can be improved to mitigate the fault, applying the best practices of the AWS Well-Architected Reliability Pillar. Additional guidance and resources can be found in the AWS Builder’s Library, which hosts articles about how to improve your health checks and employ retries with backoff in your application code, among others.

After these changes have been implemented, run the experiment again (shown by the dotted line in Figure 2) to determine their effectiveness. If the verify step indicates the hypothesis holds true, then the workload will be in steady state, and the cycle in Figure 2 continues.

4. Run experiments regularly

A chaos experiment is a cycle, and experiments should be run regularly as part of Chaos Engineering. After a workload meets the experiment’s hypothesis, the experiment should be automated to run continuously as a regression part of your CI/CD pipeline. To learn how to do this, explore this blog on how to run AWS FIS experiments using AWS CodePipeline. This lab on recurrent AWS FIS experiments in a CI/CD pipeline enables you to work hands-on with this.

Fault injection experiments are also a part of game days. Game days simulate a failure or event to verify systems, processes, and team responses. The purpose of game days is to actually perform the actions that the team would perform as if an exceptional event happened.

5. Capture and store experiment results

Results for fault injection experiments must be captured and persisted. Include all necessary data necessary (such as time, workload, and conditions) to be able to later analyze experiment results and trends. Examples of results might include screenshots of dashboards, CSV dumps from your metrics database, or a hand-recorded record of events and observations from the experiment. Experiment logging with AWS FIS can be part of this data capture.

This blog post gives early access to the updated implementation guidance on Chaos Engineering we are publishing as part of updates to the AWS Well-Architected content. Using the implementation steps described in this post, you can begin using Chaos Engineering to verify the resilience of your workloads.

Announcing updates to the AWS Well-Architected Framework

2022-10-24 Haleh Najafzadeh

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/announcing-updates-to-the-aws-well-architected-framework/

We are excited to announce the availability of improved AWS Well-Architected Framework content. In this update, we have made changes across all six pillars of the framework: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability.

A brief history

The Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads and organizations in the cloud.

In 2012, the first version of the framework was published, leading to the 2015 release of the guidance whitepaper. We added the operational excellence pillar in 2016. The pillar-specific whitepapers and AWS Well-Architected Lenses were released in 2017, and, the following year, the AWS Well-Architected Tool was launched. In 2020, the content for the framework received a major update, more lenses, and API integration with the Well-Architected Tool. The sixth pillar, sustainability, was added in late 2021.

AWS Well-Architected timeline

What’s new

Updates to the Well-Architected content include:

For each of the six pillars, content was consolidated between the Well-Architected Framework whitepapers and the best practice descriptions. The updated content is now available both in the pillars’ whitepapers and in the Framework whitepaper.
For all six pillars, dedicated pages for each best practice were created, with options to receive feedback.
All best practices across all pillars now follow a consistent template. The template includes (as applicable):
- Information on common anti-patterns
- Benefits
- Risk levels
- Implementation guidance
- Related resources (such as documents and videos)
Prescriptive guidance was added to 18 best practices within the Reliability, Performance Efficiency, and Operational Excellence pillars, which assist with remediation of high-risk issues identified as part of Well-Architected Framework reviews. The best practices that were updated are:
The Framework is now available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.

Learn, measure, improve, and iterate

Best practices include regularly reviewing your workloads—even those that have not had major changes. We encourage you to assess your existing workloads as your architecture evolves or business needs change, and create milestones for your workloads as they develop. Use the Well-Architected Framework to guide your design and architecture of new workloads, or of workloads that you are planning on moving to the cloud.

Taking best practices into account early in your process can yield high success rates. In effective organizations, each best practice is considered and prioritized with respect to the goal they are trying to achieve.

AWS Well-Architected helps cloud architects build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. The Framework is built around six pillars—operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability.

Want to partner with us? Sign up!
Want to work with us? Visit Amazon Careers and search for “AWS Well-Architected” to find opportunities.

Let’s Architect! Designing Well-Architected systems

2022-07-27 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-designing-well-architected-systems/

Amazon’s CTO Werner Vogels says, “Everything fails, all the time”. This means we should design with failure in mind and assume that something unpredictable could happen.

The AWS Well-Architected Framework is designed to help you prepare your workload for failure. It describes key concepts, design principles, and architectural best practices for designing and running workloads in the cloud. Using this tool regularly will help you gain awareness of the status of your workloads and is in place to improve any workload deployed inside your AWS accounts.

In this edition of Let’s Architect!, we’ve collected solutions and articles that will help you understand the value behind the Well-Architected Framework and how to implement it in your software development lifecycle.

AWS Well-Architected Framework

AWS Well-Architected (AWS WA) helps cloud architects build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Built around six pillars—operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability—AWS WA provides a consistent approach for customers and partners to evaluate architectures and implement scalable designs.

The AWS WA Framework includes domain-specific lenses, hands-on labs, and the AWS Well-Architected Tool. The AWS Well-Architected Tool (AWS WA Tool), available at no cost in the AWS Management Console, provides a mechanism for regularly evaluating workloads, identifying high-risk issues, and recording improvements.

The 6 pillars that composes the AWS Well-Architected framework

Use templated answers to perform Well-Architected reviews at scale

For larger customers, performing AWS WA reviews often involves a combination of different teams. Coordinating participants from each team in order to perform a review increases the time taken and is expensive. In a large organization, there are often hundreds of AWS accounts where teams can store review documents, which means there is no way to quickly identify risks or spot common issues or trends that could influence improvements.

To address this, this blog post offers a solution to help you perform reviews easier and faster. It allows workload owners to automatically populate their reviews with templated answers to questions in the AWS WA Tool. These answers may be a shared responsibility between an application team and a centralized team such as platform, security, or finance. This way, application teams have fewer questions to answer and centralized team members have fewer reviews to attend, because answers that are common to all workloads are pre-populated in workload reviews. The solution also provides centralized reporting to provide a centralized view of AWS WA reviews conducted across the organization.

The components of the solution and the steps in the workflow

Machine Learning Lens

Machine learning (ML) is used to solve specific business problems and influence revenue. However, moving from experimentation (where scientists design ML models and explore applications) to a production scenario (where ML is used to generate value for the business) can present some challenges. For example, how do you create repeatable experiments? How do you increase automation in the deployment process? How do you deploy my model and monitor the performance?

This blog post and its companion whitepaper provide best practices based on AWS WA for each phase of putting ML into production, including formulating the problem and approaches for monitoring a model’s performance.

ML lifecycle phases with expanded components

Establishing Feedback Loops Based on the AWS Well-Architected Framework Review

When you perform an AWS WA review using the AWS WA Tool, you’ll answer a set of questions. The tool then provides gives recommendations to improve your workloads.

To apply these recommendations effectively, you must 1) define how you’ll apply them, 2) create systems to define what is monitored and which kind of metrics or logs are required, 3) establish automatic or manual process and for reporting, and 4) improve them through iteration. This process is called a feedback loop.

This blog post shows you how to iteratively improve your overall architecture with feedback loops based on the results of the AWS WA review.

Feedback loop based on the AWS WA review

See you next time!

Thanks for reading! See you in a couple of weeks when we discuss strategies for running serverless applications on AWS.

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

AI-powered architecture and governance

Well-Architected Framework evolution and implementation

Cost optimization and FinOps

Session formats to fit your learning style

Breakout sessions – Stay in the know

Chalk talks

Hands-on workshop and Builders’ sessions

Visit the AWS Cloud Support kiosk in the Venetian

About the authors

The cloud optimization imperative

Best practices for success

Real-world success stories

Business impact

AWS resources for your optimization journey

Getting started

Conclusion

About the Authors

What is an architecture review board?

Life without a formal architecture review process

Benefits of an architecture review board

Improved compliance

Reduced technical debt

Efficiency with lowered costs

Supportability

Security

Steps for effective architecture review boards

Leadership support

Single source for guidance, policies, and best practices

Defined stakeholders

Gated process with documented decisions

Establish an exception process

Architecture central repository

Automate review process

Architecture review process shepherd

Conclusion

About the authors

#10 Deploy Stable Diffusion ComfyUI on AWS elastically and efficiently

#9 Let’s Architect! Designing Well-Architected systems

#8 Let’s Architect! Learn About Machine Learning on AWS

#7 Creating an organizational multi-Region failover strategy

#6 Building a three-tier architecture on a budget

#5 Announcing updates to the AWS Well-Architected Framework guidance

#4 Let’s Architect! Serverless developer experience in AWS

#3 London Stock Exchange Group uses chaos engineering on AWS to improve resilience

#2 Achieving Frugal Architecture using the AWS Well-Architected Framework Guidance

#1 How an insurance company implements disaster recovery of 3-tier applications

Thank you!

A brief history

What’s new

Pillar updates

Operational Excellence

Security

Reliability

Performance Efficiency

Sustainability

Conclusion

See you next time!

A brief history

What’s new

Pillar updates

Operational Excellence

Security

Reliability

Performance Efficiency

Cost Optimization

Sustainability

Conclusion

#10: Build a serverless retail solution for endless aisle on AWS

#9: Optimizing data with automated intelligent document processing solutions

#8: Disaster Recovery Solutions with AWS managed services, Part 3: Multi-Site Active/Passive

#7: Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

#6: Let’s Architect! Designing event-driven architectures

#5: Use a reusable ETL framework in your AWS lake house architecture

#4: Invoking asynchronous external APIs with AWS Step Functions

#3: Announcing updates to the AWS Well-Architected Framework

#2: Let’s Architect! Designing architectures for multi-tenancy

#1: Understand resiliency patterns and trade-offs to architect efficiently in the cloud

Bonus! Three older special mentions

A brief history

What’s new