Tag Archives: AWS Well-Architected

Announcing the updated AWS Well-Architected Generative AI Lens

Post Syndicated from Dan Ferguson original https://aws.amazon.com/blogs/architecture/announcing-the-updated-aws-well-architected-generative-ai-lens/

We are delighted to announce an update to the AWS Well-Architected Generative AI Lens. This update features several new sections of the Well-Architected Generative AI Lens, including new best practices, advanced scenario guidance, and improved preambles on responsible AI, data architecture, and agentic workflows.

The AWS Well-Architected Framework provides architectural best practices for designing and operating generative AI workloads on AWS. The Generative AI Lens uses the Well-Architected Framework to outline the steps for performing a Well-Architected Framework review for your generative AI workloads.

The Generative AI Lens provides a consistent approach for customers to evaluate architectures that use large language models (LLMs) to achieve their business goals. This lens addresses common considerations relevant to model selection, prompt engineering, model customization, workload integration, and continuous improvement. Specifically excluded from the Generative AI Lens are best practices associated with model training and advanced model customization techniques. We identify best practices that help you architect your cloud-based applications and workloads according to AWS Well-Architected design principles gathered from supporting thousands of customer implementations.

The Generative AI Lens joins a collection of Well-Architected lenses published under AWS Well-Architected Lenses. For more information on the lens itself, check out the launch announcement post.

What has changed in the updated Generative AI Lens?

The updated Generative AI Lens incorporates several new additions for customers to review. These additions keep the lens on-pace with the rapidly growing area of generative AI, helping customers stay up to date with architectural best practices.

Amazon SageMaker HyperPod guidance

The updated lens features additional guidance for users of Amazon SageMaker HyperPod. SageMaker HyperPod is a highly resilient model training and hosting service that you can use to orchestrate complex, long-running generative AI workflows in the cloud. These workflows could be foundation model pre-training or serving model inference at scale.

We are excited to announce additional guidance for customers using SageMaker HyperPod in the Generative AI Lens. This guidance is built into the existing best practices, expanding the guidance for covered services to include SageMaker capabilities. This guidance joins the existing guidance on Amazon Bedrock, Amazon Q Business, Amazon Q Developer, and Amazon SageMaker AI.

Responsible AI preamble

The updated responsible AI preamble now includes a detailed discussion on the eight core dimensions of responsible AI as described by AWS. Customers can now learn more about the eight dimensions of responsibly developed AI systems directly within the lens. This is required reading for customers in all stages of their generative AI journey.

Data architecture preamble

The updated data architecture preamble reviews strategic considerations associated with a modern data architecture supporting generative AI workloads. This section provides customers with a view into the high-level decisions and considerations needed to architect a data system that services generative AI workloads.

Agentic AI preamble

New to the generative AI lens is the agentic AI preamble. Agentic systems, while technically classified as a subset of distributed computing, play an important role in modern generative AI workloads. This preamble introduces customers to a sampling of architecture paradigms common in agentic systems powered by foundation models.

Scenarios

The Generative AI Lens now includes eight architecture scenarios. These scenarios cover a range of common generative AI powered business applications, including autonomous call centers, knowledge worker co-pilots, and multi-tenant generative AI service systems. The scenario section provides specific guidance for applying generative AI technologies to common business problems. The following image is an example of one of the new scenarios now included in the Generative AI Lens.

Who should use the Generative AI Lens?

The Generative AI Lens is useful to many roles. Business leaders can use this lens to acquire a broader appreciation of the end-to-end implementation and benefits of generative AI. Data scientists and engineers can read this lens to understand how to use, secure, and gain insights from their data at scale. Risk and compliance leaders can understand how generative AI is implemented responsibly by providing compliance with regulatory and governance requirements.

Next steps

The updated Well-Architected Generative AI Lens is available now. Use the lens as a framework to verify that your generative AI workloads are architected with operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability in mind.

If you require support on the implementation or assessment of your generative AI workloads, please contact your AWS Solutions Architect or Account Representative.

Special thanks to everyone across the AWS Solution Architecture, AWS Professional Services, and Machine Learning communities who contributed to the updated Generative AI Lens. These contributions encompassed diverse perspectives, expertise, backgrounds, and experiences in developing the new AWS Well-Architected Generative AI Lens.

For additional reading, refer to the AWS Well-Architected Framework and pillar whitepapers, or use the AWS Well-Architected Machine Learning Lens and its custom lens accessible from the AWS Well-Architected Tool.


About the authors

Announcing the updated AWS Well-Architected Machine Learning Lens

Post Syndicated from Steven DeVries original https://aws.amazon.com/blogs/architecture/announcing-the-updated-aws-well-architected-machine-learning-lens/

We are excited to announce the updated AWS Well-Architected Machine Learning Lens, now enhanced with the latest capabilities and best practices for building machine learning (ML) workloads on AWS.

The AWS Well-Architected Framework provides architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud. The Machine Learning Lens uses the Well-Architected Framework to outline the steps for performing a comprehensive review of your ML architectures.

The updated Machine Learning Lens provides a consistent approach for customers to evaluate architectures across ML workloads, from traditional supervised and unsupervised learning to modern AI applications. This lens addresses common considerations relevant to the complete ML lifecycle, including business goal identification, problem framing, data processing, model development, deployment, and monitoring. The lens incorporates the latest AWS ML services and capabilities introduced since 2023, providing access to current best practices and implementation guidance.

The Machine Learning Lens is part of a collection of Well-Architected lenses published under AWS Well-Architected Lenses.

What is the Machine Learning Lens?

The Well-Architected Machine Learning Lens focuses on the six pillars of the Well-Architected Framework across six phases of the ML lifecycle.

The six phases are:

  1. Business goal identification: Establishing clear business objectives and success criteria for your ML initiative.
  2. ML problem framing: Translating business problems into well-defined ML problems with appropriate metrics.
  3. Data processing: Collecting, preparing, and engineering features from your data sources.
  4. Model development: Building, training, tuning, and evaluating ML models with proper experimentation tracking.
  5. Model deployment: Deploying models into production environments with appropriate infrastructure and monitoring.
  6. Model monitoring: Continuously monitoring model performance and maintaining model quality over time.

Unlike the traditional waterfall approach, an iterative approach is required to achieve a working prototype based on the six phases of the ML lifecycle. The lens provides you with a set of established cloud-agnostic best practices in the form of Well-Architected Framework pillars for each ML lifecycle phase.

You can also use the Well-Architected Machine Learning Lens wherever you are on your cloud journey. You can choose to apply this guidance either during the design of your ML workloads or after your workloads have entered production as a part of the continuous improvement process.

Machine Learning Lens components

The lens includes four focus areas:

  1. Well-Architected ML design principles: Ten design principles that frame the presented best practices, including assign ownership, enable reproducibility, optimize resources, and enable continuous improvement.
  2. The ML lifecycle and the Well-Architected Framework pillars: This considers all aspects of the ML lifecycle and reviews design strategies aligned to the pillars of the overall Well-Architected Framework:
    • Operational excellence: Ability to support ongoing development, run ML workloads effectively, gain insight into operations, and continuously improve processes.
    • Security: Ability to protect data, models, and ML infrastructure while taking advantage of cloud technologies to improve security posture.
    • Reliability: Ability of ML workloads to perform their intended function correctly and consistently, with automatic recovery from failure situations.
    • Performance efficiency: Ability to use computing resources efficiently for ML workloads and maintain efficiency as demand and technologies evolve.
    • Cost optimization: Ability to run ML systems to deliver business value at the lowest price point through resource optimization and automation.
    • Sustainability: Addresses the environmental impact of ML workloads, focusing on energy consumption and resource efficiency.
  3. Cloud-agnostic best practices: 100+ comprehensive best practices covering each ML lifecycle phase across the Well-Architected Framework pillars. Each best practice includes:
    • Implementation guidance: Detailed AWS implementation plans with references to current AWS ML services and capabilities.
    • Resources: Curated links to AWS documentation, blogs, videos, and code examples supporting the best practices.
  4. Related ML architecture considerations: Discussions on advanced topics including MLOps patterns, data architecture for ML, model governance strategies, and considerations for responsible AI implementation.

What else is discussed in the Machine Learning Lens?

The Machine Learning Lens also discusses the following key topics:

  • Responsible AI: Comprehensive guidance on implementing fair, explainable, and unbiased ML systems throughout the development lifecycle.
  • MLOps and automation: Best practices for implementing continuous integration, continuous deployment, and continuous training for ML workloads.
  • Data architecture for ML: Guidance on building robust data pipelines, feature stores, and data governance practices that support ML workloads at scale.
  • Model governance and lineage: Strategies for tracking model versions, maintaining audit trails, and ensuring compliance with regulatory requirements.

What’s new in the updated Machine Learning Lens?

The updated Machine Learning Lens incorporates the latest AWS ML capabilities and best practices introduced since 2023, including:

  • Enhanced data and AI collaborative workflows: Integrated development through Amazon SageMaker Unified Studio – MLOPS02-BP01, MLOPS01-BP01, MLOPS03-BP01, and MLOPS02-BP04.
  • AI-assisted development lifecycle: Code generation and productivity enhancement using Kiro and Amazon Q Developer – MLCOST01-BP02, MLOPS01-BP01, MLCOST03-BP02, and MLSUS05-BP02.
  • Distributed training infrastructure: Large-scale foundation model development and fine-tuning with Amazon SageMaker HyperPod – MLCOST04-BP02, MLCOST04-BP07, MLPERF06-BP05, MLSEC03-BP02, MLCOST04-BP06, MLPERF06-BP07, and MLSUS05-BP02.
  • Model customization capabilities: Knowledge distillation and fine-tuning for domain-specific applications using Amazon Bedrock with Kiro and Amazon Q Developer integration and model hub with Amazon SageMaker Jumpstart – MLCOST01-BP02, MLCOST01-BP01, MLCOST03-BP02, MLSUS04-BP02, MLCOST05-BP01, and MLSUS05-BP02.
  • No-code ML development: Natural language support for building models using SageMaker Canvas with Amazon Q Developer integration – MLCOST03-BP02, MLCOST03-BP03, MLOPS01-BP01, and MLSUS05-BP02.
  • Improved bias detection: Enhanced fairness metrics in SageMaker Clarify with Model Monitor for drift detection – MLREL02-BP01, MLREL03-BP04, MLREL02-BP04, MLREL02-BP05, and MLREL02-BP02.
  • Modular inference architecture: Flexible deployment with SageMaker Inference Components and Multi-Model Endpoints – MLCOST05-BP01, MLREL01-BP01, MLSUS05-BP01, MLCOST05-BP03, and MLREL01-BP02.
  • Advanced observability: Improved debugging with SageMaker Debugger, Model Monitor, and CloudWatch across the ML lifecycle – MLOPS06-BP02, MLOPS05-BP02, MLOPS06-BP01, and MLOPS02-BP04.
  • Enhanced cost optimization: Resource management through SageMaker Training Plans, Savings Plans, and Spot Instance support – MLCOST05-BP03, MLOPS05-BP02, MLCOST06-BP01, MLCOST06-BP02, and MLCOST04-BP06.

Who should use the Machine Learning Lens?

The Machine Learning Lens is valuable for many roles across your organization. Business leaders can use this lens to understand the end-to-end implementation and business value of ML initiatives. Data scientists and ML engineers can leverage the lens to understand how to build, deploy, and maintain ML systems at scale. DevOps and platform engineers can learn how to create reliable, secure infrastructure for ML workloads. Risk and compliance leaders can understand how ML systems are implemented responsibly while adhering to regulatory and governance requirements.

Next steps

If you require support on the implementation or assessment of your ML workloads, please contact your AWS Solutions Architect or Account Representative.

Special thanks to everyone across the AWS Solution Architecture, AWS Professional Services, and Machine Learning communities who contributed to the updated Machine Learning Lens. These contributions encompassed diverse perspectives, expertise, backgrounds, and experiences in developing comprehensive guidance for ML workloads on AWS.

For additional reading, refer to the AWS Well-Architected Framework, or explore the AWS Well-Architected Generative AI Lens for guidance specific to generative AI workloads.


About the authors

Architecting for AI excellence: AWS launches three Well-Architected Lenses at re:Invent 2025

Post Syndicated from Anitha Selvan original https://aws.amazon.com/blogs/architecture/architecting-for-ai-excellence-aws-launches-three-well-architected-lenses-at-reinvent-2025/

At re:Invent 2025, we introduce one new lens and two significant updates to the AWS Well-Architected Lenses specifically focused on AI workloads: the Responsible AI Lens, the Machine Learning (ML) Lens, and the Generative AI Lens. Together, these lenses provide comprehensive guidance for organizations at different stages of their AI journey, whether you’re just starting to experiment with machine learning or already deploying complex AI applications at scale.

The AWS Well-Architected Framework provides the best architectural practices for designing and operating reliable, secure, performance efficient, cost-optimized, and sustainable workloads in the cloud.

The Responsible AI Lens: Embedding trust in AI systems

The Responsible AI Lens offers a structured approach for developers to assess and track their AI workloads against established best practices, identify potential gaps in their AI implementation and receive actionable guidance to improve their AI systems’ quality and alignment with responsible AI principles. By using the Responsible AI Lens you can make informed decisions that balance business and technical requirements, accelerating your path from AI experimentation to production-ready solutions.

Key takeaways from the Responsible AI Lens:

  • Every AI system has a Responsible AI consideration: Whether intentionally designed or not, AI systems inherently carry Responsible AI implications that need to be actively managed rather than left to chance.
  • AI systems can be used beyond original intent and may have unintended impacts: Applications often get utilized in ways developers didn’t anticipate, and due to their probabilistic nature, AI systems can produce unexpected outcomes even within intended use cases, making robust Responsible AI decisions essential from the start.
  • Responsible AI is an enabler to innovation and trust: Rather than being a constraint, Responsible AI practices can accelerate innovation by proactively building stakeholder and customer trust and reducing downstream risks.

The Responsible AI Lens serves as the foundational guidance for AI development activities, providing critical guidelines that inform both the Machine Learning Lens and the Generative AI Lens implementations.

The Machine Learning Lens: Foundation for ML workloads

The Machine Learning Lens provides you with a set of established cloud-agnostic best practices in the form of Well-Architected Framework pillars for each machine learning (ML) lifecycle phase. The updated Machine Learning Lens provides a consistent approach for designing, building, and operating machine learning workloads on AWS. It addresses the full spectrum of ML workloads, from traditional supervised and unsupervised learning to modern AI applications.

The updated Machine Learning Lens incorporates the latest AWS ML capabilities (evolved since their introduction in 2023). What’s new in the updated ML Lens:

  • Enhanced data and AI collaborative workflows through Amazon SageMaker Unified Studio.
  • AI-assisted development for code generation and productivity enhancement.
  • Distributed training infrastructure for foundation model development and fine-tuning with Amazon SageMaker HyperPod.
  • Model customization capabilities such as knowledge distillation and fine-tuning domain-specific applications using Amazon Bedrock with Kiro and Amazon Q Developer.
  • No-code ML development using Amazon SageMaker Canvas with Amazon Q integration.
  • Improved bias detection with enhanced fairness metrics and Responsible AI capabilities in Amazon SageMaker Clarify.
  • Automated dashboard creation for business insights through Amazon Quick Sight.
  • Modular inference architecture for flexible model deployment with Inference Components.
  • Advanced observability with improved debugging and monitoring capabilities across the ML lifecycle.
  • Enhanced cost optimization for resource management through Amazon SageMaker Training Plans, Savings Plans, and Spot Instance support.

You can use the ML Lens wherever you are on your cloud journey. You can choose to apply this guidance either during the design of your ML workloads or after your workloads have entered production as part of the continuous improvement process. These improvements are powered by key AWS services including Amazon SageMaker Unified Studio, Amazon Q, Amazon SageMaker HyperPod, and Amazon Bedrock.

The Generative AI Lens: Specialized guidance for foundation models

The Generative AI Lens provides a consistent approach for customers to evaluate architectures that use large language models (LLMs) to achieve their business goals. This lens addresses common considerations relevant to model selection, prompt engineering, model customization, workload integration, and continuous improvement. We identify best practices that help you architect your cloud-based applications and workloads according to AWS Well-Architected design principles gathered from supporting thousands of customer implementations. While the Machine Learning (ML) Lens covers the broad spectrum of ML workloads, the Generative AI Lens focuses specifically on foundation models and generative AI applications. The Generative AI Lens provides the best architectural practices for designing and operating generative AI workloads on AWS.

The updated Generative AI Lens includes several new additions:

  • Amazon SageMaker HyperPod guidance for orchestrating complex, long-running generative AI workflows that includes additional service capabilities.
  • Enhanced Responsible AI preamble with detailed discussion on the eight core dimensions of Responsible AI as described by AWS.
  • Updated data architecture preamble with strategic considerations needed to architect a data system for generative AI workloads.
  • New agentic AI preamble introducing architecture paradigms for agentic systems.
  • Eight architecture scenarios covering common generative AI-powered business applications such as autonomous call centers, knowledge worker co-pilots, and multi-tenant generative AI service systems.

The Generative AI Lens builds upon the foundation established by the ML Lens, providing specialized guidance for the unique challenges and opportunities presented by foundation models and generative AI applications.

Implementation strategy for Well-Architected AI/ML guidance: A unified approach

The new lenses – Responsible AI Lens, Machine Learning Lens, and Generative AI Lens – work together to provide comprehensive guidance for AI development. The Responsible AI Lens guides safe, fair, and secure AI development. It helps balance business needs with technical requirements, streamlining the transition from experimentation to production. The Machine Learning Lens guides organizations in evaluating workloads across both modern AI and traditional machine learning approaches. Recent updates focus on key areas including enhanced data and AI collaborative workflows, AI-assisted development capabilities, large-scale infrastructure provisioning, and customizable model deployment. The Generative AI Lens helps customers evaluate large language model (LLM) based architectures and its updates include guidance for Amazon SageMaker HyperPod users, new insights on agentic AI, and updated architectural scenarios.

What are the next steps?

The launch of these new lenses at re:Invent 2025 helps organizations build AI systems that are responsible, trustworthy, powerful, and effective. By providing comprehensive guidance across the full spectrum of AI workloads, AWS supports organizations to accelerate their AI initiatives while maintaining the highest standards of responsible AI and technical excellence.

Learn more about the AWS Well-Architected Framework and implement the best practice guidance provided using the GitHub repository. These lenses are practical tools designed to help you build AI systems that deliver real business value while maintaining the highest standards of ethics, security, and operational excellence.

For additional reading, refer to the AWS Well-Architected Framework and pillar whitepapers, or contact your AWS Solutions Architect or Account Representative for support on implementing these lenses in your organization.


About the authors

Know before you go – AWS re:Invent 2025 guide to Well-Architected and Cloud Optimization sessions

Post Syndicated from Anitha Selvan original https://aws.amazon.com/blogs/architecture/know-before-you-go-aws-reinvent-2025-guide-to-well-architected-and-cloud-optimization-sessions/

Are you ready to maximize your Well-Architected and Cloud Optimization learning and networking time at re:Invent 2025? We have put together this comprehensive guide to help you plan your schedule and make the most of the Well-Architected and cloud optimization sessions available this year. These sessions will deliver the practical guidance your teams need to lead strategic cloud initiatives, design next-generation architectures, optimize costs, or secure AI-powered systems.

Key themes at re:Invent for Well-Architected and Cloud Optimization – You can expect to see the following themes at re:Invent 2025

AI-powered architecture and governance

The sessions in this theme showcase how AWS is integrating AI technologies to transform traditional architectural practices. From using AI services for automated Well-Architected reviews to implementing self-evolving systems with agentic AI, these sessions demonstrate how you can use AI to automate architectural decisions, streamline governance processes, and scale best practices across the enterprise.

Sessions: ARC324-R, ARC317-R, SPS320, ARC302-R (session details are posted in the following section)

Well-Architected Framework evolution and implementation

These sessions highlight how the AWS Well-Architected Framework has evolved beyond its original scope to address modern architectural challenges. Attendees will learn how to implement the framework principles across different domains—from IoT security to backup strategies—while focusing on enterprise-scale governance and compliance.

Sessions: ARC204, SEC337, STG313-R, ARC323-R (session details are posted in the following section)

Cost optimization and FinOps

The cost optimization track focuses on innovative approaches to cloud financial management, with a strong emphasis on AI-powered FinOps solutions. Sessions range from hands-on workshops like the Frugal Architect GameDay to chalk talks on establishing effective cost governance models.

Sessions: ARC318-R, COP309-R, ARC309, DEV318 (session details are posted in the following section)

Session formats to fit your learning style

This year’s catalog features an exciting mix of content across different formats: from breakout, chalk talks, workshops, builder sessions to code talks.

Breakout sessions – Stay in the know

Sit back and enjoy these presentations to stay current with the latest solution enhancements and practical applications. AWS experts and guest speakers will share valuable insights and real-world examples.

From ideas to impact: Architecting with cloud best practices

ARC204 | Breakout session | December 1, 8:30 AM

Discover how foundational frameworks like the AWS Well-Architected Framework, AWS Cloud Adoption Framework, and AWS Cloud Operating Model evolved through customer feedback and real-world learnings from thousands of organizations, transforming from structured guidance into dynamic insights for optimizing cloud environments. Learn practical strategies for applying unified best practices to accelerate cloud transformation while managing large-scale architectural changes and maintaining operational excellence.

Build a well-architected foundation for scaling generative AI and agentic apps

AIM310 | Breakout session | December 1, 10:00 AM

Move beyond proof-of-concepts to build a production-ready foundation supporting all AI applications across your organization, addressing the critical transition from experimentation to enterprise-scale AI deployment. Navigate model access and management, tool discovery, memory and state handling, and observability at scale while building foundations that seamlessly integrate model access, orchestration workflows, agents, and tools with enterprise-grade governance controls.

AI-Powered Enterprise Architecture with ServiceNow & AWS 

ARC337-S | Breakout session | December 2, 3:00 PM

Enterprises face a core challenge: translating architectural vision into resilient cloud reality. See how integrating ServiceNow’s Enterprise Architecture Workspace with the AWS Well-Architected Tool transforms traditional design processes. Through elegant “shift-left” methodologies, architects gain contextual insights that seamlessly blend enterprise modeling with cloud best practices. This presentation is brought to you by ServiceNow, an AWS Partner.

The AI revolution in customer support: Building predictive service systems

SPS315 | Breakout session | December 3, 5:30 PM

Discover how AWS is using generative AI to transform customer support from reactive to proactive. We’ll show how large language models and AI agents are improving customer satisfaction and efficiency. Topics include smart case routing, context-aware support, early problem detection, and responsible AI use. We’ll share real results and discuss balancing AI capabilities with human touch.

Optimize AWS Costs: Developer Tools and Techniques

DEV318 | Breakout session | December 1, 3:00 PM

As cloud applications grow in complexity, optimizing costs becomes crucial for developers. This session explores AWS native tools and coding practices that reduce expenses without compromising performance or scalability.

Chalk talks

AWS speakers set the stage at the beginning of the talk and then open up for discussion. Bring your questions and dive deep into the topic with AWS experts and other customers.

Architecting agentic systems: Self-evolving patterns with AWS AI

ARC324-R | Chalk talk | December 2, 1:30 PM

Learn to architect self-evolving systems using agentic AI that align with AWS Well-Architected principles, exploring cutting-edge patterns for systems that adapt, heal, and optimize themselves autonomously while maintaining architectural integrity. Implement autonomous monitoring and self-healing capabilities with Amazon Bedrock Agents, design AI-driven security controls and automated recovery mechanisms and create systems that continuously adapt to workload patterns while maintaining reliability and performance standards.

Building Well-Architected agentic AI applications

ARC317-R/R1 | Chalk talk | December 2, 3:00 PM and December 4, 1:00 PM

Navigate generative AI agent development with robust architectural practices for security and compliance, focusing on proven patterns for building production-ready agentic AI applications that meet enterprise requirements. Design agent architectures with guardrails, monitoring systems, and access controls using the AWS Well-Architected Generative AI Lens while implementing governance patterns that ensure regulatory compliance and enable systems to scale from prototype to enterprise-wide deployment.

Using generative AI to automate architectural guidance

ARC315 | Chalk talk | December 1, 4:30 PM

Replace time-intensive manual processes with AI-powered systems that generate strategic recommendations, design principles, and best practices at scale while maintaining quality and consistency. Generate organization-specific design principles using AI analysis of architectural patterns, implement AI-driven guidance systems with effective quality control mechanisms, and build knowledge bases that feed AI-powered architectural guidance while maintaining human oversight and addressing ethical considerations.

Agentic architecting: From prototype to production-ready systems

ARC330-R/R1 | Chalk talk | December 2, 5:30PM and December 4, 2:30 PM

Transform prototypes into production-ready systems by incorporating security, monitoring, and CI/CD through agentic architecting, focusing on practical challenges of moving from experimental AI systems to production-grade architectures. Use AI agents to generate and optimize AWS CDK infrastructure and application code, implement automated security improvements and CI/CD pipeline creation, and maintain AWS Well-Architected principles while enabling teams to focus on business logic as AI handles infrastructure complexity.

AI-powered FinOps: Agent-based cloud cost management

ARC318-R/R1 | Chalk talk | December 1, 4:00 PM and December 3, 4:00 PM

Learn how intelligent agents tackle fragmented cost data and optimization processes in complex multi-account environments, moving beyond traditional FinOps approaches to autonomous, intelligent financial optimization. Architect solutions using Amazon OpenSearch Service for data aggregation and Amazon Bedrock for contextual reasoning to design secure, scalable FinOps solutions that continuously optimize costs while delivering measurable business outcomes.

Supercharge your well-architected reviews with AWS Generative AI

SPS320 | Chalk talk | December 3, 4:00 PM

Discover how Koch Industries revolutionized AWS Well-Architected reviews using generative AI, transforming weeks-long manual processes into automated, intelligent systems. Automate architectural assessments using Amazon Bedrock Knowledge Bases and Model Context Protocol (MCP) to scale best practice reviews and optimize workloads in minutes instead of days while achieving more comprehensive, consistent, and actionable recommendations through proven change management and organizational adoption strategies.

Architecting enterprise-scale governance beyond AWS Control Tower

ARC323-R/R1 | Chalk talk | December 3, 11:30 AM and December 4, 2:00PM

Discover advanced governance strategies that build upon AWS Control Tower for enterprise-scale environments requiring sophisticated compliance, security, and operational controls. Implement infrastructure across six Well-Architected Foundations capabilities with critical trade-off understanding, build efficient multi-account structures balancing security requirements with innovation needs, and architect automated compliance checks and policy enforcement at scale while enabling self-service capabilities with centralized governance and security controls.

Securing IoT Workloads with AWS IoT Lens and AWS Security Reference Architecture

SEC337 | Chalk talk | December 3, 11:30 AM

Industrial environments are reaching new levels of connectivity, automation, efficiency, and real-time data insights. However, this increased connectivity also introduces significant security challenges. Unaddressed security concerns can expose vulnerabilities and slow down companies looking to accelerate digital transformation using IoT and cloud. This chalk talk explores relevant techniques, architecture patterns, best practices and AWS security services for securing complex OT/IT environments, IoT devices, edge and cloud using the AWS Well-Architected IoT Lens and AWS Security Reference Architecture (SRA).

Establishing effective cost governance

COP309-R/R1 | Chalk talk | December 3, 3:00 PM and December 4, 12:30 PM

Generative AI agent development demands robust architectural practices for security and compliance. This chalk talk explores proven patterns for architecting secure, efficient AI agents using the AWS Well-Architected Generative AI Lens. Through collaborative discussion and whiteboarding, examine architectural governance and best practices for production environments. Learn to design agent architectures incorporating guardrails, monitoring systems, access controls, and sustainable deployment practices. Gain actionable insights for building secure, efficient, and cost-effective agentic AI applications that scale.

Break down monoliths, modernizing applications on Amazon ECS

CNS346 | Chalk talk | December 2, 4:30 PM

Join this interactive chalk talk to solve a common challenge where monolithic applications take months to deploy new features, and scaling becomes increasingly difficult. We’ll start with a real scenario, an application running on servers with a shared database. Together, we’ll design the modernization path using Amazon ECS and Well-Architected Framework principles. You’ll explore common architecture patterns, containerization strategies, CI/CD automation, and blue/green deployment approaches for ECS. After this session, you’ll walk away with a practical roadmap to transform your monolithic application into scalable microservices. Bring your curiosity and help us build the architecture live.

Hands-on workshop and Builders’ sessions

AWS speakers will introduce the use-case and tools designed to tackle the challenge. You will follow instructions, complete the tasks, and walk away with better understanding of the capabilities.

AI-powered Well-Architected reviews: Building automated governance

ARC302-R | Builders’ session | December 1, 9:00 AM; December 2, 11:30 AM and December 3, 4:00 PM

Build intelligent systems that automate AWS Well-Architected Framework reviews using generative AI, transforming manual architectural assessments into continuous, intelligent governance processes. Evaluate architecture against Well-Architected pillars while incorporating organization-specific requirements, implement continuous analysis of architecture and infrastructure as code templates, and enhance AI understanding of architectural context using Model Context Protocol servers to transform time-intensive reviews into scalable, automated processes with consistent governance.

AI-powered troubleshooting: From chaos to Well-Architected

ARC301-R | Builders’ session | December 1, 8:30 AM; December 2, 3:00 PM and December 3, 10:00 AM

Tackle complex scenarios using AI-powered tools to diagnose and resolve architectural problems, gaining practical experience using AI to transform poorly designed systems into well-architected solutions. Troubleshoot and optimize architectures with scaling bottlenecks and database inefficiencies using Amazon Q, apply Well-Architected principles to enhance performance and security under pressure, and accelerate problem identification and resolution while building AWS optimization expertise and learning to identify architectural anti-patterns before they become critical issues.

The Frugal Architect GameDay: Building cost-aware architectures

ARC309 | Workshop | December 1, 8:00 AM

Compete to implement cost efficiency improvements across multiple AWS services in this interactive GameDay, applying the Laws of the Frugal Architect to real-world scenarios for practical experience in transforming high-cost infrastructure into efficient, sustainable architectures. Address challenges spanning compute, networking, storage, serverless, and observability domains while learning to reduce cloud unit costs and improve profitability without compromising service quality through gamified scenarios that build rapid cost optimization decision-making skills.

Optimize AWS Backup using AI evaluation and Well-Architected best practices

STG313-R | Builders’ session | December 2, 1:30 PM and December 3, 8:30 AM

Enhance AWS Backup implementation using the AWS Backup Evaluator Solution, an AI agent that synthesizes data from multiple sources to provide intelligent backup optimization recommendations. Assess backup environments against the Well-Architected AWS Backup lens using Strands Agents SDK, create unified visibility across backup landscapes to identify optimization opportunities, and implement AI agents that continuously monitor backup efficiency while aligning with AWS best practices to enhance efficiency and cost-effectiveness.


Visit the AWS Cloud Support kiosk in the Venetian

Important notes:

Session dates, times, and locations listed in the post are subject to change as we continue to optimize the schedule on session popularity and venue capacity. Please check this blog post and the re:Invent session catalog regularly for the most up-to-date information about your registered sessions and newly added activities. For a full view of Well-architected content, including sessions with partners, explore the AWS re:Invent catalog and filter on the Well-Architected Framework area of interest.

Remember to reserve your seats early as popular sessions fill up quickly and bring your laptop for hands-on builders’ sessions and workshops. Register today


About the authors

Build resilient generative AI agents

Post Syndicated from Yiwen Zhang original https://aws.amazon.com/blogs/architecture/build-resilient-generative-ai-agents/

Generative AI agents in production environments demand resilience strategies that go beyond traditional software patterns. AI agents make autonomous decisions, consume substantial computational resources, and interact with external systems in unpredictable ways. These characteristics create failure modes that conventional resilience approaches might not address.

This post presents a framework for AI agent resilience risk analysis that applies to most AI developments and deployment architectures. We also explore practical strategies to help prevent, detect, and mitigate the most common resilience challenges when deploying and scaling AI agents.

Generative AI agent resilience risk dimensions

To identify resilience risks, we break down the generative AI agent systems into seven dimensions:

  • Foundation models – Foundation models (FMs) provide core reasoning and planning capabilities. Your deployment choice determines your resilience responsibilities and costs. The three deployment approaches are fully self-managed such as using Amazon Elastic Compute Cloud (Amazon EC2), server-based managed services such as using Amazon SageMaker AI, or serverless managed services such as Amazon Bedrock.
  • Agent orchestration – This component controls how multiple AI agents and tools coordinate to achieve complex goals, containing logic for tool selection, human escalation triggers, and multi-step workflow management.
  • Agent deployment infrastructure – The infrastructure encompasses the underlying hardware and system where agents run. The infrastructure options include using fully self-managed EC2 instances, managed services such as Amazon Elastic Container Services (Amazon ECS), and specialized managed services designed specifically for agent deployment, such as Amazon Bedrock AgentCore Runtime.
  • Knowledge base – The knowledge base includes vector database storage, embedding models, and data pipelines that create vector embeddings, forming the foundation for Retrieval Augmented Generation (RAG) applications. Amazon Bedrock Knowledge Bases supports fully managed RAG workflows.
  • Agent tools – This includes API tools, Model Context Protocol (MCP) servers, memory management, and prompt caching features that extend agent capabilities.
  • Security and compliance – This component encompasses user and agent security controls as well as content compliance monitoring, supporting proper authentication, authorization, and content validation. Security includes inbound authentication that manages users’ access to agents, and outbound authentication and authorization that manages agents’ access to other resources. Outbound authorization is more complex because agents might require their own identity. Amazon Bedrock AgentCore Identity is the identity and credential management service designed specifically for AI agents, providing inbound and outbound authentication and authorization capabilities. To help prevent compliance violations, organizations should establish comprehensive responsible AI policies. Amazon Bedrock Guardrails provides configurable safeguards for responsible AI policy implementation.
  • Evaluation and observability – These systems track metrics from basic infrastructure statistics to detailed AI-specific traces, including ongoing performance evaluation and detection of behavioral deviations. Agent evaluation and observability requires a combination of traditional system metrics and agent-specific signals, such as reasoning traces and tool invocation results.

The following diagram illustrates these dimensions.

This configuration provides visibility into agent applications, enabling subsequent sessions to deliver targeted resilience analysis and mitigation recommendations.

Top 5 resilience problems for agents and mitigation plans

The Resilience Analysis Framework defines fundamental failure modes that production systems should avoid. In this post, we identify generative AI agents’ five primary failure modes and provide strategies that can help establish resilient properties.

Shared fate

Shared fate occurs when a failure in one agent component cascades across system boundaries, affecting the entire agent. Fault isolation is the desired property. To achieve fault isolation, you must understand how agent components interact and identify their shared dependencies.

The relationship between FMs, knowledge bases, and agent orchestration requires clear isolation boundaries. For example, in RAG applications, knowledge bases might return irrelevant search results. Implementing guardrails with relevance checks can help prevent these query errors from cascading through the rest of the agent workflow.

Tools should align with fault isolation boundaries to contain impact in case of failure. When building custom tools, design each tool as its own containment domain. When using MCP servers or existing tools, make sure you use strict, versioned request/response schemas and validate them at the boundary. Add semantic validations such as date ranges, cross-field rules, and data freshness checks. Internal tools can also be deployed across different AWS Availability Zones for additional resilience.

At the orchestration dimension, implement circuit breakers that monitor failure rates and latency, activating when dependencies become unavailable. Set bounded retry limits with exponential backoff and jitter to control cost and contention. For connectivity resilience, implement robust JSON-RPC error mapping and per-call timeouts, and maintain healthy connection pools to tools, MCP servers, and downstream services. The orchestration dimension should also manage contract-compatible fallbacks—routing from a failed tool or MCP server to alternatives—while maintaining consistent schemas and providing degraded functionality.

When isolation boundaries fail, you can implement graceful degradation that maintains core functionality while advanced features become unavailable. Conduct resilience testing with AI-specific failure injection, such as simulating model inference failures or knowledge base inconsistencies, to test your isolation boundaries before problems occur in production.

Insufficient capacity

Excessive load can overwhelm even well-provisioned systems, potentially leading to performance degradation or system failure. Sufficient capacity makes sure your systems have the resources needed to handle both expected traffic patterns and unexpected surges in demand.

AI agent capacity planning involves demand forecasting, resource assessment, and quota analysis. The primary consideration when planning capacity is estimating Requests Per Minute (RPM) and Tokens Per Minute (TPM). However, estimating RPM and TPM presents unique challenges due to the stochastic nature of agents. AI agents typically use recursive processing, where the agent’s reasoning engine repeatedly calls the FMs until reaching final answers. This creates two major planning difficulties. First, the number of iterative calls is hard to predict because it’s based on task complexity and reasoning paths. Second, each call’s token length is also hard to predict because it includes the user prompt, system instructions, agent-generated reasoning steps, and conversation history. This compounding effect makes capacity planning for agents difficult.

Through heuristic analysis during development, teams can set a reasonable recursion limit to help prevent redundant loops and runaway resource consumption. Additionally, because agent outputs become inputs for subsequent recursions, managing maximum completion tokens helps control one component of the growing token consumption in recursive reasoning chains.

The following equations help translate agent configurations to these capacity estimates:

RPM = Average agent level thread per minute * average FM invocation per minute in one thread 
    = Average agent level thread per minute * (1 + 60/(max_completion_tokens/TPS))

Token per second (TPS) is different for each model, and can be found in model release documentation and open source benchmark results, such as artificial analysis.

TPM = RPM * Average input token length
    = RPM * (system prompt length + user prompt length + max_completion_tokens * (recursion_limit -1)/recursion_limit)

This calculation is assuming no prompt caching feature is implemented.

Unlike external tools where resilience is managed by third-party providers, internally developed tools rely on proper configuration by the development team to scale based on demand. When resource needs spike unexpectedly, only the affected tools require scaling.

For example, AWS Lambda functions can be converted to MCP-compatible tools using Amazon Bedrock AgentCore Gateway. If popular tools cause Lambda functions to reach capacity limits, you can increase the account-level concurrent execution limit or implement provisioned concurrency to handle the increased load.

For scenarios involving multiple action groups executing simultaneously, Lambda functions’ reserved concurrency controls provide essential resource isolation by allocating dedicated capacity to each action group. This helps prevent a single tool from consuming all available resources during orchestrated invocations, facilitating resource availability for high-priority functions.

When capacity limits are reached, you can use intelligent request queuing with priority-based allocation to make sure essential services continue operating. Implementing graceful degradation during high-load periods can be helpful. This maintains core functionality while temporarily reducing non-essential features.

Excessive latency

Excessive latency compromises user experience, reduces throughput, and undermines the practical value of AI agents in production. Agentic workload development requires balancing speed, cost, and accuracy. Accuracy is the cornerstone for AI agents to gain user trust. Achieving high accuracy requires allowing agents to perform multiple reasoning iterations, which inevitably creates latency challenges.

Managing user expectations becomes critical—establishing service level objective (SLO) metrics before project initiation sets realistic targets for agent response times. Teams should define specific latency thresholds for different agent capabilities, such as subsecond responses for simple queries vs. longer windows for analytical tasks requiring multiple tool interactions or extensive reasoning chains. Clear communication of the expected response times helps prevent user frustration and allows for appropriate system design decisions.

Prompt engineering offers the greatest opportunity for latency improvement by reducing unnecessary reasoning loops. Vague prompts take agents into extensive deliberation cycles, whereas clear instructions accelerate decision-making. Asking an agent to “approve if the use case is of strategic value” creates a complex reasoning chain. The agent must first define strategic value criteria, then evaluate which criteria apply, and finally determine significance thresholds. Conversely, clearly stating the criteria in the system prompt can largely reduce agent iterations. The following examples illustrate the difference between ambiguous and clear instructions.

The following is an example of an ambiguous agent instruction:

You are a generative AI use case approver. 
Your role is to evaluate GenAI agent build requests by carefully analyzing user-provided 
information and make approval decisions. Please follow the following instructions: 
<instructions>
1. Carefully analyze the information provided by the user, and collect use case information, 
such as use case sponsor, significance of the use case, and potential values that it can bring. 
2. Approve the use case if it has a senior sponsor and is of strategic value. 
</instructions>

The following is an example of a clear, well-defined agent instruction:

You are a generative AI use case approver. 
Your role is to evaluate Gen AI agent build requests by carefully analyzing user-provided 
information and make approval decisions based on specific criteria. 
Please strictly follow the following instructions: 
<instructions>
1. Carefully analyze the information provided by the user. Collect answers to the following questions:
<question_1>Does the use case have a business sponsor that is VP level and above? </question_1>
<question_2>What value is this agent expected to deliver? The answer can be in the form of 
number of hours per month saved on certain tasks, or additional revenue values.</question_2>
<question_3>If the use case is external customer facing, please provide supporting information 
on the demand. </question_3>
2. Evaluate the request against these approval criteria:
<criteria_1>The use case has business sponsor at VP level and above. This is a hard criteria. </criteria_1>
<criteria_2>The use case can bring significant $ value, calculated by productivity gain or 
revenue increase. This is a soft criteria. </criteria_2>
<criteria_3>Have strong proof that the use case/feature is demanded by customers. This is a 
soft criteria. </criteria_3>
3. Based on the evaluation, make a decision to approve or deny the use case.
- Approve: If the hard criterion is met, and at least one of the soft criteria is met. 
- Deny: The hard criterion is not met, or neither of the soft criteria is met. 
</instructions>

Prompt caching delivers substantial latency reductions by storing repeated prompt prefixes between requests. Amazon Bedrock prompt caching can reduce latency by up to 85% for supported models, particularly benefiting agents with long system prompts and contextual information that remains stable across sessions.

Asynchronous processing for agents and tools reduces latency by enabling parallel execution. Multi-agent workflows achieve dramatic speedups when independent agents execute in parallel rather than waiting for sequential completion. For agents with tools, asynchronous processing enables continued reasoning and preparation of subsequent actions while tools execute in the background, optimizing workflow by overlapping cognitive processing with I/O operations.

Security and compliance checks must minimize latency impact while maintaining protection across dimensions. Content moderation agents implement streaming compliance scanning that evaluates agent outputs during generation rather than waiting for complete responses, flagging potentially problematic content in real time while allowing safe content to flow through immediately.

Incorrect agent response

Correct output makes sure your AI agent performs reliably within its defined scope, delivering accurate and consistent responses that meet user expectations and business requirements. However, misconfiguration, software bugs, and model hallucinations can compromise output quality, leading to incorrect responses that undermine user trust.

To improve accuracy, use deterministic orchestration flows whenever possible. Letting agents rely on LLMs to improvise their way through tasks creates opportunities to deviation from your intended path. Instead, define explicit workflows that specify how agents should interact and sequence their operations. This structured approach reduces both inter-agent calling errors and tool-calling mistakes. Additionally, implementing input and output guardrails significantly enhances agent accuracy. Amazon Bedrock Guardrails can scan user input for compliance checks before model invocations, and provide output validation to detect hallucinations, harmful responses, sensitive information, and blocked topics.

When response quality issues occur, you can deploy human-in-the-loop validation for high-stakes decisions where accuracy is essential, and implement automatic retry mechanisms with refined prompts when initial responses don’t meet quality standards.

Single point of failure

Redundancy creates multiple paths to success by minimizing single points of failure that can cause system-wide impairments. Single points of failure undermine redundancy when multiple components depend on a single resource or service, creating vulnerabilities that bypass protective boundaries. Effective redundancy requires both redundant components and redundant pathways, making sure that if one component fails, alternative components can take over, and if one pathway becomes unavailable, traffic can flow through different routes.

Agents require coordinated redundancy for their FMs. If the models are self-managed, you can implement multi-Region model deployment with automated failover. When using managed services, Amazon Bedrock offers cross-Region inference to provide built-in redundancy for supported models, automatically routing requests to alternative AWS Regions when primary endpoints experience issues.

The agent tools dimension must coordinate tool redundancy to facilitate graceful degradation when primary tools become unavailable. Rather than failing entirely, the system should automatically route to alternative tools that provide similar functionality, even if they’re less sophisticated. For example, when the internal chat assistant’s knowledge base fails, it can fall back to a search tool to deliver alternative output to users.

Maintaining permission consistency across redundant environments is essential. This helps prevent security gaps during failover scenarios. Because overly permissive access controls pose significant security risks, it’s critical to validate that both end-user permissions and tool-level access rights are identical between primary and failover components. This consistency makes sure security boundaries are maintained regardless of which environment is actively serving requests, helping prevent privilege escalation or unauthorized access that could occur when systems switch between different permission models during operational transitions.

Operational excellence: Integrating traditional and AI-specific practices

Operational excellence in agentic AI integrates proven DevOps practices with AI-specific requirements for running agentic systems reliably in production. Continuous integration and continuous delivery (CI/CD) pipelines orchestrate the full agent lifecycle, and infrastructure as code (IaC) standardizes deployments across environments, reducing manual error and improving reproducibility.

Agent observability requires a combination of traditional metrics and agent-specific signals such as reasoning traces and tool invocation results. Although traditional system metrics and logs can be obtained from Amazon CloudWatch, agent-level tracing requires additional software build. The recently announced Amazon Bedrock AgentCore Observability (preview) supports OpenTelemetry to integrate agent telemetry data with existing observability services, including CloudWatch, Datadog, LangSmith, and Langfuse. For more details the Amazon Bedrock AgentCore Observability features, see Launching Amazon CloudWatch generative AI observability  (Preview).

Beyond monitoring, testing and validation of agents also extend beyond conventional software practices. Automated test suites such as promptfoo help development teams configure tests to evaluate reasoning quality, task completion, and dialogue coherence. Pre-deployment checks confirm tool connectivity and knowledge access, and fault injection simulates tool outages, API failures, and data inconsistencies to surface reasoning flaws before they affect users.

When issues arise, mitigation relies on playbooks covering both infrastructure-level and agent-specific issues. These playbooks support live sessions, enabling seamless handoffs to fallback agents or human operators without losing context.

Summary

In this post, we introduced a seven-dimension architecture model to map your AI agents and analyze where resilience risks emerge. We also identified five common failure modes related to AI agents, and their mitigation strategies.

These strategies illustrated how resilience principles apply to common agentic workloads, but they are not exhaustive. Each AI system has unique characteristics and dependencies. You must analyze your specific architecture across the seven risk dimensions to identify the resilience challenges within your own workloads, prioritizing areas based on user impact and business criticality rather than technical complexity.

Resilience represents an ongoing journey rather than a destination. As your AI agents evolve and handle new use cases, your resilience strategies must evolve accordingly. You can establish regular testing, monitoring, and improvement processes to make sure your AI systems remain resilient as they scale. For more information about generative AI agents and resilience on AWS, refer to the following resources:

Maximizing Business Value Through Strategic Cloud Optimization

Post Syndicated from Ryan Dsouza original https://aws.amazon.com/blogs/architecture/maximizing-business-value-through-strategic-cloud-optimization/

As cloud adoption continues to accelerate, organizations are realizing that the journey to the cloud is just the beginning. The real challenge—and opportunity—lies in optimizing cloud usage to drive maximum business value. At AWS, we’re committed to helping our customers navigate this journey successfully. Let’s explore some key insights and best practices for cloud optimization from the recent MIT Technology Review publication, Driving business value by optimizing the cloud.

The cloud optimization imperative

Recent data shows that global cloud infrastructure spending reached $84 billion in Q3 2024, marking a 23% year-over-year increase. This growth underscores the critical role of the cloud in driving business agility and innovation. However, to truly harness the power of the cloud, organizations must strike the right balance between cost, security, resilience, and innovation.

André Dufour, AWS Director and General Manager for AWS Cloud Optimization, emphasizes that cloud optimization involves making cloud spending efficient so that freed-up resources can be redirected to fund new innovations, such as generative AI initiatives.

Cloud optimization should be viewed as a continuous process rather than a one-time event, requiring regular assessment whenever business conditions or technical requirements change significantly. The approach should be comprehensive, addressing not just costs but all six pillars (operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability), while also recognizing that different workloads require tailored optimization strategies rather than a one-size-fits-all approach.

Best practices for success

Consider the following best practices:

  • Upskill your team – Empower your employees with cloud, cost management, and optimization skills. As Dufour notes, “Every engineer or builder plays a role in cloud optimization.”
  • Establish a cloud center of excellence – Create a centralized body responsible for developing and distributing cloud best practices throughout your organization.
  • Align finance and business – Make cloud KPIs business-centric rather than purely technical, so cloud optimization efforts support overall business goals.
  • Embrace automation – Use tools to automate cloud provisioning, monitoring, and optimization, reducing human error and effort.
  • Use AI services and solutions for efficiency – Use AI technologies to automate visualization, enhance decision-making, and optimize resource utilization.

Real-world success stories

Our customers are already seeing significant benefits from strategic cloud optimization:

  • DreamCasino achieved 30% cost savings and a 50% reduction in API response times, enabling expansion into new markets
  • BMC Software reduced cloud costs by 25% while improving security and reliability, reinvesting savings into new business opportunities
  • Even within AWS, our use of Amazon Q for application modernization saved an estimated 4,500 years of development work and $260 million in performance benefits

Business impact

Effective cloud optimization delivers more than just cost savings. It enables the following:

  • Faster innovation through reinvestment of saved resources
  • Enhanced security and operational efficiency
  • Improved ability to scale and adapt to business needs
  • Better customer experiences and faster time-to-market
  • The capability to make informed architecture and design decisions by balancing trade-offs across AWS Well-Architected pillars

AWS resources for your optimization journey

To help you accelerate your cloud optimization efforts, AWS provides several tools and resources:

Getting started

To get started, consider the following steps:

Additionally, you can engage with the AWS Cloud Optimization Success (COS) team for more detailed guidance and to help identify what to do next in your cloud optimization journey. The COS team has Solutions Architects who specialize in the Cloud Adoption Framework and Well-Architected Framework and deliver workshops and training sessions though customer and partner engagements. The team can help drive adoption of AWS services through the use of the Well-Architected and Cloud Adoption Frameworks and support other services like AWS Trusted Advisor and AWS Health to optimize cost and cloud architectures. Whether you’re just starting or looking to enhance existing implementations, the AWS COS team provides the guidance, tools, and expertise you need to succeed.

Conclusion

At AWS, we’re dedicated to helping you optimize your cloud journey. By implementing these strategies and best practices, you can unlock the full potential of the cloud, driving innovation and growth while maintaining security and operational excellence.

Ready to take your cloud optimization to the next level? Refer to the resources included in this post and contact your AWS COS team to learn how we can help you maximize the value of your cloud investments.


About the Authors

Embracing event driven architecture to enhance resilience of data solutions built on Amazon SageMaker

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/embracing-event-driven-architecture-to-enhance-resilience-of-data-solutions-built-on-amazon-sagemaker/

Amazon Web Services (AWS) customers value business continuity while building modern data governance solutions. A resilient data solution helps maximize business continuity by minimizing solution downtime and making sure that critical information remains accessible to users. This post provides guidance on how you can use event driven architecture to enhance the resiliency of data solutions built on the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI. SageMaker is a managed service with high availability and durability. If customers want to build a backup and recovery system on their end, we show you how to do this in this blog. It provides three design principles to improve the data solution resiliency of your organization. In addition, it contains guidance to formulate a robust disaster recovery strategy based on event driven architecture. It contains code samples to back up the system metadata of your data solution built on SageMaker, enabling disaster recovery.

The AWS Well-Architected Framework defines resilience as the ability of a system to recover from infrastructure or service disruptions. You can enhance the resiliency of your data solution by adopting three design principles that are highlighted in this post and by establishing a robust disaster recovery strategy. Recovery point objective (RPO) and recovery time objective (RTO) are industry standard metrics to measure the resilience of a system. RPO indicates how much data loss your organization can accept in case of solution failure. RTO refers to the time for the solution to recover after failure. You can measure these metrics in seconds, minutes, hours, or days. The next section discusses how you can align your data solution resiliency strategy to meet the needs of your organization.

Formulating a strategy to enhance data solution resilience

To develop a robust resiliency strategy for your data solution built on SageMaker, start with how users interact with the data solution. The user interaction influences the data solution architecture, the degree of automation, and determines your resiliency strategy. Here are a few aspects you might consider while designing the resiliency of your data solution.

  • Data solution architecture – The data solution of your organization might follow a centralized, decentralized, or hybrid architecture. This architecture pattern reflects the distribution of responsibilities of the data solution based on the data strategy of your organization. This shift in responsibilities is reflected in the structure of the teams that perform activities in the Amazon DataZone data portal, SageMaker Unified Studio portal, AWS Management Console, and underlying infrastructure. Examples of such activities include configuring and running the data sources, publishing data assets in the data catalog, subscribing to data assets, and assigning members to projects.
  • User persona – The user persona, their data, and cloud maturity influence their preferences for interacting with the data solution. The users of a data governance solution fall into two categories: business users and technical users. Business users of your organization might include data owners, data stewards, and data analysts. They might find the Amazon DataZone data portal and SageMaker Unified Studio portal more convenient for tasks such as approving or rejecting subscription requests and performing one-time queries. Technical users such as data solution administrators, data engineers, and data scientists might opt for automation when making system changes. Examples of such activities include publishing data assets, managing glossary and metadata forms in the Amazon DataZone data portal or in SageMaker Unified Studio portal. A robust resiliency strategy accounts for tasks performed by both user groups.
  • Empowerment of self-service – The data strategy of your organization determines autonomy granted to the users. Increased user autonomy demands a high level of abstraction of the cloud infrastructure powering the data solution. SageMaker empowers self-service by enabling users to perform regular data management activities in the Amazon DataZone data portal and in the SageMaker Unified Studio portal. The level of self-service maturity of the data solution depends on the data strategy and user maturity of your organization. At an early stage, you might limit the self-service features to the use cases for onboarding the data solution. As the data solution scales, consider increasing the self-service capabilities. See Data Mesh Strategy Framework to learn about the different phases of a data mesh-based data solution.

Adopt the following design principles to enhance the resiliency of your data solution:

  • Choose serverless services – Use serverless AWS services to build your data solution. Serverless services scale automatically with increasing system load, provide fault isolation, and have built-in high-availability. Serverless services minimize the need for infrastructure management, reducing the need to design resiliency into the infrastructure. SageMaker seamlessly integrates with several serverless services such Amazon Simple Storage Service (Amazon S3), AWS Glue, AWS Lake Formation, and Amazon Athena.
  • Document system metadata – Document the system metadata of your data solution using infrastructure-as-code (IaC) and automation. Consider how users interact with the data solution. If the users prefer to perform certain activities through the Amazon DataZone data portal and SageMaker Unified Studio portal, implement automation to capture and store the metadata that’s relevant for disaster recovery. Use Amazon Relational Database Service (Amazon RDS) and Amazon DynamoDB to store the system metadata of your data solution.
  • Monitor system health – Implement a monitoring and alerting solution for your data solution so that you can respond to service interruptions and initiate the recovery process. Make sure that system activities are logged so that you can troubleshoot the system interruption. Amazon CloudWatch helps you monitor AWS resources and the applications you run on AWS in real time.

The next section presents disaster recovery strategies to recover your data solution built on SageMaker.

Disaster recovery strategies

Disaster recovery focuses on one-time recovery objectives in response to natural disasters, large-scale technical failures, or human threats such as attack or error. Disaster recovery is a crucial part of your business continuity plan. As shown in the following figure, AWS offers the following options for disaster recovery: Backup and restore, pilot light, warm standby, and multi-site active/active.

The business continuity requirements and cost of recovery should guide your organization’s disaster recovery strategy. As a general guideline, the recovery cost of your data solution increases with reduced RPO and RTO requirements. The next section provides architecture patterns to implement a robust backup and recovery solution for a data solution built on SageMaker.

Solution overview

This section provides event-driven architecture patterns following the backup and restore approach to enhance resiliency of your data solution. This active/passive strategy-based solution stores the system metadata in a DynamoDB table. You can use the system metadata to restore your data solution. The following architecture patterns provide regional resilience. You can simplify the architecture of this solution to restore data in a single AWS Region.

Pattern 1: Point-in-time backup

The point-in-time backup captures and stores system metadata of a data solution built on SageMaker when a user or an automation performs an action. In this pattern, a user activity or an automation initiates an event that captures the system metadata. This pattern is suited for low RPO requirements, ranging from seconds to minutes. The following architecture diagram shows the solution for the point-in-time backup process.

Architecture point-in-time-backup

The steps comprise the following.

  1. User or automation performs an activity on an Amazon DataZone domain or Amazon Unified Studio domain.
  2. This activity creates a new event in AWS CloudTrail.
  3. The CloudTrail event is sent to Amazon EventBridge. Alternatively, you can use Amazon DataZone as the event source for the EventBridge rule.
  4. AWS Lambda transforms and stores this event in a DynamoDB global table where the Amazon DataZone domain is hosted.
  5. The information is replicated into the replica DynamoDB table in a secondary Region. The replica DynamoDB table can be used to restore the data solution based on SageMaker in the secondary Region.

Pattern 2: Scheduled backup

The scheduled backup captures and stores system metadata of a data solution built on SageMaker at regular intervals. In this pattern, an event is initiated based on a defined time schedule. This pattern is suited for RPO requirements in the order of hours. The following architecture diagram displays the solution for point-in-time backup process.

The steps comprise the following.

  1. EventBridge triggers an event at regular interval and sends this event to AWS Step Functions.
  2. The Step Functions state machine contains multiple Lambda functions. These Lambda functions get the system metadata from either a SageMaker Unified Studio domain or an Amazon DataZone domain.
  3. The system metadata is stored in an DynamoDB global table in the primary Region where the Amazon DataZone domain is hosted.
  4. The information is replicated into the replica DynamoDB table in a secondary Region. The data solution can be restored in the secondary Region using the replica DynamoDB table.

The next section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern. This code sample stores asset information of a data solution built on a SageMaker Unified Studio domain and an Amazon DataZone domain in an DynamoDB global table. The data in the DynamoDB table is encrypted at rest using a customer managed key stored in AWS Key Management Service (AWS KMS). A multi-Region replica key encrypts the data in the secondary Region. The asset uses the data lake blueprint that contains the definition for launching and configuring a set of services (AWS Glue, Lake Formation, and Athena) to publish and use data lake assets in the business data catalog. The code sample uses the AWS Cloud Development Kit (AWS CDK) to deploy the cloud infrastructure.

Prerequisites

  • An active AWS account.
  • AWS administrator credentials for the central governance account in your development environment
  • AWS Command Line Interface (AWS CLI) installed to manage your AWS services from the command line (recommended)
  • Node.js and Node Package Manager (npm) installed to manage AWS CDK applications
  • AWS CDK Toolkit installed globally in your development environment by using npm, to synthesize and deploy AWS CDK applications
npm install -g aws-cdk
  • TypeScript installed in your development environment or installed globally by using npm compiler:
npm install -g typescript
  • Docker installed in your development environment (recommended)
  • An integrated development environment (IDE) or text editor with support for Python and TypeScript (recommended)

Walkthrough for data solutions built on a SageMaker Unified Studio domain

This section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern for data solutions built on a SageMaker Unfied Studio domain.

Set up SageMaker Unified Studio

  1. Sign into the IAM console. Create an IAM role that trusts Lambda with the following policy.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "datazone:Search",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "dynamodb:PutItem"
            ],
            "Resource": "arn:aws:dynamodb:<AWS_REGION>:<AWS_ACCOUNT>:table/*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:Encrypt",
                "kms:GenerateDataKey",
                "kms:ReEncrypt*",
                "kms:DescribeKey"
            ],
            "Resource": "arn:aws:kms:<AWS_REGION>:<AWS_ACCOUNT>:key/<KMS_KEY_ID>"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:<AWS_REGION>:<AWS_ACCOUNT>:log-group:*:log-stream:*",
                "arn:aws:logs:<AWS_REGION>:<AWS_ACCOUNT>:log-group:*"
            ]
        }
    ]
}
  1. Note down the Amazon Resource Name (ARN) of the Lambda role. Navigate to SageMaker and choose Create a Unified Studio domain.
  2. Select Quick setup and expand the Quick setup settings section. Enter a domain name, for example, CORP-DEV-SMUS. Select the Virtual private cloud (VPC) and Subnets. Choose Continue.
  3. Enter the email address of the SageMaker Unified Studio user in the Create IAM Identity Center user section. Choose Create domain.
  4. After the domain is created, choose Open unified studio in the top right corner. Screenshot open-smus
  5. Sign in to SageMaker Unified Studio using the single sign-on (SSO) credentials of your user. Choose Create project at the top right corner. Enter a project name and description, choose Continue twice, and choose Create project. Wait unti project creation is complete. Screenshot create-smus-project
  6. After the project is created, go into the project by selecting the project name. Select Query Editor from the Build drop-down menu on the top left. Paste the following create table as select (CTAS) query script in the query editor window and run it to create a new table named mkt_sls_table as described in Produce data for publishing. The script creates a table with sample marketing and sales data.
CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561Screenshot create-smus-asset
  1. Navigate to Data sources from the Project. Choose Run in the Actions section next to the project.default_lakehouse connection. Wait until the run is complete.Screeshot run-smus-data-source
  2. Navigate to Assets in the left side bar. Select the mkt_sls_table in the Inventory section and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.Screenshot smus-assets
  3. Choose Publish Asset to publish the mkt_sls_table table to the business data catalog, making it discoverable and understandable across your organization.
  4. Choose Members in the navigation pane. Choose Add members and select the IAM role you created in Step 1. Add the role as a Contributor in the project.

Deployment steps

After setting up SageMaker Unified Studio, use the AWS CDK stack provided on GitHub to deploy the solution to back up the asset metadata that is created in the previous section.

  1. Clone the repository from GitHub to your preferred integrated development environment (IDE) using the following commands.
git clone https://github.com/aws-samples/sample-event-driven-resilience-data-solutions-sagemaker.git
cd sample-event-driven-resilience-data-solutions-sagemaker
  1. Export AWS credentials and the primary Region to your development environment for the IAM role with administrative permissions, use the following format
export AWS_REGION=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=

In a production environment, use AWS Secrets Manager or AWS Systems Manager Parameter Store to manage credentials. Automate the deployment process using a continuous integration and delivery (CI/CD) pipeline.

  1. Bootstrap the AWS account in the primary and secondary Regions by using AWS CDK and running the following command.
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_REGION>
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_SECONDARY_REGION>
cd unified-studio
  1. Modify the following parameters in the config/Config.ts file.
SMUS_APPLICATION_NAME – Name of the application.
SMUS_SECONDARY_REGION – Secondary AWS region for backup.
SMUS_BACKUP_INTERVAL_MINUTES – Minutes before each backup interval. 
SMUS_STAGE_NAME – Name of the stage. 
SMUS_DOMAIN_ID – Domain identifier of the Amazon SageMaker Unified Studio. 
SMUS_PROJECT_ID – Project identifier of the Amazon SageMaker Unified Studio. 
SMUS_ASSETS_REGISTRAR_ROLE_ARN – ARN of the AWS Lambda role created in step 1 of the preceding section. 
  1. Install the dependencies by running the following command:

npm install

  1. Synthesize the CloudFormation template by running the following command.

cdk synth

  1. Deploy the solution by running the following command.

cdk deploy –all

  1. After the deployment is complete, sign in to your AWS account and navigate to the CloudFormation console to verify that the infrastructure deployed.

When deployment is complete, wait for the duration of DZ_BACKUP_INTERVAL_MINUTES. Navigate to the <DZ_APPLICATION_NAME >AssetsInfo DynamoDB table. Retrieve the data from the DynamoDB table. The following screenshot shows the data in the Items returned section. Verify the same data in the secondary Region.Screenshot smus-dynamodb

Clean up

Use the following steps to clean up the resources deployed.

  1. Empty the S3 buckets that were created as part of this deployment.
  2. In your local development environment (Linux or macOS):
  3. Navigate to the unified-studio directory of your repository.
  4. Export the AWS credentials for the IAM role that you used to create the AWS CDK stack.
  5. To destroy the cloud resources, run the following command:

cdk destroy --all

  1. Go to the SageMaker Unified Studio and delete the published data assets that were created in the project.
  2. Use the console to delete the SageMaker Unified Studio domain.

Walkthrough for data solutions built on an Amazon DataZone domain

This section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern for data solutions built on an Amazon DataZone domain.

Deployment steps

After completing the prerequisites, use the AWS CDK stack provided on GitHub to deploy the solution to backup system metadata of the data solution built on Amazon DataZone domain

  1. Clone the repository from GitHub to your preferred IDE using the following commands.
git clone https://github.com/aws-samples/sample-event-driven-resilience-data-solutions-sagemaker.git
cd event-driven-resilience-sagemaker
  1. Export AWS credentials and the primary Region information to your development environment for the AWS Identity and Access Management (IAM) role with administrative permissions, use the following format:
export AWS_REGION=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=

In a production environment, use Secrets Manager or Systems Manager Parameter Store to manage credentials. Automate the deployment process using a CI/CD pipeline.

  1. Bootstrap the AWS account in the primary and secondary Regions by using AWS CDK and running the following command:
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_REGION>
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_SECONDARY_REGION>
cd datazone
  1. From the console for IAM, note the Amazon Resource Name (ARN) of the CDK execution role. Update the trust relationship of the IAM role so that Lambda can assume the role.
  1. Modify the following parameters in the config/Config.ts file.
DZ_APPLICATION_NAME – Name of the application.
DZ_SECONDARY_REGION – Secondary Region for backup.
DZ_BACKUP_INTERVAL_MINUTES – Minutes before each backup interval.
DZ_STAGE_NAME – Name of the stage (dev, qa, or prod).
DZ_DOMAIN_NAME – Name of the Amazon DataZone domain
DZ_DOMAIN_DESCRIPTION – Description of the Amazon DataZone domain
DZ_DOMAIN_TAG – Tag of the Amazon DataZone domain
DZ_PROJECT_NAME – Name of the Amazon DataZone project
DZ_PROJECT_DESCRIPTION – Description of the Amazon DataZone project
CDK_EXEC_ROLE_ARN – ARN of the CDK execution role
DZ_ADMIN_ROLE_ARN – ARN of the administrator role
  1. Install the dependencies by running the following command:

npm install

  1. Synthesize the AWS CloudFormation template by running the following command:

cdk synth

  1. Deploy the solution by running the following command:

cdk deploy --all

  1. After the deployment is complete, sign in to your AWS account and navigate to the CloudFormation console to verify that the infrastructure deployed.

Document system metadata

This section provides instructions to create an asset and demonstrates how you can retrive the metadata of the asset. Perform the following steps to retrieve the systems metadata.

  1. Sign in to the Amazon DataZone data portal from the console. Select the project and choose Query data at the upper right.

Screenshot datazone-open-query

  1. Choose Open Athena and make sure that <DZ_PROJECT_NAME>_DataLakeEnvironment is selected in the Amazon DataZone environment dropdown at the upper right and that on the left, and that <DZ_PROJECT_NAME>_datalakeenvironment_pub_db is selected as the Database.
  2. Create a new AWS Glue table for publishing to Amazon DataZone. Paste the following create table as select (CTAS) query script in the Query window and run it to create a new table named mkt_sls_table as described in Produce data for publishing. The script creates a table with sample marketing and sales data.
CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561Screenshot datazone-run-query
  1. Go to the Tables and Views section and verify that the mkt_sls_table table was successfully created.
  2. In the Amazon DataZone Data Portal, go to Data sources, select the <DZ_PROJECT_NAME>-DataLakeEnvironment-default-datasource, and choose Run. The mkt_sls_table will be listed in the inventory and available to publish.Screenshot run-data-source
  3. Select the mkt_sls_table table and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.Screeshot publish-data-asset
  4. Choose Publish Asset and the mkt_sls_table table will be published to the business data catalog, making it discoverable and understandable across your organization.
  5. After the table is published, wait for the duration of DZ_BACKUP_INTERVAL_MINUTES. Navigate to the <DZ_APPLICATION_NAME >AssetsInfo DynamoDB table and retrieve the data from the table. The following screenshot shows the data in the Items returned section. Verify the same data in the secondary Region.Screenshot datazone-dynamodb

Clean up

Use the following steps to clean up the resources deployed.

  1. Empty the Amazon Simple Storage Service (Amazon S3) buckets that were created as part of this deployment.
  2. Go to the Amazon DataZone domain portal and delete the published data assets that were created in the Amazon DataZone project.
  3. In your local development environment (Linux or macOS):
  • Navigate to the datazone directory of your repository.
  • Export the AWS credentials for the IAM role that you used to create the AWS CDK stack.
  • To destroy the cloud resources, run the following command:

cdk destroy --all

Conclusion

This post explores how to build a resilient data governance solution on Amazon SageMaker. Resilient design principles and a robust disaster recovery strategy are central to the business continuity of AWS customers. The code samples included in this post implement a backup process of the data solution at regular time interval. They store the Amazon SageMaker asset information in Amazon DynamoDB Global tables. You can extend the backup solution by identifying the system metadata that is relevant for the data solution of your organization and by using Amazon SageMaker APIs to capture and store the metadata. The DynamoDB Global table replicates the changes in the DynamoDB table in the primary region to the secondary region in an asynchronous manner. Consider Implementing an additional layer of resiliency by using AWS Backup to back up the DynamoDB table at regular interval. In the next post, we show how you can use the system metadata to restore your data solution in the secondary region.

Adopt the resiliency features offered by Amazon DataZone and Amazon SageMaker Unified Studio. Use AWS Resilience Hub to assess the resilience of your data solution. AWS Resilience Hub helps you to define your resilience goals, assess your resilience posture against those goals, and implement recommendations for improvement based on the AWS Well-Architected Framework.

To build a data mesh based data solution using Amazon DataZone domain, see our GitHub repository. This open source project provides a step-by-step blueprint for constructing a data mesh architecture using the powerful capabilities of Amazon SageMaker, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.


About the authors

BDB-4558-DhrubaDhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data governance, and artificial intelligence at Amazon Web Services (AWS). He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure cloud solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

Empower your teams with modern architecture governance

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/empower-your-teams-with-modern-architecture-governance/

Agile product teams thrive on autonomy and rapid iteration, especially in the cloud where they can quickly deploy and test systems. However, traditional architecture governance often stands in their way, because many enterprises still impose centralized, one-off architecture signoffs early in the process. Historically, these signoffs verified design compliance with corporate standards in a slower, on-premises world. In cloud environments, such signoffs quickly become obsolete—along with their associated architectural documents—and discourage teams from considering new insights.

Modern cloud architectures demand a new governance approach. In this post, we show how collaborative architecture oversight can transform team performance through automation, self-service platforms, and distributed decision-making. We explore how key stakeholders (developers, architects, security specialists, and shared services teams) can participate in architectural decisions through asynchronous approval workflows, while making sure non-negotiable controls such as encryption at rest and in transit are consistently enforced through automation and policy as code. This approach empowers teams to experiment and adapt quickly while maintaining robust enterprise standards.

The promise of traditional architecture signoffs

Traditional architecture governance centers around formal reviews where teams submit detailed design documents to a central architecture board. These artifacts often include comprehensive diagrams, technology selections, security plans, and integration specifications. Architects and a variety of stakeholders such as security specialists, compliance officers, quality assurance, and operations teams review these documents in scheduled meetings before issuing a signoff. These approvals represent point-in-time validations against enterprise standards, assuming minimal deviation during implementation.

Why traditional signoffs fall short

Signoffs can create challenges in modern cloud architectures:

  • Substituting for continuous compliance when automated verification is missing, creating false assurance through one-off reviews
  • Creating a restrictive “check-the-box” mentality where meeting minimum documented requirements becomes the goal instead of exploring the best solutions
  • Removing decision authority from implementation teams who often have the most contextual knowledge
  • Delaying implementation feedback loops and reducing organizational agility

Consider an agile team responsible for a strategic cloud application that’s moved beyond minimal viable product and is now scaling to support growing business demands. The system architecture must evolve to handle increasing data volumes, performance requirements, and unanticipated integrations. However, corporate stakeholders insist on rigid adherence to the originally approved architecture documents. Although governance is essential for production systems, this inflexible approach with an early architecture signoff prevents the team from implementing architectural improvements as they go. What appears as meaningful control to stakeholders becomes a stifling constraint for builders, ultimately compromising the system stability the governance process aims to protect.

Modern cloud architecture support

A modern architecture function operates around evolving capabilities across three core areas: preapproved blueprints, distributed governance, and automated insights—with traditional signoffs reserved as an exception path for unique use cases.

A modern architecture function operates around evolving capabilities across three core areas: preapproved blueprints, distributed governance, and automated insights—with traditional signoffs reserved as an exception path for unique use cases.

Preapproved blueprints

Preapproved blueprints like reference architectures and code templates enable your teams to move faster while maintaining corporate standards. This approach supports a use-case-focused assessment model, where architects can concentrate on evaluating specific workflows or threats relevant to a unique use case—rather than having to understand the entire system or threat model from scratch. In this way, the architecture function shifts to managing by exception and refocus reviews on deviations from the standard. Blueprints should have gravity, guiding teams towards standardized patterns while preventing fragmentation through too many tools, databases, middleware options, or SDKs. Consider the following:

  • Pattern-based reference architectures – These set clear principles for security and resilience without micromanaging. These standards align teams while allowing innovation within a reliable framework. The cloud-driven enterprise transformation at BMW Group exemplifies how moving from signoffs to enablement through pattern-based architectures can be successful.
  • Self-service platforms – These provide standardized resources that empower teams to build independently. A self-service platform with preapproved templates for deployment toolchains and infrastructure code enables confident and rapid development. Most companies host these on internal developer platforms like Backstage or AWS Service Catalog. This also allows controlled changes to the blueprints and to track their adoption.
  • Blueprint lifecycle – Blueprints require their own approval process. Although this creates significant efficiency by reducing individual system reviews, it introduces the challenge of managing existing deployments when patterns are updated. Include versioning and migration strategies when introducing new blueprints.

Distributed governance

Distributed governance treats architectural decision-making as a continuous, collaborative process with clear accountability, delegating decision-making and empowering your builders within established blueprints. Consider the following:

  • Architecture decision records (ADRs) – These replace formal, one-off signoffs by documenting decisions for each build iteration. This approach promotes transparency and maintains team agility without compromising accountability for decisions and approvals with key stakeholders. It also allows teams to defer decisions until they are most relevant. For practical implementation guidance, see Using architectural decision records to streamline decision-making for a software development project. To learn about how to write concise ADRs and avoid duplication, consult the ADR templates GitHub repository and When Should I Write an Architecture Decision Record.
  • Community-driven consultations – Architecture departments can foster self-organization by creating architecture communities of practice for peer knowledge-sharing. These communities enable collaboration on best practices, challenges, and standards, cultivating a culture of distributed decision-making, without eliminating the lines of responsibility and accountability for final decisions. This approach works because deep architectural expertise often resides with builders who have hands-on experience with specific use cases and technologies. The role of the architecture function shifts towards providing the necessary infrastructure and identifying thought leaders in the organization.

Automated insights

Automated insights enable compliance with corporate standards through real-time monitoring and adaptation:

Comparison

The modern view improves the overall value-add and support model of your architecture function. The following table compares the traditional and modern views.

Aspect Traditional View Modern View
Purpose Centralized signoff to enforce control and reduce risk Empower teams with preapproved standards to prevent sprawl and manage their distribution such as through Backstage
Architecture approach Fixed, one-time design Evolving, treated as a parametrized, reusable code product refined through feedback
Team empowerment Limited, decisions approved by centralized authority High, teams make decisions within clear standards
Team speed and agility Slower, due to dependency on signoff Faster, continuous iteration without waiting for approvals
Risk management Early signoff to lock in decisions and reduce uncertainty Risk managed through continuous control validation with automated evidence collection, providing stronger assurance for second and third lines of defense than point-in-time assessments
Compliance Manual checks by experts Automated through policy as code and AI tools
Transparency Limited, focused on approval documentation High, lightweight decision records for technical stakeholders and visualizations or dashboards for non-technical oversight functions
Collaboration Centralized control, limited cross-team collaboration Peer-led communities (collective governance) such as Security Guardians
Innovation Restricted, focus on following signed-off designs Encouraged, teams explore within a standards-based framework

Despite the benefits, many organizations struggle to let go of signoffs for a number of reasons, including:

  • Cultural resistance – In risk-averse cultures where failing fast is not accepted, leaders hesitate to let go of centralized control mechanisms.
  • Compliance concerns – In regulated industries, centralized approvals serve as control gates. The modern view replaces point-in-time trust with continuous compliance mechanisms—automated guardrails, real-time monitoring, and evidence collection—enabling even highly regulated environments to achieve compliance with small, autonomous teams operating within clear boundaries (“two-pizza team”).
  • Lack of infrastructure – Some organizations lack self-service platforms, automated compliance, or observability, so they fall back on signoff to manage risk.
  • Governance concerns – Traditional teams often view distributed decision-making as no governance rather than transformed governance.

The modern view offers significant benefits, though with governance considerations:

  • Speed and flexibility – Teams move faster without waiting for approvals, deploying AWS resources iteratively.
  • Empowerment and ownership – Builders using standards and ADRs feel accountable and actively shape architecture.
  • Innovation and experimentation – Self-service tools and AI guidance foster experimentation without delays.

Conclusion

You can empower your builders by rethinking your architecture signoff. In the modern view discussed in this post, architecture governance aligns with the pace and flexibility of the cloud, allowing teams to innovate within a shared framework. This approach values standards and autonomy over control, and transforms your architecture function into a strategic partner in a fast-evolving landscape.

To learn how to establish and maintain cloud-centered principles and patterns, refer to the platform architecture chapter of the AWS Cloud Adoption Framework and the AWS Culture of Security resources.

Related resources


About the author

Announcing the AWS Well-Architected Generative AI Lens

Post Syndicated from Dan Ferguson original https://aws.amazon.com/blogs/architecture/announcing-the-aws-well-architected-generative-ai-lens/

We are delighted to introduce the new AWS Well-Architected Generative AI Lens. The AWS Well-Architected Framework provides architectural best practices for designing and operating generative AI workloads on AWS. The Generative AI Lens uses the Well-Architected Framework to outline the steps for performing a Well-Architected Framework Review for your generative AI workloads.

The Generative AI Lens provides a consistent approach for customers to evaluate architectures that use large language models (LLMs) to achieve their business goals. This lens addresses common considerations relevant to model selection, prompt engineering, model customization, workload integration, and continuous improvement. Specifically excluded from this lens are best practices associated with model training and advanced model customization techniques. We identify best practices that help you architect your cloud-based applications and workloads according to AWS Well-Architected design principles gathered from supporting thousands of customer implementations.

The Generative AI Lens joins a collection of Well-Architected lenses published under AWS Well-Architected Lenses.

What is the Generative AI Lens?

The Well-Architected Generative AI Lens focuses on the six pillars of the Well-Architected Framework across six phases of the generative AI lifecycle, as illustrated in the following figure.

The six phases are:

  1. Scoping the impact of generative AI in solving your problem.
  2. Selecting a model that sufficiently addresses the task.
  3. Customizing the model with prompts, data sources, or updated weights to improve performance.
  4. Integrating the model into your existing applications.
  5. Deploying the new generative AI capability into your environment.
  6. Iterating and improving on the generative AI capabilities you have released.

Unlike the traditional waterfall approach, an iterative approach is required to achieve a working prototype based on the six phases of the generative AI lifecycle. The lens provides you with a set of established cloud-agnostic best practices in the form of Well-Architected Framework pillars for each generative AI lifecycle phase.

You can also use the Well-Architected Generative AI Lens wherever you are on your cloud journey. You can choose to apply this guidance either during the design of your generative AI workloads or after your workloads have entered production as a part of the continuous improvement process.

What’s else is discussed in the Generative AI Lens?

The Generative AI Lens also discusses the following key topics:

  • Responsible AI – Responsible implementation of generative AI workloads is discussed in this paper. We describe some of the common considerations facing customers as they address the responsible implementation and deployment of generative AI.
  • Data architecture for generative AI – At the core of any AI workload is data. We feature a brief survey on the nuances of data architectures with regards to generative AI workloads.

Who should use the Generative AI Lens?

The Generative AI Lens is of use to many roles. Business leaders can use this lens to acquire a broader appreciation of the end-to-end implementation and benefits of generative AI. Data scientists and engineers can read this lens to understand how to use, secure, and gain insights from their data at scale. Risk and compliance leaders can understand how generative AI is implemented responsibly by providing compliance with regulatory and governance requirements.

Generative AI Lens components

The lens includes four focus areas:

  • The Well-Architected Generative AI Lens design principles – Design principles are the guidelines and value statements that frame the presented best practices.
  • The Generative AI lifecycle and the Well Architected Framework pillars – This considers all aspects of the generative AI lifecycle and reviews design strategies to align to the pillars of the overall Well-Architected Framework:
    • Operational excellence – Ability to support ongoing development, run operational workloads effectively, gain insight into your operations, and continuously improve supporting processes and procedures to deliver business value.
    • Security – Ability to protect data, systems, and assets, and to take advantage of cloud technologies to improve your security.
    • Reliability – Ability of a workload to perform its intended function correctly and consistently, and to automatically recover from failure situations.
    • Performance efficiency – Ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as system demand changes and technologies evolve.
    • Cost optimization – Ability to run systems to deliver business value at the lowest price point.
    • Sustainability – Addresses the long-term environmental, economic, and societal impact of your business activities.
  • Cloud-agnostic best practices – These are best practices for each generative AI lifecycle phase across the Well-Architected Framework pillars irrespective of your technology setting. The best practices are accompanied by:
    • Implementation guidance – The AWS implementation plans for each best practice with references to AWS technologies and resources.
    • Resources – A set of links to AWS documents, blogs, videos, and code examples as supporting resources to the best practices and their implementation plans.
  • Related generative AI architecture considerations – This includes discussions on the generative AI application lifecycle, and where the listed best practices in this lens could fit into the lifecycle. Additionally, we discuss elements of data architecture for generative AI workloads, and Well-Architected considerations for responsible AI.

What are the next steps?

The new Well-Architected Generative AI Lens is available now. Use the lens to make sure that your generative AI workloads are architected with operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability in mind.

If you require support on the implementation or assessment of your generative AI workloads, please contact your AWS Solutions Architect or Account Representative.

Special thanks to everyone across the AWS Solution Architecture, AWS Professional Services, and Machine Learning communities who contributed to the Generative AI Lens. These contributions encompassed diverse perspectives, expertise, backgrounds, and experiences in developing the new AWS Well-Architected Generative AI Lens.

For additional reading, refer to the AWS Well-Architected Framework and pillar whitepapers, or use the AWS Well-Architected Machine Learning Lens and its custom lens accessible from the AWS Well-Architected Tool.


About the authors

Build and operate an effective architecture review board

Post Syndicated from Darrin Weber original https://aws.amazon.com/blogs/architecture/build-and-operate-an-effective-architecture-review-board/

The rapid change of pace in computing landscapes because of cloud, artificial intelligence, and technology innovation has challenged organizations to keep up while making sure that their initiatives and projects remain compliant with enterprise guidelines and policies. An effective architecture review board (ARB) can help an organization maintain compliance with enterprise guardrails while accelerating implementation of initiatives in their project pipeline.

In this post, we identify the components of an efficient architecture review process, define what an ARB is, and describe how to build and operate an effective enterprise ARB.

What is an architecture review board?

An ARB is a multi-disciplinary team responsible for reviewing solution architectures to help ensure compliance with enterprise guidelines, best practices, and supportability. Team members include stakeholders from different disciplines throughout your organization, which typically include Security, Development, Enterprise Architecture, Infrastructure, and Operations. Including a broad set of stakeholders reduces the amount of project recycle that happens when stakeholder representation is overlooked.

An ARB isn’t a standalone group, it operates within the context of your project implementation process, reviewing solution architectures, custom development, and purchased solutions to maintain enterprise compliance and alignment with goals. As shown in the following diagram, architecture review typically occurs after the design phase—before a build or purchase decision—and again before deployment to validate that the reviewed architecture matches the solution that was built.

Project implementation process with architecture review checkpoints

Most organizations recognize the benefits and value of establishing an ARB. However, they often struggle to define and operate one in a manner that maximizes the benefits, integrates with overall project execution processes, and satisfies the needs of all the stakeholders. An efficient architecture review process imparts organizational benefits such as reduced costs, minimized security events, and diminished technical debt.

Life without a formal architecture review process

One of the most pronounced issues with implementing and maintaining software architecture is the difficulty in achieving human consensus. In any organization, you’ll find a diverse range of team members—each with their own priorities, perspectives, and pain points. Without a formalized review process, these differences can lead to prolonged debates and stalled projects. We often find that many members tend to fall into one of these personas:

The Not Invented Here The Not Invented Here – This individual doesn’t trust any software unless it was built and operated by members of their company. They’re generally wary of any cloud solution and will expend development time to avoid capital expenditure.
The Wait a Minute The Wait a Minute – This individual has good feedback and their input is welcome, but they tend to wait until the last minute before providing any feedback, making it difficult to have productive conversations and act on any constructive criticism.
The Bottleneck The Bottleneck – This individual craves control and insists that all reviews, decisions, and conversations go through them. This makes scaling the architecture review process very challenging and decisions will often come down to the whim of this one person.
The Creative The Creative – This individual has passion for software and for creating things, but will often choose complexity over simplicity and turn their architectures into art projects.
The Perfectionist The Perfectionist – This individual tends to let the perfect be the enemy of the good. While their intentions are pure, this approach can result in delayed decision making and debates on topics that might not be worth the time of the board.
The Historian The Historian – This individual has been at the company for a long time and remembers every success and failure along the way. While the context this individual brings to the table is invaluable, teams must guard against only looking to the past as they try to shape the future.

Benefits of an architecture review board

Establishing an ARB within your organization can yield substantial benefits, enhancing both the quality and efficiency of your architecture. Some key advantages are:

Improved compliance

By systematically reviewing architectural decisions, the ARB helps ensure that designs adhere to company best practices, open standards, and regulatory requirements as set forth by your enterprise architecture.

Reduced technical debt

Technical debt—taking shortcuts in the development process that lead to future complications—is a common issue in software development. The ARB helps identify and mitigate technical debt early in the design phase. By enforcing architectural standards and promoting best practices, the board helps ensure that decisions are made with long-term sustainability in mind. This approach results in more robust, maintainable codebases and reduces the likelihood of future rework.

Efficiency with lowered costs

While a formal architecture review sounds like it might have the potential for increased red tape and lowered efficiency, the ARB instead contributes to operational efficiency by standardizing architectural practices across the organization. This uniformity allows for better resource allocation, faster deployment cycles, and more predictable project timelines. By catching potential issues early in the design phase, the ARB helps avoid costly rewrites and rework, which can lead to significant cost savings over time.

Supportability

Designing for supportability is crucial for the long-term success of any application. The ARB makes sure that architectures are built with maintainability in mind, making it easier for operations teams to manage and troubleshoot systems. This focus on supportability leads to fewer downtime incidents, faster resolution times, and overall higher system reliability. By making sure the composition of the ARB crosses all parts of the organization, supportability concerns can be surfaced earlier and help ensure that changes are properly socialized.

Security

Above all, security is the most critical output of an effective ARB. The ARB plays a pivotal role in embedding security into the architectural fabric from the outset. By conducting thorough security reviews and incorporating security best practices into every design, the ARB makes sure that applications are resilient against unintended disclosure, inadvertent access, and threat actors. This proactive approach not only protects sensitive data, but also builds trust with your customers and stakeholders.

Steps for effective architecture review boards

Whether looking to establish a new architecture review process or improve the effectiveness of a current ARB, we’ve identified eight key steps to make sure that an ARB operates in a way which realizes the benefits of a robust architecture review process while maintaining enterprise compliance. With the exception of leadership support, the steps aren’t presented in a particular order and can be implemented in parallel or in whatever order fits your organization and resource availability.

Leadership support

Identifying a sponsor on the executive leadership team is crucial to the success of the ARB. An executive sponsor fosters participation from stakeholders, representing key organizations such as Security, Development, and Operations, along with gaining their commitment to the review processes. The executive sponsor helps embed the ARB function within the enterprise’s project implementation process. Supported by the executive sponsor, the ARB’s reviews serve as a formal gate within the project process, reducing attempts to bypass the review processes.

Single source for guidance, policies, and best practices

Establish a single, well-known repository or index so that the entire enterprise has a single source of truth that establishes the basis for designing and reviewing architecture. A common repository doesn’t need to be complex. It can be a central document location, wiki, or file share that’s quickly discoverable. Commonly, an enterprise’s collection of guidelines and policies are dispersed and managed by each organization using different mechanisms and repositories. Best practices are often treated as folklore passed between team members. Project teams and ARB stakeholders need to share a common understanding of the enterprise’s collective intelligence consisting of guidelines, policies, and best practices.

As the project community’s collective understanding of the enterprise guidelines and policies grows, initial solution designs are better aligned, and reviews through the ARB accelerate. After a common repository is established, consider using generative AI to create a natural language chatbot, a design chatbot, to simplify access to the collective guidelines, policies, and best practices. See Amazon Bedrock or Amazon Q – Generative AI Assistant.

Defined stakeholders

Make sure that your disciplines have defined stakeholders on the ARB. A good starting point is to identify stakeholders from the Security, Enterprise Architecture, Development, Infrastructure, and Operations teams. Broad representation on the board minimizes recycles and delays later in the project, which can occur when stakeholders aren’t engaged in the review process from the beginning. A stakeholder’s responsibility is to focus on their area of subject matter expertise and commit a portion of their time to the ARB. Consider rotating stakeholders periodically to distribute knowledge and workload through the organization.

Gated process with documented decisions

As previously described, architecture reviews typically occur after design and before solution implementation or purchase. Optionally, another architecture review takes place before deployment to validate that the solution matches what was reviewed and approved. It’s important to complete the review before implementation or the purchase decision and to get stakeholder sign off. Otherwise, projects risk rework and delay later in the process, often impacting cost or schedule to a greater degree. Document each ARB action, including approvals, reasons for recycles, exceptions required, follow-ups needed, and so on. Documented decisions should be added to the project’s overall lifecycle documentation to benefit future inspection of project or similar solution architectures.

Establish an exception process

There will always be exceptions to your enterprise guidelines or policies. Plan for exceptions with a defined process for reviewing, escalating, and gaining approval. Include leadership from both IT and business areas in the assessment and sign-off on an exception. Most importantly, set expiration dates on the exceptions–they should not be granted indefinitely. Exceptions are typically granted to accommodate a temporary nonconformance to provide time to plan for and implement a better, long-term solution.

Architecture central repository

Establish a well known, central repository for solution architecture documents. Solution documentation should be treated as living artifacts that are maintained for the lifecycle of the use case. A central architecture repository benefits teams responsible for operating and maintaining solutions, along with design teams chartered with new solution design. After a repository is established, consider including your architecture documentation in the generative AI design chatbot mentioned previously.

Automate review process

Employ automated architecture review processes wherever possible. Automated review processes allow stakeholders to focus their time on their subject matter expertise instead of administrative tasks. Consider separate review processes based on an initiative’s complexity, cost, and impact. Schedule live meetings with the ARB for the most complex and impactful solutions, and use offline mechanisms, such as email, for other efforts. Define a universal architecture template to capture areas of interest for review and automate the Q&A and sign-off processes. Consider using generative AI to do initial automated design reviews against enterprise core best practices and policies to further streamline stakeholder review processes.

Architecture review process shepherd

Identify a shepherd to help ensure that solution architectures are reviewed and the ARB review processes are broadly understood. The shepherd functions as a liaison with executive sponsors for exceptions. While the shepherd can also be a stakeholder on the board, the shepherd is not the single overall decision maker. The shepherd champions the continuous improvement of the architecture review process and mechanisms.

Conclusion

In this post, we explored the benefits of establishing an architecture review board within an organization, emphasizing its role in maintaining compliance, reducing technical debt, and enhancing operational efficiency. We discussed the challenges organizations face in setting up an effective ARB and provided guidance on the essential components and steps required to build and operate a successful ARB. By following the outlined steps, organizations can maximize the benefits of an ARB, making sure that architectural decisions align with enterprise goals and standards while fostering a culture of continuous improvement and stakeholder collaboration.

For additional guidance on garnering the leadership support necessary for an effective ARB, see Well-Architected Framework: Provide executive sponsorship. For more details on the review process, see Well-Architected Framework: The review process and AWS Well-Architected Tool, an AWS Management Console-based service that provides a consistent process for measuring your architecture using AWS best practices. If you’re interested in establishing a natural language chatbot interface for your enterprise architecture information, see Amazon Bedrock, Amazon Q Business, or Build a contextual chatbot application using Amazon Bedrock Knowledge Bases.


About the authors

Announcing the Well-Architected Data Residency with Hybrid Cloud Services Lens

Post Syndicated from Enrico Liguori original https://aws.amazon.com/blogs/architecture/announcing-the-well-architected-data-residency-with-hybrid-cloud-services-lens/

As organizations increasingly take advantage of the benefits of hybrid cloud architectures, they are often faced with the challenge of balancing data residency requirements with the flexibility of the cloud. To help you navigate this complex landscape, we are excited to introduce the AWS Well-Architected Data Residency with Hybrid Cloud Services (DRHC) Lens paper.

The AWS Well-Architected Framework provides a consistent approach for evaluating architectures and applying best practices to build reliable, secure, efficient, and cost-effective systems in the cloud. The framework is based on six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability.

What’s in the Well-Architected Data Residency with Hybrid Cloud Services Lens?

The DRHC Lens identifies a set of key design principles to help you in the development and deployment of Well-Architected hybrid cloud workloads:

  • Classify data and workloads – Determine which data sources and workloads must remain on premises or in a specific geography to comply with data residency requirements, and which can be moved to an AWS Region.
  • Establish operational practices for data sovereignty – Develop and implement distinct processes and procedures for data and workloads that are subject to data residency or data sovereignty regulations.
  • Use in-Region services whenever possible – Data residency and data sovereignty regulations often apply only to specific elements of an application, or to specific data. For those elements that aren’t in scope for these regulations, use the broad set of services available in Regions, such as identity, automation, and monitoring services.
  • Automate infrastructure – Build orchestration and automation frameworks that adapt to data residency or data sovereignty classifications to deploy compliant, repeatable, and verifiable application stacks. Automation is critical for eliminating manual processes that might introduce mistakes that can lead to compliance failures.
  • Implement robust security measures – The protection and integrity of data is critical to complying with corporate, data-residency, and data-sovereignty regulations. Use all of the tools and services available across AWS Regions, AWS Local Zones, and on AWS Outposts, including specific preventative controls to meet data residency regualations, when developing and deploying your applications.

In addition to these design principles, the lens provides detailed guidance across the six pillars of the Well-Architected Framework:

  • Operational Excellence – Achieve effective system operations, gain operational insights, and implement continuous process improvement for hybrid cloud workloads.
  • Security – Establish control objectives, implement access controls and governance, and configure detection mechanisms to protect data across on-premises and cloud environments.
  • Reliability – Deploy multiple Outposts or Local Zones for high availability, plan for disaster recovery, and maintain operations during on-premises maintenance.
  • Performance Efficiency – Carefully select services, Regions, and configurations that align with latency, bandwidth, and data residency requirements.
  • Cost Optimization – Implement a tagging strategy, monitor and manage Outposts capacity, and optimize network configurations to reduce costs.
  • Sustainability – Anchor Outposts and Local Zones to sustainable Regions, scale infrastructure to match demand, and optimize storage to reduce energy consumption.

Who should use the DRHC Lens?

All AWS customers with data residency requirements, including government agencies, healthcare providers, financial institutions, and public sector industries who are looking for guidance on these architectural patterns, can use the DRHC Lens. When designing and maintaining your architecture with data residency in mind, the following roles can benefit from this lens:

  • Business and technology leaders — For strategic decision-making around data sovereignty requirements
  • Chief technology officers — For aligning technology choices with data residency compliance needs
  • Solutions architects and data residency specialists — For designing compliant hybrid architectures following Well-Architected principles
  • Security and compliance teams — For making sure data handling meets regulatory requirements across Regions
  • Operations teams — For maintaining daily compliance with data residency controls
  • Data engineers and edge computing specialists — For implementing technical controls that enforce data locality requirements

Conclusion

The new Well-Architected Data Residency with Hybrid Cloud Services Lens is available now. Use the lens whitepaper to adopt your hybrid cloud workloads according to the tenants of the Well-Architected Framework while maintaining data sovereignty requirements.

Applying the Data Residency with Hybrid Cloud Services Lens to your architecture helps validate your data handling practices and provides actionable recommendations to address gaps. AWS will continue to update the whitepaper as new services and features are released. Using the lens can help you consistently apply the latest patterns to meet your requirements. This allows you to deploy applications to meet your business objectives and requirements.


About the Authors

Realizing twelve-factors with the AWS Well-Architected Framework

Post Syndicated from Michael Phorn original https://aws.amazon.com/blogs/architecture/realizing-twelve-factors-with-the-aws-well-architected-framework/

Organizations that are interested in improving their development velocity that follow the principles of the twelve-factor app might find benefits in understanding how to realize those concepts on Amazon Web Services (AWS). In this post, I will help you correlate the twelve-factors app concepts as you architect solutions on AWS.

Twelve-factors

Let’s start with a quick recap of twelve-factors. The Twelve-Factor App was published in 2011 by Adam Wiggins as a collaboration between developers at Heroku. He published it at a time when developers were shifting from a paradigm of writing software-as-a-service (SaaS) applications in their own cloud environments to having the applications hosted on a cloud provider, such as AWS. Their intent was to provide “a broad set of conceptual solutions” for building applications that were portable and resilient. The principles centered around reducing the software lifecycle burden, including application introduction, maintenance and operations, through sunsetting. These principles were captured in the following 12 factors:

  1. Codebase
  2. Dependencies
  3. Config
  4. Backing services
  5. Build, release, run
  6. Processes
  7. Port binding
  8. Concurrency
  9. Disposability
  10. Development and production environment parity
  11. Logs
  12. Admin process

These principles are portable and resilient application best practices.

At AWS, we have the Well-Architected Framework to capture cloud and architecture best practices, which contains similar practices to twelve-factors. The Framework comes from years of AWS Solutions Architect collective experience of building solutions across business verticals and use cases. The results are architectures that support secure, high-performing, resilient, and cost-effective systems in the cloud. If you’re responsible for the underlying infrastructure or the application, the Framework helps you, the CTO, the architect, the developer, or operations team member, understand the benefits and trade-offs of decisions that have to be made.

A brief history of the AWS Well-Architected Framework

AWS published the first version of the Framework in 2012, and we released the AWS Well-Architected Framework whitepaper in 2015. Following the initial introduction, we added the Operational Excellence pillar in 2016 and released pillar-specific whitepapers and AWS Well-Architected Lenses in 2017. The following year, the AWS Well-Architected Tool was launched.

While twelve-factors focuses on application characteristics, the AWS Well-Architected Framework provides architectural guidance. When your architecture undergoes a Well-Architected review, you can meet the guidance for a twelve-factors application more easily. With some factors, the Framework helps the application developer delegate some responsibility from the application to the infrastructure. Both frameworks aim to help you deliver applications and services that are robust, scalable, and cloud centered. The AWS Well-Architected Framework helps you reinforce these mechanisms.

The six pillars of the AWS Well-Architected Framework

Let’s explore the six pillars of the AWS Well-Architected Framework, what each aims to achieve, and where the twelve-factors concepts intersect with AWS guidance.

The following figures shows the twelve factors and how they map to processes in AWS, which are described in this section.

1. Operational excellence

The operational excellence pillar helps you review your organization’s ability to support development and run workloads efficiently. You can use the topics in this pillar to evaluate how you operate your solutions. The pillar guides you through inspection of organizational structure, inspection of your mechanisms, and identification of obstacles and roadblocks that might slow your ability to innovate. The results include a feedback loop of continuous improvement for operating the infrastructure and solutions.

The factors you capture through operational excellence are codebase (I) and development and production environment parity (X). Codebase prescribes that there is exactly one codebase used to deploy everywhere, which echoes the purpose of reducing the operational burden of maintaining your software. The argument for a single code base is for consistency, traceability, and efficiency across a unified development lifecycle. The second factor is development and production environment parity, which encourages developers to create smaller but more frequent deployments. It also encourages developers to maintain parity not just of the core software, but also the backing services between environments. Parity of environments is conducive to smoother development and deployment processes. Additionally, this parity helps developers catch issues in a non-production environment more consistently.

AWS services that can help you achieve operational excellence are AWS CodeConnections, AWS CodePipeline, AWS CloudFormation, AWS Systems Manager, Amazon CloudWatch, AWS Config, AWS CloudTrail, Amazon EventBridge, AWS X-Ray, AWS Organizations, AWS Control Tower, AWS Trusted Advisor, AWS Service Catalog, AWS Proton, Amazon CodeGuru (Preview), AWS Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), and AWS StepFunctions

2. Security

The security pillar describes how to use AWS Cloud technologies to protect data, systems, and assets that improve your security posture. At AWS, we advocate the shared responsibility model, which applies to the security pillar. AWS is responsible for providing a secure environment for managing and operating your systems and solutions, but it is your responsibility to implement those best practices in the context of your requirements. The security pillar describes best practices such as reviewing how you manage identities for people and machines, which helps you store secrets securely.

The config (III) factor can be mapped to the security pillar, which advises you to store variables and items that depend upon the environment as environment variables. This allows you to move between deployments without having to update your code. Configuration settings such as database connection strings, API keys, credentials, and other sensitive information should be separated from the application code. At AWS, we provide services that can be used to securely meet this requirement, including AWS Secrets ManagerAWS Systems Manager Parameter Store, AWS Certificate Manager, and AWS Key Management Service (KMS).

AWS services that can help you achieve security are AWS Identity and Access Management (IAM), Amazon GuardDuty, AWS Shield, AWS Web Application Firewall (WAF), Amazon Inspector, AWS CloudHSM, Amazon Macie, AWS Security Hub, AWS Config, AWS CloudTrail, Amazon VPC (Virtual Private Cloud), AWS Direct Connect, Amazon Cognito, AWS Firewall Manager, AWS Network Firewall, and AWS IAM Access Analyzer.

3. Reliability

The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. Reliability means that your architecture and systems:

  • Appropriately scale resources to meet demands
  • Mitigate disruptions caused by misconfiguration or transient network issues
  • Recover when disruptions do occur

Automation of scaling and recovery are best practices within the reliability pillar.

Because twelve-factors helps developers deliver a reliable application, multiple factors are categorized under the reliability pillar of the AWS Well-Architected Framework. Backing services (IV) explains that you should have flexibility for integrating services. This way, when your system experiences issues with availability, the application can replace the troubled service without code changes. You should choose the right resource that provides scalability while optimizing costs. Dependencies (III) describes that applications declare and isolate dependencies to become modular and self-contained. This speeds up recovery by simplifying the setup for handlers of the application code. Applications that adhere to the processes (VI) factor run as a collection of stateless processes to support scaling. This is equivalent to creating microservices that can scale up or down depending upon the workload or bring in additional instances when one fails. Disposability (IX) suggests that an application’s processes can be started and stopped rapidly, which makes the application resilient to failures and capable of being adapted to elastically scale.

AWS services that can help you achieve reliability are Amazon EC2 Auto Scaling, Elastic Load Balancing (ELB), Amazon RDS Multi-AZ, Amazon Simple Storage Service (Amazon S3), AWS CloudFormation, Amazon Route53, AWS Shield, AWS Backup, Amazon CloudWatch, AWS Systems Manager, AWS Global Accelerator, Amazon Aurora, AWS Lambda, Amazon DynamoDB, and AWS Transit Gateway.

4. Performance efficiency

The principles under the performance efficiency pillar focus on using computing resources to build architectures on AWS that efficiently deliver sustained performance as demand changes and technologies evolve. Topics in this pillar include simplifying the consumption of technologies that align with your goals, the ability to go global in minutes, and reducing the time and effort needed to deliver a service.

The concurrency (VIII) factor prioritizes management of processes, which should be stateless and allow for horizontal scaling, promoting performance efficiency. The backing services (IV) factor also falls under this category because it dictates flexibility in integration. This flexibility enables the application to maximize performance by using the right resource that meets scalability and performance requirements.

AWS services that can help you achieve performance efficiency are Amazon Elastic Compute Cloud (Amazon EC2), Amazon EC2 AutoScaling, Amazon Elastic Block Store (Amazon EBS), Amazon S3, Amazon Aurora, Amazon DynamoDB, Amazon ElastiCache, Amazon CloudFront,Application Auto ScalingElastic Load Balancing (ELB), AWS Lambda, Amazon API GatewayAWS Step Functions, Amazon SQS, Amazon Kinesis, AWS Global Accelerator, Amazon Aurora, AWS X-Ray, Amazon CloudWatch, and AWS Compute Optimizer.

5. Cost optimization

The cost optimization pillar provides guidance for the architecture’s ability to operate systems and deliver the business value at the lowest price point. The cost optimization reviews help you avoid unnecessary costs, analyze and attribute expenditure, and use appropriate pricing models.

The relationship of the build, release, run (V) factor advocates for the process separation and strict discipline around efficient handling of application deployments. This aligns to the cost optimization pillar because cost effective operations are typically a result of well-designed processes and mechanisms. AWS services that can support the build, release, run factor are, AWS CodeBuild and AWS CodeDeploy.

Other AWS services that can help you with cost optimization are AWS Cost Explorer, AWS Budgets, AWS Data Exports, AWS Trusted Advisor, AWS Compute Optimizer, EC2 Spot Instances, AWS Savings Plans, Amazon S3 Intelligent-Tiering, AWS Lambda, Amazon Aurora, Application Auto Scaling, AWS Organizations, AWS Resource Groups, Tag Editor, AWS Marketplace, AWS License Manager, AWS Glue, and Amazon Athena.

6. Sustainability

The sustainability pillar focuses on minimizing the environmental impact of running workloads in the cloud. Topics in this include reviewing the lifecycle of your data and retention policies as a methodology to use only what is needed.

The disposability (IX) factor is aligned to sustainability because it highlights an application’s ability to rapidly start and shut down at a moment’s notice. This provides agility and optimized use of resources during the life of the application.

AWS services that can help you achieve sustainability are AWS Customer Carbon Footprint ToolAmazon EC2 AutoScalingAWS Lambda, Amazon EC2 Spot Instances, Amazon EBS gp3 volumes, Amazon S3 Intelligent-Tiering, Amazon S3 Lifecycle configurations, AWS Graviton processors, Amazon Aurora Serverless, Amazon RDS Multi-AZ deployments, AWS Compute Optimizer, AWS Well Architected Tool, and Amazon CloudWatch.

Remaining factors

Port binding, logs, and admin processes aren’t specifically categorized into the pillars of the AWS Well-Architected Framework. However, these factors can be addressed as an essential part of the services that AWS delivers.

The seventh factor: Port binding

The port binding factor says that an application should be bound to a specific port when it’s hosted as a web application with the intention of making the application completely self-contained. In the context of AWS, we offer you different ways to achieve this principle, which are dependent upon the way your application is deployed on AWS. When implementing port binding on AWS, we offer features such as security, service discovery, and dynamic port mapping to simplify and secure your applications through services like Amazon EC2, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Elastic Beanstalk, AWS App Runner.

The eleventh factor: Logs

The logs factor dictates that an application should treat its running process as an event stream out to files that are managed completely by the execution environment. AWS offers many types of logging to capture different aspects of your application and the supporting infrastructure. CloudWatch is a centralized logging management service that monitors, stores, and provides access to log files from AWS services. For more detail, see AWS services for logging and monitoring.

The twelfth factor: Admin processes

The admin processes factor advises application developers to perform administrative tasks in an isolated manner to minimize the impact on the main application. At AWS, this factor is realized as a separation of the control plane and the data plane. The control plane is responsible for managing, configuring, and controlling the network or system infrastructure, while the data plane is responsible for the handling the actual user data or traffic. This separation is an inherent part of AWS services. We believe this separation allows AWS to deliver services that are scalable, highly available, secure, and efficient.

Applying the AWS Well-Architected Framework

The Framework shouldn’t be treated as a checklist that you review after development is complete. Instead, a review should be explored during the design phase to help you learn and apply architectural best practices. By the end of development, architects should have built a solution that facilitates faster, lower-risk service building and deployment. The Framework is not a static document, and as AWS evolves, architects continue to learn from working with customers and refine the definition of well-architected.

Conclusion

If you are familiar with twelve-factors or want to develop a twelve-factors app on AWS, read more about the AWS Well-Architected Framework. Consider starting a review project on your own to explore the detailed questions underneath each category or if you have specific workload that you’re already working on. You can use one of the many AWS Well-Architected Tool lenses to focus on applying these best practices to the services that you’re using. To get started on a lens review, see AWS Well-Architected Tool, which is accessible at no charge through the AWS Management Console.


About the author

How MuleSoft achieved cloud excellence through an event-driven Amazon Redshift lakehouse architecture

Post Syndicated from Sean Zou original https://aws.amazon.com/blogs/big-data/how-mulesoft-achieved-cloud-excellence-through-an-event-driven-amazon-redshift-lakehouse-architecture/

This post is cowritten with Sean Zou, Terry Quan and Audrey Yuan from MuleSoft.

In our previous thought leadership blog post Why a Cloud Operating Model we defined a COE Framework and showed why MuleSoft implemented it and the benefits they received from it. In this post, we’ll dive into the technical implementation describing how MuleSoft used Amazon EventBridge, Amazon Redshift, Amazon Redshift Spectrum, Amazon S3, & AWS Glue to implement it.

Solution overview

MuleSoft’s solution was to build a lakehouse built on top of AWS services, illustrated in the following diagram, supporting a portal. To provide near real-time analytics we used an event-driven strategy that which would trigger AWS Glue jobs an refresh materialized views.  We also implemented a layered approach that included collection, preparation, and enrichment making it straightforward to identify areas that affect data accuracy.

For MuleSoft’s lakehouse end-to-end solution, the following phases are key:

  • Preparation phase
  • Enrichment phase
  • Action phase

In the following sections, we discuss these phases in more detail.

Preparation phase

Using the COE Framework, we engaged with the stakeholders in the preparation phase to determine the business goals and identify the data sources to ingest. Examples of data sources were cloud assets inventory, AWS Cost and Usage Reports, and AWS Trusted Advisor data. The ingested data is processed in the lakehouse to implement the Well-Architected pillars, utilization, security, and compliance status checks and measures.

How you configure the CUR data and the Trusted Advisor data to land into S3?

The configuration process involves multiple components for both CUR and Trusted Advisor data storage. For CUR setup, customers need to configure an S3 bucket where the CUR report will be delivered, either by selecting an existing bucket or creating a new one. The S3 bucket requires a policy to be applied and customers must specify an S3 path prefix which creates a subfolder for CUR file delivery .

Trusted Advisor data is configured to use Kinesis Firehose to deliver customer summary data to the Support Data lake S3 bucket .The data ingestion process uses firehose buffer parameters (1MB buffer size and 60-second buffer time) to manage data flow to the S3 bucket .

The Trusted Advisor data is stored in JSON and GZIP format, following a specific folder structure with hourly partitions using the “YYYY-MM-DD-HH” format .

The S3 partition structure for Trusted Advisor customer summary data includes separate paths for success and error data, and the data is encrypted using a KMS key specific to Trusted Advisor data .

MuleSoft used AWS managed services and data ingestion tools to pull from multiple data sources and that can support customizations. Cloudquery is used tool to gather cloud infrastructure information, which can connect many infrastructure data sources out of the box and land it into an Amazon S3 bucket. The MuleSoft Anypoint Platform provides an integration layer to integrate infrastructure tools, accommodating many data sources like on-premise, SaaS, and commercial off-the-shelf (COTS) software. Cloud Custodian  was used for its capability of managing cloud resources and auto-remediation with customizations.

Enrichment phase

The enrichment phase includes ingesting raw data aligning with our business goals into the lakehouse through our pipelines, and consolidating the data to create a single pane of glass.

The pipelines adopt the event-driven architecture consisting of EventBridge, Amazon Simple Queue Service (Amazon SQS), and Amazon S3 Event Notifications to provide near real-time data for analysis. When new data arrives in the source bucket, new object creation is captured by the EventBridge rule, which invokes the AWS Glue workflow, consisting of an AWS Glue crawler and AWS Glue extract, transform, and load (ETL) jobs. We also configured S3 Event Notifications to send messages to the SQS queue to make sure the pipeline will only process the new data.

The AWS Glue ETL job cleanses and standardizes the data, so that it’s ready to be analyzed using Amazon Redshift. To tackle data with complex structures, additional processing is performed to flatten the nested data formats into a relational model. The flattening step also extracts the tags of AWS assets out of the nested JSON objects and pivots them into individual columns, enabling tagging enforcement controls and ownership attribution.  The ownership attribution of the infrastructure data provides accountability and holds teams responsible for the costs, utilization, security, compliance, and remediation of their cloud assets.  One important tag is asset ownership which is from the tags extracted from the flattening step, this data can be attributed to the corresponding owners by SQL scripts.

When the workflow is complete, the raw data from different sources and with various structures is now  centralized in the data warehouse.  From there, disjointed data with different purposes is ready to be consolidated and translated into actionable intelligence in the Well-Architected Pillars by coding out the business logic.

 Solutions for the enrichment phase

In the enrichment phase, we faced a number of storage, efficiency, and scalability challenges given the sheer volume of data. We used three techniques (file partitioning, Redshift Spectrum, and materialized views) to address these issues and scale without compromising performance.

File partitioning

MuleSoft’s infrastructure data is stored in folder structure: year, month, day, hour, account, and Region in an S3 bucket, so AWS Glue crawlers are able to automatically identify and add partitions to the tables in the AWS Glue Data Catalog. Partitioning helps improve query performance significantly because it optimizes parallel processing for queries. The amount of data scanned by each query is restricted based on the partition keys, helping reduce overall data transfers, processing time, and computation costs. Although partitioning is an optimization technique that helps improve query efficiency, it’s important to keep in mind two key points while using this technique:

  • The Data Catalog has a maximum cap of 10 million partitions per table
  • Query performance gets compromised as partitions grow rapidly

Therefore, balancing the number of partitions in the Data Catalog tables and query efficiency is essential. We decided on a data retention policy of 3 months and configured a lifecycle rule to expire any data older than that.

Our event-driven architecture–AWS Eventbridge event is invoked when objects are put into or removed from an S3 bucket, event messages are published to the SQS queue using S3 Event Notifications, which invokes an AWS Glue crawler to either add new partitions or removes old partitions from the Data Catalog based on the messages handling the partition cleanup.

Amazon Redshift and concurrency scaling

MuleSoft uses Amazon Redshift to query the data in S3 because it provides large scale compute and minimized data redundancy. MuleSoft also used Amazon Redshift concurrency scaling to run concurrent queries with consistently fast query performance. Amazon Redshift automatically added query processing power in seconds to process a high number of concurrent queries without any delays.

Materialized views

Another technique we used is Amazon Redshift materialized views. Materialized views store preset query results that future similar queries can use, so many computation steps can be skipped. Therefore, relevant data can be accessed efficiently, which leads to query optimization. Additionally, materialized views can be automatically and incrementally refreshed. Therefore, we can achieve a single pane of glass in our cloud infrastructure with the most up-to-date projections, trends, and actionable insights to our organization with improved query performance.

Amazon Redshift Materialized Views (MVs) are used extensively for reporting in MuleSoft’s Cloud Central portal, but if users needed to drill down into a granular view they could reference external tables.

Mulesoft is currently manually refreshing the materialized views through the event-driven architecture, but is evaluating a switch to automatic refresh.

Action phase

Using materialized views in Amazon Redshift, we developed a self-serve Cloud Central portal in Tableau to provide a display portal for each team, engineer, and manager offering guidance and recommendations to help them operate in a way that aligns with the organization’s requirements, standards, and budget. Managers are empowered with monitoring and decision-making information for their teams. Engineers are able to identify and tag assets with missing mandatory tagging information, as well as remediate non-compliant resources. A key feature of the portal is personalization, meaning that the portal is enabled to populate visualizations and analysis based on the relevant data associated with a manager’s or engineer’s login information.

Cloud Central also helps engineering teams improve their cloud maturity in the six Well-Architecture pillars: operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. The team proved out the “art of possible” by poc’ing Amazon Q to assist with 100 and 200 Well-Architected pillar inquiries and how to’s. The following screenshot illustrates the MuleSoft implementation of the portal, Cloud Central. Other companies will design portals that are more bespoke to their own use cases and requirements.

Conclusion

The technical and business impact of MuleSoft’s COE Framework enables an optimization strategy and a cloud usage show back approach which helps MuleSoft continue to grow with a scalable and sustainable cloud infrastructure. The framework also drives continual maturity and benefits in cloud infrastructure centered around the six Well-Architecture pillars shown in the following figure.

The framework helps organizations with expanded public cloud infrastructure achieve their business goals guided by the Well-Architected benefits powered by an event-driven architecture.

The event-driven Amazon Redshift lakehouse architecture solution offers near real-time actionable insights on decision-making, control, and accountability. The event-driven architecutre can be distilled into modules which can be added or deleted depending on your technical/business goals.

The team is exploring new ways to lower the total cost of ownership. They are evaluating Amazon Redshift Serverless for transient database workloads as well as exploring Amazon DataZone to aggregate and correlate data sources into a data catalog to share among teams, applications, and lines of businesses in a democratized way. We can increase visibility, productivity, and scalability with a well-thought-out lakehouse solution.

We invite organizations and enterprises to take a holistic approach to understand their cloud resources, infrastructure, and applications. You can enable and educate your teams through a single pane of glass, while running on a data modernization lakehouse applying Well-Architected concepts, best practices, and cloud-centric principles. This solution can ultimately enable near real-time streaming, leveling up a COE Framework well into the future.


About the Authors

Sean Zou is a Cloud Operations leader with MuleSoft at Salesforce. Sean has been involved in many aspects of MuleSoft’s Cloud Operations, and helped drive MuleSoft’s cloud infrastructure to scale more than tenfold in 7 years. He built the Oversight Engineering function at MuleSoft from scratch.

Terry Quan focuses on FinOps issues. He works on MuleSoft Engineering on cloud computing budgets and forecasting, cost reduction efforts, costs-to-serve, and coordinates with Salesforce Finance. Terry is a FinOps Practitioner and Professional Certified.

Audrey Yuan is a Software Engineer with MuleSoft at Salesforce. Audrey works on data lakehouse solutions to help drive cloud maturity across the six pillars of the Well-Architected Framework.

Rueben Jimenez is a Senior Solutions Architect at AWS, designing and implementing complex data analytics, AI/ML, and cloud infrastructure solutions.

Avijit Goswami is a Principal Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open source solutions. Outside of his work, Avijit likes to travel, hike, watch sports, and listen to music.

Top Architecture Blog Posts of 2024

Post Syndicated from Andrea Courtright original https://aws.amazon.com/blogs/architecture/top-architecture-blog-posts-of-2024/

Well, it’s been another historic year! We’ve watched in awe as the use of real-world generative AI has changed the tech landscape, and while we at the Architecture Blog happily participated, we also made every effort to stay true to our channel’s original scope, and your readership this last year has proven that decision was the right one.

AI/ML carries itself in the top posts this year, but we’re also happy to see that foundational topics like resiliency and cost optimization are still of great interest to our audience.

(By the way, if you were hoping for more AI/ML content, head on over to our sister channel, the AWS Machine Learning Blog!).

Without further ado, here are our top posts from 2024!

#10 Deploy Stable Diffusion ComfyUI on AWS elastically and efficiently

This post helps you get started using ComfyUI, and was so successful that we followed it up later in the year with How to build custom nodes workflow with ComfyUI on EKS!

Architecture for deploying stable diffusion on ComfyUI

Figure 1. Architecture for deploying stable diffusion on ComfyUI

#9 Let’s Architect! Designing Well-Architected systems

In keeping with Let’s Architect! series, we have our first of three favorites for the year. This set of resources helps you apply Well-Architected standards in practice.

Let's Architect

Figure 2. Let’s Architect

#8 Let’s Architect! Learn About Machine Learning on AWS

As I said, Let’s Architect! has a winning series, and they’ve got a finger on the pulse of the tech world. This post about machine learning showcases some of the most exciting things happening at AWS.

Let's Architect

Figure 3. Let’s Architect

If you’re more interested in generative AI, you can also take a look at another post from 2024: Let’s Architect! GenAI

#7 Creating an organizational multi-Region failover strategy

Preparedness is another common theme in this year’s favorites. Michael, John, and Saurabh are well-versed in multi-Region architecture, and they’re here to share some strategies to contain failure impact.

When the application experiences an impairment using S3 resources in the primary Region, it fails over to use an S3 bucket in the secondary Region.

Figure 4. When the application experiences an impairment using S3 resources in the primary Region, it fails over to use an S3 bucket in the secondary Region.

#6 Building a three-tier architecture on a budget

Let’s talk cost optimization. This post about a three-tier architecture that relies on the AWS Free Tier is a must-read for anyone looking for tips to help them avoid unnecessary costs (and that’s everyone).

Example of a three-tier architecture on AWS

Figure 5. Example of a three-tier architecture on AWS

#5 Announcing updates to the AWS Well-Architected Framework guidance

As usual, Haleh & team are pros at making sure the Well-Architected Framework is current and relevant. Take a look at the enhanced and expanded guidance in all six pillars.

Well-Architected logo

Figure 6. Well-Architected logo

#4 Let’s Architect! Serverless developer experience in AWS

One more winning post from Luca, Federica, Vittorio, and Zamira! This collection of developer resources includes new ideas in AWS Lambda, Amazon Q Developer, and Amazon DynamoDB.

Let's Architect

Figure 7. Let’s Architect

#3 London Stock Exchange Group uses chaos engineering on AWS to improve resilience

This post from April 1 was not an April Fool’s joke! See how LSEG designed failure scenarios to test their resilience and observability.

Chaos engineering pattern for hybrid architecture (3-tier application)

Figure 8. Chaos engineering pattern for hybrid architecture (3-tier application)

#2 Achieving Frugal Architecture using the AWS Well-Architected Framework Guidance

Frugality AND Well-Architected? What a winning combo! This post, inspired by the 2023 re:Invent keynote, outlines the seven laws of Frugal Architecture.

Well-Architected logo

Figure 9. Well-Architected logo

#1 How an insurance company implements disaster recovery of 3-tier applications

And finally, our number one post of the year! Amit and Luiz showcase a customer solution with real-world applications that builds on the guidelines of other posts in this list! Well done!

The Pilot Light scenario for a 3-tier application that has application servers and a database deployed in two Regions

Figure 10. The Pilot Light scenario for a 3-tier application that has application servers and a database deployed in two Regions

Thank you!

As always, thanks to our contributors for their dedication and desire to share, and to you, our readers! We would be nothing with you. Literally.

For other top post lists, see our Top 10 and Top 5 posts from previous years.

Enhance the resilience of critical workloads by architecting with multiple AWS Regions

Post Syndicated from John Formento original https://aws.amazon.com/blogs/architecture/enhance-the-resilience-of-critical-workloads-by-architecting-with-multiple-aws-regions/

In this post, we will share how you can use multi-Region as an architectural approach to achieve higher resilience on Amazon Web Services (AWS). This approach relies on first operating a workload across multiple Availability Zones within an AWS Region, before expanding to achieve even higher resilience by using multiple Regions. This is because within a Region there are multiple Availability Zones, which are physically separated by many miles but still close enough together (60 miles or less) to allow for single-digit millisecond latency. Each Availability Zone features one or more data centers, each housed in its own facility with its own redundant networking, connectivity, and power. Availability Zones provide fundamental building blocks that can help you achieve your resilience goals for your applications. First, you can benefit from the separation between Availability Zones by using Zonal services to specify which Availability Zone a resource is in, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance. This means that if you build your application with redundant replicas of your application resources in each Availability Zone, you can gain excellent resilience to infrastructure events impacting any one Availability Zone.

A multi-Region approach is a reliable way to achieve a bounded recovery time for critical applications in the rare event of a service failure in a Region that is impacting your application. Each Region has strict logical and physical separation from other Regions. This purposeful design helps avoid service and infrastructure disruptions in one Region affecting another Region. This unique property of Regions can be used to build multi-Region applications with predictable fault domains.

While a multi-Region approach can improve your application’s resilience to failures, it can be challenging to build and operate such an application. It requires careful work to take advantage of the isolation between Regions, with care taken to not remove this isolation benefit at the application level. For example, if you fail over an application between Regions, you need to maintain strict separation between your application stacks in each Region, be aware of all the application dependencies, and fail over all parts of the application together. This kind of system requires planning and coordination amongst many engineering and business teams, especially with a complex, microservices-based architecture that could have several dependencies between applications.

If you’re replicating data between Regions using an asynchronous approach, you should be aware of the risk that not all your data has been replicated to the standby Region when you fail over. Because there’s a finite time needed to copy data over between Regions, data might be out of sync between the primary and standby Regions. If you use a synchronously replicated database across Regions to support your applications running from more than one Region concurrently, you avoid issues with data being out of sync when starting your application in the new Region. However, this introduces higher latency characteristics into your application’s resources. This is because writes need to commit to more than one Region, and the Regions can span hundreds or thousands of miles from one another. This latency characteristic needs to be accounted for in your application design. In addition, synchronous replication can increase the chance for correlated failures because writes need to be committed to more than one Region to be successful. If there is an impairment within one Region, you’ll need to form a quorum for writes to be successful, which typically involves having your database in three Regions and having a quorum of two out of three.

Finally, you need to practice the failover and simulate Region impairments to know that it works when you need it. It’s a substantial time and resource investment to regularly rotate your application between Regions to practice failover, but it’s a recommended practice if you plan to build a multi-Region application.

Given these additional considerations when implementing a multi-Region approach, for most AWS customers, multi-AZ is the right approach for building and operating resiliently in the cloud. This approach helps mitigate most infrastructure failures, which are usually contained within an Availability Zone. A multi-Region approach is most common in the following scenarios.

Meet regulatory and compliance requirements and enhance disaster recovery capabilities

Regulated industries like financial services and healthcare and life sciences can require that applications be multi-Region. Healthcare providers and pharmaceutical companies, for example, often deploy electronic health records (EHR), clinical trial management systems, and other applications across multiple Regions for enhanced data redundancy, disaster recovery, and compliance with regional data privacy regulations (like HIPAA in the US or GDPR in the EU). Epic on AWS, for example, is typically deployed across multiple Availability Zones and multiple Regions to increase the resilience of customers’ EHR and integrated application environment, making full use of the resources and geographic diversity of the AWS Cloud.

Banks and financial institutions, including Fidelity and Vanguard, also deploy many of their core trading and investment platforms and customer-facing applications across multiple Regions for enhanced business continuity and compliance with local data protection regulations.

Achieve a bounded recovery time to support highly available business-critical workloads

With growing demand for always-on applications and services, companies are increasingly reliant on cloud-based services and infrastructure for day-to-day operations and business continuity. While a single Region supports highly available and resilient applications, distributing workloads across multiple Regions enables a bounded recovery time in the rare event of a disruption to the application. The physical and logical separation of Regions provides a well-defined fault isolation boundary that you can use to create predictable fault boundaries for your applications. If their application experiences issues in one Region, the workloads can continue operating in another Region, which minimizes downtime for customers and users.

Streaming platforms like Netflix, NBCUniversal, and Disney, for example, deploy their content delivery networks (CDNs) and video streaming infrastructure across multiple Regions to provide a seamless media experience for their customers. In many cases, video streaming and video gaming companies deploy their infrastructure across multiple Regions to offer lower-latency gaming experiences for players worldwide.

Automotive companies such as Honda deploy their connected vehicle platforms across multiple Regions to scale globally. They use geo-location routing that identifies the closest broker the vehicle should communicate with based on customer-configured rules that govern how vehicles connect to the cloud infrastructure. This allows them to reliably connect millions of vehicles to the cloud while supporting high availability.

Conclusion

No matter the industry or scenario, AWS is the definitive choice for organizations that want to build and run highly available, resilient applications in the cloud, with resilience built into its infrastructure, operational models, and comprehensive capabilities across Regions. To learn how to choose between the different options for building resilience into your application, see the Well-Architected reliability pillar, and for a detailed framework for choosing multi-Region, see AWS Multi-Region Fundamentals.


Announcing updates to the AWS Well-Architected Framework guidance

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/announcing-updates-to-the-aws-well-architected-framework-guidance-3/

We are excited to announce the availability of enhanced and expanded guidance for the AWS Well-Architected Framework with the following six pillars: Operational ExcellenceSecurityReliabilityPerformance EfficiencyCost Optimization, and Sustainability

This release includes new best practices and improved prescriptive implementation guidance for the existing best practices. This includes enhanced recommendations and steps on reusable architecture patterns focused on specific business outcomes.

A brief history

The Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads in the cloud.

2024 AWS Well-Architected guidance timeline

Figure 1. 2024 AWS Well-Architected guidance timeline

In 2012, we published the first version of the Framework. In 2015, we released the AWS Well-Architected Framework whitepaper. We added the Operational Excellence pillar in 2016. We released the pillar-specific whitepapers and AWS Well-Architected Lenses in 2017. The following year, the AWS Well-Architected Tool was launched.

In 2020, we released the new version of the Well-Architected Framework guidance, more lenses, and an API integration with the AWS Well-Architected Tool. We added the sixth pillar, Sustainability in 2021. In 2022, dedicated pages were introduced for each consolidated best practices across all six pillars, with several best practices updated with improved prescriptive guidance. By December 2023, we improved more than 75% of the Framework’s best practices. As of November 2024, we’ve refreshed 100% of the Framework’s best practices at least once since October 2022.

What’s new

The Well-Architected Framework supports customers as they mature in their cloud journey by providing guidance to help achieve more operable, secure, sustainable, scalable, and resilient environment and workload solutions.

The content updates and prescriptive guidance improvements in this release provide more complete coverage across AWS, helping customers make informed decisions when developing implementation plans. We added or expanded on guidance for the following services in this update: Amazon API Gateway, Amazon CloudFront, Amazon CloudWatch, Amazon CodeGuru, Amazon Cognito, Amazon GuardDuty, Amazon Inspector, Amazon Macie, Amazon Q Business, Amazon Q Developers, Amazon Redshift, Amazon S3, AWS Certificate Manager, AWS CloudFormation, AWS CloudTrail, AWS CodeBuild, AWS CodeDeploy, AWS CodePipeline, AWS Config, AWS Control Tower, AWS Customer Carbon Footprint Tool, AWS Glue, AWS Health, AWS Identity and Access Management (IAM), AWS Key Management Service (KMS), AWS OpenSearch, AWS Organizations, AWS Resource Access Manager, AWS Secrets Manager, AWS Security Hub, AWS Step Functions, AWS Systems Manager, AWS Trusted Advisor, AWS Verified Access, and AWS WAF.

Pillar updates

Operational Excellence

In the Operational Excellence Pillar, we updated five best practices across four questions. This includes OPS02, OPS05, OPS09, and OPS10. The updates in this release include improved prescriptive guidance on multiple AWS services. OPS02-BP02 updates leverage Amazon Q Business for improving workforce collaboration and productivity. OPS05-BP08 updates demonstrate AWS Organizations and AWS Control Tower capabilities that enable updates to a multi-environment setup while meeting governance and policy requirements. OPS09-BP01 and OPS09-BP02 have updated guidance and resources for developing operational key performance indicators (KPIs). OPS10-BP02 has been updated with information on AWS Health, including its planned lifecycle events feature, for integrating into an incident management workflow.

Security

In the Security Pillar, we updated 43 best practices across nine questions. This includes SEC02, SEC03, SEC04, SEC06, SEC07, SEC08, SEC09, SEC10, and SEC11. All best practices in SEC03 (Permissions management) were revised, with updates to guidance on Attribute Based Access Control (ABAC), AWS IAM Access Analyzer, and emergency access processes. SEC02 (Identity management) also saw updates to all six of its best practices, including refinements to guidance on identity federation and secrets management. SEC07 through SEC11 received updates to guidance on data protection, incident response, and application security. Key updates include replacing the security information and event management SIEM solution on AWS OpenSearch recommendation with AWS CloudTrail Lake in SEC04 (Detection), expanded guidance on AWS S3 Object Lock and AWS S3 Glacier Vault Lock in SEC08 (Protecting data at rest), and the addition of recommendations for Mutual Transport Layer Security (mTLS) and private certificates in SEC09 (Protecting data in transit). Overall, these changes reflect AWS’s commitment to providing up-to-date, comprehensive security guidance in line with evolving best practices and new service capabilities.

Reliability

In the Reliability Pillar, we updated 14 best practices across nine questions. This includes REL01, REL02, REL04, REL06, REL07, REL08, REL10, REL12, and REL13. We expanded and clarified our guidance throughout the Pillar and added detailed implementation steps to each best practice that did not previously have them. We refreshed our multi-location deployment guidance by merging REL10-BP02 into REL10-BP01, while improving the prescriptive guidance of this best practice with a new title of Deploy the workload to multiple locations. We updated our idempotent operations guidance in REL04-BP04 to provide detailed technical guidance for builders who wish to provide idempotent APIs and updated the title to Make mutating operations idempotent. We merged functional testing guidance by migrating the content previously published under REL12-BP03 to REL08-BP02 (Integrate functional testing as part of your deployment) and expanded our guidance on testing in CI/CD pipelines. We refreshed REL07-BP01 to emphasize infrastructure as code (IaC) as a cornerstone of automated resource management and scaling. We improved our guidance in REL06-BP02 on how to use system and application logs to improve workload observability. We also refreshed our links to relevant resources including documents, videos, and presentations.

Performance Efficiency

In the Performance Efficiency Pillar, we updated the resources of PERF03-BP04 with the latest services.

Sustainability

In the Sustainability Pillar, we updated 10 best practices across six questions. This includes SUS01, SUS03, SUS04, SUS05, and SUS06. Best practices SUS01-BP01, SUS03-BP02, SUS03-BP05, SUS04-BP03, SUS04-BP05, SUS04-BP06, SUS04-BP07, SUS04-BP08, SUS05-BP04, and SUS06-BP02 now offer improved prescriptive guidance. Additionally, we added a new best practice, SUS06-BP01 Communicate and cascade your sustainability goals, which highlights the critical role of the central IT team in cascading sustainability goals and objectives across the broader organization. By strategically leveraging the cloud, implementing resource-efficient practices, and employing sustainability-focused tools and analytics, IT teams can play a pivotal role in driving meaningful reductions in the organization’s environmental impact.

Conclusion

This release includes updates and improvements to the Framework guidance totaling 78 best practices. As of this release, we’ve updated 100% of the existing Framework best practices at least once since October 2022. With this release, we have refreshed 100% of all the pillars of the Framework including the Reliability Pillar, with 14 of its best practice updated for the first time since major Framework improvements started in 2022.

Updates in this release will be incorporated into the AWS Well-Architected Tool in future releases, which you can use to review your workloads, address important design considerations, and help you follow the AWS Well-Architected Framework guidance.

The content will be available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.

Ready to get started? Review the updated AWS Well-Architected Framework Pillar best practices and pillar-specific whitepapers.

Have questions about some of the new best practices or most recent updates? Join our growing community on AWS re:Post.

Let’s Architect! Building multi-tenant SaaS systems

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-building-multi-tenant-saas-systems/

Software as a Service (SaaS) applications offer a transformative solution for businesses worldwide, delivering on-demand software solutions to a global audience. However, building a successful SaaS platform demands on meticulous architectural planning, especially given the inherent challenges of multi-tenancy. It’s also essential to ensure that each tenant’s data remains isolated and protected from unauthorized access and that multi-tenant systems are cost-optimized and can sustain the scaling of the SaaS business provider.

In this blog post, we will explore some of the key elements and best practices for designing and deploying secure and efficient SaaS systems on AWS.

Building cost-optimized multi-tenant SaaS architectures

Cost is a key factor to consider when we design new systems. Multi-tenancy requires teams to think beyond the basics of auto scaling, adopting strategies to allow their architecture to support a complex cost-scaling challenges. In this session, the speaker covers some design patterns for distributed systems to support the continually evolving scale needs of the environment, while optimizing the cost of the infrastructure.

The architectural model chosen for deploying multi-tenant systems—pooled, siloed, or mixed—significantly influences the cost optimization strategy. Each approach offers distinct trade-offs in terms of resource allocation, scalability, and cost efficiency.

Figure 1. The architectural model chosen for deploying multi-tenant systems—pooled, siloed, or mixed—significantly influences the cost-optimization strategy. Each approach offers distinct trade-offs in terms of resource allocation, scalability, and cost efficiency.

Take me to this video

Well-Architected SaaS Lens

The SaaS Lens for the AWS Well-Architected Framework empowers customers to assess and enhance their cloud-based architectures, fostering a deeper understanding of the business implications of their design choices. By bringing together technical leadership and diverse teams to discuss strategies for improving various aspects of the system, the AWS Well-Architected Framework facilitates collaborative decision-making. Moreover, the AWS account team can provide valuable support in conducting these assessments, offering expert guidance and insights. The AWS SaaS Lens specifically focuses on how to design, deploy, and architect multi-tenant SaaS application workloads within the AWS Cloud.

The microservices running in a multi-tenant environment must be able to reference and apply tenant context within each service. At the same time, it’s also our goal to limit the degree to which developers need to introduce any tenant awareness into their code.

Figure 2. The microservices running in a multi-tenant environment must be able to reference and apply tenant context within each service. At the same time, it’s also our goal to limit the degree to which developers need to introduce any tenant awareness into their code.

Take me to this well-architected framework

SaaS anywhere: Designing distributed multi-tenant architectures

Not every SaaS provider has the luxury of running all the moving parts of their solution within their own infrastructure. SaaS teams might support a range of diverse system models, where architectures might include customer-hosted data, edge deployment for parts of the application, and on-premises components. In this session, you can learn the strategies to support the complexities of this distributed model without undermining the resilience, operational efficiency, and agility goals of your solution. The video covers how this influences the onboarding, deployment, and profile management of the SaaS environment.

In this architectural pattern, tenants are demanding to have the ML workload in their environment. So, the SaaS provider only manages the SaaS Control plane where tenants deploy the application plane in their environment, including the ML workload and the necessary components around it.

Figure 3. In this architectural pattern, tenants are demanding to have the ML workload in their environment. So, the SaaS provider only manages the SaaS control plane where tenants deploy the application plane in their environment, including the ML workload and the necessary components around it.

Take me to this video

Deploying multi-tenant SaaS applications on Amazon ECS and AWS Fargate

Containers are frequently employed in multi-tenant SaaS environments to enhance scalability, isolation, and resource efficiency. Developing such systems requires addressing multiple challenges, including tenant isolation, tenant on-boarding, tenant-specific metering, monitoring, and other factors related to multi-tenancy. This session explores how to effectively manage all of these aspects when deploying solutions on AWS Fargate.

Microservices architecture can enhance security isolation by dividing applications into smaller, independent services, reducing the potential impact of a breach.

Figure 4. Microservices architecture can enhance security isolation by dividing applications into smaller, independent services, reducing the potential impact of a breach.

Take me to this video

AWS Serverless SaaS Workshop

Serverless helps to create multi-tenant architectures thanks to services like AWS Lambda that isolate your business logic per request, making them the perfect companion to run a SaaS platform. This workshop provides a hands-on introduction to creating serverless multi-tenant SaaS applications, helping you get started and gain practical experience.

This is the high level architecture of the web application you will use in the AWS Serverless SaaS Workshop. In the labs, you will use this web application to add features that are needed to build this final SaaS application.

Figure 5. This is the high-level architecture of the web application you will use in the AWS Serverless SaaS Workshop. In the labs, you will use this web application to add features that are needed to build this final SaaS application.

Take me to this workshop

See you next time!

Thanks for reading! Multi-tenant SaaS architectures require a careful design of your system. In this post, you have discovered key elements for properly designing your next SaaS workloads. In the next blog, we will talk about modern data architectures.

To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.

Achieving Frugal Architecture using the AWS Well-Architected Framework guidance

Post Syndicated from Ashley DeLoach original https://aws.amazon.com/blogs/architecture/achieving-frugal-architecture-using-the-aws-well-architected-framework-guidance/

As part of the re:Invent 2023 keynote, Dr. Werner Vogels introduced the Frugal Architect mindset. This mindset emphasizes the importance of continuous learning, curiosity, and regular revision of architectural choices with a focus on cost and sustainability. Cost and sustainability should be treated as critical non-functional requirements, alongside factors like security, compliance, and performance. The Frugal Architect approach involves measuring and optimizing cost at every stage of the development process, which allows for innovation in parallel with promoting responsible resource usage. In the rapidly-evolving technology landscape, builders should adopt the Frugal Architect mindset to balance innovation with cost efficiency and environmental sustainability.

This blog discusses how the six pillars of the AWS Well-Architected Framework (operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability) align with the seven Frugal Architect laws. It demonstrates how adhering to the principles and best practices outlined in these pillars can help architects and builders effectively implement the Frugal Architect laws in their projects. The Well-Architected Framework provides a comprehensive set of guidelines that embed the concepts of frugality, efficiency, and cost effectiveness, which are the core tenets of the Frugal Architect laws. By following the Framework’s pillars, architects can build secure, reliable, efficient, and cost-optimized systems and promote sustainability.

Make Cost a Non-functional Requirement (Law 1)

Non-functional requirements are criteria that evaluate a system’s operation instead of its specific features or functionality. This includes aspects like accessibility, availability, scalability, security, portability, maintainability, and compliance. However, one crucial non-functional requirement that is often overlooked is cost. Consider implications early on and throughout the design, development, and operation of your systems. Organizations can strike a balance between desired features, time-to-market, and operational efficiency through early prioritization of cost considerations. The Frugal Architect argues that you should treat cost as a fundamental non-functional requirement that should be given upfront consideration when planning and initiating system development projects.

The Cost Optimization Pillar of the AWS Well-Architected Framework provides guidance on how to optimize costs when using AWS Cloud services. It emphasizes treating cost as a key requirement, not an afterthought. The main principles focus on the importance of a robust financial management processes, adoption of a cloud consumption model that allows for flexible scaling and pay-per-use billing, continual measurement of outputs against costs to optimize efficiency, use of managed services to minimize operational overhead, and implementation of transparent cost attribution to tie cloud spending to revenue sources and workloads. Organizations that follow these practices can effectively manage and optimize their costs and benefit from the scalability and agility of cloud computing.

These cost optimization principles can help organizations maximize the financial benefits of using the AWS Cloud and avoid wasteful spending. Cost optimization is an ongoing process that includes rightsizing, higher output for the same cost, and use of the most cost-effective AWS services. The pillar promotes a disciplined approach to evaluate trade-offs between cost and other optimization areas like performance or reliability. Overall, you can use this pillar to make informed decisions to provision and operate AWS services cost-effectively.

Systems that Last Align Cost to Business (Law 2)

The durability and longevity of a system are closely tied to how well its costs align with the underlying business model. During the creation of a system, consider revenue sources and profit drivers. The key is to identify the primary dimension or aspect that generates revenue, and then verify that the system architecture supports and optimizes for that revenue-generating dimension. Essentially, revenue and profitability considerations should be the primary forces behind cost decisions in system design.

The AWS Well-Architected Cost Optimization Pillar provides practices and guidance for organizations to accurately monitor their AWS costs and usage. This visibility helps users understand the profitability of different business units and products, which facilitates informed decisions on resource allocation across the organization. Organizations can implement these practices to gain insights into their AWS spending patterns, which aids in development of effective cost optimization strategies. Overall, accurate expenditure analysis and attribution are crucial for organizations to optimize cloud costs, measure ROI, and make data-driven resource allocation decisions.

It’s important to accurately identify and attribute cloud costs to specific workloads. The cloud allows for transparent cost attribution, which helps organizations link costs to individual revenue streams and workload owners. This granular cost attribution data empowers workload owners to measure return on investment (ROI) for their workloads. With detailed cost information, workload owners can optimize resource utilization and reduce costs by rightsizing resources, eliminating waste, and making informed decisions. Organizations must use accurate cost attribution to understand where their cloud spending is going and verify that resources are being used efficiently across different workloads and revenue streams.

Architecting is a Series of Trade-Offs (Law 3)

Architectural decisions involve trade-offs, particularly between cost, resilience, and performance. Systems will inevitably fail, so investment in resilience is important but may impact performance. It’s important to find the right balance between technical requirements and business needs and align with risk tolerance and budget constraints. Frugality is about maximizing value, not just minimizing spend. Frugality means that you determine what you’re can pay for based on your priorities and make informed trade-off decisions. Ultimately, architectural choices require careful consideration of the tensions between different non-functional requirements.

The AWS Well-Architected Framework helps you make architectural trade-offs through its design principles and practices across its six pillars with your business requirements in mind. As you architect workloads, you make trade-offs between pillars based on your business context. You might optimize to improve the sustainability impact and reduce cost at the expense of reliability in development environments, or for mission-critical solutions. You might optimize reliability with increased costs and sustainability impact. In ecommerce solutions, performance can affect revenue and customer propensity to buy. Security generally is not a viable trade-off against the other pillars.

Rather than optimizing for any single pillar, the Framework guides a holistic evaluation across all pillars to determine the right architectural approach. Organizations can use AWS best practices while they find the optimal balance that aligns with their unique requirements. The key is making intentional trade-off decisions instead of following any uniform approach.

Unobserved Systems Lead to Unknown Costs (Law 4)

Without proper observation and measurement, the true operational costs of a system remain hidden, and wasteful practices can persist unnoticed. Just as exposing a utility meter prompts more mindful usage, visibility increases into costs can drive more sustainable behaviors. While implementing comprehensive monitoring requires upfront investment, the long-term benefits of conserving resources and optimizing efficiency make it a worthwhile endeavor. Ultimately, you should maintain cost awareness to foster a culture of responsible, sustainable practices.

The Operational Excellence Pillar of the AWS Well-Architected Framework emphasizes the importance of observability to gain actionable insights into workloads. This involves creation of key performance indicators (KPIs) and use of observability data telemetry to comprehensively understand workload behavior, performance, reliability, cost, and health. Organizations can implement observability best practices to make informed decisions and take prompt action when business outcomes are at risk due to issues with workload operation. Observability data provides visibility into the current state and helps identify areas for improvement. This means that organizations can be proactive in performance optimization, reliability enhancement, and cost reduction based on the actionable insights derived from observability telemetry data. Overall, observability is crucial for maintenance of operational excellence through the use of data-driven decision-making and continuous improvement of workloads.

Overall, monitoring guidance is a core component across multiple pillars of the Well-Architected Framework, as it helps organizations effectively manage and optimize their cloud workloads. For more detail on the monitoring principles of the AWS Well-Architected Framework, see Cost-Aware Architectures Implement Cost Controls (Law 5).

Cost-Aware Architectures Implement Cost Controls (Law 5)

The key aspects of frugal architecture combine granular controls with robust monitoring to identify areas for optimization. This helps you optimize costs and maintain a good user experience. With a robust monitoring system, you can take action where improvements are needed.

The AWS Well-Architected Framework aligns with the concept of frugality, which focuses on maximizing value rather than just minimizing spending. The Framework helps businesses achieve maximum value by making architectural choices that meet their specific requirements.

The Cost Optimization Pillar emphasizes the continual monitoring of usage and costs to identify opportunities for efficiency improvements and cost savings. This includes expenditure analysis, adoption of consumption-based models, and implementation of cloud financial management practices.

The Security Pillar, Reliability Pillar, and Performance Efficiency Pillar reinforce the importance of monitoring systems, workloads, and costs in real-time to maintain security, automatically recover from failures, and optimize performance relative to cost.

The Sustainability Pillar focuses on measurement of a workload’s current and forecasted environmental impact. It recommends continual evaluation of new hardware and software offerings that can reduce the environmental footprint.

Overall, monitoring guidance spans multiple Well-Architected pillars to maximize value through optimization of cost, performance, security, reliability, and sustainability.

Cost Optimization is Incremental (Law 6)

Cost efficiency is a continuous process, not a one-time goal. Regularly monitor your systems to identify inefficient patterns and areas for optimization. Revisit and refine systems periodically to find additional opportunities for improvement and further reduce costs over time.

The Cost Optimization Pillar covers principles like analysis and attribution of expenditure, measurement of overall efficiency, adoption of a consumption model, and implementation of cloud financial management practices.

Additionally, the Operational Excellence Pillar provides principles that apply not just to cost optimization but all pillars. These include observability for actionable insights, safe automation where possible, frequent small reversible changes, frequent refinement of operations procedures, anticipation of failure, and documentation and distribution of learning from operational events and metrics.

Organizations can follow these AWS Well-Architected Framework principles and their practices to continuously improve their cloud architectures and operations and optimize costs effectively.

Unchallenged Success Leads to Assumptions (Law 7)

We should continue to reevaluate past approaches, even those that were previously successful. Just because something worked before does not mean that it is still the best method. Grace Hopper, a computer scientist, mathematician, and United States Navy rear admiral, cautioned against blind adherence to tradition, saying that “we’ve always done it this way” is a dangerous mindset. We must be willing to question the old ways and explore new and potentially better methods.

The AWS Well-Architected Framework advocates for an evolutionary architecture approach to system design. Traditional architectures are often designed as static, with only a few major version updates during the system’s lifetime. However, as businesses and requirements change over time, initial architectural decisions can limit the ability to adapt and evolve the system. Cloud computing enables capabilities like automated testing and lower-risk design changes, which allows systems to evolve continually rather than being constrained by the original design. An evolutionary architecture positions businesses to take advantage of new innovations and changes as part of standard practice. Rather than being locked into original architectural choices, an evolutionary approach fosters ongoing adaptation and modernization as requirements shift. This contrasts with traditional fixed architectures that make it difficult to evolve over time and provides greater flexibility to evolve systems iteratively.

The Operational Excellence Pillar includes implementation of observability to understand system behavior, safe automation of processes, frequent but reversible changes, regular refinement of operations procedures, proactive anticipation potential failures proactively, and distribution of learnings from operational events and metrics to drive continuous improvement.

Overall, the Well-Architected Framework provides guidance on evolutionary architecture and operations processes to effectively manage increasing software complexity over time.

Conclusion

Frugality is about maximizing value, rather than just minimizing costs. Following AWS Well-Architected Framework best practices regarding security, reliability, and operational excellence can help realize frugal yet robust architectures. True frugality involves optimizing costs by aligning spending with areas that deliver the highest business value and impact. The Well-Architected Framework provides guidance for making architectural decisions that increase efficiency, lower risks, and maximize return on cloud investments. This involves determining priorities, understanding sources of value, and making informed trade-off decisions based on those priorities. It’s important to avoid indiscriminate cost-cutting and instead focus on resources on what matters most to drive value for the organization. By following Well-Architected best practices, companies can practice frugality in a strategic way that balances optimization with business goals.

Start your Frugal Architecture journey with AWS Well-Architected today by reading the documentation or visiting the AWS Well-Architected Tool in the console.

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

Post Syndicated from Louis Hourcade original https://aws.amazon.com/blogs/big-data/how-amazon-gtts-runs-large-scale-etl-jobs-on-aws-using-amazon-mwaa/

The Amazon Global Transportation Technology Services (GTTS) team owns a set of products called INSITE (Insights Into Transportation Everywhere). These products are user-facing applications that solve specific business problems across different transportation domains: network topology management, capacity management, and network monitoring. As of this writing, GTTS serves around 10,000 customers globally on a monthly basis, managing the outbound transportation network.

INSITE applications are in general data intensive. They ingest and transform large volumes of data in different formats and processing patterns (such as batch and near real time) from various sources internal and external to Amazon. Datasets are often shared between applications both within domains and across domains, and are consumed in complex data pipelines that run under tight SLAs. To enable and meet these requirements, GTTS built its own data platform.

A critical component of the data platform is the data pipeline orchestrator. GTTS built its own orchestrator named Langley in 2018, and used it to schedule and monitor extract, transform, and load (ETL) jobs on a variety of compute platforms, such as Amazon EMR, Amazon Redshift, Amazon Relational Database Service (Amazon RDS).

As the Langley user base grew, GTTS engineers faced a couple of challenges on key dimensions, such as maintainability, scalability, multi-tenancy, observability, and interoperability.

Amazon GTTS partnered with AWS Professional Services to modernize their orchestration platform, relying as much as possible on managed services with auto scaling capabilities. After analyzing candidate solutions, the team decided to build a target solution relying on Amazon Managed Workflows for Apache Airflow (Amazon MWAA). This post elaborates on the drivers of the migration and its achieved benefits.

Legacy platform

Amazon GTTS works with diverse and distributed data stores, storing petabytes of data. Data engineers need a tool to define ETL jobs which run on various compute environments, as illustrated in the following diagram.

Amazon GTTS orchestration platfrom - high-level diagram

GTTS built Langley as their custom orchestrator in 2018, and have been operating it ever since. At a high level, the core of Langley’s architecture is based on a set of Amazon Simple Queue Service (Amazon SQS) queues and AWS Lambda functions, and a dedicated RDS database to store ETL job data and metadata. It also uses AWS Data Pipeline to run SQL-based workloads, Amazon Simple Storage Service (Amazon S3) to store configuration files, and Amazon CloudWatch for alarming on failures. Every day, Langley handles the lifecycle of more than 17,000 ETL jobs in Europe and 5,000 ETL jobs in North America.

The following diagram illustrates the Langley architecture.

Langley architecture diagram

Business challenges

Langley started as a simple solution to a team-internal problem, but its growth over the years surfaced key issues:

  • The maintenance of this custom solution requires considerable time from engineers, which increased over the years with the release of new features, increasing the overall complexity.
  • The Langley user base grew continuously and eventually became a key orchestration platform for multiple teams and products across Amazon. However, it wasn’t created with multi-tenancy in mind and therefore it didn’t provide the robustness and the appropriate level of isolation to guard each tenant from impacting others on the shared platform.
  • In 2023, AWS announced the upcoming deprecation of Data Pipeline, one of the core services used by Langley.

GTTS partnered with AWS to design and implement a solution to overcome those challenges. AWS used the following evaluation matrix to build a durable solution:

Maintainability The level of effort required to maintain the orchestrating system in a functional state, encompassing updates, patches, bug fixes, and routine checks for optimal performance.
Costs The overall expenditure associated with the orchestrator, including infrastructure costs, licensing fees, personnel expenses, and other relevant costs. This criterion particularly assesses the system’s ability to effectively control and reduce costs.
Scheduling The capabilities related to running and scheduling jobs, including the ability to resume an ETL job from a failed step.
User experience The overall satisfaction and usability of a system from the end-users’ perspective, considering factors such as responsiveness, accessibility, interoperability, and ease of use.
Security Mechanisms in place to safeguard data and applications from unauthorized access at all times.
Monitoring and alerting The continuous observation and analysis of system components and performance metrics to detect and address issues, optimize resource usage, and provide overall health and reliability.
Scalability The orchestrator’s capacity to efficiently adapt its resources to handle increased workload or demand, providing sustained performance.

Among the explored solutions, Amazon MWAA was finally determined as the best overall performer across this matrix.

The next section is a dive deep into the rationales that led GTTS and AWS Professional Services to choose Amazon MWAA as the best performer.

Benefits of migrating to Amazon MWAA

Amazon GTTS and AWS Professional Services worked together to release a Minimum Viable Product (MVP) of the solution described earlier, which showcases the benefits on the agreed decision criteria.

Maintainability

With their legacy system, Amazon GTTS had to manage the orchestrator database, web servers, activity queue, dispatch functions, and worker nodes.

Amazon MWAA eliminates the need for underlying infrastructure management. It takes care of provisioning and maintenance of the Apache Airflow web server, scheduler, worker nodes, and relational database, allowing GTTS teams to focus on building their ETL jobs.

Amazon MWAA offers one-click updates of the infrastructure for minor versions, like moving from Airflow version x.4.z to x.5.z. During the upgrade process, Amazon MWAA captures a snapshot of your environment metadata; upgrades the workers, schedulers, and web server to the new Airflow version; and finally restores the metadata database using the snapshot, backing it with an automated rollback mechanism.

Costs

Amazon MWAA contributes to a more cost-effective solution by automatically scaling workers depending on the workload. This dynamic scaling in and out avoids over-provisioning and allows the organization to pay for the compute they actually use, without the risk of downtime during activity spikes. Because this is an AWS-managed solution, it also reduced GTTS’s Total Cost of Ownership (TCO) by freeing up time from engineers that were managing the legacy system.

Scheduling

Amazon MWAA supports all the trigger mechanisms that the Amazon orchestrator needed:

  • Manual trigger – The users can simply invoke a Direct Acyclic Graph (DAG) using the Airflow API or even more simply via the User Interface (UI).
  • Scheduler – A scheduler can be defined as code, together with the DAG definition, to make sure it will run at specific rates (from hourly to yearly) or on specific cron schedules.
  • Event-driven trigger – Airflow provides native operators that enable invoking a downstream DAG from another DAG or from a dataset update (push approach). It also includes sensors that listen for the completion of a task external to the DAG (pull approach).
  • Partial runs on DAG failures – Another key feature for GTTS was the possibility the recover from partial DAG failures without having to rerun the whole DAG. Airflow provides task-level controls that makes this operation straightforward to implement.

User experience

In this section, we discuss three aspects of the user experience: the web UI, the interoperability, and the programming interface.

Web UI

Amazon MWAA comes with a managed web server that hosts the Airflow UI. As a result, and without any maintenance needed, you can use it to quickly run DAGs, check run history, visualize dependencies between DAGs, troubleshoot with a direct access to task logs, manage variables and database connections, and define granular permissions. The following screenshot shows an example of the UI.

Amazon MWAA User Interface - console screenshot

Interoperability

One of the most important features evaluated was the ability for the new orchestrator to effortlessly integrate with GTTS multiple data storage services, compute components, and monitoring services.

Amazon MWAA comes with a wide variety of providers preinstalled, such as apache-airflow-providers-amazon, apache-airflow-providers-postgres, and apache-airflow-providers-common-sql. This allowed GTTS to connect with those services using multiple connection methodologies, including AWS IAM Identity Center or AWS Secrets Manager password-based authentications, without having to write a single custom Airflow operator.

Amazon MWAA also makes it straightforward to upgrade providers version and install new ones. By providing a requirements.txt file, GTTS was able to change the major version of apache-airflow-providers-amazon and install the apache-airflow-providers-mysql provider.

Programming interface

Airflow is an orchestrator with a low barrier to entry, especially for those familiar with the Python programming language. Its workflow management is defined in Python scripts, with a well-documented set of native operators and external providers, making it straightforward for Python developers to get started with Airflow and create complex data pipelines.

The following are two key Airflow features:

  • TaskFlow API – The TaskFlow API removes a lot of the boilerplate code required by traditional operators by using Python decorators while simplifying the DAG editing process DAG with cleaner and more concise DAG files.
  • Dynamic DAG generation – The dynamic DAG generation capability allowed us to generate DAGs from the original legacy orchestrator’s configuration files. This enabled the platform team to build a centralized framework consumed by multiple teams to keep the code DRY (Don’t Repeat Yourself), providing a seamless migration journey from the legacy orchestrator.

The following screenshot shows an example of these features.

Airflow dynamic DAG definition - code sample

Security

The new Amazon MWAA-based architecture improves GTTS’s posture by introducing granular access control. Amazon MWAA integrates with AWS services such as AWS Key Management Service (AWS KMS), Secrets Manager, and IAM Identity Center to keep data safely encrypted at all times, both at rest and in transit using TLS-based communications. Airflow also includes a role-based access control (RBAC) model to determine what users can do on the platform and enforce the principle of least privilege. Amazon MWAA also natively integrates with AWS CloudTrail for auditing purposes.

The Airflow RBAC model enables administrators to define roles with specific privileges to access Airflow system settings and DAGs themselves. This granular access control reduces the risk of data breaches and malicious activities by limiting access to critical DAGs and sensitive Airflow environment variables. Airflow includes five default roles with different sets of permissions (as shown in the following screenshot), but it is possible to create new roles depending on your security requirements.

Airflow roles - console screenshot

GTTS used the Airflow RBAC model to restrict permissions of certain teams and consumers of the application. They also used priority weights and Airflow pools to prioritize tasks and control run concurrency. However, if you want to run a multi-tenant orchestration platform, it’s recommended to use a separate environment for each team. You can assume that everything accessible by the Amazon MWAA role is also accessible to users who can write DAGs to the environment.

To ease authentication in Amazon MWAA, GTTS federated their identity provider (IdP) through Amazon Cognito and SAML. With this integration, users log in to the Amazon MWAA UI using the same identity as in other internal systems, which removes the need for new credentials. The user’s group membership is retrieved from the IdP through Amazon Cognito, and a Lambda function redirects the user to Amazon MWAA with the appropriate Airflow role. This process is illustrated in the following architecture, and is abstracted from the user and attached to a public Application Load Balancer that redirects at the end of the process to an Amazon MWAA private cluster, making the authentication workflow seamless and secure. Refer to Accessing a private Amazon MWAA environment using federated identities to implement it using your own IdP.

Amazon MWAA federation - architecture diagram

Monitoring and alerting

Amazon MWAA integrates with CloudWatch, which manages all infrastructure logs for you. When creating an Amazon MWAA environment, you can configure what level of logs should be saved. GTTS enabled CloudWatch logging for all of the five types of components: Airflow task logs, Airflow web server logs, Airflow scheduler logs, Airflow worker logs, and Airflow DAG processing logs.

Amazon MWAA logging configuration - console screenshot

These logs are all accessible in CloudWatch for continuous monitoring, but Amazon MWAA users can also access task logs directly from the Airflow UI by looking at the DAG run history. The following screenshot shows an example of task-level logs in Airflow 2.5.1.

Amazon MWAA task-level logs - console screenshot

You can also build CloudWatch monitoring dashboards to keep an eye on the state of your environment and alert administrators when required. Amazon MWAA natively provides Airflow environment metrics and Amazon MWAA infrastructure-related metrics.

Scalability

Each Amazon MWAA environment includes the schedulers, web server, and worker nodes. Scheduler nodes are responsible for the overall orchestration and parsing of DAG files. These tasks happen in worker nodes that Amazon MWAA auto scales up and down according to system load. When creating a new Amazon MWAA environment, you need to specify the type of worker nodes, the minimum and maximum number of worker nodes, and the scheduler count, as shown in the following screenshot.

Amazon MWAA environment classes - console screenshot

There are notably two ways GTTS controlled how Amazon MWAA scales to handle the load:

  • Minimum and maximum worker count – Amazon MWAA automatically adds or deletes workers within the boundaries you set, depending on the number of tasks that are waiting to be processed. As indicated in the AWS documentation, it is possible to request a quota increase to run up to 50 workers in a single environment.
  • Size of the node – Larger worker nodes can run more concurrent tasks. For example, mw1.small instances run 5 concurrent tasks by default, whereas mw1.large instances run 20 concurrent tasks by default. The following figure shows the specification for each instance type.

Amazon MWAA environment sizes - console screenshot

With Amazon MWAA, GTTS can therefore run up to 4,000 concurrent tasks in a single Amazon MWAA environment (50 worker nodes x 80 tasks per node with mw1.2xlarge). This remains an order of magnitude for the load that can fit into the workers vCPUs and RAM, but it is possible to edit the default configuration to add even more tasks per worker. For more information regarding Amazon MWAA automatic scaling, see Configuring Amazon MWAA automatic scaling.

The Amazon MWAA based orchestration platform

After selecting Amazon MWAA as the core service for their orchestrating system, Amazon GTTS and AWS worked together to develop an end-to-end data platform with automation capabilities, access management, monitoring, and integration with downstream systems. The following diagram illustrates the solution architecture.

MWAA-based platform - architecture diagram

The following are notable components of the architecture:

  1. DAG update – GTTS Developers manage the creation, update, and deletion of Amazon MWAA DAGs through a dedicated code repository. When a developer edits DAG definitions and commits changes to the code repository, a CI/CD pipeline automatically packages the DAG definition and stores it in Amazon S3, which automatically updates DAGs in Amazon MWAA.
  2. Infrastructure as code – The entire stack is defined as IaC with the AWS CDK, which eases the process of updating components, and makes it repeatable if GTTS wants to extend the solution and redeploy the stack in multiple AWS Regions.
  3. Authentication, authorizations, and Permissions – Permissions are centrally managed with AWS Identity and Access Management (IAM) together with Airflow roles. GTTS integrated their identity provider with Amazon Cognito and Amazon MWAA, so Amazon employees can connect to the Amazon MWAA UI with the same authentication tool they are used to, and see only the DAGs they are allowed to access.
  4. UI and DAG runs – Amazon MWAA includes an AWS-managed web server that exposes the Airflow UI. Amazon employees can connect to this UI to list DAGs, run DAGs, and track their status. In addition, GTTS used the native Amazon MWAA scheduler to automatically invoke DAGs at a specific time.
  5. Airflow workers – The users can use Airflow native providers to run custom Shell or Python code directly on the workers nodes. For compute-intensive jobs, the Amazon MWAA worker can delegate the compute to a more suitable AWS service, such as Apache Spark running on Amazon EMR on Amazon EKS, which will provide compute resources only for the duration of the job, helping in optimizing costs.
  6. Data stores and external computes services – Amazon MWAA comes also with the AWS provider preloaded, allowing a seamless connectivity with more than 23 AWS compute and data services. GTTS can extend the connectivity to other AWS or external services by using Boto3 with the PythonOperator or creating dedicated custom operators.
  7. Logging and alerting – Amazon MWAA is seamlessly integrated with CloudWatch and CloudTrail to publish DAG logs, audit logs, and metrics. This enables GTTS to track completion, troubleshoot, and create an automated alerting and notifications system so DAGs owners can take remediation actions as fast as possible.

Conclusion

Amazon GTTS partnered with AWS Professional Services to overcome the challenges faced by their legacy custom orchestrator against various dimensions such as maintainability, cost efficiency, security, scalability, and observability.

The new Amazon MWAA-based architecture offers significant improvements in the context of the AWS Well-Architected Framework compared to their former system. In terms of operational excellence, the new orchestration platform is built with evolutivity in mind and enables the GTTS team to use the most adapted ETL service to run their jobs. Regarding performance efficiency, GTTS observed up to 70% improvement in end-to-end runtime on their jobs running in Amazon MWAA. In terms of security, the new solution implements best practices such as the deployment in private subnets, authentication of users through Amazon internal federation systems, and data encryption at rest and in transit. Reliability is achieved with Multi-AZ failover and built-in auto scaling to meet the workload demand at all times. Finally, cost is reduced because Amazon MWAA is an AWS-managed service, which decreases the human effort from GTTS to maintain the orchestration platform.

Amazon GTTS is now bringing the MVP into production, where it is planned to handle petabytes of data and host more than 2,000 jobs migrated from the legacy system. Additionally, the migration to Amazon MWAA has empowered GTTS to enhance its operational scalability, paving the way for the integration of new jobs and further expansion with greater efficiency and confidence.

To learn more, refer to the following resources:


About the Authors

Béntor Bautista is a Senior Data Engineer at Amazon GTTS
Louis Hourcade is a Solutions Architect at AWS
Raphael Ducay is a Senior DataOps Architect at AWS
Konstantin Zarudaev is a DevOps Consultant at AWS
Dorra Elboukari is a DevOps Architect at AWS
Marcin Zapal is an Engagement Manager at AWS
Grigorios Pikoulas is a Strategic Program Lead at AWS
Antonio Cennamo is a Senior Customer Practice Manager at AWS

Introducing the Māori Data Lens for the Well-Architected Framework

Post Syndicated from Craig Hind original https://aws.amazon.com/blogs/architecture/introducing-the-maori-data-lens-for-the-well-architected-framework/

In Aotearoa New Zealand, we have been listening and learning to better understand Māori aspirations when using cloud technology. We have been learning from Māori customers, partners, and advisors who have helped us on this journey. A common theme was how to safeguard Māori data in a digital world. Together with a group of Māori advisers, we are excited to introduce the first iteration of a Māori Data Lens for the AWS Well-Architected Framework. This lens is the first of its kind for AWS globally that focuses on indigenous data, specifically Māori data considerations.

An AWS Well-Architected Framework lens is designed to provide a technology, industry, or domain specific perspective aligned with the AWS Well-Architected Framework. The Māori Data Lens allows customers to apply important Māori data considerations when designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. This lens is designed to be a living resource that can grow and adapt alongside the evolving questions and considerations Māori have about how to secure and protect their data as a taonga (treasure). We hope this lens will be valuable in empowering individuals and organisations to design, build, and operate applications and workloads in the AWS Cloud in ways that can align with Māori values and expectations across the six pillars of the Well-Architected Framework: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. These are not a rigid set of rules, but instead a set of guiding principles.

This lens is designed to complement invaluable te ao Māori knowledge and expertise. It’s important to consult, build trust, and reflect Māori voices in digital and technology choices. AWS customers can consult and partner with their Māori customers to build systems in a way that responsibly interact with their Māori data. This lens is a framework of practical questions and considerations. When combined with Māori knowledge and expertise, AWS customers can begin to use cloud technology in a way that empowers adherence to important cultural and ethical dimensions for safeguarding Māori data. As a starting point, we have sought to align the insights shared with us to our AWS best practices for architecting secure, reliable, and cost-effective applications in the AWS Cloud. Together, these guidelines support the durability and protection of Māori data.

At AWS, we have always believed it is essential that our customers have control over their data. We strive to give customers choice in how to secure and manage their data in the cloud in accordance with their needs. This has been true from the very beginning, when we were the only major cloud provider to allow our customers to control the geographic location of their data, never moving customer data without explicit instruction from the customer.

The launch of an AWS Local Zone in Auckland in 2023 and the coming AWS Region in Aotearoa gives all New Zealanders the choice to store their data onshore in Aotearoa New Zealand in the AWS Cloud without compromising on performance, innovation, scale, or security. We also know that some customers have needs that go beyond where their data is stored. We’re committed to expanding our understanding and our capability to help all customers meet their particular needs and best serve their own customers, to protect their data, and to meet legal and regulatory requirements.

Advice and feedback to date has been instrumental in helping to shape this resource, and we are deeply grateful for the ongoing partnership and insights. We recognise there are different perspectives, and that tikanga (protocol/practice) and experience among Māori on this topic continues to evolve. We welcome feedback on enhancing this resource to better serve the needs of our Māori customers and partners. To provide feedback, reach out to us using the feedback feature on the lens document or your local AWS account team.

We’d like to thank AWS partner HTK Group, as well as Māori technology and data experts who advised us on this work including Renata Hakiwai, Lee Timutimu, Nikora Ngaropo, Ngapera Riley, Wade Reweti, Atawhai Tibble and Eli Pohio. In their words:

“In this rapidly evolving digital landscape, the importance of understanding, organising, and harnessing data cannot be overstated. For our Māori communities, this holds even greater significance, as data can be a taonga or a treasure that represents the collective wisdom and knowledge passed down through generations.

Nā tō rourou, nā taku rourou, ka ora ai te iwi. With your food basket and my food basket, the people will thrive.

Mauri ora!”

Read the Māori Data Lens, or contact your AWS account team for more information.