Tag Archives: Architecture

Field Notes: Perform Automations in Ungoverned Regions During Account Launch Using AWS Control Tower Lifecycle Events

Post Syndicated from Amit Kumar original https://aws.amazon.com/blogs/architecture/field-notes-perform-automations-in-ungoverned-regions-during-account-launch-using-aws-control-tower-lifecycle-events/

This post was co-authored by Amit Kumar; Partner Solutions Architect at AWS, Pavan Kumar Alladi; Senior Cloud Architect at Tech Mahindra, and Thooyavan Arumugam; Senior Cloud Architect at Tech Mahindra.

Organizations use AWS Control Tower to set up and govern secure, multi-account AWS environments. Frequently enterprises with a global presence want to use AWS Control Tower to perform automations during the account creation including in AWS Regions where AWS Control Tower service is not available. To review the current list of Regions where AWS Control Tower is available, visit the AWS Regional Services List.

This blog post shows you how we can use AWS Control Tower lifecycle events, AWS Service Catalog, and AWS Lambda to perform automation in the Region where AWS Control Tower service is unavailable. This solution depicts the scenario for a single Region and the solution need to be changed to work with a multi-Regions scenario.

We use an AWS CloudFormation template to create a virtual private cloud (VPC) with subnet and internet gateway as an example and use it in shared service catalog products at the organization level to make it available in child accounts. Every time AWS Control Tower lifecycle events related to account creation occurs, a Lambda function is initiated to perform automation activities in AWS Regions that are not governed by AWS Control Tower.

The solution in this blog post uses the following AWS services:

Figure 1. Solution architecture

Figure 1. Solution architecture


For this walkthrough, you need the following prerequisites:

  • AWS Control Tower configured with AWS Organizations defined and registered within AWS Control Tower. For this blog post, AWS Control Tower is deployed in AWS Mumbai Region and with an AWS Organizations structure as depicted in Figure 2.
  • Working knowledge of AWS Control Tower.
Figure 2. AWS Organizations structure

Figure 2. AWS Organizations structure

Create an AWS Service Catalog product and portfolio, and share at the AWS Organizations level

  1. Sign in to AWS Control Tower management account as an administrator, and select an AWS Region which is not governed by AWS Control Tower (for this blog post, we will use AWS us-west-1 (N. California) as the Region because at this time it is unavailable in AWS Control Tower).
  2. In the AWS Service Catalog console, in the left navigation menu, choose Products.
  3. Choose upload new product. For Product Name enter customvpcautomation, and for Owner enter organizationabc. For method, choose Use a template file.
  4. In Upload a template file, select Choose file, and then select the CloudFormation template you are going to use for automation. In this example, we are going to use a CloudFormation template which creates a VPC with CIDR, Public Subnet, and Internet Gateway.
Figure 3. AWS Service Catalog product

Figure 3. AWS Service Catalog product

CloudFormation template: save this as a YAML file before selecting this in the console.

AWSTemplateFormatVersion: 2010-09-09
Description: Template to create a VPC with CIDR with a Public Subnet and Internet Gateway. 

    Type: AWS::EC2::VPC
      EnableDnsSupport: true
      EnableDnsHostnames: true
        - Key: Name
          Value: VPC

    Type: AWS::EC2::InternetGateway
        - Key: Name
          Value: IGW

    Type: AWS::EC2::VPCGatewayAttachment
      - IGW
      - VPC
      InternetGatewayId: !Ref IGW
      VpcId: !Ref VPC

    Type: AWS::EC2::RouteTable
    DependsOn: VPC
      VpcId: !Ref VPC
        - Key: Name
          Value: Public Route Table

    Type: AWS::EC2::Route
      - PublicRouteTable
      - VPCtoIGWConnection
      GatewayId: !Ref IGW
      RouteTableId: !Ref PublicRouteTable

    Type: AWS::EC2::Subnet
    DependsOn: VPC
      VpcId: !Ref VPC
      MapPublicIpOnLaunch: true
      AvailabilityZone: !Select
        - 0
        - !GetAZs
          Ref: AWS::Region
        - Key: Name
          Value: Public Subnet

    Type: AWS::EC2::SubnetRouteTableAssociation
      - PublicRouteTable
      - PublicSubnet
      RouteTableId: !Ref PublicRouteTable
      SubnetId: !Ref PublicSubnet


    Description: Public subnet ID
      Ref: PublicSubnet
        'Fn::Sub': '${AWS::StackName}-SubnetID'

    Description: The VPC ID
      Ref: VPC
        'Fn::Sub': '${AWS::StackName}-VpcID'
  1. After the CloudFormation template is selected, choose Review, and then choose Create Product.
Figure 4. AWS Service Catalog product

Figure 4. AWS Service Catalog product

  1. In the AWS Service Catalog console, in the left navigation menu, choose Portfolios, and then choose Create portfolio.
  2. For Portfolio name, enter customvpcportfolio, for Owner, enter organizationabc, and then choose Create.
Figure 5. AWS Service Catalog portfolio

Figure 5. AWS Service Catalog portfolio

  1. After the portfolio is created, select customvpcportfolio. In the actions dropdown, select Add product to portfolio. Then select customvpcautomation product, and choose Add Product to Portfolio.
  2. Navigate back to customvpcportfolio, and select the portfolio name to see all the details. On the portfolio details page, expand the Groups, roles, and users tab, and choose Add groups, roles, users. Next, select the Roles tab and search for AWSControlTowerAdmin role, and choose Add access.
Figure 6. AWS Service Catalog portfolio role selection

Figure 6. AWS Service Catalog portfolio role selection

  1. Navigate to the Share section in portfolio details, and choose Share option. Select AWS Organization, and choose Share.

Note: If you get a warning stating “AWS Organizations sharing is not enabled”, then choose Enable and select the organizational unit (OU) where you want this portfolio to be shared. In this case, we have shared at Workload OU where all workload account is created.

Figure 7. AWS Service Catalog portfolio sharing

Figure 7. AWS Service Catalog portfolio sharing

Create an AWS Identity and Access Management (IAM) role

  1. Sign in to AWS Control Tower management account as an administrator and navigate to IAM Service.
  2. In the IAM console, choose Policies in the navigation pane, then choose Create Policy.
  3. Click on Choose a service, and select STS. In the Actions menu, choose All STS Actions, in Resources, choose All resources, and then choose Next: Tags.
  4. Skip the Tag section, go to the Review section, and for Name enter lambdacrossaccountSTS, and then choose Create policy.
  5. In the navigation pane of the IAM console, choose Roles, and then choose Create role. For the use case, select Lambda, and then choose Next: Permissions.
  6. Select AWSServiceCatalogAdminFullAccess and AmazonSNSFullAccess, then choose Next: Tags (skip tag screen if needed), then choose Next: Review.
  7. For Role name, enter Automationnongovernedregions, and then choose Create role.
Figure 8. AWS IAM role permissions

Figure 8. AWS IAM role permissions

Create an Amazon Simple Notification Service (Amazon SNS) topic

  1. Sign in to AWS Control Tower management account as an administrator and select AWS Mumbai Region (Home Region for AWS CT). Navigate to Amazon SNS Service, and on the navigation panel, choose Topics.
  2. On the Topics page, Choose Create topic. On the Create topic page, in the Details section, for Type select Standard, and for Name enter ControlTowerNotifications. Keep default for other options, and then choose Create topic.
  3. In the Details section, in the left navigation pane, choose Subscriptions.
  4. On the Subscriptions page, choose Create subscription. For Protocol, choose Email and for Endpoint mention the email id where notification need to come and choose Create Subscription.

You will receive an email stating that the subscription is in pending status. Follow the email instructions to confirm the subscription. Check in the Amazon SNS Service console to verify subscription confirmation.

Figure 9. Amazon SNS topic creation and subscription

Figure 9. Amazon SNS topic creation and subscription

Create an AWS Lambda function

  1. Sign in to AWS Control Tower management account as an administrator and select AWS Mumbai Region (Home Region for AWS Control Tower). Open the Functions page on the Lambda console, and choose Create function.
  2.  In the Create function section, choose Author from scratch.
  3. In the Basic information section:
    1. For Function name, enter NonGovernedCrossAccountAutomation.
    2. For Runtime, choose Python 3.8.
    3. For Role, select Choose an existing role.
    4. For Existing role, select the Lambda role that you created earlier.
  1. Choose Create function.
  2. Copy and paste the following code in to the Lambda editor (replace the existing code).
  3. In the File menu, choose Save.

Lambda function code: The Lambda function is developed to initiate the AWS Service Catalog product, shared at Organizations level from AWS Control Tower management account, onto all member accounts in a hub and spoke model. Key activities performed by the Lambda function are:

    • Assume role – Provides the mechanism to assume AWSControlTowerExecution role in the child account.
    • Launch product – Launch the AWS Service Catalog product shared in the non-governed Region with the member account.
    • Email notification – Send notifications to the subscribed recipients.

When this Lambda function is invoked by the AWS Control Tower lifecycle event, it performs the activity of provisioning the AWS Service Catalog products in the Region which is not governed by AWS Control Tower.

# Decription:This Lambda used execute service catalog products in unmanaged ControlTower 
# regions while creation of AWS accounts
# Environment: Control Tower Env
# Version 1.0

import boto3
import os
import time

SSM_Master = boto3.client('ssm')
STS_Master = boto3.client('sts')
SC_Master = boto3.client('servicecatalog',region_name = 'us-west-1')
SNS_Master = boto3.client('sns')

def lambda_handler(event, context):
    if event['detail']['serviceEventDetails']['createManagedAccountStatus']['state'] == 'SUCCEEDED':
        account_name = event['detail']['serviceEventDetails']['createManagedAccountStatus']['account']['accountName']
        account_id = event['detail']['serviceEventDetails']['createManagedAccountStatus']['account']['accountId']
        ##Assume role to member account
            print("-- Executing Service Catalog Procduct in the account: ", account_name)
            ##Launch Product in member account
            launch_product(os.environ['ProductName'], SC_Member)
            sendmail(f'-- Product Launched successfully ')

        except Exception as err:
            print(f'-- Error in Executing Service Catalog Procduct in the account: : {err}')
            sendmail(f'-- Error in Executing Service Catalog Procduct in the account: : {err}')   
 ##Function to Assume Role and create session in the Member account.                       
def assume_role(account_id):
    global SC_Member, IAM_Member, role_arn
    ## Assume the Member account role to execute the SC product.
    role_arn = "arn:aws:iam::$ACCOUNT_NUMBER$:role/AWSControlTowerExecution".replace("$ACCOUNT_NUMBER$", account_id)
    ##Assuming Member account Service Catalog.
    Assume_Member_Acc = STS_Master.assume_role(RoleArn=role_arn,RoleSessionName="Member_acc_session")

    #Session to Connect to IAM and Service Catalog in Member Account                          
    IAM_Member = boto3.client('iam',aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key,aws_session_token=aws_session_token)
    SC_Member = boto3.client('servicecatalog', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key,aws_session_token=aws_session_token,region_name = "us-west-1")
    ##Accepting the portfolio share in the Member account.
    print("-- Accepting the portfolio share in the Member account.")
    length = 0
    while length == 0:
            search_product = SC_Member.search_products()
            length = len(search_product['ProductViewSummaries'])
        except Exception as err:
        if length == 0:
            print("The shared product is still not available. Hence waiting..")
            ##Accept portfolio share in member account
            Accept_portfolio = SC_Member.accept_portfolio_share(PortfolioId=os.environ['portfolioID'],PortfolioShareType='AWS_ORGANIZATIONS')
            Associate_principal = SC_Member.associate_principal_with_portfolio(PortfolioId=os.environ['portfolioID'],PrincipalARN=role_arn, PrincipalType='IAM')
            print("The products are listed in account.")
    print("-- The portfolio share has been accepted and has been assigned the IAM Role principal.")
    return SC_Member

##Function to execute product in the Member account.    
def launch_product(ProductName, session):
    describe_product = SC_Master.describe_product_as_admin(Name=ProductName)
    created_time = []
    version_ID = []
    for version in describe_product['ProvisioningArtifactSummaries']:
        describe_provisioning_artifacts = SC_Master.describe_provisioning_artifact(ProvisioningArtifactId=version['Id'],Verbose=True,ProductName=ProductName,)
        if describe_provisioning_artifacts['ProvisioningArtifactDetail']['Active'] == True:
    latest_version = dict(zip(created_time, version_ID))
    latest_time = max(created_time)
    launch_provisioned_product = session.provision_product(ProductName=ProductName,ProvisionedProductName=ProductName,ProvisioningArtifactId=latest_version[latest_time],ProvisioningParameters=[
            'Key': 'string',
            'Value': 'string'
    print("-- The provisioned product ID is : ", launch_provisioned_product['RecordDetail']['ProvisionedProductId'])
def sendmail(message):
     sendmail = SNS_Master.publish(
     Subject="Alert - Attention Required",
  1. Choose Configuration, then choose Environment variables.
  2. Choose Edit, and then choose Add environment variable for each of the following:
    1. Variable 1: Key as ProductName, and Value as “customvpcautomation” (name of the product created in the previous step).
    2. Variable 2: Key as SNSTopicARN, and Value as “arn:aws:sns:ap-south-1:<accountid>:ControlTowerNotifications” (ARN of the Amazon SNS topic created in the previous step).
    3. Variable 3: Key as portfolioID, and Value as “port-tbmq6ia54yi6w” (ID for the portfolio which was created in the previous step).
Figure 10. AWS Lambda function environment variable

Figure 10. AWS Lambda function environment variable

  1. Choose Save.
  2. On the function configuration page, on the General configuration pane, choose Edit.
  3. Change the Timeout value to 5 min.
  4. Go to Code Section, and choose the Deploy option to deploy all the changes.

Create an Amazon EventBridge rule and initiate with a Lambda function

  1. Sign in to AWS Control Tower management account as an administrator, and select AWS Mumbai Region (Home Region for AWS Control Tower).
  2. On the navigation bar, choose Services, select Amazon EventBridge, and in the left navigation pane, select Rules.
  3. Choose Create rule, and for Name enter NonGovernedRegionAutomation.
  4. Choose Event pattern, and then choose Pre-defined pattern by service.
  5. For Service provider, choose AWS.
  6. For Service name, choose Control Tower.
  7. For Event type, choose AWS Service Event via CloudTrail.
  8. Choose Specific event(s) option, and select CreateManagedAccount.
  9. In Select targets, for Target, choose Lambda. Select the Lambda function which was created earlier named as NonGovernedCrossAccountAutomation in Function dropdown.
  10. Choose Create.
Figure 11. Amazon EventBridge rule initiated with AWS Lambda

Figure 11. Amazon EventBridge rule initiated with AWS Lambda

Solution walkthrough

    1. Sign in to AWS Control Tower management account as an administrator, and select AWS Mumbai Region (Home Region for AWS Control Tower).
    2. Navigate to the AWS Control Tower Account Factory page, and select Enroll account.
    3. Create a new account and complete the Account Details section. Enter the Account email, Display name, AWS SSO email, and AWS SSO user name, and select the Organizational Unit dropdown. Choose Enroll account.
Figure 12. AWS Control Tower new account creation

Figure 12. AWS Control Tower new account creation

      1. Wait for account creation and enrollment to succeed.
Figure 13. AWS Control Tower new account enrollment

Figure 13. AWS Control Tower new account enrollment

      1. Sign out of the AWS Control Tower management account, and log in to the new account. Select the AWS us-west-1 (N. California) Region. Navigate to AWS Service Catalog and then to Provisioned products. Select the Access filter as Account and you will observe that one provisioned product is created and available.
Figure 14. AWS Service Catalog provisioned product

Figure 14. AWS Service Catalog provisioned product

      1. Go to VPC service to verify if a new VPC is created by the AWS Service Catalog product with a CIDR of
Figure 15. AWS VPC creation validation

Figure 15. AWS VPC creation validation

      1. Step 4 and Step 5 validates that you are able to perform the automation during account creation through the AWS Control Tower lifecycle events in non-governed Regions.

Cleaning up

To avoid incurring future charges, clean up the resources created as part of this blog post.

  • Delete the AWS Service Catalog product and portfolio you created.
  • Delete the IAM role, Amazon SNS topic, Amazon EventBridge rule, and AWS Lambda function you created.
  • Delete the AWS Control Tower setup (if created).


In this blog post, we demonstrated how to use AWS Control Tower lifecycle events to perform automation tasks during account creation in Regions not governed by AWS Control Tower. AWS Control Tower provides a way to set up and govern a secure, multi-account AWS environment. With this solution, customers can use AWS Control Tower to automate various tasks during account creation in Regions regardless if AWS Control Tower is available in that Region.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.
Daniel Cordes

Pavan Kumar Alladi

Pavan Kumar Alladi is a Senior Cloud Architect with Tech Mahindra and is based out of Chennai, India. He is working on AWS technologies from past 10 years as a specialist in designing and architecting solutions on AWS Cloud. He is ardent in learning and implementing cloud based cutting edge solutions and is extremely zealous about applying cloud services to resolve complex real world business problems. Currently, he leads customer engagements to deliver solutions for Platform Engineering, Cloud Migrations, Cloud Security and DevOps.

Gaurav Jain

Thooyavan Arumugam

Thooyavan Arumugam is a Senior Cloud Architect at Tech Mahindra’s AWS Practice team. He has over 16 years of industry experience in Cloud infrastructure, network, and security. He is passionate about learning new technologies and helping customers solve complex technical problems by providing solutions using AWS products and services. He provides advisory services to customers and solution design for Cloud Infrastructure (Security, Network), new platform design and Cloud Migrations.

Architecture Monthly Magazine: Aerospace

Post Syndicated from Bonnie McClure original https://aws.amazon.com/blogs/architecture/architecture-monthly-magazine-aerospace/

The aerospace and space industries have changed considerably since the early days of jet travel and the Apollo missions. New technology is making space and sky more accessible than ever—startups are reaching skyward and towards the stars with seemingly every day new, innovative ideas.

This month’s issue of Architecture Monthly brings you curated content to highlight some of these ideas and innovations from our commercial aerospace, aerospace defense, and space leaders, including the newly launched AWS for Aerospace and Satellite organization. We highlight new and updated technology, practical solutions and their applications, and innovative companies that are working towards exploring and learning about the final frontier. Get inspired!

We’d like to thank our experts, Scott Eberhardt, Worldwide Tech Leader, Aerospace; Shayn Hawthorne, Sr. Mgr., Aerospace Tech Leader; and Buffy Wajvoda, Head of SA – Space & Satellite for their contributions.

Please give us your feedback! Include your comments on the Amazon Kindle page. You can view past issues and reach out to [email protected] anytime with your questions and comments.

In this month’s issue:

  • Ask an Expert: Scott Eberhardt, Worldwide Tech Lead for Aerospace at AWS; Shayn Hawthorne, Space Technology Leader for AWS Aerospace and Satellite Solutions Division; and Buffy Wajvoda, Worldwide Leader for Aerospace and Satellite Solutions Architecture at AWS
  • Website: Introduction to AWS for Aerospace and Satellite
  • Blog: Capella uses space to bring you closer to Earth
  • Reference Architecture: Run Machine Learning Algorithms with Satellite Data
  • Blog: UAE Mars mission uses AWS to advance scientific discoveries
  • Solution: AWS re:Invent 2020: Detecting extreme weather events from space
  • Blog: Announcing the AWS Space Accelerator for startups
  • Reference Architecture: Electro-Optical Imagery Reference Architecture
  • Case Study: Avio Aero Accelerates Business Growth with HPC Solution on AWS
  • Reference Architecture: Connected Aircraft
  • Case Study: Joby Aviation Uses AWS to Revolutionize Transportation
  • Whitepaper: Model Based Systems Engineering (MBSE) on AWS: From Migration to Innovation
  • Reference Architecture: Using Computer Vision for Product Quality Analysis in Plants
  • Videos:
    • AWS re:Invent 2020: Advancing the future of space in the cloud
    • AWS Connected Aircraft Overview
    • AWS Vision for Model-based Engineering in Aerospace
    • Avio Aero, a GE Aviation Business: Serverless Application to Manage Expense Purchase Approvals
    • Cybersecurity and Compliance for Aerospace
    • Product Lifecycle Management for Aerospace

Download the magazine here

How to access the magazine

  • View and download past issues as PDFs on the AWS Architecture Monthly webpage.
  • Readers in the US, UK, Germany, and France can subscribe to the Kindle version of the magazine at Kindle Newsstand.
  • Visit Flipboard, a personalized mobile magazine app that you can also read on your computer.

We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us a star rating and comment on the Kindle Newsstand page or contact us anytime at [email protected].

Introducing the new AWS Well-Architected Machine Learning Lens

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/introducing-the-new-aws-well-architected-machine-learning-lens/

The AWS Well-Architected Framework provides you with a formal approach to compare your workloads against best practices. It also includes guidance on how to make improvements.

Machine learning (ML) algorithms discover and learn patterns in data, and construct mathematical models to predict future data. These solutions can revolutionize lives through better diagnoses of diseases, environmental protections, products and services transformation, and more.

Your ML models depend on the quality of input data to generate accurate results. As data changes over time, monitoring is required to continuously detect, correct, and mitigate issues. This improves accuracy and performance. It also may require you to retrain your model with the latest refined data.

Application workloads rely on step-by-step instructions to solve a problem. ML workloads enable algorithms to learn from data through an iterative and continuous cycle. We are announcing a brand-new version of the AWS Well-Architected Machine Learning Lens whitepaper. It complements and builds upon the Well-Architected Framework to address this difference between these two types of workloads.

The whitepaper provides you with a set of established cloud and technology agnostic best practices. You can apply this guidance and architectural principles when designing your ML workloads, or after your workloads have entered production as part of continuous improvement. The paper includes guidance and resources to help you implement these best practices on AWS.

The Well-Architected Machine Learning Lens components

The Lens includes four focus areas:

1. The Well-Architected Machine Learning Design Principles — A set of considerations that are used as the basis for a Well-Architected ML workload. These design principles are the guiding light for the collection of the best practices in the ML Lens.

2. The Well-Architected Machine Learning Lifecycle — This integrates the Well-Architected Framework into the Machine Learning Lifecycle as can be seen in figure 1.

    • The Well-Architected Framework pillars includes:
      1. Operational Excellence
      2. Security
      3. Reliability
      4. Performance Efficiency
      5. Cost Optimization
    • The Machine Learning Lifecycle phases referenced in the ML Lens include:
      1. Business goal identification
      2. ML problem framing
      3. Data processing (data collection, data pre-processing, feature engineering)
      4. Model development (training, tuning, evaluation)
      5. Model deployment (prediction, inference)
      6. Model monitoring
Figure 1. Well-Architected Machine Learning Lifecycle

Figure 1. Well-Architected Machine Learning Lifecycle

In the Well-Architected ML Lens whitepaper, the Well-Architected Machine Learning Lifecycle applies the Well-Architected Framework pillars to each of the lifecycle phases.

3. Cloud and technology agnostic best practices — These are best practices for each ML lifecycle phase across the Well-Architected Framework pillars. Best practices are accompanied by:

    • Implementation guidance that provides AWS implementation plans for each best practice with references to AWS technologies and resources.
    • Resources as a set of links to AWS documents, blogs, videos, and code examples as supporting resources to the best practices and their implementation plans.

4. ML Lifecycle architecture diagrams — These illustrate processes, technologies, and components that support many of the best practices, shown in Figure 2. They include: Feature stores, Model Registry, lineage tracker, alarm manager, scheduler, and more. Different pipeline technologies are illustrated using these architecture diagrams.

Figure 2. Machine Learning Lifecycle phases with expanded components

Figure 2. Machine Learning Lifecycle phases with expanded components

Where should you apply the Well-Architected Machine Learning Lens?

Use the Well-Architected ML Lens to:

  • Make informed decisions — Plan early and make informed decisions by reviewing best practices before a new workload design begins.
  • Build and deploy faster — Use the best practices to guide you through building new Well-Architected workloads across the ML lifecycle.
  • Lower or mitigate risks — Evaluate existing workloads regularly to identify, mitigate, and address potential issues early.
  • Learn AWS best practices — Use the provided implementation plans as guidance on implementing the best practices on AWS.


The new Well-Architected Machine Learning Lens whitepaper is available now. Use the Lens to help ensure that your ML workloads are architected with operational excellence, security, reliability, performance efficiency, and cost optimization in mind.

Special thanks to everyone across the AWS Solution Architecture and Machine Learning communities.  These contributions encompassed diverse perspectives, expertise, and experiences in developing the new AWS Well-Architected Machine Learning Lens.

Offloading SQL for Amazon RDS using the Heimdall Proxy

Post Syndicated from Antony Prasad Thevaraj original https://aws.amazon.com/blogs/architecture/offloading-sql-for-amazon-rds-using-the-heimdall-proxy/

Getting the maximum scale from your database often requires fine-tuning the application. This can increase time and incur cost – effort that could be used towards other strategic initiatives. The Heimdall Proxy was designed to intelligently manage SQL connections to help you get the most out of your database.

In this blog post, we demonstrate two SQL offload features offered by this proxy:

  1. Automated query caching
  2. Read/Write split for improved database scale

By leveraging the solution shown in Figure 1, you can save on development costs and accelerate the onboarding of applications into production.

Figure 1. Heimdall Proxy distributed, auto-scaling architecture

Figure 1. Heimdall Proxy distributed, auto-scaling architecture

Why query caching?

For ecommerce websites with high read calls and infrequent data changes, query caching can drastically improve your Amazon Relational Database Sevice (RDS) scale. You can use Amazon ElastiCache to serve results. Retrieving data from cache has a shorter access time, which reduces latency and improves I/O operations.

It can take developers considerable effort to create, maintain, and adjust TTLs for cache subsystems. The proxy technology covered in this article has features that allow for automated results caching in grid-caching chosen by the user, without code changes. What makes this solution unique is the distributed, scalable architecture. As your traffic grows, scaling is supported by simply adding proxies. Multiple proxies work together as a cohesive unit for caching and invalidation.

View video: Heimdall Data: Query Caching Without Code Changes

Why Read/Write splitting?

It can be fairly straightforward to configure a primary and read replica instance on the AWS Management Console. But it may be challenging for the developer to implement such a scale-out architecture.

Some of the issues they might encounter include:

  • Replication lag. A query read-after-write may result in data inconsistency due to replication lag. Many applications require strong consistency.
  • DNS dependencies. Due to the DNS cache, many connections can be routed to a single replica, creating uneven load distribution across replicas.
  • Network latency. When deploying Amazon RDS globally using the Amazon Aurora Global Database, it’s difficult to determine how the application intelligently chooses the optimal reader.

The Heimdall Proxy streamlines the ability to elastically scale out read-heavy database workloads. The Read/Write splitting supports:

  • ACID compliance. Determines the replication lag and know when it is safe to access a database table, ensuring data consistency.
  • Database load balancing. Tracks the status of each DB instance for its health and evenly distribute connections without relying on DNS.
  • Intelligent routing. Chooses the optimal reader to access based on the lowest latency to create local-like response times. Check out our Aurora Global Database blog.

View video: Heimdall Data: Scale-Out Amazon RDS with Strong Consistency

Customer use case: Tornado

Hayden Cacace, Director of Engineering at Tornado

Tornado is a modern web and mobile brokerage that empowers anyone who aspires to become a better investor.

Our engineering team was tasked to upgrade our backend such that it could handle a massive surge in traffic. With a 3-month timeline, we decided to use read replicas to reduce the load on the main database instance.

First, we migrated from Amazon RDS for PostgreSQL to Aurora for Postgres since it provided better data replication speed. But we still faced a problem – the amount of time it would take to update server code to use the read replicas would be significant. We wanted the team to stay focused on user-facing enhancements rather than server refactoring.

Enter the Heimdall Proxy: We evaluated a handful of options for a database proxy that could automatically do Read/Write splits for us with no code changes, and it became clear that Heimdall was our best option. It had the Read/Write splitting “out of the box” with zero application changes required. And it also came with database query caching built-in (integrated with Amazon ElastiCache), which promised to take additional load off the database.

Before the Tornado launch date, our load testing showed the new system handling several times more load than we were able to previously. We were using a primary Aurora Postgres instance and read replicas behind the Heimdall proxy. When the Tornado launch date arrived, the system performed well, with some background jobs averaging around a 50% hit rate on the Heimdall cache. This has really helped reduce the database load and improve the runtime of those jobs.

Using this solution, we now have a data architecture with additional room to scale. This allows us to continue to focus on enhancing the product for all our customers.

Download a free trial from the AWS Marketplace.


Heimdall Data, based in the San Francisco Bay Area, is an AWS Advanced Tier ISV partner. They have Amazon Service Ready designations for Amazon RDS and Amazon Redshift. Heimdall Data offers a database proxy that offloads SQL improving database scale. Deployment does not require code changes. For other proxy options, consider the Amazon RDS Proxy, PgBouncer, PgPool-II, or ProxySQL.

Field Notes: Building a Multi-Region Architecture for SQL Server using FCI and Distributed Availability Groups

Post Syndicated from Yogi Barot original https://aws.amazon.com/blogs/architecture/field-notes-building-a-multi-region-architecture-for-sql-server-using-fci-and-distributed-availability-groups/

A multiple-Region architecture for Microsoft SQL Server is often a topic of interest that comes up when working with our customers. The main reasons customers adopt a multiple-Region architecture approach for SQL Server deployments are:

  • Business continuity and disaster recovery (DR)
  • Geographically distributed customer base, and improved latency for end users

We will explain the architecture patterns that you can follow to effectively design a highly available SQL Server deployment, which spans two or more AWS Regions. You will also learn how to use the multiple-Region approach to scale out the read workloads, and improve the latency for your globally distributed end users.

This blog post explores SQL Server DR architecture using SQL Server Failover Cluster with Amazon FSx for Windows File Server, for primary site and secondary DR site, and describes how to set up a multiple-Region Always On distributed availability group.

Architecture overview

The architecture diagram in Figure 1 depicts two SQL Server clusters (multiple Availability Zones) in two separate Regions, and uses distributed availability group for replication and DR. This will also serve as the reference architecture for this solution.

Figure 1. Two SQL Server clusters (multiple Availability Zones) in two separate Regions

Figure 1. Two SQL Server clusters (multiple Availability Zones) in two separate Regions

In Figure 1, there are two separate clusters in different Regions. The primary cluster in Region_01 is initially configured with SQL Server Failover Cluster Instance (FCI) using Amazon FSx for its shared storage. Always On is enabled on both nodes, and is configured to use FCI SQL Network Name (SQLFCI01) as the single replica for local Availability Group (AG01). Region_02 has an identical configuration to Region_01, but with different hostnames, listeners, and SQL Network Name to avoid possible collisions.

Highlighted in Figure 1, the Always On distributed availability group is then configured to use both listener endpoints (AG01 and AG02). Depending on what type of authentication infrastructure you have, you can either use certificates (no domain and trust dependency), or just AWS Directory Service for Microsoft Active Directory authentication to build the local mirroring endpoint that will be used by the distributed availability group.

With Amazon FSx, you get a fully managed shared file storage solution, that automatically replicates the underlying storage synchronously across multiple Availability Zones. Amazon FSx provides high availability with automatic failure detection, and automatic failover if there are any hardware or storage issues. The service fully supports continuously available shares, a feature that allows SQL Server uninterrupted access to shared file data.

There is an asynchronous replication setup using a distributed availability group from Region A to Region B. In this type of configuration, because there is only one availability group replica, it also serves as the forwarder for the local FCI cluster. The concept of a forwarder is new, and it’s one of the core functionalities for the distributed availability group. Because Windows Failover Cluster1 and Windows Failover Cluster2 are standalone and independent clusters, don’t need to open a large set of ports, thus minimizing security risk.

In this solution, because FCI is our primary high availability solution, users and applications should then connect through FCI SQL Server Network Name with the latest supported drivers and key parameters (such as, MultiSubNetFailover=True – if supported) to facilitate the failover and make sure that the applications seamlessly connect to the new replica without any errors or timeouts.



Following are the steps required to configure SQL Server DR using SQL Server Failover Cluster with Amazon FSx for Windows File Server for primary site and secondary DR site. We also show how to set up a multiple-Region Always On distributed availability group.

Assumed Variables


WSFC Cluster Name: SQLCluster1
FCI Virtual Network Name: SQLFCI01
Local Availability Group: SQLAG01


WSFC Cluster Name: SQLCluster2
FCI Virtual Network Name: SQLFCI02
Local Availability Group: SQLAG02

  • Make sure to configure network connectivity between your clusters. In this solution, we are using two VPCs in two separate Regions.
    • VPC peering is configured to enable network traffic on both VPCs.
    • The domain controller (AWS Managed Microsoft AD) on both VPCs are configured with forest trust and conditional forwarding (this enables DNS resolution between the two VPCs).
  • Create a local availability group, using FCI SQL Network Name as the replica. Because we will be setting up a domain-independent distributed availability group between the two clusters, we will be setting up certificates to authenticate between the two separate clusters.
  1. Create master key and endpoint for SQLCluster1

use master
with SUBJECT = 'SQLAG01 Endpoint Cert'
TO FILE = N'\\<FileShare>\SQLAG01-Cert.crt'
  1. Create master key and endpoint for SQLCluster2

use master
with SUBJECT = 'SQLAG02 Endpoint Cert'
TO FILE = N'\\<Fileshare>\SQLAG02-Cert.crt'
    • Make sure to place all exported certificates in a location that you can easily access from each FCI instance.
    • Create a SQL Server login and user in the master database on each FCI instance.
  1. Create database login in SQLCluster1

use master
FROM FILE = N'\\<Fileshare>\SQLAG02-Cert.crt'
  1. Create database login in SQLCluster2

use master
FROM FILE = N'\\<Fileshare>\SQLAG01-Cert.crt'
    • Now grant the newly created user endpoint access to the local mirroring endpoint in each FCI instance.
  1. Grant permission on endpoint – SQLCluster1

  1. Grant permission on endpoint – SQLCluster2

  1. Create distributed Always On availability group on SQLCluster1

Next, create the distributed availability group on the primary cluster.

    'SQLAG01' WITH    
            LISTENER_URL = 'tcp://SQLFCI01.DEMOSQL.COM:5022',    
            FAILOVER_MODE = MANUAL,   
    'SQLAG02' WITH    
            LISTENER_URL = 'tcp://SQLFCI02.SQLDEMO.COM:5022',   
            FAILOVER_MODE = MANUAL,   
    • Note that we are using the SQL Network Name of the FCI cluster as our listener URL.
    • Now, join our secondary WSFC FCI cluster to the distributed availability group.
  1. Join secondary cluster on SQLCluster2 to distributed availability group

      'SQLAG01' WITH    
         LISTENER_URL = 'tcp://SQLFCI01.DEMOSQL.COM:5022',    
      'SQLAG02' WITH    
         LISTENER_URL = 'tcp://SQLFCI02.SQLDEMO.COM:5022',   
    • After you run the join script, you should be able to see the database from the primary FCI cluster’s local availability group populate the secondary FCI cluster.
    • To do a distributed availability group failover, it is best practice to synchronize both clusters first.
  1. Synchronize primary cluster

    • You can verify synchronization lag and verify state displays as “SYNCHRONIZED”:
SELECT ag.name
       , drs.database_id
       , db_name(drs.database_id) as database_name
       , drs.group_id
       , drs.replica_id
       , drs.synchronization_state_desc
       , drs.last_hardened_lsn  
FROM sys.dm_hadr_database_replica_states drs 
INNER JOIN sys.availability_groups ag on drs.group_id = ag.group_id;
  1. Perform failover at primary cluster

    After everything is ready, perform failover by first changing the DAG role on the global primary.

  1. Perform failover at secondary cluster

After which, initiate the actual failover by running this script on the secondary cluster.

  1. Change sync mode on primary and secondary clusters

    Then make sure to change Sync mode on both clusters back to Asynchronous:



A multiple-Region strategy for your mission critical SQL Server deployments is key for business continuity and disaster recovery. This blog post focused on how to achieve that optimally by using distributed availability groups. You also learned about other benefits such as read scale outs by using distributed availability groups.

To learn more, check out Simplify your Microsoft SQL Server high availability deployments using Amazon FSx for Windows File Server.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Building Multi-Region and Multi-Account Tools with AWS Organizations

Post Syndicated from Cody Penta original https://aws.amazon.com/blogs/architecture/field-notes-building-multi-region-and-multi-account-tools-with-aws-organizations/

It’s common to start with a single AWS account when you are beginning your cloud journey with AWS. Running operations such as creating, reading, updating, and deleting resources in a single AWS account can be straightforward with AWS application program interfaces (APIs). Because an organization grows, so does their account strategy, often splitting workloads across multiple accounts. Fortunately, AWS customers can use AWS Organizations to group these accounts into logical units, also known as organizational units (OUs), to apply common policies and deploy standard infrastructure. However, this will result in an increased difficulty to run an API against all accounts, moreover, every Region that account could use. How does an organization answer these questions:

  • What is every Amazon FSx backup I own?
  • How can I do an on-demand batch job that will apply to my entire organization?
  • What is every internet access point across my organization?

This blog post shows us how we can use Organizations, AWS Single Sign-On (AWS SSO), AWS CloudFormation StackSets, and various AWS APIs to effectively build multi-account and multi-region tools that can address use cases like the ones above.

Running an AWS API sequentially across hundreds of accounts—potentially, many Regions—could take hours, depending on the API you call. An important aspect we will cover throughout this solution is the importance of concurrency for these types of tools.

Overview of solution

For this solution, we have created a fictional organization called Tavern that is set up with multiple organizational units (OUs), accounts, and Regions, to reflect a real-world scenario.

Figure 1. Organization configuration example

Figure 1. Organization configuration example

We will set up a user with multi-factor authentication (MFA) enabled so we can sign-in and access an admin user in the root account. Using this admin user, we will deploy a stack set across the organization that enables this user to assume limited permissions into each child account.

Next, we will use the Go programming language because of its native concurrency capabilities. More specifically, we will implement the pipeline concurrency pattern to build a multi-account and multi-region tool that will run APIs across our entire AWS footprint.

Additionally, we will add two common edge cases:

  • We block mass API actions to an account in a suspended OU (not pictured) and the root account.
  • We block API actions in disabled regions.

This will show us how to implement robust error handling in equally powerful tooling.


Let us separate the solution into distinct steps:

  • Create an automation user through AWS SSO.
    • This user can optionally be an IAM user or role assumed into by a third-party identity provider (such as, Azure Active Directory). Note the ARN of this identity because that is the key piece of information we will use for crafting a policy document.
  • Deploy a CloudFormation stack set across the organization that enables this user to assume limited access into each account.
    • For this blog post, we will deploy an organization-wide role with `ec2:DescribeRouteTables` permissions. Feel free to expand or change the permission set based on the type of tool you build.
  • Using Go, AWS Command Line Interface (CLI) v2, and AWS SDK for Go v2:
    1. Authenticate using AWS SSO.
    2. List every account in the organization.
    3. Assume permissions into that account.
    4.  Run an API across every Region in that account.
    5. Aggregate results for every Region.
    6. Aggregate results for every account.
    7. Report back the result.

For additional context, review this GitHub repository that contains all code and assets for this blog post.


For this walkthrough, you should have the following prerequisites:

  • Multiple AWS accounts
  • AWS Organizations
  • AWS SSO (optional)
  • AWS SDK for Go v2
  • AWS CLI v2
  • Go programming knowledge (preferred), especially Go’s concurrency model
  • General programming knowledge

Create an automation user in AWS SSO

The first thing we need to do is create an identity to sign into. This can either be an AWS Identity and Access Management (IAM) user, an IAM role integrated with a third-party identity provider, or—in this case—an AWS SSO user.

  1. Log into the AWS SSO user console.
  2. Press Add user button.
  3. Fill in the appropriate information.
Figure 2.AWS SSO create user

Figure 2. AWS SSO create user

  1. Assign the user to the appropriate group. In this case, we will assign this user to AWSControlTowerAdmins.
Figure 3.Assigning SSO user to a group

Figure 3. Assigning SSO user to a group

  1. Verify the user was created. (Optionally: enable MFA).
Figure 4.Verifying User Creation and MFA

Figure 4. Verifying User Creation and MFA

Deploy a stack set across your organization

To effectively run any API across the organization, we need to deploy a common role that our AWS SSO user can assume across every account. We can use AWS CloudFormation StackSets to deploy this role at scale.

  1. Write the IAM role and associated policy document. The following is an example AWS Cloud Development Kit (AWS CDK) code for such a role. Note that orgAccount, roleName, and ssoUser in the below code will have to be replaced with your own values.
    const role = new iam.Role(this, 'TavernAutomationRole', {
      roleName: 'TavernAutomationRole',
      assumedBy: new iam.ArnPrincipal(`arn:aws:sts::${orgAccount}:assumed-role/${roleName}/${ssoUser}`),
    role.addToPolicy(new PolicyStatement({
      actions: ['ec2:DescribeRouteTables'],
      resources: ['*']
  1. Log into the CloudFormation StackSets console.
  2. Press Create StackSet button.
  3. Upload the CloudFormation template containing the common role to be deployed to the organization by the preferred method.
  4. Specify name and optional description.
  5. Add any standard organization tags, and choose Service-managed permissions option.
  6. Choose Deploy to organization, and decide whether to disable or enable automatic deployment and appropriate account removal behavior. For this blog post, we choose to enable automatic deployment and accounts should remove the stack with removed from the target OU.
  7. For Specify regions, choose US East (N.Virginia). Note, because this stack contains only an IAM role, and IAM is a global service, region choice has no effect.
  8. For Maximum concurrent accounts, choose Percent, and enter 100 (this stack is not dependent on order).
  9. For Failure tolerance, choose Number, and enter 5, account deployment failures before a total rollback happens.
  10. For Region Concurrency, choose Sequential.
  11. Review your choices, note the deployment target (should be r-*), and acknowledge that CloudFormation might create IAM resources with custom names.
  12. Press the Submit button to deploy the stack.

Configure AWS SSO for the AWS CLI

To use our organization tools, we must first configure AWS SSO locally. With the AWS CLI v2, we can run:

aws configure sso

To configure credentials:

  1. Run the preceding command in your terminal.
  2. Follow the prompted steps.
    1. Specify your AWS SSO Start URL:
    2. AWS SSO Region:
  1. Authenticate through the pop-up browser window.
  2. Navigate back to the CLI, and choose the root account (this is where our principle for IAM originates).
  3. Specify the default client region.
  4. Specify the default output format.

Note the CLI profile name. Regardless if you choose to go with the autogenerated one or the custom one, we need this profile name for our upcoming code.

Start coding to utilize the AWS SSO shared profile

After AWS SSO is configured, we can start coding the beginning part of our multi-account tool. Our first step is to list every account belonging to our organization.

var (
    stsc    *sts.Client
    orgc    *organizations.Client
    ec2c    *ec2.Client
    regions []string

// init initializes common AWS SDK clients and pulls in all enabled regions
func init() {
    cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithSharedConfigProfile("tavern-automation"))
    if err != nil {
        log.Fatal("ERROR: Unable to resolve credentials for tavern-automation: ", err)

    stsc = sts.NewFromConfig(cfg)
    orgc = organizations.NewFromConfig(cfg)
    ec2c = ec2.NewFromConfig(cfg)

    // NOTE: By default, only describes regions that are enabled in the root org account, not all Regions
    resp, err := ec2c.DescribeRegions(context.TODO(), &ec2.DescribeRegionsInput{})
    if err != nil {
        log.Fatal("ERROR: Unable to describe regions", err)

    for _, region := range resp.Regions {
        regions = append(regions, *region.RegionName)
    fmt.Println("INFO: Listing all enabled regions:")

// main constructs a concurrent pipeline that pushes every account ID down
// the pipeline, where an action is concurrently run on each account and
// results are aggregated into a single json file
func main() {
    var accounts []string

    paginator := organizations.NewListAccountsPaginator(orgc, &organizations.ListAccountsInput{})
    for paginator.HasMorePages() {
        resp, err := paginator.NextPage(context.TODO())
        if err != nil {
            log.Fatal("ERROR: Unable to list accounts in this organization: ", err)

        for _, account := range resp.Accounts {
            accounts = append(accounts, *account.Id)

Implement concurrency into our code

With a slice of every AWS account, it’s time to concurrently run an API across all accounts. We will use some familiar Go concurrency patterns, as well as fan-out and fan-in.

// ... continued in main

    // Begin pipeline by calling gen with a list of every account
    in := gen(accounts...)

    // Fan out and create individual goroutines handling the requested action (getRoute)
    var out []<-chan models.InternetRoute
    for range accounts {
        c := getRoute(in)
        out = append(out, c)

    // Fans in and collect the routing information from all go routines
    var allRoutes []models.InternetRoute
    for n := range merge(out...) {
        allRoutes = append(allRoutes, n)

In the preceding code, we called a gen() function that started construction of our pipeline. Let’s take a deeper look into this function.

// gen primes the pipeline, creating a single separate goroutine
// that will sequentially put a single account id down the channel
// gen returns the channel so that we can plug it in into the next
// stage
func gen(accounts ...string) <-chan string {
    out := make(chan string)
    go func() {
        for _, account := range accounts {
            out <- account
    return out

We see that gen just initializes the pipeline, and then starts pushing account ID’s down the pipeline one by one.

The next two functions are where all the heavy lifting is done. First, let’s investigate `getRoute()`.

// getRoute queries every route table in an account, including every enabled region, for a
// (i.e. default route) to an internet gateway
func getRoute(in <-chan string) <-chan models.InternetRoute {
    out := make(chan models.InternetRoute)
    go func() {
        for account := range in {
            role := fmt.Sprintf("arn:aws:iam::%s:role/TavernAutomationRole", account)
            creds := stscreds.NewAssumeRoleProvider(stsc, role)

            for _, region := range regions {
                localCfg := aws.Config{
                    Region:      region,
                    Credentials: aws.NewCredentialsCache(creds),

                localEc2Client := ec2.NewFromConfig(localCfg)

                paginator := ec2.NewDescribeRouteTablesPaginator(localEc2Client, &ec2.DescribeRouteTablesInput{})
                for paginator.HasMorePages() {
                    resp, err := paginator.NextPage(context.TODO())
                    if err != nil {
                        fmt.Println("WARNING: Unable to retrieve route tables from account: ", account, err)
                        out <- models.InternetRoute{Account: account}

                    for _, routeTable := range resp.RouteTables {
                        for _, r := range routeTable.Routes {
                            if r.GatewayId != nil && strings.Contains(*r.GatewayId, "igw-") {
                                    "Account: ", account,
                                    " Region: ", region,
                                    " DestinationCIDR: ", *r.DestinationCidrBlock,
                                    " GatewayId: ", *r.GatewayId,
                                out <- models.InternetRoute{
                                    Account:         account,
                                    Region:          region,
                                    Vpc:             routeTable.VpcId,
                                    RouteTable:      routeTable.RouteTableId,
                                    DestinationCidr: r.DestinationCidrBlock,
                                    InternetGateway: r.GatewayId,

    return out

A couple of key points to highlight are as follows:

for account := range in

When iterating over a channel, the current goroutine blocks, meaning we wait here until we get an account ID passed to us before continuing. We’ll keep doing this until our upstream closes the channel. In our case, our upstream closes the channel once it pushes every account ID down the channel.

role := fmt.Sprintf("arn:aws:iam::%s:role/TavernAutomationRole", account)
creds := stscreds.NewAssumeRoleProvider(stsc, role)

Here, we can reference our existing role that we deployed to every account and assume into that role with AWS Security Token Service (STS).

for _, region := range regions {

Lastly, when we have credentials into that account, we need to iterate over every region in that account to ensure we are capturing the entire global presence.

These three key areas are how we build organization-level tools. The remaining code is calling the desired API and delivering the result down to the next stage in our pipeline, where we merge all of the results.

// merge takes every go routine and "plugs" it into a common out channel
// then blocks until every input channel closes, signally that all goroutines
// are done in the previous stage
func merge(cs ...<-chan models.InternetRoute) <-chan models.InternetRoute {
    var wg sync.WaitGroup
    out := make(chan models.InternetRoute)

    output := func(c <-chan models.InternetRoute) {
        for n := range c {
            out <- n

    for _, c := range cs {
        go output(c)

    go func() {
    return out

At the end of the main function, we take our in-memory data structures representing our internet entry points and marshal it into a JSON file.

    // ... continued in main

    savedRoutes, err := json.MarshalIndent(allRoutes, "", "\t")
    if err != nil {
        fmt.Println("ERROR: Unable to marshal internet routes to JSON: ", err)
    ioutil.WriteFile("routes.json", savedRoutes, 0644)

With the code in place, we can run the code with `go run main.go` inside of your preferred terminal. The command will generate results like the following:

    // ... routes.json
        "Account": "REDACTED",
        "Region": "eu-north-1",
        "Vpc": "vpc-1efd6c77",
        "RouteTable": "rtb-1038a979",
        "DestinationCidr": "",
        "InternetGateway": "igw-c1b125a8"
        "Account": " REDACTED ",
        "Region": "eu-north-1",
        "Vpc": "vpc-de109db7",
        "RouteTable": "rtb-e042ce89",
        "DestinationCidr": "",
        "InternetGateway": "igw-cbd457a2"
    // ...

Cleaning up

To avoid incurring future charges, delete the following resources:

  • Stack set through the CloudFormation console
  • AWS SSO user (if you created one)


Creating organization tools that answer difficult questions such as, “show me every internet entry point in our organization,” are possible using Organizations APIs and CloudFormation StackSets. We also learned how to use Go’s native concurrency features to build these tools that scale across hundreds of accounts.

Further steps you might explore include:

  • Visiting the Github Repo to capture the full picture.
  • Taking our sequential solution for iterating over Regions and making it concurrent.
  • Exploring the possibility of accepting functions and interfaces in stages to generalize specific pipeline features.

Thanks for taking the time to read, and feel free to leave comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Top 5 Architecture Blog Posts for Q3 2021

Post Syndicated from Bonnie McClure original https://aws.amazon.com/blogs/architecture/top-5-architecture-blog-posts-for-q3-2021/

The goal of the AWS Architecture Blog is to highlight best practices and provide architectural guidance. We publish thought leadership pieces that encourage readers to discover other technical documentation such as solutions and managed solutions, other AWS blogs, videos, reference architectures, whitepapers, and guides, training and certification, case studies, and the AWS Architecture Monthly Magazine. We welcome your contributions!

Field Notes is a series of posts within the Architecture Blog that provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers based on their experiences in the field solving real-world business problems for customers.

A big thank you to you, our readers, for spending time on our blog this past quarter. Of course, we wouldn’t have content for you to read without our hard-working AWS Solutions Architects and other blog post writers either, so thank you to them as well! Without further ado, the following five posts were the top Architecture Blog and Field Notes blog posts published in Q3 (July through September 2021).

#5: Choosing Your VPC Endpoint Strategy for Amazon S3

by Jeff Harman and Gilles-Kuessan Satchivi

In this blog post, Jeff and Gilles-Kuessan guide you through selecting the right virtual private connection (VPC) endpoint type to access Amazon Simple Storage Service (Amazon S3). A VPC endpoint allows workloads in an Amazon Virtual Private Cloud (Amazon VPC) to connect to supported public AWS services or third-party applications over the AWS network.

#4: Using VPC Endpoints in Multi-Region Architectures with Route 53 Resolver

by Michael Haken

You want a straightforward way to use VPC endpoints and endpoint policies for all Regions uniformly and consistently. In this post, Michael shows you how Route 53 Resolver solves this challenge using DNS. This solution ensures that requests to AWS services that support VPC endpoints stay within the VPC network, regardless of their Region.

#3: Architecting a Highly Available, Serverless Microservices-Based Ecommerce Site

by Senthil Kumar and Ajit Puthiyavettle

The number of ecommerce vendors is growing globally, and these vendors often handle large traffic at different times of the day and on different days of the year. This, in addition to building, managing, and maintaining IT infrastructure on-premises data centers can present challenges to businesses’ scalability and growth. In this blog post, Senthil and Ajit provide you a Serverless on AWS solution that offloads the undifferentiated heavy lifting of managing resources and ensures your business’ architecture can handle peak traffic.

#2: Data Caching Across Microservices in a Serverless Architecture

by Irfan Saleem, Pallavi Nargund, and Peter Buonora

In this blog post, Irfan, Pallavi, and Peter discuss a couple of customer use cases that use Serverless on AWS offerings to maintain a cache close to the microservices layer. This improves performance by reducing or eliminating the need for the real-time backend calls and by reducing latency and service-to-service communication.

#1: Overview of Data Transfer Charges for Common Architectures 

With over 35,000 views and rising, this post is vastly outpacing all other contenders this quarter. In this post, Birender, Sebastian, and Dennis discuss how data transfer charges are often overlooked while architecting solutions in AWS. This post will help you identify potential data transfer charges you may encounter while operating your workload on AWS.

Thank you!

Thanks again to all our readers and blog post writers! We look forward to continuing to learn and build amazing things together in 2021.

Other blog posts in this series

Field Notes: How to Build an AWS Glue Workflow using the AWS Cloud Development Kit

Post Syndicated from Michael Hamilton original https://aws.amazon.com/blogs/architecture/field-notes-how-to-build-an-aws-glue-workflow-using-the-aws-cloud-development-kit/

Many customers use AWS Glue workflows to build and orchestrate their ETL (extract-transform-load) pipelines directly in the AWS Glue console using the visual tool to author workflows. This can be time consuming, harder to version control, and error prone due to manual configurations, when compared to managing your workflows as code. To improve your operational excellence, consider deploying the entire AWS Glue ETL pipeline using the AWS Cloud Development Kit (AWS CDK).

In this blog post, you will learn how to build an AWS Glue workflow using Amazon Simple Storage Service (Amazon S3), various components of AWS Glue, AWS Secrets Manager, Amazon Redshift, and the AWS CDK.

Architecture overview

In this architecture, you will use the AWS CDK to deploy your data sources, ETL scripts, AWS Glue workflow components, and an Amazon Redshift cluster for analyzing the transformed data.

AWS Glue workflow architecture

Figure 1. AWS Glue workflow architecture

It is common for customers to pre-aggregate data before sending it downstream to analytical engines, like Amazon Redshift, because table joins and aggregations are computationally expensive. The AWS Glue workflow will join COVID-19 case data, and COVID-19 hiring data together on their date columns in order to run correlation analysis on the final dataset. The datasets may seem arbitrary, but we wanted to offer a way to better understand the impacts COVID-19 had on jobs in the United States. The takeaway here is to use this as a blueprint for automating the deployment of data analytic pipelines for the data of interest to your business.

After the AWS CDK application is deployed, it will begin creating all of the resources required to build the complete workflow. When it completes, the components in the architecture will be created, and the AWS Glue workflow will be ready to start. In this blog post, you start workflows manually, but they can be configured to start on a scheduled time or from a workflow trigger.

The workflow is programmed to dynamically pull the raw data from the Registry of Open Data on AWS where you can find the Covid-19 case data and the Hiring Data respectively.


This blog post uses an AWS CDK stack written in TypeScript and AWS Glue jobs written in Python. Follow the instructions in the AWS CDK Getting Started guide to set up your environment, before you proceed to deployment.

In addition to setting up your environment, you need to clone the Git repository, which contains the AWS CDK scripts and Python ETL scripts used by AWS Glue. The ETL scripts will be deployed to Amazon S3 by the AWS CDK stack as assets, and referenced by the AWS Glue jobs as part of the AWS Glue Workflow.

You should have the following prerequisites:


After you have cloned the repository, navigate to the glue-cdk-blog/lib folder and open the blog-glue-workflow-stack.ts file. This is the AWS CDK script used to deploy all necessary resources to build your AWS Glue workflow. The blog-redshift-vpc-stack.ts contains the necessary resources to deploy the Amazon Redshift cluster, connections, and permissions. The glue-cdk-blog/lib/assets folder also contains the AWS Glue job scripts. These files are uploaded to Amazon S3 by the AWS CDK when you bootstrap.

You won’t review the individual lines of code in the script in this blog post, but if you are unfamiliar with any of the AWS CDK level 1 or level 2 constructs used in the sample, you can review what each construct does with the AWS CDK documentation. Familiarize yourself with the script you cloned and anticipate what resources will be deployed. Then, deploy both stacks and verify your initial findings.

After your environment is configured, and the packages and modules installed, deploy the AWS CDK stack and assets in two commands.

  1. Bootstrap the AWS CDK stack to create an S3 bucket in the predefined account that will contain the assets.

cdk bootstrap

  1. Deploy the AWS CDK stacks.

cdk deploy --all

Verify that both of these commands have completed successfully, and remediate any failures returned. Upon successful completion, you’re ready to start the AWS Glue workflow that was just created. You can find the AWS CDK commands reference in the AWS CDK Toolkit commands documentation, and help with Troubleshooting common AWS CDK issues you may encounter.


Prior to initiating the AWS Glue workflow, explore the resources the AWS CDK stacks just deployed to your account.

  1. Log in to the AWS Management Console and the AWS CDK account.
  2. Navigate to Amazon S3 in the AWS console (you should see an S3 bucket with the name prefix of cdktoolkit-stagingbucket-xxxxxxxxxxxx).
  3. Review the objects stored in the bucket in the assets folder. These are the .py files used by your AWS Glue jobs. They were uploaded to the bucket when you issued the AWS CDK bootstrap command, and referenced within the AWS CDK script as the scripts to use for the AWS Glue jobs. When retrieving data from multiple sources, you cannot always control the naming convention of the sourced files. To solve this and create better standardization, you will use a job within the AWS Glue workflow to copy these scripts to another folder and rename them with a more meaningful name.
  4. Navigate to Amazon Redshift in the AWS console and verify your new cluster. You can use the Amazon Redshift Query Editor within the console to connect to the cluster and see that you have an empty database called db-covid-hiring. The Amazon Redshift cluster and networking resources were created by the redshift_vpc_stack which are listed here:
    • VPC, subnet and security group for Amazon Redshift
    • Secrets Manager secret
    • AWS Glue connection and S3 endpoint
    • Amazon Redshift cluster
  1. Navigate to AWS Glue in the AWS console and review the following new resources created by the workflow_stack CDK stack:
    • Two crawlers to crawl the data in S3
    • Three AWS Glue jobs used within the AWS Glue workflow
    • Five triggers to initiate AWS Glue jobs and crawlers
    • One AWS Glue workflow to manage the ETL orchestration
  1. All of these resources could have been deployed within a single stack, but this is intended to be a simple example on how to share resources across multiple stacks. The AWS Identity and Access Management (IAM) role that AWS Glue uses to run the ETL jobs in the workflow_stack, is also used by Secrets Manager for Amazon Redshift in the redshift_vpc_stack. Inspect the /bin/blog-glue-workflow-stack.ts file to further understand cross stack resource sharing.

By performing these steps, you have deployed all of the AWS Glue resources necessary to perform common ETL tasks. You then combined the resources to create an orchestration of tasks using an AWS Glue workflow. All of this was done using IaC with AWS CDK. Your workflow should look like Figure 2.

AWS Glue console showing the workflow created by the CDK

Figure 2. AWS Glue console showing the workflow created by the CDK

As mentioned earlier, you could have started your workflow using a scheduled cron trigger, but you initiated the workflow manually so you had time to review the resources the workflow_stack CDK deployed, prior to initiation of the workflow. Now that you have reviewed the resources, validate your workflow by initiating it and verifying it runs successfully.

  1. From within the AWS Glue console, select Workflows under ETL.
  2. Select the workflow named glue-workflow, and then select Run from the actions listbox.
  3. You can verify the status of the workflow by viewing the run details under the History tab.
  4. Your job will take approximately 15 minutes to successfully complete, and your history should look like Figure 3.
AWS Glue console showing the workflow as completed after the run

Figure 3. AWS Glue console showing the workflow as completed after the run

The workflow performs the following tasks:

  1. Prepares the ETL scripts by copying the files in the S3 asset bucket to a new folder and renames them with a more relevant name.
  2. Initiates a crawler to crawl the raw source data as csv files and adds tables to the Glue Data Catalog.
  3. Runs a Python script to perform some ETL tasks on the .csv files and converts them to parquet files.
  4. Crawls the parquet files and adds them to the Glue Data Catalog.
  5. Loads the parquet files into a DynamicFrame and runs an Amazon Redshift COPY command to load the data into the Amazon Redshift database.

After the workflow completes, you can query and perform analytics on the data that was populated in Amazon Redshift. Open the Amazon Redshift Query Editor and run a simple SELECT statement to query the covid_hiring_table which is the joined Covid-19 case data and hiring data (see Figure 4).

Amazon Redshift query editor showing the data that the workflow loaded into the Redshift tables

Figure 4. Amazon Redshift query editor showing the data that the workflow loaded into the Redshift tables

Cleaning up

Some resources, like S3 buckets and Amazon DynamoDB tables, must be manually emptied and deleted through the console to be fully removed. To clean up the deployment, delete all objects in the AWS CDK asset bucket in Amazon S3 by using the AWS console to empty the bucket, and then run cdk destroy –all to delete the resources the AWS CDK stacks created in your account. Finally, if you don’t plan on using AWS CloudFormation assets in this account in the future, you will need to delete the AWS CDK asset stack within the CloudFormation console to remove the AWS CDK asset bucket.


In this blog post, you learned how to automate the deployment of AWS Glue workflows using the AWS CDK. This further enhances your continuous integration and delivery (CI/CD) data pipelines by automating the deployment of the ETL jobs and AWS Glue workflow orchestration, providing an efficient, fast, and repeatable way to build and deploy AWS Glue workflows at scale.

Although AWS CDK primarily supports level 1 constructs for most AWS Glue resources, new constructs are added continually. See the AWS CDK API Reference for updates, prior to authoring your stacks, for AWS Glue level 2 construct support. You can find the code used in this blog post in this GitHub repository, and the AWS CDK in TypeScript reference to the AWS CDK namespace.

We hope this blog post helps enrich your work through the skills gained of automating the creation of Glue Workflows, enabling you to quickly build and deploy your own ETL pipelines and run analytical models that power your business.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Build a Cross-Validation Machine Learning Model Pipeline at Scale with Amazon SageMaker

Post Syndicated from Wei Teh original https://aws.amazon.com/blogs/architecture/field-notes-build-a-cross-validation-machine-learning-model-pipeline-at-scale-with-amazon-sagemaker/

When building a machine learning algorithm, such as a regression or classification algorithm, a common goal is to produce a generalized model. This is so that it performs well on new data that the model has not seen before. Overfitting and underfitting are two fundamental causes of poor performance for machine learning models. A model is overfitted when it performs well on known data, but generalizes poorly on new data. However, an underfit model performs poorly on both trained and new data. A reliable model validation technique helps provide better assessment for predicting model performance in practice, and provides insight for training models to achieve the best accuracy.

Cross-validation is a standard model validation technique commonly used for assessing performance of machine learning algorithms. In general, it works by first sampling the dataset into groups of similar sizes, where each group contains a subset of data dedicated for training and model evaluation. After the data has been grouped, a machine learning algorithm will fit and score a model using the data in each group independently. The final score of the model is defined by the average score across all the trained models for performance metric representation.

There are few cross-validation methods commonly used, including k-fold, stratified k-fold, and leave-p-out, to name a few. Although there are well-defined data science frameworks that can help simplify cross-validation processes, such as Python scikit-learn library, these frameworks are designed to work in a monolithic, single compute environment. When it comes to training machine learning algorithms with large volume of data, these frameworks become bottlenecked with limited scalability and reliability.

In this blog post, we are going to walk through the steps for building a highly scalable, high-accuracy, machine learning pipeline, with the k-fold cross-validation method, using Amazon Simple Storage Service (Amazon S3), Amazon SageMaker Pipelines, SageMaker automatic model tuning, and SageMaker training at scale.

Overview of solution

To operate the k-fold cross validation training pipeline at scale, we built an end to end machine learning pipeline using SageMaker native features. This solution implements the k-fold data processing, model training, and model selection processes as individual components to maximize parallellism. The pipeline is orchestrated through SageMaker Pipelines in distributed manner to achieve scalability and performance efficiency. Let’s dive into the high-level architecture of the solution in the following section.

Figure 1. Solution architecture

Figure 1. Solution architecture

The overall solution architecture is shown in Figure 1. There are four main building blocks in the k-fold cross-validation model pipeline:

  1. Preprocessing – Sample and split the entire dataset into k groups.
  2. Model training – Fit the SageMaker training jobs in parallel with hyperparameters optimized through the SageMaker automatic model tuning job.
  3. Model selection – Fit a final model, using the best hyperparameters obtained in step 2, with the entire dataset.
  4. Model registration – Register the final model with SageMaker Model Registry, for model lifecycle management and deployment.

The final output from the pipeline is a model that represents best performance and accuracy for the given dataset. The pipeline can be orchestrated easily using a workflow management tool, such as Pipelines.

Amazon SageMaker is a fully managed service that enables data scientists and developers to quickly develop, train, tune, and deploy machine learning quickly and at scale. When it comes to choosing the right machine learning and data processing frameworks to solve problems, SageMaker gives you the flexibility to use prebuilt containers bundled with the supported common machine learning frameworks—such as Tensorflow, Pytorch, and MxNet—or to bring your own container images with custom scripts and libraries that fit your use cases to train on the highly available SageMaker model training environment. Additionally, Pipelines enables users to develop complete machine learning workflows using python SDK, and manage these workflows in SageMaker Studio.

For simplicity, we will use the public Iris flower data as the train and test dataset to build a multivariate classification model using linear algorithm (SVM). The pipeline architecture is agnostic to the data and model; hence, it can be modified to adopt a different dataset or algorithm.


To deploy the solution, you require the following:

  • SageMaker Studio
  • A Command Line (Terminal) that supports building Docker images (or instance, AWS Cloud9)

Solution walkthrough

In this section, we are going to walk through the steps to create a cross-validation model training pipeline using Pipelines. The main components are as follows.

  1. Pipeline parameters
    Pipelines parameters are introduced as variables that allow the predefined values to be overridden at runtime. Pipelines supports the following parameters types: String, Integer, and Float (expressed as ParameterString, ParameterInteger, and ParameterFloat). The following are some examples of the parameters used in the cross-validation model training pipeline:
    • K-Fold – Value of k to be used in k-fold cross-validation
    • ProcessingInstanceCount – Number of instances for SageMaker processing job
    • ProcessingInstanceType – Instance type used for SageMaker processing job
    • TrainingInstanceType – Instance type used for SageMaker training job
    • TrainingInstanceCount – Number of instances for SageMaker training job
  1. Preprocessing

In this step, the original dataset is split into k equal-sized samples. One of the k samples is retained as the validation data for model evaluation, with the remaining k-1 samples to be used as training data. This process is repeated k times, with each of the k samples used as the validation set only one time. The k sample collections are uploaded to an S3 bucket, with the prefix corresponding to an index (0 – k-1) to be identified as the input path to the specified training jobs in the next step of the pipeline. The cross-validation split is submitted as a SageMaker processing job orchestrated through the Pipelines processing step. The processing flow is shown in Figure 2.

Figure 2. K-fold cross-validation: original data is split into k equal-sized samples uploaded to S3 bucket

Figure 2. K-fold cross-validation: original data is split into k equal-sized samples uploaded to S3 bucket

The following code snippet splits the k-fold dataset in the preprocessing script:

def save_kfold_datasets(X, y, k):
    """ Splits the datasets (X,y) k folds and saves the output from 
    each fold into separate directories.

        X : numpy array represents the features
        y : numpy array represetns the target
        k : int value represents the number of folds to split the given datasets

    # Shuffles and Split dataset into k folds. 
    kf = KFold(n_splits=k, random_state=23, shuffle=True)

    fold_idx = 0
    for train_index, test_index in kf.split(X, y=y, groups=None):    
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
       os.makedirs(f'{base_dir}/train/{fold_idx}', exist_ok=True)
       np.savetxt(f'{base_dir}/train/{fold_idx}/train_x.csv', X_train, delimiter=',')
       np.savetxt(f'{base_dir}/train/{fold_idx}/train_y.csv', y_train, delimiter=',')

       os.makedirs(f'{base_dir}/test/{fold_idx}', exist_ok=True)
       np.savetxt(f'{base_dir}/test/{fold_idx}/test_x.csv', X_test, delimiter=',')
       np.savetxt(f'{base_dir}/test/{fold_idx}/test_y.csv', y_test, delimiter=',')
       fold_idx += 1
  1.  Cross-validation training with SageMaker automatic model tuning

In a typical cross-validation training scenario, a chosen algorithm is trained for k times with specific training and a validation dataset sampled through the k-fold technique, mentioned in the previous step. Traditionally, the cross-validation model training process is performed sequentially on the same server. This method is inefficient and doesn’t scale well for models with large volumes of data. Because all the samples are uploaded to an S3 bucket, we can now run k training jobs in parallel. Each training job will consume input samples in the specified bucket location correspond to the index (ranged between 0 – k-1) given to the training job. Additionally, the hyperparameter values must be the same for all k jobs because cross validation estimates the true out-of-sample performance of a model trained with this specific set of hyperparameters.

Although the cross-validation technique helps generalize the models, hyperparameter tuning for the model is typically performed manually. In this blog post, we are going to take a heuristic approach of finding the most optimized hyperparameters using SageMaker automatic model tuning.

We start by defining a training script that accepts the hyperparameters as input for the specified model algorithm, and then implement the model training and evaluation steps.

The steps involved in the training script are summarized as follows:

    1. Parse hyperparameters from the input.
    2. Fit the model using the parsed hyperparameters.
    3. Evaluate model performance (score).
    4. Save the trained model.
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-c', '--c', type=float, default=1.0)
    parser.add_argument('--gamma', type=float)
    parser.add_argument('--kernel', type=str)
    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    args = parser.parse_args()
    model = train(train=args.train, test=args.test)
    evaluate(test=args.test, model=model)
    dump(model, os.path.join(args.model_dir, "model.joblib"))

Next, we create a python script that performs cross-validation model training by submitting k SageMaker training jobs in parallel with given hyperparameters. Additionally, the script monitors the progress of the training jobs, and calculates the objective metrics by averaging the scores across the completed jobs.

Now we create a python script that uses a SageMaker automatic model tuning job to find the optimal hyperparameters for the trained models. The hyperparameter tuner works by running a specified number of training jobs using the ranges of hyperparameters specified. The number of training jobs and ranges of hyperparameters are given in the input parameter to the script. After the tuning job completes, the objective metrics, as well as the hyperparameters from the best cross-validation model training job, are captured, formatted in JSON format, respectively, to be used in the next steps of the workflow. Figure 3 illustrates cross-validation training with automatic model tuning.

Figure 3. In cross-validation training step, a SageMaker HyperparameterTuner job invokes n training jobs. The metrics and hyperparameters are captured for downstream processes.

Figure 3. In cross-validation training step, a SageMaker HyperparameterTuner job invokes n training jobs. The metrics and hyperparameters are captured for downstream processes.

Finally, the training and cross-validation scripts are packaged and built as a custom container image, available for the SageMaker automatic model tuning job for submission. The following code snippet is for building the custom image:

FROM python:3.7
RUN apt-get update && pip install sagemaker boto3 numpy sagemaker-training
COPY cv.py /opt/ml/code/train.py
COPY scikit_learn_iris.py /opt/ml/code/scikit_learn_iris.py
  1. Model evaluation
    The objective metrics in the cross-validation training and tuning steps define the model quality. To evaluate the model performance, we created a conditional step that compares the metrics against a baseline to determine the next step in the workflow. The following code snippet illustrates the conditional step in detail. Specifically, this step first extracts the objective metrics based on the evaluation report uploaded in previous step, and then compares the value with baseline_model_objective_value provided in the pipeline job. The workflow continues if the model objective metric is greater than or equal to the baseline value, and stops otherwise.
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import (
cond_gte = ConditionGreaterThanOrEqualTo(
step_cond = ConditionStep(
    if_steps=[step_model_selection, step_register_model],
  1. Model Selection
    At this stage of the pipeline, we’ve completed cross-validation and hyperparameter optimization steps to identify the best performing model trained with the specific hyperparameter values. In this step, we are going to fit a model using the same algorithm used in cross-validation training by providing the entire dataset and the hyperparameters from the best model. The trained model will be used for serving predictions for downstream applications. The following code snippet illustrates a Pipelines training step for model selection:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep
from sagemaker.sklearn.estimator import SKLearn
sklearn_estimator = SKLearn("scikit_learn_iris.py", 
step_model_selection = TrainingStep(
        "train": TrainingInput(
        "jobinfo": TrainingInput(
  1. Model registration
    Because the cross-validation model training pipeline evolves, it’s important to have a mechanism for managing the version of model artifacts over time, so that the team responsible for the project can manage the model lifecycle, including track, deploy, or rollback a model based on the version. Building your own model registry, with lifecycle management capabilities, can be complicated and challenging to maintain and operate. SageMaker Model Registry simplifies model lifecycle management by enabling model catalog, versioning, metrics association, model approval workflow, and model deployment automation.

In the final step of the pipeline, we are going to register the trained model with Model Registry by associating model objective metrics, the model artifact location on S3 bucket, the estimator object used in the model selection step, model training and inference metadata, and approval status. The following code snippet illustrates the model registry step using ModelMetrics and RegisterModel.

from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.workflow.step_collections import RegisterModel
model_metrics = ModelMetrics(
step_register_model = RegisterModel(
    inference_instances=["ml.t2.medium", "ml.m5.xlarge"],

Figure 4 shows a model version registered in SageMaker Model Registry upon a successful pipeline job through Studio.

Figure 4. Model version registered successfully in SageMaker

  1. Putting everything together
    Now that we’ve defined a cross-validation training pipeline, we can track, visualize, and manage the pipeline job directly from within Studio. The following code snippet and Figure 5 depicts our pipeline definition:
from sagemaker.workflow.pipeline_experiment_config import PipelineExperimentConfig
from sagemaker.workflow.execution_variables import ExecutionVariables
pipeline_name = f"CrossValidationTrainingPipeline"
pipeline = Pipeline(
    steps=[step_process, step_cv_train_hpo, step_cond],
Figure 5. SageMaker Pipelines definition shown in SageMaker Studio

Figure 5. SageMaker Pipelines definition shown in SageMaker Studio

Finally, to kick off the pipeline, invoke the pipeline.start() function, with optional parameters specific to the job run:

execution = pipeline.start(

You can track the pipeline job from within Studio, or use SageMaker application programming interfaces (APIs). Figure 6 shows a screenshot of a pipeline job in progress from Studio.

Figure 6. SageMaker Pipelines job progress shown in SageMaker Studio

Figure 6. SageMaker Pipelines job progress shown in SageMaker Studio


In this blog post, we showed you an architecture that orchestrates a complete workflow for cross-validation model training. We implemented the workflow using SageMaker Pipelines that incorporates preprocessing, hyperparameter tuning, model evaluation, model selection, and model registration. The solution addresses the common challenge of orchestrating cross-validation model pipeline at scale. The entire pipeline implementation, including a jupyter notebook that defines the pipeline, a Dockerfile and python scripts described in this blog post, can be found in the GitHub project.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Connect Amazon S3 File Gateway using AWS PrivateLink for Amazon S3

Post Syndicated from Xiaozang Li original https://aws.amazon.com/blogs/architecture/connect-amazon-s3-file-gateway-using-aws-privatelink-for-amazon-s3/

AWS Storage Gateway is a set of services that provides on-premises access to virtually unlimited cloud storage. You can extend your on-premises storage capacity, and move on-premises backups and archives to the cloud. It provides low-latency access to cloud storage by caching frequently accessed data on premises, while storing data securely and durably in the cloud. This simplifies storage management and reduces costs for hybrid cloud storage use.

You may have privacy and security concerns with sending and receiving data across the public internet. In this case, you can use AWS PrivateLink, which provides private connectivity between Amazon Virtual Private Cloud (VPC) and other AWS services.

In this blog post, we will demonstrate how to take advantage of Amazon S3 interface endpoints to connect your on-premises Amazon S3 File Gateway directly to AWS over a private connection. We will also review the steps for implementation using the AWS Management Console.

AWS Storage Gateway on-premises

Storage Gateway offers four different types of gateways to connect on-premises applications with cloud storage.

  • Amazon S3 File Gateway Provides a file interface for applications to seamlessly store files as objects in Amazon S3. These files can be accessed using open standard file protocols.
  • Amazon FSx File Gateway Optimizes on-premises access to Windows file shares on Amazon FSx.
  • Tape Gateway Replaces on-premises physical tapes with virtual tapes in AWS without changing existing backup workflows.
  • Volume Gateway –  Presents cloud-backed iSCSI block storage volumes to your on-premises applications.

We will illustrate the use of Amazon S3 File Gateway in this blog.

VPC endpoints for Amazon S3

AWS PrivateLink provides two types of VPC endpoints that you can use to connect to Amazon S3; Interface endpoints and Gateway endpoints. An interface endpoint is an elastic network interface with a private IP address. It serves as an entry point for traffic destined to a supported AWS service or a VPC endpoint service. A gateway VPC endpoint uses prefix lists as the IP route target in a VPC route table and supports routing traffic privately to Amazon S3 or Amazon DynamoDB. Both these endpoints securely connect to Amazon S3 over the Amazon network, and your network traffic does not traverse the internet.

Solution architecture for PrivateLink connectivity between AWS Storage Gateway and Amazon S3

Previously, AWS Storage Gateway did not support PrivateLink for Amazon S3 and Amazon S3 Access Points. Customers had to build and manage an HTTP proxy infrastructure within their VPC to connect their on-premises applications privately to S3 (see Figure 1). This infrastructure acted as a proxy for all the traffic originating from on-premises gateways to Amazon S3 through Amazon S3 Gateway endpoints. This setup would result in additional configuration and operational overhead. The HTTP proxy could also become a network performance bottleneck.

Figure 1. Connect to Amazon S3 Gateway endpoint using an HTTP proxy

Figure 1. Connect to Amazon S3 Gateway endpoint using an HTTP proxy

AWS Storage Gateway recently added support for AWS PrivateLink for Amazon S3 and Amazon S3 Access Points. Customers can now connect their on-premises Amazon S3 File Gateway directly to Amazon S3 through a private connection. This uses an Amazon S3 interface endpoint and doesn’t require an HTTP proxy. Additionally, customers can use Amazon S3 Access Points instead of bucket names to map file shares. This enables more granular access controls for applications connecting to AWS Storage Gateway file shares (see Figure 2).

Figure 2. AWS Storage Gateway now supports AWS PrivateLink for Amazon S3 endpoints and Amazon S3 Access Points

Figure 2. AWS Storage Gateway now supports AWS PrivateLink for Amazon S3 endpoints and Amazon S3 Access Points

Implement AWS PrivateLink between AWS Storage Gateway and an Amazon S3 endpoint

Let’s look at how to create an Amazon S3 File Gateway file share, which is associated with a Storage Gateway. This file share stores data in an Amazon S3 bucket. It uses AWS PrivateLink for Amazon S3 to transfer data to the S3 endpoint.

  1. Create an Amazon S3 bucket in your preferred Region.
  2. Create and configure an Amazon S3 File Gateway.
  3. Create an Interface endpoint for Amazon S3. Ensure that the S3 interface endpoint is created in the same Region as the S3 bucket.
  4. Customize the File share settings (see Figure 3).
Figure 3. Create file share using VPC endpoints for Amazon S3

Figure 3. Create file share using VPC endpoints for Amazon S3

Best practices:

  • Select the AWS Region where the Amazon S3 bucket is located. This ensures that the VPC endpoint and the storage bucket are in the same Region.
  • When creating the file share with PrivateLink for S3 enabled, you can either select the S3 VPC endpoint ID from the dropdown menu, or manually input the S3 VPC endpoint DNS name.
  • Note that the dropdown list of VPC endpoint IDs only contains the VPCs created by the current AWS account administrator. If you are using a shared VPC in an AWS Organization, you can manually enter the DNS name of the VPC endpoint created in the management account.

Be aware of PrivateLink pricing when using an S3 interface endpoint. The cost for each interface endpoint is based on usage per hour, the number of Availability Zones used, and the volume of data transferred over the endpoint. Additionally, each Amazon S3 VPC interface endpoint can be shared among multiple S3 File Gateways. Each file share associated with the Storage Gateway can be configured with or without PrivateLink. For workloads that do not need the private network connectivity, you can save on interface endpoints costs by creating a file share without PrivateLink.

Verify PrivateLink communication

Once you have set up an S3 File Gateway file share using PrivateLink for S3, you can verify that traffic is flowing over your private connectivity as follows:

1. Enable VPC Flow Log for the VPC hosting the S3 Interface endpoint. This also hosts the Virtual Private Gateway (VGW), which connects to the on-premises environment.

2. From your workstation, connect to your on-premises File Gateway over SMB or NFS protocol and upload a new file (see Figure 4).

Figure 4. Upload a sample file to on-premises Storage Gateway

Figure 4. Upload a sample file to on-premises Storage Gateway

3. Navigate to the S3 bucket associated with the file share.  After a few seconds, you should see that the new file has been successfully uploaded and appears in the S3 bucket (see Figure 5).

Figure 5. Verify that the sample file is uploaded to storage bucket

Figure 5. Verify that the sample file is uploaded to storage bucket

4. On the VPC flow log, look for the generated log events. You’ll see your S3 interface endpoint elastic network interface, your file gateway IP, Amazon S3 private IP, and port number, as shown in Figure 6. This verifies that the file was transferred over the private connection. If you do not see an entry, verify if the VPC Flow Logs have been enabled on the correct VPC and elastic network interface.

Figure 6. VPC Flow Log entry to verify connectivity using Private IPs

Figure 6. VPC Flow Log entry to verify connectivity using Private IPs


In this blog post, we have demonstrated how to use Amazon S3 File Gateway to transfer files to Amazon S3 buckets over AWS PrivateLink. Use this solution to securely copy your application data and files to cloud storage. This will also provide low latency access to that data from your on-premises applications.

Thanks for reading this blog post. If you have any feedback or questions, please add them in the comments section.

Further Reading:

Top 5: Featured Architecture Content for September

Post Syndicated from Elyse Lopez original https://aws.amazon.com/blogs/architecture/top-5-featured-architecture-content-for-september/

The AWS Architecture Center provides new and notable reference architecture diagrams, vetted architecture solutions, AWS Well-Architected best practices, whitepapers, and more. This blog post features some of our best picks from the new and newly updated content we released in the past month.

1. AWS Best Practices for DDoS Resiliency

Prioritizing the availability and responsiveness of your application helps you maintain customer trust. That’s why it’s crucial to protect your business from the impact of distributed denial of service (DDoS) and other cyberattacks. This whitepaper provides you prescriptive guidance to improve the resiliency of your applications and best practices for how to manage different attack types.

2. Predictive Modeling for Automotive Retail

Automotive retailers use data to better understand how their incentives are helping to sell cars. This new reference architecture diagram shows you how to design a modeling system that provides granular return on investment (ROI) predictions for automotive sales incentives.

3. AWS Graviton Performance Testing – Tips for Independent Software Vendors

If you’re deciding whether to phase in AWS Graviton processors for your workload, this whitepaper covers best practices and common pitfalls for defining test approaches to evaluate Amazon Elastic Compute Cloud (Amazon EC2) instance performance and how to set success factors and compare different test methods and their implementation.

4. Text Analysis with Amazon OpenSearch Service and Amazon Comprehend

This AWS Solutions Implementation was recently updated with new guidance related to Amazon OpenSearch Service, the successor to Amazon Elasticsearch Service. Learn how Amazon OpenSearch Service and Amazon Comprehend work together to deploy a cost-effective, end-to-end solution to extract meaningful insights from unstructured text-based data such as customer calls, support tickets, and online customer feedback.

5. Back to Basics: Hosting a Static Website on AWS

In this episode of Back to Basics, join SA Readiness Specialist Even Zhang as he breaks down the AWS services you can use to host and scale your static website without a single server. You’ll also learn how to use additional functionalities to enhance your observability and security posture or run A/B tests.

 CloudFront Edge Locations and Caches from Back to Basics video

Figure 1. CloudFront Edge Locations and Caches from Back to Basics video


Journey to Adopt Cloud-Native Architecture Series: #4 – Governing Security at Scale and IAM Baselining

Post Syndicated from Anuj Gupta original https://aws.amazon.com/blogs/architecture/journey-to-adopt-cloud-native-architecture-series-4-governing-security-at-scale-and-iam-baselining/

In Part 3 of this series, Improved Resiliency and Standardized Observability, we talked about design patterns that you can adopt to improve resiliency, achieve minimum business continuity, and scale applications with lengthy transactions (more than 3 minutes).

As a refresher from previous blogs in this series, our example ecommerce company’s “Shoppers” application runs in the cloud. The company experienced hypergrowth, which posed a number of platform and technology challenges, namely, they needed to scale on the backend without impacting users.

Because of this hypergrowth, distributed denial of service (DDoS) attacks on the ecommerce company’s services increased 10 times in 6 months. Some of these attacks led to downtime and loss of revenue. This blog post shows you how we addressed these threats by implementing a multi-account strategy and applying AWS Identity and Access Management (IAM) best practices.

A multi-account strategy ensures security at scale

Originally, the company’s production and non-production services were running in a single account. This meant non-production vulnerabilities like frequently changing code or privileged access could impact the production environment. Additionally, the application experienced issues due to unexpectedly reaching service quotas. These include (but are not limited to) number of read replicas per master in Amazon Relational Database Service (Amazon RDS) and total storage for all DB instances in Auto Scaling Service Quotas for Amazon Elastic Compute Cloud (Amazon EC2).

To address these issues, we followed multi-account strategy best practices. We established the multi-account hierarchy shown in Figure 1 that includes the following eight organizational units (OUs) to meet business requirements:

  1. Security PROD OU
  2. Security SDLC OU
  3. Infrastructure PROD OU
  4. Infrastructure SDLC OU
  5. Workload PROD OU
  6. Workload SDLC OU
  7. Sandbox OU
  8. Transitional OU

To identify the right fit for our needs, we evaluated AWS Landing Zone and AWS Control Tower. To reduce operation overhead of maintaining a solution, we used AWS Control Tower to deploy guardrails as service control policies (SCPs). These guardrails were then separated into production and non-production environments, creating the hierarchy shown in Figure 1.

We created a new Payer (or Management) Account with Sandbox OU and Transitional OU under Root OU. We then moved existing AWS accounts under the Transitional OU and Sandbox OU. We provisioned new accounts with Account Factory and gradually migrated services from existing AWS accounts into the newly formed Log Archive Account, Security Account, Network Account, and Shared Services Account and applied appropriate guardrails. We then registered Sandbox OU with Control Tower. Additionally, we migrated the centralized logging solution from Part 3 of this blog series to the Security Account. We moved non-production applications into the Dev and Test Accounts, respectively, to isolate workloads. We then moved existing accounts that had production services from the Transitional OU to Workload PROD OU.

Multi-account hierarchy

Figure 1. Multi-account hierarchy

Implementing a multi-account strategy alleviated service quota challenges. It isolated variable demand non-production environments from more consistent production environments, which reduced the downtime caused by unplanned scaling events. The multi-account strategy enforces governance at scale, but also promotes innovation by allocating separate accounts with distinct security requirements for proof of concepts and experimentation. This reduces impact risks to production accounts and allows the required guardrails to be automatically applied.

Improving access management and least privilege access

When the company experienced hypergrowth, they not only had to scale their application’s infrastructure, but they also had to increase how often they release their code. They also hired and onboarded new internal teams.

To strengthen new/existing employees’ credentials, we used AWS Trusted Advisor for IAM Access Key Rotation. This identifies IAM users whose access keys have not been rotated for more than 90 days and created an automated way to rotate them. We then generated an IAM credential report to identify IAM users that don’t need console access or that don’t need access keys. We gradually assigned these users role-based access versus IAM access keys.

During a Well-Architected Security Pillar review, we identified some applications that used hardcoded passwords that hadn’t been updated for more than 90 days. We re-factored these applications to get passwords from AWS Secrets Manager and followed best practices for performance.

Additionally, we set up a system to automatically change passwords for RDS databases and wrote an AWS Lambda function to update passwords for third-party integration. Some applications on Amazon EC2 were using IAM access keys to access AWS services. We re-factored them to get permissions from the EC2 instance role attached to the EC2 instances, which reduced operational burden of rotating access keys.

Using IAM Access Analyzer, we analyzed AWS CloudTrail logs and generated policies for IAM roles. This helped us determine the least privilege permissions required for the roles as mentioned in the IAM Access Analyzer makes it easier to implement least privilege permissions by generating IAM policies based on access activity blog.

To streamline access for internal users, we migrated users to AWS Single Sign-On (AWS SSO) federated access. We enabled all features in AWS Organizations to use AWS SSO and created permission sets to define access boundaries for different functions. We assigned permission sets to different user groups and assigned users to user groups based on their job function. This allowed us to reduce the number of IAM policies and use tag-based control when defining AWS SSO permissions policies.

We followed the guidance in the Attribute-based Access Control with AWS SSO blog post to map user attributes and use tags to define permissions boundaries for user groups. This allowed us to provide access to users based on specific teams, projects, and departments. We enforced multi-factor authentication (MFA) for all AWS SSO users by configuring MFA settings to allow sign in only when an MFA device has been registered.

These improvements ensure that only the right people have access to the required resources for the right time. They reduce the risk of compromised security credentials by using AWS Security Token Service (AWS STS) to generate temporary credentials when needed. System passwords are better protected from unwanted access and automatically rotated for improved security. AWS SSO also allows us to enforce permissions at scale when people’s job functions change within or across teams.


In this blog post, we described design patterns we used to implement security governance at scale using multi-account strategy and AWS SSO integrations. We also talked about patterns you can adopt for IAM baselining that allow least privilege access, checking for IAM best practices, and proactively detecting unwanted access.

This blog post also covers why you need to refresh your threat model during hyperscale growth and how different services can make it easier to enforce security controls. In the next blog, we will talk about more security design patterns to improve infrastructure security and incident response during hyperscale.

Find out more

Other blogs in this series

Related information

Serverless Architecture for a Structured Data Mining Solution

Post Syndicated from Uri Rotem original https://aws.amazon.com/blogs/architecture/serverless-architecture-for-a-structured-data-mining-solution/

Many businesses have an essential need for structured data stored in their own database for business operations and offerings. For example, a company that produces electronics may want to store a structured dataset of parts. This requires the following properties: color, weight, connector type, and more.

This data may already be available from external sources. In many cases, one source is sufficient. But often, multiple data sources from different vendors must be incorporated. Each data source might have a different structure for the same data field, which is problematic. Achieving one unified structure from variable sources can be difficult, and is a classic data mining problem.

We will break the problem into two main challenges:

  1. Locate and collect data. Collect from multiple data sources and load data into a data store.
  2. Unify the collected data. Since the collected data has no constraints, it might be stored in different structures and file formats. To use the collected data, it must be unified by performing an extract, transform, load (ETL) process. This matches the different data sources and creates one unified data store.

In this post, we demonstrate a pipeline of services, built on top of a serverless architecture that will handle the preceding challenges. This architecture supports large-scale datasets. Because it is a serverless solution, it is also secure and cost effective.

We use Amazon SageMaker Ground Truth as a tool for classifying the data, so that no custom code is needed to classify different data sources.

Data mining and structuring

There are three main steps to explore in order to solve these challenges:

  1. Collect the data – Data mine from different sources
  2. Map the data – Construct a dictionary of key-value pairs without writing code
  3. Structure the collected data – Enrich your dataset with a unified collection of data that was collected and mapped in steps 1 and 2

Following is an example of a use case and solution flow using this architecture:

  • In this scenario, a company must enrich an empty data base with items and properties, see Figure 1.
Figure 1. Company data before data mining

Figure 1. Company data before data mining

  • Data will then be collected from multiple data sources, and stored in the cloud, as shown in Figure 2.
Figure 2. Collecting the data by SKU from different sources

Figure 2. Collecting the data by SKU from different sources

  • To unify different property names, SageMaker Ground Truth is used to label the property names with a list of properties. The results are stored in Amazon DynamoDB, shown in Figure 3.
Figure 3. Mapping the property names to match a unified name

Figure 3. Mapping the property names to match a unified name

  • Finally, the database is populated and enriched by the mapped properties from the different data sources. This can be iterated with new sources to further enrich the data base, see Figure 4.
Figure 4. Company data after data mining, mapping, and structuring

Figure 4. Company data after data mining, mapping, and structuring

1. Collect the data

Using this serverless architecture illustrated in Figure 5, your teams can minimize the effort and cost. You’ll be able to handle large-scale datasets to collect and store the data required for your business.

Figure 5. Serverless architecture for parallel data collection

Figure 5. Serverless architecture for parallel data collection

We use Amazon S3 as it is a highly scalable and durable object storage service, and can store the original dataset. It will initiate an event that will invoke a Lambda function to start a state machine, using the original dataset as its input.

AWS Step Functions are used to orchestrate the process of preparing the dataset for parallel scraping of the items. It will automatically manage the queue of items to be processed when the dataset is large. Step Functions ensures visibility of the process, reports errors, and decouples the compute-intensive scraping operation per item.

The state machine has two steps:

  1. ETL the data to clean and standardize it. Store each item in Amazon DynamoDB, a fast and flexible NoSQL database service for any scale. The ETL function will create an array of all the items identifiers. The identifier is a unique describer of the item, such as manufacturer ID and SKU.
  2. Using the Map functionality of Step Functions, a Lambda function will be invoked for each item. This runs all your scrapers for that item and stores the results in an S3 bucket.

This solution requires custom implementation of only these two functions, according to your own dataset and scraping sources. The ETL Lambda function will contain logic needed to transform your input into an array of identifiers. The scraper Lambda function will contain logic to locate the data in the source and then store it.

Scraper function flow

For each data source, write your own scraper. The Lambda function can run them sequentially.

  1. Use the identifier input to locate the item in each one of the external sources. The data source can be an API, a webpage, a PDF file, or other source.
    • API: Collecting this data will be specific to the interface provided.
    • Webpages: Data is collected with custom code. There are open source libraries that are popular for this task, such as Beautiful Soup.
    • PDF files: Consider using Amazon Textract. Amazon Textract will give you key-value pairs and table analysis.
  2. Transform the response to key-value pairs as part of the scraper logic.
  3. Store the key-value pairs in a sub folder of the scraper responses S3 bucket, and name it after that data source.

2. Mapping the responses

Figure 6. Pipeline for property mapping

Figure 6. Pipeline for property mapping

This pipeline is initiated after the data is collected. It creates a labeling job of Named Entity Recognition, with a pre-defined set of labels. The labeling work will be split among your Workforces. When the job is completed, the output manifest file for named entity recognition is used for the final ETL Lambda. This manually locates the labeling key and values detected by your workforce, and places the results in a reusable mapping table in DynamoDB.

Services used:

Amazon SageMaker Ground Truth is a fully managed data labeling service that helps you build highly accurate training datasets for machine learning (ML). By using Ground Truth, your teams can unify different data sources to match each other, so they can be identified and used in your applications.

Figure 7. Example of one line item being labeled by one of the Workforce team members

Figure 7. Example of one line item being labeled by one of the Workforce team members

3. Structure the collected data

Figure 8. Architecture diagram of entire data collection and classification process

Figure 8. Architecture diagram of entire data collection and classification process

Using another Lambda function (see in Figure 8, populate items properties), we use the collected data (1), and the mapping (2), to populate the unified dataset into the original data DynamoDB table (3).


In this blog, we showed a solution to automatically collect and structure data. We used a serverless architecture that requires minimal effort, to build a reusable asset that can unify different property definitions from different data sources. Minimal effort is involved in structuring this data, as we use Amazon SageMaker Ground Truth to match and reconcile the new data sources.

For further reading:

Field Notes: How to Scale Your Networks on Amazon Web Services

Post Syndicated from Androski Spicer original https://aws.amazon.com/blogs/architecture/field-notes-how-to-scale-your-networks-on-amazon-web-services/

As AWS adoption increases throughout an organization, the number of networks and virtual private clouds (VPCs) to support them also increases. Customers can see growth upwards of tens, hundreds, or in the case of the enterprise, thousands of VPCs.

Generally, this increase in VPCs is driven by the need to:

  • Simplify routing, connectivity, and isolation boundaries
  • Reduce network infrastructure cost
  • Reduce management overhead

Overview of solution

This blog post discusses the guidance customers require to achieve their desired outcomes. Guidance is provided through a series of real-world scenarios customers encounter on their journey to building a well-architected network environment on AWS. These challenges range from the need to centralize networking resources, to reduce complexity and cost, to implementing security techniques that help workloads to meet industry and customer specific operational compliance.

The scenarios presented here form the foundation and starting point from which the intended guidance is provided. These scenarios start as simple, but gradually increase in complexity. Each scenario tackles different questions customers ask AWS solutions architects, service teams, professional services, and other AWS professionals, on a daily basis.

Some of these questions are:

  • What does centralized DNS look like on AWS, and how should I approach and implement it?
  • How do I reduce the cost and complexity associated with Amazon Virtual Private Cloud (Amazon VPC) interface endpoints for AWS services by centralizing that is spread across many AWS accounts?
  • What does centralized packet inspection look like on AWS, and how should we approach it?

This blog post will answer these questions, and more.


This blog post assumes that the reader has some understanding of AWS networking basics outlined in the blog post One to Many: Evolving VPC Design. It also assumes that the reader understands industry-wide networking basics.

Simplify routing, connectivity, and isolation boundaries

Simplification in routing starts with selecting the correct layer 3 technology. In the past, customers used a combination of VPC peering, Virtual Gateway configurations, and the Transit VPC Solution to achieve inter–VPC routing, and routing to on-premises resources. These solutions presented challenges in configuration and management complexity, as well as security and scaling.

To solve these challenges, AWS introduced AWS Transit Gateway. Transit Gateway is a regional virtual router that customers can attach their VPCs, site-to-site virtual private networks (VPNs), Transit Gateway Connect, AWS Direct Connect gateways, and cross-region transit gateway peering connections, and configure routing between them. Transit Gateway scales up to 5,000 attachments; so, a customer can start with one VPC attachment, and scale up to thousands of attachments across thousands of accounts. Each VPC, Direct Connect gateway, and peer transit gateway connection receives up to 50 Gbps of bandwidth.

Routing happens at layer 3 through a transit gateway. Transit Gateway come with a default route table to which all default attachment association happens. If route propagation and association is enabled at transit gateway creation time, AWS will create a transit gateway with a default route table to which attachments are automatically associated and their routes automatically propagated. This creates a network where all attachments can route to each other.

Adding VPN or Direct Connect gateway attachments to on-premises networks will allow all attached VPCs and networks to easily route to on-premises networks. Some customers require isolation boundaries between routing domains. This can be achieved with Transit Gateway.

Let’s review a use case where a customer with two spoke VPCs and a shared services VPC (shared-services-vpc-A) would like to:

  • Allow all spoke VPCs to access the shared services VPC
  • Disallow access between spoke VPCs

Figure 1. Transit Gateway Deployment

To achieve this, the customer needs to:

  1. Create a transit gateway with the name tgw-A and two route tables with the names spoke-tgw-route-table and shared-services-tgw-route-table.
    1. When creating the transit gateway, disable automatic association and propagation to the default route table.
    2. Enable equal-cost multi-path routing (ECMP) and use a unique Border Gateway Protocol (BGP) autonomous system number (ASN).
  1. Associate all spoke VPCs with the spoke-tgw-route-table.
    1. Their routes should not be propagated.
    2. Propagate their routes to the shared-services-tgw-route-table.
  1. Associate the shared services VPC with the shared-services-tgw-route-table and its routes should be propagated or statically added to the spoke-tgw-route-table.
  2. Add a default and summarized route with a next hop of the transit gateway to the shared services and spoke VPCs route table.

After successfully deploying this configuration, the customer decides to:

  1. Allow all VPCs access to on-premises resources through AWS site-to-site VPNs.
  2. Require an aggregated bandwidth of 10 Gbps across this VPN.
Figure 2. Transit Gateway hub and spoke architecture, with VPCs and multiple AWS site-to-site VPNs

Figure 2. Transit Gateway hub and spoke architecture, with VPCs and multiple AWS site-to-site VPNs

To achieve this, the customer needs to:

  1. Create four site-to-site VPNs between the transit gateway and the on-premises routers with BGP as the routing protocol.
    1. AWS site-to-site VPN has two VPN tunnels. Each tunnel has a dedicated bandwidth of 1.25 Gbps.
    2. Read more on how to configure ECMP for site-to-site VPNs.
  1. Create a third transit gateway route table with the name WAN-connections-route-table.
  2. Associate all four VPNs with the WAN-connections-route-table.
  3. Propagate the routes from the spoke and shared services VPCs to WAN-connections-route-table.
  4. Propagate VPN attachment routes to the spoke-tgw-route-table and shared-services-tgw-route-table.

Building on this progress, the customer has decided to deploy another transit gateway and shared services VPC in another AWS Region. They would like both shared service VPCs to be connected.

Transit Gateway peering connection architecture

Figure 3. Transit Gateway peering connection architecture

To accomplish these requirements, the customer needs to:

  1. Create a transit gateway with the name tgw-B in the new region.
  2. Create a transit gateway peering connection between tgw-A and tgw-B. Ensure peering requests are accepted.
  3. Statically add a route to the shared-services-tgw-route-table in region A that has the transit-gateway-peering attachment as the next for hop traffic destined to the VPC Classless Inter-Domain Routing (CIDR) range for shared-services-vpc-B. Then, in region B, add a route to the shared-services-tgw-route-table that has the transit-gateway-peering attachment as the next for hop traffic destined to the VPC CIDR range for shared-services-vpc-A.

Reduce network infrastructure cost

It is important to design your network to eliminate unnecessary complexity and management overhead, as well as cost optimization. To achieve this, use centralization. Instead of creating network infrastructure that is needed by every VPC inside each VPC, deploy these resources in a type of shared services VPC and share them throughout your entire network. This results in the creation of this infrastructure only one time, which reduces the cost and management overhead.

Some VPC components that can be centralized are network address translation (NAT) gateways, VPC interface endpoints, and AWS Network Firewall. Third-party firewalls can also be centralized.

Let’s take a look at a few use cases that build on the previous use cases.

Figure 4. Centralized interface endpoint architecture

Figure 4. Centralized interface endpoint architecture

The customer has made the decision to allow access to AWS Key Management Service (AWS KMS) and AWS Secrets Manager from their VPCs.

The customer should employ the strategy of centralizing their VPC interface endpoints to reduce the potential proliferation of cost, management overhead, and complexity that can occur when working with this VPC feature.

To centralize these endpoints, the customer should:

  1. Deploy AWS VPC interface endpoints for AWS KMS and Secrets Manager inside shared-services-vpc-A and shared-services-vpc-B.
    1. Disable each Private DNS.

Figure 5. Centralized interface endpoint step-by-step guide (Step 1)

  1. Use the AWS default DNS name for AWS KMS and Secrets Manager to create an Amazon Route 53 private hosted zone (PHZ) for each of these services. These are:
    1. kms.<region>.amazonaws.com
    2. secretsmanager.<region>.amazonaws.com
Figure 6. Centralized interface endpoint step-by-step guide (Step 2)

Figure 6. Centralized interface endpoint step-by-step guide (Step 2)

  1. Authorize each spoke VPC to associate with the PHZ in their respective region. This can be done from the AWS Command Line Interface (AWS CLI) by using the command aws route53 create-vpc-association-authorization –hosted-zone-id <hosted-zone-id> –vpc VPCRegion=<region>,VPCId=<vpc-id> –region <AWS-REGION>.
  2. Create an A record for each PHZ. In the creation process, for the Route to option, select the VPC Endpoint Alias. Add the respective VPC interface endpoint DNS hostname that is not Availability Zone specific (for example, vpce-0073b71485b9ad255-mu7cd69m.ssm.ap-south-1.vpce.amazonaws.com).
Figure 7. Centralized interface endpoint step-by-step guide (Step 3)

Figure 7. Centralized interface endpoint step-by-step guide (Step 3)

  1. Associate each spoke VPC with the available PHZs. Use the CLI command aws route53 associate-vpc-with-hosted-zone –hosted-zone-id <hosted-zone-id> –vpc VPCRegion=<region>,VPCId=<vpc-id> –region <AWS-REGION>.

This concludes the configuration for centralized VPC interface endpoints for AWS KMS and Secrets Manager. You can learn more about cross-account PHZ association configuration.

After successfully implementing centralized VPC interface endpoints, the customer has decided to centralize:

  1. Internet access.
  2. Packet inspection for East-West and North-South internet traffic using a pair of firewalls that support the Geneve protocol.

To achieve this, the customer should use the AWS Gateway Load Balancer (GWLB), Amazon VPC endpoint services, GWLB endpoints, and transit gateway route table configurations.

Figure 8. Illustrated security-egress VPC infrastructures and route table configuration

Figure 8. Illustrated security-egress VPC infrastructures and route table configuration

To accomplish these centralization requirements, the customer should create:

  1. A VPC with the name security-egress VPC.
  2. A GWLB, an autoscaling group with at least two instance of the customer’s firewall which are evenly distributed across multiple private subnets in different Availability Zones.
  3. A target group for use with the GWLB. Associate the autoscaling group with this target group.
  4. An AWS endpoint service using the GWLB as the entry point. Then create AWS interface endpoints for this endpoint service inside the same set of private subnets or create a /28 set of subnets for interface endpoints.
  5. Two AWS NAT gateways spread across two public subnets in multiple Availability Zones.
  6. A transit gateway attachment request from the security-egress VPC and ensure that:
    1. Transit gateway appliance mode is enabled for this attachment as it ensures bidirectional traffic forwarding to the same transit gateway attachments.
    2. Transit gateway–specific subnets are used to host the attachment interfaces.
  1. In the security-egress VPC, configure the route tables accordingly.
    1. Private subnet route table.
      1. Add default route to the NAT gateway.
      2. Add summarized routes with a next-hop of Transit Gateway for all networks you intend to route to that are connected to the Transit Gateway.
    1. Public subnet route table.
      1. Add default route to the internet gateway.
      2. Add summarized routes with a next-hop of the GWLB endpoints you intend to route to for all private networks.

Transit Gateway configuration

  1. Create a new transit gateway route table with the name transit-gateway-egress-route-table.
    1. Propagate all spoke and shared services VPCs routes to it.
    2. Associate the security-egress VPC with this route table.
  1. Add a default route to the spoke-tgw-route-table and shared-services-tgw-route-table that points to the security-egress VPC attachment, and remove all VPC attachment routes respectively from both route tables.
Illustrated routing configuration for the transit gateway route tables and VPC route tables

Figure 9. Illustrated routing configuration for the transit gateway route tables and VPC route tables

Illustrated North-South traffic flow from spoke VPC to the internet

Figure 10. Illustrated North-South traffic flow from spoke VPC to the internet

Figure 11. Illustrated East-West traffic flow between spoke VPC and shared services VPC

Figure 11. Illustrated East-West traffic flow between spoke VPC and shared services VPC


In this blog post, we went on a network architecture journey that started with a use case of routing domain isolation. This is a scenario most customers confront when getting started with Transit Gateway. Gradually, we built upon this use case and exponentially increased its complexity by exploring other real-world scenarios that customers confront when designing multiple region networks across multiple AWS accounts.

Regardless of the complexity, these use cases were accompanied by guidance that helps customers achieve a reduction in cost and complexity throughout their entire network on AWS.

When designing your networks, design for scale. Use AWS services that let you achieve scale without the complexity of managing the underlying infrastructure.

Also, simplify your network through the technique of centralizing repeatable resources. If more than one VPC requires access to the same resource, then find ways to centralize access to this resource which reduces the proliferation of these resources. DNS, packet inspection, and VPC interface endpoints are good examples of things that should be centralized.

Thank you for reading. Hopefully you found this blog post useful.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Improving Performance and Reducing Cost Using Availability Zone Affinity

Post Syndicated from Michael Haken original https://aws.amazon.com/blogs/architecture/improving-performance-and-reducing-cost-using-availability-zone-affinity/

One of the best practices for building resilient systems in Amazon Virtual Private Cloud (VPC) networks is using multiple Availability Zones (AZ). An AZ is one or more discrete data centers with redundant power, networking, and connectivity. Using multiple AZs allows you to operate workloads that are more highly available, fault tolerant, and scalable than would be possible from a single data center. However, transferring data across AZs adds latency and cost.

This blog post demonstrates an architectural pattern called “Availability Zone Affinity” that improves performance and reduces costs while still maintaining the benefits of Multi-AZ architectures.

Cross Availability Zone effects

AZs are physically separated by a meaningful distance from other AZs in the same AWS Region. Although they all are within 60 miles (100 kilometers) of each other. This produces roundtrip latencies usually under 1-2 milliseconds (ms) between AZs in the same Region. Roundtrip latency between two instances in the same AZ is closer to 100-300 microseconds (µs) when using enhanced networking.1 This can be even lower when the instances use cluster placement groups. Additionally, when data is transferred between two AZs, data transfer charges apply in both directions.

To better understand these effects, we’ll analyze a fictitious workload, the “foo service,” shown in Figure 1. The foo service provides a storage platform for other workloads in AWS to redundantly store data. Requests are first processed by an Application Load Balancer (ALB). ALBs always use cross-zone load balancing to evenly distribute requests to all targets. Next, the request is sent from the load balancer to a request router. The request router performs a few operations, like authorization checks and input validation, before sending it to the storage tier. The storage tier replicates the data sequentially from the lead node, to the middle node, and finally the tail node. Once the data has been written to all three nodes, it is considered committed. The response is sent from the tail node back to the request router, back through the load balancer, and finally returned to the client.

An example system that transfers data across AZs

Figure 1. An example system that transfers data across AZs

We can see in Figure 1 that, in the worst case, the request traversed an AZ boundary eight times. Let’s calculate the fastest possible, zeroth percentile (p0), latency. We’ll assume the best time for non-network processing of the request in the load balancer, request router, and storage tier is 4 ms. If we consider 1 ms as the minimum network latency added for each AZ traversal, in the worst-case scenario of eight AZ traversals, the total processing time can be no faster than 12 ms. At the 50th percentile (p50), meaning the median, let’s assume the cross-AZ latency is 1.5 ms and non-network processing is 8 ms, resulting in a total of 20 ms for overall processing. Additionally, if this system is processing millions of requests, the data transfer charges could become substantial over time. Now, let’s imagine that a workload using the foo service must operate with p50 latency under 20 ms. How can the foo service change their system design to meet this goal?

Availability Zone affinity

The AZ Affinity architectural pattern reduces the number of times an AZ boundary is crossed. In the example system we looked at in Figure 1, AZ Affinity can be implemented with two changes.

  1. First, the ALB is replaced with a Network Load Balancer (NLB). NLBs provide an elastic network interface per AZ that is configured with a static IP. NLBs also have cross-zone load balancing disabled by default. This ensures that requests are only sent to targets that are in the same AZ as the elastic network interface that receives the request.
  2. Second, DNS entries are created for each elastic network interface to provide an AZ-specific record using the AZ ID, which is consistent across accounts. Clients use that DNS record to communicate with a load balancer in the AZ they select. So instead of interacting with a Region-wide service using a DNS name like foo.com, they would instead use use1-az1.foo.com.

Figure 2 shows the system with AZ Affinity. We can see that each request, in the worst case, only traverses an AZ boundary four times. Data transfer costs are reduced by approximately 40 percent compared to the previous implementation. If we use 300 μs as the p50 latency for intra-AZ communication, we now get (4×300μs)+(4×1.5ms)=7.2ms. Using the median 8 ms processing time, this brings the overall median latency to 15.2 ms. This represents a 40 percent reduction in median network latency. When thinking about p90, p99, or even p99.9 latencies, this reduction could be even more significant.

The system now implements AZ Affinity

Figure 2. The system now implements AZ Affinity

Figure 3 shows how you could take this approach one step farther using service discovery. Instead of requiring the client to remember AZ-specific DNS names for load balancers, we can use AWS Cloud Map for service discovery. AWS Cloud Map is a fully managed service that allows clients to look up IP address and port combinations of service instances using DNS and dynamically retrieve abstract endpoints, like URLs, over the HTTP-based service Discovery API. Service discovery can reduce the need for load balancers, removing their cost and added latency.

The client first retrieves details about the service instances in their AZ from the AWS Cloud Map registry. The results are filtered to the client’s AZ by specifying an optional parameter in the request. Then they use that information to send requests to the discovered request routers.

AZ Affinity implemented using AWS Cloud Map for service discovery

Figure 3. AZ Affinity implemented using AWS Cloud Map for service discovery

Workload resiliency

In the new architecture using AZ Affinity, the client has to select which AZ they communicate with. Since they are “pinned” to a single AZ and not load balanced across multiple AZs, they may see impact during an event affecting the AWS infrastructure or foo service in that AZ.

During this kind of event, clients can choose to use retries with exponential backoff or send requests to the other AZs that aren’t impacted. Alternatively, they could implement a circuit breaker to stop making requests from the client in the affected AZ and only use clients in the others. Both approaches allow them to use the resiliency of Multi-AZ systems while taking advantage of AZ Affinity during normal operation.

Client libraries

The easiest way to achieve the process of service discovery, retries with exponential backoff, circuit breakers, and failover is to provide a client library/SDK. The library handles all of this logic for users and makes the process transparent, like what the AWS SDK or CLI does. Users then get two options, the low-level API and the high-level library.


This blog demonstrated how the AZ Affinity pattern helps reduce latency and data transfer costs for Multi-AZ systems while providing high availability. If you want to investigate your data transfer costs, check out the Using AWS Cost Explorer to analyze data transfer costs blog for an approach using AWS Cost Explorer.

For investigating latency in your workload, consider using AWS X-Ray and Amazon CloudWatch for tracing and observability in your system. AZ Affinity isn’t the right solution for every workload, but if you need to reduce inter-AZ data transfer costs or improve latency, it’s definitely an approach to consider.

  1. This estimate was made using t4g.small instances sending ping requests across AZs. The tests were conducted in the us-east-1, us-west-2, and eu-west-1 Regions. These results represent the p0 (fastest) and p50 (median) intra-AZ latency in those Regions at the time they were gathered, but are not a guarantee of the latency between two instances in any location. You should perform your own tests to calculate the performance enhancements AZ Affinity offers.

Migrate Resources Between AWS Accounts

Post Syndicated from Ashok Srirama original https://aws.amazon.com/blogs/architecture/migrate-resources-between-aws-accounts/

Have you ever wondered how to move resources between Amazon Web Services (AWS) accounts? You can really view this as a migration of resources. Migrating resources from one AWS account to another may be desired or required due to your business needs. Following are a few scenarios where this may be of benefit:

  1. When you acquire, sell, or merge overseas operations from other businesses.
  2. When you move regional operations from one Managed Service Provider (MSP) to another.
  3. When you reorganize your AWS account and organizational structure.

This process may involve migrating the infrastructure either partially or completely.

In this blog, we will discuss various approaches to migrating resources based on type, configuration, and workload needs. Usually, the first consideration is infrastructure. What’s in your environment? What are the interdependencies? How will you migrate each resource?

Using this information, you can outline a plan on how you will approach migrating each of the resources in your portfolio, and in what order.

Here are some considerations to address for a typical migration:

Let’s look at each of these considerations in detail.

Migrating infrastructure

To migrate infrastructure that includes ephemeral resources, you can use one of the following Infrastructure as Code (IaC) approaches, shown in Figure 1. IaC templates are like programming scripts that automate the provisioning of IT resources.

Figure 1. Approaches to migrate infrastructure using IaC

Figure 1. Approaches to migrate infrastructure using IaC

1. If you are already using AWS CloudFormation templates, you can easily import the existing templates to the target AWS account.

AWS CloudFormation simplifies provisioning and management on AWS. You can create templates for quick and reliable provisioning of services or applications (called “stacks”).

2. You can use tools like Former2 to templatize your existing resources in the source AWS account and deploy them in the target account.

Former2 is an open-source project that allows you to generate IaC templates. For example, AWS CloudFormation or HashiCorp Terraform can be generated from the existing resources within your AWS account. Read Accelerate infrastructure as code development with open source Former2 for step-by-step guidance.

Migrating compute resources

To migrate compute resources that have a persistent state, you can use one of the following approaches, shown in Figure 2. These provide a virtual computing environment, allowing you to launch instances with a variety of operating systems.

Figure 2. Approaches to migrate compute resources

Figure 2. Approaches to migrate compute resources

1. If you are already using AWS Backup service and AWS Organizations to centrally manage backup policies, you can enable AWS Backup cross-account management feature. This manages, monitors, restores your backup, and copies jobs across AWS accounts. Ensure you have both accounts in same AWS Organization. Once the backups are available in the target account, you can restore EC2 instances. Follow detailed instructions at Creating backup copies across AWS accounts.

AWS Backup is a fully managed data protection service that centralizes and automates data across AWS services, in the cloud, and on-premises. You can configure backup policies and monitor activity for your AWS resources. You can automate and consolidate backup tasks that were previously performed service-by-service. This removes the need to create custom scripts and use manual processes.

2. Create an Amazon Machine Image of your EC2 instances and share it with the target account. You can launch new EC2 instances using the shared AMI. Follow step-by-step instructions: How do I transfer an Amazon EC2 instance or AMI to a different AWS account?

Amazon Machine Image (AMI) provides the information required to launch an instance. Specify an AMI and then launch multiple instances from a single AMI with the same configuration. You can use different AMIs to launch instances when you need instances with different configurations.

For migrating non-persistent compute resources, refer Migrating Infrastructure section.

Migrating storage resources

AWS offers various storage services including object, file, and block storage. To migrate objects from a S3 bucket, you can take the following approaches, shown in Figure 3a.

Figure 3a. Approaches to migrate S3 buckets

Figure 3a. Approaches to migrate S3 buckets

1. Use Amazon S3 command line interface (CLI) commands to copy the initial load of objects from the source account to the target account. Read How can I copy S3 objects from another AWS account? After the initial copy, you can enable Amazon S3 replication feature to continuously replicate object changes across accounts. Add a bucket policy to grant source bucket permission to replicate objects in destination bucket. Read this walkthrough on how to configure replications.

2. If the S3 bucket contains large number of objects, use Amazon S3 Batch operations to copy objects across AWS accounts in bulk. Read Cross-account bulk transfer of files using Amazon S3 Batch Operations.

To migrate files from an Amazon EFS file system, you can take the following approach, shown in Figure 3b.

Figure 3b. Approach to migrate EFS file systems

Figure 3b. Approach to migrate EFS file systems

Use AWS DataSync agent to transfer data from one EFS file system to another. AWS DataSync is an online transfer service that simplifies moving, copying, and synchronizing large amounts of data between on-premises storage systems and AWS storage services. Read Transferring file data across AWS Regions and accounts using AWS DataSync for step-by-step guidance.

Migrating database resources

AWS offers various purpose-built database engines. These include relational, key-value, document, in-memory, graph, time series, wide column, and ledger databases. To migrate relational databases, you can take one of the following approaches, shown in Figure 4.

Figure 4. Approaches to migrate relational database resources

Figure 4. Approaches to migrate relational database resources

1. If you want to continuously replicate data changes, use AWS Database Migration Service (AWS DMS) to replicate your data across AWS accounts with high availability. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. You can set up a DMS task for either one-time migration or on-going replication. An on-going replication task keeps your source and target databases in sync. Once set up, the on-going replication task will continuously apply source changes to the target with minimal latency. Learn how to Set Up AWS DMS for Cross-Account Migration.

AWS DMS is a cloud service that streamlines the migration of relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud or between combinations of cloud and on-premises setups.

2. Use RDS Snapshots to create and share database backups across AWS accounts. Use the shared snapshots to launch new Amazon Relational Database Service (RDS) instances in the target account. Read step-by-step instructions: How can I share an encrypted Amazon RDS DB snapshot with another account?

3. Use AWS Backup to create backup policies that automatically back up your AWS resources. Use AWS Backup cross-account management feature to manage and monitor your backup, restore, and copy jobs across AWS accounts. Once the backups are available in the target account, you can restore RDS instances. Learn about Creating backup copies across AWS accounts.

In this section, we discussed relational databases migration. You can also use AWS DMS for migrating other databases. Read supported AWS DMS source and target databases.


In this blog post, we discussed various approaches you can take to migrate resources from one account to another depending upon the resource type and configuration. Additionally, you can also explore CloudEndure Migration for continuous data replication. Learn more about Migrating workloads across AWS Regions with CloudEndure Migration.

Field Notes: How to Prepare Large Text Files for Processing with Amazon Translate and Amazon Comprehend

Post Syndicated from Veeresh Shringari original https://aws.amazon.com/blogs/architecture/field-notes-how-to-prepare-large-text-files-for-processing-with-amazon-translate-and-amazon-comprehend/

Biopharmaceutical manufacturing is a highly regulated industry where deviation documents are used to optimize manufacturing processes. Deviation documents in biopharmaceutical manufacturing processes are geographically diverse, spanning multiple countries and languages. The document corpus is complex, with additional requirements for complete encryption. Therefore, to reduce downtime and increase process efficiency, it is critical to automate the ingestion and understanding of deviation documents. For this workflow, a large biopharma customer needed to translate and classify documents at their manufacturing site.

The customer’s challenge included translation and classification of paragraph-sized text documents into statement types. First, the tokenizer previously used was failing for certain languages. Second, post-tokenization, big paragraphs were needed to be sliced into sizes smaller than 5,000 bytes to facilitate consumption into Amazon Translate and Amazon Comprehend. Because each sentence and paragraphs were of differing sizes, the customer needed to slice them so that each sentence and paragraph did not lose their context and meaning.

This blog post describes a solution to tokenize text documents into appropriate-sized chunks for easy consumption by Amazon Translate and Amazon Comprehend.

Overview of solution

The solution is divided into the following steps. Text data coming from the AWS Glue output is transformed and stored in Amazon Simple Storage Service (Amazon S3) in a .txt file. This transformed data is passed into the sentence tokenizer with slicing and encryption using AWS Key Management Service (AWS KMS). This data is now ready to be fed into Amazon Translate and Amazon Comprehend, and then to a Bidirectional Encoder Representations from Transformers (BERT) model for clustering. All of the models are developed and managed in Amazon SageMaker.


For this walkthrough, you should have the following prerequisites:

The architecture in Figure 1 shows a complete document classification and clustering workflow running the sentence tokenizer solution (step 4) as an input to Amazon Translate and Amazon Comprehend. The complete architecture also uses AWS Glue crawlers, Amazon Athena, Amazon S3 , AWS KMS, and SageMaker.

Figure 1. Higher level architecture describing use of the tokenizer in the system

Figure 1. Higher level architecture describing use of the tokenizer in the system

Solution steps

  1. Ingest the streaming data from the daily pharma supply chain incidents from the AWS Glue crawlers and Athena-based view tables. AWS Glue is used for ETL (extract, transform, and load), while Athena helps to analyze the data in Amazon S3 for its integrity.
  2. Ingest the streaming data into Amazon S3, which is AWS KMS encrypted. This limits any unauthorized access to the secured files, as required for the healthcare domain.
  3. Enable the CloudWatch logs. CloudWatch logs help to store, monitor, and access error messages logged by SageMaker.
  4. Open the SageMaker notebook using AWS console, and navigate to the integrated development environment (IDE) with Python notebook.

Solution description

Initialize the Amazon S3 client, and enable the get_execution role.

Figure 2. Code sample to initialize Amazon S3 Client execution roles

Figure 3 shows the code for tokenizing large paragraphs into sentences. This helps to feed a sentence of 5,000 byte chunks to Amazon Translate and Amazon Comprehend. Additionally, in the regulated environment, data at rest and in transition, is encrypted using AWS KMS (using S3 IO object) before chunking into 5,000-byte size files using last-in-first-out (LIFO) process.

Figure 3. Code sample with file chunking function and AWS KMS encryption

Figure 3. Code sample with file chunking function and AWS KMS encryption

Figure 4 shows the function for writing the file chunks to objects in Amazon S3, and objects are AWS KMS encrypted.

Figure 4. Code sample for writing chunked 5,000-byte sized data to Amazon S3

Code sample

The following example code details the tokenizer and chunking tool which we subsequently run through SageMaker:

Cleaning up

To avoid incurring future charges, delete the resources (like S3 objects) used for the practice files after you have completed implementation of the solution.


In this blog post, we presented a solution which incorporates sentence-level tokenization with rules governing expected sentence size. The solution includes automation scripts to reduce bigger files into smaller chunked sizes of 5,000 bytes to facilitate Amazon Translate and Amazon Comprehend. The solution is effective for tokenizing and chunking complex environments with multi-language files. Furthermore, the solution uses file exchange security by using AWS KMS, as required by regulated industries.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Get Started with Amazon S3 Event Driven Design Patterns

Post Syndicated from Micah Walter original https://aws.amazon.com/blogs/architecture/get-started-with-amazon-s3-event-driven-design-patterns/

Event driven programs use events to initiate succeeding steps in a process. For example, the completion of an upload job may then initiate an image processing job. This allows developers to create complex architectures by using the principle of decoupling. Decoupling is preferable for many workflows, as it allows each component to perform its tasks independently, which improves efficiency. Examples are ecommerce order processing, image processing, and other long running batch jobs.

Amazon Simple Storage Service (S3) is an object-based storage solution from Amazon Web Services (AWS) that allows you to store and retrieve any amount of data, at any scale. Amazon S3 Event Notifications provides users a mechanism for initiating events when certain actions take place inside an S3 bucket.

In this blog post, we will illustrate how you can use Amazon S3 Event Notifications in combination with a powerful suite of Amazon messaging services. This will allow you to implement an event driven architecture for a variety of common use cases.

Setting up Amazon S3 Event Notifications

We first must understand the types of events that can be initiated with Amazon S3 Event Notifications. Events can be initiated by uploading, modifying, deleting an object, or other actions. When an event is initiated, a payload is created containing the event metadata. This includes information about the object that initiated the event itself.

To enable notifications, you must first add a notification configuration that identifies the events you want Amazon S3 to publish. Specify the destinations where you want Amazon S3 to send the notifications. This configuration is stored in the notification subresource, which you can find under the Properties tab within your S3 bucket, see Figure 1.

Figure 1. Properties tab showing S3 Event Notifications subresource

Figure 1. Properties tab showing S3 Event Notifications subresource

An event notification can be initiated anytime an object is uploaded, modified, or deleted, depending on your configuration details. You can create multiple notification configurations for different scenarios, shown in Figure 2. For example, one configuration can handle new or modified objects, and another configuration can handle deletions. You can specify that events will only be initiated when objects contain a specific prefix, or following the restoration of an object. For a complete listing of all the configuration options and event types, read documentation on supported event types.

Figure 2. S3 Event Notifications subresource details and options

Figure 2. S3 Event Notifications subresource details and options

When all of the conditions in your configuration have been met, a new event will be initiated and sent to the destination you specify. An S3 event destination can be an AWS Lambda function, an Amazon Simple Queue Service (SQS) queue, or an Amazon Simple Notification Service (SNS) topic, see Figure 3.

Figure 3. S3 Event Notifications subresource destination settings

Figure 3. S3 Event Notifications subresource destination settings

Event driven design patterns

There are many common design patterns for building event driven programs with Amazon S3 Event Notifications. Once you have set up your notification configuration, the next step is to consume the event. The following describes a few typical architectures you might consider, depending on the needs of your application.

Synchronous and reliable point-to-point processing

Figure 4. Point-to-point processing with S3 and Lambda as a destination

Figure 4. Point-to-point processing with S3 and Lambda as a destination

One common use case for event driven processing, is when synchronous and reliable information is required. For example, a mobile application processes images uploaded by users and automatically tags the images with the detected objects using Artificial Intelligence/Machine Learning (AI/ML). From an architectural perspective (Figure 4), an image is uploaded to an S3 bucket, which generates an event notification. This initiates a Lambda function that sends the details of the uploaded image to Amazon Rekognition for tagging. Results from Amazon Rekognition could be further processed by the Lambda function and stored in a database like Amazon DynamoDB.

With this type of architecture, there is no contingency for dealing with multiple images arriving simultaneously in the S3 bucket. If this application sends too many requests to Lambda, events can start to pile up. This can cause a failure to process some of the images. To make our program more fault tolerant, adding an Amazon SQS queue would help, as shown in Figure 5.

Asynchronous and queued point-to-point processing 

Figure 5. Queued point-to-point processing with S3, SQS, and Lambda

Figure 5. Queued point-to-point processing with S3, SQS, and Lambda

Architectures that require the processing of information in an asynchronous fashion can use this pattern. Building off the first example, a mobile application might provide a solution to allow end users to bulk upload thousands of images simultaneously. It can then use AWS Lambda to send the images to Amazon Rekognition for tagging.

By providing a queue-based asynchronous solution, the Lambda function can retrieve work from the SQS queue at its own pace. This allows it to control the processing flow by processing files sequentially without risk of being overloaded. This is especially useful if the application must handle incomplete or partial uploads when a connection is temporarily lost.

Currently, Amazon S3 Event Notifications only work with standard SQS queues, and first-in-first-out (FIFO) SQS queues are not supported. Read more about how to configure S3 event notification with an SQS queue as a destination. Your Lambda function in this architecture must be adjusted to handle the message payload arriving from SQS. This is because it will have a slightly different form than the original event notification body generated from S3.

Parallel processing with “Fan Out” architecture

Figure 6. Fan out design pattern with S3, SNS, and SQS before sending to a Lambda function

Figure 6. Fan out design pattern with S3, SNS, and SQS before sending to a Lambda function

To create a “fan out” style architecture where a single event is propagated to many destinations in parallel, SNS is combined with SQS. Configure your S3 event notification to use an SNS topic as its destination, as shown in Figure 6. You can then direct multiple subsequent processes to act on the same event. This is especially useful if you aim to do parallel processing on the same object in S3.

For example, if you wanted to process a source image into multiple target resolutions, you could create a Lambda function. The function will use the “fan-out” pattern to process all images at the same time, at each resolution. You could then subscribe an SQS queue to your SNS topics. This ensures that Event Notifications sent to SNS are verified as complete by SQS, once they’ve been processed by your Lambda function.

Figure 7. Fan out design pattern including secondary pipeline for deleting images

Figure 7. Fan out design pattern including secondary pipeline for deleting images

To extend the use case of image processing even further, you could create multiple SNS topics to handle different types of events from the same S3 bucket. As depicted in Figure 7, this architecture would allow your program to handle creations and updates differently than deletions. You could also process images differently based on their S3 prefix.

Adjust your Lambda code to handle messages making their way through SNS and SQS. Their payloads will be slightly different than the original S3 Event Notification payload.

Real-time notifications

Figure 8. Event driven design pattern for real-time notifications

Figure 8. Event driven design pattern for real-time notifications

In addition to application-to-application messaging, Amazon SNS provides application-to-person (A2P) communication (see Figure 8). Amazon SNS can send SMS text messages to mobile subscribers in over 100 countries. It can also send push notifications to Android and Apple devices and emails over SMTP. Using A2P, uploading an image to an Amazon S3 bucket can generate a notification to a group of users via their choice of Amazon SNS A2P platform.


In this blog post, we’ve shown you the basic design patterns for developing an event driven architecture using Amazon S3 Event Notifications. You can create many more complicated architecture patterns to suit your needs. By using Amazon SQS, Amazon SNS, and AWS Lambda, you can design an event driven program that is fault tolerant, scalable, and smartly decoupled. But don’t stop there! Consider expanding your program further by utilizing AWS Lambda destinations. Or combine parallel image processing with highly scalable A2P notifications, which will alert your users when a task is complete.

For further reading:

Field Notes: Set Up a Highly Available Database on AWS with IBM Db2 Pacemaker

Post Syndicated from Sai Parthasaradhi original https://aws.amazon.com/blogs/architecture/field-notes-set-up-a-highly-available-database-on-aws-with-ibm-db2-pacemaker/

Many AWS customers need to run mission-critical workloads—like traffic control system, online booking system, and so forth—using the IBM Db2 LUW database server. Typically, these workloads require the right high availability (HA) solution to make sure that the database is available in the event of a host or Availability Zone failure.

This HA solution for the Db2 LUW database with automatic failover is managed using IBM Tivoli System Automation for Multiplatforms (Tivoli SA MP) technology with IBM Db2 high availability instance configuration utility (db2haicu). However, this solution is not supported on AWS Cloud deployment because the automatic failover may not work as expected.

In this blog post, we will go through the steps to set up an HA two-host Db2 cluster with automatic failover managed by IBM Db2 Pacemaker with quorum device setup on a third EC2 instance. We will also set up an overlay IP as a virtual IP pointing to a primary instance initially. This instance is used for client connections and in case of failover, the overlay IP will automatically point to a new primary instance.

IBM Db2 Pacemaker is an HA cluster manager software integrated with Db2 Advanced Edition and Standard Edition on Linux (RHEL 8.1 and SLES 15). Pacemaker can provide HA and disaster recovery capabilities on AWS, and an alternative to Tivoli SA MP technology.

Note: The IBM Db2 v11.5.5 database server implemented in this blog post is a fully featured 90-day trial version. After the trial period ends, you can select the required Db2 edition when purchasing and installing the associated license files. Advanced Edition and Standard Edition are supported by this implementation.

Overview of solution

For this solution, we will go through the steps to install and configure IBM Db2 Pacemaker along with overlay IP as virtual IP for the clients to connect to the database. This blog post also includes prerequisites, and installation and configuration instructions to achieve an HA Db2 database on Amazon Elastic Compute Cloud (Amazon EC2).

Figure 1. Cluster management using IBM Db2 Pacemaker

Prerequisites for installing Db2 Pacemaker

To set up IBM Db2 Pacemaker on a two-node HADR (high availability disaster recovery) cluster, the following prerequisites must be met.

  • Set up instance user ID and group ID.

Instance user id and group id’s must be set up as part of Db2 Server installation which can be verified as follows:

grep db2iadm1 /etc/group
grep db2inst1 /etc/group

  • Set up host names for all the hosts in /etc/hosts file on all the hosts in the cluster.

For both of the hosts in the HADR cluster, ensure that the host names are set up as follows.

Format: ipaddress fully_qualified_domain_name alias

  • Install kornshell (ksh) on both of the hosts.

sudo yum install ksh -y

  • Ensure that all instances have TCP/IP connectivity between their ethernet network interfaces.
  • Enable password less secure shell (ssh) for the root and instance user IDs across both instances.After the password less root ssh is enabled, verify it using the “ssh <host name> -l root ls” command (hostname is either an alias or fully-qualified domain name).

ssh <host name> -l root ls

  • Activate HADR for the Db2 database cluster.
  • Make available the IBM Db2 Pacemaker binaries in the /tmp folder on both hosts for installation. The binaries can be downloaded from IBM download location (login required).

Installation steps

After completing all prerequisites, run the following command to install IBM Db2 Pacemaker on both primary and standby hosts as root user.

cd /tmp
tar -zxf Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64.tar.gz
cd Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/RPMS/

dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm -y
dnf install */*.rpm -y

cp /tmp/Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/Db2/db2cm /home/db2inst1/sqllib/adm

chmod 755 /home/db2inst1/sqllib/adm/db2cm

Run the following command by replacing the -host parameter value with the alias name you set up in prerequisites.

/home/db2inst1/sqllib/adm/db2cm -copy_resources
/tmp/Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/Db2agents -host <host>

After the installation is complete, verify that all required resources are created as shown in Figure 2.

ls -alL /usr/lib/ocf/resource.d/heartbeat/db2*

Figure 2. List of heartbeat resources

Configuring Pacemaker

After the IBM Db2 Pacemaker is installed on both primary and standby hosts, initiate the following configuration commands from only one of the hosts (either primary or standby hosts) as root user.

  1. Create the cluster using db2cm utility.Create the Pacemaker cluster using db2cm utility using the following command. Before running the command, replace the -domain and -host values appropriately.

/home/db2inst1/sqllib/adm/db2cm -create -cluster -domain <anydomainname> -publicEthernet eth0 -host <primary host alias> -publicEthernet eth0 -host <standby host alias>

Note: Run ifconfig to get the –publicEthernet value and replace in the former command.

  1. Create instance resource model using the following commands.Modify -instance and -host parameter values in the following command before running.

/home/db2inst1/sqllib/adm/db2cm -create -instance db2inst1 -host <primary host alias>
/home/db2inst1/sqllib/adm/db2cm -create -instance db2inst1 -host <standby host alias>

  1. Create the database instance using db2cm utility. Modify -db parameter value accordingly.

/home/db2inst1/sqllib/adm/db2cm -create -db TESTDB -instance db2inst1

After configuring Pacemaker, run crm status command from both the primary and standby hosts to check if the Pacemaker is running with automatic failover activated.

Figure 3. Pacemaker cluster status

Quorum device setup

Next, we shall set up a third lightweight EC2 instance that will act as a quorum device (QDevice) which will act as a tie breaker avoiding a potential split-brain scenario. We need to install only corsync-qnetd* package from the Db2 Pacemaker cluster software.

Prerequisites (quorum device setup)

  1. Update /etc/hosts file on Db2 primary and standby instances to include the host details of QDevice EC2 instance.
  2. Set up password less root ssh access between Db2 instances and the QDevice instance.
  3. Ensure TCP/IP connectivity between the Db2 instances and the QDevice instance on port 5403.

Steps to set up quorum device

Run the following commands on the quorum device EC2 instance.

cd /tmp
tar -zxf Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64.tar.gz
cd Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/RPMS/
dnf install */corosync-qnetd* -y

  1. Run the following command from one of the Db2 instances to join the quorum device to the cluster by replacing the QDevice value appropriately.

/home/db2inst1/sqllib/adm/db2cm -create -qdevice <hostnameofqdevice>

  1. Verify the setup using the following commands.

From any Db2 servers:

/home/db2inst1/sqllib/adm/db2cm -list

From QDevice instance:

corosync-qnetd-tool -l

Figure 4. Quorum device status

Setting up overlay IP as virtual IP

For HADR activated databases, virtual IP provides a common connection point for the clients so that in case of failovers there is no need to update the connection strings with the actual IP address of the hosts. Furtermore, the clients can continue to establish the connection to the new primary instance.

We can use the overlay IP address routing on AWS to send the network traffic to HADR database servers within Amazon Virtual Private Cloud (Amazon VPC) using a route table so that the clients can connect to the database using the overlay IP from the same VPC (any Availability Zone) where the database exists. aws-vpc-move-ip is a resource agent from AWS which is available along with the Pacemaker software that helps to update the route table of the VPC.

If you need to connect to the database using overlay IP from on-premises or outside of the VPC (different VPC than database servers), then additional setup is needed using either AWS Transit Gateway or Network Load Balancer.

Prerequisites (setting up overlay IP as virtual IP)

  • Choose the overlay IP address range which needs to be configured. This IP should not be used anywhere in the VPC or on-premises, and should be a part of the private IP address range as defined in RFC 1918. If the VPC is configured in the range of or, we can use the overlay IP from the range of will use the following IP and ethernet settings.

  • To route traffic through overlay IP, we need to disable source and target destination checks on the primary and standby EC2 instances.

aws ec2 modify-instance-attribute –profile <AWS CLI profile> –instance-id EC2-instance-id –no-source-dest-check

Steps to configure overlay IP

The following commands can be run as root user on the primary instance.

  1. Create the following AWS Identity and Access Management (IAM) policy and attach it to the instance profile. Update region, account_id, and routetableid values.
  "Version": "2012-10-17",
  "Statement": [
      "Sid": "Stmt0",
      "Effect": "Allow",
      "Action": "ec2:ReplaceRoute",
      "Resource": "arn:aws:ec2:<region>:<account_id>:route-table/<routetableid>"
      "Sid": "Stmt1",
      "Effect": "Allow",
      "Action": "ec2:DescribeRouteTables",
      "Resource": "*"
  1. Add the overlay IP on the primary instance.

ip address add dev eth0

  1. Update the route table (used in Step 1) with the overlay IP specifying the node with the Db2 primary instance. The following command returns True.

aws ec2 create-route –route-table-id <routetableid> –destination-cidr-block –instance-id <primrydb2instanceid>

  1. Create a file overlayip.txt with the following command to create the resource manager for overlay ip.


primitive db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP ocf:heartbeat:aws-vpc-move-ip \
  params ip= routing_table=<routetableid> interface=<ethernet> profile=<AWS CLI profile name> \
  op start interval=0 timeout=180s \
  op stop interval=0 timeout=180s \
  op monitor interval=30s timeout=60s

eifcolocation db2_db2inst1_db2inst1_TESTDB_AWS_primary-colocation inf:


order order-rule-db2_db2inst1_db2inst1_TESTDB-then-primary-oip Mandatory:

db2_db2inst1_db2inst1_TESTDB-clone db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP
location prefer-node1_db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP

db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP 100: <primaryhostname>
location prefer-node2_db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP

db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP 100: <standbyhostname>

The following parameters must be replaced in the resource manager create command in the file.

    • Name of the database resource agent (This can be found through crm config show | grep primitive | grep DBNAME command. For this example, we will use: db2_db2inst1_db2inst1_TESTDB)
    • Overlay IP address (created earlier)
    • Routing table ID (used earlier)
    • AWS command-line interface (CLI) profile name
    • Primary and standby host names
  1. After the file with commands is ready, run the following command to create the overlay IP resource manager.

crm configure load update overlayip.txt

  1. Next, create the VIP resource manager—not in managed state. Run the following command to manage and start the resource.

crm resource manage db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP

  1. Validate the setup with crm status command.

Figure 5. Pacemaker cluster status along with overlay IP resource

Test failover with client connectivity

For the purpose of this testing, launch another EC2 instance with Db2 client installed, and catalog the Db2 database server using overlay IP.

Figure 6. Database directory list

Establish a connection with the Db2 primary instance using the cataloged alias (created earlier) using overlay IP address.

Figure 7. Connect to database

If we connect to the primary instance and check the applications connected, we can see the active connection from the client’s IP as shown in Figure 8.

Check client connections before failover

Figure 8. Check client connections before failover

Next, let’s stop the primary Db2 instance and check if the Pacemaker cluster promoted the standby to primary and we can still connect to the database using the overlay IP, which now points to the new primary instance.

If we check the CRM status from the new primary instance, we can see that the Pacemaker cluster has promoted the standby database to new primary database as shown in Figure 9.

Figure 9. Automatic failover to standby

Let’s go back to our client and reestablish the connection using the cataloged DB alias created using overlay IP.

Figure 10. Database reconnection after failover

If we connect to the new promoted primary instance and check the applications connected, we can see the active connection from the client’s IP as shown in Figure 11.

Check client connections after failover

Figure 11. Check client connections after failover

Cleaning up

To avoid incurring future charges, terminate all EC2 instances which were created as part of the setup referencing this blog post.


In this blog post, we have set up automatic failover using IBM Db2 Pacemaker with overlay (virtual) IP to route traffic to secondary database instance during failover, which helps to reconnect to the database without any manual intervention. In addition, we can also enable automatic client reroute using the overlay IP address to achieve a seamless failover connectivity to the database for mission-critical workloads.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Build Your Own Game Day to Support Operational Resilience

Post Syndicated from Lewis Taylor original https://aws.amazon.com/blogs/architecture/build-your-own-game-day-to-support-operational-resilience/

Operational resilience is your firm’s ability to provide continuous service through people, processes, and technology that are aware of and adaptive to constant change. Downtime of your mission-critical applications can not only damage your reputation, but can also make you liable to multi-million-dollar financial fines.

One way to test operational resilience is to simulate life-like system failures. An effective way to do this is by running events in your organization known as game days. Game days test systems, processes, and team responses and help evaluate your readiness to react and recover from operational issues. The AWS Well-Architected Framework recommends game days as a key strategy to develop and operate highly resilient systems because they focus not only on technology resilience issues but identify people and process gaps.

This blog post will explain how you can apply game day concepts to your workloads to help achieve a highly resilient workload.

Why does operational resilience matter from a regulatory perspective?

In March 2021, the Bank of England, Prudential Regulation Authority, and Financial Conduct Authority published their Building operational resilience: Feedback to CP19/32 and final rules policy. In this policy, operational resilience refers to a firm’s ability to prevent, adapt, and respond to and return to a steady system state when a disruption occurs. Further, firms are expected to learn and implement process improvements from prior disruptions.

This policy will not apply to everyone. However, across the board if you don’t establish operational resilience strategies, you are likely operating at an increased risk. If you have a service disruption, you may incur lost revenue and reputational damage.

What does it mean to be operationally resilient?

The final policy provides guidance on how firms should achieve operational resilience, which includes but is not limited to the following:

  • Identify and prioritize services based on the potential of intolerable harm to end consumers or risk to market integrity.
  • Define appropriate maximum impact tolerance of an important business service. This is reviewed annually using metrics to measure impact tolerance and answers questions like, “How long (in hours) can a service be offline before causing intolerable harm to end consumers?”
  • Document a complete view of all the aspects required to deliver each important service. This includes people, processes, technology, facilities, and information (resources). Firms should also test their ability to remain within the impact tolerances and provide assurance of resilience along with areas that need to be addressed.

What is a game day?

The AWS Well-Architected Framework defines a game day as follows:

“A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened. These should be conducted regularly so that your team builds “muscle memory” on how to respond. Your game days should cover the areas of operations, security, reliability, performance, and cost.

In AWS, your game days can be carried out with replicas of your production environment using AWS CloudFormation. This enables you to test in a safe environment that resembles your production environment closely.”

Running game days that simulate system failure helps your organization evaluate and build operational resilience.

How can game days help build operational resilience?

Running a game day alone is not sufficient to ensure operational resilience. However, by navigating the following process to set up and perform a game day, you will establish a best practice-based approach for operating resilient systems.

Stage 1 – Identify key services

As part of setting up a game day event, you will catalog and identify business-critical services.

Game days are performed to test services where operational failure could result in significant financial, customer, and/or reputational impact to the firm. Game days can also evaluate other key factors, like the impact of a failure on the wider market where your firm operates.

For example, a firm may identify its digital banking mobile application from which their customers can initiate payments as one of its important business services.

Stage 2 – Map people, process, and technology supporting the business service

Game days are holistic events. To get a full picture of how the different aspects of your workload operate together, you’ll generate a detailed map of people and processes as they interact and operate the technical and non-technical components of the system. This mapping also helps your end consumers understand how you will provide them reliable support during a failure.

Stage 3 – Define and perform failure scenarios

Systems fail, and failures often happen when a system is operating at scale because various services working together can introduce complexity. To ensure operational resilience, you must understand how systems react and adapt to failures. To do this, you’ll identify and perform failure scenarios so you can understand how your systems will react and adapt and build “muscle memory” for actual events.

AWS builds to guard against outages and incidents, and accounts for them in the design of AWS services—so when disruptions do occur, their impact on customers and the continuity of services is as minimal as possible. At AWS, we employ compartmentalization throughout our infrastructure and services. We have multiple constructs that provide different levels of independent, redundant components.

Stage 4 – Observe and document people, process, and technology reactions

In running a failure scenario, you’ll observe how technological and non-technological components react to and recover from failure. This helps you identify failures and fix them as they cascade through impacted components across your workload. This also helps identify technical and operational challenges that might not otherwise be obvious.

Stage 5 – Conduct lessons learned exercises

Game days generate information on people, processes, and technology and also capture data on customer impact, incident response and remediation timelines, contributing factors, and corrective actions. By incorporating these data points into the system design process, you can implement continuous resilience for critical systems.

How to run your own game day in AWS

You may have heard of AWS GameDay events. This is an AWS organized event for our customers. In this team-based event, AWS provides temporary AWS accounts running fictional systems. Failures are injected into these systems and teams work together on completing challenges and improving the system architecture.

However, the method and tooling and principles we use to conduct AWS GameDays are agnostic and can be applied to your systems using the following services:

  • AWS Fault Injection Simulator is a fully managed service that runs fault injection experiments on AWS, which makes it easier to improve an application’s performance, observability, and resiliency.
  • Amazon CloudWatch is a monitoring and observability service that provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health.
  • AWS X-Ray helps you analyze and debug production and distributed applications (such as those built using a microservices architecture). X-Ray helps you understand how your application and its underlying services are performing to identify and troubleshoot the root cause of performance issues and errors.

Please note you are not limited to the tools listed for simulating failure scenarios. For complete coverage of failure scenarios, we encourage you to explore additional tools and strategies.

Figure 1 shows a reference architecture example that demonstrates conducting a game day for an Open Banking implementation.

Game day reference architecture example

Figure 1. Game day reference architecture example

Game day operators use Fault Injection Simulator to catalog and perform failure scenarios to be included in your game day. For example, in our Open Banking use case in Figure 1, a failure scenario might be for the business API functions servicing Open Banking requests to abruptly stop working. You can also combine such simple failure scenarios into a more complex one with failures injected across multiple components of the architecture.

Game day participants use CloudWatch, X-Ray, and their own custom observability and monitoring tooling to identify failures as they cascade through systems.

As you go through the process of identifying, communicating, and fixing issues, you’ll also document impact of failures on end-users. From there, you’ll generate lessons learned to holistically improve your workload’s resilience.


In this blog, we discussed the significance of ensuring operational resilience. We demonstrated how to set up game days and how they can supplement your efforts to ensure operational resilience. We discussed how using AWS services such as Fault Injection Simulator, X-Ray, and CloudWatch can be used to facilitate and implement game day failure scenarios.

Ready to get started? For more information, check out our AWS Fault Injection Simulator User Guide.

Related information: