Tag Archives: Amazon EC2

Deploying CIS Level 1 hardened AMIs with Amazon EC2 Image Builder

Post Syndicated from Joseph Keating original https://aws.amazon.com/blogs/devops/deploying-cis-level-1-hardened-amis-with-amazon-ec2-image-builder/

The NFL, an AWS Professional Services partner, is collaborating with NFL’s Player Health and Safety team to build the Digital Athlete Program. The Digital Athlete Program is working to drive progress in the prevention, diagnosis, and treatment of injuries; enhance medical protocols; and further improve the way football is taught and played. The NFL, in conjunction with AWS Professional Services, delivered an Amazon EC2 Image Builder pipeline for automating the production of Amazon Machine Images (AMIs). Following similar practices from the Digital Athlete Program, this post demonstrates how to deploy an automated Image Builder pipeline.

“AWS Professional Services faced unique environment constraints, but was able to deliver a modular pipeline solution leveraging EC2 Image Builder. The framework serves as a foundation to create hardened images for future use cases. The team also provided documentation and knowledge transfer sessions to ensure our team was set up to successfully manage the solution.”

—Joseph Steinke, Director, Data Solutions Architect, National Football League

A common scenario AWS customers face is how to build processes that configure secure AWS resources that can be leveraged throughout the organization. You need to move fast in the cloud without compromising security best practices. Amazon Elastic Compute Cloud (Amazon EC2) allows you to deploy virtual machines in the AWS Cloud. EC2 AMIs provide the configuration utilized to launch an EC2 instance. You can use AMIs for several use cases, such as configuring applications, applying security policies, and configuring development environments. Developers and system administrators can deploy configuration AMIs to bring up EC2 resources that require little-to-no setup. Often times, multiple patterns are adopted for building and deploying AMIs. Because of this, you need the ability to create a centralized, automated pattern that can output secure, customizable AMIs.

In this post, we demonstrate how to create an automated process that builds and deploys Center for Internet Security (CIS) Level 1 hardened AMIs. The pattern that we deploy includes Image Builder, a CIS Level 1 hardened AMI, an application running on EC2 instances, and Amazon Inspector for security analysis. You deploy the AMI configured with the Image Builder pipeline to an application stack. The application stack consists of EC2 instances running Nginx. Lastly, we show you how to re-hydrate your application stack with a new AMI utilizing AWS CloudFormation and Amazon EC2 launch templates. You use Amazon Inspector to scan the EC2 instances launched from the Image Builder-generated AMI against the CIS Level 1 Benchmark.

After going through this exercise, you should understand how to build, manage, and deploy AMIs to an application stack. The infrastructure deployed with this pipeline includes a basic web application, but you can use this pattern to fit many needs. After running through this post, you should feel comfortable using this pattern to configure an AMI pipeline for your organization.

The project we create in this post addresses the following use case: you need a process for building and deploying CIS Level 1 hardened AMIs to an application stack running on Amazon EC2. In addition to demonstrating how to deploy the AMI pipeline, we also illustrate how to refresh a running application stack with a new AMI. You learn how to deploy this configuration with the AWS Command Line Interface (AWS CLI) and AWS CloudFormation.

AWS services used
Image Builder allows you to develop an automated workflow for creating AMIs to fit your organization’s needs. You can streamline the creation and distribution of secure images, automate your patching process, and define security and application configuration into custom AWS AMIs. In this post, you use the following AWS services to implement this solution:

  • AWS CloudFormation – AWS CloudFormation allows you to use domain-specific languages or simple text files to model and provision, in an automated and secure manner, all the resources needed for your applications across all Regions and accounts. You can deploy AWS resources in a safe, repeatable manner, and automate the provisioning of infrastructure.
  • AWS KMSAmazon Key Management Service (AWS KMS) is a fully managed service for creating and managing cryptographic keys. These keys are natively integrated with most AWS services. You use a KMS key in this post to encrypt resources.
  • Amazon S3Amazon Simple Storage Service (Amazon S3) is an object storage service utilized for storing and encrypting data. We use Amazon S3 to store our configuration files.
  • AWS Auto ScalingAWS Auto Scaling allows you to build scaling plans that automate how groups of different resources respond to changes in demand. You can optimize availability, costs, or a balance of both. We use Auto Scaling to manage Nginx on Amazon EC2.
  • Launch templatesLaunch templates contain configurations such as AMI ID, instance type, and security group. Launch templates enable you to store launch parameters so that they don’t have to be specified every time instances are launched.
  • Amazon Inspector – This automated security assessment service improves the security and compliance of applications deployed on AWS. Amazon Inspector automatically assesses applications for exposures, vulnerabilities, and deviations from best practices.

Architecture overview
We use Ansible as a configuration management component alongside Image Builder. The CIS Ansible Playbook applies a Level 1 set of rules to the local host of which the AMI is provisioned on. For more information about the Ansible Playbook, see the GitHub repo. Image Builder offers AMIs with Security Technical Implementation Guides (STIG) levels low-high as part of its pipeline build.

The following diagram depicts the phases of the Image Builder pipeline for building a Nginx web server. The numbers 1–6 represent the order of when each phase runs in the build process:

  1. Source
  2. Build components
  3. Validate
  4. Test
  5. Distribute
  6. AMI

Figure: Shows the EC2 Image Builder steps

The workflow includes the following steps:

  1. Deploy the CloudFormation templates.
  2. The template creates an Image Builder pipeline.
  3. AWS Systems Manager completes the AMI build process.
  4. Amazon EC2 starts an instance to build the AMI.
  5. Systems Manager starts a test instance build after the first build is successful.
  6. The AMI starts provisioning.
  7. The Amazon Inspector CIS benchmark starts.

CloudFormation templates
You deploy the following CloudFormation templates. These CloudFormation templates have a great deal of configurations. They deploy the following resources:

  • vpc.yml – Contains all the core networking configuration. It deploys the VPC, two private subnets, two public subnets, and the route tables. The private subnets utilize a NAT gateway to communicate to the internet. The public subnets have full outbound access to the IGW.
  • kms.yml – Contains the AWS KMS configuration that we use for encrypting resources. The KMS key policy is also configured in this template.
  • s3-iam-config.yml – Contains the launch configuration and autoscaling groups for the initial Nginx launch. For updates and patching to Nginx, we use Image Builder to build those changes.
  • infrastructure-ssm-params.yml – Contains the Systems Manager parameter store configuration. The parameters are populated by using outputs from other CloudFormation templates.
  • nginx-config.yml – Contains the configuration for Nginx. Additionally, this template contains the network load balancer, target groups, security groups, and EC2 instance AWS Identity and Access Management (IAM) roles.
  • nginx-image-builder.yml – Contains the configuration for the Image Builder pipeline that we use to build AMIs.

To follow the steps to provision the pipeline deployment, you must have the following prerequisites:

Deploying the CloudFormation templates
To deploy your templates, complete the following steps:

1. Clone the source code repository found in the following location:

git clone https://github.com/aws-samples/deploy-cis-level-1-hardened-ami-with-ec2-image-builder-pipeline.git

You now use the AWS CLI to deploy the CloudFormation templates. Make sure to leave the CloudFormation template names as we have written in this post.

2. Deploy the VPC CloudFormation template:

aws cloudformation create-stack \
--stack-name vpc-config \
--template-body file://Templates/vpc.yml \
--parameters file://Parameters/vpc-params.json  \
--capabilities CAPABILITY_IAM \
--region us-east-1

The output should look like the following code:


    "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/vpc-config/7faaab30-247f-11eb-8712-0e65b6fb18f9"


3. Open the Parameters/kms-params.json file and update the UserARN parameter with your account ID:

      "ParameterKey": "KeyName",
      "ParameterValue": "DemoKey"
    "ParameterKey": "UserARN",
    "ParameterValue": "arn:aws:iam::<input_your_account_id>:root"


4. Deploy the KMS key CloudFormation template:

aws cloudformation create-stack \
--stack-name kms-config \
--template-body file://Templates/kms.yml \
--parameters file://Parameters/kms-params.json \
--capabilities CAPABILITY_IAM \
--region us-east-1

The output should look like the following:

"StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/kms-config/f65aca80-08ff-11eb-8795-12275bc6e1ef"


5. Open the Parameters/s3-iam-config.json file and update the DemoConfigS3BucketName parameter to a unique name of your choosing:

    "ParameterKey" : "Environment",
    "ParameterValue" : "dev"
    "ParameterKey": "NetworkStackName",
    "ParameterValue" : "vpc-config"
    "ParameterKey" : "KMSStackName",
    "ParameterValue" : "kms-config"
    "ParameterKey": "DemoConfigS3BucketName",
    "ParameterValue" : "<input_your_unique_bucket_name>"
    "ParameterKey" : "EC2InstanceRoleName",
    "ParameterValue" : "EC2InstanceRole"


6. Deploy the IAM role configuration template:

aws cloudformation create-stack \
--stack-name s3-iam-config \
--template-body file://Templates/s3-iam-config.yml \
--parameters file://Parameters/s3-iam-config.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

The output should look like the following:

"StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/s3-iam-config/9be9f990-0909-11eb-811c-0a78092beb51"


Configuring IAM roles and policies

This solution uses a couple of service-linked roles. Let’s generate these roles using the AWS CLI.


1. Run the following commands:

aws iam create-service-linked-role --aws-service-name autoscaling.amazonaws.com
aws iam create-service-linked-role --aws-service-name imagebuilder.amazonaws.com

If you see a message similar to following code, it means that you already have the service-linked role created in your account and you can move on to the next step:

An error occurred (InvalidInput) when calling the CreateServiceLinkedRole operation: Service role name AWSServiceRoleForImageBuilder has been taken in this account, please try a different suffix.

Now that you have generated the IAM roles used in this post, you add them to the KMS key policy. This allows the roles to encrypt and decrypt the KMS key.


2. Open the Parameters/kms-params.json file:

      "ParameterKey": "KeyName",
      "ParameterValue": "DemoKey"
    "ParameterKey": "UserARN",
    "ParameterValue": "arn:aws:iam::12345678910:root"


3. Add the following values as a comma-separated list to the UserARN parameter key:



When finished, the file should look similar to the following:

      "ParameterKey": "KeyName",
      "ParameterValue": "DemoKey"
    "ParameterKey": "UserARN",
    "ParameterValue": "arn:aws:iam::123456789012:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling,arn:aws:iam::<input_your_aws_account_id>:role/NginxS3PutLambdaRole,arn:aws:iam::123456789012:role/aws-service-role/imagebuilder.amazonaws.com/AWSServiceRoleForImageBuilder,arn:aws:iam::12345678910:role/EC2InstanceRole,arn:aws:iam::12345678910:role/EC2ImageBuilderRole,arn:aws:iam::12345678910:root"

Updating the CloudFormation stack

Now that the AWS KMS parameter file has been updated, you update the AWS KMS CloudFormation stack.

1. Run the following command to update the kms-config stack:

aws cloudformation update-stack \
--stack-name kms-config \
--template-body file://Templates/kms.yml \
--parameters file://Parameters/kms-params.json \
--capabilities CAPABILITY_IAM \
--region us-east-1


The output should look like the following:

"StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/kms-config/6e84b750-0905-11eb-b543-0e4dccb471bf"


2. Open the AnsibleConfig/component-nginx.yml file and update the <input_s3_bucket_name> value with the bucket name you generated from the s3-iam-config stack:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
name: 'Ansible Playbook Execution on Amazon Linux 2'
description: 'This is a sample component that demonstrates how to download and execute an Ansible playbook against Amazon Linux 2.'
schemaVersion: 1.0
  - s3bucket:
      type: string
      value: <input_s3_bucket_name>
  - name: build
      - name: InstallAnsible
        action: ExecuteBash
           - sudo amazon-linux-extras install -y ansible2
      - name: CreateDirectory
        action: ExecuteBash
            - sudo mkdir -p /ansibleloc/roles
      - name: DownloadLinuxCis
        action: S3Download
          - source: 's3://{{ s3bucket }}/components/linux-cis.zip'
            destination: '/ansibleloc/linux-cis.zip'
      - name: UzipLinuxCis
        action: ExecuteBash
            - unzip /ansibleloc/linux-cis.zip -d /ansibleloc/roles
            - echo "unzip linux-cis file"
      - name: DownloadCisPlaybook
        action: S3Download
          - source: 's3://{{ s3bucket }}/components/cis_playbook.yml'
            destination: '/ansibleloc/cis_playbook.yml'
      - name: InvokeCisAnsible
        action: ExecuteBinary
          path: ansible-playbook
            - '{{ build.DownloadCisPlaybook.inputs[0].destination }}'
            - '--tags=level1'
      - name: DeleteCisPlaybook
        action: ExecuteBash
            - rm '{{ build.DownloadCisPlaybook.inputs[0].destination }}'
      - name: DownloadNginx
        action: S3Download
          - source: s3://{{ s3bucket }}/components/nginx.zip'
            destination: '/ansibleloc/nginx.zip'
      - name: UzipNginx
        action: ExecuteBash
            - unzip /ansibleloc/nginx.zip -d /ansibleloc/roles
            - echo "unzip Nginx file"
      - name: DownloadNginxPlaybook
        action: S3Download
          - source: 's3://{{ s3bucket }}/components/nginx_playbook.yml'
            destination: '/ansibleloc/nginx_playbook.yml'
      - name: InvokeNginxAnsible
        action: ExecuteBinary
          path: ansible-playbook
            - '{{ build.DownloadNginxPlaybook.inputs[0].destination }}'
      - name: DeleteNginxPlaybook
        action: ExecuteBash
            - rm '{{ build.DownloadNginxPlaybook.inputs[0].destination }}'

  - name: validate
      - name: ValidateDebug
        action: ExecuteBash
            - sudo echo "ValidateDebug section"

  - name: test
      - name: TestDebug
        action: ExecuteBash
            - sudo echo "TestDebug section"
      - name: Download_Inspector_Test
        action: S3Download
          - source: 's3://ec2imagebuilder-managed-resources-us-east-1-prod/components/inspector-test-linux/1.0.1/InspectorTest'
            destination: '/workdir/InspectorTest'
      - name: Set_Executable_Permissions
        action: ExecuteBash
            - sudo chmod +x /workdir/InspectorTest
      - name: ExecuteInspectorAssessment
        action: ExecuteBinary
          path: '/workdir/InspectorTest'


Adding files to your S3 buckets

Now you assume a role you generated from one of the previous CloudFormation stacks. This allows you to add files to the encrypted S3 bucket.

1. Run the following command and make sure to update the role to use your AWS account ID number:

aws sts assume-role --role-arn "arn:aws:iam::<input_your_aws_account_id>:role/EC2ImageBuilderRole" --role-session-name AWSCLI-Session

You see an output similar to the following:

    "Credentials": {
        "AccessKeyId": "<AWS_ACCESS_KEY_ID>",
        "SecretAccessKey": "<AWS_SECRET_ACCESS_KEY_ID>",
        "SessionToken": "<AWS_SESSION_TOKEN>",
        "Expiration": "2020-11-20T02:54:17Z"
    "AssumedRoleUser": {
        "AssumedRoleId": "ACPATGCCLSNJCNSJCEWZ:AWSCLI-Session",
        "Arn": "arn:aws:sts::123456789012:assumed-role/EC2ImageBuilderRole/AWSCLI-Session"

You now assume the EC2ImageBuilderRole IAM role from the command line. This role allows you to create objects in the S3 bucket generated from the s3-iam-config stack. Because this bucket is encrypted with AWS KMS, any user or IAM role requires specific permissions to decrypt the key. You have already accounted for this in a previous step by adding the EC2ImageBuilderRole IAM role to the KMS key policy.


2. Create the following environment variable to use the EC2ImageBuilderRole role. Update the values with the output from the previous step:

export AWS_ACCESS_KEY_ID=AccessKeyId
export AWS_SECRET_ACCESS_KEY=SecretAccessKey
export AWS_SESSION_TOKEN=SessionToken


3. Check to make sure that you have actually assumed the role EC2ImageBuilderRole:

aws sts get-caller-identity

You should see an output similar to the following:

    "Account": "123456789012",
    "Arn": "arn:aws:sts::123456789012:assumed-role/EC2ImageBuilderRole/AWSCLI-Session"


4. Create a folder inside of the encrypted S3 bucket generated in the s3-iam-config stack:

aws s3api put-object --bucket <input_your_bucket_name> --key components


5. Zip the configuration files that you use in the Image Builder pipeline process:

zip -r linux-cis.zip LinuxCis/
zip -r nginx.zip Nginx/


6. Upload the configuration files to S3 bucket. Update the bucket name with the S3 bucket name you generated in the s3-iam-config stack.

aws s3 cp linux-cis.zip s3://<input_your_bucket_name>/components/

aws s3 cp nginx.zip s3://<input_your_bucket_name>/components/

aws s3 cp AnsibleConfig/cis_playbook.yml s3://<input_your_bucket_name>/components/

aws s3 cp AnsibleConfig/nginx_playbook.yml s3://<input_your_bucket_name>/components/

aws s3 cp AnsibleConfig/component-nginx.yml s3://<input_your_bucket_name>/components/

Deploying your pipeline

You’re now ready to deploy your pipeline.

1. Switch back to the original IAM user profile you used before assuming the EC2ImageBuilderRole. For instructions, see How do I assume an IAM role using the AWS CLI?


2. Open the Parameters/nginx-image-builder-params.json file and update the ImageBuilderBucketName parameter with the S3 bucket name generated in the s3-iam-config stack:

    "ParameterKey": "Environment",
    "ParameterValue": "dev"
    "ParameterKey": "ImageBuilderBucketName",
    "ParameterValue": "<input_your_bucket_name>"
    "ParameterKey": "NetworkStackName",
    "ParameterValue": "vpc-config"
    "ParameterKey": "KMSStackName",
    "ParameterValue": "kms-config"
    "ParameterKey": "S3ConfigStackName",
    "ParameterValue": "s3-iam-config"


3. Deploy the nginx-image-builder.yml template:

aws cloudformation create-stack \
--stack-name cis-image-builder \
--template-body file://Templates/nginx-image-builder.yml \
--parameters file://Parameters/nginx-image-builder-params.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

The template takes around 35 minutes to complete. Deploying this template starts the Image Builder pipeline.


Monitoring the pipeline

You can get more details about the pipeline on the AWS Management Console.

1. On the Image Builder console, choose Image pipelines to see the status of the pipeline.

Figure: Shows the EC2 Image Builder Pipeline status


2. Choose the pipeline (for this post, cis-image-builder-LinuxCis-Pipeline)

On the pipeline details page, you can view more information and make updates to its configuration.

Figure: Shows the Image Builder Pipeline metadata

At this point, the Image Builder pipeline has started running the automation document in Systems Manager. Here you can monitor the progress of the AMI build.


3. On the Systems Manager console, choose Automation.


4. Choose the execution ID of the arn:aws:ssm:us-east-1:123456789012:document/ImageBuilderBuildImageDocument document.

Figure: Shows the Image Builder Pipeline Systems Manager Automation steps


5. Choose the step ID to see what is happening in each step.

At this point, the Image Builder pipeline is bringing up an Amazon Linux 2 EC2 instance. From there, we run Ansible playbooks that configure the security and application settings. The automation is pulling its configuration from the S3 bucket you deployed in a previous step. When the Ansible run is complete, the instance stops and an AMI is generated from this instance. When this is complete, a cleanup is initiated that ends the EC2 instance. The final result is a CIS Level 1 hardened Amazon Linux 2 AMI running Nginx.


Updating parameters

When the stack is complete, you retrieve some new parameter values.

1. On the Systems Manager console, choose Automation.


2. Choose the execution ID of the arn:aws:ssm:us-east-1:123456789012:document/ImageBuilderBuildImageDocument document.


3. Choose step 21.

The following screenshot shows the output of this step.

Figure: Shows step of EC2 Image Builder Pipeline


4. Open the Parameters/nginx-config.json file and update the AmiId parameter with the AMI ID generated from the previous step:

    "ParameterKey" : "Environment",
    "ParameterValue" : "dev"
    "ParameterKey": "NetworkStackName",
    "ParameterValue" : "vpc-config"
    "ParameterKey" : "S3ConfigStackName",
    "ParameterValue" : "s3-iam-config"
    "ParameterKey": "AmiId",
    "ParameterValue" : "<input_the_cis_hardened_ami_id>"
    "ParameterKey": "ApplicationName",
    "ParameterValue" : "Nginx"
    "ParameterKey": "NLBName",
    "ParameterValue" : "DemoALB"
    "ParameterKey": "TargetGroupName",
    "ParameterValue" : "DemoTG"


5. Deploy the nginx-config.yml template:

aws cloudformation create-stack \
--stack-name nginx-config \
--template-body file://Templates/nginx-config.yml \
--parameters file://Parameters/nginx-config.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

The output should look like the following:

    "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/nginx-config/fb2b0f30-24f6-11eb-ad7c-0a3238f55eb3"


6. Deploy the infrastructure-ssm-params.yml template:

aws cloudformation create-stack \
--stack-name ssm-params-config \
--template-body file://Templates/infrastructure-ssm-params.yml \
--parameters file://Parameters/infrastructure-ssm-parameters.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1


Verifying Nginx is running

Let’s verify that our Nginx service is up and running properly. You use Session Manager to connect to a testing instance.

1. On the Amazon EC2 console, choose Instances.

You should see three instances, as in the following screenshot.

Figure: Shows the Nginx EC2 instances

You can connect to either one of the Nginx instances.


2. Select the testing instance.


3. On the Actions menu, choose Connect.


4. Choose Session Manager.


5. Choose Connect.

A terminal on the EC2 instance opens, similar to the following screenshot.

Figure: Shows the Session Manager terminal

6. Run the following command to ensure that Nginx is running properly:

curl localhost:8080

You should see an output similar to the following screenshot.

Figure: Shows Nginx output from terminal

Reviewing resources and configurations

Now that you have deployed the core services that for the solution, take some time to review the services that you have just deployed.


IAM roles

This project creates several IAM roles that are used to manage AWS resources. For example, EC2ImageBuilderRole is used to configure new AMIs with the Image Builder pipeline. This role contains only the permissions required to manage the Image Builder process. Adopting this pattern enforces the practice of least privilege. Additionally, many of the IAM polices attached to the IAM roles are restricted down to specific AWS resources. Let’s look at a couple of examples of managing IAM permissions with this project.


The following policy restricts Amazon S3 access to a specific S3 bucket. This makes sure that the role this policy is attached to can only access this specific S3 bucket. If this role needs to access any additional S3 buckets, the resource has to be explicitly added.

  - PolicyName: GrantS3Read
        - Sid: GrantS3Read
          Effect: Allow
            - s3:List*
            - s3:Get*
            - s3:Put*
          Resource: !Sub 'arn:aws:s3:::${S3Bucket}*'

Let’s look at the EC2ImageBuilderRole. A common scenario that occurs is when you need to assume a role locally in order to perform an action on a resource. In this case, because you’re using AWS KMS to encrypt the S3 bucket, you need to assume a role that has access to decrypt the KMS key so that artifacts can be uploaded to the S3 bucket. In the following AssumeRolePolicyDocument, we allow Amazon EC2 and Systems Manager services to be assumed by this role. Additionally, we allow IAM users to assume this role as well.

  Version: 2012-10-17
    - Effect: Allow
          - ec2.amazonaws.com
          - ssm.amazonaws.com
          - imagebuilder.amazonaws.com
        AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root'
        - sts:AssumeRole

The principle !Sub 'arn:aws:iam::${AWS::AccountId}:root allows for any IAM user in this account to assume this role locally. Normally, this role should be scoped down to specific IAM users or roles. For the purpose of this post, we grant permissions to all users of the account.


Nginx configuration

The AMI built from the Image Builder pipeline contains all of the application and security configurations required to run Nginx as a web application. When an instance is launched from this AMI, no additional configuration is required.

We use Amazon EC2 launch templates to configure the application stack. The launch templates contain information such as the AMI ID, instance type, and security group. When a new AMI is provisioned, you simply update the launch template CloudFormation parameter with the new AMI and update the CloudFormation stack. From here, you can start an Auto Scaling group instance refresh to update the application stack to use the new AMI. The Auto Scaling group is updated with instances running on the updated AMI by bringing down one instance at a time and replacing it.


Amazon Inspector configuration

Amazon Inspector is an automated security assessment service that helps improve the security and compliance of applications deployed on AWS. With Amazon Inspector, assessments are generated for exposure, vulnerabilities, and deviations from best practices.

After performing an assessment, Amazon Inspector produces a detailed list of security findings prioritized by level of severity. These findings can be reviewed directly or as part of detailed assessment reports that are available via the Amazon Inspector console or API. We can use Amazon Inspector to assess our security posture against the CIS Level 1 standard that we use our Image Builder pipeline to provision. Let’s look at how we configure Amazon Inspector.

A resource group defines a set of tags that, when queried, identify the AWS resources that make up the assessment target. Any EC2 instance that is launched with the tag specified in the resource group is in scope for Amazon Inspector assessment runs. The following code shows our configuration:

  Type: "AWS::Inspector::ResourceGroup"
      - Key: "ResourceGroup"
        Value: "Nginx"

  Type: AWS::Inspector::AssessmentTarget
    AssessmentTargetName : "NginxAssessmentTarget"
    ResourceGroupArn : !Ref ResourceGroup

In the following code, we specify the tag set in the resource group, which makes sure that when an instance is launched from this AMI, it’s under the scope of Amazon Inspector:

  Type: AWS::ImageBuilder::Image
    ImageRecipeArn: !Ref Recipe
    InfrastructureConfigurationArn: !Ref Infrastructure
    DistributionConfigurationArn: !Ref Distribution
      ImageTestsEnabled: false
      TimeoutMinutes: 60
      ResourceGroup: 'Nginx'


Building and deploying a new image with Amazon Inspector tests enabled

For this final portion of this post, we build and deploy a new AMI with an Amazon Inspector evaluation.

1. In your text editor, open Templates/nginx-image-builder.yml and update the pipeline and IBImage resource property ImageTestsEnabled to true.

The updated configuration should look like the following:

  Type: AWS::ImageBuilder::Image
    ImageRecipeArn: !Ref Recipe
    InfrastructureConfigurationArn: !Ref Infrastructure
    DistributionConfigurationArn: !Ref Distribution
      ImageTestsEnabled: true
      TimeoutMinutes: 60
      ResourceGroup: 'Nginx'


2. Update the stack with the new configuration:

aws cloudformation update-stack \
--stack-name cis-image-builder \
--template-body file://Templates/nginx-image-builder.yml \
--parameters file://Parameters/nginx-image-builder-params.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

This starts a new AMI build with an Amazon Inspector evaluation. The process can take up to 2 hours to complete.

3. On the Amazon Inspector console, choose Assessment Runs.

Figure: Shows Amazon Inspector Assessment Run

4. Under Reports, choose Download report.

5. For Select report type, select Findings report.

6. For Select report format, select PDF.

7. Choose Generate report.

The following screenshot shows the findings report from the Amazon Inspector run.

This report generates an assessment against the CIS Level 1 standard. Any policies that don’t comply with the CIS Level 1 standard are explicitly called out in this report.

Section 3.1 lists any failed policies.


Figure: Shows Inspector findings

These failures are detailed later in the report, along with suggestions for remediation.

In section 4.1, locate the entry 1.3.2 Ensure filesystem integrity is regularly checked. This section shows the details of a failure from the Amazon Inspector findings report. You can also see suggestions on how to remediate the issue. Under Recommendation, the findings report suggests a specific command that you can use to remediate the issue.


Figure: Shows Inspector findings issue

You can use the Image Builder pipeline to simply update the Ansible playbooks with this setting, then run the Image Builder pipeline to build a new AMI, deploy the new AMI to an EC2 Instance, and run the Amazon Inspector report to ensure that the issue has been resolved. Finally, we can see the specific instances that have been assessed that have this issue.

Organizations often customize security settings based off of a given use case. Your organization may choose CIS Level 1 as a standard but elect to not apply all the recommendations. For example, you might choose to not use the FirewallD service on your Linux instances, because you feel that using Amazon EC2 security groups gives you enough networking security in place that you don’t need an additional firewall. Disabling FirewallD causes a high severity failure in the Amazon Inspector report. This is expected and can be ignored when evaluating the report.


In this post, we showed you how to use Image Builder to automate the creation of AMIs. Additionally, we also showed you how to use the AWS CLI to deploy CloudFormation stacks. Finally, we walked through how to evaluate resources against CIS Level 1 Standard using Amazon Inspector.


About the Authors


Joe Keating is a Modernization Architect in Professional Services at Amazon Web Services. He works with AWS customers to design and implement a variety of solutions in the AWS Cloud. Joe enjoys cooking with a glass or two of wine and achieving mediocrity on the golf course.




Virginia Chu is a Sr. Cloud Infrastructure Architect in Professional Services at Amazon Web Services. She works with enterprise-scale customers around the globe to design and implement a variety of solutions in the AWS Cloud.


Creating a cross-region Active Directory domain with AWS Launch Wizard for Microsoft Active Directory

Post Syndicated from AWS Admin original https://aws.amazon.com/blogs/compute/creating-a-cross-region-active-directory-domain-with-aws-launch-wizard-for-microsoft-active-directory/

AWS Launch Wizard is a console-based service to quickly and easily size, configure, and deploy third party applications, such as Microsoft SQL Server Always On and HANA based SAP systems, on AWS without the need to identify and provision individual AWS resources. AWS Launch Wizard offers an easy way to deploy enterprise applications and optimize costs. Instead of selecting and configuring separate infrastructure services, you go through a few steps in the AWS Launch Wizard and it deploys a ready-to-use application on your behalf. It reduces the time you need to spend on investigating how to provision, cost and configure your application on AWS.

You can now use AWS Launch Wizard to deploy and configure self-managed Microsoft Windows Server Active Directory Domain Services running on Amazon Elastic Compute Cloud (EC2) instances. With Launch Wizard, you can have fully-functioning, production-ready domain controllers within a few hours—all without having to manually deploy and configure your resources.

You can use AWS Directory Service to run Microsoft Active Directory (AD) as a managed service, without the hassle of managing your own infrastructure. If you need to run your own AD infrastructure, you can use AWS Launch Wizard to simplify the deployment and configuration process.

In this post, I walk through creation of a cross-region Active Directory domain using Launch Wizard. First, I deploy a single Active Directory domain spanning two regions. Then, I configure Active Directory Sites and Services to match the network topology. Finally, I create a user account to verify replication of the Active Directory domain.

Diagram of Resources deployed in this post

Figure 1: Diagram of resources deployed in this post


  1. You must have a VPC in your home. Additionally, you must have remote regions that have CIDRs that do not overlap with each other. If you need to create VPCs and subnets that do not overlap, please refer here.
  2. Each subnet used must have outbound internet connectivity. Feel free to either use a NAT Gateway or Internet Gateway.
  3. The VPCs must be peered in order to complete the steps in this post. For information on creating a VPC Peering connection between regions, please refer here.
  4. If you choose to deploy your Domain Controllers to a private subnet, you must have an RDP jump / bastion instance setup to allow you to RDP to your instance.

Deploy Your Domain Controllers in the Home Region using Launch Wizard

In this section, I deploy the first set of domain controllers into the us-east-1 the home region using Launch Wizard. I refer to US-East-1 as the home region, and US-West-2 as the remote region.

  1. In the AWS Launch Wizard Console, select Active Directory in the navigation pane on the left.
  2. Select Create deployment.
  3. In the Review Permissions page, select Next.
  4. In the Configure application settings page set the following:
    • General:
      • Deployment name: UsEast1AD
    • Active Directory (AD) installation
      • Installation type: Active Directory on EC2
    • Domain Settings:
      • Number of domain controllers: 2
      • AMI installation type: License-included AMI
    • License-included AMI: ami-################# | Windows_Server-2019-English-Full-Base-202#-##-##
    • Connection type: Create new Active Directory
    • Domain DNS name: corp.example.com
    • Domain NetBIOS Name: CORP
    • Connectivity:
      • Key Pair Name: Choose and exiting Key pair or select and existing one.
      • Virtual Private Cloud (VPC): Select Virtual Private Cloud (VPC)
    • VPC: Select your home region VPC
    • Availability Zone (AZ) and private subnets:
      • Select 2 Availability Zones
      • Choose the proper subnet in each subnet
      • Assign a Controller IP address for each domain controller
    • Remote Desktop Gateway preferences: Disregard for now, this is set up later.
    • Check the I confirm that a public subnet has been set up. Each of the selected private subnets have outbound connectivity enabled check box.
  1. Select Next.
  2. In the Define infrastructure requirements page, set the following inputs.
    • Storage and compute: Based on infrastructure requirements
    • Number of AD users: Up to 5000 users
  3. Select Next.
  4. In the Review and deploy page, review your selections. Then, select Deploy.

Note that it may take up to 2 hours for your domain to be deployed. Once the status has changed to Completed, you can proceed to the next section. In the next section, I prepare Active Directory Sites and Services for the second set of domain controller in my other region.

Configure Active Directory Sites and Services

In this section, I configure the Active Directory Sites and Services topology to match my network topology. This step ensures proper Active Directory replication routing so that domain clients can find the closest domain controller. For more information on Active Directory Sites and Services, please refer here.

Retrieve your Administrator Credentials from Secrets Manager

  1. From the AWS Secrets Manager Console in us-east-1, select the Secret that begins with LaunchWizard-UsEast1AD.
  2. In the middle of the Secret page, select Retrieve secret value.
    1. This will display the username and password key with their values.
    2. You need these credentials when you RDP into one of the domain controllers in the next steps.

Rename the Default First Site

  1. Log in to the one of the domain controllers in us-east-1.
  2. Select Start, type dssite and hit Enter on your keyboard.
  3. The Active Directory Sites and Services MMC should appear.
    1. Expand Sites. There is a site named Default-First-Site-Name.
    2. Right click on Default-First-Site-Name select Rename.
    3. Enter us-east-1 as the name.
  4. Leave the Active Directory Sites and Services MMC open for the next set of steps.

Create a New Site and Subnet Definition for US-West-2

  1. Using the Active Directory Sites and Services MMC from the previous steps, right click on Sites.
  2. Select New Site… and enter the following inputs:
    • Name: us-west-2
  3.  Select OK.
  4. A pop up will appear telling you there will need to be some additional configuration. Select OK.
  5. Expand Sites and right click on Subnets and select New Subnet.
  6. Enter the following information:
    • Prefix: the CIDR of your us-west-2 VPC. An example would be 1.0.0/24
    • Site: select us-west-2
  7. Select OK.
  8. Leave the Active Directory Sites and Services MMC open for the following set of steps.

Configure Site Replication Settings

Using the Active Directory Sites and Services MMC from the previous steps, expand Sites, Inter-Site Transports, and select IP. You should see an object named DEFAULTIPSITELINK,

  1. Right click on DEFAULTIPSITELINK.
  2. Select Properties. Set or verify the following inputs on the General tab:
  3. Select Apply.
  4. In the DEFAULTIPSITELINK Properties, select the Attribute Editor tab and modify the following:
    • Scroll down and double click on Enter 1 for the Value, then select OK twice.
      • For more information on these settings, please refer here.
  5. Close the Active Directory Sites and Services MMC, as it is no longer needed.

Prepare Your Home Region Domain Controllers Security Group

In this section, I modify the Domain Controllers Security Group in us-east-1. This allows the domain controllers deployed in us-west-2 to communicate with each other.

  1. From the Amazon Elastic Compute Cloud (Amazon EC2) console, select Security Groups under the Network & Security navigation section.
  2. Select the Domain Controllers Security Group that was created with Launch Wizard Active Directory.
  3. Select Edit inbound rules. The Security Group should start with LaunchWizard-UsEast1AD-.
  4. Choose Add rule and enter the following:
    • Type: Select All traffic
    • Protocol: All
    • Port range: All
    • Source: Select Custom
    • Enter the CIDR of your remote VPC. An example would be 1.0.0/24
  5. Select Save rules.

Create a Copy of Your Administrator Secret in Your Remote Region

In this section, I create a Secret in Secrets Manager that contains the Administrator credentials when I created a home region.

  1. Find the Secret that being with LaunchWizard-UsEast1AD from the AWS Secrets Manager Console in us-east-1.
  2. In the middle of the Secret page, select Retrieve secret value.
    • This displays the username and password key with their values. Make note of these keys and values, as we need them for the next steps.
  3. From the AWS Secrets Manager Console, change the region to us-west-2.
  4. Select Store a new secret. Then, enter the following inputs:
    • Select secret type: Other type of secrets
    • Add your first keypair
    • Select Add row to add the second keypair
  5. Select Next, then enter the following inputs.
    • Secret name: UsWest2AD
    • Select Next twice
    • Select Store

Deploy Your Domain Controllers in the Remote Region using Launch Wizard

In this section, I deploy the second set of domain controllers into the us-west-1 region using Launch Wizard.

  1. In the AWS Launch Wizard Console, select Active Directory in the navigation pane on the left.
  2. Select Create deployment.
  3. In the Review Permissions page, select Next.
  4. In the Configure application settings page, set the following inputs.
    • General
      • Deployment name: UsWest2AD
    • Active Directory (AD) installation
      • Installation type: Active Directory on EC2
    • Domain Settings:
      • Number of domain controllers: 2
      • AMI installation type: License-included AMI
      • License-included AMI: ami-################# | Windows_Server-2019-English-Full-Base-202#-##-##
    • Connection type: Add domain controllers to existing Active Directory
    • Domain DNS name: corp.example.com
    • Domain NetBIOS Name: CORP
    • Domain Administrator secret name: Select you secret you created above.
    • Add permission to secret
      • After you verified the Secret you created above has the policy listed. Check the checkbox confirming the secret has the required policy.
    • Domain DNS IP address for resolution: The private IP of either domain controller in your home region
    • Connectivity:
      • Key Pair Name: Choose an existing Key pair
      • Virtual Private Cloud (VPC): Select Virtual Private Cloud (VPC)
    • VPC: Select your home region VPC
    • Availability Zone (AZ) and private subnets:
      • Select 2 Availability Zones
      • Choose the proper subnet in each subnet
      • Assign a Controller IP address for each domain controller
    • Remote Desktop Gateway preferences: disregard for now, as I set this later.
    • Check the I confirm that a public subnet has been set up. Each of the selected private subnets have outbound connectivity enabled check box
  1. In the Define infrastructure requirements page set the following:
    • Storage and compute: Based on infrastructure requirements
    • Number of AD users: Up to 5000 users
  2. In the Review and deploy page, review your selections. Then, select Deploy.

Note that it may take up to 2 hours to deploy domain controllers. Once the status has changed to Completed, proceed to the next section. In this next section, I prepare Active Directory Sites and Services for the second set of domain controller in another region.

Prepare Your Remote Region Domain Controllers Security Group

In this section, I modify the Domain Controllers Security Group in us-west-2. This allows the domain controllers deployed in us-west-2 to communicate with each other.

  1. From the Amazon Elastic Compute Cloud (Amazon EC2) console, select Security Groups under the Network & Security navigation section.
  2. Select the Domain Controllers Security Group that was created by your Launch Wizard Active Directory.
  3. Select Edit inbound rules. The Security Group should start with LaunchWizard-UsWest2AD-EC2ADStackExistingVPC-
  4. Choose Add rule and enter the following:
    • Type: Select All traffic
    • Protocol: All
    • Port range: All
    • Source: Select Custom
    • Enter the CIDR of your remote VPC. An example would be 0.0.0/24
  5. Choose Save rules.

Create an AD User and Verify Replication

In this section, I create a user in one region and verify that it replicated to the other region. I also use AD replication diagnostics tools to verify that replication is working properly.

Create a Test User Account

  1. Log in to one of the domain controllers in us-east-1.
  2. Select Start, type dsa and press Enter on your keyboard. The Active Directory Users and Computers MMC should appear.
  3. Right click on the Users container and select New > User.
  4. Enter the following inputs:
    • First name: John
    • Last name: Doe
    • User logon name: jdoe and select Next
    • Password and Confirm password: Your choice of complex password
    • Uncheck User must change password at next logon
  5. Select Next.
  6. Select Finish.

Verify Test User Account Has Replicated

  1. Log in to the one of the domain controllers in us-west-2.
  2. Select Start and type dsa.
  3. Then, press Enter on your keyboard. The Active Directory Users and Computers MMC should appear.
  4. Select Users. You should see a user object named John Doe.

Note that if the user is not present, it may not have been replicated yet. Replication should not take longer than 60 seconds from when the item was created.


Congratulations, you have created a cross-region Active Directory! In this post you:

  1. Launched a new Active Directory forest in us-east-1 using AWS Launch Wizard.
  2. Configured Active Directory Sites and Service for a multi-region configuration.
  3. Launched a set of new domain controllers in the us-west-2 region using AWS Launch Wizard.
  4. Created a test user and verified replication.

This post only touches on a couple of features that are available in the AWS Launch Wizard Active Directory deployment. AWS Launch Wizard also automates the creation of a Single Tier PKI infrastructure or trust creation. One of the prime benefits of this solution is the simplicity in deploying a fully functional Active Directory environment in just a few clicks. You no longer need to do the undifferentiated heavy lifting required to deploy Active Directory.  For more information, please refer to AWS Launch Wizard documentation.

Automate domain join for Amazon EC2 instances from multiple AWS accounts and Regions

Post Syndicated from Sanjay Patel original https://aws.amazon.com/blogs/security/automate-domain-join-for-amazon-ec2-instances-multiple-aws-accounts-regions/

As organizations scale up their Amazon Web Services (AWS) presence, they are faced with the challenge of administering user identities and controlling access across multiple accounts and Regions. As this presence grows, managing user access to cloud resources such as Amazon Elastic Compute Cloud (Amazon EC2) becomes increasingly complex. AWS Directory Service for Microsoft Active Directory (also known as an AWS Managed Microsoft AD) makes it easier and more cost-effective for you to manage this complexity. AWS Managed Microsoft AD is built on highly available, AWS managed infrastructure. Each directory is deployed across multiple Availability Zones, and monitoring automatically detects and replaces domain controllers that fail. In addition, data replication and automated daily snapshots are configured for you. You don’t have to install software, and AWS handles all patching and software updates. AWS Managed Microsoft AD enables you to leverage your existing on-premises user credentials to access cloud resources such as the AWS Management Console and EC2 instances.

This blog post describes how EC2 resources launched across multiple AWS accounts and Regions can automatically domain-join a centralized AWS Managed Microsoft AD. The solution we describe in this post is implemented for both Windows and Linux instances. Removal of Computer objects from Active Directory upon instance termination is also implemented. The solution uses Amazon DynamoDB to centrally store account and directory information in a central security account. We also provide AWS CloudFormation templates and platform-specific domain join scripts for you to use with AWS Lambda as a quick start solution.


The following diagram shows the domain-join process for EC2 instances across multiple accounts and Regions using AWS Managed Microsoft AD.

Figure 1: EC2 domain join architecture

Figure 1: EC2 domain join architecture

The event flow works as follows:

  1. An EC2 instance is launched in a peered virtual private cloud (VPC) of a workload or security account. VPCs that are hosting EC2 instances need to be peered with the VPC that contains AWS Managed Microsoft AD to enable network connectivity with Active Directory.
  2. An Amazon CloudWatch Events rule detects an EC2 instance in the “running” state.
  3. The CloudWatch event is forwarded to a regional CloudWatch event bus in the security account.
  4. If the CloudWatch event bus is in the same Region as AWS Managed Microsoft AD, it delivers the event to an Amazon Simple Queue Service (Amazon SQS) queue, referred to as the domain-join queue in this post.
  5. If the CloudWatch event bus is in a different Region from AWS Managed Microsoft AD, it delivers the event to an Amazon Simple Notification Service (Amazon SNS) topic. The event is then delivered to the domain-join queue described in step 4, through the Amazon SNS topic subscription.
  6. Messages in the domain-join queue are held for five minutes to allow for EC2 instances to stabilize after they reach the “running” state. This delay allows time for installation of additional software components and agents through the use of EC2 user data and AWS Systems Manager Distributor.
  7. After the holding period is over, messages in the domain-join queue invoke the AWS AD Join/Leave Lambda function. The Lambda function does the following:
    1. Retrieves the AWS account ID that originated the event from the message and retrieves account-specific configurations from a DynamoDB table. This configuration identifies AWS Managed Microsoft AD domain controller IPs, credentials required to perform EC2 domain join, and an AWS Identity and Access Management (IAM) role that can be assumed by the Lambda function to invoke AWS Systems Manager Run Command.
    2. If needed, uses AWS Security Token Service (AWS STS) and prepares a cross-account access session.
    3. Retrieves EC2 instance information, such as the instance state, platform, and tags, and validates the instance state.
    4. Retrieves platform-specific domain-join scripts that are deployed with the Lambda function’s code bundle, and configures invocation of those scripts by using data read from the DynamoDB table (bash script for Linux instances and PowerShell script for Windows instances).
    5. Uses AWS Systems Manager Run Command to invoke the domain-join script on the instance. Run Command enables you to remotely and securely manage the configuration of your managed instances.
    6. The domain-join script runs on the instance. It uses script parameters and instance attributes to configure the instance and perform the domain join. The adGroupName tag value is used to configure the Active Directory user group that will have permissions to log in to the instance. The instance is rebooted to complete the domain join process. Various software components are installed on the instance when the script runs. For the Linux instance, sssd, realmd, krb5, samba-common, adcli, unzip, and packageit are installed. For the Windows instance, the RDS-RD-Server feature is installed.

Removal of EC2 instances from AWS Managed Microsoft AD upon instance termination follows a similar sequence of steps. Each instance that is domain joined creates an Active Directory domain object under the “Computer” hierarchy. This domain object needs to be removed upon instance termination so that a new instance that uses the same private IP address in the subnet (at a future time) can successfully domain join and enable instance access with Active Directory credentials. Removal of the Active Directory Computer object is done by running the leaveDomaini.ps1 script (included with this blog) through Run Command on the Active Directory Tools instance identified in Figure 1.

Prerequisites and setup

To build the solution outlined in this post, you need:

  • AWS Managed Microsoft AD with an appropriate DNS name (for example, example.com). For more information about getting started with AWS Managed Microsoft AD, see Create Your AWS Managed Microsoft AD directory.
  • AD Tools. To install AD Tools and use it to create the required users:
    1. Launch a Windows EC2 instance in the same account and Region, and domain-join it with the directory you created in the previous step. Log in to the instance through Remote Desktop Protocol (RDP) and install AD Tools as described in Installing the Active Directory Administration Tools.
    2. After the AD Tools are installed, launch the AD Users & Computers application to create domain users, and assign those users to an Active Directory security group (for example, my_UserGroup) that has permission to access domain-joined instances.
    3. Create a least-privileged user for performing domain joins as described in Delegate Directory Join Privileges for AWS Managed Microsoft AD. The identity of this user is stored in the DynamoDB table and read by the AD Join Lambda function to invoke Active Directory join scripts.
    4. Store the password for the least-privileged user in an encrypted Systems Manager parameter. The password for this user is stored in the secure string System Manager parameter and read by the AD Join Lambda function at runtime while processing Amazon SQS messages.
    5. Assign a unique tag key and value to identify the AD Tools instance. This instance will be invoked by the Lambda function to delete Computer objects from Active Directory upon termination of domain-joined instances.
  • All VPCs that are hosting EC2 instances to be domain joined must be peered with the VPC that hosts the relevant AWS Managed Microsoft AD. Alternatively, AWS Transit Gateway could be used to establish this connectivity.
  • In addition to having network connectivity to the AWS Managed Microsoft AD domain controllers, domain join scripts that run on EC2 instances must be able to resolve relevant Active Directory resource records. In this solution, we leverage Amazon Route 53 Outbound Resolver to forward DNS queries to the AWS Managed Microsoft AD DNS servers, while still preserving the default DNS capabilities that are available to the VPC. Learn more about deploying Route 53 Outbound Resolver and resolver rules to resolve your directory DNS name to DNS IPs.
  • Each domain-join EC2 instance must have a Systems Manager Agent (SSM Agent) installed and an IAM role that provides equivalent permissions as provided by the AmazonEC2RoleforSSM built-in policy. The SSM Agent is used to allow domain-join scripts to run automatically. See Working with SSM Agent for more information on installing and configuring SSM Agents on EC2 instances.

Solution deployment

The steps in this section deploy AD Join solution components by using the AWS CloudFormation service.

The CloudFormation template provided with this solution (mad_auto_join_leave.json) deploys resources that are identified in the security account’s AWS Region that hosts AWS Managed Microsoft AD (the top left quadrant highlighted in Figure 1). The template deploys a DynamoDB resource with 5 read and 5 write capacity units. This should be adjusted to match your usage. DynamoDB also provides the ability to auto-scale these capacities. You will need to create and deploy additional CloudFormation stacks for cross-account, cross-Region scenarios.

To deploy the solution

  1. Create a versioned Amazon Simple Storage Service (Amazon S3) bucket to store a zip file (for example, adJoinCode.zip) that contains Python Lambda code and domain join/leave bash and PowerShell scripts. Upload the source code zip file to an S3 bucket and find the version associated with the object.
  2. Navigate to the AWS CloudFormation console. Choose the appropriate AWS Region, and then choose Create Stack. Select With new resources.
  3. Choose Upload a template file (for this solution, mad_auto_join_leave.json), select the CloudFormation stack file, and then choose Next.
  4. Enter the stack name and values for the other parameters, and then choose Next.
    Figure 2: Defining the stack name and parameters

    Figure 2: Defining the stack name and parameters

    The parameters are defined as follows:

  • S3CodeBucket: The name of the S3 bucket that holds the Lambda code zip file object.
  • adJoinLambdaCodeFileName: The name of the Lambda code zip file that includes Lambda Python code, bash, and Powershell scripts.
  • adJoinLambdaCodeVersion: The S3 Version ID of the uploaded Lambda code zip file.
  • DynamoDBTableName: The name of the DynamoDB table that will hold account configuration information.
  • CreateDynamoDBTable: The flag that indicates whether to create a new DynamoDB table or use an existing table.
  • ADToolsHostTagKey: The tag key of the Windows EC2 instance that has AD Tools installed and that will be used for removal of Active Directory Computer objects upon instance termination.
  • ADToolsHostTagValue: The tag value for the key identified by the ADToolsHostTagKey parameter.
  • Acknowledge creation of AWS resources and choose to continue to deploy AWS resources through AWS CloudFormation.The CloudFormation stack creation process is initiated, and after a few minutes, upon completion, the stack status is marked as CREATE_COMPLETE. The following resources are created when the CloudFormation stack deploys successfully:
    • An AD Join Lambda function with associated scripts and IAM role.
    • A CloudWatch Events rule to detect the “running” and “terminated” states for EC2 instances.
    • An SQS event queue to hold the EC2 instance “running” and “terminated” events.
    • CloudWatch event mapping to the SQS event queue and further to the Lambda function.
    • A DynamoDB table to hold the account configuration (if you chose this option).

The DynamoDB table hosts account-level configurations. Account-specific configuration is required for an instance from a given account to join the Active Directory domain. Each DynamoDB item contains the account-specific configuration shown in the following table. Storing account-level information in the DynamoDB table provides the ability to use multiple AWS Managed Microsoft AD directories and group various accounts accordingly. Additional account configurations can also be stored in this table for implementation of various centralized security services (instance inspection, patch management, and so on).

accountIdAWS account number
adJoinUserNameUser ID with AD Join permissions
adJoinUserPwParamEncrypted Systems Manager parameter containing the AD Join user’s password
dnsIP1Domain controller 1 IP address2
dnsIP2Domain controller 2 IP address
assumeRoleARNAmazon Resource Name (ARN) of the role assumed by the AD Join Lambda function

Following is an example of how you could insert an item (row) in a DynamoDB table for an account.

aws dynamodb put-item --table-name <DynamoDB-Table-Name> --item file://itemData.json

where itemData.json is as follows.

    "accountId": { "S": "123412341234" },
    " adJoinUserName": { "S": "ADJoinUser" },
    " adJoinUserPwParam": { "S": "ADJoinUser-PwParam" },
    "dnsName": { "S": "example.com" },
    "dnsIP1": { "S": "" },
    "dnsIP2": { "S": "" },
    "assumeRoleARN": { "S": "arn:aws:iam::111122223333:role/adJoinLambdaRole" }

(Update with your own values as appropriate for your environment.)

In the preceding example, adJoinLambdaRole is assumed by the AD Join Lambda function (if needed) to establish cross-account access using AWS Security Token Service (AWS STS). The role needs to provide sufficient privileges for the AD Join Lambda function to retrieve instance information and run cross-account Systems Manager commands.

adJoinUserName identifies a user with the minimum privileges to do the domain join; you created this user in the prerequisite steps.

adJoinUserPwParam identifies the name of the encrypted Systems Manager parameter that stores the password for the AD Join user. You created this parameter in the prerequisite steps.

Solution test

After you successfully deploy the solution using the steps in the previous section, the next step is to test the deployed solution.

To test the solution

  1. Navigate to the AWS EC2 console and launch a Linux instance. Launch the instance in a public subnet of the available VPC.
  2. Choose an IAM role that gives at least AmazonEC2RoleforSSM permissions to the instance.
  3. Add an adGroupName tag with the value that identifies the name of the Active Directory security group whose members should have access to the instance.
  4. Make sure that the security group associated with your instance has permissions for your IP address to log in to the instance by using the Secure Shell (SSH) protocol.
  5. Wait for the instance to launch and perform the Active Directory domain join. You can navigate to the AWS SQS console and observe a delayed message that represents the CloudWatch instance “running” event. This message is processed after five minutes; after that you can observe the Lambda function’s message processing log in CloudWatch logs.
  6. Log in to the instance with Active Directory user credentials. This user must be the member of the Active Directory security group identified by the adGroupName tag value. Following is an example login command.
    ssh ‘[email protected]’@<public-dns-name|public-ip-address>

  7. Similarly, launch a Windows EC2 instance to validate the Active Directory domain join by using Remote Desktop Protocol (RDP).
  8. Terminate domain-joined instances. Log in to the AD Tools instance to validate that the Active Directory Computer object that represents the instance is deleted.

The AD Join Lambda function invokes Systems Manager commands to deliver and run domain join scripts on the EC2 instances. The AWS-RunPowerShellScript command is used for Microsoft Windows instances, and the AWS-RunShellScript command is used for Linux instances. Systems Manager command parameters and execution status can be observed in the Systems Manager Run Command console.

The AD user used to perform the domain join is a least-privileged user, as described in Delegate Directory Join Privileges for AWS Managed Microsoft AD. The password for this user is passed to instances by way of SSM Run Commands, as described above. The password is visible in the SSM Command history log and in the domain join scripts run on the instance. Alternatively, all script parameters can be read locally on the instance through the “adjoin” encrypted SSM parameter. Refer to the domain join scripts for details of the “adjoin” SSM parameter.

Additional information

Directory sharing

AWS Managed Microsoft AD can be shared with other AWS accounts in the same Region. Learn how to use this feature and seamlessly domain join Microsoft Windows EC2 instances and Linux instances.

autoadjoin tag

Launching EC2 instances with an autoadjoin tag key with a “false” value excludes the instance from the automated Active Directory join process. You might want to do this in scenarios where you want to install additional agent software before or after the Active Directory join process. You can invoke domain join scripts (bash or PowerShell) by using user data or other means. However, you’ll need to reboot the instance and re-run scripts to complete the domain join process.


In this blog post, we demonstrated how you could automate the Active Directory domain join process for EC2 instances to AWS Managed Microsoft AD across multiple accounts and Regions, and also centrally manage this configuration by using AWS DynamoDB. By adopting this model, administrators can centrally manage Active Directory–aware applications and resources across their accounts.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Sanjay Patel

Sanjay is a Senior Cloud Application Architect with AWS Professional Services. He has a diverse background in software design, enterprise architecture, and API integrations. He has helped AWS customers automate infrastructure security. He enjoys working with AWS customers to identify and implement the best fit solution.


Vaibhawa Kumar

Vaibhawa is a Senior Cloud Infrastructure Architect with AWS Professional Services. He helps customers with the architecture, design, and automation to build innovative, secured, and highly available solutions using various AWS services. In his free time, you can find him spending time with family, sports, and cooking.


Kevin Higgins

Kevin is a Senior Cloud Infrastructure Architect with AWS Professional Services. He helps customers with the architecture, design, and development of cloud-optimized infrastructure solutions. As a member of the Microsoft Global Specialty Practice, he collaborates with AWS field sales, training, support, and consultants to help drive AWS product feature roadmap and go-to-market strategies.

Now in Preview – Larger & Faster io2 Block Express EBS Volumes with Higher Throughput

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-in-preview-larger-faster-io2-ebs-volumes-with-higher-throughput/

Amazon Elastic Block Store (EBS) volumes have been an essential EC2 component since they were launched in 2008. Today, you can choose between six types of HDD and SSD volumes, each designed to serve a particular use case and to deliver a specified amount of performance.

Earlier this year we launched io2 volumes with 100x higher durability and 10x more IOPS/GiB than the earlier io1 volumes. The io2 volumes are a great fit for your most I/O-hungry and latency-sensitive applications, including high-performance, business-critical workloads.

Even More
Today we are opening up a preview of io2 Block Express volumes that are designed to deliver even higher performance!

Built on our new EBS Block Express architecture that takes advantage of some advanced communication protocols implemented as part of the AWS Nitro System, the volumes will give you up to 256K IOPS & 4000 MBps of throughput and a maximum volume size of 64 TiB, all with sub-millisecond, low-variance I/O latency. Throughput scales proportionally at 0.256 MB/second per provisioned IOPS, up to a maximum of 4000 MBps per volume. You can provision 1000 IOPS per GiB of storage, twice as many as before. The increased volume size & higher throughput means that you will no longer need to stripe multiple EBS volumes together, reducing complexity and management overhead.

Block Express is a modular storage system that is designed to increase performance and scale. Scalable Reliable Datagrams (as described in A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC) are implemented using custom-built, dedicated hardware, making communication between Block Express volumes and Nitro-powered EC2 instances fast and efficient. This is, in fact, the same technology that the Elastic Fabric Adapter (EFA) uses to support high-end HPC and Machine Learning workloads on AWS,

Putting it all together, these volumes are going to deliver amazing performance for your SAP HANA, Microsoft SQL Server, Oracle, and Apache Cassandra workloads, and for your mission-critical transaction processing applications such as airline reservation systems and banking that once mandated the use of an expensive and inflexible SAN (Storage Area Network).

Join the Preview
The preview is currently available in the US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Tokyo), and Europe (Frankfurt) Regions. During the preview, we support the use of R5b instances, with support for other Nitro-powered instances in the works.

You can opt-in to the preview on a per-account, per-region basis, create new io2 Block Express volumes, and then attach them to R5b instances. All newly created io2 volumes in that account/region will then make use of Block Express, and will perform as described above.

This is still a work in progress. We’re still adding support for a couple of features (Multi-Attach, Elastic Volumes, and Fast Snapshot Restore) and we’re building a new I/O fencing feature so that you can attach the same volume to multiple instances while ensuring consistent access and protecting shared data.

The volumes support encryption, but you can’t create encrypted volumes from unencrypted AMIs or snapshots, or from encrypted AMIs or snapshots that were shared from another AWS account. We expect to take care of all of these items during the preview. To learn more, visit the io2 page and read the io2 documentation.

To get started, opt-in to the io2 Block Express Preview today!



Coming Soon – Amazon EC2 G4ad Instances Featuring AMD GPUs for Graphics Workloads

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/new-amazon-ec2-g4ad-instances-featuring-amd-gpus-for-graphics-workloads/

Customers with high performance graphic workloads, such as game streaming, animation, and video rendering for example, are always looking for higher performance with less cost. Today, I’m happy to announce new Amazon Elastic Compute Cloud (EC2) instances in the G4 instance family are in the works and will be available soon, to improve performance and reduce cost for graphics-intensive workloads. The new G4ad instances feature AMD’s latest Radeon Pro V520 GPUs and 2nd generation EPYC processors, and are the first in EC2 to feature AMD GPUs.

G4dn instances, released in 2019 and featuring NVIDIA T4 GPUs, were previously the most cost-effective GPU-based instances in EC2. G4dn instances are ideal for deploying machine learning models in production and also graphics-intensive applications. However, when compared to G4dn the new G4ad instances enable up to 45% better price performance for graphics-intensive workloads, including the aforementioned game streaming, remote graphics workstations, and rendering scenarios. Compared to an equally-sized G4dn instance, G4ad instances offer up to 40% improvement in performance.

G4dn instances will continue to be the best option for small-scale machine learning (ML) training and GPU-based ML inference due to included hardware optimizations like Tensor Cores. Additionally, G4dn instances are still best suited for graphics applications that need access to NVIDIA libraries such as CUDA, CuDNN, and NVENC. However, when there is no dependency on NVIDIA’s libraries, we recommend customers try the G4ad instances to benefit from the improved price and performance.

AMD Radeon Pro V520 GPUs in G4ad instances support DirectX 11/12, Vulkan 1.1, and OpenGL 4.5 APIs. For operating systems, customers can choose from Windows Server 2016/2019, Amazon Linux 2, Ubuntu 18.04.3, and CentOS 7.7. Instances using G4ad can be purchased as On-Demand, Savings Plan, Reserved Instances, or Spot Instances. Three instance sizes are available, from G4ad.4xlarge, with 1 GPU, to G4ad.16xlarge with 4 GPUs, as shown below.

Instance SizeGPUsGPU Memory (GB)vCPUsMemory (GB)StorageEBS Bandwidth (Gbps)Network Bandwidth (Gbps)
g4ad.4xlarge181664600Up to 3Up to 10

The new G4ad instances will be available soon in US East (N. Virginia), US West (Oregon), and Europe (Ireland).

Learn more about G4ad instances.

— Steve

New EC2 M5zn Instances – Fastest Intel Xeon Scalable CPU in the Cloud

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-ec2-m5zn-instances-fastest-intel-xeon-scalable-cpu-in-the-cloud/

We launched the compute-intensive z1d instances in mid-2018 for customers who asked us for extremely high per-core performance and a high memory-to-core ratio to power their front-end Electronic Design Automation (EDA), actuarial, and CPU-bound relational database workloads.

In order to address a complementary set of use cases, customers have asked us for an EC2 instance that will give them high per-core performance like z1d, with no local NVMe storage, higher networking throughput, and a reduced memory-to-vCPU ratio. They have indicated if we built an instance with this set of attributes, it would be an excellent fit for workloads such as gaming, financial applications, simulation modeling applications such as those used in the automobile, aerospace, energy and telecommunication industries, and High Performance Computing (HPC).

Introducing M5zn
Building on the success of the z1d instances, we are launching M5zn instances in seven sizes today. These instances use 2nd generation custom Intel® Xeon® Scalable (Cascade Lake) processors with a sustained all-core turbo clock frequency of up to 4.5 GHz. M5zn instances feature high frequency processing, are a variant of the general-purpose M5 instances, and are built on the AWS Nitro System. These instances also feature low latency 100 Gbps networking and the Elastic Fabric Adapter (EFA), in order to improve performance for HPC and communication-intensive applications.

Here are the M5zn instances (all VPC-only, HVM-only, and EBS-Optimized, with support for Optimize vCPU). As you can see, the memory-to-vCPU ratio on these instances is half that of the existing z1d instances:

Instance NamevCPUs
Network BandwidthEBS-Optimized Bandwidth
m5zn.large28 GiBUp to 25 GbpsUp to 3.170 Gbps
m5zn.xlarge416 GiBUp to 25 GbpsUp to 3.170 Gbps
m5zn.2xlarge832 GiBUp to 25 Gbps3.170 Gbps
m5zn.3xlarge1248 GiBUp to 25 Gbps4.750 Gbps
m5zn.6xlarge2496 GiB50 Gbps9.500 Gbps
m5zn.12xlarge48192 GiB100 Gbps19 Gbps
m5zn.metal48192 GiB100 Gbps19 Gbps

The Nitro Hypervisor allows M5zn instances to deliver performance that is just about indistinguishable from bare metal. Other AWS Nitro System components such as the Nitro Security Chip and hardware-based processing for EBS increase performance, while VPC encryption provides greater security.

Things To Know
Here are a couple of “fun facts” about the M5zn instances:

Placement Groups – M5zn instances can be used in Cluster (for low latency and high network throughput), Spread (to keep critical instances separate from each other), and Partition (to reduce correlated failures) placement groups.

Networking – M5zn instances support the Elastic Network Adapter (ENA) with dedicated 100 Gbps network connections and a dedicated 19 Gbps connection to EBS. If you are building distributed ML or HPC applications for use on a cluster of M5zn instances, be sure to take a look at the Elastic Fabric Adapter (EFA). Your HPC applications can use the Message Passing Interface (MPI) to communicate efficiently at high speed while scaling to thousands of nodes.

C-State Control – You can configure CPU Power Management on m5zn.6xlarge and m5zn.12xlarge instances. This is definitely an advanced feature, but one worth exploring in those situations where you need to squeeze every possible cycle of available performance from the instance.

NUMA – You can make use of Non-Uniform Memory Access on m5zn.12xlarge instances. This is also an advanced feature, but worth exploring in situations where you have an in-depth understanding of your application’s memory access patterns.

To learn more about these and other features, visit the EC2 M5 Instances page.

Available Now
As you can see, the M5zn instances are a great fit for gaming, HPC and simulation modeling workloads such as those used by the financial, automobile, aerospace, energy, and telecommunications industries.

You can launch M5zn instances today in the US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Europe (Ireland), Europe (Frankfurt), and Asia Pacific (Tokyo) Regions in On-Demand, Reserved Instance, Savings Plan, and Spot form. Dedicated Instances and Dedicated Hosts are also available.

Support is available in the EC2 Forum or via your usual AWS Support contact. The EC2 team is interested in your feedback and you can contact them at [email protected].




Coming Soon – EC2 C6gn Instances – 100 Gbps Networking with AWS Graviton2 Processors

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/coming-soon-ec2-c6gn-instances-100-gbps-networking-with-aws-graviton2-processors/

Based on the amazing feedback from customers such as Snap, NextRoll, Intuit, SmugMug, and Honeycomb who are running their workloads on Amazon Elastic Compute Cloud (EC2) instances powered by AWS Graviton2, today we are announcing an addition to our broad Arm-based Graviton2 portfolio with C6gn instances that deliver up to 100 Gbps network bandwidth, up to 38 Gbps Amazon Elastic Block Store (EBS) bandwidth, up to 40% higher packet processing performance, and up to 40% better price/performance versus comparable current generation x86-based network optimized instances.

Compared to C6g instances, this new instance type provides 4x higher network bandwidth, 4x higher packet processing performance, and 2x higher EBS bandwidth. This means that customers with workloads that need high networking bandwidth such as high performance computing (HPC), network appliance, real-time video communications, and data analytics, will be able to bring their biggest and most challenging applications to Arm and take advantage of the performance and cost-optimization.

C6gn instances will be available in 8 sizes:

Network Bandwidth
EBS Throughput
c6gn.medium12Up to 25Up to 9.5
c6gn.large24Up to 25Up to 9.5
c6gn.xlarge48Up to 25Up to 9.5
c6gn.2xlarge816Up to 25Up to 9.5

The new instances are built on the AWS Nitro System, a collection of AWS-designed hardware and software innovations that maximize resource efficiency. C6gn instances support Elastic Fabric Adapter (EFA) on the c6gn.16xlarge sizes for workloads that can take advantage of lower network latency (such as HPC and video processing) and use Message Passing Interface (MPI) for highly scalable clusters. These new instances also fully support network frameworks like Data Plane Development Kit (DPDK), making it easier to migrate network appliance workloads.

Coming Soon
EC2 C6gn instances will be available later this month and make it easier to optimize costs for HPC and workloads that require high network bandwidth and low latency. Let me know what you are going to build with them!

To get practice with the AWS Graviton2 architecture, you can try t4g.micro instances for free for up to 750 hours per month until March 31st, 2021.

Learn more about EC2 C6gn instances today.


New – Amazon EC2 R5b Instances Provide 3x Higher EBS Performance

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-amazon-ec2-r5b-instances-providing-3x-higher-ebs-performance/

In July 2018, we announced memory-optimized R5 instances for the Amazon Elastic Compute Cloud (Amazon EC2). R5 instances are designed for memory-intensive applications such as high-performance databases, distributed web scale in-memory caches, in-memory databases, real time big data analytics, and other enterprise applications.

R5 instances offer two different block storage options. R5d instances offer up to 3.6TB of NMVe instance storage for applications that need access to high-speed, low latency local storage. In addition, all R5b instances work with Amazon Elastic Block Store. Amazon EBS is an easy-to-use, high-performance and highly available block storage service designed for use with Amazon EC2 for both throughput- and transaction-intensive workloads at any scale. A broad range of workloads, such as relational and non-relational databases, enterprise applications, containerized applications, big data analytics engines, file systems, and media workflows are widely deployed on Amazon EBS.

Today, we are happy to announce the availability of R5b, a new addition to the R5 instance family. The new R5b instance is powered by the AWS Nitro System to provide the best network-attached storage performance available on EC2. This new instance offers up to 60Gbps of EBS bandwidth and 260,000 I/O operations per second (IOPS).

Amazon EC2 R5b Instance
Many customers use R5 instances with EBS for large relational database workloads such as commerce platforms, ERP systems, and health record systems, and they rely on EBS to provide scalable, durable, and high availability block storage. These instances provide sufficient storage performance for many use cases, but some customers require higher EBS performance on EC2.

R5 instances provide bandwidth up to 19Gbps and maximum EBS performance of 80K IOPS, while the new R5b instances support bandwidth up to 60Gbps and EBS performance of 260K IOPS, providing 3x higher EBS-Optimized performance compared to R5 instances, enabling customers to lift and shift large relational databases applications to AWS. R5b and R5 vCPU to memory ratio and network performance are the same.

Instance NamevCPUsMemoryEBS Optimized Bandwidth (Mbps)EBS Optimized [email protected] (IO/s)
r5b.large216 GiBUp to 10,000Up to 43,333
r5b.xlarge432 GiBUp to 10,000Up to 43,333
r5b.2xlarge864 GiBUp to 10,000Up to 43,333
r5b.4xlarge16128 GiB10,00043,333
r5b.8xlarge32256 GiB20,00086,667
r5b.12xlarge48384 GiB30,000130,000
r5b.16xlarge64512 GiB40,000173,333
r5b.24xlarge96768 GiB60,000260,000
r5b.metal96768 GiB60,000260,000

Customers operating storage performance sensitive workloads can migrate from R5 to R5b to consolidate their existing workloads into fewer or smaller instances. This can reduce the cost of both infrastructure and licensed commercial software working on those instances. R5b instances are supported by Amazon RDS for Oracle and Amazon RDS for SQL Server, simplifying the migration path for large commercial database applications and improving storage performance for current RDS customers by up to 3x.

All Nitro compatible AMIs support R5b instances, and the EBS-backed HVM AMI must have NVMe 1.0e and ENA drivers installed at R5b instance launch. R5b supports io1, io2 Block Express (in preview), gp2, gp3, sc1, st1 and standard volumes. R5b does not support io2 volumes and io1 volumes that have multi-attach enabled, which are coming soon.

Available Today

R5b instances are available in the following regions: US West (Oregon), Asia Pacific (Tokyo), US East (N. Virginia), US East (Ohio), Asia Pacific (Singapore), and Europe (Frankfurt). RDS on r5b is available in US East (Ohio), Asia Pacific (Singapore), and Europe (Frankfurt), and support in other regions is coming soon.

Learn more about EC2 R5 instances and get started with Amazon EC2 today.

– Kame;

EC2 Update – D3 / D3en Dense Storage Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-update-d3-d3en-dense-storage-instances/

We have launched several generations of EC2 instances with dense storage including the HS1 in 2012 and the D2 in 2015. As you can guess from the name, our customers use these instances when they need massive amounts of very economical on-instance storage for their data warehouses, data lakes, network file systems, Hadoop clusters, and the like. These workloads demand plenty of I/O and network throughput, but work fine with a high ratio of storage to compute power.

New D3 and D3en Instances
Today we are launching the D3 and D3en instances. Like their predecessors, they give you access to massive amounts of low-cost on-instance HDD storage. The D3 instances are available in four sizes, with up to 32 vCPUs and 48 TB of storage. Here are the specs:

Instance NamevCPUsRAMHDD StorageAggregate Disk Throughput
(128 KiB Blocks)
Network BandwidthEBS-Optimized Bandwidth
d3.xlarge432 GiB6 TB (3 x 2 TB) 580 MiBpsUp to 15 Gbps850 Mbps
d3.2xlarge864 GiB12 TB (6 x 2 TB)1,100 MiBpsUp to 15 Gbps1,700 Mbps
d3.4xlarge16128 GiB24 TB (12 x 2 TB)2,300 MiBpsUp to 15 Gbps2,800 Mbps
d3.8xlarge32256 GiB48 TB (24 x 2 TB)4,600 MiBps25 Gbps5,000 Mbps

As you can see from the table above, the D3 instances are available in the same configurations as the D2 instances for easy migration. You’ll get 5% more memory per vCPU, a 30% boost in compute power, and 2.5x higher network performance if you migrate from D2 to D3. The instances provide low-cost dense storage that delivers high performance sequential access to large data sets. They are perfect for distributed file systems such as HDFS and MapR FS, big data analytical workloads, data warehouses, log processing, and data processing.

The D3en instances are available in six sizes, with up to 48 vCPUs and 336 TB of storage. Here are the specs:

Instance NamevCPUsRAMHDD StorageAggregate Disk Throughput
(128 KiB Blocks)
Network BandwidthEBS-Optimized Bandwidth
d3en.xlarge416 GiB28 TB (2 x 14 TB)500 MiBpsUp to 25 Gbps850 Mbps
d3en.2xlarge832 GiB56 TB (4 x 14 TB)1,000 MiBpsUp to 25 Gbps1,700 Mbps
d3en.4xlarge1664 GiB112 TB (8 x 14 TB)2,000 MiBps25 Gbps2,800 Mbps
d3en.6xlarge2496 GiB168 TB (12 x 14 TB)3,100 MiBps40 Gbps4,000 Mbps
d3en.8xlarge32128 GiB 224 TB (16 x 14 TB)4,100 MiBps50 Gbps5,000 Mbps
d3en.12xlarge48192 GiB336 TB (24 x 14 TB)6,200 MiBps75 Gbps7,000 Mbps

The D3en instances have a high ratio of storage to vCPU, and are optimized for high throughput and high sequential I/O to very large data sets, with a cost-per-TB that is 80% lower than on D2 instances. D3en instances can host Lustre, BeeGFS, GPFS, and other distributed file systems, they can store your data lakes, and they can run your Amazon EMR, Spark, and Hadoop analytical workloads.

Both of the instance types are built on the AWS Nitro System and are powered by custom Intel® Second Generation Scalable Xeon® (Cascade Lake) processors that can deliver all-core turbo performance of up to 3.1 GHz. The HDD storage is encrypted at rest using AES-256-XTS; traffic between D3 or D3en instances in the same VPC or within peered VPCs is encrypted using a 256-bit key.

Things to Know
Here are a couple of things that you should keep in mind regarding the D3 and D3en instances:

Regions – D3en instances are available in the US East (N. Virginia), US West (Oregon), and Europe (Ireland) Regions; D3en instances are available in all of those regions and also in the US East (Ohio) Region, with more regions coming soon.

Purchase Options – You can purchase D3 and D3 instances in On-Demand, Savings Plan, Reserved Instance, Spot, and Dedicated Instance form.

AMIs – You must use AMIs that include the Elastic Network Adapter (ENA) and NVMe drivers.

Now Available
D3 and D3en instances are available now and you can start using them today!


New – Use Amazon EC2 Mac Instances to Build & Test macOS, iOS, ipadOS, tvOS, and watchOS Apps

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-use-mac-instances-to-build-test-macos-ios-ipados-tvos-and-watchos-apps/

Throughout the course of my career I have done my best to stay on top of new hardware and software. As a teenager I owned an Altair 8800 and an Apple II. In my first year of college someone gave me a phone number and said “call this with modem.” I did, it answered “PENTAGON TIP,” and I had access to ARPANET!

I followed the emerging PC industry with great interest, voraciously reading every new issue of Byte, InfoWorld, and several other long-gone publications. In early 1983, rumor had it that Apple Computer would soon introduce a new system that was affordable, compact, self-contained, and very easy to use. Steve Jobs unveiled the Macintosh in January 1984 and my employer ordered several right away, along with a pair of the Apple Lisa systems that were used as cross-development hosts. As a developer, I was attracted to the Mac’s rich collection of built-in APIs and services, and still treasure my phone book edition of the Inside Macintosh documentation!

New Mac Instance
Over the last couple of years, AWS users have told us that they want to be able to run macOS on Amazon Elastic Compute Cloud (EC2). We’ve asked a lot of questions to learn more about their needs, and today I am pleased to introduce you to the new Mac instance!

The original (128 KB) Mac

Powered by Mac mini hardware and the AWS Nitro System, you can use Amazon EC2 Mac instances to build, test, package, and sign Xcode applications for the Apple platform including macOS, iOS, iPadOS, tvOS, watchOS, and Safari. The instances feature an 8th generation, 6-core Intel Core i7 (Coffee Lake) processor running at 3.2 GHz, with Turbo Boost up to 4.6 GHz. There’s 32 GiB of memory and access to other AWS services including Amazon Elastic Block Store (EBS), Amazon Elastic File System (EFS), Amazon FSx for Windows File Server, Amazon Simple Storage Service (S3), AWS Systems Manager, and so forth.

On the networking side, the instances run in a Virtual Private Cloud (VPC) and include ENA networking with up to 10 Gbps of throughput. With EBS-Optimization, and the ability to deliver up to 55,000 IOPS (16KB block size) and 8 Gbps of throughput for data transfer, EBS volumes attached to the instances can deliver the performance needed to support I/O-intensive build operations.

Mac instances run macOS 10.14 (Mojave) and 10.15 (Catalina) and can be accessed via command line (SSH) or remote desktop (VNC). The AMIs (Amazon Machine Images) for EC2 Mac instances are EC2-optimized and include the AWS goodies that you would find on other AWS AMIs: An ENA driver, the AWS Command Line Interface (CLI), the CloudWatch Agent, CloudFormation Helper Scripts, support for AWS Systems Manager, and the ec2-user account. You can use these AMIs as-is, or you can install your own packages and create custom AMIs (the homebrew-aws repo contains the additional packages and documentation on how to do this).

You can use these instances to create build farms, render farms, and CI/CD farms that target all of the Apple environments that I mentioned earlier. You can provision new instances in minutes, giving you the ability to quickly & cost-effectively build code for multiple targets without having to own & operate your own hardware. You pay only for what you use, and you get to benefit from the elasticity, scalability, security, and reliability provided by EC2.

EC2 Mac Instances in Action
As always, I asked the EC2 team for access to an instance in order to put it through its paces. The instances are available in Dedicated Host form, so I started by allocating a host:

$ aws ec2 allocate-hosts --instance-type mac1.metal \
  --availability-zone us-east-1a --auto-placement on \
  --quantity 1 --region us-east-1

Then I launched my Mac instance from the command line (console, API, and CloudFormation can also be used):

$ aws ec2 run-instances --region us-east-1 \
  --instance-type mac1.metal \
  --image-id  ami-023f74f1accd0b25b \
  --key-name keys-jbarr-us-east  --associate-public-ip-address

I took Luna for a very quick walk, and returned to find that my instance was ready to go. I used the console to give it an appropriate name:

Then I connected to my instance:

From here I can install my development tools, clone my code onto the instance, and initiate my builds.

I can also start a VNC server on the instance and use a VNC client to connect to it:

Note that the VNC protocol is not considered secure, and this feature should be used with care. I used a security group that allowed access only from my desktop’s IP address:

I can also tunnel the VNC traffic over SSH; this is more secure and would not require me to open up port 5900.

Things to Know
Here are a couple of fast-facts about the Mac instances:

AMI Updates – We expect to make new AMIs available each time Apple releases major or minor versions of each supported OS. We also plan to produce AMIs with updated Amazon packages every quarter.

Dedicated Hosts – The instances are launched as EC2 Dedicated Hosts with a minimum tenancy of 24 hours. This is largely transparent to you, but it does mean that the instances cannot be used as part of an Auto Scaling Group.

Purchase Models – You can run Mac instances On-Demand and you can also purchase a Savings Plan.

Apple M1 Chip – EC2 Mac instances with the Apple M1 chip are already in the works, and planned for 2021.

Launch one Today
You can start using Mac instances in the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Singapore) Regions today, and check out this video for more information!



Automatically update security groups for Amazon CloudFront IP ranges using AWS Lambda

Post Syndicated from Yeshwanth Kottu original https://aws.amazon.com/blogs/security/automatically-update-security-groups-for-amazon-cloudfront-ip-ranges-using-aws-lambda/

Amazon CloudFront is a content delivery network that can help you increase the performance of your web applications and significantly lower the latency of delivering content to your customers. For CloudFront to access an origin (the source of the content behind CloudFront), the origin has to be publicly available and reachable. Anyone with the origin domain name or IP address could request content directly and bypass CloudFront. In this blog post, I describe an automated solution that uses security groups to permit only CloudFront to access the origin.

Amazon Simple Storage Service (Amazon S3) origins provide a feature called Origin Access Identity, which blocks public access to selected buckets, making them accessible only through CloudFront. When you use CloudFront to secure your web applications, it’s important to ensure that only CloudFront can access your origin (such as Amazon Elastic Cloud Compute (Amazon EC2) or Application Load Balancer (ALB)) and any direct access to origin is restricted. This blog post shows you how to create an AWS Lambda function to automatically update Amazon Virtual Private Cloud (Amazon VPC) security groups with CloudFront service IP ranges to permit only CloudFront to access the origin.

AWS publishes the IP ranges in JSON format for CloudFront and other AWS services. If your origin is an Elastic Load Balancer or an Amazon EC2 instance, you can use VPC security groups to allow only CloudFront IP ranges to access your applications. The IP ranges in the list are separated by service and Region, and you must specify only the IP ranges that correspond to CloudFront.

The IP ranges that AWS publishes change frequently and without an automated solution, you would need to retrieve this document frequently to understand the current IP ranges for CloudFront. Frequent polling is inefficient because there is no notice of when the IP ranges change, and if these IP ranges aren’t modified immediately, your client might see 504 errors when they access CloudFront. Additionally, there are numerous IP ranges for each service, performing the change manually isn’t an efficient way of updating these ranges. This means you need infrastructure to support the task. However, in that case you end up with another host to manage—complete with the typical patching, deployment, and monitoring. As you can see, a small task could quickly become more complicated than the problem you intended to solve.

An Amazon Simple Notification Service (Amazon SNS) message is sent to a topic whenever the AWS IP ranges change. Enabling you to build an event-driven, serverless solution that updates the IP ranges for your security groups, as needed by using a Lambda function that is triggered in response to the SNS notification.

Here are the steps we are going to take to implement the solution:

  1. Create your resources
    1. Create an IAM policy and execution role for the Lambda function
    2. Create your Lambda function
  2. Test your Lambda function
  3. Configure your Lambda function’s trigger

Create your resources

The first thing you need to do is create a Lambda function execution role and policy. Lambda function uses execution role to access or create AWS resources. This Lambda function is triggered by an SNS notification whenever there’s a change in the IP ranges document. Based on the number of IP ranges present for CloudFront and also the number of ports (for example, 80,443) that you want to whitelist on the origin, this Lambda function creates the required security groups. These security groups will allow only traffic from CloudFront to your ELB load balancers or EC2 instances.

Create an IAM policy and execution role for the Lambda function

When you create a Lambda function, it’s important to understand and properly define the security context for the Lambda function. Using AWS Identity and Access Management (IAM), you can create the Lambda execution role that determines the AWS service calls that the function is authorized to complete. (Learn more about the Lambda permissions model.)

To create the IAM policy for your role

  1. Log in to the IAM console with the user account that you will use to manage the Lambda function. This account must have administrator permissions.
  2. In the navigation pane, choose Policies.
  3. In the content pane, choose Create policy.
  4. Choose the JSON tab and copy the text from the following JSON policy document. Paste this text into the JSON text box.
      "Version": "2012-10-17",
      "Statement": [
          "Sid": "CloudWatchPermissions",
          "Effect": "Allow",
          "Action": [
          "Resource": "arn:aws:logs:*:*:*"
          "Sid": "EC2Permissions",
          "Effect": "Allow",
          "Action": [
          "Resource": "*"

  5. When you’re finished, choose Review policy.
  6. On the Review page, enter a name for the policy name (e.g. LambdaExecRolePolicy-UpdateSecurityGroupsForCloudFront). Review the policy Summary to see the permissions granted by your policy, and then choose Create policy to save your work.

To understand what this policy allows, let’s look closely at both statements in the policy. The first statement allows the Lambda function to create and write to CloudWatch Logs, which is vital for debugging and monitoring our function. The second statement allows the function to get information about existing security groups, get existing VPC information, create security groups, and authorize and revoke ingress permissions. It’s an important best practice that your IAM policies be as granular as possible, to support the principal of least privilege.

Now that you’ve created your policy, you can create the Lambda execution role that will use the policy.

To create the Lambda execution role

  1. In the navigation pane of the IAM console, choose Roles, and then choose Create role.
  2. For Select type of trusted entity, choose AWS service.
  3. Choose the service that you want to allow to assume this role. In this case, choose Lambda.
  4. Choose Next: Permissions.
  5. Search for the policy name that you created earlier and select the check box next to the policy.
  6. Choose Next: Tags.
  7. (Optional) Add metadata to the role by attaching tags as key-value pairs. For more information about using tags in IAM, see Tagging IAM Users and Roles.
  8. Choose Next: Review.
  9. For Role name (e.g. LambdaExecRole-UpdateSecurityGroupsForCloudFront), enter a name for your role.
  10. (Optional) For Role description, enter a description for the new role.
  11. Review the role, and then choose Create role.

Create your Lambda function

Now, create your Lambda function and configure the role that you created earlier as the execution role for this function.

To create the Lambda function

  1. Go to the Lambda console in N. Virginia region and choose Create function. On the next page, choose Author from scratch. (I’ll be providing the code for your Lambda function, but for other functions, the Use a blueprint option can be a great way to get started.)
  2. Give your Lambda function a name (e.g UpdateSecurityGroupsForCloudFront) and description, and select Python 3.8 from the Runtime menu.
  3. Choose or create an execution role: Select the execution role you created earlier by selecting the option Use an Existing Role.
  4. After confirming that your settings are correct, choose Create function.
  5. Paste the Lambda function code from here.
  6. Select Save.

Additionally, in the Basic Settings of the Lambda function, increase the timeout to 10 seconds.

To set the timeout value in the Lambda console

  1. In the Lambda console, choose the function you just created.
  2. Under Basic settings, choose Edit.
  3. For Timeout, select 10s.
  4. Choose Save.

By default, the Lambda function has these settings:

  • The Lambda function is configured to create security groups in the default VPC.
  • CloudFront IP ranges are updated as inbound rules on port 80.
  • The created security groups are tagged with the name prefix AUTOUPDATE.
  • Debug logging is turned off.
  • The service for which IP ranges are extracted is set to CloudFront.
  • The SDK client in the Lambda function set to us-east-1(N. Virginia).

If you want to customize these settings, set the following environment variables for the Lambda function. For more details, see Using AWS Lambda environment variables.

ActionKey-value data
To create security groups in a specific VPCKey: VPC_ID
Value: vpc-id
To create security groups rules for a different port or multiple ports
The solution in this example supports a total of two ports. One can be used for HTTP and another for HTTPS.

Value: portnumber
Value: portnumber,portnumber
To customize the prefix name tag of your security groupsKey: PREFIX_NAME
Value: custom-name
To enable debug logging to CloudWatchKey: DEBUG
Value: true
To extract IP ranges for a different service other than CloudFrontKey: SERVICE
Value: servicename
To configure the Region for the SDK client used in the Lambda function
If the CloudFront origin is present in a different Region than N. Virginia, the security groups must be created in that region.
Value: regionname

To set environment variables in the Lambda console

  1. In the Lambda console, choose the function you created.
  2. Under Environment variables, choose Edit.
  3. Choose Add environment variable.
  4. Enter a key and value.
  5. Choose Save.

Test your Lambda function

Now that you’ve created your function, it’s time to test it and initialize your security group.

To create your test event for the Lambda function

  1. In the Lambda console, on the Functions page, choose your function. In the drop-down menu next to Actions, choose Configure test events.
  2. Enter an Event Name (e.g. TriggerSNS)
  3. Replace the following as your sample event, which will represent an SNS notification and then select Create.
        "Records": [
                "EventVersion": "1.0",
                "EventSubscriptionArn": "arn:aws:sns:EXAMPLE",
                "EventSource": "aws:sns",
                "Sns": {
                    "SignatureVersion": "1",
                    "Timestamp": "1970-01-01T00:00:00.000Z",
                    "Signature": "EXAMPLE",
                    "SigningCertUrl": "EXAMPLE",
                    "MessageId": "95df01b4-ee98-5cb9-9903-4c221d41eb5e",
                    "Message": "{\"create-time\": \"yyyy-mm-ddThh:mm:ss+00:00\", \"synctoken\": \"0123456789\", \"md5\": \"7fd59f5c7f5cf643036cbd4443ad3e4b\", \"url\": \"https://ip-ranges.amazonaws.com/ip-ranges.json\"}",
                    "Type": "Notification",
                    "UnsubscribeUrl": "EXAMPLE",
                    "TopicArn": "arn:aws:sns:EXAMPLE",
                    "Subject": "TestInvoke"

  4. After you’ve added the test event, select Save and then select Test. Your Lambda function is then invoked, and you should see log output at the bottom of the console in Execution Result section, similar to the following.
    Updating from https://ip-ranges.amazonaws.com/ip-ranges.json
    MD5 Mismatch: got 2e967e943cf98ae998efeec05d4f351c expected 7fd59f5c7f5cf643036cbd4443ad3e4b: Exception
    Traceback (most recent call last):
      File "/var/task/lambda_function.py", line 29, in lambda_handler
        ip_ranges = json.loads(get_ip_groups_json(message['url'], message['md5']))
      File "/var/task/lambda_function.py", line 50, in get_ip_groups_json
        raise Exception('MD5 Missmatch: got ' + hash + ' expected ' + expected_hash)
    Exception: MD5 Mismatch: got 2e967e943cf98ae998efeec05d4f351c expected 7fd59f5c7f5cf643036cbd4443ad3e4b

  5. Edit the sample event again, and this time change the md5 value in the sample event to be the first MD5 hash provided in the log output. In this example, you would update the md5 value in the sample event configured earlier with the hash value seen in the error ‘2e967e943cf98ae998efeec05d4f351c’. Lambda code successfully executes only when the original hash of the IP ranges document and the hash received from the event trigger match. After you modify the hash value from the error message received earlier, the test event matches the hash of the IP ranges document.
  6. Select Save and test. This invokes your Lambda function.

After the function is invoked the second time with updated md5 has Lambda function should execute without any errors. You should be able to see the new security groups created and the IP ranges of CloudFront updated in the rules in the EC2 console, as shown in Figure 1.

Figure 1: EC2 console showing the security groups created

Figure 1: EC2 console showing the security groups created

In the initial successful run of this function, it created the total number of security groups required to update all the IP ranges of CloudFront for the ports mentioned. The function creates security groups based on the maximum number of rules that can be added to individual security groups. The new security groups can be identified from the EC2 console by the name AUTOUPDATE_random if you used the default configuration, or a custom name if you provided a PREFIX_NAME.

You can now attach these security groups to your Elastic LoadBalancer or EC2 instances. If your log output is different from what is described here, the output should help you identify the issue.

Configure your Lambda function’s trigger

After you’ve validated that your function is executing properly, it’s time to connect it to the SNS topic for IP changes. To do this, use the AWS Command Line Interface (CLI). Enter the following command, making sure to replace Lambda ARN with the Amazon Resource Name (ARN) of your Lambda function. You can find this ARN at the top right when viewing the configuration of your Lambda function.

aws sns subscribe - -topic-arn "arn:aws:sns:us-east-1:806199016981:AmazonIpSpaceChanged" - -region us-east-1 - -protocol lambda - -notification-endpoint "Lambda ARN"

You should receive the ARN of your Lambda function’s SNS subscription.

Now add a permission that allows the Lambda function to be invoked by the SNS topic. The following command also adds the Lambda trigger.

aws lambda add-permission - -function-name "Lambda ARN" - -statement-id lambda-sns-trigger - -region us-east-1 - -action lambda:InvokeFunction - -principal sns.amazonaws.com - -source-arn "arn:aws:sns:us-east-1:806199016981:AmazonIpSpaceChanged"

When AWS changes any of the IP ranges in the document, an SNS notification is sent and your Lambda function will be triggered. This Lambda function verifies the modified ranges in the document and efficiently updates the IP ranges on the existing security groups. Additionally, the function dynamically scales and creates additional security groups if the number of IP ranges for CloudFront is increased in future. Any newly created security groups are automatically attached to the network interface where the previous security groups are attached in order to avoid service interruption.


As you followed this blog post, you created a Lambda function to create a security groups and update the security group’s rules dynamically whenever AWS publishes new internal service IP ranges. This solution has several advantages:

  • The solution isn’t designed as a periodic poll, so it only runs when it needs to.
  • It’s automatic, so you don’t need to update security groups manually which lowers the operational cost.
  • It’s simple, because you have no extra infrastructure to maintain as the solution is completely serverless.
  • It’s cost effective, because the Lambda function runs only when triggered by the AmazonIpSpaceChanged SNS topic and only runs for a few seconds, this solution costs only pennies to operate.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon CloudFront forum. If you have any other use cases for using Lambda functions to dynamically update security groups, or even other networking configurations such as VPC route tables or ACLs, we’d love to hear about them!

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Yeshwanth Kottu

Yeshwanth is a Systems Development Engineer at AWS in Cupertino, CA. With a focus on CloudFront and [email protected], he enjoys helping customers tackle challenges through cloud-scale architectures. Yeshwanth has an MS in Computer Systems Networking and Telecommunications from Northeastern University. Outside of work, he enjoys travelling, visiting national parks, and playing cricket.

Tracking the latest server images in Amazon EC2 Image Builder pipelines

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/tracking-the-latest-server-images-in-amazon-ec2-image-builder-pipelines/

This post courtesy of Anoop Rachamadugu, Cloud Architect at AWS

The Amazon EC2 Image Builder service helps users to build and maintain server images. The images created by EC2 Image Builder can be used with Amazon Elastic Compute Cloud (EC2) and on-premises. Image Builder reduces the effort of keeping images up-to-date and secure by providing a graphical interface, built-in automation, and AWS-provided security settings. Customers have told us that they manage multiple server images and are looking for ways to track the latest server images created by the pipelines.

In this blog post, I walk through a solution that uses AWS Lambda and AWS Systems Manager (SSM) Parameter Store. It tracks and updates the latest Amazon Machine Image (AMI) IDs every time an Image Builder pipeline is run. With Lambda, you pay only for what you use. You are charged based on the number of requests for your functions and the time it takes for your code to run. In this case, the Lambda function is invoked upon the completion of the image builder pipeline. Standard SSM parameters are available at no additional charge.

Users can reference the SSM parameters in automation scripts and AWS CloudFormation templates providing access to the latest AMI ID for your EC2 infrastructure. Consider the use case of updating Amazon Machine Image (AMI) IDs for the EC2 instances in your CloudFormation templates. Normally, you might map AMI IDs to specific instance types and Regions. Then to update these, you would manually change them in each of your templates. With the SSM parameter integration, your code remains untouched and a CloudFormation stack update operation automatically fetches the latest Parameter Store value.


This solution uses a Lambda function written in Python that subscribes to an Amazon Simple Notification Service (SNS) topic. The Lambda function and the SNS topic are deployed using AWS SAM CLI. Once deployed, the SNS topic must be configured in an existing Image Builder pipeline. This results in the Lambda function being invoked at the completion of the Image Builder pipeline.

When a Lambda function subscribes to an SNS topic, it is invoked with the payload of the published messages. The Lambda function receives the message payload as an input parameter. The Lambda function first checks the message payload to see if the image status is available. If the image state is available, it retrieves the AMI ID from the message payload and updates the SSM parameter.

EC2 Image builder architecture diagram

EC2 Image builder architecture diagram


To get started with this solution, the following is required:

Deploying the solution

The solution consists of two files, which can be downloaded from the amazon-ec2-image-builder GitHub repository.

  1. The Python file image-builder-lambda-update-ssm.py contains the code for the Lambda function. It first checks the SNS message payload to determine if the image is available. If it’s available, it extracts the AMI ID from the SNS message payload and updates the SSM parameter specified.The ‘ssm_parameter_name’ variable specifies the SSM parameter path where the AMI ID should be stored and updated. The Lambda function finishes by adding tags to the SSM parameter.
  2. The template.yaml file is an AWS SAM template. It deploys the Lambda function, SNS topic, and IAM role required for the Lambda function. I use Python 3.7 as the runtime and assign a memory of 256 MB for the Lambda function. The IAM policy gives the Lambda function permissions to retrieve and update SSM parameters. Deploy this application using the AWS SAM CLI guided deploy:
    sam deploy --guided

After deploying the application, note the ARN of the created SNS topic. Next, update the infrastructure settings of an existing Image Builder pipeline with this newly created SNS topic. This results in the Lambda function being invoked upon the completion of the image builder pipeline.

Configuration details

Configuration details

Verifying the solution

After the completion of the image builder pipeline, use the AWS CLI or check the AWS Management Console to verify the updated SSM parameter. To verify via AWS CLI, run the following commands to retrieve and list the tags attached to the SSM parameter:

aws ssm get-parameter --name ‘/ec2-imagebuilder/latest’
aws ssm list-tags-for-resource --resource-type "Parameter" --resource-id ‘/ec2-imagebuilder/latest’

To verify via the AWS Management Console, navigate to the Parameter Store under AWS Systems Manager. Search for the parameter /ec2-imagebuilder/latest:

AWS Systems Manager: Parameter Store

AWS Systems Manager: Parameter Store

Select the Tags tab to view the tags attached to the SSM parameter:

Image builder tags list

Image builder tags list

Referencing the SSM Parameter in CloudFormation templates

Users can reference the SSM parameters in automation scripts and AWS CloudFormation templates providing access to the latest AMI ID for your EC2 infrastructure. This sample code shows how to reference the SSM parameter in a CloudFormation template.

Parameters :
  LatestAmiId :
    Type : 'AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>'
    Default: ‘/ec2-imagebuilder/latest’

Resources :
  Instance :
    Type : 'AWS::EC2::Instance'
    Properties :
      ImageId : !Ref LatestAmiId


In this blog post, I demonstrate a solution that allows users to track and update the latest AMI ID created by the Image Builder pipelines. The Lambda function retrieves the AMI ID of the image created by a pipeline and update an AWS Systems Manager parameter. This Lambda function is triggered via an SNS topic configured in an Image Builder pipeline.

The solution is deployed using AWS SAM CLI. I also note how users can reference Systems Manager parameters in AWS CloudFormation templates providing access to the latest AMI ID for your EC2 infrastructure.

The amazon-ec2-image-builder-samples GitHub repository provides a number of examples for getting started with EC2 Image Builder. Image Builder can make it easier for you to build virtual machine (VM) images.

Snowflake: Running Millions of Simulation Tests with Amazon EKS

Post Syndicated from Keith Joelner original https://aws.amazon.com/blogs/architecture/snowflake-running-millions-of-simulation-tests-with-amazon-eks/

This post was co-written with Brian Nutt, Senior Software Engineer and Kao Makino, Principal Performance Engineer, both at Snowflake.

Transactional databases are a key component of any production system. Maintaining data integrity while rows are read and written at a massive scale is a major technical challenge for these types of databases. To ensure their stability, it’s necessary to test many different scenarios and configurations. Simulating as many of these as possible allows engineers to quickly catch defects and build resilience. But the Holy Grail is to accomplish this at scale and within a timeframe that allows your developers to iterate quickly.

Snowflake has been using and advancing FoundationDB (FDB), an open-source, ACID-compliant, distributed key-value store since 2014. FDB, running on Amazon Elastic Cloud Compute (EC2) and Amazon Elastic Block Storage (EBS), has proven to be extremely reliable and is a key part of Snowflake’s cloud services layer architecture. To support its development process of creating high quality and stable software, Snowflake developed Project Joshua, an internal system that leverages Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Registry (ECR), Amazon EC2 Spot Instances, and AWS PrivateLink to run over one hundred thousand of validation and regression tests an hour.

About Snowflake

Snowflake is a single, integrated data platform delivered as a service. Built from the ground up for the cloud, Snowflake’s unique multi-cluster shared data architecture delivers the performance, scale, elasticity, and concurrency that today’s organizations require. It features storage, compute, and global services layers that are physically separated but logically integrated. Data workloads scale independently from one another, making it an ideal platform for data warehousing, data lakes, data engineering, data science, modern data sharing, and developing data applications.

Snowflake architecture

Developing a simulation-based testing and validation framework

Snowflake’s cloud services layer is composed of a collection of services that manage virtual warehouses, query optimization, and transactions. This layer relies on rich metadata stored in FDB.

Prior to the creation of the simulation framework, Project Joshua, FDB developers ran tests on their laptops and were limited by the number they could run. Additionally, there was a scheduled nightly job for running further tests.

Joshua at Snowflake

Amazon EKS as the foundation

Snowflake’s platform team decided to use Kubernetes to build Project Joshua. Their focus was on helping engineers run their workloads instead of spending cycles on the management of the control plane. They turned to Amazon EKS to achieve their scalability needs. This was a crucial success criterion for Project Joshua since at any point in time there could be hundreds of nodes running in the cluster. Snowflake utilizes the Kubernetes Cluster Autoscaler to dynamically scale worker nodes in minutes to support a tests-based queue of Joshua’s requests.

With the integration of Amazon EKS and Amazon Virtual Private Cloud (Amazon VPC), Snowflake is able to control access to the required resources. For example: the database that serves Joshua’s test queues is external to the EKS cluster. By using the Amazon VPC CNI plugin, each pod receives an IP address in the VPC and Snowflake can control access to the test queue via security groups.

To achieve its desired performance, Snowflake created its own custom pod scaler, which responds quicker to changes than using a custom metric for pod scheduling.

  • The agent scaler is responsible for monitoring a test queue in the coordination database (which, coincidentally, is also FDB) to schedule Joshua agents. The agent scaler communicates directly with Amazon EKS using the Kubernetes API to schedule tests in parallel.
  • Joshua agents (one agent per pod) are responsible for pulling tests from the test queue, executing, and reporting results. Tests are run one at a time within the EKS Cluster until the test queue is drained.

Achieving scale and cost savings with Amazon EC2 Spot

A Spot Fleet is a collection—or fleet—of Amazon EC2 Spot instances that Joshua uses to make the infrastructure more reliable and cost effective. ​ Spot Fleet is used to reduce the cost of worker nodes by running a variety of instance types.

With Spot Fleet, Snowflake requests a combination of different instance types to help ensure that demand gets fulfilled. These options make Fleet more tolerant of surges in demand for instance types. If a surge occurs it will not significantly affect tasks since Joshua is agnostic to the type of instance and can fall back to a different instance type and still be available.

For reservations, Snowflake uses the capacity-optimized allocation strategy to automatically launch Spot Instances into the most available pools by looking at real-time capacity data and predicting which are the most available. This helps Snowflake quickly switch instances reserved to what is most available in the Spot market, instead of spending time contending for the cheapest instances, at the cost of a potentially higher price.

Overcoming hurdles

Snowflake’s usage of a public container registry posed a scalability challenge. When starting hundreds of worker nodes, each node needs to pull images from the public registry. This can lead to a potential rate limiting issue when all outbound traffic goes through a NAT gateway.

For example, consider 1,000 nodes pulling a 10 GB image. Each pull request requires each node to download the image across the public internet. Some issues that need to be addressed are latency, reliability, and increased costs due to the additional time to download an image for each test. Also, container registries can become unavailable or may rate-limit download requests. Lastly, images are pulled through public internet and other services in the cluster can experience pulling issues.

​For anything more than a minimal workload, a local container registry is needed. If an image is first pulled from the public registry and then pushed to a local registry (cache), it only needs to pull once from the public registry, then all worker nodes benefit from a local pull. That’s why Snowflake decided to replicate images to ECR, a fully managed docker container registry, providing a reliable local registry to store images. Additional benefits for the local registry are that it’s not exclusive to Joshua; all platform components required for Snowflake clusters can be cached in the local ECR Registry. For additional security and performance Snowflake uses AWS PrivateLink to keep all network traffic from ECR to the workers nodes within the AWS network. It also resolved rate-limiting issues from pulling images from a public registry with unauthenticated requests, unblocking other cluster nodes from pulling critical images for operation.


Project Joshua allows Snowflake to enable developers to test more scenarios without having to worry about the management of the infrastructure. ​ Snowflake’s engineers can schedule thousands of test simulations and configurations to catch bugs faster. FDB is a key component of ​the Snowflake stack and Project Joshua helps make FDB more stable and resilient. Additionally, Amazon EC2 Spot has provided non-trivial cost savings to Snowflake vs. running on-demand or buying reserved instances.

If you want to know more about how Snowflake built its high performance data warehouse as a Service on AWS, watch the This is My Architecture video below.

Majority of Alexa Now Running on Faster, More Cost-Effective Amazon EC2 Inf1 Instances

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/majority-of-alexa-now-running-on-faster-more-cost-effective-amazon-ec2-inf1-instances/

Today, we are announcing that the Amazon Alexa team has migrated the vast majority of their GPU-based machine learning inference workloads to Amazon Elastic Compute Cloud (EC2) Inf1 instances, powered by AWS Inferentia. This resulted in 25% lower end-to-end latency, and 30% lower cost compared to GPU-based instances for Alexa’s text-to-speech workloads. The lower latency allows Alexa engineers to innovate with more complex algorithms and to improve the overall Alexa experience for our customers.

AWS built AWS Inferentia chips from the ground up to provide the lowest-cost machine learning (ML) inference in the cloud. They power the Inf1 instances that we launched at AWS re:Invent 2019. Inf1 instances provide up to 30% higher throughput and up to 45% lower cost per inference compared to GPU-based G4 instances, which were, before Inf1, the lowest-cost instances in the cloud for ML inference.

Alexa is Amazon’s cloud-based voice service that powers Amazon Echo devices and more than 140,000 models of smart speakers, lights, plugs, smart TVs, and cameras. Today customers have connected more than 100 million devices to Alexa. And every month, tens of millions of customers interact with Alexa to control their home devices (“Alexa, increase temperature in living room,” “Alexa, turn off bedroom’“), to listen to radios and music (“Alexa, start Maxi 80 on bathroom,” “Alexa, play Van Halen from Spotify“), to be informed (“Alexa, what is the news?” “Alexa, is it going to rain today?“), or to be educated, or entertained with 100,000+ Alexa Skills.

If you ask Alexa where she lives, she’ll tell you she is right here, but her head is in the cloud. Indeed, Alexa’s brain is deployed on AWS, where she benefits from the same agility, large-scale infrastructure, and global network we built for our customers.

How Alexa Works
When I’m in my living room and ask Alexa about the weather, I trigger a complex system. First, the on-device chip detects the wake word (Alexa). Once detected, the microphones record what I’m saying and stream the sound for analysis in the cloud. At a high level, there are two phases to analyze the sound of my voice. First, Alexa converts the sound to text. This is known as Automatic Speech Recognition (ASR). Once the text is known, the second phase is to understand what I mean. This is Natural Language Understanding (NLU). The output of NLU is an Intent (what does the customer want) and associated parameters. In this example (“Alexa, what’s the weather today ?”), the intent might be “GetWeatherForecast” and the parameter can be my postcode, inferred from my profile.

This whole process uses Artificial Intelligence heavily to transform the sound of my voice to phonemes, phonemes to words, words to phrases, phrases to intents. Based on the NLU output, Alexa routes the intent to a service to fulfill it. The service might be internal to Alexa or external, like one of the skills activated on my Alexa account. The fulfillment service processes the intent and returns a response as a JSON document. The document contains the text of the response Alexa must say.

The last step of the process is to generate the voice of Alexa from the text. This is known as Text-To-Speech (TTS). As soon as the TTS starts to produce sound data, it is streamed back to my Amazon Echo device: “The weather today will be partly cloudy with highs of 16 degrees and lows of 8 degrees.” (I live in Europe, these are Celsius degrees 🙂 ). This Text-To-Speech process also heavily involves machine learning models to build a phrase that sounds natural in terms of pronunciations, rhythm, connection between words, intonation etc.

Alexa is one of the most popular hyperscale machine learning services in the world, with billions of inference requests every week. Of Alexa’s three main inference workloads (ASR, NLU, and TTS), TTS workloads initially ran on GPU-based instances. But the Alexa team decided to move to the Inf1 instances as fast as possible to improve the customer experience and reduce the service compute cost.

What is AWS Inferentia?
AWS Inferentia is a custom chip, built by AWS, to accelerate machine learning inference workloads and optimize their cost. Each AWS Inferentia chip contains four NeuronCores. Each NeuronCore implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses, dramatically reducing latency and increasing throughput.

AWS Inferentia can be used natively from popular machine-learning frameworks like TensorFlow, PyTorch, and MXNet, with AWS Neuron. AWS Neuron is a software development kit (SDK) for running machine learning inference using AWS Inferentia chips. It consists of a compiler, run-time, and profiling tools that enable you to run high-performance and low latency inference.

Who Else is Using Amazon EC2 Inf1?
In addition to Alexa, Amazon Rekognition is also adopting AWS Inferentia. Running models such as object classification on Inf1 instances resulted in 8x lower latency and doubled throughput compared to running these models on GPU instances.

Customers, from Fortune 500 companies to startups, are using Inf1 instances for machine learning inference. For example, Snap Inc.​ incorporates machine learning (ML) into many aspects of Snapchat, and exploring innovation in this field is a key priority for them. Once they heard about AWS Inferentia, they collaborated with AWS to adopt Inf1 instances to help with ML deployment, including around performance and cost. They started with their recommendation models inference, and are now looking forward to deploying more models on Inf1 instances in the future.

Conde Nast, one of the world’s leading media companies, saw a 72% reduction in cost of inference compared to GPU-based instances for its recommendation engine. And Anthem, one of the leading healthcare companies in the US, observed 2x higher throughput compared to GPU-based instances for its customer sentiment machine learning workload.

How to Get Started with Amazon EC2 Inf1
You can start using Inf1 instances today.

If you prefer to manage your own machine learning application development platforms, you can get started by either launching Inf1 instances with AWS Deep Learning AMIs, which include the Neuron SDK, or you can use Inf1 instances via Amazon Elastic Kubernetes Service or Amazon ECS for containerized machine learning applications. To learn more about running containers on Inf1 instances, read this blog to get started on ECS and this blog to get started on EKS.

The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service that enables developers to build, train, and deploy machine learning models quickly.

Get started with Inf1 on Amazon SageMaker today.

— seb

PS: The team just released this video, check it out!

Integrating AWS Outposts with existing security zones

Post Syndicated from Shubha Kumbadakone original https://aws.amazon.com/blogs/compute/integrating-aws-outposts-with-existing-security-zones/

This post is contributed by Santiago Freitas and Matt Lehwess.

AWS Outposts is a fully managed service that extends AWS infrastructure, services, APIs, and tools to your on-premises facility. This blog post explains how the resources created on an Outpost can be integrated with security zones of an existing infrastructure. Integrating Outposts with existing security zones is important for customers that segment their environment into multiple domains based on their security policy.


It’s common for customers to have different security zones within their infrastructure. For example, a customer might have a DMZ security zone where internet facing systems are located, an extra-net security zone where systems used to communicate with business partners are hosted, and an internal security zone where the systems accessible only by employees operate.

As illustrated in the following figure, customers often use firewalls and network ACLs to perform packet inspection and filtering for traffic that flows between security zones. Customers also usually use similar security controls for traffic flowing between different application tiers.



Example of firewall being used for security zone segmentation

Enabling connectivity from an Outposts VPC to your on-premises network

During your Outpost deployment, AWS works with you to establish connectivity to your local network. During this installation process, AWS creates an address pool based on an IP range you assign to the Outpost out of your local network’s IP ranges. This address pool is known as a customer-owned IP address pool or CoIP pool.

When resources running on Outposts (like an EC2 instance) need to communicate with existing on-premises systems, you must assign an Elastic IP address to them. This Elastic IP address comes from the CoIP pool. The resources in your local area network (LAN), including firewalls and network ACLs, see the Elastic IP address as the source IP of the packets sent from the resources running on Outposts. This Elastic IP address is also used as the destination IP for traffic from your on-premises systems to the instance on the Outpost.

How to enable connectivity from an Outposts VPC to your on-premises network

With a typical Outpost deployment, as shown in the following diagram, the following items are required to enable connectivity from the Outpost’s VPC to your on-premises network:

  1. VPC routing: Routes to your on-premises network in the route table of the VPC’s Outpost subnet.
  2. CoIP pool: Assigned by you, the customer, and created by AWS at time of install, for example, in the diagram. It’s used to allocate Elastic IP addresses that can then be associated with instances or other resources on the Outpost.
  3. Elastic IP address association: Customer configured, association of Elastic IP addresses from the CoIP pool with EC2 instances and other resources on the Outpost. For example, VPC IP from instance Y has Elastic IP address associated with it.
  4. Routing advertisement: The Outpost advertises the assigned CoIP range to the local devices within your on-premises network via Border Gateway Protocol (BGP). This is configured by AWS during installation and based on the CoIP pool IP range you provide.


AWS Outposts – typical network topology

Integrating Outposts with existing security zones

CoIP pools, and AWS Resource Access Manager (RAM) are used to facilitate the integration of Outposts with your existing security zones. AWS RAM lets you share your resources with AWS account or through AWS Organizations.

When AWS deploys your Outposts, you can ask AWS to create multiple CoIP pools. Each CoIP pool is configured with an IP range assigned by you during initial installation. If after the initial installation you need AWS to create additional CoIP pools for an existing Outpost, you can open a case with AWS Support and request the creation of additional pools.

CoIP pools are owned initially by the AWS account that owns the Outpost. In a multi-account Outpost deployment, you can share your customer-owned IP pool(s) with other AWS accounts in your AWS Organization using AWS RAM.

VPCs that have subnets on the Outpost must be created in the same account that owns the Outpost. After creation of these subnets within the Outpost account, AWS RAM can then be used to share these subnets with other accounts or organizational units that are in the same organization from AWS Organizations.

After you’ve shared the CoIP pool with the account where you’d like to run your workloads, and you’ve shared an Outpost subnet with that same account, users of that account can deploy resources in the shared Outpost subnet. For the Outposts resources that must communicate with resources in your existing infrastructure, users can assign Elastic IP addresses from one or more of the shared CoIP pools to those Outposts resources.

Multiple CoIP pools and the ability to share them and Outpost subnets with a particular account, gives you granular control over the IP range used by Outpost resources requiring connectivity to your on-premises LAN.

The following figure illustrates the sharing of a subnet and CoIP pool from Account 1, which is the account where the Outpost was initially deployed, with Account 2. As the CoIP pool was shared with Account 2, the users in Account 2 are able to assign an Elastic IP address to the EC2 instance running within Account 2.

AWS Outposts and VPC Sharing

Creating an AWS RAM resource share

The following screenshots demonstrate how to use AWS RAM to share a CoIP pool and a subnet with another account in the same organization from AWS Organizations.

Step 1. Navigate to the AWS RAM service page within the AWS Management Console, and click Create Resource Share. This step is done within the AWS account that owns the Outpost.

Step 2. Next, give the share an appropriate name.

Step 3. Select one or more subnets you’d like to share with your application owner’s account. In this case, select a subnet that exists in your VPC on the Outpost.

Step 4. In this case, share the CoIP range. You can find that in the resources type list.

Step 5. Select the Customer-owned Ipv4Pool resource type, and then select the CoIP to be shared. Selecting the the CoIP adds it to the list of shared resources along with the subnet selected earlier.

Step 6. Add the account number for the account you’d like to share the Outposts subnet and CoIP pool with.

Step 7. Click Create Resource. After clicking the Create Resource share button, you should see the resource share listed and be in the state “active.”

Now that you’ve successfully shared the subnet and CoIP pool with your application account (that’s in the same organization), you can now go in to that account and allocate an Elastic IP address from the CoIP pool. You can then assign the Elastic IP address to an instance launched in the subnet previously shared, which is on the Outpost.

Allocating and associating the Elastic IP address from your CoIP pool

Step 1. From within the application account, allocate an Elastic IP address from the CoIP pool. This is done by navigating to the VPC console, then to Elastic IP addresses, and then click Allocate Elastic IP address.

Step 2. Select Customer owned pool of IPv4 addresses, select the CoIP pool that’s been shared with this account previously, then click Allocate. You can see this step in the following image.

Step 3. You can now see the new Elastic IP address that has been allocated from the shared CoIP pool. You can now associate that Elastic IP address with an instance in our shared subnet.

Note: When using the AWS Management Console to allocate an Elastic IP address, the ElasticIP address is automatically allocated from the CoIPpool. If you’re using the AWS CLI, you have the option to select a specific Elastic IP address from the CoIP pool by using the --customer-owned-ipv4-pool <value> option.

Step 4. You can now associate the Elastic IP address of type CoIP with an instance. Click Associate after selecting the instance that is running on your Outpost in the shared subnet. This step is represented in the following image.

Step 5 – Once the operation is complete, you can see the CoIP Elastic IP address in the Elastic IP address console, and that it’s assigned to the instance that’s running in your shared subnet.

The steps preceding demonstrated how to share the Outpost with other accounts through the use of AWS Resource Access Manager, subnet sharing, and CoIP sharing.

Use case

Let’s apply the capabilities described prior into a design. A customer is running a 3-tier web application and would like to host the web and application tiers on Outposts. The database tier is hosted on the existing infrastructure. The application’s user traffic comes from the internet, through an external firewall towards the web tier hosted on the Outpost. Traffic between web and application tiers is secured with security groups and remains within the Outpost. Application servers connect to the database servers via the existing internal firewalls. The customer uses independent external and internal firewalls.

During the Outposts installation, the customer asks AWS to create two CoIP pools. The first CoIP pool, referenced in the following image as CoIP Web, is assigned the IP range The second CoIP pool, referenced in the following image as CoIP App, is assigned the IP range The customer then creates two additional AWS accounts, one for the web tier and another for the app tier. Within the account that owns the Outpost, the customer creates two VPCs and within each of those VPCs a subnet is created.

The subnet with IP address is shared with the web tier account and the subnet with IP address is shared with the app tier account. The customer then configures VPC peering between the VPCs to enable communication between the web and app tiers. VPC peering configuration is performed within the account that owns the Outposts.

Note: With VPC peering traffic between the VPCs remains within the Outpost. However, data transfer charges for VPC peering applies. This use-case could also be built with the web tier and app tier using different subnets within the same VPC to save on data transfer costs over VPC peering.

The following image shows initial setup with two CoIP pools, two accounts, and two VPCs.


Initial setup with two CoIP pools, two accounts and two VPCs

After the initial setup the customer then shares each CoIP pool with the appropriate account using AWS RAM. The customer can then create their web and application servers within the respective accounts. As shown in the prior image, the web server is assigned IP address and the application server is assigned IP address Each IP address is part of their respective VPC IP range and are reachable only within the Outpost at this point.

To enable the integration of the web and application servers with the existing infrastructure, the customer assigns an Elastic IP address from the CoIP pool shared with each account to their respective Amazon EC2 instances.

With this configuration, the customer can integrate the web and application servers into the existing security zones. To integrate the web servers, path “A” in the following image, the customer creates a rule in their external firewall that allows communication from any source (internet) to Elastic IP address on port 443 (HTTPS) which has been assigned to the web server.

For the communication between the web and application servers, path “B” in the following image, the customer has configured VPC peering between the two VPCs and created the required security groups.

To integrate the application servers into the Internal Security Zone, path “C” in the following image, the customer has assigned Elastic IP address to the application server and has configured a rule on the Internal Firewall allowing the IP range that is assigned to the app CoIP pool to communicate with the database server.


End to end traffic flow from users to web tier and from app tier to database tier

In addition to setting up their Outposts as covered in the preceding details, to enable the communication between the Outposts hosted resources and the existing infrastructure you must create a custom route table and associate it with an Outpost subnet.


AWS Outposts extends AWS infrastructure, services, APIs, and tools to your on-premises facility. During deployment of your Outposts, AWS works with you to establish connectivity to your local network. This blog post builds upon this initial deployment of an Outpost to allow the Outpost’s resources to integrate into your existing security zones. The design is applicable for customers that segment their environment into multiple security zones.


Santiago Freitas is AWS Head of Technology for Central Eastern Europe (CEE), Middle East, North Africa (MENA), Sub Saharan Africa (SSA), Turkey (TUR), and Russia and Commonwealth of Independent States (RUS-CIS). Previously he was an AWS Global Solutions Architect for financial services. Before joining AWS, Santiago was a Distinguished Engineer at Cisco. He is based in Dubai, United Arab Emirates.

Matt is a Principal Developer Advocate for Amazon Web Services where he has spent the last 7 years working with AWS customers and partners, helping them to build best-in-class AWS networking solutions. He is co-author of the AWS Certified Advanced Networking Official Study Guide, as well as an Amazon Re:Invent speaker extraordinaire (5 years and counting!). His primary focus at AWS is to help evangelize AWS networking solutions and more recently, AWS Outposts.

Using shared memory for low-latency, intra-node communication in AWS Batch

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-shared-memory-for-low-latency-intra-node-communication-in-aws-batch/

This post is courtesy of Dario La Porta, Senior Consultant, HPC.

AWS Batch enables developers, scientists, and engineers to run hundreds of thousands of HPC jobs in AWS. By managing the provisioning of computing resources, this allows you to focus on your core business. Shared memory support is a new feature that can help improve overall performance.

This post explains the shared memory paradigm and how it can help you improve the performance of your single and multi-node applications. Performance gains can also help you to reduce the total runtime of your jobs and therefore reduce the overall cost.

The second part of the post shows you how to use shared memory in AWS Batch both in the AWS Management Console and the AWS CLI. Finally, I show the performance gains that are made possible with shared memory usage by walking through a benchmarking analysis with OSU Micro-Benchmarks and GROMACS.

Shared memory paradigm

Advanced, compute-intensive workloads require high-performance hardware to use scalability to deliver results. The Amazon EC2 C5n instance type provides cost-efficient, high-performance hardware with a configurable number of cores.

HPC workloads use algorithms that require parallelization and a low latency communication between the different processes. The two main technologies used for the parallel communications are message-passing with distributed memory and shared memory.

Message Passing Interface (MPI) is a message-passing standard used for the communication in a parallel distributed environment. Elastic Fabric Adapter (EFA) enables your MPI applications to use low-latency, inter-node communication.

The shared memory paradigm allows multiple processors in the same system to communicate using a memory (RAM) portion that is shared between the processes. This method takes advantage of the high-speed memory bus.

Shared memory paradigm

MPI with intra-node shared memory communication

The two main MPI implementations, OpenMPI and Intel MPI, enable an intra-node shared memory communication in a distributed compute environment. When configured, you take advantage of the EFA libfabic implementation having consistent and reduced latency. This results in higher throughput than the TCP transport for the intra-node communication. From libfabric 1.9 onwards, the shared memory support has been directly added to the EFA provider. You no longer need to perform any modification to the OFI MTL.

MPI jobs in AWS Batch

AWS Batch enables the execution of MPI jobs using a multi-node configuration. First, a job definition is created that enables the execution of the job in multiple nodes. To learn how to create this definition, see Creating a Multi-node Parallel Job Definition.

To take advantage of the EFA capabilities, select a supported instance type and read Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch. This post shows how to create the necessary resources in AWS Batch and run your first job with EFA.

Shared memory in AWS Batch

The new AWS Batch console interface enables you to configure the shared memory of the container inside the Job Definition. To see this, expand the Additional configuration in the Container properties section:

Container properties

The Linux parameters section contains the Shared memory size parameter in MB.

Linux parameters

You can set the same configuration in the AWS CLI by passing JSON parameters to the RegisterJobDefinition API:

"linuxParameters": {
    "sharedMemorySize": integer

When you run the job, it creates a shared memory area on each node that uses two or more processes. The shared memory area cannot be changed during the execution of the job. The size of the shared memory area is determined by the number of cores available in the node and the application requirements. For most jobs, a suggested initial value is 4096 MB.

Modern Linux kernels support a POSIX shared memory API. You can inspect the size of the container shared memory using the df -h /dev/shm command. The output can help you determine the shared memory space needed for your job.


The following section compares the different performance using shared memory with Intel MPI 2019 update 7 and EFA.

The instance type used for the benchmark is the c5n.18xlarge and, for the multi-node use case, a cluster placement group. This compares the performance increase from shared memory versus using pure EFA communication. The first benchmark focuses on the latency of the communication in a single node use case.

OSU Micro-Benchmarks is a suite of benchmarks for measuring and evaluating the performance of MPI operations. The specific test case used is the osu_latency, measuring the minimum, maximum and average latency of a ping-pong communication between a sender and a receiver. Specifically, this is where the message sender waits for the reply from the receiver. The benchmark uses a variety of data sizes to report the average one-way latency.

OSU benchmark results

The chart shows the latency in μs on the horizontal axis and packet size in bits on the vertical axis. The result shows a decrease in the communication latency using shared memory for the intra-node communication compared with using only EFA. The following chart shows the latency improvement:

Latency improvement graph

The next benchmark demonstrates how shared memory can also increase the performance in a multi-node configuration. The test application is GROMACS, a versatile package to perform molecular dynamics. The overall performance of the application is susceptible to communication latency variance.

Gromacs performance

The code for the test has been downloaded from the Unified European Applications Benchmark Suite. The specific use case is named lignocellulose-rf and it uses the Reaction field for electrostatics. The details and the download link can be found in the UEABS repository.

The benchmark uses one thread per core and the following mdrun parameters:

-maxh 0.50 -resethway -noconfout -nsteps 10000 -dlb yes -nstlist 100 -pin on

The compilation options and the parameters configuration are explained in the README file. The test is run on a c5n.18xlarge instance, instead of a GPU instance, to focus on measuring the performance improvement caused specifically by increasing the number of total cores of the simulation. The chart explains the performance gain (measure in ns/day) that are achieved by increasing the number of cores. This is possible by using the shared memory for the intra-node communication during the simulation instead of using only EFA networking.

The following chart illustrates a significant percentage performance improvement by using the shared memory:

Shared memory performance improvement


In this post, I show how the new shared memory support in AWS Batch is able to improve performance while decreasing the latency of the intra-node communication. This performance gain can also lower the cost of running jobs overall.

I show how to enable the usage of the shared memory in AWS Batch from the AWS Management Console or the AWS CLI. I also highlight the performance gain from using shared memory with the high-speed memory bus of the c5n.18xlarge instance, using benchmarking analysis with OSU Micro-Benchmarks and GROMACS.

AWS Batch multi-node parallel jobs are now even more performant with EFA and shared memory configurations, enabling you to focus more on your applications and less on tuning. In addition, the Elastic Fabric Adapter (EFA) has a more consistent latency and higher throughput than the TCP transport for the inter-node communication.

To learn more about using this feature, visit the Getting Started guide.

Proactively manage the Spot Instance lifecycle using the new Capacity Rebalancing feature for EC2 Auto Scaling

Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/proactively-manage-spot-instance-lifecycle-using-the-new-capacity-rebalancing-feature-for-ec2-auto-scaling/

By Deepthi Chelupati and Chad Schmutzer

AWS now offers Capacity Rebalancing for Amazon EC2 Auto Scaling, a new feature for proactively managing the Amazon EC2 Spot Instance lifecycle in an Auto Scaling group. Capacity Rebalancing complements the capacity optimized allocation strategy (designed to help find the most optimal spare capacity) and the mixed instances policy (designed to enhance availability by deploying across multiple instance types running in multiple Availability Zones). Capacity Rebalancing increases the emphasis on availability by automatically attempting to replace Spot Instances in an Auto Scaling group before they are interrupted by Amazon EC2.

In order to proactively replace Spot Instances, Capacity Rebalancing leverages the new EC2 Instance rebalance recommendation, a signal that is sent when a Spot Instance is at elevated risk of interruption. The rebalance recommendation signal can arrive sooner than the existing two-minute Spot Instance interruption notice, providing an opportunity to proactively rebalance a workload to new or existing Spot Instances that are not at elevated risk of interruption.

Capacity Rebalancing for EC2 Auto Scaling provides a seamless and automated experience for maintaining desired capacity through the Spot Instance lifecycle. This includes monitoring for rebalance recommendations, attempting to proactively launch replacement capacity for existing Spot Instances when they are at elevated risk of interruption, detaching from Elastic Load Balancing if necessary, and running lifecycle hooks as configured. This post provides an overview of using Capacity Rebalancing in EC2 Auto Scaling to manage your Spot Instance backed workloads, and dives into an example use case for taking advantage of Capacity Rebalancing in your environment.

EC2 Auto Scaling and Spot Instances – a classic love story

First, let’s review what Spot Instances are and why EC2 Auto scaling provides an optimal platform to manage your Spot Instance backed workloads. This will help illustrate how Capacity Rebalancing can benefit these workloads.

Spot Instances are spare EC2 compute capacity in the AWS Cloud available for steep discounts off On-Demand prices. In exchange for the discount, Spot Instances come with a simple rule – they are interruptible and must be returned when EC2 needs the capacity back. Where does this spare capacity come from? Since AWS builds capacity for unpredictable demand at any given time (think all 350+ instance types across 77 Availability Zones and 24 Regions), there is often excess capacity. Rather than let that spare capacity sit idle and unused, it is made available to be purchased as Spot Instances.

As you can imagine, the location and amount of spare capacity available at any given moment is dynamic and continually changes in real time. This is why it is extremely important for Spot customers to only run workloads that are truly interruption tolerant. Additionally, Spot workloads should be flexible, meaning they can be shifted in real time to where the spare capacity currently is (or otherwise be paused until spare capacity is available again). In practice, being flexible means qualifying a workload to run on multiple EC2 instance types (think big: multiple families, sizes, and generations), and in multiple Availability Zones, at any given time.

This is where EC2 Auto Scaling comes in. EC2 Auto Scaling is designed to help you maintain application availability. It also allows you to automatically add or remove EC2 instances according to conditions you define. We’ve continued to innovate on behalf of our customers by adding new features to EC2 Auto Scaling to natively support flexible configurations for EC2 workloads. One of these innovations is the mixed instances policy (launched in 2018), which supports multiple instance types and purchase options in a single Auto Scaling group. Another innovation is the capacity optimized allocation strategy (launched in 2019), an allocation strategy designed to locate optimal spare capacity for Spot Instances backed workloads. These features are aimed at supporting flexible workload best practices, and reacting to the dynamic shifts in capacity automatically.

The next level – moving from reactive to proactive Spot Capacity Rebalancing in EC2 Auto Scaling

The default behavior for EC2 Auto Scaling is to take a reactive approach to Spot Instance interruptions. This means that EC2 Auto Scaling attempts to replace an interrupted Spot Instance with another Spot Instance only after the instance has been shut down by EC2 and the health check fails. The reactive approach to interruptions works fine for many workloads. However, we have received feedback from customers requesting that EC2 Auto Scaling take a more proactive approach to handling Spot Instance interruptions.

Capacity Rebalancing in EC2 Auto Scaling is the answer to this request. Capacity Rebalancing is designed to take a proactive approach in handling the dynamic nature of EC2 capacity. It does this by monitoring for the EC2 Instance rebalance recommendation signal in addition to the “final” two-minute Spot Instance interruption notice. When a rebalance recommendation signal is detected, it automatically attempts to get a head start in replacing Spot Instances with new Spot Instances before they are shut down. In addition to attempting to maintain desired capacity through interruptions by launching replacement Spot Instances, Capacity Rebalancing gives customers the opportunity to gracefully remove Spot Instances from an Auto Scaling group by taking Spot Instances through the normal shut down process, such as deregistering from a load balancer and running terminating lifecycle hooks.

Capacity Rebalancing in EC2 Auto Scaling works best when combined with a few best practices. Let’s quickly review them:

  1. Be flexible. Capacity Rebalancing thrives on flexibility, and works best when using the EC2 Auto Scaling mixed instances policy and as many instance types and Availability Zones as possible. Remember to think big and qualify multiple families, sizes, and generations for your workload, and use all Availability Zones if possible.
  2. Use the capacity optimized allocation strategy. Capacity rebalance works optimally when combined with the capacity optimized allocation strategy and a flexible list of instance types and Availability Zones, because the goal is to find the optimal spare capacity to rebalance your workload on.
  3. Take advantage of termination lifecycle hooks (optional). Termination lifecycle hooks are powerful in case you need to perform any final tasks before shutdown.

Example tutorial – Web application workload

Now that you understand the best practices for taking advantage of Capacity Rebalancing in EC2 Auto Scaling, let’s dive into the example workload. In this scenario, we have a web application powered by 75% Spot Instances and 25% On-Demand Instances in an Auto Scaling group, running behind an Application Load Balancer. We’d like to maintain availability, and have the Auto Scaling group automatically handle Spot Instance interruptions and rebalancing of capacity.

The Auto Scaling group configuration looks like this (note the best practices of instance type and Availability Zone flexibility combined with the capacity optimized allocation strategy in the mixed instances policy):

   "AutoScalingGroupName": "myAutoScalingGroup",
   "CapacityRebalance": true,
   "DesiredCapacity": 12,
   "MaxSize": 15,
   "MinSize": 12,
   "MixedInstancesPolicy": {
      "InstancesDistribution": {
         "OnDemandBaseCapacity": 0,
         "OnDemandPercentageAboveBaseCapacity": 25,
         "SpotAllocationStrategy": "capacity-optimized"
      "LaunchTemplate": {
         "LaunchTemplateSpecification": {
            "LaunchTemplateName": "myLaunchTemplate",
            "Version": "$Default"
         "Overrides": [
               "InstanceType": "c5.large"
               "InstanceType": "c5a.large"
               "InstanceType": "m5.large"
               "InstanceType": "m5a.large"
               "InstanceType": "c4.large"
               "InstanceType": "m4.large"
               "InstanceType": "c3.large"
               "InstanceType": "m3.large"
   "TargetGroupARNs": [
   "VPCZoneIdentifier": "mySubnet1,mySubnet2,mySubnet3"

Next, create the Auto Scaling group as follows:

aws autoscaling create-auto-scaling-group \
  --cli-input-json file://myAutoScalingGroup.json

We also use a lifecycle hook to download logs before an instance is shut down:

aws autoscaling put-lifecycle-hook \
  --lifecycle-hook-name myTerminatingHook \
  --auto-scaling-group-name myAutoScalingGroup \
  --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \
  --heartbeat-timeout 300

In this example scenario, let’s say that the above config results in nine Spot Instances and three On-Demand instances being deployed in the Auto Scaling group, three Spot Instances, and one On-Demand instance in each Availability Zone. With Capacity Rebalancing enabled, if any of the nine Spot Instances receive the EC2 Instance rebalance recommendation signal, EC2 Auto Scaling will automatically request a replacement Spot Instance according to the allocation strategy (capacity optimized), resulting in 10 running Spot Instances. When the new Spot Instance passes EC2 health checks, it is joined to the load balancer and placed into service. Upon placing the new Spot Instance in service, EC2 Auto Scaling then proceeds with the shutdown process for the Spot Instance that has received the rebalance recommendation signal. It detaches the instance from the load balancer, drains connections, and then carries out the terminating lifecycle hook. Once the terminating lifecycle hook is complete, EC2 Auto Scaling shuts down the instance, bringing capacity back to nine Spot Instances.


Consider using the new Capacity Rebalancing feature for EC2 Auto Scaling in your environment to proactively manage Spot Instance lifecycle. Capacity Rebalancing attempts to maintain workload availability by automatically rebalancing capacity as necessary, providing a seamless and hands-off experience for managing Spot Instance interruptions. Capacity Rebalancing works best when combined with instance type flexibility and the capacity optimized allocation strategy, and may be especially useful for workloads that can easily rebalance across shifting capacity, including:

  • Containerized workloads
  • Big data and analytics
  • Image and media rendering
  • Batch processing
  • Web applications

To learn more about Capacity Rebalancing for EC2 Auto Scaling, please visit the documentation.

To learn more about the new EC2 Instance rebalance recommendation, please visit the documentation.

Announcing Outposts and local gateway sharing for multi-account access

Post Syndicated from Shubha Kumbadakone original https://aws.amazon.com/blogs/compute/announcing-outposts-and-local-gateway-sharing-for-multi-account-access/

This post was contributed by James Devine, Sr. Outposts SA

AWS Outposts enables customers to run AWS services in their on-premises environments. With the release of Outposts and local gateway (LGW) sharing, customers can now configure multi-account access and sharing within an AWS Organization.

Prior to this release, an Outpost was only viable within a single AWS account. VPC sharing was the main way to enable multiple accounts to use Outposts capacity. With the release of Outposts and LGW sharing support, there is now additional functionality to enable multi-account access Outpost capacity within an AWS Organization.

Outposts and LGW sharing is facilitated through AWS Resource Access Manager (RAM). It enables Outposts and LGWs to be shared with AWS accounts within the same AWS Organization. The account that orders Outposts is the owner account that can create resource shares. The accounts that have access to the share are called consumer accounts. Each consumer account can create its own VPCs with subnets that reside on the shared Outpost.

This post will discuss how to start using this new functionality and considerations to take into account.

Use cases

Per AWS best practices, customers typically deploy a number of AWS accounts. Utilizing multiple accounts allows for reduced blast radius and the ability to provide infrastructure isolation by line of business, environment type, and even down to individual workloads. Outposts sharing enables customers to extend their existing AWS account structures to seamlessly integrate with Outposts.

Getting started – creating a resource share

Before any resources can be shared, the first step is to configure an AWS Organization (if one does not already exist). Outposts resources can only be shared with accounts under the same AWS Organization. The Outposts can reside in any account under an organization. For centralized management of Outposts, it is recommend to create a dedicated account, or set of accounts, to host Outposts.

Once an organization is created with member AWS accounts, resources shares can then be created. It’s possible to place multiple resources into a resource share. To facilitate Outposts, LGW, and customer-owned IP (CoIP) sharing, a single resource share can be created that includes all three resources. Principals can then be added to the resource share. The principals can be both organizational units (OUs) and individual AWS account IDs within the AWS Organization. In this case, I’ve shared all three resources with a consumer account ID as a principal, as demonstrated in the following screenshot.

Screen shot of resource sharing.

Sharing an Outpost

After an Outposts is provisioned, the logical Outpost ID can be shared with any account under the AWS Organization. The consumer account then has access to provision resources on the Outposts, such as Amazon EBS volumes and Outposts subnets, as well as launching instances on the shared Outpost.

From the AWS Management Console in the consumer account, I can see the shared Outposts ID, its associated Availability Zone, and the owner account ID.

From the AWS Management Console in the consumer account, I can see the shared Outposts ID, its associated Availability Zone, and the owner account ID.

Once the Outposts ID is selected, I can use the Actions drop down menu to create Outposts subnets and EBS volumes. I can also select Launch instance to provision instances on the Outpost.

Once the Outposts ID is selected, I can use the Actions drop down menu to create Outposts subnets and EBS volumes. I can also select Launch instance to provision instances on the Outpost.

Sharing an LGW

Each consumer account can create their own Outposts subnets within their own VPCs. LGW sharing enables the consumer account to create routes an Outposts subnet route table that has a shared LGW as the destination. This enables Outposts subnets in the consumer account to have communication with the on-premises network through the shared LGW.

The consumer account view shared LGWs, as shown in the following screenshot.

The consumer account view shared LGWs, as shown in the following screenshot.

The consumer account can then select VPCs within the account to associate with the LGW route table. This enables routing to on-premises if a CoIP is assigned to an instance.

This enables routing to on-premises if a CoIP is assigned to an instance.


LGW and Outposts sharing is meant to enable sharing of resources between various accounts within a larger organizational structure. It is not suitable for multi-tenancy outside of an AWS Organization. Additional considerations around capacity planning, access, and local network connectivity should be taken into account.

Resources created in the consumer account are only visible from within the consumer account. The AWS account that owns the Outpost does not have the ability to view instances, EBS volumes, VPCs, subnets, or any other resource created within the consumer account. Since the consumer account is part of an AWS Organization, it is possible to use the default OrganizationAccountAccessRole role that is created by AWS Organizations. This allows for visibility and management of Outpost resources across the AWS Organization.

Capacity information is not shared with the consumer account. However, it is possible to use cross-account CloudWatch metric sharing. Outposts utilization metrics from the account that owns the Outpost can be shared with the consumer account. This allows the consumer account to see what capacity is available on the shared Outposts. I’ve configured the cross-account sharing, and from my consumer account I can see that there is ample c5.xlarge capacity on the shared Outposts.

I’ve configured the cross-account sharing, and from my consumer account I can see that there is ample c5.xlarge capacity on the shared Outposts.

If a principal (consumer account or organizational unit) no longer requires access to Outposts capacity, the resource share can be deleted through RAM in the primary Outposts account. It is important to note that this does not delete subnets, EBS volumes, instances, or other resources running on the shared Outposts. Proper cleanup of Outposts resources within the consumer account (EBS volumes, instances, subnets, etc.) should be planned for whenever removing principals from a resource share to ensure that the capacity is released.


In the blog post, I described the Outposts and LGW sharing capabilities and demonstrated how they can be used to enable multi-account sharing of an Outpost within an AWS Organization. These new capabilities unlock even more customer use cases and allow for stronger blast-radius and account isolation. It’s exciting to see continued functionality come to Outposts! You can start using LGW and Outposts sharing today. There’s no need to upgrade or modify your Outposts in any way to take advantage of this new and exciting functionality.

Architecting for Reliable Scalability

Post Syndicated from Marwan Al Shawi original https://aws.amazon.com/blogs/architecture/architecting-for-reliable-scalability/

Cloud solutions architects should ideally “build today with tomorrow in mind,” meaning their solutions need to cater to current scale requirements as well as the anticipated growth of the solution. This growth can be either the organic growth of a solution or it could be related to a merger and acquisition type of scenario, where its size is increased dramatically within a short period of time.

Still, when a solution scales, many architects experience added complexity to the overall architecture in terms of its manageability, performance, security, etc. By architecting your solution or application to scale reliably, you can avoid the introduction of additional complexity, degraded performance, or reduced security as a result of scaling.

Generally, a solution or service’s reliability is influenced by its up time, performance, security, manageability, etc. In order to achieve reliability in the context of scale, take into consideration the following primary design principals.


Modularity aims to break a complex component or solution into smaller parts that are less complicated and easier to scale, secure, and manage.

Monolithic architecture vs. modular architecture

Figure 1: Monolithic architecture vs. modular architecture

Modular design is commonly used in modern application developments. where an application’s software is constructed of multiple and loosely coupled building blocks (functions). These functions collectively integrate through pre-defined common interfaces or APIs to form the desired application functionality (commonly referred to as microservices architecture).


Scalable modular applications

Figure 2: Scalable modular applications

For more details about building highly scalable and reliable workloads using a microservices architecture, refer to Design Your Workload Service Architecture.

This design principle can also be applied to different components of the solution’s architecture. For example, when building a cloud solution on a single Amazon VPC, it may reach certain scaling limits and make it harder to introduce changes at scale due to the higher level of dependencies. This single complex VPC can be divided into multiple smaller and simpler VPCs. The architecture based on multiple VPCs can vary. For example, the VPCs can be divided based on a service or application building block, a specific function of the application, or on organizational functions like a VPC for various departments. This principle can also be leveraged at a regional level for very high scale global architectures. You can make the architecture modular at a global level by distributing the multiple VPCs across different AWS Regions to achieve global scale (facilitated by AWS Global Infrastructure).

In addition, modularity promotes separation of concerns by having well-defined boundaries among the different components of the architecture. As a result, each component can be managed, secured, and scaled independently. Also, it helps you avoid what is commonly known as “fate sharing,” where a vertically scaled server hosts a monolithic application, and any failure to this server will impact the entire application.

Horizontal scaling

Horizontal scaling, commonly referred to as scale-out, is the capability to automatically add systems/instances in a distributed manner in order to handle an increase in load. Examples of this increase in load could be the increase of number of sessions to a web application. With horizontal scaling, the load is distributed across multiple instances. By distributing these instances across Availability Zones, horizontal scaling not only increases performance, but also improves the overall reliability.

In order for the application to work seamlessly in a scale-out distributed manner, the application needs to be designed to support a stateless scaling model, where the application’s state information is stored and requested independently from the application’s instances. This makes the on-demand horizontal scaling easier to achieve and manage.

This principle can be complemented with a modularity design principle, in which the scaling model can be applied to certain component(s) or microservice(s) of the application stack. For example, only scale-out Amazon Elastic Cloud Compute (EC2) front-end web instances that reside behind an Elastic Load Balancing (ELB) layer with auto-scaling groups. In contrast, this elastic horizontal scalability might be very difficult to achieve for a monolithic type of application.

Leverage the content delivery network

Leveraging Amazon CloudFront and its edge locations as part of the solution architecture can enable your application or service to scale rapidly and reliably at a global level, without adding any complexity to the solution. The integration of a CDN can take different forms depending on the solution use case.

For example, CloudFront played an important role to enable the scale required throughout Amazon Prime Day 2020 by serving up web and streamed content to a worldwide audience, which handled over 280 million HTTP requests per minute.

Go serverless where possible

As discussed earlier in this post, modular architectures based on microservices reduce the complexity of the individual component or microservice. At scale it may introduce a different type of complexity related to the number of these independent components (microservices). This is where serverless services can help to reduce such complexity reliably and at scale. With this design model you no longer have to provision, manually scale, maintain servers, operating systems, or runtimes to run your applications.

For example, you may consider using a microservices architecture to modernize an application at the same time to simplify the architecture at scale using Amazon Elastic Kubernetes Service (EKS) with AWS Fargate.

Example of a serverless microservices architecture

Figure 3: Example of a serverless microservices architecture

In addition, an event-driven serverless capability like AWS Lambda is key in today’s modern scalable cloud solutions, as it handles running and scaling your code reliably and efficiently. See How to Design Your Serverless Apps for Massive Scale and 10 Things Serverless Architects Should Know for more information.

Secure by design

To avoid any major changes at a later stage to accommodate security requirements, it’s essential that security is taken into consideration as part of the initial solution design. For example, if the cloud project is new or small, and you don’t consider security properly at the initial stages, once the solution starts to scale, redesigning the entire cloud project from scratch to accommodate security best practices is usually not a simple option, which may lead to consider suboptimal security solutions that may impact the desired scale to be achieved. By leveraging CDN as part of the solution architecture (as discussed above), using Amazon CloudFront, you can minimize the impact of distributed denial of service (DDoS) attacks as well as perform application layer filtering at the edge. Also, when considering serverless services and the Shared Responsibility Model, from a security lens you can delegate a considerable part of the application stack to AWS so that you can focus on building applications. See The Shared Responsibility Model for AWS Lambda.

Design with security in mind by incorporating the necessary security services as part of the initial cloud solution. This will allow you to add more security capabilities and features as the solution grows, without the need to make major changes to the design.

Design for failure

The reliability of a service or solution in the cloud depends on multiple factors, the primary of which is resiliency. This design principle becomes even more critical at scale because the failure impact magnitude typically will be higher. Therefore, to achieve a reliable scalability, it is essential to design a resilient solution, capable of recovering from infrastructure or service disruptions. This principle involves designing the overall solution in such a way that even if one or more of its components fail, the solution is still be capable of providing an acceptable level of its expected function(s). See AWS Well-Architected Framework – Reliability Pillar for more information.


Designing for scale alone is not enough. Reliable scalability should be always the targeted architectural attribute. The design principles discussed in this blog act as the foundational pillars to support it, and ideally should be combined with adopting a DevOps model.

Amazon EC2 P4d instances deep dive

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/amazon-ec2-p4d-instances-deep-dive/

This post is contributed by Amr Ragab, Senior Solutions Architect, Amazon EC2


AWS is excited to announce that the new Amazon EC2 P4d instances are now generally available. This instance type brings additional benefits with 2.5x higher deep learning performance; adding to the accelerated instances portfolio, new features, and technical breakthroughs that our customers can benefit from with this latest technology. This blog post details some of those key features and how to integrate them into your current workloads and architectures.


P4d instances

As you can see from the generalized block diagram above, the p4d comes with dual socket Intel Cascade Lake 8275CL processors totaling 96 vCPUs at 3.0 GHz with 1.1 TB of RAM and 8 TB of NVMe local storage. P4d also comes with 8 x 40 GB NVIDIA Tesla A100 GPUs with NVSwitch and 400 Gbps Elastic Fabric Adapter (EFA) enabled networking. This instance configuration represents the latest generation of computing for our customers spanning Machine Learning (ML), High Performance Computing (HPC), and analytics.

One of the improvements of the p4d is in the networking stack.  This new instance type has 400 Gbps with support for EFA and GPUDirect RDMA. Now, on AWS, you can take advantage of point-to-point GPU to GPU communication (across nodes), bypassing the CPU. Look out for additional blogs and webinars detailing use cases of GPUDirect and how this feature helps decrease latency and improve performance for certain workloads.

Let’s look at some new features and performance metrics for the P4d instances.


Local ephemeral NVMe storage
The p4d instance type comes with 8 TB of local NVMe storage. Each device has a maximum read/write throughput of 2.7 GB/s. To create a local namespace and staging area for input into the GPUs, you can create a local RAID 0 of all the drives. This results in aggregate read throughput of about 16 GB/s. The following table summarizes the I/O tests on the NVMe drives in this configuration.

FIO – TestBlock SizeThreadsBandwidth
1Sequential Read128k9616.4 GiB/s
2Sequential Write128k968.2 GiB/s
3Random Read128k9616.3 GiB/s
4Random Write128k968.1 GiB/s


Introduced with the p4d instance type is NVSwitch. Every GPU in the node is connected to each other in a full mesh topology up to 600 GB/s bidirectional bandwidth. ML frameworks and HPC applications that use NVIDIA communication collectives library (NCCL) can take full advantage of this all-to-all communication layer.

P4d GPU to GPU bandwidth

P3 GPU to GPU bandwidth

P4d uses a full mesh NVLink topology for optimized all-to-all communication, compared to the previous generation P3/P3dn instances, which have all-to-all communication across various data path domains (NUMA, PCIe switch, NVLink).  This new topology accessed via NCCL will improve performance for multiGPU workloads.
To make optimal use of the NVSwitch ensure that in your instance, all GPUs application boost clocks are set to its maximum values:

sudo nvidia-smi -ac 1215,1410

Multi-Instance GPU (MIG)

It’s now possible, at the user level, to have control of fractionating a GPU into multiple GPU slices, with each GPU slice isolated from each other. This enables multiple users to run different workloads on the same GPU without impacting performance. I walk you through an example implementation of MIG in the following steps:

With every newly launched instance, MIG is disabled. So, you must enable it with the following command:

[email protected]:~# sudo nvidia-smi -mig 1 

Enabled MIG Mode for GPU 00000000:10:1C.0
You can get a list of supported MIG profiles.
Next, you can create seven slices, and create compute instances for each slice.
[email protected]:~# sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 
Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 7 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 8 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 11 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 12 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 13 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 14 on GPU 0 using profile MIG 1g.5gb (ID 19)
[email protected]:~# nvidia-smi mig -cci -gi 7,8,9,11,12,13,14 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 7 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 8 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 9 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 11 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 12 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 13 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 14 using profile MIG 1g.5gb (ID 0)

You can split a GPU into a maximum of seven slices. To pass the GPU through into a docker container, you can specify the index pair at runtime:

docker run -it --gpus '"device=1:0"' nvcr.io/nvidia/tensorflow:20.09-tf1-py3

With MIG, you can run multiple smaller workloads on the same GPU without compromising performance. We will follow up with additional blogs on this feature as we integrate it with additional AWS services.


For workloads optimized for multiGPU capabilities, we introduced GPUDirect over EFA fabric. This allows direct GPU-GPU communication across multiple p4d nodes for decreased latency and improved performance. Follow this user guide to get started with installing the EFA driver and setting up the environment. The code sample below can be used as a template to use GPUDirect RDMA over EFA.

/opt/amazon/openmpi/bin/mpirun \
     -n ${NUM_PROCS} -N ${NUM_PROCS_NODE} \
     -x RDMAV_FORK_SAFE=1 -x NCCL_DEBUG=info \
     --hostfile ${HOSTS_FILE} \
     --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
     $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 4G -f 2 -g 1 -c 1 -n 100

Machine learning Optimizations

You can quickly get started with the all benefits mentioned earlier for the p4d by using our latest Deep Learning AMI (DLAMI). The DLAMI now comes with CUDA11 and the latest NVLink and cuDNN SDKs and drivers to take advantage of the p4d.

TensorFloat32 – TF32

TF32 is a new 19 bit precision datatype from NVIDIA introduced for the first time for the p4d.24xlarge instance. This datatype improves performance with little to no loss of training and validation accuracy for most mainstream models. We have more detailed benchmarks for individual algorithms. But, on the p4d.24xlarge you can achieve approximately a 2.5 fold increase compared to FP32 on the p3dn.24xlarge for mainstream deep learning models.

We have updated our machine learning models here to show examples (see the following chart) of popular algorithms our customers are using today including general DNNs and Bert.

DNNP3dn FP32 (imgs/sec)P3dn FP16 (imgs/sec)P4d Throughput TF32 (imgs/sec)P4d Throughput FP16 (imgs/sec)P4d over p3dn TF32/FP32P4d over P3dn FP16

BERT Large – Wikipedia/Books Corpus

GPUsSequence LengthBatch size / GPU: mixed precision, TF32Gradient Accumulation: mixed precision, TF32Throughput – mixed precision

You can find other code examples at github.com/NVIDIA/DeepLearningExamples.

If you want to builld your own AMI or extend an AMI maintained by your organization you can use the github repo, which provides Packer scripts to build AMIs for Amazon Linux 2 or Ubuntu 18.04 versions.


The stack includes the following components:

  • NVIDIA Driver 450.80.02
  • CUDA 11
  • NVIDIA Fabric Manager
  • cuDNN 8
  • NCCL 2.7.8
  • EFA latest driver
  • FSx kernel and client driver and utilities
  • Intel OneDNN
  • NVIDIA-runtime Docker


Get started with the new P4d instances with support on Amazon EKS, AWS Batch, and Amazon Sagemaker. We are excited to hear about what you develop and run with the new P4d instances. If you have any questions please reach out to your account team. Now, go power up your ML and HPC workloads with NVIDIA Tesla A100s and the P4d instances.