Tag Archives: Technical How-to

Field Notes: Migrating a Self-managed Kubernetes Cluster on Amazon EC2 to Amazon EKS

Post Syndicated from Ahmed Bham original https://aws.amazon.com/blogs/architecture/field-notes-migrating-a-self-managed-kubernetes-cluster-on-ec2-to-amazon-eks/

AWS customers from startups to enterprises have been successfully running Kubernetes clusters on Amazon EC2 instances since 2015, well before Amazon Elastic Kubernetes Service (Amazon EKS), was launched in 2018. As a fully managed Kubernetes service, Amazon EKS customers can run Kubernetes on AWS without needing to install, operate, and maintain their own Kubernetes control plane. Since its launch, many existing and new customers are building and running their Kubernetes clusters on Amazon EKS.

At re:Invent 2019, AWS announced AWS Fargate for Amazon EKS, which is serverless compute for containers. Then, in January 2020, AWS announced a price reduction of Amazon EKS  by 50% to $0.10 per hour, per cluster. These developments, coupled with realization of management and cost overhead of Kubernetes control plane operations at scale, made more customers look into migrating their self-managed Kubernetes clusters to Amazon EKS.

The “how” of this migration is the focus of this blog.

Overview of Solution

For most customers migrating from self-managed Kubernetes clusters on Amazon EC2 to Amazon EKS can usually be accomplished with minimal or no downtime. However, for large clusters involving hundreds of nodes and thousands of pods, this requires more planning and testing, and it is recommended to engage AWS Support for guidance.

Kubernetes control plane


There are certain considerations to ensure a successful Amazon EKS migration and operational excellence.


  • Access Control
    • Amazon EKS uses AWS Identity and Access Management (IAM) to provide authentication to your Kubernetes cluster, but it still relies on native Kubernetes Role Based Access Control (RBAC) for authorization. It’s important to plan for the creation and governance of IAM users, roles, or groups, for Kubernetes cluster administration.
    • You can enable private access to the Kubernetes API server so that all communication between your nodes and the API server stays within your VPC. You can limit the IP addresses that can access your API server from the internet, or completely disable internet access to the API server.
  • IAM Role for Service Account
    • With IAM roles for service accounts on Amazon EKS clusters, you can associate an IAM role with a Kubernetes service account. This service account can then provide AWS permissions to the containers in any pod that uses that service account. With this feature, you no longer need to provide extended permissions to the node IAM role so that pods on that node can call AWS APIs.
  • Security groups for pods
    • Security groups for pods integrate Amazon EC2 security groups with Kubernetes pods. You can use Amazon EC2 security groups to define rules that allow inbound and outbound network traffic to and from pods. These pods are deployed to nodes running on many Amazon EC2 instance types.


  • Amazon EKS supports native VPC networking with the Amazon VPC Container Network Interface (CNI) plugin for Kubernetes. Using this plugin allows Kubernetes pods to have the same IP address inside the pod as they do on the VPC network.
  •  VPC CNI plugin uses IP addresses for the pods from the VPC CIDR ranges, and specifically from the subnet where the worker node is hosted. Therefore, customers must ensure the VPC and subnets that will host their Amazon EKS cluster have sufficient IP addresses available for the expected number of pods running at a time. Additionally, IP addresses are allocated to the Elastic Network Interface (ENI) attached to the EC2 instances.  The EC2 instance selection for the worker nodes should take into account the number of ENI attachments supported by the instance type.

Compute Options

Kubernetes Versions

  • Amazon EKS supports four major Kubernetes versions at a time,  which you can review in the available AWS documentation, along with a calendar for future Amazon EKS releases.
  • If you are currently running a non-supported Kubernetes cluster, or would like to migrate to a newer version on Amazon EKS, consider the following:
    1. Review release notes for specific Kubernetes version you want to migrate to, and make necessary updates to Kubernetes manifest files.
    2. Update your Kubernetes add-ons (CNI plugin, CoreDNS, Kube-Proxy) to compatible versions, as listed in the Updating an Amazon EKS Cluster guide.


  1. Create an IAM Role for the creator and/or administrator of the Amazon EKS cluster. Specify this role when creating the Amazon EKS cluster.
  2. If using an existing VPC and subnets to host Amazon EKS cluster:
    • You will need subnets in at least two Availability Zones
    • All public subnets should have the property MapPublicIpOnLaunch enabled (that is, Auto-assign public IPv4 address in the AWS Management Console) to host self-managed and managed node groups.

3. If your pods are currently accessing AWS resources, and if you would like to use IAM roles for service accounts, then:

    • Create service accounts in Kubernetes to be used by your pods.
    • Follow these steps to create IAM roles, and assign to service account created.
    • Update your pod manifest files to specify the newly created service account and role ARN annotation. Remove any existing code for storing or passing IAM credentials.

4. If you are planning to use AWS Fargate to run your pods, you need to create the appropriate Fargate Profile and pod execution role.

Application and Data Migration

  • For stateless workloads, apply your resource definitions (YAML or Helm) to the new cluster, and make sure everything works as expected. This includes the connection to resources external to the cluster.
  • For stateful workloads:
    1. You will need to carefully plan your migration to avoid data loss or unexpected downtime.
    2. If you are currently using shared persistent file storage based on Amazon EFS or Amazon FSx for Lustre, they can be mounted to Amazon EKS pods concurrently. Just make sure that pods don’t write to the same files concurrently.
    3. For pods using EBS volumes, and for other persistent storage types, you can use a Kubernetes backup and restore tool, Velero.

Traffic Migration

If you have an entire domain migration that you would like to smoothly migrate, you can take advantage of Amazon Route 53’s Weighted Routing (as shown in the following diagram). With Weighted Routing, you are able to have a progressive transition from your existing cluster to the new one with zero downtime by splitting the traffic at the DNS level.

Your customers are slowly being transferred to your new cluster as their cached TTL expires. The split could start with a small share of your customers, for example, 10% being pointed to the new Amazon EKS cluster and 90% still on the old one. As soon as traffic is confirmed to be working as expected on the new cluster, that percentage of clients pointed to the new one can be increased.



This implementation is flexible, it can be tied to Load Balancers, EC2 Instances, and even to external on-premises infrastructure.


In this blog post, we showed how to migrate your live-traffic serving self-hosted Kubernetes Cluster to Amazon EKS. Amazon EKS offers a cost-effective and highly available option for running Kubernetes clusters in the cloud. Since Amazon EKS is upstream Kubernetes compliant, you can migrate existing self-managed Kubernetes workloads to Amazon EKS, with multiple options to minimize or avoid service disruption. To create your first Amazon EKS cluster, visit Getting started with Amazon EKS.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.



Fire Dynamics Simulation CFD workflow using AWS ParallelCluster, Elastic Fabric Adapter, Amazon FSx for Lustre and NICE DCV

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/fire-dynamics-simulation-cfd-workflow-using-aws-parallelcluster-elastic-fabric-adapter-amazon-fsx-for-lustre-and-nice-dcv/

This post was written by By Kevin Tuil, AWS HPC consultant 

Modeling fires is key for many industries, from the design of new buildings, defining evacuation procedures for trains, planes and ships, and even the spread of wildfires. Modeling these fires is complex. It involves both the need to model the three-dimensional unsteady turbulent flow of the fire and the many potential chemical reactions. To achieve this, the fire modeling community has moved to higher-fidelity turbulence modeling approaches such as the Large Eddy Simulation, which requires both significant temporal and spatial resolution. It means that the computational cost for these simulations is typically in the order of days to weeks on a single workstation.
While there are a number of software packages, one of the most popular is the open-source code: Fire Dynamics Simulation (FDS) developed by National Institute of Standards and Technology (NIST).

In this blog, I focus on how AWS High Performance Computing (HPC) resources (e.g AWS ParallelCluster, Amazon FSx for Lustre, Elastic Fabric Adapter (EFA), and Amazon S3) allow FDS users to scale up beyond a single workstation to hundreds of cores to achieve simulation times of hours rather than days or weeks. In this blog, I outline the architecture needed, providing scripts and templates to compile FDS and run your simulation.

Service and solution overview

AWS ParallelCluster

AWS ParallelCluster is an open source cluster management tool that simplifies deploying and managing HPC clusters with Amazon FSx for Lustre, EFA, a variety of job schedulers, and the MPI library of your choice. AWS ParallelCluster simplifies cluster orchestration on AWS so that HPC environments become easy-to-use, even if you are new to the cloud. AWS released AWS ParallelCluster 2.9.1 and its user guide – which is the version I use in this blog.

These three AWS HPC resources are optimal for Fire Dynamics Simulation. Together, they provide easy deployment of HPC systems on AWS, low latency network communication for MPI workloads, and a fast, parallel file system.

Elastic Fabric Adapter

EFA is a critical service that provides low latency and high-bandwidth 100 Gbps network communication. EFA allows applications to scale at the level of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS Cloud. Computational Fluid Dynamics (CFD), among other tightly coupled applications, is an excellent candidate for the use of EFA.

Amazon FSx for Lustre

Amazon FSx for Lustre is a fully managed, high-performance file system, optimized for fast processing workloads, like HPC. Amazon FSx for Lustre allows users to access and alter data from either Amazon S3 or on-premises seamlessly and exceptionally fast. For example, you can launch and run a file system that provides sub-millisecond latency access to your data. Additionally, you can read and write data at speeds of up to hundreds of gigabytes per second of throughput, and millions of IOPS. This speed and low-latency unleash innovation at an unparalleled pace. This blog post uses the latest version of Amazon FSx for Lustre, which recently added a new API for moving data in and out of Amazon S3. This API also includes POSIX support, which allows files to mount with the same user id. Additionally, the latest version also includes a new backup feature that allows you to back up your files to an S3 bucket.

Solution and steps

The overall solution that I deploy in this blog is represented in the following diagram:

solution overview diagram

Step 1: Access to AWS Cloud9 terminal and upload data

There are two ways to start using AWS ParallelCluster. You can either install AWS CLI or turn on AWS Cloud9, which is a cloud-based integrated development environment (IDE) that includes a terminal. For simplicity, I use AWS Cloud9 to create the HPC cluster. Please refer to this link to proceed to AWS Cloud9 set up and to this link for AWS CLI setup.

Once logged into your AWS Cloud9 instance, the first thing you want to create is the S3 bucket. This bucket is key to exchange user data in and out from the corporate data center and the AWS HPC cluster. Please make sure that your bucket name is unique globally, meaning there is only one worldwide across all AWS Regions.

aws s3 mb s3://fds-smv-bucket-unique
make_bucket: fds-smv-bucket-unique

Download the latest FDS-SMV Linux version package from the official NIST website. It looks something like: FDS6.7.4_SMV6.7.14_lnx.sh

For the geometry, it should be renamed to “geometry.fds”, and must be uploaded to your AWS Cloud9 or directly to your S3 bucket.

Please note that once the FDS-SMV package has been downloaded locally to the instance, you must upload it to the S3 bucket using the following command.

aws s3 cp FDS6.7.4_SMV6.7.14_lnx.sh s3://fds-smv-bucket-unique
aws s3 cp geometry.fds s3://fds-smv-bucket-unique

You use the same S3 bucket to install FDS-SMV later on with the Amazon FSx for Lustre File System.

Step 2: Set up AWS ParallelCluster

You can install AWS ParallelCluster running the following command from your AWS Cloud9 instance:

sudo pip install aws-parallelcluster

Once it is installed, you can run the following command to check the version:

pcluster version 

At the time of writing this blog, 2.9.1 is the most up-to-date version.

Then use the text editor of your choice and open the configuration file as follows:

vim ~/.parallelcluster/config

Replace the bolded section, if not yet filled in, by your own information and save the configuration file.

aws_region_name = <AWS-REGION>

sanity_check = true
cluster_template = fds-smv-cluster
update_check = true

[vpc public]
vpc_id = vpc-<VPC-ID>
master_subnet_id = subnet-<SUBNET-ID>

[cluster fds-smv-cluster]
key_name = <Key-Name>
vpc_settings = public
initial_queue_size = 0
max_queue_size = 100
cluster_type = ondemand
placement_group = DYNAMIC
placement = compute
base_os = alinux2
tags = {"Name" : "fds-smv"}
disable_hyperthreading = true
fsx_settings = fsxshared
enable_efa = compute
dcv_settings = hpc-dcv

[dcv hpc-dcv]
enable = master

[fsx fsxshared]
shared_dir = /fsx
storage_capacity = 1200
import_path = s3://fds-smv-bucket-unique
imported_file_chunk_size = 1024
export_path = s3://fds-smv-bucket-unique

ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Let’s review the different sections of the configuration file and explain their role:

  • scheduler: Supported job schedulers are SGE, TORQUE, SLURM and AWS Batch. I have selected SLURM for this example.
  • cluster_type: You have the choice between On-Demand (ondemand) or Spot Instances (spot) for your compute instances. For On-Demand, instances are available for use without condition (if available in the Region selected) at a certain price per hour with the pay-as-you-go model, meaning that as soon as they are started, they are reserved for your utilization. For Spot Instances, you can take advantage of unused EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to On-Demand Instance prices. You can use Spot Instances for various stateless, fault-tolerant, or flexible applications such as HPC, for more information about Spot Instances, feel free to visit this webpage.
  • s3_read_write_resource: This parameter allows you to read and write objects directly on your S3 bucket from the cluster you created without additional permissions. It acts as a role for your cluster, allowing you access to your specified S3 bucket.  
  • placement_groupUse DYNAMIC to ensure that your instances are located as physically close to one another as possible. Close placement minimizes the latency between compute nodes and takes advantage of EFA’s low latency networking.
  • placement: By selecting compute you only enforce compute instances to be placed within the same placement group, leaving the head node placement free.
  • compute_instance_type:Select C5n.18xlarge because it is optimized for compute-intensive workloads and supports EFA for better scaling of HPC applications. Note that EFA is supported only for specific instance types. Please visit currently supported instances for more information.
  • master_instance_type:This can be any instance type. As traffic between head and compute nodes is relatively small, and the head node runs during the entire lifetime of the cluster, I use c5.xlarge because it is inexpensive and is a good fit for this use case.
  • initial_queue_size:You start with no compute instances after the HPC cluster is up. This means that any new job submitted has some delay (time for the nodes to be powered on) before they are seen as available by the job scheduler. This helps you pay for what you use and keeps costs as low as possible.
  • max_queue_size:Limit the maximum compute fleet to 100 instances. This allows you room to scale your jobs up to a large number of cores, while putting a limit on the number of compute nodes to help control costs.
  • base_osFor this blog, select Amazon Linux 2 (alinux2) as a base OS. Currently we also support Amazon Linux (alinux), CentOS 7 (centos7), Ubuntu 16.04 (ubuntu1604), and Ubuntu 18.04 (ubuntu1804) with EFA.
  • disable_hyperthreading: This setting turns off hyperthreading (true) on your cluster, which is the right configuration in this use case.[fsx fsxshared]: This section contains the settings to define your FSx for Lustre parallel file system, including the location where the shared directory is mounted, the storage capacity for the file system, the chunk size for files to be imported, and the location from which the data will be imported. You can read more about FSx for Lustre here.
  • enable_efa: Mark as (true) in this use case since it is a tightly coupled CFD simulation use case.
  • dcv_settings:With AWS ParallelCluster, you can use NICE DCV to support your remote visualization needs.
  • [dcv hpc-dcv]:This section contains the settings to define your remote visualization setup. You can read more about DCV with AWS ParallelCluster here.
  • import_path: This parameter enables all the objects on the S3 bucket available when creating the cluster to be seen directly from the FSx for Lustre file system. In this case, you are able to access the FDS-SMV package and the geometry under the /fsx mounted folder.
  • export_path: This parameter is useful for backup purposes using the Data Repository Tasks. I share more details about this in step 7 (optional).

Step 3: Create the HPC cluster and log in

Now, you can create the HPC cluster, named fds-smv. It takes around 10 minutes to complete and you can see the status changing (going through the different AWS CloudFormation template steps). At the end of creation, two IP addresses are prompted, a public IP and/or a private IP depending on your network choice.

pcluster create fds-smv
Creating stack named: parallelcluster-fds-smv
Status: parallelcluster-fds-smv - CREATE_COMPLETE                               
MasterPublicIP: X.X.X.X
ClusterUser: ec2-user
MasterPrivateIP: X.X.X.X

In order to log in, you must use the key you specified in the AWS ParallelCluster configuration file before creating the cluster:

pcluster ssh fds-smv -i <Key-Name>

You should now be logged in as an ec2-user (since we are using Amazon Linux 2 base OS).

Step 4: Install FDS-SMV package

Now that the HPC cluster using AWS ParallelCluster is set up, it is time to install the FDS-SMV package.  In the prior steps, you uploaded both the FDS-SMV package and the geometry to your S3 bucket. Since you enabled “import_path” to that bucket, they are already available on the Amazon FSx for Lustre storage under /fsx.

Run the script as follows and select /fsx/fds-smv as final target for installation:

cd /fsx
[[email protected] fsx]$ ./FDS6.7.4_SMV6.7.14_lnx.sh 

Installing FDS and Smokeview  for Linux

  1) Press <Enter> to begin installation [default]
  2) Type "extract" to copy the installation files to:

FDS install options:
  Press 1 to install in /home/ec2-user/FDS/FDS6 [default]
  Press 2 to install in /opt/FDS/FDS6
  Press 3 to install in /usr/local/bin/FDS/FDS6
  Enter a directory path to install elsewhere

It is important to source the following scripts as part of the installed packages to check if the installation is successful with the correct versions. Here is the correct output you should get:

[[email protected] ~]$ source /fsx/fds-smv/bin/SMV6VARS.sh 
[[email protected] ~]$ source /fsx/fds-smv/bin/FDS6VARS.sh 
[[email protected] ~]$ fds -version
FDS revision       : FDS6.7.4-0-gbfaa110-release
MPI library version: Intel(R) MPI Library 2019 Update 4 for Linux* OS

[[email protected] ~]$ smokeview -version

Smokeview  SMV6.7.14-0-g568693b-release - Mar  9 2020

Revision         : SMV6.7.14-0-g568693b-release
Revision Date    : Wed Mar 4 23:13:42 2020 -0500
Compilation Date : Mar  9 2020 16:31:22
Compiler         : Intel C/C++
Checksum(SHA1)   : e801eace7c6597dc187739e51ba6f546bfde4e48
Platform         : LINUX64

Important notes:

The way FDS-SMV package has been installed is the default installation. Binaries are already compiled and Intel MPI libraries are embedded as part of the installation package. It is what one would call a self-contained application. For further builds and source codes, please visit this webpage.

Step 5: Running the fire dynamics simulation using FDS

Now that everything is installed, it is time to create the SLURM submission script. In this step, you take advantage of the FSx for Lustre File System, the compute-optimized instance, and the EFA network to maximize simulation performance.

cd /fsx/
vi fds-smv.sbatch

Here is the information you should specify in your submission script:

#SBATCH --job-name=fds-smv-job
#SBATCH --ntasks=<Total number of MPI processes>
#SBATCH --ntasks-per-node=36
#SBATCH --output=%x_%j.out

source /fsx/fds-smv/bin/FDS6VARS.sh
source /fsx/fds-smv/bin/SMV6VARS.sh

module load intelmpi 

export I_MPI_PIN_DOMAIN=omp

cd /fsx/<results>

time mpirun -ppn 36 -np <Total number of MPI processes>  fds geometry.fds

Replace the <results> with the one of your choice, and don’t forget to copy the geometry.fds file in it before submitting your job. Once ready, save the file and submit the job using the following command:

sbatch fds-smv.sbatch 

If you decided to build your HPC cluster with c5n.18xlarge instances, the number of MPI processes per node is 36 since you turned off the hyperthreading, and that the instance has 36 physical cores. That is the meaning of the “#SBATCH --ntasks-per-node=36” line.

For any run exceeding 36 MPI processes, the job is split among multiple instances and take advantage of EFA for internode communication.

It is important to note that FDS only allows the number of MPI processes to be equal to the number of meshes in the input geometry (geometry.fds in this scenario). In case the number of meshes in the input geometry cannot be modified, OpenMP threads can be enabled and efficiently increase performance. Do this using up to four OpenMP Threads across four CPU cores attached to one MPI process.

Please read best practices provided by NIST for that topic on their user guide.

In order to take advantage of the distributed computing capability of FDS, it is mandatory to work first on the input geometry, and divide it into the appropriate number of meshes. It is also highly advised to evenly distribute the number of cells/elements per mesh across all meshes. This best practice optimizes the load balancing for each CPU core.

Step 6: Visualizing the results using NICE DCV and SMV

In order to visualize results, you must connect to the head node using NICE DCV streaming protocol.

As a reminder, the current instance type for the head node is a c5.xlarge, which is not a graphics-accelerated instance. For heavy and GPU intensive visualization, it is important to set up a more appropriate instance such as the G4 instance group.

Go back to your AWS Cloud9 instance, open a new terminal side by side to your session connected to your AWS HPC cluster, and enter the following command in the terminal:

pcluster dcv connect fds-smv -k <Key-Name>

You are provided a one-time HTTPS URL available for a short period of time in order to connect to your head node using the NICE DCV protocol.

Once connected, open the terminal inside your session and source the FDS-SMV scripts as before:

source /fsx/fds-smv/bin/FDS6VARS.sh
source /fsx/fds-smv/bin/SMV6VARS.sh

Navigate to your <results> folder and start SMV with your result.

I have selected one of the geometries named fire_whirl_pool.fds in the Examples folder, part of the default FDS-SMV installation package located here:


You can find other scenarios under the Examples folder to run some more use cases if you did not already choose your geometry.fds file.

Now you can run SMV and visualize your results:

smokeview fire_whirl_pool.smv

SMV (smokeview) takes as an input .smv extension files, please replace with your appropriate file. If you have already chosen your geometry.fds, then run the following command:

smokeview geometry.smv

The application then open as follows, and you can visualize the results. The following image is an output of the SOOT DENSITY of the 3D smoke.

fire simulation picture

Step 7 (optional): Back up your FDS-SMV results to an S3 bucket

First update the AWS CLI to its most recent version. It is compatible with 1.16.309 and above.

After running your FDS-SMV simulation, you can back up your data in /fsx to the S3 bucket you used earlier to upload the installation package, and input files using Data Repository Tasks.

Data Repository Tasks represent bulk operations between your Amazon FSx for Lustre file system and your S3 bucket. One of the jobs is to export your changed file system contents back to its linked S3 bucket.

Open your AWS Cloud9 terminal and exit the HPC head node cluster. Retrieve your Amazon FSx for Lustre ID using:

aws fsx describe-file-systems

It looks something like, fs-0533eebf1148fc8dd. Then create a backup of the data as follows:

aws fsx create-data-repository-task --file-system-id fs-0533eebf1148fc8dd --type EXPORT_TO_REPOSITORY --paths results --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://fds-smv-bucket-unique/

The following are definitions about the command parameters:

  • file-system-id: Your file system ID.
  • type EXPORT_TO_REPOSITORY: Exports the data back to the S3 bucket.
  • paths results: The directory you want to export to your S3 bucket. If you have more than one folder to back up, use a comma-separated notation such as: results1,results2,…
  • Format=REPORT_CSV_20191124: Note this is only the name the Amazon FSx Lustre supports. Please keep it the same.

You can check the backup status by running:

aws fsx describe-data-repository-tasks

Please wait for the copy to be achieved, once finished you should see on the Lifecycle line "Lifecycle": "SUCCEEDED"

Also go back to your S3 bucket, and your folder(s) should appear with all the files correctly uploaded from your /fsx folder you specified.

In terms of data management, Amazon S3 is an important service. You started by uploading installation package and geometry files from an external source, such as your laptop or an on-premises system. Then made these files available to the AWS HPC cluster under the Amazon FSx for Lustre file system and ran the simulation. Finally, you backed up the results from the Amazon FSx for Lustre to Amazon S3. You can also decide to download the results on Amazon S3 back to your local system if needed.

Step 8: Delete your AWS resources created during the deployment of this blog

After your run is completed and your data backed up successfully (Step 7 is optional) on your S3 bucket, you can then delete your cluster by using the following command in your Cloud9 terminal:

pcluster delete fds-smv


If you run the command above all resources you created during this blog are automatically deleted beside your Cloud9 session and your data on your S3 bucket you created earlier.

Your S3 bucket still contains your input “geometry.fds” and your installation package “FDS6.7.4_SMV6.7.14_lnx.sh” files.

If you selected to back up your data during Step 7 (optional), then your S3 bucket also contains that data on top of the two previous files mentioned above.

If you want to delete your S3 bucket and all data mentioned above, go to your AWS Management Console, select S3 service then select your S3 bucket and hit delete on the top section.

If you want to terminate your Cloud9 session, go to your AWS Management Console, select Cloud9 service then select your session and hit delete on the top right section.

After performing these operations, there will be no more resources running on AWS related to this blog.


I showed that AWS ParallelCluster, Amazon FSx for Lustre, EFA, and Amazon S3 are key AWS services and features for HPC workloads such as CFD and in particular for FDS.

You can achieve simulation times of hours on AWS rather than days or weeks on a single workstation.

Please visit this workshop  for a more in-depth tutorial on running Fire Dynamics Simulation on AWS and our HPC dedicated homepage.


Event-driven architecture for using third-party Git repositories as source for AWS CodePipeline

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/devops/event-driven-architecture-for-using-third-party-git-repositories-as-source-for-aws-codepipeline/

In the post Using Custom Source Actions in AWS CodePipeline for Increased Visibility for Third-Party Source Control, we demonstrated using custom actions in AWS CodePipeline and a worker that periodically polls for jobs and processes further to get the artifact from the Git repository. In this post, we discuss using an event-driven architecture to trigger an AWS CodePipeline pipeline that has a third-party Git repository within the source stage that is part of a custom action.

Instead of using a worker to periodically poll for available jobs across all pipelines, we can define a custom source action on a particular pipeline to trigger an Amazon CloudWatch Events rule when the webhook on CodePipeline receives an event and puts it into an In Progress state. This works exactly like how CodePipeline works with natively supported Git repositories like AWS CodeCommit or GitHub as a source.

Solution architecture

The following diagram shows how you can use an event-driven architecture with a custom source stage that is associated with a third-party Git repository that isn’t supported by CodePipeline natively. For our use case, we use GitLab, but you can use any Git repository that supports Git webhooks.


The architecture includes the following steps:

1. A user commits code to a Git repository.

2. The commit invokes a Git webhook.

3. This invokes a CodePipeline webhook.

4. The CodePipeline source stage is put into In Progress status.

5. The source stage action triggers a CloudWatch Events rule that indicates the stage started.

6. The CloudWatch event triggers an AWS Lambda function.

7. The function polls for the job details of the custom action.

8. The function also triggers AWS CodeBuild and passes all the job-related information.

9. CodeBuild gets the public SSH key stored in AWS Secrets Manager (or user name and password, if using HTTPS Git access).

10. CodeBuild clones the repository for a particular branch.

11. CodeBuild zips and uploads the archive to the CodePipeline artifact store Amazon Simple Storage Service (Amazon S3) bucket.

12. A Lambda function sends a success message to the CodePipeline source stage so it can proceed to the next stage.

Similarly, with the same setup, if you chose a release change for the pipeline that has custom source stage, a CloudWatch event is triggered, which triggers a Lambda function, and the same process repeats until it gets the artifact from the Git repository.

Solution overview

To set up the solution, you complete the following steps:

1. Create an SSH key pair for authenticating to the Git repository.

2. Publish the key to Secrets Manager.

3. Launch the AWS CloudFormation stack to provision resources.

4. Deploy a sample CodePipeline and test the custom action type.

5. Retrieve the webhook URL.

6. Create a webhook and add the webhook URL.

Creating an SSH key pair
You first create an SSH key pair to use for authenticating to the Git repository using ssh-keygen on your terminal. See the following code:

ssh-keygen -t rsa -b 4096 -C "[email protected]"

Follow the prompt from ssh-keygen and give a name for the key, for example codepipeline_git_rsa. This creates two new files in the current directory: codepipeline_git_rsa and codepipeline_git_rsa.pub.

Make a note of the contents of codepipeline_git_rsa.pub and add it as an authorized key for your Git user. For instructions, see Adding an SSH key to your GitLab account.

Publishing the key
Publish this key to Secrets Manager using the AWS Command Line Interface (AWS CLI):

export SecretsManagerArn=$(aws secretsmanager create-secret --name codepipeline_git \
--secret-string file://codepipeline_git_rsa --query ARN --output text)

Make a note of the ARN, which is required later.

Alternative, you can create a secret on the Secrets Manager console.

Make sure that the lines in the private key codepipeline_git are the same when the value to the secret is added.

Launching the CloudFormation stack

Clone the git repository aws-codepipeline-third-party-git-repositories that contains the AWS CloudFormation templates and AWS Lambda function code using the below command:

git clone https://github.com/aws-samples/aws-codepipeline-third-party-git-repositories.git .

Now you should have the below files in the cloned repository


Launch the CloudFormation stack using the template third_party_git_custom_action.yaml from the cfn directory. The main resources created by this stack are:

1. CodePipeline Custom Action Type. ResourceType: AWS::CodePipeline::CustomActionType
2. Lambda Function. ResourceType: AWS::Lambda::Function
3. CodeBuild Project. ResourceType: AWS::CodeBuild::Project
4. Lambda Execution Role. ResourceType: AWS::IAM::Role
5. CodeBuild Service Role. ResourceType: AWS::IAM::Role

These resources help uplift the logic for connecting to the Git repository, which for this post is GitLab.

Upload the Lambda function code to any S3 bucket in the same Region where the stack is being deployed. To create a new S3 bucket, use the following code (make sure to provide a unique name):

export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export S3_BUCKET_NAME=codepipeline-git-custom-action-${ACCOUNT_ID} 
aws s3 mb s3://${S3_BUCKET_NAME} --region us-east-1

Then zip the contents of the function and upload to the S3 bucket (substitute the appropriate bucket name):

export ZIP_FILE_NAME="codepipeline_git.zip"
zip -jr ${ZIP_FILE_NAME} ./lambda/lambda_function.py && \
aws s3 cp codepipeline_git.zip \

If you don’t have a VPC and subnets that Lambda and CodeBuild require, you can create those by launching the following CloudFormation stack.

Run the following AWS CLI command to deploy the third-party Git source solution stack:

export vpcId="vpc-123456789"
export subnetId1="subnet-12345"
export subnetId2="subnet-54321"
export GIT_SOURCE_STACK_NAME="thirdparty-codepipeline-git-source"
aws cloudformation create-stack \
--stack-name ${GIT_SOURCE_STACK_NAME} \
--template-body file://$(pwd)/cfn/third_party_git_custom_action.yaml \
--parameters ParameterKey=SourceActionVersion,ParameterValue=1 \
ParameterKey=SourceActionProvider,ParameterValue=CustomSourceForGit \
ParameterKey=GitPullLambdaSubnet,ParameterValue=${subnetId1}\\,${subnetId2} \
ParameterKey=GitPullLambdaVpc,ParameterValue=${vpcId} \
ParameterKey=LambdaCodeS3Bucket,ParameterValue=${S3_BUCKET_NAME} \
ParameterKey=LambdaCodeS3Key,ParameterValue=${ZIP_FILE_NAME} \
--capabilities CAPABILITY_IAM

Alternatively, launch the stack by choosing Launch Stack:


For more information about the VPC requirements for Lambda and CodeBuild, see Internet and service access for VPC-connected functions and Use AWS CodeBuild with Amazon Virtual Private Cloud, respectively.

A custom source action type is now available on the account where you deployed the stack. You can check this on the CodePipeline console by attempting to create a new pipeline. You can see your source type listed under Source provider.


Testing the pipeline

We now deploy a sample pipeline and test the custom action type using the template sample_pipeline_custom.yaml from the cfn directory . You can run the following AWS CLI command to deploy the CloudFormation stack:

Note: Please provide the GitLab repository url to SSH_URL environment variable that you have access to or create a new GitLab project and repository. The example url "[email protected]:kirankumar15/test.git" is for illustration purposes only.

export SSH_URL="[email protected]:kirankumar15/test.git"
export SAMPLE_STACK_NAME="third-party-codepipeline-git-source-test"
aws cloudformation create-stack \
--stack-name ${SAMPLE_STACK_NAME}\
--template-body file://$(pwd)/cfn/sample_pipeline_custom.yaml \
--parameters ParameterKey=Branch,ParameterValue=master \
ParameterKey=GitUrl,ParameterValue=${SSH_URL} \
ParameterKey=SourceActionVersion,ParameterValue=1 \
ParameterKey=SourceActionProvider,ParameterValue=CustomSourceForGit \
ParameterKey=CodePipelineName,ParameterValue=sampleCodePipeline \
ParameterKey=SecretsManagerArnForSSHPrivateKey,ParameterValue=${SecretsManagerArn} \
ParameterKey=GitWebHookIpAddress,ParameterValue= \
--capabilities CAPABILITY_IAM

Alternatively, choose Launch stack:


Retrieving the webhook URL
When the stack creation is complete, retrieve the CodePipeline webhook URL from the stack outputs. Use the following AWS CLI command:

aws cloudformation describe-stacks \
--stack-name ${SAMPLE_STACK_NAME}\ 
--output text \
--query "Stacks[].Outputs[?OutputKey=='CodePipelineWebHookUrl'].OutputValue"

For more information about stack outputs, see Outputs.

Creating a webhook

You can use an existing GitLab repository or create a new GitLab repository and follow the below steps to add a webhook to it.
To create your webhook, complete the following steps:

1. Navigate to the Webhooks Settings section on the GitLab console for the repository that you want to have as a source for CodePipeline.

2. For URL, enter the CodePipeline webhook URL you retrieved in the previous step.

3. Select Push events and optionally enter a branch name.

4. Select Enable SSL verification.

5. Choose Add webhook.


For more information about webhooks, see Webhooks.

We’re now ready to test the solution.

Testing the solution

To test the solution, we make changes to the branch that we passed as the branch parameter in the GitLab repository. This should trigger the pipeline. On the CodePipeline console, you can see the Git Commit ID on the source stage of the pipeline when it succeeds.

Note: Please provide the GitLab repository url that you have access to or create a new GitLab repository. and make sure that it has buildspec.yml in the contents to execute in AWS CodeBuild project in the Build stage. The example url "[email protected]:kirankumar15/test.git" is for illustration purposes only.

Enter the following code to clone your repository:

git clone [email protected]:kirankumar15/test.git .

Add a sample file to the repository with the name sample.txt, then commit and push it to the repository:

echo "adding a sample file" >> sample_text_file.txt
git add ./
git commit -m "added sample_test_file.txt to the repository"
git push -u origin master

The pipeline should show the status In Progress.


After few minutes, it changes to Succeeded status and you see the Git Commit message on the source stage.


You can also view the Git Commit message by choosing the execution ID of the pipeline, navigating to the timeline section, and choosing the source action. You should see the Commit message and Commit ID that correlates with the Git repository.



If the CodePipeline fails, check the Lambda function logs for the function with the name GitLab-CodePipeline-Source-${ACCOUNT_ID}. For instructions on checking logs, see Accessing Amazon CloudWatch logs for AWS Lambda.

If the Lambda logs has the CodeBuild build ID, then check the CodeBuild run logs for that build ID for errors. For instructions, see View detailed build information.

Cleaning up

Delete the CloudFormation stacks that you created. You can use the following AWS CLI commands:

aws cloudformation delete-stack --stack-name ${SAMPLE_STACK_NAME}

aws cloudformation delete-stack --stack-name ${GIT_SOURCE_STACK_NAME}

Alternatively, delete the stack on the AWS CloudFormation console.

Additionally, empty the S3 bucket and delete it. Locate the bucket in the ${SAMPLE_STACK_NAME} stack. Then use the following AWS CLI command:

aws s3 rb s3://${S3_BUCKET_NAME} --force

You can also delete and empty the bucket on the Amazon S3 console.


You can use the architecture in this post for any Git repository that supports webhooks. This solution also works if the repository is reachable only from on premises, and if the endpoints can be accessed from a VPC. This event-driven architecture works just like using any natively supported source for CodePipeline.


About the Author

kirankumar.jpegKirankumar Chandrashekar is a DevOps consultant at AWS Professional Services. He focuses on leading customers in architecting DevOps technologies. Kirankumar is passionate about Infrastructure as Code and DevOps. He enjoys music, as well as cooking and travelling.


Building a cross-account CI/CD pipeline for single-tenant SaaS solutions

Post Syndicated from Rafael Ramos original https://aws.amazon.com/blogs/devops/cross-account-ci-cd-pipeline-single-tenant-saas/

With the increasing demand from enterprise customers for a pay-as-you-go consumption model, more and more independent software vendors (ISVs) are shifting their business model towards software as a service (SaaS). Usually this kind of solution is architected using a multi-tenant model. It means that the infrastructure resources and applications are shared across multiple customers, with mechanisms in place to isolate their environments from each other. However, you may not want or can’t afford to share resources for security or compliance reasons, so you need a single-tenant environment.

To achieve this higher level of segregation across the tenants, it’s recommended to isolate the environments on the AWS account level. This strategy brings benefits, such as no network overlapping, no account limits sharing, and simplified usage tracking and billing, but it comes with challenges from an operational standpoint. Whereas multi-tenant solutions require management of a single shared production environment, single-tenant installations consist of dedicated production environments for each customer, without any shared resources across the tenants. When the number of tenants starts to grow, delivering new features at a rapid pace becomes harder to accomplish, because each new version needs to be manually deployed on each tenant environment.

This post describes how to automate this deployment process to deliver software quickly, securely, and less error-prone for each existing tenant. I demonstrate all the steps to build and configure a CI/CD pipeline using AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, and AWS CloudFormation. For each new version, the pipeline automatically deploys the same application version on the multiple tenant AWS accounts.

There are different caveats to build such cross-account CI/CD pipelines on AWS. Because of that, I use AWS Command Line Interface (AWS CLI) to manually go through the process and demonstrate in detail the various configuration aspects you have to handle, such as artifact encryption, cross-account permission granting, and pipeline actions.

Single-tenancy vs. multi-tenancy

One of the first aspects to consider when architecting your SaaS solution is its tenancy model. Each brings their own benefits and architectural challenges. On multi-tenant installations, each customer shares the same set of resources, including databases and applications. With this mode, you can use the servers’ capacity more efficiently, which generally leads to significant cost-saving opportunities. On the other hand, you have to carefully secure your solution to prevent a customer from accessing sensitive data from another. Designing for high availability becomes even more critical on multi-tenant workloads, because more customers are affected in the event of downtime.

Because the environments are by definition isolated from each other, single-tenant solutions are simpler to design when it comes to security, networking isolation, and data segregation. Likewise, you can customize the applications per customer, and have different versions for specific tenants. You also have the advantage of eliminating the noisy-neighbor effect, and can plan the infrastructure for the customer’s scalability requirements. As a drawback, in comparison with multi-tenant, the single-tenant model is operationally more complex because you have more servers and applications to maintain.

Which tenancy model to choose depends ultimately on whether you can meet your customer needs. They might have specific governance requirements, be bound to a certain industry regulation, or have compliance criteria that influences which model they can choose. For more information about modeling your SaaS solutions, see SaaS on AWS.

Solution overview

To demonstrate this solution, I consider a fictitious single-tenant ISV with two customers: Unicorn and Gnome. It uses one central account where the tools reside (Tooling account), and two other accounts, each representing a tenant (Unicorn and Gnome accounts). As depicted in the following architecture diagram, when a developer pushes code changes to CodeCommit, Amazon CloudWatch Events  triggers the CodePipeline CI/CD pipeline, which automatically deploys a new version on each tenant’s AWS account. It ensures that the fictitious ISV doesn’t have the operational burden to manually re-deploy the same version for each end-customers.

Architecture diagram of a CI/CD pipeline for single-tenant SaaS solutions

For illustration purposes, the sample application I use in this post is an AWS Lambda function that returns a simple JSON object when invoked.


Before getting started, you must have the following prerequisites:

Setting up the Git repository

Your first step is to set up your Git repository.

  1. Create a CodeCommit repository to host the source code.

The CI/CD pipeline is automatically triggered every time new code is pushed to that repository.

  1. Make sure Git is configured to use IAM credentials to access AWS CodeCommit via HTTP by running the following command from the terminal:
git config --global credential.helper '!aws codecommit credential-helper [email protected]'
git config --global credential.UseHttpPath true
  1. Clone the newly created repository locally, and add two files in the root folder: index.js and application.yaml.

The first file is the JavaScript code for the Lambda function that represents the sample application. For our use case, the function returns a JSON response object with statusCode: 200 and the body Hello!\n. See the following code:

exports.handler = async (event) => {
    const response = {
        statusCode: 200,
        body: `Hello!\n`,
    return response;

The second file is where the infrastructure is defined using AWS CloudFormation. The sample application consists of a Lambda function, and we use AWS Serverless Application Model (AWS SAM) to simplify the resources creation. See the following code:

AWSTemplateFormatVersion: '2010-09-09'
Transform: 'AWS::Serverless-2016-10-31'
Description: Sample Application.

        Type: String
        Type: String
        Type: String
        Type: 'AWS::Serverless::Function'
            FunctionName: !Ref ApplicationName
            Handler: index.handler
            Runtime: nodejs12.x
                Bucket: !Ref S3Bucket
                Key: !Ref S3Key
            Description: Hello Lambda.
            MemorySize: 128
            Timeout: 10
  1. Push both files to the remote Git repository.

Creating the artifact store encryption key

By default, CodePipeline uses server-side encryption with an AWS Key Management Service (AWS KMS) managed customer master key (CMK) to encrypt the release artifacts. Because the Unicorn and Gnome accounts need to decrypt those release artifacts, you need to create a customer managed CMK in the Tooling account.

From the terminal, run the following command to create the artifact encryption key:

aws kms create-key --region <YOUR_REGION>

This command returns a JSON object with the key ARN property if run successfully. Its format is similar to arn:aws:kms:<YOUR_REGION>:<TOOLING_ACCOUNT_ID>:key/<KEY_ID>. Record this value to use in the following steps.

The encryption key has been created manually for educational purposes only, but it’s considered a best practice to have it as part of the Infrastructure as Code (IaC) bundle.

Creating an Amazon S3 artifact store and configuring a bucket policy

Our use case uses Amazon Simple Storage Service (Amazon S3) as artifact store. Every release artifact is encrypted and stored as an object in an S3 bucket that lives in the Tooling account.

To create and configure the artifact store, follow these steps in the Tooling account:

  1. From the terminal, create an S3 bucket and give it a unique name:
aws s3api create-bucket \
    --bucket <BUCKET_UNIQUE_NAME> \
    --region <YOUR_REGION> \
    --create-bucket-configuration LocationConstraint=<YOUR_REGION>
  1. Configure the bucket to use the customer managed CMK created in the previous step. This makes sure the objects stored in this bucket are encrypted using that key, replacing <KEY_ARN> with the ARN property from the previous step:
aws s3api put-bucket-encryption \
    --bucket <BUCKET_UNIQUE_NAME> \
    --server-side-encryption-configuration \
            "Rules": [
                    "ApplyServerSideEncryptionByDefault": {
                        "SSEAlgorithm": "aws:kms",
                        "KMSMasterKeyID": "<KEY_ARN>"
  1. The artifacts stored in the bucket need to be accessed from the Unicorn and Gnome Configure the bucket policies to allow cross-account access:
aws s3api put-bucket-policy \
    --bucket <BUCKET_UNIQUE_NAME> \
    --policy \
            "Version": "2012-10-17",
            "Statement": [
                    "Action": [
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": [
                    "Resource": [
                    "Action": [
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": [
                    "Resource": [

This S3 bucket has been created manually for educational purposes only, but it’s considered a best practice to have it as part of the IaC bundle.

Creating a cross-account IAM role in each tenant account

Following the security best practice of granting least privilege, each action declared on CodePipeline should have its own IAM role.  For this use case, the pipeline needs to perform changes in the Unicorn and Gnome accounts from the Tooling account, so you need to create a cross-account IAM role in each tenant account.

Repeat the following steps for each tenant account to allow CodePipeline to assume role in those accounts:

  1. Configure a named CLI profile for the tenant account to allow running commands using the correct access keys.
  2. Create an IAM role that can be assumed from another AWS account, replacing <TENANT_PROFILE_NAME> with the profile name you defined in the previous step:
aws iam create-role \
    --role-name CodePipelineCrossAccountRole \
    --profile <TENANT_PROFILE_NAME> \
    --assume-role-policy-document \
            "Version": "2012-10-17",
            "Statement": [
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": "arn:aws:iam::<TOOLING_ACCOUNT_ID>:root"
                    "Action": "sts:AssumeRole"
  1. Create an IAM policy that grants access to the artifact store S3 bucket and to the artifact encryption key:
aws iam create-policy \
    --policy-name CodePipelineCrossAccountArtifactReadPolicy \
    --profile <TENANT_PROFILE_NAME> \
    --policy-document \
            "Version": "2012-10-17",
            "Statement": [
                    "Action": [
                    "Resource": [
                    "Effect": "Allow"
                    "Action": [
                    "Resource": [
                    "Effect": "Allow"
                    "Action": [ 
                    "Resource": "<KEY_ARN>",
                    "Effect": "Allow"
  1. Attach the CodePipelineCrossAccountArtifactReadPolicy IAM policy to the CodePipelineCrossAccountRole IAM role:
aws iam attach-role-policy \
    --profile <TENANT_PROFILE_NAME> \
    --role-name CodePipelineCrossAccountRole \
    --policy-arn arn:aws:iam::<TENANT_ACCOUNT_ID>:policy/CodePipelineCrossAccountArtifactReadPolicy
  1. Create an IAM policy that allows to pass the IAM role CloudFormationDeploymentRole to CloudFormation and to perform CloudFormation actions on the application Stack:
aws iam create-policy \
    --policy-name CodePipelineCrossAccountCfnPolicy \
    --profile <TENANT_PROFILE_NAME> \
    --policy-document \
            "Version": "2012-10-17",
            "Statement": [
                    "Action": [
                    "Resource": "arn:aws:iam::<TENANT_ACCOUNT_ID>:role/CloudFormationDeploymentRole",
                    "Effect": "Allow"
                    "Action": [
                    "Resource": "arn:aws:cloudformation:<YOUR_REGION>:<TENANT_ACCOUNT_ID>:stack/SampleApplication*/*",
                    "Effect": "Allow"
  1. Attach the CodePipelineCrossAccountCfnPolicy IAM policy to the CodePipelineCrossAccountRole IAM role:
aws iam attach-role-policy \
    --profile <TENANT_PROFILE_NAME> \
    --role-name CodePipelineCrossAccountRole \
    --policy-arn arn:aws:iam::<TENANT_ACCOUNT_ID>:policy/CodePipelineCrossAccountCfnPolicy

Additional configuration is needed in the Tooling account to allow access, which you complete later on.

Creating a deployment IAM role in each tenant account

After CodePipeline assumes the CodePipelineCrossAccountRole IAM role into the tenant account, it triggers AWS CloudFormation to provision the infrastructure based on the template defined in the application.yaml file. For that, AWS CloudFormation needs to assume an IAM role that grants privileges to create resources into the tenant AWS account.

Repeat the following steps for each tenant account to allow AWS CloudFormation to create resources in those accounts:

  1. Create an IAM role that can be assumed by AWS CloudFormation:
aws iam create-role \
    --role-name CloudFormationDeploymentRole \
    --profile <TENANT_PROFILE_NAME> \
    --assume-role-policy-document \
            "Version": "2012-10-17",
            "Statement": [
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "cloudformation.amazonaws.com"
                    "Action": "sts:AssumeRole"
  1. Create an IAM policy that grants permissions to create AWS resources:
aws iam create-policy \
    --policy-name CloudFormationDeploymentPolicy \
    --profile <TENANT_PROFILE_NAME> \
    --policy-document \
            "Version": "2012-10-17",
            "Statement": [
                    "Action": "iam:PassRole",
                    "Resource": "arn:aws:iam::<TENANT_ACCOUNT_ID>:role/*",
                    "Effect": "Allow"
                    "Action": [
                    "Resource": "arn:aws:iam::<TENANT_ACCOUNT_ID>:role/*",
                    "Effect": "Allow"
                    "Action": "lambda:*",
                    "Resource": "*",
                    "Effect": "Allow"
                    "Action": "codedeploy:*",
                    "Resource": "*",
                    "Effect": "Allow"
                    "Action": [
                    "Resource": [
                    "Effect": "Allow"
                    "Action": [
                    "Resource": "<KEY_ARN>",
                    "Effect": "Allow"
                    "Action": [
                    "Resource": "arn:aws:cloudformation:<YOUR_REGION>:<TENANT_ACCOUNT_ID>:stack/SampleApplication*/*",
                    "Effect": "Allow"
                    "Action": [
                    "Resource": "arn:aws:cloudformation:<YOUR_REGION>:aws:transform/Serverless-2016-10-31",
                    "Effect": "Allow"

The granted permissions in this IAM policy depend on the resources your application needs to be provisioned. Because the application in our use case consists of a simple Lambda function, the IAM policy only needs permissions over Lambda. The other permissions declared are to access and decrypt the Lambda code from the artifact store, use AWS CodeDeploy to deploy the function, and create and attach the Lambda execution role.

  1. Attach the IAM policy to the IAM role:
aws iam attach-role-policy \
    --profile <TENANT_PROFILE_NAME> \
    --role-name CloudFormationDeploymentRole \
    --policy-arn arn:aws:iam::<TENANT_ACCOUNT_ID>:policy/CloudFormationDeploymentPolicy

Configuring an artifact store encryption key

Even though the IAM roles created in the tenant accounts declare permissions to use the CMK encryption key, that’s not enough to have access to the key. To access the key, you must update the CMK key policy.

From the terminal, run the following command to attach the new policy:

aws kms put-key-policy \
    --key-id <KEY_ARN> \
    --policy-name default \
    --region <YOUR_REGION> \
    --policy \
             "Id": "TenantAccountAccess",
             "Version": "2012-10-17",
             "Statement": [
                    "Sid": "Enable IAM User Permissions",
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": "arn:aws:iam::<TOOLING_ACCOUNT_ID>:root"
                    "Action": "kms:*",
                    "Resource": "*"
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": [
                    "Action": [
                    "Resource": "*"

Provisioning the CI/CD pipeline

Each CodePipeline workflow consists of two or more stages, which are composed by a series of parallel or serial actions. For our use case, the pipeline is made up of four stages:

  • Source – Declares CodeCommit as the source control for the application code.
  • Build – Using CodeBuild, it installs the dependencies and builds deployable artifacts. In this use case, the sample application is too simple and this stage is used for illustration purposes.
  • Deploy_Dev – Deploys the sample application on a sandbox environment. At this point, the deployable artifacts generated at the Build stage are used to create a CloudFormation stack and deploy the Lambda function.
  • Deploy_Prod – Similar to Deploy_Dev, at this stage the sample application is deployed on the tenant production environments. For that, it contains two actions (one per tenant) that are run in parallel. CodePipeline uses CodePipelineCrossAccountRole to assume a role on the tenant account, and from there, CloudFormationDeploymentRole is used to effectively deploy the application.

To provision your resources, complete the following steps from the terminal:

  1. Download the CloudFormation pipeline template:
curl -LO https://cross-account-ci-cd-pipeline-single-tenant-saas.s3.amazonaws.com/pipeline.yaml
  1. Deploy the CloudFormation stack using the pipeline template:
aws cloudformation deploy \
    --template-file pipeline.yaml \
    --region <YOUR_REGION> \
    --stack-name <YOUR_PIPELINE_STACK_NAME> \
    --capabilities CAPABILITY_IAM \
    --parameter-overrides \
        ArtifactBucketName=<BUCKET_UNIQUE_NAME> \
        ArtifactEncryptionKeyArn=<KMS_KEY_ARN> \
        UnicornAccountId=<UNICORN_TENANT_ACCOUNT_ID> \
        GnomeAccountId=<GNOME_TENANT_ACCOUNT_ID> \
        SampleApplicationRepositoryName=<YOUR_CODECOMMIT_REPOSITORY_NAME> \

This is the list of the required parameters to deploy the template:

    • ArtifactBucketName – The name of the S3 bucket where the deployment artifacts are to be stored.
    • ArtifactEncryptionKeyArn – The ARN of the customer managed CMK to be used as artifact encryption key.
    • UnicornAccountId – The AWS account ID for the first tenant (Unicorn) where the application is to be deployed.
    • GnomeAccountId – The AWS account ID for the second tenant (Gnome) where the application is to be deployed.
    • SampleApplicationRepositoryName – The name of the CodeCommit repository where source changes are detected.
    • RepositoryBranch – The name of the CodeCommit branch where source changes are detected. The default value is master in case no value is provided.
  1. Wait for AWS CloudFormation to create the resources.

When stack creation is complete, the pipeline starts automatically.

For each existing tenant, an action is declared within the Deploy_Prod stage. The following code is a snippet of how these actions are configured to deploy the application on a different account:

RoleArn: !Sub arn:aws:iam::${UnicornAccountId}:role/CodePipelineCrossAccountRole
    ActionMode: CREATE_UPDATE
    StackName: !Sub SampleApplication-unicorn-stack-${AWS::Region}
    RoleArn: !Sub arn:aws:iam::${UnicornAccountId}:role/CloudFormationDeploymentRole
    TemplatePath: CodeCommitSource::application.yaml
    ParameterOverrides: !Sub | 
            "ApplicationName": "SampleApplication-Unicorn",
            "S3Bucket": { "Fn::GetArtifactAtt" : [ "ApplicationBuildOutput", "BucketName" ] },
            "S3Key": { "Fn::GetArtifactAtt" : [ "ApplicationBuildOutput", "ObjectKey" ] }

The code declares two IAM roles. The first one is the IAM role assumed by the CodePipeline action to access the tenant AWS account, whereas the second is the IAM role used by AWS CloudFormation to create AWS resources in the tenant AWS account. The ParameterOverrides configuration declares where the release artifact is located. The S3 bucket and key are in the Tooling account and encrypted using the customer managed CMK. That’s why it was necessary to grant access from external accounts using a bucket and KMS policies.

Besides the CI/CD pipeline itself, this CloudFormation template declares IAM roles that are used by the pipeline and its actions. The main IAM role is named CrossAccountPipelineRole, which is used by the CodePipeline service. It contains permissions to assume the action roles. See the following code:

    "Action": "sts:AssumeRole",
    "Effect": "Allow",
    "Resource": [

When you have more tenant accounts, you must add additional roles to the list.

After CodePipeline runs successfully, test the sample application by invoking the Lambda function on each tenant account:

aws lambda invoke --function-name SampleApplication --profile <TENANT_PROFILE_NAME> --region <YOUR_REGION> out

The output should be:

    "StatusCode": 200,
    "ExecutedVersion": "$LATEST"

Cleaning up

Follow these steps to delete the components and avoid future incurring charges:

  1. Delete the production application stack from each tenant account:
aws cloudformation delete-stack --profile <TENANT_PROFILE_NAME> --region <YOUR_REGION> --stack-name SampleApplication-<TENANT_NAME>-stack-<YOUR_REGION>
  1. Delete the dev application stack from the Tooling account:
aws cloudformation delete-stack --region <YOUR_REGION> --stack-name SampleApplication-dev-stack-<YOUR_REGION>
  1. Delete the pipeline stack from the Tooling account:
aws cloudformation delete-stack --region <YOUR_REGION> --stack-name <YOUR_PIPELINE_STACK_NAME>
  1. Delete the customer managed CMK from the Tooling account:
aws kms schedule-key-deletion --region <YOUR_REGION> --key-id <KEY_ARN>
  1. Delete the S3 bucket from the Tooling account:
aws s3 rb s3://<BUCKET_UNIQUE_NAME> --force
  1. Optionally, delete the IAM roles and policies you created in the tenant accounts


This post demonstrated what it takes to build a CI/CD pipeline for single-tenant SaaS solutions isolated on the AWS account level. It covered how to grant cross-account access to artifact stores on Amazon S3 and artifact encryption keys on AWS KMS using policies and IAM roles. This approach is less error-prone because it avoids human errors when manually deploying the exact same application for multiple tenants.

For this use case, we performed most of the steps manually to better illustrate all the steps and components involved. For even more automation, consider using the AWS Cloud Development Kit (AWS CDK) and its pipeline construct to create your CI/CD pipeline and have everything as code. Moreover, for production scenarios, consider having integration tests as part of the pipeline.

Rafael Ramos

Rafael Ramos

Rafael is a Solutions Architect at AWS, where he helps ISVs on their journey to the cloud. He spent over 13 years working as a software developer, and is passionate about DevOps and serverless. Outside of work, he enjoys playing tabletop RPG, cooking and running marathons.

Integrating AWS CloudFormation Guard into CI/CD pipelines

Post Syndicated from Sergey Voinich original https://aws.amazon.com/blogs/devops/integrating-aws-cloudformation-guard/

In this post, we discuss and build a managed continuous integration and continuous deployment (CI/CD) pipeline that uses AWS CloudFormation Guard to automate and simplify pre-deployment compliance checks of your AWS CloudFormation templates. This enables your teams to define a single source of truth for what constitutes valid infrastructure definitions, to be compliant with your company guidelines and streamline AWS resources’ deployment lifecycle.

We use the following AWS services and open-source tools to set up the pipeline:

Solution overview

The CI/CD workflow includes the following steps:

  1. A code change is committed and pushed to the CodeCommit repository.
  2. CodePipeline automatically triggers a CodeBuild job.
  3. CodeBuild spins up a compute environment and runs the phases specified in the buildspec.yml file:
  4. Clone the code from the CodeCommit repository (CloudFormation template, rule set for CloudFormation Guard, buildspec.yml file).
  5. Clone the code from the CloudFormation Guard repository on GitHub.
  6. Provision the build environment with necessary components (rust, cargo, git, build-essential).
  7. Download CloudFormation Guard release from GitHub.
  8. Run a validation check of the CloudFormation template.
  9. If the validation is successful, pass the control over to CloudFormation and deploy the stack. If the validation fails, stop the build job and print a summary to the build job log.

The following diagram illustrates this workflow.

Architecture Diagram

Architecture Diagram of CI/CD Pipeline with CloudFormation Guard


For this walkthrough, complete the following prerequisites:

Creating your CodeCommit repository

Create your CodeCommit repository by running a create-repository command in the AWS CLI:

aws codecommit create-repository --repository-name cfn-guard-demo --repository-description "CloudFormation Guard Demo"

The following screenshot indicates that the repository has been created.

CodeCommit Repository

CodeCommit Repository has been created

Populating the CodeCommit repository

Populate your repository with the following artifacts:

  1. A buildspec.yml file. Modify the following code as per your requirements:
version: 0.2
    # Definining CloudFormation Teamplate and Ruleset as variables - part of the code repo
    CF_TEMPLATE: "cfn_template_file_example.yaml"
    CF_ORG_RULESET:  "cfn_guard_ruleset_example"
      - apt-get update
      - apt-get install build-essential -y
      - apt-get install cargo -y
      - apt-get install git -y
      - echo "Setting up the environment for AWS CloudFormation Guard"
      - echo "More info https://github.com/aws-cloudformation/cloudformation-guard"
      - echo "Install Rust"
      - curl https://sh.rustup.rs -sSf | sh -s -- -y
       - echo "Pull GA release from github"
       - echo "More info https://github.com/aws-cloudformation/cloudformation-guard/releases"
       - wget https://github.com/aws-cloudformation/cloudformation-guard/releases/download/1.0.0/cfn-guard-linux-1.0.0.tar.gz
       - echo "Extract cfn-guard"
       - tar xvf cfn-guard-linux-1.0.0.tar.gz .
       - echo "Validate CloudFormation template with cfn-guard tool"
       - echo "More information https://github.com/aws-cloudformation/cloudformation-guard/blob/master/cfn-guard/README.md"
       - cfn-guard-linux/cfn-guard check --rule_set $CF_ORG_RULESET --template $CF_TEMPLATE --strict-checks
    - cfn_template_file_example.yaml
  name: guard_templates
  1. An example of a rule set file (cfn_guard_ruleset_example) for CloudFormation Guard. Modify the following code as per your requirements:
#CFN Guard rules set example

#List of multiple references
let allowed_azs = [us-east-1a,us-east-1b]
let allowed_ec2_instance_types = [t2.micro,t3.nano,t3.micro]
let allowed_security_groups = [sg-08bbcxxc21e9ba8e6,sg-07b8bx98795dcab2]

#EC2 Policies
AWS::EC2::Instance AvailabilityZone IN %allowed_azs
AWS::EC2::Instance ImageId == ami-0323c3dd2da7fb37d
AWS::EC2::Instance InstanceType IN %allowed_ec2_instance_types
AWS::EC2::Instance SecurityGroupIds == ["sg-07b8xxxsscab2"]
AWS::EC2::Instance SubnetId == subnet-0407a7casssse558

#EBS Policies
AWS::EC2::Volume AvailabilityZone == us-east-1a
AWS::EC2::Volume Encrypted == true
AWS::EC2::Volume Size == 50 |OR| AWS::EC2::Volume Size == 100
AWS::EC2::Volume VolumeType == gp2
  1. An example of a CloudFormation template file (.yaml). Modify the following code as per your requirements:
AWSTemplateFormatVersion: "2010-09-09"
Description: "EC2 instance with encrypted EBS volume for AWS CloudFormation Guard Testing"


    Type: AWS::EC2::Instance
      ImageId: 'ami-0323c3dd2da7fb37d'
      AvailabilityZone: 'us-east-1a'
      KeyName: "your-ssh-key"
      InstanceType: 't3.micro'
      SubnetId: 'subnet-0407a7xx68410e558'
        - 'sg-07b8b339xx95dcab2'
          Device: '/dev/sdf'
          VolumeId: !Ref EBSVolume
       - Key: Name
         Value: cfn-guard-ec2

   Type: AWS::EC2::Volume
     Size: 100
     AvailabilityZone: 'us-east-1a'
     Encrypted: true
     VolumeType: gp2
       - Key: Name
         Value: cfn-guard-ebs
   DeletionPolicy: Snapshot

    Description: The Instance ID
    Value: !Ref EC2Instance
    Description: The Volume ID
    Value: !Ref  EBSVolume
AWS CodeCommit

Optional CodeCommit Repository Structure

The following screenshot shows a potential CodeCommit repository structure.

Creating a CodeBuild project

Our CodeBuild project orchestrates around CloudFormation Guard and runs validation checks of our CloudFormation templates as a phase of the CI process.

  1. On the CodeBuild console, choose Build projects.
  2. Choose Create build projects.
  3. For Project name, enter your project name.
  4. For Description, enter a description.
AWS CodeBuild

Create CodeBuild Project

  1. For Source provider, choose AWS CodeCommit.
  2. For Repository, choose the CodeCommit repository you created in the previous step.
AWS CodeBuild

Define the source for your CodeBuild Project

To setup CodeBuild environment we will use managed image based on Ubuntu 18.04

  1. For Environment Image, select Managed image.
  2. For Operating system, choose Ubuntu.
  3. For Service role¸ select New service role.
  4. For Role name, enter your service role name.
CodeBuild Environment

Setup the environment, the OS image and other settings for the CodeBuild

  1. Leave the default settings for additional configuration, buildspec, batch configuration, artifacts, and logs.

You can also use CodeBuild with custom build environments to help you optimize billing and improve the build time.

Creating IAM roles and policies

Our CI/CD pipeline needs two AWS Identity and Access Management (IAM) roles to run properly: one role for CodePipeline to work with other resources and services, and one role for AWS CloudFormation to run the deployments that passed the validation check in the CodeBuild phase.

Creating permission policies

Create your permission policies first. The following code is the policy in JSON format for CodePipeline:

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "*",
            "Condition": {
                "StringEqualsIfExists": {
                    "iam:PassedToService": [

To create your policy for CodePipeline, run the following CLI command:

aws iam create-policy --policy-name CodePipeline-Cfn-Guard-Demo --policy-document file://CodePipelineServiceRolePolicy_example.json

Capture the policy ARN that you get in the output to use in the next steps.

The following code is the policy in JSON format for AWS CloudFormation:

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:CreateServiceLinkedRole",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "iam:AWSServiceName": [
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
            "Resource": "*"

Create the policy for AWS CloudFormation by running the following CLI command:

aws iam create-policy --policy-name CloudFormation-Cfn-Guard-Demo --policy-document file://CloudFormationRolePolicy_example.json

Capture the policy ARN that you get in the output to use in the next steps.

Creating roles and trust policies

The following code is the trust policy for CodePipeline in JSON format:

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "Service": "codepipeline.amazonaws.com"
      "Action": "sts:AssumeRole"

Create your role for CodePipeline with the following CLI command:

aws iam create-role --role-name CodePipeline-Cfn-Guard-Demo-Role --assume-role-policy-document file://RoleTrustPolicy_CodePipeline.json

Capture the role name for the next step.

The following code is the trust policy for AWS CloudFormation in JSON format:

  "Version": "2012-10-17",
  "Statement": [
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "cloudformation.amazonaws.com"
      "Action": "sts:AssumeRole"

Create your role for AWS CloudFormation with the following CLI command:

aws iam create-role --role-name CF-Cfn-Guard-Demo-Role --assume-role-policy-document file://RoleTrustPolicy_CloudFormation.json

Capture the role name for the next step.


Finally, attach the permissions policies created in the previous step to the IAM roles you created:

aws iam attach-role-policy --role-name CodePipeline-Cfn-Guard-Demo-Role  --policy-arn "arn:aws:iam::<AWS Account Id >:policy/CodePipeline-Cfn-Guard-Demo"

aws iam attach-role-policy --role-name CF-Cfn-Guard-Demo-Role  --policy-arn "arn:aws:iam::<AWS Account Id>:policy/CloudFormation-Cfn-Guard-Demo"

Creating a pipeline

We can now create our pipeline to assemble all the components into one managed, continuous mechanism.

  1. On the CodePipeline console, choose Pipelines.
  2. Choose Create new pipeline.
  3. For Pipeline name, enter a name.
  4. For Service role, select Existing service role.
  5. For Role ARN, choose the service role you created in the previous step.
  6. Choose Next.
CodePipeline Setup

Setting Up CodePipeline environment

  1. In the Source section, for Source provider, choose AWS CodeCommit.
  2. For Repository name¸ enter your repository name.
  3. For Branch name, choose master.
  4. For Change detection options, select Amazon CloudWatch Events.
  5. Choose Next.
AWS CodePipeline Source

Adding CodeCommit to CodePipeline

  1. In the Build section, for Build provider, choose AWS CodeBuild.
  2. For Project name, choose the CodeBuild project you created.
  3. For Build type, select Single build.
  4. Choose Next.
CodePipeline Build Stage

Adding Build Project to Pipeline Stage

Now we will create a deploy stage in our CodePipeline to deploy CloudFormation templates that passed the CloudFormation Guard inspection in the CI stage.

  1. In the Deploy section, for Deploy provider, choose AWS CloudFormation.
  2. For Action mode¸ choose Create or update stack.
  3. For Stack name, choose any stack name.
  4. For Artifact name, choose BuildArtifact.
  5. For File name, enter the CloudFormation template name in your CodeCommit repository (In case of our demo it is cfn_template_file_example.yaml).
  6. For Role name, choose the role you created earlier for CloudFormation.
CodePipeline - Deploy Stage

Adding deploy stage to CodePipeline

22. In the next step review your selections for the pipeline to be created. The stages and action providers in each stage are shown in the order that they will be created. Click Create pipeline. Our CodePipeline is ready.

Validating the CI/CD pipeline operation

Our CodePipeline has two basic flows and outcomes. If the CloudFormation template complies with our CloudFormation Guard rule set file, the resources in the template deploy successfully (in our use case, we deploy an EC2 instance with an encrypted EBS volume).

CloudFormation Deployed

CloudFormation Console

If our CloudFormation template doesn’t comply with the policies specified in our CloudFormation Guard rule set file, our CodePipeline stops at the CodeBuild step and you see an error in the build job log indicating the resources that are non-compliant:

[EBSVolume] failed because [Encrypted] is [false] and the permitted value is [true]
[EC2Instance] failed because [t3.2xlarge] is not in [t2.micro,t3.nano,t3.micro] for [InstanceType]
Number of failures: 2

Note: To demonstrate the above functionality I changed my CloudFormation template to use unencrypted EBS volume and switched the EC2 instance type to t3.2xlarge which do not adhere to the rules that we specified in the Guard rule set file

Cleaning up

To avoid incurring future charges, delete the resources that we have created during the walkthrough:

  • CloudFormation stack resources that were deployed by the CodePipeline
  • CodePipeline that we have created
  • CodeBuild project
  • CodeCommit repository


In this post, we covered how to integrate CloudFormation Guard into CodePipeline and fully automate pre-deployment compliance checks of your CloudFormation templates. This allows your teams to have an end-to-end automated CI/CD pipeline with minimal operational overhead and stay compliant with your organizational infrastructure policies.

Field Notes: Building an Autonomous Driving and ADAS Data Lake on AWS

Post Syndicated from Junjie Tang original https://aws.amazon.com/blogs/architecture/field-notes-building-an-autonomous-driving-and-adas-data-lake-on-aws/

Customers developing self-driving car technology are continuously challenged by the amount of data captured and created during the development lifecycle. This is accelerated by the need to design and launch incremental feature improvements on advanced driver-assistance systems (ADAS). Efforts to advance ADAS functionality have led to new approaches for storing, cataloging, and analyzing driving data captured from vehicles on the road today. Combining data from connected vehicle fleets transmitted over cellular networks in combination with manually ingesting data from vehicle data loggers requires a complex architecture and elastic data lake capability that only AWS provides.

This blog explains how to build an Autonomous Driving Data Lake using this Reference Architecture. We cover the workflow from how to ingest the data, prepare it for machine learning, catalog the output from ADAS systems and vehicle sensors, label it, automatically detect scenarios, and manage the various workflows required for moving it into an organized data lake construct. The AWS Autonomous Driving and ADAS Data Lake Reference Architecture was developed after working with numerous customers on the challenges they faced in achieving this. Using multiple AWS services and solution best practices, we outlined an approach we found helpful to others.

In the blog post, Autonomous Vehicle and ADAS development on AWS Part 1: Achieving Scale, we outlined the benefits having a data lake in the cloud, as opposed to building your own on-premises data lake solution.

Before we dive into the details of the reference architecture, let’s review the typical workflow of autonomous and ADAS development. The diagram below shows the following steps, much of which are common in any machine learning project:

  • data acquisition and ingest,
  • data processing and analytics,
  • labeling, map development,
  • model and algorithm development,
  • simulation and validation, and
  • orchestration and deployment.

This blog focuses on data ingest, data processing and analytics, labeling, and the data lake itself, as shown in the following workflow diagram:

driving dev workflow

Why build a data lake for ADAS and Autonomous Driving System development?

At AWS re:Invent 2019, BMW spoke at this session; (AUT306): Creating a data-driven, cloud-native ecosystem at BMW Group where they explained the motivation for developing an enterprise-wide cloud data lake, or the Cloud Data Hub as they call it. The Cloud Data Hub ingests information from multiple lines of business including sources like manufacturing systems, logistics, customer service, after-sales, and connected vehicle sensor telemetry data.

BMW created a global organization to drive DevOps culture with a strong emphasis on tooling, data quality, and end-to-end data lineage. Their journey began in the Hadoop eco-system (Hive, HBase) with an on-premises, heterogeneous, and difficult-to-scale environment, then moved to cloud-native building blocks with a focus on data quality. Examples of benefits derived from the AWS Cloud implementation are multi-region deployments, high security standards, and compliance with local regulations. The goal is to democratize the most valuable information assets so they can be used across a broad community globally.

The BMW Cloud Data Hub is just one example of how customers manage information assets for sharing across the enterprise. Using a similar approach, the AWS Autonomous Driving and ADAS Data Lake Reference Architecture extends the data lake pattern to address specific challenges around:

  1. metadata and the data catalog including automated scenario detection;
  2. data lineage from source to semantic layer;
  3. data sharing with external consumers and third party service providers via Amazon SageMaker Ground Truth.


Now let’s review the details of the Autonomous Driving Data Lake Reference Architecture:

1. Ingest data from autonomous fleet with AWS Outposts for local data processing. 

Sensor data is captured and written to data loggers containing multiple SSD hard drives. Once at the garage or customer facility, the hard drives are removed and inserted into copy stations.  From there the data is copied to Amazon S3 or to a local storage system and AWS Outposts for pre-processing. AWS Outposts is a fully managed service that extends AWS infrastructure with AWS Services like Amazon Elastic Kubernetes Service, Amazon Relational Database Service, Amazon EMR on Amazon S3 (EMR supports EMRFS and HDFS which both natively support S3). These services are used to run data integrity checks, compress the data to remove redundant information and prepare the data for downstream AD workloads.

Using AWS DataSync, the data is synchronized at high data rates and securely between on-premises network attached storage (NAS) sources and Amazon S3. This is done over a high bandwidth connection provided by AWS Direct Connect.

2. Ingest vehicle telemetry data in real time using AWS IoT Core and Amazon Kinesis Data Firehose.

Vehicle telemetry is captured and published to the cloud using a number of different technologies, typically over HTTPS or MQTT. In this architecture, AWS IoT Greengrass provides an intelligent edge runtime in the vehicle with application logic running in Lambda functions deployed locally to filter vehicle network signals like CAN data, GPS location, ADAS system output, road condition metadata derived from cameras, and other vehicle sensor information.

AWS IoT Greengrass allows customers to deploy containers, machine learning inference models, create multiple data streams and prioritize them based on business logic you define. Eventually, the data ends up in S3 to be combined with the sensor data captured from the data logger process defined in step one above.

3. Remove and transform low quality data.

Autonomous vehicles produce terabytes of data per hour. In this trove of information, there may be redundant as well as corrupted data coming from the vehicle telemetry stream and raw sensor stack.  This data needs to be normalized for optimal downstream processing. Customers use a number of technologies to do this. For example, Amazon EMR provides a runtime for high volume, complex data processing using open source Apache big data processing engines like Spark.  There are a few common steps in the data transformation including;

  • checking if the driving is complete by combining batch files and streaming data;
  • parsing the log files based on recording formats (rosbag, mdf4, etc.);
  • decoding the signals from binary formats to readable text;
  • filtering inconsistent data files; and
  • synchronize the timestamp of signals

EMR Launch is an open source framework developed by AWS Labs available for customers to accelerate and simplify defining, deploying, managing, and using Amazon EMR clusters with the following features:

  • separating the definition of cluster security configurations (EMR Profile) and cluster resource configurations (Cluster Configuration) into reusable and shareable constructs; and
  • provide a suite of tools to simplify the construction of orchestration pipelines using Amazon Step Functions (EMR Launch Function).

4. Schedule the extract, transform, load (ETL) jobs using Apache Airflow.

To create trustworthy insights and empower your next machine learning use case with a reliable data foundation you need to bring the data creation process under design control. Fundamental to this approach is a centralized, governed workflow system powered by Apache Airflow. With Airflow you can establish trust in your data processing pipelines by making the workflow part of your code base to enable transparent, repeatable, pipeline executions.

The following solution diagram shows how radar and video data processing in MDF4 format achieves the highest scalability by leveraging AWS Fargate for Amazon ECS.

  • To ensure data pipeline integrity, the solution is deployed securely in an Amazon Virtual Private Cloud with end-to-end TLS and is only accessed from a private bastion host.
  • The containers for Airflow Webserver, Scheduler and Worker are deployed in multiple Availability Zones for high availability.
  • The communication between the components are decoupled via Amazon ElastiCache for Redis.
  • The status of running jobs is stored in Amazon Aurora.
  • To learn more about how to leverage complex workflows and model training jobs on AWS, a future blog post will describe the detailed architecture of Apache Airflow.

radar and video data processing in MDF4 format

5. Enrich data with map information and weather conditions based on GPS location and timestamp.

Data-sets are enriched using Amazon EMR with map or weather information from external geospatial and weather service providers and stored in Amazon S3, or database services like Amazon DynamoDB. Sensors like cameras and LiDAR could malfunction or even fail in adverse weather conditions. Advanced sensor fusing could perform the weather perception in real-time. The result of the weather perception could be verified with the real weather conditions.

6. Extract metadata into Amazon DynamoDB and Amazon Elasticsearch Service. 

Using drive logs that contain the telemetry, telematics, perception, and sensor data, a catalog is built to create searchable quantitative metadata that includes speed, turning angles, location as well as simple and complex semantic descriptions of scene snippets such as “high velocity,” “left turn,” or “pedestrian.”  The data lake catalog is updated with scenario data and indexed in Amazon Elastic Search for discovery by analysts and ADS engineers.  The extraction process is ideally fully automated, but many higher order behavioral descriptions may require human annotations from later processing steps.

The majority of systems leverage quantitative and simple semantic descriptions for the drive data, with a clear trend towards needing higher order behaviors extracted in the data lake catalog.  These more complex behaviors are ideal for enhanced searching capabilities and better validating coverage mapping.  ASAM OpenSCENARIO defines a scenario description language that provides a common ontology and hierarchy for detailing the behavior of the vehicle and surroundings.

This standard provides an open approach for describing complex, synchronized maneuvers that involve multiple entities including other vehicles, vulnerable road users (VRUs) like pedestrians, bikers, construction workers, and other traffic participants. The description of a maneuver may be based on driver actions like performing a lane change by the ego, or based on the actions of others in the scenario such as a cut in from another driver. OpenSCENARIO also accounts for the appearance/description of the participant in the scene.

7. Store data lineage in Amazon Neptune and catalog data using AWS Glue Data Catalog.

Amazon Neptune is a fully managed graph database service. It’s useful to catalog data lineage in a graph model to visualize file and object dependencies. AWS Glue is a fully managed service that provides a data catalog making assets in the data lake discoverable. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

The following diagram shows how we parse data from the Ford Autonomous Vehicle Dataset in Rosbag format using Amazon EMR. We store it in Amazon S3 in parquet format, use AWS Glue Crawler to read the file schema and create the tables in the AWS Glue Data Catalog, and finally use Amazon Athena to query the velocity data.

Ford Autonomous Vehicle Dataset

8. Process drive data and perform deep signal validation. 

Deploy your drive data signal validation code in Amazon Elastic Kubernetes Service (Amazon EKS). EKS is a managed service that makes it easy for you to run Kubernetes without needing to stand up or maintain your own Kubernetes control plane. Only a subset of the signals from the Rosbag or MDF4 files will be extracted for KPI calculation and aggregation which potentially reduces the stored data volume from gigabytes to megabytes.

9. Perform automated labeling using Amazon SageMaker Ground Truth.

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning. Ground Truth offers automatic data labeling and/or annotation which uses a machine learning model to label your data.

In addition, the service helps you create custom workflows for data labeling that leverage human workers from Amazon Mechanical Turk, the AWS Partner Network, or your own private workforce to improve the automated labeling accuracy. Ground Truth now supports 3D Point Cloud Labeling for task types like Object Detection, Object Tracking and Semantic Segmentation. This blog shows how to use the service with open data sets from Audi A2D2 and KITTI. Alternatively, customers can run a custom container on Amazon EKS for Ground Truth generation and labeling.

10. Provide a search function for particular scenarios using AWS AppSync.

Developers and data scientists can search for a particular scenario and all of the associated metadata related to it. AWS AppSync is a managed service that uses GraphQL to make it easy for applications to get data from a range of data sources such as Amazon DynamoDB, Amazon ES and AWS Lambda.

Additional Aspects to consider

  • China: The collection of raw data that includes video, lidar, radar, and GPS data is defined as a controlled activity by the government (geographic information surveying and mapping). This is a regulated activity and must be done under the governance of local certified map providers with navigation surveying licenses.
  • Data Encryption and Anonymization: Some ADS/ADAS use cases include sensitive or personal information. AWS Key Management Service (KMS) supports Customer Master Keys (CMKs). Vehicle Identification Numbers (VIN) can be anonymized by Amazon EMR jobs. This blog shows how to anonymize personal data like faces from the video using Amazon Rekognition.
  • Exchange Data with partners: AWS Data Exchange is a service that makes it easy for AWS customers to securely exchange file-based data sets in the AWS Cloud. Providers in AWS Data Exchange have a secure, transparent, and reliable channel to reach AWS customers and grant existing customers their subscriptions more efficiently.
  • Data Lake as code: AWS provides a full stack of DevOps tooling including AWS CodeCommit, AWS CodeBuild and AWS CodePipeline to simplify the provisioning and management infrastructure, to deploy application code, automate software release processes, and monitor application and infrastructure performance. Third party CI/CD tools like Jenkins and Zuul can be integrated as well.


In this post, we discussed the steps outlined in this Reference architecture to build an Autonomous Driving and ADAS Data Lake on AWS. We hope you found this interesting and helpful and invite your comments on the architecture.

Also, check out the Automotive issue of the AWS Architecture Monthly Magazine.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Implementing Hardware-in-the-Loop for Autonomous Driving Development on AWS

Post Syndicated from Bryan Berezdivin original https://aws.amazon.com/blogs/architecture/field-notes-implementing-hardware-in-the-loop-for-autonomous-driving-development-on-aws/

Automotive customers use AWS as their platform for advanced driving assistance systems (ADAS) and autonomous driving (AD) development to accelerate their development cycles and experience faster time-to-market.  In the blog post, Autonomous Vehicle and ADAS development on AWS Part 1: Achieving Scale, we illustrated how software in the loop (SiL) and hardware in the loop (HiL) simulations are part of the workflow used to develop and validate safe AD and ADAS functionality. In this post, I run through some of the more common questions and patterns for implementing HiL on AWS, while looking at some of the differences from running your development all on-premises.

HiL simulations leverage test drive data and derived synthetic data to develop and validate various functions in the AD software stack.  Test drive log data is ingested and stored in Amazon S3 for use for HiL simulations in parallel with other AD development workloads including visualization, processing, labeling, analysis, and model and algorithm development.

As such, we see customers exist in a hybrid context with their HiL workloads running on-premises to support customized equipment. For ADAS and ADS customers, this poses a few questions and considerations:

  • What are the recommendations to deploy HiL for AD on AWS?
  • How is this different from what customers were used to on-premises?

For hybrid customers, there are assumptions and misconceptions:

  • Do I need to replicate the test drive data locally, and if so, what are the considerations and consequences?

For the purposes of brevity, the remainder of this blog post will use the term AD development to encompass ADAS and AD unless specifically called out.

HiL Building Blocks

Simulations and validations make up an important aspect of AD development. According to Rand’s analysis, there is a need to demonstrate safe driving on billions of miles for an autonomous vehicle to have a lower failure rate than a human driver. While this analysis is statistically derived for fully autonomous driving (SAE Level 5), it demonstrates a need for further validation on millions of miles. This pattern is reflected in most ADAS and AD development projects, where software in the loop (SiL) and HiL are used for verification and validation.

The following diagram is an illustration of the ISO 26262 V-Model, a product development approach for matching requirements with corresponding tests, where HiL simulation is required for much of system level testing and validation phases on the right side of the overlaid V-Model.

Figure 1: V-Model as defined by ISO-26262

Figure 1: V-Model as defined by ISO-26262

HiL simulations require a few key elements. The main component is the device under test (DUT), such as one or more electronic control units (ECUs) running the AD software stack. HiL simulations allow customers to put the device under test (DUT) under the rigor of real-world signals found in a vehicle. By providing more accurate environments and scenarios for the DUT, it can be fine-tuned for key performance indicators (KPIs) such as power utilization, response time, and accuracy.

The DUT is connected to a “HIL Rig,” a high performance server with multiple expansion boards to connect to various components of the AD system. The various interfaces are identified in Figure 2:  High Level Hardware-in-the-loop (HiL) System and Interfaces and include Controller Area Network (CAN), Automotive Ethernet, Low Voltage Differential Signal (LVDS), and PCIExpress. These interfaces emulate the vehicular topology for testing purposes and allow system level validation of the DUT.

The HiL Rigs and the corresponding software tooling are offered by companies like Elektrobit, dSpace, National Instruments, and Opal-RT. The HiL systems facilitate time synchronized inputs and outputs to the DUT and measures system performance. These solutions have optimizations for large-scale operations aligned with faster validation cycles for customers. This latter point is relevant when a project requires validation across a large number of miles. Elektrobit provides the ability to orchestrate and deploy large HiL server farms that can be deployed to work in parallel. The larger server farms allow parallel HiL simulations to reduce the time to validate thousands of miles of drive time and assess the key performance indicators (KPIs) for the feature sets. Results can be acted on more quickly and reduce the overall development time.

Figure 2: High Level Hardware-in-the-loop (HiL) System and Interfaces

Figure 2: High Level Hardware-in-the-loop (HiL) System and Interfaces

The HIL system loads sensor data derived from test drive logs. These logs vary in format, but often are captured and stored as MDF4, ADTF, rosbag, or other data logger proprietary formats. These are then processed for HIL simulations to implement open-loop and closed-loop simulations.

  • Open loop simulations refer to replay of log data from test drives.
  • Closed-loop simulations rely on the behavior of the system as inputs vary based on new outputs of the simulation.

Both open loop and closed loop simulations are part of autonomous driving development, but open-loop simulations require the largest datasets (multiple petabytes on average) due to reliance on the log data from test drives making them a primary concern for deploying HiL in a hybrid manner.

Overview of Solution

Architectures for supporting HIL simulations with AWS for AD development vary based primarily on the networking available at HiL locations. A common pattern for AWS customers is to have the HiL systems directly interfacing with Amazon S3 over high-bandwidth network links leveraging AWS Direct Connect. This is the simplest approach to deploying HiL and avoids hybrid data management of the petabytes of data in Amazon S3 to a local storage system.

AWS Direct Connect provides customers options to deploy their HIL rigs at their data center or in AWS colocation facilities with low latency connections. AWS has the largest number of Direct Connection locations and points-of-presence (POPs) to enable low latency connectivity to any of the >24 AWS Regions. The following diagram illustrates a reference architecture leveraging a direct interface from the HiL systems and Amazon S3.

Figure 3 : Reference Architecture for Hardware-in-the-Loop (HiL) Direct to Amazon S3

As shown in Figure 3, we illustrate the common interfaces, topology, and AWS services used for autonomous driving customers.

  • Amazon S3 is used to store and analyze the test drive logs used by the HiL simulations and also the results from the simulation runs for further analysis.
  • Metadata of test drive data is populated in various database and analytics services, referred to as the data catalogues, with metadata crawlers and processing pipelines that extract from the drive log and test result data on Amazon S3.
  • The data catalogues provide flexible search interfaces for developers and validation engineers or advanced analytics tools. These systems provide keyword search in Amazon Elasticsearch or SQL queries in Amazon Redshift or Amazon RDS and noSQL interfaces using Amazon DynamoDB. Amazon Partner Network solutions for these database and analysis tools are common as well, such as those in AWS Marketplace.
  • Validation engineer, data scientists, or developers use these data catalogues to find scenarios for testing. These personas also use the HiL management interfaces to configure and orchestrate the HiL simulation runs on the scenarios identified and ensure traceability.
  • HiL management systems control the HiL Rigs that interface to the DUT and implement the HiL simulations using the test drive logs. The HiL management system then writes results back to S3 for further analysis via various tool chains.

A common question AWS customers have is how to determine an optimal hybrid architecture using this approach. The primary factors are properly sized network links to accommodate data sets used by the HiL simulations as well as low latency network links between Amazon S3 and the HiL rigs. As a result, a key factor is ensuring use of an AWS region for your AWS storage that is in close proximity to your HiL testing site(s).

Based on current HiL implementations, open-loop simulations can sustain latencies of 30-50 ms RTT. AWS has numerous AWS Direct Connect locations in co-location facilities with latencies <5ms RTT. Sizing for these network links can be calculated based on the expected dataset sizes and the interval of time targeted for simulation run. We show a basic formula used for network sizing.

Average_Throughput (Gbps) = Average_Dataset_Size(GB)*8 / Time_Interval (seconds)

As an example, for a scenario where an average of 20PB is needed by the HIL rig every 2 weeks, we require ~200Gbps for the AWS Direct Connect bandwidth.

Figure 4 shows an example of a high-level architecture supported by Elektrobit with multiple EB 9101 test racks grouped together. This architecture supports multiple ECUs to be tested at once, leveraging drive log data in Amazon S3. This system is controlled with a central management software that allows optimal orchestration to keep the Elektrobit HiL system running optimally.

Use cases include:

  • The automated replay of all relevant sensor data with high time precision to ECU
  • Capture of ECU responses including debug data
  • Integration of customer components inside the HiL rack for visualization or post-processing.

Another common question from AWS customers is whether this architecture is supported for their HiL implementation. Many HiL providers are adding AWS functionality to their software and hardware stacks in response to customers transitioning to cloud for the development platforms.  Some vendors still require Amazon S3 as a supported interface in their HiL Rigs. The work needed to accommodate Amazon S3 is usually a small level of effort for any developer by using Amazon SDKs on the HiL rig software stack. If there is a project where this is needed, contact the AWS account teams and your HiL vendors to ensure a successful and cost efficient project implementation.

Figure 4: Elektrobit HiL Architecture with AWS

Figure 4: Elektrobit HiL Architecture with AWS

An alternative HiL solution shown in Figure 5 includes Amazon S3 as the primary storage for drive log data and the scale out NAS storage system is located on-premises operating as a cache for the HIL rigs. This is common when the networking options at the HIL site are limited in bandwidth or latency to handle the target datasets and time windows.

AWS customers calculate the size of the cache to transfer the entire dataset over the intended time interval. Following is a simple calculation to demonstrate this.

Cache_Size(GB) = Average_Dataset_Size(GB) - Average_Throughput (Gbps) /8 * Time_Interval (seconds)

In this example, a customer has 40 Gbps AWS Direct Connect available and a 10PB dataset needed for HIL simulations every 2 weeks. Using the preceding formula there is a need for local cache of four PB capable of high read-rates.

Figure 5: Reference Architecture for Hardware-in-the Loop (HiL) with Local Cache to Amazon S3

Figure 5: Reference Architecture for Hardware-in-the Loop (HiL) with Local Cache to Amazon S3

In this hybrid architecture there is a need to orchestrate the data movement in line with the needs of the HIL simulation data set. This requires third party software generally or built in functionality into workflow orchestration tools like Apache Airflow. At CES 2019, Dell EMC and AWS illustrated a solution for this hybrid architecture documented in this short solution brief using Isilon as the scale out NAS storage system and DataIQ as the data movement and orchestration mechanism.

Any of these architectures can be cost-optimized, and AWS has programs and pricing options for Amazon Direct Connect as well as the other AWS services involved. There are Enterprise Agreements and Migration Acceleration Programs (MAP)  in line with the holistic AD development platform needs, that reduce the costs for hybrid architecture functionality needed in the HiL solutions. One common need is support for AWS Direct Connect “flat rate” pricing option to accommodate the data transfer out (DTO) needs for the HiL workload. If you need details on these programs for your AD development project, contact your AWS account team.


In this blog post, we discussed two common architectural patterns for supporting HiL simulations for ADAS and Autonomous Driving development. These help customers decide on the right networking, storage, and hybrid topologies for these systems.

HiL systems directly interfacing with Amazon S3 is the most common pattern as you see with Elektrobit HiL solutions, but for customers with limited network links the use of a local cache is an option. Autonomous driving customers looking to increase velocity in their SAE Level 2-5 development programs with HiL simulations have achieved success with AWS as the development platform using these patterns. AWS has a team dedicated to autonomous driving, so contact your AWS account team to get a more prescriptive solution for your HiL or related ADAS and AD development needs.

Also, check out the Automotive issue of the AWS Architecture Monthly Magazine.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Powering the Connected Vehicle with Amazon Alexa

Post Syndicated from Amit Kumar original https://aws.amazon.com/blogs/architecture/field-notes-powering-the-connected-vehicle-with-amazon-alexa/

Alexa has improved the in-home experience and has potential to greatly enhance the in-car experience. This blog is a continuation of my previous blog: Field Notes: Implementing a Digital Shadow of a Connected Vehicle with AWS IoT. Multiple OEMs (Original Equipment Manufacturers) have showcased this capability during CES 2020. Use cases include; a person seating at the rear seat can play a song, control HVAC (Heating, ventilation, and air conditioning), pay for gas/coffee, all while using Alexa. In this blog, I cover how you create a connected vehicle using Alexa, to initiate a command, such as; ‘Alexa, open my trunk’.

Solution Architecture

“Alexa, open my trunk”

The preceding architecture shows a message flowing in the following example:

  1. A user of a connected vehicle wants to open their trunk using an Alexa voice command. Alexa will identify the right intent based on utterances and invoke a Lambda function. The Lambda function updates the device shadow with (desired {““trunk””: ““open””}).
  2. Vehicle TCU registered the callback function shadowRegisterDeltaCallback(). Listen to delta topics for the device shadow by subscribing to delta topics. Whenever there is a difference between the desired and reported state, the registered callback will be called.  The delta payload will be available in the callback. Update performed in #1 will be received in delta callback.
  3. Now, the vehicle must act on the desired state. In this case, it acts on the trunk status change. After performing the required action for the trunk change, the vehicle TCU will update the device shadow with the reported state (reported : { “trunk”: “open”} )
  4. The web/mobile app subscribed to the topic $aws/things/tcu/shadow/update/accepted”. Therefore, as soon as the vehicle TCU updates the shadow, the Web/Mobile app received the update and synchronized the UI state.

As part of the previous blog, we implemented #2, #3 and #4. Lets implement #1 and incorporate into the solution.

The source code (vehicle-command) of this blog is available in this code repository.

The Alexa voice command required the implementation of three key areas:

  1. Configure Alexa – which will listen to utterances and identify the right intent and invoke a Lambda function.
  2. Set up the Lambda function – which will interpret the command and invoke the AWS IoT Core device shadow API.
  3. Handle Command at Vehicle tcu and App – Vehicle tcu must register shadowRegisterDeltaCallback so any update in the device shadow will receive a call message to perform the  actual command by the vehicle and synchronize the state with a web/mobile app.

Let’s ‘Open a trunk’ using Alexa voice command. First set up the environment:

  • Open AWS Cloud9 IDE created in an earlier lab and run the following command:

Set up permanent credentials. Note: Alexa doesn’t work with temporary credentials.  Configure it with permanent credentials for ASK command line interface (CLI).

  1. Open Cloud9 Preferences by clicking AWS Cloud9 > Preference or  by clicking on the “gear” icon in the upper right corner of the  Cloud9 window
  2. Select “AWS  Settings”
  3. Disable “AWS  managed temporary credentials”
  4. $ aws  configure
  5. Enter the Access Key  and Secret Access Key of a user that has required access credentials
  6. Use us-east-1 as the region. It will store in ~/.aws/config

Verify that everything worked by examining the file ~/.aws/credentials. It should resemble the following:

 aws_access_key_id = <access_key>
 aws_secret_access_key = <secrect_key>

*Remove aws_session_token line from credentials file.

Next, install the Alexa CLI:

$ npm install ask-cli --global

Initialize ASK CLI by issuing the following command. This will initialize the ASK CLI with a profile associated with your Amazon developer credentials.

$ ask configure --no-browser

Check you are linking AWS account with Alexa:

Do you want to link your AWS account in order to host your Alexa skills? Yes

#At the end output should look as follows:

------------------------- Initialization Complete -------------------------
Here is the summary for the profile setup:
ASK Profile: default
AWS Profile: default

As part of the previous blog, you have already cloned the following git repository in AWS Cloud9 IDE. It has a baseline code to jump start.

$ git clone

Configure Alexa Skills

The Alexa Developer console GUI can be used but we are doing it programmatically so it can be done at scale and allows versioning.

1. Open connected-vehicle-lab/vehicle-command/skill-package/skill.json . We have 2 locale en-US, en-IN are defined in the base code for Alexa command. Let’s add en-GB locale in the json file located at “manifest”/”publishingInformation”/”locales”.  Similarly, you can add locale for your preferred language:

"en-GB": {
"name": "vehicle-command",
"summary": "Control Vehicle using voice command",
"description": "Allow you to control vehicle using voice command",
"examplePhrases": [
    "Alexa open genie",
    "ask genie to lower window",
    "window up"
"keywords": []

If you are inserting into the middle then make sure it is separated by a comma.

2. Let’s create a copy of models connected-vehicle-lab/vehicle-command/skill-package/interactionModels/custom/en-US.json and rename it to en-GB.json and add our intent

  • We have “invocationName”: “genie”.  Here, we  are using “genie” as a command to invoke our Alexa skill. You  can change if needed
  • The key elements in this json file is intent, slots, sample utterance and slot types. Let’s define the  slot types t_action_type for ‘open’, ‘close’, ‘lock’, ‘unlock’. under “types”: [].
        "name": "t_action_type",
        "values": [
                "name": {
                "value": "unlock"
                "name": {
                "value": "lock"
                "name": {
                "value": "close"
                "name": {
                "value": "open"
  • Let’s add intent under “intents”: [] for trunk  ‘TrunkCommandIntent’ and define the sample utterance speech like ‘lock my trunk’,  ‘open trunk’. We are using slot types to simplify the utterance and  understand the operation requested by a user.
            "name": "TrunkCommandIntent",
            "slots": [
                "name": "t_action",
                "type": "t_action_type"
            "samples": [
                "{t_action} trunk",
                "trunk {t_action}",
                "{t_action} my trunk",
                "{t_action} trunk"
  • Now add the same intent, slots, slot type and sample utterances  for other locales files (en-US.json and en-IN.json) as well.

3. Let’s add response message under languageString.js (available at /connected-vehicle-lab/vehicle-command/lambda/custom).

TRUNK_OPEN: 'Trunk Open',
TRUNK_CLOSE: 'Trunk Close' 

If you are inserting into the middle then make sure it is separated by a comma.

Set up the Lambda function

1. Add a Lambda function which will get invoked by Alexa. This Lambda function will handle  the intent and invoke IoT Core Device Shadow API and execute the actual command of ‘Trunk open/unlock or lock/close’.

  • Open /connected-vehicle-lab/vehicle-command/lambda/custom/index.js  and add our TrunkCommandIntent
const TrunkCommandIntentHandler = {
                canHandle(handlerInput) {
                return Alexa.getRequestType(handlerInput.requestEnvelope) === 'IntentRequest'
                && Alexa.getIntentName(handlerInput.requestEnvelope) === 'TrunkCommandIntent';
                    handle(handlerInput) {
                    var t_action_value = handlerInput.requestEnvelope.request.intent.slots.t_action.value;
                    var speakOutput;
                    const obj = "trunk";
                    if (t_action_value == "lock" || t_action_value == "open")
                        updateDeviceShadow(obj, "open");
                        speakOutput = handlerInput.t('TRUNK_OPEN')
                        updateDeviceShadow(obj, "close");
                        speakOutput = handlerInput.t('TRUNK_CLOSE')
                    return handlerInput.responseBuilder
                    //.reprompt('add a reprompt if you want to keep the session open for the user to respond')
  • We have  UpdateDeviceShadow(“vehicle_part”, “command”) function  which actually invokes the IoT core Device Shadow API
 function updateDeviceShadow (obj, command)
        shadowMessage.state.desired[obj] = command;
        var iotdata = new AWS.IotData({endpoint: ioT_EndPoint});
        var params = {
        payload: JSON.stringify(shadowMessage) , /* required */
        thingName: deviceName /* required */ 
        iotdata.updateThingShadow(params, function(err, data) {
            if (err) 
            console.log(err, err.stack); // an error occurred
            //reset the shadow 
            shadowMessage.state.desired = {}

2. Update the value of ioT_EndPoint from AWS IoT Core > Settings > Custom Endpoint

3.  Add Trunk CommandIntent in request handler

exports.handler = Alexa.SkillBuilders.custom()

4. Deploy Alexa Skills

$ cd ~/environment/connected-vehicle-lab/vehicle-command
$ ask deploy 

Handle Command at Vehicle tcu and App

For more detail on this section, refer to part 1 of this blog: Field Notes: Implementing a Digital Shadow of a Connected Vehicle with AWS IoT.

@ Vehicle tcu – tcuShadowRead.py has trunk_handle() function to receive a message from device shadow

def trunk_handle(status):
  if status is not None:
    shadowClient.reportedShadowMessage['state']['reported']['trunk'] = status
    print ('Perform action on trunk status change : ' + str(status))

@web App – demo-car/js/websocket.js has handleTrunkCommand function receive callback message as soon any update happened on Device Shadow

//this function will be called by onMessageArrive
function handleTrunkCommand(trunkStatus) {
    obj = document.getElementsByClassName("action trunk")[0];
    obj.checked = trunkStatus == "open" ? true : false;
    console.log(obj.getAttribute("data-text") + " : " + obj.checked);

demo-car/js/demo-car.js has handleTrunkCommand function to handle UI input and invoke IoT Core Device Gateway API to update the desired state.

//this function will be called when user will click on trunk checkbox
    handleTrunkCommand: function(obj) {
        obj.checked ? demoCar.shadowMessage.state.desired.trunk = "open" : demoCar.shadowMessage.state.desired.trunk = "close";
        console.log(obj.getAttribute("data-text") + " : " + demoCar.shadowMessage.state.desired.trunk);

Use Alexa skill to invoke a command

Let’s test or command ‘Alexa, open my trunk’. We can use a command line and execute:

$ask dialog --locale "en-GB" 

Using Alexa GUI, provides an interesting visualization, as shown in the following screenshot.

  1. Open the Alexa GUI,  Select ‘vehicle command’ skill and select test tab. Allow “developer.amazon.com” to use your microphone?
  2. Open a demo.html web app side by side of the Alexa GUI to check an actual operation happened at the Vehicle tcu and synchronize the  status with virtual car model.
  3. Now test the Alexa skill. You can use an audio command as well. You can ask or write ‘ask genie’.

Alexa developer console

Clean Up

What a fun exploration this has been! Now clean up AWS resources created for this and the previous post to avoid incurring any future AWS services costs. Resources created by CDK can be deleted by deleting the stack on the CloudFormation console. Resources created manually need to be deleted individually.


In this blog post, I showed how you can enable voice command for a connected vehicle and enhance in-vehicle user experience.  Similarly, you can also extend this solution for the use cases like Alexa ‘open my garage’. AWS IoT Core Device Shadow API does all the heavy-lifting in this case. Any update in device shadow allows both device and user application to act. Alexa skill is acting as an interface to capture the user command and invoke the lambda function.

Since these are all serverless services, that means this implementation can scale without making any change in the application and you only pay when someone invokes a command. Creating an engaging, high-quality interaction with Alexa in the vehicle is critical. You can refer to Alexa Automotive Documentation for an Alexa Built-in automotive experience.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.


Field Notes: Implementing a Digital Shadow of a Connected Vehicle with AWS IoT

Post Syndicated from Amit Sinha original https://aws.amazon.com/blogs/architecture/field-notes-implementing-a-digital-shadow-of-a-connected-vehicle-with-aws-iot/

Innovations in connected vehicle technology are expected to improve the quality and speed of vehicle communications and create a safer driving experience. As connected vehicles are becoming part of the mainstream, OEMs (Original Equipment Manufacturers) are broadening the capabilities of their products and dramatically improving the in-vehicle experience for customers.

An important feature in a connected vehicle is its ability to execute a remote command and synchronize the state of the vehicle between a web/mobile app in real time.

This blog demonstrates how to:

  • secure two-way communication between a device (vehicle telematics control unit) and the AWS Cloud using AWS IoT
  • execute command at vehicle
  • execute a remote command
  • and test with a vehicle virtual model

You can watch a quick animation of a remote command execution in the following GIF:

Animated car GIF

Solution Overview

In a traditional connected vehicle approach, there are many processes running on multiple servers. These processes are subscribing to one another, coordinating with each other, and polling for an update. This makes scalability and availability a challenge. We use AWS IoT Core and AWS IoT Device Shadow service as primary components for this solution.

This solution has three building blocks:

  1. a vehicle TCU (telematics control unit),
  2. the AWS Cloud (with connection via AWS IoT Core) and
  3. a virtual Model (e.g.; web/mobile app to send/receive commands to TCU). These three building blocks together reflect the current state of a vehicle.

Alexa Solution Overview

The previous diagram shows a message flowing in the following example:

  1. A user of a connected vehicle wants to open their door using a web/mobile app. The app updates the device shadow with (desired {““door””: ““open””}). The app will always request the vehicle to execute the command; therefore, it will always update the device shadow with the desired state.
  2. Vehicle TCU registered the callback function shadowRegisterDeltaCallback(). Listen on delta topics for the device shadow by subscribing to delta topics. Whenever there is a difference between the desired and reported state, the registered callback is called and the delta payload will be available in the callback. Update performed in #1 will received in delta callback.
  3. Now, the vehicle needs to act on the desired state. In this case, ‘act on’ is the door status change. After performing the required action for the door change, the vehicle TCU will update the device shadow with the reported state (reported : { “door”: “open”} )
  4. Now, the vehicle is closing the door. The vehicle will always perform the action; therefore, it will always update device shadow with reports state (reported: {“door” : “close”})
  5. The Web/Mobile app subscribed to topic $aws/things/tcu/shadow/update/accepted”. Therefore, as soon as the vehicle TCU updates the shadow, the Web/Mobile app received the update and synchronized the UI state.
  6. You can also build an Amazon Alexa skill to control your vehicle (“Alexa, raise my window”). After identifying the utterance, Alexa can invoke the Lambda function to update the device shadow and perform the requested action.

Note: For the Web/Mobile app developments for production, it is recommended to use AWS AppSync and AWS Amplify SDK for building a flexible and decoupled application from the API. Refer to this code sample for more detail.


First, you need to set up the code. Refer to the directions in this code sample.

Create device

In AWS IoT Core, name a device ‘TCU’ (created by connected-vehicle-app-cdk-stack). Create a new certificate (download files) and attach the policy generated by cdk.

create a certificate

Next, deploy the certificate key and pem file on your device so it can connect with the AWS Cloud using the X.509 certificate. For more detail, refer to the directions in the code sample.

Execute Command at Vehicle

AWS IoT Device Shadow is an important feature of AWS IoT core for remote command execution because it allows you to decouple the vehicle and the app which controls and commands the vehicle. A device’s shadow is a JSON document that is used to store and retrieve current state information for a device. Primarily we use state.desired and state.reported. properties of a device’s shadow document.

The device shadow (Device SKD and APIs) enables applications to interact with devices even when they are offline and allow:

  • Cloud representation of device state
  • Query last known state for offline devices
  • Real-time state changes
  • Track last known device state
  • Control devices via change of state
  • Automatic synchronization once devices connect to the cloud
  • APIs for applications to discover and interact with devices

The rich features of a device shadow allows the app to interact with the vehicle TCU even when there is no connectivity. Once connectivity is established, the device gateway pushes the changes to device and vice versa.

We need to deploy a program (tcuShadowWrite.py) on the vehicle TCU device to update the device shadow and send the update to the AWS Cloud. This program is available in this code repository.

Let’s assume that after reaching their home, the vehicle’s user closes the door, switches off the headlights, and rolls up the windows. The same state of the vehicle should be reflected on their web/mobile app in real time. The vehicle TCU has to update the “reported” state in the device shadow JSON document.

shadow message

AWSIoTMQTTShadowClient library has a method called shadowUpdate that needs to be called from the vehicle TCU to update the device shadow. Essentially, it is publishing the shadow reported state on topic $aws/things/<thingName>/shadow/update.

If you run tcuShadowWrite.py script, you should be able to see the output as described in the following image.


  • Open the AWS IoT Core console.
  • Select Manage -> Things -> Select tcu, and then choose Shadow. You should be able to see the shadow message sent from the device described in the following image.

shadow document

Execute Remote Command

We need to deploy a program (tcuShadowRead.py) on the vehicle TCU to receive updates from the AWS Cloud. It is available in this code sample.

Let’s assume the vehicle owner uses the mobile app to open the door, switch on the headlight and roll down the windows. The vehicle TCU should receive this command and instruct the Electronic Control Unit (ECU) to execute the command. The web/mobile app will update the “desire” state in the device shadow JSON document.

shadow message2

In tcuShadowRead.py, AWSIoTMQTTShadowClient has a method shadowRegisterDeltaCallback. It listens on delta topics for this device shadow by subscribing to delta topics. Whenever there is a difference between the desired and reported state, the registered callback is called and the delta payload will be available in the callback.


The callback function has a code to handle the state change request. In an actual implementation, a function like door_handle() would be calling the ECU to execute the door open command.

door open command

If you make changes in Device Shadow on AWS IoT for the tcu device, you should receive the output in the following image.

Device shadow

Test with a Virtual Vehicle Model

To help you test this solution, you can deploy the virtual vehicle model shown in the following image. Detailed steps for the deployment of the virtual vehicle is available in this code sample.

virtual vehicle model

Any changes in the model state should be reflected on the virtual demo vehicle and vice versa.

Here, we use open-source Paho-mqtt library.  and Developers can use this to write JavaScript applications that access AWS IoT using MQTT or MQTT over the WebSocket protocol without using AWS IoT SDK. This implementation is made simpler by using AWS IoT Device SDK for JavaScript v2 Readme.

Review the JavaScript file named webSocketApp.js:

websocket app

  • onMessageArrived() function will be invoked whenever the device will change the shadow state.
  • handle<object>Command functions (such as handleDoorCommand) will be called with the current state.  Call this function if the device has received any status change.

We have another JavaScript file demo-car.js in the demo-car folder. This includes the functions that our simulated vehicle will use in order to change the device shadow.

Let’s review the following code:

democar javascript

  • We have 3 handle command function defined (e.g., handleDoorCommand) to take the user’s input and access AWS IoT Core services.
  • connectDevice is an actual function to invoke updateThingsShadow function to send the desired state
  • accessIoTDevice uses Amazon Cognito Identity to get the authenticated identities to access AWS IoT Core securely without exposing the access key or secret key.

Now, keep demo.html side by side to your code and run the tcuShadowRead.py script. Any change made at the virtual model will reflect at the command output. Similarly, any change made by tcuShadowWrite.py will reflect the state update on the virtual model.


In this blog, we showed how to implement a digital shadow of a connected vehicle using AWS IoT. This solution removes complexity from running multiple processes in parallel and ensures a successful outcome. AWS IoT Core enables scalable, secure, low-latency, low-overhead, bi-directional communication between connected devices, tolerate and recover from slow/brittle connection, the AWS Cloud and customer-facing applications.

The Device Shadow in AWS IoT Core enables the AWS Cloud and applications to easily and accurately receive data from connected vehicles and send commands to the vehicles. The Device Shadow’s uniform and always-available interface simplifies the implementation of time-sensitive use cases. These include, remote command execution and two-way state synchronization between a device and app where the cloud is acting as a broker. This solution enables you to shift operational responsibilities of a connected vehicle infrastructure to the AWS Cloud while paying only for what you use, with no minimum fees or mandatory service usage.

For more information about how AWS can help you build connected vehicle solutions, refer to the AWS Connected Vehicle solution page.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Building resilient serverless patterns by combining messaging services

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/building-resilient-no-code-serverless-patterns-by-combining-messaging-services/

In “Choosing between messaging services for serverless applications”, I explain the features and differences between the core AWS messaging services. Amazon SQS, Amazon SNS, and Amazon EventBridge provide queues, publish/subscribe, and event bus functionality for your applications. Individually, these are robust, scalable services that are fundamental building blocks of serverless architectures.

However, you can also combine these services to solve specific challenges in distributed architectures. By doing this, you can use specific features of each service to build sophisticated patterns with little code. These combinations can make your applications more resilient and scalable, and reduce the amount of custom logic and architecture in your workload.

In this blog post, I highlight several important patterns for serverless developers. I also show how you use and deploy these integrations with the AWS Serverless Application Model (AWS SAM).

Examples in this post refer to code that can be downloaded from this GitHub repo. The README.md file explains how to deploy and run each example.

SNS to SQS: Adding resilience and throttling to message throughput

SNS has a robust retry policy that results in up to 100,010 delivery attempts over 23 days. If a downstream service is unavailable, it may be overwhelmed by retries when it comes back online. You can solve this issue by adding an SQS queue.

Adding an SQS queue between the SNS topic and its subscriber has two benefits. First, it adds resilience to message delivery, since the messages are durably stored in a queue. Second, it throttles the rate of messages to the consumer, helping smooth out traffic bursts caused by the service catching up with missed messages.

To build this in an AWS SAM template, you first define the two resources, and the SNS subscription:

    Type: AWS::SQS::Queue

    Type: AWS::SNS::Topic
        - Protocol: sqs
          Endpoint: !GetAtt MySqsQueue.Arn

Finally, you provide permission to the SNS topic to publish to the queue, using the AWS::SQS::QueuePolicy resource:

    Type: AWS::SQS::QueuePolicy
        Version: "2012-10-17"
          - Sid: "Allow SNS publish to SQS"
            Effect: Allow
            Principal: "*"
            Resource: !GetAtt MySqsQueue.Arn
            Action: SQS:SendMessage
                aws:SourceArn: !Ref MySnsTopic
        - Ref: MySqsQueue

To test this, you can publish a message to the SNS topic and then inspect the SQS queue length using the AWS CLI:

aws sns publish --topic-arn "arn:aws:sns:us-east-1:123456789012:sns-sqs-MySnsTopic-ABC123ABC" --message "Test message"
aws sqs get-queue-attributes --queue-url "https://sqs.us-east-1.amazonaws.com/123456789012/sns-sqs-MySqsQueue- ABC123ABC " --attribute-names ApproximateNumberOfMessages

This results in the following output:

CLI output

Another usage of this pattern is when you want to filter messages in architectures using an SQS queue. By placing the SNS topic in front of the queue, you can use the message filtering capabilities of SNS. This ensures that only the messages you need are published to the queue. To use message filtering in AWS SAM, use the AWS:SNS:Subcription resource:

    Type: 'AWS::SNS::Subscription'
      TopicArn: !Ref MySnsTopic
      Endpoint: !GetAtt MySqsQueue.Arn
      Protocol: sqs
        - orders
        - payments 
      RawMessageDelivery: 'true'

EventBridge to SNS: combining features of both services

Both SNS and EventBridge have different characteristics in terms of targets, and integration with broader features. This table compares the major differences between the two services:

Amazon SNSAmazon EventBridge
Number of targets10 million (soft)5
Limits100,000 topics. 12,500,000 subscriptions per topic.100 event buses. 300 rules per event bus.
Input transformationNoYes – see details.
Message filteringYes – see details.Yes, including IP address matching – see details.
FormatRaw or JSONJSON
Receive events from AWS CloudTrailNoYes
TargetsHTTP(S), SMS, SNS Mobile Push, Email/Email-JSON, SQS, Lambda functions15 targets including AWS LambdaAmazon SQSAmazon SNSAWS Step FunctionsAmazon Kinesis Data StreamsAmazon Kinesis Data Firehose.
SaaS integrationNoYes – see integration partners.
Schema Registry integrationNoYes – see details.
Dead-letter queues supportedYesNo
Public visibilityCan create public topicsCannot create public buses
Cross-RegionYou can subscribe your AWS Lambda functions to an Amazon SNS topic in any Region.Targets must be same Region. You can publish across Region to another event bus.

In this pattern, you configure an SNS topic as a target of an EventBridge rule:

SNS topic as a target for an EventBridge rule

In the AWS SAM template, you declare the resources in the preceding diagram as follows:

    Type: AWS::SNS::Topic

    Type: AWS::Events::Rule
      Description: "EventRule"
          - !Sub '${AWS::AccountId}'
          - "demo.cli"
        - Arn: !Ref MySnsTopic
          Id: "SNStopic"

The default bus already exists in every AWS account, so there is no need to declare it. For the event bus to publish matching events to the SNS topic, you define permissions using the AWS::SNS::TopicPolicy resource:

    Type: AWS::SNS::TopicPolicy
        - Effect: Allow
            Service: events.amazonaws.com
          Action: sns:Publish
          Resource: !Ref MySnsTopic
        - !Ref MySnsTopic       

EventBridge has a limit of five targets per rule. In cases where you must send events to hundreds or thousands of targets, publishing to SNS first and then subscribing those targets to the topic works around this limit. Both services have different targets, and this pattern allows you to deliver EventBridge events to SMS, HTTP(s), email and SNS mobile push.

You can transform and filter the message using these services, often without needing an AWS Lambda function. SNS does not support input transformation but you can do this in an EventBridge rule. Message filtering is possible in both services but EventBridge provides richer content filtering capabilities.

AWS CloudTrail can log and monitor activity across services in your AWS account. It can be a useful source for events, allowing you to respond dynamically to objects in Amazon S3 or react to changes in your environment, for example. This natively integrates with EventBridge, allowing you to ingest events at scale from dozens of services.

Using EventBridge enables you to source events from outside your AWS account, offering integrations with a list of software as a service (SaaS) providers. This capability allows you to receive events from your accounts with SaaS providers like Zendesk, PagerDuty, and Auth0. These events are delivered to a partner event bus in your account, and can then be filtered and routed to an SNS topic.

Additionally, this pattern allows you to deliver events to Lambda functions in other AWS accounts and in other AWS Regions. You can invoke Lambda from SNS topics in other Regions and accounts. It’s also possible to make SNS topics publicly read-only, making them extensible endpoints that other third parties can consume from. SNS has comprehensive access control, which you can incorporate into this pattern.

Cross-account publishing

EventBridge to SQS: Building fault-tolerant microservices

EventBridge can route events to targets such as microservices. In the case of downstream failures, the service retries events for up to 24 hours. For workloads where you need a longer period of time to store and retry messages, you can deliver the events to an SQS queue in each microservice. This durably stores those events until the downstream service recovers. Additionally, this pattern protects the microservice from large bursts of traffic by throttling the delivery of messages.

Fault-tolerant microservices architecture

The resources declared in the AWS SAM template are similar to the previous examples, but it uses the AWS::SQS::QueuePolicy resource to grant the appropriate permission to EventBridge:

    Type: AWS::SQS::QueuePolicy
        - Effect: Allow
            Service: events.amazonaws.com
          Action: SQS:SendMessage
          Resource:  !GetAtt MySqsQueue.Arn
        - Ref: MySqsQueue


You can combine these services in your architectures to implement patterns that solve complex challenges, often with little code required. This blog post shows three examples that implement message throttling and queueing, integrating SNS and EventBridge, and building fault tolerant microservices.

To learn more building decoupled architectures, see this Learning Path series on EventBridge. For more serverless learning resources, visit https://serverlessland.com.

Custom logging with AWS Batch

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/custom-logging-with-aws-batch/

This post was written by Christian Kniep, Senior Developer Advocate for HPC and AWS Batch. 

For HPC workloads, visibility into the logs of jobs is important to debug a job which failed, but also to have insights into a running job and track its trajectory to influence the configuration of the next job or terminate the job because it went off track.

With AWS Batch, customers are able to run batch workloads at scale, reliably and with ease as this managed serves takes out the undifferentiated heavy lifting. The customer can then focus on submitting jobs and getting work done. Customers told us that at a certain scale, the single logging driver available within AWS Batch made it hard to separate logs as they were all ending up in the same log group in Amazon CloudWatch.

With the new release of customer logging driver support, customers are now able to adjust how the job output is logged. Not only customize the Amazon CloudWatch setting, but enable the use of external logging frameworks such as splunk, fluentd, json-files, syslog, gelf, journald.

This allow AWS Batch jobs to use the existing systems they are accustom to, with fine-grained control of the log data for debugging and access control purposes.

In this blog, I show the benefits of custom logging with AWS Batch by adjusting the log targets for jobs. The first example will customize the Amazon CloudWatch log group, the second will log to Splunk, an external logging service.

Example setup

To showcase this new feature, I use the AWS Command Line Interface (CLI) to setup the following:

  1. IAM roles, policies, and profiles to grant access and permissions
  2. A compute environment to provide the compute resources to run jobs
  3. A job queue, which supervises the job execution and schedules jobs on a compute environment
  4. A job definition, which uses a simple job to demonstrate how the new configuration can be applied

Once those tasks are completed, I submit a job and send logs to a customized CloudWatch log-group and Splunk.


To make things easier, I first set a couple of environment variables to have the information handy for later use. I use the following code to set up the environment variables.

# in case it is not already installed
sudo yum install -y jq 
export MD_URL=
export IFACE=$(curl -s ${MD_URL}/network/interfaces/macs/)
export SUBNET_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/subnet-id)
export VPC_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/vpc-id)
export AWS_REGION=$(curl -s ${MD_URL}/placement/availability-zone | sed 's/[a-z]$//')
export AWS_ACCT_ID=$(curl -s ${MD_URL}/identity-credentials/ec2/info |jq -r .AccountId)
export AWS_SG_DEFAULT=$(aws ec2 describe-security-groups \
--filters Name=group-name,Values=default \
|jq -r '.SecurityGroups[0].GroupId')


When using the AWS Management Console, you must create IAM roles manually.

Trust Policies

IAM Roles are defined to be used by a certain service. In the simplest case, you want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. This defines which entity is able to use an IAM Role, called Trust Policy. To set up a trust policy for an IAM role, use the following code snippet.

cat > ec2-trust-policy.json << EOF
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    "Action": "sts:AssumeRole"

Instance role

With the IAM trust policy, I now create an ecsInstanceRole and attach the pre-defined policy AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.

aws iam create-role --role-name ecsInstanceRole \
 --assume-role-policy-document file://ec2-trust-policy.json
aws iam create-instance-profile --instance-profile-name ecsInstanceProfile
aws iam add-role-to-instance-profile \
    --instance-profile-name ecsInstanceProfile \
    --role-name ecsInstanceRole
aws iam attach-role-policy --role-name ecsInstanceRole \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Service Role

The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role.  You can set up this role with the following logic.

cat > svc-trust-policy.json << EOF
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "batch.amazonaws.com"
    "Action": "sts:AssumeRole"
aws iam create-role --role-name AWSBatchServiceRole \
--assume-role-policy-document file://svc-trust-policy.json
aws iam attach-role-policy --role-name AWSBatchServiceRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

In addition to dealing with Amazon ECS, the instance role can create and write to Amazon CloudWatch log groups, to control which log group names are used, a condition is attached.

While the compute environment is coming up, let us create and attach a policy to make a new log-group possible.

cat > policy.json << EOF
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
    "Resource": "*",
    "Condition": {
      "StringEqualsIfExists": {
        "batch:LogDriver": ["awslogs"],
        "batch:AWSLogsGroup": ["/aws/batch/custom/*"]
aws iam create-policy --policy-name batch-awslog-policy \
    --policy-document file://policy.json
aws iam attach-role-policy --policy-arn arn:aws:iam::${AWS_ACCT_ID}:policy/batch-awslog-policy --role-name ecsInstanceRole

At this point, I created the IAM roles and policies so that the instance and service are able to interact with the AWS APIs, including trust-policies to define which services are meant to use them. EC2 for the ecsInstanceRole and the AWSBatchServiceRole for the AWS Batch service itself.

Compute environment

Now, I am going to create a compute environment, which is going to spin up an instance (one vCPU target) to run the example job in.

cat > compute-environment.json << EOF
  "computeEnvironmentName": "od-ce",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "EC2",
    "allocationStrategy": "BEST_FIT_PROGRESSIVE",
    "minvCpus": 1,
    "maxvCpus": 8,
    "desiredvCpus": 1,
    "instanceTypes": ["m5.xlarge"],
    "subnets": ["${SUBNET_ID}"],
    "securityGroupIds": ["${AWS_SG_DEFAULT}"],
    "instanceRole": "arn:aws:iam::${AWS_ACCT_ID}:instance-profile/ecsInstanceRole",
    "tags": {"Name": "aws-batch-compute"},
    "bidPercentage": 0
  "serviceRole": "arn:aws:iam::${AWS_ACCT_ID}:role/AWSBatchServiceRole"
aws batch create-compute-environment --cli-input-json file://compute-environment.json  

Once this section is complete, a compute environment is being spun up in the back. This will take a moment. You can use the following command to check on the status of the compute environment.

aws batch  describe-compute-environments

Once it is enabled and valid we can continue by setting up the job queue.

Job Queue

Now that I have a compute environment up and running, I will create a job queue which accepts job submissions and schedules the jobs on the compute environment.

cat > job-queue.json << EOF
  "jobQueueName": "jq",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [{
    "order": 0,
    "computeEnvironment": "od-ce"
aws batch create-job-queue --cli-input-json file://job-queue.json

Job definition

The job definition is used as a template for jobs. This example runs a plain container and prints the environment variables. With the new release of AWS Batch, the logging driver awslogs now allows you to change the log group configuration within the job definition.

cat > job-definition.json << EOF
  "jobDefinitionName": "alpine-env",
  "type": "container",
  "containerProperties": {
  "image": "alpine",
  "vcpus": 1,
  "memory": 128,
  "command": ["env"],
  "readonlyRootFilesystem": true,
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": { 
      "awslogs-region": "${AWS_REGION}", 
      "awslogs-group": "/aws/batch/custom/env-queue",
      "awslogs-create-group": "true"}
aws batch register-job-definition --cli-input-json file://job-definition.json

Job Submission

Using the above job definition, you can now submit a job.

aws batch submit-job \
  --job-name test-$(date +"%F_%H-%M-%S") \
  --job-queue arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-queue/jq \
  --job-definition arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-definition/alpine-env:1

Now, you can check the ‘Log Group’ in CloudWatch. Go to the CloudWatch console and find the ‘Log Group’ section on the left.

log groups in cloudwatch

Now, click on the log group defined above, and you should see the output of the job which allows for debugging if something within the container went wrong or processing logs and create alarms and reports.

cloudwatch log events


Splunk is an established log engine for a broad set of customers. You can use the Docker container to set up a Splunk server quickly. More information can be found in the Splunk documentation. You need to configure the HTTP Event Collector, which provides you with a link and a token.

To send logs to Splunk, create an additional job-definition with the Splunk token and URL. Please adjust the splunk-url and splunk-token to match your Splunk setup.

  "jobDefinitionName": "alpine-splunk",
  "type": "container",
  "containerProperties": {
    "image": "alpine",
    "vcpus": 1,
    "memory": 128,
    "command": ["env"],
    "readonlyRootFilesystem": false,
    "logConfiguration": {
      "logDriver": "splunk",
      "options": {
        "splunk-url": "https://<splunk-url>",
        "splunk-token": "XXX-YYY-ZZZ"

This forwards the logs to Splunk, as you can see in the following image.

forward to splunk


This blog post showed you how to apply custom logging to AWS Batch using the awslog and Splunk logging driver. While these are two important logging drivers, please head over to the documentation to find out about fluentd, syslog, json-file and other drivers to find the best driver to match your current logging infrastructure.


Field Notes: Gaining Insights into Labeling Jobs for Machine Learning

Post Syndicated from Michael Graumann original https://aws.amazon.com/blogs/architecture/field-notes-gaining-insights-into-labeling-jobs-for-machine-learning/

In an era where more and more data is generated, it becomes critical for businesses to derive value from it. With the help of supervised learning, it is possible to generate models to automatically make predictions or decisions by leveraging historical data. For example, image recognition for self-driving cars, predicting anomalies on X-rays, fraud detection in finance and more. With supervised learning, these models learn from labeled data. The success of those models is highly dependent on readily available, high quality labeled data.

However, you might encounter cases where a high percentage of your pre-existing data is unlabeled. In these situations, providing correct labeling to previously unlabeled data points would directly translate to higher model accuracy.

Amazon SageMaker Ground Truth helps you with exactly that. It lets you build highly accurate training datasets for machine learning quickly. SageMaker Ground Truth provides your labelers with built-in workflows and interfaces for common labeling tasks. This process could take several hours or more depending on the size of your unlabeled dataset, and you might have a need to track the progress easily, preferably in the form of a dashboard.

In this blogpost we show how to gain deep insights into the progress of labeling and the performance of the workers by using Amazon Athena and Amazon QuickSight. We use Amazon Athena former to set up several views with specific insights into the labeling progress. Finally we will reference these views in Amazon QuickSight to visualize the data in a dashboard.

This approach also works for combining multiple AWS services in general. AWS provides many building blocks than you can mix-and-match to create a unique, integrated solution with cohesive insights. In this blog post we use data produced by one service (Ground Truth), prepare it with another (Athena) and visualize with a third (QuickSight). The following diagram shows this architecture.

Solution Architecture

ML Solution Architecture

Mapping a JSON structure to a table structure

Ground Truth creates several directories in your Amazon S3 output path. These directories contain the results of your labeling job and other artifacts of the job. The top-level directory for a labeling job has the same name as your labeling job, while the output directories are placed inside it. We will create all insights from what SageMaker Ground Truth calls worker responses.

All respective JSON files reside in the path s3://bucket/<job-name>/annotations/worker-response/.

To analyze the labeling data with Amazon Athena we need to understand the structure of the underlying JSON files. Let’s review the example below. For each item that was labeled, we see the label itself, followed by the submission time and a workerId pointing to an identity. This identity lives in Amazon Cognito, a fully managed service that provides the user directory for our labelers.

    "answers": [
            "answerContent": {
                "crowd-classifier": {
                    "label": "Compute"
            "submissionTime": "2020-03-27T10:31:04.210Z",
            "workerId": "private.eu-west-1.1111111111111111",
            "workerMetadata": {
                "identityData": {
                    "identityProviderType": "Cognito",
                    "issuer": "https://cognito-idp.eu-west-1.amazonaws.com/eu-west-1_111111111",
                    "sub": "11111111-1111-1111-1111-111111111111"

Although the data is stored in Amazon S3 object storage, we are able to use SQL to access the data by using Amazon Athena. Since we now understand the JSON structure from shown in the preceding code, we use Athena and define how to interpret the data that is relevant to us. We do so by first creating a database using the Athena Query Editor:

CREATE DATABASE analyze_labels_db;

Once inside the database, we add the table schema. The actual files remain on Amazon S3, but using the metadata catalog, Athena then knows where the data lies and how to interpret it. The AWS Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given dataset, you can store its table definition, physical location, add business relevant attributes, in addition to track how this data has changed over time. Besides, Athena the AWS Glue Data Catalog also provides out-of-box integration with Amazon EMR and Amazon Redshift Spectrum. Once you add your table definitions to the Glue Data Catalog, they are available for ETL. They are also readily available for querying in Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum so that you can have a common view of your data between these services.

When going from JSON to SQL, we are crossing format boundaries. To further facilitate how to read the JSON formatted data we are using SerDe Properties to replace the hyphen in crowd-classifier with an underscore due to DDL constraints. Finally we point the location to our Amazon S3 bucket containing the single worker responses. Recognize in the following script that we translate the nested structure of the JSON file itself into a hierarchical, nested data structure in the schema definition. Also, we could leave out the workerMetadata as we don’t need it at this time. The data would still stay in the files on Amazon S3, so that we could later change and add the workerMetadata STRUCT into the table definition for our analysis.

CREATE EXTERNAL TABLE annotations_raw (
  answers array<
        struct<label: string>
      submissionTime: string,
      workerId: string,
          struct<identityProviderType: string, issuer: string, sub: string>
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://<YOUR_BUCKET>/<JOB_NAME>/annotations/worker-response/'

Creating Views in Athena

Now, we have nested data in our annotations_raw table. For many use cases, especially for analytical uses, representing data in a tabular fashion—as rows—is more natural. This is also the standard way when using SQL and business intelligence tools. To unnest the hierarchical data into flattened rows, we create the following view which will serve as foundation for the other views we create. For an in-depth look into unnesting data with Amazon Athena, read this blog post.

Some of the information we’re interested in might not be part of the document, but is encoded in the path. We use a trick in Athena by using the $path variable from the Presto Hive Connector. This determines which Amazon S3 file contains data that is returned by a specific row in an Athena table. This way we can find out which data object an annotation belongs to. Since Athena is built on top of Presto, we are able to use Presto’s built-in regexp_extract function to find out the iteration as well as the data object id per labeling result. We also cast the submission time in date format to later determine the labeling progress per day.

CREATE OR REPLACE VIEW annotations_view AS
  regexp_extract("$path", 'iteration-[0-9]*') as iteration,
  regexp_extract("$path", '(iteration-[0-9]*\/([0-9]*))',2) as dataRecord,
  cast(from_iso8601_timestamp(answer.submissionTime) as timestamp) as submissionTime,
  cast(from_iso8601_timestamp(answer.submissionTime) as date) as submissionDay,
  "$path" path
CROSS JOIN UNNEST(answers) AS t(answer)

This view, annotations_view, will be the starting point for the other views we will be creating in further in this post.

Visualizing with QuickSight

In this section, we explore a way to visualize the views we build in Athena by pointing Amazon QuickSight to the respective view. Amazon QuickSight lets you create and publish interactive dashboards that include ML Insights. Dashboards can then be accessed from any device, and embedded into your applications, portals, and websites.

Thanks to the tight integration between Athena and QuickSight, we are able to map one dataset in QuickSight to one Athena view. In order to further optimize the performance of the dashboard, we can optionally import the datasets into the in-memory optimized calculation engine for Amazon QuickSight called SPICE. With the datasets in place we can now create an analysis in order to interact with the visuals we’re going to add. You can think of an analysis as a container for a set of related visuals. You can use multiple datasets in an analysis, although any given visual can only use one of those datasets. After you create an analysis and an initial visual, you can expand the analysis. You can do this for example by adding datasets and visuals.

Let’s start with our first insight.

Annotations per worker

We’d like to gain insights not only into the total number of labeled items but also on the level of contributions of each individual workers. This could give us an indication whether the labels were created by a diverse crowd of labelers or by a few productive ones. A largely disproportionate amount of contributions from a handful of workers who may have brought along their biases.

SageMaker Ground Truth calls labeled data objects annotations, which is the result of a single workers labeling task.

Luckily we encapsulated all the heavy lifting of format conversion in the annotations_view, so that it is now easy to create a view for the annotations per user:

CREATE OR REPLACE VIEW annotations_per_user AS
SELECT COUNT(sub) AS LabeledItems,
sub AS User
FROM annotations_view
ORDER BY LabeledItems DESC

Next we visualize this view in QuickSight. We add a visual to our analysis, select the respective dataset for the view and use the AutoGraph feature, which chooses the most appropriate visual type. Since we already arranged our view in Athena by the number of labeled items in descending order, there is no need now to sort the data in QuickSight. In the following screenshot, worker c4ef78e4... contributed more labels compared to their peers.

Annotations per worker

This view gives you an indicator to check for a bias that the leading worker might have brought along.

Annotations per label

One thing we want to be aware of is potential imbalances between classes in our dataset. Especially simple machine learning models, which may learn to frequently predict a label that is massively over represented in the dataset. If we can identify an imbalance, we can apply mitigation actions such as upsampling data of underrepresented classes. With the following view we list the total number of annotations per label.

CREATE OR REPLACE VIEW annotations_per_label AS
SELECT Count(dataRecord) AS TotalLabels, label As Label 
FROM annotations_view
GROUP BY label
ORDER BY TotalLabels DESC, Label;

As before, we create a dataset in QuickSight pointing to the annotations_per_label view, open the analysis, add a new visual and leverage the AutoGraph functionality. The result is the following visual representation:

Annotations per worker 2

One can clearly see that the Analytics & AI/ML class is massively underrepresented. At this point, you might want to try getting more data or think about upsampling data for that class.

Annotations per day

Seeing the total number of annotations per label and per worker is good, but we are also interested in how the labeling progress changes over time. This way we might see spikes related to labeller activations. We can also or estimate how long it takes to reach a certain goal of annotations given the current pace. For this purpose we create the following view aggregating the total annotations per day.

CREATE OR REPLACE VIEW annotations_per_day AS
SELECT COUNT(datarecord) AS LabeledItems,
FROM annotations_view
GROUP BY submissionDay
ORDER BY submissionDay, LabeledItems DESC

This time the QuickSight AutoGraph provides us with the following line chart. You might have noticed that the axis labels do not match the column names in Athena. That is because we renamed them in QuickSight for better readability.

Total annotations per day

In the preceding chart we see that there is no consistent pace of labeling, which makes it hard to predict when a certain amount of labeled data will be reached. In this example, after starting strong the progress immediately went down. Knowing this, we might want to take action into motivating our workers to contribute more and validate the effectiveness of these actions with the help of this chart. The spikes indicate an effective short-term action.

Distribution of total annotations by user

We already have insights into annotations per worker, per label and per day. Let us now now see what insights we can get from aggregating some of this information.

The bigger your labeling workforce gets, the harder it can become to see the whole picture. For that reason we will now create a histogram consisting of five buckets. Each bucket represents an interval of total annotations (for example, 0-25 annotations) mapped to the number of users whose amount of total annotations lies in that interval. This allows us to get a sense of what kind of bias might be introduced by the majority of annotations being contributed by a small amount of workers.

To do that, we use the Presto function width_bucket which returns the number of labeled data objects according to the five buckets we defined with a size of 25 each. We define these buckets by creating an Array with 5 elements that specify the boundaries.

CREATE OR REPLACE VIEW users_per_bucket_annotations AS
   WHEN bucket=5 THEN 'B' || cast(bucket AS VARCHAR(10)) || ': ' || cast(((bucket-1) * 25) AS VARCHAR(10)) || '+'
   ELSE 'B' || cast(bucket AS VARCHAR(10)) || ': ' || cast(((bucket-1) * 25) AS VARCHAR(10)) || '-' || cast((bucket * 25) AS VARCHAR(10))
END AS NumberOfAnnotations
(SELECT width_bucket(labeleditems,ARRAY[0,25,50,75,100]) AS bucket,
 count(user) AS numberOfUsers
FROM annotations_per_user
ORDER BY bucket)

A SELECT * FROM users_per_bucket_annotations produces the following result:

A SELECT FROM users_per_bucket_annotations

Let’s now investigate the same data via QuickSight:

Annotations per User in buckets of Size 25

Now that we can look at the data visually it becomes clear that we have a bimodal distribution, with many labelers having done very little, and many labelers doing quite a lot. This may warrant interviewing some labelers to find out if there is something holding back users from progressing, or if we can keep engagement high over time.

Putting it all together in QuickSight

Since we created all previous visuals into one analysis, we can now utilize it as a central place to consume our insights in a user-friendly way. Moreover, we can share our insights with others as a read-only snapshot which QuickSight calls a dashboard. User who are dashboard viewers can view and filter the dashboard data as below:

Groundtruth dashboard

Furthermore, you can generate a report and let QuickSight send it either once or on a schedule (daily, weekly or monthly) to your peers. This way users do not have to sign in and they can get reminders to check the progress of the labeling job. Lastly, sending out those reports is an opportunity to stay in touch with the labelers and keep the engagement high.


In this blogpost, we have shown one example of combining multiple AWS services in order to build a solution tailored to your needs. We took the Amazon S3 output generated by SageMaker Ground Truth and showed how it can be further processed and analyzed with Athena. Finally, we created a central place to consume our insights in a user-friendly way with QuickSight. By putting it all together in a dashboard we were able to share our insights with our peers.

You can take the same pattern and apply it to other situations: take some of the many building blocks AWS provides and mix-and-match them to create a unique, integrated solution with cohesive insights just as we did with Ground Truth, Athena, and QuickSight.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Monitoring the Java Virtual Machine Garbage Collection on AWS Lambda

Post Syndicated from Steffen Grunwald original https://aws.amazon.com/blogs/architecture/field-notes-monitoring-the-java-virtual-machine-garbage-collection-on-aws-lambda/

When you want to optimize your Java application on AWS Lambda for performance and cost the general steps are: Build, measure, then optimize! To accomplish this, you need a solid monitoring mechanism. Amazon CloudWatch and AWS X-Ray are well suited for this task since they already provide lots of data about your AWS Lambda function. This includes overall memory consumption, initialization time, and duration of your invocations. To examine the Java Virtual Machine (JVM) memory you require garbage collection logs from your functions. Instances of an AWS Lambda function have a short lifecycle compared to a long-running Java application server. It can be challenging to process the logs from tens or hundreds of these instances.

In this post, you learn how to emit and collect data to monitor the JVM garbage collector activity. Having this data, you can visualize out-of-memory situations of your applications in a Kibana dashboard like in the following screenshot. You gain actionable insights into your application’s memory consumption on AWS Lambda for troubleshooting and optimization.

The lifecycle of a JVM application on AWS Lambda

Let’s first revisit the lifecycle of the AWS Lambda Java runtime and its JVM:

  1. A Lambda function is invoked.
  2. AWS Lambda launches an execution context. This is a temporary runtime environment based on the configuration settings you provide, like permissions, memory size, and environment variables.
  3. AWS Lambda creates a new log stream in Amazon CloudWatch Logs for each instance of the execution context.
  4. The execution context initializes the JVM and your handler’s code.

You typically see the initialization of a fresh execution context when a Lambda function is invoked for the first time, after it has been updated, or it scales up in response to more incoming events.

AWS Lambda maintains the execution context for some time in anticipation of another Lambda function invocation. In effect, the service freezes the execution context after a Lambda function completes. It thaws the execution context when the Lambda function is invoked again if AWS Lambda chooses to reuse it.

During invocations, the JVM also maintains garbage collection as usual. Outside of invocations, the JVM and its maintenance processes like garbage collection are also frozen.

Garbage collection and indicators for your application’s health

The purpose of JVM garbage collection is to clean up objects in the JVM heap, which is the space for an application’s objects. It finds objects which are unreachable and deletes them. This frees heap space for other objects.

You can make the JVM log garbage collection activities to get insights into the health of your application. One example for this is the free heap after each garbage collection. If this metric keeps shrinking, it is an indicator for a memory leak – eventually turning into an OutOfMemoryError. If there is not enough of free heap, the JVM might be too busy with garbage collection instead of running your application code. Otherwise, a heap that is too big does indicate that there’s potential to decrease the memory configuration of your AWS Lambda function. This keeps garbage collection pauses low and provides a consistent response time.

The garbage collection logging can be configured via an environment variable as part of the AWS Lambda function configuration. The environment variable JAVA_TOOL_OPTIONS is considered by both the Java 8 and 11 JVMs. You use it to pass options that you would usually add to the command line when launching the JVM. The options to configure garbage collection logging and the output is specific to the Java version.

Java 11 uses the Unified Logging System (JEP 158 and JEP 271) which has been introduced in Java 9. Logging can be configured with the environment variable:


The Serial Garbage Collector will output the logs:

[<TIMESTAMP>][gc] GC(4) Pause Full (Allocation Failure) 9M->9M(11M) 3.941ms (D)
[<TIMESTAMP>][gc,heap] GC(3) DefNew: 3063K->234K(3072K) (A)
[<TIMESTAMP>][gc,heap] GC(3) Tenured: 6313K->9127K(9152K) (B)
[<TIMESTAMP>][gc,metaspace] GC(3) Metaspace: 762K->762K(52428K) (C)
[<TIMESTAMP>][gc] GC(3) Pause Young (Allocation Failure) 9M->9M(21M) 23.559ms (D)

Prior to Java 9, including Java 8, you configure the garbage collection logging as follows:

JAVA_TOOL_OPTIONS=-XX:+PrintGCDetails -XX:+PrintGCDateStamps

The Serial garbage collector output in Java 8 is structured differently:

<TIMESTAMP>: [GC (Allocation Failure)
    <TIMESTAMP>: [DefNew: 131042K->131042K(131072K), 0.0000216 secs] (A)
    <TIMESTAMP>: [Tenured: 235683K->291057K(291076K), 0.2213687 secs] (B)
    366725K->365266K(422148K), (D)
    [Metaspace: 3943K->3943K(1056768K)], (C)
    0.2215370 secs]
    [Times: user=0.04 sys=0.02, real=0.22 secs]
<TIMESTAMP>: [Full GC (Allocation Failure)
    <TIMESTAMP>: [Tenured: 297661K->36658K(297664K), 0.0434012 secs] (B)
    431575K->36658K(431616K), (D)
    [Metaspace: 3943K->3943K(1056768K)], 0.0434680 secs] (C)
    [Times: user=0.02 sys=0.00, real=0.05 secs]

Independent of the Java version, the garbage collection activities are logged to standard out (stdout) or standard error (stderr). Logs appear in the AWS Lambda function’s log stream of Amazon CloudWatch Logs. The log contains the size of memory used for:

  • A: the young generation
  • B: the old generation
  • C: the metaspace
  • D: the entire heap

The notation is before-gc -> after-gc (committed heap). Read the JVM Garbage Collection Tuning Guide for more details.

Visualizing the logs in Amazon Elasticsearch Service

It is hard to fully understand the garbage collection log by just reading it in Amazon CloudWatch Logs. You must visualize it to gain more insight. This section describes the solution to achieve this.

Solution Overview

Java Solution Overview

Amazon CloudWatch Logs have a feature to stream CloudWatch Logs data to Amazon Elasticsearch Service via an AWS Lambda function. The AWS Lambda function for log transformation is subscribed to the log group of your application’s AWS Lambda function. The subscription filters for a pattern that matches the one of the garbage collection log entries. The log transformation function processes the log messages and puts it to a search cluster. To make the data easy to digest for the search cluster, you add code to transform and convert the messages to JSON. Having the data in a search cluster, you can visualize it with Kibana dashboards.

Get Started

To start, launch the solution architecture described above as a prepackaged application from the AWS Serverless Application Repository. It contains all resources ready to visualize the garbage collection logs for your Java 11 AWS Lambda functions in a Kibana dashboard. The search cluster consists of a single t2.small.elasticsearch instance with 10GB of EBS storage. It is protected with Amazon Cognito User Pools so you only need to add your user(s). The T2 instance types do not support encryption of data at rest.

Read the source code for the application in the aws-samples repository.

1. Spin up the application from the AWS Serverless Application Repository:

launch stack button

2. As soon as the application is deployed completely, the outputs of the AWS CloudFormation stack provide the links for the next steps. You will find two URLs in the AWS CloudFormation console called createUserUrl and kibanaUrl.

search stack

3. Use the createUserUrl link from the outputs, or navigate to the Amazon Cognito user pool in the console to create a new user in the pool.

a. Enter an email address as username and email. Enter a temporary password of your choice with at least 8 characters.

b. Leave the phone number empty and uncheck the checkbox to mark the phone number as verified.

c. If necessary, you can check the checkboxes to send an invitation to the new user or to make the user verify the email address.

d. Choose Create user.

create user dialog of Amazon Cognito User Pools

4. Access the Kibana dashboard with the kibanaUrl link from the AWS CloudFormation stack outputs, or navigate to the Kibana link displayed in the Amazon Elasticsearch Service console.

a. In Kibana, choose the Dashboard icon in the left menu bar

b. Open the Lambda GC Activity dashboard.

You can test that new events appear by using the Kibana Developer Console:

POST gc-logs-2020.09.03/_doc
  "@timestamp": "2020-09-03T15:12:34.567+0000",
  "@gc_type": "Pause Young",
  "@gc_cause": "Allocation Failure",
  "@heap_before_gc": "2",
  "@heap_after_gc": "1",
  "@heap_size_gc": "9",
  "@gc_duration": "5.432",
  "@owner": "123456789012",
  "@log_group": "/aws/lambda/myfunction",
  "@log_stream": "2020/09/03/[$LATEST]123456"

5. When you go to the Lambda GC Activity dashboard you can see the new event. You must select the right timeframe with the Show dates link.

Lambda GC activity

The dashboard consists of six tiles:

  • In the Filters you optionally select the log group and filter for a specific AWS Lambda function execution context by the name of its log stream.
  • In the GC Activity Count by Execution Context you see a heatmap of all filtered execution contexts by garbage collection activity count.
  • The GC Activity Metrics display a graph for the metrics for all filtered execution contexts.
  • The GC Activity Count shows the amount of garbage collection activities that are currently displayed.
  • The GC Duration show the sum of the duration of all displayed garbage collection activities.
  • The GC Activity Raw Data at the bottom displays the raw items as ingested into the search cluster for a further drill down.

Configure your AWS Lambda function for garbage collection logging

1. The application that you want to monitor needs to log garbage collection activities. Currently the solution supports logs from Java 11. Add the following environment variable to your AWS Lambda function to activate the logging.


The environment variables must reflect this parameter like the following screenshot:

environment variables

2. Go to the streamLogs function in the AWS Lambda console that has been created by the stack, and subscribe it to the log group of the function you want to monitor.

streamlogs function

3. Select Add Trigger.

4. Select CloudWatch Logs as Trigger Configuration.

5. Input a Filter name of your choice.

6. Input "[gc" (including quotes) as the Filter pattern to match all garbage collection log entries.

7. Select the Log Group of the function you want to monitor. The following screenshot subscribes to the logs of the application’s function resize-lambda-ResizeFn-[...].

add trigger

8. Select Add.

9. Execute the AWS Lambda function you want to monitor.

10. Refresh the dashboard in Amazon Elasticsearch Service and see the datapoint added manually before appearing in the graph.

Troubleshooting examples

Let’s look at an example function and draw some useful insights from the Java garbage collection log. The following diagrams show the Sample Amazon S3 function code for Java from the AWS Lambda documentation running in a Java 11 function with 512 MB of memory.

  • An S3 event from a new uploaded image triggers this function.
  • The function loads the image from S3, resizes it, and puts the resized version to S3.
  • The file size of the example image is close to 2.8MB.
  • The application is called 100 times with a pause of 1 second.

Memory leak

For the demonstration of a memory leak, the function has been changed to keep all source images in memory as a class variable. Hence the memory of the function keeps growing when processing more images:

GC activity metrics

In the diagram, the heap size drops to zero at timestamp 12:34:00. The Amazon CloudWatch Logs of the function reveal an error before the next call to your code in the same AWS Lambda execution context with a fresh JVM:

Java heap space: java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
 at java.desktop/java.awt.image.DataBufferByte.<init>(Unknown Source)

The JVM crashed and was restarted because of the error. You leverage primarily the Amazon CloudWatch Logs of your function to detect errors. The garbage collection log and its visualization provide additional information for root cause analysis:

Did the JVM run out of memory because a single image to resize was too large?

Or was the memory issue growing over time?

The latter could be an indication that you have a memory leak in your code.

The Heap size is too small

For the demonstration of a heap that was chosen too small, the memory leak from the preceding image has been resolved, but the function was configured to 128MB of memory. From the baseline of the heap to the maximum heap size, there are only approximately 5 MB used.

GC activity metrics

This will result in a high management overhead of your JVM. You should experiment with a higher memory configuration to find the optimal performance also taking cost into account. Check out AWS Lambda power tuning open source tool to do this in an automated fashion.

Finetuning the initial heap size

If you review the development of the heap size at the start of an execution context, this indicates that the heap size is continuously increased. Each heap size change is an expensive operation consuming time of your function. Over time, the heap size is changed as well. The garbage collector logs 502 activities, which take almost 17 seconds overall.

GC activity metrics

This on-demand scaling is useful on a local workstation where the physical memory is shared with other applications. On AWS Lambda, the configured memory is dedicated to your function, so you can use it to its full extent.

You can do so by setting the minimum and maximum heap size to a fixed value by appending the -Xms and -Xmx parameters to the environment variable we introduced before.

The heap is not the only part of the JVM that consumes memory, so you must experiment with this setting and closely monitor the performance.

Start with the heap size that you observe to be working from the garbage collection log. If you set the heap size too large, your function will not initialize at all or break unexpectedly. Remember that the ability to tweak JVM parameters might change with future service features.

Let’s set 400 MB of the 512 MB memory and examine the results:

JAVA_TOOL_OPTIONS=-Xlog:gc:stderr:time,tags -Xms400m -Xmx400m

GC activity metrics

The preceding dashboard shows that the overall garbage collection duration was reduced by about 95%. The garbage collector had 80% fewer activities.

The garbage collection log entries displayed in the dashboard reveal that exclusively minor garbage collection (Pause Young) activities were triggered instead of major garbage collections (Pause Full). This is expected as the images are immediately discarded after the download, resize, upload operation. The effect on the overall function durations of 100 invocations, is a 5% decrease on average in this specific case.

Lambda duration

Cost estimation and clean up

Cost for the processing and transformation of your function’s Amazon CloudWatch Logs incurs when your function is called. This cost depends on your application and how often garbage collection activities are triggered. Read an estimate of the monthly cost for the search cluster. If you do not need the garbage collection monitoring anymore, delete the subscription filter from the log group of your AWS Lambda function(s). Also, delete the stack of the solution above in the AWS CloudFormation console to clean up resources.


In this post, we examined further sources of data to gain insights about the health of your Java application. We also demonstrated a pipeline to ingest, transform, and visualize this information continuously in a Kibana dashboard. As a next step, launch the application from the AWS Serverless Application Repository and subscribe it to your applications’ logs. Feel free to submit enhancements to the application in the aws-samples repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Creating an EC2 instance in the AWS Wavelength Zone

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/creating-an-ec2-instance-in-the-aws-wavelength-zone/

Creating an EC2 instance in the AWS Wavelength Zone

This blog post is contributed by Saravanan Shanmugam, Lead Solution Architect, AWS Wavelength

AWS announced Wavelength at re:Invent 2019 in partnership with Verizon in US, SK Telecom in South Korea, KDDI in Japan, and Vodafone in UK and Europe. Following the re:Invent 2019 announcement, on August 6, 2020, AWS announced GA of one Wavelength Zone with Verizon in Boston connected to US East (N.Virginia) Region and one in San Francisco connected to the US West (Oregon) Region.

In this blog, I walk you through the steps required to create an Amazon EC2 instance in an AWS Wavelength Zone from the AWS Management console. We also address the questions asked by our customers regarding the different protocol traffic allowed into and out of a AWS Wavelength Zones.

Customers who want to access AWS Wavelength Zones and deploy their applications to the Wavelength Zone can sign up using this link. Customers that opted in to access the AWS Wavelength Zone can confirm the status on the EC2 console Account Attribute section as shown in the following image.

 Services and features

AWS Wavelength Zones are Availability Zones inside the Carrier Service Provider network closer to the Edge of the Mobile Network. Wavelength Zones bring the AWS core compute and storage services like Amazon EC2 and Amazon EBS that can be used by other services like Amazon EKS and Amazon ECS. We look at Wavelength Zone(s) as a hub and spoke model, where developers can deploy latency sensitive, high-bandwidth applications at the Edge and non-latency sensitive and data persistent applications in the Region.

Wavelength Zones supports three Nitro based Amazon EC2 instance types t3 (t3.medium, t3.xlarge) r5 (r5.2xlarge) and g4 (g4dn.2xlarge) with EBS volume types gp2. Customers can also use Amazon ECS and Amazon EKS to deploy container applications at the Edge. Other AWS Services, like AWS CloudFormation templates, CloudWatch, IAM resources, and Organizations, continue to work as expected, providing you a consistent experience. You can also leverage the full suite of services like Amazon S3 in the parent Region over AWS’s private network backbone. Now that we have reviewed AWS wavelength, the services and features associated with it, let us talk about the steps to launch an EC2 instance in the AWS Wavelength zone.

Creating a Subnet in the Wavelength Zone

Once the Wavelength Zone is enabled for your AWS Account, you can extend your existing VPC from the parent Region to a Wavelength Zone by creating a new VPC subnet assigned to the AWS Wavelength Zone. Customers can also create a new VPC and then a Subnet to deploy their applications in the Wavelength zone. The following image shows the Subnet creation step, where you pick the Wavelength Zone as the Availability zone for the subnet

Carrier Gateway

We have introduced a new gateway type called Carrier Gateway, which allows you to route traffic from the Wavelength Zone subnet to the CSP network and to the Internet. Carrier Gateways are similar to the Internet gateway in the Region. Carrier Gateway is also responsible for NAT’ing the traffic from/to the Wavelength Zone subnets mapping it to the carrier ip address assigned to the instances.

Creating a Carrier Gateway

In the VPC console, you can now create Carrier Gateway and attach it to your VPC.

You select the VPC to which the Carrier Gateway must be attached. There is also option to select “Route subnet traffic to the Carrier Gateway” in the Carrier Gateway creation step. By selecting this option, you can pick the Wavelength subnets you want to default route to the Carrier Gateway. This option automatically deletes the existing route table to the subnets, creates a new route table, creates a default route entry, and attaches the new route table to the Subnets you selected. The following picture captures the necessary input required while creating a Carrier Gateway


Creating an EC2 instance in a Wavelength Zone with Private IP Address

Once a VPC subnet is created for the AWS Wavelength Zone, you can launch an EC2 instance with a Private address using the EC2 Launch Wizard. In the configure instance details step, you can select the Wavelength Zone Subnet that you created in the “Creating a Subnet” section.

Attach a IAM profile with SSM role included, which allows you to SSH into the console of the instance through SSM. This is a recommended practice for Wavelength Zone instances as there is no direct SSH access allowed from Public internet.

 Creating an EC2 instance in a Wavelength Zone with Carrier IP Address

The instances running in the Wavelength Zone subnets can obtain a Carrier IP address, which is allocated from a pool of IP addresses called Network Border group (NBG). To create an EC2 instance in the Wavelength Zone with a carrier routable IP address, you can use AWS CLI. You can use the following command to create EC2 instance in a Wavelength Zone subnet. Note the additional network interface (NIC) option “AssociateCarrierIpAddress: as part of the EC2 run instance command, as shown in the following command.

aws ec2 --region us-west-2 run-instances --network-interfaces '[{"DeviceIndex":0, "AssociateCarrierIpAddress": true, "SubnetId": "<subnet-0d3c2c317ac4a262a>"}]' --image-id <ami-0a07be880014c7b8e> --instance-type t3.medium --key-name <san-francisco-wavelength-sample-key>

 *To use “AssociateCarrierIpAddress” option in the ec2 run-instance command use the latest aws cli v2.

The carrier IP assigned to the EC2 instance can be obtained by running the following command.

 aws ec2 describe-instances --instance-ids <replace-with-your-instance-id> --region us-west-2

 Make necessary changes to the default security group that is attached to the EC2 instance after running the run-instance command to allow the necessary protocol traffic. If you allow ICMP traffic to your EC2 instance, you can test ICMP connectivity to your instance from the public internet.

The different protocols allowed in and out of the Wavelength Zone are captured in the following table.


TCP Connection FROMTCP Connection TOResult*
Region ZonesWL ZonesAllowed
Wavelength ZonesRegionAllowed
Wavelength ZonesInternetAllowed
Internet (TCP SYN)WL ZonesBlocked
Internet (TCP EST)WL ZonesAllowed
Wavelength ZonesUE (Radio)Allowed
UE(Radio)WL ZonesAllowed


UDP Packets FROMUDP Packets TOResult*
Wavelength ZonesWL ZonesAllowed
Wavelength ZonesRegionAllowed
Wavelength ZonesInternetAllowed
Wavelength ZonesUE (Radio)Allowed
UE(Radio)WL ZonesAllowed


Wavelength ZonesWL ZonesAllowed
Wavelength ZonesRegionAllowed
Wavelength ZonesInternetAllowed
Wavelength ZonesUE (Radio)Allowed
UE(Radio)WL ZonesAllowed


We have covered how to create and run an EC2 instance in the AWS Wavelength Zone, the core foundation for application deployments. We will continue to publish blogs helping customers to create ECS and EKS clusters in the AWS Wavelength Zones and deploy container applications at the Mobile Carriers Edge. We are really looking forward to seeing what all you can do with them. AWS would love to get your advice on additional local services/features or other interesting use cases, so feel free to leave us your comments!


Field Notes: Requirements for Successfully Installing CloudEndure

Post Syndicated from Daniel Covey original https://aws.amazon.com/blogs/architecture/field-notes-requirements-for-successfully-installing-cloudendure/

Customers have been using CloudEndure for their Migration and Disaster Recovery needs for many years. In 2019, CloudEndure was acquired by AWS, and provided the licensing for CloudEndure to all of their users free of charge for migration. During this time, AWS has identified the requirements for replication to complete successfully after initial agent install. Customers can use the following tips to facilitate a smooth transition to AWS.

In this blog, we look at four sections of the CloudEndure configuration process required for a successful installation:

  1. CloudEndure Port configuration
  2. CloudEndure JSON Policy Options
  3. CloudEndure Staging Area Configuration
  4. CloudEndure Configuration for Proxies

Required CloudEndure Ports

There are 2 required ports that CloudEndure uses. TCP Ports 1500 and 443 have particular configurations based on source or staging area. TCP 1500 is used for replication of data, and 443 is for agent communication with the CloudEndure Console.

Architecture Overview

The following graphic is a high level overview of the required ports for CloudEndure, both from the source infrastructure, and the staging subnet you will be replicating to.

network architecture


  1. Is 443 outbound open to console.cloudendure.com on the source infrastructure?
  • Check OS level firewall
  • Check proxy settings
  • Ensure there is no SSL intercept or Deep Packet Inspection being done to packets from that machine

        2. Is 443 outbound open to console.cloudendure.com in the AWS Security Group assigned to the replication subnet?

  • Check no NACLs are in place to prevent SSL traffic outbound from the subnet
  • Check the machines on this subnet can reach the EC2 endpoint for the region.
  • If you have any restrictions on accessing Amazon S3 buckets, you can have CloudEndure use a CloudFront distribution instead.
  • Review The CloudEndure documentation for how to do this

       3. Is 1500 outbound open to the Staging Subnet from the source infrastructure?

  • Check OS level firewall
  • Check proxy settings

4. Is 1500 inbound from the source infrastructure open on the Security Group assigned to the replication subnet?

  • Check no NACLs are in place preventing traffic.

CloudEndure JSON Policies 

CloudEndure uses one of these JSON policies attached to IAM Users. These policies give CloudEndure specific access to your AWS account resources. This launches specific resources needed to ensure the tool is working properly. CloudEndure JSON policies use tag filtering to limit the creation and deletion of resources.

For the JSON policy, CloudEndure expects a specific set of permissions, even in the case where we may not be using them. CloudEndure does a policy check first, to ensure all permissions are available. It is not recommended to change the JSON policies, as it can cause CloudEndure to fail initial replication configuration. Use one of the following three policies.


  • Best policy to use if you are doing Inter-AWS replication, such as Region-to-Region, or AZ-to-AZ replication


  • Default JSON policy. Allows for access to any of the resources needed by CloudEndure

Tagging based

  • A more restrictive policy, for customers that need a more secure solution.

Staging Area Configuration

CloudEndure replicates to a “Staging Area”, where you control the Replication Server and the AWS EBS volumes attached to that server. You define which VPC and Subnet you want CloudEndure to replicate to, with the following considerations.

staging area

  1. Default Subnet
    • You designate your specific AWS Subnet to use for replication here. Leaving the option as “Default” uses the default subnet for the VPC, which is usually deleted by customers when first configuring their VPCs.

2. Default Security group

    • This is often created by the Cloudendure tool, cannot be changed, and will be added if replication disconnects. Any changes made to this SG will be reverted back to default rules.
    • If utilizing a proxy, it is advised to add a Security Group that also allows access to the proxy

Proxy Servers

Some customers utilize Proxy servers within their environment. Review the following guidance regarding specific changes to configurations within your environment needed for CloudEndure to operate effectively.

  1. Make sure to set proxy in replication settings
    • This can be either IP address, or an FQDN

       2. Note the following for either Windows or Linux

    • Windows – CloudEndure agent runs as System, so please ensure the System account is part of the allow list in the proxy.
    • Linux – CloudEndure Agent creates a linux user to run commands (named cloudendure), so this user will need to be part of the allow list in the proxy

       3. Make sure environment variables are set for the machines

  • Windows Steps
    • Control Panel > System and Security > System > Advanced system settings.
    • In the Advanced Tab of the System Properties dialog box, select Environment Variables 
    • On the System Variables section of the Environment Variables pane, select New to add the https_proxy environment variable or Edit if the variable already exists.
    • Enter https://PROXY_ADDR:PROXY_PORT/ in the Variable value field. Select OK.
    • If the agent was already installed, restart the service
  • Alternatively, you can open CMD as Administrator and enter the following command:

setx https_proxy https://<proxy ip>:<proxy port>/ /m

  • Linux Steps
    • Complete one of the following lines in the terminal
    • $ export http_proxy=http://server-ip:port/
    • $ export http_proxy=
    • $ export http_proxy=http://proxy-server.mycorp.com:3128/ Please note to include the trailing /

Cleaning Up

After you have finished utilizing the CloudEndure tool, remove any resources you may no longer need.


In conclusion, I have showed how best to prepare your environment for installation of the CloudEndure tool. CloudEndure is utilized to protect your business, and mitigate downtime during your move to the cloud. By following the preceding steps, you set up the configuration for success. Visit the AWS landing page for CloudEndure, to get a deeper understanding of the tool, get started with CloudEndure, or take the free online technical training. Should you need assistance with other configurations, visit the CloudEndure Documentation Library, which includes every aspect of the tooling, as well as a helpful FAQ.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Using Lambda layers to simplify your development process

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-lambda-layers-to-simplify-your-development-process/

Serverless developers frequently import libraries and dependencies into their AWS Lambda functions. While you can zip these dependencies as part of the build and deployment process, in many cases it’s easier to use layers instead. In this post, I explain how layers work, and how you can build and include layers in your own applications.

This blog post references the Happy Path application, which shows how to build a flexible backend to a photo-processing web application. To learn more, refer to Using serverless backends to iterate quickly on web apps – part 1. This code in this post is available at this GitHub repo.

Overview of Lambda layers

A Lambda layer is an archive containing additional code, such as libraries, dependencies, or even custom runtimes. When you include a layer in a function, the contents are extracted to the /opt directory in the execution environment. You can include up to five layers per function, which count towards the standard Lambda deployment size limits.

Layers are deployed as immutable versions, and the version number increments each time you publish a new layer. When you include a layer in a function, you specify the layer version you want to use. Layers are automatically set as private, but they can be shared with other AWS accounts, or shared publicly. Permissions only apply to a single version of a layer.

Using layers can make it faster to deploy applications with the AWS Serverless Application Model (AWS SAM) or the Serverless framework. By moving runtime dependencies from your function code to a layer, this can help reduce the overall size of the archive uploaded during a deployment.

Creating a layer containing the AWS SDK

The AWS SDK allows you to interact programmatically with AWS services using one of the supported runtimes. The Lambda service includes the AWS SDK so you can use it without explicitly importing in your deployment package.

However, there is no guarantee of the version provided in the execution environment. The SDK is upgraded frequently to support new AWS services and features. As a result, the version may change at any time. You can see the current version used by Lambda by declaring an instance of the SDK and logging out the version method:

Logging out the version method

For production workloads, it’s best practice to lock the version of the AWS SDK used in your functions. You can achieve this by including the SDK with your code package. Once you include this library, your code always uses the version in the deployment package and not the version included in the Lambda service.

A serverless application may consist of many functions, which all use a common SDK version. Instead of bundling the SDK with each function deployment, you can create a layer containing the SDK. The effect of this is to reduce the size of the uploaded archive, which makes your deployments faster.

To create an AWS SDK layer:

  1. First, clone this blog post’s GitHub repo. From a terminal window, execute:
    git clone https://github.com/aws-samples/aws-lambda-layers-aws-sam-examples
    cd ./aws-sdk-layer
  2. This directory contains an AWS SAM template and Node.js package.json file. Install the package.json contents:
    npm install
  3. Create the layer directory defined in the AWS SAM template and the nodejs directory required by Lambda. Next, move the node_modules directory:
    mkdir -p ./layer/nodejs
    mv ./node_modules ./layer/nodejs
  4. Next, deploy the AWS SAM template to create the layer:
    sam deploy --guided
  5. For the Stack name, enter “aws-sdk-layer”. Enter your preferred AWS Region and accept the other defaults.
  6. After the deployment completes, the new Lambda layer is available to use. Run this command to see the available layers:aws lambda list-layersaws lambda list-layers output

After adding a layer to a function, you can use console.log to log out the AWS SDK version. This shows that the function is now using the SDK version in the layer instead of the version provided by the Lambda service:

Use the SDK layer instead of the bundled layer

Creating layers with OS-specific binaries

Many code libraries include binaries that are operating-system specific. When you build packages on your local development machine, by default the binaries for that operating system are used. These may not be the right binaries for Lambda, which runs on Amazon Linux. If you are not using a compatible operating system, you must ensure you include Linux binaries in the layer.

The simplest way to package these libraries correctly is to use AWS Cloud9. This is an IDE in the AWS Cloud, which runs on Amazon EC2. After creating an environment, you can clone a git repository directly to the local storage of the instance, and run the necessary build scripts.

The Happy Path application resizes images using the Sharp npm library. This library uses libvips, which is written in C, so the compilation is operating system-specific. By creating a layer containing this library, it simplifies the packaging and deployment of the consuming Lambda function.

To create a Sharp layer using AWS Cloud9:

  1. Navigate to the AWS Cloud9 console.
  2. Choose Create environment.
  3. Enter the name “My IDE” and choose Next step.
  4. Accept all the default and choose Next step.
  5. Review the settings and choose Create environment.
  6. In the terminal panel, enter:
    git clone https://github.com/aws-samples/aws-lambda-layers-aws-sam-examples
    cd ./aws-lambda-layers-aws-sam-examples/sharp-layer
    npm installCreating a layer in Cloud9
  7. From a terminal window, ensure you are in the directory where you cloned this post’s GitHub repo. Execute the following commands:cd ./sharp-layer
    npm install
    mkdir -p ./layer/nodejs
    mv ./node_modules ./layer/nodejsCreating the layer in Cloud9
  8. Next, deploy the AWS SAM template to create the layer:
    sam deploy --guided
  9. For the Stack name, enter “sharp-layer”. Enter your preferred AWS Region and accept the other defaults. After the deployment completes, the new Lambda layer is available to use.

In some runtimes, you can specify a local set of packages for development, and another set for production. For example, in Node.js, the package.json file allows you to specify two sections for dependencies. If your development machine uses a different operating system to Lambda, and therefore uses different binaries, you can use package.json to resolve this. In the Happy Path Resizer function, which uses the Sharp layer, the package.json refers to a local binary for development.

Adding development dependencies to package.json

AWS SAM defines Lambda functions with the AWS::Serverless::Function resource. Layers are defined as a property of functions, as a list of layer ARNs including the version:

    Type: AWS::Serverless::Function 
      CodeUri: myFunction/
      Handler: app.handler
      MemorySize: 128
        - !Ref SharpLayerARN

Sharing a layer

Layers are private to your account by default but you can optionally share with other AWS accounts or make a layer public. You cannot share layers via the AWS Management Console but instead use the AWS CLI.

To share a layer, use add-layer-version-permission, specifying the layer name, version, AWS Region, and principal:

aws lambda add-layer-version-permission \
  --layer-name node-sharp \
  --principal '*' \
  --action lambda:GetLayerVersion \
  --version-number 3 
  --statement-id public 
  --region us-east-1

In the principal parameter, specify an individual account ID or use an asterisk to make the layer public. The CLI responds with a RevisionId containing the current revision of the policy:

add-layer-version output

You can check the permissions associated with a layer version by calling get-layer-version-policy with the layer name and version:

aws lambda get-layer-version-policy \
  --layer-name node-sharp \
  --version-number 3 \
  --region us-east-1

get-layer-version-policy output

Similarly, you can delete permissions associated with a layer version by calling remove-layer-vesion-permission with the layer name, statement ID, and version:

aws lambda remove-layer-version-permission \
 -- layer-name node-sharp \
 -- statement-id public \
 -- version-number 3

Once the permissions are removed, calling get-layer-version-policy results in an error:

Error invoking after removal


Lambda layers provide a convenient and effective way to package code libraries for sharing with Lambda functions in your account. Using layers can help reduce the size of uploaded archives and make it faster to deploy your code.

Layers can contain packages using OS-specific binaries, providing a convenient way to distribute these to developers. While layers are private by default, you can share with other accounts or make a layer public. Layers are published as immutable versions, and deleting a layer has no effect on deployed Lambda functions already using that layer.

To learn more about using Lambda layers, visit the documentation, or see how layers are used in the Happy Path web application.

Federated multi-account access for AWS CodeCommit

Post Syndicated from Steven David original https://aws.amazon.com/blogs/devops/federated-multi-account-access-for-aws-codecommit/

As a developer working in a large enterprise or for a group that supports multiple products, you may often find yourself accessing Git repositories from different organizations. Currently, to securely access multiple Git repositories in other popular tools, you need SSH keys, GPG keys, a Git credential helper, and a significant amount of setup by the developer hoping to commit to the repository. In addition, administrators must be aware of the various ways to remove all the permissions granted to the developer.

AWS CodeCommit is a managed source control service. Combined with AWS Single Sign-On (AWS SSO) and git-remote-codecommit, you can quickly and easily switch between repositories owned by different groups or even managed in separate AWS accounts. You can control those permissions with AWS Identity and Access Management (IAM) roles to allow for the automated removal of the user’s permission as part of their off-boarding procedure for the company.

This post demonstrates how to grant access to various CodeCommit repositories without access keys.

Solution overview

In this solution, the user’s access is controlled with federated login via AWS SSO. You can grant that access using AWS native authentication, which eliminates the need for a Git credential helper, SSH, and GPG keys. In addition, this allows the administrator to control access by adding or removing the user’s IAM role access.

The following diagram shows the code access pattern you can achieve by using AWS SSO and git-remote-codecommit to access CodeCommit across multiple accounts.

git-remote-codecommit overview diagram


To complete this tutorial, you must have the following prerequisites:

  • CodeCommit repositories in two separate accounts. For instructions, see Create an AWS CodeCommit repository.
  • AWS SSO set up to handle access federation. For instructions, see Enable AWS SSO.
  • Python 3.6 or higher installed on the developer’s local machine. To download and install the latest version of Python, see the Python website.
    • On a Mac, it can be difficult to ensure that you’re using Python 3.6, because 2.7 is installed and required by the OS. For more information about checking your version of Python, see the following GitHub repo.
  • Git installed on your local machine. To download Git, see Git Downloads.
  • PIP version 9.0.3 or higher installed on your local machine. For instructions, see Installation on the PIP website.

Configuring AWS SSO role permissions

As your first step, you should make sure each AWS SSO role has the correct permissions to access the CodeCommit repositories.

  1. On the AWS SSO console, choose AWS Accounts.
  2. On the Permissions Sets tab, choose Create permission set.
  3. On the Create a new permission set page, select Create a custom permission set.
  4. For Name, enter CodeCommitDeveloperAccess.
  5. For Description, enter This permission set gives the user access to work with CodeCommit for common developer tasks.
  6. For Session duration, choose 12 hours.

Create new permissions

  1. For Relay state, leave blank.
  2. For What policies do you want to include in your permissions set?, select Create a custom permissions policy.
  3. Use the following policy:
    "Version": "2012-10-17",
    "Statement": [
             "Sid": "CodeCommitDeveloperAccess",
             "Effect": "Allow",
             "Action": [
             "Resource": "*"

The preceding code grants access to all the repositories in the account. You could limit to a specific list of repositories, if needed.

  1. Choose Create.

Creating your AWS SSO group

Next, we need to create the SSO Group we want to assign the permissions.

  1. On the AWS SSO console, choose Groups.
  2. Choose create group.
  3. For Group name, enter CodeCommitAccessGroup.
  4. For Description, enter Users assigned to this group will have access to work with CodeCommit.

Create Group

  1. Choose Create.

Assigning your group and permission sets to your accounts

Now that we have our group and permission sets created, we need to assign them to the accounts with the CodeCommit repositories.

  1. On the AWS SSO console, choose AWS Accounts.
  2. Choose the account you want to use in your new group.
  3. On the account Details page, choose Assign Users.
  4. On the Select users or groups page, choose Group.
  5. Select CodeCommitGroup.
  6. Choose NEXT: Permission Sets.
  7. Choose the CodeCommitDeveloperAccess permission set and choose Finish

Assign Users

  1. Choose Proceed to Accounts to return to the AWS SSO console.
  2. Repeat these steps for each account that has a CodeCommit repository.

Assigning a user to the group

To wrap up our AWS SSO configuration, we need to assign the user to the group.

  1. On the AWS SSO console, choose Groups.
  2. Choose CodeCommitAccessGroup.
  3. Choose Add user.
  4. Select all the users you want to add to this group.
  5. Choose Add user(s).
  6. From the navigation pane, choose Settings.
  7. Record the user portal URL to use later.

Enabling AWS SSO login

The second main feature we want to enable is AWS SSO login from the AWS Command Line Interface (AWS CLI) on our local machine.

  1. Run the following command from the AWS CLI. You need to enter the user portal URL from the previous step and tell the CLI what Region has your AWS SSO deployment. The following code example has AWS SSO deployed in us-east-1:
aws configure sso 
SSO start URL [None]: https://my-sso-portal.awsapps.com/start 
SSO region [None]:us-east-1

You’re redirected to your default browser.

  1. Sign in to AWS SSO.

When you return to the CLI, you must choose your account. See the following code:

There are 2 AWS accounts available to you.
> DeveloperResearch, [email protected] (123456789123)
DeveloperTrading, [email protected] (123456789444)
  1. Choose the account with your CodeCommit repository.

Next, you see the permissions sets available to you in the account you just picked. See the following code:

Using the account ID 123456789123
There are 2 roles available to you.
> ReadOnly
  1. Choose the CodeCommitDeveloperAccess permissions.

You now see the options for the profile you’re creating for these AWS SSO permissions:

CLI default client Region [None]: us-west-2<ENTER>
CLI default output format [None]: json<ENTER>
CLI profile name [123456789011_ReadOnly]: DevResearch-profile<ENTER>
  1. Repeat these steps for each AWS account you want to access.

For example, I create DevResearch-profile for my DeveloperResearch account and DevTrading-profile for the DeveloperTrading account.

Installing git-remote-codecommit

Finally, we want to install the recently released git-remote-codecommit and start working with our Git repositories.

  1. Install git-remote-codecommit with the following code:
pip install git-remote-codecommit

With some operating systems, you might need to run the following code instead:

sudo pip install git-remote-codecommit
  1. Clone the code from one of your repositories. For this use case, my CodeCommit repository is named MyDemoRepo. See the following code:
git clone codecommit://[email protected] my-demo-repo
  1. After that solution is cloned locally, you can copy code from another federated profile by simply changing to that profile and referencing the repository in that account named MyDemoRepo2. See the following code:
git clone codecommit://[email protected] my-demo-repo2

Cleaning up

At the end of this tutorial, complete the following steps to undo the changes you made to your local system and AWS:

  1. On the AWS SSO console, remove the user from the group you created, so any future access requests fail.
  2. To remove the AWS SSO login profiles, open the local config file with your preferred tool and remove the profile.
    1. The config file is located at %UserProfile%/.aws/config for Windows and $HOME/.aws/config for Linux or Mac.
  3. To remove git-remote-codecommit, run the PIP uninstall command:
pip uninstall git-remote-codecommit

With some operating systems, you might need to run the following code instead:

sudo pip uninstall git-remote-codecommit


This post reviewed an approach to securely switch between repositories and work without concerns about one Git repository’s security credentials interfering with the other Git repository. User access is controlled by the permissions assigned to the profile via federated roles from AWS SSO. This allows for access control to CodeCommit without needing access keys.

About the Author

Steven David
Steven David

Steven David is an Enterprise Solutions Architect at Amazon Web Services. He helps customers build secure and scalable solutions. He has background in application development and containers.

Field Notes: Deploying UiPath RPA Software on AWS

Post Syndicated from Yuchen Lin original https://aws.amazon.com/blogs/architecture/field-notes-deploying-uipath-rpa-software-on-aws/

Running UiPath RPA software on AWS leverages the elasticity of the AWS Cloud, to set up, operate, and scale robotic process automation. It provides cost-efficient and resizable capacity, and scales the robots to meet your business workload. This reduces the need for administration tasks, such as hardware provisioning, environment setup, and backups. It frees you to focus on business process optimization by automating more processes.

This blog post guides you in deploying UiPath robotic processing automation (RPA) software on AWS. RPA software uses the user interface to capture data and manipulate applications just like humans do. It runs as a software robot to interpret, and trigger responses, as well as communicate with other systems to perform a variety of repetitive tasks.

UiPath Enterprise RPA Platform provides the full automation lifecycle including discover, build, manage, run, engage, and measure with different products. This blog post focuses on the Platform’s core products: build with UiPath Studio, manage with UiPath Orchestrator and run with UiPath Robots.

About UiPath software

UiPath Enterprise RPA Platform’s core products are:

UiPath Studio and UiPath Robot are individual products, you can deploy each on a standalone machine.

UiPath Orchestrator contains Web Servers, SQL Server and Indexer Server (Elasticsearch), you can use Single Machine deployment, or Multi-Node deployment, depends on the workload capacity and availability requirements.

For information on UiPath platform offerings, review UiPath platform products.

UiPath on AWS

You can deploy all UiPath products on AWS.

  • UiPath Studio is needed for automation design jobs and runs on single machine. You deploy it with Amazon EC2.
  • UiPath Robots are needed for automation tasks, runs on a single machine, and scales with the business workload. You deploy it with Amazon EC2 and scale with Amazon EC2 Auto Scaling.
  • UiPath Orchestrator is needed for automation administration jobs and contains three logical components that run on multiple machines. You deploy Web Server with Amazon EC2, SQL Server with Amazon RDS, and Indexer Server with Amazon Elasticsearch Service. For Multi-Node deployment, you deploy High Availability Add-On with Amazon EC2.

The architecture of UiPath Enterprise RPA Platform on AWS looks like the following diagram:

Figure 1 - UiPath Enterprise RPA Platform on AWS

Figure 1 – UiPath Enterprise RPA Platform on AWS

By deploying the UiPath Enterprise RPA Platform on AWS, you can set up, operate, and scale workloads. This controls the infrastructure cost to meet process automation workloads.


For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • AWS resources
  • UiPath Enterprise RPA Platform software
  • Basic knowledge of Amazon EC2, EC2 Auto Scaling, Amazon RDS, Amazon Elasticsearch Service.
  • Basic knowledge to set up Windows Server, IIS, SQL Server, Elasticsearch.
  • Basic knowledge of Redis Enterprise to set up High Availability Add-on.
  • Basic knowledge of UiPath Studio, UiPath Robot, UiPath Orchestrator.

Deployment Steps

Deploy UiPath Studio
UiPath Studio deploys on a single machine. Amazon EC2 instances provide secure and resizable compute capacity in the cloud, and the ability to launch applications when needed without upfront commitments.

  1. Download the UiPath Enterprise RPA Platform. UiPath Studio is integrated in the installation package.
  2. Launch an EC2 instance with a Windows OS-based Amazon Machine Image (AMI) that meets the UiPath Studio hardware requirements and software requirements.
  3. Install the UiPath Studio software. For UiPath Studio installation steps, review the UiPath Studio Guide.

Optionally, you can save the installation and pre-configuration work completed for UiPath Studio as a custom Amazon Machine Image (AMI). Then, you can launch more UiPath Studio instances from this AMI. For details, visit Launch an EC2 instance from a custom AMI tutorial.

UiPath Robot Deployment

Each UiPath Robot deploys one single machine with Amazon EC2. Amazon EC2 Auto Scaling helps you add or remove Robots to meet automation workload changes in demand.

  1. Download the UiPath Enterprise RPA Platform. The UiPath Robot is integrated in the installation package.
  2. Launch an EC2 instance with a Windows OS based Amazon Machine Image (AMI) that meets the UiPath Robot hardware requirements and software requirements.
  3. Install the business application (Microsoft Office, SAP, etc.) required for your business processes. Alternatively, select the business application AMI from the AWS Marketplace.
  4. Install the UiPath Robot software. For UiPath Robot installation steps, review Installing the Robot.

Optionally, you can save the installation and pre-configuration work completed for UiPath Robot as a custom Amazon Machine Image (AMI). Then you can create Launch templates with instance configuration information. With launch template, you can create Auto Scaling groups from launch templates and scale the Robots.

Scale the Robots’ Capacity

Amazon EC2 Auto Scaling groups help you use scaling policies to scale compute capacity based on resource use. By monitoring the process queue and creating a customized scaling policy, the UiPath Robot can automatically scale based on the workload. For details, review Scaling the size of your Auto Scaling group.

Use the Robot Logs

UiPath Robot generates multiple diagnostic and execution logs. Amazon CloudWatch provides the log collection, storage, and analysis, and enables the complete visibility of the Robots and automation tasks. For CloudWatch agent setup on Robot, review Quick Start: Enable Your Amazon EC2 Instances Running Windows Server to Send logs to CloudWatch Logs.

Monitor the Automation Jobs

UiPath Robot uses the user interface to capture data and manipulate applications. When UiPath Robot runs, it is important to capture processing screens for troubleshooting and auditing usage. This screen capture activity can be integrated with process in conjunction with UiPath Studio.

Amazon S3 provides cost-effective storage for retaining all Robot logs and processing screen captures. Amazon S3 Object Lifecycle Management automates the transition between different storage classes, and helps you manage the screenshots so that they are stored cost effectively throughout their lifecycle. For lifecycle policy creation, review How Do I Create a Lifecycle Policy for an S3 Bucket?.

UiPath Orchestrator Deployment

Deployment Components
UiPath Orchestrator Server Platform has many logical components, grouped in three layers:

  • presentation layer
  • web service layer
  • persistence layer

The presentation layer and web service layer are built into one ASP.NET website. The persistence layer contains SQL Server and Elasticsearch. There are three deployment components to be set up:

  • web application
  • SQL Server
  • Elasticsearch

The Web Server, SQL Server, and Elasticsearch Server require multiple different environments. Review the hardware requirements and software requirements for more details.

Note: set up the Web Server, SQL Server, Elasticsearch Server environments before running the UiPath Enterprise Platform installation wizard.

Set up Web Server with Amazon EC2

UiPath Orchestrator Web Server deploys on Windows Server with IIS 7.5 or later. For details, review the software requirements.

AWS provides various AMIs for Windows Server that can help you set up the environment required for the Web Server.

The Microsoft Windows Server 2019 Base AMI includes most prerequisites for installation except some features of Web Server (IIS) to be enabled. For configuration steps, review Server Roles and Features.

The Web Server should be put in correct subnet (Public or Private) and have proper security group (HTTPS visits) according to the business requirements. Review Allow user to connect EC2 on HTTP or HTTPS.

Set up SQL Server with Amazon RDS

Amazon Relational Database Service (Amazon RDS) provides you with a managed database service. With a few clicks, you can set up, operate, and scale a relational database in the AWS Cloud.

Amazon RDS support SQL Server Engine. For UiPath Orchestrator, both Standard Edition and Enterprise Edition are supported. For details, review software requirements.

Amazon RDS can be set up in multiple Available Zones to meet requirements for high availability.

UiPath Orchestrator can connect to the created Amazon RDS database with SQL Server Authentication.

Set up Elasticsearch Server with Amazon Elasticsearch Service (Amazon ES)

Amazon ES is a fully managed service for you to deploy, secure, and operate Elasticsearch at scale with generally zero down time.

Elasticsearch Service provides a managed ELS stack, with no upfront costs or usage requirements, and without the operational overhead.

All messages logged by UiPath Robots are sent through the Logging REST endpoint to the Indexer Server where they are indexed for future utilization.

Install UiPath Orchestrator on the Web Server

After Web Server, SQL Server, Elasticsearch Server environment are ready, download the UiPath Enterprise RPA Platform, and install it on the Web Server.

The UiPath Enterprise Platform installation wizard guides you in configuring and setting up each environment, including connecting to SQL Server and configuring the Elasticsearch API URL.

After you complete setup, the UiPath Orchestrator Portal is available for you to visit and manage processes, jobs, and robots.

The UiPath Orchestrator dashboard appears like in the following screenshot:

Figure UiPath Orchestrator Portal

Figure 2- UiPath Orchestrator Portal

Set up Orchestrator High Availability Architecture

One Orchestrator can handle many robots in a typical configuration, but any product running on a single server is vulnerable to failure if something happens to that server.

The High Availability add-on (HAA) enables you to add a second Orchestrator server to your environment that is generally fully synchronized with the first server.

To set up multi-node deployment, launch Amazon EC2 instances with a Linux OS-based Amazon Machine Image (AMI) that meets the HAA hardware and software requirements. Follow the installation guide to set up HAA.

Elastic Load Balancing automatically distributes incoming application traffic across multiple targets. Network Load Balancer should be set up to allow Robots to communicate with multi-node Orchestrators.

Cleaning up

To avoid incurring future charges, delete all the resources.


In this post, I showed you how to deploy the UiPath Enterprise RPA Platform on AWS to further optimize and automate your business processes. AWS Managed Services like Amazon EC2, Amazon RDS, and Amazon Elasticsearch Service help you set up the environment with high availability. This reduces the maintenance effort of backend services, as well as scaling Orchestrator capabilities. Amazon EC2 Auto Scaling helps you add or remove robots to meet automation workload changes in demand.

Learn more about how to integrate UiPath with AWS services, check out The UiPath and AWS partnership.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Building a Shared Account Structure Using AWS Organizations

Post Syndicated from Abhijit Vaidya original https://aws.amazon.com/blogs/architecture/field-notes-building-a-shared-account-structure-using-aws-organizations/

For customers considering the AWS Solution Provider Program, there are challenges to mitigate when building a shared account model with SI partners. AWS Organizations make it possible to build the right account structure to support a resale arrangement. In this engagement model, the end customer gets an AWS invoice from an AWS authorized partner instead of AWS directly.

Partners and customers who want to engage in this service resale arrangement need to build a new account structure. This process includes linking or transferring existing customer accounts to the partner master account. This is so that all the billing data from customer accounts is consolidated into the partner master account.

While linking or transferring existing customer accounts to the new master account, the partner must check the new master account structure. It should not compromise any customer security controls and continue to provide full control of linked accounts to the customer.  The new account structure must fulfill the following requirements for both the AWS customer and partner:

  • The customer maintains full access to the AWS organization and able to perform all critical security-related tasks except access to billing data.
  • The Partner is able to control only billing information and not able to perform any other task in the root account (master payer account) without approval from the customer.
  • In case of contract breach / termination, the customer is able to gain back full control of all accounts including the Master.

In this post, you will learn about how partners can create a master account with shared ownership. We also show how to link or transfer customer organization accounts to the new organization master account and set up policies that would provide appropriate access to both partner and customer.

Account Structure

The following architecture represents the account structure setup that can fulfill customer and partner requirements as a part of a service resale arrangement.

As illustrated in the preceding diagram, the following list includes the key architectural components:

Architectural Components

As a part of resale arrangement, the customer’s existing AWS organization and related accounts are linked to the partner’s master payer account. The customer can continue to maintain their existing master root account, while all child accounts are linked to the master account (as shown in the list).

Customers may have valid concerns about linking/transferring accounts to the owned master payee account and may come up with many ‘what-if’ scenarios for example “What if the partner shuts down environment/servers?”  or, “What if partner blocks access to child accounts?”.

This account structure provides the right controls to both customer and partner that would address customer concerns around the security of their accounts. This includes the following benefits:

  • Starting with access to the root account, neither customer nor partner can access the root account without the other party’s involvement.
  • The partner controls the id/password for the root account while the customer maintains the MFA token for the account. The customer also controls the phone number, security questions associated with the root account. That way, the partner cannot replace the MFA token on their own.
  • The partner only has billing access and does not control any other parts of account including child accounts. Anytime the root account access is needed, both customer and partner team need to collaborate and access the root account.
  • The customer or partner cannot assign new IAM roles to themselves, therefore protecting the initial account setup.

Security considerations in the shared account setup

The following table highlights both customer and partner responsibilities and access controls provided by the architecture in the previous section.

The following points highlight security recommendations to provide adequate access rights to both partner and customers.

  • New master payer/ root account has a joint ownership between the Partner and the Customer.
  • AWS account root users (user id/password) would be with Partner and MFA (multi-factor authentication) device with Customer.
  • IAM (AWS Identity and Access Management) role to be created under the master payer with policies “FullOrganizationAccess”, “Amazon S3” (Amazon Simple Storage Service), “CloudTrail” (AWS CloudTrail), “CloudWatch”(Amazon CloudWatch) for the Customer.
  • Security team to log in and manage security of the account. Additional permissions to be added to this role as needed in future. This role does not have ANY billing permissions.
  • Only the partner has access to AWS billing and usage data.
  • IAM role / user would be created under master payer with just billing permission for the Partner team to log in and download all invoices. This role does not have any other permissions except billing and usage reports.
  • Any root login attempts to master payer triggers a notification to Customer’s SOC team and the Partner’s Customer account management team.
  • The Partner’ email address is used to create an account so invoices can be emailed to the partner’s email. The Customer cannot see these invoices.
  • The Customer phone number is used to create a master account and the customer maintains security questions/answers. This prevents replacement of MFA token by the Partner team without informing customer.  The Customer wouldn’t need the Partner’s help or permission to login and manage any security.
  • No aspect of  the Master Payer / Root Partner team can login to the master payer/Root without the Customer providing an MFA token.

Setting up the shared master AWS account structure

Create a playbook for the account transfer activity based on the following tasks. For each task, identify the owner. Make sure that owners have the right permissions to perform the tasks.

Part I – Setting up new partner master account

  1. Create new Partner Master payee Account
  2. Update payment details section with the required details for payment in Partner Master payee Account
  3. Enable MFA in the Partner Master payee Account
  4. Update contact for security and operations in the Partner Master payee Account
  5. Update demographics -Address and contact details in Partner Master payee Account
  6. Create an IAM role for Customer Team in Partner Master Payee account. IAM role is created under master payer with “FullOrganizationAccess”, “Amazon S3”, “CloudTrail”, “CloudWatch” “CloudFormationFullAccess” for the Customer SOC team to login and manage security of the account. Additional permissions can be added to this role in future if needed.

Select the roles:

create role

7. Create an IAM role/user for Partner billing role in the Partner Master Payee account.

Part II – Setting up customer master account

1.      Create an IAM user in the customer’s master account. This user assumes role into the new master payer/root account.

aws management console


2.      Confirm that when the IAM user from the customer account assumes a role in the new master account, and that the user does not have Billing Access.

Billing and cost management dashboard

Part III – Creating an organization structure in partner account

  1. Create an Organization in the Partner Master Payee Account
  2. Create Multiple Organizational Units (OU) in the Partner Master Payee Account


3. Enable Service Control Policies from AWS Organization’s Policies menu.

service control policies

5. Create/Copy Multiple in to Partner Master Payee Account from Customer root Account.  Any service control policies from the customer root account should be manually copied to new partner account.

6. If customer root account has any special software installed for example, security, install same software in Partner Master Payee Account.

7. Set alerts in Partner Master Payee root account. Any login to the root account would send alerts to customer and partner teams.

8. It is recommended to keep a copy of all billing history invoices for all accounts to be transferred to partner organization. This could be achieved by either downloading CSV or printing all invoices and storing files in Amazon S3 for long term archival. Billing history and invoices are found by clicking Orders and Invoices on Billing & Cost Management Dashboard. After accounts are transferred to new organization, historic billing data will not be available for those accounts.

9. Remove all the Member Accounts from the current Customer Root Account/ Organization. This step is performed by customer account admin and required before account can be transferred to Partner Account organization.

10. Send an invite from the Partner Master Payee Account to the delinked Member Account

master payee account

11.      Member Accounts to accept the invite from the Partner Master Payee Account.


12.      Move the Customer member account to the appropriate OU in the Partner Master Payee Account.

Setting the shared security model between partner and customer contact

While setting up the master account, three contacts need to be updated for notification.

  • Billing –  this is owned by the Partner
  • Operations – this is owned by the Customer
  • Security – this is owned by the Customer.

This will trigger a notification of any activity on the root account. The contact details contain Name, Title, Email Address and Phone number.  It is recommended to use the Customer’s SOC team distribution email for security and operations, and a phone number that belongs to the organization, and not the individual.

Alternate contacts

Additionally, before any root account activity takes place, AWS Support will verify using the security challenge questionnaire. These questions and answers are owned by the Customer’s SOC team.

security challenge questions

If a customer is not able to access the AWS account, alternate support options are available at Contact us by expanding the “I’m an AWS customer and I’m looking for billing or account support” menu. While contacting AWS Support, all the details that are listed on the account are needed, including full name, phone number, address, email address, and the last four digits of the credit card.

Clean Up

After recovering the account, the Customer should close any accounts that are not in use. It’s a good idea not to have open accounts in your name that could result in charges. For more information, review Closing an Account in the Billing and Cost Management User Guide.

The Shared master root account should be only used for selected activities referred to in the following document.


In this post, you learned how AWS Organizations features can be used to create a shared master account structure.  This helps both customer and partner engage in a service resale business engagement. Using AWS Organizations and cross account access, this solution allows customers to control all key aspects of managing the AWS Organization (Security / Logging / Monitoring) and also allows partners to control any billing related data.

Additional Resources

Cross Account Access

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Integrating a Multi-forest Source Environment with AWS SSO

Post Syndicated from Sudhir Amin original https://aws.amazon.com/blogs/architecture/field-notes-integrating-a-multi-forest-source-environment-with-aws-sso/

During re:Invent 2019, AWS announced a new way to integrate external identity sources such as Azure Active Directory with auto provisioning of identities and groups in AWS Single Sign-On (AWS SSO). In March 2020, AWS SSO afforded customers the possibility to connect their Okta Identity Cloud to AWS Single Sign-On (SSO) in order to manage access to AWS centrally in AWS SSO.

AWS Single Sign-On service helps to centralize management access to multiple AWS accounts and some cases tying back to corporate identities. This provides ready access to business applications and services. With this feature, companies can leverage AWS Single Sign-On for allowing federated access to multiple AWS accounts and cloud applications.

In this blog post, I discuss the challenges faced by customers running multi-forest environments or multiple Azure tenant subscriptions with this feature.  I also provide a different approach to solving this challenge with a brief overview of each solution presented.

Large Enterprise companies often require their security team to build centralized identity solutions that work across different Active Directory forests environments. This is commonly due to a merger, acquisition or partnership. Challenges include complex networking with different IP routes, DNS forwarding configurations, firewall rules to enable trust relationships between different Active Directory forests to support compliance of a single identity to manage the account lifecycle and password policies. This becomes even more challenging when your organization is working in multiple cloud platforms within a centralized Identity solution, with hybrid networking connectivity.

Customer Example

To illustrate my point, I use the following example of a real life customer scenario, under the fictitious name of ‘Acme Corporation’.

Acme Corporation is a capital wealth management company operating in three countries: USA, Canada, and Brazil. Business is growing and they are exploring cloud services.

Their corporate headquarters is located in NY, USA and they have established offices (branches) in Canada and Brazil. The organization operates in a decentralized model, which consists of different governance over their identity structure. An Active Directory Forest is established per Region with a cross-forest trust relationship. The company is looking to adopt cloud technologies and needed a common identity solution across on-premises and cloud services with Azure Active Directory and AWS.

We’ve outlined the solution in the following diagram:

Figure 1 - Solution Overview

Figure 1 – Solution Overview

Options to source identities into AWS Single Sign-On

AWS Single Sign-On offers the following 3 options to establish as an identity source:

  • Active Directory
  • External Identity Provider
Figure 2 - Identity Source Options

Figure 2 – Identity Source Options

The first option; “AWS SSO” is a default native identity store. You can create and delete users and groups.

The second option; “Active Directory” allows administrators to source users and groups from Active Directory running On-Premises Active Directory, or Active Directory in EC2 (using AD Connector as the directory gateway) or AWS Managed Microsoft AD directory hosted in the AWS Cloud.

The third option; “External Identity Provider” enables administrators to provision users and groups from external identity providers (IdPs) through the Security Assertion Markup Language (SAML) 2.0 standard.

Note: AWS Single Sign-On allows only one identity source at any given time. In this post, we focus on two options that help integrate a multi-forest environment with AWS Single Sign-On and Azure Active Directory.


Option 1. Federating with Active Directory 

In the hub-and-spoke model, the AWS Managed Microsoft Active Directory is the hub and the spoke is the Active Directory forests.

  1. Provision a AWS Managed Microsoft Active Directory.
    • If you already have an AWS Managed Microsoft Active Directory for a hub, continue to the next step.
  2. Setup hybrid network connectivity, and firewall rules allowing trust traffic
  3. DNS, conditional forwarding allows to resolve the trusting forests. We need an Outbound Endpoint with Forwarding Rules to the different forests so the VPC resolves the names and an inbound endpoint so the forests can resolve the AWS Managed Microsoft AD names.
  4. Check the name resolution is working for the hybrid environment.
  5. Establish a Forest trust relationship and validate the trust.

The following snapshot shows how your trust relationship will be displayed on the console.

Figure 3 - Trust Relationships

Note 1: You cannot use the transitive trust relationship of a child domain in a forest or cross forest relationship. In that case, you have to create an explicit trust or a domain trust to the AWS Managed Microsoft AD domain for AWS Single Sign-On. This enables you to see the user and groups required to provision the permission sets and Accounts.

Note 2: AWS Managed Microsoft Active Directory in this example does not require you to host any users or groups, as this domain is only being used for the domain trust relationships. In short, this can be an empty forest.

Configure AWS Single Sign-On to use your AWS Managed Microsoft Active Directory for Active Directory option.

The following snapshot shows how to assign a group to an account in preparation for AWS Singles Sign-On enablement.

Figure 4 - Selecting Users or Groups

The following snapshot shows how to assign a group to an account in preparation for AWS Singles Sign-On enablement and selecting a group.

Figure 5 - Assigning Users

The following is a conceptual diagram of Acme corporation, after successful integration.

Figure 4 - Conceptual Diagram

Figure 3 – Option 1 – Conceptual Diagram

Option 2a. Federating with Azure Active Directory Single Tenant

If you have multiple-forests and would like to use a single tenant, here are the steps:

  1. Setup a single Azure AD Connect in any forest, to consolidate users from different forests to a single Azure Tenant.
  2. Configure AWS Single Sign-On to use your Single Azure Active Directory Tenant for External Identity Provider option.

The following is a conceptual diagram of Acme corporation, after successful integration.

Figure 5 - Option 2a - Conceptual Diagram

Figure 5 – Option 2a – Conceptual Diagram

Option 2b. Federating with Azure Active Directory Multiple Tenants

If option 2a is not feasible and you are using multiple Azure AD Connect sync servers and multiple Azure Active Directory tenants (as per the following diagram) then, you can nominate one of the Azure Active Directory tenants to connect with AWS SSO. Through B2B invitation, selectively invite users from other tenants into the nominated tenant.

Note: This is not a scalable solution, as it requires administrative overhead. This should be ideal for a small set of users requiring access to AWS API or console for administrative work.

The following is a conceptual diagram of Acme corporation, after successful integration.

Figure 6 - Option 2b - Conceptual Diagram

Figure 6 – Option 2b – Conceptual Diagram


In this post, we discussed the options for connecting AWS SSO to your preferred Identity Provider, with a multi-forest infrastructure. Customers running multi-forest environments or multiple Azure tenant subscriptions now have a guide to offer their users a continued way of centralizing management and enforcing least privilege access on cloud resources. To learn more, review our AWS Single Sign-On service content.

Additional Content:

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.