Tag Archives: Field Notes

Field Notes: Accelerate Research with Managed Jupyter on Amazon SageMaker

Post Syndicated from Mrudhula Balasubramanyan original https://aws.amazon.com/blogs/architecture/field-notes-accelerate-research-with-managed-jupyter-on-amazon-sagemaker/

Research organizations across industry verticals have unique needs. These include facilitating stakeholder collaboration, setting up compute environments for experimentation, handling large datasets, and more. In essence, researchers want the freedom to focus on their research, without the undifferentiated heavy-lifting of managing their environments.

In this blog, I show you how to set up a managed Jupyter environment using custom tools used in Life Sciences research. I show you how to transform the developed artifacts into scripted components that can be integrated into research workflows. Although this solution uses Life Sciences as an example, it is broadly applicable to any vertical that needs customizable managed environments at scale.

Overview of solution

This solution has two parts. First, the System administrator of an organization’s IT department sets up a managed environment and provides researchers access to it. Second, the researchers access the environment and conduct interactive and scripted analysis.

This solution uses AWS Single Sign-On (AWS SSO), Amazon SageMaker, Amazon ECR, and Amazon S3. These services are architected to build a custom environment, provision compute, conduct interactive analysis, and automate the launch of scripts.


The architecture and detailed walkthrough are presented from both an admin and researcher perspective.

Architecture from an admin perspective

Architecture from admin perspective


In order of tasks, the admin:

  1. authenticates into AWS account as an AWS Identity and Access Management (IAM) user with admin privileges
  2. sets up AWS SSO and users who need access to Amazon SageMaker Studio
  3. creates a Studio domain
  4. assigns users and groups created in AWS SSO to the Studio domain
  5. creates a SageMaker notebook instance shown generically in the architecture as Amazon EC2
  6. launches a shell script provided later in this post to build and store custom Docker image in a private repository in Amazon ECR
  7. attaches the custom image to Studio domain that the researchers will later use as a custom Jupyter kernel inside Studio and as a container for the SageMaker processing job.

Architecture from a researcher perspective

Architecture from a researcher perspective

In order of tasks, the researcher:

  1. authenticates using AWS SSO
  2. SSO authenticates researcher to SageMaker Studio
  3. researcher performs interactive analysis using managed Jupyter notebooks with custom kernel, organizes the analysis into script(s), and launches a SageMaker processing job to execute the script in a managed environment
  4. the SageMaker processing job reads data from S3 bucket and writes data back to S3. The user can now retrieve and examine results from S3 using Jupyter notebook.


For this walkthrough, you should have:

  • An AWS account
  • Admin access to provision and delete AWS resources
  • Researchers’ information to add as SSO users: full name and email

Set up AWS SSO

To facilitate collaboration between researchers, internal and external to your organization, the admin uses AWS SSO to onboard to Studio.

For admins: follow these instructions to set up AWS SSO prior to creating the Studio domain.

Onboard to SageMaker Studio

Researchers can use just the functionality they need in Amazon SageMaker Studio. Studio provides managed Jupyter environments with sharable notebooks for interactive analysis, and managed environments for script execution.

When you onboard to Studio, a home directory is created for you on Amazon Elastic File System (Amazon EFS) which provides reliable, scalable storage for large datasets.

Once AWS SSO has been setup, follow these steps to onboard to Studio via SSO. Note the Studio domain id (ex. d-2hxa6eb47hdc) and the IAM execution role (ex. AmazonSageMaker-ExecutionRole-20201156T214222) in the Studio Summary section of Studio. You will be using these in the following sections.

Provision custom image

At the core of research is experimentation. This often requires setting up playgrounds with custom tools to test out ideas. Docker images are an effective[CE1] [BM2]  way to package those tools and dependencies and deploy them quickly. They also address another critical need for researchers – reproducibility.

To demonstrate this, I picked a Life Sciences research problem that requires custom Python packages to be installed and made available to a team of researchers as Jupyter kernels inside Studio.

For the custom Docker image, I picked a Python package called Pegasus. This is a tool used in genomics research for analyzing transcriptomes of millions of single cells, both interactively as well as in cloud-based analysis workflows.

In addition to Python, you can provision Jupyter kernels for languages such as R, Scala, Julia, in Studio using these Docker images.

Launch an Amazon SageMaker notebook instance

To build and push custom Docker images to ECR, you use an Amazon SageMaker notebook instance. Note that this is not part of SageMaker Studio and unrelated to Studio notebooks. It is a fully managed machine learning (ML) Amazon EC2 instance inside the SageMaker service that runs the Jupyter Notebook application, AWS CLI, and Docker.

  • Use these instructions to launch a SageMaker notebook instance.
  • Once the notebook instance is up and running, select the instance and navigate to the IAM role attached to it. This role comes with IAM policy ‘AmazonSageMakerFullAccess’ as a default. Your instance will need some additional permissions.
  • Create a new IAM policy using these instructions.
  • Copy the IAM policy below to paste into the JSON tab.
  • Fill in the values for <region-id> (ex. us-west-2), <AWS-account-id>, <studio-domain-id>, <studio-domain-iam-role>. Name the IAM policy ‘sagemaker-notebook-policy’ and attach it to the notebook instance role.
    "Version": "2012-10-17",
    "Statement": [
            "Sid": "additionalpermissions",
            "Effect": "Allow",
            "Action": [
            "Resource": [
  • Start a terminal session in the notebook instance.
  • Once you are done creating the Docker image and attaching to Studio in the next section, you will be shutting down the notebook instance.

Create private repository, build, and store custom image, attach to SageMaker Studio domain

This section has multiple steps, all of which are outlined in a single bash script.

  • First the script creates a private repository in Amazon ECR.
  • Next, the script builds a custom image, tags, and pushes to Amazon ECR repository. This custom image will serve two purposes: one as a custom Python Jupyter kernel used inside Studio, and two as a custom container for SageMaker processing.
  • To use as a custom kernel inside SageMaker Studio, the script creates a SageMaker image and attaches to the Studio domain.
  • Before you initiate the script, fill in the following information: your AWS account ID, Region (ex. us-east-1), Studio IAM execution role, and Studio domain id.
  • You must create four files: bash script, Dockerfile, and two configuration files.
  • Copy the following bash script to a file named ‘pegasus-docker-images.sh’ and fill in the required values.

# Pegasus python packages from Docker hub



executionrole=<fill-in-execution-role ex. AmazonSageMaker-ExecutionRole-xxxxx>

domainid=<fill-in-Studio-domain-id ex. d-xxxxxxx>

if aws ecr describe-repositories | grep 'sagemaker-custom'
    echo 'repo already exists! Skipping creation'
    aws ecr create-repository – repository-name sagemaker-custom

aws ecr get-login-password – region $region | docker login – username AWS – password-stdin $accountid.dkr.ecr.$region.amazonaws.com

docker build -t sagemaker-custom:pegasus-1.0 .

docker tag sagemaker-custom:pegasus-1.0 $accountid.dkr.ecr.$region.amazonaws.com/sagemaker-custom:pegasus-1.0

docker push $accountid.dkr.ecr.$region.amazonaws.com/sagemaker-custom:pegasus-1.0

if aws sagemaker list-images | grep 'pegasus-1'
    echo 'Image already exists! Skipping creation'
    aws sagemaker create-image – image-name pegasus-1 – role-arn arn:aws:iam::$accountid:role/service-role/$executionrole
    aws sagemaker create-image-version – image-name pegasus-1 – base-image $accountid.dkr.ecr.$region.amazonaws.com/sagemaker-custom:pegasus-1.0

if aws sagemaker list-app-image-configs | grep 'pegasus-1-config'
    echo 'Image config already exists! Skipping creation'
   aws sagemaker create-app-image-config – cli-input-json file://app-image-config-input.json

aws sagemaker update-domain – domain-id $domainid – cli-input-json file://default-user-settings.json

Copy the following to a file named ‘Dockerfile’.

FROM cumulusprod/pegasus-terra:1.0

USER root

Copy the following to a file named ‘app-image-config-input.json’.

    "AppImageConfigName": "pegasus-1-config",
    "KernelGatewayImageConfig": {
        "KernelSpecs": [
                "Name": "python3",
                "DisplayName": "Pegasus 1.0"
        "FileSystemConfig": {
            "MountPath": "/root",
            "DefaultUid": 0,
            "DefaultGid": 0

Copy the following to a file named ‘default-user-settings.json’.

    "DefaultUserSettings": {
        "KernelGatewayAppSettings": { 
           "CustomImages": [ 
                 "ImageName": "pegasus-1",
                 "ImageVersionNumber": 1,
                 "AppImageConfigName": "pegasus-1-config"

Launch ‘pegasus-docker-images.sh’ in the directory with all four files, in the terminal of the notebook instance. If the script ran successfully, you should see the custom image attached to the Studio domain.

Amazon SageMaker dashboard


Perform interactive analysis

You can now launch the Pegasus Python kernel inside SageMaker . If this is your first time using Studio, you can get a quick tour of its UI.

For interactive analysis, you can use publicly available notebooks in Pegasus tutorial from this GitHub repository. Review the license before proceeding.

To clone the repository in Studio, open a system terminal using these instructions. Initiate $ git clone https://github.com/klarman-cell-observatory/pegasus

  • In the directory ‘pegasus’, select ‘notebooks’ and open ‘pegasus_analysis.ipynb’.
  • For kernel choose ‘Pegasus 1.0 (pegasus-1/1)’.
  • You can now run through the notebook and examine the output generated. Feel free to work through the other notebooks for deeper analysis.

Pagasus tutorial

At any point during experimentation, you can share your analysis along with results with your colleagues using these steps. The snapshot that you create also captures the notebook configuration such as instance type and kernel, to ensure reproducibility.

Formalize analysis and execute scripts

Once you are done with interactive analysis, you can consolidate your analysis into a script to launch in a managed environment. This is an important step, if you want to later incorporate this script as a component into a research workflow and automate it.

Copy the following script to a file named ‘pegasus_script.py’.

BSD 3-Clause License

Copyright (c) 2018, Broad Institute
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.



import pandas as pd
import pegasus as pg

if __name__ == "__main__":
    BASE_DIR = "/opt/ml/processing"
    data = pg.read_input(f"{BASE_DIR}/input/MantonBM_nonmix_subset.zarr.zip")
    pg.qc_metrics(data, percent_mito=10)
    df_qc = pg.get_filter_stats(data)
    pd.DataFrame(df_qc).to_csv(f"{BASE_DIR}/output/qc_metrics.csv", header=True, index=False)

The Jupyter notebook following provides an example of launching a processing job using the script in SageMaker.

  • Create a notebook in SageMaker Studio in the same directory as the script.
  • Copy the following code to the notebook and name it ‘sagemaker_pegasus_processing.ipynb’.
  • Select ‘Python 3 (Data Science)’ as the kernel.
  • Launch the cells.
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()

prefix = 'pegasus'

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'research-custom'
tag = ':pegasus-1.0'

uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
processing_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)

script_processor = ScriptProcessor(command=['python3'],
!wget https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip

local_path = "MantonBM_nonmix_subset.zarr.zip"

s3 = boto3.resource("s3")

base_uri = f"s3://{bucket}/{prefix}"
input_data_uri = sagemaker.s3.S3Uploader.upload(

code_uri = sagemaker.s3.S3Uploader.upload(

                      inputs=[ProcessingInput(source=input_data_uri, destination='/opt/ml/processing/input'),],
                      outputs=[ProcessingOutput(source="/opt/ml/processing/output", destination=f"{base_uri}/output")]
script_processor_job_description = script_processor.jobs[-1].describe()

output_path = f"{base_uri}/output"

The ‘output_path’ is the S3 prefix where you will find the results from SageMaker processing. This will be printed as the last line after execution. You can examine the results either directly in S3 or by copying the results back to your home directory in Studio.

Cleaning up

To avoid incurring future charges, shut down the SageMaker notebook instance. Detach image from the Studio domain, delete image in Amazon ECR, and delete data in Amazon S3.


In this blog, I showed you how to set up and use a unified research environment using Amazon SageMaker. Although the example pertained to Life Sciences, the architecture and the framework presented are generally applicable to any research space. They strive to address the broader research challenges of custom tooling, reproducibility, large datasets, and price predictability.

As a logical next step, take the scripted components and incorporate them into research workflows and automate them. You can use SageMaker Pipelines to incorporate machine learning into your workflows and operationalize them.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Enroll Existing AWS Accounts into AWS Control Tower

Post Syndicated from Kishore Vinjam original https://aws.amazon.com/blogs/architecture/field-notes-enroll-existing-aws-accounts-into-aws-control-tower/

Originally published 21 April 2020 to the Field Notes blog, and updated in August 2020 with new prechecks to the account enrollment script. 

Since the launch of AWS Control Tower, customers have been asking for the ability to deploy AWS Control Tower in their existing AWS Organizations and to extend governance to those accounts in their organization.

We are happy that you can now deploy AWS Control Tower in your existing AWS Organizations. The accounts that you launched before deploying AWS Control Tower, what we refer to as unenrolled accounts, remain outside AWS Control Towers’ governance by default. These accounts must be enrolled in the AWS Control Tower explicitly.

When you enroll an account into AWS Control Tower, it deploys baselines and additional guardrails to enable continuous governance on your existing AWS accounts. However, you must perform proper due diligence before enrolling in an account. Refer to the Things to Consider section below for additional information.

In this blog, I show you how to enroll your existing AWS accounts and accounts within the unregistered OUs in your AWS organization under AWS Control Tower programmatically.


Here’s a quick review of some terms used in this post:

  • The Python script provided in this post. This script interacts with multiple AWS services, to identify, validate, and enroll the existing unmanaged accounts into AWS Control Tower.
  • An unregistered organizational unit (OU) is created through AWS Organizations. AWS Control Tower does not manage this OU.
  • An unenrolled account is an existing AWS account that was created outside of AWS Control Tower. It is not managed by AWS Control Tower.
  • A registered organizational unit (OU) is an OU that was created in the AWS Control Tower service. It is managed by AWS Control Tower.
  • When an OU is registered with AWS Control Tower, it means that specific baselines and guardrails are applied to that OU and all of its accounts.
  • An AWS Account Factory account is an AWS account provisioned using account factory in AWS Control Tower.
  • Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud.
  • AWS Service Catalog allows you to centrally manage commonly deployed IT services. In the context of this blog, account factory uses AWS Service Catalog to provision new AWS accounts.
  • AWS Organizations helps you centrally govern your environment as you grow and scale your workloads on AWS.
  • AWS Single Sign-On (SSO) makes it easy to centrally manage access to multiple AWS accounts. It also provides users with single sign-on access to all their assigned accounts from one place.

Things to Consider

Enrolling an existing AWS account into AWS Control Tower involves moving an unenrolled account into a registered OU. The Python script provided in this blog allows you to enroll your existing AWS accounts into AWS Control Tower. However, it doesn’t have much context around what resources are running on these accounts. It assumes that you validated the account services before running this script to enroll the account.

Some guidelines to check before you decide to enroll the accounts into AWS Control Tower.

  1. An AWSControlTowerExecution role must be created in each account. If you are using the script provided in this solution, it creates the role automatically for you.
  2. If you have a default VPC in the account, the enrollment process tries to delete it. If any resources are present in the VPC, the account enrollment fails.
  3. If AWS Config was ever enabled on the account you enroll, a default config recorder and delivery channel were created. Delete the configuration-recorder and delivery channel for the account enrollment to work.
  4. Start with enrolling the dev/staging accounts to get a better understanding of any dependencies or impact of enrolling the accounts in your environment.
  5. Create a new Organizational Unit in AWS Control Tower and do not enable any additional guardrails until you enroll in the accounts. You can then enable guardrails one by one to check the impact of the guardrails in your environment.
  6. As an additional option, you can apply AWS Control Tower’s detective guardrails to an existing AWS account before moving them under Control Tower governance. Instructions to apply the guardrails are discussed in detail in AWS Control Tower Detective Guardrails as an AWS Config Conformance Pack blog.


Before you enroll your existing AWS account in to AWS Control Tower, check the prerequisites from AWS Control Tower documentation.

This Python script provided part of this blog, supports enrolling all accounts with in an unregistered OU in to AWS Control Tower. The script also supports enrolling a single account using both email address or account-id of an unenrolled account. Following are a few additional points to be aware of about this solution.

  • Enable trust access with AWS Organizations for AWS CloudFormation StackSets.
  • The email address associated with the AWS account is used as AWS SSO user name with default First Name Admin and Last Name User.
  • Accounts that are in the root of the AWS Organizations can be enrolled one at a time only.
  • While enrolling an entire OU using this script, the AWSControlTowerExecution role is automatically created on all the accounts on this OU.
  • You can enroll a single account with in an unregistered OU using the script. It checks for AWSControlTowerExecution role on the account. If the role doesn’t exist, the role is created on all accounts within the OU.
  • By default, you are not allowed to enroll an account that is in the root of the organization. You must pass an additional flag to launch a role creation stack set across the organization
  • While enrolling a single account that is in the root of the organization, it prompts for additional flag to launch role creation stack set across the organization.
  • The script uses CloudFormation Stack Set Service-Managed Permissions to create the AWSControlTowerExecution role in the unenrolled accounts.

How it works

The following diagram shows the overview of the solution.

Account enrollment

  1. In your AWS Control Tower environment, access an Amazon EC2 instance running in the master account of the AWS Control Tower home Region.
  2. Get temporary credentials for AWSAdministratorAccess from AWS SSO login screen
  3. Download and execute the enroll_script.py script
  4. The script creates the AWSControlTowerExecution role on the target account using Automatic Deployments for a Stack Set feature.
  5. On successful validation of role and organizational units that are given as input, the script launches a new product in Account Factory.
  6. The enrollment process creates an AWS SSO user using the same email address as the AWS account.

Setting up the environment

It takes up to 30 minutes to enroll each AWS account in to AWS Control Tower. The accounts can be enrolled only one at a time. Depending on number of accounts that you are migrating, you must keep the session open long enough. In this section, you see one way of keeping these long running jobs uninterrupted using Amazon EC2 using the screen tool.

Optionally you may use your own compute environment where the session timeouts can be handled. If you go with your own environment, make sure you have python3, screen and a latest version of boto3 installed.

1. Prepare your compute environment:

  • Log in to your AWS Control Tower with AWSAdministratorAccess role.
  • Switch to the Region where you deployed your AWS Control Tower if needed.
  • If necessary, launch a VPC using the stack here and wait for the stack to COMPLETE.
  • If necessary, launch an Amazon EC2 instance using the stack here. Wait for the stack to COMPLETE.
  • While you are on master account, increase the session time duration for AWS SSO as needed. Default is 1 hour and maximum is 12 hours.

2. Connect to the compute environment (one-way):

  • Go to the EC2 Dashboard, and choose Running Instances.
  • Select the EC2 instance that you just created and choose Connect.
  • In Connect to your instance screen, under Connection method, choose EC2InstanceConnect (browser-based SSH connection) and Connect to open a session.
  • Go to AWS Single Sign-On page in your browser. Click on your master account.
  • Choose command line or programmatic access next to AWSAdministratorAccess.
  • From Option 1 copy the environment variables and paste them in to your EC2 terminal screen in step 5 below.

3. Install required packages and variables. You may skip this step, if you used the stack provided in step-1 to launch a new EC2 instance:

  • Install python3 and boto3 on your EC2 instance. You may have to update boto3, if you use your own environment.
$ sudo yum install python3 -y 
$ sudo pip3 install boto3
$ pip3 show boto3
Name: boto3
Version: 1.12.39
  • Change to home directory and download the enroll_account.py script.
$ cd ~
$ wget https://raw.githubusercontent.com/aws-samples/aws-control-tower-reference-architectures/master/customizations/AccountFactory/EnrollAccount/enroll_account.py
  • Set up your home Region on your EC2 terminal.
export AWS_DEFAULT_REGION=<AWSControlTower-Home-Region>

4. Start a screen session in daemon mode. If your session gets timed out, you can open a new session and attach back to the screen.

$ screen -dmS SAM
$ screen -ls
There is a screen on:
        585.SAM (Detached)
1 Socket in /var/run/screen/S-ssm-user.
$ screen -dr 585.SAM 

5. On the screen terminal, paste the environmental variable that you noted down in step 2.

6. Identify the accounts or the unregistered OUs to migrate and run the Python script provide with below mentioned options.

  • Python script usage:
usage: enroll_account.py -o -u|-e|-i -c 
Enroll existing accounts to AWS Control Tower.

optional arguments:
  -h, – help            show this help message and exit
  -o OU, – ou OU        Target Registered OU
  -u UNOU, – unou UNOU  Origin UnRegistered OU
  -e EMAIL, – email EMAIL
                        AWS account email address to enroll in to AWS Control Tower
  -i AID, – aid AID     AWS account ID to enroll in to AWS Control Tower
  -c, – create_role     Create Roles on Root Level
  • Enroll all the accounts from an unregistered OU to a registered OU
$ python3 enroll_account.py -o MigrateToRegisteredOU -u FromUnregisteredOU 
Creating cross-account role on 222233334444, wait 30 sec: RUNNING 
Executing on AWS Account: 570395911111, [email protected] 
Launching Enroll-Account-vinjak-unmgd3 
Status: UNDER_CHANGE. Waiting for 6.0 min to check back the Status 
Status: UNDER_CHANGE. Waiting for 5.0 min to check back the Status 
. . 
Status: UNDER_CHANGE. Waiting for 1.0 min to check back the Status 
SUCCESS: 111122223333 updated Launching Enroll-Account-vinjakSCchild 
Status: UNDER_CHANGE. Waiting for 6.0 min to check back the Status 
ERROR: 444455556666 
Launching Enroll-Account-Vinjak-Unmgd2 
Status: UNDER_CHANGE. Waiting for 6.0 min to check back the Status 
. . 
Status: UNDER_CHANGE. Waiting for 1.0 min to check back the Status 
SUCCESS: 777788889999 updated
  • Use AWS account ID to enroll a single account that is part of an unregistered OU.
$ python3 enroll_account.py -o MigrateToRegisteredOU -i 111122223333
  • Use AWS account email address to enroll a single account from an unregistered OU.
$ python3 enroll_account.py -o MigrateToRegisteredOU -e [email protected]

You are not allowed by default to enroll an AWS account that is in the root of the organization. The script checks for the AWSControlTowerExecution role in the account. If role doesn’t exist, you are prompted to use -c | – create-role. Using -c flag adds the stack instance to the parent organization root. Which means an AWSControlTowerExecution role is created in all the accounts with in the organization.

Note: Ensure installing AWSControlTowerExecution role in all your accounts in the organization, is acceptable in your organization before using -c flag.

If you are unsure about this, follow the instructions in the documentation and create the AWSControlTowerExecution role manually in each account you want to migrate. Rerun the script.

  • Use AWS account ID to enroll a single account that is in root OU (need -c flag).
$ python3 enroll_account.py -o MigrateToRegisteredOU -i 111122223333 -c
  • Use AWS account email address to enroll a single account that is in root OU (need -c flag).
$ python3 enroll_account.py -o MigrateToRegisteredOU -e [email protected] -c

Cleanup steps

On successful completion of enrolling all the accounts into your AWS Control Tower environment, you could clean up the below resources used for this solution.

If you have used templates provided in this blog to launch VPC and EC2 instance, delete the EC2 CloudFormation stack and then VPC template.


Now you can deploy AWS Control Tower in an existing AWS Organization. In this post, I have shown you how to enroll your existing AWS accounts in your AWS Organization into AWS Control Tower environment. By using the procedure in this post, you can programmatically enroll a single account or all the accounts within an organizational unit into an AWS Control Tower environment.

Now that governance has been extended to these accounts, you can also provision new AWS accounts in just a few clicks and have your accounts conform to your company-wide policies.

Additionally, you can use Customizations for AWS Control Tower to apply custom templates and policies to your accounts. With custom templates, you can deploy new resources or apply additional custom policies to the existing and new accounts. This solution integrates with AWS Control Tower lifecycle events to ensure that resource deployments stay in sync with your landing zone. For example, when a new account is created using the AWS Control Tower account factory, the solution ensures that all resources attached to the account’s OUs are automatically deployed.

Field Notes: Stopping an Automatically Started Database Instance with Amazon RDS

Post Syndicated from Islam Ghanim original https://aws.amazon.com/blogs/architecture/field-notes-stopping-an-automatically-started-database-instance-with-amazon-rds/

Customers needing to keep an Amazon Relational Database Service (Amazon RDS) instance stopped for more than 7 days, look for ways to efficiently re-stop the database after being automatically started by Amazon RDS. If the database is started and there is no mechanism to stop it; customers start to pay for the instance’s hourly cost. Moreover, customers with database licensing agreements could incur penalties for running beyond their licensed cores/users.

Stopping and starting a DB instance is faster than creating a DB snapshot, and then restoring the snapshot. However, if you plan to keep the Amazon RDS instance stopped for an extended period of time, it is advised to terminate your Amazon RDS instance and recreate it from a snapshot when needed.

This blog provides a step-by-step approach to automatically stop an RDS instance once the auto-restart activity is complete. This saves any costs incurred once the instance is turned on. The proposed architecture is fully serverless and requires no management overhead. It relies on AWS Step Functions and a set of Lambda functions to monitor RDS instance state and stop the instance when required.

Architecture overview

Given the autonomous nature of the architecture and to avoid management overhead, the architecture leverages serverless components.

  • The architecture relies on RDS event notifications. Once a stopped RDS instance is started by AWS due to exceeding the maximum time in the stopped state; an event (RDS-EVENT-0154) is generated by RDS.
  • The RDS event is pushed to a dedicated SNS topic rds-event-notifications-topic.
  • The Lambda function start-statemachine-execution-lambda is subscribed to the SNS topic rds-event-notifications-topic.
    • The function filters messages with event code: RDS-EVENT-0154. In order to restrict the ‘force shutdown’ activity further, the function validates that the RDS instance is tagged with auto-restart-protection and that the tag value is set to ‘yes’.
    • Once all conditions are met, the Lambda function starts the AWS Step Functions state machine execution.
  • The AWS Step Functions state machine integrates with two Lambda functions in order to retrieve the instance state, as well as attempt to stop the RDS instance.
    • In case the instance state is not ‘available’, the state machine waits for 5 minutes and then re-checks the state.
    • Finally, when the Amazon RDS instance state is ‘available’; the state machine will attempt to stop the Amazon RDS instance.


In order to implement the steps in this post, you need an AWS account as well as an IAM user with permissions to provision and delete resources of the following AWS services:

  • Amazon RDS
  • AWS Lambda
  • AWS Step Functions
  • AWS CloudFormation

Architecture implementation

You can implement the architecture using the AWS Management Console or AWS CLI.  For faster deployment, the architecture is available on GitHub. For more information on the repo, visit GitHub.

The steps below explain how to build the end-to-end architecture from within the AWS Management Console:

Create an SNS topic

  • Open the Amazon SNS console.
  • On the Amazon SNS dashboard, under Common actions, choose Create Topic.
  • In the Create new topic dialog box, for Topic name, enter a name for the topic (rds-event-notifications-topic).
  • Choose Create topic.
  • Note the Topic ARN for the next task (for example, arn:aws:sns:us-east-1:111122223333:my-topic).

Configure RDS event notifications

Amazon RDS uses Amazon Simple Notification Service (Amazon SNS) to provide notification when an Amazon RDS event occurs. These notifications can be in any notification form supported by Amazon SNS for an AWS Region, such as an email, a text message, or a call to an HTTP endpoint.

For this architecture, RDS generates an event indicating that instance has automatically restarted because it exceed the maximum duration to remain stopped. This specific RDS event (RDS-EVENT-0154) belongs to ‘notification’ category. For more information, visit Using Amazon RDS Event Notification.

To subscribe to an RDS event notification

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In the navigation pane, choose Event subscriptions.
  • In the Event subscriptions pane, choose Create event subscription.
  • In the Create event subscription dialog box, do the following:
    • For Name, enter a name for the event notification subscription (RdsAutoRestartEventSubscription).
    • For Send notifications to, choose the SNS topic created in the previous step (rds-event-notifications-topic).
    • For Source type, choose ‘Instances’. Since our source will be RDS instances.
    • For Instances to include, choose ‘All instances’. Instances are included or excluded based on the tag, auto-restart-protection. This is to keep the architecture generic and to avoid regular configurations moving forward.
    • For Event categories to include, choose ‘Select specific event categories’.
    • For Specific event, choose ‘notification’. This is the category under which the RDS event of interest falls. For more information, review Using Amazon RDS Event Notification.
    •  Choose Create.
    • The Amazon RDS console indicates that the subscription is being created.

Create Lambda functions

Following are the three Lambda functions required for the architecture to work:

  1. start-statemachine-execution-lambda, the function will subscribe to the newly created SNS topic (rds-event-notifications-topic) and starts the AWS Step Functions state machine execution.
  2. retrieve-rds-instance-state-lambda, the function is triggered by AWS Step Functions state machine to retrieve an RDS instance state (example, available or stopped)
  3. stop-rds-instance-lambda, the function is triggered by AWS Step Functions state machine in order to attempt to stop an RDS instance.

First, create the Lambda functions’ execution role.

To create an execution role

  • Open the roles page in the IAM console.
  • Choose Create role.
  • Create a role with the following properties.
    • Trusted entity – Lambda.
    • Permissions – AWSLambdaBasicExecutionRole.
    • Role namerds-auto-restart-lambda-role.
    • The AWSLambdaBasicExecutionRole policy has the permissions that the function needs to write logs to CloudWatch Logs.

Now, create a new policy and attach to the role in order to allow the Lambda function to: start an AWS StepFunctions state machine execution, stop an Amazon RDS instance, retrieve RDS instance status, list tags and add tags.

Use the JSON policy editor to create a policy

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane on the left, choose Policies.
  • Choose Create policy.
  • Choose the JSON tab.
  • Paste the following JSON policy document:
    "Version": "2012-10-17",
    "Statement": [
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
  • When you are finished, choose Review policy. The Policy Validator reports any syntax errors.
  • On the Review policy page, type a Name (rds-auto-restart-lambda-policy) and a Description (optional) for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.

To link the new policy to the AWS Lambda execution role

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane, choose Policies.
  • In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  • Choose Policy actions, and then choose Attach.
  • Select the IAM role created for the three Lambda functions. After selecting the identities, choose Attach policy.

Given the principle of least privilege, it is recommended to create 3 different roles restricting a function’s access to the needed resources only. 

Repeat the following step 3 times to create 3 new Lambda functions. Differences between the 3 Lambda functions are: (1) code and (2) triggers:

  • Open the Lambda console.
  • Choose Create function.
  • Configure the following settings:
    • Name
      • start-statemachine-execution-lambda
      • retrieve-rds-instance-state-lambda
      • stop-rds-instance-lambda
    • Runtime – Python 3.8.
    • Role – Choose an existing role.
    • Existing role – rds-auto-restart-lambda-role.
    • Choose Create function.
    • To configure a test event, choose Test.
    • For Event name, enter test.
  • Choose Create.
  • For the Lambda function —  start-statemachine-execution-lambda, use the following Python 3.8 sample code:
import json
import boto3
import logging
import os

LOGGER = logging.getLogger()

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')

def lambda_handler(event, context):

    #log input event
    LOGGER.info("RdsAutoRestart Event Received, now checking if event is eligible. Event Details ==> ", event)

    #Input event from the SNS topic originated from RDS event notifications
    snsMessage = json.loads(event['Records'][0]['Sns']['Message'])
    rdsInstanceId = snsMessage['Source ID']
    stepFunctionInput = {"rdsInstanceId": rdsInstanceId}
    rdsEventId = snsMessage['Event ID']

    #Retrieve RDS instance ARN
    db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
    db_instance = db_instances[0]
    rdsInstanceArn = db_instance['DBInstanceArn']

    # Filter on the Auto Restart RDS Event. Event code: RDS-EVENT-0154. 

    if 'RDS-EVENT-0154' in rdsEventId:

        #log input event
        LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")

        #Verify that instance is tagged with auto-restart-protection tag. The tag is used to classify instances that are required to be terminated once started. 

        tagCheckPass = 'false'
        rdsInstanceTags = rdsClient.list_tags_for_resource(ResourceName=rdsInstanceArn)
        for rdsInstanceTag in rdsInstanceTags["TagList"]:
            if 'auto-restart-protection' in rdsInstanceTag["Key"]:
                if 'yes' in rdsInstanceTag["Value"]:
                    tagCheckPass = 'true'
                    #log instance tags
                    LOGGER.info("RdsAutoRestart verified that the instance is tagged auto-restart-protection = yes, now starting the Step Functions Flow")
                    tagCheckPass = 'false'

        #log instance tags
        LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")

        if 'true' in tagCheckPass:

            #Initialise StepFunctions Client
            stepFunctionsClient = boto3.client('stepfunctions')

            # Start StepFunctions WorkFlow
            # StepFunctionsArn is stored in an environment variable
            stepFunctionsArn = os.environ['STEPFUNCTION_ARN']
            stepFunctionsResponse = stepFunctionsClient.start_execution(
            stateMachineArn= stepFunctionsArn,
            input= json.dumps(stepFunctionInput)



        LOGGER.info("RdsAutoRestart Event detected, and event is not eligible")

    return {
            'statusCode': 200

And then, configure an SNS source trigger for the function start-statemachine-execution-lambda. RDS event notifications will be published to this SNS topic:

  • In the Designer pane, choose Add trigger.
  • In the Trigger configurations pane, select SNS as a trigger.
  • For SNS topic, choose the SNS topic previously created (rds-event-notifications-topic)
  • For Enable trigger, keep it checked.
  • Choose Add.
  • Choose Save.

For the Lambda function — retrieve-rds-instance-state-lambda, use the following Python 3.8 sample code:

import json
import logging
import boto3

LOGGER = logging.getLogger()

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')

def lambda_handler(event, context):

    #log input event
    #rdsInstanceId is passed as input to the lambda function from the AWS StepFunctions state machine.  
    rdsInstanceId = event['rdsInstanceId']
    db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
    db_instance = db_instances[0]
    rdsInstanceState = db_instance['DBInstanceStatus']
    return {
        'statusCode': 200,
        'rdsInstanceState': rdsInstanceState,
        'rdsInstanceId': rdsInstanceId

Choose Save.

For the Lambda function, stop-rds-instance-lambda, use the following Python 3.8 sample code:

import json
import logging
import boto3

LOGGER = logging.getLogger()

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')

def lambda_handler(event, context):
    #log input event
    rdsInstanceId = event['rdsInstanceId']
    #Stop RDS instance
    return {
        'statusCode': 200,
        'rdsInstanceId': rdsInstanceId

Choose Save.

Create a Step Function

AWS Step Functions will execute the following service logic:

  1. Retrieve RDS instance state by calling Lambda function, retrieve-rds-instance-state-lambda. The Lambda function then returns the parameter, rdsInstanceState.
  2. If the rdsInstanceState parameter value is ‘available’, then the state machine will step into the next action calling the Lambda function, stop-rds-instance-lambda. If the rdsInstanceState is not ‘available’, the state machine will then wait for 5 minutes and then re-check the RDS instance state again.
  3. Stopping an RDS instance is an asynchronous operation and accordingly the state machine will keep polling the instance state once every 5 minutes until the rdsInstanceState parameter value becomes ‘stopped’. Only then, the state machine execution will complete successfully.

  • An RDS instance path to ‘available’ state may vary depending on the various maintenance activities scheduled for the instance.
  • Once the RDS notification event is generated, the instance will go through multiple states till it becomes ‘available’.
  • The use of the 5 minutes timer is to make sure that the automation flow will keep attempting to stop the instance once it becomes available.
  • The second part will make sure that the flow doesn’t end till the instance status is changed to ‘stopped’ and hence notifying the system administrator.

To create an AWS Step Functions state machine

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In the navigation pane, choose State machines.
  • In the State machines pane, choose Create state machine.
  • On the Define state machine page, choose Author with code snippets. For Type, choose Standard.
  • Enter a Name for your state machine, stop-rds-instance-statemachine.
  • In the State machine definition pane, add the following state machine definition using the ARNs of the two Lambda function created earlier, as shown in the following code sample:
  "Comment": "stop-rds-instance-statemachine: Automatically shutting down RDS instance after a forced Auto-Restart",
  "StartAt": "retrieveRdsInstanceState",
  "States": {
    "retrieveRdsInstanceState": {
      "Type": "Task",
      "Resource": "retrieve-rds-instance-state-lambda Arn",
      "Next": "isInstanceAvailable"
    "isInstanceAvailable": {
      "Type": "Choice",
      "Choices": [
          "Variable": "$.rdsInstanceState",
          "StringEquals": "available",
          "Next": "stopRdsInstance"
      "Default": "waitFiveMinutes"
    "waitFiveMinutes": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "retrieveRdsInstanceState"
    "stopRdsInstance": {
      "Type": "Task",
      "Resource": "stop-rds-instance-lambda Arn",
      "Next": "retrieveRDSInstanceStateStopping"
    "retrieveRDSInstanceStateStopping": {
      "Type": "Task",
      "Resource": "retrieve-rds-instance-state-lambda Arn",
      "Next": "isInstanceStopped"
    "isInstanceStopped": {
      "Type": "Choice",
      "Choices": [
          "Variable": "$.rdsInstanceState",
          "StringEquals": "stopped",
          "Next": "notifyDatabaseAdmin"
      "Default": "waitFiveMinutesStopping"
    "waitFiveMinutesStopping": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "retrieveRDSInstanceStateStopping"
    "notifyDatabaseAdmin": {
      "Type": "Pass",
      "Result": "World",
      "End": true

This is a definition of the state machine written in Amazon States Language which is used to describe the execution flow of an AWS Step Function.

Choose Next.

  • In the Name pane, enter a name for your state machine, stop-rds-instance-statemachine.
  • In the Permissions pane, choose Create new role. Take note of the the new role’s name displayed at the bottom of the page (example, StepFunctions-stop-rds-instance-statemachine-role-231ffecd).
  • Choose Create state machine
  • By default, the created role only grants the state machine access to CloudWatch logs. Since the state machine will have to make Lambda calls, then another IAM policy has to be associated with the new role.

Use the JSON policy editor to create a policy

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane on the left, choose Policies.
  • Choose Create policy.
  • Choose the JSON tab.
  • Paste the following JSON policy document:
"Version": "2012-10-17",
"Statement": [
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "*"
  • When you are finished, choose Review policy. The Policy Validator reports any syntax errors.
  • On the Review policy page, type a Name rds-auto-restart-stepfunctions-policy and a Description (optional) for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy.
  • Choose Create policy to save your work.

To link the new policy to the AWS Step Functions execution role

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane, choose Policies.
  • In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  • Choose Policy actions, and then choose Attach.
  • Select the IAM role create for the state machine (example, StepFunctions-stop-rds-instance-statemachine-role-231ffecd). After selecting the identities, choose Attach policy.


Testing the architecture

In order to test the architecture, create a test RDS instance, tag it with auto-restart-protection tag and set the tag value to yes. While the RDS instance is still in creation process, test the Lambda function —  start-statemachine-execution-lambda with a sample event that simulates that the instance was started as it exceeded the maximum time to remain stopped (RDS-EVENT-0154).

To invoke a function

  • Sign in to the AWS Management Console and open the Lambda console.
  • In navigation pane, choose Functions.
  • In Functions pane, choose start-statemachine-execution-lambda.
  • In the upper right corner, choose Test.
  • In the Configure test event page, choose Create new test event and in Event template, leave the default Hello World option.
    "Records": [
        "EventSource": "aws:sns",
        "EventVersion": "1.0",
        "EventSubscriptionArn": "<RDS Event Subscription Arn>",
        "Sns": {
            "Type": "Notification",
            "MessageId": "10001-2d55da-9a73-5e42d46748c0",
            "TopicArn": "<SNS Topic Arn>",
            "Subject": "RDS Notification Message",
            "Message": "{\"Event Source\":\"db-instance\",\"Event Time\":\"2020-07-09 15:15:03.031\",\"Identifier Link\":\"https://console.aws.amazon.com/rds/home?region=<region>#dbinstance:id=<RDS instance id>\",\"Source ID\":\"<RDS instance id>\",\"Event ID\":\"http://docs.amazonwebservices.com/AmazonRDS/latest/UserGuide/USER_Events.html#RDS-EVENT-0154\",\"Event Message\":\"DB instance started\"}",
            "Timestamp": "2020-07-09T15:15:03.991Z",
            "SignatureVersion": "1",
            "Signature": "YsuM+L6N8rk+pBPBWoWeRcSuYqo/BN5v9D2lyoSg0B0uS46Q8NZZSoZWaIQi25TXfHY3RYXCXF9WbVGXiWa4dJs2Mjg46anM+2j6z9R7BDz0vt25qCrCyWhmWtc7yeETrlwa0jCtR/wxXFFexRwynqlZeDfvQpf/x+KNLrnJlT61WZ2FMTHYs124RwWU8NY3pm1Os0XOIvm8rfv3ywm1ccZfP4rF7Lfn+2EK6a0635Z/5aiyIlldNZxbgRYTODJYroO9INTlF7NPzVV1Y/K0E9aaL/wQgLZNquXQGCAxPFWy5lxJKeyUocOWcG48KJGIBUC36JJaqVdIilbZ9HvxTg==",
            "SigningCertUrl": "https://sns.<region>.amazonaws.com/SimpleNotificationService-a86cb10b4e1f29c941702d737128f7b6.pem",
            "UnsubscribeUrl": "https://sns.<region>.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=<arn>",
            "MessageAttributes": {}
start-statemachine-execution-lambda uses the SNS MessageId parameter as name for the AWS Step Functions execution. The name field is unique for a certain period of time, accordingly, with every test run the MessageId parameter value must be changed. 
  • Choose Create and then choose Test. Each user can create up to 10 test events per function. Those test events are not available to other users.
  • AWS Lambda executes your function on your behalf. The handler in your Lambda function receives and then processes the sample event.
  • Upon successful execution, view results in the console.
  • The Execution result section shows the execution status as succeeded and also shows the function execution results, returned by the return statement. Following is a sample response of the test execution:

Now, verify the execution of the AWS Step Functions state machine:

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In navigation pane, choose State machines.
  • In the State machine pane, choose stop-rds-instance-statemachine.
  • In the Executions pane, choose the execution with the Name value passed in the test event MessageId parameter.
  • In the Visual workflow pane, the real-time execution status is displayed:

  • Under the Step details tab, all details related to inputs, outputs and exceptions are displayed:


It is recommended to use Amazon CloudWatch to monitor all the components in this architecture. You can use AWS Step Functions to log the state of the execution, inputs and outputs of each step in the flow. So when things go wrong, you can diagnose and debug problems quickly.


When you build the architecture using serverless components, you pay for what you use with no upfront infrastructure costs. Cost will depend on the number of RDS instances tagged to be protected against an automatic start.

Architectural considerations

This architecture has to be deployed per AWS Account per Region.


The blog post demonstrated how to build a fully serverless architecture that monitors and stops RDS instances restarted by AWS. This helps to avoid falling behind on any required maintenance updates. This architecture helps you save cost incurred by started instances’ running hours and licensing implications.  Feel free to submit enhancements to the GitHub repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers

Field Notes: Running a Stateful Java Service on Amazon EKS

Post Syndicated from Tom Cheung original https://aws.amazon.com/blogs/architecture/field-notes-running-a-stateful-java-service-on-amazon-eks/

This post was co-authored  by Tom Cheung, Cloud Infrastructure Architect, AWS Professional Services and Bastian Klein, Solutions Architect at AWS.

Containerization helps to create secure and reproducible runtime environments for applications. Container orchestrators help to run containerized applications by providing extended deployment and scaling capabilities, among others. Because of this, many organizations are installing such systems as a platform to run their applications on. Organizations often start their container adaption with new workloads that are well suited for the way how orchestrators manage containers.

After they gained their first experiences with containers, organizations start migrating their existing applications to the same container platform to simplify the infrastructure landscape and unify their deployment mechanisms.  Migrations come with some challenges, as the applications were not designed to run in a container environment. Many of the existing applications work in a stateful manner. They are persisting files to the local storage and make use of stateful sessions. Both requirements need to be met for the application to properly work in the container environment.

This blog post shows how to run a stateful Java service on Amazon EKS with the focus on how to handle stateful sessions You will learn how to deploy the service to Amazon EKS and how to save the session state in an Amazon ElastiCache Redis database. There is a GitHub Repository that provides all sources that are mentioned in this article. It contains AWS CloudFormation templates that will setup the required infrastructure, as well as the Java application code along with the Kubernetes resource templates.

The Java code used in this blog post and the GitHub Repository are based on a Blog Post from Java In Use: Spring Boot + Session Management Example Using Redis. Our thanks for this content contributed under the MIT-0 license to the Java In Use author.

Overview of architecture

Kubernetes is a popular Open Source container orchestrator that is widely used. Amazon EKS is the managed Kubernetes offering by AWS and used in this example to run the Java application. Amazon EKS manages the Control Plane for you and gives you the freedom to choose between self-managed nodes, managed nodes or AWS Fargate to run your compute.

The following architecture diagram shows the setup that is used for this article.

Container reference architecture


  • There is a VPC composed of three public subnets, three subnets used for the application and three subnets reserved for the database.
  • For this application, there is an Amazon ElastiCache Redis database that stores the user sessions and state.
  • The Amazon EKS Cluster is created with a Managed Node Group containing three t3.micro instances per default. Those instances run the three Java containers.
  • To be able to access the website that is running inside the containers, Elastic Load Balancing is set up inside the public subnets.
  • The Elastic Load Balancing (Classic Load Balancer) is not part of the CloudFormation templates, but will automatically be created by Amazon EKS, when the application is deployed.


Here are the high-level steps in this post:

  • Deploy the infrastructure to your AWS Account
  • Inspect Java application code
  • Inspect Kubernetes resource templates
  • Containerization of the Java application
  • Deploy containers to the Amazon EKS Cluster
  • Testing and verification


If you do not want to set this up on your local machine, you can use AWS Cloud9.

Deploying the infrastructure

To deploy the infrastructure, you first need to clone the Github repository.

git clone https://github.com/aws-samples/amazon-eks-example-for-stateful-java-service.git

This repository contains a set of CloudFormation Templates that set up the required infrastructure outlined in the architecture diagram. This repository also contains a deployment script deploy.sh that issues all the necessary CLI commands. The script has one required argument -p that reflects the aws cli profile that should be used. Review the Named Profiles documentation to set up a profile before continuing.

If the profile is already present, the deployment can be started using the following command:

./deploy.sh -p <profile name>

The creation of the infrastructure will roughly take 30 minutes.

The below table shows all configurable parameters of the CloudFormation template:

parameter name table

This template is initiating several steps to deploy the infrastructure. First, it validates all CloudFormation templates. If the validation was successful, an Amazon S3 Bucket is created and the CloudFormation Templates are uploaded there. This is necessary because nested stacks are used. Afterwards the deployment of the main stack is initiated. This will automatically trigger the creation of all nested stacks.

Java application code

The following code is a Java web application implemented using Spring Boot. The application will persist session data at Amazon ElastiCache Redis, which enables the app to become stateless. This is a crucial part of the migration, because it allows you to use Kubernetes horizontal scaling features with Kubernetes resources like Deployments, without the need to use sticky load balancer sessions.

This is the Java ElastiCache Redis implementation by Spring Data Redis and Spring Boot. It allows you to configure the host and port of the deployed Redis instance. Because this is environment-specific information, it is not configured in the properties file It is injected as environment variables during runtime.


public class Config {

    private String host;
    private Integer port;

    public String getHost() {
        return host;

    public void setHost(String host) {
        this.host = host;

    public Integer getPort() {
        return port;

    public void setPort(Integer port) {
        this.port = port;

    public LettuceConnectionFactory redisConnectionFactory() {

        return new LettuceConnectionFactory(new RedisStandaloneConfiguration(this.host, this.port));



Containerization of Java application


FROM openjdk:8-jdk-alpine

MAINTAINER Tom Cheung <email address>, Bastian Klein<email address>
VOLUME /target

RUN addgroup -S spring && adduser -S spring -G spring
USER spring:spring
ARG DEPENDENCY=target/dependency
COPY ${DEPENDENCY}/org /app/org

ENTRYPOINT ["java","-Djava.security.egd=file:/dev/./urandom","-cp","app:app/lib/*", "com/amazon/aws/SpringBootSessionApplication"]


This is the Dockerfile to build the container image for the Java application. OpenJDK 8 is used as the base container image. Because of the way Docker images are built, this sample explicitly does not use a so-called ‘fat jar’. Therefore, you have separate image layers for the dependencies and the application code. By leveraging the Docker caching mechanism, optimized build and deploy times can be achieved.

Kubernetes Resources

After reviewing the application specifics, we will now see which Kubernetes Resources are required to run the application.

Kubernetes uses the concept of config maps to store configurations as a resource within the cluster. This allows you to define key value pairs that will be stored within the cluster and which are accessible from other resources.


apiVersion: v1
kind: ConfigMap
  name: java-ms
  namespace: default
  host: "***.***.0001.euc1.cache.amazonaws.com"
  port: "6379"

In this case, the config map is used to store the connection information for the created Redis database.

To be able to run the application, Kubernetes Deployments are used in this example. Deployments take care to maintain the state of the application (e.g. number of replicas) with additional deployment capabilities (e.g. rolling deployments).


apiVersion: apps/v1
kind: Deployment
  name: java-ms
  # labels so that we can bind a Service to this Pod
    app: java-ms
  replicas: 3
      app: java-ms
        app: java-ms
      - name: java-ms
        image: bastianklein/java-ms:1.2
        imagePullPolicy: Always
            cpu: "500m" #half the CPU free: 0.5 Core
            memory: "256Mi"
            cpu: "1000m" #max 1.0 Core
            memory: "512Mi"
          - name: SPRING_REDIS_HOST
                name: java-ms
                key: host
          - name: SPRING_REDIS_PORT
                name: java-ms
                key: port
        - containerPort: 8080
          name: http
          protocol: TCP

Deployments are also the place for you to use the configurations stored in config maps and map them to environment variables. The respective configuration can be found under “env”. This setup relies on the Spring Boot feature that is able to read environment variables and write them into the according system properties.

Now that the containers are running, you need to be able to access those containers as a whole from within the cluster, but also from the internet. To be able to route traffic cluster internally Kubernetes has a resource called Service. Kubernetes Services get a Cluster internal IP and DNS name assigned that can be used to access all containers that belong to that Service. Traffic will, by default, be distributed evenly across all replicas.


apiVersion: v1
kind: Service
  name: java-ms
  type: LoadBalancer
    - protocol: TCP
      port: 80 # Port for LB, AWS ELB allow port 80 only  
      targetPort: 8080 # Port for Target Endpoint
    app: java-ms

The “selector“ defines which Pods belong to the services. It has to match the labels assigned to the pods. The labels are assigned in the “metadata” section in the deployment.

Deploy the Java service to Amazon EKS

Before the deployment can start, there are some steps required to initialize your local environment:

  1. Update the local kubeconfig to configure the kubectl with the created cluster
  2. Update the k8s-resources/config-map.yaml to the created Redis Database Address
  3. Build and package the Java Service
  4. Build and push the Docker image
  5. Update the k8s-resources/deployment.yaml to use the newly created image

These steps can be automatically executed using the init.sh script located in the repository. The script needs following parameter:

  1.  -u – Docker Hub User Name
  2.  -r – Repository Name
  3.  -t – Docker image version tag

A sample invocation looks like this: ./init.sh -u bastianklein -r java-ms -t 1.2

This information is used to concatenate the full docker repository string. In the preceding example this would resolve to bastianklein/java-ms:1.2, which will automatically be pushed to your Docker Hub repository. If you are not yet logged in to docker on the command line execute docker login and follow the displayed steps before executing the init.sh script.

As everything is set up, it is time to deploy the Java service. The below list of commands first deploys all Kubernetes resources and then lists pods and services.

kubectl apply -f k8s-resources/

This will output:

configmap/java-ms created
deployment.apps/java-ms created
service/java-ms created


Now, list the freshly created pods by issuing kubectl get pods.

NAME                                                READY       STATUS                             RESTARTS   AGE

java-ms-69664cc654-7xzkh   0/1     ContainerCreating   0          1s

java-ms-69664cc654-b9lxb   0/1     ContainerCreating   0          1s


Let’s also review the created service kubectl get svc.

NAME            TYPE                   CLUSTER-IP         EXTERNAL-IP                                                        PORT(S)                   AGE            SELECTOR

java-ms          LoadBalancer         ***-***.eu-central-1.elb.amazonaws.com         80:32300/TCP       33s               app=java-ms

kubernetes     ClusterIP                 <none>                                                                      443/TCP                 2d1h            <none>


What we can see here is that the Service with name java-ms has an External-IP assigned to it. This is the DNS Name of the Classic Loadbalancer that is created behind the scenes. If you open that URL, you should see the Website (this might take a few minutes for the ELB to be provisioned).

Testing and verification

The webpage that opens should look similar to the following screenshot. In the text field you can enter text that is saved on clicking the “Save Message” button. This text will be listed in the “Messages” as shown in the following screenshot. These messages are saved as session data and now persists at Amazon ElastiCache Redis.

screenboot session example

By destroying the session, you will lose the saved messages.

Cleaning up

To avoid incurring future charges, you should delete all created resources after you are finished with testing. The repository contains a destroy.sh script. This script takes care to delete all deployed resources.

The script requires one parameter -p that requires the aws cli profile name that should be used: ./destroy.sh -p <profile name>


This post showed you the end-to-end setup of a stateful Java service running on Amazon EKS. The service is made scalable by saving the user sessions and the according session data in a Redis database. This solution requires changing the application code, and there are situations where this is not an option. By using StatefulSets as Kubernetes Resource in combination with an Application Load Balancer and sticky sessions, the goal of replicating the service can still be achieved.

We chose to use a Kubernetes Service in combination with a Classic Load Balancer. For a production workload, managing incoming traffic with a Kubernetes Ingress and an Application Load Balancer might be the better option. If you want to know more about Kubernetes Ingress with Amazon EKS, visit our Application Load Balancing on Amazon EKS documentation.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Protecting Domain-Joined Workloads with CloudEndure Disaster Recovery

Post Syndicated from Daniel Covey original https://aws.amazon.com/blogs/architecture/field-notes-protecting-domain-joined-workloads-with-cloudendure-disaster-recovery/

Co-authored by Daniel Covey, Solutions Architect, at CloudEndure, an AWS Company and Luis Molina, Senior Cloud Architect at AWS. 

When designing a Disaster Recovery plan, one of the main questions we are asked is how Microsoft Active Directory will be handled during a test or failover scenario. In this blog, we go through some of the options for IT professionals who are using the CloudEndure Disaster Recovery (DR) tool, and how to best architect it in certain scenarios.

Overview of architecture

In the following architecture, we show how you can protect domain-joined workloads in the case of a disaster. You can instruct CloudEndure Disaster Recovery to automatically launch thousands of your machines in their fully provisioned state in minutes.

CloudEndure DR Architecture diagram

Scenario 1: Full Replication Failover


In this scenario, we are performing a full stack Region to Region recovery including Microsoft Active Directory services.

Using CloudEndure Disaster Recovery  to protect Active Directory in Amazon EC2.

This will be a lift-and-shift style implementation. You take the on-premises Active Directory, and failover to another Region. Although not shown in this blog, this can be done from either on-premises, Cross-Region, or Cross-Cloud during DR or Testing.


For this walkthrough, you should have the following:

  • An AWS account
  • A CloudEndure Account
  • A CloudEndure project configured, with agents installed and replicating in ‘Continuous Data Replication’ Mode
  • A CloudEndure Recovery Plan configured to boot the Active Directory Domain controller first, followed by remaining servers
  • An understanding of Active Directory
  • Two separate VPCs, with matching CIDR ranges, and no connection to the source infrastructure.

Configuration and Launch of Recovery Plan

1. Log in to the CloudEndure Console
2. Ensure the blueprint settings for each machine are configured to boot either in the Test VPC or Failover VPC, depending on the reason for booting,
a. These changes can be done either through the console, or by using the CloudEndure API operations.
b. To change blueprints on a mass scale, use the mass blueprint setter scripts (Zip file with instructions).
3. Open “Recovery Plans” section for the project
a. Create a new Recovery Plan following these steps
b. Tip: Add in a delay between the launch of the Active Directory server, and the following servers, to allow Active Directory services to come up before the rest of the infrastructure.
4. Once you have created the Recovery Plan, you can either launch it from the CloudEndure console, or use the CloudEndure API Operations.

*Note: there is full CloudEndure failover and failback documentation.

There are different ways to clean up resources, depending on whether this was a test launch, or true failover.

  • Test Launch – You can choose the “Delete x target machines” under the “Machines” tab.
    • This will delete all machines created by CloudEndure in the VPC they were launched into.
  • True failover – At this time, you can choose to failback as needed.
    • Once failback is completed, you can use the same preceding steps as to delete the infrastructure spun up by CloudEndure.

Scenario 2: Warm Site Recovery


In this scenario, we perform a failover/recovery into a Region with a fully writeable and online Active Directory domain controller. This domain controller is running as an EC2 instance and is an extension of the on-premises, or cross cloud/region Active Directory infrastructure.


For this walkthrough, you should have the following:

  • An AWS account
  • A CloudEndure Account
  • A CloudEndure project configured, with agents installed and replicating in Continuous Data Replication Mode
  • An understanding of Active Directory
  • A deployment of Active Directory with online writeable domain controller(s)

Preparing AWS and Active Directory:

For our example us-west-1 (California) will be the  source environment CloudEndure is protecting. We have specified us-east-1 (N.Virginia) as the target recovery Region aka “warm site”.

  • The source Region will consist of a VPC configured with public and private (AD domain) subnets and security groups
  • AD Domain Controllers are deployed in the source environment (DC1 and DC2)


1.     Set up a target recovery site/VPC in a Region of your choice. We refer to this as the warm site.

2.     Configure connectivity between the source environment you are protecting, and the warm site.

a.     This can be accomplished in multiple ways depending on whether your source environment is on-premises (VPN or Direct connect), an alternate cloud provider (VPN tunnel), or a different AWS Region (VPC peering). For our example the source environment we are protecting is in us-west-1, and the warm recovery site is in us-east-1, both regions VPCs are connected via VPC peering.

3.     Establish connectivity between the source environment and the warm site. This ensures that the appropriate routes, subnets and ACLs are configured to allow AD authentication and replication traffic to flow between the source and warm recovery site.

4.     Extend your Active Directory into the warm recovery site by deploying a domain controller (DC3) into the warm site. This domain controller will handle Active Directory authentication and DNS for machines that get recovered into the warm site.

5.     Next, create a new Active Directory site. Use the Active Directory Sites and Services MMC for the warm recovery site prepared in us-east-1, and DC3 will be its associated domain controller.

a.     Once the site is created, associate the warm recovery site VPC networks to it. This will enforce local Active Directory client affinity to DC3 so that any machines recovered into the warm site use DC3 rather than the source environment domain controllers.  Otherwise, this could introduce recovery delays if the source environment domain controllers are unreachable.

Screenshot of Active Directory sites

6.     Now, you set DHCP options for the warm site recovery VPC. This sets the warm site domain controller (DC3) as the primary DNS server for any machines that get recovered into the warm site, allowing for a seamless recovery/failover.

Screenshot of DHCP options

Test or Failover procedure:

Review the “Configuration and Launch of Recovery Plan” as provided earlier in this blog post.

Cleaning up

To avoid incurring future charges, delete all resources used in both scenarios.


In this blog, we have provided you a few ways to successfully configure and test domain-joined servers, with their Active Directory counterpart. Going forward, you can test and fine tune the CloudEndure Recovery Plans to limit the down time needed for failover. Further blog posts will go into other ways to failover domain-joined servers.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Streaming VR to Wireless Headsets Using NVIDIA CloudXR

Post Syndicated from William Cannady original https://aws.amazon.com/blogs/architecture/field-notes-streaming-vr-to-wireless-headsets-using-nvidia-cloudxr/

It’s exciting to see many consumer-grade virtual reality (VR) hardware options, but setting up hardware can be cumbersome, expensive and complicated. Wired headsets require high-powered graphics workstations, and a solution to prevent you from tripping over the wires. Many room-scale headsets require two external peripherals (or ‘light towers’) to be installed so the headset can position itself in a room. These setups can take days to tune, and need resetting if the light towers are moved.

With the release of the Oculus Quest, users of virtual reality were delighted with a wireless, room-scale headset with dual hand tracking. They could enjoy VR without worrying about light towers or a high-powered graphics workstation. However, because the Quest was battery powered, it used an inherently low-powered central processing unit (CPU) and graphics processing unit (GPU). As a result, VR content had to be simplified to run on the Quest. This prevented customers from using the Quest for the most demanding graphics experiences, such as Computer Aided Design (CAD) review, or playing games such as Half-Life: Alyx.

Customers were faced with a difficult choice: expensive, complicated setups, or reduced-fidelity experiences.

In this blog post, we show you how to stream a full-fidelity VR experience from a computer on AWS to a wireless headset such as the Quest.

Overview of architecture

NVIDIA CloudXR takes NVIDIA’s experience in GPU encoding and decoding, and streams pixels to a remotely connected VR headset. By doing this, the rendering and compute requirements of visually intensive applications take place on a remote server instead of a local headset. This makes mobile headsets work with any application, regardless of their visual complexity and density.


Figure 1: architecture for streaming VR experiences from the AWS Cloud to a VR headset using NVIDIA’s CloudXR server running on EC2.

Figure 1: architecture for streaming VR experiences from the AWS Cloud to a VR headset

To provide global scalability,  NVIDIA announced the CloudXR platform will be available on G4 and P3 EC2 instances. It provides the following benefits:

  • At a global scale, customers can stream remote AR/VR experiences from Regions that are close them.
  • It enables centrally managed and deployed software experiences on Amazon Elastic Compute Cloud (Amazon EC2) instances. Previously these required physical transportation and implementation of devices and server hardware.
  • Lastly, IT administrators can now centrally manage content that may be sensitive or require frequent changes.


Using CloudXR on AWS requires EC2 instances with NVIDIA GPUs (that is, the P3 or G4 instance types) running within your virtual private cloud (VPC). The instance must be network accessible to a remote CloudXR client running on a VR headset. Connections are 1:1, meaning that each CloudXR client is connected to a dedicated EC2 instance. If needs require multiple CloudXR clients, you can deploy multiple EC2 instances.

Note that the process outlined is accurate as of January 2021. CloudXR, and X Reality (XR) overall, is rapidly changing. Consult the latest information about CloudXR from NVIDIA. Using CloudXR within your AWS account requires you setup P3 or G4 EC2 instances, as you would within an Amazon VPC. You must also add a security group that allows the ports required for CloudXR communication. These specific ports can be found in the CloudXR documentation, available from NVIDIA.

We have created a CloudFormation template that deploys an EC2 instance with CloudXR configured for reference, linked in the prerequisites. Because it makes reference to a private AMI, it must be shared with your account in order to deploy successfully. If you’re interested in using this template, contact your AWS Account team.


The following steps describe how to configure the EC2 instance manually. CloudXR streaming requires using a connection other than Windows RDP to connect to the remote EC2 instance. We use NICE DCV, which is provided at no cost to EC2 instances for remote connectivity.

For this walkthrough, you should have the following prerequisites:

Deploy CloudXR Server onto Amazon EC2

It’s important to note the steps outlined are for configuring a G4 instance. If you’d prefer to use a P3 instance, manually deploy your P3 instance and install NICE DCV as described in the documentation.

  1. Log into your AWS Account and navigate to the AWS Marketplace to install an EC2 instance with NICE DCV configured.
  2. Create a new security group during deployment that matches the CloudXR port settings and apply it to your instance. Consult the CloudXR documentation for the latest port settings.
  3. Wait 5 minutes for everything to initialize properly. Make note of the issues public IP address (or attach an Elastic IP address to the instance).
  4. Navigate to https://<IP-OF-INSTANCE>:8443 to connect to NICE DCV web-browser client. b.Use the credentials created during EC2 initialization to Log in

NICE DCV login screen

NICE DCV login screen on Web Browser Client

5. Once logged into your EC2 instance, install SteamVR and CloudXR onto the remote EC2 instance. SteamVR is used as an OpenVR/XR proxy between your VR application and CloudXR. CloudXR is used to stream the SteamVR experience to a remote CloudXR Client.

6. Verify installation of the CloudXR plugin into SteamVR by navigating to the Manage Add-ons page within the Advanced Settings option. Make sure it lists CloudXRRemoteHMD as an addon and is set to ON.

Verification of CloudXR Installation

Verification of CloudXR Installation

7. Add an allow entry to the Windows Firewall Entry for VRServer.exe. This allows SteamVR to use the CloudXR to stream properly. By default this file is located at %ProgramFiles% (x86)\Steam\steamapps\common\SteamVR\bin\win64\vrserver.exe\

Enabling the VRSERVER.EXE application through the Windows Application Firewall.


8. Install a CloudXR client onto your VR headset. If using an android-powered headset (that is, the Oculus Quest), you can use the sample APK within the CloudXR SDK

9. Select Finish.

Connect to your CloudXR Server and start streaming

1. Launch SteamVR on your remote EC2 instance by logging into your Steam account or configuring a no-login link following the Installation/use of SteamVR in an environment without internet access instructions.

2. When loaded, it will report a headset cannot be detected. This is OK.

SteamVR will display Headset not Detected—this is OK

SteamVR will display Headset not Detected—this is OK.

3. Within your Client headset, load the CloudXR Client application you recently installed.

4. Once connected, the headset will start display the SteamVR “void”. You also should see a view of your headset if SteamVR mirroring is enabled. The status box on the SteamVR server application will show a headset and 2 controllers attached as well.

SteamVR “Void” and Headset Connected icons

SteamVR “Void” and Headset Connected icons

5. Congratulations. You’re now connected to an AWS EC2 instance using NVIDIA CloudXR! Any VR application you now run on the EC2 server that uses OpenVR will be streamed to your VR headset!

Cleaning up

EC2 instances are billed only when they’re being used. You’ll want to make sure to stop your instance or shut it down when you are finished with your session. Terminating your instance is not necessary.


In this blog post, we showed how to stream a full-fidelity VR experience from a computer on AWS to a wireless headset. Having the ability to remotely connect to GPU-powered servers to run graphic workloads is not necessarily new, but connecting to a remote server with a VR headset and having full interactivity certainly is. With this architecture, you realize the benefits of CloudXR combined with the agility and scalability available on AWS. It becomes less challenging to manage content played on VR headsets because content doesn’t reside on the VR headset—it lives on the EC2 server.

Deploying to any AWS region where GPU instances are available allows you to offer CloudXR to your users at global scale.  As networks get faster and closer through services like AWS Outposts and AWS Wavelength, remote VR work will become possible for more customers. We’re excited to see what new workloads come next as this way of working grows.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.


Leo Chan

Leo Chan

Leo Chan is the Worldwide Tech Lead for Spatial Computing at AWS. He loves working at the intersection of Art and Technology and lives with his family in a rain forest just off the coast of Vancouver, Canada.

Top 15 Architecture Blog Posts of 2020

Post Syndicated from Jane Scolieri original https://aws.amazon.com/blogs/architecture/top-15-architecture-blog-posts-of-2020/

The goal of the AWS Architecture Blog is to highlight best practices and provide architectural guidance. We publish thought leadership pieces that encourage readers to discover other technical documentation, such as solutions and managed solutions, other AWS blogs, videos, reference architectures, whitepapers, and guides, Training & Certification, case studies, and the AWS Architecture Monthly Magazine. We welcome your contributions!

Field Notes is a series of posts within the Architecture blog channel which provide hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

We would like to thank you, our readers, for spending time on our blog this last year. Much appreciation also goes to our hard-working AWS Solutions Architects and other blog post writers. Below are the top 15 Architecture & Field Notes blog posts written in 2020.

#15: Field Notes: Choosing a Rehost Migration Tool – CloudEndure or AWS SMS

by Ebrahim (EB) Khiyami

In this post, Ebrahim provides some considerations and patterns where it’s recommended based on your migration requirements to choose one tool over the other.

Read Ebrahim’s post.

#14: Architecting for Reliable Scalability

by Marwan Al Shawi

In this post, Marwan explains how to architect your solution or application to reliably scale, when to scale and how to avoid complexity. He discusses several principles including modularity, horizontal scaling, automation, filtering and security.

Read Marwan’s post.

#13: Field Notes: Building an Autonomous Driving and ADAS Data Lake on AWS

by Junjie Tang and Dean Phillips

In this post, Junjie and Dean explain how to build an Autonomous Driving Data Lake using this Reference Architecture. They cover all steps in the workflow from how to ingest the data, to moving it into an organized data lake construct.

Read Junjie’s and Dean’s post.

#12: Building a Self-Service, Secure, & Continually Compliant Environment on AWS

by Japjot Walia and Jonathan Shapiro-Ward

In this post, Jopjot and Jonathan provide a reference architecture for highly regulated Enterprise organizations to help them maintain their security and compliance posture. This blog post provides an overview of a solution in which AWS Professional Services engaged with a major Global systemically important bank (G-SIB) customer to help develop ML capabilities and implement a Defense in Depth (DiD) security strategy.

Read Jopjot’s and Jonathan’s post.

#11: Introduction to Messaging for Modern Cloud Architecture

by Sam Dengler

In this post, Sam focuses on best practices when introducing messaging patterns into your applications. He reviews some core messaging concepts and shows how they can be used to address challenges when designing modern cloud architectures.

Read Sam’s post.

#10: Building a Scalable Document Pre-Processing Pipeline

by Joel Knight

In this post, Joel presents an overview of an architecture built for Quantiphi Inc. This pipeline performs pre-processing of documents, and is reusable for a wide array of document processing workloads.

Read Joel’s post.

#9: Introducing the Well-Architected Framework for Machine Learning

by by Shelbee Eigenbrode, Bardia Nikpourian, Sireesha Muppala, and Christian Williams

In the Machine Learning Lens whitepaper, the authors focus on how to design, deploy, and architect your machine learning workloads in the AWS Cloud. The whitepaper describes the general design principles and the five pillars of the Framework as they relate to ML workloads.

Read the post.

#8: BBVA: Helping Global Remote Working with Amazon AppStream 2.0

by Jose Luis Prieto

In this post, Jose explains why BBVA chose Amazon AppStream 2.0 to accommodate the remote work experience. BBVA built a global solution reducing implementation time by 90% compared to on-premises projects, and is meeting its operational and security requirements.

Read Jose’s post.

#7: Field Notes: Serverless Container-based APIs with Amazon ECS and Amazon API Gateway

by Simone Pomata

In this post, Simone guides you through the details of the option based on Amazon API Gateway and AWS Cloud Map, and how to implement it. First you learn how the different components (Amazon ECS, AWS Cloud Map, API Gateway, etc.) work together, then you launch and test a sample container-based API.

Read Simone’s post.

#6: Mercado Libre: How to Block Malicious Traffic in a Dynamic Environment

by Gaston Ansaldo and Matias Ezequiel De Santi

In this post, readers will learn how to architect a solution that can ingest, store, analyze, detect and block malicious traffic in an environment that is dynamic and distributed in nature by leveraging various AWS services like Amazon CloudFront, Amazon Athena and AWS WAF.

Read Gaston’s and Matias’ post.

#5: Announcing the New Version of the Well-Architected Framework

by Rodney Lester

In this post, Rodney announces the availability of a new version of the AWS Well-Architected Framework, and focuses on such issues as removing perceived repetition, adding content areas to explicitly call out previously implied best practices, and revising best practices to provide clarity.

Read Rodney’s post.

#4: Serverless Stream-Based Processing for Real-Time Insights

by Justin Pirtle

In this post, Justin provides an overview of streaming messaging services and AWS Serverless stream processing capabilities. He shows how it helps you achieve low-latency, near real-time data processing in your applications.

Read Justin’s post.

#3: Field Notes: Working with Route Tables in AWS Transit Gateway

by Prabhakaran Thirumeni

In this post, Prabhakaran explains the packet flow if both source and destination network are associated to the same or different AWS Transit Gateway Route Table. He outlines a scenario with a substantial number of VPCs, and how to make it easier for your network team to manage access for a growing environment.

Read Prabhakaran’s post.

#2: Using VPC Sharing for a Cost-Effective Multi-Account Microservice Architecture

by Anandprasanna Gaitonde and Mohit Malik

Anand and Mohit present a cost-effective approach for microservices that require a high degree of interconnectivity and are within the same trust boundaries. This approach requires less VPC management while still using separate accounts for billing and access control, and does not sacrifice scalability, high availability, fault tolerance, and security.

Read Anand’s and Mohit’s post.

#1: Serverless Architecture for a Web Scraping Solution

by Dzidas Martinaitis

You may wonder whether serverless architectures are cost-effective or expensive. In this post, Dzidas analyzes a web scraping solution. The project can be considered as a standard extract, transform, load process without a user interface and can be packed into a self-containing function or a library.

Read Dzidas’ post.

Thank You

Thanks again to all our readers and blog post writers! We look forward to learning and building amazing things together in 2021.

Field Notes: Speed Up Redaction of Connected Car Data by Multiprocessing Video Footage with Amazon Rekognition

Post Syndicated from Sandeep Kulkarni original https://aws.amazon.com/blogs/architecture/field-notes-speed-up-redaction-of-connected-car-data-by-multiprocessing-video-footage-with-amazon-rekognition/

In the blog, Redacting Personal Data from Connected Cars Using Amazon Rekognition, we demonstrated how you can redact personal data such as human faces using Amazon Rekognition. Traversing the video, frame by frame, and identifying personal information in each frame takes time. This solution is great for small video clips, where you do not need a near real-time response. However, in some use cases like object detection, real time traffic monitoring, you may need to process this information in near real-time and keep up with the input video stream.

In this blog post, we introduce how to leverage “multiprocessing” to speed up the redaction process and provide a response in near real time. We also compare the process run times using a variety of Amazon SageMaker instances to give users various options to process video using Amazon Rekognition.

For example, the ml.c5.4xlarge instance has 16 vCPUs, so we could theoretically have 16 processes, working in parallel, to process the video stream, which will significantly reduce the processing time. Our test against the sample video shows that we reduce the process run time by a factor of 11x, using the ml.c5.4xlarge instance.

Architecture Overview

Video Redaction - Multiprocessing

Walkthrough: 6 Steps

1. We will assume that the video data from the car was ingested and is stored in a “Raw” Amazon S3 bucket. (For real time analytics, video data will likely be ingested from the connected vehicles into an Amazon Kinesis Video Stream)

2.  In this architecture we will use an Amazon SageMaker notebook instance, which is a machine learning (ML) compute instance running the Jupyter Notebook App.

3. Additionally an AWS Identity and Access Management (IAM) role created with appropriate permissions is leveraged to provide temporary security credentials required for this program.

4. The individual frames are analyzed by calling the “DetectFaces” Amazon Rekognition API, which analyzes and provides metadata about the frame. If a face is detected in the frame, then Amazon Rekognition returns a bounding box per face.

5.  We write a function multi_process_video to blur the detected face for each frame and distribute the processing job equally among all available CPUs in the SageMaker instance

6. We run the multi_process function for the input video and write the output video to S3 bucket for further analysis.

Detailed Steps

For the 5 steps mentioned previously, we provide the input video, code samples and the corresponding output video.

Step 1: Login to the AWS console with your user credentials.

  • Upload the sample video to your S3 bucket.
    Name it face1.mp4. I’ve included the following example of the video input.

Step 2: In this block, we will create a SageMaker notebook.

Notebook instance:

  • Notebook instance name: VideoRedaction
    Notebook instance class: choose “ml.t3.large” from drop down
    Elastic inference: None


  • IAM role: Select Create a new role from the drop-down menu. This will open a new screen, click next and the new role will be created. The role name will start with AmazonSageMaker-ExecutionRole-xxxxxxxx.
  • Root access: Select Enable
  • Assume defaults for the rest, and select the orange “Create notebook instance” button at the bottom.

This will take you to the next screen, which shows that your notebook instance is being created. It will take a few minutes and you can monitor the status, which will show a green “InService” state, when the notebook is ready.

Step 3:  Next, we need to provide additional permissions to the new role that you created in Step 2.

  • Select the VideoRedaction notebook.
    This will open a new screen. Scroll down to the 3 block – “Permissions and encryption” and click on the IAM role ARN link.

This will open a screen where you can attach additional policies. It will already be populated with “AmazonSageMakerFullAccess”

  • Select the blue Attach policies button.
  • This will open a new screen, which will allow you to add permissions to your execution role.
    • Under “Filter policies” search for S3full. AmazonS3FullAccess. Check the box next to it.
    • Under “Filter policies” search for Rekognition. Check the box next to AmazonRekognitionFullAccess and AmazonRekognitionServiceRole.
    • Click blue Attach Policies button at the bottom. This will populate a screen which will show you the five policies attached as follows:

Permissions policies

  • Click on the Add inline policy link on the right and then click on the JSON tab on the next screen. Paste the following policy replacing the <account number> with your AWS account number:
    "Version": "2012-10-17",
    "Statement": [
            "Sid": "MySid",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::<accountnumber>:role/serviceRekognition"

On the next screen enter VideoInlinePolicy for the name and select the blue Create Policy button at the bottom.

Permissions Policies - 6 Policies Applied

Step 3a:  Navigate to SageMaker in the console:

  • Select “Notebook instances” in the menu on left. This will show your VideoRedaction notebook.
  • Select Open Jupyter blue link under Actions. This will open a new tab titled, Jupyter.

Step 3b: In the upper right corner, click on drop down arrow next to “New” and choose conda_tensorflow_p36 as the kernel for your notebook.

Your screen will look at follows:


Install ffmpeg

First, we need to install ffmpeg for multiprocessing video. It’s a free and open-source software project consisting of a large suite of libraries and programs for handling video, audio, and other multimedia files and streams. We use it to concatenate all the subset videos processed by each vCPU and generate the final output.

Install ffmpeg using the following command:

!conda install x264=='1!152.20180717' ffmpeg=4.0.2 -c conda-forge --yes  

Import libraries – We import additional libraries to help with multi-processing capability.

import cv2  
import os  
from PIL import ImageFilter  
import boto3  
import io  
from PIL import Image, ImageDraw, ExifTags, ImageColor  
import numpy as np  
from os.path import isfile, join  
import time  
import sys  
import time  
import subprocess as sp  
import multiprocessing as mp  
from os import remove  

Step 4: Identify personal data (faces) in the individual frames

Amazon Rekognition “Detect_Faces” detects the 100 largest faces in the image. For each face detected, the operation returns face details. These details include a bounding box of the face, a confidence value (that the bounding box contains a face), and a fixed set of attributes such as facial landmarks (for example, coordinates of eye and mouth), presence of beard, sunglasses, and so on.

You pass the input image either as base64-encoded image bytes or as a reference to an image in an Amazon S3 bucket. In this code, we pass the image as jpg to Amazon Rekognition since we want to see each frame of this video. We also show how you can expand the bounding boxes returned by Amazon Rekognition, if required, to blur an enlarged portion of the face.

	def detect_blur_face_local_file(photo,blurriness):      
	    # Call DetectFaces      
	    with open(photo, 'rb') as image:      
	        response = client.detect_faces(Image={'Bytes': image.read()})      
	    imgWidth, imgHeight = image.size        
	    draw = ImageDraw.Draw(image)         
	    # Calculate and display bounding boxes for each detected face             
	    for faceDetail in response['FaceDetails']:      
	        box = faceDetail['BoundingBox']      
	        left = imgWidth * box['Left']      
	        top = imgHeight * box['Top']      
	        width = imgWidth * box['Width']      
	        height = imgHeight * box['Height']      
	        #blur faces inside the enlarged bounding boxes
	        #you can also keep the original bounding boxes    
	        mask = Image.new('L', image.size, 0)      
	        draw = ImageDraw.Draw(mask)      
	        draw.rectangle([ (x1,y1), (x2,y2) ], fill=255)      
	        blurred = image.filter(ImageFilter.GaussianBlur(blurriness))      
	        image.paste(blurred, mask=mask)      
	    return image      

Step 5: Redact the face bounding box and distribute the processing among all CPUs

By passing the group_number of the multi_process_video function, you can distribute the video processing job among all available CPUs of the instance equally and therefore largely reduce the process time.

	def multi_process_video(group_number):  
	    cap = cv2.VideoCapture(input_file)  
	    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_jump_unit * group_number)  
	    proc_frames = 0  
	    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))  
	    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))  
	    fps = cap.get(cv2.CAP_PROP_FPS)  
	    out = cv2.VideoWriter(  
	        "{}.{}".format(group_number, 'mp4'),  
	        (width, height),  
	    while proc_frames < frame_jump_unit:  
	        ret, frame = cap.read()  
	        if ret == False:  
	        #Define the blurriness  
	        blurred_frame=cv2.cvtColor(np.array(blurred_img), cv2.COLOR_BGR2RGB)    
	        proc_frames += 1  
	        print('Group '+str(group_number)+' finished processing!')  
	    return None  

Step 6: Run multi-processing video function and write the redacted video to the output bucket

  • Then we multi-process the video and generate the output using multiprocessing function and ffmpeg in python.
  • We take a record of each video processed by a CPU in the format of ‘1.mp4’, ‘2.mp4’ … in a file called multiproc_files and then use subprocess to call ffmpeg to concatenate these videos based on these videos’ order in multiproc_files.
  • After the final video is generated, we remove all the intermediate results and upload the face-blurred result to a S3 bucket.
	start_time = time.time()  
	# Connect to S3  
	s3_client = boto3.client('s3')  
	# Download S3 video to local. Enter your bucketname and file name below
	s3_client.download_file(bucket, file, './'+file)  
	num_processes = mp.cpu_count()  
	cap = cv2.VideoCapture(input_file)  
	frame_jump_unit = cap.get(cv2.CAP_PROP_FRAME_COUNT) // num_processes  
	# Multiprocessing video across all vCPUs    
	p = mp.Pool(num_processes)  
	p.map(multi_process_video, range(num_processes))  
	# Generate multiproc_files to record the subset videos in the right order    
	multiproc_files = ["{}.{}".format(i, 'mp4') for i in range(num_processes)]  
	with open("multiproc_files.txt", "w") as f:  
	    for t in multiproc_files:  
	        f.write("file {} \n".format(t))  
	# Use ffmpeg to concatenate all the subset videos according to multiproc_files   
	ffmpeg_command="ffmpeg -f concat -safe 0 -i multiproc_files.txt -c copy "  
	ffmpeg_command += local_filename  
	cmd = sp.Popen(ffmpeg_command, stdout=sp.PIPE, stderr=sp.PIPE, shell=True)  
	# Remove all the intermediate results    
	for f in multiproc_files:  
	filelist = [ f for f in os.listdir(mydir) if f.endswith(".jpg") ]  
	for f in filelist:  
	    os.remove(os.path.join(mydir, f))  
	# Upload face-blurred video to s3  
	response = s3_client.upload_file(local_filename, bucket, s3_filename)   
	finish_time = time.time()  
	print( "Total Process Time:",finish_time-start_time,'s')  


Group 13 finished processing!

Group 15 finished processing!

Group 14 finished processing!

Group 12 finished processing!

Group 11 finished processing!

Group 9 finished processing!

Group 10 finished processing!

Group 1 finished processing!

Group 3 finished processing!

Group 4 finished processing!

Group 8 finished processing!

Group 5 finished processing!

Group 2 finished processing!

Group 7 finished processing!

Group 6 finished processing!

Group 0 finished processing!

Total Process Time: 15.709482431411743 s

Using the same instance, we reduce the process time from 168s to 15.7s. As we mentioned, ml.c5.4xlarge has 16 vCPUs and you can even further reduce the process time if you have an instance that has 32 or 64 CPUs.

Note: Choosing the right instance will depend on your requirement for process time and cost. As this result demonstrates, multiprocessing video using Amazon Rekognition is an efficient way to leverage the benefits of Amazon Rekognition state-of-the-art ML model and powerful multi-core Amazon SageMaker instances.

Comparison of Amazon SageMaker Instances in Terms of Process Time and Cost

Here is the comparison table generated when processing a 6.5 seconds video with multiple faces on different SageMaker instances. Following is a video screenshot:

Video screenshot with faces of 5 people blurred

Based on the following table, you learn that instances with 16 vCPU (4xlarge) are better options in terms of faster processing capability, while optimized for cost.

Table with SageMaker Instance Types

Depending on the size of your input video file and the requirements for real-time processing, you can break the input video file into smaller chunks and then scale instances to process those chunks in parallel. While this example is focused on blurring faces, you can also use AWS Rekognition for other use cases like someone wielding a gun, smoking a cigarette, suggestive content and the like.  These and many other moderation activities are all supported by Rekognition content moderation APIs.


In this blog post, we showed how you can leverage multiple cores in large machine learning instances, along with Amazon Rekognition. Doing this can significantly speed up the process of redacting personally identifiable information from videos collected by connected vehicles. The ability to provide near-real-time information unlocks additional value from the video that is ingested. For example, in smart cities, information is collected about the environment, such as road traffic and weather. This data can be visualized in near-real-time to help city management make decisions that can optimize traffic and improve residents’ quality of life.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: How FactSet Uses ‘microAccounts’ to Reduce Developer Friction and Maintain Security at Scale

Post Syndicated from Tarik Makota original https://aws.amazon.com/blogs/architecture/field-notes-how-factset-uses-microaccounts-to-reduce-developer-friction-and-maintain-security-at-scale/

This is post was co-written by FactSet’s Cloud Infrastructure team, Gaurav Jain, Nathan Goodman, Geoff Wang, Daniel Cordes, Sunu Joseph and AWS Solution Architects, Amit Borulkar and Tarik Makota.

FactSet considers developer self-service and DevOps essential for realizing cloud benefits.  As part of their cloud adoption journey, they wanted developers to have a frictionless infrastructure provisioning experience while maintaining standardization and security of their cloud environment.  To achieve their objectives, they use what they refer to as a ‘microAccounts approach’. In their microAccount approach, each AWS account is allocated for one project and is owned by a single team.

In this blog, we describe how FactSet manages 1000+ AWS accounts at scale using the microAccounts approach. First, we cover the core concepts of their approach. Then we outline how they manage access and permissions. Finally, we show how they manage their networking implementation and how they use automation to manage their AWS Cloud infrastructure.

How FactSet started with AWS

They started their cloud adoption journey with what they now call a ‘macroAccounts’ approach. In the early days they would set up a handful of AWS accounts. These macroAccounts were then shared across several different application teams and projects.   They have hundreds of application teams along with thousands of developers and they quickly experienced the challenges of a macroAccounts approach. These include the following:

  1. AWS Identity and Access Management (IAM) policies and resource tagging were complex to design in order to maintain least privilege. For example, if a developer desired the ability to start/stop Amazon EC2 instances, they would need to ensure that they are limited to starting/stopping only their own instances.  This complexity kept increasing as developers wanted to automate their workflows using constructs such as AWS Lambda functions, and containers.
  2. They had difficulty in properly attributing cloud costs across departments.  More importantly they kept going back and forth on: how do we establish accountability and transparency around spends by groups, projects, or teams?
  3. It was difficult to track and manage impact of infrastructure change to FactSet applications. For example, how is maintenance off underlying security group or IAM policy affecting FactSet applications?
  4. Significant effort was required in managing service quotas and limits across various applications being under single AWS account.

FactSet’s solution – microAccounts

Recognizing the issues, they decided to take a different approach to AWS account management. Instead of creating a few shared macro-accounts, they decided to create one AWS account per project (microAccounts) with clearly defined ownership and product allocation.  An analogy might be that macro-accounts were like leaving the main door of a house open but locking individual closets and rooms to limit access. This is opposed to safeguarding the entry to the house but largely leaving individual closets and rooms open for the tenant to manage.

Benefits of microAccounts

They have been operating their AWS Cloud infrastructure using microAccounts for about two years now. Benefits of the microAccount approach include:

1.      Access & Permissions: By associating an account with a project they simplified which services are allowed, which resources that development team can access, and are able to ensure that those permissions cascade properly to underlying resources.  The following diagram shows their microAccount strategy.


Tagging versus microAccount strategy

Figure 1 – Tagging versus microAccount strategy

2.      Service Quotas & Limits: Given most service quotas are account specific, microAccounts allow their developers to plan limits based on their application needs.  In a shared account configuration, there was no mechanism to limit separate teams from using up a larger portion of the service quota, leaving other teams with less.  These limits extend beyond infrastructure provisioning to run time tasks like Lambda concurrency, API throttling limits on parameter store and more.

3.      AWS Service Permissions: microAccounts allowed FactSet to easily implement least privilege across services. By using IAM service control policies (SCPs) they limit what AWS services an account can access.  They start with a default set of services and based on business need we can grant a specific account access to other non-common services without having to worry about those services creeping into other use cases.  For example, they disable storage gateway by default, but can allow access for a specific account if needed.

4.      Blast Radius Containment:  microAccounts provides the ability to create safety boundaries. This is in the event of any stability and security issues, they stay isolated within that specific application (AWS account) and they don’t affect operations of other applications.

5.     Cost Attributions:  Clearly defined account ownership provides a simple and straightforward way to attribute costs to a specific team, project, or product.  They don’t have to enforce the tagging individual resources for cost purposes. AWS account acts like an application resource group so all resources in the account are implicitly tagged.

6.      Account Notifications & Operations:  Single threaded account ownership allows FactSet to automatically relay any required notification to right developers.  Moreover, given that account ownership is fundamental in defining who is allowed access to the account, there is a high level of confidence in the validity of this mapping as opposed to relying on just tagging.

7.      Account Standards & Extensions: we manage microAccounts through a CI/CD pipeline which allows us to standardize and extend without interruptions.  For example, all their microAccounts are provisioned with a standard AWS Key Management Service (AWS KMS) key, an AWS Backup Vault & policy, private Amazon Route 53 zone, AWS Systems Manager Parameter Store with network information for Terraform or AWS CloudFormation templates.

8.      Developer Experience: microAccount automation and guardrails allow developers to get started quickly instead of spending time debugging things like correct SCP/IAM permissions and more. Developers tend to work across multiple applications and their experience has improved as they have a standard set of expectations for their AWS environment. This is particularly useful as they move from application to application.

Access and permissions for microAccounts

FactSet creates every AWS account with a standard set of IAM roles and permissions. Furthermore, each account has its own SCP which defines the list of services allowed in the account.  Based on application needs, they can extend the permissions.  Interactive roles are mapped to an ActiveDirectory (AD) group, and membership of the AD group is managed by the development teams themselves.  Standard roles are:

  • DevOps Role – Interactive role used to provision and manage infrastructure.
  • Developer Role – Interactive role used to read/write data (and some infrastructure)
  • ReadOnly Role – Interactive role with read-only access to the account.  This can be granted to account supervisors, product developers, and other similar roles.
  • Support Roles – Interactive roles for certain admin teams to assist account owners if needed
  • ServiceExecutionRole – Role that can be attached to entities such as Lambda functions, CodeBuild, EC2 instances, and has similar permissions to a developer role.
IAM Role Privileges

Figure 2 – IAM Role Privileges

Networking for microAccounts

  • FactSet leverages AWS Resource Access Manager (RAM) to share appropriate subnets with each account.  Each microAccount provisioned has access to subnets by sing AWS Shared VPCs.  They create a single VPC per business unit per environment (Dev, Prod, UAT, and Shared Services) in each region.  RAM enabled them to easily and securely share AWS resources with any AWS account within their AWS Organization.  When an account is created they allocate appropriate subnets to that account.
  • They use AWS Transit Gateway to manage inter-VPC routing and communication across multiple VPCs in a region.  They didn’t want to limit our ability to scale up quickly.  AWS Transit Gateway is a single place to land their AWS Direct Connect circuits in each Region.  It provides them with a consolidated place to manage routing tables that propagated to each VPC when they are attached.


VPC Sharing for microAccounts

Figure 3 – VPC Sharing for microAccounts

Automation & Config Management for microAccounts

To create frictionless self-service cloud infrastructure early on, FactSet realized that automation is a must.  Their infrastructure automation uses source-control as a source of truth for defining each microAccount. This helps them ensure repeatable and standardized account provisioning process, as well as flexibility to adjust specific settings and permissions on per account needs.

Account provisioning flow

Figure 4 – Account provisioning flow

By default, their accounts are only enabled in a small set of Regions.  They control it via the following policy block.  If they add new Region(s), they would implement that change in source-control and automated enforcement checks would add it to SCP.

    "Sid": "DenyOtherRegions",
    "Effect": "Deny",
    "Action": "*",
    "Resource": "*",
    "Condition": {
        "StringNotEquals": {
            "aws:RequestedRegion": ["us-east-1","eu-west-2"]
        "ForAllValues:StringNotLike": {
            "aws:PrincipalArn": [

Lessons Learned

During their journey to adopt microAccounts, FactSet came across some new challenges that are worth highlighting:

  1. IAM role creation: Their DevOps Role can create new IAM roles within the account.  To ensure that newly created role complies with least-privilege principles, they attach a standard permission boundary which limits its permissions to not extend beyond DevOps level.
  2. Account Deletion: While AWS provides APIs for account creation, currently there is no API to delete or rename an account.  This is not an issue since only a small percentage of accounts had to be deleted because of a cancelled project for example.
  3. Account Creation / Service Activation: Although automation is used to provision accounts it can still take time for all services in account to be fully activated.  Some services like Amazon EC2 have asynchronous processes to be activated in a new account.
  4. Account Email, Root Password, and MFA: Upon account creation, they don’t set up a root password or MFA.  That is only setup on the primary (master) account.  Given each account requires a unique email address, they leverage Amazon Simple Email Service (Amazon SES) to create a new email address with cloud administrator team as the recipients.  When they need to log in as root (very unusual), they go through the process of password reset before logging in.
  5. Service Control Policies: There were two primary challenges related to SCPs:
    • SCP is a property in the primary (master) account that is attached to a child microAccount.  However, they also wanted to manage SCP like any other account config and store it in source-control along with other account configuration.  This required IAM role used by our automation to have special permissions to be able to create/attach/detach SCPs in the primary (master) account.
    • There is a hard limit of 1000 SCPs in the primary (master) account.  If you have a SCP per account, this would limit you to 1000 microAccounts.  They solved this by re-using SCPs across accounts with same policies.  Content of a policy is hashed to create a unique SCP identifier, and accounts with same hashes are attached to same SCP.
  6. Sharing data (typically S3) across microAccounts: they leverage a concept of “trusted-accounts” to allow other accounts access to an account’s resources including S3 and KMS keys.
  7. It may feel like an anti-pattern to have resources with static costs like Application Load Balancers (ALB) and KMS for individual projects as opposed to a shared pool.  The list of resources with a base cost is small as most of the services are largely priced based on usage.  For FactSet, resource isolation is a key benefit of microAccounts, and therefore outweighs some of these added costs.
  8. Central Inventory & Logging: With 100s of accounts, it is worth investing in a more centralized inventory and AWS CloudTrail logs collection system.
  9. Costs, Reserved Instances (RI), and Savings Plans: FactSet found AWS Cost Explorer at the level of your primary (master) account to be a great tool for cost-transparency.  They leverage AWS Cost Explorer’s API to import that data into their internal cost transparency tools.  RIs and Savings Plans are managed centrally and leverage automatic sharing between accounts within the same master (primary) organization.


The microAccounts approach provides FactSet with the agility to operate according to specific needs of different teams and projects in the enterprise. They are currently deploying in twelve AWS Regions with automated AWS account provisioning happening in minutes and drift checks executing multiple times throughout the day. This frees up their developers to focus on solving business problems to maximize the benefits of cloud computing, so that their business can innovate and accelerate their clients’ digital transformations.

Their experience operating regulated infrastructure in the cloud demonstrated that microAccounts are pivotal for managing cloud at scale. With microAccounts they were able to accelerate projects onboarded to cloud by 5X, reduce number of IAM permission tickets by 10X, and experienced 3X fewer stability issues. We hope that this blog post provided useful insights to help determine if the microAccount strategy is a good fit for you.

In their own words, FactSet creates flexible, open data and software solutions for tens of thousands of investment professionals around the world, which provides instant access to financial data and analytics that investors use to make crucial decisions. At FactSet, we are always working to improve the value that our products provide.

Recommended Reading:

Defining an AWS Multi-Account Strategy for telecommunications companies

Why should I set up a multi-account AWS environment?

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.


Field Notes: Improving Call Center Experiences with Iterative Bot Training Using Amazon Connect and Amazon Lex

Post Syndicated from Marius Cealera original https://aws.amazon.com/blogs/architecture/field-notes-improving-call-center-experiences-with-iterative-bot-training-using-amazon-connect-and-amazon-lex/

This post was co-written by Abdullah Sahin, senior technology architect at Accenture, and Muhammad Qasim, software engineer at Accenture. 

Organizations deploying call-center chat bots are interested in evolving their solutions continuously, in response to changing customer demands. When developing a smart chat bot, some requests can be predicted (for example following a new product launch or a new marketing campaign). There are however instances where this is not possible (following market shifts, natural disasters, etc.)

While voice and chat bots are becoming more and more ubiquitous, keeping the bots up-to-date with the ever-changing demands remains a challenge.  It is clear that a build>deploy>forget approach quickly leads to outdated AI that lacks the ability to adapt to dynamic customer requirements.

Call-center solutions which create ongoing feedback mechanisms between incoming calls or chat messages and the chatbot’s AI, allow for a programmatic approach to predicting and catering to a customer’s intent.

This is achieved by doing the following:

  • applying natural language processing (NLP) on conversation data
  • extracting relevant missed intents,
  • automating the bot update process
  • inserting human feedback at key stages in the process.

This post provides a technical overview of one of Accenture’s Advanced Customer Engagement (ACE+) solutions, explaining how it integrates multiple AWS services to continuously and quickly improve chatbots and stay ahead of customer demands.

Call center solution architects and administrators can use this architecture as a starting point for an iterative bot improvement solution. The goal is to lead to an increase in call deflection and drive better customer experiences.

Overview of Solution

The goal of the solution is to extract missed intents and utterances from a conversation and present them to the call center agent at the end of a conversation, as part of the after-work flow. A simple UI interface was designed for the agent to select the most relevant missed phrases and forward them to an Analytics/Operations Team for final approval.

Figure 1 – Architecture Diagram

Amazon Connect serves as the contact center platform and handles incoming calls, manages the IVR flows and the escalations to the human agent. Amazon Connect is also used to gather call metadata, call analytics and handle call center user management. It is the platform from which other AWS services are called: Amazon Lex, Amazon DynamoDB and AWS Lambda.

Lex is the AI service used to build the bot. Lambda serves as the main integration tool and is used to push bot transcripts to DynamoDB, deploy updates to Lex and to populate the agent dashboard which is used to flag relevant intents missed by the bot. A generic CRM app is used to integrate the agent client and provide a single, integrated, dashboard. For example, this addition to the agent’s UI, used to review intents, could be implemented as a custom page in Salesforce (Figure 2).

Figure 2 – Agent feedback dashboard in Salesforce. The section allows the agent to select parts of the conversation that should be captured as intents by the bot.

A separate, stand-alone, dashboard is used by an Analytics and Operations Team to approve the new intents, which triggers the bot update process.


The typical use case for this solution (Figure 4) shows how missing intents in the bot configuration are captured from customer conversations. These intents are then validated and used to automatically build and deploy an updated version of a chatbot. During the process, the following steps are performed:

  1. Customer intents that were missed by the chatbot are automatically highlighted in the conversation
  2. The agent performs a review of the transcript and selects the missed intents that are relevant.
  3. The selected intents are sent to an Analytics/Ops Team for final approval.
  4. The operations team validates the new intents and starts the chatbot rebuild process.

Figure 3 – Use case: the bot is unable to resolve the first call (bottom flow). Post-call analysis results in a new version of the bot being built and deployed. The new bot is able to handle the issue in subsequent calls (top flow)

During the first call (bottom flow) the bot fails to fulfil the request and the customer is escalated to a Live Agent. The agent resolves the query and, post call, analyzes the transcript between the chatbot and the customer, identifies conversation parts that the chatbot should have understood and sends a ‘missed intent/utterance’ report to the Analytics/Ops Team. The team approves and triggers the process that updates the bot.

For the second call, the customer asks the same question. This time, the (trained) bot is able to answer the query and end the conversation.

Ideally, the post-call analysis should be performed, at least in part, by the agent handling the call. Involving the agent in the process is critical for delivering quality results. Any given conversation can have multiple missed intents, some of them irrelevant when looking to generalize a customer’s question.

A call center agent is in the best position to judge what is or is not useful and mark the missed intents to be used for bot training. This is the important logical triage step. Of course, this will result in the occasional increase in the average handling time (AHT). This should be seen as a time investment with the potential to reduce future call times and increase deflection rates.

One alternative to this setup would be to have a dedicated analytics team review the conversations, offloading this work from the agent. This approach avoids the increase in AHT, but also introduces delay and, possibly, inaccuracies in the feedback loop.

The approval from the Analytics/Ops Team is a sign off on the agent’s work and trigger for the bot building process.


The following section focuses on the sequence required to programmatically update intents in existing Lex bots. It assumes a Connect instance is configured and a Lex bot is already integrated with it. Navigate to this page for more information on adding Lex to your Connect flows.

It also does not cover the CRM application, where the conversation transcript is displayed and presented to the agent for intent selection.  The implementation details can vary significantly depending on the CRM solution used. Conceptually, most solutions will follow the architecture presented in Figure1: store the conversation data in a database (DynamoDB here) and expose it through an (API Gateway here) to be consumed by the CRM application.

Lex bot update sequence

The core logic for updating the bot is contained in a Lambda function that triggers the Lex update. This adds new utterances to an existing bot, builds it and then publishes a new version of the bot. The Lambda function is associated with an API Gateway endpoint which is called with the following body:

	“intent”: “INTENT_NAME”,
	“utterances”: [“UTTERANCE_TO_ADD_1”, “UTTERANCE_TO_ADD_2” …]

Steps to follow:

  1. The intent information is fetched from Lex using the getIntent API
  2. The existing utterances are combined with the new utterances and deduplicated.
  3. The intent information is updated with the new utterances
  4. The updated intent information is passed to the putIntent API to update the Lex Intent
  5. The bot information is fetched from Lex using the getBot API
  6. The intent version present within the bot information is updated with the new intent

Figure 4 – Representation of Lex Update Sequence


7. The update bot information is passed to the putBot API to update Lex and the processBehavior is set to “BUILD” to trigger a build. The following code snippet shows how this would be done in JavaScript:

const updateBot = await lexModel
        processBehavior: "BUILD"

9. The last step is to publish the bot, for this we fetch the bot alias information and then call the putBotAlias API.

const oldBotAlias = await lexModel
        name: config.botAlias,
        botName: updatedBot.name

return lexModel
        name: config.botAlias,
        botName: updatedBot.name,
        botVersion: updatedBot.version,
        checksum: oldBotAlias.checksum,


In this post, we showed how a programmatic bot improvement process can be implemented around Amazon Lex and Amazon Connect.  Continuously improving call center bots is a fundamental requirement for increased customer satisfaction. The feedback loop, agent validation and automated bot deployment pipeline should be considered integral parts to any a chatbot implementation.

Finally, the concept of a feedback-loop is not specific to call-center chatbots. The idea of adding an iterative improvement process in the bot lifecycle can also be applied in other areas where chatbots are used.

Accelerating Innovation with the Accenture AWS Business Group (AABG)

By working with the Accenture AWS Business Group (AABG), you can learn from the resources, technical expertise, and industry knowledge of two leading innovators, helping you accelerate the pace of innovation to deliver disruptive products and services. The AABG helps customers ideate and innovate cloud solutions with customers through rapid prototype development.

Connect with our team at [email protected] to learn and accelerate how to use machine learning in your products and services.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.


Abdullah Sahin

Abdullah Sahin

Abdullah Sahin is a senior technology architect at Accenture. He is leading a rapid prototyping team bringing the power of innovation on AWS to Accenture customers. He is a fan of CI/CD, containerization technologies and IoT.

Muhammad Qasim

Muhammad Qasin

Muhammad Qasim is a software engineer at Accenture and excels in development of voice bots using services such as Amazon Connect. In his spare time, he plays badminton and loves to go for a run.

Field Notes: Applying Machine Learning to Vegetation Management using Amazon SageMaker

Post Syndicated from Sameer Goel original https://aws.amazon.com/blogs/architecture/field-notes-applying-machine-learning-to-vegetation-management-using-amazon-sagemaker/

This post was co-written by Soheil Moosavi, a data scientist consultant in Accenture Applied Intelligence (AAI) team, and Louis Lim, a manager in Accenture AWS Business Group. 

Virtually every electric customer in the US and Canada has, at one time or another, experienced a sustained electric outage as a direct result of a tree and power line contact. According to the report from Federal Energy Regulatory Commission (FERC.gov), Electric utility companies actively work to mitigate these threats.

Vegetation Management (VM) programs represent one of the largest recurring maintenance expenses for electric utility companies in North America. Utilities and regulators generally agree that keeping trees and vegetation from conflicting with overhead conductors. It is a critical and expensive responsibility of all utility companies concerned about electric service reliability.

Vegetation management such as tree trimming and removal is essential for electricity providers to reduce unwanted outages and be rated with a low System Average Interruption Duration Index (SAIDI) score. Electricity providers are increasingly interested in identifying innovative practices and technologies to mitigate outages, streamline vegetation management activities, and maintain acceptable SAIDI scores. With the recent democratization of machine learning leveraging the power of cloud, utility companies are identifying unique ways to solve complex business problems on top of AWS. The Accenture AWS Business Group, a strategic collaboration by Accenture and AWS, helps customers accelerate their pace of innovation to deliver disruptive products and services. Learning how to machine learn helps enterprises innovate and disrupt unlocking business value.

In this blog post, you learn how Accenture and AWS collaborated to develop a machine learning solution for an electricity provider using Amazon SageMaker.  The goal was to improve vegetation management and optimize program cost.

Overview of solution 

VM is generally performed on a cyclical basis, prioritizing circuits solely based on the number of outages in the previous years. A more sophisticated approach is to use Light Detection and Ranging (LIDAR) and imagery from aircraft and low earth orbit (LEO) satellites with Machine Learning  models, to determine where VM is needed. This provides the information for precise VM plans, but is more expensive due to cost to acquire the LIDAR and imagery data.

In this blog, we show how a machine learning (ML) solution can prioritize circuits based on the impacts of tree-related outages on the coming year’s SAIDI without using imagery data.

We demonstrate how to implement a solution that cross-references, cleans, and transforms time series data from multiple resources. This then creates features and models that predict the number of outages in the coming year, and sorts and prioritizes circuits based on their impact on the coming year’s SAIDI. We show how you use an interactive dashboard designed to browse circuits and the impact of performing VM on SAIDI reduction based on your budget.


  • Source data is first transferred into an Amazon Simple Storage Service (Amazon S3) bucket from the client’s data center.
  • Next, AWS Glue Crawlers are used to crawl the data from the  source bucket. Glue Jobs were used to cross-reference data files to create features for modeling and data for the dashboards.
  • We used Jupyter Notebooks on Amazon SageMaker to train and evaluate models. The best performing model was saved as a pickle file on Amazon S3 and Glue was used to add the predicted number of outages for each circuit to the data prepared for the dashboards.
  • Lastly, Operations users were granted access to Amazon QuickSight dashboards, sourced data from Athena, to browse the data and graphs, while VM users were additionally granted access to directly edit the data prepared for dashboards, such as the latest VM cost for each circuit.
  • We used Amazon QuickSight to create interactive dashboards for the VM team members to visualize analytics and predictions. These predictions are a list of circuits prioritized based on their impact on SAIDI in the coming year. The solution allows our team to analyze the data and experiment with different models in a rapid cycle.


We were provided with 6 years worth of data across 127 circuits. Data included VM (VM work start and end date, number of trees trimmed and removed, costs), asset (pole count, height, and materials, wire count, length, and materials, and meter count and voltage), terrain (elevation, landcover, flooding frequency, wind erodibility, soil erodibility, slope, soil water absorption, and soil loss tolerance from GIS ESRI layers), and outages (outage coordinated, dates, duration, total customer minutes, total customers affected). In addition, we collected weather data from NOAA and DarkSky datasets, including wind, gust, precipitation, temperature.

Starting with 762 records (6 years * 127 circuits) and 226 features, we performed a series of data cleaning and feature engineering tasks including:

  • Dropped sparse, non-variant, and non-relevant features
  • Capped selected outliers based on features’ distributions and percentiles
  • Normalized imbalanced features
  • Imputed missing values
    • Used “0” where missing value meant zero (for example, number of trees removed)
    • Used 3650 (equivalent to 10 years) where missing values are days for VM work (for example, days since previous tree trimming job)
    • Used average of values for each circuit when applicable, and average of values across all circuits for circuits with no existing values (for example, pole mean height)
  • Merged conceptually relevant features
  • Created new features such as ratios (for example, tree trim cost per trim) and combinations(for example, % of land cover for low and medium intensity areas combined)

After further dropping highly correlated features to remove multi-collinearity for our models, we were left with 72 features for model development. The following diagram shows a high-level overview data partitioning and number of outages prediction.

Our best performing model out of Gradient Boosting Trees, Random Forest, and Feed Forward Neural Networks was Elastic Net, with Mean Absolute Error of 6.02 when using a combination of only 10 features. Elastic Net is appropriate for smaller sample for this dataset, good at feature selection, likely to generalize on a new dataset, and consistently showed a lower error rate. Exponential expansion of features showed small improvements in predictions, but we kept the non-expanded version due to better interpretability.

When analyzing the model performance, predictions were more accurate for circuits with lower outage count, and models suffered from under-predicting when the number of outages was high. This is due to having few circuits with a high number of outages for the model to learn from.

The following chart below shows the importance of each feature used in the model. An error of 6.02 means on average we over or under predict six outages for each circuit.


We designed two types of interactive dashboards for the VM team to browse results and predictions. The first set of dashboards show historical or predicted outage counts for each circuit on a geospatial map. Users can further filter circuits based on criteria such as the number of days since VM, as shown in the following screenshot.

The second type of dashboard shows predicted post-VM SAIDI on the y-axis and VM cost on the x-axis. This dashboard is used by the client to determine the reduction in SAIDI based on available VM budget for the year and dispatch the VM crew accordingly. Clients can also upload a list of update VM cost for each circuit, and the graph will automatically readjust.


This solution for Vegetation management demonstrates how we used Amazon SageMaker to train and evaluate machine learning models. Using this solution an Electric Utility can save time and cost, and scale easily to include more circuits within a set VM budget. We demonstrated how a utility can leverage machine learning to predict unwanted outages and also maintain vegetation, without incurring the cost of high-resolution imagery.

Further, to improve these predictions we recommend:

  1. A yearly collection of asset and terrain data (if data is only available for the most recent year, it is impossible for models to learn from each years’ changes),
  2. Collection of VM data per month per location (if current data is collected only at the end of each VM cycle and only per circuit, monthly, and subcircuit modeling is impossible), and
  3. Purchasing LiDAR imagery or tree inventory data to include features such as tree density, height, distance to wires, and more.

Accelerating Innovation with the Accenture AWS Business Group (AABG)

By working with the Accenture AWS Business Group (AABG), you can learn from the resources, technical expertise, and industry knowledge of two leading innovators, helping you accelerate the pace of innovation to deliver disruptive products and services. The AABG helps customers ideate and innovate cloud solutions with customers through rapid prototype development.

Connect with our team at [email protected] to learn and accelerate how to use machine learning in your products and services.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Soheil Moosavi

Soheil Moosavi is a data scientist consultant and part of Accenture Applied Intelligence (AAI) team. He comes with vast experience in Machine Learning and architecting analytical solutions to solve and improve business problems.


Soheil Moosavi

Louis Lim is a manager in Accenture AWS Business Group, his team focus on helping enterprises to explore the art of possible through rapid prototyping and cloud-native solution.


Field Notes: Comparing Algorithm Performance Using MLOps and the AWS Cloud Development Kit

Post Syndicated from Moataz Gaber original https://aws.amazon.com/blogs/architecture/field-notes-comparing-algorithm-performance-using-mlops-and-the-aws-cloud-development-kit/

Comparing machine learning algorithm performance is fundamental for machine learning practitioners, and data scientists. The goal is to evaluate the appropriate algorithm to implement for a known business problem.

Machine learning performance is often correlated to the usefulness of the model deployed. Improving the performance of the model typically results in an increased accuracy of the prediction. Model accuracy is a key performance indicator (KPI) for businesses when evaluating production readiness and identifying the appropriate algorithm to select earlier in model development. Organizations benefit from reduced project expenses, accelerated project timelines and improved customer experience. Nevertheless, some organizations have not introduced a model comparison process into their workflow which negatively impacts cost and productivity.

In this blog post, I describe how you can compare machine learning algorithms using Machine Learning Operations (MLOps). You will learn how to create an MLOps pipeline for comparing machine learning algorithms performance using AWS Step Functions, AWS Cloud Development Kit (CDK) and Amazon SageMaker.

First, I explain the use case that will be addressed through this post. Then, I explain the design considerations for the solution. Finally, I provide access to a GitHub repository which includes all the necessary steps for you to replicate the solution I have described, in your own AWS account.

Understanding the Use Case

Machine learning has many potential uses and quite often the same use case is being addressed by different machine learning algorithms. Let’s take Amazon Sagemaker built-in algorithms. As an example, if you are having a “Regression” use case, it can be addressed using (Linear Learner, XGBoost and KNN) algorithms. Another example for a “Classification” use case you can use algorithm such as (XGBoost, KNN, Factorization Machines and Linear Learner). Similarly for “Anomaly Detection” there are (Random Cut Forests and IP Insights).

In this post, it is a “Regression” use case to identify the age of the abalone which can be calculated based on the number of rings on its shell (age equals to number of rings plus 1.5). Usually the number of rings are counted through microscopes examinations.

I use the abalone dataset in libsvm format which contains 9 fields [‘Rings’, ‘Sex’, ‘Length’,’ Diameter’, ‘Height’,’ Whole Weight’,’ Shucked Weight’,’ Viscera Weight’ and ‘Shell Weight’] respectively.

The features starting from Sex to Shell Weight are physical measurements that can be measured using the correct tools. Therefore, using the machine learning algorithms (Linear Learner and XGBoost) to address this use case, the complexity of having to examine the abalone under microscopes to understand its age can be improved.

Benefits of the AWS Cloud Development Kit (AWS CDK)

The AWS Cloud Development Kit (AWS CDK) is an open source software development framework to define your cloud application resources.

The AWS CDK uses the jsii which is an interface developed by AWS that allows code in any language to naturally interact with JavaScript classes. It is the technology that enables the AWS Cloud Development Kit to deliver polyglot libraries from a single codebase.

This means that you can use the CDK and define your cloud application resources in typescript language for example. Then by compiling your source module using jsii, you can package it as modules in one of the supported target languages (e.g: Javascript, python, Java and .Net). So if your developers or customers prefer any of those languages, you can easily package and export the code to their preferred choice.

Also, the cdk tf provides constructs for defining Terraform HCL state files and the cdk8s enables you to use constructs for defining kubernetes configuration in TypeScript, Python, and Java. So by using the CDK you have a faster development process and easier cloud onboarding. It makes your cloud resources more flexible for sharing.


Overview of solution

This architecture serves as an example of how you can build a MLOps pipeline that orchestrates the comparison of results between the predictions of two algorithms.

The solution uses a completely serverless environment so you don’t have to worry about managing the infrastructure. It also deletes resources not needed after collecting the predictions results, so as not to incur any additional costs.

Figure 1: Solution Architecture


In the preceding diagram, the serverless MLOps pipeline is deployed using AWS Step Functions workflow. The architecture contains the following steps:

  1. The dataset is uploaded to the Amazon S3 cloud storage under the /Inputs directory (prefix).
  2. The uploaded file triggers AWS Lambda using an Amazon S3 notification event.
  3. The Lambda function then will initiate the MLOps pipeline built using a Step Functions state machine.
  4. The starting lambda will start by collecting the region corresponding training images URIs for both Linear Learner and XGBoost algorithms. These are used in training both algorithms over the dataset. It will also get the Amazon SageMaker Spark Container Image which is used for running the SageMaker processing Job.
  5. The dataset is in libsvm format which is accepted by the XGBoost algorithm as per the Input/Output Interface for the XGBoost Algorithm. However, this is not supported by the Linear Learner Algorithm as per Input/Output interface for the linear learner algorithm. So we need to run a processing job using Amazon SageMaker Data Processing with Apache Spark. The processing job will transform the data from libsvm to csv and will divide the dataset into train, validation and test datasets. The output of the processing job will be stored under /Xgboost and /Linear directories (prefixes).

Figure 2: Train, validation and test samples extracted from dataset

6. Then the workflow of Step Functions will perform the following steps in parallel:

    • Train both algorithms.
    • Create models out of trained algorithms.
    • Create endpoints configurations and deploy predictions endpoints for both models.
    • Invoke lambda function to describe the status of the deployed endpoints and wait until the endpoints become in “InService”.
    • Invoke lambda function to perform 3 live predictions using boto3 and the “test” samples taken from the dataset to calculate the average accuracy of each model.
    • Invoke lambda function to delete deployed endpoints not to incur any additional charges.

7. Finally, a Lambda function will be invoked to determine which model has better accuracy in predicting the values.

The following shows a diagram of the workflow of the Step Functions:

Figure 3: AWS Step Functions workflow graph

The code to provision this solution along with step by step instructions can be found at this GitHub repo.

Results and Next Steps

After waiting for the complete execution of step functions workflow, the results are depicted in the following diagram:

Figure 4: Comparison results

This doesn’t necessarily mean that the XGBoost algorithm will always be the better performing algorithm. It just means that the performance was the result of these factors:

  • the hyperparameters configured for each algorithm
  • the number of epochs performed
  • the amount of dataset samples used for training

To make sure that you are getting better results from the models, you can run hyperparameters tuning jobs which will run many training jobs on your dataset using the algorithms and ranges of hyperparameters that you specify. This helps you allocate which set of hyperparameters which are giving better results.

Finally, you can use this comparison to determine which algorithm is best suited for your production environment. Then you can configure your step functions workflow to update the configuration of the production endpoint with the better performing algorithm.

Figure 5: Update production endpoint workflow


This post showed you how to create a repeatable, automated pipeline to deliver the better performing algorithm to your production predictions endpoint. This helps increase the productivity and reduce the time of manual comparison.  You also learned to provision the solution using AWS CDK and to perform regular cleaning of deployed resources to drive down business costs. If this post helps you or inspires you to solve a problem, share your thoughts and questions in the comments. You can use and extend the code on the GitHub repo.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers

Field Notes: Ingest and Visualize Your Flat-file IoT Data with AWS IoT Services

Post Syndicated from Paul Ramsey original https://aws.amazon.com/blogs/architecture/field-notes-ingest-and-visualize-your-flat-file-iot-data-with-aws-iot-services/

Customers who maintain manufacturing facilities often find it challenging to ingest, centralize, and visualize IoT data that is emitted in flat-file format from their factory equipment. While modern IoT-enabled industrial devices can communicate over standard protocols like MQTT, there are still some legacy devices that generate useful data but are only capable of writing it locally to a flat file. This results in siloed data that is either analyzed in a vacuum without the broader context, or it is not available to business users to be analyzed at all.

AWS provides a suite of IoT and Edge services that can be used to solve this problem. In this blog, I walk you through one method of leveraging these services to ingest hard-to-reach data into the AWS cloud and extract business value from it.

Overview of solution

This solution provides a working example of an edge device running AWS IoT Greengrass with an AWS Lambda function that watches a Samba file share for new .csv files (presumably containing device or assembly line data). When it finds a new file, it will transform it to JSON format and write it to AWS IoT Core. The data is then sent to AWS IoT Analytics for processing and storage, and Amazon QuickSight is used to visualize and gain insights from the data.

Samba file share solution diagram

Since we don’t have an actual on-premises environment to use for this walkthrough, we’ll simulate pieces of it:

  • In place of the legacy factory equipment, an EC2 instance running Windows Server 2019 will generate data in .csv format and write it to the Samba file share.
    • We’re using a Windows Server for this function to demonstrate that the solution is platform-agnostic. As long as the flat file is written to a file share, AWS IoT Greengrass can ingest it.
  • An EC2 instance running Amazon Linux will act as the edge device and will host AWS IoT Greengrass Core and the Samba share.
    • In the real world, these could be two separate devices, and the device running AWS IoT Greengrass could be as small as a Raspberry Pi.


For this walkthrough, you should have the following prerequisites:

  • An AWS Account
  • Access to provision and delete AWS resources
  • Basic knowledge of Windows and Linux server administration
  • If you’re unfamiliar with AWS IoT Greengrass concepts like Subscriptions and Cores, review the AWS IoT Greengrass documentation for a detailed description.


First, we’ll show you the steps to launch the AWS IoT Greengrass resources using AWS CloudFormation. The AWS CloudFormation template is derived from the template provided in this blog post. Review the post for a detailed description of the template and its various options.

  1. Create a key pair. This will be used to access the EC2 instances created by the CloudFormation template in the next step.
  2. Launch a new AWS CloudFormation stack in the N. Virginia (us-east-1) Region using iot-cfn.yml, which represents the simulated environment described in the preceding bullet.
    •  Parameters:
      • Name the stack IoTGreengrass.
      • For EC2KeyPairName, select the EC2 key pair you just created from the drop-down menu.
      • For SecurityAccessCIDR, use your public IP with a /32 CIDR (i.e.
      • You can also accept the default of if you can have SSH and RDP open to all sources on the EC2 instances in this demo environment.
      • Accept the defaults for the remaining parameters.
  •  View the Resources tab after stack creation completes. The stack creates the following resources:
    • A VPC with two subnets, two route tables with routes, an internet gateway, and a security group.
    • Two EC2 instances, one running Amazon Linux and the other running Windows Server 2019.
    • An IAM role, policy, and instance profile for the Amazon Linux instance.
    • A Lambda function called GGSampleFunction, which we’ll update with code to parse our flat-files with AWS IoT Greengrass in a later step.
    • An AWS IoT Greengrass Group, Subscription, and Core.
    • Other supporting objects and custom resource types.
  • View the Outputs tab and copy the IPs somewhere easy to retrieve. You’ll need them for multiple provisioning steps below.

3. Review the AWS IoT Greengrass resources created on your behalf by CloudFormation:

    • Search for IoT Greengrass in the Services drop-down menu and select it.
    • Click Manage your Groups.
    • Click file_ingestion.
    • Navigate through the SubscriptionsCores, and other tabs to review the configurations.

Leveraging a device running AWS IoT Greengrass at the edge, we can now interact with flat-file data that was previously difficult to collect, centralize, aggregate, and analyze.

Set up the Samba file share

Now, we set up the Samba file share where we will write our flat-file data. In our demo environment, we’re creating the file share on the same server that runs the Greengrass software. In the real world, this file share could be hosted elsewhere as long as the device that runs Greengrass can access it via the network.

  • Follow the instructions in setup_file_share.md to set up the Samba file share on the AWS IoT Greengrass EC2 instance.
  • Keep your terminal window open. You’ll need it again for a later step.

Configure Lambda Function for AWS IoT Greengrass

AWS IoT Greengrass provides a Lambda runtime environment for user-defined code that you author in AWS Lambda. Lambda functions that are deployed to an AWS IoT Greengrass Core run in the Core’s local Lambda runtime. In this example, we update the Lambda function created by CloudFormation with code that watches for new files on the Samba share, parses them, and writes the data to an MQTT topic.

  1. Update the Lambda function:
    • Search for Lambda in the Services drop-down menu and select it.
    • Select the file_ingestion_lambda function.
    • From the Function code pane, click Actions then Upload a .zip file.
    • Upload the provided zip file containing the Lambda code.
    • Select Actions > Publish new version > Publish.

2. Update the Lambda Alias to point to the new version.

    • Select the Version: X drop-down (“X” being the latest version number).
    • Choose the Aliases tab and select gg_file_ingestion.
    • Scroll down to Alias configuration and select Edit.
    • Choose the newest version number and click Save.
    • Do NOT use $LATEST as it is not supported by AWS IoT Greengrass.

3. Associate the Lambda function with AWS IoT Greengrass.

    • Search for IoT Greengrass in the Services drop-down menu and select it.
    • Select Groups and choose file_ingestion.
    • Select Lambdas > Add Lambda.
    • Click Use existing Lambda.
    • Select file_ingestion_lambda > Next.
    • Select Alias: gg_file_ingestion > Finish.
    • You should now see your Lambda associated with the AWS IoT Greengrass group.
    • Still on the Lambda function tab, click the ellipsis and choose Edit configuration.
    • Change the following Lambda settings then click Update:
      • Set Containerization to No container (always).
      • Set Timeout to 25 seconds (or longer if you have large files to process).
      • Set Lambda lifecycle to Make this function long-lived and keep it running indefinitely.

Deploy AWS IoT Greengrass Group

  1. Restart the AWS IoT Greengrass daemon:
    • A daemon restart is required after changing containerization settings. Run the following commands on the Greengrass instance to restart the AWS IoT Greengrass daemon:
 cd /greengrass/ggc/core/

 sudo ./greengrassd stop

 sudo ./greengrassd start

2. Deploy the AWS IoT Greengrass Group to the Core device.

    • Return to the file_ingestion AWS IoT Greengrass Group in the console.
    • Select Actions Deploy.
    • Select Automatic detection.
    • After a few minutes, you should see a Status of Successfully completed. If the deployment fails, check the logs, fix the issues, and deploy again.

Generate test data

You can now generate test data that is ingested by AWS IoT Greengrass, written to AWS IoT Core, and then sent to AWS IoT Analytics and visualized by Amazon QuickSight.

  1. Follow the instructions in generate_test_data.md to generate the test data.
  2. Verify that the data is being written to AWS IoT Core following these instructions (Use iot/data for the MQTT Subscription Topic instead of hello/world).


Setup AWS IoT Analytics

Now that our data is in IoT Cloud, it only takes a few clicks to configure AWS IoT Analytics to process, store, and analyze our data.

  1. Search for IoT Analytics in the Services drop-down menu and select it.
  2. Set Resources prefix to file_ingestion and Topic to iot/data. Click Quick Create.
  3. Populate the data set by selecting Data sets > file_ingestion_dataset >Actions > Run now. If you don’t get data on the first run, you may need to wait a couple of minutes and run it again.

Visualize the Data from AWS IoT Analytics in Amazon QuickSight

We can now use Amazon QuickSight to visualize the IoT data in our AWS IoT Analytics data set.

  1. Search for QuickSight in the Services drop-down menu and select it.
  2. If your account is not signed up for QuickSight yet, follow these instructions to sign up (use Standard Edition for this demo)
  3. Build a new report:
    • Click New analysis > New dataset.
    • Select AWS IoT Analytics.
    • Set Data source name to iot-file-ingestion and select file_ingestion_dataset. Click Create data source.
    • Click Visualize. Wait a moment while your rows are imported into SPICE.
    • You can now drag and drop data fields onto field wells. Review the QuickSight documentation for detailed instructions on creating visuals.
    • Following is an example of a QuickSight dashboard you can build using the demo data we generated in this walkthrough.

Cleaning up

Be sure to clean up the objects you created to avoid ongoing charges to your account.

  • In Amazon QuickSight, Cancel your subscription.
  • In AWS IoT Analytics, delete the datastore, channel, pipeline, data set, role, and topic rule you created.
  • In CloudFormation, delete the IoTGreengrass stack.
  • In Amazon CloudWatch, delete the log files associated with this solution.


Gaining valuable insights from device data that was once out of reach is now possible thanks to AWS’s suite of IoT services. In this walkthrough, we collected and transformed flat-file data at the edge and sent it to IoT Cloud using AWS IoT Greengrass. We then used AWS IoT Analytics to process, store, and analyze that data, and we built an intelligent dashboard to visualize and gain insights from the data using Amazon QuickSight. You can use this data to discover operational anomalies, enable better compliance reporting, monitor product quality, and many other use cases.

For more information on AWS IoT services, check out the overviews, use cases, and case studies on our product page. If you’re new to IoT concepts, I’d highly encourage you to take our free Internet of Things Foundation Series training.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Migrating File Servers to Amazon FSx and Integrating with AWS Managed Microsoft AD

Post Syndicated from Kyaw Soe Hlaing original https://aws.amazon.com/blogs/architecture/field-notes-migrating-file-servers-to-amazon-fsx-and-integrating-with-aws-managed-microsoft-ad/

Amazon FSx provides AWS customers with the native compatibility of third-party file systems with feature sets for workloads such as Windows-based storage, high performance computing (HPC), machine learning, and electronic design automation (EDA).  Amazon FSx automates the time-consuming administration tasks such as hardware provisioning, software configuration, patching, and backups. Since Amazon FSx integrates the file systems with cloud-native AWS services, this makes them even more useful for a broader set of workloads.

Amazon FSx for Windows File Server provides fully managed file storage that is accessible over the industry-standard Server Message Block (SMB) protocol. Built on Windows Server, Amazon FSx delivers a wide range of administrative features such as data deduplication, end-user file restore, and Microsoft Active Directory (AD) integration.

In this post, I explain how to migrate files and file shares from on-premises servers to Amazon FSx with AWS DataSync in a domain migration scenario. Customers are migrating their file servers to Amazon FSx as part of their migration from an on-premises Active Directory to AWS managed Active Directory. Their plan is to replace their file servers with Amazon FSx during Active Directory migration to AWS Managed AD.

Arhictecture diagram


Before you begin, perform the steps outlined in this blog to migrate the user accounts and groups to the managed Active Directory.


There are numerous ways to perform the Active Directory migration. Generally, the following five steps are taken:

  1. Establish two-way forest trust between on-premises AD and AWS Managed AD
  2. Migrate user accounts and group with the ADMT tool
  3. Duplicate Access Control List (ACL) permissions in the file server
  4. Migrate files and folders with existing ACL to Amazon FSx using AWS DataSync
  5. Migrate User Computers

In this post, I focus on duplication of ACL permissions and migration of files and folders using Amazon FSx and AWS DataSync. In order to perform duplication of ACL permission in file servers, I use SubInACL tool, which is available from the Microsoft website.

Duplication of the ACL is required because users want to seamlessly access file shares once their computers are migrated to AWS Managed AD. Thus all migrated files and folders have permission with Managed AD users and group objects. For enterprises, the migration of user computers does not happen overnight. Normally, migration takes place in batches or phases. With ACL duplication, both migrated and non-migrated users can access their respective file shares seamlessly during and after migration.

Duplication of Access Control List (ACL)

Before we proceed with ACL duplication, we must ensure that the migration of user accounts and groups was completed. In my demo environment, I have already migrated on-premises users to the Managed Active Directory. In the meantime, we presume that we are migrating identical users to the Managed Active Directory. There might be a scenario where migrated user accounts have different naming such as samAccount name. In this case, we will need to handle this during ACL duplication with SubInACL. For more information about syntax, refer to the SubInACL documentation.

As indicated in following screenshots, I have two users created in the on-premises Active Directory (onprem.local) and those two identical users have been created in the Managed Active Directory too (corp.example.com).

Screenshot of on-premises Active Directory (onprem.local)


Screenshot of Active Directory

In the following screenshot, I have a shared folder called “HR_Documents” in an on-premises file server. Different users have different access rights to that folder. For example, John Smith has “Full Control” but Onprem User1 only have “Read & Execute”. Our plan is to add same access right to identical users from the Managed Active Directory, here corp.example.com, so that once John Smith is migrated to managed AD, he can access to shared folders in Amazon FSx using his Managed Active Directory credential.

Let’s verify the existing permission in the “HR_Documents” folder. Two users from onprem.local are found with different access rights.

Screenshot of HR docs

Screenshot of HR docs

Now it’s time to install SubInACL.

We install it in our on-premises file server. After the SubInACL tool is installed, it can be found under “C:\Program Files (x86)\Windows Resource Kits\Tools” folder by default. To perform an ACL duplication, run command prompt as administrator and run the following command;

Subinacl /outputlog=C:\temp\HR_document_log.txt /errorlog=C:\temp\HR_document_Err_log.txt /Subdirectories C:\HR_Documents\* /migratetodomain=onprem=corp

There are several parameters that I am using in the command:

  • Outputlog = where log file is saved
  • ErrorLog = where error log file is saved
  • Subdirectories = to apply permissions including subfolders and files
  • Migratetodomain= NetBIOS name of source domain and destination domain

Screenshot windows resources kits

screenshot of windows resources kit

If the command is run successfully, you should able to see a summary of the results. If there is no error or failure, you can verify whether ACL permissions are duplicated as expected by looking at the folders and files. In our case, we can see that there is one ACL entry of identical account from corp.example.com is added.

Note: you will always see two ACL entries, one from onprem.local and another one from corp.example.com domain in all the files and folders that you used during migration.  Permissions are now applied to both at the folder and file level.

screenshot of payroll properties

screenshot of doc 1 properties

Migrate files and folders using AWS DataSync

AWS DataSync is an online data transfer service that simplifies, automates, and accelerates moving data between on-premises storage systems and AWS Storage services such as Amazon S3, Amazon Elastic File System (Amazon EFS), or Amazon FSx for Windows File Server. Manual tasks related to data transfers can slow down migrations and burden IT operations. AWS DataSync reduces or automatically handles many of these tasks, including scripting copy jobs, scheduling and monitoring transfers, validating data, and optimizing network utilization.

Create an AWS DataSync agent

An AWS DataSync agent deploys as a virtual machine in an on-premises data center. An AWS DataSync agent can be run on ESXi, KVM, and Microsoft Hyper-V hypervisors. The AWS DataSync agent is used to access on-premises storage systems and transfer data to the AWS DataSync managed service running on AWS. AWS DataSync always performs incremental copies by comparing from a source to a destination and only copying files that are new or have changed.

AWS DataSync supports the following SMB (Server Message Block) locations to migrate data from:

  • Network File System (NFS)
  • Server Message Block (SMB)

In this blog, I use SMB as the source location, since I am migrating from an on-premises Windows File server. AWS DataSync supports SMB 2.1 and SMB 3.0 protocols.

AWS DataSync saves metadata and special files when copying to and from file systems. When files are copied from a SMB file share and Amazon FSx for Windows File Server, AWS DataSync copies the following metadata:

  • File timestamps: access time, modification time, and creation time
  • File owner and file group security identifiers (SIDs)
  • Standard file attributes
  • NTFS discretionary access lists (DACLs): access control entries (ACEs) that determine whether to grant access to an object

Data Synchronization with AWS DataSync

When a task starts, AWS DataSyc goes through different stages. It begins with examining file system follows by data transfer to destination. Once data transfer is completed, it performs verification for consistency between source and destination file systems. You can review detailed information about the data synchronization stages.

DataSync Endpoints

You can activate your agent by using one of the following endpoint types:

  • Public endpoints – If you use public endpoints, all communication from your DataSync agent to AWS occurs over the public internet.
  • Federal Information Processing Standard (FIPS) endpoints – If you need to use FIPS 140-2 validated cryptographic modules when accessing the AWS GovCloud (US-East) or AWS GovCloud (US-West) Region, use this endpoint to activate your agent. You use the AWS CLI or API to access this endpoint.
  • Virtual private cloud (VPC) endpoints – If you use a VPC endpoint, all communication from AWS DataSync to AWS services occurs through the VPC endpoint in your VPC in AWS. This approach provides a private connection between your self-managed data center, your VPC, and AWS services. It increases the security of your data as it is copied over the network.

In my demo environment, I have implemented AWS DataSync as indicated in following diagram. The DataSync Agent can be run either on VMware or Hyper-V and KVM platform in a customer on-premises data center.

Datasync Agent Arhictecture

Once the AWS DataSync Agent setup is completed and the task that defined the source file servers and destination Amazon FSx server is added, you can verify agent status in the AWS Management Console.

Console screenshot

Select Task and then choose Start to start copying files and folders. This will start the replication task (or you can wait until the task runs hourly). You can check the History tab to see a history of the replication task executions.

Console screenshot

Congratulations! You have replicated the contents of an on-premises file server to Amazon FSx. Let’s look and make sure the ACL permissions are still intact in their destination after migration. As shown in the following screenshots, the ACL permissions in the Payroll folder still remains as is, both on-premises users and Managed AD users are inside. Once the user’s computers are migrated to the Managed AD, they can access the same file share in Amazon FSx server using Managed AD credentials.

Payroll properties screenshot

Payroll properties screenshot

Cleaning up

If you are performing testing by following the preceding steps in your own account, delete the following resources, to avoid incurring future charges:

  • EC2 instances
  • Managed AD
  • Amazon FSx file system
  • AWS Datasync


You have learned how to duplicate ACL permissions and shared folder permissions during migration of file servers to Amazon FSx. This process provides a seamless migration experience for users. Once the user’s computers are migrated to the Managed AD, they only need to remap shared folders from Amazon FSx. This can be automated by pushing down shared folders mapping with a Group Policy. If new files or folders are created in the source file server, AWS Datasync will synchronize to Amazon FSx server.

For customers who are planning to do a domain migration from on-premises to AWS Managed Microsoft AD, migration of resources like file servers are common. Handling ACL permissions plays a vital role in providing a seamless migration experience. The duplication of ACL can be an option, otherwise, the ADMT tool can be used to migrate SID information from the source Domain to destination Domain. To migrate SID history, SID filtering needs to be disabled during migration.

If you want to provide feedback about this post, you are welcome to submit in the comments section below.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Setting Up Disaster Recovery in a Different Seismic Zone Using AWS Outposts

Post Syndicated from Vijay Menon original https://aws.amazon.com/blogs/architecture/field-notes-setting-up-disaster-recovery-in-a-different-seismic-zone-using-aws-outposts/

Recovering your mission-critical workloads from outages is essential for business continuity and providing services to customers with little or no interruption. That’s why many customers replicate their mission-critical workloads in multiple places using a Disaster Recovery (DR) strategy suited for their needs.

With AWS, a customer can achieve this by deploying multi Availability Zone High-Availability setup or a multi-region setup by replicating critical components of an application to another region.  Depending on the RPO and RTO of the mission-critical workload, the requirement for disaster recovery ranges from simple backup and restore, to multi-site, active-active, setup. In this blog post, I explain how AWS Outposts can be used for DR on AWS.

In many geographies, it is possible to set up your disaster recovery for a workload running in one AWS Region to another AWS Region in the same country (for example in US between us-east-1 and us-west-2). For countries where there is only one AWS Region, it’s possible to set up disaster recovery in another country where AWS Region is present. This method can be designed for the continuity, resumption and recovery of critical business processes at an agreed level and limits the impact on people, processes and infrastructure (including IT). Other reasons include to minimize the operational, financial, legal, reputational and other material consequences arising from such events.

However, for mission-critical workloads handling critical user data (PII, PHI or financial data), countries like India and Canada have regulations which mandate to have a disaster recovery setup at a “safe distance” within the same country. This ensures compliance with any data sovereignty or data localization requirements mandated by the regulators. “Safe distance” means the distance between the DR site and the primary site is such that the business can continue to operate in the event of any natural disaster or industrial events affecting the primary site. Depending on the geography, this safe distance could be 50KM or more. These regulations limit the options customers have to use another AWS Region in another country as a disaster recovery site of their primary workload running on AWS.

In this blog post, I describe an architecture using AWS Outposts which helps set up disaster recovery on AWS within the same country at a distance that can meet the requirements set by regulators. This architecture also helps customers to comply with various data sovereignty regulations in a given country. Another advantage of this architecture is the homogeneity of the primary and disaster recovery site. Your existing IT teams can set up and operate the disaster recovery site using familiar AWS tools and technology in a homogenous environment.


Readers of this blog post should be familiar with basic networking concepts like WAN connectivity, BGP and the following AWS services:

Architecture Overview

I explain the architecture using an example customer scenario in India, where a customer is using AWS Mumbai Region for their mission-critical workload. This workload needs a DR setup to comply with local regulation and the DR setup needs to be in a different seismic zone than the one for Mumbai. Also, because of the nature of the regulated business, the user/sensitive data needs to be stored within India.

Following is the architecture diagram showing the logical setup.

This solution is similar to a typical AWS Outposts use case where a customer orders the Outposts to be installed in their own Data Centre (DC) or a CoLocation site (Colo). It will follow the shared responsibility model described in AWS Outposts documentation.

The only difference is that the AWS Outpost parent Region will be the closest Region other than AWS Mumbai, in this case Singapore. Customers will then provision an AWS Direct Connect public VIF locally for a Service Link to the Singapore Region. This ensures that the control plane stays available via the AWS Singapore Region even if there is an outage in AWS Mumbai Region affecting control plane availability. You can then launch and manage AWS Outposts supported resources in the AWS Outposts rack.

For data plane traffic, which should not go out of the country, the following options are available:

  • Provision a self-managed Virtual Private Network (VPN) between an EC2 instances running router AMI in a subnet of AWS Outposts and AWS Transit Gateway (TGW) in the primary Region.
  • Provision a self-managed Virtual Private Network (VPN) between an EC2 instances running router AMI in a subnet of AWS Outposts and Virtual Private Gateway (VGW) in the primary Region.

Note: The Primary Region in this example is AWS Mumbai Region. This VPN will be provisioned via Local Gateway and DX public VIF. This ensures that data plane traffic will not traverse any network out of the country (India) to comply with data localization mandated by the regulators.

Architecture Walkthrough

  1. Make sure your data center (DC) or the choice of collocate facility (Colo) meets the requirements for AWS Outposts.
  2. Create an Outpost and order Outpost capacity as described in the documentation. Make sure that you do this step while logged into AWS Outposts console of the AWS Singapore Region.
  3. Provision connectivity between AWS Outposts and network of your DC/Colo as mentioned in AWS Outpost documentation.  This includes setting up VLANs for service links and Local Gateway (LGW).
  4. Provision an AWS Direct Connect connection and public VIF between your DC/Colo and the primary Region via the closest AWS Direct Connect location.
    • For the WAN connectivity between your DC/Colo and AWS Direct Connect location you can choose any telco provider of your choice or work with one of AWS Direct Connect partners.
    • This public VIF will be used to attach AWS Outposts to its parent Region in Singapore over AWS Outposts service link. It will also be used to establish an IPsec GRE tunnel between AWS Outposts subnet and a TGW or VGW for data plane traffic (explained in subsequent steps).
    • Alternatively, you can provision separate Direct Connect connection and public VIFs for Service Link and data plane traffic for better segregation between the two. You will have to provision sufficient bandwidth on Direct Connect connection for the Service Link traffic as well as the Data Plane traffic (like data replication between primary Region and AWS outposts).
    • For an optimal experience and resiliency, AWS recommends that you use dual 1Gbps connections to the AWS Region. This connectivity can also be achieved over Internet transit; however, I recommend using AWS Direct Connect because it provides private connectivity between AWS and your DC/Colo  environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.
  5. Create a subnet in AWS Outposts and launch an EC2 instance running a router AMI of your choice from AWS Marketplace in this subnet. This EC2 instance is used to establish the IPsec GRE tunnel to the TGW or VGW in primary Region.
  6. Add rules in security group of these EC2 instances to allow ISAKMP (UDP 500), NAT Traversal (UDP 4500), and ESP (IP Protocol 50) from VGW or TGW endpoint public IP addresses.
  7. NAT (Network Address Translation) the EIP assigned in step 5 to a public IP address at your edge router connecting to AWS Direct connect or internet transit. This public IP will be used as the customer gateway to establish IPsec GRE tunnel to the primary Region.
  8. Create a customer gateway using the public IP address used to NAT the EC2 instances step 7. Follow the steps in similar process found at Create a Customer Gateway.
  9. Create a VPN attachment for the transit gateway using the customer gateway created in step 8. This VPN must be a dynamic route-based VPN. For steps, review Transit Gateway VPN Attachments. If you are connecting the customer gateway to VPC using VGW in primary Region then follow the steps mentioned at How do I create a secure connection between my office network and Amazon Virtual Private Cloud?.
  10. Configure the customer gateway (EC2 instance running a router AMI in AWS Outposts subnet) side for VPN connectivity. You can base this configuration suggested by AWS during the creation of VPN in step 9. This suggested sample configuration can be downloaded from AWS console post VPN setup as discussed in this document.
  11. Modify the route table of AWS outpost Subnets to point to the EC2 instance launched in step 5 as the target for any destination in your VPCs in the primary Region, which is AWS Mumbai in this example.

At this point, you will have end-to-end connectivity between VPCs in a primary Region and resources in an AWS Outposts. This connectivity can now be used to replicate data from your primary site to AWS Outposts for DR purposes. This  keeps the setup compliant with any internal or external data localization requirements.


In this blog post, I described an architecture using AWS Outposts for Disaster Recovery on AWS in countries without a second AWS Region. To set up disaster recovery, your existing IT teams can set up and operate the disaster recovery site using the familiar AWS tools and technology in a homogeneous environment. To learn more about AWS Outposts, refer to the documentation and FAQ.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Restricting Amazon WorkSpaces Users to Run Amazon Athena Queries

Post Syndicated from Somdeb Bhattacharjee original https://aws.amazon.com/blogs/architecture/field-notes-restricting-amazon-workspaces-users-to-run-amazon-athena-queries/

One of the use cases we hear from customers is that they want to provide very limited access to Amazon Workspaces users (for example contractors, consultants) in an AWS account. At the same time they want to allow them to query Amazon Simple Storage Service (Amazon S3) data in another account using Amazon Athena over a JDBC connection.

For example, marketing companies might provide private access to the first party data to media agencies through this mechanism.

The restrictions they want to put in place are:

  • For security reasons these Amazon WorkSpaces should not have internet connectivity. So the access to Amazon Athena must be over AWS PrivateLink.
  • Public access to Athena is not allowed using the credentials used for the JDBC connection. This is to prevent the users from leveraging the credentials to query the data from anywhere else.

In this post, we show how to use Amazon Virtual Private Cloud (Amazon VPC) endpoints for Athena, along with AWS Identity and Access Management (AWS IAM) policies. This provides private access to query the Amazon S3 data while preventing users from querying the data from outside their Amazon WorkSpaces or using the Athena public endpoint.

Let’s review the steps to achieve this:

  • Initial setup of two AWS accounts (AccountA and AccountB)
  • Set up Amazon S3 bucket with sample data in AccountA
  • Set up an IAM user with Amazon S3 and Athena access in AccountA
  • Create an Amazon VPC endpoint for Athena in AccountA
  • Set up Amazon WorkSpaces for a user in AccountB
  • Install a SQL client tool (we will use DbVisualizer Free) and Athena JDBC driver in Amazon WorkSpaces in AccountB
  • Use DbVisualizer to the query the Amazon S3 data in AccountA using the Athena public endpoint
  • Update IAM policy for user in AccountA to restrict private only access


To follow the steps in this post, you need two AWS Accounts. The Amazon VPC and subnet requirements are specified in the detailed steps.

Note: The AWS CloudFormation template used in this blog post is for US-EAST-1 (N. Virginia) Region so ensure the Region setting for both the accounts are set to US-EAST-1 (N. Virginia).


The two AWS accounts are:

AccountA – Contains the Amazon S3 bucket where the data is stored. For AccountA you can create a new Amazon VPC or use the default Amazon VPC.

AccountB – Amazon WorkSpaces account. Use the following AWS CloudFormation template for AccountB:

  • The AWS CloudFormation template will create a new Amazon VPC in AccountB with CIDR and set up one public subnet and two private subnets.
  • It will also create a NAT Gateway in the public subnet and create both public and private route tables.
  • Since we will be launching Amazon WorkSpaces in these private subnets and not all Availability Zones (AZ) are supported by Amazon WorkSpaces, it is important to choose the right AZ when creating them.

Review the documentation to learn which AWS Regions/AZ are supported.

We have provided two parameters in the AWS CloudFormation template:

  • AZName1
  • AZName2

Step 1

Before launching the CloudFormation stack:

  • Log in to AccountB
  • Search for AWS Resource Access Manager
  • On the right-hand side, you will notice the AZ ID to AZ Name mapping. Note down the AZ Name corresponding to AZ ID use1-az2 and use1-az4
  • Now launch the CloudFormation template and remember to choose the AZ names you noted down earlier
    • https://athena-workspaces-blogpost.s3.amazonaws.com/vpc.yaml
  • Enter the CloudFormation Stack Name as – ‘AthenaWorkspaces’ and leave everything default.
  • Once the CloudFormation stack creation is complete, create a peering connection from AccountB to AccountA.
  • Update the associated route tables for the private subnets with the new peering connection.

For information on how to create a VPC peering connection, refer to AWS documentation on VPC Peering.

AccountB VPC Route Table:

AccountB VPC Route Table:

AccountA VPC Route Table:

AccountA VPC Route Table

Step 2

  • Create a new Amazon S3 bucket in AccountA with a bucket name that starts with ‘athena-’.
  • Next, you can download a sample file and upload it to the Amazon S3 bucket you just created.
  • Use the following statements to create AWS Glue database. Use an external table for the data in the Amazon S3 bucket so that you can query it from Athena.
  • Go to Athena console and define a new database:


Once the database is created, create a new table in sampledb (by selecting sampledb from the “Database” drop down menu). Replace the <<your bucket name>> with the bucket you just created:

CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.amazon_reviews_tsv(
  marketplace string, 
  customer_id string, 
  review_id string, 
  product_id string, 
  product_parent string, 
  product_title string,
  product_category string,
  star_rating int, 
  helpful_votes int, 
  total_votes int, 
  vine string, 
  verified_purchase string, 
  review_headline string, 
  review_body string, 
  review_date string)
  's3://<<your bucket name>>/'
TBLPROPERTIES ("skip.header.line.count"="1")


Step 3

  • In AccountA, create a new IAM user with programmatic access.
  • Save the access key and secret access key.
  • For the same user add an Inline Policy which allows the following actions:

IAM summary

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowAthenaReadActions",
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
            "Sid": "AllowAthenaWorkgroupActions",
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
            "Sid": "AllowGlueActionsViaVPCE",
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
            "Sid": "AllowGlueActionsViaAthena",
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
            "Sid": "AllowS3ActionsViaAthena",
            "Effect": "Allow",
            "Action": [
            "Resource": [


Step 4

  • In this step, we create an Interface VPC endpoint (AWS PrivateLink) for Athena in AccountA. When you use an interface VPC endpoint, communication between your Amazon VPC and Athena is conducted entirely within the AWS network.
  • Each VPC endpoint is represented by one or more Elastic Network Interfaces (ENIs) with private IP addresses in your VPC subnets.
  • To create an Interface VPC endpoint follow the instructions and select Athena in the AWS Services list. Do not select the checkbox for Enable Private DNS Name.
  • Ensure the security group that is attached to the Amazon VPC endpoint is open to inbound traffic on port 443 and 444 for source AccountB VPC CIDR Port 444 is used by Athena to stream query results.
  • Once you create the VPC endpoint, you will get a DNS endpoint name which is in the following format. We are going to use this in JDBC connection from the SQL client.


Step 5

  • In this step we set up Amazon WorkSpaces in AccountB.
  • Each Amazon WorkSpace is associated with the specific Amazon VPC and AWS Directory Service construct that you used to create it. All Directory Service constructs (Simple AD, AD Connector, and Microsoft AD) require two subnets to operate, each in different Availability Zones. This is why we created 2 private subnets at the beginning.
  • For this blog post I have used Simple AD as the directory service for the Amazon WorkSpaces.
  • By default, IAM users don’t have permissions for Amazon WorkSpaces resources and operations.
  • To allow IAM users to manage Amazon WorkSpaces resources, you must create an IAM policy that explicitly grants them permissions
  • Then attach the policy to the IAM users or groups that require those permissions.
  • To start, go to the Amazon WorkSpaces console and select Advanced Setup.
    • Set up a new directory using the SimpleAD option.
    • Use the “small” directory size and choose the Amazon VPC and private subnets you created in Step 1 for AccountB.
    • Once you create the directory, register the directory with Amazon WorkSpaces by selecting “Register” from the “Action” menu.
    • Select private subnets you created in Step 1 for AccountB.

Directory info

  • Next, launch Amazon WorkSpaces by following the Launch WorkSpaces button.
  • Select the directory you created and create a new user.
  • For the bundle, choose Standard with Windows 10 (PCoIP).
  • After the Amazon WorkSpaces is created, you can log in to the Amazon WorkSpaces using a client software. You can download it from https://clients.amazonworkspaces.com/
  • Login to your Amazon WorkSpace, install a SQL Client of your choice. At this point your Amazon WorkSpace still has Internet access via the NAT Gateway
  • I have used DbVisualizer (the free version) as the SQL client. Once you have that installed, install the JDBC driver for Athena following the instructions
  • Now you can set up the JDBC connections to Athena using the access key and secret key you set up for an IAM user in AccountA.

Step 6

To test out both the Athena public endpoint and the Athena VPC endpoint, create two connections using the same credentials.

For the Athena public endpoint, you need to use athena.us-east-1.amazonaws.com service endpoint. (jdbc:awsathena://athena.us-east-1.amazonaws.com:443;S3OutputLocation=s3://<athena-bucket-name>/)

Athena public

For the VPC Endpoint Connection, use the VPC Endpoint you created in Step 4 (jdbc:awsathena://vpce-<>.athena.us-east-1.vpce.amazonaws.com:443;S3OutputLocation=s3://<athena-bucket-name>/)

Database connection Athena

Now run a simple query to select records from the amazon_reviews_tsv table using both the connections.

SELECT * FROM sampledb.amazon_reviews_tsv limit 10

You should be able to see results using both the connections. Since the private subnets are still connected to the internet via the NAT Gateway, you can query using the Athena public endpoint.

Run the AWS Command Line Interface (AWS CLI) command using the credentials used for the JDBC connection from your workstation. You should be able to access the Amazon S3 bucket objects and the Athena query run list using the following commands.

aws s3 ls s3://athena-workspaces-blogpost

aws athena list-query-executions

Step 7

  • Now we lock down the access as described in the beginning of this blog post by taking the following actions:
  • Update the route table for the private subnets by removing the route for the internet so access to the Athena public endpoint is restricted from the Amazon WorkSpaces. The only access will be allowed through the Athena VPC Endpoint.
  • Add conditional checks to the IAM user access policy that will restrict access to the Amazon S3 buckets and Athena only if:
    • The request came in through the VPC endpoint. For this we use the “aws:SourceVpce” check and provide the VPC Endpoint ID value.
    • The request for Amazon S3 data is through Athena. For this we use the condition “aws:CalledVia” and provide a value of “athena.amazonaws.com”.
  • In the IAM access policy below replace <<your vpce id>> with your VPC endpoint id and update the previous inline policy which was added to the IAM user in Step 3.
    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowAthenaReadActions",
            "Effect": "Allow",
            "Action": [
            "Resource": "*",
                     "<<your vpce id>>"
            "Sid": "AllowAthenaWorkgroupActions",
            "Effect": "Allow",
            "Action": [
            "Resource": "*",
                     "<<your vpce id>>"
            "Sid": "AllowGlueActionsViaVPCE",
            "Effect": "Allow",
            "Action": [
            "Resource": "*",
                     "<<your vpce id>>"
            "Sid": "AllowGlueActionsViaAthena",
            "Effect": "Allow",
            "Action": [
            "Resource": "*",
            "Sid": "AllowS3ActionsViaAthena",
            "Effect": "Allow",
            "Action": [
            "Resource": [

Once you applied the changes, try to reconnect using both the Athena VPC endpoint as well Athena public endpoint connections. The Athena VPC endpoint connection should work but the public endpoint connection will time out. Also try the same Amazon S3 and Athena AWS CLI commands. You should get access denied for both the operations.

Clean Up

To avoid incurring costs, remember to delete the resources that you created.

For AWS AccountA:

  • Delete the S3 buckets
  • Delete the database you created in AWS Glue
  • Delete the Amazon VPC endpoint you created for Amazon Athena

For AccountB:

  • Delete the Amazon Workspace you created along with the Simple AD directory. You can review more information on how to delete your Workspaces.


In this blog post, I showed how to leverage Amazon VPC endpoints and IAM policies to privately connect to Amazon Athena from Amazon Workspaces that don’t have internet connectivity.

Give this solution a try and share your feedback in the comments!

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Optimize your Java application for Amazon ECS with Quarkus

Post Syndicated from Sascha Moellering original https://aws.amazon.com/blogs/architecture/field-notes-optimize-your-java-application-for-amazon-ecs-with-quarkus/

In this blog post, I show you an interesting approach to implement a Java-based application and compile it to a native image using Quarkus. This native image is the main application, which is containerized, and runs in an Amazon Elastic Container Service and Amazon Elastic Kubernetes Service cluster on AWS Fargate.

Amazon ECS is a fully managed container orchestration service, Amazon EKS is a fully managed Kubernetes service, both services support Fargate to provide serverless compute for containers. Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation by design. AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you.

Quarkus is a Supersonic Subatomic Java framework that uses OpenJDK HotSpot as well as GraalVM and over fifty different libraries like RESTEasy, Vertx, Hibernate, and Netty. In a previous blog post, I demonstrated how GraalVM can be used to optimize the size of Docker images. GraalVM is an open source, high-performance polyglot virtual machine from Oracle. I use it to compile native images ahead of time to improve startup performance, and reduce the memory consumption and file size of Java Virtual Machine (JVM)-based applications. The framework that allows ahead-of-time-compilation (AOT) is called Substrate.

Application Architecture

First, review the GitHub repository containing the demo application.

Our application is a simple REST-based Create Read Update Delete (CRUD) service that implements basic user management functionalities. All data is persisted in an Amazon DynamoDB table. Quarkus offers an extension for Amazon DynamoDB that is based on AWS SDK for Java V2. This Quarkus extension supports two different programming models: blocking access and asynchronous programming. For local development, DynamoDB Local is also supported. DynamoDB Local is the downloadable version of DynamoDB that lets you write and test applications without accessing the DynamoDB service. Instead, the database is self-contained on your computer. When you are ready to deploy your application in production, you can make a few minor changes to the code so that it uses the DynamoDB service.

The REST-functionality is located in the class UserResource which uses the JAX-RS implementation RESTEasy. This class invokes the UserService that implements the functionalities to access a DynamoDB table with the AWS SDK for Java. All user-related information is stored in a Plain Old Java Object (POJO) called User.

Building the application

To create a Docker container image that can be used in the task definition of my ECS cluster, follow these three steps: build the application, create the Docker Container Image, and push the created image to my Docker image registry.

To build the application, I used Maven with different profiles. The first profile (default profile) uses a standard build to create an uber JAR – a self-contained application with all dependencies. This is very useful if you want to run local tests with your application, because the build time is much shorter compared to the native-image build. When you run the package command, it also execute all tests, which means you need DynamoDB Local running on your workstation.

$ docker run -p 8000:8000 amazon/dynamodb-local -jar DynamoDBLocal.jar -inMemory -sharedDb

$ mvn package

The second profile uses GraalVM to compile the application into a native image. In this case, you use the native image as base for a Docker container. The Dockerfile can be found under src/main/docker/Dockerfile.native and uses a build-pattern called multi-stage build.

$ mvn package -Pnative -Dquarkus.native.container-build=true

An interesting aspect of multi-stage builds is that you can use multiple FROM statements in your Dockerfile. Each FROM instruction can use a different base image, and begins a new stage of the build. You can pick the necessary files and copy them from one stage to another, thereby limiting the number of files you have to copy. Use this feature to build your application in one stage and copy your compiled artifact and additional files to your target image. In this case, you use ubi-quarkus-native-image:20.1.0-java11 as base image and copy the necessary TLS-files (SunEC library and the certificates) and point your application to the necessary files with JVM properties.

FROM quay.io/quarkus/ubi-quarkus-native-image:20.1.0-java11 as nativebuilder
RUN mkdir -p /tmp/ssl-libs/lib \
  && cp /opt/graalvm/lib/security/cacerts /tmp/ssl-libs \
  && cp /opt/graalvm/lib/libsunec.so /tmp/ssl-libs/lib/

FROM registry.access.redhat.com/ubi8/ubi-minimal
WORKDIR /work/
COPY target/*-runner /work/application
COPY – from=nativebuilder /tmp/ssl-libs/ /work/
RUN chmod 775 /work
CMD ["./application", "-Dquarkus.http.host=", "-Djava.library.path=/work/lib", "-Djavax.net.ssl.trustStore=/work/cacerts"]

In the second and third steps, I have to build and push the Docker image to a Docker registry of my choice which is straight forward:

$ docker build -f src/main/docker/Dockerfile.native -t

$ docker push <repo/image:tag>

Setting up the infrastructure

You’ve compiled the application to a native-image and have built a Docker image. Now, you set up the basic infrastructure consisting of an Amazon Virtual Private Cloud (VPC), an Amazon ECS or Amazon EKS cluster with on AWS Fargate launch type, an Amazon DynamoDB table, and an Application Load Balancer.

Figure 1: Architecture of the infrastructure (for Amazon ECS)

Figure 1: Architecture of the infrastructure (for Amazon ECS)

Codifying your infrastructure allows you to treat your infrastructure just as code. In this case, you use the AWS Cloud Development Kit (AWS CDK), an open source software development framework, to model and provision your cloud application resources using familiar programming languages. The code for the CDK application can be found in the demo application’s code repository under eks_cdk/lib/ecs_cdk-stack.ts or ecs_cdk/lib/ecs_cdk-stack.ts. Set up the infrastructure in the AWS Region us-east-1:

$ npm install -g aws-cdk // Install the CDK
$ cd ecs_cdk
$ npm install // retrieves dependencies for the CDK stack
$ npm run build // compiles the TypeScript files to JavaScript
$ cdk deploy  // Deploys the CloudFormation stack

The output of the AWS CloudFormation stack is the load balancer’s DNS record. The heart of our infrastructure is an Amazon ECS or Amazon EKS cluster with AWS Fargate launch type. The Amazon ECS cluster is set up as follows:

const cluster = new ecs.Cluster(this, "quarkus-demo-cluster", {
      vpc: vpc
    const logging = new ecs.AwsLogDriver({
      streamPrefix: "quarkus-demo"

    const taskRole = new iam.Role(this, 'quarkus-demo-taskRole', {
      roleName: 'quarkus-demo-taskRole',
      assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com')
    const taskDef = new ecs.FargateTaskDefinition(this, "quarkus-demo-taskdef", {
      taskRole: taskRole
    const container = taskDef.addContainer('quarkus-demo-web', {
      image: ecs.ContainerImage.fromRegistry("<repo/image:tag>"),
      memoryLimitMiB: 256,
      cpu: 256,
      containerPort: 8080,
      hostPort: 8080,
      protocol: ecs.Protocol.TCP

    const fargateService = new ecs_patterns.ApplicationLoadBalancedFargateService(this, "quarkus-demo-service", {
      cluster: cluster,
      taskDefinition: taskDef,
      publicLoadBalancer: true,
      desiredCount: 3,
      listenerPort: 8080

Cleaning up

After you are finished, you can easily destroy all of these resources with a single command to save costs.

$ cdk destroy


In this post, I described how Java applications can be implemented using Quarkus, compiled to a native-image, and ran using Amazon ECS or Amazon EKS on AWS Fargate. I also showed how AWS CDK can be used to set up the basic infrastructure. I hope I’ve given you some ideas on how you can optimize your existing Java application to reduce startup time and memory consumption. Feel free to submit enhancements to the sample template in the source repository or provide feedback in the comments.

We also encourage you to explore how you to optimize your Java application for AWS Lambda with Quarkus.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: How to Identify and Block Fake Crawler Bots Using AWS WAF

Post Syndicated from Vijay Menon original https://aws.amazon.com/blogs/architecture/field-notes-how-to-identify-and-block-fake-crawler-bots-using-aws-waf/

In this blog post, we focus on how to identify fake bots using these AWS services: AWS WAF, Amazon Kinesis Data Firehose, Amazon S3 and AWS Lambda. We use fake Google/Bing bots to demonstrate, but the principles can be applied to other popular crawlers like Slurp Bot from Yahoo, DuckDuckBot from DuckDuckGo, Alexa crawler from Alexa internet ranking service.

For industries like media, online retailors, news or social websites, content is critical and often sets them apart from other competitors. These companies put in a significant amount of effort to make the content as visible and accessible as possible. To do that these companies rely on crawler bots, so that legitimate users searching for content can find the content easily. Crawler bots are useful for indexing the site pages and helping make the content more searchable and improve rankings.

However, this capability can be misused. So it is important to distinguish between genuine crawler bots and fake ones that are doing more than just indexing your site. It’s important to properly identify good and bad actors so that you can stop the bad ones without impacting the ability of good ones, and at scale. This helps in driving more traffic, visitors, and more revenue from your websites.

Identifying bots

There are two primary sources of information required to identify a fake bot:

  • HTTP Header User-Agent: Fake bots try to present themselves as real bots, for example as Google or Bing, by using the same user agent string used by Google or Bing.
  • IP Address: You can look at the source IP address of the incoming request and determine if it belongs to the search engine provider network like Google or Bing. You can do this by performing a forward and reverse look up and comparing the results. These methods are well documented by the search engine providers Google and Bing.

Solution Overview

The solution leverages the capabilities of AWS WAF. Our demonstration application is a static website hosted on Amazon S3 fronted by Amazon CloudFront. This means that we can provide permission of access to the S3 bucket only to CloudFront using origin access identity.

The logs are streamed in near real time using Amazon Kinesis Data Firehose, inspected using a Lambda function to help identify fake bots before storing the logs on Amazon S3. The Lambda function does two things:

  1. Inspect the traffic using the rules to identify bad or fake bots. In this case it uses forward and reverse DNS lookup results of the Client IP address of packets with User-Agent string resembling GoogleBot or BingBot.
  2. Once identified as a fake bot, the Lambda function updates AWS WAF IP-Set to permanently block the requests coming from IP addresses of fake bots.

Note: For the sake of this demonstration, we are using a static website hosted on Amazon S3 with CloudFront. This requires the AWS WAF and IP-Set used by AWS WAF to be of scope ‘CLOUDFRONT’. You can modify it to use scope ‘REGIONAL’ if you chose to protect your web properties behind Application Load Balancer or API Gateway.

Solution Architecture

WAF Solution Architecture


Readers of this blog post should be familiar with HTTP and the following AWS services:

For this walkthrough, you should have the following:

  • AWS Account
  • AWS Command Line Interface (AWS CLI): You need AWS CLI installed and configured on the workstation from where you are going to try the steps mentioned below.
  • Credentials configured in AWS CLI should have the required IAM permissions to spin up and modify the resources mentioned in this post.
  • Make sure that you deploy the solution to us-east-1 Region and your AWS CLI default Region is us-east-1. If us-east-1 is not the default Region, reference the Regions explicitly while executing AWS CLI commands using --region us-east-1 switch.
  • Amazon S3 bucket in us-east-1 region


1.     Create an Amazon S3 static Website and put it behind an Amazon CloudFront distribution. You can follow these steps. Note down the CloudFront distribution ID. We will use it in subsequent steps.

2.     Download and unzip the file containing CloudFormation template and lambda function code from here to a folder in your local workstation. You will need to run all of the subsequent commands from this folder.

3.     Zip the lambda function lambda_function.py and upload it to an Amazon S3 bucket of your choice in us-east-1. Note the bucket name as it will be used in the subsequent steps. The blog uses waf-logs-fake-bots-us-east-1 for reference as the S3 bucket name.

$ zip lambda_function.py lambda_function.py.zip
$ aws s3 cp lambda_function.py.zip s3://waf-logs-fake-bots-us-east-1/

4.     Create the resources required for this blog post by deploying the AWS CloudFormation template and running the below command:

aws cloudformation create-stack \
--stack-name FakeBotBlockBlog \
--template-body file://BotBlog.yml \
--parameters ParameterKey=KinesisBufferIntervalSeconds,ParameterValue=900 ParameterKey=KinesisBufferSizeMB,ParameterValue=3 ParameterKey=IPSetName,ParameterValue=BlockFakeBotIPSet ParameterKey=IPSetScope,ParameterValue=CLOUDFRONT ParameterKey=S3BucketWithDeploymentPackage,ParameterValue=waf-logs-fake-bots-us-east-1 ParameterKey=DeploymentPackageZippedFilename,ParameterValue=lambda_function.py.zip \
--capabilities CAPABILITY_IAM \
--region us-east-1

You need to provide the following information, and you can change the parameters based on your specific needs:

a.     KinesisBufferIntervalSeconds and KinesisBufferSizeMB. These will define the interval at which Kinesis Firehose ships the logs to Amazon S3, the Default is 900 seconds and 3MB respectively, whichever is met first.

b.     IPSetName. Name of the IP Set which will be used to record the client IP address of fake bots. Default value is BlockFakeBotIPSet.

c.     IPSetScope. Scope of the IP Set. I am using CLOUDFRONT and associate it with the CloudFront distribution created in step 1. You can choose to make it REGIONAL in which case WebACLassociation will need to be with an ALB or an API Gateway.

d.     S3BucketWithDeploymentPackage. Name of S3 bucket used in step 3. The blog assumes waf-logs-fake-bots-us-east-1.

e.     DeploymentPackageZipedFilename. Lambda function filename without the file extension. For example, the blog assumes lambda_function.py.zip is available on the Amazon S3 bucket and uses this value for this parameter.

Some stack templates might include resources that can affect permissions in your AWS account, for example, by creating new AWS Identity and Access Management (IAM) role. For those stacks, you must explicitly acknowledge this by specifying CAPABILITY_IAM or CAPABILITY_NAMED_IAM value for the –capabilities parameter.

Stack creation will take you approximately 5-7 minutes. Check the status of the stack by executing the below command every few minutes. You should see StackStatus value as CREATE_COMPLETE.


aws cloudformation describe-stacks – stack-name FakeBotBlockBlog | grep StackStatus

The CloudFormation template will create the following resources:

  • IP Set for AWS WAF
  • WebACL with rules to block the client IP addresses of fake bots, and an AWS-managed common rule set.
  • Lambda function to help detect fake bots and modify the AWS WAF IP Set to block them
  • Kinesis Firehose delivery stream, which will use the above Lambda function for processing
  • IAM roles with required permissions for the Lambda function and Kinesis Firehose
  • S3 bucket for AWS WAF logs

5.     Enable logging for the WebACL using AWS CLI. For this you need the ARN of the WebACL and Kinesis Firehose. You can find that information from the output of the CloudFormation stack created in step 4 using the below AWS CLI command

aws cloudformation describe-stacks – stack-name FakeBotBlockBlog

Please note the 2 ARNs and run the following commands by replacing (1) ResourceArn value with WebACL ARN and (2) LogDestinationConfigs value with Kinesis Firehose delivery stream ARN.


aws wafv2 put-logging-configuration –-logging-configuration ResourceArn=arn:aws:wafv2:us-east-1:123456789012:global/webacl/FakeBotWebACL/259ea98f-24ba-4acd-8803-3e7d02e8d482,LogDestinationConfigs=arn:aws:firehose:us-east-1:123456789012:deliverystream/aws-waf-logs-FakeBotBlockBlog – region us-east-1

6.     Associate CloudFront distribution with this WebACL: Sign in to the AWS Management Console and open the AWS WAF and Shield console at https://console.aws.amazon.com/wafv2/homev2/web-acls?region=global .

  • Click on the WebACL created earlier in this procedure
  • Navigate to ‘Associated AWS resources’ tab and select Add AWS resources
  • In the subsequent screen, select the CloudFront distribution created in step 1
  • Select Add

Note: If you are using an ALB or API Gateway for your web property, then you need to use REGIONAL WebACL and IP Set.  Review the procedure to associate an ALB or API Gateway to the WebACL.

You can monitor WebACL performance from the Overview section of WebACL from the AWS WAF and Shield console.


To test, you will need to generate some traffic which will trigger the lambda function to detect and block the fake bots created earlier in this blog. The web traffic can be generated from the local machine or from an EC2 instance with access to the internet using curl. Manually set the user agent to resemble Googlebot by running the following command from shell:

Replace http://www.awsdemodesign.com/ with the URL of your CloudFront distribution you created in step 1 of the walkthrough.

for i in {1..1000}; do curl -I -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" http://www.awsdemodesign.com/; done

Initially you will see HTTP1.1/ 200 OK response. This will trigger Lambda and modify the IP Set to include your public IP address to be blocked. You can verify that by inspecting the IP set from AWS WAF and Shield console.

  • Sign in to the AWS Management Console and open the AWS WAF and Shield console
  • Click on IP Set created earlier in this blog. In the subsequent screen you can see your public IP address in the list of IP addresses.

If you run the curl command again, you will see that the response now is HTTP/1.1 403 Forbidden.

Clean Up

  • Disassociate the CloudFront Distribution from WebACL
  • Delete the S3 bucket and CloudFront Distribution created in Step 1
  • Empty and delete the S3 bucket created by the CloudFormation stack for AWS WAF logs.
  • Delete CloudFormation stack created in Step 4.


In this blog post, we demonstrated how you can set up and inspect incoming web traffic using AWS Lambda, AWS WAF native logging capabilities, and Kinesis Firehose to help detect and block bad or fake bots at scale. Furthermore, the solution outlined in this post provides a framework which can be extended to identify similar unwanted traffic impersonating as other good bots.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.


Field Notes: Integrating IoT and ITSM using AWS IoT Greengrass and AWS Secrets Manager – Part 2

Post Syndicated from Gary Emmerton original https://aws.amazon.com/blogs/architecture/field-notes-integrating-iot-and-itsm-using-aws-iot-greengrass-and-aws-secrets-manager-part-2/

In part 1 of this blog I introduced the need for organizations to securely connect thousands of IoT devices with many different systems in the hyperconnected world that exists today, and how that can be addressed using AWS IoT Greengrass and AWS Secrets Manager.  We walked through the creation of ServiceNow credentials in AWS Secrets Manager, the creation of IAM roles and the Lambda functions that will run on our edge device (a Raspberry Pi).

In this second part of the blog, we will setup AWS IoT Greengrass, on our Raspberry Pi, and AWS IoT Core so that we can run the AWS Lambda functions and access our ServiceNow credentials, retrieved securely from AWS Secrets Manager.

Setting up AWS IoT Core and AWS IoT Greengrass

The overall sequence for configuring AWS IoT Core and AWS IoT Greengrass is:

  • Create a certificate, and IoT Thing and link them
  • Create AWS IoT Greengrass group
  • Associate IAM role to the AWS IoT Greengrass group
  • Create and attach a policy to the certificate
  • Create an AWS IoT Greengrass Resource Definition for our ‘Secret’
  • Create an AWS IoT Greengrass Function Definition for our Lambda functions
  • Create an AWS IoT Greengrass Subscription Definition for IoT Topics to be used
  • Finally associate our Resource, Function and Subscription Definitions with our AWS IoT Greengrass Core


For this walkthrough, I have selected the AWS region “eu-west-1”, however, feel free to use other Regions where AWS IoT Core and AWS IoT Greengrass are available.

First, let’s install Greengrass on the Raspberry Pi:

  • Follow the instructions to configure the pre-requisites on the Raspberry Pi
  • Then we download the AWS IoT Greengrass software
  • And then we unzip the AWS IoT Greengrass software using the following command (note, this command is for version 1.10.0 of Greengrass and will change as later versions are released):

sudo tar -xzvf greengrass-linux-armv6l-1.10.0.tar.gz -C /

Note that AWS IoT Greengrass must be compatible with the version of the AWS Greengrass SDK installed to identify what versions are compatible and use sudo pip3 install greengrasssdk==<version_number> to install the SDK compatible with the version of AWS IoT Greengrass that we installed.

Our AWS IoT Greengrass core will authenticate with AWS IoT Core in AWS using certificates, so we need to generate these first using the following command:

aws iot create-keys-and-certificate – set-as-active – certificate-pem-outfile "iot-ge.cert.pem" – public-key-outfile "iot-ge.public.key" – private-key-outfile "iot-ge.private.key"

This command will generate three files containing the private key, public key and certificate.  All of these files need to be copied to the /greengrass/certs folder on the Raspberry Pi.  Also, the output of the preceding command will give the ARN of the certificate – we need to make a note of this ARN as we will use it in the next steps.

We also need to download a copy of the Amazon Root CA into the /greegrass/certs folder using the command below:

sudo wget -O root.ca.pem https://www.amazontrust.com/repository/AmazonRootCA1.pem

For the next step we need our AWS account number and IoT Host address unique to our account – we get the IoT Host address using the command:

aws iot describe-endpoint – endpoint-type iot:Data-ATS

Now we need to create a config.json file on the Raspberry Pi in the /greengrass/config folder, with the account number and IoT Host address obtained in the previous step;

  "coreThing" : {
    "caPath" : "root.ca.pem",
    "certPath" : "iot-ge.cert.pem",
    "keyPath" : "iot-ge.private.key",
    "thingArn" : "arn:aws:iot:eu-west-1:<aws_account_number>:thing/IoT-blog_Core",
    "iotHost" : "<endpoint_address>",
    "ggHost" : "greengrass-ats.iot.eu-west-1.amazonaws.com",
    "keepAlive" : 600
  "runtime" : {
    "cgroup" : {
      "useSystemd" : "yes"
    "allowFunctionsToRunAsRoot" : "yes"
  "managedRespawn" : false,
  "crypto" : {
    "principals" : {
      "SecretsManager" : {
        "privateKeyPath" : "file:///greengrass/certs/iot-ge.private.key"
      "IoTCertificate" : {
        "privateKeyPath" : "file:///greengrass/certs/iot-ge.private.key",
        "certificatePath" : "file:///greengrass/certs/iot-ge.cert.pem"
    "caPath" : "file:///greengrass/certs/root.ca.pem"

Note that the line "allowFunctionsToRunAsRoot" : "yes" allows the Lambda functions to easily access the SenseHat on the Raspberry Pi. This configuration should normally be avoided in Production environments for security reasons but has been used here for simplicity.

Next we create the IoT Thing to represent our Raspberry Pi to match the entry we added into the config.json file previously:

aws iot create-thing – thing-name IoT-blog_Core

Now that our config.json file is in place and our IoT ‘thing’ created we can start the AWS IoT Greengrass software using the following commands:

cd /greengrass/ggc/core/
sudo ./greengrassd start

Then we attach the certificate to our new Thing – we need the ARN of the certificate that was noted in the earlier steps when we created the certificates:

aws iot attach-thing-principal – thing-name "IoT-blog_Core" – principal "<certificate_arn>"

Now we create the AWS IoT Greengrass group – make a note of the Group ID in the output of this command as we use it later:

aws greengrass create-group – name IoT-blog-group

Next we create the AWS IoT Greengrass Core definition file – create this using a text editor and save as core-def.json

  "Cores": [
      "CertificateArn": "<certificate_arn>",
      "Id": "<IoT Thing Name>",
      "SyncShadow": true,
      "ThingArn": "<thing_arn>"

Then, using the preceding file we just created, we create the core definition using the following command:

aws greengrass create-core-definition – name "IoT-blog_Core" – initial-version file://core-def.json

Now we associate the AWS IoT Greengrass core with the AWS IoT Greengrass group – we need the LatestVersionARN from the output of the command above and the group ID of your existing AWS IoT Greengrass group (in the output from the command for creation of the group in previous steps):

aws greengrass create-group-version – group-id "<greengrass_group_id>" – core-definition-version-arn "<core_definition_version_arn>"

Then we associate the IAM role (created earlier) to the AWS IoT Greengrass group;

aws greengrass associate-role-to-group – group-id "<greengrass_group_id>" – role-arn "arn:aws:iam::<aws_account_number>:role/IoTGGRole"

We need to create a policy to associate with the certificate so that our AWS IoT Greengrass Core (authenticated/authorized by our certificates) has rights to interact with AWS IoT Core.  To do this we create the policy.json file:

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Action": [
      "Resource": [
      "Effect": "Allow",
      "Action": [
      "Resource": [
      "Effect": "Allow",
      "Action": [
      "Resource": [

Then create the policy using the policy file using the command below:

aws iot create-policy – policy-name myGGPolicy – policy-document file://policy.json

And finally attach our new policy to the certificate – as the certificate is attached to our AWS IoT Greengrass Core, this gives the rights defined in the policy to our AWS IoT Greengrass Core;

aws iot attach-policy – target "<certificate_arn>" – policy-name "myGGPolicy"

Now we have the AWS IoT Greengrass Core and permissions in place, it’s time to add our Secret as a resource for AWS IoT Greengrass.

First, we need to create a resource definition that refers to the ARN of the secret we created earlier.  Get the ARN of the secret using the following command:

aws secretsmanager describe-secret – secret-id "greengrass-snow-creds"

And then we create a text file containing the following and save it as resource.json:

"Resources": [
      "Id": "SNOW-Credentials",
      "Name": "SNOW-Credentials",
      "ResourceDataContainer": {
        "SecretsManagerSecretResourceData": {
          "ARN": "<secret_arn>"

Now we command to create the resource reference in IoT to the Secret:

aws greengrass create-resource-definition – name "MySNOWSecret" – initial-version file://resource.json

Note the Resource ID from the output as it is needed as it has to be added to the Lambda definition json file in the next steps.  The function definition file contains the details of the Lambda function(s) that we will attach to our AWS IoT Greengrass group.  We create a text file with the content below and save as lambda-def.json.

We also specify a couple of variables in the definition file; these are the same as the environment variables that can be specified for Lambda, but they make the variables available in AWS IoT Greengrass.

Note, if we specify environment variables for the functions in the Lambda console then these will NOT be available when the function is running under AWS IoT Greengrass.  We will need our ServiceNow API URL to add to the configuration below, and this will be in the form of https://devXXXXX.service-now.com/api/now/table/incident, where XXXXX is the developer instance number assigned by ServiceNow when our instance is created.

We need the ARNs of the Lambda functions that we created in part 1 of the blog – these appear in the output after successfully creating the functions from the command line, or can be obtained using the aws lambda list-functions command – we need to have the ‘:1’ at the end of the ARN as AWS IoT Greengrass needs to reference published function versions.

  "DefaultConfig": {
    "Execution": {
      "IsolationMode": "NoContainer",
      "RunAs": {
        "Gid": 0,
        "Uid": 0
  "Functions": [
      "FunctionArn": "<lambda_function1_arn>:1",
      "FunctionConfiguration": {
        "EncodingType": "json",
        "Environment": {
          "Execution": {
            "IsolationMode": "NoContainer"
          "Variables": { 
            "tempLimit": "30",
            "humidLimit": "50"
        "ExecArgs": "string",
        "Executable": "lambda_function.lambda_handler",
        "Pinned": true,
        "Timeout": 10
    "Id": "sensorLambda"
      "FunctionArn": "<lambda_function2_arn>:1",
      "FunctionConfiguration": {
        "EncodingType": "json",
        "Environment": {
          "Execution": {
            "IsolationMode": "NoContainer"
          "ResourceAccessPolicies": [
              "Permission": "ro",
              "ResourceId": "SNOW-Credentials"
          "Variables": { 
            "snowUrl": "<service_now_api_url>"
        "ExecArgs": "string",
        "Executable": "lambda_function.lambda_handler",
        "Pinned": false,
        "Timeout": 10
    "Id": "anomalyLambda"

The Lambda function now needs to be registered within our AWS IoT Greengrass core using the definition file just created, using the following command:

aws greengrass create-function-definition – name "IoT-blog-lambda" – initial-version file://lambda-def.json

Create Subscriptions

We now need to create some IoT Topics to pass data between the two Lambda functions and also to submit all sensor data to AWS IoT Core, which gives us visibility of the successful collection of sensor data.cd.

First, let’s create a subscription configuration file (subscriptions.json) for sensor data and anomaly data:

  "Subscriptions": [
      "Id": "SensorData",
      "Source": "<lambda_function1_arn>:1",
      "Subject": "IoTBlog/sensorData",
      "Target": "cloud"
      "Id": "AnomalyData",
      "Source": "<lambda_function1_arn>:1",
      "Subject": "IoTBlog/anomaly",
      "Target": "<lambda_function2_arn>:1"
      "Id": "AnomalyDataB",
      "Source": "<lambda_function1_arn>:1",
      "Subject": "IoTBlog/anomaly",
      "Target": "cloud"

And next, we run the command to create the subscription from this configuration:

aws greengrass create-subscription-definition – name "IoT-sensor-subs" – initial-version file://subscriptions.json

Update AWS IoT Greengrass Group Associations and Deploy

Now that the functions, subscriptions and resources have been defined, we run the following command to update our AWS IoT Greengrass group to the new version with those components included:

aws greengrass create-group-version – group-id <gg_group_id> – core-definition-version-arn "<core_def_version_arn>" – function-definition-version-arn "<function_def_version_arn>" – resource-definition-version-arn "<resource_def_version_arn>" – subscription-definition-version-arn "<subscription_def_version_arn>"

And finally, we can deploy our configuration.  Use the following command to deploy the Greengrass group to our device, using the group-version-id from the output of the previous command and also the group-id:

aws greengrass create-deployment – deployment-type NewDeployment – group-id <gg_group_id> – group-version-id <gg_group_version_id>

Summarized below is the integration between the different functions and components that we have now deployed to get from our sensor data through to an incident being raised in ServiceNow:

Raspberry PI

Create an Incident

Everything is setup now from an IoT perspective, so we can attempt to trigger a threshold breach on the sensors to trigger the creation of an incident in ServiceNow.  In order to trigger the incident creation, let’s raise the humidity around the sensor so that it breaches the threshold defined in the environment variables of the Lambda function.

Under normal conditions we will just see the data published by the first Lambda function in the IoTBlog/sensorData topic:

IoTblog sensordata

However, when a threshold is breached (in our example, humidity above 50%), the data is published to the IoTBlog/anomaly topic as shown below:

ioTblog Anomaly

Via the AWS IoT Greengrass subscriptions created earlier, this message arriving in the anomaly topic also triggers the second Lambda function to create the ticket in ServiceNow.

The log for the second Lambda function on AWS IoT Greengrass (stored in /greengrass/ggc/var/log/user/<region>/<aws_account_number>/ on the Raspberry Pi) will show a ‘201’ return code if the incident is successfully created in ServiceNow.

201 response

Now let’s log on to ServiceNow and check out our new incident.  Good news, our new incident appears correctly:

And when we click on our incident we can see the detail, including the full data from the IoT topic in the Activities section;

This is only a basic use of the ServiceNow API and there are many other parameters that you can use to increase the richness of the incident, refer to the ServiceNow API documentation for more details.

Cleaning up

To avoid incurring future charges, delete the resources that you created in the walkthrough.


We have built an IoT device (Raspberry Pi), running AWS IoT Greengrass, AWS Lambda, and using ServiceNow credentials managed in AWS Secrets Manager.  Using this we have triggered an anomaly event that has created an incident automatically in ServiceNow, directly from the Lambda function running on our Pi.  You can use this architecture as the foundation to integrate your edge devices and ITSM solution to automate ticket generation in your organization.

Look out for follow-up blogs that will extend this solution to provide a real-time dashboard for the sensor data and store the sensor data in a Data Lake for historical visualization.

Find out more about deploying Secrets to AWS IoT Greengrass Core.

Check out the AWS IoT Blog for more examples of how to use AWS to integrate your edge devices with the AWS Cloud.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.