Tag Archives: Architecture

Choosing Your VPC Endpoint Strategy for Amazon S3

Post Syndicated from Jeff Harman original https://aws.amazon.com/blogs/architecture/choosing-your-vpc-endpoint-strategy-for-amazon-s3/

This post was co-written with Anusha Dharmalingam, former AWS Solutions Architect.

Must your Amazon Web Services (AWS) application connect to Amazon Simple Storage Service (S3) buckets, but not traverse the internet to reach public endpoints? Must the connection scale to accommodate bandwidth demands? AWS offers a mechanism called VPC endpoint to meet these requirements. This blog post provides guidance for selecting the right VPC endpoint type to access Amazon S3. A VPC endpoint enables workloads in an Amazon VPC to connect to supported public AWS services or third-party applications over the AWS network. This approach is used for workloads that should not communicate over public networks.

When a workload architecture uses VPC endpoints, the application benefits from the scalability, resilience, security, and access controls native to AWS services. Amazon S3 can be accessed using an interface VPC endpoint powered by AWS PrivateLink or a gateway VPC endpoint. To determine the right endpoint for your workloads, we’ll discuss selection criteria to consider based on your requirements.

VPC endpoint overview

A VPC endpoint is a virtual scalable networking component you create in a VPC and use as a private entry point to supported AWS services and third-party applications. Currently, two types of VPC endpoints can be used to connect to Amazon S3: interface VPC endpoint and gateway VPC endpoint.

When you configure an interface VPC endpoint, an elastic network interface (ENI) with a private IP address is deployed in your subnet. An Amazon EC2 instance in the VPC can communicate with an Amazon S3 bucket through the ENI and AWS network. Using the interface endpoint, applications in your on-premises data center can easily query S3 buckets over AWS Direct Connect or Site-to-Site VPN. Interface endpoint supports a growing list of AWS services. Consult our documentation to find AWS services compatible with interface endpoints powered by AWS PrivateLink.

Gateway VPC endpoints use prefix lists as the IP route target in a VPC route table. This routes traffic privately to Amazon S3 or Amazon DynamoDB. An EC2 instance in a VPC without internet access can still directly read from and/or write to an Amazon S3 bucket. Amazon DynamoDB and Amazon S3 are the services currently accessible via gateway endpoints.

Your internal security policies may have strict rules against communication between your VPC and the internet. To maintain compliance with these policies, you can use VPC endpoint to connect to AWS public services like Amazon S3. To control user or application access to the VPC endpoint and the resources it supports, you can use an AWS Identity and Access Management (AWS IAM) resource policy. This will separately secure the VPC endpoint and accessible resources.

Selecting gateway or interface VPC endpoints

With both interface endpoint and gateway endpoint available for Amazon S3, here are some factors to consider as you choose one strategy over the other.

  • Cost: Gateway endpoints for S3 are offered at no cost and the routes are managed through route tables. Interface endpoints are priced at $0.01/per AZ/per hour. Cost depends on the Region, check current pricing. Data transferred through the interface endpoint is charged at $0.01/per GB (depending on Region).
  • Access pattern: S3 access through gateway endpoints is supported only for resources in a specific VPC to which the endpoint is associated. S3 gateway endpoints do not currently support access from resources in a different Region, different VPC, or from an on-premises (non-AWS) environment. However, if you’re willing to manage a complex custom architecture, you can use proxies. In all those scenarios, where access is from resources external to VPC, S3 interface endpoints access S3 in a secure way.
  • VPC endpoint architecture: Some customers use centralized VPC endpoint architecture patterns. This is where the interface endpoints are all managed in a central hub VPC for accessing the service from multiple spoke VPCs. This architecture helps reduce the complexity and maintenance for multiple interface VPC endpoints across different VPCs. When using an S3 interface endpoint, you must consider the amount of network traffic that would flow through your network from spoke VPCs to hub VPC. If the network connectivity between spoke and hub VPCs are set up using transit gateway, or VPC peering, consider the data processing charges (currently $0.02/GB). If VPC peering is used, there is no charge for data transferred between VPCs in the same Availability Zone. However, data transferred between Availability Zones or between Regions will incur charges as defined in our documentation.

In scenarios where you must access S3 buckets securely from on-premises or from across Regions, we recommend using an interface endpoint. If you chose a gateway endpoint, install a fleet of proxies in the VPC to address transitive routing.

Figure 1. VPC endpoint architecture

Figure 1. VPC endpoint architecture

  • Bandwidth considerations: When setting up an interface endpoint, choose multiple subnets across multiple Availability Zones to implement high availability. The number of ENIs should equal to number of subnets chosen. Interface endpoints offer a throughput of 10 Gbps per ENI with a burst capability of 40 Gbps. If your use case requires higher throughput, contact AWS Support.

Gateway endpoints are route table entries that route your traffic directly from the subnet where traffic is originating to the S3 service. Traffic does not flow through an intermediate device or instance. Hence, there is no throughput limit for the gateway endpoint itself. The initial setup for gateway endpoints consists in specifying the VPC route tables you would like to use to access the service. Route table entries for the destination (prefix list) and target (endpoint ID) are automatically added to the route tables.

The two architectural options for creating and managing endpoints are:

Single VPC architecture

Using a single VPC, we can configure:

  • Gateway endpoints for VPC resources to access S3
  • VPC interface endpoint for on-premises resources to access S3

The following architecture shows the configuration on how both can be set up in a single VPC for access. This is useful when access from within AWS is limited to a single VPC while still enabling external (non-AWS) access.

Figure 2. Single VPC architecture

Figure 2. Single VPC architecture

DNS configured on-premises will point to the VPC interface endpoint IP addresses. It will forward all traffic from on-premises to S3 through the VPC interface endpoint. The route table configured in the subnet will ensure that any S3 traffic originating from the VPC will flow to S3 using gateway endpoints.

Multi-VPC centralized architecture

In a hub and spoke architecture that centralizes S3 access for multi-Region, cross-VPC, and on-premises workloads, we recommend using an interface endpoint in the hub VPC. The same pattern would also work in multi-account/multi-region design where multiple VPCs require access to centralized buckets.

Note: Firewall appliances that monitor east-west traffic will experience increased load with the Multi-VPC centralized architecture. It may be necessary to use the single VPC endpoint design to reduce impact to firewall appliances.

Figure 3. Multi-VPC centralized architecture

Figure 3. Multi-VPC centralized architecture


Based on preceding considerations, you can choose to use a combination of gateway and interface endpoints to meet your specific needs. Depending on the account structure and VPC setup, you can support both types of VPC endpoints in a single VPC by using a shared VPC architecture.

With AWS, you can choose between two VPC endpoint types (gateway endpoint or interface endpoint) to securely access your S3 buckets using a private network. In this blog, we showed you how to select the right VPC endpoint using criteria like VPC architecture, access pattern, and cost. To learn more about VPC endpoints and improve the security of your architecture, read Securely Access Services Over AWS PrivateLink.

CohnReznick Automates Claim Validation Workflow Using AWS AI Services

Post Syndicated from Rajeswari Malladi original https://aws.amazon.com/blogs/architecture/cohnreznick-automates-claim-validation-workflow-using-aws-ai-services/

This post was co-written by Winn Oo and Brendan Byam of CohnReznick and Rajeswari Malladi and Shanthan Kesharaju

CohnReznick is a leading advisory, assurance, and tax firm serving clients around the world. CohnReznick’s government and public sector practice provides claims audit and verification services for state agencies. This process begins with recipients submitting documentation as proof of their claim expenses. The supporting documentation often contains hundreds of filled-out or scanned (sometimes handwritten) PDFs, MS Word files, Excel spreadsheets, and/or pictures, along with a summary form outlining each of the claimed expenses.

Prior to automation with AWS artificial intelligence (AI) services, CohnReznick’s data extraction and validation process was performed manually. Audit professionals had to extract each data point from the submitted documentation, select a population sample for testing, and manually search the documentation for any pages or page sections that validated the information submitted. Validated data points and proof of evidence pages were then packaged into a single document and submitted for claim expense reimbursement.

In this blog post, we’ll show you how CohnReznick implemented Amazon Textract, Amazon Comprehend (with a custom machine learning classification model), and Amazon Augmented AI (Amazon A2I). With this solution, CohnReznick automated nearly 40% of the total claim verification process with focus on data extraction and package creation. This resulted in an estimated cost savings of $500k per year for each project and process.

Automating document processing workflow

Figure 1 shows the newly automated process. Submitted documentation is processed by Amazon Textract, which extracts text from the documents. This text is then submitted to Amazon Comprehend, which employs a custom classification model to classify the documents as claim summaries or evidence documents. All data points are collected from the Amazon Textract output of the claim summary documents. These data points are then validated against the evidence documents.

Finally, a population sample of the extracted data points is selected for testing. Rather than auditors manually searching for specific information in the documentation, the automated process conducts the data search, extracts the validated evidence pages from submitted documentation, and generates the audited package, which can then be submitted for reimbursement.

Architecture diagram

Figure 1. Architecture diagram

Components in the solution

At a high level, the end-to-end process starts with auditors using a proprietary web application to submit the documentation received for each case to the document processing workflow. The workflow includes three stages, as described in the following sections.

Text extraction

First, the process extracts the text from the submitted documents using the following steps:

  1. For each case, the CohnReznick proprietary web application uploads the documents to the Amazon Simple Storage Service (Amazon S3) upload bucket. Each file has a unique name, and the files have metadata that associates them with the parent case.
  2. The uploaded documents Amazon Simple Queue Service (Amazon SQS) queue is configured to receive notifications for all new objects added to the upload bucket. For every new document added to the upload bucket, Amazon S3 sends a notification to the uploaded documents queue.
  3. The text extraction AWS Lambda function runs every 5 minutes to poll the uploaded documents queue for new messages.
  4. For each message in the uploaded documents queue, the text extraction function submits an Amazon Textract job to process the document asynchronously. This continues until it reaches a predefined maximum allowed limit of concurrent jobs for that AWS account. Concurrency control is implemented by handling LimitExceededException on StartDocumentAnalysis API call.
  5. After Amazon Textract finishes processing a document, it sends a completion notification to a completed jobs Amazon Simple Notification Service (Amazon SNS) topic.
  6. A process job results Lambda function is subscribed to the completed jobs topic and receives a notification for every completed message sent to the completed jobs topic.
  7. The process job results function then fetches document extraction results from Amazon Textract.
  8. The process job results function stores the document extraction results in the Amazon Textract output bucket.

Documents classification

Next, the process classifies the documents. The submitted claim documents can consist of up to seven supporting document types. The documents need to be classified into the respective categories. They are primarily classified using automation. Any documents classified with a low confidence score are sent to a human review workflow.

Classification model creation 

The custom classification feature of Amazon Comprehend is used to build a custom model to classify documents into the seven different document types as required by the business process. The model is trained by providing sample data in CSV format. Amazon Comprehend uses multiple algorithms in the training process and picks the model that delivers the highest accuracy for the training data.

Classification model invocation and processing

The automated document classification uses the trained model and the classification consists of the following steps:

  1. The business logic in the process job results Lambda function determines text extraction completion for all documents for each case. It then calls the StartDocumentClassificationJob operation on the custom classifier model to start classifying unlabeled documents.
  2. The document classification results from the custom classifier are returned as a single output.tar.gz file in the comprehend results S3 bucket.
  3. At this point, the check confidence scores Lambda function is invoked, which processes the classification results.
  4. The check confidence scores function reviews the confidence scores of classified documents. The results for documents with high confidence scores are saved to the classification results table in Amazon DynamoDB.

Human review

The documents from the automated classification that have low confidence scores are classified using human review with the following steps:

  1. The check confidence scores Lambda function invokes human review with Amazon Augmented AI for documents with low confidence scores. Amazon A2I is a ready-to-use workflow service for human review of machine learning predictions.
  2. The check confidence scores Lambda function creates human review tasks for each document with a low confidence score. Humans assigned to the classification jobs log into the human review portal and either approve the classification done by the model or reclassify the text with the right labels.
  3. The results from human review are placed in the A2I results bucket.
  4. The update results Lambda function is invoked to process results from the human review.
  5. Finally, the update results function writes the human review document classification results to the classification results table in DynamoDB.

Additional processes

Documents workflow status capturing

The Lambda functions throughout the workflow update the status of their processing and document/case details in the workflow status table in DynamoDB. The auditor that submitted the case documents will know the status of the workflow of their submitted case using the data in workflow status table.

Search and package creation

When the processing is complete for a case, auditors perform the final review and submit the generated packet for downstream processing.

  1. The web application uses AWS SDK for Java to integrate with the Textract output S3 bucket that has the document extraction results and classification results table in DynamoDB with classification results. This data is used for the search and package creation process.

Purge data process

After the package creation is complete, the auditor can purge all data in the workflow.

  1. Using the AWS SDK, the data is purged from the S3 buckets and DynamoDB tables.


As seen in this blog post, Amazon Textract, Amazon Comprehend, and Amazon A2I for human review work together with Amazon S3, DynamoDB, and Lambda services. These services have helped CohnReznick automate nearly 40% of their total claim verification process with focus on data extraction and package creation.

You can achieve similar efficiencies and increase scalability by automating your business processes. Get started today by reading additional user stories and using the resources on automated document processing.

Using Cloud Fitness Functions to Drive Evolutionary Architecture

Post Syndicated from Hauke Juhls original https://aws.amazon.com/blogs/architecture/using-cloud-fitness-functions-to-drive-evolutionary-architecture/

“It is not the strongest of the species that survives, nor the most intelligent. It is the one that is most adaptable to change.” – often attributed to Charles Darwin

One common strategy for businesses that operate in dynamic market conditions (and thus need to continuously correct their course) is to aim for smaller, independent development teams. Microservices and two-pizza teams at Amazon are prominent examples of this strategy. But having smaller units is not the only success factor: to reduce organizational bottlenecks and make high-quality decisions quickly, these two-pizza teams need to be autonomous in most of their decision making.

Architects can no longer rely on static upfront design to meet the change rate required to be successful in such an environment.

This blog shows enterprise architects a mechanism to align decentralized architectural decision making with overall architecture goals.

Gathering data from your fitness functions

“Evolutionary architecture” was coined by Neal Ford and his colleagues from AWS Partner ThoughtWorks in their work on Building Evolutionary Architectures. It is defined as “supporting guided, incremental change as a first principle across multiple dimensions.”

Fitness functions help you obtain the necessary data to allow for the planned evolution of your architecture. They set measurable values to assess how close your solution is to achieving your set goals.

Fitness functions can and should be adapted as the architecture evolves to guide a desired change process. This provides architects with a tool to guide their teams while maintaining team autonomy.

Example of a regression fitness function in action

You’ve identified shorter time-to-market as a key non-functional requirement. You want to lower the risk of regressions and rollbacks after deployments. So, you and your team write automated test cases. To ensure that they have a good set of test cases in place, they measure test coverage. This test coverage measures the percentage of code that is tested automatically. This steers the team toward writing tests to mitigate the risk of regressions so they have fewer rollbacks and shorter time to market.

Fitness functions like this work best when they’re as automated as possible. But how do you acquire the necessary data points to use this mechanism outside of software architecture? We’ll show you how in the following sections.

AWS Cloud services with built-in fitness functions

AWS Cloud services are highly standardized, fully automated via API operations, and are built with observability in mind. This allows you to generate measurements for fitness functions automatically for areas such as availability, responsiveness, and security.

To start building your evolutionary architecture with fitness functions, use something that can be easily measured. AWS has services that can be used as inputs to fitness functions, including:

  • Amazon CloudWatch aggregates logs and metrics to check for availability, responsiveness, and reliability fitness functions.
  • AWS Security Hub provides a comprehensive view of your security alerts and security posture across your AWS accounts. Security Architects could, for example, define the fitness function of critical and high findings to be zero. Teams then would be guided into reducing the number of these findings, resulting in better security.
  • AWS Cost Explorer ensures your costs stay in line with value generated.
  • AWS Well-Architected Tool evaluates teams’ architectures in a consistent and repeatable way. The number of items acts as your fitness function, which can be queried using the API. To improve your architecture based on the results, review the Establishing Feedback Loops Based on the AWS Well-Architected Framework Review blog post.
  • Amazon SageMaker Model Monitor continuously monitors the quality of SageMaker machine learning models in production. Detecting deviations early allows you to take corrective actions like retraining models, auditing upstream systems, or fixing quality issues.

Using the observability that the cloud provides

Fitness functions can be derived by evaluating the AWS account activity such as configuration changes. AWS CloudTrail is useful for this. It records account activity and service events from most AWS services, which can then be analyzed with Amazon Athena.

Fitness functions provide feedback to engineers via metrics

Figure 1. Fitness functions provide feedback to engineers via metrics

Example of a cloud fitness function in action

In this example, we implement a fitness function that monitors the operability of your system.

You have had certain outages due to manual tasks in operations, and you have anecdotal evidence that engineers are spending time on manual work during application rollouts. To improve operations, you want to reduce manual interactions via the shell in favor of automation. First, you prevent direct secure shell (SSH) access by blocking SSH traffic via the managed AWS Config rule restricted-ssh. Second, you make use of AWS Systems Manager Session Manager, which provides a secure and auditable way to access Amazon Elastic Compute Cloud (Amazon EC2) instances.

By counting the logged API events in CloudTrail you can measure the number of shell sessions. This is shown in this sample Athena query to count the number of shell sessions:

SELECT count(*),
FROM "cloudtrail_logs_partition_projection"
WHERE readonly = 'false'
  AND eventsource = 'ssm.amazonaws.com'
  AND eventname in ('StartSession',
GROUP BY DATE(from_iso8601_timestamp(eventTime)),
ORDER BY DATE(from_iso8601_timestamp(eventTime)) DESC

The number of shell sessions now act as fitness function to improve operational excellence through operations as code. Coincidently, the fitness function you defined also rewards teams moving to serverless compute services such as AWS Fargate or AWS Lambda.

Fitness through exercising

Similar to people, your architecture’s fitness can be improved by exercising. It does not take much equipment, but you need to take the first step. To get started, we encourage you to think of the desired outcomes for your architecture that you can measure (and thus guide) through fitness functions. The following lessons learned will help you focus your goals:

  • Requirements and business goals may differ per domain. Thus, your fitness functions might differ. Work closely with your teams when defining fitness functions.
  • Start by taking something that can be easily measured and communicated as a goal.
  • Focus on a positive trendline rather than absolute values.
  • Make sure you and your teams are using the same metrics and the same way to measure them. We have seen examples where central governance departments had access to data the individual teams did not, leading to frustration on all sides.
  • Ensure that your architecture goals fit well into the current context and time horizon.
  • Continuously re-visit the fitness functions to ensure that they evolve with the changing business goals.


Fitness functions help architects focus on building. Once established, teams can use the data points from fitness functions to make decisions and work towards a common and measurable goal. The architects in turn can use the data points they get from fitness functions to confirm their hypothesis of the current state of the architecture. Get started building your fitness functions today by:

  • Gathering the most important system quality attributes.
  • Beginning with approximately three meaningful fitness functions relying on the API operations available.
  • Building a dashboard that shows progress over time, share it with your teams, and rely on this data in your daily work.

Field Notes: Launch a Fully Configured AWS Deep Learning Desktop with NICE DCV

Post Syndicated from Ajay Vohra original https://aws.amazon.com/blogs/architecture/field-notes-launch-a-fully-configured-aws-deep-learning-desktop-with-nice-dcv/

You want to start quickly when doing deep learning using GPU-activated Elastic Compute Cloud (Amazon EC2) instances in the AWS Cloud. Although AWS provides end-to-end machine learning (ML) in Amazon SageMaker, working at the deep learning frameworks level, the quickest way to start is with AWS Deep Learning AMIs (DLAMIs), which provide preconfigured Conda environments for most of the popular frameworks.

DLAMIs make it straightforward to launch Amazon EC2 instances, but these instances do not automatically provide the high-performance graphics visualization required during deep learning research. Additionally, they are not preconfigured to use AWS storage services or SageMaker. This post explains how you can launch a fully-configured deep learning desktop in the AWS Cloud. Not only is this desktop preconfigured with the popular frameworks such as TensorFlow, PyTorch, and Apache MXNet, but it is also enabled for high-performance graphics visualization. NICE DCV is the remote display protocol used for visualization. In addition, it is preconfigured to use AWS storage services and SageMaker.

Overview of the Solution

The deep learning desktop described in this solution is ready for research and development of deep neural networks (DNNs) and their visualization. You no longer need to set up low-level drivers, libraries, and frameworks, or configure secure access to AWS storage services and SageMaker. The desktop has preconfigured access to your data in a Simple Storage Service (Amazon S3) bucket, and a shared Amazon Elastic File System (Amazon EFS) is automatically attached to the desktop. It is automatically configured to access SageMaker for ML services, and provides you with the ability to prepare the data needed for deep learning, and to research, develop, build, and debug your DNNs. You can use all the advanced capabilities of SageMaker from your deep learning desktop. The following diagram shows the reference architecture for this solution.

Reference Architecture to Launch a Fully Configured AWS Deep Learning Desktop with NICE DCV

Figure 1 – Architecture overview of the solution to launch a fully configured AWS Deep Learning Desktop with NICE DCV

The deep learning desktop solution discussed in this post is contained in a single AWS CloudFormation template. To launch the solution, you create a CloudFormation stack from the template. Before we provide a detailed walkthrough for the solution, let us review the key benefits.

DNN Research and Development

During the DNN research phase, there is iterative exploration until you choose the preferred DNN architecture. During this phase, you may prefer to work in an isolated environment (for example, a dedicated desktop) with your favorite integrated development environment (IDE) (for example, Visual Studio Code or PyCharm). Developers like the ability to step through code the IDE Debugger. With the increasing support for imperative programming in modern ML frameworks, the ability to step through code in the research phase can accelerate DNN development.

The DLAMIs are preconfigured with NVIDIA GPU drivers, NVIDIA CUDA Toolkit, and low-level libraries such as Deep Neural Network library (cuDNN). Deep learning ML frameworks such as TensorFlow, PyTorch, and Apache MXNet are preconfigured.

After you launch the deep learning desktop, you need to install and open your favorite IDE, clone your GitHub repository, and you can start researching, developing, debugging, and visualizing your DNN. The acceleration of DNN research and development is the first key benefit for the solution described in this post.

Screenshot showing Developing on deep learning desktop with Visual Studio Code IDE

Figure 2 – Developing on deep learning desktop with Visual Studio Code IDE

Elasticity in number of GPUs

During the research phase, you need to debug any issues using a single GPU. However, as the DNN is stabilized, you horizontally scale across multiple GPUs in a single machine, followed by scaling across multiple machines.

Most modern deep learning frameworks support distributed training across multiple GPUs in a single machine, and also across multiple machines. However, when you use a single GPU in an on-premises desktop equipped with multiple GPUs, the idle GPUs are wasted. With the deep learning desktop solution described in this post, you can stop the deep learning desktop instance, change its Amazon EC2 instance type to another compatible type, restart the desktop, and get the exact number of GPUs you need at the moment. The elasticity in the number of GPUs in the deep learning desktop is the second key benefit for the solution described in this post.

Integrated access to storage services 

Since the deep learning desktop is running in AWS Cloud, you have access to all of the AWS data storage options, including the S3 object store, the Amazon EFS, and the Amazon FSx file system for Lustre. You can build your favorite data pipeline and it will be supported by one or more data storage options. You can also easily use ML-IO library, which is a high-performance data access library for ML tasks with support for multiple data formats. The integrated access to highly durable and scalable object and file system storage services for accessing ML data is the third key benefit for the solution described in this post.

Integrated access to SageMaker

Once you have a stable version of your DNN, you need to find the right hyperparameters that lead to model convergence during training. Having tuned the hyperparameters, you need to run multiple trials over permutations of datasets and hyperparameters to fine-tune your models. Finally, you may need to prune and compile the models to optimize inference. To compress the training time, you may need to do distributed data parallel training across multiple GPUs in multiple machines. For all of these activities, the deep learning desktop is preconfigured to use SageMaker. You can use jupyter-lab notebooks running on the desktop to launch SageMaker training jobs for distributed training in infrastructure automatically managed by SageMaker.

Submitting a SageMaker training job from deep learning desktop using Jupyter Lab notebook

Figure 3 – Submitting a SageMaker training job from deep learning desktop using Jupyter Lab notebook

The SageMaker training logs, TensorBoard summaries, and model checkpoints can be configured to be written to the Amazon EFS attached to the deep learning desktop. You can use the Linux command tail to monitor the logs, or start a TensorBoard server from the Conda environment on the deep learning desktop, and monitor the progress of your SageMaker training jobs. You can use a Jupyter Lab notebook running on the deep learning desktop to load a specific model checkpoint available on the Amazon EFS, and visualize the predictions from the model checkpoint, even while the SageMaker training job is still running.

 Locally monitoring the TensorBoard summaries from SageMaker training job

Figure 4 – Locally monitoring the TensorBoard summaries from SageMaker training job

SageMaker offers many advanced capabilities, such as profiling ML training jobs using Amazon SageMaker Debugger, and these services are easily accessible from the deep learning desktop. You can manage the training input data, training model checkpoints, training logs, and TensorBoard summaries of your local iterative development, in addition to the distributed SageMaker training jobs, all from your deep learning desktop. The integrated access to SageMaker services is the fourth key use case for the solution described in this post.


To get started, complete the following steps:


The complete source and reference documentation for this solution is available in the repository accompanying this post. Following is a walkthrough of the steps.

Create a CloudFormation stack

Create a stack on the CloudFormation console in your selected AWS Region using the CloudFormation template in your cloned GitHub repository. This CloudFormation stack creates IAM resources. When you are creating a CloudFormation stack using the console, you must confirm: I acknowledge that AWS CloudFormation might create IAM resources.

To create the CloudFormation stack, you must specify values for the following input parameters (for the rest of the input parameters, default values are recommended):

  • DesktopAccessCIDR – Use the public internet address of your laptop as the base value for the CIDR.
  • DesktopInstanceType – For deep leaning, the recommended value for this parameter is p3.2xlarge, or larger.
  • DesktopVpcId – Select an Amazon Virtual Private Cloud (VPC) with at least one public subnet.
  • DesktopVpcSubnetId – Select a public subnet in your VPC.
  • DesktopSecurityGroupId – The specified security group must allow inbound access over ports 22 (SSH) and 8443 (NICE DCV) from your DesktopAccessCIDR, and must allow inbound access from within the security group to port 2049 and all network ports required for distributed SageMaker training in your subnet.
  • If you leave it blank, the automatically-created security group allows inbound access for SSH, and NICE DCV from your DesktopAccessCIDR, and allows inbound access to all ports from within the security group.
  • KeyName – Select your SSH key pair name.
  • S3Bucket – Specify your S3 bucket name. The bucket can be empty.

Visit the documentation on all the input parameters.

Connect to the deep learning desktop

  • When the status for the stack in the CloudFormation console is CREATE_COMPLETE, find the deep learning desktop instance launched in your stack in the Amazon EC2 console,
  • Connect to the instance using SSH as user ubuntu, using your SSH key pair. When you connect using SSH, if you see the message, “Cloud init in progress. Machine will REBOOT after cloud init is complete!!”, disconnect and try again in about 15 minutes.
  • The desktop installs the NICE DCV server on first-time startup, and automatically reboots after the install is complete. If instead you see the message, “NICE DCV server is enabled!”, the desktop is ready for use.
  • Before you can connect to the desktop using the NICE DCV client, you need to set a new password for user ubuntu using the Bash command:
    sudo passwd ubuntu 
  • After you successfully set the new password for user ubuntu, exit the SSH connection. You are now ready to connect to the desktop using a suitable NICE DCV client (a non–web browser client is recommended) using the user ubuntu, and the new password.
  • NICE DCV client asks you to specify the server host and port to connect. For the server host, use the public IPv4 DNS address of the desktop Amazon EC2 instance available in Amazon EC2 console.
  • You do not need to specify the port, because the desktop is configured to use the default NICE DCV server port of 8443.
  • When you first login to the desktop using the NICE DCV client, you will be asked if you would like to upgrade the OS version. Do not upgrade the OS version!

Develop on the deep learning desktop

When you are connected to the desktop using the NICE DCV client, use the Ubuntu Software Center to install Visual Studio Code, or your favorite IDE. To view the available Conda environments containing the popular deep learning frameworks preconfigured on the desktop, open a desktop terminal, and run the Bash command:

conda env list

The deep learning desktop instance has secure access to the S3 bucket you specified when you created the CloudFormation stack. You can verify access to the S3 bucket by running the Bash command (replace ‘your-bucket-name’ following with your S3 bucket name):

aws s3 ls your-bucket-name 

If your bucket is empty, a successful initiation of the previous command will produce no output, which is normal.

An Amazon Elastic Block Store (Amazon EBS) root volume is attached to the instance. In addition, an Amazon EFS is mounted on the desktop at the value of EFSMountPath input parameter, which by default is /home/ubuntu/efs. You can use the Amazon EFS for staging deep learning input and output data.

Use SageMaker from the deep learning desktop

The deep learning desktop is preconfigured to use SageMaker. To get started with SageMaker examples in a JupyterLab notebook, launch the following Bash commands in a desktop terminal:

mkdir ~/git
cd ~/git
git clone https://github.com/aws/amazon-sagemaker-examples.git

This will start a ‘jupyter-lab’ notebook server in the terminal, and open a tab in your web browser. You can explore any of the SageMaker example notebooks. We recommend starting with the example Distributed Training of Mask-RCNN in SageMaker using Amazon EFS found at the following path in the cloned repository:


The preceding SageMaker example requires you to specify a subnet and a security group. Use the preconfigured OS environment variables as follows:

security_group_ids = [ os.environ['desktop_sg_id'] ] 
subnets = [ os.environ['desktop_subnet_id' ] ] 

Stopping and restarting the desktop

You may safely reboot, stop, and restart the desktop instance at any time. The desktop will automatically mount the Amazon EFS at restart.

Clean Up

When you no longer need the deep learning desktop, you may delete the CloudFormation stack from the CloudFormation console. Deleting the stack will shut down the desktop instance, and delete the root Amazon EBS volume attached to the desktop. The Amazon EFS is not automatically deleted when you delete the stack.


In this post, we showed how to launch a desktop pre-configured with the popular machine learning frameworks for research and development of deep learning neural networks.  NICE-DCV was used for high performance visualization related to deep learning. AWS storage services were used for highly scalable access to deep learning data.  Finally, Amazon SageMaker was used for the distributed training of deep learning data.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Data Caching Across Microservices in a Serverless Architecture

Post Syndicated from Irfan Saleem original https://aws.amazon.com/blogs/architecture/data-caching-across-microservices-in-a-serverless-architecture/

Organizations are re-architecting their traditional monolithic applications to incorporate microservices. This helps them gain agility and scalability and accelerate time-to-market for new features.

Each microservice performs a single function. However, a microservice might need to retrieve and process data from multiple disparate sources. These can include data stores, legacy systems, or other shared services deployed on premises in data centers or in the cloud. These scenarios add latency to the microservice response time because multiple real-time calls are required to the backend systems. The latency often ranges from milliseconds to a few seconds depending on size of the data, network bandwidth, and processing logic. In certain scenarios, it makes sense to maintain a cache close to the microservices layer to improve performance by reducing or eliminating the need for the real-time backend calls.

Caches reduce latency and service-to-service communication of microservice architectures. A cache is a high-speed data storage layer that stores a subset of data. When data is requested from a cache, it is delivered faster than if you accessed the data’s primary storage location.

While working with our customers, we have observed use cases where data caching helps reduce latency in the microservices layer. Caching can be implemented in several ways. In this blog post, we discuss a couple of these use cases that customers have built. In both use cases, the microservices layer is created using Serverless on AWS offerings. It requires data from multiple data sources deployed locally in the cloud or on premises. The compute layer is built using AWS Lambda. Though Lambda functions are short-lived, the cached data can be used by subsequent instances of the same microservice to avoid backend calls.

Use case 1: On-demand cache to reduce real-time calls

In this use case, the Cache-Aside design pattern is used for lazy loading of frequently accessed data. This means that an object is only cached when it is requested by a consumer, and the respective microservice decides if the object is worth saving.

This use case is typically useful when the microservices layer makes multiple real-time calls to fetch and process data. These calls can be greatly reduced by caching frequently accessed data for a short period of time.

Let’s discuss a real-world scenario. Figure 1 shows a customer portal that provides a list of car loans, their status, and the net outstanding amount for a customer:

  • The Billing microservice gets a request. It then tries to get required objects (for example, the list of car loans, their status, and the net outstanding balance) from the cache using an object_key. If the information is available in the cache, a response is sent back to the requester using cached data.
  • If requested objects are not available in the cache (a cache miss), the Billing microservice makes multiple calls to local services, applications, and data sources to retrieve data. The result is compiled and sent back to the requester. It also resides in the cache for a short period of time.
  • Meanwhile, if a customer makes a payment using the Payment microservice, the balance amount in the cache must be invalidated/deleted. The Payment microservice processes the payment and invokes an asynchronous event (payment_processed) with the respective object key for the downstream processes that will remove respective objects from the cache.
  • The events are stored in the event store.
  • The CacheManager microservice gets the event (payment_processed) and makes a delete request to the cache for the respective object_key. If necessary, the CacheManager can also refresh cached data. It can call a resource within the Billing service or it can refresh data directly from the source system depending on the data refresh logic.
Reducing latency by caching frequently accessed data on demand

Figure 1. Reducing latency by caching frequently accessed data on demand

Figure 2 shows AWS services for use case 1. The microservices layer (Billing, Payments, and Profile) is created using Lambda. The Amazon API Gateway is exposing Lambda functions as API operations to the internal or external consumers.

Suggested AWS services for implementing use case 1

Figure 2. Suggested AWS services for implementing use case 1

All three microservices are connected with the data cache and can save and retrieve objects from the cache. The cache is maintained in-memory using Amazon ElastiCache. The data objects are kept in cache for a short period of time. Every object has an associated TTL (time to live) value assigned to it. After that time period, the object expires. The custom events (such as payment_processed) are published to Amazon EventBridge for downstream processing.

Use case 2: Proactive caching of massive volumes of data

During large modernization and migration initiatives, not all data sources are colocated for a certain period of time. Some legacy systems, such as mainframe, require a longer decommissioning period. Many legacy backend systems process data through periodic batch jobs. In such scenarios, front-end applications can use cached data for a certain period of time (ranging from a few minutes to few hours) depending on nature of data and its usage. The real-time calls to the backend systems cannot deal with the extensive call volume on the front-end application.

In such scenarios, required data/objects can be identified up front and loaded directly into the cache through an automated process as shown in Figure 3:

  • An automated process loads data/objects in the cache during the initial load. Subsequent changes to the data sources (either in a mainframe database or another system of record) are captured and applied to the cache through an automated CDC (change data capture) pipeline.
  • Unlike use case 1, the microservices layer does not make real-time calls to load data into the cache. In this use case, microservices use data already cached for their processing.
  • However, the microservices layer may create an event if data in the cache is stale or specific objects have been changed by another service (for example, by the Payment service when a payment is made).
  • The events are stored in Event Manager. Upon receiving an event, the CacheManager initiates a backend process to refresh stale data on demand.
  • All data changes are sent directly to the system of record.
Eliminating real-time calls by caching massive data volumes proactively

Figure 3. Eliminating real-time calls by caching massive data volumes proactively

As shown in Figure 4, the data objects are maintained in Amazon DynamoDB, which provides low-latency data access at any scale. The data retrieval is managed through DynamoDB Accelerator (DAX), a fully managed, highly available, in-memory cache. It delivers up to a 10 times performance improvement, even at millions of requests per second.

Suggested AWS services for implementing use case 2

Figure 4. Suggested AWS services for implementing use case 2

The data in DynamoDB can be loaded through different methods depending on the customer use case and technology landscape. API Gateway, Lambda, and EventBridge are providing similar functionality as described in use case 1.

Use case 2 is also beneficial in scenarios where front-end applications must cache data for an extended period of time, such as a customer’s shopping cart.

In addition to caching, the following best practices can also be used to reduce latency and to improve performance within the Lambda compute layer:


The microservices architecture allows you to build several caching layers depending on your use case. In this blog, we discussed data caching within the compute layer to reduce latency when data is retrieved from disparate sources. The information from use case 1 can help you reduce real-time calls to your back-end system by saving frequently used data to the cache. Use case 2 helps you maintain large volumes of data in caches for extended periods of time when real-time calls to the backend system are not possible.

Digitally Optimize your Factory Issue Resolution with Amazon Virtual Andon

Post Syndicated from Ajay Swamy original https://aws.amazon.com/blogs/architecture/digitally-optimize-your-factory-issue-resolution-with-amazon-virtual-andon/

As a manufacturing enterprise, maximizing your operational efficiency and optimizing output are critical in a competitive global market. Global black swan events such as COVID-19 have necessitated the ability to monitor remotely and respond to issues actively, on the factory floor.

Amazon Virtual Andon (AVA) is a digital notification system that helps factory personnel raise, route, track, and resolve factory floor issues more quickly. The solution was incubated in the Amazon Fulfillment Centers to monitor and resolve issues. It is now available as a self-deployable solution that manufacturers can use to optimize productivity.

An intuitive UI to deliver quick value

Amazon Virtual Andon comes with an intuitive user interface (UI) to set up your factory asset hierarchy, so you can model your real factory floor equipment (Figure 1). With AVA, you can create events (incidents) that are likely to occur across your factory. You can granularly tie events to the underlying stations and processes. Factory personnel can monitor manufacturing equipment for events, quickly raise issues, route the problem to specialized engineers, and ensure tracking and resolution of those issues.

Setup and configuration

Figure 1. Create your factory hierarchy

Figure 1. Create your factory hierarchy

With AVA, you can set up your factory configuration quickly (see Figure 2a, 2b following). You can choose to create multiple factory sites, areas, stations, processes, and devices with the setup UI. You can also create users and assign them to Admin, Manager, Engineer, and Associate roles. You can set fine-grained controls on what users can access within the Permissions screen.

Figure 2a. Create new factory areas

Figure 2a. Create new factory areas

Figure 2b. Create factory processes

Figure 2b. Create factory processes

Client screen

In Figure 3, you can see AVA’s graphical client screen, where floor associates can raise issues when breakdowns occur. Associates can raise an issue by clicking on the issue tile. Once issues are raised, they are routed via Amazon Simple Notification Service (SNS) to subscribed emails addresses or phone numbers of specialized engineers for solving those specific breakdown events.

Figure 3. Raise issues screen

Figure 3. Raise issues screen

Observer screen

The issues routed to the engineers are visible on the observer screen, shown in Figure 4. The engineers can confirm that the problem is being addressed. This updates the status on the client screen, and allows for seamless communication between different users on the factory floor. The engineers can then close the issues by specifying a root cause and descriptive text. This enables prescriptive issue resolution, and can assist with future problems.

Figure 4. Issue response screen

Figure 4. Issue response screen

Metrics reporting and API operations

AVA also provides robust reporting seen in Figure 5. This allows managers to view issue metrics for the selected sites and areas in the last seven days. The metric reporting will allow you to observe trends across different areas of your factory floor and optimize accordingly.

Figure 5. Metrics and reporting

Figure 5. Metrics and reporting

Lastly, AVA comes with a rich set of GraphQL APIs that you can use to create sites, issues, and monitor and resolve problems programmatically. In addition, the APIs allow for custom integration with external systems such as factory MES (Manufacturing Execution System) systems, for automated issue creation and solving.

AVA deployment architecture

Deploying this solution with the default parameters builds the following environment in the AWS Cloud:

Figure 6. Amazon Virtual Andon deployment architecture

Figure 6. Amazon Virtual Andon deployment architecture

  1. An Amazon CloudFront web interface deploys into an Amazon Simple Storage Service (S3) bucket configured for web hosting.
  2. An Amazon Cognito user pool activates the solution’s administrators to register users and groups using the web interface.
  3. AWS AppSync GraphQL APIs and AWS Amplify power the web interface. Amazon DynamoDB tables store the factory data.
  4. An AWS IoT rules engine helps users monitor manufacturing workstations or devices for events. It then routes the events to the correct engineer for resolution in real time.
  5. Authorized users can interact with and receive notifications. Using an AWS Lambda function and Amazon SNS, you can send emails and SMS notifications.
  6. Issues created, acknowledged, and closed in the web interface are recorded and updated using AWS AppSync and DynamoDB.


Amazon Virtual Andon (AVA) is a self-deployable solution that helps you resolve factory floor issues more quickly and accurately. With AVA’s intuitive UI and open APIs, you can optimize issue resolution and focus on optimizing production output. You also can holistically monitor your manufacturing site remotely via an intuitive web application. Get started with Amazon Virtual Andon today.

Perform Chaos Testing on your Amazon Aurora Cluster

Post Syndicated from Anthony Pasquariello original https://aws.amazon.com/blogs/architecture/perform-chaos-testing-on-your-amazon-aurora-cluster/

“Everything fails all the time” Werner Vogels, AWS CTO

In 2010, Netflix introduced a tool called “Chaos Monkey”, that was used for introducing faults in a production environment. Chaos Monkey led to the birth of Chaos engineering where teams test their live applications by purposefully injecting faults. Observations are then used to take corrective action and increase resiliency of applications.

In this blog, you will learn about the fault injection capabilities available in Amazon Aurora for simulating various database faults.

Chaos Experiments

Chaos experiments consist of:

  • Understanding the application baseline: The application’s steady-state behavior
  • Designing an experiment: Ask “What can go wrong?” to identify failure scenarios
  • Run the experiment: Introduce faults in the application environment
  • Observe and correct: Redesign apps or infrastructure for fault tolerance

Chaos experiments require fault simulation across distributed components of the application. Amazon Aurora provides a set of fault simulation capabilities that may be used by teams to exercise chaos experiments against their applications.

Amazon Aurora fault injection

Amazon Aurora is a fully managed database service that is compatible with MySQL and PostgreSQL. Aurora is highly fault tolerant due to its six-way replicated storage architecture. In order to test the resiliency of an application built with Aurora, developers can leverage the native fault injection features to design chaos experiments. The outcome of the experiments gives a better understanding of the blast radius, depth of monitoring required, and the need to evaluate event response playbooks.

In this section, we will describe the various fault injection scenarios that you can use for designing your own experiments. We’ll show you how to conduct the experiment and use the results. This will make your application more resilient and prepared for an actual event.

Note that availability of the fault injection feature is dependent on the version of MySQL and PostgreSQL.

Figure 1. Fault injection overview

Figure 1. Fault injection overview

1. Testing an instance crash

An Aurora cluster can have one primary and up to 15 read replicas. If the primary instance fails, one of the replicas becomes the primary. Applications must be designed to recover from these instance failures as soon as possible to have minimal impact on the end-user experience.

The instance crash fault injection simulates failure of the instance/dispatcher/node in the Aurora database cluster. Fault injection may be carried out on the primary or replicas by running the API against the target instance.

Example: Aurora PostgreSQL for instance crash simulation

The query following will simulate a database instance crash:

SELECT aurora_inject_crash ('instance' );

Since this is a simulation, it does not lead to a failover to the replica. As an alternative to using this API, you can carry out an actual failover by using the AWS Management Console or AWS CLI.

The team should observe the change in the application’s behavior to understand the impact of the instance failure. Take corrective actions to reduce the impact of such failures on the application.

A long recovery time on the application would require the team to reduce the Domain Name Service (DNS) time-to-live (TTL) for the DB connections. As a general best practice, the Aurora Database cluster should have at least one replica.

2. Testing the replica failure

Aurora manages asynchronous replication between cluster nodes within a cluster. The typical replication lag is under 100 milliseconds. Network slowness or issues on the nodes may lead to an increase in replication lag between writer and replica nodes.

The replica failure fault injection allows you to simulate replication failure across one or more replicas. Note that this type of fault injection applies only to a DB cluster that has at least one read replica.

Replica failure manifests itself as stale data read by the application that is connecting to the replicas. The specific functional impact on the application depends on the sensitivity to the freshness of data. Note that this fault injection mechanism does not apply to the native replication supported mechanisms in PostgreSQL and MySQL databases.

Example: Aurora PostgreSQL for replica failure

The statement following will simulate 100% failure of replica named ‘my-replica’ for 20 seconds.

SELECT aurora_inject_replica_failure(100, 20, ‘my-replica’)

The team must observe the behavior of the application from the data sensitivity perspective. If the observed lag is unacceptable, the team must evaluate corrective actions such as vertical scaling of database instances and query optimization. As a best practice, the team should monitor the replication lag and take proactive actions to address it.

3. Testing the disk failure

Aurora’s storage volume consists of six copies of data across three Availability Zones (refer the diagram preceding). Aurora has an inherent ability to repair itself for failures in the storage components. This high reliability is achieved by way of a quorum model. Reads require only 3/6 nodes and writes require 4/6 nodes to be available. However, there may still be transient impact on application depending on how widespread the issue.

The disk failure injection capability allows you to simulate failures of storage nodes and partial failure of disks. The severity of failure can be set as a percentage value. The simulation continues only for the specified amount of time. There is no impact on the actual data on the storage nodes and the disk.

Example: Aurora PostgreSQL for disk failure simulation

You may get the number of disks (for index) on your cluster using the query:

SELECT disks FROM aurora_show_volume_status()

The query following will simulate 75% failure on disk with index 15. The simulation will end in 20 seconds.

SELECT aurora_inject_disk_failure(75, 15, true, 20)

Applications may experience temporary failures due to this fault injection and should be able to gracefully recover from it. If the recovery time is higher than a threshold, or the application has a complete failure, the team can redesign their application.

4. Disk congestion fault

Disk congestion usually happens because of heavy I/O traffic against the storage devices. The impact may range from degraded application performance, to complete application failures.

Aurora provides the capability to simulate disk congestion without synthetic SQL load against the database. With this fault injection mechanism, you can gain a better understanding of the performance characteristics of the application under heavy I/O spikes.

Example: Aurora PostgreSQL for disk congestion simulation

You may get the number of disks (for index) on your cluster using the query:

SELECT disks FROM aurora_show_volume_status()

The query following will simulate a 100% disk failure for 20 seconds. The failure will be simulated on disk with index 15. Simulated delay will be between 30 and 40 milliseconds.

SELECT aurora_inject_disk_congestion(100, 15, true, 20, 30, 40)

If the observed behavior is unacceptable, then the team must carefully consider the load characteristics of their application. Depending on the observations, corrective action may include query optimization, indexing, vertical scaling of the database instances, and adding more replicas.


A chaos experiment involves injecting a fault in a production environment and then observing the application behavior. The outcome of the experiment helps the team identify application weaknesses and evaluate event response processes. Amazon Aurora natively provides fault-injection capabilities that can be used by teams to conduct chaos experiments for database failure scenarios. Aurora can be used for simulating instance failure, replication failure, disk failures, and disk congestion. Try out these capabilities in Aurora to make your applications more robust and resilient from database failures.

Field Notes: How Sportradar Accelerated Data Recovery Using AWS Services

Post Syndicated from Mithil Prasad original https://aws.amazon.com/blogs/architecture/field-notes-how-sportradar-accelerated-data-recovery-using-aws-services/

This post was co-written by Mithil Prasad, AWS Senior Customer Solutions Manager, Patrick Gryczkat, AWS Solutions Architect, Ben Burdsall, CTO at Sportradar and Justin Shreve, Director of Engineering at Sportradar. 

Ransomware is a type of malware which encrypts data, effectively locking those affected by it out of their own data and requesting a payment to decrypt the data.  The frequency of ransomware attacks has increased over the past year, with local governments, hospitals, and private companies experiencing cases of ransomware.

For Sportradar, providing their customers with access to high quality sports data and insights is central to their business. Ensuring that their systems are designed securely and in a way which minimizes the possibility of a ransomware attack is top priority.  While ransomware attacks can occur both on premises and in the cloud, AWS services offer increased visibility and native encryption and back up capabilities. This helps prevent and minimize the likelihood and impact of a ransomware attack.

Recovery, backup, and the ability to go back to a known good state is best practice. To further expand their defense and diminish the value of ransom, the Sportradar architecture team set out to leverage their AWS Step Functions expertise to minimize recovery time. The team’s strategy centered on achieving a short deployment process. This process commoditized their production environment, allowing them to spin up interchangeable environments in new isolated AWS accounts, pulling in data from external and isolated sources, and diminishing the value of a production environment as a ransom target. This also minimized the impact of a potential data destruction event.

By partnering with AWS, Sportradar was able to build a secure and resilient infrastructure to provide timely recovery of their service in the event of data destruction by an unauthorized third party. Sportradar automated the deployment of their application to a new AWS account and established a new isolation boundary from an account with compromised resources. In this blog post, we show how the Sportradar architecture team used a combination of AWS CodePipeline and AWS Step Functions to automate and reduce their deployment time to less than two hours.

Solution Overview

Sportradar’s solution uses AWS Step Functions to orchestrate the deployment of resources, the recovery of data, and the deployment of application code, and to navigate all necessary dependencies for order of deployment. While deployment can be orchestrated through CodePipeline, Sportradar used their familiarity with Step Functions to create a quick and repeatable deployment process for their environment.

Sportradar’s solution to a ransomware Disaster Recovery scenario has also provided them with a reliable and accelerated process for deploying development and testing environments. Developers are now able to scale testing and development environments up and down as needed.  This has allowed their Development and QA teams to follow the pace of feature development, versus weekly or bi-weekly feature release and testing schedules tied to a single testing environment.

Reference Architecture Showing How Sportradar Accelerated Data Recovery

Figure 1 – Reference Architecture Diagram showing Automated Deployment Flow


The prerequisites for implementing this deployment strategy are:

  • An implemented database backup policy
  • Ideally data should be backed up to a data bunker AWS account outside the scope of the environment you are looking to protect. This is so that in the event of a ransomware attack, your backed up data is isolated from your affected environment and account
  • Application code within a GitHub repository
  • Separation of duties
  • Access and responsibility for the backups and GitHub repository should be separated to different stakeholders in order to reduce the likelihood of both being impacted by a security breach

Step 1: New Account Setup 

Once data destruction is identified, the first step in Sportradar’s process is to use a pre-created runbook to create a new AWS account.  A new account is created in case the malicious actors who have encrypted the application’s data have access to not just the application, but also to the AWS account the application resides in.

The runbook sets up a VPC for a selected Region, as well as spinning up the following resources:

  • Security Groups with network connectivity to their git repository (in this case GitLab), IAM Roles for their resources
  • KMS Keys
  • Amazon S3 buckets with CloudFormation deployment templates
  • CodeBuild, CodeDeploy, and CodePipeline

Step 2: Deploying Secrets

It is a security best practice to ensure that no secrets are hard coded into your application code. So, after account setup is complete, the new AWS accounts Access Keys and the selected AWS Region are passed into CodePipeline variables. The application secrets are then deployed to the AWS Parameter Store.

Step 3: Deploying Orchestrator Step Function and In-Memory Databases

To optimize deployment time, Sportradar decided to leave the deployment of their in-memory databases running on Amazon EC2 outside of their orchestrator Step Function.  They deployed the database using a CloudFormation template from their CodePipeline. This was in parallel with the deployment of the Step Function, which orchestrates the rest of their deployment.

Step 4: Step Function Orchestrates the Deployment of Microservices and Alarms

The AWS Step Functions orchestrate the deployment of Sportradar’s microservices solutions, deploying 10+ Amazon RDS instances, and restoring each dataset from DB snapshots. Following that, 80+ producer Amazon SQS queues and  S3 buckets for data staging were deployed. After the successful deployment of the SQS queues, the Lambda functions for data ingestion and 15+ data processing Step Functions are deployed to begin pulling in data from various sources into the solution.

Then the API Gateways and Lambda functions which provide the API layer for each of the microservices are deployed in front of the restored RDS instances. Finally, 300+ Amazon CloudWatch Alarms are created to monitor the environment and trigger necessary alerts. In total Sportradar’s deployment process brings online: 15+ Step Functions for data processing, 30+ micro-services, 10+ Amazon RDS instances with over 150GB of data, 80+ SQS Queues, 180+ Lambda functions, CDN for UI, Amazon Elasticache, and 300+ CloudWatch alarms to monitor the applications. In all, that is over 600 resources deployed with data restored consistently in less than 2 hours total.

Reference Architecture Diagram for How Sportradar Accelerated Data Recovery Using AWS Services

Figure 2 – Reference Architecture Diagram of the Recovered Application


In this blog, we showed how Sportradar’s team used Step Functions to accelerate their deployments, and a walk-through of an example disaster recovery scenario. Step Functions can be used to orchestrate the deployment and configuration of a new environment, allowing complex environments to be deployed in stages, and for those stages to appropriately wait on their dependencies.

For examples of Step Functions being used in different orchestration scenarios, check out how Step Functions acts as an orchestrator for ETLs in Orchestrate multiple ETL jobs using AWS Step Functions and AWS Lambda and Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy. For migrations of Amazon EC2 based workloads, read more about CloudEndure, Migrating workloads across AWS Regions with CloudEndure Migration.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.


Ben Burdsall

Ben Burdsall

Ben is currently the chief technology officer of Sportradar, – a data provider to the sporting industry, where he leads a product and engineering team of more than 800. Before that, Ben was part of the global leadership team of Worldpay.

Justin Shreve

Justin Shreve

Justin is Director of Engineering at Sportradar, leading an international team to build an innovative enterprise sports analytics platform.

Using Amazon Macie to Validate S3 Bucket Data Classification

Post Syndicated from Bill Magee original https://aws.amazon.com/blogs/architecture/using-amazon-macie-to-validate-s3-bucket-data-classification/

Securing sensitive information is a high priority for organizations for many reasons. At the same time, organizations are looking for ways to empower development teams to stay agile and innovative. Centralized security teams strive to create systems that align to the needs of the development teams, rather than mandating how those teams must operate.

Security teams who create automation for the discovery of sensitive data have some issues to consider. If development teams are able to self-provision data storage, how does the security team protect that data? If teams have a business need to store sensitive data, they must consider how, where, and with what safeguards that data is stored.

Let’s look at how we can set up Amazon Macie to validate data classifications provided by decentralized software development teams. Macie is a fully managed service that uses machine learning (ML) to discover sensitive data in AWS. If you are not familiar with Macie, read New – Enhanced Amazon Macie Now Available with Substantially Reduced Pricing.

Data classification is part of the security pillar of a Well-Architected application. Following the guidelines provided in the AWS Well-Architected Framework, we can develop a resource-tagging scheme that fits our needs.

Overview of decentralized data validation system

In our example, we have multiple levels of data classification that represent different levels of risk associated with each classification. When a software development team creates a new Amazon Simple Storage Service (S3) bucket, they are responsible for labeling that bucket with a tag. This tag represents the classification of data stored in that bucket. The security team must maintain a system to validate that the data in those buckets meets the classification specified by the development teams.

This separation of roles and responsibilities for development and security teams who work independently requires a validation system that’s decoupled from S3 bucket creation. It should automatically detect new buckets or data in the existing buckets, and validate the data against the assigned classification tags. It should also notify the appropriate development teams of misclassified or unclassified buckets in a timely manner. These notifications can be through standard notification channels, such as email or Slack channel notifications.

Validation and alerts with AWS services

Figure 1. Validation system for Data Classification

Figure 1. Validation system for data classification

We assume that teams are permitted to create S3 buckets and we will use AWS Config to enforce the following required tags: DataClassification and SupportSNSTopic. The DataClassification tag indicates what type of data is allowed in the bucket. The SupportSNSTopic tag indicates an Amazon Simple Notification Service (SNS) topic. If there are issues found with the data in the bucket, a message is published to the topic, and Amazon SNS will deliver an alert. For example, if there is personally identifiable information (PII) data in a bucket that is classified as non-sensitive, the system will alert the owners of the bucket.

Macie is configured to scan all S3 buckets on a scheduled basis. This configuration ensures that any new bucket and data placed in the buckets is analyzed the next time the Macie job runs.

Macie provides several managed data identifiers for discovering and classifying the data. These include bank account numbers, credit card information, authentication credentials, PII, and more. You can also create custom identifiers (or rules) to gather information not covered by the managed identifiers.

Macie integrates with Amazon EventBridge to allow us to capture data classification events and route them to one or more destinations for reporting and alerting needs. In our configuration, the event initiates an AWS Lambda. The Lambda function is used to validate the data classification inferred by Macie against the classification specified in the DataClassification tag using custom business logic. If a data classification violation is found, the Lambda then sends a message to the Amazon SNS topic specified in the SupportSNSTopic tag.

The Lambda function also creates custom metrics and sends those to Amazon CloudWatch. The metrics are organized by engineering team and severity. This allows the security team to create a dashboard of metrics based on the Macie findings. The findings can also be filtered per engineering team and severity to determine which teams need to be contacted to ensure remediation.


This solution provides a centralized security team with the tools it needs. The team can validate the data classification of an Amazon S3 bucket that is self-provisioned by a development team. New Amazon S3 buckets are automatically included in the Macie jobs and alerts. These are only sent out if the data in the bucket does not conform to the classification specified by the development team. The data auditing process is loosely coupled with the Amazon S3 Bucket creation process, enabling self-service capabilities for development teams, while ensuring proper data classification. Your teams can stay agile and innovative, while maintaining a strong security posture.

Learn more about Amazon Macie and Data Classification.

Architecting a Highly Available Serverless, Microservices-Based Ecommerce Site

Post Syndicated from Senthil Kumar original https://aws.amazon.com/blogs/architecture/architecting-a-highly-available-serverless-microservices-based-ecommerce-site/

The number of ecommerce vendors is growing globally—they often handle large traffic at different times of the day and different days of the year. This, in addition to building, managing, and maintaining IT infrastructure on-premises data centers can present challenges to ecommerce businesses’ scalability and growth.

This blog provides you a Serverless on AWS solution that offloads the undifferentiated heavy lifting of managing resources and ensures your businesses’ architecture can handle peak traffic.

Common architecture set up versus serverless solution

The following sections describe a common monolithic architecture and our suggested alternative approach: setting up microservices-based order submission and product search modules. These modules are independently deployable and scalable.

Typical monolithic architecture

Figure 1 shows how a typical on-premises ecommerce infrastructure with different tiers is set up:

  • Web servers serve static assets and proxy requests to application servers
  • Application servers process ecommerce business logic and authentication logic
  • Databases store user and other dynamic data
  • Firewall and load balancers provide network components for load balancing and network security
Monolithic on-premises ecommerce infrastructure with different tiers

Figure 1. Monolithic on-premises ecommerce infrastructure with different tiers

Monolithic architecture tightly couples different layers of the application. This prevents them from being independently deployed and scaled.

Microservices-based modules

Order submission workflow module

This three-layer architecture can be set up in the AWS Cloud using serverless components:

  • Static content layer (Amazon CloudFront and Amazon Simple Storage Service (Amazon S3)). This layer stores static assets on Amazon S3. By using CloudFront in front of the S3 storage cache, you can deliver assets to customers globally with low latency and high transfer speeds.
  • Authentication layer (Amazon Cognito or customer proprietary layer). Ecommerce sites deliver authenticated and unauthenticated content to the user. With Amazon Cognito, you can manage users’ sign-up, sign-in, and access controls, so this authentication layer ensures that only authenticated users have access to secure data.
  • Dynamic content layer (AWS Lambda and Amazon DynamoDB). All business logic required for the ecommerce site is handled by the dynamic content layer. Using Lambda and DynamoDB ensures that these components are scalable and can handle peak traffic.

As shown in Figure 2, the order submission workflow is split into two sections: synchronous and asynchronous.

By splitting the order submission workflow, you allow users to submit their order details and get an orderId. This makes sure that they don’t have to wait for backend processing to complete. This helps unburden your architecture during peak shopping periods when the backend process can get busy.

Microservices-based order submission workflow

Figure 2. Microservices-based order submission workflow

The details of the order, such as credit card information in encrypted form, shipping information, etc., are stored in DynamoDB. This action invokes an asynchronous workflow managed by AWS Step Functions.

Figure 3 shows sample step functions from the asynchronous process. In this scenario, you are using external payment processing and shipping systems. When both systems get busy, step functions can manage long-running transactions and also the required retry logic. It uses a decision-based business workflow, so if a payment transaction fails, the order can be canceled. Or, once payment is successful, the order can proceed.

Amazon Simple Notification Service (Amazon SNS) notifies users whenever their order status changes. You can even extend Step Functions to have it react based on status of shipping.

Sample AWS Step Functions asynchronous workflow that uses external payment processing service and shipping system

Figure 3. Sample AWS Step Functions asynchronous workflow that uses external payment processing service and shipping system

Product search module

Our product search module is set up using the following serverless components:

  • Amazon Elasticsearch Service (Amazon ES) stores product data, which is updated whenever product-related data changes.
  • Lambda formats the data.
  • Amazon API Gateway allows users to search without authentication. As shown in Figure 4, searching for products on the ecommerce portal does not require users to log in. All traffic via API Gateway is unauthenticated.
Microservices-based product search workflow module with dynamic traffic through API Gateway

Figure 4. Microservices-based product search workflow module with dynamic traffic through API Gateway

Replicating data across Regions

If your ecommerce application runs on multiple Regions, it may require the content and data to be replicated. This allows the application to handle local traffic from that Region and also act as a failover option if the application fails in another Region. The content and data are replicated using the multi-Region replication features of Amazon S3 and DynamoDB global tables.

Figure 5 shows a multi-Region ecommerce site built on AWS with serverless services. It uses the following features to make sure that data between all Regions are in sync for data/assets that do not need data residency compliance:

  • Amazon S3 multi-Region replication keeps static assets in sync for assets.
  • DynamoDB global tables keeps dynamic data in sync across Regions.

Assets that are specific to their Region are stored in Regional specific buckets.

Data replication for a multi-Region ecommerce website built using serverless components

Figure 5. Data replication for a multi-Region ecommerce website built using serverless components

Amazon Route 53 DNS web service manages traffic failover from one Region to another. Route 53 provides different routing policies, and depending on your business requirement, you can choose the failover routing policy.

Best practices

Now that we’ve shown you how to build these applications, make sure you follow these best practices to effectively build, deploy, and monitor the solution stack:

  • Infrastructure as Code (IaC). A well-defined, repeatable infrastructure is important for managing any solution stack. AWS CloudFormation allows you to treat your infrastructure as code and provides a relatively easy way to model a collection of related AWS and third-party resources.
  • AWS Serverless Application Model (AWS SAM). An open-source framework. Use it to build serverless applications on AWS.
  • Deployment automation. AWS CodePipeline is a fully managed continuous delivery service that automates your release pipelines for fast and reliable application and infrastructure updates.
  • AWS CodeStar. Allows you to quickly develop, build, and deploy applications on AWS. It provides a unified user interface, enabling you to manage all of your software development activities in one place.
  • AWS Well-Architected Framework. Provides a mechanism for regularly evaluating your workloads, identifying high risk issues, and recording your improvements.
  • Serverless Applications Lens. Documents how to design, deploy, and architect serverless application workloads.
  • Monitoring. AWS provides many services that help you monitor and understand your applications, including Amazon CloudWatch, AWS CloudTrail, and AWS X-Ray.


In this blog post, we showed you how to architect a highly available, serverless, and microservices-based ecommerce website that operates in multiple Regions.

We also showed you how to replicate data between different Regions for scaling and if your workload fails. These serverless services reduce the burden of building and managing physical IT infrastructure to help you focus more on building solutions.

Related information

Integrating Amazon Connect and Amazon Lex with Third-party Systems

Post Syndicated from Steven Warwick original https://aws.amazon.com/blogs/architecture/integrating-amazon-connect-and-amazon-lex-with-third-party-systems/

AWS customers who provide software solutions that integrate with AWS often require design patterns that offer some flexibility. They must build, support, and expand products and solutions to meet their end user business requirements. These design patterns must use the underlying services and infrastructure through API operations. As we will show, third-party solutions can integrate with Amazon Connect to initiate customer-specific workflows. You don’t need specific utterances when using Amazon Lex to convert speech to text.

Introduction to Amazon Connect workflows

Platform as a service (PaaS) systems are built to handle a variety of use cases with various inputs. These inputs are provided by upstream systems and sometimes result in complex integrations. For example, when creating a call center management solution, these workflows may require opaque transcription data to be sent to downstream third-party system. This pattern allows the caller to interact with a third-party system with an Amazon Connect flow. The solution allows for communication to occur multiple times. Opaque transcription data is transferred between the Amazon Connect contact flow, through Amazon Lex, and then to the third-party system. The third-party system can modify and update workflows without affecting the Amazon Connect or Amazon Lex systems.

API workflow use case

AnyCompany Tech (our hypothetical company) is a PaaS company that allows other companies to quickly build workflows with their tools. It provides the ability to take customer calls with a request-response style interaction. AnyCompany built an API into the Amazon Connect flow to allow their end users to return various response types. Examples of the response types are “disconnect,” “speak,” and “failed.”

Using Amazon Connect, AnyCompany allows their end users to build complex workflows using a basic question-response API. Each customer of AnyCompany can build workflows that respond to a voice input. It is processed via the high-quality speech recognition and natural language understanding capabilities of Amazon Lex. The caller is prompted by “What is your question?” Amazon Lex processes the audio input then invokes a Lambda function that connects to AnyCompany Tech. They in turn initiate their customer’s unique workflow. The customer’s workflow may change over time without requiring any further effort from AnyCompany Tech.

Questions graph database use case

AnyCompany Storage is a company that has a graph database that stores documents and information correlating the business to its inventory, sales, marketing, and employees. Accessing this database will be done via a question-response API. For example, such questions might be: “What are our third quarter earnings?”, “Do we have product X in stock?”, or “When was Jane Doe hired?” The company wants the ability to have their employees call in and after proper authentication, ask any question of the system and receive a response. Using the architecture in Figure 1, the company can link their Amazon Connect implementation up to this API. The output from Amazon Lex is passed into the API, a response is received, and it is then passed to Amazon Connect.

Amazon Connect third-party system architecture

Figure 1. End-customer call flow

Figure 1. End-customer call flow

  1. User calls Amazon Connect using the telephone number for the connect instance.
  2. Amazon Connect receives the incoming call and starts an Amazon Connect contact flow. This Amazon Connect flow captures the caller’s utterance and forwards it to Amazon Lex.
  3. Amazon Lex starts the requested bot. The Amazon Lex bot translates the caller’s utterance into text and sends it to AWS Lambda via an event.
  4. AWS Lambda accepts the incoming data, transforms or enhances it as needed, and calls out to the external API via some transport
  5. The external API processes the content sent to it from AWS Lambda.
  6. The external API returns a response back to AWS Lambda.
  7. AWS Lambda accepts the response from the external API, then forwards this response to Amazon Lex.
  8. Amazon Lex returns the response content to Amazon Connect.
  9. Amazon Connect processes the response.

Solution components

Amazon Connect

Amazon Connect allows customer calls to get information from the third-party system. An Amazon Connect contact flow is required to get the callers input, which is then sent to Amazon Lex. The Amazon Lex bot must be granted permission to interact with an Amazon Connect contact flow. As shown in Figure 2, the Get customer input block must call the fallback intent of the Amazon Lex bot. Figure 2 demonstrates a basic flow used to get input, check the response, and perform an action.

Figure 2. Basic Amazon Connect contact flow to integrate with Amazon Lex

Figure 2. Basic Amazon Connect contact flow to integrate with Amazon Lex

Amazon Lex

Amazon Lex converts the voice given by a caller into text, which is then processed by a Lambda function. When setting up the Amazon Lex bot you will require one fallback intent (Figure 3), one unused intent (Figure 4), and one clarification prompts disabled (Figure 5).

Figure 3. Amazon Lex fallback intent

Figure 3. Amazon Lex fallback intent

The fallback intent is created by using the pre-defined AMAZON.FallbackIntent and must be set up to call a Lambda function in the Fulfillment section. Using the fallback intent and Lambda fulfillment causes the system to ignore any utterance pattern and pass any translation directly to the Lambda function.

Figure 4. Amazon Lex unused intent

Figure 4. Amazon Lex unused intent

The unused intent is only created to satisfy Amazon Lex’s requirement for a bot to have at least one custom intent with a valid utterance.

Figure 5. Amazon Lex error handling clarification prompts

Figure 5. Amazon Lex error handling clarification prompts

In the error handling section of the Amazon Lex bot, the clarification prompts must be disabled. Disabling the clarification prompts stops the Amazon Lex bot from asking the caller to clarify the input.

AWS Lambda

AWS Lambda is called by Amazon Lex, which is used to interact with the third-party system. The third-party system will return values, which the Lambda will add to its session attributes resulting object. The resulting object from AWS Lambda requires the dialog action to be set up with the type set to “Close,” fulfillmentState set to “Fulfilled,” and the message contentType set to “CustomPayload”. This will allow Amazon Lex to pass the values to Amazon Connect without speaking the results to the caller.


In this blog we showed how Amazon Connect, Amazon Lex, and AWS Lambda functions can be used together to create common interactions with third-party systems. This is a flexible architecture and requires few changes to the Amazon Connect contact flow. You don’t have to set up pre-defined utterances that limit the allowed inputs. Using this solution, AWS customers can provide flexible solutions that interact with third-party systems.

Related information

Intelligently Search Media Assets with Amazon Rekognition and Amazon ES

Post Syndicated from Sridhar Chevendra original https://aws.amazon.com/blogs/architecture/intelligently-search-media-assets-with-amazon-rekognition-and-amazon-es/

Media assets have become increasingly important to industries like media and entertainment, manufacturing, education, social media applications, and retail. This is largely due to innovations in digital marketing, mobile, and ecommerce.

Successfully locating a digital asset like a video, graphic, or image reduces costs related to reproducing or re-shooting. An efficient search engine is critical to quickly delivering something like the latest fashion trends. This in turn increases customer satisfaction, builds brand loyalty, and helps increase businesses’ online footprints, ultimately contributing towards revenue.

This blog post shows you how to build automated indexing and search functions using AWS serverless managed artificial intelligence (AI)/machine learning (ML) services. This architecture provides high scalability, reduces operational overhead, and scales out/in automatically based on the demand, with a flexible pay-as-you-go pricing model.

Automatic tagging and rich metadata with Amazon ES

Asset libraries for images and videos are growing exponentially. With Amazon Elasticsearch Service (Amazon ES), this media is indexed and organized, which is important for efficient search and quick retrieval.

Adding correct metadata to digital assets based on enterprise standard taxonomy will help you narrow down search results. This includes information like media formats, but also richer metadata like location, event details, and so forth. With Amazon Rekognition, an advanced ML service, you do not need to tag and index these media assets. This automatic tagging and organization frees you up to gain insights like sentiment analysis from social media.

Figure 1 is tagged using Amazon Rekognition. You can see how rich metadata (Apparel, T-Shirt, Person, and Pills) is extracted automatically. Without Amazon Rekognition, you would have to manually add tags and categorize the image. This means you could only do a keyword search on what’s manually tagged. If the image was not tagged, then you likely wouldn’t be able to find it in a search.

Figure 1. An image tagged automatically with Amazon Rekognition

Figure 1. An image tagged automatically with Amazon Rekognition

Data ingestion, organization, and storage with Amazon S3

As shown in Figure 2, use Amazon Simple Storage Service (Amazon S3) to store your static assets. It provides high availability and scalability, along with unlimited storage. When you choose Amazon S3 as your content repository, multiple data providers are configured for data ingestion for future consumption by downstream applications. In addition to providing storage, Amazon S3 lets you organize data into prefixes based on the event type and captures S3 object mutations through S3 event notifications.

Figure 2. Solution overview diagram

Figure 2. Solution overview diagram

S3 event notifications are invoked for a specific prefix, suffix, or combination of both. They integrate with Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), and AWS Lambda as targets. (Refer to the Amazon S3 Event Notifications user guide for best practices). S3 event notification targets vary across use cases. For media assets, Amazon SQS is used to decouple the new data objects ingested into S3 buckets and downstream services. Amazon SQS provides flexibility over the data processing based on resource availability.

Data processing with Amazon Rekognition

Once media assets are ingested into Amazon S3, they are ready to be processed. Amazon Rekognition determines the entities within each asset. Amazon Rekognition then extracts the entities in JSON format and assigns a confidence score.

If the confidence score is below the defined threshold, use Amazon Augmented AI (A2I) for further review. A2I is an ML service that helps you build the workflows required for human review of ML predictions.

Amazon Rekognition also supports custom modeling to help identify entities within the images for specific business needs. For instance, a campaign may need images of products worn by a brand ambassador at a marketing event. Then they may need to further narrow their search down by the individual’s name or age demographic.

Using our solution, a Lambda function invokes Amazon Rekognition to extract the entities from the ingested assets. Lambda continuously polls the SQS queue for any new messages. Once a message is available, the Lambda function invokes the Amazon Rekognition endpoint to extract the relevant entities.

The following is a sample output from detect_labels API call in Amazon Rekognition and the transformed output that will be updated to downstream search engine:

{'Labels': [{'Name': 'Clothing', 'Confidence': 99.98137664794922, 'Instances': [], 'Parents': []}, {'Name': 'Apparel', 'Confidence': 99.98137664794922,'Instances': [], 'Parents': []}, {'Name': 'Shirt', 'Confidence': 97.00833129882812, 'Instances': [], 'Parents': [{'Name': 'Clothing'}]}, {'Name': 'T-Shirt', 'Confidence': 76.36670684814453, 'Instances': [{'BoundingBox': {'Width': 0.7963646650314331, 'Height': 0.6813027262687683, 'Left':
0.09593021124601364, 'Top': 0.1719706505537033}, 'Confidence': 53.39663314819336}], 'Parents': [{'Name': 'Clothing'}]}], 'LabelModelVersion': '2.0', 'ResponseMetadata': {'RequestId': '3a561e82-badc-4ba0-aa77-39a13f1bb3a6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Mon, 17 May 2021 18:32:27 GMT', 'x-amzn-requestid': '3a561e82-badc-4ba0-aa77-39a13f1bb3a6','content-length': '542', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

As shown, the Lambda function submits an API call to Amazon Rekognition, where a T-shirt image in .jpeg format is provided as the input. Based on your confidence score threshold preference, Amazon Rekognition will prompt you to initiate a human review using Amazon A2I. It will also prompt you to use Amazon Rekognition Custom Labels to train the custom models. Lambda then identifies and arranges the labels and updates the specified index.

Indexing with Amazon ES

Amazon ES is a managed search engine service that provides enterprise-grade search engine capability for applications. In our solution, assets are searched based on entities that are used as metadata to update the index. Amazon ES is hosted as a public endpoint or a VPC endpoint for secure access within the specified AWS account.

Labels are identified and marked as tags, which are assigned to .jpeg formatted images. The following sample output shows the query on one of the tags issued on an Amazon ES cluster.


curl-XGET https://<ElasticSearch Endpoint>/<_IndexName>/_search?q=T-Shirt



In addition to photos, Amazon Rekognition also detects the labels on videos. It can recognize labels and identify characters and entities. These are then added to Amazon ES to enhance search capability. This allows users to skip to specific parts of a video for quick searchability. For instance, a marketer may need images of cashmere sweaters from a fashion show that was streamed and recorded.

Once the raw video clip is identified, it is then converted using Amazon Elastic Transcoder to play back on mobile devices, tablets, web browsers, and connected televisions. Elastic Transcoder is a highly scalable and cost-effective media transcoding service in the cloud. Segmented output renditions are created for delivery using the multiple protocols to compatible devices.


This blog describes AWS services that can be applied to diverse set of use cases for tagging and efficient search of images and videos. You can build automated indexing and search using AWS serverless managed AI/ML services. They provide high scalability, reduce operational overhead, and scale out/in automatically based on the demand, with a flexible pay-as-you-go pricing model.

To get started, use these references to create your own sample architectures:

Architecture Monthly Magazine: Genomics

Post Syndicated from Jane Scolieri original https://aws.amazon.com/blogs/architecture/architecture-monthly-magazine-genomics/

The field of genomics has made huge strides in the last 20 years.

Genomics organizations and researchers are rising to the many challenges we face today, and seeking improved methods for future needs. Amazon Web Services (AWS) provides an array of services that can help the genomics industry with securely handling and interpreting genomics data, assisting with regulatory compliance, and supporting complex research workloads. In this issue, we have case studies from Lifebit and Fred Hutch, blogs on genomic sequencing and the Registry of Open Data, and some reference architecture and solutions to support your work.

We include videos from the Smithsonian, AstraZeneca, Genomic Discoveries, AMP lab, Illumina, and the University of Sydney.

We hope you’ll find this edition of Architecture Monthly useful. We’d like to thank Kelli Jonakin, PhD, Global Head of Life Sciences & Genomics Marketing, AWS, as well as our Experts, Ryan Ulaszek, Worldwide Tech Leader – Genomics, and Lisa McFerrin, Worldwide Tech Leader – Bioinformatics, for their contribution.

Please give us your feedback! Include your comments on the Amazon Kindle page. You can view past issues and reach out to [email protected] anytime with your questions and comments.

In this month’s Genomics issue:

  • Ask an Expert: Ryan Ulaszek & Lisa McFerrin
  • Executive Brief: Genomics on AWS: Accelerating scientific discoveries and powering business agility
  • Case Study: Fred Hutch Microbiome Researchers Use AWS to Perform Seven Years of Compute Time in Seven Days
  • Quick Start: For rapid deployment
  • Blog: NIH’s Sequence Read Archive, the world’s largest genome sequence repository: Openly accessible on AWS
  • Solutions: Genomics Secondary Analysis Using AWS Step Functions and AWS Batch
  • Reference Architecture: Genomics data transfer, analytics, and machine learning reference architecture
  • Case Study: Lifebit Powers Collaborative Research Environment for Genomics England on AWS
  • Quick Start: Illumina DRAGEN on AWS
  • Executive Brief: Genomic data security and compliance on the AWS Cloud
  • Solutions: Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena
  • Reference Architecture: Genomics report pipeline reference architecture
  • Blog: Broad Institute gnomAD data now accessible on the Registry of Open Data on AWS
  • Quick Start: Workflow orchestration for genomics analysis on AWS
  • Solutions: Genomics Tertiary Analysis and Machine Learning Using Amazon SageMaker
  • Reference Architecture: Research data lake ingestion pipeline reference architecture
  • Videos:
    • The Smithsonian Institution Improves Genome Annotation for Biodiverse Species Using the AWS Cloud
    • AstraZeneca Genomics on AWS: A Journey from Petabytes to New Medicines
    • Accelerate Genomic Discoveries on AWS
    • UC Berkeley AMP Lab Genomics Project on AWS – Customer Success Story
    • Helix Uses Illumina’s BaseSpace Sequence Hub on AWS to Build Their Personal Genomics Platform
    • University of Sydney Accelerate Genomics Research with AWS and Ronin

Download the Magazine

How to access the magazine

View and download past issues as PDFs on the AWS Architecture Monthly webpage.
Readers in the US, UK, Germany, and France can subscribe to the Kindle version of the magazine at Kindle Newsstand.
Visit Flipboard, a personalized mobile magazine app that you can also read on your computer.
We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us a star rating and comment on the Amazon Kindle Newsstand page or contact us anytime at [email protected]

Should I Run my Containers on AWS Fargate, AWS Lambda, or Both?

Post Syndicated from Rob Solomon original https://aws.amazon.com/blogs/architecture/should-i-run-my-containers-on-aws-fargate-aws-lambda-or-both/

Containers have transformed how companies build and operate software. Bundling both application code and dependencies into a single container image improves agility and reduces deployment failures. But what compute platform should you choose to be most efficient, and what factors should you consider in this decision?

With the release of container image support for AWS Lambda functions (December 2020), customers now have an additional option for building serverless applications using their existing container-oriented tooling and DevOps best practices. In addition, a single container image can be configured to run on both of these compute platforms: AWS Lambda (using serverless functions) or AWS Fargate (using containers).

Three key factors can influence the decision of what platform you use to deploy your container: startup time, task runtime, and cost. That decision may vary each time a task is initiated, as shown in the three scenarios following.

Design considerations for deploying a container

Total task duration consists of startup time and runtime. The startup time of a containerized task is the time required to provision the container compute resource and deploy the container. Task runtime is the time it takes for the application code to complete.

Startup time: Some tasks must complete quickly. For example, when a user waits for a web response, or when a series of tasks is completed in sequential order. In those situations, the total duration time must be minimal. While the application code may be optimized to run faster, startup time depends on the chosen compute platform as well. AWS Fargate container startup time typically takes from 60 to 90 seconds. AWS Lambda initial cold start can take up to 5 seconds. Following that first startup, the same containerized function has negligible startup time.

Task runtime: The amount of time it takes for a task to complete is influenced by the compute resources allocated (vCPU and memory) and application code. AWS Fargate lets you select vCPU and memory size. With AWS Lambda, you define the amount of allocated memory. Lambda then provisions a proportional quantity of vCPU. In both AWS Fargate and AWS Lambda uses, increasing the amount of compute resources may result in faster completion time. However, this will depend on the application. While the additional compute resources incur greater cost, the total duration may be shorter, so the overall cost may also be lower.

AWS Lambda has a maximum limit of 15 minutes of runtime. Lambda shouldn’t be used for these tasks to avoid the likelihood of timeout errors.

Figure 1 illustrates the proportion of startup time to total duration. The initial steepness of each line shows a rapid decrease in startup overhead. This is followed by a flattening out, showing a diminishing rate of efficiency. Startup time delay becomes less impactful as the total job duration increases. Other factors (such as cost) become more significant.

Figure 1. Ratio of startup time as a function to overall job duration for each service

Figure 1. Ratio of startup time as a function to overall job duration for each service

Cost: When making the choice between Fargate and Lambda, it is important to understand the different pricing models. This way, you can make the appropriate selection for your needs.

Figure 2 shows a cost analysis of Lambda vs Fargate. This is for the entire range of configurations for a runtime task. For most of the range of configurable memory, AWS Lambda is more expensive per second than even the most expensive configuration of Fargate.

Figure 2. Total cost for both AWS Lambda and AWS Fargate based on task duration

Figure 2. Total cost for both AWS Lambda and AWS Fargate based on task duration

From a cost perspective, AWS Fargate is more cost-effective for tasks running for several seconds or longer. If cost is the only factor at play, then Fargate would be the better choice. But the savings gained by using Fargate may be offset by the business value gained from the shorter Lambda function startup time.

Dynamically choose your compute platform

In the following scenarios, we show how a single container image can serve multiple use cases. The decision to run a given containerized application on either AWS Lambda or AWS Fargate can be determined at runtime. This decision depends on whether cost, speed, or duration are the priority.

In Figure 3, an image-processing AWS Batch job runs on a nightly schedule, processing tens of thousands of images to extract location information. When run as a batch job, image processing may take 1–2 hours. The job pulls images stored in Amazon Simple Storage Service (S3) and writes the location metadata to Amazon DynamoDB. In this case, AWS Fargate provides a good combination of compute and cost efficiency. An added benefit is that it also supports tasks that exceed 15 minutes. If a single image is submitted for real-time processing, response time is critical. In that case, the same image-processing code can be run on AWS Lambda, using the same container image. Rather than waiting for the next batch process to run, the image is processed immediately.

Figure 3. One-off invocation of a typically long-running batch job

Figure 3. One-off invocation of a typically long-running batch job

In Figure 4, a SaaS application uses an AWS Lambda function to allow customers to submit complex text search queries for files stored in an Amazon Elastic File System (EFS) volume. The task should return results quickly, which is an ideal condition for AWS Lambda. However, a small percentage of jobs run much longer than the average, exceeding the maximum duration of 15 minutes.

A straightforward approach to avoid job failure is to initiate an Amazon CloudWatch alarm when the Lambda function times out. CloudWatch alarms can automatically retry the job using Fargate. An alternate approach is to capture historical data and use it to create a machine learning model in Amazon SageMaker. When a new job is initiated, the SageMaker model can predict the time it will take the job to complete. Lambda can use that prediction to route the job to either AWS Lambda or AWS Fargate.

Figure 4. Short duration tasks with occasional outliers running longer than 15 minutes

Figure 4. Short duration tasks with occasional outliers running longer than 15 minutes

In Figure 5, a customer runs a containerized legacy application that encompasses many different kinds of functions, all related to a recurring data processing workflow. Each function performs a task of varying complexity and duration. These can range from processing data files, updating a database, or submitting machine learning jobs.

Using a container image, one code base can be configured to contain all of the individual functions. Longer running functions, such as data preparation and big data analytics, are routed to Fargate. Shorter duration functions like simple queries can be configured to run using the container image in AWS Lambda. By using AWS Step Functions as an orchestrator, the process can be automated. In this way, a monolithic application can be broken up into a set of “Units of Work” that operate independently.

Figure 5. Heterogeneous function orchestration

Figure 5. Heterogeneous function orchestration


If your job lasts milliseconds and requires a fast response to provide a good customer experience, use AWS Lambda. If your function is not time-sensitive and runs on the scale of minutes, use AWS Fargate. For tasks that have a total duration of under 15 minutes, customers must decide based on impacts to both business and cost. Select the service that is the most effective serverless compute environment to meet your requirements. The choice can be made manually when a job is scheduled or by using retry logic to switch to the other compute platform if the first option fails. The decision can also be based on a machine learning model trained on historical data.

Field Notes: How to Set Up Your Cross-Account and Cross-Region Database for Amazon Aurora

Post Syndicated from Ashutosh Pateriya original https://aws.amazon.com/blogs/architecture/field-notes-how-to-set-up-your-cross-account-and-cross-region-database-for-amazon-aurora/

This post was co-written by Ashutosh Pateriya, Solution Architect at AWS and Nirmal Tomar, Principal Consultant at Infosys Technologies Ltd. 

Various organizations have stringent regulatory compliance obligations or business requirements that require an effective cross-account and cross-region database setup. We recommend to establish a Disaster Recovery (DR) environment in different AWS accounts and Regions for a system design that can handle AWS Region failures.

Account-level separation is important for isolating production environments from other environments, that is, development, test, UAT and DR environments. These are defined by external compliance requirements, such as PCI-DSS or HIPAA. Cross-account replication helps an organization recover data from a replicated account, when the primary AWS account is compromised, and access to the account is lost. Using multiple Regions gives you greater control over the recovery time, if there is a hard dependency failure on a Regional AWS service.

You can use Amazon Aurora Global Database to replicate your RDS database across Regions, within the same AWS Account. In this blog, we show how to build a custom solution for Cross-Account, Cross-Region database replication with configurable Recovery Time Objective (RTO)/ Recovery Point Objective (RPO).

Solution Overview

a custom solution for Cross-Account, Cross-Region database replication with configurable Recovery Time

Figure 1 – Architecture of a Cross-Account, Cross-Region database replication with configurable Recovery Time Objective (RTO)/ Recovery Point Objective (RPO)

Step 1: Automated Amazon Aurora snapshots cannot be shared with other AWS accounts. Hence, create a manual version of the snapshot for the same.

Step 2: Amazon Aurora cannot be restored from the snapshot created in different AWS Regions. To overcome this, you must copy the snapshot in the target AWS Region after AWS KMS encryption.

Step 3: Share the DB cluster snapshot with the target account, in order to restore the cluster.

Step 4: Restore Aurora Cluster from the shared snapshot.

Step 5: Restore the required Aurora Instances.

Step 6: Swap the cluster and instance names, to make it accessible from the previous Reader/Writer endpoint.

Step 7: Perform a sanity test and shut down the old cluster.


  • Two AWS accounts with permission to spin up required resources.
  • A running Aurora instance in the source AWS account.


The solution outlined is custom built, and automated using AWS Step functions, AWS Lambda, Amazon Simple Notification Service (SNS), AWS Key Management Service (KMS), AWS Cloud​Formation Templates, Amazon Aurora and AWS Identity and Access Management IAM by following these four steps:

  1. Launch a one-time set up of the AWS KMS in the AWS source account to share the key with destination account. This key will be used by the destination account to restore the Aurora cluster from a KMS encrypted snapshot.
  2. Set up the step function workflow in the AWS source account and source Region. On a scheduled interval, this workflow creates a snapshot in the source account and source Region. Copy it to the target Region and push a notification to the SNS topic and setup a Lambda function as a target for the SNS topic. Refer to the documentation to learn more about Using Lambda with Amazon SNS.
  3. Set up the step function workflow in the AWS target account and in the target Region. This workflow is initiated from the previous workflow after sharing the snapshot. It restores the Aurora cluster from the shared snapshot in the target account and in the target Region.
  4. Provide access to the Lambda function in the target account and target Region. Set it up to listen to the SNS topic in the source account and source Region.
AWS Step Function in Source Account and Source Region

Figure 2 – Architecture showing AWS Step Function in Source Account and Source Region

We can implement a Lambda function with different supported programming languages. For this solution, we selected Python to implement various Lambda functions used in both step function workflows.

Note: We are using us-west-2 as the source Region and us-west-1 as the target Region for this demo. You can choose Regions as per your design need.

Part 1: One Time AWS KMS key setup in the Source Account and Target Region for cross-account access of the Encrypted Snapshot

We can share a Customer Master Key (CMK) across multiple AWS accounts using different ways, such as  using AWS CloudFormation templates, CLI, and AWS console.

Following are one-time steps that allow you to create an AWS KMS key in the source AWS account. You can then provide cross-account usage access of the same key to the destination account for restoring the Aurora cluster. You can achieve this with the AWS Console and AWS CloudFormation Template.

Note: Several AWS services being used do not fall under the AWS Free Tier. To avoid extra charges, make sure to delete resources once the demo is completed.

Using the AWS Console

You can set up cross-account AWS KMS keys using the AWS Console. For more information on the setup, refer to the guide: Allowing users in other accounts to use a CMK.

Using the AWS CloudFormation Template

You can set up cross account AWS KMS keys using CloudFormation templates by following these steps:

  • Launch the template:

Launch with CLoudFormatation Console button

  • Use the source account and destination AWS Region.
  • Type the appropriate stack name and destination account number and select Create stack.

Create Stack screenshot

  • Click on the Resources tab to confirm if the KMS key and its alias name have been created.

Resources screenshot

Part 2: Snapshot creation and sharing Target Account in Target Region using the Step Function workflow

Now, we schedule the AWS Step Function workflow with single/multiple times in a day/week/month using Amazon EventBridge or Amazon CloudWatch Events per Recovery Time Objective (RTO)/ Recovery Point Objective (RPO).

This workflow first creates a snapshot in the AWS source account and source Region> The snapshot copies it in target Region after KMS encryption, shares the snapshot with the target account and sends a notification to SNS Topic.  A Lambda function is subscribed to the topic in the target account.


Diagram showing AWS Step Function in Source Account and Source Region

Figure 3 – Architecture showing the AWS Step Functions Workflow

The Import Lambda functions involved in this workflow:

a.     CreateRDSSnapshotInSourceRegion:  This Lambda function creates a snapshot in the source Account and Region for an existing Aurora cluster.

b.     FetchSourceRDSSnapshotStatus: This Lambda function fetches the snapshot status and waits for its availability before moving to the next step.

c.     EncrytAndCopySnapshotInTargetRegion: This Lambda function copies the source snapshot in the target AWS Region after encryption with AWS KMS keys.

d.     FetchTargetRDSSnapshotStatus: This Lambda function fetch the snapshot status in the target Region and will wait to move to the next step once snapshot status changed to available.

e.     ShareSnapshotStateWithTargetAccount: This Lambda function shares the snapshot with the target account and target Region.

f.      SendSNSNotificationToTargetAccount:  This option sends the notification to the SNS topic. An AWS Lambda function is subscribed to this topic for the destination AWS Account.

Using the following AWS CloudFormation Template, we set up the required services/workflow in the source account and source Region:

a.     Launch the template in the source account and source Region:

Launch with CLoudFormatation Console button

Specify stack details screenshot

b.     Fill out the preceding form with the following details and select next.

Stack name:  Stack Name which you want to create.

DestinationAccountNumber: Destination AWS Account number where you want to restore the Aurora  cluster.

KMSKey: KMS key ARN number setup in previous steps.

ScheduleExpression: Schedule a cron expression when you want to run this workflow. Sample expressions are available in this guide, Schedule expressions using rate or cron.

SourceDBClusterIdentifier: The identifier of the DB cluster, for which a snapshot is created. It must match the identifier of an existing DBCluster. This parameter isn’t case-sensitive.

SourceDBSnapshotName: Snapshot name which we will use in this flow.

TargetAWSRegion: Destination AWS Account Region where you want to restore the Aurora cluster.

Permissions screenshot

c.     Select IAM role to launch this template per best practices.

Capabilities and transforms screenshot

d.     Acknowledge to create various resources including IAM roles/policies and select Create Stack.

Part 3: Workflow to Restore the Aurora Cluster in the AWS Target Account and Target Region

Once AWS Lambda receives a cross-account sns notification for the shared snapshot, this workflow begins:

  • The Lambda function starts the AWS Step Function workflow in an asynchronous mode.
  • The workflow checks for the snapshot availability and starts restoring the Aurora cluster based on the configured engine, version, and other parameters.
  • Depending on the availability of the cluster, it adds instances based on a predefined configuration.
  • Once the instances are available, it swaps cluster names so that the updated database is available at the same reader and writer endpoints.
  • Once the databases are restored, the old Aurora cluster is shut down.


a)       The database is restored in the default VPC, Availability Zone, and option group, with the default cluster version and option group. If you want to restore with the specific configurations, you can edit the restore cluster function with details available in the same section.

b)      Database credentials will be the same as source account. If you want to override credential or change the authentication mechanism, you can edit the restore cluster function with details available in the same section.

Trigger Step Function graphic

Figure 4 – Architecture showing the AWS Step Functions Workflow

Import Lambda functions involved in this workflow:

a)     FetchRDSSnapshotStatus: This Lambda function fetches the snapshot status and moves to RestoreCluster functionality once the snapshot is available.

b)    RestoreRDSCluster: This Lambda function restores the cluster from the shared snapshot with the mandatory attribute only.

–     You can edit the Lambda function as described following to add specific parameters like specific VPC, Subnets, AvailabilityZones, DBSubnetGroupName, Autoscaling for reader capacity, DBClusterParameterGroupName, BacktrackWindow and many more.

c)     RestoreInstance: This Lambda function creates an instance for the restored cluster.

d)    SwapClusterName: This Lambda function swaps the cluster name, so that the updated database is available at previous endpoints. Then the cluster url is swapped.

e)     TerminateOldCluster: This Lambda function will shut down the previous cluster.

Using the following AWS CloudFormation Template, we can set up the required services/workflow in the target account and target Region:

a)     Launch the template in target account and target Region:

Launch with CLoudFormatation Console button

Specify stack details screenshot

b)     Fill out the preceding form with the following details and click the next button.

Stack name: Stack name which you want to create.

Engine: The database engine to be used for the new DB cluster, such as aurora-mysql. It must be compatible with the engine of the source.

DBInstanceClass: The compute and memory capacity of the Amazon RDS DB instance, for example, db.t3.medium.

Not all DB instance classes are available in all AWS Regions, or for all database engines. For the full list of DB instance classes, and availability for your engine, see DB Instance Class in the Amazon RDS User Guide.

ExistingDBClusterIdentifier: Name of the existing DB cluster. In the final step, if the old cluster is available with this name, it will be deleted and a new cluster will be available with this name, so the reader/writer endpoint will not change. This parameter isn’t case-sensitive. It must contain from 1 to 63 letters, numbers, or hyphens. The first character must be a letter. DBClusterIdentifier can’t end with a hyphen or contain two consecutive hyphens.

TempDBClusterIdentifier: Temporary name of the DB cluster created from the DB snapshot or DB cluster snapshot. In the final step, this cluster name will be swapped with the original cluster and the previous cluster will be deleted. This parameter isn’t case-sensitive. Must contain from 1 to 63 letters, numbers, or hyphens. The first character must be a letter. DBClusterIdentifier can’t end with a hyphen or contain two consecutive hyphens.

ExistingDBInstanceIdentifier: Name of the existing DB instance identifier. In the final step, if an previous instance is available with this name, it will be deleted and a new instance will be available with this name. This parameter is stored as a lowercase string. Must contain from 1 to 63 letters, numbers, or hyphens. The first character must be a letter, can’t end with a hyphen or contain two consecutive hyphens.

TempDBInstanceIdentifier: Temporary name of the existing DB instance. In the final step, if a previous instance is available with this name, it will be deleted and a new instance will be available. This parameter is stored as a lowercase string. Must contain from 1 to 63 letters, numbers, or hyphens. The first character must be a letter, can’t end with a hyphen or contain two consecutive hyphens.

Permissions2 screenshot


c)     Select IAM role to launch this template as per best practices and select Next.

Capabilities and transforms screenshot 2


d)     Acknowledge to create various resources including IAM roles/policies and select Create Stack.

After launching the CloudFormation steps, all resources like IAM roles, step function workflow and Lambda functions will be set up in the target account.

Part 4: Cross Account Role for Lambda Invocation to restore the Aurora Cluster

In this step, we  provide access to Lambda function setup in the target account and target Region to listen to the SNS topic setup in the source account and source Region. An SNS notification is initiated by the first workflow after the snapshot sharing steps. The Lambda function initiates a workflow to restore the Aurora cluster in the target account and target Region.

Using the following code , we can configure a Lambda function in the target account to listen to the sns topic created in the source account and Region. More details are available in this Tutorial: Using AWS Lambda with Amazon Simple Notification Service.

a)     Configure the SNS topic in source account to allow Subscriptions for target account by using the following AWS CLI command in the source account.

aws sns add-permission \
  --label lambda-access --aws-account-id <<targetAccountID>> \
  --topic-arn <<snsTopicinSourceAccount>> \
  --action-name Subscribe ListSubscriptionsByTopic

Replace << targetAccountID >> with target AWS account number and <<snsTopicinSourceAccount>> with sns topic created as an output of part 2 CloudFormation template in the source account.

b)     Configure the InvokeStepFunction Lambda function in target account to allow invocation from the SNS topic in source account by using following AWS CLI command in the target account. This policy allows the specific SNS topic in the source account to invoke the Lambda function.

aws lambda add-permission \
  --function-name InvokeStepFunction \
  --source-arn <<snsTopicinSourceAccount>> \
  --statement-id sns-x-account \
  --action "lambda:InvokeFunction" \
   --principal sns.amazonaws.com

Replace <<snsTopicinSourceAccount>> with sns topic created as an output of part 2 CloudFormation template.

c)     Subscribe the Lambda function in the target account to the SNS topic in the source account by using the following AWS CLI command in the target account.

aws sns subscribe \
  --topic-arn <<snsTopicinSourceAccount>> \
   --protocol "lambda" \
  --notification-endpoint << InvokeStepFunctionArninTargetAccount>>

Initiate the CLI command in the source Region as sns topic is created there. Replace <<snsTopicinSourceAccount>> with sns topic created as an output of part 2 CloudFormation template execution and << InvokeStepFunctionArninTargetAccount >>with InvokeStepFunction Lambda in the target account created as an output of part 3 CloudFormation template launch.

Security Consideration:

1.     We can share the snapshots with different accounts using public and private mode.

  • Public mode permits all AWS accounts to restore a DB instance from the shared snapshot.
  • Private mode permits only AWS accounts that you specify to restore a DB instance from the shared snapshot. We recommend sharing the snapshot with the destination account in private mode, as we are using the same in our solution.

Snapshot permissions screenshot

2.     Sharing a manual DB cluster snapshot (whether encrypted or unencrypted) enables authorized AWS accounts to directly restore a DB cluster from the snapshot. This is instead of taking a copy and restoring from that. We recommend sharing an encrypted snapshot with the destination account, as we are using the same in our solution.

Cleaning up

Delete the resources to avoid future incurring charges.


In this blog, you learned how to replicate databases across the Region and across different accounts. This helps in designing your DR environment and fulfilling compliance requirements. This solution can be customized for any RDS-based databases or Aurora Serverless, and you can achieve the desired level of RTO and RPO.


Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.


Nirmal Tomar

Nirmal is a principal consultant with Infosys, assisting vertical industries on application migration and modernization to the public cloud. He is the part of the Infosys Cloud CoE team and leads various initiatives with AWS. Nirmal has a keen interest in industry and cloud trends, and in working with hyperscaler to build differentiated offerings. Nirmal also specializes in cloud native development, managed containerization solutions, data analytics, data lakes, IoT, AI, and machine learning solutions.

ERGO Breaks New Frontiers for Insurance with AI Factory on AWS

Post Syndicated from Piotr Klesta original https://aws.amazon.com/blogs/architecture/ergo-breaks-new-frontiers-for-insurance-with-ai-factory-on-aws/

This post is co-authored with Piotr Klesta, Robert Meisner and Lukasz Luszczynski of ERGO

Artificial intelligence (AI) and related technologies are already finding applications in our homes, cars, industries, and offices. The insurance business is no exception to this. When AI is implemented correctly, it adds a major competitive advantage. It enhances the decision-making process, improves efficiency in operations, and provides hassle-free customer assistance.

At ERGO Group, we realized early on that innovation using AI required more flexibility in data integration than most of our legacy data architectures allowed. Our internal governance, data privacy processes, and IT security requirements posed additional challenges towards integration. We had to resolve these issues in order to use AI at the enterprise level, and allow for sensitive data to be used in a cloud environment.

We aimed for a central system that introduces ‘intelligence’ into other core application systems, and thus into ERGO’s business processes. This platform would support the process of development, training, and testing of complex AI models, in addition to creating more operational efficiency. The goal of the platform is to take the undifferentiated heavy lifting away from our data teams so that they focus on what they do best – harness data insights.

Building ERGO AI Factory to power AI use cases

Our quest for this central system led to the creation of AI Factory built on AWS Cloud. ERGO AI Factory is a compliant platform for running production-ready AI use cases. It also provides a flexible model development and testing environment. Let’s look at some of the capabilities and services we offer to our advanced analytics teams.

Figure 1: AI Factory imperatives

Figure 1: AI Factory imperatives

  • Compliance: Enforcing security measures (for example, authentication, encryption, and least privilege) was one of our top priorities for the platform. We worked closely with the security teams to meet strict domain and geo-specific compliance requirements.
  • Data governance: Data lineage and deep metadata extraction are important because they support proper data governance and auditability. They also allow our users to navigate a complex data landscape. Our data ingestion frameworks include a mixture of third party and AWS services to capture and catalog both technical and business metadata.
  • Data storage and access: AI Factory stores data in Amazon Simple Storage Service (S3) in a secure and compliant manner. Access rights are only granted to individuals working on the corresponding projects. Roles are defined in Active Directory.
  • Automated data pipelines: We sought to provide a flexible and robust data integration solution. An ETL pipeline using Apache Spark, Apache Airflow, and Kubernetes pods is central to our data ingestion. We use this for AI model development and subsequent data preparation for operationalization and model integration.
  • Monitoring and security: AI Factory relies on open-source cloud monitoring solutions like Grafana to detect security threats and anomalies. It does this by collecting service and application logs, tracking metrics, and generating alarms.
  • Feedback loop: We store model inputs/outputs and use BI tools, such as Amazon QuickSight, to track the behavior and performance of productive AI models. It’s important to share such information with our business partners so we can build their trust and confidence with AI.
  • Developer-friendly environment: Creating AI models is possible in a notebook-style or integrated development environment. Because our data teams use a variety of machine learning (ML) frameworks and libraries, we keep our platform extensible and our framework agnostic. We support Python/R, Apache Spark, PyTorch and TensorFlow, and more. All this is bolstered by CI/CD processes that accelerate delivery and reduce errors.
  • Business process integration: AI Factory offers services to integrate ML models into existing business processes. We focus on standardizing processes and close collaboration with business and technical stakeholders. Our overarching goal is to operationalize the AI model in the shortest possible timeframe, while preserving high quality and security standards.

AI Factory architecture

So far, we have looked at the functional building blocks of the AI Factory. Let’s take an architectural view of the platform using a five-step workflow:

Figure 2: AI Factory high-level architecture

Figure 2: AI Factory high-level architecture

  1. Data ingestion environment: We use this environment to ingest data from the prominent on-premises ERGO data sources. We can schedule the batch or Delta transfer data to various cloud destinations using multiple Kubernetes-hosted microservices. Once ingested, data is persisted and cataloged as ERGO’s data lake on Amazon S3. It is prepared for processing by the upstream environments.
  2. Model development environment: This environment is used primarily by data scientists and data engineers. We use Amazon EMR and Amazon SageMaker extensively for data preparation, data wrangling, experimentation with predictive models, and development through rapid iterations.
  3. Model operationalization environment: Trained models with satisfactory KPIs are promoted from the model development to the operationalization environment. This is where we integrate AI models in business processes. The team focuses on launching and optimizing the operation of services and algorithms.
    • Integration with ERGO business processes is achieved using Kubernetes-hosted ‘Model Service.’ This allows us to infuse AI models provided by data scientists in existing business processes.
    • An essential part of model operationalization is to continuously monitor the quality of the deployed ML models using the ‘feedback loop service.’
  4. Model insights environment: This environment is used for displaying information about platform performance, processes, and analytical data. Data scientists use its services to check for unexpected bias or performance drifts that the model could exhibit. Feedback coming from the business through the “feedback loop service’ allows them to identify problems fast and retrain the model.
  5. Shared services: Though shown as the fifth step of the workflow, the shared services environment supports almost every step in the process. It provides common, shared components between different parts of the platform managing CI/CD and orchestration processes within the AI factory. Additional services like platform logging and monitoring, authentication, and metadata management are also delivered from the shared services environment.

A binding theme across the various subplatforms is that all provisioning and deployment activities are automated using Infrastructure as Code (IaC) practices. This reduces the potential for human error, provides architectural flexibility, and greatly speeds up software development and our infrastructure-related operations.

All components of the AI factory are run in the AWS Cloud and can be scaled and adapted as needed. The connection between model development and operationalization happens at well-defined interfaces to prevent unnecessary coupling of components.

Lessons learned

Security first

  • Align with security early and often
  • Understand all the regulatory obligations and document them as critical, non-functional requirements

Modular approach

  • Combine modern data science technology and professional IT with a cross-functional, agile way of working
  • Apply loosely coupled services with an API-first approach

Data governance

  • Tracking technical metadata is important but not sufficient, you need business attributes too
  • Determine data ownership in operational systems to map upstream data governance workflows
  • Establish solutions to data masking as the data moves across sub-platforms
  • Define access rights and permissions boundaries among various personas

FinOps strategy

  • Carefully track platform cost
  • Assign owners responsible for monitoring and cost improvements
  • Provide regular feedback to platform stakeholders on usage patterns and associated expenses

Working with our AWS team

  • Establish cadence for architecture review and new feature updates
  • Plan cloud training and enablement

The future for the AI factory

The creation of the AI Factory was an essential building block of ERGO’s strategy. Now we are ready to embrace the next chapter in our advanced analytics journey.

We plan to focus on important use cases that will deliver the highest business value. We want to make the AI Factory available to ERGO’s international subsidiaries. We are also enhancing and scaling its capabilities. We are creating an ‘analytical content hub’ based on automated text extraction, improving speech to text, and developing translation processes for all unstructured and semistructured data using AWS AI services.

Field Notes: Orchestrating and Monitoring Complex, Long-running Workflows Using AWS Step Functions

Post Syndicated from Max Winter original https://aws.amazon.com/blogs/architecture/field-notes-orchestrating-and-monitoring-complex-long-running-workflows-using-aws-step-functions/


IHS Markit’s Wall Street Office (WSO) offers financial reports to hundreds of clients worldwide. When IHS Markit completed the migration of WSO’s SaaS software to AWS, it unlocked the power and agility to deliver new product features monthly, as opposed to a multi-year release cycle. This migration also presented a great opportunity to further enhance the customer experience by automating the WSO reporting team’s own Continuous Integration and Continuous Deployment (CI/CD) workflow. WSO then offered the same migration workflow to its on-prem clients, who needed the ability to upgrade quickly in order to meet a regulatory LIBOR reporting deadline. This rapid upgrade was enabled by fully automating regression testing of new software versions.

In this blog post, I outline the architectures created in collaboration with WSO to orchestrate and monitor the complex, long-running reconciliation workflows in their environment by leveraging the power of AWS Step Functions. To enable each client’s migration to AWS, WSO needed to ensure that the new, AWS version of the reporting application produced identical outputs to the previous, on-premises version. For a single migration, the process is as follows:

  • spin up the old version of the SQL Server and reporting engine on Windows servers,
  • run reports,
  • repeat the process with the new version,
  • compare the outputs and review the differences.

The problem came with scaling this process.  IHS Markit provides financial solutions and tools to numerous clients. To enable these clients to transition away from LIBOR, the WSO team was tasked with migrating over 80 instances of the application, and reconciling hundreds of reports for each migration. During upgrades, customers must manually validate custom extracts created in the WSO Reporting application against current and next version, which limits upgrade frequency and increases the resourcing cost of these validations. Without automation, upgrading all clients would have taken an entire new Operations team, and cost the firm over 700 developer-hours to meet the regulatory LIBOR cessation deadline.

WSO was able to save over 4,000 developer-hours by making this process repeatable so it can be used as an automated regression test as part of the regular Systems Development Lifecycle process  The following diagram shows the reconciliation workflow steps enabled as part of this automated process.

Reconciliation Workflow Steps

Figure 1 – Reconciliation Workflow Steps


The team quickly realized that a Serverless and event-driven solution would be required to make this process manageable. The initial approach was to use AWS Lambda functions to call PowerShell scripts to perform each step in the reconciliation process. They also used Amazon SNS to invoke the next Lambda function when the previous step completed.

The problem came when the Operations team tried to monitor these Lambda functions, with multiple parallel reconciliations running concurrently. The Lambda outputs became mixed together in shared Amazon Cloud Watch log groups, and there was no way to quickly see the overall progress of any given reconciliation workflow. It was also difficult to figure out how to recover from errors.

Furthermore, the team found that some steps in this process, such as database restoration, ran longer than the 15 minute Lambda timeout limit. As a result, they were forced to look for alternatives to manage these long-running steps. Following is an architecture diagram showing the serverless component used to automate and scale the process.

Initial Orchestration Architecture

Figure 2 – Initial Orchestration Architecture


Enter Step Functions and AWS Systems Manager (formerly known as SSM) Automation. To address the problem of orchestrating the many sequential and parallel steps in our workflow, AWS Solutions Architects suggested replacing Amazon SNS with AWS Step Functions.

The Step Functions state machine controls the order in which the steps are invoked, including successful and error state transitions. The service is integrated with 14 other AWS services (Lambda function, SSM Automation, Amazon ECS, and more.), and can invoke them, as well as manual actions. These calls can be synchronous or run via steps that wait for an event. A state machine instance is long-lived and can support processes that take up to a year to complete.

Step Function Designer UI

Figure 3 – Step Function Designer UI

This immediately gave the Development team a holistic, visual way to design our workflow, and offered Operations a graphical user interface (UI) to monitor ongoing reconciliations in real time. The Step Functions console lists out all running and past reconciliations, including their status, and allows the operator to drill down into the detailed state diagram of any given reconciliation. The operator can then see how far it’s progressed or where it encountered an error.

The UI also provides Amazon CloudWatch links for any given step, isolating the logs of that particular Lambda execution, eliminating the need to search through the CloudWatch log group manually. The screenshot below illustrates what an in-progress Step Function looks to an operator, with each step listed out with its own status and a link to its log.

Step Function Execution Monitor

Figure 4 – Step Function Execution Monitor


Figure 5 - Step Function Execution Detail Viewer

Figure 5 – Step Function Execution Detail Viewer


Figure 6 -Step Function Log Group in CloudWatch

Figure 6 -Step Function Log Group in Amazon CloudWatch

The team also used the Step Function state machine as a container for metadata about each particular reconciliation process instance (like the environment ID and the database and Amazon EC2 instances associated with that environment), reducing the need to pass this data between Lambda functions.

To solve the problem of long-running PowerShell scripts, AWS Solutions Architects suggested using SSM Automation. Unlike Lambda functions, SSM Automation is meant to run operational scripts, with no maximum time limit. They also have native PowerShell integration, which you can use to call the existing scripts and capture their output.


SSM Automation UI for Monitoring Long-running Tasks and Manual Approval Steps

Figure 7 – SSM Automation UI for Monitoring Long-running Tasks and Manual Approval Steps

To save time running hundreds of reports, the team looked into the ‘Map State’ feature of Step Functions. Map takes an array of input data, then creates an instance of the step (in this case a Lambda call) for each item in this array.  It waits for them all to complete before proceeding.

This is recommended to implement as a fan-out pattern with almost no orchestration code. The Map State step also gives Operations users the option to limit the level of parallelism, in this case letting only 5 reports run simultaneously. This prevents overloading our reporting applications and databases.

Figure 8 – Map State is used to Fan Out the “CompareReport” Step as Multiple Parallel Steps

To deal with errors in any of the workflow steps, the Development team introduced a manual review step, which you can model in Step Functions. The manual step notifies a mailing list of the error, then waits for a reply to tell it whether to retry or abort the workflow.

The only challenge the Development team found was the mechanism for re-running an individual failed step. At this time, any failure needs to have an explicit state transition within the Step Function’s state diagram. While the Step Function can auto-retry a step, the team wanted to insert a wait-for-human-investigation step before retrying the more expensive and complex steps.

This presented 2 options:

  1. add wait-and-loop-back steps around every step we may want to retry,
  2. route all failures to a single wait-for-investigation step.

The former added significant complexity to the state machine, so AWS Solutions Architects raised this as a product feature that should be added to the Step Functions UI.

The proposed enhancement would allow any failed step to be manually rerun or skipped via UI controls, without adding explicit steps to each state machine to model this. In the meantime, the Dev team went with the latter approach, and had the human error review step loop back to the top of the state machine to retry the entire workflow. To avoid re-running long steps, they created a check within the step Lambda function to query the Step Functions API and determine whether that step had already succeeded before the loop-back, and complete it instantly if it had.


Human Intervention Steps Used to Allow Time to Review Results or Resolve Errors Before Retrying Failed Steps

Figure 9 – Human Intervention Steps Used to Allow Time to Review Results or Resolve Errors Before Retrying Failed Steps


Within 6 weeks, WSO was able to run the first reconciliations and begin the LIBOR migration on time. The Step Function Designer instantly gave the Developer team an operator UI and workflow orchestration engine. Normally, this would have required the creation of an entire 3-tier stack, scheduler and logging infrastructure.

Instead, using Step Functions allowed the developers to spend their time on the reconciliation logic that makes their application unique. The report compare tool developed by the WSO team provides clients with automated artifacts confirming that customer report data remained identical between current version and next version of WSO. The new testing artifacts provide clients with robust and comprehensive testing of critical data extracts.

The Completed Step Function which Orchestrates an Entire Reconciliation Process

Figure 10 -The Completed Step Function which Orchestrates an Entire Reconciliation Process


We hope that this blog post provided useful insights to help determine if using AWS Step Functions are a good fit for you.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.


Vertical Integration Strategy Powered by Amazon EventBridge

Post Syndicated from Tiago Oliveira original https://aws.amazon.com/blogs/architecture/vertical-integration-strategy-powered-by-amazon-eventbridge/

Over the past few years, midsize and large enterprises have adopted vertical integration as part of their strategy to optimize operations and profitability. Vertical integration consists of separating different stages of the production line from other related departments, such as marketing and logistics. Enterprises implement such strategy to gain full control of their value chain: from the raw material production to the assembly lines and end consumer.

To achieve operational efficiency, enterprises must keep a level of independence between departments. However, this can lead to unstandardized operations and communication issues. Moreover, with this kind of autonomy for independent and dynamic verticals, the enterprise may lose some measure of visibility and control. As a result, it becomes challenging to generate a basic report from multiple departments. This blog post provides a high-level solution to integrate your different business verticals, using an event-driven architecture on top of Amazon EventBridge.

Event-driven architecture

Event-driven architecture is an architectural pattern to model communication between services while decoupling applications from each other. Applications scale and fail independently, and a central event bus facilitates the communication between the services in the enterprise. Instead of a particular application sending a request directly to another, it produces an event. The central event router captures it and forwards the message to the proper destinations.

For instance, when a customer places a new order on the retail website, the application sends the event to the event bus. Following, the event bus sends the message to the ERP system and the fulfillment center for dispatch. In this scenario, we call the application sending the event, an event publisher, and the applications receiving the event, event consumers.

Because all messages are going through the central event bus, there is clear independence between the applications within the enterprise. Here are some benefits:

  • Application independence occurs even if they belong to the same business workflow
  • You can plug in more event consumers to receive the same event type
  • You can add a data lake to receive all new order events from the retail website
  • You can receive all the events from the payment system and the customer relations department

This ensures you can integrate independent departments, increase overall visibility, and make sense of specific processes happening in the organization using the right tools.

Implementing event-driven architecture with Amazon EventBridge

Each vertical organically generates lifecycle events. Enterprises can use the event-driven architecture paradigm to make the information flow between the departments by asynchronously exchanging events through the event bus. This way, each department can react to events generated by other departments and initiate processes or actions depending on its business needs.

Such an approach creates a dynamic and flexible choreography between the different participants, which is unique to the enterprise. Such choreography can be followed and monitored using analytics and fine-grained event data collected on the data lake. Read Using AWS X-Ray tracing with Amazon EventBridge to learn how to debug and analyze this kind of distributed application.

Figure 1. Architecture diagram depicting enterprise vertical integration with Amazon EventBridge

Figure 1. Architecture diagram depicting enterprise vertical integration with Amazon EventBridge

In Figure 1, Amazon EventBridge works as the central event bus, the core component of this event-driven architecture. Through Amazon EventBridge, each event publisher sends or receives lifecycle events to and from all the other participants. Amazon EventBridge has an advanced routing mechanism using the concept of rules. Each rule defines up to five targets for the event arriving on the bus. Events are selected based on the event pattern. You can set up routing rules to determine where to send your data to build application architectures. These will react in real time to your data sources, with event publisher and consumer decoupled.

In addition to initiating the heavy routing and distribution of events, Amazon EventBridge can also give real-time insights into how the business runs. Using metrics automatically sent to Amazon CloudWatch, it is possible to see which kinds of events are arriving, and at which rate. You can also see how those events are distributed across the registered targets, and any failures that occur during this distribution. Every event can also be archived using the Amazon EventBridge events archiving feature.

Amazon Simple Storage Service (S3) is the backend storage, or data lake, for all the events that have ever transited via the event bus. With Amazon S3, customers have a cost-efficient storage service at any scale, with 11 9’s of durability. To help customers manage and secure their data, S3 provides features such as Amazon S3 Lifecycle to optimize costs. S3 Object Lock allows the write-once-read-many (WORM) model. You can expand this data and transform it into information using S3. Using services like Amazon AthenaAmazon Redshift, and Amazon EMR, those events can be transformed, correlated, and aggregated to generate insights on the business. The Amazon S3 data lake can also be the input to a data warehouse, machine learning models, and real-time analytics. Learn more about how to use Amazon S3 as the data lake storage.

A critical feature of this solution is the initiation of complex queries on top of the data lake. Amazon API Gateway provides one single flexible and elastic API entry point to retrieve data from the data lake. It also can publish events directly to the event bus. For complex queries, Amazon API Gateway can be integrated with an AWS Lambda. It will coordinate the execution of standard SQL queries using Amazon Athena as the query engine. You can read about a fully functional example of such an API called athena-express.

After collecting data from multiple departments, third-party entities, and shop floors, you can use the data to derive business value using cross-organization dashboards. In this way, you can increase visibility over the different entities and make sense of the data from the distributed systems. Even though this design allows you to use your favorite BI tool, we are using Amazon QuickSight for this solution. For example, with QuickSight, you can author your interactive dashboards, which include machine learning-powered insights. Those dashboards can then connect the marketing campaigns data with the sales data. You can measure how effective those campaigns were and forecast the demand on the production lines.


In this blog post, we showed you how to use Amazon EventBridge as an event bus to allow event-driven architectures. This architecture pattern streamlines the adoption of vertical integration. Enterprises can decouple IT systems from each other while retaining visibility into the data they generate. Integrating those systems can happen asynchronously using a choreography approach instead of having an orchestrator as a central component. There are technical challenges to implement this kind of solution, such as maintaining consistency in distributed applications and transactions spanning multiple microservices. Refer to the saga pattern for microservices-based architecture, and how to implement it using AWS Step Functions.

With a data lake in place to collect all the data produced by IT systems, you can create BI dashboards that provide a holistic view of multiple departments. Moreover, it allows organizations to get better insights into their valuable data and explore other use cases, such as machine learning. To support the data lake creation and management, refer to AWS Lake Formation and a series of other blog posts.

To learn more about Amazon EventBridge from a hands-on perspective, refer to this EventBridge workshop.

Top 5 Architecture Blog Posts for Q2 2021

Post Syndicated from Bonnie McClure original https://aws.amazon.com/blogs/architecture/top-5-architecture-blog-posts-for-q2-2021/

The goal of the AWS Architecture Blog is to highlight best practices and provide architectural guidance. We publish thought leadership pieces that encourage readers to discover other technical documentation such as solutions and managed solutions, other AWS blogs, videos, reference architectures, whitepapers, and guides, training and certification, case studies, and the AWS Architecture Monthly Magazine. We welcome your contributions!

Field Notes is a series of posts within the Architecture Blog channel that provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers based on their experiences in the field solving real-world business problems for customers.

A big thank you to you, our readers, for spending time on our blog this past quarter. Of course, we wouldn’t have content for you to read without our hard-working AWS Solutions Architects and other blog post writers, so thank you to them as well! Without further ado, the following five posts were the top Architecture Blog and Field Notes blog posts published in Q2 (April through June 2021).

Honorable Mention: Managing Asynchronous Workflows with a REST API

by Scott Gerring

With 3,400 views since mid-May, Scott’s post is worth mentioning in our leaderboard. In this post, Scott shows you common patterns for handling REST API operations, their advantages/disadvantages, and their typical AWS serverless implementations. Understanding the options you have to build such a system with AWS serverless solutions is important to choosing the right tool for your problem.

 Serverless polling architecture

#5: Scaling RStudio/Shiny using Serverless Architecture and AWS Fargate

by Chayan Panda, Michael Hsieh, and Mukosi Mukwevho

In this post, Chayan, Michael, and Mukosi discuss serverless architecture that addresses common challenges of hosting RStudio/Shiny servers. They show you best practices adapted from AWS Well-Architected. This architecture provides data science teams a secure, scalable, and highly available environment while reducing infrastructure management overhead.

RStudio-Shiny Open Source Deployment on AWS Serverless Infrastructure

#4: Disaster Recovery (DR) Architecture on AWS, Part III: Pilot Light and Warm Standby

by Seth Eliot

You’ll notice a recurring theme in this Top 5 post—Seth’s four-part DR series is really popular! Throughout the series, Seth shows you different strategies to prepare your workload for disaster events like natural disasters like earthquakes or floods, technical failures such as power or network loss, and human actions such as inadvertent or unauthorized modifications. Part III discusses two strategies to prepare your workload for a disaster event: pilot light and warm standby. The post shows you how to implement these strategies that help you limit data loss and downtime and how to get the most out of your set up.

Pilot light DR strategy

#3: Disaster Recovery (DR) Architecture on AWS, Part II: Backup and Restore with Rapid Recovery

by Seth Eliot

Part II covers the backup and restore strategy, the easiest and least expensive strategy to implement in the series. It shows you how, by using automation, you can minimize recovery time objectives and therefore lessen the impacts of downtime in the event of a disaster.

Backup and restore DR strategy

#2: Disaster Recovery (DR) Architecture on AWS, Part I: Strategies for Recovery in the Cloud | AWS Architecture Blog

by Seth Eliot

Part I gives you an overview of each strategy in the series (backup and restore, pilot light, standby, multi-site active/active) and how to select the best strategy for your business needs. Disaster events pose a threat to your workload availability, but by using AWS Cloud services you can mitigate or remove these threats.

DR strategies – trade-offs between RTO/RPO and costs

#1: Issues to Avoid When Implementing Serverless Architecture with AWS Lambda

by Andrei Maksimov

With almost 8,000 views as of this Top 5 post’s publication date, Andrei’s post has been a hit this quarter! In the post, he highlights eight common anti-patterns (solutions that may look like the right solution but end up being less effective than intended). He provides recommendations to avoid these patterns to ensure that your system is performing at its best.

Your contributions to the blog are immensely valuable to all our customers! Keep on writing!

Field Notes: Use AWS Cloud9 to Power Your Visual Studio Code IDE

Post Syndicated from Nick Ragusa original https://aws.amazon.com/blogs/architecture/field-notes-use-aws-cloud9-to-power-your-visual-studio-code-ide/

Everyone has their favorite integrated development environment, or IDE, as it’s more commonly known. For many of us, it’s a tool that we rely on for our day-to-day activities. In some instances, it’s a tool we’ve spent years getting set up just the way we want – from the theme that looks the best to the most productive plugins and extensions that help us optimize our workflows.

After many iterations of trying different IDEs, I’ve chosen Visual Studio Code as my daily driver. One of the things I appreciate most about Visual Studio Code is its vast ecosystem of extensions, allowing me to extend its core functionality to exactly how I need it. I’ve spent hours installing (and sometimes subsequently removing) extensions, figuring out the keyboard shortcut combinations but it’s the theme and syntax highlighter that I find the most appealing. Despite all of this time invested in building the perfect IDE, it can still fall somewhat short.

One of the hardest challenges to overcome, regardless of IDE, is the fact that it’s confined to the hardware that it’s installed on. Some of these challenges include running out of local disk space, RAM exhaustion, or requiring more CPU cores to process a build. In this blog post, I show you how to overcome the limitations of your laptop or desktop by using AWS Cloud9 to power your Visual Studio Code IDE.

Solution Overview

One way to overcome the limitations of your laptop or desktop is to use a remote machine. You’ve probably tried to mount some remote file system over SSH on your machine so you could continue to use your IDE, but timeouts and connection resets made that a frustrating and unusable solution.

Imagine a solution where you can:

  • Use SSH to connect to a remote instance, install all of your favorite extensions on the remote instance, and take advantage of the remote machine’s resources including CPU, disk, RAM, and more.
  • Access the instance without needing to open security groups and ACLs from your (often changing) desktop IP.
  • Leverage your existing AWS infrastructure including your VPC, IAM user / role and policies, and take advantage of AWS Cloud9 to power your remote instance.

With AWS Cloud9, you start with an environment pre-packaged with essential tools for popular programming languages, coupled with the power of Amazon EC2. As an added advantage, you can even take advantage of AWS Cloud9’s built-in auto-shutdown capability which powers off your instance when you’re not actively connected to it. The following diagram shows the architecture you can build to use AWS Cloud9 to power your Visual Studio Code IDE.

Visual Studio Ref Architecture


I will walk you through setting up the Visual Studio Code with the Remote SSH extension. Once installed, I’ll show you how to leverage the features of AWS Cloud9 to power your Visual Studio Code IDE, including:

  • Automatically power on your instance when you connect to it
  • Configure your SSH client to use AWS Systems Manager Session Manager to connect to your AWS Cloud9 instance
  • Modify your AWS Cloud9 instance to shut down after you disconnect

Before you get started, launch the following AWS CloudFormation template in your AWS account. This will be the easiest way to get up and running to be able to evaluate this solution.

launch stack button

Once launched, a second CloudFormation stack is deployed named aws-cloud9-VS-Code-Remote-SSH-Demo-<unique id>.

  • Select this stack from your console.
  • Select the Resources tab, and take note of the instance Physical ID.
  • Save this value for use later on.

CloudFormation Screenshot


Finally, you may wish to encrypt your AWS Cloud9 instance’s EBS volume for an additional layer of security. Follow this procedure to encrypt your AWS Cloud9 instance’s volume.


For this walkthrough, you should have the following available to you:

Install Remote – SSH Extension

First, install the Remote SSH extension in Visual Studio Code. You have the option of clicking the Install button from the preceding link, or by opening the extensions panel (View -> Extensions) from within Visual Studio Code and searching for Remote SSH.

Install Remote – SSH Extension

Once installed, there are a few extension settings I suggest modifying for this solution. To access the extension settings, click on the icon in the extension browser, and go to Extension Settings.

Here, you will land on a screen similar to the following. Adjust:

  • Remote.SSH: Connect Timeout – I suggest putting a value such as 60 here. Doing so will prevent a timeout from happening in the event your AWS Cloud9 instance is powered off. This gives it ample time to power on and get ready.
  • Remote.SSH: Default Extensions – For your essential extensions, specify them on this screen. Whenever you connect to a new remote host, these extensions will be installed by default.

Screen Settings screenshot

Install the Session Manager plugin for the AWS CLI

First, install the Session Manager plugin to start and stop sessions that connect to your EC2 instances from the AWS CLI. This plugin can be installed on supported versions of Microsoft Windows, macOS, Linux, and Ubuntu Server.

Note: the plugin requires AWS CLI version 1.16.12 or later to work.

Create an SSH key pair

An SSH key pair is required to access our instance over SSH. Using public key authentication over a simple password provides cryptographic strength over even the most complex passwords. In my macOS environment, I have a utility called ssh-keygen, which I will use to create a key pair. From a terminal, run:

$ ssh-keygen -b 4096 -C 'VS Code Remote SSH user' -t rsa

To avoid any confusion with existing SSH keys, I chose to save my key to /Users/nickragusa/.ssh/id_rsa-cloud9 for this example.

Now that we have a keypair created, we need to copy the public key to the authorized_keys file on the AWS Cloud9 instance. Since I saved my key to /Users/nickragusa/.ssh/id_rsa-cloud9, the corresponding public key has been saved to /Users/nickragusa/.ssh/id_rsa-cloud9.pub.

Open the AWS Cloud9 service console in the same region you deployed the CloudFormation stack. Your environment should be named VS Code Remote SSH Demo. Select Open IDE:


AWS Cloud9 console

Now that you’re in your AWS Cloud9 IDE, edit ~/.ssh/authorized_keys and paste your public key, being sure to paste it below the lines. Following is an example of using the AWS Cloud9 editor to first reveal the file and then edit it:

AWS Cloud9 Editor

Save the file and move on to the next steps.

Modify the shutdown script

As an optional cost saving measure in AWS Cloud9, you can specify a timeout value in minutes that when reached, the instance will automatically power down. This works perfectly when you are connected to your AWS Cloud9 IDE through your browser.  In our use case here, we  connect to the instance strictly via SSH from our Visual Studio Code IDE. We modify the shutdown logic on the instance itself to check for connectivity from our IDE, and if we’re actively connected, this prevents the automatic shutdown.

To prevent the instance from shutting down while connected from Visual Studio Code, download this script and place it in ~/.c9/stop-if-inactive.sh. From your AWS Cloud9 instance, run:

# Save a copy of the script first
$ sudo mv ~/.c9/stop-if-inactive.sh ~/.c9/stop-if-inactive.sh-SAVE
$ curl https://raw.githubusercontent.com/aws-samples/cloud9-to-power-vscode-blog/main/scripts/stop-if-inactive.sh -o ~/.c9/stop-if-inactive.sh
$ sudo chown root:root ~/.c9/stop-if-inactive.sh
$ sudo chmod 755 ~/.c9/stop-if-inactive.sh

That’s all of the configuration changes that are needed on your AWS Cloud9 instance. The remainder of the work will be done on your laptop or desktop.

  • ssm-proxy.sh – This script will launch each time you SSH to the AWS Cloud9 instance. Upon invocation, the script checks the current state of your instance. If your instance is powered off, it will execute the aws ec2 start-instances CLI to power on the instance and wait for it to start. Once started, it will use the aws ssm start-session command, along with the Session Manager plugin installed earlier, to create an SSH session with the instance via AWS Systems Manager Session Manager.
  • ~/.ssh/config – For OpenSSH clients, this file defines any user specific SSH configurations you need. In this file, you specify the EC2 instance ID of your AWS Cloud9 instance, the private SSH key to use, the user to connect as, and the location of the ssm-proxy.sh script.

Start by downloading the ssm-proxy.sh script to your machine. To keep things organized, I downloaded this script to my ~/.ssh/ folder.

$ curl https://raw.githubusercontent.com/aws-samples/cloud9-to-power-vscode blog/main/scripts/ssm-proxy.sh -o ~/.ssh/ssm-proxy.sh
$ chmod +x ~/.ssh/ssm-proxy.sh

Next, you edit this script to match your environment. Specifically, there are 4 variables at the top of ssm-proxy.sh that you may want to edit:

  • AWS_PROFILE – If you use named profiles, you specify the appropriate profile here. If you do not use named profiles, specify a value of default here.
  • AWS_REGION – The Region where you launched your AWS Cloud9 environment into.
  • MAX_ITERATION – The number of times this script will check if the environment has successfully powered on.
  • SLEEP_DURATION – The number of seconds to wait between MAX_ITERATION checks.

Once this script is set up, you need to edit your SSH configuration. You can use the example provided to get you started.

Edit ~/.ssh/config and paste the contents of the example to the bottom of the file. Edit any values that are unique to your environment, specifically:

  • IdentityFile – This should be the full path to the private key you created earlier.
  • HostName – Specify the EC2 instance ID of your AWS Cloud9 environment. You should have copied this value down from the ‘Walkthrough’ section of this post.
  • ProxyCommand – Make sure to specify the full path of the location where you downloaded the ssm-proxy.sh script from the previous step.

Test the solution

Your AWS Cloud9 environment is now up and you have configured your local SSH configuration to use AWS Systems Manager Session Manager for connectivity.  Time to try it all out!

From the Visual Studio Code editor, open the command palette, start typing remote-ssh to bring up the correct command, and select Remote-SSH: Connect Current Window to Host.

Command Palette screenshot

A list of your SSH hosts should appear as defined in ~/.ssh/config. Select the host called cloud9 and wait for the connection to establish.

Now your Visual Studio Code environment is powered by your AWS Cloud9 instance! As some next steps, I suggest:

  • Open the integrated terminal so you can explore the file system and check what tools are already installed.
  • Browse the extension gallery and install any extensions you frequently use. Remember that extensions are installed locally – so any extensions you may have already installed on your desktop / laptop are not automatically available in this remote environment.

Cleaning up

To avoid any incurring future charges, you should delete any resources you created. I recommend deleting the CloudFormation stack you launched at the beginning of this walkthrough.


In this post, I covered how you can use the Remote SSH extension within Visual Studio Code to connect to your AWS Cloud9 instance. By making some adjustments to your local OpenSSH configuration, you can use AWS Systems Manager Session Manager to establish an SSH connection to your instance. This is done without the need to open an SSH specific port in the instance’s security group.

Now that you can power your Visual Studio Code IDE with AWS Cloud9, how will you take your IDE to the next level? Feel free to share some of your favorite extensions, keyboard shortcuts, or your vastly improved build times in the comments!

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.