Tag Archives: Amazon Machine Learning

Gaining operational insights with AIOps using Amazon DevOps Guru

Post Syndicated from Nikunj Vaidya original https://aws.amazon.com/blogs/devops/gaining-operational-insights-with-aiops-using-amazon-devops-guru/

Amazon DevOps Guru offers a fully managed AIOps platform service that enables developers and operators to improve application availability and resolve operational issues faster. It minimizes manual effort by leveraging machine learning (ML) powered recommendations. Its ML models take advantage of AWS’s expertise in operating highly available applications for the world’s largest e-commerce business for over 20 years. DevOps Guru automatically detects operational issues, predicts impending resource exhaustion, details likely causes, and recommends remediation actions.

This post walks you through how to enable DevOps Guru for your account in a typical serverless environment and observe the insights and recommendations generated for various activities. These insights are generated for operational events that could pose a risk to your application availability. DevOps Guru uses AWS CloudFormation stacks as the application boundary to detect resource relationships and co-relate with deployment events.

Solution overview

The goal of this post is to demonstrate the insights generated for anomalies detected by DevOps Guru from DevOps operations. If you don’t have a test environment and want to build out infrastructure to test the generation of insights, then you can follow along through this post. However, if you have a test or production environment (preferably spawned from CloudFormation stacks), you can skip the first section and jump directly to section, Enabling DevOps Guru and injecting traffic.

The solution includes the following steps:

1. Deploy a serverless infrastructure.

2. Enable DevOps Guru and inject traffic.

3. Review the generated DevOps Guru insights.

4. Inject another failure to generate a new insight.

As depicted in the following diagram, we use a CloudFormation stack to create a serverless infrastructure, comprising of Amazon API Gateway, AWS Lambda, and Amazon DynamoDB, and inject HTTP requests at a high rate towards the API published to list records.

Serverless infrastructure monitored by DevOps Guru

When DevOps Guru is enabled to monitor your resources within the account, it uses a combination of vended Amazon CloudWatch metrics and specific patterns from its ML models to detect anomalies. When an anomaly is detected, it generates an insight with the recommendations.

The generation of insights is dependent upon several factors. Although this post provides a canned environment to reproduce insights, the results may vary depending upon traffic pattern, timings of traffic injection, and so on.


To complete this tutorial, you should have access to an AWS Cloud9 environment and the AWS Command Line Interface (AWS CLI).

Deploying a serverless infrastructure

To deploy your serverless infrastructure, you complete the following high-level steps:

1.   Create your IDE environment.

2.   Launch the CloudFormation template to deploy the serverless infrastructure.

3.   Populate the DynamoDB table.


1. Creating your IDE environment

We recommend using AWS Cloud9 to create an environment to get access to the AWS CLI from a bash terminal. AWS Cloud9 is a browser-based IDE that provides a development environment in the cloud. While creating the new environment, ensure you choose Linux2 as the operating system. Alternatively, you can use your bash terminal in your favorite IDE and configure your AWS credentials in your terminal.

When access is available, run the following command to confirm that you can see the Amazon Simple Storage Service (Amazon S3) buckets in your account:

aws s3 ls

Install the following prerequisite packages and ensure you have Python3 installed:

sudo yum install jq -y

export AWS_REGION=$(curl -s \ | jq -r '.region')

sudo pip3 install requests

Clone the git repository to download the required CloudFormation templates:

git clone https://github.com/aws-samples/amazon-devopsguru-samples
cd amazon-devopsguru-samples/generate-devopsguru-insights/

2. Launching the CloudFormation template to deploy your serverless infrastructure

To deploy your infrastructure, complete the following steps:

  • Run the CloudFormation template using the following command:
aws cloudformation create-stack – stack-name myServerless-Stack \
--template-body file:///$PWD/cfn-shops-monitoroper-code.yaml \

The AWS CloudFormation deployment creates an API Gateway, a DynamoDB table, and a Lambda function with sample code.

  • When it’s complete, go to the Outputs tab of the stack on the AWS CloudFormation console.
  • Record the links for the two APIs: one of them to list the table contents and other one to populate the contents.

3. Populating the DynamoDB table

Run the following commands (simply copy-paste) to populate the DynamoDB table. The below three commands will identify the name of the DynamoDB table from the CloudFormation stack and populate the name in the populate-shops-dynamodb-table.json file.

dynamoDBTableName=$(aws cloudformation list-stack-resources \
--stack-name myServerless-Stack | \
jq '.StackResourceSummaries[]|select(.ResourceType == "AWS::DynamoDB::Table").PhysicalResourceId' | tr -d '"')
sudo sed -i s/"<YOUR-DYNAMODB-TABLE-NAME>"/$dynamoDBTableName/g \
aws dynamodb batch-write-item \
--request-items file://populate-shops-dynamodb-table.json

The command gives the following output:

"UnprocessedItems": {}

This populates the DynamoDB table with a few sample records, which you can verify by accessing the ListRestApiEndpointMonitorOper API URL published on the Outputs tab of the CloudFormation stack. The following screenshot shows the output.

Screenshot showing the API output

Enabling DevOps Guru and injecting traffic

In this section, you complete the following high-level steps:

1.   Enable DevOps Guru for the CloudFormation stack.

2.   Wait for the serverless stack to complete.

3.   Update the stack.

4.   Inject HTTP requests into your API.


1. Enabling DevOps Guru for the CloudFormation stack

To enable DevOps Guru for CloudFormation, complete the following steps:

  • Run the CloudFormation template to enable DevOps Guru for this CloudFormation stack:
aws cloudformation create-stack \
--stack-name EnableDevOpsGuruForServerlessCfnStack \
--template-body file:///$PWD/EnableDevOpsGuruForServerlessCfnStack.yaml \
--parameters ParameterKey=CfnStackNames,ParameterValue=myServerless-Stack \
--region ${AWS_REGION}
  • When the stack is created, navigate to the Amazon DevOps Guru console.
  • Choose Settings.
  • Under CloudFormation stacks, locate myServerless-Stack.

If you don’t see it, your CloudFormation stack has not been successfully deployed. You may remove and redeploy the EnableDevOpsGuruForServerlessCfnStack stack.

Optionally, you can configure Amazon Simple Notification Service (Amazon SNS) topics or enable AWS Systems Manager integration to create OpsItem entries for every insight created by DevOps Guru. If you need to deploy as a stack set across multiple accounts and Regions, see Easily configure Amazon DevOps Guru across multiple accounts and Regions using AWS CloudFormation StackSets.

2. Waiting for baselining of resources

This is a necessary step to allow DevOps Guru to complete baselining the resources and benchmark the normal behavior. For our serverless stack with 3 resources, we recommend waiting for 2 hours before carrying out next steps. When enabled in a production environment, depending upon the number of resources selected to monitor, it can take up to 24 hours for it to complete baselining.

Note: Unlike many monitoring tools, DevOps Guru does not expect the dashboard to be continuously monitored and thus under normal system health, the dashboard would simply show zero’ed counters for the ongoing insights. It is only when an anomaly is detected, it will raise an alert and display an insight on the dashboard.

3. Updating the CloudFormation stack

When enough time has elapsed, we will make a configuration change to simulate a typical operational event. As shown below, update the CloudFormation template to change the read capacity for the DynamoDB table from 5 to 1.

CloudFormation showing read capacity settings to modify

Run the following command to deploy the updated CloudFormation template:

aws cloudformation update-stack – stack-name myServerless-Stack \
--template-body file:///$PWD/cfn-shops-monitoroper-code.yaml \

4. Injecting HTTP requests into your API

We now inject ingress traffic in the form of HTTP requests towards the ListRestApiEndpointMonitorOper API, either manually or using the Python script provided in the current directory (sendAPIRequest.py). Due to reduced read capacity, the traffic will oversubscribe the dynamodb tables, thus inducing an anomaly. However, before triggering the script, populate the url parameter with the correct API link for the  ListRestApiEndpointMonitorOper API, listed in the CloudFormation stack’s output tab.

After the script is saved, trigger the script using the following command:

python sendAPIRequest.py

Make sure you’re getting an output of status 200 with the records that we fed into the DynamoDB table (see the following screenshot). You may have to launch multiple tabs (preferably 4) of the terminal to run the script to inject a high rate of traffic.

Terminal showing script executing and injecting traffic to API

After approximately 10 minutes of the script running in a loop, an operational insight is generated in DevOps Guru.


Reviewing DevOps Guru insights

While these operations are being run, DevOps Guru monitors for anomalies, logs insights that provide details about the metrics indicating an anomaly, and prints actionable recommendations to mitigate the anomaly. In this section, we review some of these insights.

Under normal conditions, DevOps Guru dashboard will show the ongoing insights counter to be zero. It monitors a high number of metrics behind the scenes and offloads the operator from manually monitoring any counters or graphs. It raises an alert in the form of an insight, only when anomaly occurs.

The following screenshot shows an ongoing reactive insight for the specific CloudFormation stack. When you choose the insight, you see further details. The number under the Total resources analyzed last hour may vary, so for this post, you can ignore this number.

DevOps Guru dashboard showing an ongoing reactive insight

The insight is divided into four sections: Insight overview, Aggregated metrics, Relevant events, and Recommendations. Let’s take a closer look into these sections.

The following screenshot shows the Aggregated metrics section, where it shows metrics for all the resources that it detected anomalies in (DynamoDB, Lambda, and API Gateway). Note that depending upon your traffic pattern, lambda settings, baseline traffic, the list of metrics may vary. In the example below, the timeline shows that an anomaly for DynamoDB started first and was followed by API Gateway and Lambda. This helps us understand the cause and symptoms, and prioritize the right anomaly investigation.

The listing of metrics inside an Insight

Initially, you may see only two metrics listed, however, over time, it populates more metrics that showed anomalies. You can see the anomaly for DynamoDB started earlier than the anomalies for API Gateway and Lambda, thus indicating them as after effects. In addition to the information in the preceding screenshot, you may see Duration p90 and IntegrationLatency p90 (for Lambda and API Gateway respectively, due to increased backend latency) metrics also listed.

Now we check the Relevant events section, which lists potential triggers for the issue. The events listed here depend on the set of operations carried out on this CloudFormation stack’s resources in this Region. This makes it easy for the operator to be reminded of a change that may have caused this issue. The dots (representing events) that are near the Insight start portion of timeline are of particular interest.

Related Events shown inside the Insight

If you need to delve into any of these events, just click of any of these points, and it provides more details as shown below screenshot.

Delving into details of the related event listed in Insight

You can choose the link for an event to view specific details about the operational event (configuration change via CloudFormation update stack operation).

Now we move to the Recommendations section, which provides prescribed guidance for mitigating this anomaly. As seen in the following screenshot, it recommends to roll back the configuration change related to the read capacity for the DynamoDB table. It also lists specific metrics and the event as part of the recommendation for reference.

Recommendations provided inside the Insight

Going back to the metric section of the insight, when we select Graphed anomalies, it shows you the graphs of all related resources. Below screenshot shows a snippet showing anomaly for DynamoDB ReadThrottleEvents metrics. As seen in the below screenshot of the graph pattern, the read operations on the table are exceeding the provisioned throughput of read capacity. This clearly indicates an anomaly.

Graphed anomalies in DevOps Guru

Let’s navigate to the DynamoDB table and check our table configuration. Checking the table properties, we notice that our read capacity is reduced to 1. This is our root cause that led to this anomaly.

Checking the DynamoDB table capacity settings

If we change it to 5, we fix this anomaly. Alternatively, if the traffic is stopped, the anomaly moves to a Resolved state.

The ongoing reactive insight takes a few minutes after resolution to move to a Closed state.

Note: When the insight is active, you may see more metrics get populated over time as we detect further anomalies. When carrying out the preceding tests, if you don’t see all the metrics as listed in the screenshots, you may have to wait longer.


Injecting another failure to generate a new DevOps Guru insight

Let’s create a new failure and generate an insight for that.

1.   After you close the insight from the previous section, trigger the HTTP traffic generation loop from the AWS Cloud9 terminal.

We modify the Lambda functions’s resource-based policy by removing the permissions for API Gateway to access this function.

2.   On the Lambda console, open the function ScanFunctionMonitorOper.

3.   On the Permissions tab, access the policy.

Accessing the permissions tab for the Lambda


4.   Save a copy of the policy offline as a backup before making any changes.

5.   Note down the “Sid” values for the “AWS:SourceArn” that ends with prod/*/ and prod/*/*.

Checking the Resource-based policy for the Lambda

6.   Run the following command to remove the “Sid” JSON statements in your Cloud9 terminal:

aws lambda remove-permission – function-name ScanFunctionMonitorOper \
--statement-id <Sid-value-ending-with-prod/*/>

7.   Run the same command for the second Sid value:

aws lambda remove-permission – function-name ScanFunctionMonitorOper \
--statement-id <Sid-value-ending-with-prod/*/*>

You should see several 5XX errors, as in the following screenshot.

Terminal output now showing 500 errors for the script output

After less than 8 minutes, you should see a new ongoing reactive insight on the DevOps Guru dashboard.

Let’s take a closer look at the insight. The following screenshot shows the anomalous metric 5XXError Average of API Gateway and its duration. (This insight shows as closed because I had already restored permissions.)

Insight showing 5XX errors for API-Gateway and link to OpsItem

If you have configured to enable creating OpsItem in Systems Manager, you would see the link to OpsItem ID created in the insight, as shown above. This is an optional configuration, which will enable you to track the insights in the form of open tickets (OpsItems) in Systems Manager OpsCenter.

The recommendations provide guidance based upon the related events and anomalous metrics.

After the insight has been generated, reviewed, and verified, restore the permissions by running the following command:

aws lambda add-permission – function-name ScanFunctionMonitorOper  \
--statement-id APIGatewayProdPerm – action lambda:InvokeFunction \
--principal apigateway.amazonaws.com

If needed, you can insert the condition to point to the API Gateway ARN to allow only specific API Gateways to access the Lambda function.


Cleaning up

After you walk through this post, you should clean up and un-provision the resources to avoid incurring any further charges.

1.   To un-provision the CloudFormation stacks, on the AWS CloudFormation console, choose Stacks.

2.   Select each stack (EnableDevOpsGuruForServerlessCfnStack and myServerless-Stack) and choose Delete.

3.   Check to confirm that the DynamoDB table created with the stacks is cleaned up. If not, delete the table manually.

4.   Un-provision the AWS Cloud9 environment.



This post reviewed how DevOps Guru can continuously monitor the resources in your AWS account in a typical production environment. When it detects an anomaly, it generates an insight, which includes the vended CloudWatch metrics that breached the threshold, the CloudFormation stack in which the resource existed, relevant events that could be potential triggers, and actionable recommendations to mitigate the condition.

DevOps Guru generates insights that are relevant to you based upon the pre-trained machine-learning models, removing the undifferentiated heavy lifting of manually monitoring several events, metrics, and trends.

I hope this post was useful to you and that you would consider DevOps Guru for your production needs.


Top 15 Architecture Blog Posts of 2020

Post Syndicated from Jane Scolieri original https://aws.amazon.com/blogs/architecture/top-15-architecture-blog-posts-of-2020/

The goal of the AWS Architecture Blog is to highlight best practices and provide architectural guidance. We publish thought leadership pieces that encourage readers to discover other technical documentation, such as solutions and managed solutions, other AWS blogs, videos, reference architectures, whitepapers, and guides, Training & Certification, case studies, and the AWS Architecture Monthly Magazine. We welcome your contributions!

Field Notes is a series of posts within the Architecture blog channel which provide hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

We would like to thank you, our readers, for spending time on our blog this last year. Much appreciation also goes to our hard-working AWS Solutions Architects and other blog post writers. Below are the top 15 Architecture & Field Notes blog posts written in 2020.

#15: Field Notes: Choosing a Rehost Migration Tool – CloudEndure or AWS SMS

by Ebrahim (EB) Khiyami

In this post, Ebrahim provides some considerations and patterns where it’s recommended based on your migration requirements to choose one tool over the other.

Read Ebrahim’s post.

#14: Architecting for Reliable Scalability

by Marwan Al Shawi

In this post, Marwan explains how to architect your solution or application to reliably scale, when to scale and how to avoid complexity. He discusses several principles including modularity, horizontal scaling, automation, filtering and security.

Read Marwan’s post.

#13: Field Notes: Building an Autonomous Driving and ADAS Data Lake on AWS

by Junjie Tang and Dean Phillips

In this post, Junjie and Dean explain how to build an Autonomous Driving Data Lake using this Reference Architecture. They cover all steps in the workflow from how to ingest the data, to moving it into an organized data lake construct.

Read Junjie’s and Dean’s post.

#12: Building a Self-Service, Secure, & Continually Compliant Environment on AWS

by Japjot Walia and Jonathan Shapiro-Ward

In this post, Jopjot and Jonathan provide a reference architecture for highly regulated Enterprise organizations to help them maintain their security and compliance posture. This blog post provides an overview of a solution in which AWS Professional Services engaged with a major Global systemically important bank (G-SIB) customer to help develop ML capabilities and implement a Defense in Depth (DiD) security strategy.

Read Jopjot’s and Jonathan’s post.

#11: Introduction to Messaging for Modern Cloud Architecture

by Sam Dengler

In this post, Sam focuses on best practices when introducing messaging patterns into your applications. He reviews some core messaging concepts and shows how they can be used to address challenges when designing modern cloud architectures.

Read Sam’s post.

#10: Building a Scalable Document Pre-Processing Pipeline

by Joel Knight

In this post, Joel presents an overview of an architecture built for Quantiphi Inc. This pipeline performs pre-processing of documents, and is reusable for a wide array of document processing workloads.

Read Joel’s post.

#9: Introducing the Well-Architected Framework for Machine Learning

by by Shelbee Eigenbrode, Bardia Nikpourian, Sireesha Muppala, and Christian Williams

In the Machine Learning Lens whitepaper, the authors focus on how to design, deploy, and architect your machine learning workloads in the AWS Cloud. The whitepaper describes the general design principles and the five pillars of the Framework as they relate to ML workloads.

Read the post.

#8: BBVA: Helping Global Remote Working with Amazon AppStream 2.0

by Jose Luis Prieto

In this post, Jose explains why BBVA chose Amazon AppStream 2.0 to accommodate the remote work experience. BBVA built a global solution reducing implementation time by 90% compared to on-premises projects, and is meeting its operational and security requirements.

Read Jose’s post.

#7: Field Notes: Serverless Container-based APIs with Amazon ECS and Amazon API Gateway

by Simone Pomata

In this post, Simone guides you through the details of the option based on Amazon API Gateway and AWS Cloud Map, and how to implement it. First you learn how the different components (Amazon ECS, AWS Cloud Map, API Gateway, etc.) work together, then you launch and test a sample container-based API.

Read Simone’s post.

#6: Mercado Libre: How to Block Malicious Traffic in a Dynamic Environment

by Gaston Ansaldo and Matias Ezequiel De Santi

In this post, readers will learn how to architect a solution that can ingest, store, analyze, detect and block malicious traffic in an environment that is dynamic and distributed in nature by leveraging various AWS services like Amazon CloudFront, Amazon Athena and AWS WAF.

Read Gaston’s and Matias’ post.

#5: Announcing the New Version of the Well-Architected Framework

by Rodney Lester

In this post, Rodney announces the availability of a new version of the AWS Well-Architected Framework, and focuses on such issues as removing perceived repetition, adding content areas to explicitly call out previously implied best practices, and revising best practices to provide clarity.

Read Rodney’s post.

#4: Serverless Stream-Based Processing for Real-Time Insights

by Justin Pirtle

In this post, Justin provides an overview of streaming messaging services and AWS Serverless stream processing capabilities. He shows how it helps you achieve low-latency, near real-time data processing in your applications.

Read Justin’s post.

#3: Field Notes: Working with Route Tables in AWS Transit Gateway

by Prabhakaran Thirumeni

In this post, Prabhakaran explains the packet flow if both source and destination network are associated to the same or different AWS Transit Gateway Route Table. He outlines a scenario with a substantial number of VPCs, and how to make it easier for your network team to manage access for a growing environment.

Read Prabhakaran’s post.

#2: Using VPC Sharing for a Cost-Effective Multi-Account Microservice Architecture

by Anandprasanna Gaitonde and Mohit Malik

Anand and Mohit present a cost-effective approach for microservices that require a high degree of interconnectivity and are within the same trust boundaries. This approach requires less VPC management while still using separate accounts for billing and access control, and does not sacrifice scalability, high availability, fault tolerance, and security.

Read Anand’s and Mohit’s post.

#1: Serverless Architecture for a Web Scraping Solution

by Dzidas Martinaitis

You may wonder whether serverless architectures are cost-effective or expensive. In this post, Dzidas analyzes a web scraping solution. The project can be considered as a standard extract, transform, load process without a user interface and can be packed into a self-containing function or a library.

Read Dzidas’ post.

Thank You

Thanks again to all our readers and blog post writers! We look forward to learning and building amazing things together in 2021.

Field Notes: Improving Call Center Experiences with Iterative Bot Training Using Amazon Connect and Amazon Lex

Post Syndicated from Marius Cealera original https://aws.amazon.com/blogs/architecture/field-notes-improving-call-center-experiences-with-iterative-bot-training-using-amazon-connect-and-amazon-lex/

This post was co-written by Abdullah Sahin, senior technology architect at Accenture, and Muhammad Qasim, software engineer at Accenture. 

Organizations deploying call-center chat bots are interested in evolving their solutions continuously, in response to changing customer demands. When developing a smart chat bot, some requests can be predicted (for example following a new product launch or a new marketing campaign). There are however instances where this is not possible (following market shifts, natural disasters, etc.)

While voice and chat bots are becoming more and more ubiquitous, keeping the bots up-to-date with the ever-changing demands remains a challenge.  It is clear that a build>deploy>forget approach quickly leads to outdated AI that lacks the ability to adapt to dynamic customer requirements.

Call-center solutions which create ongoing feedback mechanisms between incoming calls or chat messages and the chatbot’s AI, allow for a programmatic approach to predicting and catering to a customer’s intent.

This is achieved by doing the following:

  • applying natural language processing (NLP) on conversation data
  • extracting relevant missed intents,
  • automating the bot update process
  • inserting human feedback at key stages in the process.

This post provides a technical overview of one of Accenture’s Advanced Customer Engagement (ACE+) solutions, explaining how it integrates multiple AWS services to continuously and quickly improve chatbots and stay ahead of customer demands.

Call center solution architects and administrators can use this architecture as a starting point for an iterative bot improvement solution. The goal is to lead to an increase in call deflection and drive better customer experiences.

Overview of Solution

The goal of the solution is to extract missed intents and utterances from a conversation and present them to the call center agent at the end of a conversation, as part of the after-work flow. A simple UI interface was designed for the agent to select the most relevant missed phrases and forward them to an Analytics/Operations Team for final approval.

Figure 1 – Architecture Diagram

Amazon Connect serves as the contact center platform and handles incoming calls, manages the IVR flows and the escalations to the human agent. Amazon Connect is also used to gather call metadata, call analytics and handle call center user management. It is the platform from which other AWS services are called: Amazon Lex, Amazon DynamoDB and AWS Lambda.

Lex is the AI service used to build the bot. Lambda serves as the main integration tool and is used to push bot transcripts to DynamoDB, deploy updates to Lex and to populate the agent dashboard which is used to flag relevant intents missed by the bot. A generic CRM app is used to integrate the agent client and provide a single, integrated, dashboard. For example, this addition to the agent’s UI, used to review intents, could be implemented as a custom page in Salesforce (Figure 2).

Figure 2 – Agent feedback dashboard in Salesforce. The section allows the agent to select parts of the conversation that should be captured as intents by the bot.

A separate, stand-alone, dashboard is used by an Analytics and Operations Team to approve the new intents, which triggers the bot update process.


The typical use case for this solution (Figure 4) shows how missing intents in the bot configuration are captured from customer conversations. These intents are then validated and used to automatically build and deploy an updated version of a chatbot. During the process, the following steps are performed:

  1. Customer intents that were missed by the chatbot are automatically highlighted in the conversation
  2. The agent performs a review of the transcript and selects the missed intents that are relevant.
  3. The selected intents are sent to an Analytics/Ops Team for final approval.
  4. The operations team validates the new intents and starts the chatbot rebuild process.

Figure 3 – Use case: the bot is unable to resolve the first call (bottom flow). Post-call analysis results in a new version of the bot being built and deployed. The new bot is able to handle the issue in subsequent calls (top flow)

During the first call (bottom flow) the bot fails to fulfil the request and the customer is escalated to a Live Agent. The agent resolves the query and, post call, analyzes the transcript between the chatbot and the customer, identifies conversation parts that the chatbot should have understood and sends a ‘missed intent/utterance’ report to the Analytics/Ops Team. The team approves and triggers the process that updates the bot.

For the second call, the customer asks the same question. This time, the (trained) bot is able to answer the query and end the conversation.

Ideally, the post-call analysis should be performed, at least in part, by the agent handling the call. Involving the agent in the process is critical for delivering quality results. Any given conversation can have multiple missed intents, some of them irrelevant when looking to generalize a customer’s question.

A call center agent is in the best position to judge what is or is not useful and mark the missed intents to be used for bot training. This is the important logical triage step. Of course, this will result in the occasional increase in the average handling time (AHT). This should be seen as a time investment with the potential to reduce future call times and increase deflection rates.

One alternative to this setup would be to have a dedicated analytics team review the conversations, offloading this work from the agent. This approach avoids the increase in AHT, but also introduces delay and, possibly, inaccuracies in the feedback loop.

The approval from the Analytics/Ops Team is a sign off on the agent’s work and trigger for the bot building process.


The following section focuses on the sequence required to programmatically update intents in existing Lex bots. It assumes a Connect instance is configured and a Lex bot is already integrated with it. Navigate to this page for more information on adding Lex to your Connect flows.

It also does not cover the CRM application, where the conversation transcript is displayed and presented to the agent for intent selection.  The implementation details can vary significantly depending on the CRM solution used. Conceptually, most solutions will follow the architecture presented in Figure1: store the conversation data in a database (DynamoDB here) and expose it through an (API Gateway here) to be consumed by the CRM application.

Lex bot update sequence

The core logic for updating the bot is contained in a Lambda function that triggers the Lex update. This adds new utterances to an existing bot, builds it and then publishes a new version of the bot. The Lambda function is associated with an API Gateway endpoint which is called with the following body:

	“intent”: “INTENT_NAME”,
	“utterances”: [“UTTERANCE_TO_ADD_1”, “UTTERANCE_TO_ADD_2” …]

Steps to follow:

  1. The intent information is fetched from Lex using the getIntent API
  2. The existing utterances are combined with the new utterances and deduplicated.
  3. The intent information is updated with the new utterances
  4. The updated intent information is passed to the putIntent API to update the Lex Intent
  5. The bot information is fetched from Lex using the getBot API
  6. The intent version present within the bot information is updated with the new intent

Figure 4 – Representation of Lex Update Sequence


7. The update bot information is passed to the putBot API to update Lex and the processBehavior is set to “BUILD” to trigger a build. The following code snippet shows how this would be done in JavaScript:

const updateBot = await lexModel
        processBehavior: "BUILD"

9. The last step is to publish the bot, for this we fetch the bot alias information and then call the putBotAlias API.

const oldBotAlias = await lexModel
        name: config.botAlias,
        botName: updatedBot.name

return lexModel
        name: config.botAlias,
        botName: updatedBot.name,
        botVersion: updatedBot.version,
        checksum: oldBotAlias.checksum,


In this post, we showed how a programmatic bot improvement process can be implemented around Amazon Lex and Amazon Connect.  Continuously improving call center bots is a fundamental requirement for increased customer satisfaction. The feedback loop, agent validation and automated bot deployment pipeline should be considered integral parts to any a chatbot implementation.

Finally, the concept of a feedback-loop is not specific to call-center chatbots. The idea of adding an iterative improvement process in the bot lifecycle can also be applied in other areas where chatbots are used.

Accelerating Innovation with the Accenture AWS Business Group (AABG)

By working with the Accenture AWS Business Group (AABG), you can learn from the resources, technical expertise, and industry knowledge of two leading innovators, helping you accelerate the pace of innovation to deliver disruptive products and services. The AABG helps customers ideate and innovate cloud solutions with customers through rapid prototype development.

Connect with our team at [email protected] to learn and accelerate how to use machine learning in your products and services.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.


Abdullah Sahin

Abdullah Sahin

Abdullah Sahin is a senior technology architect at Accenture. He is leading a rapid prototyping team bringing the power of innovation on AWS to Accenture customers. He is a fan of CI/CD, containerization technologies and IoT.

Muhammad Qasim

Muhammad Qasin

Muhammad Qasim is a software engineer at Accenture and excels in development of voice bots using services such as Amazon Connect. In his spare time, he plays badminton and loves to go for a run.

Automating Recommendation Engine Training with Amazon Personalize and AWS Glue

Post Syndicated from Alexander Spivak original https://aws.amazon.com/blogs/architecture/automating-recommendation-engine-training-with-amazon-personalize-and-aws-glue/

Customers from startups to enterprises observe increased revenue when personalizing customer interactions. Still, many companies are not yet leveraging the power of personalization, or, are relying solely on rule-based strategies. Those strategies are effort-intensive to maintain and not effective. Common reasons for not launching machine learning (ML) based personalization projects include: the complexity of aggregating and preparing the datasets, gaps in data science expertise and the lack of trust regarding the quality of ML recommendations.

This blog post demonstrates an approach for product recommendations to mitigate those concerns using historical datasets. To get started with your personalization journey, you don’t need ML expertise or a data lake. The following serverless end-to-end architecture involves aggregating and transforming the required data, as well as automatically training an ML-based recommendation engine.

I will outline the architectural production-ready setup for personalized product recommendations based on historical datasets. This is of interest to data analysts who look for ways to bring an existing recommendation engine to production, as well as solutions architects new to ML.

Solution Overview

The two core elements to create a proof-of-concept for ML-based product recommendations are:

  1. the recommendation engine and,
  2. the data set to train the recommendation engine.

Let’s start with the recommendation engine first, and work backwards to the corresponding data needs.

Product recommendation engine

To create the product recommendation engine, we use Amazon Personalize. Amazon Personalize differentiates three types of input data:

  1. user events called interactions (user events like views, signups or likes),
  2. item metadata (description of your items: category, genre or availability), and
  3. user metadata (age, gender, or loyalty membership).

An interactions dataset is typically the minimal requirement to build a recommendation system. Providing user and item metadata datasets improves recommendation accuracy, and enables cold starts, item discovery and dynamic recommendation filtering.

Most companies already have existing historical datasets to populate all three input types. In the case of retail companies, the product order history is a good fit for interactions. In the case of the media and entertainment industry, the customer’s consumption history maps to the interaction dataset. The product and media catalogs map to the items dataset and the customer profiles to the user dataset.

Amazon Personalize: from datasets to a recommendation API

Amazon Personalize: from datasets to a recommendation API

The Amazon Personalize Deep Dive Series provides a great introduction into the service and explores the topics of training, inference and operations. There are also multiple blog posts available explaining how to create a recommendation engine with Amazon Personalize and how to select the right metadata for the engine training. Additionally, the Amazon Personalize samples repository in GitHub showcases a variety of topics: from getting started with Amazon Personalize, up to performing a POC in a Box using existing datasets, and, finally, automating the recommendation engine with MLOps. In this post, we focus on getting the data from the historical data sources into the structure required by Amazon Personalize.

Creating the dataset

While manual data exports are a quick way to get started with one-time datasets for experiments, we use AWS Glue to automate this process. The automated approach with AWS Glue speeds up the proof of concept (POC) phase and simplifies the process to production by:

  • easily reproducing dataset exports from various data sources. This are used to iterate with other feature sets for recommendation engine training.
  • adding additional data sources and using those to enrich existing datasets
  • efficiently performing transformation logic like column renaming and fuzzy matching out of the box with code generation support.

AWS Glue is a serverless data integration service that is scalable and simple to use. It provides all of the capabilities needed for data integration and supports a wide variety of data sources: Amazon S3 buckets, JDBC connectors, MongoDB databases, Kafka, and Amazon Redshift, the AWS data warehouse. You can even make use of data sources living outside of your AWS environment, e.g. on-premises data centers or other services outside of your VPC. This enables you to perform a data-driven POC even when the data is not yet in AWS.

Modern application environments usually combine multiple heterogeneous database systems, like operational relational and NoSQL databases, in addition to, the BI-powering data warehouses. With AWS Glue, we orchestrate the ETL (extract, transform, and load) jobs to fetch the relevant data from the corresponding data sources. We then bring it into a format that Amazon Personalize understands: CSV files with pre-defined column names hosted in an Amazon S3 bucket.

Each dataset consists of one or multiple CSV files, which can be uniquely identified by an Amazon S3 prefix. Additionally, each dataset must have an associated schema describing the structure of your data. Depending on the dataset type, there are required and pre-defined fields:

  • USER_ID (string) and one additional metadata field for the users dataset
  • ITEM_ID (string) and one additional metadata field for the items dataset
  • USER_ID (string), ITEM_ID (string), TIMESTAMP (long; as Epoch time) for the interactions dataset

The following graph presents a high-level architecture for a retail customer, who has a heterogeneous data store landscape.

Using AWS Glue to export datasets from heterogeneous data sources to Amazon S3

Using AWS Glue to export datasets from heterogeneous data sources to Amazon S3

To understand how AWS Glue connects to the variety of data sources and allows transforming the data into the required format, we need to drill down into the AWS Glue concepts and components.

One of the key components of AWS Glue is the AWS Glue Data Catalog: a persistent metadata store containing table definitions, connection information, as well as, the ETL job definitions.
The tables are metadata definitions representing the structure of the data in the defined data sources. They do not contain any data entries from the sources but solely the structure definition. You can create a table either manually or automatically by using AWS Glue Crawlers.

AWS Glue Crawlers scan the data in the data sources, extract the schema information from it, and store the metadata as tables in the AWS Glue Data Catalog. This is the preferred approach for defining tables. The crawlers use AWS Glue Connections to connect to the data sources. Each connection contains the properties that are required to connect to a particular data store. The connections will be also used later by the ETL jobs to fetch the data from the data sources.

AWS Glue Crawlers also help to overcome a challenge frequently appearing in microservice environments. Microservice architectures are frequently operated by fully independent and autonomous teams. This means that keeping track of changes to the data source format becomes a challenge. Based on a schedule, the crawlers can be triggered to update the metadata for the relevant data sources in the AWS Glue Data Catalog automatically. To detect cases when a schema change would break the ETL job logic, you can combine the CloudWatch Events emitted by AWS Glue on updating the Data Catalog tables with an AWS Lambda function or a notification send via the Amazon Simple Notification Service (SNS).

The AWS Glue ETL jobs use the defined connections and the table information from the Data Catalog to extract the data from the described sources, apply the user-defined transformation logic and write the results into a data sink. AWS Glue can automatically generate code for ETL jobs to help perform a variety of useful data transformation tasks. AWS Glue Studio makes the ETL development even simpler by providing an easy-to-use graphical interface that accelerates the development and allows designing jobs without writing any code. If required, the generated code can be fully customized.

AWS Glue supports Apache Spark jobs, written either in Python or in Scala, and Python Shell jobs. Apache Spark jobs are optimized to run in a highly scalable, distributed way dealing with any amount of data and are a perfect fit for data transformation jobs. The Python Shell jobs provide access to the raw Python execution environment, which is less scalable but provides a cost-optimized option for orchestrating AWS SDK calls.

The following diagram visualizes the interaction between the components described.

The basic concepts of populating your Data Catalog and processing ETL dataflow in AWS Glue

The basic concepts of populating your Data Catalog and processing ETL dataflow in AWS Glue

For each Amazon Personalize dataset type, we create a separate ETL job. Since those jobs are independent, they also can run in parallel. After all jobs have successfully finished, we can start the recommendation engine training. AWS Glue Workflows allow simplifying data pipelines by visualizing and monitoring complex ETL activities involving multiple crawlers, jobs, and triggers, as a single entity.

The following graph visualizes a typical dataset export workflow for training a recommendation engine, which consists of:

  • a workflow trigger being either manual or scheduled
  • a Python Shell job to remove the results of the previous export workflow from S3
  • a trigger firing when the removal job is finished and initiating the parallel execution of the dataset ETL jobs
  • the three Apache Spark ETL jobs, one per dataset type
  • a trigger firing when all three ETL jobs are finished and initiating the training notification job
  • a Python Shell job to initiate a new dataset import or a full training cycle in Amazon Personalize (e.g. by triggering the MLOps pipeline using the AWS SDK)


AWS Glue workflow for extracting the three datasets and triggering the training workflow of the recommendation engine

AWS Glue workflow for extracting the three datasets and triggering the training workflow of the recommendation engine

Combining the data export and the recommendation engine

In the previous sections, we discussed how to create an ML-based recommendation engine and how to create the datasets for the training of the engine. In this section, we combine both parts of the solution leveraging an adjusted version of the MLOps pipeline solution available on GitHub to speed up the iterations on new solution versions by avoiding manual steps. Moreover, automation means new items can be put faster into production.

The MLOps pipeline uses a JSON file hosted in an S3 bucket to describe the training parameters for Amazon Personalize. The creation of a new parameter file version triggers a new training workflow orchestrated in a serverless manner using AWS Step Functions and AWS Lambda.

To integrate the Glue data export workflow described in the previous section, we also enable the Glue workflow to trigger the training pipeline. Additionally, we manipulate the pipeline to read the parameter file as the first pipeline step. The resulting architecture enables an automated end-to-end set up from dataset export up to the recommendation engine creation.

End-to-end architecture combining the data export with AWS Glue, the MLOps training workflow and Amazon Personalize

End-to-end architecture combining the data export with AWS Glue, the MLOps training workflow, and Amazon Personalize

The architecture for the end-to-end data export and recommendation engine creation solution is completely serverless. This makes it highly scalable, reliable, easy to maintain, and cost-efficient. You pay only for what you consume. For example, in the case of the data export, you pay only for the duration of the AWS Glue crawler executions and ETL jobs. These are only need to run to iterate with a new dataset.

The solution is also flexible in terms of the connected data sources. This architecture is also recommended for use cases with a single data source. You can also start with a single data store and enrich the datasets on-demand with additional data sources in future iterations.

Testing the quality of the solution

A common approach to validate the quality of the solution is the A/B testing technique, which is widely used to measure the efficacy of generated recommendations. Based on the testing results, you can iterate on the recommendation engine by optimizing the underlying datasets and models. The high degree of automation increases the speed of iterations and the resiliency of the end-to-end process.


In this post, I presented a typical serverless architecture for a fully automated, end-to-end ML-based recommendation engine leveraging available historical datasets. As you begin to experiment with ML-based personalization, you will unlock value currently hidden in the data. This helps mitigate potential concerns like the lack of trust in machine learning and you can put the resulting engine into production.

Start your personalization journey today with the Amazon Personalize code samples and bring the engine to production with the architecture outlined in this blog. As a next step, you can involve recording real-time events to update the generated recommendations automatically based on the event data.


Field Notes: Comparing Algorithm Performance Using MLOps and the AWS Cloud Development Kit

Post Syndicated from Moataz Gaber original https://aws.amazon.com/blogs/architecture/field-notes-comparing-algorithm-performance-using-mlops-and-the-aws-cloud-development-kit/

Comparing machine learning algorithm performance is fundamental for machine learning practitioners, and data scientists. The goal is to evaluate the appropriate algorithm to implement for a known business problem.

Machine learning performance is often correlated to the usefulness of the model deployed. Improving the performance of the model typically results in an increased accuracy of the prediction. Model accuracy is a key performance indicator (KPI) for businesses when evaluating production readiness and identifying the appropriate algorithm to select earlier in model development. Organizations benefit from reduced project expenses, accelerated project timelines and improved customer experience. Nevertheless, some organizations have not introduced a model comparison process into their workflow which negatively impacts cost and productivity.

In this blog post, I describe how you can compare machine learning algorithms using Machine Learning Operations (MLOps). You will learn how to create an MLOps pipeline for comparing machine learning algorithms performance using AWS Step Functions, AWS Cloud Development Kit (CDK) and Amazon SageMaker.

First, I explain the use case that will be addressed through this post. Then, I explain the design considerations for the solution. Finally, I provide access to a GitHub repository which includes all the necessary steps for you to replicate the solution I have described, in your own AWS account.

Understanding the Use Case

Machine learning has many potential uses and quite often the same use case is being addressed by different machine learning algorithms. Let’s take Amazon Sagemaker built-in algorithms. As an example, if you are having a “Regression” use case, it can be addressed using (Linear Learner, XGBoost and KNN) algorithms. Another example for a “Classification” use case you can use algorithm such as (XGBoost, KNN, Factorization Machines and Linear Learner). Similarly for “Anomaly Detection” there are (Random Cut Forests and IP Insights).

In this post, it is a “Regression” use case to identify the age of the abalone which can be calculated based on the number of rings on its shell (age equals to number of rings plus 1.5). Usually the number of rings are counted through microscopes examinations.

I use the abalone dataset in libsvm format which contains 9 fields [‘Rings’, ‘Sex’, ‘Length’,’ Diameter’, ‘Height’,’ Whole Weight’,’ Shucked Weight’,’ Viscera Weight’ and ‘Shell Weight’] respectively.

The features starting from Sex to Shell Weight are physical measurements that can be measured using the correct tools. Therefore, using the machine learning algorithms (Linear Learner and XGBoost) to address this use case, the complexity of having to examine the abalone under microscopes to understand its age can be improved.

Benefits of the AWS Cloud Development Kit (AWS CDK)

The AWS Cloud Development Kit (AWS CDK) is an open source software development framework to define your cloud application resources.

The AWS CDK uses the jsii which is an interface developed by AWS that allows code in any language to naturally interact with JavaScript classes. It is the technology that enables the AWS Cloud Development Kit to deliver polyglot libraries from a single codebase.

This means that you can use the CDK and define your cloud application resources in typescript language for example. Then by compiling your source module using jsii, you can package it as modules in one of the supported target languages (e.g: Javascript, python, Java and .Net). So if your developers or customers prefer any of those languages, you can easily package and export the code to their preferred choice.

Also, the cdk tf provides constructs for defining Terraform HCL state files and the cdk8s enables you to use constructs for defining kubernetes configuration in TypeScript, Python, and Java. So by using the CDK you have a faster development process and easier cloud onboarding. It makes your cloud resources more flexible for sharing.


Overview of solution

This architecture serves as an example of how you can build a MLOps pipeline that orchestrates the comparison of results between the predictions of two algorithms.

The solution uses a completely serverless environment so you don’t have to worry about managing the infrastructure. It also deletes resources not needed after collecting the predictions results, so as not to incur any additional costs.

Figure 1: Solution Architecture


In the preceding diagram, the serverless MLOps pipeline is deployed using AWS Step Functions workflow. The architecture contains the following steps:

  1. The dataset is uploaded to the Amazon S3 cloud storage under the /Inputs directory (prefix).
  2. The uploaded file triggers AWS Lambda using an Amazon S3 notification event.
  3. The Lambda function then will initiate the MLOps pipeline built using a Step Functions state machine.
  4. The starting lambda will start by collecting the region corresponding training images URIs for both Linear Learner and XGBoost algorithms. These are used in training both algorithms over the dataset. It will also get the Amazon SageMaker Spark Container Image which is used for running the SageMaker processing Job.
  5. The dataset is in libsvm format which is accepted by the XGBoost algorithm as per the Input/Output Interface for the XGBoost Algorithm. However, this is not supported by the Linear Learner Algorithm as per Input/Output interface for the linear learner algorithm. So we need to run a processing job using Amazon SageMaker Data Processing with Apache Spark. The processing job will transform the data from libsvm to csv and will divide the dataset into train, validation and test datasets. The output of the processing job will be stored under /Xgboost and /Linear directories (prefixes).

Figure 2: Train, validation and test samples extracted from dataset

6. Then the workflow of Step Functions will perform the following steps in parallel:

    • Train both algorithms.
    • Create models out of trained algorithms.
    • Create endpoints configurations and deploy predictions endpoints for both models.
    • Invoke lambda function to describe the status of the deployed endpoints and wait until the endpoints become in “InService”.
    • Invoke lambda function to perform 3 live predictions using boto3 and the “test” samples taken from the dataset to calculate the average accuracy of each model.
    • Invoke lambda function to delete deployed endpoints not to incur any additional charges.

7. Finally, a Lambda function will be invoked to determine which model has better accuracy in predicting the values.

The following shows a diagram of the workflow of the Step Functions:

Figure 3: AWS Step Functions workflow graph

The code to provision this solution along with step by step instructions can be found at this GitHub repo.

Results and Next Steps

After waiting for the complete execution of step functions workflow, the results are depicted in the following diagram:

Figure 4: Comparison results

This doesn’t necessarily mean that the XGBoost algorithm will always be the better performing algorithm. It just means that the performance was the result of these factors:

  • the hyperparameters configured for each algorithm
  • the number of epochs performed
  • the amount of dataset samples used for training

To make sure that you are getting better results from the models, you can run hyperparameters tuning jobs which will run many training jobs on your dataset using the algorithms and ranges of hyperparameters that you specify. This helps you allocate which set of hyperparameters which are giving better results.

Finally, you can use this comparison to determine which algorithm is best suited for your production environment. Then you can configure your step functions workflow to update the configuration of the production endpoint with the better performing algorithm.

Figure 5: Update production endpoint workflow


This post showed you how to create a repeatable, automated pipeline to deliver the better performing algorithm to your production predictions endpoint. This helps increase the productivity and reduce the time of manual comparison.  You also learned to provision the solution using AWS CDK and to perform regular cleaning of deployed resources to drive down business costs. If this post helps you or inspires you to solve a problem, share your thoughts and questions in the comments. You can use and extend the code on the GitHub repo.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers

New – Store, Discover, and Share Machine Learning Features with Amazon SageMaker Feature Store

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-store-discover-and-share-machine-learning-features-with-amazon-sagemaker-feature-store/

Today, I’m extremely happy to announce Amazon SageMaker Feature Store, a new capability of Amazon SageMaker that makes it easy for data scientists and machine learning engineers to securely store, discover and share curated data used in training and prediction workflows.

For all the importance of selecting the right algorithm to train machine learning (ML) models, experienced practitioners know how crucial it is to feed it with high-quality data. Cleaning data is a good first step, and ML workflows routinely include steps to fill missing values, remove outliers, and so on. Then, they often move on to transforming data, using a mix of common and arcane techniques known as “feature engineering.”

Simply put, the purpose of feature engineering is to transform your data and to increase its expressiveness so that the algorithm may learn better. For instance, many columnar datasets include strings, such as street addresses. To most ML algorithms, strings are meaningless, and they need to be encoded in a numerical representation. Thus, you could replace street addresses with GPS coordinates, a much more expressive way to learn the concept of location. In other words, if data is the new oil, then feature engineering is the refining process that turns it into high-octane jet fuel that helps models get to stratospheric accuracy.

Indeed, ML practitioners spend a lot of time crafting feature engineering code, applying it to their initial datasets, training models on the engineered datasets, and evaluating model accuracy. Given the experimental nature of this work, even the smallest project will lead to multiple iterations. The same feature engineering code is often run again and again, wasting time and compute resources on repeating the same operations. In large organizations, this can cause an even greater loss of productivity, as different teams often run identical jobs, or even write duplicate feature engineering code because they have no knowledge of prior work.

There’s another hard problem that ML teams have to solve. As models are trained on engineered datasets, it’s imperative to apply the same transformations to data sent for prediction. This often means rewriting feature engineering code, sometimes in a different language, integrating it in your prediction workflow, and running it at prediction time. This whole process is not only time-consuming, it can also introduce inconsistencies, as even the tiniest variation in a data transform can have a large impact on predictions.

In order to solve these problems, ML teams sometimes build a feature store, a central repository where they can keep and retrieve engineered data used in their training and predictions jobs. As useful as feature stores are, building and managing your own involves a lot of engineering, infrastructure, and operational effort that takes valuable time away from actual ML work. Customers asked us for a better solution, and we got to work.

Introducing Amazon SageMaker Feature Store
Amazon SageMaker Feature Store is a fully managed centralized repository for your ML features, making it easy to securely store and retrieve features without having to manage any infrastructure. It’s part of Amazon SageMaker, our fully managed service for ML, and supports all algorithms. It’s also integrated with Amazon SageMaker Studio, our web-based development environment for ML.

Features stored in SageMaker Feature Store are organized in groups, and tagged with metadata. Thanks to this, you can quickly discover which features are available, and whether they’re suitable for your models. Multiple teams can also easily share and re-use features, reducing the cost of development and accelerating innovation.

Once stored, features can be retrieved and used in your SageMaker workflows: model training, batch transform, and real-time prediction with low latency. Not only do you avoid duplicating work, you also build consistent workflows that use the same consistent features stored in the offline and online stores.

The Climate Corporation (Climate) is a subsidiary of Bayer, and the industry leader in bringing digital innovation to farmers. Says Daniel McCaffrey, Vice President, Data and Analytics, Climate: “At Climate, we believe in providing the world’s farmers with accurate information to make data driven decisions and maximize their return on every acre. To achieve this, we have invested in technologies such as machine learning tools to build models using measurable entities known as features, such as yield for a grower’s field. With Amazon SageMaker Feature Store, we can accelerate the development of ML models with a central feature store to access and reuse features across multiple teams easily. SageMaker Feature Store makes it easy to access features in real-time using the online store, or run features on a schedule using the offline store for different use cases, and we can develop ML models faster.”

Care.com, the world’s leading platform for finding and managing high-quality family care, is also using Amazon SageMaker Feature Store. This is what Clemens Tummeltshammer, Data Science Manager, Care.com, told us: “A strong care industry where supply matches demand is essential for economic growth from the individual family up to the nation’s GDP. We’re excited about Amazon SageMaker Feature Store and Amazon SageMaker Pipelines , as we believe they will help us scale better across our data science and development teams, by using a consistent set of curated data that we can use to build scalable end-to-end machine learning model pipelines from data preparation to deployment. With the newly announced capabilities of Amazon SageMaker, we can accelerate development and deployment of our ML models for different applications, helping our customers make better informed decisions through faster real-time recommendations.

Now, let’s see how you can get started.

Storing and Retrieving Features with Amazon SageMaker Feature Store
Once you’ve run your feature engineering code on your data, you can organize and store your engineered features in SageMaker Feature Store, by grouping them in feature groups. A feature group is a collection of records, similar to rows in a table. Each record has a unique identifier, and holds the engineered feature values for one of the data instances in your original data source. Optionally, you can choose to encrypt the data at rest using your own AWS Key Management Service (KMS) key that is unique for each feature group.

How you define feature groups is up to you. For example, you could create one per data source (CSV files, database tables, and so on), and use a convenient unique column as the record identifier (primary key, customer id, transaction id, and so on).

Once you’ve got your groups figured out, you should repeat the following steps for each group:

  1. Create feature definitions, with the name and the type of each feature in a record (Fractional, Integral, or String).
  2. Create each feature group with the create_feature_group() API:
         # The name of the feature group
         # The name of the column acting as the record identifier
         # The name of the column action as the feature timestamp
         EventTimeFeatureName = event_time_feature_name,
         # A list of feature names and types
         # The S3 location for the offline feature store
         # Optionally, enable the online feature store
         # An IAM role
  3. In each feature group, store records containing a collection of feature name/feature value pairs, using the put_record() API:

    For faster ingestion, you could create multiple threads and parallelize this operation.

At this point, features will be available in Amazon SageMaker Feature Store. Thanks to the offline store, you can use services such as Amazon Athena, AWS Glue, or Amazon EMR to build datasets for training: fetch the corresponding JSON objects in S3, select the features that you need, and save them in S3 in the format expected by your ML algorithm. From then on, it’s SageMaker business as usual!

In addition, you can use the get_record() API to access individual records stored in the online store, passing the group name and the unique identifier of the record you want to access, like so:

record = sm_feature_store.get_record(
    RecordIdentifierValue={"IntegralValue": 5962}

Amazon SageMaker Feature Store is designed for fast and efficient access for real time inference, with a P95 latency lower than 10ms for a 15-kilobyte payload. This makes it possible to query for engineered features at prediction time, and to replace raw features sent by the upstream application with the exact same features used to train the model. Feature inconsistencies are eliminated by design, letting you focus on building the best models instead of chasing bugs.

Finally, as SageMaker Feature Store includes feature creation timestamps, you can retrieve the state of your features at a particular point in time.

As Amazon SageMaker Feature Store is integrated with SageMaker Studio, I can see my two features groups there.

SageMaker screenshot

Right-clicking on “Open feature group detail”, I open the identity feature group.

SageMaker screenshot

I can see feature definitions.

SageMaker screenshot

Finally, I can generate queries for the offline store, which I could add to a Amazon SageMaker Data Wrangler workflow to load features prior to training.

SageMaker screenshot

How to Get Started with Amazon SageMaker Feature Store
As you can see, SageMaker Feature Store makes it easy to store, retrieve, and share features required by your training and prediction workflows.

SageMaker Feature Store is available in all regions where SageMaker is available. Pricing is based on feature reads and writes, and on the total amount of data stored.

Here are sample notebooks that will help you get started right away. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

New – Profile Your Machine Learning Training Jobs With Amazon SageMaker Debugger

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/profile-your-machine-learning-training-jobs-with-amazon-sagemaker-debugger/

Today, I’m extremely happy to announce that Amazon SageMaker Debugger can now profile machine learning models, making it much easier to identify and fix training issues caused by hardware resource usage.

Despite its impressive performance on a wide range of business problems, machine learning (ML) remains a bit of a mysterious topic. Getting things right is an alchemy of science, craftsmanship (some would say wizardry), and sometimes luck. In particular, model training is a complex process whose outcome depends on the quality of your dataset, your algorithm, its parameters, and the infrastructure you’re training on.

As ML models become ever larger and more complex (I’m looking at you, deep learning), one growing issue is the amount of infrastructure required to train them. For instance, training BERT on the publicly available COCO dataset takes well over six hours on a single p3dn.24xlarge instance, even with its eight NVIDIA V100 GPUs. Some customers like autonomous vehicle companies deal with much larger datasets, and train object detection models for several days.

When a complex training job takes this long, the odds that something goes wrong and ruins it are pretty high, not only wasting time and money but also causing lots of frustration. Important work needs to be put on the back burner while you investigate, figure out the root cause, try to fix it, and then run your training job again. Often, you’ll have to iterate quite a few times to nail the problem.

Depending on the ML framework that you use, and sometimes on its version, you may or may not be able to use existing framework-specific tools. Often, you’ll have to build and maintain your own bespoke tools. Even for experienced practitioners, this is a lot of hard work. For regular developers like me, this is an utterly daunting task.

Introducing Model Profiling in Amazon SageMaker Debugger
Launched last year at AWS re:Invent, Amazon SageMaker Debugger is a capability of Amazon SageMaker that automatically identifies complex issues developing in ML training jobs. These include loss not decreasing, exploding gradients, and more.

Now, SageMaker Debugger can also monitor hardware resource usage, and allows you to profile your training job to help you correlate resource usage to ML operations in your training script. Thus, you’ll be able to resolve performance issues much quicker, and iterate through your training job much faster.

Chaim Rand, ML Algorithm Developer at Mobileye, an Intel company building automated driving and driver assistance systems, had the opportunity to work with the new profiling capabilities, and here’s what he told us: “Many of the assisted driving and autonomous vehicle technologies that we develop at Mobileye, rely on training deep neural network models to detect a wide variety of road artifacts, including vehicles, pedestrians, speed bumps, road signs and more. Often, these models train on extremely large datasets, on multiple machines, and for periods of up to several days. For us, at Mobileye, it is imperative that we have a toolkit of advanced performance profiling capabilities, for analyzing the flow of data across the network, CPU, and GPU resources, and for pinpointing performance issues. The profiling functionality in SageMaker Debugger provides just that, taking performance profiling out of the domain of a few specialized experts, and empowering our algorithm developers to maximize training resource utilization, accelerate model convergence, and reduce cost.

At launch, the new profiling capability of SageMaker Debugger is available for TensorFlow 2.x and PyTorch 1.x. All you have to do is to train with the corresponding built-in frameworks in Amazon SageMaker. Distributed training is supported out of the box.

Setting a single parameter in your SageMaker estimator, and without any change to your training code, you can enable the collection of infrastructure and model metrics such as:

  • CPU and GPU,
  • RAM and GPU RAM,
  • Network I/O,
  • Storage I/O (local storage and Pipe Mode),
  • Python metrics,
  • Data loading time,
  • Time spent in ML operators running on CPU and GPU,
  • Distributed training metrics for Horovod,
  • and many more.

In addition, you can visualize how much time is spent in different phases, such as preprocessing, training loop, and postprocessing. If needed, you can drill down on each training epoch, and even on each function in your training script.

By default, metrics are collected every 500ms, and you can also set this value to 100ms, 200ms, 1s, 5s, and 1min. For finer-grained analysis, you can also enable and disable profiling explicitly in your training code, only capturing metrics for specific parts.

While your training job is running, you can easily visualize these metrics in Amazon SageMaker Studio, our web-based integrated development environment for ML. As you would expect, all data is also available through the SageMaker Debugger API, and you can retrieve it to build your own graphs.

Running in parallel of the training job, an Amazon SageMaker Processing analyzes captured data, builds graphs, and generates a report providing insights on potential problems. This doesn’t require any work on your part, as this analysis runs inside a built-in container on fully managed infrastructure.

Now, let’s run a demo with PyTorch, where we’ll profile a ResNet-50 image classification model training on the CIFAR-10 dataset.

Profiling a Training Job with Amazon SageMaker Debugger
All it takes to enable profiling on your training job is an extra parameter in your SageMaker estimator. You don’t need to change a line in your training code. By default, SageMaker Debugger uses a set of built-in profiling rules looking for unwanted conditions that could develop during training, such as low GPU utilization. On top of reporting these conditions, SageMaker Debugger also triggers events in CloudWatch Events. For example, I could use them to run a AWS Lambda function that automatically stops inefficient training jobs.

First, I create a profiling configuration capturing data every 500ms. Optionally, I could select a training step interval if I wanted to profile only a certain portion of the job.

import sagemaker
from sagemaker.profiler import ProfilerConfig 
profiler_config = ProfilerConfig(profiling_interval_millis=500)

Then, I pass this configuration to my PyTorch Estimator, training on an ml.p3.8xlarge instance equipped with 4 NVIDIA V100 GPUs.

from sagemaker.pytorch import PyTorch
estimator = PyTorch(
    hyperparameters={"batch_size":512, "epochs":10},

Then, I launch the training job as usual. Once the job is running, profiling data is captured and stored in S3.

path = estimator.latest_job_profiler_artifacts_path()

Using the SageMaker SDK, I can retrieve and count profiling events.

from smdebug.profiler.system_metrics_reader import S3SystemMetricsReader
system_metrics_reader = S3SystemMetricsReader(path)
last_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()
events = system_metrics_reader.get_events(0, last_timestamp)
print("Found", len(events), "recorded system metric events. Latest recorded event:", last_timestamp)
Found 411853 recorded system metric events. Latest recorded event: 1605620280000000

Of course, I could parse and analyze these profiling events, build my own graphs, and so on. Instead, let’s visualize them in near real-time in SageMaker Studio.

While my training job is still running, I locate it in SageMaker Studio, and I right-click “Open Debugger for insights”.

SageMaker screenshot

This opens a new tab, and I select the “Nodes” panel where I can see details statistics for each instance in my training job. So, how’s my training job doing? Feel free to click on the image below to zoom in.

SageMaker screenshot

Apparently, this job isn’t going great. GPU utilization and GPU memory utilization are desperately flat at around 10%. I’m definitely not pushing my multi-GPU instance hard enough. Maybe GPUs are not receiving data fast enough because the CPU can’t keep up? Let’s check the system utilization heatmap.

SageMaker screenshot

The CPU is taking a nap here, hardly ever exceeding 20% usage. This instance is definitely not busy enough. Is there anything I could do to fix this?

Switching to the “Overview” panel, I see that some of the built-in profiling rules have been triggered.

SageMaker screenshot

LowGPUUtilization confirms what I saw on the graphs above. BatchSize is very interesting, as it suggests increasing the size of mini-batches sent to the GPUs by the training script running on the CPU. This should definitely help fill GPU memory, put more GPU cores to work, speed up my training job, and improve infrastructure usage.

At this point, I should decide to stop my inefficient training job, and to relaunch it with a larger batch size. Here, I’ll let it run to completion to show you the report generated by the SageMaker Processing job running in parallel of your training job.

Once the training job is complete, I can see its summary in the “Overview” panel.

SageMaker screenshot

Clicking on the “Download report” button, I get a very detailed report that includes additional metrics, for example the ratio between the different phases of the training job, or the ratio between the forward and backward pass.

SageMaker screenshot

I can also see information on the most time-consuming CPU and GPU operators, which is really important if I wanted to optimize my code. For example, the graph below tells me that the most time-consuming GPU operations in my training job are backward pass convolution operators.

SageMaker screenshot

There’s much more to read in the report (rules summary, training loop analysis, and more). A companion notebook is also available to understand how graphs have been built, and how you can tailor them to your own needs.

Getting Started
We’ve just scratched the surface, and there are many more features in Amazon SageMaker Debugger that make it easy to gather, analyze and visualize model profiling information. You can start using it today in all regions where Amazon SageMaker is available. You won’t be charged for any compute used to run built-in profiling rules.

You’ll find sample notebooks on Github, so give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

re:Invent 2020 Liveblog: Machine Learning Keynote

Post Syndicated from AWS News Blog Team original https://aws.amazon.com/blogs/aws/reinvent-2020-liveblog-machine-learning-keynote/

Swami Sivasubramanian speaks on stage at AWS re:InventFollow along as AWS Chief Evangelist Jeff Barr and Developer Advocates Martin Beeby and Steve Roberts liveblog the first-ever Machine Learning Keynote. Swami Sivasubramanian, VP of Amazon ML/AI will share the latest developments and launches in AWS machine learning, as well as demos of new technology, and insights from customers.

Join us here from 7:45-10 AM (PST), Tuesday, Dec. 8, 2020! 


AWS Panorama Appliance: Bringing Computer Vision Applications to the Edge

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/using-computer-vision-applications-at-the-edge/

At AWS re:Invent today, we gave a preview of the AWS Panorama Appliance. Also, we announced the AWS Panorama SDK is coming soon. These allow organizations to bring computer vision to their on-premises cameras and make automated predictions with high accuracy and low latency.

Over the past couple of decades, computer vision has gone from a topic discussed by academics to a tool used by businesses worldwide. Cloud has been critical in enabling this growth, and we have had an explosion of services and infrastructure capabilities that have made the previously impossible possible.

Customers face challenges with their physical systems, whether it be inspecting parts on a manufacturing line, ensuring workers wear hard hats in hazardous areas or analyzing customer traffic in retail stores. Customers often solve these problems by manually monitoring live video feeds or reviewing recorded footage after an issue or incident has occurred. These solutions are manual, error-prone, and difficult to scale.

Computer Vision is increasingly being used to perform these inspection tasks using models running in the cloud. Still, there are circumstances when relying exclusively on the cloud is not optimal due to latency requirements or intermittent connectivity that make a round trip to the cloud unfeasible.

What Was Announced Today
You can now develop a computer vision model using Amazon SageMaker and then deploy it to a Panorama Appliance that can then run the model on video feeds from multiple network and IP cameras. The Panorama Appliance and its associated console are now in preview.

Coming soon, we have the Panorama SDK, which is a Software Development Kit (SDK) that can be used by third-party device manufacturers to build Panorama-enabled devices. The Panorama SDK is flexible, with a small footprint, making it easy for hardware vendors to build new devices in various form factors and sensors to satisfy use cases across different industries and environments, including industrial sites, low light scenarios, and the outdoors.

Unboxing the Appliance
Jeff was sent a Panorama Appliance a few weeks before AWS re:Invent so that we could write this blog; here is a picture of the device set up in Jeff’s office.

To set up the Panorama Appliance, I head over to the console and click on Get Started.

The console presents me with a three-step guide to getting a computer vision model running on the Panorama Appliance. In this blog, I will only look at Step 1, which helps me set up the Panorama Appliance.

I plugged the Panorama Appliance into the local network with an ethernet cable, and so in the configure step, I choose the Ethernet option.

The console creates a configuration archive, based on my input, that can set up the device. I download the file and transfer it to a USB key, which I will then plug into the Panorama Appliance.

I power on the Panorama Appliance and plug in the USB key; lights begin to flash as the Panorama Appliance connects to AWS. A green message appears on the console after a little while, saying that the Panorama Appliance is connected and online.

The guide then asks me to Add camera streams.

There are two ways to Add cameras: Automatic and Manual. I select Automatic, which will search the subnet for available cameras.

Some of the network cameras are password-protected. I add the Username and Password so that the Panorama Appliance can connect.

The Panorama Appliance is now connected to the network cameras, and my Panorama Appliance is ready to use.

The next step is to create an application and then deploy it to the Panorama Appliance. Over the next few weeks, I will produce a demo application, and when the service becomes Generally Available, I look forward to talking about it on this blog and sharing with you what I have created.

Get Started
To get started with the AWS Panorama Appliance, visit the product page. To get started with the AWS Panorama SDK, check out the documentation. We are excited to see what our customers will develop with the Panorama Appliance, and the sort of products third-party device manufacturers will build with the Panorama SDK.

Happy Automating!

— Martin







Amazon EC2 P4d instances deep dive

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/amazon-ec2-p4d-instances-deep-dive/

This post is contributed by Amr Ragab, Senior Solutions Architect, Amazon EC2


AWS is excited to announce that the new Amazon EC2 P4d instances are now generally available. This instance type brings additional benefits with 2.5x higher deep learning performance; adding to the accelerated instances portfolio, new features, and technical breakthroughs that our customers can benefit from with this latest technology. This blog post details some of those key features and how to integrate them into your current workloads and architectures.


P4d instances

As you can see from the generalized block diagram above, the p4d comes with dual socket Intel Cascade Lake 8275CL processors totaling 96 vCPUs at 3.0 GHz with 1.1 TB of RAM and 8 TB of NVMe local storage. P4d also comes with 8 x 40 GB NVIDIA Tesla A100 GPUs with NVSwitch and 400 Gbps Elastic Fabric Adapter (EFA) enabled networking. This instance configuration represents the latest generation of computing for our customers spanning Machine Learning (ML), High Performance Computing (HPC), and analytics.

One of the improvements of the p4d is in the networking stack.  This new instance type has 400 Gbps with support for EFA and GPUDirect RDMA. Now, on AWS, you can take advantage of point-to-point GPU to GPU communication (across nodes), bypassing the CPU. Look out for additional blogs and webinars detailing use cases of GPUDirect and how this feature helps decrease latency and improve performance for certain workloads.

Let’s look at some new features and performance metrics for the P4d instances.


Local ephemeral NVMe storage
The p4d instance type comes with 8 TB of local NVMe storage. Each device has a maximum read/write throughput of 2.7 GB/s. To create a local namespace and staging area for input into the GPUs, you can create a local RAID 0 of all the drives. This results in aggregate read throughput of about 16 GB/s. The following table summarizes the I/O tests on the NVMe drives in this configuration.

FIO – TestBlock SizeThreadsBandwidth
1Sequential Read128k9616.4 GiB/s
2Sequential Write128k968.2 GiB/s
3Random Read128k9616.3 GiB/s
4Random Write128k968.1 GiB/s


Introduced with the p4d instance type is NVSwitch. Every GPU in the node is connected to each other in a full mesh topology up to 600 GB/s bidirectional bandwidth. ML frameworks and HPC applications that use NVIDIA communication collectives library (NCCL) can take full advantage of this all-to-all communication layer.

P4d GPU to GPU bandwidth

P3 GPU to GPU bandwidth

P4d uses a full mesh NVLink topology for optimized all-to-all communication, compared to the previous generation P3/P3dn instances, which have all-to-all communication across various data path domains (NUMA, PCIe switch, NVLink).  This new topology accessed via NCCL will improve performance for multiGPU workloads.
To make optimal use of the NVSwitch ensure that in your instance, all GPUs application boost clocks are set to its maximum values:

sudo nvidia-smi -ac 1215,1410

Multi-Instance GPU (MIG)

It’s now possible, at the user level, to have control of fractionating a GPU into multiple GPU slices, with each GPU slice isolated from each other. This enables multiple users to run different workloads on the same GPU without impacting performance. I walk you through an example implementation of MIG in the following steps:

With every newly launched instance, MIG is disabled. So, you must enable it with the following command:

[email protected]:~# sudo nvidia-smi -mig 1 

Enabled MIG Mode for GPU 00000000:10:1C.0
You can get a list of supported MIG profiles.
Next, you can create seven slices, and create compute instances for each slice.
[email protected]:~# sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 
Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 7 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 8 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 11 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 12 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 13 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 14 on GPU 0 using profile MIG 1g.5gb (ID 19)
[email protected]:~# nvidia-smi mig -cci -gi 7,8,9,11,12,13,14 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 7 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 8 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 9 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 11 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 12 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 13 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 14 using profile MIG 1g.5gb (ID 0)

You can split a GPU into a maximum of seven slices. To pass the GPU through into a docker container, you can specify the index pair at runtime:

docker run -it --gpus '"device=1:0"' nvcr.io/nvidia/tensorflow:20.09-tf1-py3

With MIG, you can run multiple smaller workloads on the same GPU without compromising performance. We will follow up with additional blogs on this feature as we integrate it with additional AWS services.


For workloads optimized for multiGPU capabilities, we introduced GPUDirect over EFA fabric. This allows direct GPU-GPU communication across multiple p4d nodes for decreased latency and improved performance. Follow this user guide to get started with installing the EFA driver and setting up the environment. The code sample below can be used as a template to use GPUDirect RDMA over EFA.

/opt/amazon/openmpi/bin/mpirun \
     -n ${NUM_PROCS} -N ${NUM_PROCS_NODE} \
     -x RDMAV_FORK_SAFE=1 -x NCCL_DEBUG=info \
     --hostfile ${HOSTS_FILE} \
     --mca pml ^cm --mca btl tcp,self – mca btl_tcp_if_exclude lo,docker0 – bind-to none \
     $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 4G -f 2 -g 1 -c 1 -n 100

Machine learning Optimizations

You can quickly get started with the all benefits mentioned earlier for the p4d by using our latest Deep Learning AMI (DLAMI). The DLAMI now comes with CUDA11 and the latest NVLink and cuDNN SDKs and drivers to take advantage of the p4d.

TensorFloat32 – TF32

TF32 is a new 19 bit precision datatype from NVIDIA introduced for the first time for the p4d.24xlarge instance. This datatype improves performance with little to no loss of training and validation accuracy for most mainstream models. We have more detailed benchmarks for individual algorithms. But, on the p4d.24xlarge you can achieve approximately a 2.5 fold increase compared to FP32 on the p3dn.24xlarge for mainstream deep learning models.

We have updated our machine learning models here to show examples (see the following chart) of popular algorithms our customers are using today including general DNNs and Bert.

DNNP3dn FP32 (imgs/sec)P3dn FP16 (imgs/sec)P4d Throughput TF32 (imgs/sec)P4d Throughput FP16 (imgs/sec)P4d over p3dn TF32/FP32P4d over P3dn FP16

BERT Large – Wikipedia/Books Corpus

GPUsSequence LengthBatch size / GPU: mixed precision, TF32Gradient Accumulation: mixed precision, TF32Throughput – mixed precision

You can find other code examples at github.com/NVIDIA/DeepLearningExamples.

If you want to builld your own AMI or extend an AMI maintained by your organization you can use the github repo, which provides Packer scripts to build AMIs for Amazon Linux 2 or Ubuntu 18.04 versions.


The stack includes the following components:

  • NVIDIA Driver 450.80.02
  • CUDA 11
  • NVIDIA Fabric Manager
  • cuDNN 8
  • NCCL 2.7.8
  • EFA latest driver
  • FSx kernel and client driver and utilities
  • Intel OneDNN
  • NVIDIA-runtime Docker


Get started with the new P4d instances with support on Amazon EKS, AWS Batch, and Amazon Sagemaker. We are excited to hear about what you develop and run with the new P4d instances. If you have any questions please reach out to your account team. Now, go power up your ML and HPC workloads with NVIDIA Tesla A100s and the P4d instances.

AWS Architecture Monthly Magazine: Robotics

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/architecture-monthly-magazine-robotics/

Architecture Monthly: RoboticsSeptember’s issue of AWS Architecture Monthly issue is all about robotics. Discover why iRobot, the creator of your favorite (though maybe not your pet’s favorite) little robot vacuum, decided to move its mission-critical platform to the serverless architecture of AWS. Learn how and why you sometimes need to test in a virtual environment instead of a physical one. You’ll also have the opportunity to hear from technical experts from across the robotics industry who came together for the AWS Cloud Robotics Summit in August.

Our expert this month, Matt Hansen (who has dreamed of building robots since he was a teen), gives us his outlook for the industry and explains why cloud will be an essential part of that.

In September’s Robotics issue

  • Ask an Expert: Matt Hansen, Principle Solutions Architect
  • Blog: Testing a PR2 Robot in a Simulated Hospital
  • Case Study: iRobot
  • Blog: Introduction to Automatic Testing of Robotics Applications
  • Case Study: Multiply Labs Uses AWS RoboMaker to Manufacture Individualized Medicines
  • Demos & Videos: AWS Cloud Robotics Summit (August 18-19, 2020)
  • Related Videos: iRobot and ZS Associates

Survey opportunity

This month, we’re also asking you to take a 10-question survey about your experiences with this magazine. The survey is hosted by an external company (Qualtrics), so the below survey button doesn’t lead to our website. Please note that AWS will own the data gathered from this survey, and we will not share the results we collect with survey respondents. Your responses to this survey will be subject to Amazon’s Privacy Notice. Please take a few moments to give us your opinions.

How to access the magazine

We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us star rating and comment on the Amazon Kindle Newsstand page or contact us anytime at [email protected].

AWS announces AWS Contact Center Intelligence solutions

Post Syndicated from Alejandra Quetzalli original https://aws.amazon.com/blogs/aws/aws-announces-aws-contact-center-intelligence-solutions/

What was announced?

We’re announcing the availability of AWS Contact Center Intelligence (CCI) solutions, a combination of services that empowers customers to easily integrate AI into contact centers, made available through AWS Partner Network (APN) partners.

AWS CCI has solutions for self-service, live-call analytics & agent assist, and post-call analytics, making it possible for customers to quickly deploy AI into their existing workflows or build completely new ones.

Pricing and regional availability correspond to the underlying services (Amazon Comprehend, Amazon Kendra, Amazon Lex, Amazon Transcribe, Amazon Translate, and Amazon Polly) used.

What is AWS Contact Center Intelligence?

We mentioned that AWS CCI brings solutions to contact centers powered by AI for before, during, and after customer interactions.

My colleague Swami Sivasubramanian (VP, Amazon Machine Learning, AWS) said: “We want to make it easy for our customers with contact centers to benefit from machine learning capabilities even if they have no machine learning expertise. By partnering with APN technology and consulting partners to bring AWS Contact Center Intelligence solutions to market, we are making it easier for customers to realize the benefits of cloud-based machine learning services while removing the heavy lifting and the need to hire specialized developers to integrate the ML capabilities in to their existing contact centers.

But what does that mean? 🤔

AWS CCI solutions lets you leverage machine learning (ML) functionality such as text-to-speech, translation, enterprise search, chatbots, business intelligence, and language comprehension into current contact center environments. Customers can now implement contact center intelligence ML solutions to aid self-service, live-call analytics & agent assist, and post-call analytics. Currently, AWS CCI solutions are available through partners such as Genesys, Vonage, and UiPath for easy integration into existing enterprise contact center systems.

“We’re proud Genesys customers will be among the first to benefit from the off-the-shelf machine learning capabilities of AWS Contact Center Intelligence solutions. It’s now simpler and more cost-effective for organizations to combine AWS’s AI capabilities, including search, text-to-speech and natural language understanding, with the advanced contact center capabilities of Genesys Cloud to give customers outstanding self-service experiences.” ~ Olivier Jouve (Executive Vice President and General Manager of Genesys Cloud)

“More and more consumers are relying on automated methods to interact with brands, especially in today’s retail environment where online shopping is taking a front seat. The Genesys Cloud and Amazon Web Services (AWS) integration will make it easier to leverage conversational AI so we can provide more effective self-service experiences for our customers.” ~ Aarde Cosseboom (Senior Director of Global Member Services Technology, Analytics and Product at TechStyle Fashion Group)


How it works and who it’s for…

AWS Contact Center Intelligence solutions offer a variety of ways that organizations can quickly and cost-effectively add machine learning-based intelligence to their contact centers, via AWS pre-trained AI Services. AWS CCI is currently available through participating APN partners, and it is focused on three stages of the contact center workflow: Self-Service, Live Call Analytics and Agent Assist, and Post-Call Analytics. Let’s break each one of these up.

The Self-Service solution helps with creation of chatbots and ML-driven IVRs (Interactive voice response) to address the most common queries a contact center workforce often gets. This now allows actual call center employees to focus on higher value work. To implement this solution, you’ll want to work with either Amazon Lex and/or Amazon Kendra. The novelty of this solution is that Lex + Kendra not only fulfills transactional queries (i.e. book a hotel room or reset my password), but also addresses the long tail of customers questions whose answers live in enterprises knowledge systems. Before, these Q&A had to be hard coded in Lex, making it harder to implement and maintain. Today, you can implement this solution directly from your existing contact center platform with AWS CCI partners, such as Genesys.

The Live Call Analytics & Agent Assist solution enables the creation of real-time ML capabilities to increase staff productivity and engagement. Here, Amazon Transcribe is used to perform real-time speech transcription, while Amazon Comprehend can analyze interactions, detect the sentiment of the caller, and identify key words and phrases in the conversation. Amazon Translate can even be added to translate the conversation into a preferred language! Now, you can implement this solution directly from several leading contact center platforms with AWS CCI partners, like SuccessKPI.

The Post-Call Analytics solution is an automatic analysis of contact center conversations, which tend to leave actionable data for product and service feedback loops. Similar to live call analytics, this solution combines Amazon Transcribe to perform speech recognition and creates a high-quality text transcription of each call, with Amazon Comprehend to analyze the interaction. Amazon Translate can be added to translate the conversation into your preferred language, and Amazon Kendra can be used for contextual natural language queries. Today, you can implement this solution directly from several leading contact center platforms with AWS CCI partners, such as Acqueon.

AWS helps partners integrate these solutions into their products. Some solutions also have a Quick Start, which includes CloudFormation templates and deployment guide, to automate the deployments. The good news is that our AWS Partners landing pages will also provide additional implementation information specific to their products. 👌

Let’s see a demo…

For today’s post, we chose to focus on diving deeper into the Self-Service and Post-Call Analytics solutions, so let’s begin with Self-Service.

We have a public GitHub repository that has a complete Quick Start template plus a detailed deployment guide with architecture diagrams. (And the good news is that our APN partner landing pages will also reference this repo!)

This GitHub repo talks about the Amazon Lex chatbot integration with Amazon Kendra. The main idea here is that the customer can bring their own document repository through Amazon Kendra, which can be sourced through Amazon Lex when customers are interacting with this Lex chatbot.

The main thing we want to notice in this architecture is that customers can bring their existing documents and allow their chatbot to search that document whenever someone interacts with said chatbot. The architecture below assumes our docs are in an S3 bucket, but it’s worth noting that Amazon Kendra can integrate with multiple kinds of data sources. If using an S3 bucket, customers must provide their own S3 bucket name, the one that has their document repository. This is a prerequisite for deployment.

Let’s follow the instructions under the repo’s Deployment Steps, skipping ahead to Step #2, “Click Deploy to launch the CloudFormation template.”

Since this is a Quick Start template, you can see how everything is already filled out for us. We click Next and move on to Step 2, Specify stack details.

Notice how the S3 bucket section is blank. You can provide your own S3 bucket name if you want to test this out with your own docs. For today, I am going to use the S3 bucket name that was provided to us in the GitHub doc.

The next part to configure will be the Cross account role configuration section. For my demo, I will add my own AWS account ID under “Assuming Account ID.”

We click Next and move on to Step 3, Configure Stack options.

Nothing to configure here, so we can click Next again and move on to Step 4, Review. We click to accept these final acknowledgements and click Create Stack.

If we were to navigate over to our deployed AWS CloudFormation stacks, we can go to Outputs of this stack and see our Kendra index name and Lex bot name.

Now if we head over to Amazon Lex, we should be able to easily find our chatbot.

We click into it and we can see that our chatbot is ready. At this point, we can start interacting with it!

We can something like “Hi” for example.

Eventually we would also get a response that details the reply source. What this means is that it will tell you if this came from Amazon Lex or from Amazon Kendra and the documents we saved in our S3 bucket.


Live Call Analytics & Agent Assist
We have two public GitHub repositories for this solution too, and both have detailed deployment guide with architecture diagrams as well.

This GitHub repo provides us a code example and a fully functional AWS Lambda function to get you started with capturing and transcribing Amazon Chime Voice Connector phone calls using Amazon Kinesis Video Streams and Amazon Transcribe. This solution gives us the ability to see how to use AI and ML services to talk to the customer’s existent environment, to drive agent assistance or analytics. We can take a real-time voice feed, transcribe that information, and then use Amazon Comprehend to pull that information out to provide the key action and sentiment.

We now also provide the Chime SIP req connector (a chime component that allows you to connect voice over an IP compatible environment with Amazon voice services) to stream voice in Amazon Transcribe from virtually any contact center. Our partner Vonage can do the same through websocket.

👉🏽 Check out the GitHub developer docs:

And as we mentioned above, for today’s post, we chose to focus on diving deeper into the Self-Service and Post-Call Analytics solutions. So let’s move on to show an example for Post-Call Analytics.


Post-Call Analytics

We have a public GitHub repository for this solution too, with another complete Quick Start template and detailed deployment guide with architecture diagrams. This solution is used after the call has ended, so that our customers can review the analytics of those calls.

This GitHub repo talks about how to look for insights and information about calls that have already happened. We call this, Quality Management. We can use Amazon Transcribe and Amazon Comprehend to pull out key words, information, and data, in order to know how to better drive what is happening in our contact center calls. We can then review these insights on Amazon QuickSight.

Let’s look at the architecture diagram for this solution too. Our call recording gets stored in an S3 bucket, which is then picked up by a Lambda function which does a transcription using Amazon Transcribe. It puts the result in a different bucket and then that call’s metadata gets stored in DynamoDB. Now Amazon Comprehend can conduct text analysis on the call’s metadata, and stores the result in a Text analysis Output bucket. Eventually, QuickSight is used to provide dashboards showing the resulting call analytics.

Just like in the previous example, we move down to the Deployment steps section. Just like before, we have a pre-made CloudFormation template that is ready to be deployed.

Step 1, Specify template is good to go, so we click Next.

In Step 2, Specify stack details, something important to note is that the User Pool Domain Name must be globally unique.

We click Next and move on to Step 3, Configure Stack options. Nothing additional to configure here either, so we can click Next again and move on to Step 4, Review.

We click to accept these final acknowledgements and click Create Stack.

And if we were to navigate over to our deployed AWS CloudFormation stacks again, we can go to Outputs of this stack and see the PortalEndpoint key. After the stack creation has completed successfully, and portal website is available at CloudFront distribution endpoint. This key is what will allow us to find the portal URL.

We will need to have user created in Amazon Cognito for the next steps to work. (If you have never created one, visit this how-to guide.)

⚠ NOTE: Make sure to open the portal URL endpoint in a different Incognito Window as the portal attaches a QuickSight User Role that can interfere with your actual role.

We go to the portal URL and login with our created Cognito user. We’re prompted to change the temporary password and are eventually directed to the QuickSight homepage.

Now we want to upload the audio files of our calls and we can do so with the Upload button.

After successfully uploading our audio files, the audio processing will run through transcription and text analysis. At this point we can click on the Call Analytics logo in the top left of the Navigation Bar to return to home page.

Now we can drill down into a call to see Amazon Comprehend’s result of the call classifications and turn-by-turn sentiments.


🌎 Lastly…

Regional availability for AWS Contact Center Intelligence (CCI) solutions correspond to the underlying services (Amazon Comprehend, Amazon Kendra, Amazon Lex, Amazon Transcribe, Amazon Translate) used.

We are announcing AWS CCI availability with 12 APN partners: Genesys, UiPath, Vonage, Acqueon, SuccessKPI, and Inference Solutions (Technology partners), and Slalom, Onica/Rackspace, TensorIoT, Quantiphi, Accenture, and HGS Digital (Consulting partners).

Ready to get started? Contact one of the AWS CCI launch partners listed on the AWS CCI web page.


You may also want to see…

👉🏽AWS Quick Start links from post:


¡Gracias por tu tiempo!
~Alejandra 💁🏻‍♀️🤖 y Canela 🐾

AWS Architecture Monthly Magazine: Agriculture

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/aws-architecture-monthly-magazine-agriculture/

Architecture Monthly Magazine cover - AgricultureIn this month’s issue of AWS Architecture Monthly, Worldwide Tech Lead for Agriculture, Karen Hildebrand (who’s also a fourth generation farmer) refers to agriculture as “the connective tissue our world needs to survive.” As our expert for August’s Agriculture issue, she also talks about what role cloud will play in future development efforts in this industry and why developing personal connections with our AWS agriculture customers is one of the most important aspects of our jobs.

You’ll also buzz through the world of high tech beehives, milk the information about data analytics-savvy cows, and see what the reference architecture of a Smart Farm looks like.

In August’s issue Agriculture issue

  • Ask an Expert: Karen Hildebrand, AWS WW Agriculture Tech Leader
  • Customer Success Story: Tine & Crayon: Revolutionizing the Norwegian Dairy Industry Using Machine Learning on AWS
  • Blog Post: Beewise Combines IoT and AI to Offer an Automated Beehive
  • Reference Architecture:Smart Farm: Enabling Sensor, Computer Vision, and Edge Inference in Agriculture
  • Customer Success Story: Farmobile: Empowering the Agriculture Industry Through Data
  • Blog Post: The Cow Collar Wearable: How Halter benefits from FreeRTOS
  • Related Videos: DuPont, mPrest & Netafirm, and Veolia

Survey opportunity

This month, we’re also asking you to take a 10-question survey about your experiences with this magazine. The survey is hosted by an external company (Qualtrics), so the below survey button doesn’t lead to our website. Please note that AWS will own the data gathered from this survey, and we will not share the results we collect with survey respondents. Your responses to this survey will be subject to Amazon’s Privacy Notice. Please take a few moments to give us your opinions.

How to access the magazine

We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us star rating and comment on the Amazon Kindle Newsstand page or contact us anytime at [email protected].

Field Notes: Inference C++ Models Using SageMaker Processing

Post Syndicated from Qingwei Li original https://aws.amazon.com/blogs/architecture/field-notes-inference-c-models-using-sagemaker-processing/

Machine learning has existed for decades. Before the prevalence of doing machine learning with Python, many other languages such as Java, and C++ were used to build models. Refactoring legacy models in C++ or Java could be forbiddingly expensive and time consuming. Customers need to know how they can bring their legacy models in C++ to the cloud, so that they can run model inference faster and at a lower cost.

Amazon SageMaker Processing is a new capability of Amazon SageMaker for running processing and model evaluation workloads with a fully managed experience. Amazon SageMaker Processing enables customers to run analytics jobs for data engineering and model evaluation on Amazon SageMaker easily, and at scale. SageMaker Processing allows customers to enjoy the benefits of a fully managed environment with all the security and compliance built into Amazon SageMaker.

In this blog post, we demonstrate inferencing a C++ model using SageMaker Processing. We first explain the C++ program we use to represent a simple linear regression model, and the Python script we use to run inference. Then, we build a custom container that contains the C++ model and Python script. Lastly, we run a SageMaker ScriptProcessor job for inference. The code from this post is available in the GitHub repo.


To run this code, you need to have permissions to access Amazon S3, push a Docker image to Amazon ECR, and create SageMaker Processing jobs.

Prepare a C++ Model

We use a simple C++ test file for demonstration purposes. This C++ program accepts input data as a series of strings separated by a comma. For example, “2,3“ represents a row of input data, labeled 2 and 3 in two separate columns.

We use a simple linear regression model y=x1 + x2 in this blog post for demonstration purposes. Customer can modify the C++ inference code to inference more realistic and sophisticated models.  The C++ code is made up of the following steps:

  • Receives data record for inferencing from C++ command line parameters.
  • Parses out data columns and stores data in a C++ vector. We use “,” to separate data columns.
  • Loops through data columns and calculates the sum.
  • Prints out the result to standard output stream.

We can compile the C++ program to an executable file using g++. The complete C++ script is shown in the following code:

#include <sstream>
#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <iostream>
using namespace std;

void print(std::vector<int> const &input)
    for (int i = 0; i < input.size(); i++)
        std::cout << input.at(i);
        if (i!=input.size()-1)
            cout<< ',';

std::vector<std::string> split(const std::string& s, char delimiter)
   std::vector<std::string> tokens;
   std::string token;
   std::istringstream tokenStream(s);
   while (std::getline(tokenStream, token, delimiter))
   return tokens;

int main(int argc, char* argv[])
    vector<int> result;
    int counter = 0;
    int result_temp = 0;
    //assuming one argv
    string t1(argv[1]);
    vector<string> temp_str = split(t1, ',');
    vector<string>::iterator pos; 

    for (pos = temp_str.begin(); pos < temp_str.end(); pos++)
        int temp_int;
        istringstream(*pos) >> temp_int;
        if (counter == 0)
            result_temp += temp_int;
        if (counter == 1)
            result_temp += temp_int;
            result_temp = 0;
            counter = 0;
    return 0;

Create a SageMaker Processing script

This notebook uses the ScriptProcessor class from the Amazon SageMaker Python SDK. The ScriptProcessor class runs a Python script with your own Docker image that processes input data, and saves the processed data in Amazon S3.  For more information, review Run Scripts with Your own Processing Container.

When the processing job starts, the data files are automatically downloaded by SageMaker from S3 to the designated local directory in the processing compute instance.

Your Python script, process_script.py, first finds all data files under /opt/ml/processing/input/ directory. By default, when you use multiple instances, the data files from S3 are duplicated to each processing compute instance. That means every instance gets the full dataset. By setting s3_data_distribution_type='ShardedByS3Key' , each instance gets approximately 1/n of the number of total input date files, where n is the number of compute instances. For more effective parallel processing, partition input data into multiple files to help ensure each node processes a different set of input data.

The Python script reads each data file into memory and converts it into a long string ready for C++ executable to consume. The subprocess module from Python runs the C++ executable and connects to output and error pipes. The output is saved as a CSV file to /opt/ml/processing/output directory. Upon completion, SageMaker Processing uploads output files in this directory from every Processing instance to Amazon S3.

def call_one_exe(a):
    p = subprocess.Popen(["./a.out",
    p_out, err= p.communicate()
    output = p_out.decode("utf-8")
    return output.split(',')

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    #user can pass their own argument from Processor. 
    args, _ = parser.parse_known_args()
    print('Received arguments {}'.format(args))
    files = glob('/opt/ml/processing/input/*.csv')
    for i, f in enumerate(files):
            data = pd.read_csv(f, header=None)
            string = str(list(data.values.flat)).replace(' ','')[1:-1]
            predictions = call_one_exe(string)
            output_path = os.path.join('/opt/ml/processing/output', str(i)+'_out.csv')
            print('Saving training features to {}'.format(output_path))
            pd.DataFrame({'results':predictions}).to_csv(output_path, header=False, index=False)
        except Exception as e:

Build your own SageMaker Processing container

The processing container is defined as shown in the following image. We have Anaconda and Pandas installed into the container. a.out is the C++ executable that contains the model inference logic. process_script.py is the Python script we use to call C++ executable and save results. We explain more about the C++ program and process_script.py in a later paragraph. Now let us build the Docker container and push it to Amazon ECR. The Dockerfile looks like the following code:

FROM ubuntu:16.04

RUN apt-get update && \
    apt-get -y install build-essential libatlas-dev git wget curl 

RUN curl -LO http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -bfp /miniconda3 && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/miniconda3/bin:${PATH}

RUN conda update -y conda && \
    conda install -c anaconda scipy

# Python won’t try to write .pyc or .pyo files on the import of source modules
# Force stdin, stdout and stderr to be totally unbuffered. Good for logging

RUN pip install – no-cache -I scikit-learn==0.20.0 pandas==1.0.3 boto3 sagemaker retrying
ADD process_script.py /
ADD a.out /

Set up the ScriptProcessor and run your script

We have 10 sample data files included in this demo. Each file contains 5000 rows of arbitrarily generated data. We first upload these files to Amazon S3. We use one  ml.c5.xlarge instance for inference. You can increase the number of instance counts for a bigger dataset. Amazon SageMaker Processing runs the script in similar way as the following command, where EntryPoint is process_script.py and ImageUri is the Docker image we built earlier.

docker run – entry-point [EntryPoint] [ImageUri]

The SageMaker Processing job is set up as following,

role = get_execution_role()
script_processor = ScriptProcessor(command=['python3'],
                image_uri=Account_number + '.dkr.ecr.us-east-1.amazonaws.com/cpp_processing:latest',
                base_job_name = 'run-exe-processing',
output_location = os.path.join('s3://',default_s3_bucket, 'processing_output')

After the processing job starts, Amazon SageMaker displays job progress. Information such as Job Name, input and output locations are reported. Upon completion, we can review a few rows of the output to make sure that the processing job was successful.

print('Top 5 rows from 1_out.csv')
!aws s3 cp $output_location/0_out.csv - | head -n5


In this post, we used Amazon SageMaker Processing to run inference on C++ models. Customers can bring legacy C++ models to SageMaker for faster inference at a lower cost. For more information, review Amazon SageMaker Processing.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Building deep learning inference with AWS Lambda and Amazon EFS

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/building-deep-learning-inference-with-aws-lambda-and-amazon-efs/

This post is courtesy of Giuseppe Angelo Porcelli, Principal ML Specialist SA, and Diego Natali, Solutions Architect.

Amazon EFS for AWS Lambda makes it easier for serverless applications requiring persistent file storage or access to large amounts of reference data. Previously, applications had to download data from an object store or database to local ephemeral storage in 512-MB chunks for processing. This creates more code, causes slower startup behavior, and slower data processing. Customers also faced challenges when loading large code packages and models for ML inference.

Recently, AWS announced Amazon EFS support for AWS Lambda. It enables customers to easily share data across function invocations. It also allows you to read large reference data files, and write function output to a persistent and shared data store. Customers can now use Lambda to build data-intensive applications, and load larger libraries and models. They can process larger amounts of data in a highly distributed manner, and share data across functions, containers, and instances.

In this blog post, we show how you can use EFS to store deep learning (DL) framework libraries and models to load from Lambda to execute inferences. We provide a code example on executing serverless inferences with TensorFlow 2.

Using EFS and Lambda for deep learning inference requires to execute two steps:

  1. Storing the deep learning libraries and model on EFS
  2. Creating a Lambda function for inference, which loads the libraries and model from the EFS file system

In the next sections, we share some best practices to implement these steps, and then discuss a full, working example.


This post assumes experience with Lambda, EFS, plus general knowledge of Python programming, DL, and DL frameworks. To help you get started, read the blog post and documentation.

1. Storing the deep learning libraries and model on Amazon EFS

To populate EFS with DL framework Python libraries and the DL model, there are different options. You can use EC2 instances, third-party tools like lambdci or AWS CodeBuild. AWS CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces software packages for deployment.

This blog post uses an AWS CodeBuild project, configured as follows:

  • The build environment is a Docker container replicating the Lambda runtime environment. To make sure that the packages work in Lambda, it uses the lambci/lambda build container images on Docker Hub.
  • The EFS file system is mounted to the CodeBuild environment.
  • Build commands are used to install the DL framework and download the model to specific paths of the file system.

After the build completes, the EFS file system contains the Python libraries and the model in specific paths. It is attached to the Lambda function for loading those libraries at runtime and execute inference.

For this example, these are the CodeBuild commands to install the TensorFlow 2 framework and an SSD (Single Shot MultiBox Detector) pre-trained object detection model from TensorFlow Hub:

'echo "Downloading and copying model..."',
'mkdir -p $CODEBUILD_EFS1/lambda/model',
'curl https://storage.googleapis.com/tfhub-modules/google/openimages_v4/ssd/mobilenet_v2/1.tar.gz – output /tmp/1.tar.gz',
'tar zxf /tmp/1.tar.gz -C $CODEBUILD_EFS1/lambda/model',

'echo "Installing virtual environment..."',
'mkdir -p $CODEBUILD_EFS1/lambda',
'python3 -m venv $CODEBUILD_EFS1/lambda/tensorflow',
'echo "Installing Tensorflow..."',
'source $CODEBUILD_EFS1/lambda/tensorflow/bin/activate && pip3 install ' +
              (props.installPackages ? props.installPackages : "tensorflow"),

'echo "Changing folder permissions..."',
'chown -R 1000:1000 $CODEBUILD_EFS1/lambda/'


  • The approach described can also work for other ML/DL frameworks
  • The EFS file system can be attached to multiple Lambda functions. This means it can share the DL framework libraries with multiple inference functions (up to 25000 connections for each file system).
  • There are alternatives to using EFS for model storage. If the model size fits in the Lambda package deployment, then you could optimize the first invocation since it doesn’t need to download the model. You can also use the function’s initializer to load the model since the first mount to EFS only takes a few hundred milliseconds.

2. Creating a Lambda function for inference

After attaching the EFS file system, you may structure the Lambda code as follows:

Lambda code structure

The code outside the handler method first adds the local mount path to the Python path. It then imports the frameworks, and loads the model into memory. Executing those operations outside of the function’s handler ensures that those objects remain initialized and reused in subsequent invocations of the same Lambda function instance. The code inside the handler runs the inference flow by reading inputs, executing the actual inference, and returning the results to the caller.

For hosting the TensorFlow 2 object detection model in the example, this is the function code:

import sys
import os

# Setting library paths.
efs_path = "/mnt/python"
python_pkg_path = os.path.join(efs_path, "tensorflow/lib/python3.8/site-packages")

import json
import string
import time
import io
import requests

# Importing TensorFlow
import tensorflow as tf

# Loading model
model_path = os.path.join(efs_path, 'model/')
loaded_model = tf.saved_model.load(model_path)
detector = loaded_model.signatures['default']

def lambda_handler(event, context):
    r = requests.get(event['url'])
    img = tf.image.decode_jpeg(r.content, channels=3)

    # Executing inference.
    converted_img  = tf.image.convert_image_dtype(img, tf.float32)[tf.newaxis, ...]
    start_time = time.time()
    result = detector(converted_img)
    end_time = time.time()

    obj = {
        'detection_boxes' : result['detection_boxes'].numpy().tolist(),
        'detection_scores': result['detection_scores'].numpy().tolist(),
        'detection_class_entities': [el.decode('UTF-8') for el in result['detection_class_entities'].numpy()] 

    return {
        'statusCode': 200,
        'body': json.dumps(obj)

When invoked, the response is like:

    "statusCode": 200,
    "body": "{
    \"detection_boxes\": This field contains the relative position of the bounding boxes,
    \"detection_class_entities\": This field returns the class labels,
    \"detection_scores\": This field returns the detection confidences

Running the example

This working example is provided to set up and run ML/AI inference on Lambda using EFS. To run it, you must have the AWS CDK installed. Execute the following commands:

# clone repository
$ git clone https://github.com/aws-samples/lambda-efs-deep-learning-inference.git
$ cd lambda-efs-ml-demo

# Install the CDK and bootstrap the target account (if this was never done before)
$ npm install -g aws-cdk
$ cdk bootstrap aws://{account_id}/{region}

# Install packages for the project, build and deploy
$ cd cdk/
$ npm install
$ npm run build
$ cdk deploy

After deployment, note the output:

LambdaEFSMLDemo.LambdaFunctionName = LambdaEFSMLDemo-LambdaEFSMLExecuteInference17332C2-0546aa45dfXXXXXX

It takes a few minutes for AWS CodeBuild to deploy the libraries and framework to EFS. To test the Lambda function, run this command, replacing the function name:

$ aws lambda invoke \
    --function-name LambdaEFSMLDemo-LambdaEFSMLExecuteInference17332C2-0546aa45dfXXXXXX \
    --region us-east-1 \
    --cli-binary-format raw-in-base64-out \
    --payload '{"url": "https://images.pexels.com/photos/310983/pexels-photo-310983.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=650&w=940"}' \
    --region us-east-1 \

This is the output:

    "StatusCode": 200,
    "ExecutedVersion": "$LATEST"

Here you can check the inference’s result:
$ tail /tmp/return.json

The following image shows the bounding boxes created from the inference output.

Inference result

The following image shows the bounding boxes created from the inference output.

Image with bounding boxes

To generate this image with the bounding boxes, use the Jupyter notebook from the repository. We reduce the number of bounding boxes to the most relevant classes:

  • Bicycle: 91%
  • Wheel: 48%
  • Person: 45%
  • Wheel: 44%
  • Man: 40%
  • Bicycle wheel: 37%
  • Bicycle wheel: 30%

To clean up the deployment, run:

$ cdk destroy

Performance considerations

When planning for ML inference, you must keep three main aspects in mind: the type of compute resources required for inference, model size and memory footprint, function initialization and cold start.

Lambda is best suited for CPU-based inferencing, which meets the needs for most ML/DL inference use cases. Lambda’s memory can be set between 128 MB and 3008 MB. This means that large models (for example, FasterRCNN models) that may require more memory or dedicated GPUs are not a good fit.

It’s important to understand how Lambda invokes affect performance. The first request to a function instance is called a “cold-start”. This is where the function is provisioned, code downloaded, and the initializer is executed to download the code and load libraries. In this example, it takes about 40 seconds to load the full TensorFlow 2 libraries from EFS, and another 8 seconds to load the model into memory.

Subsequent calls to the same Lambda function instance don’t incur cold start latency if the request is handled by an existing execution environment. Customers who want to reduce this one-time cold start can use Provisioned Concurrency. This feature provides customers with greater control over performance of their serverless applications at any scale.

The EFS mount operation only takes a few hundred milliseconds and only happens once during the function provisioning. EFS supports up to 25,000 connections so is ideal for functions that scale up. We recommend you use EFS provisioned throughput with Provisioned Concurrency for better performance. To learn more, read the documentation about Amazon EFS performance and monitoring Amazon EFS.


This post shows how you can use EFS for Lambda to deploy large DL libraries and models into a function for synchronous invocations. The same approach can be applied to asynchronous invokes. For example, you could perform object detection on images stored in Amazon S3, or streaming invokes on data in Amazon Kinesis and Amazon DynamoDB.

EFS for Lambda enables many new use cases. To learn more about how to use EFS for Lambda, see the AWS News Blog post and read the documentation.

Field Notes: Bring your C#.NET skills to Amazon SageMaker

Post Syndicated from Haider Abdullah original https://aws.amazon.com/blogs/architecture/field-notes-bring-your-c-net-skills-to-amazon-sagemaker/

Amazon SageMaker is a fully managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the undifferentiated heavy lifting from each step of the machine learning process to make it easier to develop high-quality models.

Amazon SageMaker Notebooks are one-click Jupyter Notebooks with elastic compute that can be spun up quickly. Notebooks contain everything necessary to run or recreate a machine learning workflow. Notebooks in SageMaker are pre-loaded with all the common CUDA and cuDNN drivers, Anaconda packages, and framework libraries. However, there is a small amount of work required to support C# .NET code in the Notebooks.

This blog post focuses on customizing the Amazon SageMaker Notebooks environments to support C# .NET so C#.NET developers can get started with SageMaker Notebooks and machine learning on AWS. To provide support for C#.NET in our SageMaker Jupyter Notebook environment, we use the .NET interactive tool, which is an evolution of the Try .NET global tool. An installation script is provided for you, which  automatically downloads and installs these components. Note that the components are distributed under a proprietary license from Microsoft.

After we set up the environment, we walk through an end-to-end example of building, training, deploying, and invoking a built-in image classification model provided by SageMaker. The example in this blog post is modeled after the End-to-End Multiclass Image Classification Example but written entirely using the C# .NET APIs!


The following are the high level steps to build out and invoke a fully functioning image classification model in SageMaker using C#.NET:

  • Customize the SageMaker Jupyter Notebook instances by creating a SageMaker lifecycle configuration
  • Launch a Jupyter Notebook using the SageMaker lifecycle configuration
  • Create an Amazon S3 bucket for the validation and training datasets, and the output data
  • Train a model using the sample dataset and built-in SageMaker image classification algorithm
  • Deploy and host the model in SageMaker
  • Create an inference endpoint using the trained model
  • Invoke the endpoint with the trained model to obtain real time inferences for sample images
  • Clean up the used resources used in the example to stop incurring charges

Customize the Notebook instances

We use Amazon SageMaker lifecycle configurations to install the required components for C# .NET support.

Use the following steps to create the Lifecycle configuration.

  1. Sign in to the AWS Management Console.
  2. Navigate to Amazon SageMaker and select Lifecycle configurations from the left menu.
  3. Select Create configuration and provide a name for the configuration.
  4. In the Start notebook section paste the following script:

set -e
wget https://download.visualstudio.microsoft.com/download/pr/d731f991-8e68-4c7c-8ea0-
wget https://download.visualstudio.microsoft.com/download/pr/30ab052d-dbb6-4bce-8a44-
mkdir -p /home/ec2-user/dotnet and&amp;and&amp; tar zxf dotnet-runtime-3.0.1-linux-x64.tar.gz -C /home/ec2-user/dotnet
export DOTNET_ROOT=/home/ec2-user/dotnet
export PATH=$PATH:/home/ec2-user/dotnet
export DOTNET_CLI_HOME=/home/ec2-user/dotnet
export HOME=/home/ec2-user

tar zxf dotnet-sdk-3.1.100-linux-x64.tar.gz -C /home/ec2-user/dotnet
Dotnetdotnet tool install – global Microsoft.dotnet-interactive
export PATH=$PATH:/home/ec2-user/dotnet/.dotnet/tools
Dotnetdotnet interactive jupyter install
jupyter kernelspec list

touch /etc/profile.d/jupyter-env.sh
echo "export PATH='$PATH:/home/ec2-user/dotnet/.dotnet/tools:/home/ec2-user/dotnet'">>

touch /etc/profile.d/dotnet-env.sh
echo "export DOTNET_ROOT='/home/ec2-user/dotnet'">>/etc/profile.d/dotnet-env.sh
sudo chmod -R 777 /home/ec2-user/.dotnet

initctl restart jupyter-server – no-wait

In the above code snippet the following actions are taken:

  • Download of  the .NET SDK and runtime from Microsoft
  • Extraction of the downloads
  • Installation of the .NET interactive global tool
  • Setting of required PATH and DOTNET_ROOT variables so they are available to the Jupyter Notebook instance

Note: Creation of the lifecycle configuration automatically downloads and installs third-party software components that are subject to a proprietary license.

Launch a Jupyter Notebook instance

After the lifecycle configuration is created, we are ready to launch a Notebook instance.

Use the following steps to create the Notebook instance:

1. In the AWS Management Console, navigate to the Notebook instances page from the left menu.

2. Select Create notebook instance.

3. Provide a name for your Notebook and select an instance type (a smaller instance type such as ml.m2.medium suffices for the purposes for this example).

4. Set Elastic interface to none.

5. Expand the Additional configuration menu to expose the Lifecycle configuration drop down list. Then select the configuration you created. Volume size in GB can be set to the default of 5.

Notebook Instance Settings

6. In the Permissions and encryption section, select Create a new IAM Role from the IAM role dropdown. This is the role that is used by your Notebook instance, and you can use the provided default permissions for the role. Select Create role.

7. The Network, Git repositories, and tags sections can be left as is.

8. Select Create notebook instance.

Create an S3 bucket to hold the validation and training datasets

It takes a few minutes for the newly launched Jupyter Notebook instance to be ‘InService’.  While it is launching, create an S3 bucket to hold the datasets required for our example. Make sure to create the S3 bucket in the same Region as your Notebook instance.

Open the Jupyter Notebook and begin writing code

After the Notebook instance has launched, the ‘Status’ in the AWS Management Console will report as ‘InService’. Select the Notebook, and choose the Open Jupyter link in the Actions column. To begin authoring from scratch, select New -> .NET (C#) from the drop-down menu on the top right hand side.

A blank ‘Untitled’ file opens and is ready for you to start writing code. Alternatively, if you’d like to follow along with this post, you can download the full Notebook from Github and use the ‘Upload’ button to start stepping through the Notebook.

To run a block of code in the Notebook, click into the block and then select the Run button, or hold down your SHIFT Key and press ‘Enter’ on your keyboard.

Last checkpoint

We start by including the relevant NuGet packages for SageMaker, Amazon S3, and a JSON parser:

#r "nuget:AWSSDK.SageMaker,"
#r "nuget:AWSSDK.SageMakerRuntime,"
#r "nuget:AWSSDK.S3,"
#r "nuget:Newtonsoft.Json, 12.0.3"

Next, we create the service client objects that are used throughout the rest of the code:

static AmazonS3Client s3Client = new AmazonS3Client();
AmazonSageMakerClient smClient = new AmazonSageMakerClient();
AmazonSageMakerRuntimeClient smrClient = new AmazonSageMakerRuntimeClient();

Download the required training and validation datasets and store them in Amazon S3

The next step is to download the required training and validation datasets and store them in the S3 bucket that was previously created so they are accessible for our training job. In this demo, we use Caltech-256 dataset, which contains 30608 images of 256 objects. For the sake of brevity, the C# code to download web files and upload them to Amazon S3 is not shown here but can be found in full in the Github repo.

Train a model using the sample dataset and built-in SageMaker image classification algorithm

After we have the data available in the correct format for training, the next step is to actually train the model using the data. We start by retrieving the IAM role we want SageMaker to use from the currently running Notebook instance dynamically.

DescribeNotebookInstanceRequest dniReq = new DescribeNotebookInstanceRequest() {
   NotebookInstanceName = "dotNetV3-1"
DescribeNotebookInstanceResponse dniResp = await smClient.DescribeNotebookInstanceAsync(dniReq);

We set all the training parameters and kick off the training job. We use the built-in image classification algorithm for our training job. We specify this in the TrainingImage parameter by providing the URI for the docker image for this algorithm from the documentation–there are a number of training Images available that correspond with the desired algorithm and the Region we have chosen.

string jobName = String.Format("DEMO-imageclassification-{0}",DateTime.Now.ToString("yyyy-MM-dd-hh-mmss"));

CreateTrainingJobRequest ctrRequest = new CreateTrainingJobRequest(){
  AlgorithmSpecification = new AlgorithmSpecification(){
  TrainingImage = "433757028032.dkr.ecr.us-west-2.amazonaws.com/image-classification:1",
  TrainingInputMode = "File" 
 RoleArn = dniResp.RoleArn, 
 OutputDataConfig = new OutputDataConfig(){
   S3OutputPath = String.Format(@"Amazon S3s3://{0}/{1}/output",bucketName,jobName)
 ResourceConfig = new ResourceConfig(){
   InstanceCount = 1,
   Instance typeInstanceType = Amazon.SageMaker.TrainingInstanceType.MlP2Xlarge,
   VolumeSizeInGB = 50
 TrainingJobName = jobName,
 HyperParameters = new Dictionary&lt;string,string&gt;() {
 StoppingCondition = new StoppingCondition(){
    MaxRuntimeInSeconds = 360000
 InputDataConfig = new List&lt;Amazon.SageMaker.Model.Channel&gt;(){
   new Amazon.SageMaker.Model.Channel() {
    ChannelName = "train",
    ContentType = "application/x-recordio",
    CompressionType = Amazon.SageMaker.CompressionType.None,
    DatasourceDataSource = new Amazon.SageMaker.Model.DataSource(){
      S3DataSource = new Amazon.SageMaker.Model.S3DataSource(){
      S3DataType = Amazon.SageMaker.S3DataType.S3Prefix,
      S3Uri = s3Train,
      S3DataDistributionType = Amazon.SageMaker.S3DataDistribution.FullyReplicated
 new Amazon.SageMaker.Model.Channel(){
   ChannelName = "validation",
   ContentType = "application/x-recordio",
   CompressionType = Amazon.SageMaker.CompressionType.None,
   DatasourceDataSource = new Amazon.SageMaker.Model.DataSource(){
     S3DataSource = new Amazon.SageMaker.Model.S3DataSource(){
       S3DataType = Amazon.SageMaker.S3DataType.S3Prefix,
       S3Uri = s3Validation,
       S3DataDistributionType = Amazon.SageMaker.S3DataDistribution.FullyReplicated

Poll the job for completion status a few times until the job status reports Completed, then you can proceed to the next step.

Deploy and host the model in SageMaker

After the training job has completed, it is time to build the model. Create the request object with all required parameters and make API call to generate the model.

String modelName = String.Format(“DEMO-full-image-classification-model-{0}”,DateTime.Now.ToString(“yyyy-MM-dd-hh-mmss”));

CreateModelRequest modelRequest = new CreateModelRequest(){
  ModelName = modelName,
  ExecutionRoleArn = dniResp.RoleArn,
  PrimaryContainer = new ContainerDefinition(){
     Image = “433757028032.dkr.ecr.us-west-2.amazonaws.com/image-classification:latest”,
     ModelDataUrl = tjResp.ModelArtifacts.S3ModelArtifacts

CreateModelResponse modelResponse = await smClient.CreateModelAsync(modelRequest);

Create an inference endpoint using the trained model

After deploying the model, we are ready to create the endpoint that will be invoked to get real time inferences for images. This is a two-step process: first, we create an endpoint configuration and use it to create the endpoint itself.

CreateEndpointConfigRequest epConfReq = new CreateEndpointConfigRequest(){
   EndpointConfigName = epConfName,
   ProductionVariants = new List&lt;ProductionVariant&gt;(){
     new ProductionVariant() {
       Instance typeInstanceType = Amazon.SageMaker.ProductionVariantInstanceType.MlP28xlarge,
       InitialInstanceCount = 1,
       ModelName = modelName,
       VariantName = "AllTraffic"

CreateEndpointConfigResponse epConfResp = await smClient.CreateEndpointConfigAsync(epConfReq);

string epName = String.Format("{0}-EndPoint",jobName);

CreateEndpointRequest epReq = new CreateEndpointRequest(){
  EndpointName = epName,
  EndpointConfigName = epConfName

CreateEndpointResponse epResp = await smClient.CreateEndpointAsync(epReq);

Poll the endpoint status a few times until the it reports InService, then proceed to the next step.

Invoke the endpoint with the trained model to obtain real time inferences for sample images

Load the known list of classes/categories into a List (shortened here for brevity).  We compare the inference response to this list.

String[] categoriesArray = new String[]{"ak47", "american-flag", "backpack", "baseball-bat", "baseball-glove", "basketball-hoop", "bat", "bathtub", "bear", "beer-mug", ..... "clutter"};
List&lt;String&gt; categories = categoriesArray.ToList();

Two images from the Caltech dataset have been chosen at random to be tested against the deployed model.  These two images are downloaded locally and loaded into memory so they can be passed as payload to the endpoint. The code below demonstrates one of these in action:

webClient.DownloadFile("http://www.vision.caltech.edu/Image_Datasets/Caltech256/images/008.bathtub/008_0007.jpg", "008_0007.jpg");

MemoryStream data streamdataStream = new MemoryStream(File.ReadAllBytes(@"./008_0007.jpg"));
InvokeEndpointRequest invReq = new InvokeEndpointRequest(){
  EndpointName = epName,
  ContentType = "application/x-image",
  Body = data streamdataStream</p><p>
InvokeEndpointResponse invResp = await smrClient.InvokeEndpointAsync(invReq);

//Read the response stream back into a string so it can be reviewed
StreamReader sr = new StreamReader(invResp.Body);
String responseBody = sr.ReadToEnd();

We now have a response returned from the endpoint and we must inspect this result to determine if it is correct. The response is in the form of a list of probabilities–each item in the list represents the probability that the image provided to the endpoint matches the specific class/category in our previously loaded list.

//Load the values into a List so they can be more easily searched
List&lt;Decimal&gt; probabilities = JsonConvert.DeserializeObject&lt;List&lt;Decimal&gt;&gt;(responseBody);

//Determine which category returned the highest Probability match and print it's value and Index
var indexAtMax = probabilities.IndexOf(probabilities.Max());
Console.WriteLine(String.Format("Index of Max Probability: {0}",indexAtMax));
Console.WriteLine(String.Format("Value of Max Probability: {0}",probabilities[indexAtMax]));

//Print which Category name matches with the image
Console.WriteLine(String.Format("Category of image : {0}",categories[indexAtMax]));

The response indicates that for the given image the highest probabilistic match (~16.9%) to one of our known classes/categories of images, is at Index ‘7’ of the list.  We inspect our list of known classes/categories at the same index value to determine the name, which returns “bathtub”.  We have a successful match!

Index of Max Probability: 7
Value of Max Probability: 0.16936515271663666
Category of image : bathtub

Clean up the used resources

In order to avoid continuing charges, delete the deployed endpoint and stop the Jupyter Notebook.


If you are a C# .NET developer that was previously overwhelmed by the prospect of getting started with machine learning on AWS, following the guidance in this post will get you up and running quickly. The full Jupyter Notebook for this example can be found in the Github repo.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Automated code reviews on Bitbucket repositories and other enhancements in Amazon CodeGuru

Post Syndicated from Nikunj Vaidya original https://aws.amazon.com/blogs/devops/automated-code-reviews-on-bitbucket-repositories-and-other-enhancements-in-amazon-codeguru/

This post covers the support for the Atlassian Bitbucket Cloud source repository for Amazon CodeGuru Reviewer, which was recently announced. It also delves into new functionalities introduced to enhance the developer experience in CodeGuru Reviewer.

CodeGuru Reviewer is a machine learning-based service that scans your pull requests and gives you recommendations against your source code in Bitbucket with a description of what’s causing the issue and how to remediate it. CodeGuru Reviewer identifies code quality issues in five broad categories:

  • AWS best practices
  • Concurrency
  • Resource leaks
  • Sensitive information leaks
  • Common code bugs

You can also use CodeGuru Reviewer to provide code quality or AWS best practice recommendations when you migrate your code base to Java or adopt AWS services to achieve scale and robustness. The CodeGuru Reviewer recommendations offer unique capabilities in static code analysis, including the following:

  • Lower false positives
  • Difficult-to-identify issues like resource leaks and concurrency
  • Machine learning to evolve continuously from Amazon code bases
  • AWS best practices

In short, Amazon CodeGuru (Preview) equips your development team with the tools to maintain a high bar of coding standards in the software development process. For more information about configuring CodeGuru (Preview) for automated code reviews and performance optimization, see Automated code reviews and application profiling with Amazon CodeGuru.


In this blog, we will go over the following items:

  1. Using the Getting Started wizard.
  2. Associating the Bitbucket repository with the CodeGuru Reviewer and generating a pull request to trigger an automated code review.
  3. Using the new pull request code reviews dashboard to keep track of the history of pull requests and associated CodeGuru Reviewer recommendations
  4. Using supported APIs and AWS Command Line Interface (AWS CLI) to carry out CodeGuru Reviewer functions.


Using the Getting Started wizard

If you’re new to CodeGuru (Preview), you should follow the wizard guidance from the Getting started drop-down menu on the CodeGuru (Preview) console. This has been recently introduced to facilitate configuring the service.

Choose CodeGuru Profiler or CodeGuru Reviewer for configuration and follow the guided steps.

Screenshot of Getting Started Wizard

Associating a Bitbucket repository

This section summarizes the high level steps to associate the Bitbucket code repository with CodeGuru Reviewer. For more information, see What is Amazon CodeGuru Reviewer?

1.  On the CodeGuru (Preview) console, choose Reviewer.

2.  Choose Associated repositories.

3.  Choose Associate repository.

4.  Select Bitbucket.

5.  For Connect to Bitbucket, you can choose an existing connection from the drop-down menu or choose Create a Bitbucket connection.



6.  After choosing Create a Bitbucket connection, for Connection name, provide a name for your connection; for example, Bitbucket-Connection.

Screenshot of Create Connection Step1

7.  Choose Connect to Bitbucket Cloud.

A window opens to log in to Bitbucket.

8.  If you’re not already logged in, enter your Bitbucket login credentials.

9.  In the Bitbucket connection settings section, for Connection name, the name you entered in earlier step will be displayed.

10.  For Bitbucket Cloud apps, you can search for an existing app in the search text box or choose Install a new app.

Screenshot to Connect to Bitbucket

After you choose Install a new app, a pop-up window appears asking for authorization to grant access for AWS CodeStar.

11.  Choose Grant Access.

You now see a connection string populated in the Bitbucket Cloud apps field.


12.  Choose Connect.

You return to the earlier screen, which provides you with a drop-down menu of repositories from your Bitbucket account.


13.  Choose an appropriate repository and choose Associate.


You can see your repository is in the status Associating as shown below:

Screenshot for Reviewer connection with Bitbucket in Associating State


When the association is complete, the status shows as Associated. The following screenshot shows two repositories in the Associated state. This indicates that CodeGuru (Preview) is now listening for any pull request notifications from these repositories.

Screenshot showing in Associated state


Now you can go to the Bitbucket site and access your repository.


14.  On the Bitbucket website, create a pull request for the Bitbucket repository.

Screenshot for Pull-Request


CodeGuru (Preview) is notified of any pull requests created on that repository. This triggers the code review for the code referenced in the pull request. You can see generated recommendations on the Activity tab.


The following screenshot shows recommendations generated on the Bitbucket dashboard.

Screenshot for Recommendation


You can now take further actions to address the comments and merge the code.


Using the pull request code reviews dashboard

AWS introduced this dashboard based on early feedback requesting a centralized place to manage details about code review history. The pull request dashboard allows you to view CodeGuru Reviewer recommendations for all code reviews. This page lists all code reviews with accompanying information such as the status of the code review, the repository, the number of recommendations, and more.

PR Dashboard


Every code review is assigned a unique ARN that allows you to see the details of an individual code review, including any recommendations that may have surfaced.


The following screenshot shows the details of a code review.

PR Dashboard with Status



The following are key details:

  • Status – Confirms that the code review is complete. This is especially useful in use cases where there are no generated recommendations.
  • Metered lines of code – Refers to the lines of code scanned for the code request, excluding the lines without significant code (for example, commented code or lines with only opening or closing braces).
  • Pull request Id – Provides a link to navigate back to the code review page on the repository and review the complete code review activity.
  • Recommendations – Provides a text search capability to locate specific recommendations


The following screenshot shows an individual recommendation.

Individual Recommendations from PR Dashboard


You can give feedback on CodeGuru recommendations by choosing either the thumbs up or thumbs down icon below each recommendation. This gives you the opportunity to provide input about whether the recommendation was useful for you. These inputs enable AWS to evolve the service with more relevant recommendations.


Using the supported APIs and AWS CLI

The pull request code review dashboard includes the following APIs:

  • DescribeCodeReview
  • ListCodeReviews
  • ListRecommendations
  • PutRecommendationFeedback
  • DescribeRecommendationFeedback
  • ListRecommendationFeedback

In addition, you can use equivalent AWS CLI commands. To use the AWS CLI, you need to install the AWS CLI version 2. For more information about CodeGuru Reviewer operations, see codeguru-reviewer. For more information about CodeGuru Profiler operations, see codeguruprofiler.


The following are a few examples exercising the above API’s using aws cli’s:

Admin:~/environment $ aws codeguru-reviewer list-repository-associations
    "RepositoryAssociationSummaries": [
            "AssociationArn": "arn:aws:codeguru-reviewer:<Region>:<AcctID>:association:11c1b6a3-638f-4a6c-bfc8-3e286785632a",
            "LastUpdatedTimeStamp": "2020-05-06T04:46:16.742000+00:00",
            "AssociationId": "11c1b6a3-638f-4a6c-bfc8-3e286785632a",
            "Name": "codeguruapp",
            "Owner": "nvaidya1",
            "ProviderType": "Bitbucket",
            "State": "Associated"
            "AssociationArn": "arn:aws:codeguru-reviewer:<Region>:<AcctID>:association:29ec5f42-34b4-448e-909e-76fc98bd8e59",
            "LastUpdatedTimeStamp": "2020-05-06T03:13:55.878000+00:00",
            "AssociationId": "29ec5f42-34b4-448e-909e-76fc98bd8e59",
            "Name": "MyJavaProject",
            "Owner": "nvaidya1",
            "ProviderType": "Bitbucket",
            "State": "Associated"


Admin:~/environment $ aws codeguru-reviewer list-code-reviews – type PullRequest
    "CodeReviewSummaries": [
            "Name": "BITBUCKET-codeguruapp-2-3099f39ad26a2d70e93d0288dfbd8ae301e925d5",
            "CodeReviewArn": "arn:aws:codeguru-reviewer:<Region>:<AcctID>:code-review:PullRequest-BITBUCKET-codeguruapp-2-3099f39ad26a2d70e93d0288dfbd8ae301e925d5",
            "RepositoryName": "codeguruapp",
            "Owner": "<Snip>",
            "ProviderType": "Bitbucket",
            "State": "Completed",
            "CreatedTimeStamp": "2020-05-06T04:47:43.868000+00:00",
            "LastUpdatedTimeStamp": "2020-05-06T04:51:47.492000+00:00",
            "Type": "PullRequest",
            "PullRequestId": "2",
            "MetricsSummary": {
                "MeteredLinesOfCodeCount": 74,
                "FindingsCount": 5

Admin:~/environment $ aws codeguru-reviewer list-recommendations – code-review-arn arn:aws:codeguru-reviewer:<Remaining-ARN-String>
    "RecommendationSummaries": [
            "FilePath": "src/main/java/com/company/sample/application/EventHandler.java",
            "RecommendationId": "573efe3796b8b96f455591aa6eb4b675f77c95b9cf42f5a260ecc0dee0299e53",
            "StartLine": 170,
            "EndLine": 170,
            "Description": "This code is written so that the client cannot be reused across invocations of the Lambda function.\nTo improve the performance of the Lambda function, consider using static initialization/constructor, global/static variables and singletons. It allows to keep alive and reuse HTTP connections that were established during a previous invocation.\nLearn more about [best practices for working with AWS Lambda functions](https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html)."

Admin:~/environment $ aws codeguru-reviewer describe-code-review – code-review-arn arn:aws:codeguru-reviewer:<Remaining-ARN-String>
    "CodeReview": {
        "Name": "BITBUCKET-codeguruapp-2-3099f39ad26a2d70e93d0288dfbd8ae301e925d5",
        "CodeReviewArn": "arn:aws:codeguru-reviewer:<Region>:<AcctID>:code-review:PullRequest-BITBUCKET-codeguruapp-2-3099f39ad26a2d70e93d0288dfbd8ae301e925d5",
        "RepositoryName": "codeguruapp",
        "Owner": "<snip>",
        "ProviderType": "Bitbucket",
        "State": "Completed",
        "StateReason": "CodeGuru Reviewer successfully finished reviewing the pull request source code.",
        "CreatedTimeStamp": "2020-05-06T04:47:43.868000+00:00",
        "LastUpdatedTimeStamp": "2020-05-06T04:51:47.492000+00:00",
        "Type": "PullRequest",
        "PullRequestId": "2",
        "SourceCodeType": {
            "CommitDiff": {
                "SourceCommit": "3099f39ad26a2d70e93d0288dfbd8ae301e925d5",
                "DestinationCommit": "7338cd2fd99e6e663b3a312e80e5ca2b570a6891"
        "Metrics": {
            "MeteredLinesOfCodeCount": 74,
            "FindingsCount": 5

Cleaning up

When you’re finished testing, you should un-provision the service to avoid incurring further charges:

  • CodeGuru Reviewer – Remove the association of CodeGuru (Preview) to the repository, so that any further pull request notifications doesn’t trigger CodeGuru (Preview) to perform an automated code review.
  • CodeGuru Profiler – If configured, remove the profiling group.



This post reviewed CodeGuru (Preview) support for Bitbucket repositories for CodeGuru Reviewer. It also reviewed the pull request dashboard and supported APIs and AWS CLI functionalities. You can take advantage of these features to enhance your application development workflow.


Registration for Amazon re:MARS 2020 is OPEN 🎉

Post Syndicated from Alejandra Quetzalli original https://aws.amazon.com/blogs/aws/registration-for-amazon-remars-2020-is-open-%F0%9F%8E%89/

Can you believe it? It’s almost time for re:MARS 2020 🎉!

It’ll be from June 16-19 in Las Vegas, NV, with standard pricing at $1999. This year we have an all-new Amazon re:MARS Developer Day for engineers and developers to engage with Amazon product teams. We’ll also have breakout sessions just like last year where leaders, thinkers, and technical advisors can get inspired by how Amazon leaders and industry peers are applying these technologies today.

You will be able to choose from over 100 main event sessions and activities featuring visionaries working on the future, organizations applying these technologies today, and even take a peek behind the curtain at Amazon’s own implementations. You’ll also see new exhibits and interact with even more peers and experts than last year!

Get ready to hear from Jeff Bezos, Jon Favreau, and more at Amazon’s event dedicated to Machine Learning 🧠, Automation and Voice 🔊, Robotics 🤖, and Space 🛰.

Sad because you missed it last year?

Don’t be! 🙃

You can check out last year’s recorded sessions on YouTube, the Day 1 Blog post, the 2019 Session Catalog, and even Jeff Bezos’s own tweet in the following links provided…

What to get excited about for Amazon re:MARS 2020

⚠New for 2020! Amazon re:MARS Developer Day ⚠

Make sure to register in advance and arrive early for this deep dive on technical content with all kinds of Amazon product leaders. This Amazon re:MARS Developer Day — included in the conference pass — will host a keynote and a series of tracks including diverse AWS Machine Learning services, AWS RoboMaker, Alexa Skills, and Alexa for Device Makers.

🧠Machine Learning
Get ready to get your mind blown away. Come learn about pre-trained AI services for computer vision, language, recommendations, and forecasting. We’ll also teach you how to build, train and deploying machine learning models at scale and even how to build custom models with support for open-source frameworks.

🤖AWS RoboMaker
Hear from AWS Robotics leaders about the AWS RoboMaker service, how customers are using it to advance their robotics initiatives, and learn about the latest features. Follow up with deep dive sessions led by the product team to help you understand the concepts, onboard, and get started building.

🔊Alexa Skills
Have you ever wanted to take a behind-the-scenes look at Alexa developer services? Learn how Alexa can help you innovate in conversational AI and engage customers through voice; get the latest voice enablement and conversational interface features; and attend deep dives on technologies like Alexa Conversations.

🛠Alexa for Device Makers
If you enjoy tinkering with hardware, you’re going to love our sessions on building Alexa into appliances, cameras, computers, headsets, TVs and more! A voice interface makes it natural to control and access content and services, and joins your device to the expanding network of Alexa devices, skills and services. Learn from Alexa leaders about the latest innovations and break out into deep dive sessions on the tech.

🗣Our Keynote Speakers

  • Jeff Bezos, Founder & CEO of Amazon
  • Jeff Wilke, CEO, Worldwide Consumer, Amazon
  • Dr. Cynthia Breazeal, Professor of Media Arts and Sciences, Head of Personal Robots Group, MIT
  • Dr. Kate Darling, Leading Expert in Social Robotics and MIT Media Lab Research Specialist
  • Dr. Bethany Ehlmann, Professor of Planetary Science, Caltech, Research Scientist at Jet Propulsion Laboratory
  • Dr. Ayanna Howard, Roboticist and Chair, School of Interactive Computing, Georgia Tech
  • Dr. Maja Matarić, Chaired and Distinguished Professor, Computer Science, Neuroscience, and Pediatrics, USC
  • Dr. Sara Seager, Planetary Scientist and Astrophysicist, MIT
  • Boyan Slat, CEO and Founder of The Ocean Cleanup
  • Dr. Eric Topol, Executive Vice President, Scripps Research, Founder and Director, Scripps Research Translational Institute
  • Jon Favreau, Director, Producer, Writer, and Actor with Ben Grossman, CEO, Magnopus, and Mixed Reality Director

👩🏿‍💻Ready to register for Amazon re:MARS 2020?

Ready to join the fun? Check out the re:MARS 2020 website and register. (Psst!🤫Academics and students registering with a .edu email address can use discount code ACAD20REMARS.)

🌗Partner Sponsorship

Are you an Amazon partner (APN) looking to showcase your expertise in Machine Learning, Automation, Robotics or Space? The re:MARS sponsorship program gives sponsors access to an experiential (i.e. learning through reflection on doing) Tech Showcase that highlights their areas of expertise. Select sponsors may also offer executive thought leadership in sponsor-led breakout sessions. To learn more about this program and view all of the available options, please view the sponsorship prospectus. Questions? 📧Email our team📧 for more information.

#AstronautsAttendForFree 👩‍🚀

Will I see you there?

I hope so! All you have to do is register. 🥳



¡Gracias por tu tiempo!

~Alejandra 💁🏻‍♀️ &  Canela🐾

Build machine learning-powered business intelligence analyses using Amazon QuickSight

Post Syndicated from Osemeke Isibor original https://aws.amazon.com/blogs/big-data/build-machine-learning-powered-business-intelligence-analyses-using-amazon-quicksight/

Imagine you can see the future—to know how many customers will order your product months ahead of time so you can make adequate provisions, or to know how many of your employees will leave your organization several months in advance so you can take preemptive actions to encourage staff retention. For an organization that sees the future, the possibilities are limitless. Machine learning (ML) makes it possible to predict the future with a higher degree of accuracy.

Amazon SageMaker provides every developer and data scientist the ability to build, train, and deploy ML models quickly, but for business users who usually work on creating business intelligence dashboards and reports rather than ML models, Amazon QuickSight is the service of choice. With Amazon QuickSight, you can still use ML to forecast the future. This post goes through how to create business intelligence analyses that use ML to forecast future data points and detect anomalies in data, with no technical expertise or ML experience needed.

Overview of solution

Amazon QuickSight ML Insights uses AWS-proven ML and natural language capabilities to help you gain deeper insights from your data. These powerful, out-of-the-box features make it easy to discover hidden trends and outliers, identify key business drivers, and perform powerful what-if analysis and forecasting with no technical or ML experience. You can use ML insights in sales reporting, web analytics, financial planning, and more. You can detect insights buried in aggregates, perform interactive what-if analysis, and discover what activities you need to meet business goals.

This post imports data from Amazon S3 into Amazon QuickSight and creates ML-powered analyses with the imported data. The following diagram illustrates this architecture.


In this walkthrough, you create an Amazon QuickSight analysis that contains ML-powered visuals that forecast the future demand for taxis in New York City. You also generate ML-powered insights to detect anomalies in your data. This post uses the New York City Taxi and Limousine Commission (TLC) Trip Record Data on the Registry of Open Data on AWS.

The walkthrough includes the following steps:

  1. Set up and import data into Amazon QuickSight
  2. Create an ML-powered visual to forecast the future demand for taxis
  3. Generate an ML-powered insight to detect anomalies in the data set


For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Amazon Quicksight Enterprise edition
  • Basic knowledge of AWS

Setting up and importing data into Amazon QuickSight

Set up Amazon Quicksight as an individual user. Complete the following steps:

  1. On the AWS Management Console, in the Region list, select US East (N. Virginia) or any Region of your choice that Amazon QuickSight
  2. Under Analytics, for Services, choose Amazon QuickSight.If you already have an existing Amazon QuickSight account, make sure it is the Enterprise edition; if it is not, upgrade to Enterprise edition. For more information, see Upgrading your Amazon QuickSight Subscription from Standard Edition to Enterprise Edition.If you do not have an existing Amazon QuickSight account, proceed with the setup and make sure you choose Enterprise Edition when setting up the account. For more information, see Setup a Free Standalone User Account in Amazon QuickSight.After you complete the setup, a Welcome Wizard screen appears.
  1. Choose Next on each of the Welcome Wizard screens.
  2. Choose Get Started.

Before you import the data set, make sure that you have at least 3GB of SPICE capacity. For more information, see Managing SPICE Capacity.

Importing the NYC Taxi data set into Amazon QuickSight

The NYC Taxi data set is in an S3 bucket. To import S3 data into Amazon QuickSight, use a manifest file. For more information, see Supported Formats for Amazon S3 Manifest Files. To import your data, complete the following steps:

  1. In a new text file, copy and paste the following code:
    "fileLocations": [
                "URIs": [
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-01.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-02.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-03.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-04.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-05.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-06.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-07.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-08.csv", 
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-09.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-10.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-11.csv", 
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-12.csv"
        "globalUploadSettings": {
            "textqualifier": "\""

  2. Save the text file as nyc-taxi.json.
  3. On the Amazon QuickSight console, choose New analysis.
  4. Choose New data set.
  5. For data source, choose S3.
  6. Under New S3 data source, for Data source name, enter a name of your choice.
  7. For Upload a manifest file field, select Upload.
  8. Choose the nyc-taxi.json file you created earlier.
  9. Choose Connect.
    The S3 bucket this post uses is a public bucket that contains a public data set and open to the public. When using S3 buckets in your account with Amazon QuickSight, it is highly recommended that the buckets are not open to the public; you need to configure authentication to access your S3 bucket from Amazon QuickSight. For more information about troubleshooting, see I Can’t Connect to Amazon S3.After you choose Connect, the Finish data set creation screen appears.
  10. Choose Visualize.
  11. Wait for the import to complete.

You can see the progress on the top right corner of the screen. When the import is complete, the result shows the number of rows imported successfully and the number of rows skipped.

Creating an ML-powered visual

After you import the data set into Amazon QuickSight SPICE, you can start creating analyses and visuals. Your goal is to create an ML-powered visual to forecast the future demand for taxis. For more information, see Forecasting and Creating What-If Scenarios with Amazon Quicksight.

To create your visual, complete the following steps:

  1. From the Data source details pop screen, choose Visualize.
  2. From the field list, select Lpep_pickup_datetime.
  3. Under Visual types, select the first visual.Amazon QuickSight automatically uses the best visual based on the number and data type of fields you selected. From your selection, Amazon Quicksight displays a line chart visual.From the preceding graph, you can see that the bulk of your data clusters are around December 31, 2017, to January 1, 2019, for the Lpep_pickup_datetime field. There are a few data points with date ranges up to June 2080. These values are incorrect and can impact your ML forecasts.To clean up your data set, filter out the incorrect data using the data in the Lpep_pickup_datetime. This post only uses data in which Lpep_pickup_datetime falls between January 1, 2018, and December 18, 2018, because there is a more consistent amount of data within this date range.
  4. Use the filter menu to create a filter using the Lpep_pickup_datetime
  5. Under Filter type, choose Time range and Between.
  6. For Start date, enter 2018-01-01 00:00.
  7. Select Include start date.
  8. For End date, enter 2018-12-18 00:00.
  9. Select Include end date.
  10. Choose Apply.

The line chart should now contain only data with Lpep_pickup_datetime from January 1, 2018, and December 18, 2018. You can now add the forecast for the next 31 days to the line chart visual.

Adding the forecast to your visual

To add your forecast, complete the following steps:

  1. On the visual, choose the arrow.
  2. From the drop-down menu, choose Add forecast.
  3. Under Forecast properties, for Forecast length, for Periods forward, enter 31.
  4. For Periods backwards, enter 0.
  5. For Prediction interval, leave at the default value 90.
  6. For Seasonality, leave at the default selection Automatic.
  7. Choose Apply.

You now see an orange line on your graph, which is the forecasted pickup quantity per day for the next 31 days after December 18, 2018. You can explore the different dates by hovering your cursor over different points on the forecasted pickup line. For example, hovering your cursor over January 10, 2019, shows that the expected forecasted number of pickups for that day is approximately 22,000. The forecast also provides an upper bound (maximum number of pickups forecasted) of about 26,000 and a lower bound (minimum number of pickups forecasted) of about 18,000.

You can create multiple visuals with forecasts and combine them into a sharable Amazon QuickSight dashboard. For more information, see Working with Dashboards.

Generating an ML-powered insight to detect anomalies

In Amazon QuickSight, you can add insights, autonarratives, and ML-powered anomaly detection to your analyses without ML expertise or knowledge. Amazon QuickSight generates suggested insights and autonarratives automatically, but for ML-powered anomaly detection, you need to perform additional steps. For more information, see Using ML-Powered Anomaly Detection.

This post checks if there are any anomalies in the total fare amount over time from select locations. For example, if the total fare charged for taxi rides is about $1,000 and above from the first pickup location (for example, the airport) for most dates in your data set, an anomaly is when the total fare charged deviates from the standard pattern. Anomalies are not necessarily negative, but rather abnormalities that you can choose to investigate further.

To create an anomaly insight, complete the following steps:

  1. From the top right corner of the analysis creation screen, click the Add drop-down menu and choose Add insight.
  2. On the Computation screen, for Computation type, select Anomaly detection.
  3. Under Fields list, choose the following fields:
  • fare_amount
  • lpep_pickup_datetime
  • PULocationID
  1. Choose Get started.
  2. The Configure anomaly detection
  3. Choose Analyze all combinations of these categories.
  4. Leave the other settings as their default.You can now perform a contribution analysis and discover how the drop-off location contributed to the anomalies. For more information, see Viewing Top Contributors.
  5. Under Contribution analysis, choose DOLocationID.
  6. Choose Save.
  7. Choose Run Now.The anomaly detection can take up to 10 minutes to complete. If it is still running after about 10 minutes, your browser may have timed out. Refresh your browser and you should see the anomalies displayed in the visual.
  8. Choose Explore anomalies.

By default, the anomalies you see are for the last date in your data set. You can explore the anomalies across the entire date range of your data set by choosing SHOW ANOMALIES BY DATE and dragging the slider at the bottom of the visual to display the entire date range from January 1, 2018, to December 30, 2018.

This graph shows that March 21, 2018, has the highest number of anomalies of the fare charged in the entire data set. For example, the total fare amount charged by taxis that picked up passengers from location 74 on March 21, 2018, was 7,181. This is -64% (about 19,728.5) of the total fare charged by taxis for the same pickup location on March 20, 2018. When you explore the anomalies of other pickup locations for that same date, you can see that they all have similar drops in the total fare charged. You can also see the top DOLocationID contributors to these anomalies.

What happened in New York City on March 21, 2018, to cause this drop? A quick online search reveals that New York City experienced a severe weather condition on March 21, 2018.

Publishing your analyses to a dashboard, sharing the dashboard, and setting up email alerts

You can create additional visuals to your analyses, publish the analyses as a dashboard, and share the dashboard with other users. QuickSight anomaly detection allows you to uncover hidden insights in your data by continuously analyzing billions of data points. You can subscribe to receive alerts to your inbox if an anomaly occurs in your business metrics. The email alert also indicates the factors that contribute to these anomalies. This allows you to act immediately on the business metrics that need attention.

From the QuickSight dashboard, you can configure an anomaly alert to be sent to your email with Severity set to High and above and Direction set to Lower than expected. Also make sure to schedule a data refresh so that the anomaly detection runs on your most recent data. For more information, see Refreshing Data.

Cleaning up

To avoid incurring future charges, you need to cancel your Amazon Quicksight subscription.


This post walked you through how to use ML-powered insights with Amazon QuickSight to forecast future data points, detect anomalies, and derive valuable insights from your data without needing prior experience or knowledge of ML. If you want to do more forecasting without ML experience, check out Amazon Forecast.

If you have questions or suggestions, please leave a comment.


About the Author

Osemeke Isibor is a partner solutions architect at AWS. He works with AWS Partner Network (APN) partners to design secure, highly available, scalable and cost optimized solutions on AWS. He is a science-fiction enthusiast and a fan of anime.




Optimizing deep learning on P3 and P3dn with EFA

Post Syndicated from whiteemm original https://aws.amazon.com/blogs/compute/optimizing-deep-learning-on-p3-and-p3dn-with-efa/

This post is written by Rashika Kheria, Software Engineer, Purna Sanyal, Senior Solutions Architect, Strategic Account and James Jeun, Sr. Product Manager

The Amazon EC2 P3dn.24xlarge instance is the latest addition to the Amazon EC2 P3 instance family, with upgrades to several components. This high-end size of the P3 family allows users to scale out to multiple nodes for distributed workloads more efficiently.  With these improvements to the instance, you can complete training jobs in a shorter amount of time and iterate on your Machine Learning (ML) models faster.


This blog reviews the significant upgrades with p3dn.24xlarge, walks you through deployment, and shows an example ML use case for these upgrades.


Overview of P3dn instance upgrades

The most notable upgrade to the p3dn.24xlarge instance is the 100-Gbps network bandwidth and the new EFA network interface that allows for highly scalable internode communication. This means you can scale runs on applications to use thousands of GPUs, which reduces time to get results. EFA’s operating system bypasses networking mechanisms and the underlying Scalable Reliable Protocol that is built in to the Nitro Controllers. The Nitro controllers enable a low-latency, low-jitter channel for inter-instance communication. EFA has been adopted in the mainline Linux and integrated with LibFabric and various distributions. AWS worked with NVIDIA for EFA to support NVIDIA Collective Communication Library (NCCL). NCCL optimizes multi-GPU and multi-node communication primitives and helps achieve high throughput over NVLink interconnects.


The following diagram shows the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.

the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.


The following table summarizes the full set of differences between p3.16xlarge and p3dn.24xlarge.

ProcessorIntel Xeon E5-2686 v4Intel Skylake 8175 (w/ AVX 512)
GPU8x 16 GB NVIDIA Tesla V1008x 32 GB NVIDIA Tesla V100
RAM488 GB768 GB
Network25 Gbps ENA100 Gbps ENA + EFA
GPU InterconnectNVLink – 300 GB/sNVLink – 300 GB/s


P3dn.24xl offers more networking bandwidth than p3.16xl. Paired with EFA’s communication library, this feature increases scaling efficiencies drastically for large-scale, distributed training jobs. Other improvements include double the GPU memory for large datasets and batch sizes, increased system memory, and more vCPUs. This upgraded instance is the most performant GPU compute option on AWS.


The upgrades also improve your workload around distributed deep learning. The GPU memory improvement enables higher intranode batch sizes. The newer Layer-wise Adaptive Rate Scaling (LARS) has been tested with ResNet50 and other deep neural networks (DNNs) to allow for larger batch sizes. The increased batch sizes reduce wall-clock time per epoch with minimal loss of accuracy. Additionally, using 100-Gbps networking with EFA heightens performance with scale. Greater networking performance is beneficial when updating weights for a large number of parameters. You can see high scaling efficiency when running distributed training on GPUs for ResNet50 type models that primarily use images for object recognition. For more information, see Scalable multi-node deep learning training using GPUs in the AWS Cloud.


Natural language processing (NLP) also presents large compute requirements for model training. This large compute requirement is especially present with the arrival of large Transformer-based models like BERT and GPT-2, which have up to a billion parameters. The following describes how to set up distributed model trainings with scalability for both image and language-based models, and also notes how the AWS P3 and P3dn instances perform.


Optimizing your P3 family

First, optimize your P3 instances with an important environmental update. This update runs traditional TCP-based networking and is in the latest release of NCCL 2.4.8 as of this writing.


Two new environmental variables are available, which allow you to take advantage of multiple TCP sockets per thread: NCCL_SOCKET_NTHREADS and NCCL_NSOCKS_PERTHREAD.


These environmental variables allow the NCCL backend to exceed the 10-Gbps TCP single stream bandwidth limitation in EC2.


Enter the following command:

/opt/openmpi/bin/mpirun -n 16 -N 8 – hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_NSOCKS_PERTHREAD=4 -x NCCL_SOCKET_NTHREADS=4 – mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100


The following graph shows the synthetic NCCL tests and their increased performance with the additional directives.

synthetic NCCL tests and their increased performance with the additional directives

You can achieve a two-fold increase in throughput after a threshold in the synthetic payload size (around 1 MB).



Deploying P3dn


The following steps walk you through spinning up a cluster of p3dn.24xlarge instances in a cluster placement group. This allows you to take advantage of all the new performance features within the P3 instance family. For more information, see Cluster Placement Groups in the Amazon EC2 User Guide.

This post deploys the following stack:


  1. On the Amazon EC2 console, create a security group.

 Make sure that both inbound and outbound traffic are open on all ports and protocols within the security group.


  1. Modify the user variables in the packer build script so that the variables are compatible with your environment.

The following is the modification code for your variables:



  "variables": {

    "Region": "us-west-2",

    "flag": "compute",

    "subnet_id": "<subnet-id>",

    "sg_id": "<security_group>",

    "build_ami": "ami-0e434a58221275ed4",

    "iam_role": "<iam_role>",

    "ssh_key_name": "<keyname>",

    "key_path": "/path/to/key.pem"


3. Build and Launch the AMI by running the following packer script:

Packer build nvidia-efa-fsx-al2.yml

This entire workflow takes care of setting up EFA, compiling NCCL, and installing the toolchain. After building it, you have an AMI ID that you can launch in the EC2 console. Make sure to enable the EFA.

  1. Launch a second instance in a cluster placement group so you can run two node tests.
  2. Enter the following code to make sure that all components are built correctly:


  1. The following output of the commend will confirm that the build is using EFA :

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa

INFO: Function: main Line: 49: NET/OFI Process rank 8 started. NCCLNet device used on ip-172-0-1-161 is AWS Libfabric.

INFO: Function: main Line: 53: NET/OFI Received 1 network devices

INFO: Function: main Line: 57: NET/OFI Server: Listening on dev 0

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa


Synthetic two-node performance

This blog includes the NCCL-tests GitHub as part of the deployment stack. This shows synthetic benchmarking of the communication layer over NCCL and the EFA network.

When launching the two-node cluster, complete the following steps:

  1. Place the instances in the cluster placement group.
  2. SSH into one of the nodes.
  3. Fill out the hosts file.
  4. Run the two-node test with the following code:

/opt/openmpi/bin/mpirun -n 16 -N 8 – hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x FI_PROVIDER="efa" -x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_SOCKET_IFNAME=eth0 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100

This test makes sure that the node performance works the way it is supposed to.

The following graph compares the NCCL bandwidth performance using -x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp“. There is a three-fold increase in bus bandwidth when using EFA.


 -x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp". There is a three-fold increase in bus bandwidth when using EFA. 

Now that you have run the two node tests, you can move on to a deep learning use case.

FAIRSEQ ML training on a P3dn cluster

Fairseq(-py) is a sequence modeling toolkit that allows you to train custom models for translation, summarization, language modeling, and other text-generation tasks. FAIRSEQ MACHINE TRANSLATION distributed training requires a fast network to support the Allreduce algorithm. Fairseq provides reference implementations of various sequence-to-sequence models, including convolutional neural networks (CNN), long short-term memory (LSTM) networks, and transformer (self-attention) networks.


After you receive consistent 10 GB/s bus-bandwidth on the new P3dn instance, you are ready for FAIRSEQ distributed training.

To install fairseq from source and develop locally, complete the following steps:

  1. Copy FAIRSEQ source code to one of the P3dn instance.
  2. Copy FAIRSEQ Training data in the data folder.
  3. Copy FAIRSEQ Test Data in the data folder.


git clone https://github.com/pytorch/fairseq

cd fairseq

pip install – editable . 

Now that you have FAIRSEQ installed, you can run the training model. Complete the following steps:

  1. Run FAIRSEQ Training in 1 node/8 GPU p3dn instance to check the performance and the accuracy of FAIRSEQ operations.
  2. Create a custom AMI.
  3. Build the other 31 instances from the custom AMI.


Use the following scripts for distributed All Reduce FAIRSEQ Training :


export RANK=$1 # the rank of this process, from 0 to 127 in case of 128 GPUs
export LOCAL_RANK=$2 # the local rank of this process, from 0 to 7 in case of 8 GPUs per mac
export FI_PROVIDER="efa";

export LD_LIBRARY_PATH=/opt/amazon/efa/lib64/:/home/ec2-user/aws-ofi-nccl/install/lib/:/home/ec2-user/nccl/build/lib:$LD_LIBRARY_PATH;
python train.py data-bin/wmt18_en_de_bpej32k \
   – clip-norm 0.0 -a transformer_vaswani_wmt_en_de_big \
   – lr 0.0005 – source-lang en – target-lang de \
   – label-smoothing 0.1 – upsample-primary 16 \
   – attention-dropout 0.1 – dropout 0.3 – max-tokens 3584 \
   – log-interval 100  – weight-decay 0.0 \
   – criterion label_smoothed_cross_entropy – fp16 \
   – max-update 500000 – seed 3 – save-interval-updates 16000 \
   – share-all-embeddings – optimizer adam – adam-betas '(0.9, 0.98)' \
   – lr-scheduler inverse_sqrt – warmup-init-lr 1e-07 \
   – warmup-updates 4000 – min-lr 1e-09 \
   – distributed-port 12597 – distributed-world-size 32 \
   – distributed-init-method 'tcp://' – distributed-rank $RANK \
   – device-id $LOCAL_RANK \
   – max-epoch 3 \
   – no-progress-bar  – no-save

Now that you have completed and validated your base infrastructure layer, you can add additional components to the stack for various workflows. The following charts show time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training.

time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training



EFA on p3dn.24xlarge allows you to take advantage of additional performance at scale with no change in code. With this updated infrastructure, you can decrease cost and time to results by using more GPUs to scale out and get more done on complex workloads like natural language processing. This blog provides much of the undifferentiated heavy lifting with the DLAMI integrated with EFA. Go power up your ML workloads with EFA!