Tag Archives: Events

Getting started with testing serverless applications

Post Syndicated from Talia Nassi original https://aws.amazon.com/blogs/compute/getting-started-with-testing-serverless-applications/

Testing is an essential step in the software development lifecycle. Through the different types of tests, you validate user experience, performance, and detect bugs in your code. Features should not be considered done until all of the corresponding tests are written.

The distributed nature of serverless architectures separates your application logic from other concerns like state management, request routing, workflow orchestration, and queue polling.

In this post, I cover the three main types of testing developers do when building applications. I also go through what changes and what stays the same when building serverless applications with AWS Lambda, in addition to the challenges of testing serverless applications.

The challenges of testing serverless applications

To test your code fully using managed services, you need to emulate the cloud environment on your local machine. However, this is usually not practical.

Secondly, using many managed services for event-driven architecture means you must also account for external resources like queues, database tables, and event buses. This means you write more integration tests than unit tests, altering the standard testing pyramid. Building more integration tests can impact the maintenance of your tests or slow your testing speed.

Lastly, with synchronous workloads, such as a traditional web service, you make a request and assert on the response. The test doesn’t need to do anything special because the thread is blocked until the response returns.

However, in the case of event-driven architectures, state changes are driven by events flowing from one resource to another. Your tests must detect side effects in downstream components and these might not be immediate. This means that the tests must tolerate asynchronous behaviors, which can make for more complicated and slower-running tests.

Unit testing

Unit tests validate individual units of your code, independent from any other components. Unit tests check the smallest unit of functionality and should only have one reason to fail – the unit is not correctly implemented.

Unit tests generally cover the smallest units of functionality although the size of each unit can vary. For example, a number of functions may provide a coherent piece of behavior and you may want to test them as a single unit. In this case, your unit test might call an entry-point function that invokes several others to do its job. You test these functions together as a single unit.

unit testing

Integration Testing

One good practice to test how services interact with each other is to write integration tests that mock the behavior of your services in the cloud.

The point of integration tests is to make sure that two components of your application work together properly. Integration tests are important in serverless applications because they rely heavily on integrations of different services. Unless you are testing in production, the most efficient way to run automated integration tests is to emulate your services in the cloud.

This can be done with tools like moto. Moto mocks calls to AWS automatically without requiring any other dependencies. Another useful tool is localstack. Localstack allows you to mock certain AWS service APIs on your local machine that you can use for testing the integration of two or more services.

You can also configure test events and manually test directly from the Lambda console. Remember that when you test a Lambda function, you are not only testing the business logic. You must also mock its payload and call a function invoke. There are over 200 event sources that can trigger Lambda functions. Each service has its own unique event format, and contains data about the resource or request that invoked the function. Find the full list of test events in the AWS documentation.

To configure a test event for AWS Lambda:

  1. Navigate to the Lambda console and choose the function. From the Test dropdown, choose Configure Test Event.step 1
  2. Choose Create a new Test Event and select the template for the service you want to act as the trigger for your Lambda function. In this example, you choose Amazon DynamoDB Update.
    step 2
  3. Save the test event and choose Test in the Code source section. Each user can create up to 10 test events per function. Those test events are private to you. Lambda runs the function on your behalf. The function handler receives and then processes the sample event.
    step 3
  4. The Execution result shows the execution status as succeeded.

End-to-end testing

When testing your serverless applications end-to-end, it’s important to understand your user and data flows. The most important business-critical flows of your software are what should be tested end-to-end in your UI.

From a business perspective, these should be the most valuable user and data flows that occur in your product. Another resource to utilize is data from your customers. From your analytics platform, find the actions that users are doing the most in production.

End-to-end tests should be running in your build pipeline and act as blockers if one of them fails. They should also be updated as new features are added to your product.

The testing pyramid

testing pyramid

The standard testing pyramid above on the left indicates that systems should have more unit tests than any other type of test, then a medium number of integration tests, and the least number of end-to-end tests.

However, when testing serverless applications, this standard shifts to a hexagonal structure on the right because it’s mostly made up of two or more AWS services talking to each other. You can mock out those integrations with tools such as moto or localstack.

Add automated tests to your CI/CD pipeline

As serverless applications scale, having automated tests is essential in getting fast feedback on the current state of your product. It is not scalable to test everything manually, so investing in an automation tool to run your tests is essential.

All of the tests in your build pipeline, including unit, integration, and end-to-end tests should be blocking in your CI/CD pipeline. This means if one of them fails, it should block the promotion of that code into production. And remember – there’s no such thing as a flakey test. Either the test does what it’s supposed to do, or it doesn’t.

Narrowly scope your tests

Testing asynchronous processes can be tricky. Not only must you monitor different parts of your system, you also need to know when to stop waiting and end the test. When there are multiple asynchronous steps, the delays add up to a longer-running test. It’s also more difficult to estimate how long we should wait before ending. There are two approaches to mitigate these issues.

Firstly, write separate, more narrowly-scoped tests over each asynchronous step. This limits the possible causes of asynchronous test failure you need to investigate. Also, with fewer asynchronous steps, these tests will run quicker and it will be easier to estimate how long to wait before timing out.

Secondly, verify as much of your system as possible using synchronous tests. Then, you only need asynchronous tests to verify residual concerns that aren’t already covered. Synchronous tests are also easier to diagnose when they fail, so you want to catch as many issues with them as possible before running your asynchronous tests.


In this blog post, you learn the three types of testing – unit testing, integration testing, and end-to-end testing. Then you learn how to configure test events with Lambda. I then cover the shift from the standard testing pyramid to the hexagonal testing pyramid for serverless, and why more integration tests are necessary. Then you learn a few best practices to keep in mind for getting started with testing your serverless applications.

For more information on serverless, head to Serverless Land.

Welcome to AWS Storage Day 2021

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/welcome-to-aws-storage-day-2021/

Welcome to the third annual AWS Storage Day 2021! During Storage Day 2020 and the first-ever Storage Day 2019 we made many impactful announcements for our customers and this year will be no different. The one-day, free AWS Storage Day 2021 virtual event will be hosted on the AWS channel on Twitch. You’ll hear from experts about announcements, leadership insights, and educational content related to AWS Storage services.

AWS Storage DayThe first part of the day is the leadership track. Wayne Duso, VP of Storage, Edge, and Data Governance, will be presenting a live keynote. He’ll share information about what’s new in AWS Cloud Storage and how these services can help businesses increase agility and accelerate innovation. The keynote will be followed by live interviews with the AWS Storage leadership team, including Mai-Lan Tomsen Bukovec, VP of AWS Block and Object Storage.

The second part of the day is a technical track in which you’ll learn more about Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (EBS), Amazon Elastic File System (Amazon EFS), AWS Backup, Cloud Data Migration, AWS Transfer Family and Amazon FSx.

To register for the event, visit the AWS Storage Day 2021 event page.

Now as Jeff Barr likes to say, let’s get into the announcements.

Amazon FSx for NetApp ONTAP
Today, we are pleased to announce Amazon FSx for NetApp ONTAP, a new storage service that allows you to launch and run fully managed NetApp ONTAP file systems in the cloud. Amazon FSx for NetApp ONTAP joins Amazon FSx for Lustre and Amazon FSx for Windows File Server as the newest file system offered by Amazon FSx.

Amazon FSx for NetApp ONTAP provides the full ONTAP experience with capabilities and APIs that make it easy to run applications that rely on NetApp or network-attached storage (NAS) appliances on AWS without changing your application code or how you manage your data. To learn more, read New – Amazon FSx for NetApp ONTAP.

Amazon S3
Amazon S3 Multi-Region Access Points is a new S3 feature that allows you to define global endpoints that span buckets in multiple AWS Regions. Using this feature, you can now build multi-region applications without adding complexity to your applications, with the same system architecture as if you were using a single AWS Region.

S3 Multi-Region Access Points is built on top of AWS Global Accelerator and routes S3 requests over the global AWS network. S3 Multi-Region Access Points dynamically routes your requests to the lowest latency copy of your data, so the upload and download performance can increase by 60 percent. It’s a great solution for applications that rely on reading files from S3 and also for applications like autonomous vehicles that need to write a lot of data to S3. To learn more about this new launch, read How to Accelerate Performance and Availability of Multi-Region Applications with Amazon S3 Multi-Region Access Points.

Creating a multi-region access point

There’s also great news about the Amazon S3 Intelligent-Tiering storage class! The conditions of usage have been updated. There is no longer a minimum storage duration for all objects stored in S3 Intelligent-Tiering, and monitoring and automation charges for objects smaller than 128 KB have been removed. Smaller objects (128 KB or less) are not eligible for auto-tiering when stored in S3 Intelligent-Tiering. Now that there is no monitoring and automation charge for small objects and no minimum storage duration, you can use the S3 Intelligent-Tiering storage class by default for all your workloads with unknown or changing access patterns. To learn more about this announcement, read Amazon S3 Intelligent-Tiering – Improved Cost Optimizations for Short-Lived and Small Objects.

Amazon EFS
Amazon EFS Intelligent Tiering is a new capability that makes it easier to optimize costs for shared file storage when access patterns change. When you enable Amazon EFS Intelligent-Tiering, it will store the files in the appropriate storage class at the right time. For example, if you have a file that is not used for a period of time, EFS Intelligent-Tiering will move the file to the Infrequent Access (IA) storage class. If the file is accessed again, Intelligent-Tiering will automatically move it back to the Standard storage class.

To get started with Intelligent-Tiering, enable lifecycle management in a new or existing file system and choose a lifecycle policy to automatically transition files between different storage classes. Amazon EFS Intelligent-Tiering is perfect for workloads with changing or unknown access patterns, such as machine learning inference and training, analytics, content management and media assets. To learn more about this launch, read Amazon EFS Intelligent-Tiering Optimizes Costs for Workloads with Changing Access Patterns.

AWS Backup
AWS Backup Audit Manager allows you to simplify data governance and compliance management of your backups across supported AWS services. It provides customizable controls and parameters, like backup frequency or retention period. You can also audit your backups to see if they satisfy your organizational and regulatory requirements. If one of your monitored backups drifts from your predefined parameters, AWS Backup Audit Manager will let you know so you can take corrective action. This new feature also enables you to generate reports to share with auditors and regulators. To learn more, read How to Monitor, Evaluate, and Demonstrate Backup Compliance with AWS Backup Audit Manager.

Amazon EBS
Amazon EBS direct APIs now support creating 64 TB EBS Snapshots directly from any block storage data, including on-premises. This was increased from 16 TB to 64 TB, allowing customers to create the largest snapshots and recover them to Amazon EBS io2 Block Express Volumes. To learn more, read Amazon EBS direct API documentation.

AWS Transfer Family
AWS Transfer Family Managed Workflows is a new feature that allows you to reduce the manual tasks of preprocessing your data. Managed Workflows does a lot of the heavy lifting for you, like setting up the infrastructure to run your code upon file arrival, continuously monitoring for errors, and verifying that all the changes to the data are logged. Managed Workflows helps you handle error scenarios so that failsafe modes trigger when needed.

AWS Transfer Family Managed Workflows allows you to configure all the necessary tasks at once so that tasks can automatically run in the background. Managed Workflows is available today in the AWS Transfer Family Management Console. To learn more, read Transfer Family FAQ.

Storage Day 2021 Join us online for more!
Don’t forget to register and join us for the AWS Storage Day 2021 virtual event. The event will be live at 8:30 AM Pacific Time (11:30 AM Eastern Time) on September 2. The event will immediately re-stream for the Asia-Pacific audience with live Q&A moderators on Friday, September 3, at 8:30 AM Singapore Time. All sessions will be available on demand next week.

We look forward to seeing you there!


Summer at the Raspberry Pi Store with The Centre for Computing History

Post Syndicated from Ashley Whittaker original https://www.raspberrypi.org/blog/raspberry-pi-store-summer-2021/

A whole lot of super free hands-on activities are happening at the Raspberry Pi Store this summer.

We have teamed up with the Centre for Computing History to create an interactive learning space that’s accessible to all ages and abilities. Best of all, everything is free. It’s all happening in a big space new space we’ve borrowed a few doors down from the Raspberry Pi Store in the Grand Arcade in Cambridge, UK.

What is Raspberry Pi doing?

Everyone aged seven to 107 can get hands-on and creative with our free beginner-friendly workshops. You can make games with Scratch on Raspberry Pi, learn simple electronics for beginners, or get hands-on with the Raspberry Pi camera and Python programming.

Learners of all ages can have a go

If you don’t know anything about coding, don’t worry: there are friendly people on hand to help you learn.

The workshops take place every Monday, Wednesday and Friday until 3 September. Pre-booking is highly advisable. If the one you want is fully booked, it’s well worth dropping by if you’re in the neighbourhood, because spaces often become available at the last minute. And if you book and find you can no longer come along, please do make sure you cancel, because there will be lots of people who would love to take your space!

Book your place at one of our workshops.

Not sure what you’re doing? We can help!

What is the Centre for Computing History doing?

Come and celebrate thirty years of the World Wide Web and see how things have changed over the last three decades.

This interactive exhibition celebrates the years since Tim Berners-Lee changed the world forever by publishing the very first website at CERN in 1991. You can trace the footsteps of the early web, and have a go on some original hardware.

centre for computing history web at 30
So much retro hardware to get your hands on

Here are some of the things you can do:

  • Browse the very first website from 1991
  • Search the web with Archie, the first search engine
  • Enjoy the very first web comic
  • Order a pizza on the first transactional website
  • See the first webcam site
  • See a recreation of the trailblazing Trojan Room Coffee Cam

But I don’t live near the Raspberry Pi Store!

While we would love to have a Raspberry Pi store in every town in every country all over the world (cackles maniacally), we are sticking with just the one in our hometown for now. But we make lots of cool stuff you can access online to relieve the FOMO.

The Raspberry Pi Foundation’s livestreamed Digital Making at Home videos are all still available for young people to watch and learn along with. You can chat, code together, hear from cool people, and see amazing digital making projects from kids who love making with technology.

Taking a scroll through our FutureLearn courses

There are also more than thirty Raspberry Pi courses available for free on FutureLearn. There’s something for every type of user and level of learner, from coders looking to move from Scratch to Python programming, to people looking to start up their own CoderDojo. Plus tons of materials for teachers sharing practical resources for the classroom.

Raspberry Pi books

If you like to tinker away in your own time, there are loads of books for all abilities available from the Raspberry Pi Press online store. The Official Raspberry Pi Beginner’s Guide comes in five languages. Game designers can Code the Classics. And fashion-forward makers can create Wearable Tech Projects.

More books than the library in Beauty and the Beast

The post Summer at the Raspberry Pi Store with The Centre for Computing History appeared first on Raspberry Pi.

AWS Contact Center Day – July 2021

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-contact-center-day-july-2021/

AWS Contact Center Days

Earlier this week, I ordered from Amazon.fr a box of four toothpaste tubes, but only one was in the box. I called Amazon’s customer center. The agent immediately found my order without me having to share the long order number. She issued a refund and told me I even can keep the one tube I received, no return was needed.  As a customer, I can’t ask for better customer service.

Amazon strives to be the earth’s most customer-centric company, not only because it is the right thing to do for customers, but because over the long term, it’s good for the business. According to a Salesforce study, 80% of customers believe the experience a company provides is as important as its product or services. Over 90% of customers believe a positive customer service experience makes them more likely to make another purchase.

Just like me, you might have been delighted by Amazon customer service already. We know that you want fast, convenient support and it’s what makes you loyal.

This is why we created the AWS Contact Center Day conference. To learn from industry experts how to create your contact center of the future in a free on-demand video conference.

Amazon’s Contact Centers
Amazon’s contact centers are critical to our mission to be focused on customer experience. Our contact centers have more than 100,000 customer-service associates in 32 countries who support millions of customers in dozens of languages. Given the scale and our strict requirements for security, resiliency, flexibility, agility, or automation, we couldn’t purchase an off-the-shelf solution. We decided to build our own.

Everything that Amazon learned from our customer service organization, while looking to the future, has helped us bring to market Amazon Connect, an easy-to-use omnichannel cloud contact center that helps businesses provide superior customer service at a lower cost. As the notion of the contact center has evolved, so have the expectations of customers. The contact center of the future isn’t a collection of disparate point solutions for taking a call or a chat, and it isn’t just an application that consolidates those technologies. It’s a platform that makes it easy to integrate with your enterprise applications or system of record. The contact center of the future makes it easy to use customer data in real time to personalize and contextualize all customer experiences.

Contact Centers Best Practices
To further support business looking to improve their contact centers, Amazon designed our first contact center focused event, AWS Contact Contact Center Day, a way to share best practices, customer experience, and contact center technology, and to learn how to use Amazon Connect to accelerate the modernization of your contact centers.

The one-day conference took place on July 13, 2021 and brought together some of the most influential people invested in the future of contact centers including: Becky Ploeger, Global Head of Hilton Reservations and Customer Care, and member of the Customer Contact Week advisory board; Matt Dixon, Chief Research and Innovation Officer at Tethr, and author of multiple bestsellers, including The Challenger Sale; Customer service expert; author Shep Hyken, author of I’ll Be Back: How to Get Customers to Come Back Again and Again; Brian Solis, Global Innovation Evangelist, Salesforce, and best-selling author; and Mark Honeycutt, Director of Consumer Operations, Amazon.

At Amazon, we have gone through years of trial and error to get to where our customer experience stands today. This is why we wanted to share our experiences with you so that you can learn from our progress:

  • Customers want super-human service. You can now automate routine customer experience and agent tasks. When I call my airline for a rebooking after a delayed flight, I expect to be greeted by name. I expect the system to know my flight was delayed and to offer rebooking suggestions. This can happen automatically today, without involving a customer agent. These automatic chatbot systems are personalized per customer. They are dynamic because they answer customer questions before they ask, and they are natural because they are based on voice interactions, like conversations between humans.
  • Customers expect personalized engagement. Amazon Connect allows for fast and secure interactions with real-time voice biometric authentication. There is no need to go through a lengthy authentication questionnaire anymore. After the customer is authenticated, the customer service agent has a 360-degree view of the customer’s profile, integrating and displaying data from across the enterprise and using machine learning to provide the right information at the right moment.
  • Contact centers must evolve quickly to answer changing needs. Contact center interactions must take action based on real-time data or customer sentiment. Leaders want to experiment, learn, and improve using customer analytics and data.

Learn more
If you’re interested in learning more about contact center excellence, the entire Contact Center Day conference is now available on demand.

Check out the full agenda and watch a session now or learn more about Amazon Connect.

I am looking forward my next delightful customer experience using your contact centers.

— seb

re:Invent 2020 Liveblog: Werner Vogels Keynote

Post Syndicated from AWS News Blog Team original https://aws.amazon.com/blogs/aws/reinvent-2020-liveblog-werner-vogels-keynote/

Join us Tuesday, Dec. 15 for Dr. Werner Vogels’ Keynote as he shares how Amazon is solving today’s hardest technology problems. Jeff Barr, Martin Beeby, Steve Roberts and Channy Yun will liveblog the event, sharing all the highlights, insights and major announcements from this final keynote of re:Invent 2020.

See you here Tuesday, 7:30-10:00 AM (PST)!

New – VPC Reachability Analyzer

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-vpc-insights-analyzes-reachability-and-visibility-in-vpcs/

With Amazon Virtual Private Cloud (VPC), you can launch a logically isolated customer-specific virtual network on the AWS Cloud. As customers expand their footprint on the cloud and deploy increasingly complex network architectures, it can take longer to resolve network connectivity issues caused by misconfiguration. Today, we are happy to announce VPC Reachability Analyzer, a network diagnostics tool that troubleshoots reachability between two endpoints in a VPC, or within multiple VPCs.

Ensuring Your Network Configuration is as Intended
You have full control over your virtual network environment, including choosing your own IP address range, creating subnets, and configuring route tables and network gateways. You can also easily customize the network configuration of your VPC. For example, you can create a public subnet for a web server that has access to the Internet with Internet Gateway. Security-sensitive backend systems such as databases and application servers can be placed on private subnets that do not have internet access. You can use multiple layers of security, such as security groups and network access control list (ACL), to control access to entities of each subnet by protocol, IP address, and port number.

You can also combine multiple VPCs via VPC peering or AWS Transit Gateway for region-wide, or global network connections that can route traffic privately. You can also use VPN Gateway to connect your site with your AWS account for secure communication. Many AWS services that reside outside the VPC, such as AWS Lambda, or Amazon S3, support VPC endpoints or AWS PrivateLink as entities inside the VPC and can communicate with those privately.

When you have such rich controls and feature set, it is not unusual to have unintended configuration that could lead to connectivity issues. Today, you can use VPC Reachability Analyzer for analyzing reachability between two endpoints without sending any packets. VPC Reachability analyzer looks at the configuration of all the resources in your VPCs and uses automated reasoning to determine what network flows are feasible. It analyzes all possible paths through your network without having to send any traffic on the wire. To learn more about how these algorithms work checkout this re:Invent talk or read this paper.

How VPC Reachability Analyzer Works
Let’s see how it works. Using VPC Reachability Analyzer is very easy, and you can test it with your current VPC. If you need an isolated VPC for test purposes, you can run the AWS CloudFormation YAML template at the bottom of this article. The template creates a VPC with 1 subnet, 2 security groups and 3 instances as A, B, and C. Instance A and B can communicate with each other, but those instances cannot communicate with instance C because the security group attached to instance C does not allow any incoming traffic.

You see Reachability Analyzer in the left navigation of the VPC Management Console.

Click Reachability Analyzer, and also click Create and analyze path button, then you see new windows where you can specify a path between a source and destination, and start analysis.

You can specify any of the following endpoint types: VPN Gateways, Instances, Network Interfaces, Internet Gateways, VPC Endpoints, VPC Peering Connections, and Transit Gateways for your source and destination of communication. For example, we set instance A for source and the instance B for destination. You can choose to check for connectivity via either the TCP or UDP protocols. Optionally, you can also specify a port number, or source, or destination IP address.

Configuring test path

Finally, click the Create and analyze path button to start the analysis. The analysis can take up to several minutes depending on the size and complexity of your VPCs, but it typically takes a few seconds.

You can now see the analysis result as Reachable. If you click the URL link of analysis id nip-xxxxxxxxxxxxxxxxx, you can see the route hop by hop.

The communication from instance A to instance C is not reachable because the security group attached to instance C does not allow any incoming traffic.

If you click nip-xxxxxxxxxxxxxxxxx for more detail, you can check the Explanations for details.

Result Detail

Here we see the security group that blocked communication. When you click on the security group listed in the upper right corner, you can go directly to the security group editing window to change the security group rules. In this case adding a properly scoped ingress rule will allow the instances to communicate.

Available Today
This feature is available for all AWS commercial Regions except for China (Beijing), and China (Ningxia) regions. More information is available in our technical documentation, and remember that to use this feature your IAM permissions need to be set up as documented here.

– Kame

CloudFormation YAML template for test

Description: An AWS VPC configuration with 1 subnet, 2 security groups and 3 instances. When testing ReachabilityAnalyzer, this provides both a path found and path not found scenario.
AWSTemplateFormatVersion: 2010-09-09

      execution: ami-0915e09cc7ceee3ab
      ecs: ami-08087103f9850bddd

  # VPC
    Type: AWS::EC2::VPC
      EnableDnsSupport: true
      EnableDnsHostnames: true
      InstanceTenancy: default

  # Subnets
    Type: AWS::EC2::Subnet
      VpcId: !Ref VPC
      MapPublicIpOnLaunch: false

  # SGs
    Type: AWS::EC2::SecurityGroup
      GroupDescription: Allow all ingress and egress traffic
      VpcId: !Ref VPC
        - CidrIp:
          IpProtocol: "-1" # -1 specifies all protocols

    Type: AWS::EC2::SecurityGroup
      GroupDescription: Allow all egress traffic
      VpcId: !Ref VPC

  # Instances
  # Instance A and B should have a path between them since they are both in SecurityGroup 1
    Type: AWS::EC2::Instance
          - RegionMap
          - Ref: AWS::Region
          - execution
      InstanceType: 't3.nano'
        Ref: Subnet1
        - Ref: SecurityGroup1

  # Instance A and B should have a path between them since they are both in SecurityGroup 1
    Type: AWS::EC2::Instance
          - RegionMap
          - Ref: AWS::Region
          - execution
      InstanceType: 't3.nano'
        Ref: Subnet1
        - Ref: SecurityGroup1

  # This instance should not be reachable from Instance A or B since it is in SecurityGroup 2
    Type: AWS::EC2::Instance
          - RegionMap
          - Ref: AWS::Region
          - execution
      InstanceType: 't3.nano'
        Ref: Subnet1
        - Ref: SecurityGroup2


PennyLane on Braket + Progress Toward Fault-Tolerant Quantum Computing + Tensor Network Simulator

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/pennylane-on-braket-progress-toward-fault-tolerant-quantum-computing-tensor-network-simulator/

I first wrote about Amazon Braket last year and invited you to Get Started with Quantum Computing! Since that launch we have continued to push forward, and have added several important & powerful new features to Amazon Braket:

August 2020 – General Availability of Amazon Braket with access to quantum computing hardware from D-Wave, IonQ, and Rigetti.

September 2020 – Access to D-Wave’s Advantage Quantum Processing Unit (QPU), which includes more than 5,000 qubits and 15-way connectivity.

November 2020 – Support for resource tagging, AWS PrivateLink, and manual qubit allocation. The first two features make it easy for you to connect your existing AWS applications to the new ones that you build with Amazon Braket, and should help you to envision what a production-class cloud-based quantum computing application will look like in the future. The last feature is particularly interesting to researchers; from what I understand, certain qubits within a given piece of quantum computing hardware can have individual physical and connectivity properties that might make them perform somewhat better when used as part of a quantum circuit. You can read about Allocating Qubits on QPU Devices to learn more (this is somewhat similar to the way that a compiler allocates CPU registers to frequently used variables).

In my initial blog post I also announced the formation of the AWS Center for Quantum Computing adjacent to Caltech.

As I write this, we are in the Noisy Intermediate Scale Quantum (NISQ) era. This description captures the state of the art in quantum computers: each gate in a quantum computing circuit introduces a certain amount of accuracy-destroying noise, and the cumulative effect of this noise imposes some practical limits on the scale of the problems.

Update Time
We are working to address this challenge, as are many others in the quantum computing field. Today I would like to give you an update on what we are doing at the practical and the theoretical level.

Similar to the way that CPUs and GPUs work hand-in-hand to address large scale classical computing problems, the emerging field of hybrid quantum algorithms joins CPUs and QPUs to speed up specific calculations within a classical algorithm. This allows for shorter quantum executions that are less susceptible to the cumulative effects of noise and that run well on today’s devices.

Variational quantum algorithms are an important type of hybrid quantum algorithm. The classical code (in the CPU) iteratively adjusts the parameters of a parameterized quantum circuit, in a manner reminiscent of the way that a neural network is built by repeatedly processing batches of training data and adjusting the parameters based on the results of an objective function. The output of the objective function provides the classical code with guidance that helps to steer the process of tuning the parameters in the desired direction. Mathematically (I’m way past the edge of my comfort zone here), this is called differentiable quantum computing.

So, with this rather lengthy introduction, what are we doing?

First, we are making the PennyLane library available so that you can build hybrid quantum-classical algorithms and run them on Amazon Braket. This library lets you “follow the gradient” and write code to address problems in computational chemistry (by way of the included Q-Chem library), machine learning, and optimization. My AWS colleagues have been working with the PennyLane team to create an integrated experience when PennyLane is used together with Amazon Braket.

PennyLane is pre-installed in Braket notebooks and you can also install the Braket-PennyLane plugin in your IDE. Once you do this, you can train quantum circuits as you would train neural networks, while also making use of familiar machine learning libraries such as PyTorch and TensorFlow. When you use PennyLane on the managed simulators that are included in Amazon Braket, you can train your circuits up to 10 times faster by using parallel circuit execution.

Second, the AWS Center for Quantum Computing is working to address the noise issue in two different ways: we are investigating ways to make the gates themselves more accurate, while also working on the development of more efficient ways to encode information redundantly across multiple qubits. Our new paper, Building a Fault-Tolerant Quantum Computer Using Concatenated Cat Codes speaks to both of these efforts. While not light reading, the 100+ page paper proposes the construction of a 2-D grid of micron-scale electro-acoustic qubits that are coupled via superconducting circuits:

Interestingly, this proposed qubit design was used to model a Toffoli gate, and then tested via simulations that ran for 170 hours on c5.18xlarge instances. In a very real sense, the classical computers are being used to design and then simulate their future quantum companions.

The proposed hybrid electro-acoustic qubits are far smaller than what is available today, and also offer a > 10x reduction in overhead (measured in the number of physical qubits required per error-corrected qubit and the associated control lines). In addition to working on the experimental development of this architecture based around hybrid electro-acoustic qubits, the AWS CQC team will also continue to explore other promising alternatives for fault-tolerant quantum computing to bring new, more powerful computing resources to the world.

And Third, we are expanding the choice of managed simulators that are available on Amazon Braket. In addition to the state vector simulator (which can simulate up to 34 qubits), you can use the new tensor network simulator that can simulate up to 50 qubits for certain circuits. This simulator builds a graph representation of the quantum circuit and uses the graph to find an optimized way to process it.

Help Wanted
If you are ready to help us to push the state of the art in quantum computing, take a look at our open positions. We are looking for Quantum Research Scientists, Software Developers, Hardware Developers, and Solutions Architects.

Time to Learn
It is still Day One (as we often say at Amazon) when it comes to quantum computing and now is the time to learn more and to get some experience with. Be sure to check out the Braket Tutorials repository and let me know what you think.


PS – If you are ready to start exploring ways that you can put quantum computing to work in your organization, be sure to take a look at the Amazon Quantum Solutions Lab.

Amazon SageMaker JumpStart Simplifies Access to Pre-built Models and Machine Learning Solutions

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-jumpstart-simplifies-access-to-prebuilt-models-and-machine-learning-models/

Today, I’m extremely happy to announce the availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that accelerates your machine learning workflows with one-click access to popular model collections (also known as “model zoos”), and to end-to-end solutions that solve common use cases.

In recent years, machine learning (ML) has proven to be a valuable technique in improving and automating business processes. Indeed, models trained on historical data can accurately predict outcomes across a wide range of industry segments: financial services, retail, manufacturing, telecom, life sciences, and so on. Yet, working with these models requires skills and experience that only a subset of scientists and developers have: preparing a dataset, selecting an algorithm, training a model, optimizing its accuracy, deploying it in production, and monitoring its performance over time.

In order to simplify the model building process, the ML community has created model zoos, that is to say, collections of models built with popular open source libraries, and often pretrained on reference datasets. For example, the TensorFlow Hub and the PyTorch Hub provide developers with a long list of models ready to be downloaded, and integrated in applications for computer vision, natural language processing, and more.

Still, downloading a model is just part of the answer. Developers then need to deploy it for evaluation and testing, using either a variety of tools, such as the TensorFlow Serving and TorchServe model servers, or their own bespoke code. Once the model is running, developers need to figure out the correct format that incoming data should have, a long-lasting pain point. I’m sure I’m not the only one regularly pulling my hair out here!

Of course, a full-ML application usually has a lot of moving parts. Data needs to be preprocessed, enriched with additional data fetched from a backend, and funneled into the model. Predictions are often postprocessed, and stored for further analysis and visualization. As useful as they are, model zoos only help with the modeling part. Developers still have lots of extra work to deliver a complete ML solution.

Because of all this, ML experts are flooded with a long backlog of projects waiting to start. Meanwhile, less experienced practitioners struggle to get started. These barriers are incredibly frustrating, and our customers asked us to remove them.

Introducing Amazon SageMaker JumpStart
Amazon SageMaker JumpStart is integrated in Amazon SageMaker Studio, our fully integrated development environment (IDE) for ML, making it intuitive to discover models, solutions, and more. At launch, SageMaker JumpStart includes:

  • 15+ end-to-end solutions for common ML use cases such as fraud detection, predictive maintenance, and so on.
  • 150+ models from the TensorFlow Hub and the PyTorch Hub, for computer vision (image classification, object detection), and natural language processing (sentence classification, question answering).
  • Sample notebooks for the built-in algorithms available in Amazon SageMaker.

SageMaker JumpStart also provides notebooks, blogs, and video tutorials designed to help you learn and remove roadblocks. Content is easily accessible within Amazon SageMaker Studio, enabling you to get started with ML faster.

It only takes a single click to deploy solutions and models. All infrastructure is fully managed, so all you have to do is enjoy a nice cup of tea or coffee while deployment takes place. After a few minutes, you can start testing, thanks to notebooks and sample prediction code that are readily available in Amazon SageMaker Studio. Of course, you can easily modify them to use your own data.

SageMaker JumpStart makes it extremely easy for experienced practitioners and beginners alike to quickly deploy and evaluate models and solutions, saving days or even weeks of work. By drastically shortening the path from experimentation to production, SageMaker JumpStart accelerates ML-powered innovation, particularly for organizations and teams that are early on their ML journey, and haven’t yet accumulated a lot of skills and experience.

Now, let me show you how SageMaker JumpStart works.

Deploying a Solution with Amazon SageMaker JumpStart
Opening SageMaker Studio, I select the “JumpStart” icon on the left. This opens a new tab showing me all available content (solutions, models, and so on).

Let’s say that I’m interested in using computer vision to detect defects in manufactured products. Could ML be the answer?

Browsing the list of available solutions, I see one for product defect detection.

Opening it, I can learn more about the type of problems that it solves, the sample dataset used in the demo, the AWS services involved, and more.

SageMaker screenshot

A single click is all it takes to deploy this solution. Under the hood, AWS CloudFormation uses a built-in template to provision all appropriate AWS resources.

A few minutes later, the solution is deployed, and I can open its notebook.

SageMaker screenshot

The notebook opens immediately in SageMaker Studio. I run the demo, and understand how ML can help me detect product defects. This is also a nice starting point for my own project, making it easy to experiment with my own dataset (feel free to click on the image below to zoom in).

SageMaker screenshot

Once I’m done with this solution, I can delete all its resources in one click, letting AWS CloudFormation clean up without having to worry about leaving idle AWS resources behind.

SageMaker screenshot

Now, let’s look at models.

Deploying a Model with Amazon SageMaker JumpStart
SageMaker JumpStart includes a large collection of models available in the TensorFlow Hub and the PyTorch Hub. These models are pre-trained on reference datasets, and you can use them directly to handle a wide range of computer vision and natural language processing tasks. You can also fine-tune them on your own datasets for greater accuracy, a technique called transfer learning.

SageMaker screenshot
Here, I pick a version of the BERT model trained on question answering. I can either deploy it as is, or fine-tune it. For the sake of brevity, I go with the former here, and I just click on the “Deploy” button.

SageMaker screenshot

A few minutes later, the model has been deployed to a real-time endpoint powered by fully managed infrastructure.

SageMaker screenshot

Time to test it! Clicking on “Open Notebook” launches a sample notebook that I run right away to test the model, without having to change a line of code (again, feel free to click on the image below to zoom in). Here, I’m asking two questions (“What is Southern California often abbreviated as?” and “Who directed Spectre?“), passing some context containing the answer. In both cases, the BERT model gives the correct answer, respectively “socal” and “Sam Mendes“.

SageMaker screenshot

When I’m done testing, I can delete the endpoint in one click, and stop paying for it.

Getting Started
As you can see, it’s extremely easy to deploy models and solutions with SageMaker JumpStart in minutes, even if you have little or no ML skills.

You can start using this capability today in all regions where SageMaker Studio is available, at no additional cost.

Give it a try and let us know what you think.

As always, we’re looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Special thanks to my colleague Jared Heywood for his precious help during early testing.

New – Amazon SageMaker Pipelines Brings DevOps Capabilities to your Machine Learning Projects

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-pipelines-brings-devops-to-machine-learning-projects/

Today, I’m extremely happy to announce Amazon SageMaker Pipelines, a new capability of Amazon SageMaker that makes it easy for data scientists and engineers to build, automate, and scale end to end machine learning pipelines.

Machine learning (ML) is intrinsically experimental and unpredictable in nature. You spend days or weeks exploring and processing data in many different ways, trying to crack the geode open to reveal its precious gemstones. Then, you experiment with different algorithms and parameters, training and optimizing lots of models in search of highest accuracy. This process typically involves lots of different steps with dependencies between them, and managing it manually can become quite complex. In particular, tracking model lineage can be difficult, hampering auditability and governance. Finally, you deploy your top models, and you evaluate them against your reference test sets. Finally? Not quite, as you’ll certainly iterate again and again, either to try out new ideas, or simply to periodically retrain your models on new data.

No matter how exciting ML is, it does unfortunately involve a lot of repetitive work. Even small projects will require hundreds of steps before they get the green light for production. Over time, not only does this work detract from the fun and excitement of your projects, it also creates ample room for oversight and human error.

To alleviate manual work and improve traceability, many ML teams have adopted the DevOps philosophy and implemented tools and processes for Continuous Integration and Continuous Delivery (CI/CD). Although this is certainly a step in the right direction, writing your own tools often leads to complex projects that require more software engineering and infrastructure work than you initially anticipated. Valuable time and resources are diverted from the actual ML project, and innovation slows down. Sadly, some teams decide to revert to manual work, for model management, approval, and deployment.

Introducing Amazon SageMaker Pipelines
Simply put, Amazon SageMaker Pipelines brings in best-in-class DevOps practices to your ML projects. This new capability makes it easy for data scientists and ML developers to create automated and reliable end-to-end ML pipelines. As usual with SageMaker, all infrastructure is fully managed, and doesn’t require any work on your side.

Care.com is the world’s leading platform for finding and managing high-quality family care. Here’s what Clemens Tummeltshammer, Data Science Manager, Care.com, told us: “A strong care industry where supply matches demand is essential for economic growth from the individual family up to the nation’s GDP. We’re excited about Amazon SageMaker Feature Store and Amazon SageMaker Pipelines, as we believe they will help us scale better across our data science and development teams, by using a consistent set of curated data that we can use to build scalable end-to-end machine learning (ML) model pipelines from data preparation to deployment. With the newly announced capabilities of Amazon SageMaker, we can accelerate development and deployment of our ML models for different applications, helping our customers make better informed decisions through faster real-time recommendations.

Let me tell you more about the main components in Amazon SageMaker Pipelines: pipelines, model registry, and MLOps templates.

Pipelines – Model building pipelines are defined with a simple Python SDK. They can include any operation available in Amazon SageMaker, such as data preparation with Amazon SageMaker Processing or Amazon SageMaker Data Wrangler, model training, model deployment to a real-time endpoint, or batch transform. You can also add Amazon SageMaker Clarify to your pipelines, in order to detect bias prior to training, or once the model has been deployed. Likewise, you can add Amazon SageMaker Model Monitor to detect data and prediction quality issues.

Once launched, model building pipelines are executed as CI/CD pipelines. Every step is recorded, and detailed logging information is available for traceability and debugging purposes. Of course, you can also visualize pipelines in Amazon SageMaker Studio, and track their different executions in real time.

Model Registry – The model registry lets you track and catalog your models. In SageMaker Studio, you can easily view model history, list and compare versions, and track metadata such as model evaluation metrics. You can also define which versions may or may not be deployed in production. In fact, you can even build pipelines that automatically trigger model deployment once approval has been given. You’ll find that the model registry is very useful in tracing model lineage, improving model governance, and strengthening your compliance posture.

MLOps TemplatesSageMaker Pipelines includes a collection of built-in CI/CD templates for popular pipelines (build/train/deploy, deploy only, and so on). You can also add and publish your own templates, so that your teams can easily discover them and deploy them. Not only do templates save lots of time, they also make it easy for ML teams to collaborate from experimentation to deployment, using standard processes and without having to manage any infrastructure. Templates also let Ops teams customize steps as needed, and give them full visibility for troubleshooting.

Now, let’s do a quick demo!

Building an End-to-end Pipeline with Amazon SageMaker Pipelines
Opening SageMaker Studio, I select the “Components” tab and the “Projects” view. This displays a list of built-in project templates. I pick one to build, train, and deploy a model.

SageMaker screenshot

Then, I simply give my project a name, and create it.

A few seconds later, the project is ready. I can see that it includes two Git repositories hosted in AWS CodeCommit, one for model training, and one for model deployment.

SageMaker screenshot

The first repository provides scaffolding code to create a multi-step model building pipeline: data processing, model training, model evaluation, and conditional model registration based on accuracy. As you’ll see in the pipeline.py file, this pipeline trains a linear regression model using the XGBoost algorithm on the well-known Abalone dataset. This repository also includes a build specification file, used by AWS CodePipeline and AWS CodeBuild to execute the pipeline automatically.

Likewise, the second repository contains code and configuration files for model deployment, as well as test scripts required to pass the quality gate. This operation is also based on AWS CodePipeline and AWS CodeBuild, which run a AWS CloudFormation template to create model endpoints for staging and production.

Clicking on the two blue links, I clone the repositories locally. This triggers the first execution of the pipeline.

SageMaker screenshot

A few minutes later, the pipeline has run successfully. Switching to the “Pipelines” view, I can visualize its steps.

SageMaker screenshot

Clicking on the training step, I can see the Root Mean Square Error (RMSE) metrics for my model.

SageMaker screenshot

As the RMSE is lower than the threshold defined in the conditional step, my model is added to the model registry, as visible below.

SageMaker screenshot

For simplicity, the registration step sets the model status to “Approved”, which automatically triggers its deployment to a real-time endpoint in the same account. Within seconds, I see that the model is being deployed.

SageMaker screenshot

Alternatively, you could register your model with a “Pending manual approval” status. This will block deployment until the model has been reviewed and approved manually. As the model registry supports cross-account deployment, you could also easily deploy in a different account, without having to copy anything across accounts.

A few minutes later, the endpoint is up, and I could use it to test my model.

SageMaker screenshot

Once I’ve made sure that this model works as expected, I could ping the MLOps team, and ask them to deploy the model in production.

Putting my MLOps hat on, I open the AWS CodePipeline console, and I see that my deployment is indeed waiting for approval.

SageMaker screenshot

I then approve the model for deployment, which triggers the final stage of the pipeline.

SageMaker screenshot

Reverting to my Data Scientist hat, I see in SageMaker Studio that my model is being deployed. Job done!

SageMaker screenshot

Getting Started
As you can see, Amazon SageMaker Pipelines makes it really easy for Data Science and MLOps teams to collaborate using familiar tools. They can create and execute robust, automated ML pipelines that deliver high quality models in production quicker than before.

You can start using SageMaker Pipelines in all commercial regions where SageMaker is available. The MLOps capabilities are available in the regions where CodePipeline is also available.

Sample notebooks are available to get you started. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Special thanks to my colleague Urvashi Chowdhary for her precious assistance during early testing.

Amazon HealthLake Stores, Transforms, and Analyzes Health Data in the Cloud

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-amazon-healthlake-to-store-transform-and-analyze-petabytes-of-health-and-life-sciences-data-in-the-cloud/

Healthcare organizations collect vast amounts of patient information every day, from family history and clinical observations to diagnoses and medications. They use all this data to try to compile a complete picture of a patient’s health information in order to provide better healthcare services. Currently, this data is distributed across various systems (electronic medical records, laboratory systems, medical image repositories, etc.) and exists in dozens of incompatible formats.

Emerging standards, such as Fast Healthcare Interoperability Resources (FHIR), aim to address this challenge by providing a consistent format for describing and exchanging structured data across these systems. However, much of this data is unstructured information contained in medical records (e.g., clinical records), documents (e.g., PDF lab reports), forms (e.g., insurance claims), images (e.g., X-rays, MRIs), audio (e.g., recorded conversations), and time series data (e.g., heart electrocardiogram) and it is challenging to extract this information.

It can take weeks or months for a healthcare organization to collect all this data and prepare it for transformation (tagging and indexing), structuring, and analysis. Furthermore, the cost and operational complexity of doing all this work is prohibitive for most healthcare organizations.

Many data to analyze

Today, we are happy to announce Amazon HealthLake, a fully managed, HIPAA-eligible service, now in preview, that allows healthcare and life sciences customers to aggregate their health information from different silos and formats into a centralized AWS data lake. HealthLake uses machine learning (ML) models to normalize health data and automatically understand and extract meaningful medical information from the data so all this information can be easily searched. Then, customers can query and analyze the data to understand relationships, identify trends, and make predictions.

How It Works
Amazon HealthLake supports copying your data from on premises to the AWS Cloud, where you can store your structured data (like lab results) as well as unstructured data (like clinical notes), which HealthLake will tag and structure in FHIR. All the data is fully indexed using standard medical terms so you can quickly and easily query, search, analyze, and update all of your customers’ health information.

Overview of HealthLake

With HealthLake, healthcare organizations can collect and transform patient health information in minutes and have a complete view of a patients medical history, structured in the FHIR industry standard format with powerful search and query capabilities.

From the AWS Management Console, healthcare organizations can use the HealthLake API to copy their on-premises healthcare data to a secure data lake in AWS with just a few clicks. If your source system is not configured to send data in FHIR format, you can use a list of AWS partners to easily connect and convert your legacy healthcare data format to FHIR.

HealthLake is Powered by Machine Learning
HealthLake uses specialized ML models such as natural language processing (NLP) to automatically transform raw data. These models are trained to understand and extract meaningful information from unstructured health data.

For example, HealthLake can accurately identify patient information from medical histories, physician notes, and medical imaging reports. It then provides the ability to tag, index, and structure the transformed data to make it searchable by standard terms such as medical condition, diagnosis, medication, and treatment.

Queries on tens of thousands of patient records are very simple. For example, a healthcare organization can create a list of diabetic patients based on similarity of medications by selecting “diabetes” from the standard list of medical conditions, selecting “oral medications” from the treatment menu, and refining the gender and search.

Healthcare organizations can use Juypter Notebook templates in Amazon SageMaker to quickly and easily run analysis on the normalized data for common tasks like diagnosis predictions, hospital re-admittance probability, and operating room utilization forecasts. These models can, for example, help healthcare organizations predict the onset of disease. With just a few clicks in a pre-built notebook, healthcare organizations can apply ML to their historical data and predict when a diabetic patient will develop hypertension in the next five years. Operators can also build, train, and deploy their own ML models on data using Amazon SageMaker directly from the AWS management console.

Let’s Create Your Own Data Store and Start to Test
Starting to use HealthLake is simple. You access AWS Management Console, and click select Create a datastore.

If you click Preload data, HealthLake will load test data and you can start to test its features. You can also upload your own data if you already have FHIR 4 compliant data. You upload it to S3 buckets, and import it to set its bucket name.

Once your Data Store is created, you can perform a Search, Create, Read, Update or Delete FHIR Query Operation. For example, if you need a list of every patient located in New York, your query setting looks like the screenshots below. As per the FHIR specification, deleted data is only hidden from analysis and results; it is not deleted from the service, only versioned.

Creating Query


You can choose Add search parameter for more nested conditions of the query as shown below.

Amazon HealthLake is Now in Preview
Amazon HealthLake is in preview starting today in US East (N. Virginia). Please check our web site and technical documentation for more information.

– Kame

Amazon SageMaker Edge Manager Simplifies Operating Machine Learning Models on Edge Devices

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-edge-manager-simplifies-operating-machine-learning-models-on-edge-devices/

Today, I’m extremely happy to announce Amazon SageMaker Edge Manager, a new capability of Amazon SageMaker that makes it easier to optimize, secure, monitor, and maintain machine learning models on a fleet of edge devices.

Edge computing is certainly one of the most exciting developments in information technology. Indeed, thanks to continued advances in compute, storage, networking, and battery technology, organizations routinely deploy large numbers of embedded devices anywhere on the planet for a wide range of industry applications: manufacturing, energy, agriculture, healthcare, and more. Ranging from simple sensors to large industrial machines, the devices have a common purpose: capture data, analyze it, and act on it, for example send an alert if an unwanted condition is detected.

As machine learning (ML) demonstrated its ability to solve a wide range of business problems, customers tried to apply it to edge applications, training models in the cloud and deploying them at the edge in an effort to extract deeper insights from local data. However, given the remote and constrained nature of edge devices, deploying and managing models at the edge is often quite difficult.

For example, a complex model can be too large to fit, forcing customers to settle for a smaller and less accurate model. Also, predicting with several models on the same device (say, to detect different types of anomalies) may require additional code to load and unload models on demand, in order to conserve hardware resources. Finally, monitoring prediction quality is a major concern, as the real world will always be more complex and unpredictable than any training set can anticipate.

Customers asked us to help them solve these challenges, and we got to work.

Announcing Amazon SageMaker Edge Manager
Amazon SageMaker Edge Manager makes it easy for ML edge developers to use the same familiar tools in the cloud or on edge devices. It reduces the time and effort required to get models to production, while continuously monitoring and improving model quality across your device fleet.

Starting from a model that you trained or imported in Amazon SageMaker, SageMaker Edge Manager first optimizes it for your hardware platform using Amazon SageMaker Neo. Launched two years ago, Neo converts models into an efficient common format which is executed on the device by a low footprint runtime. Neo currently supports devices based on chips manufactured by Ambarella, ARM, Intel, NVIDIA, NXP, Qualcomm, TI, and Xilinx.

Then, SageMaker Edge Manager packages the model, and stores it in Amazon Simple Storage Service (S3), where it can be deployed to your devices. In fact, you can deploy multiple models, loading and predicting with a runtime optimized for your hardware of choice.

On-device models are managed by the SageMaker Edge Manager Manager Agent, which communicates with the AWS Cloud for model deployment, and with your application for model management. Indeed, you can integrate this agent with your application, so that it may automatically load and unload models according to your prediction requests. This enables a variety of scenarios, such as freeing all resources for a large model whenever needed, or working with a collection of smaller models that cohabit in memory.

Lenovo, the #1 global PC maker, recently incorporated Amazon SageMaker into its latest predictive maintenance offering. Igor Bergman, Lenovo Vice President, Cloud & Software of PCs and Smart Devices, told us: “At Lenovo, we’re more than a hardware provider and are committed to being a trusted partner in transforming customers’ device experience and delivering on their business goals. Lenovo Device Intelligence is a great example of how we’re doing this with the power of machine learning, enhanced by Amazon SageMaker. With Lenovo Device Intelligence, IT administrators can proactively diagnose PC issues and help predict potential system failures before they occur, helping to decrease downtime and increase employee productivity. By incorporating Amazon SageMaker Neo, we’ve already seen a substantial improvement in the execution of our on-device predictive models – an encouraging sign for the new Amazon SageMaker Edge Manager that will be added in the coming weeks. SageMaker Edge Manager will help eliminate the manual effort required to optimize, monitor, and continuously improve the models after deployment. With it, we expect our models will run faster and consume less memory than with other comparable machine learning platforms. As we extend AI to new applications across the Lenovo services portfolio, we will continue to require a high-performance pipeline that is flexible and scalable both in the cloud and on millions of edge devices. That’s why we selected the Amazon SageMaker platform. With its rich edge-to-cloud and CI/CD workflow capabilities, we can effectively bring our machine learning models to any device workflow for much higher productivity.

Getting Started
As you can see, SageMaker Edge Manager makes it easier to work with ML models deployed on edge devices. It’s available today in the US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), Europe (Frankfurt), and Asia Pacific (Tokyo) regions.

Sample notebooks are available to get you started right away. Give them a try, and let us know what you think.

We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

New – Amazon SageMaker Clarify Detects Bias and Increases the Transparency of Machine Learning Models

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-amazon-sagemaker-clarify-detects-bias-and-increases-the-transparency-of-machine-learning-models/

Today, I’m extremely happy to announce Amazon SageMaker Clarify, a new capability of Amazon SageMaker that helps customers detect bias in machine learning (ML) models, and increase transparency by helping explain model behavior to stakeholders and customers.

As ML models are built by training algorithms that learn statistical patterns present in datasets, several questions immediately come to mind. First, can we ever hope to explain why our ML model comes up with a particular prediction? Second, what if our dataset doesn’t faithfully describe the real-life problem we were trying to model? Could we even detect such issues? Would they introduce some sort of bias in imperceptible ways? As we will see, these are not speculative questions at all. They are very real, and their implications can be far-reaching.

Let’s start with the bias problem. Imagine that you’re working on a model detecting fraudulent credit card transactions. Fortunately, the huge majority of transactions are legitimate, and they make up 99.9% of your dataset, meaning that you only have 0.1% fraudulent transactions, say 100 out of 100,000. Training a binary classification model (legitimate vs. fraudulent), there’s a strong chance that it would be strongly influenced or biased by the majority group. In fact, a trivial model could simply decide that transactions are always legitimate: as useless as this model would be, it would still be right 99.9% of the time! This simple example shows how careful we have to be about the statistical properties of our data, and about the metrics that we use to measure model accuracy.

There are many variants of this under-representation problem. As the number of classes, features, and unique feature values increase, your dataset may only contain a tiny number of training instances for certain groups. In fact, some of these groups may correspond to various socially sensitive features such as gender, age range, or nationality. Under-representation for such groups could result in a disproportionate impact on their predicted outcomes.

Unfortunately, even with the best of intentions, bias issues may exist in datasets and be introduced into models with business, ethical, and regulatory consequences. It is thus important for model administrators to be aware of potential sources of bias in production systems.

Now, let’s discuss the explainability problem. For simple and well-understood algorithms like linear regression or tree-based algorithms, it’s reasonably easy to crack the model open, inspect the parameters that it learned during training, and figure out which features it predominantly uses. You can then decide whether this process is consistent with your business practices, basically saying: “yes, this is how a human expert would have done it.”

However, as models become more and more complex (I’m staring at you, deep learning), this kind of analysis becomes impossible. Just like the prehistoric tribes in Stanley Kubrick’s “2001: A Space Odyssey,” we’re often left staring at an impenetrable monolith and wondering what it all means. Many companies and organizations may need ML models to be explainable before they can be used in production. In addition, some regulations may require explainability when ML models are used as part of consequential decision making, and closing the loop, explainability can also help detect bias.

Thus, our customers asked us for help on detecting bias in their datasets and their models, and on understanding how their models make predictions. We got to work, and came up with SageMaker Clarify.

Introducing Amazon SageMaker Clarify
SageMaker Clarify is a new set of capabilities for Amazon SageMaker, our fully managed ML service. It’s integrated with SageMaker Studio, our web-based integrated development environment for ML, as well as with other SageMaker capabilities like Amazon SageMaker Data Wrangler, Amazon SageMaker Experiments, and Amazon SageMaker Model Monitor.

Thanks to SageMaker Clarify, data scientists are able to:

  • Detect bias in datasets prior to training, and in models after training.
  • Measure bias using a variety of statistical metrics.
  • Explain how feature values contribute to the predicted outcome, both for the model overall and for individual predictions.
  • Detect bias drift and feature importance drift over time, thanks to the integration with Amazon SageMaker Model Monitor.

Let’s look at each of these capabilities.

Detecting dataset bias: This is an important first step. Indeed, a heavily biased dataset may well be unsuitable for training. Knowing this early on certainly saves you time, money, and frustration! Looking at bias metrics computed by SageMaker Clarify on your dataset, you can then add your own bias reduction techniques to your data processing pipeline. Once the dataset has been revised and processed, you can measure bias again, and check if it has actually decreased.

Detecting model bias: After you’ve trained your model, you can run a SageMaker Clarify bias analysis, which includes automatic deployment to a temporary endpoint, and computation of bias metrics using your model and dataset. By computing these metrics, you can figure out if your trained model has similar predictive behavior across groups.

Measuring bias: SageMaker Clarify lets you pick from many different bias metrics. I’ll just give you a few examples here.

  • Difference in positive proportions in labels (DPL): Are labels in the dataset correlated or not with specific sensitive feature values? For example, do people living in a certain city have a better chance of getting a positive answer?
  • Difference in positive proportions in predicted labels (DPPL): Do we overpredict positive labels for a certain group?
  • Accuracy difference (AD): Are the predictions by the model more accurate for one group than the other?
  • Counterfactuals – Fliptest (FT): Suppose we look at each member of one group, and compare with similar members from the other group. Do they get different model predictions?

Explaining predictions – to explain how your model predicts, SageMaker Clarify supports a popular technique called SHapley Additive exPlanations (SHAP). Originating in game theory, SHAP analyzes for each data instance the individual contribution of feature values to the predicted output, and represents them as a positive or negative value. For example, predicting with a credit application model, you could see that Alice’s application is approved with a score of 87.5%, that her employment status (+27.2%) and her credit score (+32.4%) are the strongest contributors to this score, and that her income level has a slight negative impact (-5%). Such insights are crucial in building trust that the model is working as expected, and in explaining to customers and regulators why it comes up with a particular prediction. Further analysis of the SHAP values for your complete dataset can also help identify the relative importance of features and feature values, potentially leading to the discovery of prediction issues and biases.

As you can see, SageMaker Clarify has some pretty powerful features for bias detection and explainability. Fortunately, it also makes them very easy to use. First, you should upload a clean and pre-processed copy of your tabular dataset (CSV or JSON) to Amazon Simple Storage Service (S3). Then, using a built-in container, you just launch an Amazon SageMaker Processing job on your dataset, passing a short configuration file defining the name of the target attribute, the name and values of the sensitive columns to analyze for bias, and the bias metrics that you want to compute. As you would expect, this job runs on fully managed infrastructure. For post-training analysis, a temporary endpoint is also automatically created and deleted by the job. Once the job is complete, results are available in S3 and in SageMaker Studio, and include an auto-generated report that summarizes the results.

Now, I’d like to show you how to get started with SageMaker Clarify.

Exploring Datasets and Models with Amazon SageMaker Clarify
The German Credit Data dataset contains 1,000 labeled credit applications, which I’ve used to train a binary classification model with XGBoost. Each data instance has 20 features, such as credit purpose, credit amount, housing status, employment history, and more. Categorical features have been encoded with Axx values. For example, here’s how the credit history feature is encoded: A30 means ‘no credits taken’, A31 means ‘all credits at this bank paid back duly’, and so on.

In particular, the dataset includes a feature telling us if a customer is a foreign worker. In fact, a quick look at the dataset hints at a large imbalance in favor of foreign workers. Could bias be hiding there? What about the model? Did XGBoost increase or decrease the bias? Which features contribute most to the predicted output? Let’s find out.

After training the model, my next step is to run a SageMaker Clarify bias analysis job on the dataset, using a built-in container image that will compute bias metrics. The job inputs are the dataset, and a JSON configuration file that defines:

  • The name of the target attribute (Class1Good2Bad), and the value for the positive answer (1).
  • The sensitive features to analyze (called “facets”), and their value. Here, we want to focus on instances where ForeignWorker is set to 0, as they seem to be under-represented in the dataset.
  • The bias metrics that the job should compute. As I already have a model, I pass its name so that post-training metrics can be computed on a temporary endpoint.

Here’s the relevant snippet in the configuration file:

"label": "Class1Good2Bad",
"label_values_or_threshold": [1],
"facet": [
         "name_or_index" : "ForeignWorker",
         "value_or_threshold": [0]
. . .
"methods": {
    "pre_training_bias": {"methods": "all"}
    "post_training_bias": {"methods": "all"}
"predictor": {
        "model_name": "xgboost-german-model",
        "instance_type": "ml.m5.xlarge",
        "initial_instance_count": 1

Then, I configure the job inputs (the dataset and the configuration file) and output (the report), passing all appropriate paths in S3:

config_input = ProcessingInput(
data_input = ProcessingInput(
result_output = ProcessingOutput(

Finally, I run the processing job.

from sagemaker.processing import Processor, ProcessingInput, ProcessingJob, ProcessingOutput
analyzer_image_uri = f'678264136642.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xai-analyzer:latest'
analyzer = Processor(base_job_name='analyzer',
analyzer.run(inputs=[ data_input, config_input], outputs=[result_output])

Once the processing job is complete, I can retrieve the report. Let’s look at bias metrics.

Detecting Bias with Amazon SageMaker Clarify
Here are some of the pre-training bias metrics:

"ForeignWorker": [
     "value_or_threshold": "0",
     "metrics": [
       "name": "CI",
       "description": "Class Imbalance (CI)",
       "value": 0.9225
       "name": "DPL",
       "description": "Difference in Positive Proportions in Labels (DPL)",
       "value": -0.21401904442300435
. . . 

The class imbalance metric confirms our visual impression. The dataset has about 92% more foreign workers than it has domestic workers to assess. Whether this imbalance is responsible or not, we can also see that the difference in positive proportion for domestic workers is quite negative. In other words, there’s a smaller proportion of domestic workers with positive labels. This statistical pattern could be picked up by an ML algorithm, leading to a larger proportion of domestic workers getting negative answers. Figuring out whether this is actually legitimate or not would require further analysis, and in any case, it’s great that SageMaker Clarify warned us about this potential issue.

As I provided a trained model, post-training metrics are also available. Comparing the DPPL and the DPL, I can see that XGBoost has slightly reduced bias on positive proportions (-18.8% vs -21.4%). We also see that DAR is negative, indicating that the model achieves higher precision for domestic workers compared to foreign workers.

"ForeignWorker": [
     "value_or_threshold": "0",
     "metrics": [
       "name": "DPPL",
       "description": "\"Difference in Positive Proportions in Predicted Labels (DPPL)\")",
       "value": -0.18801124208230213
       "name": "DAR",
       "description": "Difference in Acceptance Rates (DAR)",
       "value": -0.050909090909090904
       "name": "DRR",
       "description": "Difference in Rejection Rates (DRR)",
       "value": 0.0365296803652968
. . .

As SageMaker Clarify is integrated with SageMaker Studio, I can visualize bias metrics there. All I have to do is find the processing job in the list of trials, right-click “Open in trial details”, and select the “Bias report” view.

SageMaker screenshot

Finally, deciding whether high value of a certain bias metric is problematic involves domain-specific considerations. This needs to be guided by ethical, social, regulatory, and business considerations. Similarly, interventions for removing bias may often need a careful analysis of the entire ML lifecycle, from problem formulation to feedback loops in deployment.

Now, let’s see how SageMaker Clarify helps us understand what features the models base their predictions on.

Explaining Predictions with Amazon SageMaker Clarify
The report includes global SHAP values, showing the relative importance of all the features in the dataset. On the feature importance graph available in SageMaker Studio, I see that the three most important features are credit duration, not having a checking account (A14), and the loan amount. All things being equal, the bank probably sees you as a safer customer if you’re borrowing a small amount over a short period of time, and without the possibility to write checks!

SageMaker screenshot

In S3, I can also find a CSV file with SHAP values for individual data instances, giving me a complete picture of feature and feature value importance.

Getting Started
As you can see, SageMaker Clarify is a powerful tool to detect bias and to understand how your model works. You can start using it today in all regions where Amazon SageMaker is available, at no additional cost.

Sample notebooks are available to get you started quickly. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Special thanks to my colleagues Sanjiv Das, Michele Donini, Jason Gelman, Krishnaram Kenthapadi, Pinar Yilmaz, and Bilal Zafar for their precious help.

Dataset credits: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

New – Managed Data Parallelism in Amazon SageMaker Simplifies Training on Large Datasets

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/managed-data-parallelism-in-amazon-sagemaker-simplifies-training-on-large-datasets/

Today, I’m particularly happy to announce that Amazon SageMaker now supports a new data parallelism library that makes it easier to train models on datasets that may be as large as hundreds or thousands of gigabytes.

As data sets and models grow larger and more sophisticated, machine learning (ML) practitioners working on large distributed training jobs have to face increasingly long training times, even when using powerful instances such as the Amazon Elastic Compute Cloud (EC2) p3 and p4 instances. For example, using a ml.p3dn.24xlarge instance equipped with 8 NVIDIA V100 GPUs, it takes over 6 hours to train advanced object detection models such as Mask RCNN and Faster RCNN on the publicly available COCO dataset. Likewise, training BERT, a state of the art natural language processing model, takes over 100 hours on the same instance. Some of our customers, such as autonomous vehicle companies, routinely deal with even larger training jobs that run for days on large GPU clusters.

As you can imagine, these long training times are a severe bottleneck for ML projects, hurting productivity and slowing down innovation. Customers asked us for help, and we got to work.

Introducing Data Parallelism in Amazon SageMaker
Amazon SageMaker now helps ML teams reduce distributed training time and cost, thanks to the SageMaker Data Parallelism (SDP) library. Available for TensorFlow and PyTorch, SDP implements a more efficient distribution of computation, optimizes network communication, and fully utilizes our fastest p3 and p4 GPU instances.

Up to 90% of GPU resources can now be used for training, not for data transfer. Distributed training jobs can achieve up near-liner scaling efficiency, regardless of the number of GPUs involved. In other words, if a training job runs for 8 hours on a single instance, it will only take approximately 1 hour on 8 instances, with minimal cost increase. SageMaker effectively eliminates any trade-off between training cost and training time, allowing ML teams to get results sooner, iterate faster, and accelerate innovation.

During his keynote at AWS re:Invent 2020, Swami Sivasubramanian demonstrated the fastest training times to date for T5-3B and Mask-RCNN.

  • The T5-3B model has 3 billion parameters, achieves state-of-the-art accuracy on natural language processing benchmarks, and usually takes weeks of effort to train and tune for performance. We trained this model in 6 days on 256 ml.p4d.24xlarge instances.
  • Mask-RCNN continues to be a popular instance segmentation model used by our customers. Last year at re:Invent, we trained Mask-RCNN in 26 minutes on PyTorch, and in 27 minutes on TensorFlow. This year, we recorded the fastest training time to date for Mask-RCNN at 6:12 minutes on TensorFlow, and 6:45 minutes on PyTorch.

Before we explain how Amazon SageMaker is able to achieve such speedups, let’s first explain how data parallelism works, and why it’s hard to scale.

A Primer on Data Parallelism
If you’re training a model on a single GPU, its full internal state is available locally: model parameters, optimizer parameters, gradients (parameter updates computed by backpropagation), and so on. However, things are different when you distribute a training job to a cluster of GPUs.

Using a technique named “data parallelism,” the training set is split in mini-batches that are evenly distributed across GPUs. Thus, each GPU only trains the model on a fraction of the total data set. Obviously, this means that the model state will be slightly different on each GPU, as they will process different batches. In order to ensure training convergence, the model state needs to be regularly updated on all nodes. This can be done synchronously or asynchronously:

  • Synchronous training: all GPUs report their gradient updates either to all other GPUs (many-to-many communication), or to a central parameter server that redistributes them (many-to-one, followed by one-to-many). As all updates are applied simultaneously, the model state is in sync on all GPUs, and the next mini-batch can be processed.
  • Asynchronous training: gradient updates are sent to all other nodes, or to a central server. However, they are applied immediately, meaning that model state will differ from one GPU to the next.

Unfortunately, these techniques don’t scale very well. As the number of GPUs increases, a parameter server will inevitably become a bottleneck. Even without a parameter server, network congestion soon becomes a problem, as n GPUs need to exchange n*(n-1) messages after each iteration, for a total amount of n*(n-1)*model size bytes. For example, ResNet-50 is a popular model used in computer vision applications. With its 26 million parameters, each 32-bit gradient update takes about 100 megabytes. With 8 GPUs, each iteration requires sending and receiving 56 updates, for a total of 5.6 gigabytes. Even with a fast network, this will cause some overhead, and slow down training.

A significant step forward was taken in 2017 thanks to the Horovod project. Horovod implemented an optimized communication algorithm for distributed training named “ring-allreduce,” which was soon integrated with popular deep learning libraries.

In a nutshell, ring-allreduce is a decentralized asynchronous algorithm. There is no parameter server: nodes are organized in a directed cycle graph (to put it simply, a one-way ring). For each iteration, a node receives a gradient update from its predecessor. Once a node has processed its own batch, it applies both updates (its own and the one it received), and sends the results to its neighbor. With n GPUs, each GPU processes 2*(n-1) messages before all GPUs have been updated. Accordingly, the total amount of data exchanged per GPU is 2*(n-1)*model size, which is much better than n*(n-1)*model size.

Still, as datasets keep growing, the network bottleneck issue often rises again. Enter SageMaker and its new AllReduce algorithm.

A New Data Parallelism Algorithm in Amazon SageMaker
With the AllReduce algorithm, GPUs don’t talk to one another any more. Each GPU stores its gradient updates in GPU memory. When a certain threshold is exceeded, these updates are sharded, and sent to parameter servers running on the CPUs of the GPU instances. This removes the need for dedicated parameter servers.

Each CPU is responsible for a subset of the model parameters, and it receives updates coming from all GPUs. For example, with 3 training instances equipped with a single GPU, each GPU in the training cluster would send a third of its gradient updates to each one of the three CPUs.


Then, each CPU would apply all the gradient updates that it received, and it would distributes the consolidated result back to all GPUs.


Now that we understand how this algorithm works, let’s see how you can use it with your own code, without having to manage any infrastructure.

Training with Data Parallelism in Amazon SageMaker
The SageMaker Data Parallelism API is designed for ease of use, and should provide seamless integration with existing distributed training toolkits. In most cases, all you have to change in your training code is the import statement for Horovod (TensorFlow), or for Distributed Data Parallel (PyTorch).

For PyTorch, this would look like this.

import smdistributed.dataparallel.torch.parallel.distributed as dist

Then, I need to pin each GPU to a single SDP process.


Then, I define my model as usual, for example:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

Finally, I instantiate my model, and use it to create a DistributedDataParallel object like so:

import torch
from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
device = torch.device("cuda")
model = DDP(Net().to(device))

The rest of the code is vanilla PyTorch, and I can train it using the PyTorch estimator available in the SageMaker SDK. Here, I’m using an ml.p3.16xlarge instance with 8 NVIDIA V100 GPUs.

from sagemaker.pytorch import PyTorch
estimator = PyTorch(
    distribution={'smdistributed':{'dataparallel':{enabled': True}}}

From then on, SageMaker takes over and provisions all required infrastructure. You can focus on other tasks while your training job runs.

Getting Started
If your training jobs last for hours or days on multiple GPUs, we believe that the SageMaker Data Parallelism library can save you time and money, and help you experiment and innovate quicker. It’s available today at in all regions where SageMaker is available, at no additional cost.

Examples are available to get you started quickly. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Amazon SageMaker Simplifies Training Deep Learning Models With Billions of Parameters

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-simplifies-training-deep-learning-models-with-billions-of-parameters/

Today, I’m extremely happy to announce that Amazon SageMaker simplifies the training of very large deep learning models that were previously difficult to train due to hardware limitations.

In the last 10 years, a subset of machine learning named deep learning (DL) has taken the world by storm. Based on neural networks, DL algorithms have an extraordinary ability to extract information patterns hidden in vast amounts of unstructured data, such as images, videos, speech, or text. Indeed, DL has quickly achieved impressive results on a variety of complex human-like tasks, especially on computer vision and natural language processing. In fact, innovation has never been faster, as DL keeps improving its results on reference tasks like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), the General Language Understanding Evaluation (GLUE), or the Stanford Question Answering Dataset (SQUAD).

In order to tackle ever more complex tasks, DL researchers are designing increasingly sophisticated models, adding more neuron layers and more connections to improve pattern extraction and prediction accuracy, with a direct impact on model size. For example, you would get very good results on image classification with a 100-megabyte ResNet-50 model. For more difficult tasks such as object detection or instance segmentation, you would have to use larger models such as Mask R-CNN or YOLO v4, weighing in at about 250 megabytes.

As you can guess, model growth also impacts the amount of time and hardware resources required for model training, which is why Graphical Processing Units (GPU) have long been the preferred option to train and fine-tune large DL models. Thanks to their massively parallel architecture and their large on-board memory, they make it possible to use a technique called mini-batch training. By sending several data samples at once to the GPU, instead of sending them one by one, communication overhead is reduced, and training jobs are greatly accelerated. For example, the NVIDIA A100 available on the Amazon Elastic Compute Cloud (EC2) p4 family has over 7,000 compute cores and 40 gigabytes of fast onboard memory. Surely, that should be enough to train large batches of data on very large models, shouldn’t it?

Well, it’s not. Natural language processing behemoths such as OpenAI GPT-2 (1.5 billion parameters), T5-3B (3 billion parameters) and GPT-3 (175 billion parameters) consume tens or even hundreds of gigabytes of GPU memory. Likewise, state-of-the-art models working on high-resolution 3D images can be too large to fit in GPU memory, even with a batch size of 1…

Trying to square the circle, DL researchers use a combination of techniques, such as the following:

  • Buy more powerful GPUs, although we just saw that this is simply not an option for some models.
  • Work with less powerful models, and sacrifice accuracy.
  • Implement gradient checkpointing, a technique that relies on saving intermediate training results to disk instead of keeping everything in memory, at the expense of a 20-30% training slowdown.
  • Implement model parallelism, that is to say split the model manually, and train its (smaller) pieces on different GPUs. Needless to say, this is an extremely difficult, time-consuming, and uncertain task, even for expert practitioners.

Customers have told us that none of the above was a satisfactory solution to working with very large models. They asked us for a simpler and more cost-effective solution, and we got to work.

Introducing Model Parallelism in Amazon SageMaker
Model parallelism in SageMaker automatically and efficiently partitions models across several GPUs, eliminating the need for accuracy compromises or for complex manual work. In addition, thanks to this scale-out approach to model training, not only can you work with very large models without any memory bottleneck, you can also leverage a large number of smaller and more cost-effective GPUs.

At launch, this is supported for TensorFlow and PyTorch, and it only requires minimal changes in your code. When you launch a training job, you can specify whether your model should be optimized for speed or for memory usage. Then, Amazon SageMaker runs an initial profiling job on your behalf in order to analyze the compute and memory requirements of your model. This information is then fed to a partitioning algorithm which decides how to split the model and how to map model partitions to GPUs, while minimizing communication. The outcome of the partitioning decision is saved to a file, which is passed as input to the actual training job.

As you can see, SageMaker takes care of everything. If you’d like, you could also manually profile and partition the model, then train on SageMaker.

Before we look at the code, I’d like to give you a quick overview of the internals.

Training with Model Partitions and Microbatches
As model partitions running on different GPUs expect forward pass inputs from each other (activation values), processing training mini-batches across a sequence of partitions would only keep one partition busy at all times, while stalling the other ones.

To avoid this inefficient behavior, mini-batches are split into microbatches that are processed in parallel on the different GPUs. For example, GPU #1 could be forward propagating microbatch n, while GPU #2 could do the same for microbatch n+1. Activation values can be stored, and passed to the next partition whenever it’s ready to accept them.

For back propagation, partitions also expect input values from each other (gradients). As a partition can’t simultaneously run forward and backward propagation, we could wait for all GPUs to complete the forward pass on their own microbatch, before letting them run the corresponding backward pass. This simple mode is available in Amazon SageMaker.

There’s an even more efficient option, called interleaved mode. Here, SageMaker replicates partitions according to the number of microbatches. For example, working with 2 microbatches, each GPU would run two copies of the partition it has received. Each copy would collaborate with partitions running on other GPUs, either for forward or backpropagation.

Here’s how things could look like, with 4 different microbatches being processed by 2 duplicated partitions.


To sum things up, interleaving the forward and backward passes of different microbatches is how SageMaker maximimes GPU utilization.

Now, let’s see how we can put this to work with TensorFlow.

Implementing Model Parallelism in Amazon SageMaker
Thanks to the SageMaker Model Parallelism (SMP) library, you can easily implement model parallelism in your own TensorFlow code (the process is similar for PyTorch). Here’s what you need to do:

  • Define and initialize the partitioning configuration.
  • Make your model a subclass of the DistributedModel class, using standard Keras subclassing.
  • Write and decorate with @smp.step a training function that represents a forward and backward step for the model. This function will be pipelined according to the architecture described in the previous section.
  • Optionally, do the same for an evaluation function that will also be pipelined.

Let’s apply this to a simple convolution network training on the MNIST dataset, using an ml.p3.8xlarge instance equipped with 4 NVIDIA V100 GPUs.

First, I initialize the SMP API.

import smdistributed.modelparallel.tensorflow as smp

Then, I subclass DistributedModel and build my model.

class MyModel(smp.DistributedModel):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv = Conv2D(32, 3, activation="relu")
        self.flatten = Flatten()
        self.dense1 = Dense(128)
        self.dense2 = Dense(10)
. . .

This is what the training function looks like.

def forward_backward(images, labels):
    predictions = model(images, training=True)
    loss = loss_obj(labels, predictions)
    grads = optimizer.get_gradients(loss, model.trainable_variables)
    return grads, loss

Then, I can train as usual with the TensorFlow estimator available in the SageMaker SDK. I only need to add the model parallelism configuration: 2 partitions (hence training on 2 GPUs), and 2 microbatches (hence 2 copies of each partition) with interleaving.

smd_mp_estimator = TensorFlow(
        "smdistributed": {
            "modelparallel": {
                "parameters": {
                    "microbatches": 2,
                    "partitions": 2,
                    "pipeline": "interleaved",
                    "optimize": "memory",
                    "horovod": True, 
        "mpi": {
            "enabled": True,
            "processes_per_host": 2, # Pick your processes_per_host
            "custom_mpi_options": mpioptions

Getting Started
As you can see, model parallelism makes it easier to train very large state-of-the-art deep learning models. It’s available today in all regions where Amazon SageMaker is available, at no additional cost.

Examples are available to get you started right away. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien


New – Fully Serverless Batch Computing with AWS Batch Support for AWS Fargate

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-fully-serverless-batch-computing-with-aws-batch-support-for-aws-fargate/

We launched AWS Batch on December 2016 as a fully managed batch computing service that enables developers, scientists and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. With AWS Batch, you no longer need to install and manage batch computing software or server clusters to run your jobs. AWS Batch is designed to remove the heavy lifting of batch workload management by creating compute environments, managing queues, and launching the appropriate compute resources to run your jobs quickly and efficiently.

Today, we are happy to introduce the ability to specify AWS Fargate as a computing resource for AWS Batch jobs. AWS Fargate is a serverless computing engine for containers that eliminates the need to provision and manage your own servers. With this enhancement, customers will now have a way to run their jobs on serverless computing resources: Simply submit your analysis, ML inference, map reduce analysis, and other batch workloads, and let Batch and Fargate handle the rest.

Basic Concept
Customers running batch workloads in the cloud have a variety of orchestration needs: for example, workloads need to be queued, submitted to a compute resource, given priorities, dependencies and retries need to be handled, compute needs to be scalable and available, and users need to account for utilization and resource management. While AWS Batch simplifies all the queuing, scheduling, and lifecycle management for customers, and even provisions and manages compute in the customer account, customers are looking for even more simplicity where they can get up and running in minutes. Time spent on image maintenance, right-sizing of compute, and monitoring is time not spent on applications. These customer needs have led us to develop Fargate integration, which we are pleased to announce today.

How It Works
Simply specify Fargate or Fargate Spot as the resource type in Batch and submit a Fargate job definition, and customers can now take advantage of the benefits of serverless computing without the need for image patching, isolation of VM boundaries, and calculation of the correct size.

To start, access the AWS Management Console of AWS Batch. Select Compute environments and Create.Getting startWe now have 2 new options for Provisioning model: Fargate and Fargate Spot.

Selecting FargateWith Fargate or Fargate Spot, you don’t need to worry about Amazon EC2 instances or Amazon Machine Images. Just set Fargate or Fargate Spot, your subnets, and the maximum total vCPU of the jobs running in the compute environment, and you have a ready-to-go Fargate computing environment. With Fargate Spot, you can take advantage of up to 70% discount for your fault-tolerant, time-flexible jobs.

vCPU fro FargateSelect Create compute environment. Then, Batch will create your Fargate-based compute environment.

Created Computing environment

Next step is to create the Job Queue, which is where your jobs live when waiting to be run. Then, Connect that to your Fargate compute environment.

After you finished setting up the job queue, next step is to create Job definitions for your Fargate jobs. Select Job definitions from the left pane, and click the Create button.

Setting up job definitionOnce you’ve selected Fargate for the job definition, you are now ready to submit your job. Batch will handle queueing, submission, and job lifecycle for you! You can access Job definitions by clicking Job definitions in the left pane. After selecting Job Definition, click Submit new job.

Submitting JobYou need to select the Job queue previously set up for your Fargate compute environment.

Submitting new job

You can now submit your new job by pressing the Submit button at the bottom.

Follow the steps below to set up your Fargate-based compute environment using the AWS CLI.

1. Creating Compute Environment

aws batch create-compute-environment --cli-input-json file://below_sample.json

    "computeEnvironmentName": "FargateComputeEnvironment",
    "type": "MANAGED",
    "state": "ENABLED",
    "computeResources": {
        "type": "FARGATE", # or FARGATE_SPOT
        "maxvCpus": 40,
        "subnets": [
        "securityGroupIds": ["sg-xxxxxxxxxxxxxxxx"],
        "tags": {
            "KeyName": "fargate"
"serviceRole": "arn:aws:iam::xxxxxxxxxxxx:role/service-role/AWSBatchServiceRole"

2.Creating Job Queue

aws batch create-job-queue --cli-input-json file://below_job_queue.json

  "jobQueueName": "FargateJobQueue",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [
      "order": 1,
      "computeEnvironment": "FargateComputeEnvironment"

3.Creating and Registering Job Definitions
aws batch-fargate register-job-definition --cli-input-json file://below_job_definition.json

    "jobDefinitionName": "FargateJobDefinition",
    "type": "container",
    "propagateTags": true,
     "containerProperties": {
        "image": "xxxxxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/test:latest",
        "networkConfiguration": {
            "assignPublicIp": "ENABLED"
        "fargatePlatformConfiguration": {
            "platformVersion": "LATEST"
        "resourceRequirements": [
                "value": "0.25",
                "type": "VCPU"
                "value": "512",
                "type": "MEMORY"
        "jobRoleArn": "arn:aws:iam::xxxxxxxxxxxx:role/ecsTaskExecutionRole",
        "logConfiguration": {
            "logDriver": "awslogs",
            "options": {
            "awslogs-group": "/ecs/sleepenv",
            "awslogs-region": "us-east-1",
            "awslogs-stream-prefix": "ecs"
   "platformCapabilities": [
    "tags": {
    "Service": "Batch",
    "Name": "JobDefinitionTag",
    "Expected": "MergeTag"

You can also use other container image registries like Docker Hub in addition to Amazon Elastic Container Registry.

4.Submitting Job
aws batch submit-job --job-name faragteJob --job-queue FargateJobQueue --job-definition FargateJobDefinition

Generally Available Today
AWS Batch support for AWS Fargate is generally available today for all AWS Regions where AWS Batch and AWS Fargate are available. Please visit the AWS Batch page and technical documentation for more details.

– Kame

Managed Entitlements in AWS License Manager Streamlines License Tracking and Distribution for Customers and ISVs

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/managed-entitlements-for-aws-license-manager-streamlines-license-management-for-customers-and-isvs/

AWS License Manager is a service that helps you easily manage software licenses from vendors such as Microsoft, SAP, Oracle, and IBM across your Amazon Web Services (AWS) and on-premises environments. You can define rules based on your licensing agreements to prevent license violations, such as using more licenses than are available. You can set the rules to help prevent licensing violations or notify you of breaches. AWS License Manager also offers automated discovery of bring your own licenses (BYOL) usage that keeps you informed of all software installations and uninstallations across your environment and alerts you of licensing violations.

License Manager can manage licenses purchased in AWS Marketplace, a curated digital catalog where you can easily find, purchase, deploy, and manage third-party software, data, and services to build solutions and run your business. Marketplace lists thousands of software listings from independent software vendors (ISVs) in popular categories such as security, networking, storage, machine learning, business intelligence, database, and DevOps.

Managed entitlements for AWS License Manager
Starting today, you can use managed entitlements, a new feature of AWS License Manager that lets you distribute licenses across your AWS Organizations, automate software deployments quickly and track licenses – all from a single, central account. Previously, each of your users would have to independently accept licensing terms and subscribe through their own individual AWS accounts. As your business grows and scales, this becomes increasingly inefficient.

Customers can use managed entitlements to manage more than 8,000 listings available for purchase from more than 1600 vendors in the AWS Marketplace. Today, AWS License Manager automates license entitlement distribution for Amazon Machine Image, Containers and Machine Learning products purchased in the Marketplace with a variety of solutions.

How It Works
Managed entitlements provides built-in controls that allow only authorized users and workloads to consume a license within vendor-defined limits. This new license management mechanism also eliminates the need for ISVs to maintain their own licensing systems and conduct costly audits.


Each time a customer purchases licenses from AWS Marketplace or a supported ISV, the license is activated based on AWS IAM credentials, and the details are registered to License Manager.

list of granted license

Administrators distribute licenses to AWS accounts. They can manage a list of grants for each license.

list of grants

Benefits for ISVs
AWS License Manager managed entitlements provides several benefits to ISVs to simplify the automatic license creation and distribution process as part of their transactional workflow. License entitlements can be distributed to end users with and without AWS accounts. Managed entitlements streamlines upgrades and renewals by removing expensive license audits and provides customers with a self-service tracking tool with built-in license tracking capabilities. There are no fees for this feature.

Managed entitlements provides the ability to distribute licenses to end users who do not have AWS accounts. In conjunction with the AWS License Manager, ISVs create a unique long-term token to identify the customer. The token is generated and shared with the customer. When the software is launched, the customer enters the token to activate the license. The software exchanges the long-term customer token for a short-term token that is passed to the API and the setting of the license is completed. For on-premises workloads that are not connected to the Internet, ISVs can generate a host-specific license file that customers can use to run the software on that host.

Now Available
This new enhancement to AWS License Manager is available today for US East (N. Virginia), US West (Oregon), and Europe (Ireland) with other AWS Regions coming soon.

Licenses purchased on AWS Marketplace are automatically created in AWS License Manager and no special steps are required to use managed entitlements. For more details about the new feature, see the managed entitlement pages on AWS Marketplace, and the documentation. For ISVs to use this new feature, please visit our getting started guide.

Get started with AWS License Manager and the new managed entitlements feature today.

– Kame

New – Amazon Lookout for Equipment Analyzes Sensor Data to Help Detect Equipment Failure

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-amazon-lookout-for-equipment-analyzes-sensor-data-to-help-detect-equipment-failure/

Companies that operate industrial equipment are constantly working to improve operational efficiency and avoid unplanned downtime due to component failure. They invest heavily and repeatedly in physical sensors (tags), data connectivity, data storage, and building dashboards over the years to monitor the condition of their equipment and get real-time alerts. The primary data analysis methods are single-variable threshold and physics-based modeling approaches, and while these methods are effective in detecting specific failure types and operating conditions, they can often miss important information detected by deriving multivariate relationships for each piece of equipment.

With machine learning, more powerful technologies have become available that can provide data-driven models that learn from an equipment’s historical data. However, implementing such machine learning solutions is time-consuming and expensive owing to capital investment and training of engineers.

Today, we are happy to announce Amazon Lookout for Equipment, an API-based machine learning (ML) service that detects abnormal equipment behavior. With Lookout for Equipment, customers can bring in historical time series data and past maintenance events generated from industrial equipment that can have up to 300 data tags from components such as sensors and actuators per model. Lookout for Equipment automatically tests the possible combinations and builds an optimal machine learning model to learn the normal behavior of the equipment. Engineers don’t need machine learning expertise and can easily deploy models for real-time processing in the cloud.

Customers can then easily perform ML inference to detect abnormal behavior of the equipment. The results can be integrated into existing monitoring software or AWS IoT SiteWise Monitor to visualize the real-time output or to receive alerts if an asset tends toward anomalous conditions.

How Lookout for Equipment Works
Lookout for Equipment reads directly from Amazon S3 buckets. Customers can publish their industrial data in S3 and leverage Lookout for Equipment for model development. A user determines the value or time period to be used for training and assigns an appropriate label. Given this information, Lookout for Equipment launches a task to learn and creates the best ML model for each customer.

Because Lookout for Equipment is an automated machine learning tool, it gets smarter over time as users use Lookout for Equipment to retrain their models with new data. This is useful for model re-creation when new invisible failures occur, or when the model drifts over time. Once the model is complete and can be inferred, Lookout for Equipment provides real-time analysis.

With the equipment data being published to S3, the user can scheduled inference that ranges from 5 minutes to one hour. When the user data arrives in S3, Lookout for Equipment fetches the new data on the desired schedule, performs data inference, and stores the results in another S3 bucket.

Set up Lookout for Equipment with these simply steps:

  1. Upload data to S3 buckets
  2. Create datasets
  3. Ingest data
  4. Create a model
  5. Schedule inference (if you need real-time analysis)

1. Upload data
You need to upload tag data from equipment to any S3 bucket.

2. Create Datasets

Select Create dataset, and set Dataset name, and set Data Schema. Data schema is like a data design document that defines the data to be fed in later. Then select Create.

creating datasets console

3. Ingest data
After a dataset is created, the next step is to ingest data. If you are familiar with Amazon Personalize or Amazon Forecast, doesn’t this screen feel familiar? Yes, Lookout for Equipment is as easy to use as those are.

Select Ingest data.

Ingesting data consoleSpecify the S3 bucket location where you uploaded your data, and an IAM role. The IAM role has to have a trust relationship to “lookoutequipment.amazonaws.com” You can use the following policy file for the test.

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "Service": "lookoutequipment.amazonaws.com"
      "Action": "sts:AssumeRole"

The data format in the S3 bucket has to match the Data Schema you set up in step 2. Please check our technical documents for more detail. Ingesting data takes a few minutes to tens of minutes depending on your data volume.

4. Create a model
After data ingest is completed, you can train your own ML model now. Select Create new model. Fields show us a list of fields in the ingested data. By default, no field is selected. You can select fields you want Lookout for Equipment to learn. Lookout for Equipment automatically finds and trains correlations from multiple specified fields and creates a model.

Image illustrates setting up fields.

If you are sure that your data has some unusual data included, you can optionally set the windows to exclude that data.

setting up maintenance windowOptionally, you can divide ingested data for training and then for evaluation. The data specified during the evaluation period is checked compared to the trained model.

setting up evaluation window

Once you select Create, Lookout for Equipment starts to train your model. This process takes minutes to hours depending on your data volume. After training is finished, you can evaluate your model with the evaluation period data.

model performance console

5. Schedule Inference
Now it is time to analyze your real time data. Select Schedule Inference, and set up your S3 buckets for input.

setting up input S3 bucket

You can also set Data upload frequency, which is actually the same as inferencing frequency, and Offset delay time. Then, you need to set up Output data as Lookout for Equipment outputs the result of inference.

setting up inferenced output S3 bucket

Amazon Lookout for Equipment is In Preview Today
Amazon Lookout for Equipment is in preview today at US East (N. Virginia), Asia Pacific (Seoul), and Europe (Ireland) and you can see the documentation here.

– Kame

Amazon Monitron, a Simple and Cost-Effective Service Enabling Predictive Maintenance

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-monitron-a-simple-cost-effective-service-enabling-predictive-maintenance/

Today, I’m extremely happy to announce Amazon Monitron, a condition monitoring service that detects potential failures and allows user to track developing faults enabling you to implement predictive maintenance and reduce unplanned downtime.

True story: A few months ago, I bought a new washing machine. As the delivery man was installing it in my basement, we were chatting about how unreliable these things seemed to be nowadays; never lasting more than a few years. As the gentleman made his way out, I pointed to my aging and poorly maintained water heater, telling him that I had decided to replace it in the coming weeks and that he’d be back soon to install a new one. Believe it or not, it broke down the next day. You can laugh at me, it’s OK. I deserve it for not planning ahead.

As annoying as this minor domestic episode was, it’s absolutely nothing compared to the tremendous loss of time and money caused by the unexpected failure of machines located in industrial environments, such as manufacturing production lines and warehouses. Any proverbial grain of sand can cause unplanned outages, and Murphy’s Law has taught us that they’re likely to happen in the worst possible configuration and at the worst possible time, resulting in severe business impacts.

To avoid breakdowns, reliability managers and maintenance technicians often combine three strategies:

  1. Run to failure: where equipment is operated without maintenance until it no longer operates reliably. When the repair is completed, equipment is returned to service; however, the condition of the equipment is unknown and failure is uncontrolled.
  2. Planned maintenance: where predefined maintenance activities are performed on a periodic or meter basis, regardless of condition. The effectiveness of planned maintenance activities is dependent on the quality of the maintenance instructions and planned cycle. It risks equipment being both over- and under-maintained, incurring unnecessary cost or still experiencing breakdowns.
  3. Condition-based maintenance: where maintenance is completed when the condition of a monitored component breaches a defined threshold. Monitoring physical characteristics such as tolerance, vibration or temperature is a more optimal strategy, requiring less maintenance and reducing maintenance costs.
  4. Predictive maintenance: where the condition of components is monitored, potential failures detected and developing faults tracked. Maintenance is planned at a time in the future prior to expected failure and when the total cost of maintenance is most cost-effective.

Condition-based maintenance and predictive maintenance require sensors to be installed on critical equipment. These sensors measure and capture physical quantities such as temperature and vibration, whose change is a leading indicator of a potential failure or a deteriorating condition.

As you can guess, building and deploying such maintenance systems can be a long, complex, and costly project involving bespoke hardware, software, infrastructure, and processes. Our customers asked us for help, and we got to work.

Introducing Amazon Monitron
Amazon Monitron is an easy and cost-effective condition monitoring service that allows you to monitor the condition of equipment in your facilities, enabling the implementation of a predictive maintenance program.


Setting up Amazon Monitron is extremely simple. You first install Monitron sensors that capture vibration and temperature data from rotating machines, such as bearings, gearboxes, motors, pumps, compressors, and fans. Sensors send vibration and temperature measurements hourly to a nearby Monitron gateway, using Bluetooth Low Energy (BLE) technology allowing the sensors to run for at least three years. The Monitron gateway is itself connected to your WiFi network, and sends sensor data to AWS, where it is stored and analyzed using machine learning and ISO 20816 vibration standards.

As communication is infrequent, up to 20 sensors can be connected to a single gateway, which can be located up to 30 meters away (depending on potential interference). Thanks to the scalability and cost efficiency of Amazon Monitron, you can deploy as many sensors as you need, including on pieces of equipment that until now weren’t deemed critical enough to justify the cost of traditional sensors. As with any data-driven application, security is our No. 1 priority. The Monitron service authenticates the gateway and the sensors to make sure that they’re legitimate. Data is also encrypted end-to-end, without any decryption taking place on the gateway.

Setting up your gateways and sensors only requires installing the Monitron mobile application on an Android mobile device with Bluetooth support for gateway setup, and NFC support for sensor setup. This is an extremely simple process, and you’ll be monitoring in minutes. Technicians will also use the mobile application to receive alerts indicating abnormal machine conditions. They can acknowledge these alerts and provide feedback to improve their accuracy (say, to minimize false alerts and missed anomalies).

Customers are already using Amazon Monitron today, and here are a couple of examples.

Fender Musical Instruments Corporation is an iconic brand and a leading manufacturer of stringed instruments and amplifiers. Here’s what Bill Holmes, Global Director of Facilities at Fender, told us: “Over the past year we have partnered with AWS to help develop a critical but sometimes overlooked part of running a successful manufacturing business which is knowing the condition of your equipment. For manufacturers worldwide, uptime of equipment is the only way we can remain competitive with a global market. Ensuring equipment is up and running and not being surprised by sudden breakdowns helps get the most out of our equipment. Unplanned downtime is costly both in loss of production and labor due to the firefighting nature of the breakdown. The Amazon Monitron condition monitoring system has the potential of giving both large industry as well as small ‘mom and pop shops’ the ability to predict failures of their equipment before a catastrophic breakdown shuts them down. This will allow for a scheduled repair of failing equipment before it breaks down.

GE Gas Power is a leading provider of power generation equipment, solutions and services. It operates many manufacturing sites around the world, in which much of the manufacturing equipment is not connected nor monitored for health. Magnus Akesson, CIO at GE Gas Power Manufacturing says: “Naturally, we can reduce both maintenance costs and downtime, if we can easily and cheaply connect and monitor these assets at scale. Additionally, we want to take advantage of advanced algorithms to look forward, to know not just the current state but also predict future health and to detect abnormal behaviors. This will allow us to transition from time-based to predictive and prescriptive maintenance practices. Using Amazon Monitron, we are now able to quickly retrofit our assets with sensors and connecting them to real- time analytics in the AWS cloud. We can do this without having to require deep technical skills or having to configure our own IT and OT networks. From our initial work on vibration-prone tumblers, we are seeing this vision come to life at an amazing speed: the ease-of-use for the operators and maintenance team, the simplicity, and the ability to implement at scale is extremely attractive to GE. During our pilot, we were also delighted to see one-click capabilities for updating the sensors via remote Over the Air (OTA) firmware upgrades, without having to physically touch the sensors. As we grow in scale, this is a critical capability in order to be able to support and maintain the fleet of sensors.

Now, let me show you how to get started with Amazon Monitron.

Setting up Amazon Monitron
First, I open the Monitron console. In just a few clicks, I create a project, and an administrative user allowed to manage it. Using a link provided in the console, I download and install the Monitron mobile application on my Android phone. Opening the app, I log in using my administrative credentials.

The first step is to create a site describing assets, sensors, and gateways. I name it “my-thor-project.”

Application screenshot

Let’s add a gateway. Enabling BlueTooth on my phone, I press the pairing button on the gateway.

Application screenshot

The name of the gateway appears immediately.

Application screenshot

I select the gateway, and I configure it with my WiFi credentials to let it connect to AWS. A few seconds later, the gateway is online.

Application screenshot

My next step is to create an asset that I’d like to monitor, say a process water pump set, with a motor and a pump that I would like to monitor. I first create the asset itself, simply defining its name, and the appropriate ISO 20816 class (a standard for measurement and evaluation of machine vibration).

Application screenshot

Then, I add a sensor for the motor.

Application screenshot

I start by physically attaching the sensor to the motor using the suggested adhesive. Next, I specify a sensor position, enable the NFC on my smartphone, and tap the Monitron sensor that I attached to the motor with my phone. Within seconds, the sensor is commissioned.

Application screenshot

I repeat the same operation for the pump. Looking at my asset, I see that both sensors are operational.

Application screenshot

They are now capturing temperature and vibration information. Although there isn’t much to see for the moment, graphs are available in the mobile app.

Application screenshot

Over time, the gateway will keep sending this data securely to AWS, where it will be analyzed for early signs of failure. Should either of my assets exhibit these, I would receive an alert in the mobile application, where I could visualize historical data, and decide what the best course of action would be.

Getting Started
As you can see, Monitron makes it easy to deploy sensors enabling predictive maintenance applications. The service is available today in the US East (N. Virginia) region, and using it costs $50 per sensor per year.

If you’d like to evaluate the service, the Monitron Starter Kit includes everything you need (a gateway with a mounting kit, five sensors, and a power supply), and it’s available for $715. Then, you can scale your deployment with additional sensors, which you can buy in 5-packs for $575.

Starter kit picture

Give Amazon Monitron a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for Monitron.

– Julien

Special thanks to my colleague Dave Manley for taking the time to educate me on industrial maintenance operations.

New – Amazon QuickSight Q Answers Natural-Language Questions About Business Data

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/amazon-quicksight-q-to-answer-ad-hoc-business-questions/

We launched Amazon QuickSight as the first Business Intelligence (BI) service with Pay-per-Session pricing. Today, we are happy to announce the preview of Amazon QuickSight Q, a Natural Language Query (NLQ) feature powered by machine learning (ML). With Q, business users can now use QuickSight to ask questions about their data using everyday language and receive accurate answers in seconds.

For example, in response to questions such as, “What is my year-to-date year-over-year sales growth?” or “Which products grew the most year-over-year?” Q automatically parses the questions to understand the intent, retrieves the corresponding data and returns the answer in the form of a number, chart, or table in QuickSight. Q uses state-of-the art ML algorithms to understand the relationships across your data and build indexes to provide accurate answers. Also, since Q does not require BI teams to pre-build data models on specific datasets, you can ask questions across all your data.

The Need for Q
Traditionally, BI engineers and analysts create dashboards to make it easier for business users to view and monitor key metrics. When a new business question arises and no answers are found in the data displayed on an existing dashboard, the business user must submit a data request to the BI Team, which is often thinly staffed, and wait several weeks for the question to be answered and added to the dashboard.

A sales manager looking at a dashboard that outlines daily sales trends may want to know what their overall sales were for last week, in comparison to last month, the previous quarter, or the same time last year. They may want to understand how absolute sales compare to growth rates, or how growth rates are broken down by different geographies, product lines, or customer segments to identify new opportunities for growth. This may require a BI team to reconstruct the data, create new data models, and answer additional questions. This process can take from a few days to a few weeks. Such specific data requests increase the workload for BI teams that may be understaffed, increases the time spent waiting for answers, and frustrates business users and executives who need the data to make timely decisions.

How Q Works
To ask a question, you simply type your question into the QuickSight Q search bar. Once you start typing in your question, Q provides autocomplete suggestions with key phrases and business terms to speed up the process. It also automatically performs spell check, and acronym and synonym matching, so you don’t have to worry about typos or remember the exact business terms in the data. Q uses natural language understanding techniques to extract business terms (e.g., revenue, growth, allocation, etc.) and intent from your questions, retrieves the corresponding data from the source, and returns the answers in the form of numbers and graphs.

Q further learns from user interactions from within the organization to continually improve accuracy. For example, if Q doesn’t understand a phrase in a question, such as what “my product” refers to, Q prompts the user to choose from a drop-down menu of suggested options in the search bar. Q then remembers the phrase for next time, thus improving accuracy with use. If you ask a question about all your data, Q provides an answer using that data. Users are not limited to asking questions that are confined to a pre-defined dashboard and can ask any questions relevant to your business.

Let’s see a demo. We assume that there is a dashboard of sales for a company.

Dashboard of Quicksight

The business users of the dashboard can drill down and slice and dice the data simply by typing their questions on the Q search bar above.

Let’s use the Q search bar to ask a question, “Show me last year’s weekly sales in California.” Q generates numbers and a graph within seconds.

Generated dashboard

You can click “Looks good” or “Not quite right” on the answer. When clicking “Not quite right,” you can submit your feedback to your BI team to help improve Q. You can also investigate the answer further. Let’s add “versus New York” to the end of the question and hit enter. A new answer will pop up.

Generated new graph

Next, let’s investigate further in California. Type in “What are the best selling categories in California.

categories detail

With Q, you can easily change the presentation. Let’s see another diagram for the same question.

line start

Next, let’s take a look at the biggest industry, “Finance.” Type in “Show me the sales growth % week over week in the Finance sector” to Q, and specify “Line chart” to check weekly sales revenue growth.

The sales revenue shows growth, but it has peak and off-peak spikes. With these insights, you might now consider how to stabilize for a better profit structure.

Getting Started with Amazon QuickSight Q
A new “Q Topics” link will appear on the left navigation bar. Topics are a collection of one or more datasets and are meant to represent a subject area that users can ask questions about. For example, a marketing team may have Q Topics for “Ad Spending,” “Email Campaign,” “Website Analytics,” and others. Additionally, as an author, you can:

  • Add friendly names, synonyms, and descriptions to datasets and columns to improve Q’s answers.
  • Share the Topic to your users so they can ask questions about the Topic.
  • See questions your users are asking, how Q answered these questions, and improve upon the answer.

topic tool barSelect Topics, and set Topic name and its Description.

setting up topics

After clicking the Continue button, you can add datasets to a topic in two ways: You can add one or more datasets directly to your topic by selecting Add datasets, or you can import all the datasets in an existing dashboard into your topic by selecting Import dashboard.

The next step is to make your datasets natural-language friendly. Generally, names of datasets and columns are based on technical naming conventions and do not reflect how they are referred to by end users. Q relies heavily on names to match the right dataset and column with the terms used in questions. Therefore, such technical names must be converted to user-friendly names to ensure that they can be mapped correctly. Below are examples:

  • Dataset Name – D_CUST_DLY_ORD_DTL → Friendly Name: Customer Daily Order Details.
  • Column Name: pdt_cd Column → Friendly name: Product Code

Also, you can set up synonyms for each column so users can use the terms they are most comfortable with. For example, some users might input the term “client” or “segment” instead of “industry.” Q provides a feature to correct to the right name when typing the query, but BI operators can also set up synonyms for frequently used words. Click “Topics” in the left pane and choose the dashboard where you want to set synonyms.

Then, choose “datasets.

Now, we can set a Friendly Name or synonyms as Aliases, such as “client” for “Customer,” or “Segment” for “Industry.”

setting up Friendly Name

After adding synonyms, a user can save the changes and start asking questions in the Q search bar.

Amazon QuickSight Q Preview Available Today
Q is available in preview for US East (N. Virginia), US West (Oregon), US East (Ohio) and Europe (Ireland). Getting started with Q is just a few clicks away from QuickSight. You can use Q with AWS data sources such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, and Amazon S3, or third-party commercial sources such as SQL Server, Teradata, and Snowflake. Salesforce, ServiceNow, and Adobe automatically integrate with all data sources supported by QuickSight, including business applications such as Analytics or Excel.

Learn more about Q and get started with the preview today.

– Kame


New- Amazon DevOps Guru Helps Identify Application Errors and Fixes

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/amazon-devops-guru-machine-learning-powered-service-identifies-application-errors-and-fixes/

Today, we are announcing Amazon DevOps Guru, a fully managed operations service that makes it easy for developers and operators to improve application availability by automatically detecting operational issues and recommending fixes. DevOps Guru applies machine learning informed by years of operational excellence from Amazon.com and Amazon Web Services (AWS) to automatically collect and analyze data such as application metrics, logs, and events to identify behavior that deviates from normal operational patterns.

Once a behavior is identified as an operational problem or risk, DevOps Guru alerts developers and operators to the details of the problem so they can quickly understand the scope of the problem and possible causes. DevOps Guru provides intelligent recommendations for fixing problems, saving you time resolving them. With DevOps Guru, there is no hardware or software to deploy, and you only pay for the data analyzed; there is no upfront cost or commitment.

Distributed/Complex Architecture and Operational Excellence
As applications become more distributed and complex, operators need more automated practices to maintain application availability and reduce the time and effort spent on detecting, debugging, and resolving operational issues. Application downtime, for example, as caused by misconfiguration, unbalanced container clusters, or resource depletion, can result in significant revenue loss to an enterprise.

In many cases, companies must invest developer time in deploying and managing multiple monitoring tools, such as metrics, logs, traces, and events, and storing them in various locations for analysis. Developers or operators also spend time developing and maintaining custom alarms to alert them to issues such as sudden spikes in load balancer errors or unusual drops in application request rates. When a problem occurs, operators receive multiple alerts related to the same issue and spend time combining alerts to prioritize those that need immediate attention.

How DevOps Guru Works
The DevOps Guru machine learning models leverages AWS expertise in running highly available applications for the world’s largest e-commerce business for the past 20 years. DevOps Guru automatically detects operational problems, details the possible causes, and recommends remediation actions. DevOps Guru provides customers with a single console experience to search and visualize operational data by integrating data across multiple sources supporting Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS CloudFormation, and AWS X-Ray and reduces the need to use multiple tools.

Getting Started with DevOps Guru
Activating DevOps Guru is as easy as accessing the AWS Management Console and clicking Enable. When enabling DevOps Guru, you can select the IAM role. You’ll then choose the AWS resources to analyze, which may include all resources in your AWS account or just specified CloudFormation StackSets. Finally, you can set an Amazon SNS topic if you want to send notifications from DevOps Guru via SNS.

DevOps Guru starts to accumulate logs and analyze your environment; it can take up to several hours. Let’s assume we have a simple serverless architecture as shown in this illustration.

When the system has an error, the operator needs to investigate if the error came from Amazon API Gateway, AWS Lambda, or AWS DynamoDB. They must then determine the root cause and how to fix the issue. With DevOps Guru, the process is now easy and simple.

When a developer accesses the management console of DevOps Guru, they will see a list of insights which is a collection of anomalies that are created during the analysis of the AWS resources configured within your application. In this case, Amazon API Gateway, AWS Lambda, and Amazon DynamoDB. Each insight contains observations, recommendations, and contextual data you can use to better understand and resolve the operational problem.

The list below shows the insight name, the status (closed or ongoing), severity, and when the insight was created. Without checking any logs, you can immediately see that in the most recent issue (line1), a problem with a Lambda function within your stack was the cause of the issue, and it was related to duration. If the issue was still occurring, the status would be listed as Ongoing. Since this issue was temporary, the status is showing Closed.


Let’s look deeper at the most recent anomaly by clicking through the first insight link. There are two tabs: Aggregated metrics and Graphed anomalies.

Aggregated metrics display metrics that are related to the insight. Operators can see which AWS CloudFormation stack created the resource that emitted the metric, the name of the resource, and its type. The red lines on a timeline indicate spans of time when a metric emitted unusual values. In this case, the operator can see the specific time of day on Nov 24 when the anomaly occurred for each metric.

Graphed anomalies display detailed graphs for each of the insight’s anomalies. Operators can investigate and look at an anomaly at the resource level and per statistic. The graphs are grouped by metric name.


By reviewing aggregated and graphed anomalies, an operator can see when the issue occurred, whether it is still ongoing, as well as the resources impacted. It appears the increased Lambda duration had a corresponding impact on API Gateway causing timeouts and resulted in 5XX errors in API Gateway.

Dev Ops Guru also provides Relevant events which are related to activities that changed your application’s configuration as illustrated below.


We can now see that a configuration change happened 2 hours before this issue occurred. If we click the point on the graph at 20:30 on 11/24, we can learn more and see the details of that change.

If you click through to the Ops event, the AWS CloudTrail logs would show that the configuration change was twofold: 1) a change in the concurrency provisioned capacity on a Lambda function and 2) the reduction in the integration timeout on an API integration latency.

recommendations to fix

The recommendations tell the operator to evaluate the provisioned concurrency for Lambda and how to troubleshoot errors in API Gateway. After further evaluation, the operator will discover this is exactly correct. The root cause is a mismatch between the Lambda provisioned concurrency setting and the API Gateway integration latency timeout. When the Lambda configuration was updated in the last deployment, it altered how this application responded to burst traffic, and it no longer fit within the API Gateway timeout window. This error is unlikely to have been found in unit testing and will occur repeatedly if the configurations are not updated.

DevOps Guru can send alerts of anomalies to operators via Amazon SNS, and it is integrated with AWS Systems Manager OpsCenter, enabling customers to receive insights directly within OpsCenter as quickly diagnose and remediate issues.

Available for Preview Today
Amazon DevOps Guru is available for preview in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo). To learn more about DevOps Guru, please visit our web site and technical documentation, and get started today.

– Kame