Tag Archives: artificial intelligence

New – Amazon SageMaker Pipelines Brings DevOps Capabilities to your Machine Learning Projects

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-pipelines-brings-devops-to-machine-learning-projects/

Today, I’m extremely happy to announce Amazon SageMaker Pipelines, a new capability of Amazon SageMaker that makes it easy for data scientists and engineers to build, automate, and scale end to end machine learning pipelines.

Machine learning (ML) is intrinsically experimental and unpredictable in nature. You spend days or weeks exploring and processing data in many different ways, trying to crack the geode open to reveal its precious gemstones. Then, you experiment with different algorithms and parameters, training and optimizing lots of models in search of highest accuracy. This process typically involves lots of different steps with dependencies between them, and managing it manually can become quite complex. In particular, tracking model lineage can be difficult, hampering auditability and governance. Finally, you deploy your top models, and you evaluate them against your reference test sets. Finally? Not quite, as you’ll certainly iterate again and again, either to try out new ideas, or simply to periodically retrain your models on new data.

No matter how exciting ML is, it does unfortunately involve a lot of repetitive work. Even small projects will require hundreds of steps before they get the green light for production. Over time, not only does this work detract from the fun and excitement of your projects, it also creates ample room for oversight and human error.

To alleviate manual work and improve traceability, many ML teams have adopted the DevOps philosophy and implemented tools and processes for Continuous Integration and Continuous Delivery (CI/CD). Although this is certainly a step in the right direction, writing your own tools often leads to complex projects that require more software engineering and infrastructure work than you initially anticipated. Valuable time and resources are diverted from the actual ML project, and innovation slows down. Sadly, some teams decide to revert to manual work, for model management, approval, and deployment.

Introducing Amazon SageMaker Pipelines
Simply put, Amazon SageMaker Pipelines brings in best-in-class DevOps practices to your ML projects. This new capability makes it easy for data scientists and ML developers to create automated and reliable end-to-end ML pipelines. As usual with SageMaker, all infrastructure is fully managed, and doesn’t require any work on your side.

Care.com is the world’s leading platform for finding and managing high-quality family care. Here’s what Clemens Tummeltshammer, Data Science Manager, Care.com, told us: “A strong care industry where supply matches demand is essential for economic growth from the individual family up to the nation’s GDP. We’re excited about Amazon SageMaker Feature Store and Amazon SageMaker Pipelines, as we believe they will help us scale better across our data science and development teams, by using a consistent set of curated data that we can use to build scalable end-to-end machine learning (ML) model pipelines from data preparation to deployment. With the newly announced capabilities of Amazon SageMaker, we can accelerate development and deployment of our ML models for different applications, helping our customers make better informed decisions through faster real-time recommendations.

Let me tell you more about the main components in Amazon SageMaker Pipelines: pipelines, model registry, and MLOps templates.

Pipelines – Model building pipelines are defined with a simple Python SDK. They can include any operation available in Amazon SageMaker, such as data preparation with Amazon SageMaker Processing or Amazon SageMaker Data Wrangler, model training, model deployment to a real-time endpoint, or batch transform. You can also add Amazon SageMaker Clarify to your pipelines, in order to detect bias prior to training, or once the model has been deployed. Likewise, you can add Amazon SageMaker Model Monitor to detect data and prediction quality issues.

Once launched, model building pipelines are executed as CI/CD pipelines. Every step is recorded, and detailed logging information is available for traceability and debugging purposes. Of course, you can also visualize pipelines in Amazon SageMaker Studio, and track their different executions in real time.

Model Registry – The model registry lets you track and catalog your models. In SageMaker Studio, you can easily view model history, list and compare versions, and track metadata such as model evaluation metrics. You can also define which versions may or may not be deployed in production. In fact, you can even build pipelines that automatically trigger model deployment once approval has been given. You’ll find that the model registry is very useful in tracing model lineage, improving model governance, and strengthening your compliance posture.

MLOps TemplatesSageMaker Pipelines includes a collection of built-in CI/CD templates for popular pipelines (build/train/deploy, deploy only, and so on). You can also add and publish your own templates, so that your teams can easily discover them and deploy them. Not only do templates save lots of time, they also make it easy for ML teams to collaborate from experimentation to deployment, using standard processes and without having to manage any infrastructure. Templates also let Ops teams customize steps as needed, and give them full visibility for troubleshooting.

Now, let’s do a quick demo!

Building an End-to-end Pipeline with Amazon SageMaker Pipelines
Opening SageMaker Studio, I select the “Components” tab and the “Projects” view. This displays a list of built-in project templates. I pick one to build, train, and deploy a model.

SageMaker screenshot

Then, I simply give my project a name, and create it.

A few seconds later, the project is ready. I can see that it includes two Git repositories hosted in AWS CodeCommit, one for model training, and one for model deployment.

SageMaker screenshot

The first repository provides scaffolding code to create a multi-step model building pipeline: data processing, model training, model evaluation, and conditional model registration based on accuracy. As you’ll see in the pipeline.py file, this pipeline trains a linear regression model using the XGBoost algorithm on the well-known Abalone dataset. This repository also includes a build specification file, used by AWS CodePipeline and AWS CodeBuild to execute the pipeline automatically.

Likewise, the second repository contains code and configuration files for model deployment, as well as test scripts required to pass the quality gate. This operation is also based on AWS CodePipeline and AWS CodeBuild, which run a AWS CloudFormation template to create model endpoints for staging and production.

Clicking on the two blue links, I clone the repositories locally. This triggers the first execution of the pipeline.

SageMaker screenshot

A few minutes later, the pipeline has run successfully. Switching to the “Pipelines” view, I can visualize its steps.

SageMaker screenshot

Clicking on the training step, I can see the Root Mean Square Error (RMSE) metrics for my model.

SageMaker screenshot

As the RMSE is lower than the threshold defined in the conditional step, my model is added to the model registry, as visible below.

SageMaker screenshot

For simplicity, the registration step sets the model status to “Approved”, which automatically triggers its deployment to a real-time endpoint in the same account. Within seconds, I see that the model is being deployed.

SageMaker screenshot

Alternatively, you could register your model with a “Pending manual approval” status. This will block deployment until the model has been reviewed and approved manually. As the model registry supports cross-account deployment, you could also easily deploy in a different account, without having to copy anything across accounts.

A few minutes later, the endpoint is up, and I could use it to test my model.

SageMaker screenshot

Once I’ve made sure that this model works as expected, I could ping the MLOps team, and ask them to deploy the model in production.

Putting my MLOps hat on, I open the AWS CodePipeline console, and I see that my deployment is indeed waiting for approval.

SageMaker screenshot

I then approve the model for deployment, which triggers the final stage of the pipeline.

SageMaker screenshot

Reverting to my Data Scientist hat, I see in SageMaker Studio that my model is being deployed. Job done!

SageMaker screenshot

Getting Started
As you can see, Amazon SageMaker Pipelines makes it really easy for Data Science and MLOps teams to collaborate using familiar tools. They can create and execute robust, automated ML pipelines that deliver high quality models in production quicker than before.

You can start using SageMaker Pipelines in all commercial regions where SageMaker is available. The MLOps capabilities are available in the regions where CodePipeline is also available.

Sample notebooks are available to get you started. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Special thanks to my colleague Urvashi Chowdhary for her precious assistance during early testing.

Introducing Amazon SageMaker Data Wrangler, a Visual Interface to Prepare Data for Machine Learning

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/introducing-amazon-sagemaker-data-wrangler-a-visual-interface-to-prepare-data-for-machine-learning/

Today, I’m extremely happy to announce Amazon SageMaker Data Wrangler, a new capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications by using a visual interface.

Whenever I ask a group of data scientists and ML engineers how much time they actually spend studying ML problems, I often hear a collective sigh, followed by something along the lines of, “20%, if we’re lucky.” When I ask them why, the answer is invariably the same, “data preparation consistently takes up to 80% of our time!

Indeed, preparing data for training is a crucial step of the ML process, and no one would think about botching it up. Typical tasks include:

  • Locating data: finding where raw data is stored, and getting access to it
  • Data visualization: examining statistical properties for each column in the dataset, building histograms, studying outliers
  • Data cleaning: removing duplicates, dropping or filling entries with missing values, removing outliers
  • Data enrichment and feature engineering: processing columns to build more expressive features, selecting a subset of features for training

In the early stage of a new ML project, this is a highly manual process, where intuition and experience play a large part. Using a mix of bespoke tools and open source tools such as pandas or PySpark, data scientists often experiment with different combinations of data transformations, and use them to process datasets before training models. Then, they analyze prediction results and iterate. As important as this is, looping through this process again and again can be time-consuming, tedious, and error-prone.

At some point, you will hit the right level of accuracy (or whatever other metric you’ve picked), and you’ll then want to train on the full dataset in your production environment. However, you’ll first have to reproduce and automate the exact data preparation steps that you experimented within your sandbox. Unfortunately, there’s always room for error given the interactive nature of this work, even if you carefully document it.

Last but not least, you’ll have to manage and scale your data processing infrastructure before you get to the finish line. Now that I think of it, 80% of your time may not be enough to do all of this!

Introducing Amazon SageMaker Data Wrangler
Amazon SageMaker Data Wrangler is integrated in Amazon SageMaker Studio, our fully managed integrated development environment (IDE) for ML. With just a few clicks, you can connect to data sources, explore and visualize data, apply built-in transformations as well as your own, export the resulting code to an auto-generated script, and run it on managed infrastructure. Let’s look at each step in more detail.

Obviously, data preparation starts with locating and accessing data. Out of the box, SageMaker Data Wrangler lets you easily and quickly connect to Amazon Simple Storage Service (S3), Amazon Athena, Amazon Redshift and AWS Lake Formation. You can also import data from Amazon SageMaker Feature Store. As all things AWS, access management is governed by AWS Identity and Access Management (IAM), based on the permissions attached to your SageMaker Studio instance.

Once you’ve connected to your data sources, you’ll probably want to visualize your data. Using the SageMaker Data Wrangler user interface, you can view table summaries, histograms, and scatter plots in seconds. You can also build your own custom graphs by simply copying and running code written with the popular Altair open source library.

Once you’ve got a good grasp on what your data looks like, it’s time to start preparing it. SageMaker Data Wrangler includes 300+ built-in transformations, such as finding and replacing data, splitting/renaming/dropping columns, scaling numerical values, encoding categorical values, and so on. All you have to do is select the transformation in a drop-down list, and fill in the parameters it may require. You can then preview the change, and decide whether you’d like to add it or not to the list of preparation steps for this dataset. If you’d like, you can also add your own code to implement custom transformations, using either pandas, PySpark, or PySpark SQL.

As you add transformation steps to your processing pipeline, you can view its graphical summary in SageMaker Studio. You can also add new stages to the pipeline, for example a new data source, or another group of transformation steps (say, a data cleaning group, followed by a feature engineering group). Thanks to the intuitive user interface, your data preparation pipeline will take shape in front of your eyes, and you’ll instantly be able to check that processed data looks the way that it should.

Early on, you’d certainly love to check your data preparation steps, and also get a sense of their predictive power, wouldn’t you? Good news, then! For regression and classification problem types, the “Quick model” capability lets you select a subset of your data, train a model, and determine which features are contributing most to the predicted outcome. Looking at the model, you can easily diagnose and fix data preparation issues as early as possible, and to determine if additional feature engineering is needed to improve your model performance.

Once you’re happy with your pipeline, you can export it in one click to a Python script that faithfully reproduces your manual steps. You won’t waste any time chasing discrepancies, and you can directly add this code to your ML project.

In addition, you can also export your processing code to:

Now, let’s do a quick demo, and show you how easy it is to work with SageMaker Data Wrangler .

Using Amazon SageMaker Data Wrangler
Opening SageMaker Studio, I create a new data flow in order to process the Titanic dataset, which contains information on passengers, and labels showing whether they survived the wreck or not.

SageMaker screenshot

My dataset is stored as a CSV file in Amazon Simple Storage Service (S3), and I select the appropriate data source.

SageMaker screenshot

Using the built-in tool, I quickly navigate my S3 buckets, and I locate the CSV file containing my data. For larger datasets, SageMaker Data Wrangler also supports the Parquet format.

As I select my file, SageMaker Data Wrangler shows me the first few rows.

SageMaker screenshot

I import the dataset, and I’m presented with an initial view of the data flow. Right-clicking on the dataset, I select “Edit data types” to make sure that SageMaker Data Wrangler has correctly detected the type of each column in the dataset.

SageMaker screenshot

Checking each column, it looks like all types are correct.

SageMaker screenshot

Moving back to the data flow view, I select “Add analysis” this time. This opens a new view where I can visualize data using histograms, scatterplots, and more. For example, I build an histogram showing me the age distribution of passengers according to their survival status, and coloring the bins using their gender. Of course, I can save it for future use.

SageMaker screenshot

Moving back to the data flow view once again, I select “Add transform” in order to start processing the dataset. This opens a new view, showing me the first lines of the dataset, as well as a list of 300+ built-in transforms.

SageMaker screenshot

Pclass, the passenger class, is a categorical variable, and I decide to encode it using one-hot encoding. This creates 3 new columns representing different dimensions, and I can preview them. As this is exactly what I wanted, I apply this transform for good. Likewise, I apply the same transform to the Sex column.

SageMaker screenshot

Then, I drop the original Pclass column. Using the same transform, I also drop the Name column.

SageMaker screenshot

In order to get a quick idea on whether these transformations increase or decrease the accuracy of the model, I can create a analysis that trains a model on the spot. As my problem is a binary classification problem, SageMaker Data Wrangler uses a metric called the F1 score. 0.749 is a good start, and additional processing would certainly improve it. I can also see which features contribute most to the predicted outcome: sex, age, and being a third class passenger.

SageMaker screenshot

Then, moving to the “Export” view, I select all the transforms I’ve created so far, in order to add them to my ML project.

SageMaker screenshot

Here, I select “Python Code” to generate a Python script. Other options are available for Amazon SageMaker Processing, Amazon SageMaker Pipelines, and Amazon SageMaker Feature Store.

SageMaker screenshot

A few seconds later, the script is available. I could add it as is to my ML project, and rest assured that my data preparation steps would be consistent with the interactive transforms that I’ve created above.

SageMaker screenshot

Getting Started
As you can see, Amazon SageMaker Data Wrangler makes it really easy to work interactively on data preparation steps, before transforming them into code that can be used immediately for experimentation and production.

You can start using this capability today in all regions where SageMaker Studio is available.

Give it a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Special thanks to my colleague Peter Liu for his precious help during early testing.

New – Store, Discover, and Share Machine Learning Features with Amazon SageMaker Feature Store

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-store-discover-and-share-machine-learning-features-with-amazon-sagemaker-feature-store/

Today, I’m extremely happy to announce Amazon SageMaker Feature Store, a new capability of Amazon SageMaker that makes it easy for data scientists and machine learning engineers to securely store, discover and share curated data used in training and prediction workflows.

For all the importance of selecting the right algorithm to train machine learning (ML) models, experienced practitioners know how crucial it is to feed it with high-quality data. Cleaning data is a good first step, and ML workflows routinely include steps to fill missing values, remove outliers, and so on. Then, they often move on to transforming data, using a mix of common and arcane techniques known as “feature engineering.”

Simply put, the purpose of feature engineering is to transform your data and to increase its expressiveness so that the algorithm may learn better. For instance, many columnar datasets include strings, such as street addresses. To most ML algorithms, strings are meaningless, and they need to be encoded in a numerical representation. Thus, you could replace street addresses with GPS coordinates, a much more expressive way to learn the concept of location. In other words, if data is the new oil, then feature engineering is the refining process that turns it into high-octane jet fuel that helps models get to stratospheric accuracy.

Indeed, ML practitioners spend a lot of time crafting feature engineering code, applying it to their initial datasets, training models on the engineered datasets, and evaluating model accuracy. Given the experimental nature of this work, even the smallest project will lead to multiple iterations. The same feature engineering code is often run again and again, wasting time and compute resources on repeating the same operations. In large organizations, this can cause an even greater loss of productivity, as different teams often run identical jobs, or even write duplicate feature engineering code because they have no knowledge of prior work.

There’s another hard problem that ML teams have to solve. As models are trained on engineered datasets, it’s imperative to apply the same transformations to data sent for prediction. This often means rewriting feature engineering code, sometimes in a different language, integrating it in your prediction workflow, and running it at prediction time. This whole process is not only time-consuming, it can also introduce inconsistencies, as even the tiniest variation in a data transform can have a large impact on predictions.

In order to solve these problems, ML teams sometimes build a feature store, a central repository where they can keep and retrieve engineered data used in their training and predictions jobs. As useful as feature stores are, building and managing your own involves a lot of engineering, infrastructure, and operational effort that takes valuable time away from actual ML work. Customers asked us for a better solution, and we got to work.

Introducing Amazon SageMaker Feature Store
Amazon SageMaker Feature Store is a fully managed centralized repository for your ML features, making it easy to securely store and retrieve features without having to manage any infrastructure. It’s part of Amazon SageMaker, our fully managed service for ML, and supports all algorithms. It’s also integrated with Amazon SageMaker Studio, our web-based development environment for ML.

Features stored in SageMaker Feature Store are organized in groups, and tagged with metadata. Thanks to this, you can quickly discover which features are available, and whether they’re suitable for your models. Multiple teams can also easily share and re-use features, reducing the cost of development and accelerating innovation.

Once stored, features can be retrieved and used in your SageMaker workflows: model training, batch transform, and real-time prediction with low latency. Not only do you avoid duplicating work, you also build consistent workflows that use the same consistent features stored in the offline and online stores.

The Climate Corporation (Climate) is a subsidiary of Bayer, and the industry leader in bringing digital innovation to farmers. Says Daniel McCaffrey, Vice President, Data and Analytics, Climate: “At Climate, we believe in providing the world’s farmers with accurate information to make data driven decisions and maximize their return on every acre. To achieve this, we have invested in technologies such as machine learning tools to build models using measurable entities known as features, such as yield for a grower’s field. With Amazon SageMaker Feature Store, we can accelerate the development of ML models with a central feature store to access and reuse features across multiple teams easily. SageMaker Feature Store makes it easy to access features in real-time using the online store, or run features on a schedule using the offline store for different use cases, and we can develop ML models faster.”

Care.com, the world’s leading platform for finding and managing high-quality family care, is also using Amazon SageMaker Feature Store. This is what Clemens Tummeltshammer, Data Science Manager, Care.com, told us: “A strong care industry where supply matches demand is essential for economic growth from the individual family up to the nation’s GDP. We’re excited about Amazon SageMaker Feature Store and Amazon SageMaker Pipelines , as we believe they will help us scale better across our data science and development teams, by using a consistent set of curated data that we can use to build scalable end-to-end machine learning model pipelines from data preparation to deployment. With the newly announced capabilities of Amazon SageMaker, we can accelerate development and deployment of our ML models for different applications, helping our customers make better informed decisions through faster real-time recommendations.

Now, let’s see how you can get started.

Storing and Retrieving Features with Amazon SageMaker Feature Store
Once you’ve run your feature engineering code on your data, you can organize and store your engineered features in SageMaker Feature Store, by grouping them in feature groups. A feature group is a collection of records, similar to rows in a table. Each record has a unique identifier, and holds the engineered feature values for one of the data instances in your original data source. Optionally, you can choose to encrypt the data at rest using your own AWS Key Management Service (KMS) key that is unique for each feature group.

How you define feature groups is up to you. For example, you could create one per data source (CSV files, database tables, and so on), and use a convenient unique column as the record identifier (primary key, customer id, transaction id, and so on).

Once you’ve got your groups figured out, you should repeat the following steps for each group:

  1. Create feature definitions, with the name and the type of each feature in a record (Fractional, Integral, or String).
  2. Create each feature group with the create_feature_group() API:
    sm_feature_store.create_feature_group(
         # The name of the feature group
         FeatureGroupName=my_feature_group_name,
         # The name of the column acting as the record identifier
         RecordIdentifierName=record_identifier_name,
         # The name of the column action as the feature timestamp
         EventTimeFeatureName = event_time_feature_name,
         # A list of feature names and types
         FeatureDefinitions=my_feature_definitions,
         # The S3 location for the offline feature store
         OnlineStoreConfig=online_store_config,
         # Optionally, enable the online feature store
         OfflineStoreConfig=offline_store_config,
         # An IAM role
         RoleArn=role
    )
  3. In each feature group, store records containing a collection of feature name/feature value pairs, using the put_record() API:
    sm_feature_store.put_record(
       FeatureGroupName=feature_group_name,
       Record=record,
       EventTime=event_time
    )

    For faster ingestion, you could create multiple threads and parallelize this operation.

At this point, features will be available in Amazon SageMaker Feature Store. Thanks to the offline store, you can use services such as Amazon Athena, AWS Glue, or Amazon EMR to build datasets for training: fetch the corresponding JSON objects in S3, select the features that you need, and save them in S3 in the format expected by your ML algorithm. From then on, it’s SageMaker business as usual!

In addition, you can use the get_record() API to access individual records stored in the online store, passing the group name and the unique identifier of the record you want to access, like so:

record = sm_feature_store.get_record(
    FeatureGroupName=my_feature_group_name,
    RecordIdentifierValue={"IntegralValue": 5962}
)

Amazon SageMaker Feature Store is designed for fast and efficient access for real time inference, with a P95 latency lower than 10ms for a 15-kilobyte payload. This makes it possible to query for engineered features at prediction time, and to replace raw features sent by the upstream application with the exact same features used to train the model. Feature inconsistencies are eliminated by design, letting you focus on building the best models instead of chasing bugs.

Finally, as SageMaker Feature Store includes feature creation timestamps, you can retrieve the state of your features at a particular point in time.

As Amazon SageMaker Feature Store is integrated with SageMaker Studio, I can see my two features groups there.

SageMaker screenshot

Right-clicking on “Open feature group detail”, I open the identity feature group.

SageMaker screenshot

I can see feature definitions.

SageMaker screenshot

Finally, I can generate queries for the offline store, which I could add to a Amazon SageMaker Data Wrangler workflow to load features prior to training.

SageMaker screenshot

How to Get Started with Amazon SageMaker Feature Store
As you can see, SageMaker Feature Store makes it easy to store, retrieve, and share features required by your training and prediction workflows.

SageMaker Feature Store is available in all regions where SageMaker is available. Pricing is based on feature reads and writes, and on the total amount of data stored.

Here are sample notebooks that will help you get started right away. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Amazon HealthLake Stores, Transforms, and Analyzes Health Data in the Cloud

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-amazon-healthlake-to-store-transform-and-analyze-petabytes-of-health-and-life-sciences-data-in-the-cloud/

Healthcare organizations collect vast amounts of patient information every day, from family history and clinical observations to diagnoses and medications. They use all this data to try to compile a complete picture of a patient’s health information in order to provide better healthcare services. Currently, this data is distributed across various systems (electronic medical records, laboratory systems, medical image repositories, etc.) and exists in dozens of incompatible formats.

Emerging standards, such as Fast Healthcare Interoperability Resources (FHIR), aim to address this challenge by providing a consistent format for describing and exchanging structured data across these systems. However, much of this data is unstructured information contained in medical records (e.g., clinical records), documents (e.g., PDF lab reports), forms (e.g., insurance claims), images (e.g., X-rays, MRIs), audio (e.g., recorded conversations), and time series data (e.g., heart electrocardiogram) and it is challenging to extract this information.

It can take weeks or months for a healthcare organization to collect all this data and prepare it for transformation (tagging and indexing), structuring, and analysis. Furthermore, the cost and operational complexity of doing all this work is prohibitive for most healthcare organizations.

Many data to analyze

Today, we are happy to announce Amazon HealthLake, a fully managed, HIPAA-eligible service, now in preview, that allows healthcare and life sciences customers to aggregate their health information from different silos and formats into a centralized AWS data lake. HealthLake uses machine learning (ML) models to normalize health data and automatically understand and extract meaningful medical information from the data so all this information can be easily searched. Then, customers can query and analyze the data to understand relationships, identify trends, and make predictions.

How It Works
Amazon HealthLake supports copying your data from on premises to the AWS Cloud, where you can store your structured data (like lab results) as well as unstructured data (like clinical notes), which HealthLake will tag and structure in FHIR. All the data is fully indexed using standard medical terms so you can quickly and easily query, search, analyze, and update all of your customers’ health information.

Overview of HealthLake

With HealthLake, healthcare organizations can collect and transform patient health information in minutes and have a complete view of a patients medical history, structured in the FHIR industry standard format with powerful search and query capabilities.

From the AWS Management Console, healthcare organizations can use the HealthLake API to copy their on-premises healthcare data to a secure data lake in AWS with just a few clicks. If your source system is not configured to send data in FHIR format, you can use a list of AWS partners to easily connect and convert your legacy healthcare data format to FHIR.

HealthLake is Powered by Machine Learning
HealthLake uses specialized ML models such as natural language processing (NLP) to automatically transform raw data. These models are trained to understand and extract meaningful information from unstructured health data.

For example, HealthLake can accurately identify patient information from medical histories, physician notes, and medical imaging reports. It then provides the ability to tag, index, and structure the transformed data to make it searchable by standard terms such as medical condition, diagnosis, medication, and treatment.

Queries on tens of thousands of patient records are very simple. For example, a healthcare organization can create a list of diabetic patients based on similarity of medications by selecting “diabetes” from the standard list of medical conditions, selecting “oral medications” from the treatment menu, and refining the gender and search.

Healthcare organizations can use Juypter Notebook templates in Amazon SageMaker to quickly and easily run analysis on the normalized data for common tasks like diagnosis predictions, hospital re-admittance probability, and operating room utilization forecasts. These models can, for example, help healthcare organizations predict the onset of disease. With just a few clicks in a pre-built notebook, healthcare organizations can apply ML to their historical data and predict when a diabetic patient will develop hypertension in the next five years. Operators can also build, train, and deploy their own ML models on data using Amazon SageMaker directly from the AWS management console.

Let’s Create Your Own Data Store and Start to Test
Starting to use HealthLake is simple. You access AWS Management Console, and click select Create a datastore.

If you click Preload data, HealthLake will load test data and you can start to test its features. You can also upload your own data if you already have FHIR 4 compliant data. You upload it to S3 buckets, and import it to set its bucket name.

Once your Data Store is created, you can perform a Search, Create, Read, Update or Delete FHIR Query Operation. For example, if you need a list of every patient located in New York, your query setting looks like the screenshots below. As per the FHIR specification, deleted data is only hidden from analysis and results; it is not deleted from the service, only versioned.

Creating Query

 

You can choose Add search parameter for more nested conditions of the query as shown below.

Amazon HealthLake is Now in Preview
Amazon HealthLake is in preview starting today in US East (N. Virginia). Please check our web site and technical documentation for more information.

– Kame

Bringing machine learning to more builders through databases and analytics services

Post Syndicated from Swami Sivasubramanian original https://aws.amazon.com/blogs/big-data/bringing-machine-learning-to-more-builders-through-databases-and-analytics-services/

Machine learning (ML) is becoming more mainstream, but even with the increasing adoption, it’s still in its infancy. For ML to have the broad impact that we think it can have, it has to get easier to do and easier to apply. We launched Amazon SageMaker in 2017 to remove the challenges from each stage of the ML process, making it radically easier and faster for everyday developers and data scientists to build, train, and deploy ML models. SageMaker has made ML model building and scaling more accessible to more people, but there’s a large group of database developers, data analysts, and business analysts who work with databases and data lakes where much of the data used for ML resides. These users still find it too difficult and involved to extract meaningful insights from that data using ML.

This group is typically proficient in SQL but not Python, and must rely on data scientists to build the models needed to add intelligence to applications or derive predictive insights from data. And even when you have the model in hand, there’s a long and involved process to prepare and move data to use the model. The result is that ML isn’t being used as much as it can be.

To meet the needs of this large and growing group of builders, we’re integrating ML into AWS databases, analytics, and business intelligence (BI) services.

AWS customers generate, process, and collect more data than ever to better understand their business landscape, market, and customers. And you don’t just use one type of data store for all your needs. You typically use several types of databases, data warehouses, and data lakes, to fit your use case. Because all these use cases could benefit from ML, we’re adding ML capabilities to our purpose-built databases and analytics services so that database developers, data analysts, and business analysts can train models on their data or add inference results right from their database, without having to export and process their data or write large amounts of ETL code.

Machine Learning for database developers

At re:Invent last year, we announced ML integrated inside Amazon Aurora for developers working with relational databases. Previously, adding ML using data from Aurora to an application was a very complicated process. First, a data scientist had to build and train a model, then write the code to read data from the database. Next, you had to prepare the data so it can be used by the ML model. Then, you called an ML service to run the model, reformat the output for your application, and finally load it into the application.

Now, with a simple SQL query in Aurora, you can add ML to an enterprise application. When you run an ML query in Aurora using SQL, it can directly access a wide variety of ML models from Amazon SageMaker and Amazon Comprehend. The integration between Aurora and each AWS ML service is optimized, delivering up to 100 times better throughput when compared to moving data between Aurora and SageMaker or Amazon Comprehend without this integration. Because the ML model is deployed separately from the database and the application, each can scale up or scale out independently of the other.

In addition to making ML available in relational databases, combining ML with certain types of non-relational database models can also lead to better predictions. For example, database developers use Amazon Neptune, a purpose-built, high-performance graph database, to store complex relationships between data in a graph data model. You can query these graphs for insights and patterns and apply the results to implement capabilities such as product recommendations or fraud detection.

However, human intuition and analyzing individual queries is not enough to discover the full breadth of insights available from large graphs. ML can help, but as was the case with relational databases it requires you to do a significant amount of heavy lifting upfront to prepare the graph data and then select the best ML model to run against that data. The entire process can take weeks.

To help with this, today we announced the general availability of Amazon Neptune ML to provide database developers access to ML purpose-built for graph data. This integration is powered by SageMaker and uses the Deep Graph Library (DGL), a framework for applying deep learning to graph data. It does the hard work of selecting the graph data needed for ML training, automatically choosing the best model for the selected data, exposing ML capabilities via simple graph queries, and providing templates to allow you to customize ML models for advanced scenarios. The following diagram illustrates this workflow.

And because the DGL is purpose-built to run deep learning on graph data, you can improve accuracy of most predictions by over 50% compared to that of traditional ML techniques.

Machine Learning for data analysts

At re:Invent last year, we announced ML integrated inside Amazon Athena for data analysts. With this integration, you can access more than a dozen built-in ML models or use your own models in SageMaker directly from ad-hoc queries in Athena. As a result, you can easily run ad-hoc queries in Athena that use ML to forecast sales, detect suspicious logins, or sort users into customer cohorts.

Similarly, data analysts also want to apply ML to the data in their Amazon Redshift data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day. These Amazon Redshift users want to run ML on their data in Amazon Redshift without having to write a single line of Python. Today we announced the preview of Amazon Redshift ML to do just that.

Amazon Redshift now enables you to run ML algorithms on Amazon Redshift data without manually selecting, building, or training an ML model. Amazon Redshift ML works with Amazon SageMaker Autopilot, a service that automatically trains and tunes the best ML models for classification or regression based on your data while allowing full control and visibility.

When you run an ML query in Amazon Redshift, the selected data is securely exported from Amazon Redshift to Amazon Simple Storage Service (Amazon S3). SageMaker Autopilot then performs data cleaning and preprocessing of the training data, automatically creates a model, and applies the best model. All the interactions between Amazon Redshift, Amazon S3, and SageMaker are abstracted away and automatically occur. When the model is trained, it becomes available as a SQL function for you to use. The following diagram illustrates this workflow.

Rackspace Technology – a leading end-to-end multicloud technology services company, and Slalom –  a modern consulting firm focused on strategy, technology, and business transformation are both users of Redshift ML in preview.

Nihar Gupta, General Manager for Data Solutions at Rackspace Technology says “At Rackspace Technology, we help companies elevate their AI/ML operationsthe seamless integration with Amazon SageMaker will empower data analysts to use data in new ways, and provide even more insight back to the wider organization.”

And Marcus Bearden, Practice Director at Slalom shared “We hear from our customers that they want to have the skills and tools to get more insight from their data, and Amazon Redshift is a popular cloud data warehouse that many of our customers depend on to power their analytics, the new Amazon Redshift ML feature will make it easier for SQL users to get new types of insight from their data with machine learning, without learning new skills.”

Machine Learning for business analysts

To bring ML to business analysts, we launched new ML capabilities in Amazon QuickSight earlier this year called ML Insights. ML Insights uses SageMaker Autopilot to enable business analysts to perform ML inference on their data and visualize it in BI dashboards with just a few clicks. You can get results for different use cases that require ML, such as anomaly detection to uncover hidden insights by continuously analyzing billions of data points, to do forecasting, to predict growth, and other business trends. In addition, QuickSight can also give you an automatically generated summary in plain language (a capability we call auto-narratives), which interprets and describes what the data in your dashboard means. See the following screenshot for an example.

Customers like Expedia Group, Tata Consultancy Services, and Ricoh Company are already benefiting from ML out of the box with QuickSight. These human-readable narratives enable you to quickly interpret the data in a shared dashboard and focus on the insights that matter most.

In addition, customers have also been interested in asking questions of their business data in plain language and receiving answers in near-real time. Although some BI tools and vendors have attempted to solve this challenge with Natural Language Query (NLQ), the existing approaches require that you first spend months in advance preparing and building a model on a pre-defined set of data, and even then, you still have no way of asking ad hoc questions when those questions require a new calculation that wasn’t pre-defined in the data model. For example, the question “What is our year-over-year growth rate?” requires that “growth rate” be pre-defined as a calculation in the model. With today’s BI tools, you need to work with your BI teams to create and update the model to account for any new calculation or data, which can take days or weeks of effort.

Last week, we announced Amazon QuickSight Q. ‘Q’ gives business analysts the ability to ask any question of all their data and receive an accurate answer in seconds. To ask a question, you simply type it into the QuickSight Q search bar using natural language and business terminology that you’re familiar with. Q uses ML (natural language processing, schema understanding, and semantic parsing for SQL code generation) to automatically generate a data model that understands the meaning of and relationships between business data, so you can get answers to your business questions without waiting weeks for a data model to be built. Because Q eliminates the need to build a data model, you’re also not limited to asking only a specific set of questions. See the following screenshot for an example.

Best Western Hotels & Resorts is a privately-held hotel brand with a global network of approximately 4,700 hotels in over 100 countries and territories worldwide. “With Amazon QuickSight Q, we look forward to enabling our business partners to self-serve their ad hoc questions while reducing the operational overhead on our team for ad hoc requests,” said Joseph Landucci, Senior Manager of Database and Enterprise Analytics at Best Western Hotels & Resorts. “This will allow our partners to get answers to their critical business questions quickly by simply typing and searching their questions in plain language.”

Summary

For ML to have a broad impact, we believe it has to get easier to do and easier to apply. Database developers, data analysts, and business analysts who work with databases and data lakes have found it too difficult and involved to extract meaningful insights from their data using ML. To meet the needs of this large and growing group of builders, we’ve added ML capabilities to our purpose-built databases and analytics services so that database developers, data analysts, and business analysts can all use ML more easily without the need to be an ML expert. These capabilities put ML in the hands of every data professional so that they can get the most value from their data.


About the Authors

Swami Sivasubramanian is Vice President at AWS in charge of all Amazon AI and Machine Learning services. His team’s mission is “to put machine learning capabilities in the hands on every developer and data scientist.” Swami and the AWS AI and ML organization work on all aspects of machine learning, from ML frameworks (Tensorflow, Apache MXNet and PyTorch) and infrastructure, to Amazon SageMaker (an end-to-end service for building, training and deploying ML models in the cloud and at the edge), and finally AI services (Transcribe, Translate, Personalize, Forecast, Rekognition, Textract, Lex, Comprehend, Kendra, etc.) that make it easier for app developers to incorporate ML into their apps with no ML experience required.

Previously, Swami managed AWS’s NoSQL and big data services. He managed the engineering, product management, and operations for AWS database services that are the foundational building blocks for AWS: DynamoDB, Amazon ElastiCache (in-memory engines), Amazon QuickSight, and a few other big data services in the works. Swami has been awarded more than 250 patents, authored 40 referred scientific papers and journals, and participates in several academic circles and conferences.

 

Herain Oberoi leads Product Marketing for AWS’s Databases, Analytics, BI, and Blockchain services. His team is responsible for helping customers learn about, adopt, and successfully use AWS services. Prior to AWS, he held various product management and marketing leadership roles at Microsoft and a successful startup that was later acquired by BEA Systems. When he’s not working, he enjoys spending time with his family, gardening, and exercising.

 

 

 

Preview: Amazon Lookout for Metrics, an Anomaly Detection Service for Monitoring the Health of Your Business

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/preview-amazon-lookout-for-metrics-anomaly-detection-service-monitoring-health-business/

We are excited to announce Amazon Lookout for Metrics, a new service that uses machine learning (ML) to detect anomalies in your metrics, helping you proactively monitor the health of your business, diagnose issues, and find opportunities quickly – with no ML experience required.

Lookout for Metrics uses the same technology used by Amazon to detect anomalous changes in data that are otherwise hard to find, while reducing the number of false detections. It also groups similar findings together, ranks them by severity, and provides information to determine the root cause of the anomalies.

It can be used across a wide variety of metrics such as revenue, web page views, daily active users, churn rate, transaction volume, mobile app installations, and more. Lookout for Metrics is now available in preview today.

Why Use Amazon Lookout for Metrics for Anomaly Detection?
Organizations across all industries are looking to improve efficiency in their business through technology and automation. While challenges may vary, what’s common is that being able to identify defects and opportunities early and often can lead to material cost savings, higher margins, and better customer experience. Traditionally, organizations rely on manual audits of large amounts of data, which is not scalable and is prone to human error. Others use rule-based methods based on arbitrary ranges, which are often static, do not easily adapt to seasonality changes, and lead to too many false detections.

When anomalies are detected, developers, analysts, and business owners can spend weeks trying to identify the root cause of the change. These are situations where ML can be an effective and transformational tool. However, ML algorithms need to be carefully selected, trained, tested, and deployed for each type of data – requiring a skilled team of ML experts.

Amazon has a long history of being a data-driven company, with a growing number of businesses that need to stay on top of the health of their business, operations, and customer experience. A key part of this effort over the years has involved building and improving ML technology to detect anomalies in key performance indicators (KPI) such as website visits from different traffic channels, number of products added to the shopping cart, number of orders placed, revenue for every product category, and more.

Amazon Lookout for Metrics puts the same ML technology used by Amazon in the hands of every developer. It finds anomalies in your data, groups them intelligently, helps you visualize aggregated results, and automates alerts.

Because it’s a fully managed service, it takes care of the whole ML process so you can get started quickly and focus on your core business. And most importantly, the service improves model performance continually by incorporating your real-time feedback on the accuracy and relevance of the anomalies and root cause analysis.

How Amazon Lookout for Metrics Works
You can get started with Lookout for Metrics with just a few clicks in the AWS Management Console. Without having to write any code, you connect your data to the service through the built-in data source integrations; next Lookout for Metrics trains a custom model for your data; and finally, it begins detecting anomalies for you to review and start taking action on.

Lookout for Metrics continuously monitors data stored in Amazon Simple Storage Service (S3), Amazon Relational Database Service (RDS), Amazon Redshift, Amazon CloudWatch, or SaaS integrations supported by Amazon AppFlow such as Salesforce, Marketo, Google Analytics, Slack, Zendesk, and many more.

During this phase, you can flag each field in your dataset as a measure (or KPI), dimension, or timestamp. For example, if you want to monitor abnormal changes in page views for every device type separately, then you would select page_views as the measure and device_type as the dimension.

Once your data source is configured and connected, Lookout for Metrics inspects and prepares the data for analysis and selects the right algorithm to build the most accurate anomaly detection model. This detector runs on your data at a configurable cadence (every few minutes, hourly, daily, and so on) and provides a threshold dial that allows you to adjust its sensitivity.

When detecting an anomaly, Lookout for Metrics helps you focus on what matters the most by assigning a severity score to aid prioritization. To help you find the root cause, it intelligently groups anomalies that may be related to the same incident and summarizes the different sources of impact (as shown below).

Moreover, you can configure an automatic action such as sending a notification via Amazon Simple Notification Service (SNS), Datadog, PagerDuty, Webhooks, or Slack. Or you can trigger a Lambda function, for example to temporarily hide a product on your e-commerce site when a potential pricing error is detected.

Domain knowledge and expertise can often play an important role in determining if a sudden change in a metric is expected or is an anomaly. Lookout for Metrics allows you to provide real-time feedback on the relevance of the detected anomalies, enabling a powerful human-in-the-loop mechanism. This information is fed back to the anomaly detection model to improve its accuracy.

Who’s Using Amazon Lookout for Metrics Today?
Digitata Networks offers intelligent customer, network and site-centric solutions that assist Mobile Network Operators to monitor, audit, control and automate different aspects of their network. Company CTO Nico Kruger has been pleased with the results he’s seen so far from using Lookout for Metrics.

“We discovered the improved accuracy and insights that Lookout for Metrics can bring to our existing solution and we are thrilled to use the service…we can quickly identify opportunities in addition to finding issues,” he said.

Playrix, one of the leading mobile game developers in the world, known for high-quality games such as Township, Fishdom, and Gardenscapes, is another customer that’s been working with the new service. “We experimented with our user acquisition data to understand how the service works and it quickly identified and grouped anomalies enabling us to work faster and better,” said Mikhail Artyugin, Playrix technical director.

“Lookout for Metrics has saved our team many hours of manual investigation and now notifications are viewed as actionable rather than noise, allowing our teams to easily focus on strategic priorities with less technical overhead,” he added.

“Working with almost a billion impressions every day to capture insights and intent for our customers, we need quick feedback on real data anomalies,” said Brian Ecker, a senior staff engineer at NextRoll, a marketing and data technology company with the mission to provide innovative solutions to companies to keep them growing.

“After working with the Lookout for Metrics team, we saw the improved accuracy that the new service can bring to our existing anomaly detection process and we are thrilled to start using it.”

It’s also worth noting that APN partners such as TensorIoT, Quantiphi, and Provectus have expertise in Lookout for Metrics and can help customers leverage its functionalities.

Available in Preview
Amazon Lookout for Metrics is now available in preview in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland).

You can interact with the service using the AWS Management Console, the AWS SDKs and the CLI . Find out more on the technical documentation and get started quickly by joining the preview at the following link.

Request preview access to Amazon Lookout for Metrics here.

Alex

Amazon SageMaker Edge Manager Simplifies Operating Machine Learning Models on Edge Devices

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-edge-manager-simplifies-operating-machine-learning-models-on-edge-devices/

Today, I’m extremely happy to announce Amazon SageMaker Edge Manager, a new capability of Amazon SageMaker that makes it easier to optimize, secure, monitor, and maintain machine learning models on a fleet of edge devices.

Edge computing is certainly one of the most exciting developments in information technology. Indeed, thanks to continued advances in compute, storage, networking, and battery technology, organizations routinely deploy large numbers of embedded devices anywhere on the planet for a wide range of industry applications: manufacturing, energy, agriculture, healthcare, and more. Ranging from simple sensors to large industrial machines, the devices have a common purpose: capture data, analyze it, and act on it, for example send an alert if an unwanted condition is detected.

As machine learning (ML) demonstrated its ability to solve a wide range of business problems, customers tried to apply it to edge applications, training models in the cloud and deploying them at the edge in an effort to extract deeper insights from local data. However, given the remote and constrained nature of edge devices, deploying and managing models at the edge is often quite difficult.

For example, a complex model can be too large to fit, forcing customers to settle for a smaller and less accurate model. Also, predicting with several models on the same device (say, to detect different types of anomalies) may require additional code to load and unload models on demand, in order to conserve hardware resources. Finally, monitoring prediction quality is a major concern, as the real world will always be more complex and unpredictable than any training set can anticipate.

Customers asked us to help them solve these challenges, and we got to work.

Announcing Amazon SageMaker Edge Manager
Amazon SageMaker Edge Manager makes it easy for ML edge developers to use the same familiar tools in the cloud or on edge devices. It reduces the time and effort required to get models to production, while continuously monitoring and improving model quality across your device fleet.

Starting from a model that you trained or imported in Amazon SageMaker, SageMaker Edge Manager first optimizes it for your hardware platform using Amazon SageMaker Neo. Launched two years ago, Neo converts models into an efficient common format which is executed on the device by a low footprint runtime. Neo currently supports devices based on chips manufactured by Ambarella, ARM, Intel, NVIDIA, NXP, Qualcomm, TI, and Xilinx.

Then, SageMaker Edge Manager packages the model, and stores it in Amazon Simple Storage Service (S3), where it can be deployed to your devices. In fact, you can deploy multiple models, loading and predicting with a runtime optimized for your hardware of choice.

On-device models are managed by the SageMaker Edge Manager Manager Agent, which communicates with the AWS Cloud for model deployment, and with your application for model management. Indeed, you can integrate this agent with your application, so that it may automatically load and unload models according to your prediction requests. This enables a variety of scenarios, such as freeing all resources for a large model whenever needed, or working with a collection of smaller models that cohabit in memory.

Lenovo, the #1 global PC maker, recently incorporated Amazon SageMaker into its latest predictive maintenance offering. Igor Bergman, Lenovo Vice President, Cloud & Software of PCs and Smart Devices, told us: “At Lenovo, we’re more than a hardware provider and are committed to being a trusted partner in transforming customers’ device experience and delivering on their business goals. Lenovo Device Intelligence is a great example of how we’re doing this with the power of machine learning, enhanced by Amazon SageMaker. With Lenovo Device Intelligence, IT administrators can proactively diagnose PC issues and help predict potential system failures before they occur, helping to decrease downtime and increase employee productivity. By incorporating Amazon SageMaker Neo, we’ve already seen a substantial improvement in the execution of our on-device predictive models – an encouraging sign for the new Amazon SageMaker Edge Manager that will be added in the coming weeks. SageMaker Edge Manager will help eliminate the manual effort required to optimize, monitor, and continuously improve the models after deployment. With it, we expect our models will run faster and consume less memory than with other comparable machine learning platforms. As we extend AI to new applications across the Lenovo services portfolio, we will continue to require a high-performance pipeline that is flexible and scalable both in the cloud and on millions of edge devices. That’s why we selected the Amazon SageMaker platform. With its rich edge-to-cloud and CI/CD workflow capabilities, we can effectively bring our machine learning models to any device workflow for much higher productivity.

Getting Started
As you can see, SageMaker Edge Manager makes it easier to work with ML models deployed on edge devices. It’s available today in the US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), Europe (Frankfurt), and Asia Pacific (Tokyo) regions.

Sample notebooks are available to get you started right away. Give them a try, and let us know what you think.

We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

New – Amazon SageMaker Clarify Detects Bias and Increases the Transparency of Machine Learning Models

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-amazon-sagemaker-clarify-detects-bias-and-increases-the-transparency-of-machine-learning-models/

Today, I’m extremely happy to announce Amazon SageMaker Clarify, a new capability of Amazon SageMaker that helps customers detect bias in machine learning (ML) models, and increase transparency by helping explain model behavior to stakeholders and customers.

As ML models are built by training algorithms that learn statistical patterns present in datasets, several questions immediately come to mind. First, can we ever hope to explain why our ML model comes up with a particular prediction? Second, what if our dataset doesn’t faithfully describe the real-life problem we were trying to model? Could we even detect such issues? Would they introduce some sort of bias in imperceptible ways? As we will see, these are not speculative questions at all. They are very real, and their implications can be far-reaching.

Let’s start with the bias problem. Imagine that you’re working on a model detecting fraudulent credit card transactions. Fortunately, the huge majority of transactions are legitimate, and they make up 99.9% of your dataset, meaning that you only have 0.1% fraudulent transactions, say 100 out of 100,000. Training a binary classification model (legitimate vs. fraudulent), there’s a strong chance that it would be strongly influenced or biased by the majority group. In fact, a trivial model could simply decide that transactions are always legitimate: as useless as this model would be, it would still be right 99.9% of the time! This simple example shows how careful we have to be about the statistical properties of our data, and about the metrics that we use to measure model accuracy.

There are many variants of this under-representation problem. As the number of classes, features, and unique feature values increase, your dataset may only contain a tiny number of training instances for certain groups. In fact, some of these groups may correspond to various socially sensitive features such as gender, age range, or nationality. Under-representation for such groups could result in a disproportionate impact on their predicted outcomes.

Unfortunately, even with the best of intentions, bias issues may exist in datasets and be introduced into models with business, ethical, and regulatory consequences. It is thus important for model administrators to be aware of potential sources of bias in production systems.

Now, let’s discuss the explainability problem. For simple and well-understood algorithms like linear regression or tree-based algorithms, it’s reasonably easy to crack the model open, inspect the parameters that it learned during training, and figure out which features it predominantly uses. You can then decide whether this process is consistent with your business practices, basically saying: “yes, this is how a human expert would have done it.”

However, as models become more and more complex (I’m staring at you, deep learning), this kind of analysis becomes impossible. Just like the prehistoric tribes in Stanley Kubrick’s “2001: A Space Odyssey,” we’re often left staring at an impenetrable monolith and wondering what it all means. Many companies and organizations may need ML models to be explainable before they can be used in production. In addition, some regulations may require explainability when ML models are used as part of consequential decision making, and closing the loop, explainability can also help detect bias.

Thus, our customers asked us for help on detecting bias in their datasets and their models, and on understanding how their models make predictions. We got to work, and came up with SageMaker Clarify.

Introducing Amazon SageMaker Clarify
SageMaker Clarify is a new set of capabilities for Amazon SageMaker, our fully managed ML service. It’s integrated with SageMaker Studio, our web-based integrated development environment for ML, as well as with other SageMaker capabilities like Amazon SageMaker Data Wrangler, Amazon SageMaker Experiments, and Amazon SageMaker Model Monitor.

Thanks to SageMaker Clarify, data scientists are able to:

  • Detect bias in datasets prior to training, and in models after training.
  • Measure bias using a variety of statistical metrics.
  • Explain how feature values contribute to the predicted outcome, both for the model overall and for individual predictions.
  • Detect bias drift and feature importance drift over time, thanks to the integration with Amazon SageMaker Model Monitor.

Let’s look at each of these capabilities.

Detecting dataset bias: This is an important first step. Indeed, a heavily biased dataset may well be unsuitable for training. Knowing this early on certainly saves you time, money, and frustration! Looking at bias metrics computed by SageMaker Clarify on your dataset, you can then add your own bias reduction techniques to your data processing pipeline. Once the dataset has been revised and processed, you can measure bias again, and check if it has actually decreased.

Detecting model bias: After you’ve trained your model, you can run a SageMaker Clarify bias analysis, which includes automatic deployment to a temporary endpoint, and computation of bias metrics using your model and dataset. By computing these metrics, you can figure out if your trained model has similar predictive behavior across groups.

Measuring bias: SageMaker Clarify lets you pick from many different bias metrics. I’ll just give you a few examples here.

  • Difference in positive proportions in labels (DPL): Are labels in the dataset correlated or not with specific sensitive feature values? For example, do people living in a certain city have a better chance of getting a positive answer?
  • Difference in positive proportions in predicted labels (DPPL): Do we overpredict positive labels for a certain group?
  • Accuracy difference (AD): Are the predictions by the model more accurate for one group than the other?
  • Counterfactuals – Fliptest (FT): Suppose we look at each member of one group, and compare with similar members from the other group. Do they get different model predictions?

Explaining predictions – to explain how your model predicts, SageMaker Clarify supports a popular technique called SHapley Additive exPlanations (SHAP). Originating in game theory, SHAP analyzes for each data instance the individual contribution of feature values to the predicted output, and represents them as a positive or negative value. For example, predicting with a credit application model, you could see that Alice’s application is approved with a score of 87.5%, that her employment status (+27.2%) and her credit score (+32.4%) are the strongest contributors to this score, and that her income level has a slight negative impact (-5%). Such insights are crucial in building trust that the model is working as expected, and in explaining to customers and regulators why it comes up with a particular prediction. Further analysis of the SHAP values for your complete dataset can also help identify the relative importance of features and feature values, potentially leading to the discovery of prediction issues and biases.

As you can see, SageMaker Clarify has some pretty powerful features for bias detection and explainability. Fortunately, it also makes them very easy to use. First, you should upload a clean and pre-processed copy of your tabular dataset (CSV or JSON) to Amazon Simple Storage Service (S3). Then, using a built-in container, you just launch an Amazon SageMaker Processing job on your dataset, passing a short configuration file defining the name of the target attribute, the name and values of the sensitive columns to analyze for bias, and the bias metrics that you want to compute. As you would expect, this job runs on fully managed infrastructure. For post-training analysis, a temporary endpoint is also automatically created and deleted by the job. Once the job is complete, results are available in S3 and in SageMaker Studio, and include an auto-generated report that summarizes the results.

Now, I’d like to show you how to get started with SageMaker Clarify.

Exploring Datasets and Models with Amazon SageMaker Clarify
The German Credit Data dataset contains 1,000 labeled credit applications, which I’ve used to train a binary classification model with XGBoost. Each data instance has 20 features, such as credit purpose, credit amount, housing status, employment history, and more. Categorical features have been encoded with Axx values. For example, here’s how the credit history feature is encoded: A30 means ‘no credits taken’, A31 means ‘all credits at this bank paid back duly’, and so on.

In particular, the dataset includes a feature telling us if a customer is a foreign worker. In fact, a quick look at the dataset hints at a large imbalance in favor of foreign workers. Could bias be hiding there? What about the model? Did XGBoost increase or decrease the bias? Which features contribute most to the predicted output? Let’s find out.

After training the model, my next step is to run a SageMaker Clarify bias analysis job on the dataset, using a built-in container image that will compute bias metrics. The job inputs are the dataset, and a JSON configuration file that defines:

  • The name of the target attribute (Class1Good2Bad), and the value for the positive answer (1).
  • The sensitive features to analyze (called “facets”), and their value. Here, we want to focus on instances where ForeignWorker is set to 0, as they seem to be under-represented in the dataset.
  • The bias metrics that the job should compute. As I already have a model, I pass its name so that post-training metrics can be computed on a temporary endpoint.

Here’s the relevant snippet in the configuration file:

"label": "Class1Good2Bad",
"label_values_or_threshold": [1],
"facet": [
        {
         "name_or_index" : "ForeignWorker",
         "value_or_threshold": [0]
        }
    ],
. . .
"methods": {
    "pre_training_bias": {"methods": "all"}
    "post_training_bias": {"methods": "all"}
},
"predictor": {
        "model_name": "xgboost-german-model",
        "instance_type": "ml.m5.xlarge",
        "initial_instance_count": 1
    }

Then, I configure the job inputs (the dataset and the configuration file) and output (the report), passing all appropriate paths in S3:

config_input = ProcessingInput(
                    input_name="analysis_config",
                    source=analysis_config_s3_path,
                    destination="/opt/ml/processing/input/config")
data_input = ProcessingInput(
                    input_name="dataset",
                    source=train_data_s3_path,
                    destination="/opt/ml/processing/input/data")
result_output = ProcessingOutput(
                    source="/opt/ml/processing/output",
                    destination=analysis_result_s3_path,
                    output_name="analysis_result")

Finally, I run the processing job.

from sagemaker.processing import Processor, ProcessingInput, ProcessingJob, ProcessingOutput
analyzer_image_uri = f'678264136642.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xai-analyzer:latest'
analyzer = Processor(base_job_name='analyzer',
                     image_uri=analyzer_image_uri,
                     role=sagemaker.get_execution_role(),
                     instance_count=1,
                     instance_type='ml.c5.xlarge')
analyzer.run(inputs=[ data_input, config_input], outputs=[result_output])

Once the processing job is complete, I can retrieve the report. Let’s look at bias metrics.

Detecting Bias with Amazon SageMaker Clarify
Here are some of the pre-training bias metrics:

"ForeignWorker": [
    {
     "value_or_threshold": "0",
     "metrics": [
      {
       "name": "CI",
       "description": "Class Imbalance (CI)",
       "value": 0.9225
      },
      {
       "name": "DPL",
       "description": "Difference in Positive Proportions in Labels (DPL)",
       "value": -0.21401904442300435
      },
. . . 

The class imbalance metric confirms our visual impression. The dataset has about 92% more foreign workers than it has domestic workers to assess. Whether this imbalance is responsible or not, we can also see that the difference in positive proportion for domestic workers is quite negative. In other words, there’s a smaller proportion of domestic workers with positive labels. This statistical pattern could be picked up by an ML algorithm, leading to a larger proportion of domestic workers getting negative answers. Figuring out whether this is actually legitimate or not would require further analysis, and in any case, it’s great that SageMaker Clarify warned us about this potential issue.

As I provided a trained model, post-training metrics are also available. Comparing the DPPL and the DPL, I can see that XGBoost has slightly reduced bias on positive proportions (-18.8% vs -21.4%). We also see that DAR is negative, indicating that the model achieves higher precision for domestic workers compared to foreign workers.

"ForeignWorker": [
    {
     "value_or_threshold": "0",
     "metrics": [
      {
       "name": "DPPL",
       "description": "\"Difference in Positive Proportions in Predicted Labels (DPPL)\")",
       "value": -0.18801124208230213
      },
      {
       "name": "DAR",
       "description": "Difference in Acceptance Rates (DAR)",
       "value": -0.050909090909090904
      },
      {
       "name": "DRR",
       "description": "Difference in Rejection Rates (DRR)",
       "value": 0.0365296803652968
      },
. . .

As SageMaker Clarify is integrated with SageMaker Studio, I can visualize bias metrics there. All I have to do is find the processing job in the list of trials, right-click “Open in trial details”, and select the “Bias report” view.

SageMaker screenshot

Finally, deciding whether high value of a certain bias metric is problematic involves domain-specific considerations. This needs to be guided by ethical, social, regulatory, and business considerations. Similarly, interventions for removing bias may often need a careful analysis of the entire ML lifecycle, from problem formulation to feedback loops in deployment.

Now, let’s see how SageMaker Clarify helps us understand what features the models base their predictions on.

Explaining Predictions with Amazon SageMaker Clarify
The report includes global SHAP values, showing the relative importance of all the features in the dataset. On the feature importance graph available in SageMaker Studio, I see that the three most important features are credit duration, not having a checking account (A14), and the loan amount. All things being equal, the bank probably sees you as a safer customer if you’re borrowing a small amount over a short period of time, and without the possibility to write checks!

SageMaker screenshot

In S3, I can also find a CSV file with SHAP values for individual data instances, giving me a complete picture of feature and feature value importance.

Getting Started
As you can see, SageMaker Clarify is a powerful tool to detect bias and to understand how your model works. You can start using it today in all regions where Amazon SageMaker is available, at no additional cost.

Sample notebooks are available to get you started quickly. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Special thanks to my colleagues Sanjiv Das, Michele Donini, Jason Gelman, Krishnaram Kenthapadi, Pinar Yilmaz, and Bilal Zafar for their precious help.

Dataset credits: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

New – Profile Your Machine Learning Training Jobs With Amazon SageMaker Debugger

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/profile-your-machine-learning-training-jobs-with-amazon-sagemaker-debugger/

Today, I’m extremely happy to announce that Amazon SageMaker Debugger can now profile machine learning models, making it much easier to identify and fix training issues caused by hardware resource usage.

Despite its impressive performance on a wide range of business problems, machine learning (ML) remains a bit of a mysterious topic. Getting things right is an alchemy of science, craftsmanship (some would say wizardry), and sometimes luck. In particular, model training is a complex process whose outcome depends on the quality of your dataset, your algorithm, its parameters, and the infrastructure you’re training on.

As ML models become ever larger and more complex (I’m looking at you, deep learning), one growing issue is the amount of infrastructure required to train them. For instance, training BERT on the publicly available COCO dataset takes well over six hours on a single p3dn.24xlarge instance, even with its eight NVIDIA V100 GPUs. Some customers like autonomous vehicle companies deal with much larger datasets, and train object detection models for several days.

When a complex training job takes this long, the odds that something goes wrong and ruins it are pretty high, not only wasting time and money but also causing lots of frustration. Important work needs to be put on the back burner while you investigate, figure out the root cause, try to fix it, and then run your training job again. Often, you’ll have to iterate quite a few times to nail the problem.

Depending on the ML framework that you use, and sometimes on its version, you may or may not be able to use existing framework-specific tools. Often, you’ll have to build and maintain your own bespoke tools. Even for experienced practitioners, this is a lot of hard work. For regular developers like me, this is an utterly daunting task.

Introducing Model Profiling in Amazon SageMaker Debugger
Launched last year at AWS re:Invent, Amazon SageMaker Debugger is a capability of Amazon SageMaker that automatically identifies complex issues developing in ML training jobs. These include loss not decreasing, exploding gradients, and more.

Now, SageMaker Debugger can also monitor hardware resource usage, and allows you to profile your training job to help you correlate resource usage to ML operations in your training script. Thus, you’ll be able to resolve performance issues much quicker, and iterate through your training job much faster.

Chaim Rand, ML Algorithm Developer at Mobileye, an Intel company building automated driving and driver assistance systems, had the opportunity to work with the new profiling capabilities, and here’s what he told us: “Many of the assisted driving and autonomous vehicle technologies that we develop at Mobileye, rely on training deep neural network models to detect a wide variety of road artifacts, including vehicles, pedestrians, speed bumps, road signs and more. Often, these models train on extremely large datasets, on multiple machines, and for periods of up to several days. For us, at Mobileye, it is imperative that we have a toolkit of advanced performance profiling capabilities, for analyzing the flow of data across the network, CPU, and GPU resources, and for pinpointing performance issues. The profiling functionality in SageMaker Debugger provides just that, taking performance profiling out of the domain of a few specialized experts, and empowering our algorithm developers to maximize training resource utilization, accelerate model convergence, and reduce cost.

At launch, the new profiling capability of SageMaker Debugger is available for TensorFlow 2.x and PyTorch 1.x. All you have to do is to train with the corresponding built-in frameworks in Amazon SageMaker. Distributed training is supported out of the box.

Setting a single parameter in your SageMaker estimator, and without any change to your training code, you can enable the collection of infrastructure and model metrics such as:

  • CPU and GPU,
  • RAM and GPU RAM,
  • Network I/O,
  • Storage I/O (local storage and Pipe Mode),
  • Python metrics,
  • Data loading time,
  • Time spent in ML operators running on CPU and GPU,
  • Distributed training metrics for Horovod,
  • and many more.

In addition, you can visualize how much time is spent in different phases, such as preprocessing, training loop, and postprocessing. If needed, you can drill down on each training epoch, and even on each function in your training script.

By default, metrics are collected every 500ms, and you can also set this value to 100ms, 200ms, 1s, 5s, and 1min. For finer-grained analysis, you can also enable and disable profiling explicitly in your training code, only capturing metrics for specific parts.

While your training job is running, you can easily visualize these metrics in Amazon SageMaker Studio, our web-based integrated development environment for ML. As you would expect, all data is also available through the SageMaker Debugger API, and you can retrieve it to build your own graphs.

Running in parallel of the training job, an Amazon SageMaker Processing analyzes captured data, builds graphs, and generates a report providing insights on potential problems. This doesn’t require any work on your part, as this analysis runs inside a built-in container on fully managed infrastructure.

Now, let’s run a demo with PyTorch, where we’ll profile a ResNet-50 image classification model training on the CIFAR-10 dataset.

Profiling a Training Job with Amazon SageMaker Debugger
All it takes to enable profiling on your training job is an extra parameter in your SageMaker estimator. You don’t need to change a line in your training code. By default, SageMaker Debugger uses a set of built-in profiling rules looking for unwanted conditions that could develop during training, such as low GPU utilization. On top of reporting these conditions, SageMaker Debugger also triggers events in CloudWatch Events. For example, I could use them to run a AWS Lambda function that automatically stops inefficient training jobs.

First, I create a profiling configuration capturing data every 500ms. Optionally, I could select a training step interval if I wanted to profile only a certain portion of the job.

import sagemaker
from sagemaker.profiler import ProfilerConfig 
profiler_config = ProfilerConfig(profiling_interval_millis=500)

Then, I pass this configuration to my PyTorch Estimator, training on an ml.p3.8xlarge instance equipped with 4 NVIDIA V100 GPUs.

from sagemaker.pytorch import PyTorch
estimator = PyTorch(
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.8xlarge',
    entry_point='train_pt.py',
    framework_version='1.5.0',
    hyperparameters={"batch_size":512, "epochs":10},
    profiler_config=profiler_config)

Then, I launch the training job as usual. Once the job is running, profiling data is captured and stored in S3.

path = estimator.latest_job_profiler_artifacts_path()
print(path)
s3://sagemaker-us-east-2-123456789012/pt-train-pt-2020-11-17-17-38-43-231/profiler-output

Using the SageMaker SDK, I can retrieve and count profiling events.

from smdebug.profiler.system_metrics_reader import S3SystemMetricsReader
system_metrics_reader = S3SystemMetricsReader(path)
system_metrics_reader.refresh_event_file_list()
last_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()
events = system_metrics_reader.get_events(0, last_timestamp)
print("Found", len(events), "recorded system metric events. Latest recorded event:", last_timestamp)
Found 411853 recorded system metric events. Latest recorded event: 1605620280000000

Of course, I could parse and analyze these profiling events, build my own graphs, and so on. Instead, let’s visualize them in near real-time in SageMaker Studio.

While my training job is still running, I locate it in SageMaker Studio, and I right-click “Open Debugger for insights”.

SageMaker screenshot

This opens a new tab, and I select the “Nodes” panel where I can see details statistics for each instance in my training job. So, how’s my training job doing? Feel free to click on the image below to zoom in.

SageMaker screenshot

Apparently, this job isn’t going great. GPU utilization and GPU memory utilization are desperately flat at around 10%. I’m definitely not pushing my multi-GPU instance hard enough. Maybe GPUs are not receiving data fast enough because the CPU can’t keep up? Let’s check the system utilization heatmap.

SageMaker screenshot

The CPU is taking a nap here, hardly ever exceeding 20% usage. This instance is definitely not busy enough. Is there anything I could do to fix this?

Switching to the “Overview” panel, I see that some of the built-in profiling rules have been triggered.

SageMaker screenshot

LowGPUUtilization confirms what I saw on the graphs above. BatchSize is very interesting, as it suggests increasing the size of mini-batches sent to the GPUs by the training script running on the CPU. This should definitely help fill GPU memory, put more GPU cores to work, speed up my training job, and improve infrastructure usage.

At this point, I should decide to stop my inefficient training job, and to relaunch it with a larger batch size. Here, I’ll let it run to completion to show you the report generated by the SageMaker Processing job running in parallel of your training job.

Once the training job is complete, I can see its summary in the “Overview” panel.

SageMaker screenshot

Clicking on the “Download report” button, I get a very detailed report that includes additional metrics, for example the ratio between the different phases of the training job, or the ratio between the forward and backward pass.

SageMaker screenshot

I can also see information on the most time-consuming CPU and GPU operators, which is really important if I wanted to optimize my code. For example, the graph below tells me that the most time-consuming GPU operations in my training job are backward pass convolution operators.

SageMaker screenshot

There’s much more to read in the report (rules summary, training loop analysis, and more). A companion notebook is also available to understand how graphs have been built, and how you can tailor them to your own needs.

Getting Started
We’ve just scratched the surface, and there are many more features in Amazon SageMaker Debugger that make it easy to gather, analyze and visualize model profiling information. You can start using it today in all regions where Amazon SageMaker is available. You won’t be charged for any compute used to run built-in profiling rules.

You’ll find sample notebooks on Github, so give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

New – Managed Data Parallelism in Amazon SageMaker Simplifies Training on Large Datasets

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/managed-data-parallelism-in-amazon-sagemaker-simplifies-training-on-large-datasets/

Today, I’m particularly happy to announce that Amazon SageMaker now supports a new data parallelism library that makes it easier to train models on datasets that may be as large as hundreds or thousands of gigabytes.

As data sets and models grow larger and more sophisticated, machine learning (ML) practitioners working on large distributed training jobs have to face increasingly long training times, even when using powerful instances such as the Amazon Elastic Compute Cloud (EC2) p3 and p4 instances. For example, using a ml.p3dn.24xlarge instance equipped with 8 NVIDIA V100 GPUs, it takes over 6 hours to train advanced object detection models such as Mask RCNN and Faster RCNN on the publicly available COCO dataset. Likewise, training BERT, a state of the art natural language processing model, takes over 100 hours on the same instance. Some of our customers, such as autonomous vehicle companies, routinely deal with even larger training jobs that run for days on large GPU clusters.

As you can imagine, these long training times are a severe bottleneck for ML projects, hurting productivity and slowing down innovation. Customers asked us for help, and we got to work.

Introducing Data Parallelism in Amazon SageMaker
Amazon SageMaker now helps ML teams reduce distributed training time and cost, thanks to the SageMaker Data Parallelism (SDP) library. Available for TensorFlow and PyTorch, SDP implements a more efficient distribution of computation, optimizes network communication, and fully utilizes our fastest p3 and p4 GPU instances.

Up to 90% of GPU resources can now be used for training, not for data transfer. Distributed training jobs can achieve up near-liner scaling efficiency, regardless of the number of GPUs involved. In other words, if a training job runs for 8 hours on a single instance, it will only take approximately 1 hour on 8 instances, with minimal cost increase. SageMaker effectively eliminates any trade-off between training cost and training time, allowing ML teams to get results sooner, iterate faster, and accelerate innovation.

During his keynote at AWS re:Invent 2020, Swami Sivasubramanian demonstrated the fastest training times to date for T5-3B and Mask-RCNN.

  • The T5-3B model has 3 billion parameters, achieves state-of-the-art accuracy on natural language processing benchmarks, and usually takes weeks of effort to train and tune for performance. We trained this model in 6 days on 256 ml.p4d.24xlarge instances.
  • Mask-RCNN continues to be a popular instance segmentation model used by our customers. Last year at re:Invent, we trained Mask-RCNN in 26 minutes on PyTorch, and in 27 minutes on TensorFlow. This year, we recorded the fastest training time to date for Mask-RCNN at 6:12 minutes on TensorFlow, and 6:45 minutes on PyTorch.

Before we explain how Amazon SageMaker is able to achieve such speedups, let’s first explain how data parallelism works, and why it’s hard to scale.

A Primer on Data Parallelism
If you’re training a model on a single GPU, its full internal state is available locally: model parameters, optimizer parameters, gradients (parameter updates computed by backpropagation), and so on. However, things are different when you distribute a training job to a cluster of GPUs.

Using a technique named “data parallelism,” the training set is split in mini-batches that are evenly distributed across GPUs. Thus, each GPU only trains the model on a fraction of the total data set. Obviously, this means that the model state will be slightly different on each GPU, as they will process different batches. In order to ensure training convergence, the model state needs to be regularly updated on all nodes. This can be done synchronously or asynchronously:

  • Synchronous training: all GPUs report their gradient updates either to all other GPUs (many-to-many communication), or to a central parameter server that redistributes them (many-to-one, followed by one-to-many). As all updates are applied simultaneously, the model state is in sync on all GPUs, and the next mini-batch can be processed.
  • Asynchronous training: gradient updates are sent to all other nodes, or to a central server. However, they are applied immediately, meaning that model state will differ from one GPU to the next.

Unfortunately, these techniques don’t scale very well. As the number of GPUs increases, a parameter server will inevitably become a bottleneck. Even without a parameter server, network congestion soon becomes a problem, as n GPUs need to exchange n*(n-1) messages after each iteration, for a total amount of n*(n-1)*model size bytes. For example, ResNet-50 is a popular model used in computer vision applications. With its 26 million parameters, each 32-bit gradient update takes about 100 megabytes. With 8 GPUs, each iteration requires sending and receiving 56 updates, for a total of 5.6 gigabytes. Even with a fast network, this will cause some overhead, and slow down training.

A significant step forward was taken in 2017 thanks to the Horovod project. Horovod implemented an optimized communication algorithm for distributed training named “ring-allreduce,” which was soon integrated with popular deep learning libraries.

In a nutshell, ring-allreduce is a decentralized asynchronous algorithm. There is no parameter server: nodes are organized in a directed cycle graph (to put it simply, a one-way ring). For each iteration, a node receives a gradient update from its predecessor. Once a node has processed its own batch, it applies both updates (its own and the one it received), and sends the results to its neighbor. With n GPUs, each GPU processes 2*(n-1) messages before all GPUs have been updated. Accordingly, the total amount of data exchanged per GPU is 2*(n-1)*model size, which is much better than n*(n-1)*model size.

Still, as datasets keep growing, the network bottleneck issue often rises again. Enter SageMaker and its new AllReduce algorithm.

A New Data Parallelism Algorithm in Amazon SageMaker
With the AllReduce algorithm, GPUs don’t talk to one another any more. Each GPU stores its gradient updates in GPU memory. When a certain threshold is exceeded, these updates are sharded, and sent to parameter servers running on the CPUs of the GPU instances. This removes the need for dedicated parameter servers.

Each CPU is responsible for a subset of the model parameters, and it receives updates coming from all GPUs. For example, with 3 training instances equipped with a single GPU, each GPU in the training cluster would send a third of its gradient updates to each one of the three CPUs.

Illustration

Then, each CPU would apply all the gradient updates that it received, and it would distributes the consolidated result back to all GPUs.

Illustration

Now that we understand how this algorithm works, let’s see how you can use it with your own code, without having to manage any infrastructure.

Training with Data Parallelism in Amazon SageMaker
The SageMaker Data Parallelism API is designed for ease of use, and should provide seamless integration with existing distributed training toolkits. In most cases, all you have to change in your training code is the import statement for Horovod (TensorFlow), or for Distributed Data Parallel (PyTorch).

For PyTorch, this would look like this.

import smdistributed.dataparallel.torch.parallel.distributed as dist
dist.init_process_group()

Then, I need to pin each GPU to a single SDP process.

torch.cuda.set_device(dist.get_local_rank())

Then, I define my model as usual, for example:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
...

Finally, I instantiate my model, and use it to create a DistributedDataParallel object like so:

import torch
from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
device = torch.device("cuda")
model = DDP(Net().to(device))

The rest of the code is vanilla PyTorch, and I can train it using the PyTorch estimator available in the SageMaker SDK. Here, I’m using an ml.p3.16xlarge instance with 8 NVIDIA V100 GPUs.

from sagemaker.pytorch import PyTorch
estimator = PyTorch(
    entry_point='train_pytorch.py',
    role=sagemaker.get_execution_role(),
    framework_version='1.6.0',
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    distribution={'smdistributed':{'dataparallel':{enabled': True}}}
)
estimator.fit()

From then on, SageMaker takes over and provisions all required infrastructure. You can focus on other tasks while your training job runs.

Getting Started
If your training jobs last for hours or days on multiple GPUs, we believe that the SageMaker Data Parallelism library can save you time and money, and help you experiment and innovate quicker. It’s available today at in all regions where SageMaker is available, at no additional cost.

Examples are available to get you started quickly. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Amazon SageMaker Simplifies Training Deep Learning Models With Billions of Parameters

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-simplifies-training-deep-learning-models-with-billions-of-parameters/

Today, I’m extremely happy to announce that Amazon SageMaker simplifies the training of very large deep learning models that were previously difficult to train due to hardware limitations.

In the last 10 years, a subset of machine learning named deep learning (DL) has taken the world by storm. Based on neural networks, DL algorithms have an extraordinary ability to extract information patterns hidden in vast amounts of unstructured data, such as images, videos, speech, or text. Indeed, DL has quickly achieved impressive results on a variety of complex human-like tasks, especially on computer vision and natural language processing. In fact, innovation has never been faster, as DL keeps improving its results on reference tasks like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), the General Language Understanding Evaluation (GLUE), or the Stanford Question Answering Dataset (SQUAD).

In order to tackle ever more complex tasks, DL researchers are designing increasingly sophisticated models, adding more neuron layers and more connections to improve pattern extraction and prediction accuracy, with a direct impact on model size. For example, you would get very good results on image classification with a 100-megabyte ResNet-50 model. For more difficult tasks such as object detection or instance segmentation, you would have to use larger models such as Mask R-CNN or YOLO v4, weighing in at about 250 megabytes.

As you can guess, model growth also impacts the amount of time and hardware resources required for model training, which is why Graphical Processing Units (GPU) have long been the preferred option to train and fine-tune large DL models. Thanks to their massively parallel architecture and their large on-board memory, they make it possible to use a technique called mini-batch training. By sending several data samples at once to the GPU, instead of sending them one by one, communication overhead is reduced, and training jobs are greatly accelerated. For example, the NVIDIA A100 available on the Amazon Elastic Compute Cloud (EC2) p4 family has over 7,000 compute cores and 40 gigabytes of fast onboard memory. Surely, that should be enough to train large batches of data on very large models, shouldn’t it?

Well, it’s not. Natural language processing behemoths such as OpenAI GPT-2 (1.5 billion parameters), T5-3B (3 billion parameters) and GPT-3 (175 billion parameters) consume tens or even hundreds of gigabytes of GPU memory. Likewise, state-of-the-art models working on high-resolution 3D images can be too large to fit in GPU memory, even with a batch size of 1…

Trying to square the circle, DL researchers use a combination of techniques, such as the following:

  • Buy more powerful GPUs, although we just saw that this is simply not an option for some models.
  • Work with less powerful models, and sacrifice accuracy.
  • Implement gradient checkpointing, a technique that relies on saving intermediate training results to disk instead of keeping everything in memory, at the expense of a 20-30% training slowdown.
  • Implement model parallelism, that is to say split the model manually, and train its (smaller) pieces on different GPUs. Needless to say, this is an extremely difficult, time-consuming, and uncertain task, even for expert practitioners.

Customers have told us that none of the above was a satisfactory solution to working with very large models. They asked us for a simpler and more cost-effective solution, and we got to work.

Introducing Model Parallelism in Amazon SageMaker
Model parallelism in SageMaker automatically and efficiently partitions models across several GPUs, eliminating the need for accuracy compromises or for complex manual work. In addition, thanks to this scale-out approach to model training, not only can you work with very large models without any memory bottleneck, you can also leverage a large number of smaller and more cost-effective GPUs.

At launch, this is supported for TensorFlow and PyTorch, and it only requires minimal changes in your code. When you launch a training job, you can specify whether your model should be optimized for speed or for memory usage. Then, Amazon SageMaker runs an initial profiling job on your behalf in order to analyze the compute and memory requirements of your model. This information is then fed to a partitioning algorithm which decides how to split the model and how to map model partitions to GPUs, while minimizing communication. The outcome of the partitioning decision is saved to a file, which is passed as input to the actual training job.

As you can see, SageMaker takes care of everything. If you’d like, you could also manually profile and partition the model, then train on SageMaker.

Before we look at the code, I’d like to give you a quick overview of the internals.

Training with Model Partitions and Microbatches
As model partitions running on different GPUs expect forward pass inputs from each other (activation values), processing training mini-batches across a sequence of partitions would only keep one partition busy at all times, while stalling the other ones.

To avoid this inefficient behavior, mini-batches are split into microbatches that are processed in parallel on the different GPUs. For example, GPU #1 could be forward propagating microbatch n, while GPU #2 could do the same for microbatch n+1. Activation values can be stored, and passed to the next partition whenever it’s ready to accept them.

For back propagation, partitions also expect input values from each other (gradients). As a partition can’t simultaneously run forward and backward propagation, we could wait for all GPUs to complete the forward pass on their own microbatch, before letting them run the corresponding backward pass. This simple mode is available in Amazon SageMaker.

There’s an even more efficient option, called interleaved mode. Here, SageMaker replicates partitions according to the number of microbatches. For example, working with 2 microbatches, each GPU would run two copies of the partition it has received. Each copy would collaborate with partitions running on other GPUs, either for forward or backpropagation.

Here’s how things could look like, with 4 different microbatches being processed by 2 duplicated partitions.

Illustration

To sum things up, interleaving the forward and backward passes of different microbatches is how SageMaker maximimes GPU utilization.

Now, let’s see how we can put this to work with TensorFlow.

Implementing Model Parallelism in Amazon SageMaker
Thanks to the SageMaker Model Parallelism (SMP) library, you can easily implement model parallelism in your own TensorFlow code (the process is similar for PyTorch). Here’s what you need to do:

  • Define and initialize the partitioning configuration.
  • Make your model a subclass of the DistributedModel class, using standard Keras subclassing.
  • Write and decorate with @smp.step a training function that represents a forward and backward step for the model. This function will be pipelined according to the architecture described in the previous section.
  • Optionally, do the same for an evaluation function that will also be pipelined.

Let’s apply this to a simple convolution network training on the MNIST dataset, using an ml.p3.8xlarge instance equipped with 4 NVIDIA V100 GPUs.

First, I initialize the SMP API.

import smdistributed.modelparallel.tensorflow as smp
smp.init()

Then, I subclass DistributedModel and build my model.

class MyModel(smp.DistributedModel):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv = Conv2D(32, 3, activation="relu")
        self.flatten = Flatten()
        self.dense1 = Dense(128)
        self.dense2 = Dense(10)
. . .

This is what the training function looks like.

@smp.step
def forward_backward(images, labels):
    predictions = model(images, training=True)
    loss = loss_obj(labels, predictions)
    grads = optimizer.get_gradients(loss, model.trainable_variables)
    return grads, loss

Then, I can train as usual with the TensorFlow estimator available in the SageMaker SDK. I only need to add the model parallelism configuration: 2 partitions (hence training on 2 GPUs), and 2 microbatches (hence 2 copies of each partition) with interleaving.

smd_mp_estimator = TensorFlow(
    entry_point="tf2.py",
    role=role,
    framework_version='2.3.1',
    pv_version='py3',
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    distribution={
        "smdistributed": {
            "modelparallel": {
                "enabled":True,
                "parameters": {
                    "microbatches": 2,
                    "partitions": 2,
                    "pipeline": "interleaved",
                    "optimize": "memory",
                    "horovod": True, 
                }
             }
         },
        "mpi": {
            "enabled": True,
            "processes_per_host": 2, # Pick your processes_per_host
            "custom_mpi_options": mpioptions
        },
    }
)

Getting Started
As you can see, model parallelism makes it easier to train very large state-of-the-art deep learning models. It’s available today in all regions where Amazon SageMaker is available, at no additional cost.

Examples are available to get you started right away. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

 

re:Invent 2020 Liveblog: Machine Learning Keynote

Post Syndicated from AWS News Blog Team original https://aws.amazon.com/blogs/aws/reinvent-2020-liveblog-machine-learning-keynote/

Swami Sivasubramanian speaks on stage at AWS re:InventFollow along as AWS Chief Evangelist Jeff Barr and Developer Advocates Martin Beeby and Steve Roberts liveblog the first-ever Machine Learning Keynote. Swami Sivasubramanian, VP of Amazon ML/AI will share the latest developments and launches in AWS machine learning, as well as demos of new technology, and insights from customers.

Join us here from 7:45-10 AM (PST), Tuesday, Dec. 8, 2020! 

 


Raising code quality for Python applications using Amazon CodeGuru

Post Syndicated from Ran Fu original https://aws.amazon.com/blogs/devops/raising-code-quality-for-python-applications-using-amazon-codeguru/

We are pleased to announce the launch of Python support for Amazon CodeGuru, a service for automated code reviews and application performance recommendations. CodeGuru is powered by program analysis and machine learning, and trained on best practices and hard-learned lessons across millions of code reviews and thousands of applications profiled on open-source projects and internally at Amazon.

Amazon CodeGuru has two services:

  • Amazon CodeGuru Reviewer – Helps you improve source code quality by detecting hard-to-find defects during application development and recommending how to remediate them.
  • Amazon CodeGuru Profiler – Helps you find the most expensive lines of code, helps reduce your infrastructure cost, and fine-tunes your application performance.

The launch of Python support extends CodeGuru beyond its original Java support. Python is a widely used language for various use cases, including web app development and DevOps. Python’s growth in data analysis and machine learning areas is driven by its rich frameworks and libraries. In this post, we discuss how to use CodeGuru Reviewer and Profiler to improve your code quality for Python applications.

CodeGuru Reviewer for Python

CodeGuru Reviewer now allows you to analyze your Python code through pull requests and full repository analysis. For more information, see Automating code reviews and application profiling with Amazon CodeGuru. We analyzed large code corpuses and Python documentation to source hard-to-find coding issues and trained our detectors to provide best practice recommendations. We expect such recommendations to benefit beginners as well as expert Python programmers.

CodeGuru Reviewer generates recommendations in the following categories:

  • AWS SDK APIs best practices
  • Data structures and control flow, including exception handling
  • Resource leaks
  • Secure coding practices to protect from potential shell injections

In the following sections, we provide real-world examples of bugs that can be detected in each of the categories:

AWS SDK API best practices

AWS has hundreds of services and thousands of APIs. Developers can now benefit from CodeGuru Reviewer recommendations related to AWS APIs. AWS recommendations in CodeGuru Reviewer cover a wide range of scenarios such as detecting outdated or deprecated APIs, warning about API misuse, authentication and exception scenarios, and efficient API alternatives.

Consider the pagination trait, implemented by over 1,000 APIs from more than 150 AWS services. The trait is commonly used when the response object is too large to return in a single response. To get the complete set of results, iterated calls to the API are required, until the last page is reached. If developers were not aware of this, they would write the code as the following (this example is patterned after actual code):

def sync_ddb_table(source_ddb, destination_ddb):
    response = source_ddb.scan(TableName=“table1”)
    for item in response['Items']:
        ...
        destination_ddb.put_item(TableName=“table2”, Item=item)
    …   

Here the scan API is used to read items from one Amazon DynamoDB table and the put_item API to save them to another DynamoDB table. The scan API implements the Pagination trait. However, the developer missed iterating on the results beyond the first scan, leading to only partial copying of data.

The following screenshot shows what CodeGuru Reviewer recommends:

The following screenshot shows CodeGuru Reviewer recommends on the need for pagination

The developer fixed the code based on this recommendation and added complete handling of paginated results by checking the LastEvaluatedKey value in the response object of the paginated API scan as follows:

def sync_ddb_table(source_ddb, destination_ddb):
    response = source_ddb.scan(TableName==“table1”)
    for item in response['Items']:
        ...
        destination_ddb.put_item(TableName=“table2”, Item=item)
    # Keeps scanning util LastEvaluatedKey is null
    while "LastEvaluatedKey" in response:
        response = source_ddb.scan(
            TableName="table1",
            ExclusiveStartKey=response["LastEvaluatedKey"]
        )
        for item in response['Items']:
            destination_ddb.put_item(TableName=“table2”, Item=item)
    …   

CodeGuru Reviewer recommendation is rich and offers multiple options for implementing Paginated scan. We can also initialize the ExclusiveStartKey value to None and iteratively update it based on the LastEvaluatedKey value obtained from the scan response object in a loop. This fix below conforms to the usage mentioned in the official documentation.

def sync_ddb_table(source_ddb, destination_ddb):
    table = source_ddb.Table(“table1”)
    scan_kwargs = {
                  …
    }
    done = False
    start_key = None
    while not done:
        if start_key:
            scan_kwargs['ExclusiveStartKey'] = start_key
        response = table.scan(**scan_kwargs)
        for item in response['Items']:
            destination_ddb.put_item(TableName=“table2”, Item=item)
        start_key = response.get('LastEvaluatedKey', None)
        done = start_key is None

Data structures and control flow

Python’s coding style is different from other languages. For code that does not conform to Python idioms, CodeGuru Reviewer provides a variety of suggestions for efficient and correct handling of data structures and control flow in the Python 3 standard library:

  • Using DefaultDict for compact handling of missing dictionary keys over using the setDefault() API or dealing with KeyError exception
  • Using a subprocess module over outdated APIs for subprocess handling
  • Detecting improper exception handling such as catching and passing generic exceptions that can hide latent issues.
  • Detecting simultaneous iteration and modification to loops that might lead to unexpected bugs because the iterator expression is only evaluated one time and does not account for subsequent index changes.

The following code is a specific example that can confuse novice developers.

def list_sns(region, creds, sns_topics=[]):
    sns = boto_session('sns', creds, region)
    response = sns.list_topics()
    for topic_arn in response["Topics"]:
        sns_topics.append(topic_arn["TopicArn"])
    return sns_topics
  
def process():
    ...
    for region, creds in jobs["auth_config"]:
        arns = list_sns(region, creds)
        ... 

The process() method iterates over different AWS Regions and collects Regional ARNs by calling the list_sns() method. The developer might expect that each call to list_sns() with a Region parameter returns only the corresponding Regional ARNs. However, the preceding code actually leaks the ARNs from prior calls to subsequent Regions. This happens due to an idiosyncrasy of Python relating to the use of mutable objects as default argument values. Python default value are created exactly one time, and if that object is mutated, subsequent references to the object refer to the mutated value instead of re-initialization.

The following screenshot shows what CodeGuru Reviewer recommends:

The following screenshot shows CodeGuru Reviewer recommends about initializing a value for mutable objects

The developer accepted the recommendation and issued the below fix.

def list_sns(region, creds, sns_topics=None):
    sns = boto_session('sns', creds, region)
    response = sns.list_topics()
    if sns_topics is None: 
        sns_topics = [] 
    for topic_arn in response["Topics"]:
        sns_topics.append(topic_arn["TopicArn"])
    return sns_topics

Resource leaks

A Pythonic practice for resource handling is using Context Managers. Our analysis shows that resource leaks are rampant in Python code where a developer may open external files or windows and forget to close them eventually. A resource leak can slow down or crash your system. Even if a resource is closed, using Context Managers is Pythonic. For example, CodeGuru Reviewer detects resource leaks in the following code:

def read_lines(file):
    lines = []
    f = open(file, ‘r’)
    for line in f:
        lines.append(line.strip(‘\n’).strip(‘\r\n’))
    return lines

The following screenshot shows that CodeGuru Reviewer recommends that the developer either use the ContextLib with statement or use a try-finally block to explicitly close a resource.

The following screenshot shows CodeGuru Reviewer recommend about fixing the potential resource leak

The developer accepted the recommendation and fixed the code as shown below.

def read_lines(file):
    lines = []
    with open(file, ‘r’) as f: 
        for line in f:
            lines.append(line.strip(‘\n’).strip(‘\r\n’))
    return lines

Secure coding practices

Python is often used for scripting. An integral part of such scripts is the use of subprocesses. As of this writing, CodeGuru Reviewer makes a limited, but important set of recommendations to make sure that your use of eval functions or subprocesses is secure from potential shell injections. It issues a warning if it detects that the command used in eval or subprocess scenarios might be influenced by external factors. For example, see the following code:

def execute(cmd):
    try:
        retcode = subprocess.call(cmd, shell=True)
        ...
    except OSError as e:
        ...

The following screenshot shows the CodeGuru Reviewer recommendation:

The following screenshot shows CodeGuru Reviewer recommends about potential shell injection vulnerability

The developer accepted this recommendation and made the following fix.

def execute(cmd):
    try:
        retcode = subprocess.call(shlex.quote(cmd), shell=True)
        ...
    except OSError as e:
        ...

As shown in the preceding recommendations, not only are the code issues detected, but a detailed recommendation is also provided on how to fix the issues, along with a link to the Python official documentation. You can provide feedback on recommendations in the CodeGuru Reviewer console or by commenting on the code in a pull request. This feedback helps improve the performance of Reviewer so that the recommendations you see get better over time.

Now let’s take a look at CodeGuru Profiler.

CodeGuru Profiler for Python

Amazon CodeGuru Profiler analyzes your application’s performance characteristics and provides interactive visualizations to show you where your application spends its time. These visualizations a. k. a. flame graphs are a powerful tool to help you troubleshoot which code methods have high latency or are over utilizing your CPU.

Thanks to the new Python agent, you can now use CodeGuru Profiler on your Python applications to investigate performance issues.

The following list summarizes the supported versions as of this writing.

  • AWS Lambda functions: Python3.8, Python3.7, Python3.6
  • Other environments: Python3.9, Python3.8, Python3.7, Python3.6

Onboarding your Python application

For this post, let’s assume you have a Python application running on Amazon Elastic Compute Cloud (Amazon EC2) hosts that you want to profile. To onboard your Python application, complete the following steps:

1. Create a new profiling group in CodeGuru Profiler console called ProfilingGroupForMyApplication. Give access to your Amazon EC2 execution role to submit to this profiling group. See the documentation for details about how to create a Profiling Group.

2. Install the codeguru_profiler_agent module:

pip3 install codeguru_profiler_agent

3. Start the profiler in your application.

An easy way to profile your application is to start your script through the codeguru_profiler_agent module. If you have an app.py script, use the following code:

python -m codeguru_profiler_agent -p ProfilingGroupForMyApplication app.py

Alternatively, you can start the agent manually inside the code. This must be done only one time, preferably in your startup code:

from codeguru_profiler_agent import Profiler

if __name__ == "__main__":
     Profiler(profiling_group_name='ProfilingGroupForMyApplication')
     start_application()    # your code in there....

Onboarding your Python Lambda function

Onboarding for an AWS Lambda function is quite similar.

  1. Create a profiling group called ProfilingGroupForMyLambdaFunction, this time we select “Lambda” for the compute platform. Give access to your Lambda function role to submit to this profiling group. See the documentation for details about how to create a Profiling Group.
  2. Include the codeguru_profiler_agent module in your Lambda function code.
  3. Add the with_lambda_profiler decorator to your handler function:
from codeguru_profiler_agent import with_lambda_profiler

@with_lambda_profiler(profiling_group_name='ProfilingGroupForMyLambdaFunction')
def handler_function(event, context):
      # Your code here

Alternatively, you can profile an existing Lambda function without updating the source code by adding a layer and changing the configuration. For more information, see Profiling your applications that run on AWS Lambda.

Profiling a Lambda function helps you see what is slowing down your code so you can reduce the duration, which reduces the cost and improves latency. You need to have continuous traffic on your function in order to produce a usable profile.

Viewing your profile

After running your profile for some time, you can view it on the CodeGuru console.

Screenshot of Flame graph visualization by CodeGuru Profiler

Each frame in the flame graph shows how much that function contributes to latency. In this example, an outbound call that crosses the network is taking most of the duration in the Lambda function, caching its result would improve the latency.

For more information, see Investigating performance issues with Amazon CodeGuru Profiler.

Supportability for CodeGuru Profiler is documented here.

If you don’t have an application to try CodeGuru Profiler on, you can use the demo application in the following GitHub repo.

Conclusion

This post introduced how to leverage CodeGuru Reviewer to identify hard-to-find code defects in various issue categories and how to onboard your Python applications or Lambda function in CodeGuru Profiler for CPU profiling. Combining both services can help you improve code quality for Python applications. CodeGuru is now available for you to try. For more pricing information, please see Amazon CodeGuru pricing.

 

About the Authors

Neela Sawant is a Senior Applied Scientist in the Amazon CodeGuru team. Her background is building AI-powered solutions to customer problems in a variety of domains such as software, multimedia, and retail. When she isn’t working, you’ll find her exploring the world anew with her toddler and hacking away at AI for social good.

 

 

Pierre Marieu is a Software Development Engineer in the Amazon CodeGuru Profiler team in London. He loves building tools that help the day-to-day life of other software engineers. Previously, he worked at Amadeus IT, building software for the travel industry.

 

 

 

Ran Fu is a Senior Product Manager in the Amazon CodeGuru team. He has a deep customer empathy, and love exploring who are the customers, what are their needs, and why those needs matter. Besides work, you may find him snowboarding in Keystone or Vail, Colorado.

 

Preparing data for ML models using AWS Glue DataBrew in a Jupyter notebook

Post Syndicated from Zayd Simjee original https://aws.amazon.com/blogs/big-data/preparing-data-for-ml-models-using-aws-glue-databrew-in-a-jupyter-notebook/

AWS Glue DataBrew is a new visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning (ML). In this post, we examine a sample ML use case and show how to use DataBrew and a Jupyter notebook to upload a dataset, clean and normalize the data, and train and publish an ML model. We look for anomalies by applying the Amazon SageMaker Random Cut Forest (RCF) anomaly detection algorithm on a public dataset that records power consumption for more than 300 random households.

Deploying your resources

To make it easier for you to get started, we created an AWS CloudFormation template that automatically configures a Jupyter notebook instance with the required libraries and installs the plugin. We used Amazon Deep Learning AMI to configure the out-of-the-box Jupyter server. This easy deployment is intended to get you started on DataBrew from within a Jupyter environment. The source code for the DataBrew plugin and the CloudFormation template are available in the GitHub repo.

To deploy the solution, you must have a subnet that has internet access and an Amazon Simple Storage Service (Amazon S3) bucket where you want store the data for DataBrew. Select the VPC, subnet, security group, and the S3 bucket that you want to use store the data for DataBrew processing. Provide the Amazon Elastic Compute Cloud (Amazon EC2) key pair if you plan to SSH to the instance.

  1. Launch the following stack:
  2. When the template deployment is complete, on the Outputs tab, choose the URL to open JupyterLab.

Because the Jupyter server is configured with a self-signed SSL certificate, your browser warns you and prompts you to avoid continuing to this website. But because you set this up yourself, it’s safe to continue.

  1. Choose Advanced.
  2. Choose Proceed.
  3. Use the password databrew_demo to log in.

For more information about securing and configuring your Jupyter server, see Set up a Jupyter Notebook Server.

  1. In the Jupyter environment’s left panel, choose the DataBrew logo.
  2. Choose Launch AWS Glue DataBrew.
  3. When the extension loads, choose the Datasets tab in the navigation bar.

Preparing your data using DataBrew

Now that the DataBrew extension is ready to go, we can begin to explore how DataBrew can make data preparation easy. Our source dataset contains data points at 15-minute intervals, and is organized as a series of columns for each household. The dataset is really wide, and the RCF algorithm expects data tuples of date/time, client ID, and consumption value. Additionally, we want to normalize our data to 1-hour intervals. All of this is achieved through DataBrew.

Setting up your dataset and project

To get started, you set up your dataset, import your data, and create a new project.

  1. Download power.consumption.csv from the GitHub repo.
  2. On the Datasets page, choose Connect new dataset.
  3. For Dataset name, enter a name.
  4. In the Connect to new dataset section, choose File upload.
  5. Upload power.consumption.csv.
  6. For Enter S3 destination, enter an S3 path where you can save the file.
  7. Choose Create dataset.

The file may take a few minutes to upload, depending on your internet speed.

  1. On the Datasets page, filter for your created dataset.
  2. Select your dataset and choose Create project with this dataset.

  1. In the Create project wizard, give your project a name.
  2. In the Permissions section, choose the AWS Identity and Access Management (IAM) role created from the CloudFormation template.

You can find the role on the CloudFormation stack’s Resources tab. If you use the default stack name, the role should begin with databrew-jupyter-plugin-demo.

After you create the project, the project view loads, and you’re ready to prepare your data.

Building a recipe for data transformation

A recipe is a series of steps that prepare your data for the RCF algorithm. The algorithm requires three columns: date, client ID, and an integer value. To transform our dataset to contain those three columns, we configure our recipe to do the following:

  1. Unpivot the data to collapse measurements from multiple clients into one column.
  2. Apply the window function to average the 15-minute data points into 1-hour data points.
  3. Filter to keep only values at each hour.
  4. Multiply and floor the results.

Unpivoting the data

To unpivot the data and collapse measurements, complete the following steps:

  1. On the toolbar, choose Pivot.

  1. In the Pivot wizard, select Unpivot: Columns to rows.
  2. For Unpivot columns, choose MT_012, MT_013, MT_131, and MT_132.
  3. For Column name, enter client_id.
  4. For Value column name, enter quarter_hour_consumption.

In the Recipe pane, you can see the action that unpivots the columns. This action can be revisited later and changed. The new columns may not be visible immediately.

  1. To see them and narrow down the visible data to only the relevant columns, choose the arrow next to Viewing.
  2. Deselect all items and select only _c0 and our two new columns, client_id and
    quarter_hour_consumption.

Applying the window function

To apply the window function to average the 15-minute data points into 1-hour data points, complete the following steps:

  1. Choose the quarter_hour_consumption
  2. Choose Functions.
  3. Choose Window functions.
  4. Choose Rolling average.

  1. In the Create column pane, for Number of rows before, enter 0.
  2. For Number of rows after, enter 3.
  3. For Name of column to order by with, choose client_id.
  4. For Destination column, enter hourly_consumption_raw.
  5. Choose Apply.

Filtering to keep only values at each hour

In this step, you rename the date/time column, convert it to string type so that you can do simple filtering, and filter the dataset on the string column for times ending in :00:00.

  1. For the _c0 column, choose the ellipsis icon (…) and choose Rename.

  1. Rename the column to timestamp.
  2. Choose the clock icon and choose string.

  1. With the column selected, choose Filter.
  2. Choose By condition.
  3. Choose Ends with.
  4. Enter the value :00:00.
  5. Choose Apply.

Filtering the column for only values that end with :00:00 leaves you with hourly averages of power consumption per client for every hour.

Multiplying and flooring the results

In this step, you multiply the data by 100 to increase precision and floor the data so that it can be accepted by the RCF algorithm, which only accepts integers.

  1. Choose Functions.
  2. Choose Math functions.
  3. Choose Multiply.
  4. For Value using¸ choose Source columns and value.
  5. For Source column, choose hourly_consumption_raw.
  6. For Destination column, enter hourly_consumption_raw_times_a_hundred.
  7. Choose Apply.

  1. Choose Functions.
  2. Choose Math functions.
  3. Choose Floor.
  4. For Source column, choose hourly_consumption_raw_times_a_hundred.
  5. For Destination column, enter hourly_consumption.
  6. Choose Apply.

This column contains the final, normalized data.

Running the job to transform the data

You’re now ready to transform the data.

  1. Choose Create job.

  1. Enter a job name and choose the dataset we created.
  2. Specify the S3 bucket you provided in the CloudFormation template.
  3. Choose the IAM role that AWS CloudFormation created (which we used earlier to create the project).
  4. Choose Create and run job.

The job may take up to 5 minutes to complete.

On the Job run history page for the job, view the output on the Amazon S3 console by choosing the link in the table.

That’s it, the data is now ready to use when training and deploying our ML model.

Training and deploying the ML model using prepared data

The data was already prepared using DataBrew via the plugin, so the next step is to train an ML model using that data. We provided a sample anomaly detection notebook that you can download.

In this sample notebook, you need to specify the S3 data location where you stored the output data from DataBrew. The notebook uses the IAM role attached to the EC2 instance profile that was created by AWS CloudFormation. You can follow through the notebook and when you provide the right S3 paths, the first step is to filter the specific columns we’re interested in and visualize the time series power consumption data.

The next step is to train a sample anomaly detection model using the SageMaker Random Cut Forest algorithm. We pick one of the time series available in the input Pandas DataFrame and train the anomaly detection model with the hyperparameter feature_dim set to 1, leaving the default values for other hyperparameters. We then create an estimator for Random Cut Forest and fit the model. In a few minutes, the training should be complete. In the next step, we create a predictor and deploy the model to a SageMaker endpoint.

Using the prepared data, we run the prediction and plot the results.

We use the anomaly detection baseline that is two standard deviations away from the mean score. The data shows an anomaly towards the end of the time series. With this information, that timeframe can be further investigated.

Finally, we clean up by deleting the SageMaker endpoint to prevent any ongoing charges.

Conclusion

We’ve walked you through the process of setting up the AWS Glue DataBrew Jupyter plugin in a Jupyter notebook environment. We used the plugin to prepare data, then trained and deployed an ML model in the same Jupyter environment using SageMaker.

Although we used a DLAMI Jupyter environment in this post, the DataBrew Jupyter extension also works on SageMaker notebooks. For installation instructions, see the GitHub repo.

DataBrew makes it easy to iterate through data preparation workflows. The resultant recipes and jobs are duplicable and can be run over discrete, large datasets. The DataBrew Jupyter plugin allows you to prepare your data seamlessly, in context, within your Jupyter notebook.


About the Authors

Zayd Simjee is a software engineer at Amazon. He’s interested in distributed systems, big data, and simplifying developer experience on the Cloud. He’s worked on a few Big Data services at AWS, and most recently completed work on the AWS Glue DataBrew release.

 

 

 

 

As a Principal Solutions Architect at Amazon Web Services, Karthik Sonti works with GSI partners to help enterprises realize transformational business outcomes using artificial intelligence, machine learning and data analytics

Incorporating security in code-reviews using Amazon CodeGuru Reviewer

Post Syndicated from Nikunj Vaidya original https://aws.amazon.com/blogs/devops/incorporating-security-in-code-reviews-using-amazon-codeguru-reviewer/

Today, software development practices are constantly evolving to empower developers with tools to maintain a high bar of code quality. Amazon CodeGuru Reviewer offers this capability by carrying out automated code-reviews for developers, based on the trained machine learning models that can detect complex defects and providing intelligent actionable recommendations to mitigate those defects. A quick overview of CodeGuru is covered in this blog post.

Security analysis is a critical part of a code review and CodeGuru Reviewer offers this capability with a new set of security detectors. These security detectors introduced in CodeGuru Reviewer are geared towards identifying security risks from the top 10 OWASP categories and ensures that your code follows best practices for AWS Key Management Service (AWS KMS), Amazon Elastic Compute Cloud (Amazon EC2) API, and common Java crypto and TLS/SSL libraries. As of today, CodeGuru security analysis supports Java language, thus we will take an example of a Java application.

In this post, we will walk through the on-boarding workflow to carry out the security analysis of the code repository and generate recommendations for a Java application.

 

Security workflow overview:

The new security workflow, introduced for CodeGuru reviewer, utilizes the source code and build artifacts to generate recommendations. The security detector evaluates build artifacts to generate security-related recommendations whereas other detectors continue to scan the source code to generate recommendations. With the use of build artifacts for evaluation, the detector can carry out a whole-program inter-procedural analysis to discover issues that are caused across your code (e.g., hardcoded credentials in one file that are passed to an API in another) and can reduce false-positives by checking if an execution path is valid or not. You must provide the source code .zip file as well as the build artifact .zip file for a complete analysis.

Customers can run a security scan when they create a repository analysis. CodeGuru Reviewer provides an additional option to get both code and security recommendations. As explained in the following sections, CodeGuru Reviewer will create an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account for that region to upload or copy your source code and build artifacts for the analysis. This repository analysis option can be run on Java code from any repository.

 

Prerequisites

Prepare the source code and artifact zip files: If you do not have your Java code locally, download the source code that you want to evaluate for security and zip it. Similarly, if needed, download the build artifact .jar file for your source code and zip it. It will be required to upload the source code and build artifact as separate .zip files as per the instructions in subsequent sections. Thus even if it is a single file (eg. single .jar file), you will still need to zip it. Even if the .zip file includes multiple files, the right files will be discovered and analyzed by CodeGuru. For our sample test, we will use src.zip and jar.zip file, saved locally.

Creating an S3 bucket repository association:

This section summarizes the high-level steps to create the association of your S3 bucket repository.

1. On the CodeGuru console, choose Code reviews.

2. On the Repository analysis tab, choose Create repository analysis.

Screenshot of initiating the repository analysis

Figure: Screenshot of initiating the repository analysis

 

3. For the source code analysis, select Code and security recommendations.

4. For Repository name, enter a name for your repository.

5. Under Additional settings, for Code review name, enter a name for trackability purposes.

6. Choose Create S3 bucket and associate.

Screenshot to show selection of Security Code Analysis

Figure: Screenshot to show selection of Security Code Analysis

It takes a few seconds to create a new S3 bucket in the current Region. When it completes, you will see the below screen.

Screenshot for Create repository analysis showing S3 bucket created

Figure: Screenshot for Create repository analysis showing S3 bucket created

 

7. Choose Upload to the S3 bucket option and under that choose Upload source code zip file and select the zip file (src.zip) from your local machine to upload.

Screenshot of popup to upload code and artifacts from S3 bucket

Figure: Screenshot of popup to upload code and artifacts from S3 bucket

 

8. Similarly, choose Upload build artifacts zip file and select the zip file (jar.zip) from your local machine and upload.

 

Screenshot for Create repository analysis showing S3 paths populated

Figure: Screenshot for Create repository analysis showing S3 paths populated

 

Alternatively, you can always upload the source code and build artifacts as zip file from any of your existing S3 bucket as below.

9. Choose Browse S3 buckets for existing artifacts and upload from there as shown below:

 

Screenshot to upload code and artifacts from S3 bucket

Figure: Screenshot to upload code and artifacts from an existing S3 bucket

 

10. Now click Create repository analysis and trigger the code review.

A new pending entry is created as shown below.

 

Screenshot of code review in Pending state

Figure: Screenshot of code review in Pending state

After a few minutes, you would see the recommendations generate that would include security analysis too. In the below case, there are 10 recommendations generated.

Screenshot of repository analysis being completed

Figure: Screenshot of repository analysis being completed

 

For the subsequent code reviews, you can use the same repository and upload new files or create a new repository as shown below:

 

Screenshot of subsequent code review making repository selection

Figure: Screenshot of subsequent code review making repository selection

 

Recommendations

Apart from detecting the security risks from the top 10 OWASP categories, the security detector, detects the deep security issues by analyzing data flow across multiple methods, procedures, and files.

The recommendations generated in the area of security are labelled as Security. In the below example we see a recommendation to remove hard-coded credentials and a non-security-related recommendation about refactoring of code for better maintainability.

Screenshot of Recommendations generated

Figure: Screenshot of Recommendations generated

 

Below is another example of recommendations pointing out the potential resource-leak as well as a security issue pointing to a potential risk of path traversal attack.

Screenshot of deep security recommendations

Figure: More examples of deep security recommendations

 

As this blog is focused on on-boarding aspects, we will cover the explanation of recommendations in more detail in a separate blog.

Disassociation of Repository (optional):

The association of CodeGuru to the S3 bucket repository can be removed by following below steps. Navigate to the Repositories page, select the repository and choose Disassociate repository.

Screenshot of disassociating the S3 bucket repo with CodeGuru

Figure: Screenshot of disassociating the S3 bucket repo with CodeGuru

 

Conclusion

This post reviewed the support for on-boarding workflow to carry out the security analysis in CodeGuru Reviewer. We initiated a full repository analysis for the Java code using a separate UI workflow and generated recommendations.

We hope this post was useful and would enable you to conduct code analysis using Amazon CodeGuru Reviewer.

 

About the Author

Author's profile photo

 

Nikunj Vaidya is a Sr. Solutions Architect with Amazon Web Services, focusing in the area of DevOps services. He builds technical content for the field enablement and offers technical guidance to the customers on AWS DevOps solutions and services that would streamline the application development process, accelerate application delivery, and enable maintaining a high bar of software quality.

Tightening application security with Amazon CodeGuru

Post Syndicated from Brian Farnhill original https://aws.amazon.com/blogs/devops/tightening-application-security-with-amazon-codeguru/

Amazon CodeGuru is a developer tool powered by machine learning (ML) that provides intelligent recommendations for improving code quality and identifies an application’s most expensive lines of code. To help you find and remediate potential security issues in your code, Amazon CodeGuru Reviewer now includes an expanded set of security detectors. In this post, we discuss the new types of security issues CodeGuru Reviewer can detect.

Time to read 9 minutes
Services used Amazon CodeGuru

The new security detectors are now a feature in CodeGuru Reviewer for Java applications. These detectors focus on finding security issues in your code before you deploy it. They extend CodeGuru Reviewer by providing additional security-specific recommendations to the existing set of application improvements it already recommends. When an issue is detected, a remediation recommendation and explanation is generated. This allows you to find and remediate issues before the code is deployed. These findings can help in addressing the OWASP top 10 web application security risks, with many of the recommendations being based on specific issues customers have had in this space.

You can run a security scan by creating a repository analysis. CodeGuru Reviewer now provides an additional option to get both code and security recommendations for Java codebases. Selecting this option enables you to find potential security vulnerabilities before they are promoted to production, and support users remaining secure when using your service.

Types of security issues CodeGuru Reviewer detects

Previously, CodeGuru Reviewer helped address security by detecting potential sensitive information leaks (such as personally identifiable information or credit card numbers). The additional CodeGuru Reviewer security detectors expand on this by addressing:

  • AWS API security best practices – Helps you follow security best practices when using AWS APIs, such as avoiding hard-coded credentials in API calls
  • Java crypto library best practices – Identifies when you’re not using best practices for common Java cryptography libraries, such as avoiding outdated cryptographic ciphers
  • Secure web applications – Inspects code for insecure handling of untrusted data, such as not sanitizing user-supplied input to protect against cross-site scripting, SQL injection, LDAP injection, path traversal injection, and more
  • AWS Security best practices – Developed in collaboration with AWS Security, these best practices help bring our internal expertise to customers

Examples of new security findings

The following are examples of findings that CodeGuru Reviewer security detectors can now help you identify and resolve.

AWS API security best practices

AWS API security best practice detectors inspect your code to identify issues that can be caused by not following best practices related to AWS SDKs and APIs. An example of a detected issue in this category is using hard-coded AWS credentials. Consider the following code:

import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.BasicAWSCredentials;

static String myKeyId ="AKIAX742FUDUQXXXXXXX";
static String mySecretKey = "MySecretKey";

public static void main(String[] args) {
    AWSCredentials creds = getCreds(myKeyId, mySecretKey);
}

static AWSCredentials getCreds(String id, String key) {
    return new BasicAWSCredentials(id, key);}
}

In this code, the variables myKeyId and mySecretKey are hard-coded in the application. This may have been done to move quickly, but it can also lead to these values being discovered and misused.

In this case, CodeGuru Reviewer recommends using environment variables or an AWS profile to store these values, because these can be retrieved at runtime and aren’t stored inside the application (or its source code). Here you can see an example of what this finding looks like in the console:

An example of the CodeGuru reviewer finding for IAM credentials in the AWS console

The recommendation suggests using environment variables or an AWS profile instead, and that after you delete or rotate the affected key you monitor it with CloudWatch for any attempted use. Following the learn more link, you’ll see additional detail and recommended approaches for remediation, such as using the DefaultAWSCredentialsProviderChain. An example of how to remediate this in the preceding code is to update the getCreds() function:

import com.amazonaws.auth.DefaultAWSCredentialsProviderChain;

static AWSCredentials getCreds() {
    DefaultAWSCredentialsProviderChain creds =
        new DefaultAWSCredentialsProviderChain();
    return creds.getCredentials();
}

Java crypto library best practices

When working with data that must be protected, cryptography provides mechanisms to encrypt and decrypt the information. However, to ensure the security of this data, the application must use a strong and modern cipher. Consider the following code:

import javax.crypto.Cipher;

static final String CIPHER = "DES";

public void run() {
    cipher = Cipher.getInstance(CIPHER);
}

A cipher object is created with the DES algorithm. CodeGuru Reviewer recommends a stronger cipher to help protect your data. This is what the recommendation looks like in the console:

An example of the CodeGuru reviewer finding for encryption ciphers in the AWS console

Based on this, one example of how to address this is to substitute a different cipher:

static final String CIPHER ="RSA/ECB/OAEPPadding";

This is just one option for how it could be addressed. The CodeGuru Reviewer recommendation text suggests several options, and a link to documentation to help you choose the best cipher.

Secure web applications

When working with sensitive information in cookies, such as temporary session credentials, those values must be protected from interception. This is done by flagging the cookies as secure, which prevents them from being sent over an unsecured HTTP connection. Consider the following code:

import javax.servlet.http.Cookie;

public static void createCookie() {
    Cookie cookie = new Cookie("name", "value");
}

In this code, a new cookie is created that is not marked as secure. CodeGuru Reviewer notifies you that you could make a correction by adding:

cookie.setSecure(true);

This screenshot shows you an example of what the finding looks like.

An example CodeGuru finding that shows how to ensure cookies are secured.

AWS Security best practices

This category of detectors has been built in collaboration with AWS Security and assists in detecting many other issue types. Consider the following code, which illustrates how a string can be re-encrypted with a new key from AWS Key Management Service (AWS KMS):

import java.nio.ByteBuffer;
import com.amazonaws.services.kms.AWSKMS;
import com.amazonaws.services.kms.AWSKMSClientBuilder;
import com.amazonaws.services.kms.model.DecryptRequest;
import com.amazonaws.services.kms.model.EncryptRequest;

AWSKMS client = AWSKMSClientBuilder.standard().build();
ByteBuffer sourceCipherTextBlob = ByteBuffer.wrap(new byte[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 0});

DecryptRequest req = new DecryptRequest()
                         .withCiphertextBlob(sourceCipherTextBlob);
ByteBuffer plainText = client.decrypt(req).getPlaintext();

EncryptRequest res = new EncryptRequest()
                         .withKeyId("NewKeyId")
                         .withPlaintext(plainText);
ByteBuffer ciphertext = client.encrypt(res).getCiphertextBlob();

This approach puts the decrypted value at risk by decrypting and re-encrypting it locally. CodeGuru Reviewer recommends using the ReEncrypt method—performed on the server side within AWS KMS—to avoid exposing your plaintext outside AWS KMS. A solution that uses the ReEncrypt object looks like the following code:

import com.amazonaws.services.kms.model.ReEncryptRequest;

ReEncryptRequest req = new ReEncryptRequest()
                           .withCiphertextBlob(sourceCipherTextBlob)
                           .withDestinationKeyId("NewKeyId");

client.reEncrypt(req).getCiphertextBlob();

This screenshot shows you an example of what the finding looks like.

An example CodeGuru finding to show how to avoid decrypting and encrypting locally when it's not needed

Detecting issues deep in application code

Detecting security issues can be made more complex by the contributing code being spread across multiple methods, procedures and files. This separation of code helps ensure humans work in more manageable ways, but for a person to look at the code, it obscures the end to end view of what is happening. This obscurity makes it harder, or even impossible to find complex security issues. CodeGuru Reviewer can see issues regardless of these boundaries, deeply assessing code and the flow of the application to find security issues throughout the application. An example of this depth exists in the code below:

import java.io.UnsupportedEncodingException;
import javax.servlet.http.Cookie;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

private String decode(final String val, final String enc) {
    try {
        return java.net.URLDecoder.decode(val, enc);
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
    return "";
}

public void pathTraversal(HttpServletRequest request) throws IOException {
    javax.servlet.http.Cookie[] theCookies = request.getCookies();
    String path = "";
    if (theCookies != null) {
        for (javax.servlet.http.Cookie theCookie : theCookies) {
            if (theCookie.getName().equals("thePath")) {
                path = decode(theCookie.getValue(), "UTF-8");
                break;
            }
        }
    }
    if (!path.equals("")) {
        String fileName = path + ".txt";
        String decStr = new String(org.apache.commons.codec.binary.Base64.decodeBase64(
            org.apache.commons.codec.binary.Base64.encodeBase64(fileName.getBytes())));
        java.io.FileOutputStream fileOutputStream = new java.io.FileOutputStream(decStr);
        java.io.FileDescriptor fd = fileOutputStream.getFD();
        System.out.println(fd.toString());
    }
}

This code presents an issue around path traversal, specifically relating to the Broken Access Control rule in the OWASP top 10 (specifically CWE 22). The issue is that a FileOutputStream is being created using an external input (in this case, a cookie) and the input is not being checked for invalid values that could traverse the file system. To add to the complexity of this sample, the input is encoded and decoded from Base64 so that the cookie value isn’t passed directly to the FileOutputStream constructor, and the parsing of the cookie happens in a different function. This is not something you would do in the real world as it is needlessly complex, but it shows the need for tools that can deeply analyze the flow of data in an application. Here the value passed to the FileOutputStream isn’t an external value, it is the result of the encode/decode line and as such, is a new object. However CodeGuru Reviewer follows the flow of the application to understand that the input still came from a cookie, and as such it should be treated as an external value that needs to be validated. An example of a fix for the issue here would be to replace the pathTraversal function with the sample shown below:

static final String VALID_PATH1 = "./test/file1.txt";
static final String VALID_PATH2 = "./test/file2.txt";
static final String DEFAULT_VALID_PATH = "./test/file3.txt";

public void pathTraversal(HttpServletRequest request) throws IOException {
    javax.servlet.http.Cookie[] theCookies = request.getCookies();
    String path = "";
    if (theCookies != null) {
        for (javax.servlet.http.Cookie theCookie : theCookies) {
            if (theCookie.getName().equals("thePath")) {
                path = decode(theCookie.getValue(), "UTF-8");
                break;
            }
        }
    }
    String fileName = "";
    if (!path.equals("")) {
        if (path.equals(VALID_PATH1)) {
            fileName = VALID_PATH1;
        } else if (path.equals(VALID_PATH2)) {
            fileName = VALID_PATH2;
        } else {
            fileName = DEFAULT_VALID_PATH;
        }
        String decStr = new String(org.apache.commons.codec.binary.Base64.decodeBase64(
            org.apache.commons.codec.binary.Base64.encodeBase64(fileName.getBytes())));
        java.io.FileOutputStream fileOutputStream = new java.io.FileOutputStream(decStr);
        java.io.FileDescriptor fd = fileOutputStream.getFD();
        System.out.println(fd.toString());
    }
}

The main difference in this sample is that the path variable is tested against known good values that would prevent path traversal, and if one of the two valid path options isn’t provided, the third default option is used. In all cases the externally provided path is validated to ensure that there isn’t a path through the code that allows for path traversal to occur in the subsequent call. As with the first sample, the path is still encoded/decoded to make it more complicated to follow the flow through, but the deep analysis performed by CodeGuru Reviewer can follow this and provide meaningful insights to help ensure the security of your applications.

Extending the value of CodeGuru Reviewer

CodeGuru Reviewer already recommends different types of fixes for your Java code, such as concurrency and resource leaks. With these new categories, CodeGuru Reviewer can let you know about security issues as well, bringing further improvements to your applications’ code. The new security detectors operate in the same way that the existing detectors do, using static code analysis and ML to provide high confidence results. This can help avoid signaling non-issue findings to developers, which can waste time and erode trust in the tool.

You can provide feedback on recommendations in the CodeGuru Reviewer console or by commenting on the code in a pull request. This feedback helps improve the performance of the reviewer, so the recommendations you see get better over time.

Conclusion

Security issues can be difficult to identify and can impact your applications significantly. CodeGuru Reviewer security detectors help make sure you’re following security best practices while you build applications.

CodeGuru Reviewer is available for you to try. For full repository analysis, the first 30,000 lines of code analyzed each month per payer account are free. For pull request analysis, we offer a 90 day free trial for new customers. Please check the pricing page for more details. For more information, see Getting started with CodeGuru Reviewer.

About the author

Brian Farnhill

Brian Farnhill is a Developer Specialist Solutions Architect in the Australian Public Sector team. His background is building solutions and helping customers improve DevOps tools and processes. When he isn’t working, you’ll find him either coding for fun or playing online games.

Amazon Lookout for Vision – New ML Service Simplifies Defect Detection for Manufacturing

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/amazon-lookout-for-vision-new-machine-learning-service-that-simplifies-defect-detection-for-manufacturing/

Today, I’m excited to announce Amazon Lookout for Vision, a new machine learning (ML) service that helps customers in industrial environments to detect visual defects on production units and equipment in an easy and cost-effective way.

Can you spot the circuit board with the defect in these images?

Image of 3 circuit boards - one is faulty

Maybe you can if you are familiar with circuit boards, but I have to say that it took me a while to discover the error. Humans, when properly trained and are well rested, are great at finding anomalies in a set of objects. However, when they are tired or not properly trained – like me in this example – they can be slow, prone to errors and inconsistent.

That’s why many companies use machine vision technologies to detect anomalies. However, these technologies need to be calibrated with controlled lighting and camera viewpoints. In addition, you need to specify hard-coded rules that define what is a defect and what is not, making the technologies very specialized and complex to build.

Lookout for Vision is a new machine learning service that helps increase industrial product quality and reduce operational costs by automating visual inspection of product defects across production processes. Lookout for Vision uses deep learning models to replace hard-coded rules and handles the differences in camera angle, lighting and other challenges that arise from the operational environment. With Lookout for Vision, you can reduce the need for carefully controlled environments.

Using Lookout for Vision, you can detect damages to manufactured parts, identify missing components or parts, and uncover underlying process-related issues in your manufacturing lines.

How to Get Started With Lookout for Vision
The first thing I want to mention is that to use Lookout for Vision, you don’t need to be a machine learning expert. Lookout for Vision is a fully managed service and comes with anomaly detection models that can be optimized for your use case and your data.

There are several steps for using Lookout for Vision. The first is preparing the dataset, which includes creating a dataset of images and adding labels to the images. Then, Lookout for Vision uses this dataset to automatically train the ML model that learns to detect anomalies in your product. The final part is using the model in production. You can keep evaluating the performance of your trained model and improve it at any time using tools that Lookout for Vision provides.

Service console tutorial for getting started

Preparing the Data
To get started with the model, you first need a set of images of your product. For better results, include images with normal (no defects) and anomalous content (includes defects). To get started with training, you will need at least 20 normal images and 10 anomalous images.

There are many ways of importing images into Lookout for Vision from the AWS Management Console: You can provide manifests for annotated images using the Amazon SageMaker Ground Truth service, provide images from an S3 bucket or upload directly from your computer.

Different ways to import your images.

After you upload the images, you need to add labels to classify the images in your dataset as normal or anomalous. Labeling is a very important step, as this is the key information that Lookout for Vision uses to train the model for your use case.

For this demo, I import the images from an S3 bucket. If you’ve organized the images in your S3 bucket by folder name (/anomaly/01.jpeg), Lookout for Vision will automatically import the folder structure into corresponding labels.

Training the Model
When our dataset is ready, we need to train our model with it. The training button is enabled once you have the minimum number of labeled images: 20 normal and 10 anomalous.

Depending on the size of the dataset, training may take a while to complete: for me, it took around an hour to train the model with 100 images. Note that you will begin incurring costs when Lookout for Vision starts to actually train the model. After training is complete, your model is ready to detect anomalies in new images.

Screenshot of a model in training.

Evaluating the Model
There are a couple of ways to evaluate whether your model is ready to be deployed to production. The first is to review the performance metrics of the model and the second is to run some productionlike tests that will help you to verify if the model is ready to be deployed.

There are three main performance metrics: precision, recall and the F1 score. Precision measures the percentage of times the model prediction is correct and recall measures the percentage of true defects the model identified. F1 score is used to determine the model performance metric.

Screenshot of model performance metrics

If you want to run some production-like tests to verify if your model is ready, use the run trial detection feature. This will enable you to run your Lookout for Vision model and predict anomalies on new images. You can further improve the model by manually verifying the results and adding new training images.

Create a new job to predict anomalies.

I used the three images that appear at the beginning of this post for my trial detection. The trial detection job ran for 15-20 minutes, and after that Lookout for Vision used the trained model to classify the images into “Normal” and “Anomaly.” When Lookout for Vision finalizes the trial detection job, you can verify the results as correct or incorrect, and add this images to the dataset.

Screenshot verifying the results of the trial

Using the Model in Production
To use Lookout for Vision, you need to integrate the AWS SDKs or CLI in the systems that are processing the images of the products in the manufacturing line, and internet connectivity is required for this to work. The first thing you need to do is to start the model. When using Lookout for Vision, you are billed for the time your model is running and making inferences. For example, if you start your model at 8 a.m. and stop it at 5 p.m., you will be billed for 9 hours.

# Example CLI
aws lookoutvision start-model 
--project-name circuitBoard 
--model-version 1
--additional-output-config "Bucket=<OUTPUT_BUCKET>,Prefix=<PREFIX_KEY>" 
--min-anomaly-detection-units 10 

# Example response
{ "status" : "STARTING_HOSTING" }

When your model is ready, you can call the detect-anomalies API from Lookout for Vision.

# Example CLI
aws lookoutvision detect-anomalies 
--project-name circuitBoard 
--model-version 1 

And this API will return a JSON response that shows if the image is an anomaly or not, along with the confidence level of that prediction.

{
    "DetectAnomalyResult": {
        "Source": {
            "Type": "direct"
        },
        "IsAnomalous": true,
        "Confidence": 0.97
    }
}

When you are done with detecting anomalies for the day, use the stop-model API. In the Lookout for Vision service console you can find code snippets on how to use these APIs.

When you are using Lookout for Vision in production, you’ll find a dashboard that helps you to sort and track the production lines by most defective line, line with the most recent defects, and the line with the highest anomaly ratio.

Available Today
Lookout for Vision is available in all AWS Regions.

To get started with Amazon Lookout for Vision, visit the service page today.

Marcia

New – Amazon Lookout for Equipment Analyzes Sensor Data to Help Detect Equipment Failure

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-amazon-lookout-for-equipment-analyzes-sensor-data-to-help-detect-equipment-failure/

Companies that operate industrial equipment are constantly working to improve operational efficiency and avoid unplanned downtime due to component failure. They invest heavily and repeatedly in physical sensors (tags), data connectivity, data storage, and building dashboards over the years to monitor the condition of their equipment and get real-time alerts. The primary data analysis methods are single-variable threshold and physics-based modeling approaches, and while these methods are effective in detecting specific failure types and operating conditions, they can often miss important information detected by deriving multivariate relationships for each piece of equipment.

With machine learning, more powerful technologies have become available that can provide data-driven models that learn from an equipment’s historical data. However, implementing such machine learning solutions is time-consuming and expensive owing to capital investment and training of engineers.

Today, we are happy to announce Amazon Lookout for Equipment, an API-based machine learning (ML) service that detects abnormal equipment behavior. With Lookout for Equipment, customers can bring in historical time series data and past maintenance events generated from industrial equipment that can have up to 300 data tags from components such as sensors and actuators per model. Lookout for Equipment automatically tests the possible combinations and builds an optimal machine learning model to learn the normal behavior of the equipment. Engineers don’t need machine learning expertise and can easily deploy models for real-time processing in the cloud.

Customers can then easily perform ML inference to detect abnormal behavior of the equipment. The results can be integrated into existing monitoring software or AWS IoT SiteWise Monitor to visualize the real-time output or to receive alerts if an asset tends toward anomalous conditions.

How Lookout for Equipment Works
Lookout for Equipment reads directly from Amazon S3 buckets. Customers can publish their industrial data in S3 and leverage Lookout for Equipment for model development. A user determines the value or time period to be used for training and assigns an appropriate label. Given this information, Lookout for Equipment launches a task to learn and creates the best ML model for each customer.

Because Lookout for Equipment is an automated machine learning tool, it gets smarter over time as users use Lookout for Equipment to retrain their models with new data. This is useful for model re-creation when new invisible failures occur, or when the model drifts over time. Once the model is complete and can be inferred, Lookout for Equipment provides real-time analysis.

With the equipment data being published to S3, the user can scheduled inference that ranges from 5 minutes to one hour. When the user data arrives in S3, Lookout for Equipment fetches the new data on the desired schedule, performs data inference, and stores the results in another S3 bucket.

Set up Lookout for Equipment with these simply steps:

  1. Upload data to S3 buckets
  2. Create datasets
  3. Ingest data
  4. Create a model
  5. Schedule inference (if you need real-time analysis)

1. Upload data
You need to upload tag data from equipment to any S3 bucket.

2. Create Datasets

Select Create dataset, and set Dataset name, and set Data Schema. Data schema is like a data design document that defines the data to be fed in later. Then select Create.

creating datasets console

3. Ingest data
After a dataset is created, the next step is to ingest data. If you are familiar with Amazon Personalize or Amazon Forecast, doesn’t this screen feel familiar? Yes, Lookout for Equipment is as easy to use as those are.

Select Ingest data.

Ingesting data consoleSpecify the S3 bucket location where you uploaded your data, and an IAM role. The IAM role has to have a trust relationship to “lookoutequipment.amazonaws.com” You can use the following policy file for the test.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lookoutequipment.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The data format in the S3 bucket has to match the Data Schema you set up in step 2. Please check our technical documents for more detail. Ingesting data takes a few minutes to tens of minutes depending on your data volume.

4. Create a model
After data ingest is completed, you can train your own ML model now. Select Create new model. Fields show us a list of fields in the ingested data. By default, no field is selected. You can select fields you want Lookout for Equipment to learn. Lookout for Equipment automatically finds and trains correlations from multiple specified fields and creates a model.

Image illustrates setting up fields.

If you are sure that your data has some unusual data included, you can optionally set the windows to exclude that data.

setting up maintenance windowOptionally, you can divide ingested data for training and then for evaluation. The data specified during the evaluation period is checked compared to the trained model.

setting up evaluation window

Once you select Create, Lookout for Equipment starts to train your model. This process takes minutes to hours depending on your data volume. After training is finished, you can evaluate your model with the evaluation period data.

model performance console

5. Schedule Inference
Now it is time to analyze your real time data. Select Schedule Inference, and set up your S3 buckets for input.

setting up input S3 bucket

You can also set Data upload frequency, which is actually the same as inferencing frequency, and Offset delay time. Then, you need to set up Output data as Lookout for Equipment outputs the result of inference.

setting up inferenced output S3 bucket

Amazon Lookout for Equipment is In Preview Today
Amazon Lookout for Equipment is in preview today at US East (N. Virginia), Asia Pacific (Seoul), and Europe (Ireland) and you can see the documentation here.

– Kame

Amazon Monitron, a Simple and Cost-Effective Service Enabling Predictive Maintenance

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-monitron-a-simple-cost-effective-service-enabling-predictive-maintenance/

Today, I’m extremely happy to announce Amazon Monitron, a condition monitoring service that detects potential failures and allows user to track developing faults enabling you to implement predictive maintenance and reduce unplanned downtime.

True story: A few months ago, I bought a new washing machine. As the delivery man was installing it in my basement, we were chatting about how unreliable these things seemed to be nowadays; never lasting more than a few years. As the gentleman made his way out, I pointed to my aging and poorly maintained water heater, telling him that I had decided to replace it in the coming weeks and that he’d be back soon to install a new one. Believe it or not, it broke down the next day. You can laugh at me, it’s OK. I deserve it for not planning ahead.

As annoying as this minor domestic episode was, it’s absolutely nothing compared to the tremendous loss of time and money caused by the unexpected failure of machines located in industrial environments, such as manufacturing production lines and warehouses. Any proverbial grain of sand can cause unplanned outages, and Murphy’s Law has taught us that they’re likely to happen in the worst possible configuration and at the worst possible time, resulting in severe business impacts.

To avoid breakdowns, reliability managers and maintenance technicians often combine three strategies:

  1. Run to failure: where equipment is operated without maintenance until it no longer operates reliably. When the repair is completed, equipment is returned to service; however, the condition of the equipment is unknown and failure is uncontrolled.
  2. Planned maintenance: where predefined maintenance activities are performed on a periodic or meter basis, regardless of condition. The effectiveness of planned maintenance activities is dependent on the quality of the maintenance instructions and planned cycle. It risks equipment being both over- and under-maintained, incurring unnecessary cost or still experiencing breakdowns.
  3. Condition-based maintenance: where maintenance is completed when the condition of a monitored component breaches a defined threshold. Monitoring physical characteristics such as tolerance, vibration or temperature is a more optimal strategy, requiring less maintenance and reducing maintenance costs.
  4. Predictive maintenance: where the condition of components is monitored, potential failures detected and developing faults tracked. Maintenance is planned at a time in the future prior to expected failure and when the total cost of maintenance is most cost-effective.

Condition-based maintenance and predictive maintenance require sensors to be installed on critical equipment. These sensors measure and capture physical quantities such as temperature and vibration, whose change is a leading indicator of a potential failure or a deteriorating condition.

As you can guess, building and deploying such maintenance systems can be a long, complex, and costly project involving bespoke hardware, software, infrastructure, and processes. Our customers asked us for help, and we got to work.

Introducing Amazon Monitron
Amazon Monitron is an easy and cost-effective condition monitoring service that allows you to monitor the condition of equipment in your facilities, enabling the implementation of a predictive maintenance program.

Illustration

Setting up Amazon Monitron is extremely simple. You first install Monitron sensors that capture vibration and temperature data from rotating machines, such as bearings, gearboxes, motors, pumps, compressors, and fans. Sensors send vibration and temperature measurements hourly to a nearby Monitron gateway, using Bluetooth Low Energy (BLE) technology allowing the sensors to run for at least three years. The Monitron gateway is itself connected to your WiFi network, and sends sensor data to AWS, where it is stored and analyzed using machine learning and ISO 20816 vibration standards.

As communication is infrequent, up to 20 sensors can be connected to a single gateway, which can be located up to 30 meters away (depending on potential interference). Thanks to the scalability and cost efficiency of Amazon Monitron, you can deploy as many sensors as you need, including on pieces of equipment that until now weren’t deemed critical enough to justify the cost of traditional sensors. As with any data-driven application, security is our No. 1 priority. The Monitron service authenticates the gateway and the sensors to make sure that they’re legitimate. Data is also encrypted end-to-end, without any decryption taking place on the gateway.

Setting up your gateways and sensors only requires installing the Monitron mobile application on an Android mobile device with Bluetooth support for gateway setup, and NFC support for sensor setup. This is an extremely simple process, and you’ll be monitoring in minutes. Technicians will also use the mobile application to receive alerts indicating abnormal machine conditions. They can acknowledge these alerts and provide feedback to improve their accuracy (say, to minimize false alerts and missed anomalies).

Customers are already using Amazon Monitron today, and here are a couple of examples.

Fender Musical Instruments Corporation is an iconic brand and a leading manufacturer of stringed instruments and amplifiers. Here’s what Bill Holmes, Global Director of Facilities at Fender, told us: “Over the past year we have partnered with AWS to help develop a critical but sometimes overlooked part of running a successful manufacturing business which is knowing the condition of your equipment. For manufacturers worldwide, uptime of equipment is the only way we can remain competitive with a global market. Ensuring equipment is up and running and not being surprised by sudden breakdowns helps get the most out of our equipment. Unplanned downtime is costly both in loss of production and labor due to the firefighting nature of the breakdown. The Amazon Monitron condition monitoring system has the potential of giving both large industry as well as small ‘mom and pop shops’ the ability to predict failures of their equipment before a catastrophic breakdown shuts them down. This will allow for a scheduled repair of failing equipment before it breaks down.

GE Gas Power is a leading provider of power generation equipment, solutions and services. It operates many manufacturing sites around the world, in which much of the manufacturing equipment is not connected nor monitored for health. Magnus Akesson, CIO at GE Gas Power Manufacturing says: “Naturally, we can reduce both maintenance costs and downtime, if we can easily and cheaply connect and monitor these assets at scale. Additionally, we want to take advantage of advanced algorithms to look forward, to know not just the current state but also predict future health and to detect abnormal behaviors. This will allow us to transition from time-based to predictive and prescriptive maintenance practices. Using Amazon Monitron, we are now able to quickly retrofit our assets with sensors and connecting them to real- time analytics in the AWS cloud. We can do this without having to require deep technical skills or having to configure our own IT and OT networks. From our initial work on vibration-prone tumblers, we are seeing this vision come to life at an amazing speed: the ease-of-use for the operators and maintenance team, the simplicity, and the ability to implement at scale is extremely attractive to GE. During our pilot, we were also delighted to see one-click capabilities for updating the sensors via remote Over the Air (OTA) firmware upgrades, without having to physically touch the sensors. As we grow in scale, this is a critical capability in order to be able to support and maintain the fleet of sensors.

Now, let me show you how to get started with Amazon Monitron.

Setting up Amazon Monitron
First, I open the Monitron console. In just a few clicks, I create a project, and an administrative user allowed to manage it. Using a link provided in the console, I download and install the Monitron mobile application on my Android phone. Opening the app, I log in using my administrative credentials.

The first step is to create a site describing assets, sensors, and gateways. I name it “my-thor-project.”

Application screenshot

Let’s add a gateway. Enabling BlueTooth on my phone, I press the pairing button on the gateway.

Application screenshot

The name of the gateway appears immediately.

Application screenshot

I select the gateway, and I configure it with my WiFi credentials to let it connect to AWS. A few seconds later, the gateway is online.

Application screenshot

My next step is to create an asset that I’d like to monitor, say a process water pump set, with a motor and a pump that I would like to monitor. I first create the asset itself, simply defining its name, and the appropriate ISO 20816 class (a standard for measurement and evaluation of machine vibration).

Application screenshot

Then, I add a sensor for the motor.

Application screenshot

I start by physically attaching the sensor to the motor using the suggested adhesive. Next, I specify a sensor position, enable the NFC on my smartphone, and tap the Monitron sensor that I attached to the motor with my phone. Within seconds, the sensor is commissioned.

Application screenshot

I repeat the same operation for the pump. Looking at my asset, I see that both sensors are operational.

Application screenshot

They are now capturing temperature and vibration information. Although there isn’t much to see for the moment, graphs are available in the mobile app.

Application screenshot

Over time, the gateway will keep sending this data securely to AWS, where it will be analyzed for early signs of failure. Should either of my assets exhibit these, I would receive an alert in the mobile application, where I could visualize historical data, and decide what the best course of action would be.

Getting Started
As you can see, Monitron makes it easy to deploy sensors enabling predictive maintenance applications. The service is available today in the US East (N. Virginia) region, and using it costs $50 per sensor per year.

If you’d like to evaluate the service, the Monitron Starter Kit includes everything you need (a gateway with a mounting kit, five sensors, and a power supply), and it’s available for $715. Then, you can scale your deployment with additional sensors, which you can buy in 5-packs for $575.

Starter kit picture

Give Amazon Monitron a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for Monitron.

– Julien

Special thanks to my colleague Dave Manley for taking the time to educate me on industrial maintenance operations.

New- Amazon DevOps Guru Helps Identify Application Errors and Fixes

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/amazon-devops-guru-machine-learning-powered-service-identifies-application-errors-and-fixes/

Today, we are announcing Amazon DevOps Guru, a fully managed operations service that makes it easy for developers and operators to improve application availability by automatically detecting operational issues and recommending fixes. DevOps Guru applies machine learning informed by years of operational excellence from Amazon.com and Amazon Web Services (AWS) to automatically collect and analyze data such as application metrics, logs, and events to identify behavior that deviates from normal operational patterns.

Once a behavior is identified as an operational problem or risk, DevOps Guru alerts developers and operators to the details of the problem so they can quickly understand the scope of the problem and possible causes. DevOps Guru provides intelligent recommendations for fixing problems, saving you time resolving them. With DevOps Guru, there is no hardware or software to deploy, and you only pay for the data analyzed; there is no upfront cost or commitment.

Distributed/Complex Architecture and Operational Excellence
As applications become more distributed and complex, operators need more automated practices to maintain application availability and reduce the time and effort spent on detecting, debugging, and resolving operational issues. Application downtime, for example, as caused by misconfiguration, unbalanced container clusters, or resource depletion, can result in significant revenue loss to an enterprise.

In many cases, companies must invest developer time in deploying and managing multiple monitoring tools, such as metrics, logs, traces, and events, and storing them in various locations for analysis. Developers or operators also spend time developing and maintaining custom alarms to alert them to issues such as sudden spikes in load balancer errors or unusual drops in application request rates. When a problem occurs, operators receive multiple alerts related to the same issue and spend time combining alerts to prioritize those that need immediate attention.

How DevOps Guru Works
The DevOps Guru machine learning models leverages AWS expertise in running highly available applications for the world’s largest e-commerce business for the past 20 years. DevOps Guru automatically detects operational problems, details the possible causes, and recommends remediation actions. DevOps Guru provides customers with a single console experience to search and visualize operational data by integrating data across multiple sources supporting Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS CloudFormation, and AWS X-Ray and reduces the need to use multiple tools.

Getting Started with DevOps Guru
Activating DevOps Guru is as easy as accessing the AWS Management Console and clicking Enable. When enabling DevOps Guru, you can select the IAM role. You’ll then choose the AWS resources to analyze, which may include all resources in your AWS account or just specified CloudFormation StackSets. Finally, you can set an Amazon SNS topic if you want to send notifications from DevOps Guru via SNS.

DevOps Guru starts to accumulate logs and analyze your environment; it can take up to several hours. Let’s assume we have a simple serverless architecture as shown in this illustration.

When the system has an error, the operator needs to investigate if the error came from Amazon API Gateway, AWS Lambda, or AWS DynamoDB. They must then determine the root cause and how to fix the issue. With DevOps Guru, the process is now easy and simple.

When a developer accesses the management console of DevOps Guru, they will see a list of insights which is a collection of anomalies that are created during the analysis of the AWS resources configured within your application. In this case, Amazon API Gateway, AWS Lambda, and Amazon DynamoDB. Each insight contains observations, recommendations, and contextual data you can use to better understand and resolve the operational problem.

The list below shows the insight name, the status (closed or ongoing), severity, and when the insight was created. Without checking any logs, you can immediately see that in the most recent issue (line1), a problem with a Lambda function within your stack was the cause of the issue, and it was related to duration. If the issue was still occurring, the status would be listed as Ongoing. Since this issue was temporary, the status is showing Closed.

Insights

Let’s look deeper at the most recent anomaly by clicking through the first insight link. There are two tabs: Aggregated metrics and Graphed anomalies.

Aggregated metrics display metrics that are related to the insight. Operators can see which AWS CloudFormation stack created the resource that emitted the metric, the name of the resource, and its type. The red lines on a timeline indicate spans of time when a metric emitted unusual values. In this case, the operator can see the specific time of day on Nov 24 when the anomaly occurred for each metric.

Graphed anomalies display detailed graphs for each of the insight’s anomalies. Operators can investigate and look at an anomaly at the resource level and per statistic. The graphs are grouped by metric name.

metrics

By reviewing aggregated and graphed anomalies, an operator can see when the issue occurred, whether it is still ongoing, as well as the resources impacted. It appears the increased Lambda duration had a corresponding impact on API Gateway causing timeouts and resulted in 5XX errors in API Gateway.

Dev Ops Guru also provides Relevant events which are related to activities that changed your application’s configuration as illustrated below.

Events

We can now see that a configuration change happened 2 hours before this issue occurred. If we click the point on the graph at 20:30 on 11/24, we can learn more and see the details of that change.

If you click through to the Ops event, the AWS CloudTrail logs would show that the configuration change was twofold: 1) a change in the concurrency provisioned capacity on a Lambda function and 2) the reduction in the integration timeout on an API integration latency.

recommendations to fix

The recommendations tell the operator to evaluate the provisioned concurrency for Lambda and how to troubleshoot errors in API Gateway. After further evaluation, the operator will discover this is exactly correct. The root cause is a mismatch between the Lambda provisioned concurrency setting and the API Gateway integration latency timeout. When the Lambda configuration was updated in the last deployment, it altered how this application responded to burst traffic, and it no longer fit within the API Gateway timeout window. This error is unlikely to have been found in unit testing and will occur repeatedly if the configurations are not updated.

DevOps Guru can send alerts of anomalies to operators via Amazon SNS, and it is integrated with AWS Systems Manager OpsCenter, enabling customers to receive insights directly within OpsCenter as quickly diagnose and remediate issues.

Available for Preview Today
Amazon DevOps Guru is available for preview in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo). To learn more about DevOps Guru, please visit our web site and technical documentation, and get started today.

– Kame