Tag Archives: SageMaker

Amazon SageMaker Continues to Lead the Way in Machine Learning and Announces up to 18% Lower Prices on GPU Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-leads-way-in-machine-learning/

Since 2006, Amazon Web Services (AWS) has been helping millions of customers build and manage their IT workloads. From startups to large enterprises to public sector, organizations of all sizes use our cloud computing services to reach unprecedented levels of security, resiliency, and scalability. Every day, they’re able to experiment, innovate, and deploy to production in less time and at lower cost than ever before. Thus, business opportunities can be explored, seized, and turned into industrial-grade products and services.

As Machine Learning (ML) became a growing priority for our customers, they asked us to build an ML service infused with the same agility and robustness. The result was Amazon SageMaker, a fully managed service launched at AWS re:Invent 2017 that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly.

Today, Amazon SageMaker is helping tens of thousands of customers in all industry segments build, train and deploy high quality models in production: financial services (Euler Hermes, Intuit, Slice Labs, Nerdwallet, Root Insurance, Coinbase, NuData Security, Siemens Financial Services), healthcare (GE Healthcare, Cerner, Roche, Celgene, Zocdoc), news and media (Dow Jones, Thomson Reuters, ProQuest, SmartNews, Frame.io, Sportograf), sports (Formula 1, Bundesliga, Olympique de Marseille, NFL, Guiness Six Nations Rugby), retail (Zalando, Zappos, Fabulyst), automotive (Atlas Van Lines, Edmunds, Regit), dating (Tinder), hospitality (Hotels.com, iFood), industry and manufacturing (Veolia, Formosa Plastics), gaming (Voodoo), customer relationship management (Zendesk, Freshworks), energy (Kinect Energy Group, Advanced Microgrid Systems), real estate (Realtor.com), satellite imagery (Digital Globe), human resources (ADP), and many more.

When we asked our customers why they decided to standardize their ML workloads on Amazon SageMaker, the most common answer was: “SageMaker removes the undifferentiated heavy lifting from each step of the ML process.” Zooming in, we identified five areas where SageMaker helps them most.

#1 – Build Secure and Reliable ML Models, Faster
As many ML models are used to serve real-time predictions to business applications and end users, making sure that they stay available and fast is of paramount importance. This is why Amazon SageMaker endpoints have built-in support for load balancing across multiple AWS Availability Zones, as well as built-in Auto Scaling to dynamically adjust the number of provisioned instances according to incoming traffic.

For even more robustness and scalability, Amazon SageMaker relies on production-grade open source model servers such as TensorFlow Serving, the Multi-Model Server, and TorchServe. A collaboration between AWS and Facebook, TorchServe is available as part of the PyTorch project, and makes it easy to deploy trained models at scale without having to write custom code.

In addition to resilient infrastructure and scalable model serving, you can also rely on Amazon SageMaker Model Monitor to catch prediction quality issues that could happen on your endpoints. By saving incoming requests as well as outgoing predictions, and by comparing them to a baseline built from a training set, you can quickly identify and fix problems like missing features or data drift.

Says Aude Giard, Chief Digital Officer at Veolia Water Technologies: “In 8 short weeks, we worked with AWS to develop a prototype that anticipates when to clean or change water filtering membranes in our desalination plants. Using Amazon SageMaker, we built a ML model that learns from previous patterns and predicts the future evolution of fouling indicators. By standardizing our ML workloads on AWS, we were able to reduce costs and prevent downtime while improving the quality of the water produced. These results couldn’t have been realized without the technical experience, trust, and dedication of both teams to achieve one goal: an uninterrupted clean and safe water supply.” You can learn more in this video.

#2 – Build ML Models Your Way
When it comes to building models, Amazon SageMaker gives you plenty of options. You can visit AWS Marketplace, pick an algorithm or a model shared by one of our partners, and deploy it on SageMaker in just a few clicks. Alternatively, you can train a model using one of the built-in algorithms, or your own code written for a popular open source ML framework (TensorFlow, PyTorch, and Apache MXNet), or your own custom code packaged in a Docker container.

You could also rely on Amazon SageMaker AutoPilot, a game-changing AutoML capability. Whether you have little or no ML experience, or you’re a seasoned practitioner who needs to explore hundreds of datasets, SageMaker AutoPilot takes care of everything for you with a single API call. It automatically analyzes your dataset, figures out the type of problem you’re trying to solve, builds several data processing and training pipelines, trains them, and optimizes them for maximum accuracy. In addition, the data processing and training source code is available in auto-generated notebooks that you can review, and run yourself for further experimentation. SageMaker Autopilot also now creates machine learning models up to 40% faster with up to 200% higher accuracy, even with small and imbalanced datasets.

Another popular feature is Automatic Model Tuning. No more manual exploration, no more costly grid search jobs that run for days: using ML optimization, SageMaker quickly converges to high-performance models, saving you time and money, and letting you deploy the best model to production quicker.

NerdWallet relies on data science and ML to connect customers with personalized financial products“, says Ryan Kirkman, Senior Engineering Manager. “We chose to standardize our ML workloads on AWS because it allowed us to quickly modernize our data science engineering practices, removing roadblocks and speeding time-to-delivery. With Amazon SageMaker, our data scientists can spend more time on strategic pursuits and focus more energy where our competitive advantage is—our insights into the problems we’re solving for our users.” You can learn more in this case study.
Says Tejas Bhandarkar, Senior Director of Product, Freshworks Platform: “We chose to standardize our ML workloads on AWS because we could easily build, train, and deploy machine learning models optimized for our customers’ use cases. Thanks to Amazon SageMaker, we have built more than 30,000 models for 11,000 customers while reducing training time for these models from 24 hours to under 33 minutes. With SageMaker Model Monitor, we can keep track of data drifts and retrain models to ensure accuracy. Powered by Amazon SageMaker, Freddy AI Skills is constantly-evolving with smart actions, deep-data insights, and intent-driven conversations.

#3 – Reduce Costs
Building and managing your own ML infrastructure can be costly, and Amazon SageMaker is a great alternative. In fact, we found out that the total cost of ownership (TCO) of Amazon SageMaker over a 3-year horizon is over 54% lower compared to other options, and developers can be up to 10 times more productive. This comes from the fact that Amazon SageMaker manages all the training and prediction infrastructure that ML typically requires, allowing teams to focus exclusively on studying and solving the ML problem at hand.

Furthermore, Amazon SageMaker includes many features that help training jobs run as fast and as cost-effectively as possible: optimized versions of the most popular machine learning libraries, a wide range of CPU and GPU instances with up to 100GB networking, and of course Managed Spot Training which lets you save up to 90% on your training jobs. Last but not least, Amazon SageMaker Debugger automatically identifies complex issues developing in ML training jobs. Unproductive jobs are terminated early, and you can use model information captured during training to pinpoint the root cause.

Amazon SageMaker also helps you slash your prediction costs. Thanks to Multi-Model Endpoints, you can deploy several models on a single prediction endpoint, avoiding the extra work and cost associated with running many low-traffic endpoints. For models that require some hardware acceleration without the need for a full-fledged GPU, Amazon Elastic Inference lets you save up to 90% on your prediction costs. At the other end of the spectrum, large-scale prediction workloads can rely on AWS Inferentia, a custom chip designed by AWS, for up to 30% higher throughput and up to 45% lower cost per inference compared to GPU instances.

Lyft, one of the largest transportation networks in the United States and Canada, launched its Level 5 autonomous vehicle division in 2017 to develop a self-driving system to help millions of riders. Lyft Level 5 aggregates over 10 terabytes of data each day to train ML models for their fleet of autonomous vehicles. Managing ML workloads on their own was becoming time-consuming and expensive. Says Alex Bain, Lead for ML Systems at Lyft Level 5: “Using Amazon SageMaker distributed training, we reduced our model training time from days to couple of hours. By running our ML workloads on AWS, we streamlined our development cycles and reduced costs, ultimately accelerating our mission to deliver self-driving capabilities to our customers.

#4 – Build Secure and Compliant ML Systems
Security is always priority #1 at AWS. It’s particularly important to customers operating in regulated industries such as financial services or healthcare, as they must implement their solutions with the highest level of security and compliance. For this purpose, Amazon SageMaker implements many security features, making it compliant with the following global standards: SOC 1/2/3, PCI, ISO, FedRAMP, DoD CC SRG, IRAP, MTCS, C5, K-ISMS, ENS High, OSPAR, and HITRUST CSF. It’s also HIPAA BAA eligible.

Says Ashok Srivastava, Chief Data Officer, Intuit: “With Amazon SageMaker, we can accelerate our Artificial Intelligence initiatives at scale by building and deploying our algorithms on the platform. We will create novel large-scale machine learning and AI algorithms and deploy them on this platform to solve complex problems that can power prosperity for our customers.”

#5 – Annotate Data and Keep Humans in the Loop
As ML practitioners know, turning data into a dataset requires a lot of time and effort. To help you reduce both, Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to annotate and build highly accurate training datasets at any scale (text, image, video, and 3D point cloud datasets).

Says Magnus Soderberg, Director, Pathology Research, AstraZeneca: “AstraZeneca has been experimenting with machine learning across all stages of research and development, and most recently in pathology to speed up the review of tissue samples. The machine learning models first learn from a large, representative data set. Labeling the data is another time-consuming step, especially in this case, where it can take many thousands of tissue sample images to train an accurate model. AstraZeneca uses Amazon SageMaker Ground Truth, a machine learning-powered, human-in-the-loop data labeling and annotation service to automate some of the most tedious portions of this work, resulting in reduction of time spent cataloging samples by at least 50%.

Amazon SageMaker is Evaluated
The hundreds of new features added to Amazon SageMaker since launch are testimony to our relentless innovation on behalf of customers. In fact, the service was highlighted in February 2020 as the overall leader in Gartner’s Cloud AI Developer Services Magic Quadrant. Gartner subscribers can click here to learn more about why we have an overall score of 84/100 in their “Solution Scorecard for Amazon SageMaker, July 2020”, the highest rating among our peer group. According to Gartner, we met 87% of required criteria, 73% of preferred, and 85% of optional.

Announcing a Price Reduction on GPU Instances

To thank our customers for their trust and to show our continued commitment to make Amazon SageMaker the best and most cost-effective ML service, I’m extremely happy to announce a significant price reduction on all ml.p2 and ml.p3 GPU instances. It will apply starting October 1st for all SageMaker components and across the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), EU (Ireland), EU (Frankfurt), EU (London), Canada (Central), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Seoul), Asia Pacific (Tokyo), Asia Pacific (Mumbai), and AWS GovCloud (US-Gov-West).

Instance NamePrice Reduction
ml.p2.xlarge-11%
ml.p2.8xlarge-14%
ml.p2.16xlarge-18%
ml.p3.2xlarge-11%
ml.p3.8xlarge-14%
ml.p3.16xlarge-18%
ml.p3dn.24xlarge-18%

Getting Started with Amazon SageMaker
As you can see, there are a lot of exciting features in Amazon SageMaker, and I encourage you to try them out! Amazon SageMaker is available worldwide, so chances are you can easily get to work on your own datasets. The service is part of the AWS Free Tier, letting new users work with it for free for hundreds of hours during the first two months.

If you’d like to kick the tires, this tutorial will get you started in minutes. You’ll learn how to use SageMaker Studio to build, train, and deploy a classification model based on the XGBoost algorithm.

Last but not least, I just published a book named “Learn Amazon SageMaker“, a 500-page detailed tour of all SageMaker features, illustrated by more than 60 original Jupyter notebooks. It should help you get up to speed in no time.

As always, we’re looking forward to your feedback. Please share it with your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Optimize Python ETL by extending Pandas with AWS Data Wrangler

Post Syndicated from Satoshi Kuramitsu original https://aws.amazon.com/blogs/big-data/optimize-python-etl-by-extending-pandas-with-aws-data-wrangler/

Developing extract, transform, and load (ETL) data pipelines is one of the most time-consuming steps to keep data lakes, data warehouses, and databases up to date and ready to provide business insights. You can categorize these pipelines into distributed and non-distributed, and the choice of one or the other depends on the amount of data you need to process.

Apache Spark is widely used to build distributed pipelines, whereas Pandas is preferred for lightweight, non-distributed pipelines. With the second use case in mind, the AWS Professional Service team created AWS Data Wrangler, aiming to fill the integration gap between Pandas and several AWS services, such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, AWS Glue, Amazon Athena, Amazon Aurora, Amazon QuickSight, and Amazon CloudWatch Log Insights.

AWS Data Wrangler is an open-source Python library that enables you to focus on the transformation step of ETL by using familiar Pandas transformation commands and relying on abstracted functions to handle the extraction and load steps.

You can use AWS Data Wrangler in different environments on AWS and on premises (for more information, see Install). This post focuses on data preparation for a data science project on Jupyter. By the end of this walkthrough, you will be able to set up AWS Data Wrangler on your Amazon SageMaker notebook.

Use case overview

In the following walkthrough, you use data stored in the NOAA public S3 bucket. For more information, see NOAA Global Historical Climatology Network Daily. The objective is to convert 10 CSV files (approximately 240 MB total) to a partitioned Parquet dataset, store its related metadata into the AWS Glue Data Catalog, and query the data using Athena to create a data analysis.

Configuring Amazon S3

Your first step is to create an S3 bucket to store the Parquet dataset.

  1. On the Amazon S3 console, choose Create bucket.
  2. For Bucket name, enter a name for your bucket.

  1. Choose Create.

Creating a new database in the Data Catalog

The Data Catalog is an Apache Hive-compatible managed metadata storage that lets you store, annotate, and share metadata on AWS.

For this use case, you use it to store the metadata associated with your Parquet dataset. The Data Catalog is integrated with many analytics services, including Athena, Amazon Redshift Spectrum, and Amazon EMR (Apache Spark, Apache Hive, and Presto).

  1. On the AWS Glue console, choose Databases.
  2. Choose Add database.
  3. For Database name, enter awswrangler_test.
  4. Choose Create.

Launching an Amazon SageMaker notebook

An Amazon SageMaker notebook is a managed instance running the Jupyter Notebook app. For this use case, you use it to write and run your code.

  1. On the Amazon SageMaker console, choose Notebook instance.
  2. Choose Create a notebook instance.
  3. For Notebook instance name, enter a name.
  4. For IAM role, choose an existing AWS Identity and Access Management (IAM) role or create a role that allows you to run Amazon SageMaker and grants access to Amazon S3, Athena, and AWS Glue for the related resources.

  1. Wait for the notebook status to show as InService.
  2. Choose Open Jupyter from the notebook instance you created.

Exploring the data

This section walks you through several notebook paragraphs to expose how to install and use AWS Data Wrangler.

  1. On Jupyter console, under New, choose conda_python3.
  2. To install AWS Data Wrangler, enter the following code:
    !pip install awswrangler

  3. To avoid dependency conflicts, restart the notebook kernel by choosing kernel -> Restart.
  4. Import the library given the usual alias wr:
    import awswrangler as wr

  5. List all files in the NOAA public bucket from the decade of 1880:
    wr.s3.list_objects("s3://noaa-ghcn-pds/csv/188")

The following screenshot shows the output.

  1. Load the whole decade (10 files) into a Pandas DataFrame using the Amazon S3 prefix s3://noaa-ghcn-pds/csv/188:
    col_names = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]
    
    df = wr.s3.read_csv(
        path="s3://noaa-ghcn-pds/csv/188",
        names=col_names,
        parse_dates=["dt", "obs_time"]  # Hint to parse these columns as date instead of strings
    )

    The following screenshot shows the output.

  1. Create a new column extracting the year from the dt column (the new column is useful for creating partitions in the Parquet dataset):
    df["year"] = df["dt"].dt.year

The following screenshot shows the output.

  1. Store the Pandas DataFrame in the S3 bucket you created in the beginning of this post (replace the [BUCKET] placeholder in the code with your bucket name):
    wr.s3.to_parquet(
        df=df,
        path="s3://[BUCKET]/noaa/",
        dataset=True,
        database="awswrangler_test",
        table="noaa",
        partition_cols=["year"]
    );

The preceding code creates the table noaa in the awswrangler_test database in the Data Catalog.

  1. After processing this, you can confirm the Parquet files exist in Amazon S3 and the table noaa is in AWS Glue data catalog. See the following code:
    wr.s3.list_objects("s3://[BUCKET]/noaa/")
    wr.catalog.table(database="awswrangler_test", table="noaa")

The following screenshot shows the output.

  1. Run a SQL query from Athena that filters only the US maximum temperature measurements of the last 3 years (1887–1889) and receive the result as a Pandas DataFrame:
    sql = """
    SELECT
        dt,
        (value / 10.0) AS temperature  -- Converting tenths of degrees C to regular degrees C
    FROM noaa
    WHERE year BETWEEN 1887 AND 1889  -- Only last 3 years (PARTITION filter)
    AND substr(id, 1, 2)='US'  -- Only U.S. stations
    AND element='TMAX'  -- Only Maximum temperature elements
    AND q_flag is NULL  -- Only HIGH quality measurement
    """
    
    df = wr.athena.read_sql_query(sql, database="awswrangler_test")

The following screenshot shows the output.

The following two queries illustrate how you can visualize the data.

  1. To plot the average maximum temperature measured in the tracked station, enter the following code:
    %matplotlib inline
    df.groupby("dt").mean().plot();

The following screenshot shows the output.

  1. To plot a moving average of the previous metric with a 30-day window, enter the following code:
    %matplotlib inline
    df.groupby("dt").mean().rolling(window=30, center=True).mean().plot();

The following screenshot shows the output.

Cleaning up

To avoid incurring future charges, delete the resources from the following services:

  1. AWS Glue database
    • On the AWS Glue console, choose the database you created.
    • From the Actions drop-down menu, choose Delete database.
    • Choose Delete.
  2. Amazon SageMaker notebook
    • On the Amazon SageMaker console, choose the notebook instance you created.
    • From the Actions drop-down menu, choose Stop.
    • When the status shows as Stopped, choose Database.
    • Choose Delete.
  3. S3 bucket
    • On the Amazon S3 console, choose Buckets.
    • Choose the bucket you created.
    • Choose Empty and enter your bucket name.
    • Choose Confirm.
    • Choose Delete and enter your bucket name.
    • Choose Delete bucket.
  4. IAM Role
    • On the IAM console, choose Roles.
    • Choose the role you attached to Amazon SageMaker.
    • Choose Delete role.
    • Choose Yes.

Conclusion

Installing AWS Data Wrangler is a breeze. With a single command, you can connect ETL tasks to multiple data sources and different data services. The library is a work in progress, with new features and enhancements added regularly. For more tutorials, see the GitHub repo.

 


About the Authors

Satoshi Kuramitsu is a Solutions Architect in AWS. His favorite AWS services are AWS Glue, Amazon Kinesis, and Amazon S3.

 

 

 

 

 

Igor Tavares is a Data & Machine Learning Engineer in the AWS Professional Services team and the original creator of AWS Data Wrangler.

 

 

 

Field Notes: Inference C++ Models Using SageMaker Processing

Post Syndicated from Qingwei Li original https://aws.amazon.com/blogs/architecture/field-notes-inference-c-models-using-sagemaker-processing/

Machine learning has existed for decades. Before the prevalence of doing machine learning with Python, many other languages such as Java, and C++ were used to build models. Refactoring legacy models in C++ or Java could be forbiddingly expensive and time consuming. Customers need to know how they can bring their legacy models in C++ to the cloud, so that they can run model inference faster and at a lower cost.

Amazon SageMaker Processing is a new capability of Amazon SageMaker for running processing and model evaluation workloads with a fully managed experience. Amazon SageMaker Processing enables customers to run analytics jobs for data engineering and model evaluation on Amazon SageMaker easily, and at scale. SageMaker Processing allows customers to enjoy the benefits of a fully managed environment with all the security and compliance built into Amazon SageMaker.

In this blog post, we demonstrate inferencing a C++ model using SageMaker Processing. We first explain the C++ program we use to represent a simple linear regression model, and the Python script we use to run inference. Then, we build a custom container that contains the C++ model and Python script. Lastly, we run a SageMaker ScriptProcessor job for inference. The code from this post is available in the GitHub repo.

Prerequisites

To run this code, you need to have permissions to access Amazon S3, push a Docker image to Amazon ECR, and create SageMaker Processing jobs.

Prepare a C++ Model

We use a simple C++ test file for demonstration purposes. This C++ program accepts input data as a series of strings separated by a comma. For example, “2,3“ represents a row of input data, labeled 2 and 3 in two separate columns.

We use a simple linear regression model y=x1 + x2 in this blog post for demonstration purposes. Customer can modify the C++ inference code to inference more realistic and sophisticated models.  The C++ code is made up of the following steps:

  • Receives data record for inferencing from C++ command line parameters.
  • Parses out data columns and stores data in a C++ vector. We use “,” to separate data columns.
  • Loops through data columns and calculates the sum.
  • Prints out the result to standard output stream.

We can compile the C++ program to an executable file using g++. The complete C++ script is shown in the following code:

#include <sstream>
#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <iostream>
using namespace std;

void print(std::vector<int> const &input)
{
    for (int i = 0; i < input.size(); i++)
    {
        std::cout << input.at(i);
        if (i!=input.size()-1)
            cout<< ',';
    }
}


std::vector<std::string> split(const std::string& s, char delimiter)
{
   std::vector<std::string> tokens;
   std::string token;
   std::istringstream tokenStream(s);
   while (std::getline(tokenStream, token, delimiter))
   {
      tokens.push_back(token);
   }
   return tokens;
}


int main(int argc, char* argv[])
{
    vector<int> result;
    int counter = 0;
    int result_temp = 0;
    
    //assuming one argv
    string t1(argv[1]);
    vector<string> temp_str = split(t1, ',');
    vector<string>::iterator pos; 

    for (pos = temp_str.begin(); pos < temp_str.end(); pos++)
    {
        int temp_int;
        istringstream(*pos) >> temp_int;
        
        if (counter == 0)
        {
            result_temp += temp_int;
            counter++;
            continue;
        }
        if (counter == 1)
            result_temp += temp_int;
            result.push_back(result_temp);
            result_temp = 0;
            counter = 0;
    }    
    print(result);
    return 0;
}

Create a SageMaker Processing script

This notebook uses the ScriptProcessor class from the Amazon SageMaker Python SDK. The ScriptProcessor class runs a Python script with your own Docker image that processes input data, and saves the processed data in Amazon S3.  For more information, review Run Scripts with Your own Processing Container.

When the processing job starts, the data files are automatically downloaded by SageMaker from S3 to the designated local directory in the processing compute instance.

Your Python script, process_script.py, first finds all data files under /opt/ml/processing/input/ directory. By default, when you use multiple instances, the data files from S3 are duplicated to each processing compute instance. That means every instance gets the full dataset. By setting s3_data_distribution_type='ShardedByS3Key' , each instance gets approximately 1/n of the number of total input date files, where n is the number of compute instances. For more effective parallel processing, partition input data into multiple files to help ensure each node processes a different set of input data.

The Python script reads each data file into memory and converts it into a long string ready for C++ executable to consume. The subprocess module from Python runs the C++ executable and connects to output and error pipes. The output is saved as a CSV file to /opt/ml/processing/output directory. Upon completion, SageMaker Processing uploads output files in this directory from every Processing instance to Amazon S3.

def call_one_exe(a):
    p = subprocess.Popen(["./a.out",
 a],stdout=subprocess.PIPE)
    p_out, err= p.communicate()
    output = p_out.decode("utf-8")
    return output.split(',')

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    #user can pass their own argument from Processor. 
    
    args, _ = parser.parse_known_args()
    print('Received arguments {}'.format(args))
    
    files = glob('/opt/ml/processing/input/*.csv')
    for i, f in enumerate(files):
        try:
            print(f)
            data = pd.read_csv(f, header=None)
            string = str(list(data.values.flat)).replace(' ','')[1:-1]
            predictions = call_one_exe(string)
            output_path = os.path.join('/opt/ml/processing/output', str(i)+'_out.csv')
            print('Saving training features to {}'.format(output_path))
            pd.DataFrame({'results':predictions}).to_csv(output_path, header=False, index=False)
        except Exception as e:
            print(str(e))
            

Build your own SageMaker Processing container

The processing container is defined as shown in the following image. We have Anaconda and Pandas installed into the container. a.out is the C++ executable that contains the model inference logic. process_script.py is the Python script we use to call C++ executable and save results. We explain more about the C++ program and process_script.py in a later paragraph. Now let us build the Docker container and push it to Amazon ECR. The Dockerfile looks like the following code:

FROM ubuntu:16.04

RUN apt-get update && \
    apt-get -y install build-essential libatlas-dev git wget curl 

RUN curl -LO http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -bfp /miniconda3 && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/miniconda3/bin:${PATH}

RUN conda update -y conda && \
    conda install -c anaconda scipy

# Python won’t try to write .pyc or .pyo files on the import of source modules
# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1 PYTHONIOENCODING=UTF-8 LANG=C.UTF-8 LC_ALL=C.UTF-8

RUN pip install --no-cache -I scikit-learn==0.20.0 pandas==1.0.3 boto3 sagemaker retrying
ADD process_script.py /
ADD a.out /

Set up the ScriptProcessor and run your script

We have 10 sample data files included in this demo. Each file contains 5000 rows of arbitrarily generated data. We first upload these files to Amazon S3. We use one  ml.c5.xlarge instance for inference. You can increase the number of instance counts for a bigger dataset. Amazon SageMaker Processing runs the script in similar way as the following command, where EntryPoint is process_script.py and ImageUri is the Docker image we built earlier.

docker run --entry-point [EntryPoint] [ImageUri]

The SageMaker Processing job is set up as following,

role = get_execution_role()
script_processor = ScriptProcessor(command=['python3'],
                image_uri=Account_number + '.dkr.ecr.us-east-1.amazonaws.com/cpp_processing:latest',
                role=role,
                instance_count=1,
                base_job_name = 'run-exe-processing',
                instance_type='ml.c5.xlarge')
output_location = os.path.join('s3://',default_s3_bucket, 'processing_output')
script_processor.run(code='process_script.py',
                     inputs=[ProcessingInput(
                        source=input_data,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(source='/opt/ml/processing/output',
                                               destination=output_location)]
                    )

After the processing job starts, Amazon SageMaker displays job progress. Information such as Job Name, input and output locations are reported. Upon completion, we can review a few rows of the output to make sure that the processing job was successful.

print('Top 5 rows from 1_out.csv')
!aws s3 cp $output_location/0_out.csv - | head -n5

Conclusion

In this post, we used Amazon SageMaker Processing to run inference on C++ models. Customers can bring legacy C++ models to SageMaker for faster inference at a lower cost. For more information, review Amazon SageMaker Processing.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Secure deployment of Amazon SageMaker resources

Post Syndicated from Rajesh Ramchander original https://aws.amazon.com/blogs/security/secure-deployment-of-amazon-sagemaker-resources/

Amazon SageMaker, like other services in Amazon Web Services (AWS), includes security-related parameters and configurations that you can use to improve the security posture of resources as you deploy them. However, many of these security-related parameters are optional, allowing you to deploy resources without them. While this might be acceptable in the initial exploration stage, customers want resources to be deployed more securely in production.

In this post I will discuss three approaches for deploying Amazon SageMaker resources more securely and highlight some pros and cons with each approach.

Before you begin

This post assumes general familiarity with machine learning and Amazon SageMaker. In addition, it assumes knowledge of the services used to implement security controls, including:

Approaches

Amazon SageMaker contains security-related parameters for the secure deployment of resources within it. For example, when creating an Amazon SageMaker notebook instance, root access on the notebook instance can be disabled. Another example is when creating an Amazon SageMaker training job, it can be set up to access other services like Amazon Simple Storage Service (Amazon S3) through an endpoint in the customer’s Amazon Virtual Private Cloud (Amazon VPC).

However, these and most other security-related parameters and configurations are optional. As examples of less-secure configuration, Amazon SageMaker notebook instances can be created with root access enabled, and training jobs can access Amazon S3 over the public endpoints.

There are two main methods of implementing controls to improve the security of AWS services during deployment. One of them is preventive and uses controls to stop an event from occurring. The other is responsive, and uses controls that are applied in response to events.

Preventive controls protect workloads and mitigate threats and vulnerabilities. A couple of approaches to implement preventive controls are:

  • Use IAM condition keys supported by the service to ensure that resources without necessary security controls cannot be deployed.
  • Use the AWS Service Catalog to invoke AWS CloudFormation templates that deploy resources with all the necessary security controls in place.

Responsive controls drive remediation of potential deviations from security baselines. An approach to implement responsive controls is:

  • Use CloudWatch Events to catch resource creation events, then use a Lambda function to validate that resources were deployed with the necessary security controls, or terminate resources any if the necessary security controls aren’t present.

The next few sections talk about each of these approaches in respect to Amazon SageMaker.

IAM condition keys approach

IAM condition keys can be used to improve security by preventing resources from being created without security controls. When a principal makes an API request to AWS to create a resource, the request information is gathered into a request context. This request context is compared to conditions in the principal’s policy. If the conditions pass, the API request is allowed to proceed and the resource will be created. However, if the conditions fail, the API request is stopped and the resource won’t be created.

The optional Condition element (or block) in an IAM policy is where expressions are built using condition operators (such as StringEquals or NumericLessThan). These condition expressions match the condition keys and values in the policy to the keys and values in the request context. The condition key specified in a condition element can be global or service-specific.

A condition element has the following syntax:


"Condition": {
   "{condition-operator}": {
      "{condition-key}": "{condition-value}"
   }
}

The following condition element contains an Amazon SageMaker service-specific condition key to ensure that when an Amazon SageMaker notebook instance is created, the request must ask that root access on the notebook instance be disabled. If a request is made, and the request doesn’t ask that root access on the notebook instance be disabled, it will be denied.


"Condition": {
   "StringEquals": {
      "sagemaker:RootAccess": "Disabled"
   }
}

The AWS User Guide topic IAM JSON Policy Elements: Condition provides more information.

Amazon SageMaker IAM condition keys

Amazon SageMaker supports a few global condition context keys and adds several Amazon SageMaker service-specific condition keys.

Global condition context keys

Global condition context keys are documented in AWS Global Condition Context Keys. Global condition context keys start with an aws: prefix. The following global condition context keys are applicable to Amazon SageMaker.

  • aws:RequestTag/${TagKey} – This key is used to compare the tag key-value pair that was passed in the request with the tag pair specified in the policy.
  • aws:ResourceTag/${TagKey} – This key is used to compare the tag key-value pair that is specified in the policy with the key-value pair attached to the resource.
  • aws:SourceIp – This key is used to compare the requester’s IP address with the IP address specified in the policy.
  • aws:SourceVpc – This key is used to check whether the request comes from the Amazon VPC specified in the policy.
  • aws:SourceVpce – This key is used to compare the Amazon VPC endpoint identifier of the request with the endpoint ID specified in the policy.
  • aws:TagKeys – This key is used to compare the tag keys in the request with the keys specified in the policy.

Amazon SageMaker service-specific condition keys

Amazon SageMaker service-specific condition keys are documented in Actions, Resources, and Condition Keys for Amazon SageMaker and Amazon SageMaker Identity-Based Policy Examples. They have a sagemaker: prefix.

  • sagemaker:AcceleratorTypes – This key is used to use a specific Amazon Elastic Inference accelerator when creating or updating notebook instances and when creating endpoint configurations. Elastic Inference allows addition of inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance.
  • sagemaker:DirectInternetAccess – This key is used to control direct internet access from notebook instances. Allowed values are Enabled and Disabled. The default behavior is to allow direct internet access. Direct internet access should be disabled to prevent unfettered internet access after connecting notebook instances to the customer’s Amazon VPC. This can be done by using the sagemaker:VPCSubnets and sagemaker:VPCSecurityGroupIds parameters.
  • sagemaker:FileSystemAccessMode – Amazon SageMaker can be used in conjunction with Amazon Elastic File System (Amazon EFS) or Amazon FSx file systems for training jobs and hyperparameter tuning jobs. This key is used to specify the access mode of the directory associated with the input data channel. The directory can be mounted either in read-only or read-write mode.
  • sagemaker:FieSystemDirectoryPath – This key is used to specify the file system directory path associated with the resource in the training and hyperparameter tuning request.
  • sagemaker:FileSystemId – This key is used to specify the file system ID associated with the resource in the training and hyperparameter tuning request.
  • sagemaker:FileSystemType – This key is used to specify the file system type associated with the resource in the training and hyperparameter tuning request.
  • sagemaker:InstanceTypes – This key is used to specify the list of all instance types for notebook instances, training jobs, hyperparameter tuning jobs, batch transform jobs, and endpoint configurations for hosting real-time inferencing. Restricting instance types can be done to only allow enhanced-security Nitro instances or to control costs by not allowing GPU instances.
  • sagemaker:InterContainerTrafficEncryption – This key is used to control inter-container traffic encryption for distributed training and hyperparameter tuning jobs. Allowed values are true and false. The default value is false. Distributed machine learning frameworks and algorithms usually transmit information that is directly related to the model such as weights, not the training dataset. This parameter should be set to true to comply with regulatory requirements with the understanding that it could increase the training time of distributed jobs.
  • sagemaker:MaxRuntimeInSeconds – This key is used to control costs by specifying the maximum length of time, in seconds, that the training, hyperparameter tuning, or compilation job can run. If a job doesn’t complete within this time, Amazon SageMaker ends the job. If a value isn’t specified, the default value is 1 day. The maximum value that can be specified is 28 days.
  • sagemaker:ModelArn – This key is used to specify the Amazon Resource Name (ARN) of the model associated for batch transform jobs and endpoint configurations for hosting real-time inferencing. When creating a batch transform job or endpoint configuration, a model name is passed in the API request. The name of the model must be associated with model ARN specified in the policy.
  • sagemaker:NetworkIsolation – This key is used to enable network isolation when creating training, hyperparameter tuning, and inference jobs. Allowed values are true and false. The default value is false. This parameter should be set to true to prevent containers from making any outbound network calls, even to other AWS services such as Amazon S3. Network isolation is required for training jobs and models run using resources from AWS Marketplace.
  • sagemaker:OutputKmsKey – This key is used to specify the AWS KMS key to encrypt output data stored in Amazon S3. Either the KMS key ID or key ARN can be specified. This key shouldn’t be confused with the key to encryption storage volumes specified in sagemaker:VolumeKmsKey.
  • sagemaker:RequestTag/${TagKey} – This key is used to compare the tag key-value pair that was passed in the request with the tag pair that is specified in the policy. This could be used to ensure that a particular tag is always used.
  • sagemaker:ResourceTag/${TagKey} – This key is used to compare the tag key-value pair that is specified in the policy with the key-value pair that is attached to the resource. This could be used to ensure that a particular tag and value pair is always used.
  • sagemaker:RootAccess – This key is used to control root access on the notebook instances. Allowed values are Enabled and Disabled. The default behavior is to allow root access. Root access is usually not a best practice and should be disabled. Disabling root access prevents notebook users from deleting system-level software, installing new software, and modifying essential components.
  • sagemaker:VolumeKmsKey – This key is used to specify an AWS KMS key to encrypt storage volumes when creating notebook instances, training jobs, hyperparameter tuning jobs, batch transform jobs, and endpoint configurations for hosting real-time inferencing. Either the KMS key ID or key ARN can be specified. This key shouldn’t be confused with the key to encrypt output data in Amazon S3 specified in sagemaker:OutputKmsKey.
  • sagemaker:VPCSecurityGroupIds – The list of all Amazon VPC security group IDs associated with the elastic network interface (ENI) that Amazon SageMaker creates in the Amazon VPC subnet.
  • sagemaker:VPCSubnets – The list of all Amazon VPC subnets where Amazon SageMaker creates ENIs to communicate with other resources like Amazon S3.

AWS Service Catalog approach

AWS Service Catalog allows organizations to create and manage catalogs of IT services that are approved for use on AWS. You can use it to create a preventive approach to improving security by invoking templates with security controls already in place. These IT services can include everything from virtual machine images, servers, software, and databases to complete multi-tier application architectures. AWS Service Catalog allows for the central management of commonly deployed IT services, helps achieve consistent governance, and supports you in meeting your compliance requirements. It does this while enabling users to quickly deploy only the IT services they need and that are approved by their organization.

AWS Service Catalog products are created by importing AWS CloudFormation templates that provision the resources in services. CloudFormation provides a common language for the description and provisioning of all the infrastructure resources in a cloud environment. CloudFormation lets you use programming languages or a simple text file to model and provision all the resources needed for applications across all regions and accounts in an automated and secure manner. This gives a single source of truth for the AWS resources.

AWS CloudFormation templates implement resources in various services as resource types. For example, there is a CloudFormation resource type called AWS::SageMaker::NotebookInstance that models an Amazon SageMaker notebook instance. When a CloudFormation stack with this resource type is created, the notebook instance is provisioned based on the template parameters.

Since AWS CloudFormation is typically used to provision infrastructure as opposed to executing workflows, CloudFormation models Amazon SageMaker notebook instances but not Amazon SageMaker training jobs. In situations like this, a custom resource can be used. Custom resource providers, typically implemented as Lambda functions, are invoked when a CloudFormation stack with a custom resource is created. The Lambda function can use the AWS SDKs—which are available in several programming languages—to create the resource. In the case of Amazon SageMaker training jobs, when the CloudFormation stack is created, it will call a Lambda function that can use the Boto3 Python SDK to create a training job.

The following Amazon SageMaker resource types are supported by AWS CloudFormation. All other Amazon SageMaker resources need to be created using the custom resource approach.

  • AWS::SageMaker::CodeRepository creates a Git repository that can be used for source control.
  • AWS::SageMaker::Endpoint creates an endpoint for inferencing.
  • AWS::SageMaker::EndpointConfig creates a configuration for endpoints for inferencing.
  • AWS::SageMaker::Model creates a model for inferencing.
  • AWS::SageMaker::NotebookInstance creates a notebook instance for development.
  • AWS::SageMaker::NotebookInstanceLifecycleConfig creates shell scripts that run when notebook instances are created and/or started.
  • AWS::SageMaker::Workteam creates a work team for labeling data.

AWS CloudFormation collects user input in the form of parameters that can be defined in the template. The value for the security related parameters should fall into one of the following three categories.

Security parameters that shouldn’t be exposed

In the AWS CloudFormation templates, don’t implement these settings as parameters. Instead, automatically set them without providing a choice to the user:

  • DirectInternetAccess – Set this to Disabled when creating notebook instances after connecting to the customer’s Amazon VPC using VpcConfig.Subnets and VpcConfig.SecurityGroupIds.
  • EnableInterContainerTrafficEncryption – Set this to true when creating distributed training and hyperparameter tuning jobs. Note that it might increase the training time.
  • EnableNetworkIsolation – Set this to true when creating training, hyperparameter tuning, and inference jobs to prevent situations like malicious code being accidentally installed and transferring data to a remote host.
  • MaxRuntimeInSeconds – Set this to a reasonable value.
  • RootAccess – Set this to Disabled when creating notebook instances as it is generally not a best practice to permit root access. Disabling root access prevents notebook users from deleting system-level software, installing new software, and modifying essential environment components.
  • VpcConfig.SecurityGroupIds – Set this to a pre-created security group that has been configured with the necessary controls.

Security parameters that should be restricted

For the following parameters, require the user to select a value from a dropdown list either by hard-coding the list or by using a supported AWS-specific parameter.

  • AcceleratorTypes – The dropdown list has to be hard coded. Elastic Inference accelerators allow the addition of inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance.
  • InstanceTypes – The dropdown list has to be hard coded. Restricting instance types can be used to only allow enhanced-security Nitro instances or to control costs by not allowing GPU instances.
  • VpcConfig.Subnets – The dropdown list can be built by using an AWS::EC2::Subnet::Id parameter type.

Security parameters that should be validated

Require the user to input values for the following parameters by using the AllowedPattern property for the parameter with a regular expression of “+”:

  • OutputKmsKey
  • VolumeKmsKey

CloudWatch Events approach

Amazon CloudWatch and CloudWatch Events can be used to implement responsive controls to improve security. CloudWatch is a service that provides data and actionable insights to monitor applications, respond to system-wide performance changes, optimize resource utilization, and provide a unified view of operational health. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events. CloudWatch uses the data to provide a unified view of AWS resources, applications, and services that run on AWS and on-premises servers. CloudWatch can be used to detect anomalous behavior in environments, set alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to keep applications running smoothly.

CloudWatch Events—a subsystem within CloudWatch—delivers a near real-time stream of events that describe changes in AWS resources. Using simple rules, events can be matched and routed to one or more target functions or streams. The target of a CloudWatch Events rule might be, for example, a Lambda function that will be invoked with an event every time the rule matches.

The following figure shows how to create a CloudWatch Events rule that will match events pertaining to state changes of Amazon SageMaker training jobs. The steps using the AWS Management Console are:

  1. Go to Amazon CloudWatch and select Rules under Events.
  2. Select SageMaker from the Service Name dropdown.
  3. Select SageMaker Training Job State Change from the Event Type dropdown. The Event Pattern Preview is automatically populated.
  4. Select Lambda function from the Targets dropdown.
  5. Select the AWS Lambda function that you have implemented from the Function dropdown.

    Figure 1: Create a CloudWatch event rule

    Figure 1: Create a CloudWatch event rule

Amazon SageMaker provides the following training job statuses:

  • InProgress – The training is in progress.
  • Completed – The training job has completed.
  • Failed – The training job has failed. To see the reason for the failure, see the FailureReason field in the response to a DescribeTrainingJob call.
  • Stopping – The training job is stopping.
  • Stopped – The training job has stopped.

The Lambda function that is configured as the target for the CloudWatch Events rule should inspect the event and retrieve the Amazon SageMaker training job status. If the status is InProgress, the Lambda function will do the following:

  1. Call the DescribeTrainingJob API, and pass in the training job name.
  2. From the response, check to see if the training job has all of the necessary security controls.
  3. If the training job is deemed insecure, call the StopTrainingJob API to stop it.

Similar CloudWatch events rules can be set up for other Amazon SageMaker events.

Discussing the approaches

The IAM condition keys approach doesn’t involve coding. All it requires is adding condition elements to IAM policies. When this approach is used, users deploying resources are free to choose any approach including the console, CLI, and SDKs. Additionally, Amazon SageMaker also has a higher-level Python SDK implemented on top of Boto3 that makes deploying Amazon SageMaker resources easy.

However, there are a few caveats with this approach. First, not all AWS services support IAM condition keys. Fortunately, Amazon SageMaker has comprehensive support for IAM condition keys. Second, since this approach involves IAM policies, IAM service limits, some of which are documented below (a full list of IAM limits can be found at IAM and STS Limits), need to be taken into consideration.

  • An IAM user, group, or role can have a maximum of 10 managed policies.
  • The size of each managed policy cannot exceed 6,144 characters (not counting white spaces).

All of the conditions should be documented clearly. Otherwise, users deploying resources might have to use trial-and-error to successfully deploy them.

The AWS Service Catalog approach involves coding of AWS CloudFormation templates. When the necessary resource types aren’t supported by CloudFormation, custom resource Lambda functions have to be implemented by the customer. This approach is always available without any special support needed from the service. This approach also takes guesswork out of the equation when deploying resources as the CloudFormation templates can guide the user with providing proper security parameters.

Finally, the CloudWatch Events approach also involves coding by the customer. Because it’s a responsive control, the resource will start to be deployed before it will be stopped or terminated, if users create it without the necessary security controls. CloudWatch Events are available very soon after the resource provisioning starts. Amazon SageMaker resources typically take a couple of minutes after a resource is requested before it becomes available. It should be noted that users don’t get any direct feedback when resources are stopped or terminated in response to CloudWatch Events. They need to review CloudWatch Logs or notifications sent by the Lambda function code to figure out why a resource was terminated. Customers can implement this approach could be used in combination with one of the preventive approaches to enhance security.

Summary

In this post we discussed three different approaches for deploying Amazon SageMaker securely – IAM condition keys, Service Catalog, and CloudWatch Events. You can use each of these methods to improve the security of your AWS resources as you deploy them. After reading this post, you should now have a better understanding of the pros and cons of each of the approaches and how you can use them to deploy your Amazon SageMaker resources more securely in production.

Contributors

Contributors to this document include:

  • Paco Hope, Principal Security Consultant, AWS ProServe
  • Jeff Puchalski, Technical Training Specialist, AWS Security
  • Kumar Venkateswar, Principal PM, Amazon SageMaker, AI Platforms
  • Ross Warren, Senior Solutions Architect, Security

Additional resources

For additional information, see:

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Rajesh Ramchander

Rajesh is a Senior Big Data Consultant in Professional Services at AWS. He helps customers migrate big data and machine learning/artificial intelligence workloads to AWS using Amazon EMR, AWS Glue, and Amazon SageMaker. Before joining AWS, Rajesh was a member of senior management of software development teams. He holds an MS in Computer Science and an MS in Electrical Engineering.

Building a Self-Service, Secure, & Continually Compliant Environment on AWS

Post Syndicated from Japjot Walia original https://aws.amazon.com/blogs/architecture/building-a-self-service-secure-continually-compliant-environment-on-aws/

Introduction

If you’re an enterprise organization, especially in a highly regulated sector, you understand the struggle to innovate and drive change while maintaining your security and compliance posture. In particular, your banking customers’ expectations and needs are changing, and there is a broad move away from traditional branch and ATM-based services towards digital engagement.

With this shift, customers now expect personalized product offerings and services tailored to their needs. To achieve this, a broad spectrum of analytics and machine learning (ML) capabilities are required. With security and compliance at the top of financial service customers’ agendas, being able to rapidly innovate and stay secure is essential. To achieve exactly that, AWS Professional Services engaged with a major Global systemically important bank (G-SIB) customer to help develop ML capabilities and implement a Defense in Depth (DiD) security strategy. This blog post provides an overview of this solution.

The machine learning solution

The following architecture diagram shows the ML solution we developed for a customer. This architecture is designed to achieve innovation, operational performance, and security performance in line with customer-defined control objectives, as well as meet the regulatory and compliance requirements of supervisory authorities.

Machine learning solution developed for customer

This solution is built and automated using AWS CloudFormation templates with pre-configured security guardrails and abstracted through the service catalog. AWS Service Catalog allows you to quickly let your users deploy approved IT services ensuring governance, compliance, and security best practices are enforced during the provisioning of resources.

Further, it leverages Amazon SageMaker, Amazon Simple Storage Service (S3), and Amazon Relational Database Service (RDS) to facilitate the development of advanced ML models. As security is paramount for this workload, data in S3 is encrypted using client-side encryption and column-level encryption on columns in RDS. Our customer also codified their security controls via AWS Config rules to achieve continual compliance

Compute and network isolation

To enable our customer to rapidly explore new ML models while achieving the highest standards of security, separate VPCs were used to isolate infrastructure and accessed control by security groups. Core to this solution is Amazon SageMaker, a fully managed service that provides the ability to rapidly build, train, and deploy ML models. Amazon SageMaker notebooks are managed Juypter notebooks that:

  1. Prepare and process data
  2. Write code to train models
  3. Deploy models to SageMaker hosting
  4. Test or validate models

In our solution, notebooks run in an isolated VPC with no egress connectivity other than VPC endpoints, which enable private communication with AWS services. When used in conjunction with VPC endpoint policies, you can use notebooks to control access to those services. In our solution, this is used to allow the SageMaker notebook to communicate only with resources owned by AWS Organizations through the use of the aws:PrincipalOrgID condition key. AWS Organizations helps provide governance to meet strict compliance regulation and you can use the aws:PrincipalOrgID condition key in your resource-based policies to easily restrict access to Identity Access Management (IAM) principals from accounts.

Data protection

Amazon S3 is used to store training data, model artifacts, and other data sets. Our solution uses server-side encryption with customer master keys (CMKs) stored in AWS Key Management Service (SSE-KMS) encryption to protect data at rest. SSE-KMS leverages KMS and uses an envelope encryption strategy with CMKs. Envelop encryption is the practice of encrypting data with a data key and then encrypting that data key using another key – the CMK. CMKs are created in KMS and never leave KMS unencrypted. This approach allows fine-grained control around access to the CMK and the logging of all access and attempts to access the key to Amazon CloudTrail. In our solution, the age of the CMK is tracked by AWS Config and is regularly rotated. AWS Config enables you to assess, audit, and evaluate the configurations of deployed AWS resources by continuously monitoring and recording AWS resource configurations. This allows you to automate the evaluation of recorded configurations against desired configurations.

Amazon S3 Block Public Access is also used at an account level to ensure that existing and newly created resources block bucket policies or access-control lists (ACLs) don’t allow public access. Service control policies (SCPs) are used to prevent users from modifying this setting. AWS Config continually monitors S3 and remediates any attempt to make a bucket public.

Data in the solution are classified according to their sensitivity that corresponds to your customer’s data classification hierarchy. Classification in the solution is achieved through resource tagging, and tags are used in conjunction with AWS Config to ensure adherence to encryption, data retention, and archival requirements.

Continuous compliance

Our solution adopts a continuous compliance approach, whereby the compliance status of the architecture is continuously evaluated and auto-remediated if a configuration change attempts to violate the compliance posture. To achieve this, AWS Config and config rules are used to confirm that resources are configured in compliance with defined policies. AWS Lambda is used to implement a custom rule set that extends the rules included in AWS Config.

Data exfiltration prevention

In our solution, VPC Flow Logs are enabled on all accounts to record information about the IP traffic going to and from network interfaces in each VPC. This allows us to watch for abnormal and unexpected outbound connection requests, which could be an indication of attempts to exfiltrate data. Amazon GuardDuty analyzes VPC Flow Logs, AWS CloudTrail event logs, and DNS logs to identify unexpected and potentially malicious activity within the AWS environment. For example, GuardDuty can detect compromised Amazon Elastic Cloud Compute (EC2) instances communicating with known command-and-control servers.

Conclusion

Financial services customers are using AWS to develop machine learning and analytics solutions to solve key business challenges while ensuring security and compliance needs. This post outlined how Amazon SageMaker, along with multiple security services (AWS Config, GuardDuty, KMS), enables building a self-service, secure, and continually compliant data science environment on AWS for a financial service use case.

 

New – Label 3D Point Clouds with Amazon SageMaker Ground Truth

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-label-3d-point-clouds-with-amazon-sagemaker-ground-truth/

Launched at AWS re:Invent 2018, Amazon Sagemaker Ground Truth is a capability of Amazon SageMaker that makes it easy to annotate machine learning datasets. Customers can efficiently and accurately label image and text data with built-in workflows, or any other type of data with custom workflows. Data samples are automatically distributed to a workforce (private, 3rd party or MTurk), and annotations are stored in Amazon Simple Storage Service (S3). Optionally, automated data labeling may also be enabled, reducing both the amount of time required to label the dataset, and the associated costs.

About a year ago, I met with Automotive customers who expressed interest in labeling 3-dimensional (3D) datasets for autonomous driving. Captured by LIDAR sensors, these datasets are particularly large and complex. Data is stored in frames that typically contain 50,000 to 5 million points, and can weigh up to hundreds of Megabytes each. Frames are either stored individually, or in sequences that make it easier to track moving objects.

As you can imagine, labeling these datasets is extremely time-consuming, as workers need to navigate complex 3D scenes and annotate many different object classes. This often requires building and managing very complex tools. Always looking to help customers build simpler and more efficient workflows, the Ground Truth team gathered more feedback, and got to work.

Today, I’m extremely happy to announce that you can use Amazon Sagemaker Ground Truth to label 3D point clouds using a built-in editor, and state-of-the-art assistive labeling features.

Introducing 3D Point Cloud Labeling
Just like for other Ground Truth tasks types, input data for 3D point clouds has to be stored in an S3 bucket. It also needs to be described by a manifest file, a JSON file containing both the location of frames in S3 and their attributes. A dataset may contain either single-frame data, or multi-frame sequences.

Optionally, the dataset may also include image data captured by on-board cameras. Using a feature called “sensor fusion”, Ground Truth can synchronize a 3D point cloud with up to 8 cameras. Thanks to this, workers get a real-life view of the scene, and they can also interchangeably apply labels to 2D images and 3D point clouds.

Once the manifest file is ready, Ground Truth lets you create the following task types:

  • Object Detection: identify objects of interest within a 3D point cloud frame.
  • Object Tracking: track objects of interest across a sequence of 3D point cloud frames.
  • Semantic Segmentation: segment the points of a 3D point cloud frame into predefined categories.

These can either be labeling jobs where workers annotate new frames, or adjustment jobs where they review and fine-tune existing annotations. Jobs may be distributed either to a private workforce or to a vendor workforce you picked on AWS Marketplace.

Using the built-in graphical user interface (GUI) and its shortcuts for navigation and labeling, workers can quickly and accurately apply labels, boxes and categories to 3D objects (“car”, “pedestrian”, and so on). They can also add user-defined attributes, such as the color of a car, or whether an object is fully or partially visible.

The GUI includes many assistive labeling features that significantly simplify labeling work, save time, and improve the quality of annotations. Here are a few examples:

  • Snapping: Ground Truth infers a tight-fitting box around the object.
  • Interpolation: the labeler annotates an object in the first and last frames of a sequence. Ground Truth automatically annotates it in the middle frames.
  • Ground detection and removal: Ground Truth can automatically detect and remove 3D points belonging to the ground from object boxes.

Even with assistive labeling, it may take a while to annotate complex frames and sequences, so work is saved periodically to avoid any data loss.

Preparing 3D Point Cloud Datasets
As previously mentioned, you have to provide a manifest file describing your 3D dataset. The format of this file is defined in the Ground Truth documentation. Of course, the steps required to build it will vary from one dataset to the next. For example, the Audi A2D2 dataset contains almost 400,000 frames, with 360-degree 3D LIDAR data and 2D images. KITTI, another popular choice for autonomous driving research, includes a 3D dataset with 15,000 images and their corresponding point clouds, for a total of 80,256 labeled objects. This notebook shows you how to convert KITTI data to the Ground Truth format.

When datasets contain both 3D LIDAR data and 2D camera images, one challenge is to synchronize them. This allows us to project 3D points to 2D coordinates, map them on the pictures captured by on-board cameras, and vice versa. Another challenge is that data captured by a given device uses coordinates local to this device. Fortunately, we know where the device is located on the car, and where it’s pointed to. All of this can be solved by building a global coordinate system, also known as a World Coordinate System (WCS). Using matrix operations (which I’ll spare you), we can compute the coordinates of all data points inside the WCS.

Once frames have been processed, their information is saved in the manifest file: the position of the vehicle, the location of LIDAR data in S3, the location of associated pictures in S3, and so on. For large datasets, the whole process is a significant workload, and you could run it on a managed service such as Amazon SageMaker Processing, Amazon EMR or AWS Glue.

Labeling 3D Point Clouds with Amazon SageMaker Ground Truth
Let’s do a quick demo, based on this notebook. Starting from pre-processed sample frames, it streamlines the process of creating a 3D point cloud labeling job for each of the six task types (Object Detection, Object Tracking, Semantic Segmentation, and the associated adjustment task types). You can easily make yourself a private worker, and start labeling frames with the worker GUI and its labeling tools.

A picture is worth a thousand words, and a video even more! In this first video, I annotate a couple of cars using two assistive labeling features. First, I fit the box to the ground, which helps me capture object points that are close to the ground without actually capturing the ground itself. Second, I fit the box to the object, which ensures a tight fit without any blank space.

Amazon SageMaker Ground Truth

In this second video, I annotate a third car using the same technique. It’s quite harder to “see” than the previous ones, but I still manage to fit a tight box around it. Playing the next nine frames, I see that this car is actually moving. Jumping directly to the tenth frame, I adjust the bounding box to the new location of the car. Ground Truth automatically labels the eight middle frames, another assistive labeling feature called interpolation.

Amazon SageMaker Ground Truth

I’ve barely scratched the surface, and there’s plenty more to learn. Now it’s your turn!

Getting Started
You can start labeling 3D point clouds with Amazon Sagemaker Ground Truth today in the following regions:

  • US East (N. Virginia), US East (Ohio), US West (Oregon),
  • Canada (Central),
  • Europe (Ireland), Europe (London), Europe (Frankfurt),
  • Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Seoul), Asia Pacific (Sydney), Asia Pacific (Tokyo).

We’re looking forward to reading your feedback. You can send it through your usual support contacts, or in the AWS Forum for Amazon SageMaker.

– Julien

Building a Scalable Document Pre-Processing Pipeline

Post Syndicated from Joel Knight original https://aws.amazon.com/blogs/architecture/building-a-scalable-document-pre-processing-pipeline/

In a recent customer engagement, Quantiphi, Inc., a member of the Amazon Web Services Partner Network, built a solution capable of pre-processing tens of millions of PDF documents before sending them for inference by a machine learning (ML) model. While the customer’s use case—and hence the ML model—was very specific to their needs, the pipeline that does the pre-processing of documents is reusable for a wide array of document processing workloads. This post will walk you through the pre-processing pipeline architecture.

Pre-processing pipeline architecture-SM

Architectural goals

Quantiphi established the following goals prior to starting:

  • Loose coupling to enable independent scaling of compute components, flexible selection of compute services, and agility as the customer’s requirements evolved.
  • Work backwards from business requirements when making decisions affecting scale and throughput and not simply because “fastest is best.” Scale components only where it makes sense and for maximum impact.
  •  Log everything at every stage to enable troubleshooting when something goes wrong, provide a detailed audit trail, and facilitate cost optimization exercises by identifying usage and load of every compute component in the architecture.

Document ingestion

The documents are initially stored in a staging bucket in Amazon Simple Storage Service (Amazon S3). The processing pipeline is kicked off when the “trigger” Amazon Lambda function is called. This Lambda function passes parameters such as the name of the staging S3 bucket and the path(s) within the bucket which are to be processed to the “ingestion app.”

The ingestion app is a simple application that runs a web service to enable triggering a batch and lists documents from the S3 bucket path(s) received via the web service. As the app processes the list of documents, it feeds the document path, S3 bucket name, and some additional metadata to the “ingest” Amazon Simple Queue Service (Amazon SQS) queue. The ingestion app also starts the audit trail for the document by writing a record to the Amazon Aurora database. As the document moves downstream, additional records are added to the database. Records are joined together by a unique ID and assigned to each document by the ingestion app and passed along throughout the pipeline.

Chunking the documents

In order to maximize grip and control, the architecture is built to submit single-page files to the ML model. This enables correlating an inference failure to a specific page instead of a whole document (which may be many pages long). It also makes identifying the location of features within the inference results an easier task. Since the documents being processed can have varied sizes, resolutions, and page count, a big part of the pre-processing pipeline is to chunk a document up into its component pages prior to sending it for inference.

The “chunking orchestrator” app repeatedly pulls a message from the ingest queue and retrieves the document named therein from the S3 bucket. The PDF document is then classified along two metrics:

  • File size
  • Number of pages

We use these metrics to determine which chunking queue the document is sent to:

  • Large: Greater than 10MB in size or greater than 10 pages
  • Small: Less than or equal to 10MB and less than or equal to 10 pages
  • Single page: Less than or equal to 10MB and exactly one page

Each of these queues is serviced by an appropriately sized compute service that breaks the document down into smaller pieces, and ultimately, into individual pages.

  • Amazon Elastic Cloud Compute (EC2) processes large documents primarily because of the high memory footprint needed to read large, multi-gigabyte PDF files into memory. The output from these workers are smaller PDF documents that are stored in Amazon S3. The name and location of these smaller documents is submitted to the “small documents” queue.
  • Small documents are processed by a Lambda function that decomposes the document into single pages that are stored in Amazon S3. The name and location of these single page files is sent to the “single page” queue.

The Dead Letter Queues (DLQs) are used to hold messages from their respective size queue which are not successfully processed. If messages start landing in the DLQs, it’s an indication that there is a problem in the pipeline. For example, if messages start landing in the “small” or “single page” DLQ, it could indicate that the Lambda function processing those respective queues has reached its maximum run time.

An Amazon CloudWatch Alarm monitors the depth of each DLQ. Upon seeing DLQ activity, a notification is sent via Amazon Simple Notification Service (Amazon SNS) so an administrator can then investigate and make adjustments such as tuning the sizing thresholds to ensure the Lambda functions can finish before reaching their maximum run time.

In order to ensure no documents are left behind in the active run, there is a failsafe in the form of an Amazon EC2 worker that retrieves and processes messages from the DLQs. This failsafe app breaks a PDF all the way down into individual pages and then does image conversion.

For documents that don’t fall into a DLQ, they make it to the “single page” queue. This queue drives each page through the “image conversion” Lambda function which converts the single page file from PDF to PNG format. These PNG files are stored in Amazon S3.

Sending for inference

At this point, the documents have been chunked up and are ready for inference.

When the single-page image files land in Amazon S3, an S3 Event Notification is fired which places a message in a “converted image” SQS queue which in turn triggers the “model endpoint” Lambda function. This function calls an API endpoint on an Amazon API Gateway that is fronting the Amazon SageMaker inference endpoint. Using API Gateway with SageMaker endpoints avoided throttling during Lambda function execution due to high volumes of concurrent calls to the Amazon SageMaker API. This pattern also resulted in a 2x inference throughput speedup. The Lambda function passes the document’s S3 bucket name and path to the API which in turn passes it to the auto scaling SageMaker endpoint. The function reads the inference results that are passed back from API Gateway and stores them in Amazon Aurora.

The inference results as well as all the telemetry collected as the document was processed can be queried from the Amazon Aurora database to build reports showing number of documents processed, number of documents with failures, and number of documents with or without whatever feature(s) the ML model is trained to look for.

Summary

This architecture is able to take PDF documents that range in size from single page up to thousands of pages or gigabytes in size, pre-process them into single page image files, and then send them for inference by a machine learning model. Once triggered, the pipeline is completely automated and is able to scale to tens of millions of pages per batch.

In keeping with the architectural goals of the project, Amazon SQS is used throughout in order to build a loosely coupled system which promotes agility, scalability, and resiliency. Loose coupling also enables a high degree of grip and control over the system making it easier to respond to changes in business needs as well as focusing tuning efforts for maximum impact. And with every compute component logging everything it does, the system provides a high degree of auditability and introspection which facilitates performance monitoring, and detailed cost optimization.

Exploring the public AWS COVID-19 data lake

Post Syndicated from Jason Berkowitz original https://aws.amazon.com/blogs/big-data/exploring-the-public-aws-covid-19-data-lake/

The AWS COVID-19 data lake—a centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel coronavirus (SARS-CoV-2) and its associated illness, COVID-19—is now publicly available. For more information, see A public data lake for analysis of COVID-19 data. Globally, there are several efforts underway to gather this data, and AWS is working with partners to make this crucial data freely available and keep it up-to-date.

This data is readily available for you to ask questions, blend it with your own datasets, and create new insights in your own data lake. AWS is supporting Northwestern University in performing research developing pandemic-surveillance methods. Ariel Chandler, health informatics PhD candidate, says, “The AWS COVID-19 data lake provided me access to public data easily so I didn’t have to do the heavy lifting to get access to information that should be at everyone’s fingertips. Access to the AWS Data Exchange and these processing tools are helping to track, report, and visualize the spread of COVID-19 across the state to aid with the Illinois public health response. The data lake uses a wide range of data sources, including consumer and location data, to inform which communities are most at risk. That information is used to guide the provision of medical and social services to those who need them the most during this crisis.”

You can also produce new ways to query the information and publish those insights back into the data lake. Data may come from public websites, data purchased via data providers on AWS Data Exchange, or internal systems.

This post walks you through accessing the AWS COVID-19 data lake through the AWS Glue Data Catalog via Amazon SageMaker or Jupyter and using the open-source AWS Data Wrangler library. AWS Data Wrangler is an open-source Python package that extends the power of Pandas library to AWS and connects DataFrames and AWS data-related services (such as Amazon Redshift, Amazon S3, AWS Glue, Amazon Athena, and Amazon EMR). For more information about what you can build by using this data lake, see the associated public Jupyter notebook on GitHub.

The data for this post is from the following sources:

This data lake is comprised of data in a publicly readable Amazon S3 bucket. For a complete selection of COVID-19 data, see Data related to COVID-19 available for Research & Development. For instructions on subscribing to data products, see AWS Data Exchange – Find, Subscribe To, and Use Data Products.

Solution overview

This walkthrough includes the following steps:

  1. Installing the AWS CLI
  2. Configuring Amazon SageMaker
  3. Exploring the data through the Data Catalog

You also explore four analyses and their visualizations:

  • County-level percent changes
  • Foot traffic to public venues
  • Impact of number of cases on hospital beds
  • Impact of population density on hospital beds

Prerequisites

This post assumes that you have configured access to the data using an AWS CloudFormation template. For instructions, see A public data lake for analysis of COVID-19 data.

You also need access to an AWS account with permissions to do the following:

  • Create a CloudFormation stack
  • Create AWS Glue resources (catalog databases and tables)
  • Launch Amazon SageMaker notebooks

Installing the AWS CLI

Your first step is to install the AWS CLI and configure it for the us-east-2 Region. This is where the COVID-19 public data lake exists.

If you plan to work locally in Jupyter, you should set up a virtual environment for installing Python packages. Make sure that the following Python packages are installed: plotly, pandas, numpy, and awswrangler.

Configuring Amazon SageMaker

To configure Amazon SageMaker, complete the following steps:

  1. Create your Amazon SageMaker notebook instance in us-east-2 (the database and tables you created in the post A public data lake for analysis of COVID-19 data are in that Region).
  2. Record the IAM role you use for the notebook instance.
  3. Modify the IAM role assigned to the notebook instance to add the policies AmazonAthenaFullAccess and AWSDataExchangeSubscriberFullAccess.
  4. Create a Jupyter notebook on your new notebook instance.

Make sure the following Python packages are installed: plotly, pandas, numpy, awswrangler. For more information about installing external Python packages, see Install External Libraries and Kernels in Notebook Instances.

Exploring the data through the Data Catalog

When the CloudFormation stack shows the status CREATE_COMPLETE, you can view the tables the template created. You’re now ready to explore the data and its visualizations. This post provides four examples of visualizations.

County-level percent changes

Enigma – Global Coronavirus (COVID-19) Data (Johns Hopkins) tracks the global number of cases, recoveries, and deaths per day. Data sources include the World Health Organization (WHO), the US Centers for Disease Control and Prevention (CDC), and the National Health Commission of the People’s Republic of China (NHC). The data is collected by Johns Hopkins University and supported by the ESRI Living Atlas Team.

You can visualize the percent increase of a US county’s infected population over a day with this data. For example, if a county has a population of 1,000 and its infected population increases from 10 to 100 from Monday to Tuesday, then its infected population increased from 1% to 10%.

The following visualizations show the increase in the percent of population infected from March 29, 2020, to March 30, 2020, in New York City and surrounding areas. The more yellow a county, the larger increase of cases occurred. Gray counties increased less than 0.01% from March 29 to March 30.

The following visualization zooms in on New York counties. The yellow county is New York County (the borough of Manhattan), with an increase of 0.23% of its population with COVID-19 from March 29 to March 30. The blue counties to its east are Nassau County and Suffolk County, with 0.07% and 0.05% increases, respectively. The blue-green county north of New York County is Westchester County, with an increase of 0.08%.

The accompanying notebook allows you to vary the date parameter to visualize various rates of increase and zoom out of the map to visualize the entire United States.

Foot traffic to public venues

Foursquare – COVID-19 Foot Traffic Data is a daily aggregated and anonymized percentage dataset that demonstrates how foot traffic to various venues (such as airports, gyms, and grocery stores) has changed since February 19, 2020, in different metro areas. To obtain the following visualizations, you download the data from AWS Data Exchange and use Amazon SageMaker notebooks to visualize.

The following visualization uses the foot traffic data to plot the change of foot traffic to various venues after February 19. The plot shows how public traffic to shopping malls, clothing stores, casual dining chains, and airports have a sharp decline after the National Emergency Declaration on March 13. On the other hand, traffic to grocery stores, warehouse stores, and drug stores have sharp increases in the same period.

You can make similar plots for various metro areas, including New York City, San Francisco/Oakland, Los Angeles, Seattle, and 19 different venues with the accompanying notebook.

Impact of number of cases on hospital beds

The following visualizations use Enigma – Global Coronavirus (COVID-19) Data (Johns Hopkins) and Rearc – USA Hospital Beds – COVID-19 | Definitive Healthcare to analyze how the growing number of COVID-19 cases affects local hospitals. The hospital bed dataset is a dataset of the numbers of licensed beds, staffed beds, ICU beds, and the bed utilization rate for hospitals in the United States.

The first plot shows the growth in the number of hospitalized cases over 10 days in New York County. The hospitalized cases are calculated with a 10% hospitalization rate, which means 10% of all COVID-19 cases for Manhattan result in hospitalization. In the accompanying notebook, the hospitalization rate is a parameter, so you can visualize how various hospitalization rates have different healthcare needs. To generate this plot, you use the daily COVID-19 case information from Johns Hopkins to simulate the number of hospitalized cases and use the Definitive Healthcare hospital bed data to calculate the total hospital capacity for Manhattan.

The second visualization is a hospital utilization plot by county for the entire United States. The more yellow a county, the more its healthcare resources are burdened with a 20% COVID-19 hospitalization rate. Counties in gray have less than a 5% hospitalization rate.

As with the previous visualization, you can simulate various hospitalization rates in the accompanying notebook to visualize how COVID-19 burdens health care resources around the country. You can also change the data parameter to visualize how healthcare resource requirements change over time.

Impact of population density on hospital beds

The following visualization uses Enigma – Global Coronavirus (COVID-19) Data (Johns Hopkins), Rearc – USA Hospital Beds – COVID-19 | Definitive Healthcare, and US Census County-Based Data to compare the confirmed cases and available hospital beds per square kilometer for two different counties. The Enigma dataset provides case data; the Rearc dataset provides hospital bed information throughout the country, which is aggregated at the county level. The number of cases and beds is normalized by the land area in square kilometers using the US Census County data.

In the accompanying notebook, you can change the scope of visualization, the number of entities, and the bed resources. The following visualization provides cases and licensed beds per square kilometer at the county level for Alameda and San Diego counties.

These examples are a few of the innumerable analyses you can run on the public data lake.

Cleaning up

You incur no additional cost for accessing the AWS COVID-19 data lake beyond the standard charges for the AWS services that you use. For example, if you use Athena, you incur the costs for running queries and the data storage for the query result in Amazon S3, but incur no costs for accessing the data lake. Depending on the Amazon SageMaker instance you choose, you may incur Amazon SageMaker fees. For more information, see Amazon SageMaker Pricing.

To avoid recurring charges, shut down and delete the Amazon SageMaker instance, any S3 buckets you created, and disable auto-subscriptions for AWS Data Exchange.

Conclusion

Combining our efforts across organizations and scientific disciplines can help us win the fight against the COVID-19 pandemic. With the AWS COVID-19 data lake, you can experiment with and analyze curated data related to the virus, and share your own data and results. We believe that through an open and collaborative effort that combines data, technology, and science, we can inspire insights and foster breakthroughs necessary to contain, curtail, and ultimately cure COVID-19.

For more information about the public AWS COVID-19 data lake visit: https://aws.amazon.com/covid-19-data-lake/.

 


About the Authors

Jason Berkowitz is the Americas Data & Analytics Professional Services Practice Lead. He comes from a background in Machine Learning, Data Lake Architectures and helping customers become data-driven. He is currently working helping customers shape their data lakes and analytic journeys on AWS within Professional Services.

 

 

Colby Wise is a senior data scientist and manager at Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.

 

 

 

 

 

Ninad Kulkarni is a data scientist in the Amazon Machine Learning Solutions Lab. He helps customers adopt ML and AI solutions by building solutions to address their business problems. Most recently, he has built predictive models for sports customers for on-screen consumption to improve fan engagement.

 

 

Formula 1: Using Amazon SageMaker to Deliver Real-Time Insights to Fans

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/formula-1-using-amazon-sagemaker-to-deliver-real-time-insights-to-fans-live/

The Formula one Group (F1) is responsible for the promotion of the FIA Formula One World Championship, a series of auto racing events in 21 countries where professional drivers race single-seat cars on custom tracks or through city courses in pursuit of the World Championship title.

Formula 1 works with AWS to enhance its race strategies, data tracking systems, and digital broadcasts through a wide variety of AWS services—including Amazon SageMaker, AWS Lambda, and AWS analytics services—to deliver new race metrics that change the way fans and teams experience racing.

In this special live segment of This is My Architecture, you’ll get a look at what’s under the hood of Formula 1’s F1 Insights. Hear about the machine learning algorithms the company trains on Amazon SageMaker and how inferences are made during races to deliver insights to fans.

For more content like this, subscribe to our YouTube channels This is My Architecture, This is My Code, and This is My Model, or visit the This is My Architecture AWS website, which has search functionality and the ability to filter by industry, language, and service.

Delve into the Forumla 1 case study to learn more about how AWS fuels analytics through machine learning.

Build machine learning-powered business intelligence analyses using Amazon QuickSight

Post Syndicated from Osemeke Isibor original https://aws.amazon.com/blogs/big-data/build-machine-learning-powered-business-intelligence-analyses-using-amazon-quicksight/

Imagine you can see the future—to know how many customers will order your product months ahead of time so you can make adequate provisions, or to know how many of your employees will leave your organization several months in advance so you can take preemptive actions to encourage staff retention. For an organization that sees the future, the possibilities are limitless. Machine learning (ML) makes it possible to predict the future with a higher degree of accuracy.

Amazon SageMaker provides every developer and data scientist the ability to build, train, and deploy ML models quickly, but for business users who usually work on creating business intelligence dashboards and reports rather than ML models, Amazon QuickSight is the service of choice. With Amazon QuickSight, you can still use ML to forecast the future. This post goes through how to create business intelligence analyses that use ML to forecast future data points and detect anomalies in data, with no technical expertise or ML experience needed.

Overview of solution

Amazon QuickSight ML Insights uses AWS-proven ML and natural language capabilities to help you gain deeper insights from your data. These powerful, out-of-the-box features make it easy to discover hidden trends and outliers, identify key business drivers, and perform powerful what-if analysis and forecasting with no technical or ML experience. You can use ML insights in sales reporting, web analytics, financial planning, and more. You can detect insights buried in aggregates, perform interactive what-if analysis, and discover what activities you need to meet business goals.

This post imports data from Amazon S3 into Amazon QuickSight and creates ML-powered analyses with the imported data. The following diagram illustrates this architecture.

Walkthrough

In this walkthrough, you create an Amazon QuickSight analysis that contains ML-powered visuals that forecast the future demand for taxis in New York City. You also generate ML-powered insights to detect anomalies in your data. This post uses the New York City Taxi and Limousine Commission (TLC) Trip Record Data on the Registry of Open Data on AWS.

The walkthrough includes the following steps:

  1. Set up and import data into Amazon QuickSight
  2. Create an ML-powered visual to forecast the future demand for taxis
  3. Generate an ML-powered insight to detect anomalies in the data set

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Amazon Quicksight Enterprise edition
  • Basic knowledge of AWS

Setting up and importing data into Amazon QuickSight

Set up Amazon Quicksight as an individual user. Complete the following steps:

  1. On the AWS Management Console, in the Region list, select US East (N. Virginia) or any Region of your choice that Amazon QuickSight
  2. Under Analytics, for Services, choose Amazon QuickSight.If you already have an existing Amazon QuickSight account, make sure it is the Enterprise edition; if it is not, upgrade to Enterprise edition. For more information, see Upgrading your Amazon QuickSight Subscription from Standard Edition to Enterprise Edition.If you do not have an existing Amazon QuickSight account, proceed with the setup and make sure you choose Enterprise Edition when setting up the account. For more information, see Setup a Free Standalone User Account in Amazon QuickSight.After you complete the setup, a Welcome Wizard screen appears.
  1. Choose Next on each of the Welcome Wizard screens.
  2. Choose Get Started.

Before you import the data set, make sure that you have at least 3GB of SPICE capacity. For more information, see Managing SPICE Capacity.

Importing the NYC Taxi data set into Amazon QuickSight

The NYC Taxi data set is in an S3 bucket. To import S3 data into Amazon QuickSight, use a manifest file. For more information, see Supported Formats for Amazon S3 Manifest Files. To import your data, complete the following steps:

  1. In a new text file, copy and paste the following code:
    "fileLocations": [
            {
                "URIs": [
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-01.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-02.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-03.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-04.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-05.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-06.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-07.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-08.csv", 
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-09.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-10.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-11.csv", 
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-12.csv"
    
                ]
            }
        ],
        "globalUploadSettings": {
            "textqualifier": "\""
        }
    }

  2. Save the text file as nyc-taxi.json.
  3. On the Amazon QuickSight console, choose New analysis.
  4. Choose New data set.
  5. For data source, choose S3.
  6. Under New S3 data source, for Data source name, enter a name of your choice.
  7. For Upload a manifest file field, select Upload.
  8. Choose the nyc-taxi.json file you created earlier.
  9. Choose Connect.
    The S3 bucket this post uses is a public bucket that contains a public data set and open to the public. When using S3 buckets in your account with Amazon QuickSight, it is highly recommended that the buckets are not open to the public; you need to configure authentication to access your S3 bucket from Amazon QuickSight. For more information about troubleshooting, see I Can’t Connect to Amazon S3.After you choose Connect, the Finish data set creation screen appears.
  10. Choose Visualize.
  11. Wait for the import to complete.

You can see the progress on the top right corner of the screen. When the import is complete, the result shows the number of rows imported successfully and the number of rows skipped.

Creating an ML-powered visual

After you import the data set into Amazon QuickSight SPICE, you can start creating analyses and visuals. Your goal is to create an ML-powered visual to forecast the future demand for taxis. For more information, see Forecasting and Creating What-If Scenarios with Amazon Quicksight.

To create your visual, complete the following steps:

  1. From the Data source details pop screen, choose Visualize.
  2. From the field list, select Lpep_pickup_datetime.
  3. Under Visual types, select the first visual.Amazon QuickSight automatically uses the best visual based on the number and data type of fields you selected. From your selection, Amazon Quicksight displays a line chart visual.From the preceding graph, you can see that the bulk of your data clusters are around December 31, 2017, to January 1, 2019, for the Lpep_pickup_datetime field. There are a few data points with date ranges up to June 2080. These values are incorrect and can impact your ML forecasts.To clean up your data set, filter out the incorrect data using the data in the Lpep_pickup_datetime. This post only uses data in which Lpep_pickup_datetime falls between January 1, 2018, and December 18, 2018, because there is a more consistent amount of data within this date range.
  4. Use the filter menu to create a filter using the Lpep_pickup_datetime
  5. Under Filter type, choose Time range and Between.
  6. For Start date, enter 2018-01-01 00:00.
  7. Select Include start date.
  8. For End date, enter 2018-12-18 00:00.
  9. Select Include end date.
  10. Choose Apply.

The line chart should now contain only data with Lpep_pickup_datetime from January 1, 2018, and December 18, 2018. You can now add the forecast for the next 31 days to the line chart visual.

Adding the forecast to your visual

To add your forecast, complete the following steps:

  1. On the visual, choose the arrow.
  2. From the drop-down menu, choose Add forecast.
  3. Under Forecast properties, for Forecast length, for Periods forward, enter 31.
  4. For Periods backwards, enter 0.
  5. For Prediction interval, leave at the default value 90.
  6. For Seasonality, leave at the default selection Automatic.
  7. Choose Apply.

You now see an orange line on your graph, which is the forecasted pickup quantity per day for the next 31 days after December 18, 2018. You can explore the different dates by hovering your cursor over different points on the forecasted pickup line. For example, hovering your cursor over January 10, 2019, shows that the expected forecasted number of pickups for that day is approximately 22,000. The forecast also provides an upper bound (maximum number of pickups forecasted) of about 26,000 and a lower bound (minimum number of pickups forecasted) of about 18,000.

You can create multiple visuals with forecasts and combine them into a sharable Amazon QuickSight dashboard. For more information, see Working with Dashboards.

Generating an ML-powered insight to detect anomalies

In Amazon QuickSight, you can add insights, autonarratives, and ML-powered anomaly detection to your analyses without ML expertise or knowledge. Amazon QuickSight generates suggested insights and autonarratives automatically, but for ML-powered anomaly detection, you need to perform additional steps. For more information, see Using ML-Powered Anomaly Detection.

This post checks if there are any anomalies in the total fare amount over time from select locations. For example, if the total fare charged for taxi rides is about $1,000 and above from the first pickup location (for example, the airport) for most dates in your data set, an anomaly is when the total fare charged deviates from the standard pattern. Anomalies are not necessarily negative, but rather abnormalities that you can choose to investigate further.

To create an anomaly insight, complete the following steps:

  1. From the top right corner of the analysis creation screen, click the Add drop-down menu and choose Add insight.
  2. On the Computation screen, for Computation type, select Anomaly detection.
  3. Under Fields list, choose the following fields:
  • fare_amount
  • lpep_pickup_datetime
  • PULocationID
  1. Choose Get started.
  2. The Configure anomaly detection
  3. Choose Analyze all combinations of these categories.
  4. Leave the other settings as their default.You can now perform a contribution analysis and discover how the drop-off location contributed to the anomalies. For more information, see Viewing Top Contributors.
  5. Under Contribution analysis, choose DOLocationID.
  6. Choose Save.
  7. Choose Run Now.The anomaly detection can take up to 10 minutes to complete. If it is still running after about 10 minutes, your browser may have timed out. Refresh your browser and you should see the anomalies displayed in the visual.
  8. Choose Explore anomalies.

By default, the anomalies you see are for the last date in your data set. You can explore the anomalies across the entire date range of your data set by choosing SHOW ANOMALIES BY DATE and dragging the slider at the bottom of the visual to display the entire date range from January 1, 2018, to December 30, 2018.

This graph shows that March 21, 2018, has the highest number of anomalies of the fare charged in the entire data set. For example, the total fare amount charged by taxis that picked up passengers from location 74 on March 21, 2018, was 7,181. This is -64% (about 19,728.5) of the total fare charged by taxis for the same pickup location on March 20, 2018. When you explore the anomalies of other pickup locations for that same date, you can see that they all have similar drops in the total fare charged. You can also see the top DOLocationID contributors to these anomalies.

What happened in New York City on March 21, 2018, to cause this drop? A quick online search reveals that New York City experienced a severe weather condition on March 21, 2018.

Publishing your analyses to a dashboard, sharing the dashboard, and setting up email alerts

You can create additional visuals to your analyses, publish the analyses as a dashboard, and share the dashboard with other users. QuickSight anomaly detection allows you to uncover hidden insights in your data by continuously analyzing billions of data points. You can subscribe to receive alerts to your inbox if an anomaly occurs in your business metrics. The email alert also indicates the factors that contribute to these anomalies. This allows you to act immediately on the business metrics that need attention.

From the QuickSight dashboard, you can configure an anomaly alert to be sent to your email with Severity set to High and above and Direction set to Lower than expected. Also make sure to schedule a data refresh so that the anomaly detection runs on your most recent data. For more information, see Refreshing Data.

Cleaning up

To avoid incurring future charges, you need to cancel your Amazon Quicksight subscription.

Conclusion

This post walked you through how to use ML-powered insights with Amazon QuickSight to forecast future data points, detect anomalies, and derive valuable insights from your data without needing prior experience or knowledge of ML. If you want to do more forecasting without ML experience, check out Amazon Forecast.

If you have questions or suggestions, please leave a comment.

 


About the Author

Osemeke Isibor is a partner solutions architect at AWS. He works with AWS Partner Network (APN) partners to design secure, highly available, scalable and cost optimized solutions on AWS. He is a science-fiction enthusiast and a fan of anime.

 

 

 

Binge-Watch Live This is My Architecture Videos from AWS re:Invent

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/binge-watch-live-videos-from-aws-reinvent-2019/

AWS re:Invent 2019 was a whirlwind of activity, especially in the Expo Hall, where the AWS team spent four days filming 12 live This is My Architecture videos for Twitch. Watch one a day for the next two weeks…or eat them all in one sitting. Whichever you do, you’re guaranteed to learn something new.

Accolade

Discover security and operational excellence in healthcare with Accolade.

AWS Solution Builders

Get multi-region Availability with Amazon DynamoDB, Amazon S3, and Amazon Cognito.

EcoFit

EcoFit offers responsive, AWS Lambda-based microservices at scale.

Splunk

Splunk explains data at scale by decoupling compute and storage.

Crownpeak

Crownpeak uses AWS Lambda for its decoupled content deployment architecture.

Formula One Group

Learn how Formula One Group is using Amazon SageMaker to deliver real-time insights to fans.

Adobe

Adobe is simplifying networking across thousands of AWS accounts with AWS Transit Gateway.

The Trade Desk

The Trade Desk offers real-time ad bidding in the cloud with AWS Global Accelerator.

Mueller Water Products

Learn all about about scalable ingestion of sensor data for municipal water conservation with Mueller Water Products.

NextRoll

NextRoll is driving OpEx efficiency for ad bidding engines.

Pason Systems

Explore petabyte-scale drilling datamart on AWS with Pason Systems.

UltraServe

Application vending machine with runtime event control at UltraServe.

 

Be sure to visit the AWS channel on Twitch for more in-depth videos and interviews.

Amazon SageMaker Studio: The First Fully Integrated Development Environment For Machine Learning

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-studio-the-first-fully-integrated-development-environment-for-machine-learning/

Today, we’re extremely happy to launch Amazon SageMaker Studio, the first fully integrated development environment (IDE) for machine learning (ML).

We have come a long way since we launched Amazon SageMaker in 2017, and it is shown in the growing number of customers using the service. However, the ML development workflow is still very iterative, and is challenging for developers to manage due to the relative immaturity of ML tooling. Many of the tools which developers take for granted when building traditional software (debuggers, project management, collaboration, monitoring, and so forth) have yet been invented for ML.

For example, when trying a new algorithm or tweaking hyper parameters, developers and data scientists typically run hundreds and thousands of experiments on Amazon SageMaker, and they need to manage all this manually. Over time, it becomes much harder to track the best performing models, and to capitalize on lessons learned during the course of experimentation.

Amazon SageMaker Studio unifies at last all the tools needed for ML development. Developers can write code, track experiments, visualize data, and perform debugging and monitoring all within a single, integrated visual interface, which significantly boosts developer productivity.

In addition, since all these steps of the ML workflow are tracked within the environment, developers can quickly move back and forth between steps, and also clone, tweak, and replay them. This gives developers the ability to make changes quickly, observe outcomes, and iterate faster, reducing the time to market for high quality ML solutions.

Introducing Amazon SageMaker Studio
Amazon SageMaker Studio lets you manage your entire ML workflow through a single pane of glass. Let me give you the whirlwind tour!

With Amazon SageMaker Notebooks (currently in preview), you can enjoy an enhanced notebook experience that lets you easily create and share Jupyter notebooks. Without having to manage any infrastructure, you can also quickly switch from one hardware configuration to another.

With Amazon SageMaker Experiments, you can organize, track and compare thousands of ML jobs: these can be training jobs, or data processing and model evaluation jobs run with Amazon SageMaker Processing.

With Amazon SageMaker Debugger, you can debug and analyze complex training issues, and receive alerts. It automatically introspects your models, collects debugging data, and analyzes it to provide real-time alerts and advice on ways to optimize your training times, and improve model quality. All information is visible as your models are training.

With Amazon SageMaker Model Monitor, you can detect quality deviations for deployed models, and receive alerts. You can easily visualize issues like data drift that could be affecting your models. No code needed: all it takes is a few clicks.

With Amazon SageMaker Autopilot, you can build models automatically with full control and visibility. Algorithm selection, data preprocessing, and model tuning are taken care automatically, as well as all infrastructure.

Thanks to these new capabilities, Amazon SageMaker now covers the complete ML workflow to build, train, and deploy machine learning models, quickly and at any scale.

These services mentioned above, except for Amazon SageMaker Notebooks, are covered in individual blog posts (see below) showing you how to quickly get started, so keep your eyes peeled and read on!

Now Available!
Amazon SageMaker Studio is available today in US East (Ohio).

Give it a try, and please send us feedback either in the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

– Julien

Amazon SageMaker Debugger – Debug Your Machine Learning Models

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-debugger-debug-your-machine-learning-models/

Today, we’re extremely happy to announce Amazon SageMaker Debugger, a new capability of Amazon SageMaker that automatically identifies complex issues developing in machine learning (ML) training jobs.

Building and training ML models is a mix of science and craft (some would even say witchcraft). From collecting and preparing data sets to experimenting with different algorithms to figuring out optimal training parameters (the dreaded hyperparameters), ML practitioners need to clear quite a few hurdles to deliver high-performance models. This is the very reason why be built Amazon SageMaker : a modular, fully managed service that simplifies and speeds up ML workflows.

As I keep finding out, ML seems to be one of Mr. Murphy’s favorite hangouts, and everything that may possibly go wrong often does! In particular, many obscure issues can happen during the training process, preventing your model from correctly extracting and learning patterns present in your data set. I’m not talking about software bugs in ML libraries (although they do happen too): most failed training jobs are caused by an inappropriate initialization of parameters, a poor combination of hyperparameters, a design issue in your own code, etc.

To make things worse, these issues are rarely visible immediately: they grow over time, slowly but surely ruining your training process, and yielding low accuracy models. Let’s face it, even if you’re a bona fide expert, it’s devilishly difficult and time-consuming to identify them and hunt them down, which is why we built Amazon SageMaker Debugger.

Let me tell you more.

Introducing Amazon SageMaker Debugger
In your existing training code for TensorFlow, Keras, Apache MXNet, PyTorch and XGBoost, you can use the new SageMaker Debugger SDK to save internal model state at periodic intervals; as you can guess, it will be stored in Amazon Simple Storage Service (S3).

This state is composed of:

  • The parameters being learned by the model, e.g. weights and biases for neural networks,
  • The changes applied to these parameters by the optimizer, aka gradients,
  • The optimization parameters themselves,
  • Scalar values, e.g. accuracies and losses,
  • The output of each layer,
  • Etc.

Each specific set of values – say, the sequence of gradients flowing over time through a specific neural network layer – is saved independently, and referred to as a tensor. Tensors are organized in collections (weights, gradients, etc.), and you can decide which ones you want to save during training. Then, using the SageMaker SDK and its estimators, you configure your training job as usual, passing additional parameters defining the rules you want SageMaker Debugger to apply.

A rule is a piece of Python code that analyses tensors for the model in training, looking for specific unwanted conditions. Pre-defined rules are available for common problems such as exploding/vanishing tensors (parameters reaching NaN or zero values), exploding/vanishing gradients, loss not changing, and more. Of course, you can also write your own rules.

Once the SageMaker estimator is configured, you can launch the training job. Immediately, it fires up a debug job for each rule that you configured, and they start inspecting available tensors. If a debug job detects a problem, it stops and logs additional information. A CloudWatch Events event is also sent, should you want to trigger additional automated steps.

So now you know that your deep learning job suffers from say, vanishing gradients. With a little brainstorming and experience, you’ll know where to look: maybe the neural network is too deep? Maybe your learning rate is too small? As the internal state has been saved to S3, you can now use the SageMaker Debugger SDK to explore the evolution of tensors over time, confirm your hypothesis and fix the root cause.

Let’s see SageMaker Debugger in action with a quick demo.

Debugging Machine Learning Models with Amazon SageMaker Debugger
At the core of SageMaker Debugger is the ability to capture tensors during training. This requires a little bit of instrumentation in your training code, in order to select the tensor collections you want to save, the frequency at which you want to save them, and whether you want to save the values themselves or a reduction (mean, average, etc.).

For this purpose, the SageMaker Debugger SDK provides simple APIs for each framework that it supports. Let me show you how this works with a simple TensorFlow script, trying to fit a 2-dimension linear regression model. Of course, you’ll find more examples in this Github repository.

Let’s take a look at the initial code:

import argparse
import numpy as np
import tensorflow as tf
import random

parser = argparse.ArgumentParser()
parser.add_argument('--model_dir', type=str, help="S3 path for the model")
parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001)
parser.add_argument('--steps', type=int, help="Number of steps to run", default=100)
parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0)

args = parser.parse_args()

with tf.name_scope('initialize'):
    # 2-dimensional input sample
    x = tf.placeholder(shape=(None, 2), dtype=tf.float32)
    # Initial weights: [10, 10]
    w = tf.Variable(initial_value=[[10.], [10.]], name='weight1')
    # True weights, i.e. the ones we're trying to learn
    w0 = [[1], [1.]]
with tf.name_scope('multiply'):
    # Compute true label
    y = tf.matmul(x, w0)
    # Compute "predicted" label
    y_hat = tf.matmul(x, w)
with tf.name_scope('loss'):
    # Compute loss
    loss = tf.reduce_mean((y_hat - y) ** 2, name="loss")

optimizer = tf.train.AdamOptimizer(args.lr)
optimizer_op = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(args.steps):
        x_ = np.random.random((10, 2)) * args.scale
        _loss, opt = sess.run([loss, optimizer_op], {x: x_})
        print (f'Step={i}, Loss={_loss}')

Let’s train this script using the TensorFlow Estimator. I’m using SageMaker local mode, which is a great way to quickly iterate on experimental code.

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000}

estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name='debugger-simple-demo',
    train_instance_count=1,
    train_instance_type='local',
    entry_point='script-v1.py',
    framework_version='1.13.1',
    py_version='py3',
    script_mode=True,
    hyperparameters=bad_hyperparameters)

Looking at the training log, things did not go well.

Step=0, Loss=7.883463958023267e+23
algo-1-hrvqg_1 | Step=1, Loss=9.502028841062608e+23
algo-1-hrvqg_1 | Step=2, Loss=nan
algo-1-hrvqg_1 | Step=3, Loss=nan
algo-1-hrvqg_1 | Step=4, Loss=nan
algo-1-hrvqg_1 | Step=5, Loss=nan
algo-1-hrvqg_1 | Step=6, Loss=nan
algo-1-hrvqg_1 | Step=7, Loss=nan
algo-1-hrvqg_1 | Step=8, Loss=nan
algo-1-hrvqg_1 | Step=9, Loss=nan

Loss does not decrease at all, and even goes to infinity… This looks like an exploding tensor problem, which is one of the built-in rules defined in SageMaker Debugger. Let’s get to work.

Using the Amazon SageMaker Debugger SDK
In order to capture tensors, I need to instrument the training script with:

  • A SaveConfig object specifying the frequency at which tensors should be saved,
  • A SessionHook object attached to the TensorFlow session, putting everything together and saving required tensors during training,
  • An (optional) ReductionConfig object, listing tensor reductions that should be saved instead of full tensors,
  • An (optional) optimizer wrapper to capture gradients.

Here’s the updated code, with extra command line arguments for SageMaker Debugger parameters.

import argparse
import numpy as np
import tensorflow as tf
import random
import smdebug.tensorflow as smd

parser = argparse.ArgumentParser()
parser.add_argument('--model_dir', type=str, help="S3 path for the model")
parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001 )
parser.add_argument('--steps', type=int, help="Number of steps to run", default=100 )
parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0 )
parser.add_argument('--debug_path', type=str, default='/opt/ml/output/tensors')
parser.add_argument('--debug_frequency', type=int, help="How often to save tensor data", default=10)
feature_parser = parser.add_mutually_exclusive_group(required=False)
feature_parser.add_argument('--reductions', dest='reductions', action='store_true', help="save reductions of tensors instead of saving full tensors")
feature_parser.add_argument('--no_reductions', dest='reductions', action='store_false', help="save full tensors")
args = parser.parse_args()
args = parser.parse_args()

reduc = smd.ReductionConfig(reductions=['mean'], abs_reductions=['max'], norms=['l1']) if args.reductions else None

hook = smd.SessionHook(out_dir=args.debug_path,
                       include_collections=['weights', 'gradients', 'losses'],
                       save_config=smd.SaveConfig(save_interval=args.debug_frequency),
                       reduction_config=reduc)

with tf.name_scope('initialize'):
    # 2-dimensional input sample
    x = tf.placeholder(shape=(None, 2), dtype=tf.float32)
    # Initial weights: [10, 10]
    w = tf.Variable(initial_value=[[10.], [10.]], name='weight1')
    # True weights, i.e. the ones we're trying to learn
    w0 = [[1], [1.]]
with tf.name_scope('multiply'):
    # Compute true label
    y = tf.matmul(x, w0)
    # Compute "predicted" label
    y_hat = tf.matmul(x, w)
with tf.name_scope('loss'):
    # Compute loss
    loss = tf.reduce_mean((y_hat - y) ** 2, name="loss")
    hook.add_to_collection('losses', loss)

optimizer = tf.train.AdamOptimizer(args.lr)
optimizer = hook.wrap_optimizer(optimizer)
optimizer_op = optimizer.minimize(loss)

hook.set_mode(smd.modes.TRAIN)

with tf.train.MonitoredSession(hooks=[hook]) as sess:
    for i in range(args.steps):
        x_ = np.random.random((10, 2)) * args.scale
        _loss, opt = sess.run([loss, optimizer_op], {x: x_})
        print (f'Step={i}, Loss={_loss}')

I also need to modify the TensorFlow Estimator, to use the SageMaker Debugger-enabled training container and to pass additional parameters.

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1}

from sagemaker.debugger import Rule, rule_configs
estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name='debugger-simple-demo',
    train_instance_count=1,
    train_instance_type='ml.c5.2xlarge',
    image_name=cpu_docker_image_name,
    entry_point='script-v2.py',
    framework_version='1.15',
    py_version='py3',
    script_mode=True,
    hyperparameters=bad_hyperparameters,
    rules = [Rule.sagemaker(rule_configs.exploding_tensor())]
)

estimator.fit()
2019-11-27 10:42:02 Starting - Starting the training job...
2019-11-27 10:42:25 Starting - Launching requested ML instances
********* Debugger Rule Status *********
*
* ExplodingTensor: InProgress 
*
****************************************

Two jobs are running: the actual training job, and a debug job checking for the rule defined in the Estimator. Quickly, the debug job fails!

Describing the training job, I can get more information on what happened.

description = client.describe_training_job(TrainingJobName=job_name)
print(description['DebugRuleEvaluationStatuses'][0]['RuleConfigurationName'])
print(description['DebugRuleEvaluationStatuses'][0]['RuleEvaluationStatus'])

ExplodingTensor
IssuesFound

Let’s take a look at the saved tensors.

Exploring Tensors
I can easily grab the tensors saved in S3 during the training process.

s3_output_path = description["DebugConfig"]["DebugHookConfig"]["S3OutputPath"]
trial = create_trial(s3_output_path)

Let’s list available tensors.

trial.tensors()

['loss/loss:0',
'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0',
'initialize/weight1:0']

All values are numpy arrays, and I can easily iterate over them.

tensor = 'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0'
for s in list(trial.tensor(tensor).steps()):
    print("Value: ", trial.tensor(tensor).step(s).value)

Value:  [[1.1508383e+23] [1.0809098e+23]]
Value:  [[1.0278440e+23] [1.1347468e+23]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]

As tensor names include the TensorFlow scope defined in the training code, I can easily see that something is wrong with my matrix multiplication.

# Compute true label
y = tf.matmul(x, w0)
# Compute "predicted" label
y_hat = tf.matmul(x, w)

Digging a little deeper, the x input is modified by a scaling parameter, which I set to 100000000000 in the Estimator. The learning rate doesn’t look sane either. Bingo!

x_ = np.random.random((10, 2)) * args.scale

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1}

As you probably knew all along, setting these hyperpameteres to more reasonable values will fix the training issue.

Now Available!
We believe Amazon SageMaker Debugger will help you find and solve training issues quicker, so it’s now your turn to go bug hunting.

Amazon SageMaker Debugger is available today in all commercial regions where Amazon SageMaker is available. Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

– Julien

 

 

Amazon SageMaker Model Monitor – Fully Managed Automatic Monitoring For Your Machine Learning Models

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-model-monitor-fully-managed-automatic-monitoring-for-your-machine-learning-models/

Today, we’re extremely happy to announce Amazon SageMaker Model Monitor, a new capability of Amazon SageMaker that automatically monitors machine learning (ML) models in production, and alerts you when data quality issues appear.

The first thing I learned when I started working with data is that there is no such thing as paying too much attention to data quality. Raise your hand if you’ve spent hours hunting down problems caused by unexpected NULL values or by exotic character encodings that somehow ended up in one of your databases.

As models are literally built from large amounts of data, it’s easy to see why ML practitioners spend so much time caring for their data sets. In particular, they make sure that data samples in the training set (used to train the model) and in the validation set (used to measure its accuracy) have the same statistical properties.

There be monsters! Although you have full control over your experimental data sets, the same can’t be said for real-life data that your models will receive. Of course, that data will be unclean, but a more worrisome problem is “data drift”, i.e. a gradual shift in the very statistical nature of the data you receive. Minimum and maximum values, mean, average, variance, and more: all these are key attributes that shape assumptions and decisions made during the training of a model. Intuitively, you can surely feel that any significant change in these values would impact the accuracy of predictions: imagine a loan application predicting higher amounts because input features are drifting or even missing!

Detecting these conditions is pretty difficult: you would need to capture data received by your models, run all kinds of statistical analysis to compare that data to the training set, define rules to detect drift, send alerts if it happens… and do it all over again each time you update your models. Expert ML practitioners certainly know how to build these complex tools, but at the great expense of time and resources. Undifferentiated heavy lifting strikes again…

To help all customers focus on creating value instead, we built Amazon SageMaker Model Monitor. Let me tell you more.

Introducing Amazon SageMaker Model Monitor
A typical monitoring session goes like this. You first start from a SageMaker endpoint to monitor, either an existing one, or a new one created specifically for monitoring purposes. You can use SageMaker Model Monitor on any endpoint, whether the model was trained with a built-in algorithm, a built-in framework, or your own container.

Using the SageMaker SDK, you can capture a configurable fraction of the data sent to the endpoint (you can also capture predictions if you’d like), and store it in one of your Amazon Simple Storage Service (S3) buckets. Captured data is enriched with metadata (content type, timestamp, etc.), and you can secure and access it just like any S3 object.

Then, you create a baseline from the data set that was used to train the model deployed on the endpoint (of course, you can reuse an existing baseline too). This will fire up a Amazon SageMaker Processing job where SageMaker Model Monitor will:

  • Infer a schema for the input data, i.e. type and completeness information for each feature. You should review it, and update it if needed.
  • For pre-built containers only, compute feature statistics using Deequ, an open source tool based on Apache Spark that is developed and used at Amazon (blog post and research paper). These statistics include KLL sketches, an advanced technique to compute accurate quantiles on streams of data, that we recently contributed to Deequ.

Using these artifacts, the next step is to launch a monitoring schedule, to let SageMaker Model Monitor inspect collected data and prediction quality. Whether you’re using a built-in or custom container, a number of built-in rules are applied, and reports are periodically pushed to S3. The reports contain statistics and schema information on the data received during the latest time frame, as well as any violation that was detected.

Last but not least, SageMaker Model Monitor emits per-feature metrics to Amazon CloudWatch, which you can use to set up dashboards and alerts. The summary metrics from CloudWatch are also visible in Amazon SageMaker Studio, and of course all statistics, monitoring results and data collected can be viewed and further analyzed in a notebook.

For more information and an example on how to use SageMaker Model Monitor using AWS CloudFormation, refer to the developer guide.

Now, let’s do a demo, using a churn prediction model trained with the built-in XGBoost algorithm.

Enabling Data Capture
The first step is to create an endpoint configuration to enable data capture. Here, I decide to capture 100% of incoming data, as well as model output (i.e. predictions). I’m also passing the content types for CSV and JSON data.

data_capture_configuration = {
    "EnableCapture": True,
    "InitialSamplingPercentage": 100,
    "DestinationS3Uri": s3_capture_upload_path,
    "CaptureOptions": [
        { "CaptureMode": "Output" },
        { "CaptureMode": "Input" }
    ],
    "CaptureContentTypeHeader": {
       "CsvContentTypes": ["text/csv"],
       "JsonContentTypes": ["application/json"]
}

Next, I create the endpoint using the usual CreateEndpoint API.

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m5.xlarge',
        'InitialInstanceCount':1,
        'InitialVariantWeight':1,
        'ModelName':model_name,
        'VariantName':'AllTrafficVariant'
    }],
    DataCaptureConfig = data_capture_configuration)

On an existing endpoint, I would have used the UpdateEndpoint API to seamlessly update the endpoint configuration.

After invoking the endpoint repeatedly, I can see some captured data in S3 (output was edited for clarity).

$ aws s3 ls --recursive s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/datacapture/DEMO-xgb-churn-pred-model-monitor-2019-11-22-07-59-33/
AllTrafficVariant/2019/11/22/08/24-40-519-9a9273ca-09c2-45d3-96ab-fc7be2402d43.jsonl
AllTrafficVariant/2019/11/22/08/25-42-243-3e1c653b-8809-4a6b-9d51-69ada40bc809.jsonl

Here’s a line from one of these files.

    "endpointInput":{
        "observedContentType":"text/csv",
        "mode":"INPUT",
        "data":"132,25,113.2,96,269.9,107,229.1,87,7.1,7,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1",
        "encoding":"CSV"
     },
     "endpointOutput":{
        "observedContentType":"text/csv; charset=utf-8",
        "mode":"OUTPUT",
        "data":"0.01076381653547287",
        "encoding":"CSV"}
     },
    "eventMetadata":{
        "eventId":"6ece5c74-7497-43f1-a263-4833557ffd63",
        "inferenceTime":"2019-11-22T08:24:40Z"},
        "eventVersion":"0"}

Pretty much what I expected. Now, let’s create a baseline for this model.

Creating A Monitoring Baseline
This is a very simple step: pass the location of the baseline data set, and the location where results should be stored.

from processingjob_wrapper import ProcessingJob

processing_job = ProcessingJob(sm_client, role).
   create(job_name, baseline_data_uri, baseline_results_uri)

Once that job is complete, I can see two new objects in S3: one for statistics, and one for constraints.

aws s3 ls s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/baselining/results/
constraints.json
statistics.json

The constraints.json file tells me about the inferred schema for the training data set (don’t forget to check it’s accurate). Each feature is typed, and I also get information on whether a feature is always present or not (1.0 means 100% here). Here are the first few lines.

{
  "version" : 0.0,
  "features" : [ {
    "name" : "Churn",
    "inferred_type" : "Integral",
    "completeness" : 1.0
  }, {
    "name" : "Account Length",
    "inferred_type" : "Integral",
    "completeness" : 1.0
  }, {
    "name" : "VMail Message",
    "inferred_type" : "Integral",
    "completeness" : 1.0
  }, {
    "name" : "Day Mins",
    "inferred_type" : "Fractional",
    "completeness" : 1.0
  }, {
    "name" : "Day Calls",
    "inferred_type" : "Integral",
    "completeness" : 1.0

At the end of that file, I can see configuration information for CloudWatch monitoring: turn it on or off, set the drift threshold, etc.

"monitoring_config" : {
    "evaluate_constraints" : "Enabled",
    "emit_metrics" : "Enabled",
    "distribution_constraints" : {
      "enable_comparisons" : true,
      "min_domain_mass" : 1.0,
      "comparison_threshold" : 1.0
    }
  }

The statistics.json file shows different statistics for each feature (mean, average, quantiles, etc.), as well as unique values received by the endpoint. Here’s an example.

"name" : "Day Mins",
    "inferred_type" : "Fractional",
    "numerical_statistics" : {
      "common" : {
        "num_present" : 2333,
        "num_missing" : 0
      },
      "mean" : 180.22648949849963,
      "sum" : 420468.3999999996,
      "std_dev" : 53.987178959901556,
      "min" : 0.0,
      "max" : 350.8,
      "distribution" : {
        "kll" : {
          "buckets" : [ {
            "lower_bound" : 0.0,
            "upper_bound" : 35.08,
            "count" : 14.0
          }, {
            "lower_bound" : 35.08,
            "upper_bound" : 70.16,
            "count" : 48.0
          }, {
            "lower_bound" : 70.16,
            "upper_bound" : 105.24000000000001,
            "count" : 130.0
          }, {
            "lower_bound" : 105.24000000000001,
            "upper_bound" : 140.32,
            "count" : 318.0
          }, {
            "lower_bound" : 140.32,
            "upper_bound" : 175.4,
            "count" : 565.0
          }, {
            "lower_bound" : 175.4,
            "upper_bound" : 210.48000000000002,
            "count" : 587.0
          }, {
            "lower_bound" : 210.48000000000002,
            "upper_bound" : 245.56,
            "count" : 423.0
          }, {
            "lower_bound" : 245.56,
            "upper_bound" : 280.64,
            "count" : 180.0
          }, {
            "lower_bound" : 280.64,
            "upper_bound" : 315.72,
            "count" : 58.0
          }, {
            "lower_bound" : 315.72,
            "upper_bound" : 350.8,
            "count" : 10.0
          } ],
          "sketch" : {
            "parameters" : {
              "c" : 0.64,
              "k" : 2048.0
            },
            "data" : [ [ 178.1, 160.3, 197.1, 105.2, 283.1, 113.6, 232.1, 212.7, 73.3, 176.9, 161.9, 128.6, 190.5, 223.2, 157.9, 173.1, 273.5, 275.8, 119.2, 174.6, 133.3, 145.0, 150.6, 220.2, 109.7, 155.4, 172.0, 235.6, 218.5, 92.7, 90.7, 162.3, 146.5, 210.1, 214.4, 194.4, 237.3, 255.9, 197.9, 200.2, 120, ...

Now, let’s start monitoring our endpoint.

Monitoring An Endpoint
Again, one API call is all that it takes: I simply create a monitoring schedule for my endpoint, passing the constraints and statistics file for the baseline data set. Optionally, I could also pass preprocessing and postprocessing functions, should I want to tweak data and predictions.

ms = MonitoringSchedule(sm_client, role)
schedule = ms.create(
   mon_schedule_name, 
   endpoint_name, 
   s3_report_path, 
   # record_preprocessor_source_uri=s3_code_preprocessor_uri, 
   # post_analytics_source_uri=s3_code_postprocessor_uri,
   baseline_statistics_uri=baseline_results_uri + '/statistics.json',
   baseline_constraints_uri=baseline_results_uri+ '/constraints.json'
)

Then, I start sending bogus data to the endpoint, i.e. samples constructed from random values, and I wait for SageMaker Model Monitor to start generating reports. The suspense is killing me!

Inspecting Reports
Quickly, I see that reports are available in S3.

mon_executions = sm_client.list_monitoring_executions(MonitoringScheduleName=mon_schedule_name, MaxResults=3)
for execution_summary in mon_executions['MonitoringExecutionSummaries']:
    print("ProcessingJob: {}".format(execution_summary['ProcessingJobArn'].split('/')[1]))
    print('MonitoringExecutionStatus: {} \n'.format(execution_summary['MonitoringExecutionStatus']))

ProcessingJob: model-monitoring-201911221050-df2c7fc4
MonitoringExecutionStatus: Completed 

ProcessingJob: model-monitoring-201911221040-3a738dd7
MonitoringExecutionStatus: Completed 

ProcessingJob: model-monitoring-201911221030-83f15fb9
MonitoringExecutionStatus: Completed 

Let’s find the reports for one of these monitoring jobs.

desc_analytics_job_result=sm_client.describe_processing_job(ProcessingJobName=job_name)
report_uri=desc_analytics_job_result['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']
print('Report Uri: {}'.format(report_uri))

Report Uri: s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/reports/2019112208-2019112209

Ok, so what do we have here?

aws s3 ls s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/reports/2019112208-2019112209/

constraint_violations.json
constraints.json
statistics.json

As you would expect, the constraints.json and statistics.json contain schema and statistics information on the data samples processed by the monitoring job. Let’s open directly the third one, constraints_violations.json!

violations" : [ {
    "feature_name" : "State_AL",
    "constraint_check_type" : "data_type_check",
    "description" : "Value: 0.8 does not meet the constraint requirement! "
  }, {
    "feature_name" : "Eve Mins",
    "constraint_check_type" : "baseline_drift_check",
    "description" : "Numerical distance: 0.2711598746081505 exceeds numerical threshold: 0"
  }, {
    "feature_name" : "CustServ Calls",
    "constraint_check_type" : "baseline_drift_check",
    "description" : "Numerical distance: 0.6470588235294117 exceeds numerical threshold: 0"
  }

Oops! It looks like I’ve been assigning floating point values to integer features: surely that’s not going to work too well!

Some features are also exhibiting drift, that’s not good either. Maybe something is wrong my data ingestion process, or maybe the distribution of data has actually changed, and I need to retrain the model. As all this information is available as CloudWatch metrics, I could define thresholds, set alarms and even trigger new training jobs automatically.

Now Available!
As you can see, Amazon SageMaker Model Monitor is easy to set up, and helps you quickly know about quality issues in your ML models.

Now it’s your turn: you can start using Amazon SageMaker Model Monitor today in all commercial regions where Amazon SageMaker is available. This capability is also integrated in Amazon SageMaker Studio, our workbench for ML projects. Last but not least, all information can be viewed and further analyzed in a notebook.

Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

– Julien

Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/

Today, we’re extremely happy to launch Amazon SageMaker Processing, a new capability of Amazon SageMaker that lets you easily run your preprocessing, postprocessing and model evaluation workloads on fully managed infrastructure.

Training an accurate machine learning (ML) model requires many different steps, but none is potentially more important than preprocessing your data set, e.g.:

  • Converting the data set to the input format expected by the ML algorithm you’re using,
  • Transforming existing features to a more expressive representation, such as one-hot encoding categorical features,
  • Rescaling or normalizing numerical features,
  • Engineering high level features, e.g. replacing mailing addresses with GPS coordinates,
  • Cleaning and tokenizing text for natural language processing applications,
  • And more!

These tasks involve running bespoke scripts on your data set, (beneath a moonless sky, I’m told) and saving the processed version for later use by your training jobs. As you can guess, running them manually or having to build and scale automation tools is not an exciting prospect for ML teams. The same could be said about postprocessing jobs (filtering, collating, etc.) and model evaluation jobs (scoring models against different test sets).

Solving this problem is why we built Amazon SageMaker Processing. Let me tell you more.

Introducing Amazon SageMaker Processing
Amazon SageMaker Processing introduces a new Python SDK that lets data scientists and ML engineers easily run preprocessing, postprocessing and model evaluation workloads on Amazon SageMaker.

This SDK uses SageMaker’s built-in container for scikit-learn, possibly the most popular library one for data set transformation.

If you need something else, you also have the ability to use your own Docker images without having to conform to any Docker image specification: this gives you maximum flexibility in running any code you want, whether on SageMaker Processing, on AWS container services like Amazon ECS and Amazon Elastic Kubernetes Service, or even on premise.

How about a quick demo with scikit-learn? Then, I’ll briefly discuss using your own container. Of course, you’ll find complete examples on Github.

Preprocessing Data With The Built-In Scikit-Learn Container
Here’s how to use the SageMaker Processing SDK to run your scikit-learn jobs.

First, let’s create an SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements.

from sagemaker.sklearn.processing import SKLearnProcessor
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_count=1,
                                     instance_type='ml.m5.xlarge')

Then, we can run our preprocessing script (more on this fellow in a minute) like so:

  • The data set (dataset.csv) is automatically copied inside the container under the destination directory (/input). We could add additional inputs if needed.
  • This is where the Python script (preprocessing.py) reads it. Optionally, we could pass command line arguments to the script.
  • It preprocesses it, splits it three ways, and saves the files inside the container under /opt/ml/processing/output/train, /opt/ml/processing/output/validation, and /opt/ml/processing/output/test.
  • Once the job completes, all outputs are automatically copied to your default SageMaker bucket in S3.
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor.run(
    code='preprocessing.py',
    # arguments = ['arg1', 'arg2'],
    inputs=[ProcessingInput(
        source='dataset.csv',
        destination='/opt/ml/processing/input')],
    outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
        ProcessingOutput(source='/opt/ml/processing/output/validation'),
        ProcessingOutput(source='/opt/ml/processing/output/test')]
)

That’s it! Let’s put everything together by looking at the skeleton of the preprocessing script.

import pandas as pd
from sklearn.model_selection import train_test_split
# Read data locally 
df = pd.read_csv('/opt/ml/processing/input/dataset.csv')
# Preprocess the data set
downsampled = apply_mad_data_science_skills(df)
# Split data set into training, validation, and test
train, test = train_test_split(downsampled, test_size=0.2)
train, validation = train_test_split(train, test_size=0.2)
# Create local output directories
try:
    os.makedirs('/opt/ml/processing/output/train')
    os.makedirs('/opt/ml/processing/output/validation')
    os.makedirs('/opt/ml/processing/output/test')
except:
    pass
# Save data locally
train.to_csv("/opt/ml/processing/output/train/train.csv")
validation.to_csv("/opt/ml/processing/output/validation/validation.csv")
test.to_csv("/opt/ml/processing/output/test/test.csv")
print('Finished running processing job')

A quick look to the S3 bucket confirms that files have been sucessfully processed and saved. Now I could use them directly as input for a SageMaker training job .

$ aws s3 ls --recursive s3://sagemaker-us-west-2-123456789012/sagemaker-scikit-learn-2019-11-20-13-57-17-805/output
2019-11-20 15:03:22 19967 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/test.csv
2019-11-20 15:03:22 64998 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/train.csv
2019-11-20 15:03:22 18058 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/validation.csv

Now what about using your own container?

Processing Data With Your Own Container
Let’s say you’d like to preprocess text data with the popular spaCy library. Here’s how you could define a vanilla Docker container for it.

FROM python:3.7-slim-buster
# Install spaCy, pandas, and an english language model for spaCy.
RUN pip3 install spacy==2.2.2 && pip3 install pandas==0.25.3
RUN python3 -m spacy download en_core_web_md
# Make sure python doesn't buffer stdout so we get logs ASAP.
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]

Then, you would build the Docker container, test it locally, and push it to Amazon Elastic Container Registry, our managed Docker registry service.

The next step would be to configure a processing job using the ScriptProcessor object, passing the name of the container you built and pushed.

from sagemaker.processing import ScriptProcessor
script_processor = ScriptProcessor(image_uri='123456789012.dkr.ecr.us-west-2.amazonaws.com/sagemaker-spacy-container:latest',
                role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge')

Finally, you would run the job just like in the previous example.

script_processor.run(code='spacy_script.py',
    inputs=[ProcessingInput(
        source='dataset.csv',
        destination='/opt/ml/processing/input_data')],
    outputs=[ProcessingOutput(source='/opt/ml/processing/processed_data')],
    arguments=['tokenizer', 'lemmatizer', 'pos-tagger']
)

The rest of the process is exactly the same as above: copy the input(s) inside the container, copy the output(s) from the container to S3.

Pretty simple, don’t you think? Again, I focused on preprocessing, but you can run similar jobs for postprocessing and model evaluation. Don’t forget to check out the examples in Github.

Now Available!
Amazon SageMaker Processing is available today in all commercial regions where Amazon SageMaker is available.

Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

Julien

Amazon SageMaker Autopilot – Automatically Create High-Quality Machine Learning Models With Full Control And Visibility

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-autopilot-fully-managed-automatic-machine-learning/

Today, we’re extremely happy to launch Amazon SageMaker Autopilot to automatically create the best classification and regression machine learning models, while allowing full control and visibility.

In 1959, Arthur Samuel defined machine learning as the ability for computers to learn without being explicitly programmed. In practice, this means finding an algorithm than can extract patterns from an existing data set, and use these patterns to build a predictive model that will generalize well to new data. Since then, lots of machine learning algorithms have been invented, giving scientists and engineers plenty of options to choose from, and helping them build amazing applications.

However, this abundance of algorithms also creates a difficulty: which one should you pick? How can you reliably figure out which one will perform best on your specific business problem? In addition, machine learning algorithms usually have a long list of training parameters (also called hyperparameters) that need to be set “just right” if you want to squeeze every bit of extra accuracy from your models. To make things worse, algorithms also require data to be prepared and transformed in specific ways (aka feature engineering) for optimal learning… and you need to pick the best instance type.

If you think this sounds like a lot of experimental, trial and error work, you’re absolutely right. Machine learning is definitely of mix of hard science and cooking recipes, making it difficult for non-experts to get good results quickly.

What if you could rely on a fully managed service to solve that problem for you? Call an API and get the job done? Enter Amazon SageMaker Autopilot.

Introducing Amazon SageMaker Autopilot
Using a single API call, or a few clicks in Amazon SageMaker Studio, SageMaker Autopilot first inspects your data set, and runs a number of candidates to figure out the optimal combination of data preprocessing steps, machine learning algorithms and hyperparameters. Then, it uses this combination to train an Inference Pipeline, which you can easily deploy either on a real-time endpoint or for batch processing. As usual with Amazon SageMaker, all of this takes place on fully-managed infrastructure.

Last but not least, SageMaker Autopilot also generate Python code showing you exactly how data was preprocessed: not only can you understand what SageMaker Autopilot did, you can also reuse that code for further manual tuning if you’re so inclined.

As of today, SageMaker Autopilot supports:

  • Input data in tabular format, with automatic data cleaning and preprocessing,
  • Automatic algorithm selection for linear regression, binary classification, and multi-class classification,
  • Automatic hyperparameter optimization,
  • Distributed training,
  • Automatic instance and cluster size selection.

Let me show you how simple this is.

Using AutoML with Amazon SageMaker Autopilot
Let’s use this sample notebook as a starting point: it builds a binary classification model predicting if customers will accept or decline a marketing offer. Please take a few minutes to read it: as you will see, the business problem itself is easy to understand, and the data set is neither large nor complicated. Yet, several non-intuitive preprocessing steps are required, and there’s also the delicate matter of picking an algorithm and its parameters… SageMaker Autopilot to the rescue!

First, I grab a copy of the data set, and take a quick look at the first few lines.

Then, I upload it in Amazon Simple Storage Service (S3) without any preprocessing whatsoever.

sess.upload_data(path="automl-train.csv", key_prefix=prefix + "/input")

's3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-automl-dm/input/automl-train.csv'

Now, let’s configure the AutoML job:

  • Set the location of the data set,
  • Select the target attribute that I want the model to predict: in this case, it’s the ‘y’ column showing if a customer accepted the offer or not,
  • Set the location of training artifacts.
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/input'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': 'y'
    }
  ]

output_data_config = {
    'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)
  }

That’s it! Of course, SageMaker Autopilot has a number of options that will come in handy as you learn more about your data and your models, e.g.:

  • Set the type of problem you want to train on: linear regression, binary classification, or multi-class classification. If you’re not sure, SageMaker Autopilot will figure it out automatically by analyzing the values of the target attribute.
  • Use a specific metric for model evaluation.
  • Define completion criteria: maximum running time, etc.

One thing I don’t have to do is size the training cluster, as SageMaker Autopilot uses a heuristic based on data size and algorithm. Pretty cool!

With configuration out of the way, I can fire up the job with the CreateAutoMl API.

auto_ml_job_name = 'automl-dm-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      RoleArn=role)

AutoMLJobName: automl-dm-28-10-17-49

A job runs in four steps (you can use the DescribeAutoMlJob API to view them).

  1. Splitting the data set into train and validation sets,
  2. Analyzing data, in order to recommend pipelines that should be tried out on the data set,
  3. Feature engineering, where transformations are applied to the data set and to individual features,
  4.  Pipeline selection and hyperparameter tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm.

Once the maximum number of candidates – or one of the stopping conditions – has been reached, the job is complete. I can get detailed information on all candidates using the ListCandidatesForAutoMlJob API , and also view them in the AWS console.

candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']
index = 1
for candidate in candidates:
  print (str(index) + "  " + candidate['CandidateName'] + "  " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))
  index += 1

1 automl-dm-28-tuning-job-1-fabb8-001-f3b6dead 0.9186699986457825
2 automl-dm-28-tuning-job-1-fabb8-004-03a1ff8a 0.918304979801178
3 automl-dm-28-tuning-job-1-fabb8-003-c443509a 0.9181839823722839
4 automl-dm-28-tuning-job-1-ed07c-006-96f31fde 0.9158779978752136
5 automl-dm-28-tuning-job-1-ed07c-004-da2d99af 0.9130859971046448
6 automl-dm-28-tuning-job-1-ed07c-005-1e90fd67 0.9130859971046448
7 automl-dm-28-tuning-job-1-ed07c-008-4350b4fa 0.9119930267333984
8 automl-dm-28-tuning-job-1-ed07c-007-dae75982 0.9119930267333984
9 automl-dm-28-tuning-job-1-ed07c-009-c512379e 0.9119930267333984
10 automl-dm-28-tuning-job-1-ed07c-010-d905669f 0.8873512744903564

For now, I’m only interested in the best trial: 91.87% validation accuracy. Let’s deploy it to a SageMaker endpoint, just like we would deploy any model:

model_arn = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

ep_config = sm.create_endpoint_config(EndpointConfigName = epc_name,
                                      ProductionVariants=[{'InstanceType':'ml.m5.2xlarge',
                                                           'InitialInstanceCount':1,
                                                           'ModelName':model_name,
                                                           'VariantName':variant_name}])

create_endpoint_response = sm.create_endpoint(EndpointName=ep_name,
                                              EndpointConfigName=epc_name)

After a few minutes, the endpoint is live, and I can use it for prediction. SageMaker business as usual!

Now, I bet you’re curious about how the model was built, and what the other candidates are. Let me show you.

Full Visibility And Control with Amazon SageMaker Autopilot
SageMaker Autopilot stores training artifacts in S3, including two auto-generated notebooks!

job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_data_notebook = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation']
job_candidate_notebook = job['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']

print(job_data_notebook)
print(job_candidate_notebook)

s3://<PREFIX_REMOVED>/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb
s3://<PREFIX_REMOVED>/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb

The first one contains information about the data set.

The second one contains full details on the SageMaker Autopilot job: candidates, data preprocessing steps, etc. All code is available, as well as ‘knobs’ you can change for further experimentation.

As you can see, you have full control and visibility on how models are built.

Now Available!
I’m very excited about Amazon SageMaker Autopilot, because it’s making machine learning simpler and more accessible than ever. Whether you’re just beginning with machine learning, or whether you’re a seasoned practitioner, SageMaker Autopilot will help you build better models quicker using either one of these paths:

Now it’s your turn. You can start using SageMaker Autopilot today in the following regions:

  • US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon),
  • Canada (Central), South America (São Paulo),
  • Europe (Ireland), Europe (London), Europe (Paris), Europe (Frankfurt),
  • Middle East (Bahrain),
  • Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo).

Please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

Julien

Amazon SageMaker Experiments – Organize, Track And Compare Your Machine Learning Trainings

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-experiments-organize-track-and-compare-your-machine-learning-trainings/

Today, we’re extremely happy to announce Amazon SageMaker Experiments, a new capability of Amazon SageMaker that lets you organize, track, compare and evaluate machine learning (ML) experiments and model versions.

ML is a highly iterative process. During the course of a single project, data scientists and ML engineers routinely train thousands of different models in search of maximum accuracy. Indeed, the number of combinations for algorithms, data sets, and training parameters (aka hyperparameters) is infinite… and therein lies the proverbial challenge of finding a needle in a haystack.

Tools like Automatic Model Tuning and Amazon SageMaker Autopilot help ML practitioners explore a large number of combinations automatically, and quickly zoom in on high-performance models. However, they further add to the explosive growth of training jobs. Over time, this creates a new difficulty for ML teams, as it becomes near-impossible to efficiently deal with hundreds of thousands of jobs: keeping track of metrics, grouping jobs by experiment, comparing jobs in the same experiment or across experiments, querying past jobs, etc.

Of course, this can be solved by building, managing and scaling bespoke tools: however, doing so diverts valuable time and resources away from actual ML work. In the spirit of helping customers focus on ML and nothing else, we couldn’t leave this problem unsolved.

Introducing Amazon SageMaker Experiments
First, let’s define core concepts:

  • A trial is a collection of training steps involved in a single training job. Training steps typically includes preprocessing, training, model evaluation, etc. A trial is also enriched with metadata for inputs (e.g. algorithm, parameters, data sets) and outputs (e.g. models, checkpoints, metrics).
  • An experiment is simply a collection of trials, i.e. a group of related training jobs.

The goal of SageMaker Experiments is to make it as simple as possible to create experiments, populate them with trials, and run analytics across trials and experiments. For this purpose, we introduce a new Python SDK containing logging and analytics APIs.

Running your training jobs on SageMaker or SageMaker Autopilot, all you have to do is pass an extra parameter to the Estimator, defining the name of the experiment that this trial should be attached to. All inputs and outputs will be logged automatically.

Once you’ve run your training jobs, the SageMaker Experiments SDK lets you load experiment and trial data in the popular pandas dataframe format. Pandas truly is the Swiss army knife of ML practitioners, and you’ll be able to perform any analysis that you may need. Go one step further by building cool visualizations with matplotlib, and you’ll be well on your way to taming that wild horde of training jobs!

As you would expect, SageMaker Experiments is nicely integrated in Amazon SageMaker Studio. You can run complex queries to quickly find the past trial you’re looking for. You can also visualize real-time model leaderboards and metric charts.

How about a quick demo?

Logging Training Information With Amazon SageMaker Experiments
Let’s start from a PyTorch script classifying images from the MNIST data set, using a simple two-layer convolution neural network (CNN). If I wanted to run a single job on SageMaker, I could use the PyTorch estimator like so:

estimator = PyTorch(
        entry_point='mnist.py',
        role=role,
        sagemaker_session=sess
        framework_version='1.1.0',
        train_instance_count=1,
        train_instance_type='ml.p3.2xlarge')
    
    estimator.fit(inputs={'training': inputs})

Instead, let’s say that I want to run multiple versions of the same script, changing only one of the hyperparameters (the number of convolution filters used by the two convolution layers, aka number of hidden channels) to measure its impact on model accuracy. Of course, we could run these jobs, grab the training logs, extract metrics with fancy text filtering, etc. Or we could use SageMaker Experiments!

All I need to do is:

  • Set up an experiment,
  • Use a tracker to log experiment metadata,
  • Create a trial for each training job I want to run,
  • Run each training job, passing parameters for the experiment name and the trial name.

First things first, let’s take care of the experiment.

from smexperiments.experiment import Experiment
mnist_experiment = Experiment.create(
    experiment_name="mnist-hand-written-digits-classification", 
    description="Classification of mnist hand-written digits", 
    sagemaker_boto_client=sm)

Then, let’s add a few things that we want to keep track of, like the location of the data set and normalization values we applied to it.

from smexperiments.tracker import Tracker
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
     tracker.log_input(name="mnist-dataset", media_type="s3/uri", value=inputs)
     tracker.log_parameters({
        "normalization_mean": 0.1307,
        "normalization_std": 0.3081,
    })

Now let’s run a few jobs. I simply loop over the different values that I want to try, creating a new trial for each training job and adding the tracker information to it.

for i, num_hidden_channel in enumerate([2, 5, 10, 20, 32]):
    trial_name = f"cnn-training-job-{num_hidden_channel}-hidden-channels-{int(time.time())}"
    cnn_trial = Trial.create(
        trial_name=trial_name, 
        experiment_name=mnist_experiment.experiment_name,
        sagemaker_boto_client=sm,
    )
    cnn_trial.add_trial_component(tracker.trial_component)

Then, I configure the estimator, passing the value for the hyperparameter I’m interested in, and leaving the other ones as is. I’m also passing regular expressions to extract metrics from the training log. All these will push stored in the trial: in fact, all parameters (passed or default) will be.

    estimator = PyTorch(
        entry_point='mnist.py',
        role=role,
        sagemaker_session=sess,
        framework_version='1.1.0',
        train_instance_count=1,
        train_instance_type='ml.p3.2xlarge',
        hyperparameters={
            'hidden_channels': num_hidden_channels
        },
        metric_definitions=[
            {'Name':'train:loss', 'Regex':'Train Loss: (.*?);'},
            {'Name':'test:loss', 'Regex':'Test Average loss: (.*?),'},
            {'Name':'test:accuracy', 'Regex':'Test Accuracy: (.*?)%;'}
        ]
    )

Finally, I run the training job, associating it to the experiment and the trial.

    cnn_training_job_name = "cnn-training-job-{}".format(int(time.time()))
    
    estimator.fit(
        inputs={'training': inputs}, 
        job_name=cnn_training_job_name,
        experiment_config={
            "ExperimentName": mnist_experiment.experiment_name, 
            "TrialName": cnn_trial.trial_name,
            "TrialComponentDisplayName": "Training",
        }
    )
# end of loop

Once all jobs are complete, I can run analytics. Let’s find out how we did.

Analytics with Amazon SageMaker Experiments
All information on an experiment can be easily exported to a Pandas DataFrame.

from sagemaker.analytics import ExperimentAnalytics
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=mnist_experiment.experiment_name
)
analytic_table = trial_component_analytics.dataframe()

If I want to drill down, I can specify additional parameters, e.g.:

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=mnist_experiment.experiment_name,
    sort_by="metrics.test:accuracy.max",
    sort_order="Descending",
    metric_names=['test:accuracy'],
    parameter_names=['hidden_channels', 'epochs', 'dropout', 'optimizer']
)
analytic_table = trial_component_analytics.dataframe()

This builds a DataFrame where trials are sorted by decreasing test accuracy, and showing only some of the hyperparameters for each trial.

for col in analytic_table.columns: 
    print(col) 

TrialComponentName
DisplayName
SourceArn
dropout
epochs
hidden_channels
optimizer
test:accuracy - Min
test:accuracy - Max
test:accuracy - Avg
test:accuracy - StdDev
test:accuracy - Last
test:accuracy - Count

From here on, your imagination is the limit. Pandas is the Swiss army knife of data analysis, and you’ll be able to compare trials and experiments in every possible way.

Last but not least, thanks to the integration with Amazon SageMaker Studio, you’ll be able to visualize all this information in real-time with predefined widgets. To learn more about Amazon SageMaker Studio, visit this blog post.

Now Available!
I just scratched the surface of what you can do with Amazon SageMaker Experiments, and I believe it will help you tame the wild horde of jobs that you have to deal with everyday.

The service is available today in all commercial AWS Regions where Amazon SageMaker is available.

Give it a try and please send us feedback, either in the AWS forum for Amazon SageMaker, or through your usual AWS contacts.

– Julien

 

Now Available on Amazon SageMaker: The Deep Graph Library

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-on-amazon-sagemaker-the-deep-graph-library/

Today, we’re happy to announce that the Deep Graph Library, an open source library built for easy implementation of graph neural networks, is now available on Amazon SageMaker.

In recent years, Deep learning has taken the world by storm thanks to its uncanny ability to extract elaborate patterns from complex data, such as free-form text, images, or videos. However, lots of datasets don’t fit these categories and are better expressed with graphs. Intuitively, we can feel that traditional neural network architectures like convolution neural networks or recurrent neural networks are not a good fit for such datasets, and a new approach is required.

A Primer On Graph Neural Networks
Graph neural networks (GNN) are one of the most exciting developments in machine learning today, and these reference papers will get you started.

GNNs are used to train predictive models on datasets such as:

  • Social networks, where graphs show connections between related people,
  • Recommender systems, where graphs show interactions between customers and items,
  • Chemical analysis, where compounds are modeled as graphs of atoms and bonds,
  • Cybersecurity, where graphs describe connections between source and destination IP addresses,
  • And more!

Most of the time, these datasets are extremely large and only partially labeled. Consider a fraud detection scenario where we would try to predict the likelihood that an individual is a fraudulent actor by analyzing his connections to known fraudsters. This problem could be defined as a semi-supervised learning task, where only a fraction of graph nodes would be labeled (‘fraudster’ or ‘legitimate’). This should be a better solution than trying to build a large hand-labeled dataset, and “linearizing” it to apply traditional machine learning algorithms.

Working on these problems requires domain knowledge (retail, finance, chemistry, etc.), computer science knowledge (Python, deep learning, open source tools), and infrastructure knowledge (training, deploying, and scaling models). Very few people master all these skills, which is why tools like the Deep Graph Library and Amazon SageMaker are needed.

Introducing The Deep Graph Library
First released on Github in December 2018, the Deep Graph Library (DGL) is a Python open source library that helps researchers and scientists quickly build, train, and evaluate GNNs on their datasets.

DGL is built on top of popular deep learning frameworks like PyTorch and Apache MXNet. If you know either one or these, you’ll find yourself quite at home. No matter which framework you use, you can get started easily thanks to these beginner-friendly examples. I also found the slides and code for the GTC 2019 workshop very useful.

Once you’re done with toy examples, you can start exploring the collection of cutting edge models already implemented in DGL. For example, you can train a document classification model using a Graph Convolution Network (GCN) and the CORA dataset by simply running:

$ python3 train.py --dataset cora --gpu 0 --self-loop

The code for all models is available for inspection and tweaking. These implementations have been carefully validated by AWS teams, who verified performance claims and made sure results could be reproduced.

DGL also includes a collection of graph datasets, that you can easily download and experiment with.

Of course, you can install and run DGL locally, but to make your life simpler, we added it to the Deep Learning Containers for PyTorch and Apache MXNet. This makes it easy to use DGL on Amazon SageMaker, in order to train and deploy models at any scale, without having to manage a single server. Let me show you how.

Using DGL On Amazon SageMaker
We added complete examples in the Github repository for SageMaker examples: one of them trains a simple GNN for molecular toxicity prediction using the Tox21 dataset.

The problem we’re trying to solve is figuring it the potential toxicity of new chemical compounds with respect to 12 different targets (receptors inside biological cells, etc.). As you can imagine, this type of analysis is crucial when designing new drugs, and being able to quickly predict results without having to run in vitro experiments helps researchers focus their efforts on the most promising drug candidates.

The dataset contains a little over 8,000 compounds: each one is modeled as a graph (atoms are vertices, atomic bonds are edges), and labeled 12 times (one label per target). Using a GNN, we’re going to build a multi-label binary classification model, allowing us to predict the potential toxicity of candidate molecules.

In the training script, we can easily download the dataset from the DGL collection.

from dgl.data.chem import Tox21
dataset = Tox21()

Similarly, we can easily build a GNN classifier using the DGL model zoo.

from dgl import model_zoo
model = model_zoo.chem.GCNClassifier(
    in_feats=args['n_input'],
    gcn_hidden_feats=[args['n_hidden'] for _ in range(args['n_layers'])],
    n_tasks=dataset.n_tasks,
    classifier_hidden_feats=args['n_hidden']).to(args['device'])

The rest of the code is mostly vanilla PyTorch, and you should be able to find your bearings if you’re familiar with this library.

When it comes to running this code on Amazon SageMaker, all we have to do is use a SageMaker Estimator, passing the full name of our DGL container, and the name of the training script as a hyperparameter.

estimator = sagemaker.estimator.Estimator(container,
    role,
    train_instance_count=1,
    train_instance_type='ml.p3.2xlarge',
    hyperparameters={'entrypoint': 'main.py'},
    sagemaker_session=sess)
code_location = sess.upload_data(CODE_PATH,
bucket=bucket,
key_prefix=custom_code_upload_location)
estimator.fit({'training-code': code_location})

<output removed>
epoch 23/100, batch 48/49, loss 0.4684

epoch 23/100, batch 49/49, loss 0.5389
epoch 23/100, training roc-auc 0.9451
EarlyStopping counter: 10 out of 10
epoch 23/100, validation roc-auc 0.8375, best validation roc-auc 0.8495
Best validation score 0.8495
Test score 0.8273
2019-11-21 14:11:03 Uploading - Uploading generated training model
2019-11-21 14:11:03 Completed - Training job completed
Training seconds: 209
Billable seconds: 209

Now, we could grab the trained model in S3, and use it to predict toxicity for large number of compounds, without having to run actual experiments. Fascinating stuff!

Now Available!
You can start using DGL on Amazon SageMaker today.

Give it a try, and please send us feedback in the DGL forum, in the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

Julien

 

New for Amazon Aurora – Use Machine Learning Directly From Your Databases

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-aurora-use-machine-learning-directly-from-your-databases/

Machine Learning allows you to get better insights from your data. But where is most of the structured data stored? In databases! Today, in order to use machine learning with data in a relational database, you need to develop a custom application to read the data from the database and then apply the machine learning model. Developing this application requires a mix of skills to be able to interact with the database and use machine learning. This is a new application, and now you have to manage its performance, availability, and security.

Can we make it easier to apply machine learning to data in a relational database? Even for existing applications?

Starting today, Amazon Aurora is natively integrated with two AWS machine learning services:

  • Amazon SageMaker, a service providing you with the ability to build, train, and deploy custom machine learning models quickly.
  • Amazon Comprehend, a natural language processing (NLP) service that uses machine learning to find insights in text.

Using this new functionality, you can use a SQL function in your queries to apply a machine learning model to the data in your relational database. For example, you can detect the sentiment of a user comment using Comprehend, or apply a custom machine learning model built with SageMaker to estimate the risk of “churn” for your customers. Churn is a word mixing “change” and “turn” and is used to describe customers that stop using your services.

You can store the output of a large query including the additional information from machine learning services in a new table, or use this feature interactively in your application by just changing the SQL code run by the clients, with no machine learning experience required.

Let’s see a couple of examples of what you can do from an Aurora database, first by using Comprehend, then SageMaker.

Configuring Database Permissions
The first step is to give the database permissions to access the services you want to use: Comprehend, SageMaker, or both. In the RDS console, I create a new Aurora MySQL 5.7 database. When it is available, in the Connectivity & security tab of the regional endpoint, I look for the Manage IAM roles section.

There I connect Comprehend and SageMaker to this database cluster. For SageMaker, I need to provide the Amazon Resource Name (ARN) of the endpoint of a deployed machine learning model. If you want to use multiple endpoints, you need to repeat this step. The console takes care of creating the service roles for the Aurora database to access those services in order for the new machine learning integration to work.

Using Comprehend from Amazon Aurora
I connect to the database using a MySQL client. To run my tests, I create a table storing comments for a blogging platform and insert a few sample records:

CREATE TABLE IF NOT EXISTS comments (
       comment_id INT AUTO_INCREMENT PRIMARY KEY,
       comment_text VARCHAR(255) NOT NULL
);

INSERT INTO comments (comment_text)
VALUES ("This is very useful, thank you for writing it!");
INSERT INTO comments (comment_text)
VALUES ("Awesome, I was waiting for this feature.");
INSERT INTO comments (comment_text)
VALUES ("An interesting write up, please add more details.");
INSERT INTO comments (comment_text)
VALUES ("I don’t like how this was implemented.");

To detect the sentiment of the comments in my table, I can use the aws_comprehend_detect_sentiment and aws_comprehend_detect_sentiment_confidence SQL functions:

SELECT comment_text,
       aws_comprehend_detect_sentiment(comment_text, 'en') AS sentiment,
       aws_comprehend_detect_sentiment_confidence(comment_text, 'en') AS confidence
  FROM comments;

The aws_comprehend_detect_sentiment function returns the most probable sentiment for the input text: POSITIVE, NEGATIVE, or NEUTRAL. The aws_comprehend_detect_sentiment_confidence function returns the confidence of the sentiment detection, between 0 (not confident at all) and 1 (fully confident).

Using SageMaker Endpoints from Amazon Aurora
Similarly to what I did with Comprehend, I can access a SageMaker endpoint to enrich the information stored in my database. To see a practical use case, let’s implement the customer churn example mentioned at the beginning of this post.

Mobile phone operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct a machine learning model. As input for the model, we’re looking at the current subscription plan, how much the customer is speaking on the phone at different times of day, and how often has called customer service.

Here’s the structure of my customer table:

SHOW COLUMNS FROM customers;

To be able to identify customers at risk of churn, I train a model following this sample SageMaker notebook using the XGBoost algorithm. When the model has been created, it’s deployed to a hosted endpoint.

When the SageMaker endpoint is in service, I go back to the Manage IAM roles section of the console to give the Aurora database permissions to access the endpoint ARN.

Now, I create a new will_churn SQL function giving input to the endpoint the parameters required by the model:

CREATE FUNCTION will_churn (
       state varchar(2048), acc_length bigint(20),
       area_code bigint(20), int_plan varchar(2048),
       vmail_plan varchar(2048), vmail_msg bigint(20),
       day_mins double, day_calls bigint(20),
       eve_mins double, eve_calls bigint(20),
       night_mins double, night_calls bigint(20),
       int_mins double, int_calls bigint(20),
       cust_service_calls bigint(20))
RETURNS varchar(2048) CHARSET latin1
       alias aws_sagemaker_invoke_endpoint
       endpoint name 'estimate_customer_churn_endpoint_version_123';

As you can see, the model looks at the customer’s phone subscription details and service usage patterns to identify the risk of churn. Using the will_churn SQL function, I run a query over my customers table to flag customers based on my machine learning model. To store the result of the query, I create a new customers_churn table:

CREATE TABLE customers_churn AS
SELECT *, will_churn(state, acc_length, area_code, int_plan,
       vmail_plan, vmail_msg, day_mins, day_calls,
       eve_mins, eve_calls, night_mins, night_calls,
       int_mins, int_calls, cust_service_calls) will_churn
  FROM customers;

Let’s see a few records from the customers_churn table:

SELECT * FROM customers_churn LIMIT 7;

I am lucky the first 7 customers are apparently not going to churn. But what happens overall? Since I stored the results of the will_churn function, I can run a SELECT GROUP BY statement on the customers_churn table.

SELECT will_churn, COUNT(*) FROM customers_churn GROUP BY will_churn;

Starting from there, I can dive deep to understand what brings my customers to churn.

If I create a new version of my machine learning model, with a new endpoint ARN, I can recreate the will_churn function without changing my SQL statements.

Available Now
The new machine learning integration is available today for Aurora MySQL 5.7, with the SageMaker integration generally available and the Comprehend integration in preview. You can learn more in the documentation. We are working on other engines and versions: Aurora MySQL 5.6 and Aurora PostgreSQL 10 and 11 are coming soon.

The Aurora machine learning integration is available in all regions in which the underlying services are available. For example, if both Aurora MySQL 5.7 and SageMaker are available in a region, then you can use the integration for SageMaker. For a complete list of services availability, please see the AWS Regional Table.

There’s no additional cost for using the integration, you just pay for the underlying services at your normal rates. Pay attention to the size of your queries when using Comprehend. For example, if you do sentiment analysis on user feedback in your customer service web page, to contact those who made particularly positive or negative comments, and people are making 10,000 comments a day, you’d pay $3/day. To optimize your costs, remember to store results.

It’s never been easier to apply machine learning models to data stored in your relational databases. Let me know what you are going to build with this!

Danilo

Provisioning the Intuit Data Lake with Amazon EMR, Amazon SageMaker, and AWS Service Catalog

Post Syndicated from Michael Sambol original https://aws.amazon.com/blogs/big-data/provisioning-the-intuit-data-lake-with-amazon-emr-amazon-sagemaker-and-aws-service-catalog/

This post shares Intuit’s learnings and recommendations for running a data lake on AWS. The Intuit Data Lake is built and operated by numerous teams in Intuit Data Platform. Thanks to Tristan Baker (Chief Architect), Neil Lamka (Principal Product Manager), Achal Kumar (Development Manager), Nicholas Audo, and Jimmy Armitage for their feedback and support.

A data lake is a centralized repository for storing structured and unstructured data at any scale. At Intuit, creating such a pile of raw data is easy. However, more interesting challenges present themselves:

  1. How should AWS accounts be organized?
  2. What ingestion methods will be used? How will analysts find the data they need?
  3. Where should data be stored? How should access be managed?
  4. What security measures are needed to protect Intuit’s sensitive data?
  5. Which parts of this ecosystem can be automated?

This post outlines the approach taken by Intuit, though it is important to remember that there are many ways to build a data lake (for example, AWS Lake Formation).

We’ll cover the technologies and processes involved in creating the Intuit Data Lake at a high level, including the overall structure and the automation used in provisioning accounts and resources. Watch this space in the future for more detailed blog posts on specific aspects of the system, from the other teams and engineers who worked together to build the Intuit Data Lake.

Architecture

Account Structure

Data lakes typically follow a hub-and-spoke model, with the hub account containing shared services that control access to data sources. For the purposes of this post, we’ll refer to the hub account as Central Data Lake.

In this pattern, access to Central Data Lake is apportioned to spoke accounts called Processing Accounts. This model maintains separation between end users and allows for division of billing among distinct business units.

 

 

It is common to maintain two ecosystems: pre-production (Pre-Prod) and production (Prod). This allows data lake administrators to silo access to data by preventing connectivity between Pre-Prod and Prod.

To enable experimentation and testing, it may also be advisable to maintain separate VPC-based environments within Pre-Prod accounts, such as dev, qa, and e2e. Processing Account VPCs would then be connected to the corresponding VPC in Central Data Lake.

Note that at first, we connected accounts via VPC Peering. However, as we scaled we quickly approached the hard limit of 125 VPC peering connections, requiring us to migrate to AWS Transit Gateway. As of this writing, we connect multiple new Processing Accounts weekly.

 

 

Central Data Lake

There may be numerous services running in a hub account, but we’ll focus on the aspects that are most relevant to this blog: ingestion, sanitization, storage, and a data catalog.

 

 

Ingestion, Sanitization, and Storage

A key component to Central Data Lake is a uniform ingestion pattern for streaming data. One example is an Apache Kafka cluster running on Amazon EC2. (You can read about how Intuit engineers do this in another AWS blog.) As we deal with hundreds of data sources, we’ve enabled access to ingestion mechanisms via AWS PrivateLink.

Note: Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an alternative for running Apache Kafka on Amazon EC2, but was not available at the start of Intuit’s migration.

In addition to stream processing, another method of ingestion is batch processing, such as jobs running on Amazon EMR. After data is ingested by one of these methods, it can be stored in Amazon S3 for further processing and analysis.

Intuit deals with a large volume of customer data, and each field is carefully considered and classified with a sensitivity level. All sensitive data that enters the lake is encrypted at the source. The ingestion systems retrieve the encrypted data and move it into the lake. Before it is written to S3, the data is sanitized by a proprietary RESTful service. Analysts and engineers operating within the data lake consume this masked data.

Data Catalog

A data catalog is a common way to give end users information about the data and where it lives. One example is a Hive Metastore backed by Amazon Aurora. Another alternative is the AWS Glue Data Catalog.

Processing Accounts

When Processing Accounts are delivered to end users, they include an identical set of resources. We’ll discuss the automation of Processing Accounts below, but the primary components are as follows:

 

 

                           Processing Account structure upon delivery to the customer

 

Data Storage Mechanisms

One reasonable question is whether all data should reside in Central Data Lake, or if it’s acceptable to distribute data across multiple accounts. A data lake might employ a combination of the two approaches, and classify data locations as primary or secondary.

The primary location for data is Central Data Lake, and it arrives there via the ingestion pipelines discussed previously. Processing Accounts can read from the primary source, either directly from the ingestion pipelines or from S3. Processing Accounts can contribute their transformed data back into Central Data Lake (primary), or store it in their own accounts (secondary). The proper storage location depends on the type of data, and who needs to consume it.

One rule worth enforcing is that no cross-account writes should be permitted. In other words, the IAM principal (in most cases, an IAM role assumed by EC2 via an instance profile) must be in the same account as the destination S3 bucket. This is because cross-account delegation is not supported—specifically, S3 bucket policies in Central Data Lake cannot grant Processing Account A access to objects written by a role in Processing Account B.

Another possibility is for EMR to assume different IAM roles via a custom credentials provider (see this AWS blog), but we chose not to go down this path at Intuit because it would have required many EMR jobs to be rewritten.

 

 

Data Access Patterns

The majority of end users are interested in the data that resides in S3. In Central Data Lake and some Processing Accounts, there may be a set of read-only S3 buckets: any account in the data lake ecosystem can read data from this type of bucket.

To facilitate management of S3 access for read-only buckets, we built a mechanism to control S3 bucket policies, administered entirely via code. Our deployment pipelines use account metadata to dynamically generate the correct S3 bucket policy based on the type of account (Pre-Prod or Prod). These policies are committed back into our code repository for auditability and ease of management.

We employ the same method for managing KMS key policies, as we use KMS with customer managed customer master keys (CMKs) for at-rest encryption in S3.

Here’s an example of a generated S3 bucket policy for a read-only bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ProcessingAccountReadOnly",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::111111111111:root",
                    "arn:aws:iam::222222222222:root",
                    "arn:aws:iam::333333333333:root",
                    "arn:aws:iam::444444444444:root",
                    "arn:aws:iam::555555555555:root",
                    ...
                    ...
                    ...
                    "arn:aws:iam::999999999999:root",
                ]
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::intuit-data-lake-example/*",
                "arn:aws:s3:::intuit-data-lake-example"
            ]
        }
    ]
}

Note that we grant access at the account level, rather than using explicit IAM principal ARNs. Because the reads are cross-account, permissions are also required on the IAM principals in Processing Accounts. Maintaining these policies—with automation, at that level of granularity—is untenable at scale. Furthermore, using specific IAM principal ARNs would create an external dependency on foreign accounts. For example, if a Processing Account deletes an IAM role that is referenced in an S3 bucket policy in Central Data Lake, the bucket policy can no longer be saved, causing interruptions to deployment pipelines.

Security

Security is mission critical for any data lake. We’ll mention a subset of the controls we use, but not dive deep.

Encryption

Encryption can be enforced both in transit and at rest, using multiple methods:

  1. Traffic within the lake should use the latest version of TLS (1.2 as of this writing)
  2. Data can be encrypted with application-level (client-side) encryption
  3. KMS keys can used for at-rest encryption of S3, EBS, and RDS

Ingress and Egress

There’s nothing out of the ordinary in our approach to ingress and egress, but it’s worth mentioning the standard patterns we’ve found important:

Policies restricting ingress and egress are the primary points at which a data lake can guarantee quality (ingress) and prevent loss (egress).

Authorization

Access to the Intuit Data Lake is controlled via IAM roles, meaning no IAM users (with long-term credentials) are created. End users are granted access via an internal service that manages role-based, federated access to AWS accounts. Regular reviews are conducted to remove nonessential users.

Configuration Management

We use an internal fork of Cloud Custodian, which is a suite of preventative, detective, and responsive controls consisting of Amazon CloudWatch Events and AWS Config rules. Some of the violations it reports and (optionally) mitigates include:

  • Unauthorized CIDRs in inbound security group rules
  • Public S3 bucket policies and ACLs
  • IAM user console access
  • Unencrypted S3 buckets, EBS volumes, and RDS instances

Lastly, Amazon GuardDuty is enabled in all Intuit Data Lake accounts and is monitored by Intuit Security.

Automation

If there is one thing we’ve learned building the Intuit Data Lake, it is to automate everything.

There are four areas of automation we’ll discuss in this blog:

  1. Creation of Processing Accounts
  2. Processing Account Orchestration Pipeline
  3. Processing Account Terraform Pipeline
  4. EMR and SageMaker deployment via Service Catalog

Creation of Processing Accounts

The first step in creating a Processing Account is to make a request through an internal tool. This triggers automation that provisions an Intuit-stamped AWS account under the correct business unit.

 

Note: AWS Control Tower’s Account Factory was not available at the start of our journey, but it can be leveraged to provision new AWS accounts in a secured, best practice, self-service way.

Account setup also includes automated VPC creation (with optional VPN), fully automated using Service Catalog. End users simply specify subnet sizes.

It’s worth noting that Intuit leverages Service Catalog for self-service deployment of other common patterns, including ingress security groups, VPC endpoints, and VPC peering. Here’s an example portfolio:

Processing Account Orchestration Pipeline

After account creation and VPC provisioning, the Processing Account Orchestration Pipeline runs. This pipeline executes one-time tasks required for Processing Accounts. These tasks include:

  • Bootstrapping an IAM role for use in further configuration management
  • Creation of KMS keys for S3, EBS, and RDS encryption
  • Creation of variable files for the new account
  • Updating the master configuration file with account metadata
  • Generation of scripts to orchestrate the Terraform pipeline discussed below
  • Sharing Transit Gateways via Resource Access Manager

Processing Account Terraform Pipeline

This pipeline manages the lifecycle of dynamic, frequently-updated resources, including IAM roles, S3 buckets and bucket policies, KMS key policies, security groups, NACLs, and bastion hosts.

There is one pipeline for every Processing Account, and each pipeline deploys a series of layers into the account, using a set of parameterized deployment jobs. A layer is a logical grouping of Terraform modules and AWS resources, providing a way to shrink Terraform state files and reduce blast radius if redeployment of specific resources is required.

EMR and SageMaker Deployment via Service Catalog

AWS Service Catalog facilitates the provisioning of Amazon EMR and Amazon SageMaker, allowing end users to launch EMR clusters and SageMaker notebook instances that work out of the box, with embedded security.

Service Catalog allows data scientists and data engineers to launch EMR clusters in a self-service fashion with user-friendly parameters, and provides them with the following:

  • Bootstrap action to enable connectivity to services in Central Data Lake
  • EC2 instance profile to control S3, KMS, and other granular permissions
  • Security configuration that enables at-rest and in-transit encryption
  • Configuration classifications for optimal EMR performance
  • Encrypted AMI with monitoring and logging enabled
  • Custom Kerberos connection to LDAP

For SageMaker, we use Service Catalog to launch notebook instances with custom lifecycle configurations that set up connections or initialize the following: Hive Metastore, Kerberos, security, Splunk logging, and OpenDNS. You can read more about lifecycle configurations in this AWS blog. Launching a SageMaker notebook instance with best-practice configuration is as easy as follows:

 

 

Conclusion

This post illustrates the building blocks we used in creating the Intuit Data Lake. Our solution isn’t wholly unique, but comprised of common-sense approaches we’ve gleaned from dozens of engineers across Intuit, representing decades of experience. These practices have enabled us to push petabytes of data into the lake, and serve hundreds of Processing Accounts with varying needs. We are still building, but we hope our story helps you in your data lake journey.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.

 

 

 

 

Ben Covi is a staff software engineer at Intuit. At any given moment, he’s probably losing a game of Catan.