Tag Archives: Amazon Sagemaker

Field Notes: Develop Data Pre-processing Scripts Using Amazon SageMaker Studio and an AWS Glue Development Endpoint

Post Syndicated from Sam Mokhtari original https://aws.amazon.com/blogs/architecture/field-notes-develop-data-pre-processing-scripts-using-amazon-sagemaker-studio-and-an-aws-glue-development-endpoint/

This post was co-written with Marcus Rosen, a Principal  – Machine Learning Operations with Rio Tinto, a global mining company. 

Data pre-processing is an important step in setting up Machine Learning (ML) projects for success. Many AWS customers use Apache Spark on AWS Glue or Amazon EMR to run data pre-processing scripts while using Amazon SageMaker to build ML models.  To develop spark scripts in AWS Glue, you can create an environment called a Glue Development (Dev) Endpoint that lets you author and test your data pre-processing scripts iteratively. When you’re satisfied with the results of your development, you can create a Glue ETL job that runs the final script as part of your automation framework.

With the introduction of Amazon SageMaker Studio in AWS re:Invent 2020, you can now use a single web-based IDE to spin up a notebook and perform all ML development steps. These include data pre-processing, ML model training, ML model deployment and monitoring.

This post walks you through how to connect a SageMaker Studio notebook to an AWS Glue Dev Endpoint, so you can use a single tool to iteratively develop both data pre-processing scripts and ML models.

Solution Overview

The following diagram shows the components that are used in this solution.

  • First, we use an AWS CloudFormation template to set up the required networking components (for example, VPC, subnets).
  • Then, we create an AWS Glue Dev Endpoint and use a security group to allow SageMaker Studio to securely access the endpoint.
  • Finally, we create a studio domain and use a SparkMagic kernel to connect to the AWS Glue Dev Endpoint and run spark scripts.

In the Amazon SageMaker Studio notebook, SparkMagic will call a REST API against a Livy server running on the AWS Glue Dev Endpoint. Apache Livy is a service that enables interaction with a remote Spark cluster over a REST API.

 

The following diagram shows the components that are used in this solution. We use an AWS CloudFormation template to set up the required ntworking components (for example, VPC, subnets).

Set up the VPC

You can use the following CloudFormation template to set up the environment needed for this solution.

launch stack button

This template deploys the following resources in your account:

  • A new VPC, with both public and private subnet.
  • VPC endpoints for the following resources:
  • Security groups for SageMaker Studio, Glue endpoint and VPC endpoints
  • SageMaker Service IAM role
  • AWS Glue Dev Endpoint IAM role
  • Set up the AWS Glue Dev Endpoint

Set up AWS Glue Dev Endpoint

Review this Developer Guide: Adding a Development Endpoint for instructions to create an AWS Glue Dev Endpoint.

Note: you must use the AWS Glue Dev Endpoint IAM role provisioned by the CloudFormation template.

  • In the Networking section, select Choose a VPC, subnet, and security groups.

Then choose the VPC glue security group, which you provisioned through the CloudFormation template.

The AWS Glue Dev Endpoint needs to be secured with an SSH public key, which should be generated within your local environment. An SSH key pair (public/private) can be generated using the ssh-keygen on Linux or using PuTTYgen on Windows.

Glue Dev Endpoint screenshot

The final review page looks similar to the following screenshot.

Final review page

Once the AWS Glue Dev Endpoint is in Ready status, keep note of its private IP address (Glue -> ETL -> Dev Endpoints). You will use this IP for the Livy port forwarding.

Set up SageMaker Studio

We recommend launching the SageMaker Studio resource by following the instructions in Securing Amazon SageMaker Studio connectivity using a private VPC .

Follow these steps when you provision the SageMaker Studio resources:

  • Select Standard setup with the AWS Identity and Access Management (IAM) authentication method.
  • Attach a SageMaker Service IAM role, created by the CloudFormation template, to SageMaker Studio.
  • Under Network and storage, select the same VPC and private subnet as the AWS Glue endpoint.
  • For the Network Access for Studio option, select VPC Only — SageMaker Studio will use your VPC. Direct internet access is disabled.

Then ensure that the security group with the self-referencing rule is attached. Also, check your other required security groups are attached for SageMaker Studio from the CloudFormation template output.

Connect the SageMaker Studio notebook to the AWS Glue Dev Endpoint

Once you launch the SageMaker Studio and you add the users. Follow these steps to connect the SageMaker Studio notebook to the AWS Glue Dev Endpoint:

  1. Open the Studio and go to the launcher page (by pressing the “+” icon on the top-left of the page.
  2. Under Notebooks and compute resources, select SparkMagic in the dropdown menu and select Notebook.
  3. Then open another launcher page, select SparkMagic in the same dropdown menu and select Image terminal. One thing to note is that the SparkMagic app will take some time to initialize. Proceed once the apps are in Ready status (2-3 minutes).

Notebooks and compute resources screenshot

4. Upload the private key into SparkMagic Image terminal. In other words, copy the private key to “.ssh” directory and update its permissions using “chmod 400”.

Note: the private key is corresponding to the public key used when you create the AWS Glue Dev Endpoint.

5. Now, you need to achieve port forwarding of the Livy service in order for SparkMagic kernel to be able to connect to the AWS Glue Dev Endpoint.  You run the following command in the image terminal:

/usr/bin/ssh -4 -N -o ServerAliveInterval=60 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -i /root/.ssh/{PRIVATE_KEY} -L 8998:169.254.76.1:8998 glue@{GLUE_ENDPOINT_PRIVATE_IP_ADDRESS}

The command consists of:

  • {PRIVATE_KEY} is the private key file name that you copied into .ssh directory.
  • {GLUE_ENDPOINT_PRIVATE_IP_ADDRESS} is the private IP address of the AWS Glue Dev Endpoint.
  • “8998” is the Livy port we are using for port forwarding.
  • “169.254.76.1” is the remote IP address defined by AWS Glue, this IP address does not change.

Note: Keep this terminal open and the SSH command running in order to keep the Livy session active.

6. Go to the SparkMagic notebook and restart the kernel, by going to the top menu and selecting Kernel > Restart Kernel.

7. Once the notebook kernel is restarted, the connection between the Studio Notebook and the AWS Glue Dev Endpoint is ready. To test the integration, you can run the following example command to list the tables in the AWS Glue Data Catalog.

spark.sql("show tables").show()

To test the integration, you can run the following command to list the tables in the Glue Data Catalog

Cleaning up

To avoid incurring future charges, delete the resources you created:

Conclusion

Our customers needed a single web-based IDE to spin up a notebook and perform all ML development steps including data pre-processing, ML model training, ML model deployment and monitoring. This blog post demonstrated how you can configure a SageMaker Studio notebook and connect to AWS Glue Dev Endpoint. This provides a framework for you to use  when developing both data preprocessing scripts and ML models.

To learn more about how to develop data pre-processing scripts and ML models in Amazon SageMaker, you can check out the examples in this repository.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

 

 

Building a Cloud-based OLAP Cube and ETL Architecture with AWS Managed Services

Post Syndicated from Peter Chung original https://aws.amazon.com/blogs/architecture/building-a-cloud-based-olap-cube-and-etl-architecture-with-aws-managed-services/

For decades, enterprises used online analytical processing (OLAP) workloads to answer complex questions about their business by filtering and aggregating their data. These complex queries were compute and memory-intensive. This required teams to build and maintain complex extract, transform, and load (ETL) pipelines to model and organize data, oftentimes with commercial-grade analytics tools.

In this post, we discuss building a cloud-based OLAP cube and ETL architecture that will yield faster results at lower costs without sacrificing performance by:

  • Connecting your on-premises database to the cloud for data profiling, discovery, and transformation
  • Running OLAP workloads without costly third-party software licenses, dedicated infrastructure, or the need to migrate data
  • Using AWS Glue Data Catalog, Amazon Athena, Amazon QuickSight, and Amazon SageMaker to catalog and visualize data with machine learning (ML)

Data analytics pipeline with AWS Managed Services

The proposed architecture in Figure 1 relies on AWS Managed Services. AWS Glue DataBrew is a no-code data transformation service that you can use to quickly build your transformation jobs. AWS Glue crawlers collect metadata from the transformed data and catalogs it for analytics and visualization using Athena and QuickSight. SageMaker will build, train, and deploy ML models.

This architecture will help you get answers from your data to your users as fast as possible without needing to migrate your data to AWS. There is no coding required, so you can leverage data transformation, cataloging, analytics, and ML quickly.

Figure 1. Example architecture using AWS Managed Services

Figure 1. Example architecture using AWS Managed Services

Benefits of AWS Managed Services for data analytics

Immediate connectivity to on-premises databases

The example architecture in Figure 1 begins with an online transaction processing (OLTP) database running in your corporate data center. Figure 2 shows how you can establish a Java database connectivity (JDBC) connection from the OLTP database to DataBrew running in AWS to run OLAP workloads. DataBrew supports data sources using JDBC for common data stores such as Microsoft SQL Server, MySQL, Oracle, and PostgreSQL.

DataBrew - JDBC connection to data source

Figure 2. DataBrew – JDBC connection to data source

Automatic data discovery

Figures 3 through 6 show how DataBrew summarizes your data for discovery. You can profile your data to understand patterns and detect anomalies. You can also run transformations called “jobs” in DataBrew without writing any code using over 250 built-in transforms.

DataBrew - dataset profiling overview

Figure 3. DataBrew – dataset profiling overview

 

DataBrew - data correlation patterns

Figure 4. DataBrew – data correlation patterns

 

DataBrew - data points distribution

Figure 5. DataBrew – data points distribution

No-code data transformation and cataloging

To run OLAP-type transactions, you can create jobs based on the transformation steps shown in Figure 6. These steps collectively are referred to as DataBrew recipes. These recipe results can be run as a job and outputted to an Amazon Simple Storage Service (Amazon S3) bucket.

A DataBrew project user interface view with sample data and transformation functions

Figure 6. A DataBrew project user interface view with sample data and transformation functions

Scheduled DataBrew jobs act similarly to scheduled ETL pipelines in OLAP. Based on data refresh and business requirements, DataBrew can run a job on a recurring basis (for example, every 12 hours). This can be run at a particular time of day, or as defined by a valid CRON expression. This helps you automate your transformation workflows.

The OLAP catalog is a set of metadata that sits between the actual OLAP data stored and applications. To create a Data Catalog, you can use AWS Glue crawlers to automatically classify your data to determine the data’s format, schema, and associated properties. Figure 7 shows the results of a crawler’s results written to Data Catalog as metadata to help data users find the data they need.

AWS Glue crawler metadata table output of column names and data types

Figure 7. AWS Glue crawler metadata table output of column names and data types

Data analytics without third-party software licenses

You can run analytics on your data by referring to the metadata definitions in the Data Catalog as references to the actual data in Amazon S3 using Athena. Athena is well suited for running one-time queries using standard SQL to query the transformed data directly in Amazon S3 without having to move data around. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Enterprises often supplement their OLAP workloads with separate visualization and business intelligence (BI) tools. These tools often come with their own licensing, server management, and security considerations.

You can visualize curated data using QuickSight, a scalable, serverless, embeddable, ML-powered BI service. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights, as shown in Figure 8. These dashboards can be shared with other users and embedded within your own applications.

A sample of data visualization options with Amazon QuickSight

Figure 8. A sample of data visualization options with Amazon QuickSight

Finally, you can incorporate ML workloads to OLAP workloads using SageMaker. In the past, ML workloads were often expensive, resource-intensive, and inaccessible. SageMaker provides a fully managed ML service to quickly and easily build and train ML models and directly deploy them into a production-ready hosted environment.

Conclusion

In this post, we show you how to connect your on-premises database using a JDBC connection to DataBrew for data profiling, discovery, and transformation. We looked at how you can use DataBrew recipes and jobs to run OLAP workloads without costly third-party software licenses, dedicated infrastructure, or the need to migrate any data. We also looked at AWS capabilities in data cataloging, visualization, and machine learning using Data Catalog, Athena, QuickSight, and SageMaker without having to manage any servers.

Laying the foundation to modernize an analytics workflow is critical for many enterprises that are looking to reduce the time it takes to understand their business. With AWS, you can perform enterprise-scale analytics with our portfolio of analytics services.

 

Amazon SageMaker Named as the Outright Leader in Enterprise MLOps Platforms

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-named-as-the-outright-leader-in-enterprise-mlops-platforms/

Over the last few years, Machine Learning (ML) has proven its worth in helping organizations increase efficiency and foster innovation. As ML matures, the focus naturally shifts from experimentation to production. ML processes need to be streamlined, standardized, and automated to build, train, deploy, and manage models in a consistent and reliable way. Perennial IT concerns such as security, high availability, scaling, monitoring, and automation also become critical. Great ML models are not going to do much good if they can’t serve fast and accurate predictions to business applications, 24/7 and at any scale.

In November 2017, we launched Amazon SageMaker to help ML Engineers and Data Scientists not only build the best models, but also operate them efficiently. Striving to give our customers the most comprehensive service, we’ve since then added hundreds of features covering every step of the ML lifecycle, such as data labeling, data preparation, feature engineering, bias detection, AutoML, training, tuning, hosting, explainability, monitoring, and automation. We’ve also integrated these features in our web-based development environment, Amazon SageMaker Studio.

Thanks to the extensive ML capabilities available in SageMaker, tens of thousands of AWS customers across all industry segments have adopted ML to accelerate business processes, create innovative user experiences, improve revenue, and reduce costs. Examples include Engie (energy), Deliveroo (food delivery), SNCF (railways), Nerdwallet (financial services), Autodesk (computer-aided design), Formula 1 (auto racing), as well as our very own Amazon Fulfillment Technologies and Amazon Robotics.

Today, we’re happy to announce that in his latest report on Enterprise MLOps Platforms, Bradley Shimmin, Chief Analyst at Omdia, paid SageMaker this compliment: “AWS is the outright leader in the Omdia comparative review of enterprise MLOps platforms. Across almost every measure, the company significantly outscored its rivals, delivering consistent value across the entire ML lifecycle. AWS delivers highly differentiated functionality that targets highly impactful areas of concern for enterprise AI practitioners seeking to not just operationalize but also scale AI across the business.

OMDIA

You can download the full report to learn more.

Getting Started
Curious about Amazon SageMaker? The developer guide will show you how to set it up and start running your notebooks in minutes.

As always, we look forward to your feedback. You can send it through your usual AWS Support contacts or post it on the AWS Forum for Amazon SageMaker.

– Julien

Amazon Redshift ML Is Now Generally Available – Use SQL to Create Machine Learning Models and Make Predictions from Your Data

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-redshift-ml-is-now-generally-available-use-sql-to-create-machine-learning-models-and-make-predictions-from-your-data/

With Amazon Redshift, you can use SQL to query and combine exabytes of structured and semi-structured data across your data warehouse, operational databases, and data lake. Now that AQUA (Advanced Query Accelerator) is generally available, you can improve the performance of your queries by up to 10 times with no additional costs and no code changes. In fact, Amazon Redshift provides up to three times better price/performance than other cloud data warehouses.

But what if you want to go a step further and process this data to train machine learning (ML) models and use these models to generate insights from data in your warehouse? For example, to implement use cases such as forecasting revenue, predicting customer churn, and detecting anomalies? In the past, you would need to export the training data from Amazon Redshift to an Amazon Simple Storage Service (Amazon S3) bucket, and then configure and start a machine learning training process (for example, using Amazon SageMaker). This process required many different skills and usually more than one person to complete. Can we make it easier?

Today, Amazon Redshift ML is generally available to help you create, train, and deploy machine learning models directly from your Amazon Redshift cluster. To create a machine learning model, you use a simple SQL query to specify the data you want to use to train your model, and the output value you want to predict. For example, to create a model that predicts the success rate for your marketing activities, you define your inputs by selecting the columns (in one or more tables) that include customer profiles and results from previous marketing campaigns, and the output column you want to predict. In this example, the output column could be one that shows whether a customer has shown interest in a campaign.

After you run the SQL command to create the model, Redshift ML securely exports the specified data from Amazon Redshift to your S3 bucket and calls Amazon SageMaker Autopilot to prepare the data (pre-processing and feature engineering), select the appropriate pre-built algorithm, and apply the algorithm for model training. You can optionally specify the algorithm to use, for example XGBoost.

Architectural diagram.

Redshift ML handles all of the interactions between Amazon Redshift, S3, and SageMaker, including all the steps involved in training and compilation. When the model has been trained, Redshift ML uses Amazon SageMaker Neo to optimize the model for deployment and makes it available as a SQL function. You can use the SQL function to apply the machine learning model to your data in queries, reports, and dashboards.

Redshift ML now includes many new features that were not available during the preview, including Amazon Virtual Private Cloud (VPC) support. For example:

Architectural diagram.

  • You can also create SQL functions that use existing SageMaker endpoints to make predictions (remote inference). In this case, Redshift ML is batching calls to the endpoint to speed up processing.

Before looking into how to use these new capabilities in practice, let’s see the difference between Redshift ML and similar features in AWS databases and analytics services.

ML Feature Data Training
from SQL
Predictions
using SQL Functions
Amazon Redshift ML

Data warehouse

Federated relational databases

S3 data lake (with Redshift Spectrum)

Yes, using
Amazon SageMaker Autopilot
Yes, a model can be imported and executed inside the Amazon Redshift cluster, or invoked using a SageMaker endpoint.
Amazon Aurora ML Relational database
(compatible with MySQL or PostgreSQL)
No

Yes, using a SageMaker endpoint.

A native integration with Amazon Comprehend for sentiment analysis is also available.

Amazon Athena ML

S3 data lake

Other data sources can be used through Athena Federated Query.

No Yes, using a SageMaker endpoint.

Building a Machine Learning Model with Redshift ML
Let’s build a model that predicts if customers will accept or decline a marketing offer.

To manage the interactions with S3 and SageMaker, Redshift ML needs permissions to access those resources. I create an AWS Identity and Access Management (IAM) role as described in the documentation. I use RedshiftML for the role name. Note that the trust policy of the role allows both Amazon Redshift and SageMaker to assume the role to interact with other AWS services.

From the Amazon Redshift console, I create a cluster. In the cluster permissions, I associate the RedshiftML IAM role. When the cluster is available, I load the same dataset used in this super interesting blog post that my colleague Julien wrote when SageMaker Autopilot was announced.

The file I am using (bank-additional-full.csv) is in CSV format. Each line describes a direct marketing activity with a customer. The last column (y) describes the outcome of the activity (if the customer subscribed to a service that was marketed to them).

Here are the first few lines of the file. The first line contains the headers.

age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y 56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no

I store the file in one of my S3 buckets. The S3 bucket is used to unload data and store SageMaker training artifacts.

Then, using the Amazon Redshift query editor in the console, I create a table to load the data.

CREATE TABLE direct_marketing (
	age DECIMAL NOT NULL, 
	job VARCHAR NOT NULL, 
	marital VARCHAR NOT NULL, 
	education VARCHAR NOT NULL, 
	credit_default VARCHAR NOT NULL, 
	housing VARCHAR NOT NULL, 
	loan VARCHAR NOT NULL, 
	contact VARCHAR NOT NULL, 
	month VARCHAR NOT NULL, 
	day_of_week VARCHAR NOT NULL, 
	duration DECIMAL NOT NULL, 
	campaign DECIMAL NOT NULL, 
	pdays DECIMAL NOT NULL, 
	previous DECIMAL NOT NULL, 
	poutcome VARCHAR NOT NULL, 
	emp_var_rate DECIMAL NOT NULL, 
	cons_price_idx DECIMAL NOT NULL, 
	cons_conf_idx DECIMAL NOT NULL, 
	euribor3m DECIMAL NOT NULL, 
	nr_employed DECIMAL NOT NULL, 
	y BOOLEAN NOT NULL
);

I load the data into the table using the COPY command. I can use the same IAM role I created earlier (RedshiftML) because I am using the same S3 bucket to import and export the data.

COPY direct_marketing 
FROM 's3://my-bucket/direct_marketing/bank-additional-full.csv' 
DELIMITER ',' IGNOREHEADER 1
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
REGION 'us-east-1';

Now, I create the model straight form the SQL interface using the new CREATE MODEL statement:

CREATE MODEL direct_marketing
FROM direct_marketing
TARGET y
FUNCTION predict_direct_marketing
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'my-bucket'
);

In this SQL command, I specify the parameters required to create the model:

  • FROM – I select all the rows in the direct_marketing table, but I can replace the name of the table with a nested query (see example below).
  • TARGET – This is the column that I want to predict (in this case, y).
  • FUNCTION – The name of the SQL function to make predictions.
  • IAM_ROLE – The IAM role assumed by Amazon Redshift and SageMaker to create, train, and deploy the model.
  • S3_BUCKET – The S3 bucket where the training data is temporarily stored, and where model artifacts are stored if you choose to retain a copy of them.

Here I am using a simple syntax for the CREATE MODEL statement. For more advanced users, other options are available, such as:

  • MODEL_TYPE – To use a specific model type for training, such as XGBoost or multilayer perceptron (MLP). If I don’t specify this parameter, SageMaker Autopilot selects the appropriate model class to use.
  • PROBLEM_TYPE – To define the type of problem to solve: regression, binary classification, or multiclass classification. If I don’t specify this parameter, the problem type is discovered during training, based on my data.
  • OBJECTIVE – The objective metric used to measure the quality of the model. This metric is optimized during training to provide the best estimate from data. If I don’t specify a metric, the default behavior is to use mean squared error (MSE) for regression, the F1 score for binary classification, and accuracy for multiclass classification. Other available options are F1Macro (to apply F1 scoring to multiclass classification) and area under the curve (AUC). More information on objective metrics is available in the SageMaker documentation.

Depending on the complexity of the model and the amount of data, it can take some time for the model to be available. I use the SHOW MODEL command to see when it is available:

SHOW MODEL direct_marketing

When I execute this command using the query editor in the console, I get the following output:

Console screenshot.

As expected, the model is currently in the TRAINING state.

When I created this model, I selected all the columns in the table as input parameters. I wonder what happens if I create a model that uses fewer input parameters? I am in the cloud and I am not slowed down by limited resources, so I create another model using a subset of the columns in the table:

CREATE MODEL simple_direct_marketing
FROM (
        SELECT age, job, marital, education, housing, contact, month, day_of_week, y
 	  FROM direct_marketing
)
TARGET y
FUNCTION predict_simple_direct_marketing
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'my-bucket'
);

After some time, my first model is ready, and I get this output from SHOW MODEL. The actual output in the console is in multiple pages, I merged the results here to make it easier to follow:

Console screenshot.

From the output, I see that the model has been correctly recognized as BinaryClassification, and F1 has been selected as the objective. The F1 score is a metrics that considers both precision and recall. It returns a value between 1 (perfect precision and recall) and 0 (lowest possible score). The final score for the model (validation:f1) is 0.79. In this table I also find the name of the SQL function (predict_direct_marketing) that has been created for the model, its parameters and their types, and an estimation of the training costs.

When the second model is ready, I compare the F1 scores. The F1 score of the second model is lower (0.66) than the first one. However, with fewer parameters the SQL function is easier to apply to new data. As is often the case with machine learning, I have to find the right balance between complexity and usability.

Using Redshift ML to Make Predictions
Now that the two models are ready, I can make predictions using SQL functions. Using the first model, I check how many false positives (wrong positive predictions) and false negatives (wrong negative predictions) I get when applying the model on the same data used for training:

SELECT predict_direct_marketing, y, COUNT(*)
  FROM (SELECT predict_direct_marketing(
                   age, job, marital, education, credit_default, housing,
                   loan, contact, month, day_of_week, duration, campaign,
                   pdays, previous, poutcome, emp_var_rate, cons_price_idx,
                   cons_conf_idx, euribor3m, nr_employed), y
          FROM direct_marketing)
 GROUP BY predict_direct_marketing, y;

The result of the query shows that the model is better at predicting negative rather than positive outcomes. In fact, even if the number of true negatives is much bigger than true positives, there are much more false positives than false negatives. I added some comments in green and red to the following screenshot to clarify the meaning of the results.

Console screenshot.

Using the second model, I see how many customers might be interested in a marketing campaign. Ideally, I should run this query on new customer data, not the same data I used for training.

SELECT COUNT(*)
  FROM direct_marketing
 WHERE predict_simple_direct_marketing(
           age, job, marital, education, housing,
           contact, month, day_of_week) = true;

Wow, looking at the results, there are more than 7,000 prospects!

Console screenshot.

Availability and Pricing
Redshift ML is available today in the following AWS Regions: US East (Ohio), US East (N Virginia), US West (Oregon), US West (San Francisco), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (Paris), Europe (Stockholm), Asia Pacific (Hong Kong) Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Sydney), and South America (São Paulo). For more information, see the AWS Regional Services list.

With Redshift ML, you pay only for what you use. When training a new model, you pay for the Amazon SageMaker Autopilot and S3 resources used by Redshift ML. When making predictions, there is no additional cost for models imported into your Amazon Redshift cluster, as in the example I used in this post.

Redshift ML also allows you to use existing Amazon SageMaker endpoints for inference. In that case, the usual SageMaker pricing for real-time inference applies. Here you can find a few tips on how to control your costs with Redshift ML.

To learn more, you can see this blog post from when Redshift ML was announced in preview and the documentation.

Start getting better insights from your data with Redshift ML.

Danilo

Field Notes: Accelerate Research with Managed Jupyter on Amazon SageMaker

Post Syndicated from Mrudhula Balasubramanyan original https://aws.amazon.com/blogs/architecture/field-notes-accelerate-research-with-managed-jupyter-on-amazon-sagemaker/

Research organizations across industry verticals have unique needs. These include facilitating stakeholder collaboration, setting up compute environments for experimentation, handling large datasets, and more. In essence, researchers want the freedom to focus on their research, without the undifferentiated heavy-lifting of managing their environments.

In this blog, I show you how to set up a managed Jupyter environment using custom tools used in Life Sciences research. I show you how to transform the developed artifacts into scripted components that can be integrated into research workflows. Although this solution uses Life Sciences as an example, it is broadly applicable to any vertical that needs customizable managed environments at scale.

Overview of solution

This solution has two parts. First, the System administrator of an organization’s IT department sets up a managed environment and provides researchers access to it. Second, the researchers access the environment and conduct interactive and scripted analysis.

This solution uses AWS Single Sign-On (AWS SSO), Amazon SageMaker, Amazon ECR, and Amazon S3. These services are architected to build a custom environment, provision compute, conduct interactive analysis, and automate the launch of scripts.

Walkthrough

The architecture and detailed walkthrough are presented from both an admin and researcher perspective.

Architecture from an admin perspective

Architecture from admin perspective

 

In order of tasks, the admin:

  1. authenticates into AWS account as an AWS Identity and Access Management (IAM) user with admin privileges
  2. sets up AWS SSO and users who need access to Amazon SageMaker Studio
  3. creates a Studio domain
  4. assigns users and groups created in AWS SSO to the Studio domain
  5. creates a SageMaker notebook instance shown generically in the architecture as Amazon EC2
  6. launches a shell script provided later in this post to build and store custom Docker image in a private repository in Amazon ECR
  7. attaches the custom image to Studio domain that the researchers will later use as a custom Jupyter kernel inside Studio and as a container for the SageMaker processing job.

Architecture from a researcher perspective

Architecture from a researcher perspective

In order of tasks, the researcher:

  1. authenticates using AWS SSO
  2. SSO authenticates researcher to SageMaker Studio
  3. researcher performs interactive analysis using managed Jupyter notebooks with custom kernel, organizes the analysis into script(s), and launches a SageMaker processing job to execute the script in a managed environment
  4. the SageMaker processing job reads data from S3 bucket and writes data back to S3. The user can now retrieve and examine results from S3 using Jupyter notebook.

Prerequisites

For this walkthrough, you should have:

  • An AWS account
  • Admin access to provision and delete AWS resources
  • Researchers’ information to add as SSO users: full name and email

Set up AWS SSO

To facilitate collaboration between researchers, internal and external to your organization, the admin uses AWS SSO to onboard to Studio.

For admins: follow these instructions to set up AWS SSO prior to creating the Studio domain.

Onboard to SageMaker Studio

Researchers can use just the functionality they need in Amazon SageMaker Studio. Studio provides managed Jupyter environments with sharable notebooks for interactive analysis, and managed environments for script execution.

When you onboard to Studio, a home directory is created for you on Amazon Elastic File System (Amazon EFS) which provides reliable, scalable storage for large datasets.

Once AWS SSO has been setup, follow these steps to onboard to Studio via SSO. Note the Studio domain id (ex. d-2hxa6eb47hdc) and the IAM execution role (ex. AmazonSageMaker-ExecutionRole-20201156T214222) in the Studio Summary section of Studio. You will be using these in the following sections.

Provision custom image

At the core of research is experimentation. This often requires setting up playgrounds with custom tools to test out ideas. Docker images are an effective[CE1] [BM2]  way to package those tools and dependencies and deploy them quickly. They also address another critical need for researchers – reproducibility.

To demonstrate this, I picked a Life Sciences research problem that requires custom Python packages to be installed and made available to a team of researchers as Jupyter kernels inside Studio.

For the custom Docker image, I picked a Python package called Pegasus. This is a tool used in genomics research for analyzing transcriptomes of millions of single cells, both interactively as well as in cloud-based analysis workflows.

In addition to Python, you can provision Jupyter kernels for languages such as R, Scala, Julia, in Studio using these Docker images.

Launch an Amazon SageMaker notebook instance

To build and push custom Docker images to ECR, you use an Amazon SageMaker notebook instance. Note that this is not part of SageMaker Studio and unrelated to Studio notebooks. It is a fully managed machine learning (ML) Amazon EC2 instance inside the SageMaker service that runs the Jupyter Notebook application, AWS CLI, and Docker.

  • Use these instructions to launch a SageMaker notebook instance.
  • Once the notebook instance is up and running, select the instance and navigate to the IAM role attached to it. This role comes with IAM policy ‘AmazonSageMakerFullAccess’ as a default. Your instance will need some additional permissions.
  • Create a new IAM policy using these instructions.
  • Copy the IAM policy below to paste into the JSON tab.
  • Fill in the values for <region-id> (ex. us-west-2), <AWS-account-id>, <studio-domain-id>, <studio-domain-iam-role>. Name the IAM policy ‘sagemaker-notebook-policy’ and attach it to the notebook instance role.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "additionalpermissions",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole",
                "sagemaker:UpdateDomain"
            ],
            "Resource": [
                "arn:aws:sagemaker:<region-id>:<AWS-account-id>:domain/<studio-domain-id>",
                "arn:aws:iam::<AWS-account-id>:role/<studio-domain-iam-role>"
            ]
        }
    ]
}
  • Start a terminal session in the notebook instance.
  • Once you are done creating the Docker image and attaching to Studio in the next section, you will be shutting down the notebook instance.

Create private repository, build, and store custom image, attach to SageMaker Studio domain

This section has multiple steps, all of which are outlined in a single bash script.

  • First the script creates a private repository in Amazon ECR.
  • Next, the script builds a custom image, tags, and pushes to Amazon ECR repository. This custom image will serve two purposes: one as a custom Python Jupyter kernel used inside Studio, and two as a custom container for SageMaker processing.
  • To use as a custom kernel inside SageMaker Studio, the script creates a SageMaker image and attaches to the Studio domain.
  • Before you initiate the script, fill in the following information: your AWS account ID, Region (ex. us-east-1), Studio IAM execution role, and Studio domain id.
  • You must create four files: bash script, Dockerfile, and two configuration files.
  • Copy the following bash script to a file named ‘pegasus-docker-images.sh’ and fill in the required values.
#!/bin/bash

# Pegasus python packages from Docker hub

accountid=<fill-in-account-id>

region=<fill-in-region>

executionrole=<fill-in-execution-role ex. AmazonSageMaker-ExecutionRole-xxxxx>

domainid=<fill-in-Studio-domain-id ex. d-xxxxxxx>

if aws ecr describe-repositories | grep 'sagemaker-custom'
then
    echo 'repo already exists! Skipping creation'
else
    aws ecr create-repository --repository-name sagemaker-custom
fi

aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $accountid.dkr.ecr.$region.amazonaws.com

docker build -t sagemaker-custom:pegasus-1.0 .

docker tag sagemaker-custom:pegasus-1.0 $accountid.dkr.ecr.$region.amazonaws.com/sagemaker-custom:pegasus-1.0

docker push $accountid.dkr.ecr.$region.amazonaws.com/sagemaker-custom:pegasus-1.0

if aws sagemaker list-images | grep 'pegasus-1'
then
    echo 'Image already exists! Skipping creation'
else
    aws sagemaker create-image --image-name pegasus-1 --role-arn arn:aws:iam::$accountid:role/service-role/$executionrole
    aws sagemaker create-image-version --image-name pegasus-1 --base-image $accountid.dkr.ecr.$region.amazonaws.com/sagemaker-custom:pegasus-1.0
fi

if aws sagemaker list-app-image-configs | grep 'pegasus-1-config'
then
    echo 'Image config already exists! Skipping creation'
else
   aws sagemaker create-app-image-config --cli-input-json file://app-image-config-input.json
fi

aws sagemaker update-domain --domain-id $domainid --cli-input-json file://default-user-settings.json

Copy the following to a file named ‘Dockerfile’.

FROM cumulusprod/pegasus-terra:1.0

USER root

Copy the following to a file named ‘app-image-config-input.json’.

{
    "AppImageConfigName": "pegasus-1-config",
    "KernelGatewayImageConfig": {
        "KernelSpecs": [
            {
                "Name": "python3",
                "DisplayName": "Pegasus 1.0"
            }
        ],
        "FileSystemConfig": {
            "MountPath": "/root",
            "DefaultUid": 0,
            "DefaultGid": 0
        }
    }
}

Copy the following to a file named ‘default-user-settings.json’.

{
    "DefaultUserSettings": {
        "KernelGatewayAppSettings": { 
           "CustomImages": [ 
              { 
                 "ImageName": "pegasus-1",
                 "ImageVersionNumber": 1,
                 "AppImageConfigName": "pegasus-1-config"
              }
           ]
        }
    }
}

Launch ‘pegasus-docker-images.sh’ in the directory with all four files, in the terminal of the notebook instance. If the script ran successfully, you should see the custom image attached to the Studio domain.

Amazon SageMaker dashboard

 

Perform interactive analysis

You can now launch the Pegasus Python kernel inside SageMaker . If this is your first time using Studio, you can get a quick tour of its UI.

For interactive analysis, you can use publicly available notebooks in Pegasus tutorial from this GitHub repository. Review the license before proceeding.

To clone the repository in Studio, open a system terminal using these instructions. Initiate $ git clone https://github.com/klarman-cell-observatory/pegasus

  • In the directory ‘pegasus’, select ‘notebooks’ and open ‘pegasus_analysis.ipynb’.
  • For kernel choose ‘Pegasus 1.0 (pegasus-1/1)’.
  • You can now run through the notebook and examine the output generated. Feel free to work through the other notebooks for deeper analysis.

Pagasus tutorial

At any point during experimentation, you can share your analysis along with results with your colleagues using these steps. The snapshot that you create also captures the notebook configuration such as instance type and kernel, to ensure reproducibility.

Formalize analysis and execute scripts

Once you are done with interactive analysis, you can consolidate your analysis into a script to launch in a managed environment. This is an important step, if you want to later incorporate this script as a component into a research workflow and automate it.

Copy the following script to a file named ‘pegasus_script.py’.

"""
BSD 3-Clause License

Copyright (c) 2018, Broad Institute
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

"""

import pandas as pd
import pegasus as pg

if __name__ == "__main__":
    BASE_DIR = "/opt/ml/processing"
    data = pg.read_input(f"{BASE_DIR}/input/MantonBM_nonmix_subset.zarr.zip")
    pg.qc_metrics(data, percent_mito=10)
    df_qc = pg.get_filter_stats(data)
    pd.DataFrame(df_qc).to_csv(f"{BASE_DIR}/output/qc_metrics.csv", header=True, index=False)

The Jupyter notebook following provides an example of launching a processing job using the script in SageMaker.

  • Create a notebook in SageMaker Studio in the same directory as the script.
  • Copy the following code to the notebook and name it ‘sagemaker_pegasus_processing.ipynb’.
  • Select ‘Python 3 (Data Science)’ as the kernel.
  • Launch the cells.
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()

prefix = 'pegasus'

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'research-custom'
tag = ':pegasus-1.0'

uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
processing_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)
print(processing_repository_uri)

script_processor = ScriptProcessor(command=['python3'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge')
!wget https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip

local_path = "MantonBM_nonmix_subset.zarr.zip"

s3 = boto3.resource("s3")

base_uri = f"s3://{bucket}/{prefix}"
input_data_uri = sagemaker.s3.S3Uploader.upload(
    local_path=local_path, 
    desired_s3_uri=base_uri,
)
print(input_data_uri)

code_uri = sagemaker.s3.S3Uploader.upload(
    local_path="pegasus_script.py", 
    desired_s3_uri=base_uri,
)
print(code_uri)

script_processor.run(code=code_uri,
                      inputs=[ProcessingInput(source=input_data_uri, destination='/opt/ml/processing/input'),],
                      outputs=[ProcessingOutput(source="/opt/ml/processing/output", destination=f"{base_uri}/output")]
                     )
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)

output_path = f"{base_uri}/output"
print(output_path)

The ‘output_path’ is the S3 prefix where you will find the results from SageMaker processing. This will be printed as the last line after execution. You can examine the results either directly in S3 or by copying the results back to your home directory in Studio.

Cleaning up

To avoid incurring future charges, shut down the SageMaker notebook instance. Detach image from the Studio domain, delete image in Amazon ECR, and delete data in Amazon S3.

Conclusion

In this blog, I showed you how to set up and use a unified research environment using Amazon SageMaker. Although the example pertained to Life Sciences, the architecture and the framework presented are generally applicable to any research space. They strive to address the broader research challenges of custom tooling, reproducibility, large datasets, and price predictability.

As a logical next step, take the scripted components and incorporate them into research workflows and automate them. You can use SageMaker Pipelines to incorporate machine learning into your workflows and operationalize them.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Mitigate data leakage through the use of AppStream 2.0 and end-to-end auditing

Post Syndicated from Chaim Landau original https://aws.amazon.com/blogs/security/mitigate-data-leakage-through-the-use-of-appstream-2-0-and-end-to-end-auditing/

Customers want to use AWS services to operate on their most sensitive data, but they want to make sure that only the right people have access to that data. Even when the right people are accessing data, customers want to account for what actions those users took while accessing the data.

In this post, we show you how you can use Amazon AppStream 2.0 to grant isolated access to sensitive data and decrease your attack surface. In addition, we show you how to achieve end-to-end auditing, which is designed to provide full traceability of all activities around your data.

To demonstrate this idea, we built a sample solution that provides a data scientist with access to an Amazon SageMaker Studio notebook using AppStream 2.0. The solution deploys a new Amazon Virtual Private Cloud (Amazon VPC) with isolated subnets, where the SageMaker notebook and AppStream 2.0 instances are set up.

Why AppStream 2.0?

AppStream 2.0 is a fully-managed, non-persistent application and desktop streaming service that provides access to desktop applications from anywhere by using an HTML5-compatible desktop browser.

Each time you launch an AppStream 2.0 session, a freshly-built, pre-provisioned instance is provided, using a prebuilt image. As soon as you close your session and the disconnect timeout period is reached, the instance is terminated. This allows you to carefully control the user experience and helps to ensure a consistent, secure environment each time. AppStream 2.0 also lets you enforce restrictions on user sessions, such as disabling the clipboard, file transfers, or printing.

Furthermore, AppStream 2.0 uses AWS Identity and Access Management (IAM) roles to grant fine-grained access to other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon SageMaker, and other AWS services. This gives you both control over the access as well as an accounting, via Amazon CloudTrail, of what actions were taken and when.

These features make AppStream 2.0 uniquely suitable for environments that require high security and isolation.

Why SageMaker?

Developers and data scientists use SageMaker to build, train, and deploy machine learning models quickly. SageMaker does most of the work of each step of the machine learning process to help users develop high-quality models. SageMaker access from within AppStream 2.0 provides your data scientists and analysts with a suite of common and familiar data-science packages to use against isolated data.

Solution architecture overview

This solution allows a data scientist to work with a data set while connected to an isolated environment that doesn’t have an outbound path to the internet.

First, you build an Amazon VPC with isolated subnets and with no internet gateways attached. This ensures that any instances stood up in the environment don’t have access to the internet. To provide the resources inside the isolated subnets with a path to commercial AWS services such as Amazon S3, SageMaker, AWS System Manager you build VPC endpoints and attach them to the VPC, as shown in Figure 1.

Figure 1: Network Diagram

Figure 1: Network Diagram

You then build an AppStream 2.0 stack and fleet, and attach a security group and IAM role to the fleet. The purpose of the IAM role is to provide the AppStream 2.0 instances with access to downstream AWS services such as Amazon S3 and SageMaker. The IAM role design follows the least privilege model, to ensure that only the access required for each task is granted.

During the building of the stack, you will enable AppStream 2.0 Home Folders. This feature builds an S3 bucket where users can store files from inside their AppStream 2.0 session. The bucket is designed with a dedicated prefix for each user, where only they have access. We use this prefix to store the user’s pre-signed SagaMaker URLs, ensuring that no one user can access another users SageMaker Notebook.

You then deploy a SageMaker notebook for the data scientist to use to access and analyze the isolated data.

To confirm that the user ID on the AppStream 2.0 session hasn’t been spoofed, you create an AWS Lambda function that compares the user ID of the data scientist against the AppStream 2.0 session ID. If the user ID and session ID match, this indicates that the user ID hasn’t been impersonated.

Once the session has been validated, the Lambda function generates a pre-signed SageMaker URL that gives the data scientist access to the notebook.

Finally, you enable AppStream 2.0 usage reports to ensure that you have end-to-end auditing of your environment.

To help you easily deploy this solution into your environment, we’ve built an AWS Cloud Development Kit (AWS CDK) application and stacks, using Python. To deploy this solution, you can go to the Solution deployment section in this blog post.

Note: this solution was built with all resources being in a single AWS Region. The support of multi Region is possible but isn’t part of this blog post.

Solution requirements

Before you build a solution, you must know your security requirements. The solution in this post assumes a set of standard security requirements that you typically find in an enterprise environment:

  • User authentication is provided by a Security Assertion Markup Language (SAML) identity provider (IdP).
  • IAM roles are used to access AWS services such as Amazon S3 and SageMaker.
  • AWS IAM access keys and secret keys are prohibited.
  • IAM policies follow the least privilege model so that only the required access is granted.
  • Windows clipboard, file transfer, and printing to local devices is prohibited.
  • Auditing and traceability of all activities is required.

Note: before you will be able to integrate SAML with AppStream 2.0, you will need to follow the AppStream 2.0 Integration with SAML 2.0 guide. There are quite a few steps and it will take some time to set up. SAML authentication is optional, however. If you just want to prototype the solution and see how it works, you can do that without enabling SAML integration.

Solution components

This solution uses the following technologies:

  • Amazon VPC – provides an isolated network where the solution will be deployed.
  • VPC endpoints – provide access from the isolated network to commercial AWS services such as Amazon S3 and SageMaker.
  • AWS Systems Manager – stores parameters such as S3 bucket names.
  • AppStream 2.0 – provides hardened instances to run the solution on.
  • AppStream 2.0 home folders – store users’ session information.
  • Amazon S3 – stores application scripts and pre-signed SageMaker URLs.
  • SageMaker notebook – provides data scientists with tools to access the data.
  • AWS Lambda – runs scripts to validate the data scientist’s session, and generates pre-signed URLs for the SageMaker notebook.
  • AWS CDK – deploys the solution.
  • PowerShell – processes scripts on AppStream 2.0 Microsoft Windows instances.

Solution high-level design and process flow

The following figure is a high-level depiction of the solution and its process flow.

Figure 2: Solution process flow

Figure 2: Solution process flow

The process flow—illustrated in Figure 2—is:

  1. A data scientist clicks on an AppStream 2.0 federated or a streaming URL.
    1. If it’s a federated URL, the data scientist authenticates using their corporate credentials, as well as MFA if required.
    1. If it’s a streaming URL, no further authentication is required.
  2. The data scientist is presented with a PowerShell application that’s been made available to them.
  3. After starting the application, it starts the PowerShell script on an AppStream 2.0 instance.
  4. The script then:
    1. Downloads a second PowerShell script from an S3 bucket.
    2. Collects local AppStream 2.0 environment variables:
      1. AppStream_UserName
      2. AppStream_Session_ID
      3. AppStream_Resource_Name
    3. Stores the variables in the session.json file and copies the file to the home folder of the session on Amazon S3.
  5. The PUT event of the JSON file into the Amazon S3 bucket triggers an AWS Lambda function that performs the following:
    1. Reads the session.json file from the user’s home folder on Amazon S3.
    2. Performs a describe action against the AppStream 2.0 API to ensure that the session ID and the user ID match. This helps to prevent the user from manipulating the local environment variable to pretend to be someone else (spoofing), and potentially gain access to unauthorized data.
    3. If the session ID and user ID match, a pre-signed SageMaker URL is generated and stored in session_url.txt, and copied to the user’s home folder on Amazon S3.
    4. If the session ID and user ID do not match, the Lambda function ends without generating a pre-signed URL.
  6. When the PowerShell script detects the session_url.txt file, it opens the URL, giving the user access to their SageMaker notebook.

Code structure

To help you deploy this solution in your environment, we’ve built a set of code that you can use. The code is mostly written in Python and for the AWS CDK framework, and with an AWS CDK application and some PowerShell scripts.

Note: We have chosen the default settings on many of the AWS resources our code deploys. Before deploying the code, you should conduct a thorough code review to ensure the resources you are deploying meet your organization’s requirements.

AWS CDK application – ./app.py

To make this application modular and portable, we’ve structured it in separate AWS CDK nested stacks:

  • vpc-stack – deploys a VPC with two isolated subnets, along with three VPC endpoints.
  • s3-stack – deploys an S3 bucket, copies the AppStream 2.0 PowerShell scripts, and stores the bucket name in an SSM parameter.
  • appstream-service-roles-stack – deploys AppStream 2.0 service roles.
  • appstream-stack – deploys the AppStream 2.0 stack and fleet, along with the required IAM roles and security groups.
  • appstream-start-fleet-stack – builds a custom resource that starts the AppStream 2.0 fleet.
  • notebook-stack – deploys a SageMaker notebook, along with IAM roles, security groups, and an AWS Key Management Service (AWS KMS) encryption key.
  • saml-stack – deploys a SAML role as a placeholder for SAML authentication.

PowerShell scripts

The solution uses the following PowerShell scripts inside the AppStream 2.0 instances:

  • sagemaker-notebook-launcher.ps1 – This script is part of the AppStream 2.0 image and downloads the sagemaker-notebook.ps1 script.
  • sagemaker-notebook.ps1 – starts the process of validating the session and generating the SageMaker pre-signed URL.

Note: Having the second script reside on Amazon S3 provides flexibility. You can modify this script without having to create a new AppStream 2.0 image.

Deployment Prerequisites

To deploy this solution, your deployment environment must meet the following prerequisites:

Note: We used AWS Cloud9 with Amazon Linux 2 to test this solution, as it comes preinstalled with most of the prerequisites for deploying this solution.

Deploy the solution

Now that you know the design and components, you’re ready to deploy the solution.

Note: In our demo solution, we deploy two stream.standard.small AppStream 2.0 instances, using Windows Server 2019. This gives you a reasonable example to work from. In your own environment you might need more instances, a different instance type, or a different version of Windows. Likewise, we deploy a single SageMaker notebook instance of type ml.t3.medium. To change the AppStream 2.0 and SageMaker instance types, you will need to modify the stacks/data_sandbox_appstream.py and stacks/data_sandbox_notebook.py respectively.

Step 1: AppStream 2.0 image

An AppStream 2.0 image contains applications that you can stream to your users. It’s what allows you to curate the user experience by preconfiguring the settings of the applications you stream to your users.

To build an AppStream 2.0 image:

  1. Build an image following the Create a Custom AppStream 2.0 Image by Using the AppStream 2.0 Console tutorial.

    Note: In Step 1: Install Applications on the Image Builder in this tutorial, you will be asked to choose an Instance family. For this example, we chose General Purpose. If you choose a different Instance family, you will need to make sure the appstream_instance_type specified under Step 2: Code modification is of the same family.

    In Step 6: Finish Creating Your Image in this tutorial, you will be asked to provide a unique image name. Note down the image name as you will need it in Step 2 of this blog post.

  2. Copy notebook-launcher.ps1 to a location on the image. We recommend that you copy it to C:\AppStream.
  3. In Step 2—Create an AppStream 2.0 Application Catalog—of the tutorial, use C:\Windows\System32\Windowspowershell\v1.0\powershell.exe as the application, and the path to notebook-launcher.ps1 as the launch parameter.

Note: While testing your application during the image building process, the PowerShell script will fail because the underlying infrastructure is not present. You can ignore that failure during the image building process.

Step 2: Code modification

Next, you must modify some of the code to fit your environment.

Make the following changes in the cdk.json file:

  • vpc_cidr – Supply your preferred CIDR range to be used for the VPC.

    Note: VPC CIDR ranges are your private IP space and thus can consist of any valid RFC 1918 range. However, if the VPC you are planning on using for AppStream 2.0 needs to connect to other parts of your private network (on premise or other VPCs), you need to choose a range that does not conflict or overlap with the rest of your infrastructure.

  • appstream_Image_name – Enter the image name you chose when you built the Appstream 2.0 image in Step 1.a.
  • appstream_environment_name – The environment name is strictly cosmetic and drives the naming of your AppStream 2.0 stack and fleet.
  • appstream_instance_type – Enter the AppStream 2.0 instance type. The instance type must be part of the same instance family you used in Step 1 of the To build an AppStream 2.0 image section. For a list of AppStream 2.0 instances, visit https://aws.amazon.com/appstream2/pricing/.
  • appstream_fleet_type – Enter the fleet type. Allowed values are ALWAYS_ON or ON_DEMAND.
  • Idp_name – If you have integrated SAML with this solution, you will need to enter the IdP name you chose when creating the SAML provider in the IAM Console.

Step 3: Deploy the AWS CDK application

The CDK application deploys the CDK stacks.

The stacks include:

  • VPC with isolated subnets
  • VPC Endpoints for S3, SageMaker, and Systems Manager
  • S3 bucket
  • AppStream 2.0 stack and fleet
  • Two AppStream 2.0 stream.standard.small instances
  • A single SageMaker ml.t2.medium notebook

Run the following commands to deploy the AWS CDK application:

  1. Install the AWS CDK Toolkit.
    npm install -g aws-cdk
    

  2. Create and activate a virtual environment.
    python -m venv .datasandbox-env
    
    source .datasandbox-env/bin/activate
    

  3. Change directory to the root folder of the code repository.
  4. Install the required packages.
    pip install -r requirements.txt
    

  5. If you haven’t used AWS CDK in your account yet, run:
    cdk bootstrap
    

  6. Deploy the AWS CDK stack.
    cdk deploy DataSandbox
    

Step 4: Test the solution

After the stack has successfully deployed, allow approximately 25 minutes for the AppStream 2.0 fleet to reach a running state. Testing will fail if the fleet isn’t running.

Without SAML

If you haven’t added SAML authentication, use the following steps to test the solution.

  1. In the AWS Management Console, go to AppStream 2.0 and then to Stacks.
  2. Select the stack, and then select Action.
  3. Select Create streaming URL.
  4. Enter any user name and select Get URL.
  5. Enter the URL in another tab of your browser and test your application.

With SAML

If you are using SAML authentication, you will have a federated login URL that you need to visit.

If everything is working, your SageMaker notebook will be launched as shown in Figure 3.

Figure 3: SageMaker Notebook

Figure 3: SageMaker Notebook

Note: if you receive a web browser timeout, verify that the SageMaker notebook instance “Data-Sandbox-Notebook” is currently in InService status.

Auditing

Auditing for this solution is provided through AWS CloudTrail and AppStream 2.0 Usage Reports. Though CloudTrail is enabled by default, to collect and store the CloudTrail logs, you must create a trail for your AWS account.

The following logs will be available for you to use, to provide auditing.

Connecting the dots

To get an accurate idea of your users’ activity, you have to correlate some logs from different services. First, you collect the login information from CloudTrail. This gives you the user ID of the user who logged in. You then collect the Amazon S3 put from CloudTrail, which gives you the IP address of the AppStream 2.0 instance. And finally, you collect the AppStream 2.0 usage report which gives you the IP address of the AppStream 2.0 instance, plus the user ID. This allows you to connect the user ID to the activity on Amazon S3. For auditing & controlling exploration activities with SageMaker, please visit this GitHub repository.

Though the logs are automatically being collected, what we have shown you here is a manual way of sifting through those logs. For a more robust solution on querying and analyzing CloudTrail logs, visit Querying AWS CloudTrail Logs.

Costs of this Solution

The cost for running this solution will depend on a number of factors like the instance size, the amount of data you store, and how many hours you use the solution. AppStream 2.0 is charged per instance hour and there is one instance in this example solution. You can see details on the AppStream 2.0 pricing page. VPC endpoints are charged by the hour and by how much data passes through them. There are three VPC endpoints in this solution (S3, System Manager, and SageMaker). VPC endpoint pricing is described on the Privatelink pricing page. SageMaker Notebooks are charged based on the number of instance hours and the instance type. There is one SageMaker instance in this solution, which may be eligible for free tier pricing. See the SageMaker pricing page for more details. Amazon S3 storage pricing depends on how much data you store, what kind of storage you use, and how much data transfers in and out of S3. The use in this solution may be eligible for free tier pricing. You can see details on the S3 pricing page.

Before deploying this solution, make sure to calculate your cost using the AWS Pricing Calculator, and the AppStream 2.0 pricing calculator.

Conclusion

Congratulations! You have deployed a solution that provides your users with access to sensitive and isolated data in a secure manner using AppStream 2.0. You have also implemented a mechanism that is designed to prevent user impersonation, and enabled end-to-end auditing of all user activities.

To learn about how Amazon is using AppStream 2.0, visit the blog post How Amazon uses AppStream 2.0 to provide data scientists and analysts with access to sensitive data.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Chaim Landau

As a Senior Cloud Architect at AWS, Chaim works with large enterprise customers, helping them create innovative solutions to address their cloud challenges. Chaim is passionate about his work, enjoys the creativity that goes into building solutions in the cloud, and derives pleasure from passing on his knowledge. In his spare time, he enjoys outdoor activities, spending time in nature, and immersing himself in his books.

Author

JD Braun

As a Data and Machine Learning Engineer, JD helps organizations design and implement modern data architectures to deliver value to their internal and external customers. In his free time, he enjoys exploring Minneapolis with his fiancée and black lab.

Field Notes: Speed Up Redaction of Connected Car Data by Multiprocessing Video Footage with Amazon Rekognition

Post Syndicated from Sandeep Kulkarni original https://aws.amazon.com/blogs/architecture/field-notes-speed-up-redaction-of-connected-car-data-by-multiprocessing-video-footage-with-amazon-rekognition/

In the blog, Redacting Personal Data from Connected Cars Using Amazon Rekognition, we demonstrated how you can redact personal data such as human faces using Amazon Rekognition. Traversing the video, frame by frame, and identifying personal information in each frame takes time. This solution is great for small video clips, where you do not need a near real-time response. However, in some use cases like object detection, real time traffic monitoring, you may need to process this information in near real-time and keep up with the input video stream.

In this blog post, we introduce how to leverage “multiprocessing” to speed up the redaction process and provide a response in near real time. We also compare the process run times using a variety of Amazon SageMaker instances to give users various options to process video using Amazon Rekognition.

For example, the ml.c5.4xlarge instance has 16 vCPUs, so we could theoretically have 16 processes, working in parallel, to process the video stream, which will significantly reduce the processing time. Our test against the sample video shows that we reduce the process run time by a factor of 11x, using the ml.c5.4xlarge instance.

Architecture Overview

Video Redaction - Multiprocessing

Walkthrough: 6 Steps

1. We will assume that the video data from the car was ingested and is stored in a “Raw” Amazon S3 bucket. (For real time analytics, video data will likely be ingested from the connected vehicles into an Amazon Kinesis Video Stream)

2.  In this architecture we will use an Amazon SageMaker notebook instance, which is a machine learning (ML) compute instance running the Jupyter Notebook App.

3. Additionally an AWS Identity and Access Management (IAM) role created with appropriate permissions is leveraged to provide temporary security credentials required for this program.

4. The individual frames are analyzed by calling the “DetectFaces” Amazon Rekognition API, which analyzes and provides metadata about the frame. If a face is detected in the frame, then Amazon Rekognition returns a bounding box per face.

5.  We write a function multi_process_video to blur the detected face for each frame and distribute the processing job equally among all available CPUs in the SageMaker instance

6. We run the multi_process function for the input video and write the output video to S3 bucket for further analysis.

Detailed Steps

For the 5 steps mentioned previously, we provide the input video, code samples and the corresponding output video.

Step 1: Login to the AWS console with your user credentials.

  • Upload the sample video to your S3 bucket.
    Name it face1.mp4. I’ve included the following example of the video input.

Step 2: In this block, we will create a SageMaker notebook.

Notebook instance:

  • Notebook instance name: VideoRedaction
    Notebook instance class: choose “ml.t3.large” from drop down
    Elastic inference: None

Permissions:

  • IAM role: Select Create a new role from the drop-down menu. This will open a new screen, click next and the new role will be created. The role name will start with AmazonSageMaker-ExecutionRole-xxxxxxxx.
  • Root access: Select Enable
  • Assume defaults for the rest, and select the orange “Create notebook instance” button at the bottom.

This will take you to the next screen, which shows that your notebook instance is being created. It will take a few minutes and you can monitor the status, which will show a green “InService” state, when the notebook is ready.

Step 3:  Next, we need to provide additional permissions to the new role that you created in Step 2.

  • Select the VideoRedaction notebook.
    This will open a new screen. Scroll down to the 3 block – “Permissions and encryption” and click on the IAM role ARN link.

This will open a screen where you can attach additional policies. It will already be populated with “AmazonSageMakerFullAccess”

  • Select the blue Attach policies button.
  • This will open a new screen, which will allow you to add permissions to your execution role.
    • Under “Filter policies” search for S3full. AmazonS3FullAccess. Check the box next to it.
    • Under “Filter policies” search for Rekognition. Check the box next to AmazonRekognitionFullAccess and AmazonRekognitionServiceRole.
    • Click blue Attach Policies button at the bottom. This will populate a screen which will show you the five policies attached as follows:

Permissions policies

  • Click on the Add inline policy link on the right and then click on the JSON tab on the next screen. Paste the following policy replacing the <account number> with your AWS account number:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "MySid",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::<accountnumber>:role/serviceRekognition"
        }
    ]
}

On the next screen enter VideoInlinePolicy for the name and select the blue Create Policy button at the bottom.

Permissions Policies - 6 Policies Applied

Step 3a:  Navigate to SageMaker in the console:

  • Select “Notebook instances” in the menu on left. This will show your VideoRedaction notebook.
  • Select Open Jupyter blue link under Actions. This will open a new tab titled, Jupyter.

Step 3b: In the upper right corner, click on drop down arrow next to “New” and choose conda_tensorflow_p36 as the kernel for your notebook.

Your screen will look at follows:

Jupyter

Install ffmpeg

First, we need to install ffmpeg for multiprocessing video. It’s a free and open-source software project consisting of a large suite of libraries and programs for handling video, audio, and other multimedia files and streams. We use it to concatenate all the subset videos processed by each vCPU and generate the final output.

Install ffmpeg using the following command:

!conda install x264=='1!152.20180717' ffmpeg=4.0.2 -c conda-forge --yes  

Import libraries – We import additional libraries to help with multi-processing capability.

import cv2  
import os  
from PIL import ImageFilter  
import boto3  
import io  
from PIL import Image, ImageDraw, ExifTags, ImageColor  
import numpy as np  
from os.path import isfile, join  
import time  
import sys  
import time  
import subprocess as sp  
import multiprocessing as mp  
from os import remove  

Step 4: Identify personal data (faces) in the individual frames

Amazon Rekognition “Detect_Faces” detects the 100 largest faces in the image. For each face detected, the operation returns face details. These details include a bounding box of the face, a confidence value (that the bounding box contains a face), and a fixed set of attributes such as facial landmarks (for example, coordinates of eye and mouth), presence of beard, sunglasses, and so on.

You pass the input image either as base64-encoded image bytes or as a reference to an image in an Amazon S3 bucket. In this code, we pass the image as jpg to Amazon Rekognition since we want to see each frame of this video. We also show how you can expand the bounding boxes returned by Amazon Rekognition, if required, to blur an enlarged portion of the face.

	def detect_blur_face_local_file(photo,blurriness):      
	      
	    client=boto3.client('rekognition')      
	          
	    # Call DetectFaces      
	    with open(photo, 'rb') as image:      
	        response = client.detect_faces(Image={'Bytes': image.read()})      
	          
	    image=Image.open(photo)      
	    imgWidth, imgHeight = image.size        
	    draw = ImageDraw.Draw(image)         
	              
	    # Calculate and display bounding boxes for each detected face             
	    for faceDetail in response['FaceDetails']:      
	              
	        box = faceDetail['BoundingBox']      
	        left = imgWidth * box['Left']      
	        top = imgHeight * box['Top']      
	        width = imgWidth * box['Width']      
	        height = imgHeight * box['Height']      
	              
	        #blur faces inside the enlarged bounding boxes
	        #you can also keep the original bounding boxes    
	        x1=left-0.1*width  
	        y1=top-0.1*height  
	        x2=left+width+0.1*width  
	        y2=top+height+0.1*height  
	              
	        mask = Image.new('L', image.size, 0)      
	        draw = ImageDraw.Draw(mask)      
	        draw.rectangle([ (x1,y1), (x2,y2) ], fill=255)      
	        blurred = image.filter(ImageFilter.GaussianBlur(blurriness))      
	        image.paste(blurred, mask=mask)      
	        image.save      
	       
	          
	    return image      

Step 5: Redact the face bounding box and distribute the processing among all CPUs

By passing the group_number of the multi_process_video function, you can distribute the video processing job among all available CPUs of the instance equally and therefore largely reduce the process time.

	def multi_process_video(group_number):  
	    cap = cv2.VideoCapture(input_file)  
	    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_jump_unit * group_number)  
	    proc_frames = 0  
	    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))  
	    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))  
	    fps = cap.get(cv2.CAP_PROP_FPS)  
	    out = cv2.VideoWriter(  
	        "{}.{}".format(group_number, 'mp4'),  
	        cv2.VideoWriter_fourcc(*'MP4V'),  
	        fps,  
	        (width, height),  
	    )  
	  
	    while proc_frames < frame_jump_unit:  
	        ret, frame = cap.read()  
	        if ret == False:  
	            break  
	          
	        f=str(group_number)+'_'+str(proc_frames)+'.jpg'  
	        cv2.imwrite(f,frame)  
	        #Define the blurriness  
	        blurriness=20  
	        blurred_img=detect_blur_face_local_file(f,blurriness)  
	        blurred_frame=cv2.cvtColor(np.array(blurred_img), cv2.COLOR_BGR2RGB)    
	          
	        out.write(blurred_frame)  
	        proc_frames += 1  
	    else:  
	        print('Group '+str(group_number)+' finished processing!')  
	          
	    cap.release()  
	    cap.release()  
	    out.release()  
	    return None  

Step 6: Run multi-processing video function and write the redacted video to the output bucket

  • Then we multi-process the video and generate the output using multiprocessing function and ffmpeg in python.
  • We take a record of each video processed by a CPU in the format of ‘1.mp4’, ‘2.mp4’ … in a file called multiproc_files and then use subprocess to call ffmpeg to concatenate these videos based on these videos’ order in multiproc_files.
  • After the final video is generated, we remove all the intermediate results and upload the face-blurred result to a S3 bucket.
	start_time = time.time()  
	# Connect to S3  
	s3_client = boto3.client('s3')  
	      
	# Download S3 video to local. Enter your bucketname and file name below
	bucket='yourbucketname'  
	file='face1.mp4'    
	s3_client.download_file(bucket, file, './'+file)  
	      
	input_file='face1.mp4'    
	num_processes = mp.cpu_count()  
	cap = cv2.VideoCapture(input_file)  
	frame_jump_unit = cap.get(cv2.CAP_PROP_FRAME_COUNT) // num_processes  
	  
	# Multiprocessing video across all vCPUs    
	p = mp.Pool(num_processes)  
	p.map(multi_process_video, range(num_processes))  
	  
	# Generate multiproc_files to record the subset videos in the right order    
	multiproc_files = ["{}.{}".format(i, 'mp4') for i in range(num_processes)]  
	with open("multiproc_files.txt", "w") as f:  
	    for t in multiproc_files:  
	        f.write("file {} \n".format(t))  
	  
	# Use ffmpeg to concatenate all the subset videos according to multiproc_files   
	local_filename='blurface_multiproc_827.mp4'  
	  
	ffmpeg_command="ffmpeg -f concat -safe 0 -i multiproc_files.txt -c copy "  
	ffmpeg_command += local_filename  
	  
	cmd = sp.Popen(ffmpeg_command, stdout=sp.PIPE, stderr=sp.PIPE, shell=True)  
	cmd.communicate()  
	  
	# Remove all the intermediate results    
	for f in multiproc_files:  
	    remove(f)  
	remove("multiproc_files.txt")  
	  
	mydir=os.getcwd()  
	filelist = [ f for f in os.listdir(mydir) if f.endswith(".jpg") ]  
	for f in filelist:  
	    os.remove(os.path.join(mydir, f))  
	  
	# Upload face-blurred video to s3  
	s3_filename='blurface_multiproc_827.mp4'  
	response = s3_client.upload_file(local_filename, bucket, s3_filename)   
	  
	finish_time = time.time()  
	print( "Total Process Time:",finish_time-start_time,'s')  

Output:

Group 13 finished processing!

Group 15 finished processing!

Group 14 finished processing!

Group 12 finished processing!

Group 11 finished processing!

Group 9 finished processing!

Group 10 finished processing!

Group 1 finished processing!

Group 3 finished processing!

Group 4 finished processing!

Group 8 finished processing!

Group 5 finished processing!

Group 2 finished processing!

Group 7 finished processing!

Group 6 finished processing!

Group 0 finished processing!

Total Process Time: 15.709482431411743 s

Using the same instance, we reduce the process time from 168s to 15.7s. As we mentioned, ml.c5.4xlarge has 16 vCPUs and you can even further reduce the process time if you have an instance that has 32 or 64 CPUs.

Note: Choosing the right instance will depend on your requirement for process time and cost. As this result demonstrates, multiprocessing video using Amazon Rekognition is an efficient way to leverage the benefits of Amazon Rekognition state-of-the-art ML model and powerful multi-core Amazon SageMaker instances.

Comparison of Amazon SageMaker Instances in Terms of Process Time and Cost

Here is the comparison table generated when processing a 6.5 seconds video with multiple faces on different SageMaker instances. Following is a video screenshot:

Video screenshot with faces of 5 people blurred

Based on the following table, you learn that instances with 16 vCPU (4xlarge) are better options in terms of faster processing capability, while optimized for cost.

Table with SageMaker Instance Types

Depending on the size of your input video file and the requirements for real-time processing, you can break the input video file into smaller chunks and then scale instances to process those chunks in parallel. While this example is focused on blurring faces, you can also use AWS Rekognition for other use cases like someone wielding a gun, smoking a cigarette, suggestive content and the like.  These and many other moderation activities are all supported by Rekognition content moderation APIs.

Conclusion

In this blog post, we showed how you can leverage multiple cores in large machine learning instances, along with Amazon Rekognition. Doing this can significantly speed up the process of redacting personally identifiable information from videos collected by connected vehicles. The ability to provide near-real-time information unlocks additional value from the video that is ingested. For example, in smart cities, information is collected about the environment, such as road traffic and weather. This data can be visualized in near-real-time to help city management make decisions that can optimize traffic and improve residents’ quality of life.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Applying Machine Learning to Vegetation Management using Amazon SageMaker

Post Syndicated from Sameer Goel original https://aws.amazon.com/blogs/architecture/field-notes-applying-machine-learning-to-vegetation-management-using-amazon-sagemaker/

This post was co-written by Soheil Moosavi, a data scientist consultant in Accenture Applied Intelligence (AAI) team, and Louis Lim, a manager in Accenture AWS Business Group. 

Virtually every electric customer in the US and Canada has, at one time or another, experienced a sustained electric outage as a direct result of a tree and power line contact. According to the report from Federal Energy Regulatory Commission (FERC.gov), Electric utility companies actively work to mitigate these threats.

Vegetation Management (VM) programs represent one of the largest recurring maintenance expenses for electric utility companies in North America. Utilities and regulators generally agree that keeping trees and vegetation from conflicting with overhead conductors. It is a critical and expensive responsibility of all utility companies concerned about electric service reliability.

Vegetation management such as tree trimming and removal is essential for electricity providers to reduce unwanted outages and be rated with a low System Average Interruption Duration Index (SAIDI) score. Electricity providers are increasingly interested in identifying innovative practices and technologies to mitigate outages, streamline vegetation management activities, and maintain acceptable SAIDI scores. With the recent democratization of machine learning leveraging the power of cloud, utility companies are identifying unique ways to solve complex business problems on top of AWS. The Accenture AWS Business Group, a strategic collaboration by Accenture and AWS, helps customers accelerate their pace of innovation to deliver disruptive products and services. Learning how to machine learn helps enterprises innovate and disrupt unlocking business value.

In this blog post, you learn how Accenture and AWS collaborated to develop a machine learning solution for an electricity provider using Amazon SageMaker.  The goal was to improve vegetation management and optimize program cost.

Overview of solution 

VM is generally performed on a cyclical basis, prioritizing circuits solely based on the number of outages in the previous years. A more sophisticated approach is to use Light Detection and Ranging (LIDAR) and imagery from aircraft and low earth orbit (LEO) satellites with Machine Learning  models, to determine where VM is needed. This provides the information for precise VM plans, but is more expensive due to cost to acquire the LIDAR and imagery data.

In this blog, we show how a machine learning (ML) solution can prioritize circuits based on the impacts of tree-related outages on the coming year’s SAIDI without using imagery data.

We demonstrate how to implement a solution that cross-references, cleans, and transforms time series data from multiple resources. This then creates features and models that predict the number of outages in the coming year, and sorts and prioritizes circuits based on their impact on the coming year’s SAIDI. We show how you use an interactive dashboard designed to browse circuits and the impact of performing VM on SAIDI reduction based on your budget.

Walkthrough

  • Source data is first transferred into an Amazon Simple Storage Service (Amazon S3) bucket from the client’s data center.
  • Next, AWS Glue Crawlers are used to crawl the data from the  source bucket. Glue Jobs were used to cross-reference data files to create features for modeling and data for the dashboards.
  • We used Jupyter Notebooks on Amazon SageMaker to train and evaluate models. The best performing model was saved as a pickle file on Amazon S3 and Glue was used to add the predicted number of outages for each circuit to the data prepared for the dashboards.
  • Lastly, Operations users were granted access to Amazon QuickSight dashboards, sourced data from Athena, to browse the data and graphs, while VM users were additionally granted access to directly edit the data prepared for dashboards, such as the latest VM cost for each circuit.
  • We used Amazon QuickSight to create interactive dashboards for the VM team members to visualize analytics and predictions. These predictions are a list of circuits prioritized based on their impact on SAIDI in the coming year. The solution allows our team to analyze the data and experiment with different models in a rapid cycle.

Modeling

We were provided with 6 years worth of data across 127 circuits. Data included VM (VM work start and end date, number of trees trimmed and removed, costs), asset (pole count, height, and materials, wire count, length, and materials, and meter count and voltage), terrain (elevation, landcover, flooding frequency, wind erodibility, soil erodibility, slope, soil water absorption, and soil loss tolerance from GIS ESRI layers), and outages (outage coordinated, dates, duration, total customer minutes, total customers affected). In addition, we collected weather data from NOAA and DarkSky datasets, including wind, gust, precipitation, temperature.

Starting with 762 records (6 years * 127 circuits) and 226 features, we performed a series of data cleaning and feature engineering tasks including:

  • Dropped sparse, non-variant, and non-relevant features
  • Capped selected outliers based on features’ distributions and percentiles
  • Normalized imbalanced features
  • Imputed missing values
    • Used “0” where missing value meant zero (for example, number of trees removed)
    • Used 3650 (equivalent to 10 years) where missing values are days for VM work (for example, days since previous tree trimming job)
    • Used average of values for each circuit when applicable, and average of values across all circuits for circuits with no existing values (for example, pole mean height)
  • Merged conceptually relevant features
  • Created new features such as ratios (for example, tree trim cost per trim) and combinations(for example, % of land cover for low and medium intensity areas combined)

After further dropping highly correlated features to remove multi-collinearity for our models, we were left with 72 features for model development. The following diagram shows a high-level overview data partitioning and number of outages prediction.

Our best performing model out of Gradient Boosting Trees, Random Forest, and Feed Forward Neural Networks was Elastic Net, with Mean Absolute Error of 6.02 when using a combination of only 10 features. Elastic Net is appropriate for smaller sample for this dataset, good at feature selection, likely to generalize on a new dataset, and consistently showed a lower error rate. Exponential expansion of features showed small improvements in predictions, but we kept the non-expanded version due to better interpretability.

When analyzing the model performance, predictions were more accurate for circuits with lower outage count, and models suffered from under-predicting when the number of outages was high. This is due to having few circuits with a high number of outages for the model to learn from.

The following chart below shows the importance of each feature used in the model. An error of 6.02 means on average we over or under predict six outages for each circuit.

Dashboard

We designed two types of interactive dashboards for the VM team to browse results and predictions. The first set of dashboards show historical or predicted outage counts for each circuit on a geospatial map. Users can further filter circuits based on criteria such as the number of days since VM, as shown in the following screenshot.

The second type of dashboard shows predicted post-VM SAIDI on the y-axis and VM cost on the x-axis. This dashboard is used by the client to determine the reduction in SAIDI based on available VM budget for the year and dispatch the VM crew accordingly. Clients can also upload a list of update VM cost for each circuit, and the graph will automatically readjust.

Conclusion

This solution for Vegetation management demonstrates how we used Amazon SageMaker to train and evaluate machine learning models. Using this solution an Electric Utility can save time and cost, and scale easily to include more circuits within a set VM budget. We demonstrated how a utility can leverage machine learning to predict unwanted outages and also maintain vegetation, without incurring the cost of high-resolution imagery.

Further, to improve these predictions we recommend:

  1. A yearly collection of asset and terrain data (if data is only available for the most recent year, it is impossible for models to learn from each years’ changes),
  2. Collection of VM data per month per location (if current data is collected only at the end of each VM cycle and only per circuit, monthly, and subcircuit modeling is impossible), and
  3. Purchasing LiDAR imagery or tree inventory data to include features such as tree density, height, distance to wires, and more.

Accelerating Innovation with the Accenture AWS Business Group (AABG)

By working with the Accenture AWS Business Group (AABG), you can learn from the resources, technical expertise, and industry knowledge of two leading innovators, helping you accelerate the pace of innovation to deliver disruptive products and services. The AABG helps customers ideate and innovate cloud solutions with customers through rapid prototype development.

Connect with our team at [email protected] to learn and accelerate how to use machine learning in your products and services.


Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.
Soheil

Soheil Moosavi

Soheil Moosavi is a data scientist consultant and part of Accenture Applied Intelligence (AAI) team. He comes with vast experience in Machine Learning and architecting analytical solutions to solve and improve business problems.

Soheil

Soheil Moosavi

Louis Lim is a manager in Accenture AWS Business Group, his team focus on helping enterprises to explore the art of possible through rapid prototyping and cloud-native solution.

 

Testing data quality at scale with PyDeequ

Post Syndicated from Calvin Wang original https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/

You generally write unit tests for your code, but do you also test your data? Incoming data quality can make or break your application. Incorrect, missing, or malformed data can have a large impact on production systems. Examples of data quality issues include the following:

  • Missing values can lead to failures in production system that require non-null values (NullPointerException)
  • Changes in the distribution of data can lead to unexpected outputs of machine learning (ML) models
  • Aggregations of incorrect data can lead to wrong business decisions

In this post, we introduce PyDeequ, an open-source Python wrapper over Deequ (an open-source tool developed and used at Amazon). Deequ is written in Scala, whereas PyDeequ allows you to use its data quality and testing capabilities from Python and PySpark, the language of choice of many data scientists. PyDeequ democratizes and extends the power of Deequ by allowing you to use it alongside the many data science libraries that are available in that language. Furthermore, PyDeequ allows for fluid interface with Pandas DataFrames as opposed to restricting within Apache Spark DataFrames.

Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (billions of rows) that typically live in a data lake, distributed file system, or a data warehouse. PyDeequ gives you access to this capability, but also allows you to use it from the familiar environment of your Python Jupyter notebook.

Deequ at Amazon

Deequ is used internally at Amazon to verify the quality of many large production datasets. Dataset producers can add and edit data quality constraints. The system computes data quality metrics on a regular basis (with every new version of a dataset), verifies constraints defined by dataset producers, and publishes datasets to consumers in case of success. In error cases, dataset publication can be stopped, and producers are notified to take action. Data quality issues don’t propagate to consumer data pipelines, reducing their blast radius.

Deequ is also used within Amazon SageMaker Model Monitor. Now with the availability of PyDeequ, you can use it from a broader set of environments— Amazon SageMaker notebooks, AWS Glue, Amazon EMR, and more.

Overview of PyDeequ

Let’s look at PyDeequ’s main components, and how they relate to Deequ (shown in the following diagram):

  • Metrics computation – Deequ computes data quality metrics, that is, statistics such as completeness, maximum, or correlation. Deequ uses Spark to read from sources such as Amazon Simple Storage Service (Amazon S3) and compute metrics through an optimized set of aggregation queries. You have direct access to the raw metrics computed on the data.
  • Constraint verification – As a user, you focus on defining a set of data quality constraints to be verified. Deequ takes care of deriving the required set of metrics to be computed on the data. Deequ generates a data quality report, which contains the result of the constraint verification.
  • Constraint suggestion – You can choose to define your own custom data quality constraints or use the automated constraint suggestion methods that profile the data to infer useful constraints.
  • Python wrappers – You can call each Deequ function using Python syntax. The wrappers translate the commands to the underlying Deequ calls and return their response.

Let’s look at PyDeequ’s main components, and how they relate to Deequ (shown in the following diagram)

Use case overview

As a running example, we use a customer review dataset provided by Amazon on Amazon S3. We intentionally follow the example in the post Test data quality at scale with Deequ to show the similarity in functionality and implementation. We begin the way many data science projects do: with initial data exploration and assessment in a Jupyter notebook.

If you’d like to follow along with a live Jupyter notebook, check out the notebook on our GitHub repo.

During the data exploration phase, you want to easily answer some basic questions about the data:

  • Are the fields that are supposed to contain unique values really unique? Are there fields that are missing values?
  • How many distinct categories are there in the categorical fields?
  • Are there correlations between some key features?
  • If there are two supposedly similar datasets (such as different categories or different time periods), are they really similar?

We also show you how to scale this approach to large-scale datasets, using the same code on an Amazon EMR cluster. This is how you’d likely do your ML training, and later as you move into a production setting.

Starting a PySpark session in a SageMaker notebook

To follow along with this post, open up a SageMaker notebook instance, clone the PyDeequ GitHub on the Sagemaker notebook instance, and run the test_data_quality_at_scale.ipynb notebook from the tutorials directory from the PyDeequ repository.

Let’s install our dependencies first in a terminal window:

$ pip install pydeequ

Next, in a cell of our SageMaker notebook, we need to create a PySpark session:

import sagemaker_pyspark
import pydeequ

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

Loading data

Load the dataset containing reviews for the category Electronics into our Jupyter notebook:

df = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/")

After you load the DataFrame, you can run df.printSchema() to view the schema of the dataset:

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: date (nullable = true)
 |-- year: integer (nullable = true)

Data analysis

Before we define checks on the data, we want to calculate some statistics on the dataset; we call them metrics. As with Deequ, PyDeequ supports a rich set of metrics. For more information, see Test data quality at scale with Deequ or the GitHub repo. In the following example, we use the AnalysisRunner to capture the metrics you’re interested in:

from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \
                    .onData(df) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("review_id")) \
                    .addAnalyzer(ApproxCountDistinct("review_id")) \
                    .addAnalyzer(Mean("star_rating")) \
                    .addAnalyzer(Compliance("top star_rating", "star_rating >= 4.0")) \
                    .addAnalyzer(Correlation("total_votes", "star_rating")) \
                    .addAnalyzer(Correlation("total_votes", "helpful_votes")) \
                    .run()
                    
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()

The following table summarizes our findings.

Name Instance Value
ApproxCountDistinct review_id 3010972
Completeness review_id 1
Compliance top star_rating 0.74941
Correlation helpful_votes,total_votes 0.99365
Correlation total_votes,star_rating -0.03451
Mean star_rating 4.03614
Size * 3120938

From this, we learn the following:

  • review_id has no missing values and approximately 3,010,972 unique values
  • 9% of reviews have a star_rating of 4 or higher
  • total_votes and star_rating are not correlated
  • helpful_votes and total_votes are strongly correlated
  • The average star_rating is 4.0
  • The dataset contains 3,120,938 reviews

Defining and running tests for data

After analyzing and understanding the data, we want to verify that the properties we have derived also hold for new versions of the dataset. By defining assertions on the data distribution as part of a data pipeline, we can ensure that every processed dataset is of high quality, and that any application consuming the data can rely on it.

For writing tests on data, we start with the VerificationSuite and add checks on attributes of the data. In this example, we test for the following properties of our data:

  • At least 3 million rows in total
  • review_id is never NULL
  • review_id is unique
  • star_rating has a minimum of 1.0 and maximum of 5.0
  • marketplace only contains US, UK, DE, JP, or FR
  • year does not contain negative values

This is the code that reflects the previous statements. For information about all available checks, see the GitHub repo. You can run this directly in the Spark shell as previously explained:

from pydeequ.checks import *
from pydeequ.verification import *

check = Check(spark, CheckLevel.Warning, "Amazon Electronics Review Check")

checkResult = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        check.hasSize(lambda x: x >= 3000000) \
        .hasMin("star_rating", lambda x: x == 1.0) \
        .hasMax("star_rating", lambda x: x == 5.0)  \
        .isComplete("review_id")  \
        .isUnique("review_id")  \
        .isComplete("marketplace")  \
        .isContainedIn("marketplace", ["US", "UK", "DE", "JP", "FR"]) \
        .isNonNegative("year")) \
    .run()
    
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

After calling run(), PyDeequ translates your test description into Deequ, which translates it into a series of Spark jobs that are run to compute metrics on the data. Afterwards, it invokes your assertion functions (for example, lambda x: x == 1.0 for the minimum star rating check) on these metrics to see if the constraints hold on the data. The following table summarizes our findings.

Constraint constraint_status constraint_message
SizeConstraint(Size(None)) Success
MinimumConstraint(Minimum(star_rating,None)) Success
MaximumConstraint(Maximum(star_rating,None)) Success
CompletenessConstraint(Completeness(review_id,None)) Success
UniquenessConstraint(Uniqueness(List(review_id))) Failure Value: 0.9926566948782706 does not meet the constraint requirement!
CompletenessConstraint(Completeness(marketplace,None)) Success
ComplianceConstraint(Compliance(marketplace contained in US,UK,DE,JP,FR,marketplace IS NULL OR marketplace IN (‘US’,’UK’,’DE’,’JP’,’FR’),None)) Success
ComplianceConstraint(Compliance(year is non-negative,COALESCE(year, 0.0) >= 0,None)) Success

Interestingly, the review_id column isn’t unique, which resulted in a failure of the check on uniqueness. We can also look at all the metrics that Deequ computed for this check by running the following:

checkResult_df = VerificationResult.successMetricsAsDataFrame(spark, checkResult)
checkResult_df.show()

The following table summarizes our findings.

Name Instance Value
Completeness review_id 1
Completeness marketplace 1
Compliance marketplace contained in US,UK,DE,JP,FR 1
Compliance year is non-negative 1
Maximum star_rating 5
Minimum star_rating 1
Size * 3120938
Uniqueness review_id 0.99266

Automated constraint suggestion

If you own a large number of datasets or if your dataset has many columns, it may be challenging for you to manually define appropriate constraints. Deequ can automatically suggest useful constraints based on the data distribution. Deequ first runs a data profiling method and then applies a set of rules on the result. For more information about how to run a data profiling method, see the GitHub repo.

from pydeequ.suggestions import *

suggestionResult = ConstraintSuggestionRunner(spark) \
             .onData(df) \
             .addConstraintRule(DEFAULT()) \
             .run()

# Constraint Suggestions in JSON format
print(json.dumps(suggestionResult, indent=2))

The result contains a list of constraints with descriptions and Python code, so that you can directly apply it in your data quality checks. Call print(json.dumps(result_json)) to inspect the suggested constraints; the following table shows a subset.

Column Constraint Python code
customer_id customer_id is not null .isComplete("customer_id")
customer_id customer_id has type Integral .hasDataType("customer_id", ConstrainableDataTypes.Integral)
customer_id customer_id has no negative values .isNonNegative("customer_id")
helpful_votes helpful_votes is not null .isComplete("helpful_votes")
helpful_votes helpful_votes has no negative values .isNonNegative("helpful_votes")
marketplace marketplace has value range “US”, “UK”, “DE”, “JP”, “FR” .isContainedIn("marketplace", ["US", "UK", "DE", "JP", "FR"])
product_title product_title is not null .isComplete("product_title")
star_rating star_rating is not null .isComplete("star_rating")
star_rating star_rating has no negative values .isNonNegative("star_rating")
vine vine has value range “N”, “Y” .isContainedIn("vine", ["N", "Y"])

You can explore the other tutorials in the PyDeequ GitHub repo.

Scaling to production

So far, we’ve shown you how to use these capabilities in the context of data exploration using a Jupyter notebook running on a SageMaker notebook instance. As your project matures, you need to use the same capabilities on larger and larger datasets, and in a production environment. With PyDeequ, it’s easy to make that transition. The following diagram illustrates deployment options for local and production purposes on AWS.

The following diagram illustrates deployment options for local and production purposes on AWS.

Amazon EMR and AWS Glue interface with PyDeequ through the PySpark drivers that PyDeequ utilizes as its main engine. PyDeequ can run as a PySpark application in both contexts when the Deequ JAR is added the Spark context. You can run PyDeequ’s data validation toolkit after the Spark context and drivers are configured and your data is loaded into a DataFrame. We describe the Amazon EMR configuration options and use cases in this section (configurations 2 and 3 in the diagram).

Data exploration from a SageMaker notebook via an EMR cluster

As shown in configuration 2 in the diagram, you can connect to an EMR cluster from a SageMaker notebook to run PyDeequ. This enables you to explore much larger volumes of data than you can using a single notebook. Your Amazon EMR cluster must be running Spark v2.4.6, available with Amazon EMR version 5.31 or higher, in order to work with PyDeequ. After you have a running cluster that has those components and a SageMaker notebook, you configure a SparkSession object using the following template to connect to your cluster. For more information about connecting a SageMaker notebook to Amazon EMR or the necessary IAM permissions, see Submitting User Applications with spark-submit.

In the SageMaker notebook, run the following JSON in a cell before you start your SparkSession to configure your EMR cluster:

%%configure -f
{ "conf":{
          "spark.pyspark.python": "python3",
          "spark.pyspark.virtualenv.enabled": "true",
          "spark.pyspark.virtualenv.type":"native",
          "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
          "spark.jars.packages": "com.amazon.deequ:deequ:1.0.3",
          "spark.jars.excludes": "net.sourceforge.f2j:arpack_combined_all"
         }
}

Start your SparkSession object in a cell after the preceding configuration by running spark. Then install PyDeequ onto your EMR cluster using the SparkContext (default named sc) with the following command:

sc.install_pypi_package('pydeequ')

Now you can start using PyDeequ from your notebook to run the same statements as before, but with much larger volumes of data.

Running a transient EMR cluster

Another way to leverage the power of an EMR cluster is to treat it as a transient cluster and run it in a headless configuration, as shown in configuration 3 in the diagram. We use spark-submit in an EMR add-step to run PyDeequ on Amazon EMR. For each of the following steps, make sure to replace the values in brackets accordingly.

  1. Create a bootstrap shell script and upload it to an S3 bucket. The following code is an example of pydeequ-emr-bootstrap.sh:
    #!/bin/bash
    
    sudo python3 -m pip install --no-deps pydeequ
    sudo python3 -m pip install pandas 

  1. Create an EMR cluster via the AWS Command Line Interface (AWS CLI):
    $ aws emr create-cluster \
    --name 'my-pydeequ-cluster' \
    --release-label emr-5.31.0 --applications Name=Spark Name=Hadoop Name=Hive Name=Livy Name=Pig Name=Hue 
    --use-default-roles \
    --instance-type m5.xlarge \
    --instance-count 2 \
    --bootstrap-actions \
        Path="s3://<S3_PATH_TO_BOOTSTRAP>/pydeequ-emr-bootstrap.sh",Name='install_pydeequ' \
    --visible-to-all-users \
    --enable-debugging \
    --ec2-attributes KeyName="<MY_SSH_KEY>",SubnetId="<MY_SUBNET>" \
    --auto-scaling-role EMR_AutoScaling_DefaultRole

  1. Create your PySpark PyDeequ run script and upload into Amazon S3. The following code is our example of pydeequ-test.py:
    import sys
    import pydeequ
    from pydeequ.checks import *
    from pydeequ.verification import *
    from pyspark.sql import SparkSession, Row
    
    if __name__ == "__main__":
    
        with SparkSession.builder.appName("pydeequ").getOrCreate() as spark:
    
            df = spark.sparkContext.parallelize([
                Row(a="foo", b=1, c=5),
                Row(a="bar", b=2, c=6),
                Row(a="baz", b=3, c=None)]).toDF()
    
            check = Check(spark, CheckLevel.Error, "Integrity checks")
    
            checkResult = VerificationSuite(spark) \
                .onData(df) \
                .addCheck(
                    check.hasSize(lambda x: x >= 3) \
                    .hasMin("b", lambda x: x == 0) \
                    .isComplete("c")  \
                    .isUnique("a")  \
                    .isContainedIn("a", ["foo", "bar", "baz"]) \
                    .isNonNegative("b")) \
                .run()
    
            checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
            checkResult_df.repartition(1).write.csv("s3a://<PATH_TO_OUTPUT>/pydeequ-out.csv", sep='|')

  1. When your cluster is running and in the WAITING stage, submit your Spark job to Amazon EMR via the AWS CLI:
    $ aws emr add-steps \
    --cluster-id <MY_CLUSTER_ID> \
    --steps Type=Spark,Name="pydeequ-spark-submit",Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,--packages,com.amazon.deequ:deequ:1.0.3,--exclude-packages,net.sourceforge.f2j:arpack_combined_all,s3a://pydeequ-emr/setup/pydeequ-test.py],ActionOnFailure=CANCEL_AND_WAIT

Congratulations, you have now submitted a PyDeequ PySpark job to Amazon EMR. Give the job a few minutes to run, after which you can view your results at the S3 output path specified on the last line of pydeequ-test.py.

Afterwards, remember to clean up your results and spin down the EMR cluster using the following command:

$ aws emr terminate-clusters --cluster-ids <MY_CLUSTER_ID>

Now you can use Amazon EMR to process large datasets in batch using PyDeequ to plug into your pipelines and provide scalable tests on your data.

More examples on GitHub

You can find examples of more advanced features on the Deequ GitHub page:

  • Deequ provides more than data quality checks with fixed thresholds. Learn how to use anomaly detection on data quality metrics to apply tests on metrics that change over time.
  • Deequ offers support for storing and loading metrics. Learn how to use the MetricsRepository for this use case.
  • If your dataset grows over time or is partitioned, you can use Deequ’s incremental metrics computation For each partition, Deequ stores a state for each computed metric. To compute metrics for the union of partitions, Deequ can use these states to efficiently derive overall metrics without reloading the data.

Conclusion

This post showed you how to use PyDeequ for calculating data quality metrics, verifying data quality metrics, and profiling data to automate the configuration of data quality checks. PyDeequ is available via pip install and on GitHub now for you to build your own data quality management pipeline.

Learn more about the inner workings of Deequ in the VLDB 2018 paper Automating large-scale data quality verification.

Stay tuned for another post demonstrating production workflows on AWS Glue.


About the Authors

Calvin Wang is a Data Scientist at AWS AI/ML. He holds a B.S. in Computer Science from UC Santa Barbara and loves using machine learning to build cool stuff.

 

 

Chris Ghyzel is a Data Engineer for AWS Professional Services. Currently, he is working with customers to integrate machine learning solutions on AWS into their production pipelines.

 

 

 

Veronika Megler, PhD, is Principal Data Scientist for Amazon.com Consumer Packaging. Until recently she was the Principal Data Scientist for AWS Professional Services. She enjoys adapting innovative big data, AI, and ML technologies to help companies solve new problems, and to solve old problems more efficiently and effectively. Her work has lately been focused more heavily on economic impacts of ML models and exploring causality.

Building a Controlled Environment Agriculture Platform

Post Syndicated from Ashu Joshi original https://aws.amazon.com/blogs/architecture/building-a-controlled-environment-agriculture-platform/

This post was co-written by Michael Wirig, Software Engineering Manager at Grōv Technologies.

A substantial percentage of the world’s habitable land is used for livestock farming for dairy and meat production. The dairy industry has leveraged technology to gain insights that have led to drastic improvements and are continuing to accelerate. A gallon of milk in 2017 involved 30% less water, 21% less land, a 19% smaller carbon footprint, and 20% less manure than it did in 2007 (US Dairy, 2019). By focusing on smarter water usage and sustainable land usage, livestock farming can grow to provide sustainable and nutrient-dense food for consumers and livestock alike.

Grōv Technologies (Grōv) has pioneered the Olympus Tower Farm, a fully automated Controlled Environment Agriculture (CEA) system. Unique amongst vertical farming startups, Grōv is growing cattle feed to improve that sustainable use of land for livestock farming while increasing the economic margins for dairy and beef producers.

The challenges of CEA

The set of growing conditions for a CEA is called a “recipe,” which is a combination of ingredients like temperature, humidity, light, carbon dioxide levels, and water. The optimal recipe is dynamic and is sensitive to its ingredients. Crops must be monitored in near-real time, and CEAs should be able to self-correct in order to maintain the recipe. To build a system with these capabilities requires answers to the following questions:

  • What parameters are needed to measure for indoor cattle feed production?
  • What sensors enable the accuracy and price trade-offs at scale?
  • Where do you place the sensors to ensure a consistent crop?
  • How do you correlate the data from sensors to the nutrient value?

To progress from a passively monitored system to a self-correcting, autonomous one, the CEA platform also needs to address:

  • How to maintain optimum crop conditions
  • How the system can learn and adapt to new seed varieties
  • How to communicate key business drivers such as yield and dry matter percentage

Grōv partnered with AWS Professional Services (AWS ProServe) to build a digital CEA platform addressing the challenges posed above.

Olympus Tower - Grov Technologies

Tower automation and edge platform

The Olympus Tower is instrumented for measuring recipe ingredients by combining the mechanical, electrical, and domain expertise of the Grōv team with the IoT edge and sensor expertise of the AWS ProServe team. The teams identified a primary set of features such as height, weight, and evenness of the growth to be measured at multiple stages within the Tower. Sensors were also added to measure secondary features such as water level, water pH, temperature, humidity, and carbon dioxide.

The teams designed and developed a purpose-built modular and industrial sensor station. Each sensor station has sensors for direct measurement of the features identified. The sensor stations are extended to support indirect measurement of features using a combination of Computer Vision and Machine Learning (CV/ML).

The trays with the growing cattle feed circulate through the Olympus Tower. A growth cycle starts on a tray with seeding, circulates through the tower over the cycle, and returns to the starting position to be harvested. The sensor station at the seeding location on the Olympus Tower tags each new growth cycle in a tray with a unique “Grow ID.” As trays pass by, each sensor station in the Tower collects the feature data. The firmware, jointly developed for the sensor station, uses AWS IoT SDK to stream the sensor data along with the Grow ID and metadata that’s specific to the sensor station. This information is sent every five minutes to an on-site edge gateway powered by AWS IoT Greengrass. Dedicated AWS Lambda functions manage the lifecycle of the Grow IDs and the sensor data processing on the edge.

The Grōv team developed AWS Greengrass Lambda functions running at the edge to ingest critical metrics from the operation automation software running the Olympus Towers. This information provides the ability to not just monitor the operational efficiency, but to provide the hooks to control the feedback loop.

The two sources of data were augmented with site-level data by installing sensor stations at the building level or site level to capture environmental data such as weather and energy consumption of the Towers.

All three sources of data are streamed to AWS IoT Greengrass and are processed by AWS Lambda functions. The edge software also fuses the data and correlates all categories of data together. This enables two major actions for the Grōv team – operational capability in real-time at the edge and enhanced data streamed into the cloud.

Grov Technologies - Architecture

Cloud pipeline/platform: analytics and visualization

As the data is streamed to AWS IoT Core via AWS IoT Greengrass. AWS IoT rules are used to route ingested data to store in Amazon Simple Sotrage Service (Amazon S3) and Amazon DynamoDB. The data pipeline also includes Amazon Kinesis Data Streams for batching and additional processing on the incoming data.

A ReactJS-based dashboard application is powered using Amazon API Gateway and AWS Lambda functions to report relevant metrics such as daily yield and machine uptime.

A data pipeline is deployed to analyze data using Amazon QuickSight. AWS Glue is used to create a dataset from the data stored in Amazon S3. Amazon Athena is used to query the dataset to make it available to Amazon QuickSight. This provides the extended Grōv tech team of research scientists the ability to perform a series of what-if analyses on the data coming in from the Tower Systems beyond what is available in the react-based dashboard.

Data pipeline - Grov Technologies

Completing the data-driven loop

Now that the data has been collected from all sources and stored it in a data lake architecture, the Grōv CEA platform established a strong foundation for harnessing the insights and delivering the customer outcomes using machine learning.

The integrated and fused data from the edge (sourced from the Olympus Tower instrumentation, Olympus automation software data, and site-level data) is co-related to the lab analysis performed by Grōv Research Center (GRC). Harvest samples are routinely collected and sent to the lab, which performs wet chemistry and microbiological analysis. Trays sent as samples to the lab are associated with the results of the analysis with the sensor data by corresponding Grow IDs. This serves as a mechanism for labeling and correlating the recipe data with the parameters used by dairy and beef producers – dry matter percentage, micro and macronutrients, and the presence of myco-toxins.

Grōv has chosen Amazon SageMaker to build a machine learning pipeline on its comprehensive data set, which will enable fine tuning the growing protocols in near real-time. Historical data collection unlocks machine learning use cases for future detection of anomalous sensors readings and sensor health monitoring, as well.

Because the solution is flexible, the Grōv team plans to integrate data from animal studies on their health and feed efficiency into the CEA platform. Machine learning on the data from animal studies will enhance the tuning of recipe ingredients that impact the animals’ health. This will give the farmer an unprecedented view of the impact of feed nutrition on the end product and consumer.

Conclusion

Grōv Technologies and AWS ProServe have built a strong foundation for an extensible and scalable architecture for a CEA platform that will nourish animals for better health and yield, produce healthier foods and to enable continued research into dairy production, rumination and animal health to empower sustainable farming practices.

Amazon SageMaker JumpStart Simplifies Access to Pre-built Models and Machine Learning Solutions

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-jumpstart-simplifies-access-to-prebuilt-models-and-machine-learning-models/

Today, I’m extremely happy to announce the availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that accelerates your machine learning workflows with one-click access to popular model collections (also known as “model zoos”), and to end-to-end solutions that solve common use cases.

In recent years, machine learning (ML) has proven to be a valuable technique in improving and automating business processes. Indeed, models trained on historical data can accurately predict outcomes across a wide range of industry segments: financial services, retail, manufacturing, telecom, life sciences, and so on. Yet, working with these models requires skills and experience that only a subset of scientists and developers have: preparing a dataset, selecting an algorithm, training a model, optimizing its accuracy, deploying it in production, and monitoring its performance over time.

In order to simplify the model building process, the ML community has created model zoos, that is to say, collections of models built with popular open source libraries, and often pretrained on reference datasets. For example, the TensorFlow Hub and the PyTorch Hub provide developers with a long list of models ready to be downloaded, and integrated in applications for computer vision, natural language processing, and more.

Still, downloading a model is just part of the answer. Developers then need to deploy it for evaluation and testing, using either a variety of tools, such as the TensorFlow Serving and TorchServe model servers, or their own bespoke code. Once the model is running, developers need to figure out the correct format that incoming data should have, a long-lasting pain point. I’m sure I’m not the only one regularly pulling my hair out here!

Of course, a full-ML application usually has a lot of moving parts. Data needs to be preprocessed, enriched with additional data fetched from a backend, and funneled into the model. Predictions are often postprocessed, and stored for further analysis and visualization. As useful as they are, model zoos only help with the modeling part. Developers still have lots of extra work to deliver a complete ML solution.

Because of all this, ML experts are flooded with a long backlog of projects waiting to start. Meanwhile, less experienced practitioners struggle to get started. These barriers are incredibly frustrating, and our customers asked us to remove them.

Introducing Amazon SageMaker JumpStart
Amazon SageMaker JumpStart is integrated in Amazon SageMaker Studio, our fully integrated development environment (IDE) for ML, making it intuitive to discover models, solutions, and more. At launch, SageMaker JumpStart includes:

  • 15+ end-to-end solutions for common ML use cases such as fraud detection, predictive maintenance, and so on.
  • 150+ models from the TensorFlow Hub and the PyTorch Hub, for computer vision (image classification, object detection), and natural language processing (sentence classification, question answering).
  • Sample notebooks for the built-in algorithms available in Amazon SageMaker.

SageMaker JumpStart also provides notebooks, blogs, and video tutorials designed to help you learn and remove roadblocks. Content is easily accessible within Amazon SageMaker Studio, enabling you to get started with ML faster.

It only takes a single click to deploy solutions and models. All infrastructure is fully managed, so all you have to do is enjoy a nice cup of tea or coffee while deployment takes place. After a few minutes, you can start testing, thanks to notebooks and sample prediction code that are readily available in Amazon SageMaker Studio. Of course, you can easily modify them to use your own data.

SageMaker JumpStart makes it extremely easy for experienced practitioners and beginners alike to quickly deploy and evaluate models and solutions, saving days or even weeks of work. By drastically shortening the path from experimentation to production, SageMaker JumpStart accelerates ML-powered innovation, particularly for organizations and teams that are early on their ML journey, and haven’t yet accumulated a lot of skills and experience.

Now, let me show you how SageMaker JumpStart works.

Deploying a Solution with Amazon SageMaker JumpStart
Opening SageMaker Studio, I select the “JumpStart” icon on the left. This opens a new tab showing me all available content (solutions, models, and so on).

Let’s say that I’m interested in using computer vision to detect defects in manufactured products. Could ML be the answer?

Browsing the list of available solutions, I see one for product defect detection.

Opening it, I can learn more about the type of problems that it solves, the sample dataset used in the demo, the AWS services involved, and more.

SageMaker screenshot

A single click is all it takes to deploy this solution. Under the hood, AWS CloudFormation uses a built-in template to provision all appropriate AWS resources.

A few minutes later, the solution is deployed, and I can open its notebook.

SageMaker screenshot

The notebook opens immediately in SageMaker Studio. I run the demo, and understand how ML can help me detect product defects. This is also a nice starting point for my own project, making it easy to experiment with my own dataset (feel free to click on the image below to zoom in).

SageMaker screenshot

Once I’m done with this solution, I can delete all its resources in one click, letting AWS CloudFormation clean up without having to worry about leaving idle AWS resources behind.

SageMaker screenshot

Now, let’s look at models.

Deploying a Model with Amazon SageMaker JumpStart
SageMaker JumpStart includes a large collection of models available in the TensorFlow Hub and the PyTorch Hub. These models are pre-trained on reference datasets, and you can use them directly to handle a wide range of computer vision and natural language processing tasks. You can also fine-tune them on your own datasets for greater accuracy, a technique called transfer learning.

SageMaker screenshot
Here, I pick a version of the BERT model trained on question answering. I can either deploy it as is, or fine-tune it. For the sake of brevity, I go with the former here, and I just click on the “Deploy” button.

SageMaker screenshot

A few minutes later, the model has been deployed to a real-time endpoint powered by fully managed infrastructure.

SageMaker screenshot

Time to test it! Clicking on “Open Notebook” launches a sample notebook that I run right away to test the model, without having to change a line of code (again, feel free to click on the image below to zoom in). Here, I’m asking two questions (“What is Southern California often abbreviated as?” and “Who directed Spectre?“), passing some context containing the answer. In both cases, the BERT model gives the correct answer, respectively “socal” and “Sam Mendes“.

SageMaker screenshot

When I’m done testing, I can delete the endpoint in one click, and stop paying for it.

Getting Started
As you can see, it’s extremely easy to deploy models and solutions with SageMaker JumpStart in minutes, even if you have little or no ML skills.

You can start using this capability today in all regions where SageMaker Studio is available, at no additional cost.

Give it a try and let us know what you think.

As always, we’re looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Special thanks to my colleague Jared Heywood for his precious help during early testing.

New – Amazon SageMaker Pipelines Brings DevOps Capabilities to your Machine Learning Projects

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-pipelines-brings-devops-to-machine-learning-projects/

Today, I’m extremely happy to announce Amazon SageMaker Pipelines, a new capability of Amazon SageMaker that makes it easy for data scientists and engineers to build, automate, and scale end to end machine learning pipelines.

Machine learning (ML) is intrinsically experimental and unpredictable in nature. You spend days or weeks exploring and processing data in many different ways, trying to crack the geode open to reveal its precious gemstones. Then, you experiment with different algorithms and parameters, training and optimizing lots of models in search of highest accuracy. This process typically involves lots of different steps with dependencies between them, and managing it manually can become quite complex. In particular, tracking model lineage can be difficult, hampering auditability and governance. Finally, you deploy your top models, and you evaluate them against your reference test sets. Finally? Not quite, as you’ll certainly iterate again and again, either to try out new ideas, or simply to periodically retrain your models on new data.

No matter how exciting ML is, it does unfortunately involve a lot of repetitive work. Even small projects will require hundreds of steps before they get the green light for production. Over time, not only does this work detract from the fun and excitement of your projects, it also creates ample room for oversight and human error.

To alleviate manual work and improve traceability, many ML teams have adopted the DevOps philosophy and implemented tools and processes for Continuous Integration and Continuous Delivery (CI/CD). Although this is certainly a step in the right direction, writing your own tools often leads to complex projects that require more software engineering and infrastructure work than you initially anticipated. Valuable time and resources are diverted from the actual ML project, and innovation slows down. Sadly, some teams decide to revert to manual work, for model management, approval, and deployment.

Introducing Amazon SageMaker Pipelines
Simply put, Amazon SageMaker Pipelines brings in best-in-class DevOps practices to your ML projects. This new capability makes it easy for data scientists and ML developers to create automated and reliable end-to-end ML pipelines. As usual with SageMaker, all infrastructure is fully managed, and doesn’t require any work on your side.

Care.com is the world’s leading platform for finding and managing high-quality family care. Here’s what Clemens Tummeltshammer, Data Science Manager, Care.com, told us: “A strong care industry where supply matches demand is essential for economic growth from the individual family up to the nation’s GDP. We’re excited about Amazon SageMaker Feature Store and Amazon SageMaker Pipelines, as we believe they will help us scale better across our data science and development teams, by using a consistent set of curated data that we can use to build scalable end-to-end machine learning (ML) model pipelines from data preparation to deployment. With the newly announced capabilities of Amazon SageMaker, we can accelerate development and deployment of our ML models for different applications, helping our customers make better informed decisions through faster real-time recommendations.

Let me tell you more about the main components in Amazon SageMaker Pipelines: pipelines, model registry, and MLOps templates.

Pipelines – Model building pipelines are defined with a simple Python SDK. They can include any operation available in Amazon SageMaker, such as data preparation with Amazon SageMaker Processing or Amazon SageMaker Data Wrangler, model training, model deployment to a real-time endpoint, or batch transform. You can also add Amazon SageMaker Clarify to your pipelines, in order to detect bias prior to training, or once the model has been deployed. Likewise, you can add Amazon SageMaker Model Monitor to detect data and prediction quality issues.

Once launched, model building pipelines are executed as CI/CD pipelines. Every step is recorded, and detailed logging information is available for traceability and debugging purposes. Of course, you can also visualize pipelines in Amazon SageMaker Studio, and track their different executions in real time.

Model Registry – The model registry lets you track and catalog your models. In SageMaker Studio, you can easily view model history, list and compare versions, and track metadata such as model evaluation metrics. You can also define which versions may or may not be deployed in production. In fact, you can even build pipelines that automatically trigger model deployment once approval has been given. You’ll find that the model registry is very useful in tracing model lineage, improving model governance, and strengthening your compliance posture.

MLOps TemplatesSageMaker Pipelines includes a collection of built-in CI/CD templates for popular pipelines (build/train/deploy, deploy only, and so on). You can also add and publish your own templates, so that your teams can easily discover them and deploy them. Not only do templates save lots of time, they also make it easy for ML teams to collaborate from experimentation to deployment, using standard processes and without having to manage any infrastructure. Templates also let Ops teams customize steps as needed, and give them full visibility for troubleshooting.

Now, let’s do a quick demo!

Building an End-to-end Pipeline with Amazon SageMaker Pipelines
Opening SageMaker Studio, I select the “Components” tab and the “Projects” view. This displays a list of built-in project templates. I pick one to build, train, and deploy a model.

SageMaker screenshot

Then, I simply give my project a name, and create it.

A few seconds later, the project is ready. I can see that it includes two Git repositories hosted in AWS CodeCommit, one for model training, and one for model deployment.

SageMaker screenshot

The first repository provides scaffolding code to create a multi-step model building pipeline: data processing, model training, model evaluation, and conditional model registration based on accuracy. As you’ll see in the pipeline.py file, this pipeline trains a linear regression model using the XGBoost algorithm on the well-known Abalone dataset. This repository also includes a build specification file, used by AWS CodePipeline and AWS CodeBuild to execute the pipeline automatically.

Likewise, the second repository contains code and configuration files for model deployment, as well as test scripts required to pass the quality gate. This operation is also based on AWS CodePipeline and AWS CodeBuild, which run a AWS CloudFormation template to create model endpoints for staging and production.

Clicking on the two blue links, I clone the repositories locally. This triggers the first execution of the pipeline.

SageMaker screenshot

A few minutes later, the pipeline has run successfully. Switching to the “Pipelines” view, I can visualize its steps.

SageMaker screenshot

Clicking on the training step, I can see the Root Mean Square Error (RMSE) metrics for my model.

SageMaker screenshot

As the RMSE is lower than the threshold defined in the conditional step, my model is added to the model registry, as visible below.

SageMaker screenshot

For simplicity, the registration step sets the model status to “Approved”, which automatically triggers its deployment to a real-time endpoint in the same account. Within seconds, I see that the model is being deployed.

SageMaker screenshot

Alternatively, you could register your model with a “Pending manual approval” status. This will block deployment until the model has been reviewed and approved manually. As the model registry supports cross-account deployment, you could also easily deploy in a different account, without having to copy anything across accounts.

A few minutes later, the endpoint is up, and I could use it to test my model.

SageMaker screenshot

Once I’ve made sure that this model works as expected, I could ping the MLOps team, and ask them to deploy the model in production.

Putting my MLOps hat on, I open the AWS CodePipeline console, and I see that my deployment is indeed waiting for approval.

SageMaker screenshot

I then approve the model for deployment, which triggers the final stage of the pipeline.

SageMaker screenshot

Reverting to my Data Scientist hat, I see in SageMaker Studio that my model is being deployed. Job done!

SageMaker screenshot

Getting Started
As you can see, Amazon SageMaker Pipelines makes it really easy for Data Science and MLOps teams to collaborate using familiar tools. They can create and execute robust, automated ML pipelines that deliver high quality models in production quicker than before.

You can start using SageMaker Pipelines in all commercial regions where SageMaker is available. The MLOps capabilities are available in the regions where CodePipeline is also available.

Sample notebooks are available to get you started. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Special thanks to my colleague Urvashi Chowdhary for her precious assistance during early testing.

Introducing Amazon SageMaker Data Wrangler, a Visual Interface to Prepare Data for Machine Learning

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/introducing-amazon-sagemaker-data-wrangler-a-visual-interface-to-prepare-data-for-machine-learning/

Today, I’m extremely happy to announce Amazon SageMaker Data Wrangler, a new capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications by using a visual interface.

Whenever I ask a group of data scientists and ML engineers how much time they actually spend studying ML problems, I often hear a collective sigh, followed by something along the lines of, “20%, if we’re lucky.” When I ask them why, the answer is invariably the same, “data preparation consistently takes up to 80% of our time!

Indeed, preparing data for training is a crucial step of the ML process, and no one would think about botching it up. Typical tasks include:

  • Locating data: finding where raw data is stored, and getting access to it
  • Data visualization: examining statistical properties for each column in the dataset, building histograms, studying outliers
  • Data cleaning: removing duplicates, dropping or filling entries with missing values, removing outliers
  • Data enrichment and feature engineering: processing columns to build more expressive features, selecting a subset of features for training

In the early stage of a new ML project, this is a highly manual process, where intuition and experience play a large part. Using a mix of bespoke tools and open source tools such as pandas or PySpark, data scientists often experiment with different combinations of data transformations, and use them to process datasets before training models. Then, they analyze prediction results and iterate. As important as this is, looping through this process again and again can be time-consuming, tedious, and error-prone.

At some point, you will hit the right level of accuracy (or whatever other metric you’ve picked), and you’ll then want to train on the full dataset in your production environment. However, you’ll first have to reproduce and automate the exact data preparation steps that you experimented within your sandbox. Unfortunately, there’s always room for error given the interactive nature of this work, even if you carefully document it.

Last but not least, you’ll have to manage and scale your data processing infrastructure before you get to the finish line. Now that I think of it, 80% of your time may not be enough to do all of this!

Introducing Amazon SageMaker Data Wrangler
Amazon SageMaker Data Wrangler is integrated in Amazon SageMaker Studio, our fully managed integrated development environment (IDE) for ML. With just a few clicks, you can connect to data sources, explore and visualize data, apply built-in transformations as well as your own, export the resulting code to an auto-generated script, and run it on managed infrastructure. Let’s look at each step in more detail.

Obviously, data preparation starts with locating and accessing data. Out of the box, SageMaker Data Wrangler lets you easily and quickly connect to Amazon Simple Storage Service (S3), Amazon Athena, Amazon Redshift and AWS Lake Formation. You can also import data from Amazon SageMaker Feature Store. As all things AWS, access management is governed by AWS Identity and Access Management (IAM), based on the permissions attached to your SageMaker Studio instance.

Once you’ve connected to your data sources, you’ll probably want to visualize your data. Using the SageMaker Data Wrangler user interface, you can view table summaries, histograms, and scatter plots in seconds. You can also build your own custom graphs by simply copying and running code written with the popular Altair open source library.

Once you’ve got a good grasp on what your data looks like, it’s time to start preparing it. SageMaker Data Wrangler includes 300+ built-in transformations, such as finding and replacing data, splitting/renaming/dropping columns, scaling numerical values, encoding categorical values, and so on. All you have to do is select the transformation in a drop-down list, and fill in the parameters it may require. You can then preview the change, and decide whether you’d like to add it or not to the list of preparation steps for this dataset. If you’d like, you can also add your own code to implement custom transformations, using either pandas, PySpark, or PySpark SQL.

As you add transformation steps to your processing pipeline, you can view its graphical summary in SageMaker Studio. You can also add new stages to the pipeline, for example a new data source, or another group of transformation steps (say, a data cleaning group, followed by a feature engineering group). Thanks to the intuitive user interface, your data preparation pipeline will take shape in front of your eyes, and you’ll instantly be able to check that processed data looks the way that it should.

Early on, you’d certainly love to check your data preparation steps, and also get a sense of their predictive power, wouldn’t you? Good news, then! For regression and classification problem types, the “Quick model” capability lets you select a subset of your data, train a model, and determine which features are contributing most to the predicted outcome. Looking at the model, you can easily diagnose and fix data preparation issues as early as possible, and to determine if additional feature engineering is needed to improve your model performance.

Once you’re happy with your pipeline, you can export it in one click to a Python script that faithfully reproduces your manual steps. You won’t waste any time chasing discrepancies, and you can directly add this code to your ML project.

In addition, you can also export your processing code to:

Now, let’s do a quick demo, and show you how easy it is to work with SageMaker Data Wrangler .

Using Amazon SageMaker Data Wrangler
Opening SageMaker Studio, I create a new data flow in order to process the Titanic dataset, which contains information on passengers, and labels showing whether they survived the wreck or not.

SageMaker screenshot

My dataset is stored as a CSV file in Amazon Simple Storage Service (S3), and I select the appropriate data source.

SageMaker screenshot

Using the built-in tool, I quickly navigate my S3 buckets, and I locate the CSV file containing my data. For larger datasets, SageMaker Data Wrangler also supports the Parquet format.

As I select my file, SageMaker Data Wrangler shows me the first few rows.

SageMaker screenshot

I import the dataset, and I’m presented with an initial view of the data flow. Right-clicking on the dataset, I select “Edit data types” to make sure that SageMaker Data Wrangler has correctly detected the type of each column in the dataset.

SageMaker screenshot

Checking each column, it looks like all types are correct.

SageMaker screenshot

Moving back to the data flow view, I select “Add analysis” this time. This opens a new view where I can visualize data using histograms, scatterplots, and more. For example, I build an histogram showing me the age distribution of passengers according to their survival status, and coloring the bins using their gender. Of course, I can save it for future use.

SageMaker screenshot

Moving back to the data flow view once again, I select “Add transform” in order to start processing the dataset. This opens a new view, showing me the first lines of the dataset, as well as a list of 300+ built-in transforms.

SageMaker screenshot

Pclass, the passenger class, is a categorical variable, and I decide to encode it using one-hot encoding. This creates 3 new columns representing different dimensions, and I can preview them. As this is exactly what I wanted, I apply this transform for good. Likewise, I apply the same transform to the Sex column.

SageMaker screenshot

Then, I drop the original Pclass column. Using the same transform, I also drop the Name column.

SageMaker screenshot

In order to get a quick idea on whether these transformations increase or decrease the accuracy of the model, I can create a analysis that trains a model on the spot. As my problem is a binary classification problem, SageMaker Data Wrangler uses a metric called the F1 score. 0.749 is a good start, and additional processing would certainly improve it. I can also see which features contribute most to the predicted outcome: sex, age, and being a third class passenger.

SageMaker screenshot

Then, moving to the “Export” view, I select all the transforms I’ve created so far, in order to add them to my ML project.

SageMaker screenshot

Here, I select “Python Code” to generate a Python script. Other options are available for Amazon SageMaker Processing, Amazon SageMaker Pipelines, and Amazon SageMaker Feature Store.

SageMaker screenshot

A few seconds later, the script is available. I could add it as is to my ML project, and rest assured that my data preparation steps would be consistent with the interactive transforms that I’ve created above.

SageMaker screenshot

Getting Started
As you can see, Amazon SageMaker Data Wrangler makes it really easy to work interactively on data preparation steps, before transforming them into code that can be used immediately for experimentation and production.

You can start using this capability today in all regions where SageMaker Studio is available.

Give it a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Special thanks to my colleague Peter Liu for his precious help during early testing.

New – Store, Discover, and Share Machine Learning Features with Amazon SageMaker Feature Store

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-store-discover-and-share-machine-learning-features-with-amazon-sagemaker-feature-store/

Today, I’m extremely happy to announce Amazon SageMaker Feature Store, a new capability of Amazon SageMaker that makes it easy for data scientists and machine learning engineers to securely store, discover and share curated data used in training and prediction workflows.

For all the importance of selecting the right algorithm to train machine learning (ML) models, experienced practitioners know how crucial it is to feed it with high-quality data. Cleaning data is a good first step, and ML workflows routinely include steps to fill missing values, remove outliers, and so on. Then, they often move on to transforming data, using a mix of common and arcane techniques known as “feature engineering.”

Simply put, the purpose of feature engineering is to transform your data and to increase its expressiveness so that the algorithm may learn better. For instance, many columnar datasets include strings, such as street addresses. To most ML algorithms, strings are meaningless, and they need to be encoded in a numerical representation. Thus, you could replace street addresses with GPS coordinates, a much more expressive way to learn the concept of location. In other words, if data is the new oil, then feature engineering is the refining process that turns it into high-octane jet fuel that helps models get to stratospheric accuracy.

Indeed, ML practitioners spend a lot of time crafting feature engineering code, applying it to their initial datasets, training models on the engineered datasets, and evaluating model accuracy. Given the experimental nature of this work, even the smallest project will lead to multiple iterations. The same feature engineering code is often run again and again, wasting time and compute resources on repeating the same operations. In large organizations, this can cause an even greater loss of productivity, as different teams often run identical jobs, or even write duplicate feature engineering code because they have no knowledge of prior work.

There’s another hard problem that ML teams have to solve. As models are trained on engineered datasets, it’s imperative to apply the same transformations to data sent for prediction. This often means rewriting feature engineering code, sometimes in a different language, integrating it in your prediction workflow, and running it at prediction time. This whole process is not only time-consuming, it can also introduce inconsistencies, as even the tiniest variation in a data transform can have a large impact on predictions.

In order to solve these problems, ML teams sometimes build a feature store, a central repository where they can keep and retrieve engineered data used in their training and predictions jobs. As useful as feature stores are, building and managing your own involves a lot of engineering, infrastructure, and operational effort that takes valuable time away from actual ML work. Customers asked us for a better solution, and we got to work.

Introducing Amazon SageMaker Feature Store
Amazon SageMaker Feature Store is a fully managed centralized repository for your ML features, making it easy to securely store and retrieve features without having to manage any infrastructure. It’s part of Amazon SageMaker, our fully managed service for ML, and supports all algorithms. It’s also integrated with Amazon SageMaker Studio, our web-based development environment for ML.

Features stored in SageMaker Feature Store are organized in groups, and tagged with metadata. Thanks to this, you can quickly discover which features are available, and whether they’re suitable for your models. Multiple teams can also easily share and re-use features, reducing the cost of development and accelerating innovation.

Once stored, features can be retrieved and used in your SageMaker workflows: model training, batch transform, and real-time prediction with low latency. Not only do you avoid duplicating work, you also build consistent workflows that use the same consistent features stored in the offline and online stores.

The Climate Corporation (Climate) is a subsidiary of Bayer, and the industry leader in bringing digital innovation to farmers. Says Daniel McCaffrey, Vice President, Data and Analytics, Climate: “At Climate, we believe in providing the world’s farmers with accurate information to make data driven decisions and maximize their return on every acre. To achieve this, we have invested in technologies such as machine learning tools to build models using measurable entities known as features, such as yield for a grower’s field. With Amazon SageMaker Feature Store, we can accelerate the development of ML models with a central feature store to access and reuse features across multiple teams easily. SageMaker Feature Store makes it easy to access features in real-time using the online store, or run features on a schedule using the offline store for different use cases, and we can develop ML models faster.”

Care.com, the world’s leading platform for finding and managing high-quality family care, is also using Amazon SageMaker Feature Store. This is what Clemens Tummeltshammer, Data Science Manager, Care.com, told us: “A strong care industry where supply matches demand is essential for economic growth from the individual family up to the nation’s GDP. We’re excited about Amazon SageMaker Feature Store and Amazon SageMaker Pipelines , as we believe they will help us scale better across our data science and development teams, by using a consistent set of curated data that we can use to build scalable end-to-end machine learning model pipelines from data preparation to deployment. With the newly announced capabilities of Amazon SageMaker, we can accelerate development and deployment of our ML models for different applications, helping our customers make better informed decisions through faster real-time recommendations.

Now, let’s see how you can get started.

Storing and Retrieving Features with Amazon SageMaker Feature Store
Once you’ve run your feature engineering code on your data, you can organize and store your engineered features in SageMaker Feature Store, by grouping them in feature groups. A feature group is a collection of records, similar to rows in a table. Each record has a unique identifier, and holds the engineered feature values for one of the data instances in your original data source. Optionally, you can choose to encrypt the data at rest using your own AWS Key Management Service (KMS) key that is unique for each feature group.

How you define feature groups is up to you. For example, you could create one per data source (CSV files, database tables, and so on), and use a convenient unique column as the record identifier (primary key, customer id, transaction id, and so on).

Once you’ve got your groups figured out, you should repeat the following steps for each group:

  1. Create feature definitions, with the name and the type of each feature in a record (Fractional, Integral, or String).
  2. Create each feature group with the create_feature_group() API:
    sm_feature_store.create_feature_group(
         # The name of the feature group
         FeatureGroupName=my_feature_group_name,
         # The name of the column acting as the record identifier
         RecordIdentifierName=record_identifier_name,
         # The name of the column action as the feature timestamp
         EventTimeFeatureName = event_time_feature_name,
         # A list of feature names and types
         FeatureDefinitions=my_feature_definitions,
         # The S3 location for the offline feature store
         OnlineStoreConfig=online_store_config,
         # Optionally, enable the online feature store
         OfflineStoreConfig=offline_store_config,
         # An IAM role
         RoleArn=role
    )
  3. In each feature group, store records containing a collection of feature name/feature value pairs, using the put_record() API:
    sm_feature_store.put_record(
       FeatureGroupName=feature_group_name,
       Record=record,
       EventTime=event_time
    )

    For faster ingestion, you could create multiple threads and parallelize this operation.

At this point, features will be available in Amazon SageMaker Feature Store. Thanks to the offline store, you can use services such as Amazon Athena, AWS Glue, or Amazon EMR to build datasets for training: fetch the corresponding JSON objects in S3, select the features that you need, and save them in S3 in the format expected by your ML algorithm. From then on, it’s SageMaker business as usual!

In addition, you can use the get_record() API to access individual records stored in the online store, passing the group name and the unique identifier of the record you want to access, like so:

record = sm_feature_store.get_record(
    FeatureGroupName=my_feature_group_name,
    RecordIdentifierValue={"IntegralValue": 5962}
)

Amazon SageMaker Feature Store is designed for fast and efficient access for real time inference, with a P95 latency lower than 10ms for a 15-kilobyte payload. This makes it possible to query for engineered features at prediction time, and to replace raw features sent by the upstream application with the exact same features used to train the model. Feature inconsistencies are eliminated by design, letting you focus on building the best models instead of chasing bugs.

Finally, as SageMaker Feature Store includes feature creation timestamps, you can retrieve the state of your features at a particular point in time.

As Amazon SageMaker Feature Store is integrated with SageMaker Studio, I can see my two features groups there.

SageMaker screenshot

Right-clicking on “Open feature group detail”, I open the identity feature group.

SageMaker screenshot

I can see feature definitions.

SageMaker screenshot

Finally, I can generate queries for the offline store, which I could add to a Amazon SageMaker Data Wrangler workflow to load features prior to training.

SageMaker screenshot

How to Get Started with Amazon SageMaker Feature Store
As you can see, SageMaker Feature Store makes it easy to store, retrieve, and share features required by your training and prediction workflows.

SageMaker Feature Store is available in all regions where SageMaker is available. Pricing is based on feature reads and writes, and on the total amount of data stored.

Here are sample notebooks that will help you get started right away. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Amazon SageMaker Edge Manager Simplifies Operating Machine Learning Models on Edge Devices

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-edge-manager-simplifies-operating-machine-learning-models-on-edge-devices/

Today, I’m extremely happy to announce Amazon SageMaker Edge Manager, a new capability of Amazon SageMaker that makes it easier to optimize, secure, monitor, and maintain machine learning models on a fleet of edge devices.

Edge computing is certainly one of the most exciting developments in information technology. Indeed, thanks to continued advances in compute, storage, networking, and battery technology, organizations routinely deploy large numbers of embedded devices anywhere on the planet for a wide range of industry applications: manufacturing, energy, agriculture, healthcare, and more. Ranging from simple sensors to large industrial machines, the devices have a common purpose: capture data, analyze it, and act on it, for example send an alert if an unwanted condition is detected.

As machine learning (ML) demonstrated its ability to solve a wide range of business problems, customers tried to apply it to edge applications, training models in the cloud and deploying them at the edge in an effort to extract deeper insights from local data. However, given the remote and constrained nature of edge devices, deploying and managing models at the edge is often quite difficult.

For example, a complex model can be too large to fit, forcing customers to settle for a smaller and less accurate model. Also, predicting with several models on the same device (say, to detect different types of anomalies) may require additional code to load and unload models on demand, in order to conserve hardware resources. Finally, monitoring prediction quality is a major concern, as the real world will always be more complex and unpredictable than any training set can anticipate.

Customers asked us to help them solve these challenges, and we got to work.

Announcing Amazon SageMaker Edge Manager
Amazon SageMaker Edge Manager makes it easy for ML edge developers to use the same familiar tools in the cloud or on edge devices. It reduces the time and effort required to get models to production, while continuously monitoring and improving model quality across your device fleet.

Starting from a model that you trained or imported in Amazon SageMaker, SageMaker Edge Manager first optimizes it for your hardware platform using Amazon SageMaker Neo. Launched two years ago, Neo converts models into an efficient common format which is executed on the device by a low footprint runtime. Neo currently supports devices based on chips manufactured by Ambarella, ARM, Intel, NVIDIA, NXP, Qualcomm, TI, and Xilinx.

Then, SageMaker Edge Manager packages the model, and stores it in Amazon Simple Storage Service (S3), where it can be deployed to your devices. In fact, you can deploy multiple models, loading and predicting with a runtime optimized for your hardware of choice.

On-device models are managed by the SageMaker Edge Manager Manager Agent, which communicates with the AWS Cloud for model deployment, and with your application for model management. Indeed, you can integrate this agent with your application, so that it may automatically load and unload models according to your prediction requests. This enables a variety of scenarios, such as freeing all resources for a large model whenever needed, or working with a collection of smaller models that cohabit in memory.

Lenovo, the #1 global PC maker, recently incorporated Amazon SageMaker into its latest predictive maintenance offering. Igor Bergman, Lenovo Vice President, Cloud & Software of PCs and Smart Devices, told us: “At Lenovo, we’re more than a hardware provider and are committed to being a trusted partner in transforming customers’ device experience and delivering on their business goals. Lenovo Device Intelligence is a great example of how we’re doing this with the power of machine learning, enhanced by Amazon SageMaker. With Lenovo Device Intelligence, IT administrators can proactively diagnose PC issues and help predict potential system failures before they occur, helping to decrease downtime and increase employee productivity. By incorporating Amazon SageMaker Neo, we’ve already seen a substantial improvement in the execution of our on-device predictive models – an encouraging sign for the new Amazon SageMaker Edge Manager that will be added in the coming weeks. SageMaker Edge Manager will help eliminate the manual effort required to optimize, monitor, and continuously improve the models after deployment. With it, we expect our models will run faster and consume less memory than with other comparable machine learning platforms. As we extend AI to new applications across the Lenovo services portfolio, we will continue to require a high-performance pipeline that is flexible and scalable both in the cloud and on millions of edge devices. That’s why we selected the Amazon SageMaker platform. With its rich edge-to-cloud and CI/CD workflow capabilities, we can effectively bring our machine learning models to any device workflow for much higher productivity.

Getting Started
As you can see, SageMaker Edge Manager makes it easier to work with ML models deployed on edge devices. It’s available today in the US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), Europe (Frankfurt), and Asia Pacific (Tokyo) regions.

Sample notebooks are available to get you started right away. Give them a try, and let us know what you think.

We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

New – Amazon SageMaker Clarify Detects Bias and Increases the Transparency of Machine Learning Models

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-amazon-sagemaker-clarify-detects-bias-and-increases-the-transparency-of-machine-learning-models/

Today, I’m extremely happy to announce Amazon SageMaker Clarify, a new capability of Amazon SageMaker that helps customers detect bias in machine learning (ML) models, and increase transparency by helping explain model behavior to stakeholders and customers.

As ML models are built by training algorithms that learn statistical patterns present in datasets, several questions immediately come to mind. First, can we ever hope to explain why our ML model comes up with a particular prediction? Second, what if our dataset doesn’t faithfully describe the real-life problem we were trying to model? Could we even detect such issues? Would they introduce some sort of bias in imperceptible ways? As we will see, these are not speculative questions at all. They are very real, and their implications can be far-reaching.

Let’s start with the bias problem. Imagine that you’re working on a model detecting fraudulent credit card transactions. Fortunately, the huge majority of transactions are legitimate, and they make up 99.9% of your dataset, meaning that you only have 0.1% fraudulent transactions, say 100 out of 100,000. Training a binary classification model (legitimate vs. fraudulent), there’s a strong chance that it would be strongly influenced or biased by the majority group. In fact, a trivial model could simply decide that transactions are always legitimate: as useless as this model would be, it would still be right 99.9% of the time! This simple example shows how careful we have to be about the statistical properties of our data, and about the metrics that we use to measure model accuracy.

There are many variants of this under-representation problem. As the number of classes, features, and unique feature values increase, your dataset may only contain a tiny number of training instances for certain groups. In fact, some of these groups may correspond to various socially sensitive features such as gender, age range, or nationality. Under-representation for such groups could result in a disproportionate impact on their predicted outcomes.

Unfortunately, even with the best of intentions, bias issues may exist in datasets and be introduced into models with business, ethical, and regulatory consequences. It is thus important for model administrators to be aware of potential sources of bias in production systems.

Now, let’s discuss the explainability problem. For simple and well-understood algorithms like linear regression or tree-based algorithms, it’s reasonably easy to crack the model open, inspect the parameters that it learned during training, and figure out which features it predominantly uses. You can then decide whether this process is consistent with your business practices, basically saying: “yes, this is how a human expert would have done it.”

However, as models become more and more complex (I’m staring at you, deep learning), this kind of analysis becomes impossible. Just like the prehistoric tribes in Stanley Kubrick’s “2001: A Space Odyssey,” we’re often left staring at an impenetrable monolith and wondering what it all means. Many companies and organizations may need ML models to be explainable before they can be used in production. In addition, some regulations may require explainability when ML models are used as part of consequential decision making, and closing the loop, explainability can also help detect bias.

Thus, our customers asked us for help on detecting bias in their datasets and their models, and on understanding how their models make predictions. We got to work, and came up with SageMaker Clarify.

Introducing Amazon SageMaker Clarify
SageMaker Clarify is a new set of capabilities for Amazon SageMaker, our fully managed ML service. It’s integrated with SageMaker Studio, our web-based integrated development environment for ML, as well as with other SageMaker capabilities like Amazon SageMaker Data Wrangler, Amazon SageMaker Experiments, and Amazon SageMaker Model Monitor.

Thanks to SageMaker Clarify, data scientists are able to:

  • Detect bias in datasets prior to training, and in models after training.
  • Measure bias using a variety of statistical metrics.
  • Explain how feature values contribute to the predicted outcome, both for the model overall and for individual predictions.
  • Detect bias drift and feature importance drift over time, thanks to the integration with Amazon SageMaker Model Monitor.

Let’s look at each of these capabilities.

Detecting dataset bias: This is an important first step. Indeed, a heavily biased dataset may well be unsuitable for training. Knowing this early on certainly saves you time, money, and frustration! Looking at bias metrics computed by SageMaker Clarify on your dataset, you can then add your own bias reduction techniques to your data processing pipeline. Once the dataset has been revised and processed, you can measure bias again, and check if it has actually decreased.

Detecting model bias: After you’ve trained your model, you can run a SageMaker Clarify bias analysis, which includes automatic deployment to a temporary endpoint, and computation of bias metrics using your model and dataset. By computing these metrics, you can figure out if your trained model has similar predictive behavior across groups.

Measuring bias: SageMaker Clarify lets you pick from many different bias metrics. I’ll just give you a few examples here.

  • Difference in positive proportions in labels (DPL): Are labels in the dataset correlated or not with specific sensitive feature values? For example, do people living in a certain city have a better chance of getting a positive answer?
  • Difference in positive proportions in predicted labels (DPPL): Do we overpredict positive labels for a certain group?
  • Accuracy difference (AD): Are the predictions by the model more accurate for one group than the other?
  • Counterfactuals – Fliptest (FT): Suppose we look at each member of one group, and compare with similar members from the other group. Do they get different model predictions?

Explaining predictions – to explain how your model predicts, SageMaker Clarify supports a popular technique called SHapley Additive exPlanations (SHAP). Originating in game theory, SHAP analyzes for each data instance the individual contribution of feature values to the predicted output, and represents them as a positive or negative value. For example, predicting with a credit application model, you could see that Alice’s application is approved with a score of 87.5%, that her employment status (+27.2%) and her credit score (+32.4%) are the strongest contributors to this score, and that her income level has a slight negative impact (-5%). Such insights are crucial in building trust that the model is working as expected, and in explaining to customers and regulators why it comes up with a particular prediction. Further analysis of the SHAP values for your complete dataset can also help identify the relative importance of features and feature values, potentially leading to the discovery of prediction issues and biases.

As you can see, SageMaker Clarify has some pretty powerful features for bias detection and explainability. Fortunately, it also makes them very easy to use. First, you should upload a clean and pre-processed copy of your tabular dataset (CSV or JSON) to Amazon Simple Storage Service (S3). Then, using a built-in container, you just launch an Amazon SageMaker Processing job on your dataset, passing a short configuration file defining the name of the target attribute, the name and values of the sensitive columns to analyze for bias, and the bias metrics that you want to compute. As you would expect, this job runs on fully managed infrastructure. For post-training analysis, a temporary endpoint is also automatically created and deleted by the job. Once the job is complete, results are available in S3 and in SageMaker Studio, and include an auto-generated report that summarizes the results.

Now, I’d like to show you how to get started with SageMaker Clarify.

Exploring Datasets and Models with Amazon SageMaker Clarify
The German Credit Data dataset contains 1,000 labeled credit applications, which I’ve used to train a binary classification model with XGBoost. Each data instance has 20 features, such as credit purpose, credit amount, housing status, employment history, and more. Categorical features have been encoded with Axx values. For example, here’s how the credit history feature is encoded: A30 means ‘no credits taken’, A31 means ‘all credits at this bank paid back duly’, and so on.

In particular, the dataset includes a feature telling us if a customer is a foreign worker. In fact, a quick look at the dataset hints at a large imbalance in favor of foreign workers. Could bias be hiding there? What about the model? Did XGBoost increase or decrease the bias? Which features contribute most to the predicted output? Let’s find out.

After training the model, my next step is to run a SageMaker Clarify bias analysis job on the dataset, using a built-in container image that will compute bias metrics. The job inputs are the dataset, and a JSON configuration file that defines:

  • The name of the target attribute (Class1Good2Bad), and the value for the positive answer (1).
  • The sensitive features to analyze (called “facets”), and their value. Here, we want to focus on instances where ForeignWorker is set to 0, as they seem to be under-represented in the dataset.
  • The bias metrics that the job should compute. As I already have a model, I pass its name so that post-training metrics can be computed on a temporary endpoint.

Here’s the relevant snippet in the configuration file:

"label": "Class1Good2Bad",
"label_values_or_threshold": [1],
"facet": [
        {
         "name_or_index" : "ForeignWorker",
         "value_or_threshold": [0]
        }
    ],
. . .
"methods": {
    "pre_training_bias": {"methods": "all"}
    "post_training_bias": {"methods": "all"}
},
"predictor": {
        "model_name": "xgboost-german-model",
        "instance_type": "ml.m5.xlarge",
        "initial_instance_count": 1
    }

Then, I configure the job inputs (the dataset and the configuration file) and output (the report), passing all appropriate paths in S3:

config_input = ProcessingInput(
                    input_name="analysis_config",
                    source=analysis_config_s3_path,
                    destination="/opt/ml/processing/input/config")
data_input = ProcessingInput(
                    input_name="dataset",
                    source=train_data_s3_path,
                    destination="/opt/ml/processing/input/data")
result_output = ProcessingOutput(
                    source="/opt/ml/processing/output",
                    destination=analysis_result_s3_path,
                    output_name="analysis_result")

Finally, I run the processing job.

from sagemaker.processing import Processor, ProcessingInput, ProcessingJob, ProcessingOutput
analyzer_image_uri = f'678264136642.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xai-analyzer:latest'
analyzer = Processor(base_job_name='analyzer',
                     image_uri=analyzer_image_uri,
                     role=sagemaker.get_execution_role(),
                     instance_count=1,
                     instance_type='ml.c5.xlarge')
analyzer.run(inputs=[ data_input, config_input], outputs=[result_output])

Once the processing job is complete, I can retrieve the report. Let’s look at bias metrics.

Detecting Bias with Amazon SageMaker Clarify
Here are some of the pre-training bias metrics:

"ForeignWorker": [
    {
     "value_or_threshold": "0",
     "metrics": [
      {
       "name": "CI",
       "description": "Class Imbalance (CI)",
       "value": 0.9225
      },
      {
       "name": "DPL",
       "description": "Difference in Positive Proportions in Labels (DPL)",
       "value": -0.21401904442300435
      },
. . . 

The class imbalance metric confirms our visual impression. The dataset has about 92% more foreign workers than it has domestic workers to assess. Whether this imbalance is responsible or not, we can also see that the difference in positive proportion for domestic workers is quite negative. In other words, there’s a smaller proportion of domestic workers with positive labels. This statistical pattern could be picked up by an ML algorithm, leading to a larger proportion of domestic workers getting negative answers. Figuring out whether this is actually legitimate or not would require further analysis, and in any case, it’s great that SageMaker Clarify warned us about this potential issue.

As I provided a trained model, post-training metrics are also available. Comparing the DPPL and the DPL, I can see that XGBoost has slightly reduced bias on positive proportions (-18.8% vs -21.4%). We also see that DAR is negative, indicating that the model achieves higher precision for domestic workers compared to foreign workers.

"ForeignWorker": [
    {
     "value_or_threshold": "0",
     "metrics": [
      {
       "name": "DPPL",
       "description": "\"Difference in Positive Proportions in Predicted Labels (DPPL)\")",
       "value": -0.18801124208230213
      },
      {
       "name": "DAR",
       "description": "Difference in Acceptance Rates (DAR)",
       "value": -0.050909090909090904
      },
      {
       "name": "DRR",
       "description": "Difference in Rejection Rates (DRR)",
       "value": 0.0365296803652968
      },
. . .

As SageMaker Clarify is integrated with SageMaker Studio, I can visualize bias metrics there. All I have to do is find the processing job in the list of trials, right-click “Open in trial details”, and select the “Bias report” view.

SageMaker screenshot

Finally, deciding whether high value of a certain bias metric is problematic involves domain-specific considerations. This needs to be guided by ethical, social, regulatory, and business considerations. Similarly, interventions for removing bias may often need a careful analysis of the entire ML lifecycle, from problem formulation to feedback loops in deployment.

Now, let’s see how SageMaker Clarify helps us understand what features the models base their predictions on.

Explaining Predictions with Amazon SageMaker Clarify
The report includes global SHAP values, showing the relative importance of all the features in the dataset. On the feature importance graph available in SageMaker Studio, I see that the three most important features are credit duration, not having a checking account (A14), and the loan amount. All things being equal, the bank probably sees you as a safer customer if you’re borrowing a small amount over a short period of time, and without the possibility to write checks!

SageMaker screenshot

In S3, I can also find a CSV file with SHAP values for individual data instances, giving me a complete picture of feature and feature value importance.

Getting Started
As you can see, SageMaker Clarify is a powerful tool to detect bias and to understand how your model works. You can start using it today in all regions where Amazon SageMaker is available, at no additional cost.

Sample notebooks are available to get you started quickly. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Special thanks to my colleagues Sanjiv Das, Michele Donini, Jason Gelman, Krishnaram Kenthapadi, Pinar Yilmaz, and Bilal Zafar for their precious help.

Dataset credits: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

New – Profile Your Machine Learning Training Jobs With Amazon SageMaker Debugger

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/profile-your-machine-learning-training-jobs-with-amazon-sagemaker-debugger/

Today, I’m extremely happy to announce that Amazon SageMaker Debugger can now profile machine learning models, making it much easier to identify and fix training issues caused by hardware resource usage.

Despite its impressive performance on a wide range of business problems, machine learning (ML) remains a bit of a mysterious topic. Getting things right is an alchemy of science, craftsmanship (some would say wizardry), and sometimes luck. In particular, model training is a complex process whose outcome depends on the quality of your dataset, your algorithm, its parameters, and the infrastructure you’re training on.

As ML models become ever larger and more complex (I’m looking at you, deep learning), one growing issue is the amount of infrastructure required to train them. For instance, training BERT on the publicly available COCO dataset takes well over six hours on a single p3dn.24xlarge instance, even with its eight NVIDIA V100 GPUs. Some customers like autonomous vehicle companies deal with much larger datasets, and train object detection models for several days.

When a complex training job takes this long, the odds that something goes wrong and ruins it are pretty high, not only wasting time and money but also causing lots of frustration. Important work needs to be put on the back burner while you investigate, figure out the root cause, try to fix it, and then run your training job again. Often, you’ll have to iterate quite a few times to nail the problem.

Depending on the ML framework that you use, and sometimes on its version, you may or may not be able to use existing framework-specific tools. Often, you’ll have to build and maintain your own bespoke tools. Even for experienced practitioners, this is a lot of hard work. For regular developers like me, this is an utterly daunting task.

Introducing Model Profiling in Amazon SageMaker Debugger
Launched last year at AWS re:Invent, Amazon SageMaker Debugger is a capability of Amazon SageMaker that automatically identifies complex issues developing in ML training jobs. These include loss not decreasing, exploding gradients, and more.

Now, SageMaker Debugger can also monitor hardware resource usage, and allows you to profile your training job to help you correlate resource usage to ML operations in your training script. Thus, you’ll be able to resolve performance issues much quicker, and iterate through your training job much faster.

Chaim Rand, ML Algorithm Developer at Mobileye, an Intel company building automated driving and driver assistance systems, had the opportunity to work with the new profiling capabilities, and here’s what he told us: “Many of the assisted driving and autonomous vehicle technologies that we develop at Mobileye, rely on training deep neural network models to detect a wide variety of road artifacts, including vehicles, pedestrians, speed bumps, road signs and more. Often, these models train on extremely large datasets, on multiple machines, and for periods of up to several days. For us, at Mobileye, it is imperative that we have a toolkit of advanced performance profiling capabilities, for analyzing the flow of data across the network, CPU, and GPU resources, and for pinpointing performance issues. The profiling functionality in SageMaker Debugger provides just that, taking performance profiling out of the domain of a few specialized experts, and empowering our algorithm developers to maximize training resource utilization, accelerate model convergence, and reduce cost.

At launch, the new profiling capability of SageMaker Debugger is available for TensorFlow 2.x and PyTorch 1.x. All you have to do is to train with the corresponding built-in frameworks in Amazon SageMaker. Distributed training is supported out of the box.

Setting a single parameter in your SageMaker estimator, and without any change to your training code, you can enable the collection of infrastructure and model metrics such as:

  • CPU and GPU,
  • RAM and GPU RAM,
  • Network I/O,
  • Storage I/O (local storage and Pipe Mode),
  • Python metrics,
  • Data loading time,
  • Time spent in ML operators running on CPU and GPU,
  • Distributed training metrics for Horovod,
  • and many more.

In addition, you can visualize how much time is spent in different phases, such as preprocessing, training loop, and postprocessing. If needed, you can drill down on each training epoch, and even on each function in your training script.

By default, metrics are collected every 500ms, and you can also set this value to 100ms, 200ms, 1s, 5s, and 1min. For finer-grained analysis, you can also enable and disable profiling explicitly in your training code, only capturing metrics for specific parts.

While your training job is running, you can easily visualize these metrics in Amazon SageMaker Studio, our web-based integrated development environment for ML. As you would expect, all data is also available through the SageMaker Debugger API, and you can retrieve it to build your own graphs.

Running in parallel of the training job, an Amazon SageMaker Processing analyzes captured data, builds graphs, and generates a report providing insights on potential problems. This doesn’t require any work on your part, as this analysis runs inside a built-in container on fully managed infrastructure.

Now, let’s run a demo with PyTorch, where we’ll profile a ResNet-50 image classification model training on the CIFAR-10 dataset.

Profiling a Training Job with Amazon SageMaker Debugger
All it takes to enable profiling on your training job is an extra parameter in your SageMaker estimator. You don’t need to change a line in your training code. By default, SageMaker Debugger uses a set of built-in profiling rules looking for unwanted conditions that could develop during training, such as low GPU utilization. On top of reporting these conditions, SageMaker Debugger also triggers events in CloudWatch Events. For example, I could use them to run a AWS Lambda function that automatically stops inefficient training jobs.

First, I create a profiling configuration capturing data every 500ms. Optionally, I could select a training step interval if I wanted to profile only a certain portion of the job.

import sagemaker
from sagemaker.profiler import ProfilerConfig 
profiler_config = ProfilerConfig(profiling_interval_millis=500)

Then, I pass this configuration to my PyTorch Estimator, training on an ml.p3.8xlarge instance equipped with 4 NVIDIA V100 GPUs.

from sagemaker.pytorch import PyTorch
estimator = PyTorch(
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.8xlarge',
    entry_point='train_pt.py',
    framework_version='1.5.0',
    hyperparameters={"batch_size":512, "epochs":10},
    profiler_config=profiler_config)

Then, I launch the training job as usual. Once the job is running, profiling data is captured and stored in S3.

path = estimator.latest_job_profiler_artifacts_path()
print(path)
s3://sagemaker-us-east-2-123456789012/pt-train-pt-2020-11-17-17-38-43-231/profiler-output

Using the SageMaker SDK, I can retrieve and count profiling events.

from smdebug.profiler.system_metrics_reader import S3SystemMetricsReader
system_metrics_reader = S3SystemMetricsReader(path)
system_metrics_reader.refresh_event_file_list()
last_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()
events = system_metrics_reader.get_events(0, last_timestamp)
print("Found", len(events), "recorded system metric events. Latest recorded event:", last_timestamp)
Found 411853 recorded system metric events. Latest recorded event: 1605620280000000

Of course, I could parse and analyze these profiling events, build my own graphs, and so on. Instead, let’s visualize them in near real-time in SageMaker Studio.

While my training job is still running, I locate it in SageMaker Studio, and I right-click “Open Debugger for insights”.

SageMaker screenshot

This opens a new tab, and I select the “Nodes” panel where I can see details statistics for each instance in my training job. So, how’s my training job doing? Feel free to click on the image below to zoom in.

SageMaker screenshot

Apparently, this job isn’t going great. GPU utilization and GPU memory utilization are desperately flat at around 10%. I’m definitely not pushing my multi-GPU instance hard enough. Maybe GPUs are not receiving data fast enough because the CPU can’t keep up? Let’s check the system utilization heatmap.

SageMaker screenshot

The CPU is taking a nap here, hardly ever exceeding 20% usage. This instance is definitely not busy enough. Is there anything I could do to fix this?

Switching to the “Overview” panel, I see that some of the built-in profiling rules have been triggered.

SageMaker screenshot

LowGPUUtilization confirms what I saw on the graphs above. BatchSize is very interesting, as it suggests increasing the size of mini-batches sent to the GPUs by the training script running on the CPU. This should definitely help fill GPU memory, put more GPU cores to work, speed up my training job, and improve infrastructure usage.

At this point, I should decide to stop my inefficient training job, and to relaunch it with a larger batch size. Here, I’ll let it run to completion to show you the report generated by the SageMaker Processing job running in parallel of your training job.

Once the training job is complete, I can see its summary in the “Overview” panel.

SageMaker screenshot

Clicking on the “Download report” button, I get a very detailed report that includes additional metrics, for example the ratio between the different phases of the training job, or the ratio between the forward and backward pass.

SageMaker screenshot

I can also see information on the most time-consuming CPU and GPU operators, which is really important if I wanted to optimize my code. For example, the graph below tells me that the most time-consuming GPU operations in my training job are backward pass convolution operators.

SageMaker screenshot

There’s much more to read in the report (rules summary, training loop analysis, and more). A companion notebook is also available to understand how graphs have been built, and how you can tailor them to your own needs.

Getting Started
We’ve just scratched the surface, and there are many more features in Amazon SageMaker Debugger that make it easy to gather, analyze and visualize model profiling information. You can start using it today in all regions where Amazon SageMaker is available. You won’t be charged for any compute used to run built-in profiling rules.

You’ll find sample notebooks on Github, so give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

Amazon SageMaker Simplifies Training Deep Learning Models With Billions of Parameters

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-simplifies-training-deep-learning-models-with-billions-of-parameters/

Today, I’m extremely happy to announce that Amazon SageMaker simplifies the training of very large deep learning models that were previously difficult to train due to hardware limitations.

In the last 10 years, a subset of machine learning named deep learning (DL) has taken the world by storm. Based on neural networks, DL algorithms have an extraordinary ability to extract information patterns hidden in vast amounts of unstructured data, such as images, videos, speech, or text. Indeed, DL has quickly achieved impressive results on a variety of complex human-like tasks, especially on computer vision and natural language processing. In fact, innovation has never been faster, as DL keeps improving its results on reference tasks like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), the General Language Understanding Evaluation (GLUE), or the Stanford Question Answering Dataset (SQUAD).

In order to tackle ever more complex tasks, DL researchers are designing increasingly sophisticated models, adding more neuron layers and more connections to improve pattern extraction and prediction accuracy, with a direct impact on model size. For example, you would get very good results on image classification with a 100-megabyte ResNet-50 model. For more difficult tasks such as object detection or instance segmentation, you would have to use larger models such as Mask R-CNN or YOLO v4, weighing in at about 250 megabytes.

As you can guess, model growth also impacts the amount of time and hardware resources required for model training, which is why Graphical Processing Units (GPU) have long been the preferred option to train and fine-tune large DL models. Thanks to their massively parallel architecture and their large on-board memory, they make it possible to use a technique called mini-batch training. By sending several data samples at once to the GPU, instead of sending them one by one, communication overhead is reduced, and training jobs are greatly accelerated. For example, the NVIDIA A100 available on the Amazon Elastic Compute Cloud (EC2) p4 family has over 7,000 compute cores and 40 gigabytes of fast onboard memory. Surely, that should be enough to train large batches of data on very large models, shouldn’t it?

Well, it’s not. Natural language processing behemoths such as OpenAI GPT-2 (1.5 billion parameters), T5-3B (3 billion parameters) and GPT-3 (175 billion parameters) consume tens or even hundreds of gigabytes of GPU memory. Likewise, state-of-the-art models working on high-resolution 3D images can be too large to fit in GPU memory, even with a batch size of 1…

Trying to square the circle, DL researchers use a combination of techniques, such as the following:

  • Buy more powerful GPUs, although we just saw that this is simply not an option for some models.
  • Work with less powerful models, and sacrifice accuracy.
  • Implement gradient checkpointing, a technique that relies on saving intermediate training results to disk instead of keeping everything in memory, at the expense of a 20-30% training slowdown.
  • Implement model parallelism, that is to say split the model manually, and train its (smaller) pieces on different GPUs. Needless to say, this is an extremely difficult, time-consuming, and uncertain task, even for expert practitioners.

Customers have told us that none of the above was a satisfactory solution to working with very large models. They asked us for a simpler and more cost-effective solution, and we got to work.

Introducing Model Parallelism in Amazon SageMaker
Model parallelism in SageMaker automatically and efficiently partitions models across several GPUs, eliminating the need for accuracy compromises or for complex manual work. In addition, thanks to this scale-out approach to model training, not only can you work with very large models without any memory bottleneck, you can also leverage a large number of smaller and more cost-effective GPUs.

At launch, this is supported for TensorFlow and PyTorch, and it only requires minimal changes in your code. When you launch a training job, you can specify whether your model should be optimized for speed or for memory usage. Then, Amazon SageMaker runs an initial profiling job on your behalf in order to analyze the compute and memory requirements of your model. This information is then fed to a partitioning algorithm which decides how to split the model and how to map model partitions to GPUs, while minimizing communication. The outcome of the partitioning decision is saved to a file, which is passed as input to the actual training job.

As you can see, SageMaker takes care of everything. If you’d like, you could also manually profile and partition the model, then train on SageMaker.

Before we look at the code, I’d like to give you a quick overview of the internals.

Training with Model Partitions and Microbatches
As model partitions running on different GPUs expect forward pass inputs from each other (activation values), processing training mini-batches across a sequence of partitions would only keep one partition busy at all times, while stalling the other ones.

To avoid this inefficient behavior, mini-batches are split into microbatches that are processed in parallel on the different GPUs. For example, GPU #1 could be forward propagating microbatch n, while GPU #2 could do the same for microbatch n+1. Activation values can be stored, and passed to the next partition whenever it’s ready to accept them.

For back propagation, partitions also expect input values from each other (gradients). As a partition can’t simultaneously run forward and backward propagation, we could wait for all GPUs to complete the forward pass on their own microbatch, before letting them run the corresponding backward pass. This simple mode is available in Amazon SageMaker.

There’s an even more efficient option, called interleaved mode. Here, SageMaker replicates partitions according to the number of microbatches. For example, working with 2 microbatches, each GPU would run two copies of the partition it has received. Each copy would collaborate with partitions running on other GPUs, either for forward or backpropagation.

Here’s how things could look like, with 4 different microbatches being processed by 2 duplicated partitions.

Illustration

To sum things up, interleaving the forward and backward passes of different microbatches is how SageMaker maximimes GPU utilization.

Now, let’s see how we can put this to work with TensorFlow.

Implementing Model Parallelism in Amazon SageMaker
Thanks to the SageMaker Model Parallelism (SMP) library, you can easily implement model parallelism in your own TensorFlow code (the process is similar for PyTorch). Here’s what you need to do:

  • Define and initialize the partitioning configuration.
  • Make your model a subclass of the DistributedModel class, using standard Keras subclassing.
  • Write and decorate with @smp.step a training function that represents a forward and backward step for the model. This function will be pipelined according to the architecture described in the previous section.
  • Optionally, do the same for an evaluation function that will also be pipelined.

Let’s apply this to a simple convolution network training on the MNIST dataset, using an ml.p3.8xlarge instance equipped with 4 NVIDIA V100 GPUs.

First, I initialize the SMP API.

import smdistributed.modelparallel.tensorflow as smp
smp.init()

Then, I subclass DistributedModel and build my model.

class MyModel(smp.DistributedModel):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv = Conv2D(32, 3, activation="relu")
        self.flatten = Flatten()
        self.dense1 = Dense(128)
        self.dense2 = Dense(10)
. . .

This is what the training function looks like.

@smp.step
def forward_backward(images, labels):
    predictions = model(images, training=True)
    loss = loss_obj(labels, predictions)
    grads = optimizer.get_gradients(loss, model.trainable_variables)
    return grads, loss

Then, I can train as usual with the TensorFlow estimator available in the SageMaker SDK. I only need to add the model parallelism configuration: 2 partitions (hence training on 2 GPUs), and 2 microbatches (hence 2 copies of each partition) with interleaving.

smd_mp_estimator = TensorFlow(
    entry_point="tf2.py",
    role=role,
    framework_version='2.3.1',
    pv_version='py3',
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    distribution={
        "smdistributed": {
            "modelparallel": {
                "enabled":True,
                "parameters": {
                    "microbatches": 2,
                    "partitions": 2,
                    "pipeline": "interleaved",
                    "optimize": "memory",
                    "horovod": True, 
                }
             }
         },
        "mpi": {
            "enabled": True,
            "processes_per_host": 2, # Pick your processes_per_host
            "custom_mpi_options": mpioptions
        },
    }
)

Getting Started
As you can see, model parallelism makes it easier to train very large state-of-the-art deep learning models. It’s available today in all regions where Amazon SageMaker is available, at no additional cost.

Examples are available to get you started right away. Give them a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

– Julien

 

Real-Time In-Stream Inference with AWS Kinesis, SageMaker & Apache Flink

Post Syndicated from Shawn Sachdev original https://aws.amazon.com/blogs/architecture/realtime-in-stream-inference-kinesis-sagemaker-flink/

As businesses race to digitally transform, the challenge is to cope with the amount of data, and the value of that data diminishes over time. The challenge is to analyze, learn, and infer from real-time data to predict future states, as well as to detect anomalies and get accurate results. In this blog post, we’ll explain the architecture for a solution that can achieve real-time inference on streaming data. We’ll also cover the integration of Amazon Kinesis Data Analytics (KDA) with Apache Flink to asynchronously invoke any underlying services (or databases).

Managed real-time in-stream data inference is quite a mouthful; let’s break it up:

  • In-stream data refers to the capability of processing a data stream that collects, processes, and analyzes data.
  • Real-time inference refers to the ability to use data from the feed to project future state for the underlying data.

Consider a streaming application that captures credit card transactions along with the other parameters (such as source IP to capture the geographic details of the transaction as well as the  amount). This data can then be used to be used to infer fraudulent transactions instantaneously. Compare that to a traditional batch-oriented approach that identifies fraudulent transactions at the end of every business day and generates a report when it’s too late, after bad actors have already committed fraud.

Architecture overview

In this post, we discuss how you can use Amazon Kinesis Data Analytics for Apache Flink (KDA), Amazon SageMaker, Apache Flink, and Amazon API Gateway to address the challenges such as real-time fraud detection on a stream of credit card transaction data. We explore how to build a managed, reliable, scalable, and highly available streaming architecture based on managed services that substantially reduce the operational overhead compared to a self-managed environment. Our particular focus is on how to prepare and run Flink applications with KDA for Apache Flink applications.

The following diagram illustrates this architecture:

Run Apache Flink applications with KDA for Apache Flink applications

In above architecture, data is ingested in AWS Kinesis Data Streams (KDS) using Amazon Kinesis Producer Library (KPL), and you can use any ingestion patterns supported by KDS. KDS then streams the data to an Apache Flink-based KDA application. KDA manages the required infrastructure for Flink, scales the application in response to changing traffic patterns, and automatically recovers from underlying failures. The Flink application is configured to call an API Gateway endpoint using Asynchronous I/O. Residing behind the API Gateway is an AWS SageMaker endpoint, but any endpoints can be used based on your data enrichment needs. Flink distributes the data across one or more stream partitions, and user-defined operators can transform the data stream.

Let’s talk about some of the key pieces of this architecture.

What is Apache Flink?

Apache Flink is an open source distributed processing framework that is tailored to stateful computations over unbounded and bounded datasets. The architecture uses KDA with Apache Flink to run in-stream analytics and uses Asynchronous I/O operator to interact with external systems.

KDA and Apache Flink

KDA for Apache Flink is a fully managed AWS service that enables you to use an Apache Flink application to process streaming data. With KDA for Apache Flink, you can use Java or Scala to process and analyze streaming data. The service enables you to author and run code against streaming sources. KDA provides the underlying infrastructure for your Flink applications. It handles core capabilities like provisioning compute resources, parallel computation, automatic scaling, and application backups (implemented as checkpoints and snapshots).

Flink Asynchronous I/O Operator

Flink Asynchronous I/O Operator

Flink’s Asynchronous I/O operator allows you to use asynchronous request clients for external systems to enrich stream events or perform computation. Asynchronous interaction with the external system means that a single parallel function instance can handle multiple requests and receive the responses concurrently. In most cases this leads to higher streaming throughput. Asynchronous I/O API integrates well with data streams, and handles order, event time, fault tolerance, etc. You can configure this operator to call external sources like databases and APIs. The architecture pattern explained in this post is configured to call API Gateway integrated with SageMaker endpoints.

Please refer code at kda-flink-ml, a sample Flink application with implementation of Asynchronous I/O operator to call an external Sagemaker endpoint via API Gateway. Below is the snippet of code of StreamingJob.java from sample Flink application.

DataStream<HttpResponse<RideRequest>> predictFareResponse =
            // Asynchronously call predictFare Endpoint
            AsyncDataStream.unorderedWait(
                predictFareRequests,
                new Sig4SignedHttpRequestAsyncFunction<>(predictFareEndpoint, apiKeyHeader),
                30, TimeUnit.SECONDS, 20
            )
            .returns(newTypeHint<HttpResponse<RideRequest>() {});

The operator code above requires following inputs:

  1. An input data stream
  2. An implementation of AsyncFunction that dispatches the requests to the external system
  3. Timeout, which defines how long an asynchronous request may take before it considered failed
  4. Capacity, which defines how many asynchronous requests may be in progress at the same time

How Amazon SageMaker fits into this puzzle

In our architecture we are proposing a SageMaker endpoint for inferencing that is invoked via API Gateway, which can detect fraudulent transactions.

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to build and develop high quality models. You can use these trained models in an ingestion pipeline to make real-time inferences.

You can set up persistent endpoints to get predictions from your models that are deployed on SageMaker hosting services. For an overview on deploying a single model or multiple models with SageMaker hosting services, see Deploy a Model on SageMaker Hosting Services.

Ready for a test drive

To help you get started, we would like to introduce an AWS Solution: AWS Streaming Data Solution for Amazon Kinesis (Option 4) that is available as a single-click cloud formation template to assist you in quickly provisioning resources to get your real-time in-stream inference pipeline up and running in a few minutes. In this solution we leverage AWS Lambda, but that can be switched with a SageMaker endpoint to achieve the architecture discussed earlier in this post. You can also leverage the pre-built AWS Solutions Construct, which implements an Amazon API Gateway connected to an Amazon SageMaker endpoint pattern that can replace AWS Lambda in the below solution. See the implementation guide for this solution.

The following diagram illustrates the architecture for the solution:

architecture for the solution

Conclusion

In this post we explained the architecture to build a managed, reliable, scalable, and highly available application that is capable of real-time inferencing on a data stream. The architecture was built using KDS, KDA for Apache Flink, Apache Flink, and Amazon SageMaker. The architecture also illustrates how you can use managed services so that you don’t need to spend time provisioning, configuring, and managing the underlying infrastructure. Instead, you can spend your time creating insights and inference from your data.

We also talked about the AWS Streaming Data Solution for Amazon Kinesis, which is an AWS vetted solution that provides implementations for applications you can automatically deploy directly into your AWS account. The solution automatically configures the AWS services necessary to easily capture, store, process, and infer from streaming data.