All posts by Yadgiri Pottabhathini

Accelerate machine learning with AWS Data Exchange and Amazon Redshift ML

Post Syndicated from Yadgiri Pottabhathini original https://aws.amazon.com/blogs/big-data/accelerate-machine-learning-with-aws-data-exchange-and-amazon-redshift-ml/

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. Redshift ML allows you to use your data in Amazon Redshift with Amazon SageMaker, a fully managed ML service, without requiring you to become an expert in ML.

AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. AWS Data Exchange for Amazon Redshift enables you to access and query tables in Amazon Redshift without extracting, transforming, and loading files (ETL).

As a subscriber, you can browse through the AWS Data Exchange catalog and find data products that are relevant to your business with data stored in Amazon Redshift, and subscribe to the data from the providers without any further processing, and no need for an ETL process.

If the provider data is not already available in Amazon Redshift, many providers will add the data to Amazon Redshift upon request.

In this post, we show you the process of subscribing to datasets through AWS Data Exchange without ETL, running ML algorithms on an Amazon Redshift cluster, and performing local inference and production.

Solution overview

The use case for the solution in this post is to predict ticket sales for worldwide events based on historical ticket sales data using a regression model. The data or ETL engineer can build the data pipeline by subscribing to the Worldwide Event Attendance product on AWS Data Exchange without ETL. You can then create the ML model in Redshift ML using the time series ticket sales data and predict future ticket sales.

To implement this solution, you complete the following high-level steps:

  1. Subscribe to datasets using AWS Data Exchange for Amazon Redshift.
  2. Connect to the datashare in Amazon Redshift.
  3. Create the ML model using the SQL notebook feature of the Amazon Redshift query editor V2.

The following diagram illustrates the solution architecture.

Prerequisites

Before starting this walkthrough, you must complete the following prerequisites:

  1. Make sure you have an existing Amazon Redshift cluster with RA3 node type. If not, you can create a provisioned Amazon Redshift cluster.
  2. Make sure the Amazon Redshift cluster is encrypted, because the data provider is encrypted and Amazon Redshift data sharing requires homogeneous encryption configurations. For more details on homogeneous encryption, refer to Data sharing considerations in Amazon Redshift.
  3. Create an AWS Identity Access and Management (IAM) role with access to SageMaker and Amazon Simple Storage Service (Amazon S3) and attach it to the Amazon Redshift cluster. Refer to Cluster setup for using Amazon Redshift ML for more details.
  4. Create an S3 bucket for storing the training data and model output.

Please note that some of the above AWS resources in this walkthrough will incur charges. Please remember to delete the resources when you’re finished.

Subscribe to an AWS Data Exchange product with Amazon Redshift data

To subscribe to an AWS Data Exchange public dataset, complete the following steps:

  1. On the AWS Data Exchange console, choose Explore available data products.
  2. In the navigation pane, under Data available through, select Amazon Redshift to filter products with Amazon Redshift data.
  3. Choose Worldwide Event Attendance (Test Product).
  4. Choose Continue to subscribe.
  5. Confirm the catalog is subscribed by checking that it’s listed on the Subscriptions page.

Predict tickets sold using Redshift ML

To set up prediction using Redshift ML, complete the following steps:

  1. On the Amazon Redshift console, choose Datashares in the navigation pane.
  2. On the Subscriptions tab, confirm that the AWS Data Exchange datashare is available.
  3. In the navigation pane, choose Query editor v2.
  4. Connect to your Amazon Redshift cluster in the navigation pane.

Amazon Redshift provides a feature to create notebooks and run your queries; the notebook feature is currently in preview and available in all Regions. For the remaining part of this tutorial, we run queries in the notebook and create comments in the markdown cell for each step.

  1. Choose the plus sign and choose Notebook.
  2. Choose Add markdown.
  3. Enter Show available data share in the cell.
  4. Choose Add SQL.
  5. Enter and run the following command to see the available datashares for the cluster:
    SHOW datashares;

You should be able to see worldwide_event_test_data, as shown in the following screenshot.

  1. Note the producer_namespace and producer_account values from the output, which we use in the next step.
  2. Choose Add markdown and enter Create database from datashare with producer_namespace and procedure_account.
  3. Choose Add SQL and enter the following code to create a database to access the datashare. Use the producer_namespace and producer_account values you copied earlier.
    CREATE DATABASE ml_blog_db FROM DATASHARE worldwide_event_test_data OF ACCOUNT 'producer_account' NAMESPACE 'producer_namespace';

  4. Choose Add markdown and enter Create new table to consolidate features.
  5. Choose Add SQL and enter the following code to create a new table called event consisting of the event sales by date and assign a running serial number to split into training and validation datasets:
    CREATE TABLE event AS
    	SELECT eventname, qtysold, saletime, day, week, month, qtr, year, holiday, ROW_NUMBER() OVER (ORDER BY RANDOM()) r
    	FROM "ml_blog_db"."public"."sales" s
    	INNER JOIN "ml_blog_db"."public"."event" e
    	ON s.eventid = e.eventid
    	INNER JOIN "ml_blog_db"."public"."date" d
    	ON s.dateid = d.dateid;

Event name, quantity sold, sale time, day, week, month, quarter, year, and holiday are columns in the dataset from AWS Data Exchange that are used as features in the ML model creation.

  1. Choose Add markdown and enter Split the dataset into training dataset and validation dataset.
  2. Choose Add SQL and enter the following code to split the dataset into training and validation datasets:
    CREATE TABLE training_data AS 
    	SELECT eventname, qtysold, saletime, day, week, month, qtr, year, holiday
    	FROM event
    	WHERE r >
    	(SELECT COUNT(1) * 0.2 FROM event);
    
    	CREATE TABLE validation_data AS 
    	SELECT eventname, qtysold, saletime, day, week, month, qtr, year, holiday
    	FROM event
    	WHERE r <=
    	(SELECT COUNT(1) * 0.2 FROM event);

  3. Choose Add markdown and enter Create ML model.
  4. Choose Add SQL and enter the following command to create the model. Replace the your_s3_bucket parameter with your bucket name.
    CREATE MODEL predict_ticket_sold
    	FROM training_data
    	TARGET qtysold
    	FUNCTION predict_ticket_sold
    	IAM_ROLE 'default'
    	PROBLEM_TYPE regression
    	OBJECTIVE 'mse'
    	SETTINGS (s3_bucket 'your_s3_bucket',
    	s3_garbage_collect off,
    	max_runtime 5000);

Note: It can take up to two hours to create and train the model.
The following screenshot shows the example output from adding our markdown and SQL.

  1. Choose Add markdown and enter Show model creation status. Continue to next step once the Model State has changed to Ready.
  2. Choose Add SQL and enter the following command to get the status of the model creation:
    SHOW MODEL predict_ticket_sold;

Move to the next step after the Model State has changed to READY.

  1. Choose Add markdown and enter Run the inference for eventname Jason Mraz.
  2. When the model is ready, you can use the SQL function to apply the ML model to your data. The following is sample SQL code to predict the tickets sold for a particular event using the predict_ticket_sold function created in the previous step:
    SELECT eventname,
    	predict_ticket_sold(
    	eventname, saletime, day, week, month, qtr, year, holiday ) AS predicted_qty_sold,
    	day, week, month
    	FROM event
    	Where eventname = 'Jason Mraz';

The following is the output received by applying the ML function predict_ticket_sold on the original dataset. The output of the ML function is captured in the field predicted_qty_sold, which is the predicted ticket sold quantity.

Share notebooks

To share the notebooks, complete the following steps:

  1. Create an IAM role with the managed policy AmazonRedshiftQueryEditorV2FullAccess attached to the role.
  2. Add a principal tag to the role with the tag name sqlworkbench-team.
  3. Set the value of this tag to the principal (user, group, or role) you’re granting access to.
  4. After you configure these permissions, navigate to the Amazon Redshift console and choose Query editor v2 in the navigation pane. If you haven’t used the query editor v2 before, please configure your account to use query editor v2.
  5. Choose Notebooks in the left pane and navigate to My notebooks.
  6. Right-click on the notebook you want to share and choose Share with my team.
  7. You can confirm that the notebook is shared by choosing Shared to my team and checking that the notebook is listed.

Summary

In this post, we showed you how to build an end-to-end pipeline by subscribing to a public dataset through AWS Data Exchange, simplifying data integration and processing, and then running prediction using Redshift ML on the data.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Authors

Yadgiri Pottabhathini is a Sr. Analytics Specialist Solutions Architect. His role is to assist customers in their cloud data warehouse journey and help them evaluate and align their data analytics business objectives with Amazon Redshift capabilities.

Ekta Ahuja is an Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.