All posts by Zayd Simjee

Preparing data for ML models using AWS Glue DataBrew in a Jupyter notebook

Post Syndicated from Zayd Simjee original https://aws.amazon.com/blogs/big-data/preparing-data-for-ml-models-using-aws-glue-databrew-in-a-jupyter-notebook/

AWS Glue DataBrew is a new visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning (ML). In this post, we examine a sample ML use case and show how to use DataBrew and a Jupyter notebook to upload a dataset, clean and normalize the data, and train and publish an ML model. We look for anomalies by applying the Amazon SageMaker Random Cut Forest (RCF) anomaly detection algorithm on a public dataset that records power consumption for more than 300 random households.

Deploying your resources

To make it easier for you to get started, we created an AWS CloudFormation template that automatically configures a Jupyter notebook instance with the required libraries and installs the plugin. We used Amazon Deep Learning AMI to configure the out-of-the-box Jupyter server. This easy deployment is intended to get you started on DataBrew from within a Jupyter environment. The source code for the DataBrew plugin and the CloudFormation template are available in the GitHub repo.

To deploy the solution, you must have a subnet that has internet access and an Amazon Simple Storage Service (Amazon S3) bucket where you want store the data for DataBrew. Select the VPC, subnet, security group, and the S3 bucket that you want to use store the data for DataBrew processing. Provide the Amazon Elastic Compute Cloud (Amazon EC2) key pair if you plan to SSH to the instance.

  1. Launch the following stack:
  2. When the template deployment is complete, on the Outputs tab, choose the URL to open JupyterLab.

Because the Jupyter server is configured with a self-signed SSL certificate, your browser warns you and prompts you to avoid continuing to this website. But because you set this up yourself, it’s safe to continue.

  1. Choose Advanced.
  2. Choose Proceed.
  3. Use the password databrew_demo to log in.

For more information about securing and configuring your Jupyter server, see Set up a Jupyter Notebook Server.

  1. In the Jupyter environment’s left panel, choose the DataBrew logo.
  2. Choose Launch AWS Glue DataBrew.
  3. When the extension loads, choose the Datasets tab in the navigation bar.

Preparing your data using DataBrew

Now that the DataBrew extension is ready to go, we can begin to explore how DataBrew can make data preparation easy. Our source dataset contains data points at 15-minute intervals, and is organized as a series of columns for each household. The dataset is really wide, and the RCF algorithm expects data tuples of date/time, client ID, and consumption value. Additionally, we want to normalize our data to 1-hour intervals. All of this is achieved through DataBrew.

Setting up your dataset and project

To get started, you set up your dataset, import your data, and create a new project.

  1. Download power.consumption.csv from the GitHub repo.
  2. On the Datasets page, choose Connect new dataset.
  3. For Dataset name, enter a name.
  4. In the Connect to new dataset section, choose File upload.
  5. Upload power.consumption.csv.
  6. For Enter S3 destination, enter an S3 path where you can save the file.
  7. Choose Create dataset.

The file may take a few minutes to upload, depending on your internet speed.

  1. On the Datasets page, filter for your created dataset.
  2. Select your dataset and choose Create project with this dataset.

  1. In the Create project wizard, give your project a name.
  2. In the Permissions section, choose the AWS Identity and Access Management (IAM) role created from the CloudFormation template.

You can find the role on the CloudFormation stack’s Resources tab. If you use the default stack name, the role should begin with databrew-jupyter-plugin-demo.

After you create the project, the project view loads, and you’re ready to prepare your data.

Building a recipe for data transformation

A recipe is a series of steps that prepare your data for the RCF algorithm. The algorithm requires three columns: date, client ID, and an integer value. To transform our dataset to contain those three columns, we configure our recipe to do the following:

  1. Unpivot the data to collapse measurements from multiple clients into one column.
  2. Apply the window function to average the 15-minute data points into 1-hour data points.
  3. Filter to keep only values at each hour.
  4. Multiply and floor the results.

Unpivoting the data

To unpivot the data and collapse measurements, complete the following steps:

  1. On the toolbar, choose Pivot.

  1. In the Pivot wizard, select Unpivot: Columns to rows.
  2. For Unpivot columns, choose MT_012, MT_013, MT_131, and MT_132.
  3. For Column name, enter client_id.
  4. For Value column name, enter quarter_hour_consumption.

In the Recipe pane, you can see the action that unpivots the columns. This action can be revisited later and changed. The new columns may not be visible immediately.

  1. To see them and narrow down the visible data to only the relevant columns, choose the arrow next to Viewing.
  2. Deselect all items and select only _c0 and our two new columns, client_id and
    quarter_hour_consumption.

Applying the window function

To apply the window function to average the 15-minute data points into 1-hour data points, complete the following steps:

  1. Choose the quarter_hour_consumption
  2. Choose Functions.
  3. Choose Window functions.
  4. Choose Rolling average.

  1. In the Create column pane, for Number of rows before, enter 0.
  2. For Number of rows after, enter 3.
  3. For Name of column to order by with, choose client_id.
  4. For Destination column, enter hourly_consumption_raw.
  5. Choose Apply.

Filtering to keep only values at each hour

In this step, you rename the date/time column, convert it to string type so that you can do simple filtering, and filter the dataset on the string column for times ending in :00:00.

  1. For the _c0 column, choose the ellipsis icon (…) and choose Rename.

  1. Rename the column to timestamp.
  2. Choose the clock icon and choose string.

  1. With the column selected, choose Filter.
  2. Choose By condition.
  3. Choose Ends with.
  4. Enter the value :00:00.
  5. Choose Apply.

Filtering the column for only values that end with :00:00 leaves you with hourly averages of power consumption per client for every hour.

Multiplying and flooring the results

In this step, you multiply the data by 100 to increase precision and floor the data so that it can be accepted by the RCF algorithm, which only accepts integers.

  1. Choose Functions.
  2. Choose Math functions.
  3. Choose Multiply.
  4. For Value using¸ choose Source columns and value.
  5. For Source column, choose hourly_consumption_raw.
  6. For Destination column, enter hourly_consumption_raw_times_a_hundred.
  7. Choose Apply.

  1. Choose Functions.
  2. Choose Math functions.
  3. Choose Floor.
  4. For Source column, choose hourly_consumption_raw_times_a_hundred.
  5. For Destination column, enter hourly_consumption.
  6. Choose Apply.

This column contains the final, normalized data.

Running the job to transform the data

You’re now ready to transform the data.

  1. Choose Create job.

  1. Enter a job name and choose the dataset we created.
  2. Specify the S3 bucket you provided in the CloudFormation template.
  3. Choose the IAM role that AWS CloudFormation created (which we used earlier to create the project).
  4. Choose Create and run job.

The job may take up to 5 minutes to complete.

On the Job run history page for the job, view the output on the Amazon S3 console by choosing the link in the table.

That’s it, the data is now ready to use when training and deploying our ML model.

Training and deploying the ML model using prepared data

The data was already prepared using DataBrew via the plugin, so the next step is to train an ML model using that data. We provided a sample anomaly detection notebook that you can download.

In this sample notebook, you need to specify the S3 data location where you stored the output data from DataBrew. The notebook uses the IAM role attached to the EC2 instance profile that was created by AWS CloudFormation. You can follow through the notebook and when you provide the right S3 paths, the first step is to filter the specific columns we’re interested in and visualize the time series power consumption data.

The next step is to train a sample anomaly detection model using the SageMaker Random Cut Forest algorithm. We pick one of the time series available in the input Pandas DataFrame and train the anomaly detection model with the hyperparameter feature_dim set to 1, leaving the default values for other hyperparameters. We then create an estimator for Random Cut Forest and fit the model. In a few minutes, the training should be complete. In the next step, we create a predictor and deploy the model to a SageMaker endpoint.

Using the prepared data, we run the prediction and plot the results.

We use the anomaly detection baseline that is two standard deviations away from the mean score. The data shows an anomaly towards the end of the time series. With this information, that timeframe can be further investigated.

Finally, we clean up by deleting the SageMaker endpoint to prevent any ongoing charges.

Conclusion

We’ve walked you through the process of setting up the AWS Glue DataBrew Jupyter plugin in a Jupyter notebook environment. We used the plugin to prepare data, then trained and deployed an ML model in the same Jupyter environment using SageMaker.

Although we used a DLAMI Jupyter environment in this post, the DataBrew Jupyter extension also works on SageMaker notebooks. For installation instructions, see the GitHub repo.

DataBrew makes it easy to iterate through data preparation workflows. The resultant recipes and jobs are duplicable and can be run over discrete, large datasets. The DataBrew Jupyter plugin allows you to prepare your data seamlessly, in context, within your Jupyter notebook.


About the Authors

Zayd Simjee is a software engineer at Amazon. He’s interested in distributed systems, big data, and simplifying developer experience on the Cloud. He’s worked on a few Big Data services at AWS, and most recently completed work on the AWS Glue DataBrew release.

 

 

 

 

As a Principal Solutions Architect at Amazon Web Services, Karthik Sonti works with GSI partners to help enterprises realize transformational business outcomes using artificial intelligence, machine learning and data analytics