All posts by Nipun Chagari

Trigger an AWS Glue DataBrew job based on an event generated from another DataBrew job

2022-06-02 Nipun Chagari

Post Syndicated from Nipun Chagari original https://aws.amazon.com/blogs/big-data/trigger-an-aws-glue-databrew-job-based-on-an-event-generated-from-another-databrew-job/

Organizations today have continuous incoming data, and analyzing this data in a timely fashion is becoming a common requirement for data analytics and machine learning (ML) use cases. As part of this, you need clean data in order to gain insights that can enable enterprises to get the most out of their data for business growth and profitability. You can now use AWS Glue DataBrew, a visual data preparation tool that makes it easy to transform and prepare datasets for analytics and ML workloads.

As we build these data analytics pipelines, we can decouple the jobs by building event-driven analytics and ML workflow pipelines. In this post, we walk through how to trigger a DataBrew job automatically on an event generated from another DataBrew job using Amazon EventBridge and AWS Step Functions.

Overview of solution

The following diagram illustrates the architecture of the solution. We use AWS CloudFormation to deploy an EventBridge rule, an Amazon Simple Queue Service (Amazon SQS) queue, and Step Functions resources to trigger the second DataBrew job.

The steps in this solution are as follows:

Import your dataset to Amazon Simple Storage Service (Amazon S3).
DataBrew queries the data from Amazon S3 by creating a recipe and performing transformations.
The first DataBrew recipe job writes the output to an S3 bucket.
When the first recipe job is complete, it triggers an EventBridge event.
A Step Functions state machine is invoked based on the event, which in turn invokes the second DataBrew recipe job for further processing.
The event is delivered to the dead-letter queue if the rule in EventBridge can’t invoke the state machine successfully.
DataBrew queries data from an S3 bucket by creating a recipe and performing transformations.
The second DataBrew recipe job writes the output to the same S3 bucket.

Prerequisites

To use this solution, you need the following prerequisites:

An AWS account
AWS Identity and Access Management (IAM) permissions for DataBrew (for more information, see Setting up IAM policies for DataBrew)
An S3 bucket (to store data)

Load the dataset into Amazon S3

For this post, we use the Credit Card customers sample dataset from Kaggle. This data consists of 10,000 customers, including their age, salary, marital status, credit card limit, credit card category, and more. Download the sample dataset and follow the instructions. We recommend creating all your resources in the same account and Region.

Create a DataBrew project

To create a DataBrew project, complete the following steps:

On the DataBrew console, choose Projects and choose Create project.
For Project name, enter marketing-campaign-project-1.
For Select a dataset, select New dataset.
Under Data lake/data store, choose Amazon S3.
For Enter your source from S3, enter the S3 path of the sample dataset.
Select the dataset CSV file.
Under Permissions, for Role name, choose an existing IAM role created during the prerequisites or create a new role.
For New IAM role suffix, enter a suffix.
Choose Create project.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Create the DataBrew jobs

Now we can create the recipe jobs.

On the DataBrew console, in the navigation pane, choose Projects.
On the Projects page, select the project marketing-campaign-project-1.
Choose Open project and choose Add step.
In this step, we choose Delete to drop the unnecessary columns from our dataset that aren’t required for this exercise.

You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Select the columns to delete and choose Apply.
Choose Create job.
For Job name, enter marketing-campaign-job1.
Under Job output settings¸ for File type, choose your final storage format (for this post, we choose CSV).
For S3 location, enter your final S3 output bucket path.
Under Settings, for File output storage, select Replace output files for each job run.
Choose Save.
Under Permissions, for Role name¸ choose an existing role created during the prerequisites or create a new role.
Choose Create job.

Now we repeat the same steps to create another DataBrew project and DataBrew job.

For this post, I named the second project marketing-campaign-project2 and named the job marketing-campaign-job2.
When you create the new project, this time use the job1 output file location as the new dataset.
For this job, we deselect Unknown and Uneducated in the Education_Level column.

Deploy your resources using CloudFormation

For a quick start of this solution, we deploy the resources with a CloudFormation stack. The stack creates the EventBridge rule, SQS queue, and Step Functions state machine in your account to trigger the second DataBrew job when the first job runs successfully.

Choose Launch Stack:
For DataBrew source job name, enter marketing-campaign-job1.
For DataBrew target job name, enter marketing-campaign-job2.
For both IAM role configurations, make the following choice:
1. If you choose Create a new Role, the stack automatically creates a role for you.
2. If you choose Attach an existing IAM role, you must populate the IAM role ARN manually in the following field or else the stack creation fails.
Choose Next.
Select the two acknowledgement check boxes.
Choose Create stack.

Test the solution

To test the solution, complete the following steps:

On the DataBrew console, choose Jobs.
Select the job marketing-campaign-job1 and choose Run job.

This action automatically triggers the second job, marketing-campaign-job2, via EventBridge and Step Functions.

When both jobs are complete, open the output link for marketing-campaign-job2.

You’re redirected to the Amazon S3 console to access the output file.

In this solution, we created a workflow that required minimal code. The first job triggers the second job, and both jobs deliver the transformed data files to Amazon S3.

Clean up

To avoid incurring future charges, delete all the resources created during this walkthrough:

IAM roles
DataBrew projects and their associated recipe jobs
S3 bucket
CloudFormation stack

Conclusion

In this post, we walked through how to use DataBrew along with EventBridge and Step Functions to run a DataBrew job that automatically triggers another DataBrew job. We encourage you to use this pattern for event-driven pipelines where you can build sequence jobs to run multiple jobs in conjunction with other jobs.

About the Authors

Nipun Chagari is a Senior Solutions Architect at AWS, where he helps customers build highly available, scalable, and resilient applications on the AWS Cloud. He is passionate about helping customers adopt serverless technology to meet their business objectives.

Prarthana Angadi is a Software Development Engineer II at AWS, where she has been expanding what is possible with code in order to make life more efficient for AWS customers.

Transform data and create dashboards using AWS Glue DataBrew and Tableau

2022-01-11 Nipun Chagari

Post Syndicated from Nipun Chagari original https://aws.amazon.com/blogs/big-data/transform-data-and-create-dashboards-using-aws-glue-databrew-and-tableau/

Before you can create visuals and dashboards that convey useful information, you need to transform and prepare the underlying data. With AWS Glue DataBrew, you can now easily transform and prepare datasets from Amazon Simple Storage Service (Amazon S3), an Amazon Redshift data warehouse, Amazon Aurora, and other Amazon Relational Database Service (Amazon RDS) databases and upload them into Amazon S3 to visualize the transformed data in a dashboard using Amazon QuickSight or other business intelligence (BI) tools like Tableau.

DataBrew now also supports writing prepared data into Tableau Hyper format, allowing you to easily take prepared datasets from Amazon S3 and upload them into Tableau for further visualization and analysis. Hyper is Tableau’s in-memory data engine technology optimized for fast data ingest and analytical query processing on large or complex datasets.

In this post, we use DataBrew to extract data from Amazon Redshift, cleanse and transform data using DataBrew to Tableau Hyper format without any coding, and store it in Amazon S3.

Overview of solution

The following diagram illustrates the architecture of the solution.

The solution workflow includes the following steps:

You create a JDBC connection for Amazon Redshift and a DataBrew project on the DataBrew console.
DataBrew queries data from Amazon Redshift by creating a recipe and performing transformations.
The DataBrew job writes the final output to an S3 bucket in Tableau Hyper format.
You can now upload the file into Tableau for further visualization and analysis.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An S3 bucket
AWS Identity and Access Management (IAM) permissions for DataBrew (for more information, see Setting up IAM policies for DataBrew)
An Amazon Redshift cluster

The following screenshots show the configuration for creating an Amazon Redshift cluster using the Amazon Redshift console with demo sales data. For more information about network security for the cluster, see Setting Up a VPC to Connect to JDBC Data Stores.

For this post, we use the sample data that comes with the Amazon Redshift cluster.

In this post, we only demonstrate how to transform your Amazon Redshift data to Hyper format; uploading the file for further analysis is out of scope.

Create an Amazon Redshift connection

In this step, you use the DataBrew console to create an Amazon Redshift connection.

On the DataBrew console, choose Datasets.
On the Connections tab, choose Create connection.
For Connection name, enter a name (for example, ticket-db-connection).
For Connection type, select Amazon Redshift.
In the Connection access section, provide details like cluster name, database name, user name, and password.
Choose Create connection.

Create your dataset

To create a new dataset, complete the following steps:

On the DataBrew console, choose Datasets.
On the Datasets tab, choose Connect new dataset.
For Dataset name, enter sales.
For Connect to new dataset, select Amazon Redshift.
Choose the connection you created (AwsGlueDataBrew-tickit-sales-db-connection).
Select the public schema and sales table
In the Additional configurations section, for Enter S3 destination, enter the S3 bucket you created as a prerequisite.

DataBrew uses this bucket to store the intermediate results.

Choose Create dataset.
If your query is taking too much time, then add LIMIT clause in your Select statement.

Create a project using the dataset

To create a new project, complete the following steps:

On the DataBrew console, choose Projects and choose Create project.
For Project name, enter sales-project.
For Attached recipe, choose Create new recipe.
For Recipe name, enter sales-project-recipe.
For Select a dataset, select My datasets.
Select the sales dataset.
Under Permissions, for Role name, choose an existing IAM role created during the prerequisites or create a new role.
Choose Create project.

When the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

When we connect a dataset to an Amazon Redshift cluster in your VPC, DataBrew provisions an elastic network interface in your VPC without a public IPV4 address. Because of this, you need to provision a NAT gateway in your VPC as well as an appropriate subnet route table configured for the subnets associated with the AWS Glue network interfaces. To use DataBrew with a VPC subnet without a NAT, you must have a gateway VPC endpoint to Amazon S3 and a VPC endpoint for the AWS Glue interface in your VPC. For more information, see Create a gateway endpoint and Interface VPC endpoints (AWS PrivateLink).

Build a transformation recipe

In this step, we perform some feature engineering techniques (transformations) to prepare our dataset and drop the unnecessary columns from our dataset that aren’t required for this exercise.

On the DataBrew console, choose Column.
Choose Delete.
For Source columns, select the columns pricepaid and commissions.
Choose Apply.

Add a logical condition

With DataBrew, you can now use IF, AND, OR, and CASE logical conditions to create transformations based on functions. With this feature, you have the flexibility to use custom values or reference other columns within the expressions, and can create adaptable transformations for their specific use cases.

To add a logical condition to your transformation recipe, complete the following steps:

On the DataBrew console, choose Conditions.
Choose IF.
For Matching conditions, select Match all conditions.
For Source, choose the value qtysold.
For Enter a value, select Enter a custom value and enter 2.
For Destination column, enter opportunity.
Choose Apply.

The following screenshot shows the full recipe that we applied to our dataset.

Create the DataBrew job

Now that we have built the recipe, we can create and run the DataBrew recipe job.

On the project details page, choose Create job.
For Job name, enter sales-opportunities.
We choose TABLEAU HYPER as the output format.
For S3 location, enter the previously created S3 bucket.
For Role name, choose an existing role created during the prerequisites or create a new role.
Choose Create and run job.
Navigate to the Jobs page and wait for the sales-opportunity job to complete.
Choose the output link to navigate to the Amazon S3 console to access the job output.

Clean up

To avoid incurring future charges, delete the resources you created:

Amazon Redshift cluster
Recipe job
Job output stored in the S3 bucket
IAM roles created as part of this exercise
DataBrew project sales-project and its associated recipe sales-project-recipe
DataBrew datasets

Conclusion

In this post, we showed you how to connect to an Amazon Redshift cluster and create a DataBrew dataset.

We saw how easy it is to get data from Amazon Redshift into DataBrew and apply transformations without any coding. We then ran a recipe job to convert this dataset to Tableau Hyper format file and store it in Amazon S3 for visualization using Tableau. Learn more about all the products and service integrations that AWS Glue DataBrew supports.

About the Authors

Nipun Chagari is a Senior Solutions Architect at AWS, where he helps customers build highly available, scalable, and resilient applications on the AWS Cloud. He is currently focused on helping customers leverage serverless technology to meet their business objectives.

Mohit Malik is a Senior Solutions Architect at Amazon Web Services who specializes in compute, networking, and serverless technologies. He enjoys helping customers learn how to operate efficiently and effectively in the cloud. In his spare time, Mohit enjoys spending time with his family, reading books, and watching movies.

Noise

All posts by Nipun Chagari

Trigger an AWS Glue DataBrew job based on an event generated from another DataBrew job

Overview of solution

Prerequisites

Load the dataset into Amazon S3

Create a DataBrew project

Create the DataBrew jobs

Deploy your resources using CloudFormation

Test the solution

Clean up

Conclusion

About the Authors

Transform data and create dashboards using AWS Glue DataBrew and Tableau

Overview of solution

Prerequisites

Create an Amazon Redshift connection

Create your dataset

Create a project using the dataset

Build a transformation recipe

Add a logical condition

Create the DataBrew job

Clean up

Conclusion

About the Authors

The collective thoughts of the interwebz