Tag Archives: AWS Glue DataBrew

Extract, prepare, and analyze Salesforce.com data using Amazon AppFlow, AWS Glue DataBrew, and Amazon Athena

Post Syndicated from Ramkumar Nottath original https://aws.amazon.com/blogs/big-data/extract-prepare-and-analyze-salesforce-com-data-using-amazon-appflow-aws-glue-databrew-and-amazon-athena/

As organizations embark on their data modernization journey, big data analytics and machine learning (ML) use cases are becoming even more integral parts of business. The ease for data preparation and seamless integration with third-party data sources is of paramount importance in order to gain insights quickly and make critical business decisions faster.

AWS Glue DataBrew is a visual data preparation tool that cleans and normalizes data without writing code. This reduces the time it takes to prepare data for analytics and ML by up to 80% compared to traditional approaches to data preparation.

Software as a service (SaaS) applications play a pivotal role in organizations’ analytics pipelines. This data is essential to include when performing analytics to get insights to make better business decisions. Amazon AppFlow is a fully managed integration service that helps you transfer SaaS data to your data lake securely.

Recently, DataBrew announced native console integration with Amazon AppFlow to connect to data from applications like Salesforce, Zendesk, Slack, ServiceNow, and other SaaS applications, and AWS services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift. With native integration with Amazon AppFlow, DataBrew is addressing both the challenges with data preparation and seamless integration with SaaS applications.

Salesforce is a popular and widely used customer relationship management (CRM) platform. It lets you store and manage prospect and customer information—like contact info, accounts, leads, and sales opportunities—in one central location. You can derive a lot of useful information by combining the prospect information stored in Salesforce with other structured and unstructured data in your data lake.

In this post, we walk you through how to extract data from Salesforce.com using the native integration that DataBrew has with Amazon AppFlow, prepare the data for your analytical use cases using DataBrew, store it in Amazon S3, and query it with Amazon Athena.

Architecture overview

The following diagram represents the flow described in this post. With the visual point-and-click interface in Amazon AppFlow, you create a new flow with Salesforce as source. You can either use an existing connection to Salesforce or create a new one. In DataBrew, while creating a dataset, you can choose the Amazon AppFlow flow as one of the sources to import the data for data preparation. After you perform the data preparation steps on sample data, you can save the steps as a recipe and automate the flow by creating a DataBrew job by selecting the dataset for the source and the newly created recipe for transformation. The transformed data is published to an S3 bucket. You can use an AWS Glue crawler to catalog that data and use Athena to query the data.

The workflow includes the following steps:

  1. Create an S3 bucket for the raw and transformed data.
  2. Create a connection to Salesforce.
  3. Create a flow to extract the data from Salesforce.com.
  4. Create a dataset and project.
  5. Prepare the data and create a recipe.
  6. Create a job to pull the data from the Amazon AppFlow flow, run the transformations, and load the data in Amazon S3.
  7. Create a crawler to catalog the data.
  8. Analyze data using Athena.

Prerequisites

To implement this solution, you need the following prerequisites:

Now that we have discussed the architecture of our solution, we present the step-by-step instructions.

Create an S3 bucket

To create an S3 bucket, complete the following steps:

  1. On the Amazon S3 console, choose Create bucket.
  2. Enter a name for your bucket; for example, databrew-appflow-data-prep-<your name>.
  3. Choose Create bucket.

Create a connection to Salesforce

If you already have a Salesforce connection created in Amazon AppFlow, you can skip this step. To create a new connection, complete the following steps:

  1. On the Amazon AppFlow console, choose Connections in the navigation pane.
  2. From the list of connectors, choose Salesforce.
  3. Choose Create connection.
  4. For Connection name, enter a name (for example, sfdc-appflow).
  5. Choose Continue.
  6. You’re redirected to a sign-in screen where you can log in to your Salesforce account. If you don’t have a Salesforce account, you can sign up for a developer account.
  7. Choose Allow to allow Amazon AppFlow to access your Salesforce account.

You can now see the new connection that was created.

Create a flow in Amazon AppFlow to extract data from Salesforce.com

To create a flow in Amazon AppFlow, complete the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. Choose Create flow.
  3. On the Specify flow details page, enter a name for the flow (for example, salesforce-data).
  4. Optionally, provide a description for the flow and tags.
  5. Choose Next.
  6. On the Configure flow page, for Source name¸ choose Salesforce.
  7. Choose the connection we created in the previous step.
  8. For Choose Salesforce object, choose the object you want to work with (for this post, we choose Opportunity).
  9. For Destination name, choose Amazon S3.
  10. For Bucket details, choose the bucket you created earlier.
  11. Optionally, provide a prefix (folder) where you want the data to land within the bucket.
  12. Under Additional settings, for Data transfer preference, select Aggregate all records.
  13. For Choose how to trigger the flow¸ select Run on demand.
  14. Choose Next.
  15. On the Source to destination field mapping page, for Source field name, select the fields you want to work with and choose Map fields directly.

Alternatively, you can choose Map all fields directly to map all the fields from the object.

  1. On the Add filters page, add any filters for the data you’re pulling.
  2. On the Review and create page, review all the details and choose Create flow.
  3. After the flow is created, choose Run flow to run the flow and verify whether the flow ran successfully.

Create a dataset and project in DataBrew

To create a dataset and DataBrew project, complete the following steps:

  1. On the DataBrew console, choose Projects in the navigation pane.
  2. Choose Create project.
  3. For Project name, enter a name (for example, databrew-appflow-integration).
  4. For Select a dataset, select New dataset.
  5. For Dataset name, enter a name (for example, databrew-sfdc).
  6. From the list of sources, choose Amazon AppFlow.
  7. Select the flow you created.
  8. Under Permissions, for Role name, choose Create new IAM role.
  9. For New IAM role suffix, enter a suffix (for example, appflow-databrew).
  10. Choose Create project.

After you create the project, data is loaded to DataBrew so that you can perform data preparation activities.

Prepare data and create a recipe

With DataBrew, you can choose from over 250 pre-built transformations to automate data preparation tasks, all without the need to write any code. In this post, we only discuss a few of them. For the full list of transformations, see Recipe step and function reference.

In this step, we split the CloseDate column to CloseYear, CloseMonth, and CloseDay. Then we flag the outliers in the Amount column.

  1. Duplicate the column CloseDate by choosing the column and choosing Duplicate.
  2. For Duplicate column name, enter a name.
  3. Choose Apply.
  4. Select the column you created and on the Clean menu, choose Replace value or pattern.
  5. For Value to be replaced, select Enter custom value and enter -.
  6. Choose Apply.

This replaces – with empty values.

  1. Select the modified column and on the options menu, choose Split menu and At positions from beginning.
  2. For Position from the beginning¸ enter 4.
  3. Choose Apply.

This creates a new column with the year value.

Next, we split the column at position 2 from the beginning in CloseDate_copy_2 to extract month and day.

  1. On the options menu, choose Split column.
  2. Rename the CloseDate_copy_1 column to CloseYear and choose Apply.
  3. Repeat the steps to rename the other two newly created columns to CloseMonth and CloseDay.
  4. Select the Amount column and on the Outliers menu, choose Flag outliers.
  5. For Standard deviation threshold, enter 3.
  6. Under Outlier actions, select Flag outliers.
  7. Choose Apply.

You can see that an additional column got added and the outliers are flagged.

All the steps that we performed so far are recorded under Recipe.

  1. Under Recipe, choose Publish.
  2. For Version description, enter a description.
  3. Choose Publish.

This saves the recipe for future use.

Create a DataBrew job

To create a DataBrew job, complete the following steps:

  1. On the DataBrew console, choose Jobs in the navigation pane.
  2. Choose Create job.
  3. For Job name, enter a name (for example, databrew-appflow-job).
  4. For Select a dataset, choose the dataset we created (databrew-sfdc).
  5. For Select a recipe, choose the recipe we created (databrew-appflow-integration-recipe).
  6. In the Job output settings section, for Output to, choose Amazon S3.
  7. For S3 location, enter the S3 path for the data (for example, s3://databrew-appflow-data-prep-<your name>/processed/).
  8. For Role name, choose the role with suffix databrew-appflow-role.
  9. Choose Create and run job.

Create a crawler to catalog the data

To create your AWS Glue crawler, complete the following steps:

  1. On the AWS Glue console, choose Crawlers.
  2. Choose Add crawler.
  3. For Crawler name¸ enter a name (for example, databrew-opportunity-data).
  4. Choose Next.
  5. Under Specify crawler source type, keep the default options and choose Next.
  6. Under Add a data store, for Include path, choose the S3 bucket that we used for the processed data (for example, s3://databrew-appflow-data-prep-<your name>/processed).
  7. Choose Next.
  8. For Add another data, select No.
  9. Choose Next.
  10. Select Create an IAM role and provide a suffix for the role (for example, databrew).
  11. For Frequency, choose Run on demand.
  12. On the next page, choose Add database.
  13. Enter a database name (for example, databrew-appflow).
  14. Choose Create.
  15. For Prefix, enter opportunity_.
  16. Choose Next.
  17. Review the details and choose Finish.
  18. After the crawler is created, select it and choose Run crawler.

The crawler catalogs the data that we uploaded to Amazon S3 after processing using DataBrew.

Analyze data using Athena

When the crawler is complete, we can analyze the data with Athena.

  1. On the AWS Glue console, choose the database we created.
  2. Under Data catalog, choose Tables in databrew-appflow.

You can see a table named opportunity_processed, which points to the Amazon S3 location where the processed data was landed.

  1. Select the table name.
  2. On the Action menu, choose View data.

A pop-up may appear to let you know that you’re going to the Athena console.

  1. Choose Preview data.

If this is the first time using Athena in this account, you have to set the query result location.

  1. Run a query in Athena.

You should be able to see the data in Amazon S3.

You can perform further analysis by running more queries on the data. The following query returns the expected revenue for based on various closeyear and closemonth combinations of opportunities:

SELECT closeyear, closemonth, SUM(expectedrevenue) FROM  "AwsDataCatalog"."databrew-appflow"."opportunity_processed" 
GROUP BY  closeyear, closemonth;

Clean up

You may want to clean up the demo environment when you are done. To do so, delete the following resources that were created as part of this post:

  • S3 bucket (databrew-appflow-data-prep-<your name>)
  • Connection to Salesforce in Amazon AppFlow (sfdc-appflow)
  • Flow in Amazon AppFlow (salesforce-data)
  • Dataset (databrew-sfdc), project (databrew-appflow-integration), and job (databrew-appflow-job) in DataBrew
  • AWS Glue Data Catalog database (databrew-appflow)
  • IAM role (AWSGlueDataBrewServiceRole-appflow-databrew)

Conclusion

In this post, we walked you through how to extract data from Salesforce.com using the native integration that DataBrew has with Amazon AppFlow. We also demonstrated how to prepare the data for analytical use cases using DataBrew, operationalize the data preparation steps by creating a recipe and use that in a DataBrew job, store the job result in Amazon S3, and query it with Athena.

If you have any questions or suggestions, please leave a comment.


About the Authors

Ramkumar Nottath is a Sr. Solutions Architect at AWS focusing on Analytics services. He enjoys working with various customers to help them build scalable, reliable big data and analytics solutions. His interests extend to various technologies such as analytics, data warehousing, streaming, and machine learning. He loves spending time with his family and friends.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Prepare, transform, and orchestrate your data using AWS Glue DataBrew, AWS Glue ETL, and AWS Step Functions

Post Syndicated from Durga Mishra original https://aws.amazon.com/blogs/big-data/prepare-transform-and-orchestrate-your-data-using-aws-glue-databrew-aws-glue-etl-and-aws-step-functions/

Data volumes in organizations are increasing at an unprecedented rate, exploding from terabytes to petabytes and in some cases exabytes. As data volume increases, it attracts more and more users and applications to use the data in many different ways—sometime referred to as data gravity. As data gravity increases, we need to find tools and services that allow us to prepare and process a large amount of data with ease to make it ready for consumption by a variety of applications and users. In this post, we look at how to use AWS Glue DataBrew and AWS Glue extract, transform, and load (ETL) along with AWS Step Functions to simplify the orchestration of a data preparation and transformation workflow.

DataBrew is a visual data preparation tool that exposes data in spreadsheet-like views to make it easy for data analysts and data scientists to enrich, clean, and normalize data to prepare it for analytics and machine learning (ML) without writing any line of code. With more than 250 pre-built transformations, it helps reduce the time it takes to prepare the data by about 80% compared to traditional data preparation approaches.

AWS Glue is a fully managed ETL service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage.

Step Functions is a serverless orchestration service that makes it is easy to build an application workflow by combining many different AWS services like AWS Glue, DataBrew, AWS Lambda, Amazon EMR, and more. Through the Step Functions graphical console, you see your application’s workflow as a series of event-driven steps. Step Functions is based on state machines and tasks. A state machine is a workflow, and a task is a state in a workflow that represents a single unit of work that another AWS service performs. Each step in a workflow is a state.

Overview of solution

DataBrew is a new service we introduced in AWS Re:invent 2020 in the self-serviced data preparation space and is focused on the data analyst, data scientist, and self-service audience. We understand that some organizations may have use cases where self-service data preparation needs to be integrated with a standard corporate data pipeline for advanced data transformation and operational reasons. This post provides a solution for customers who are looking for a mechanism to integrate data preparation done by analysts and scientists to the standard AWS Glue ETL pipeline using Step Functions. The following diagram illustrates this workflow.

Architecture overview

To demonstrate the solution, we prepare and transform the publicly available New Your Citi Bike trip data to analyze bike riding patterns. The dataset has the following attributes.

Field Name Description
starttime Start time of bike trip
stoptime End time of bike trip
start_station_id Station ID where bike trip started
start_station_name Station name where bike trip started
start_station_latitude Station latitude where bike trip started
start_station_longitude Station longitude where bike trip started
end_station_id Station ID where bike trip ended
end_station_name Station name where bike trip ended
end_station_latitude Station latitude where bike trip ended
end_station_longitude Station longitude where bike trip ended
bikeid ID of the bike used in bike trip
usertype User type (customer = 24-hour pass or 3-day pass user; subscriber = annual member)
birth_year Birth year of the user on bike trip
gender Gender of the user (zero=unknown; 1=male; 2=female)

We use DataBrew to prepare and clean the most recent data and then use Step Functions for advanced transformation in AWS Glue ETL.

For the DataBrew steps, we clean up the dataset and remove invalid trips where either the start time or stop time is missing, or the rider’s gender isn’t specified.

After DataBrew prepares the data, we use AWS Glue ETL tasks to add a new column tripduration and populate it with values by subtracting starttime from endtime.

After we perform the ETL transforms and store the data in our Amazon Simple Storage Service (Amazon S3) target location, we use Amazon Athena to run interactive queries on the data to find the most used bikes to schedule maintenance, and the start stations with the most trips to make sure enough bikes are available at these stations.

We also create an interactive dashboard using Amazon QuickSight to gain insights and visualize the data to compare trip count by different rider age groups and user type.

The following diagram shows our solution architecture.

Prerequisites

To follow along with this walkthrough, you must have an AWS account. Your account should have permission to run an AWS CloudFormation script to create the services mentioned in solution architecture.

Your AWS account should also have an active subscription to QuickSight to create the visualization on processed data. If you don’t have a Quicksight account, you can sign up for an account.

Create the required infrastructure using AWS CloudFormation

To get started, complete the following steps:

  1. Choose Launch Stack to launch the CloudFormation stack to configure the required resources in your AWS account.
  2. On the Create stack page, choose Next.
  3. On the Specify stack details page, enter values for Stack name and Parameters, and choose Next.
  4. Follow the remaining instructions using all the defaults and complete the stack creation.

The CloudFormation stack takes approximately 2 minutes to complete.

  1. When the stack is in the CREATE_COMPLETE state, go to the Resources tab of the stack and verify that you have 18 new resources.

In the following sections, we go into more detail of a few of these resources.

Source data and script files S3 bucket

The stack created an S3 bucket with the name formatted as <Stack name>-<SolutionS3BucketNameSuffix>.

On the Amazon S3 console, verify the creation of the bucket with the following folders:

  • scripts – Contains the Python script for the ETL job to process the cleaned data
  • source – Has the source City Bike data to be processed by DataBrew and the ETL job

DataBrew dataset, project, recipe, and job

The CloudFormation stack also created the DataBrew dataset, project, recipe, and job for the solution. Complete the following steps to verify that these resources are available:

  1. On the DataBrew console, choose Projects to see the list of projects.

You should see a new project, associated dataset, attached recipe, and job in the Projects list.

  1. To review the source data, choose the dataset, and choose View dataset on the resulting popup.

You’re redirected to the Dataset preview page.

DataBrew lets you create a dynamic dataset using custom parameter and conditions. This feature helps you automatically process the latest files available in your S3 buckets with a user-friendly interface. For example, you can choose the latest 10 files or files that are created in the last 24 hours that match specific conditions to be automatically included in your dynamic dataset.

  1. Choose the Data profile overview tab to examine and collect summaries of statistics about your data by running the data profile.
  2. Choose Run data profile to create and run a profile job.
  3. Follow the instructions to create and run the job to profile the source data.

The job takes 2–3 minutes to complete.

When the profiling job is complete, you should see Summary, Correlations, Value distribution, and Column Summary sections with more insight statistics about the data, including data quality, on the Data profile overview tab.

  1. Choose the Column statistics tab to see more detailed statistics on individual data columns.

DataBrew provides a user-friendly way to prepare, profile, and visualize the lineage of the data. Appendix A at the end of this post provides details on some of the widely used data profiling and statistical features provided by DataBrew out of the box.

  1. Choose the Data Lineage tab to see the information on data lineage.
  1. Choose PROJECTS in the navigation pane and choose the project name to review the data in the project and the transformation applied through the recipe.

DataBrew provides over 250 transforms functions to prepare and transform the dataset. Appendix B at the end of this post reviews some of the most commonly used transformations.

We don’t need to run the DataBrew job manually—we trigger it using a Step Function state machine in subsequent steps.

AWS Glue ETL job

The CloudFormation stack also created an AWS Glue ETL job for the solution. Complete the following steps to review the ETL job:

  1. On the AWS Glue console, choose Jobs in the navigation pane to see the new ETL job.
  2. Select the job and on the Script tab, choose Edit script to inspect the Python script for the job.

We don’t need to run the ETL job manually—we trigger it using a Step Function state machine in subsequent steps.

Start the Step Function state machine

The CloudFormation stack created a Step Functions state machine to orchestrate running the DataBrew job and AWS Glue ETL job. A Lambda function starts the state machine whenever the daily data files are uploaded into the source data folder of the S3 bucket. For this post, we start the state machine manually.

  1. On the Step Functions console, choose State machines in the navigation pane to see the list of state machines.
  2. Choose the state machine to see the state machine details.
  3. Choose Start execution to run the state machine.

The details of the state machine run are displayed on the Details tab.

  1. Review the Graph inspector section to observe the different states of the state machine. As each step completes, it turns green.
  1. Choose the Definition tab to review the definition of the state machine.

When the state machine is complete, it should have run the DataBrew job to clean the data and the AWS Glue ETL job to process the cleaned data.

The DataBrew job removed all trips in the source dataset with missing starttime and stoptime, and unspecified gender. It also copied the cleaned data to the cleaned folder of the S3 bucket. We review the cleaned data in subsequent steps with Athena.

The ETL job processed the data in the cleaned folder to add a calculated column tripduration, which is calculated by subtracting the starttime from the stoptime. It also converted the processed data into columnar format (Parquet), which is more optimized for analytical processing, and copied it to the processed folder. We review the processed data in subsequent steps with Athena and also use it with QuickSight to get some insight into rider behavior.

Run an AWS Glue crawler to create tables in the Data Catalog

The CloudFormation stack also added three AWS Glue crawlers to crawl through the data stored in the source, cleaned, and processed folders. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Complete the following steps to run these crawlers to create AWS Glue tables for the date in each of the S3 folders.

  1. On the AWS Glue console, choose Crawlers in the navigation pane to see the list of crawlers created by the CloudFormation stack.If you have AWS Lake Formation enabled in the Region in which you’re implementing this solution, you may get insufficient lake formation permissions error. Please follow steps in Appendix C to provide required permission to the IAM role used by Glue Crawler to create table in citibike database.
  2. Select each crawler one by one and choose Run crawler.

After the crawlers successfully run, you should see one table added by each crawler.

We run crawlers manually in this post, but you can trigger the crawlers whenever a new file is added to their respective S3 bucket folder.

  1. To verify the AWS Glue database citibike, created by the CloudFormation script, choose Databases in the navigation pane.
  2. Select the citibike database and choose View tables.

You should now see the three tables created by the crawlers.

Use Athena to run analytics queries

In the following steps, we use Athena for ad-hoc analytics queries on the cleaned and processed data in the processed_citibike table of the Data Catalog. For this post, we find the 20 most used bikes to schedule maintenance for them, and find the top 20 start stations with the most trips to make sure enough bikes are available at these stations.

  1. On the Athena console, for Data source, choose AwsDataCatalog.
  2. For Database, choose citibike.

The three new tables are listed under Tables.

If you haven’t used Athena before in your account, you receive a message to set up a query result location for Athena to store the results of queries.

  1. Run the following query on the New query1 tab to find the find the 20 most used bikes:
SELECT bikeid as BikeID, count(*) TripCount FROM "citibike"."processed_citibike" group by bikeid order by 2 desc limit 20;

If you have AWS Lake Formation enabled in the Region in which you’re implementing this solution, you may get error (similar to below error) in executing queries. To resolve the permission issues, please follow steps in Appendix D to provide “Select” permission to all tables in citibike database to logged in user using Lake Formation.

  1. Run the following query on the New query1 tab to find the top 20 start stations with the most trips:
SELECT start_station_name, count(*) trip_count FROM "citibike"."processed_citibike" group by start_station_name order by 2 desc limit 20;

Visualize the processed data on a QuickSight dashboard

As the final step, we visualize the following data using a QuickSight dashboard:

  • Compare trip count by different rider age groups
  • Compare trip count by user type (customer = 24-hour pass or 3-day pass user; subscriber = annual member)

Your AWS account should also have an active subscription to QuickSight to create the visualization on processed data. If you don’t have a Quicksight account, you can sign up for an account.

  1. On the QuickSight console, create a dataset with Athena as your data source.
  2. Follow the instructions to complete your dataset creation.
  3. Add a calculated filed rider_age using following formula:
dateDiff(parseDate({birth_year},"YYYY"), truncDate( "YYYY", now() ) ,"YYYY")

Your dataset should look like the following screenshot.

Now you can create visualizations to compare weekly trip count by user type and total trip count by rider age group.

Clean up

To avoid incurring future charges, delete the resources created for the solution.

  1. Delete the DataBrew profile job created for profiling the source data.
  2. Delete the CloudFormation stack to delete all the resources created by the stack.

Conclusion

In this post, we discussed how to use DataBrew to prepare your data and then further process the data using AWS Glue ETL to integrate it in a standard operational ETL flow to gather insights from your data.

We also walked through how you can use Athena to perform SQL analysis on the dataset and visualize and create business intelligence reports through QuickSight.

We hope this post provides a good starting point for you to orchestrate your DataBrew job with your existing or new data processing ETL pipelines.

For more details on using DataBrew with Step Functions, see Manage AWS Glue DataBrew Jobs with Step Functions.

For more information on DataBrew jobs, see Creating, running, and scheduling AWS Glue DataBrew jobs.

Appendix A

The following table lists the widely used data profiling and statistical features provided by DataBrew out of the box.

Type Data type of the column
missingValuesCount The number of missing values. Null and empty strings are considered as missing.
distinctValuesCount The number of the value appears at least once.
entropy A measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information.
mostCommonValues A list of the top 50 most common values.
mode The value that appears most often in the column.
min/max/range The minimum, maximum, and range values in the column.
mean The mean or average value of the column.
kurtosis A measure of whether the data is heavy-tailed or light-tailed relative to a normal distribution. A dataset with high kurtosis has more outliers compared to a dataset with low kurtosis.
skewness A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
Correlation The Pearson correlation coefficient. This is a measure if one column’s values correlate to values of another column.
Percentile95 The element in the list that represents the 95th percentile (95% of numbers fall below this and 5% of numbers fall above it).
interquartileRange The range between the 25th percentile and 75th percentile of numbers.
standardDeviation The unbiased sample standard deviation of values in the column.
min/max Values A list of the five minimum and maximum values in a column.
zScoreOutliersSample A list of the top 50 outliers that have the largest or smallest Z-score. -/+3 is the default threshold.
valueDistribution The measure of the distribution of values by range.

Appendix B

DataBrew provides the ability to prepare and transform your dataset using over 250 transforms. In this section, we discuss some of the most commonly used transformations:

  • Combine datasets – You can combine datasets in the following ways:
    • Join – Combine several datasets by joining them with other datasets using a join type like inner join, outer join, or excluding join.
    • Union operation – Combine several datasets using a union operation.
    • Multiple files for input datasets – While creating a dataset, you can use a parameterized Amazon S3 path or a dynamic parameter and select multiple files.
  • Aggregate data – You can aggregate the dataset using a group by clause and use standard and advanced aggregation functions like Sum, Count, Min, Max, mode, standard deviation, variance, skewness, kurtosis, as well as cumulative functions like cumulative sum and cumulative count.
  • Pivot a dataset – You can pivot the dataset in the following ways:
    • Pivot operation – Convert all rows to columns.
    • Unpivot operation – Convert all columns to rows.
    • Transpose operation – Convert all selected rows to columns and columns to rows.
  • Unnest top level value – You can extract values from arrays into rows and columns into rows. This only operates on top-level values.
  • Outlier detection and handling – This transformation works with outliers in your data and performs advanced transformations on them like flag outliers, rescale outliers, and replace or remover outliers. You can use several strategies like ZScore, modified Z-score, and interquartile range (IQR) to detect outliers.
  • Delete duplicate values – You can delete any row that is an exact match to an earlier row in the dataset.
  • Handle or impute missing values – You have the following options:
    • Remove invalid records – Delete an entire row if an invalid value is encountered in a column of that row.
    • Replace missing values – Replace missing values with custom values, most frequent value, last valid value, or numeric aggregate values.
  • Filter data – You can filter the dataset based on a custom condition, validity of a column, or missing values.
  • Split or merge columns – You can split a column into multiple columns based on a custom delimiter, or merge multiple columns into a single column.
  • Create columns based on functions –You can create new columns using different functions like mathematical functions, aggregated functions, date functions, text functions, windows functions like next and previous, and rolling aggregation functions like rolling sum, rolling count, and rolling mean.

Appendix C

If you have AWS Lake Formation enabled in the Region in which you’re implementing this solution, you may get insufficient lake formation permissions error during Glue Crawler run.


Please provide create table permission in GlueDatabase (citibike) to GlueCrawlersRole (find the ARN from the Cloud Formation Resource section) for the crawler to create required table.

Appendix D

If you have AWS Lake Formation enabled in the Region in which you’re implementing this solution, you may get error similar to below error in running queries on tables in citibike database.

Please provide “Select” permission to all tables in GlueDatabase (citibike) to logged in user in Lake formation to allow the select on tables created by the solution to logged in user.


About the Authors

Photo and bio go here Narendra Gupta is a solutions architect at AWS, helping customers on their cloud journey with focus on AWS analytics services. Outside of work, Narendra enjoys learning new technologies, watching movies, and visiting new places.

Photo and bio go hereDurga Mishra is a solutions architect at AWS. Outside of work, Durga enjoys spending time with family and loves to hike on Appalachian trails and spend time in nature.

Photo and bio go hereJay Palaniappan is a Sr. Analytics Specialist Solutions architect at AWS, helping media & entertainment customers adopt and run AWS Analytics services.

Centralize feature engineering with AWS Step Functions and AWS Glue DataBrew

Post Syndicated from Gayatri Ghanakota original https://aws.amazon.com/blogs/big-data/centralize-feature-engineering-with-aws-step-functions-and-aws-glue-databrew/

One of the key phases of a machine learning (ML) workflow is data preprocessing, which involves cleaning, exploring, and transforming the data. AWS Glue DataBrew, announced in AWS re:Invent 2020, is a visual data preparation tool that enables you to develop common data preparation steps without having to write any code or installation.

In this post, we show how to integrate the standard data preparation steps with training an ML model and running inference on a pre-trained model via DataBrew and AWS Step Functions. The solution is architected with an ML pipeline that trains the publicly available Air Quality Dataset to predict the CO levels in New York City.

Overview of solution

The following architecture diagram shows an overview of the ML workflow, which employs DataBrew for data preparation and scheduling jobs, and uses AWS Lambda and Step Functions to orchestrate ML model training and inference using the AWS Step Functions Data Science SDK. We use Amazon EventBridge to trigger the Step Functions state machine when the DataBrew job is complete.

Scope of Solution

The steps in this solution are as follows:

  1.  Import your dataset to Amazon Simple Storage Service (Amazon S3).
  2.  Launch the AWS CloudFormation stack, which deploys the following:
    1. DataBrew recipes for training and inference data.
    2.  The DataBrew job’s schedule for training and inference.
    3.  An EventBridge rule.
    4. A Lambda function that triggers the Step Functions state machine, which in turn orchestrates the states.
    5. The training state includes the following steps:
      1. Runs an Amazon SageMaker processing job to remove column headers.
      2. Performs SageMaker model training.
      3. Outputs the data to an S3 bucket to store the trained model.
    6. The inference state includes the following steps:
      1. Runs a SageMaker processing job to remove column headers.
      2. Performs a SageMaker batch transform.
      3. Outputs the data to an S3 bucket to store the predictions.

Prerequisites

For this solution, you should have the following prerequisites:

Load the dataset to Amazon S3

In this first step, we load our air quality dataset into Amazon S3.

  1. Download the Outdoor Air Quality Dataset for the years 2018, 2019, and 2020, limiting to the following options:
    1. Pollutant – CO
    2. Geographic Area – New York
    3. Monitor Site – All Sites
  2. For each year of data, split by year, month, and day, and use the data for 2018–2019 to train the model and the 2020 data to run inference.
  3. Run the following script, which stores the output into the NY_XXXX folder:
import os
import pandas as pd

def split_data(root_folder,df):
    # Create year, month and day columns
    df["year"] = pd.DatetimeIndex(df["Date"]).year
    df["month"] = pd.DatetimeIndex(df["Date"]).month
    df["day"] = pd.DatetimeIndex(df["Date"]).day
    if not os.path.exists(root_folder):
        os.mkdir(root_folder)
    for m, x1 in df.groupby(['month']):
        month_dir = os.path.join("{:02}".format(m))
        if not os.path.exists(root_folder+"/"+month_dir):
            os.mkdir(root_folder+"/"+month_dir)
        for d, x2 in x1.groupby(["day"]):
            day_dir = os.path.join("{:02}".format(d))
            if not os.path.exists(root_folder+"/"+month_dir+"/"+day_dir):
                os.mkdir(root_folder+"/"+month_dir+"/"+day_dir)
            p = os.path.join(root_folder+"/"+month_dir+"/"+day_dir, "{:02}.csv".format(d))
            x2.to_csv(p, index=False)

ny_data_2018 = pd.read_csv("<path to downloaded 2018 data  file>")
ny_data_2019 = pd.read_csv("<path to downloaded 2019 data file>")
ny_data_2020 = pd.read_csv("<path to downloaded 2020 data file>") 
 
split_data("NY_2018", ny_data_2018)
split_data("NY_2019", ny_data_2019)
split_data("NY_2020", ny_data_2020)
  1. Create an S3 bucket in the us-east-1 Region and upload the folders NY_2018 and NY_2019 to the path S3://<artifactbucket>/train_raw_data/.

train-raw-data-s3

  1. Upload the folder NY_2020 to S3:// <artifactbucket>/inference_raw_data/.

inference-raw-data-s3

Deploy your resources

For a quick start of this solution, you can deploy the provided AWS CloudFormation stack. This creates all the required resources in your account (us-east-1 Region), including the DataBrew datasets, jobs, projects, and recipes; the Step Functions train and inference state machines (which include SageMaker processing, model training, and batch transform jobs); an EventBridge rule; and the Lambda function to deploy an end-to-end ML pipeline for a predefined S3 bucket.

  1. Launch the following stack:
  2. For ArtifactBucket, enter the name of the S3 bucket you created in the previous step.

enter cloud formation stack details

  1. Select the three acknowledgement check boxes.
  2. Choose Create stack.

Create Cloud Formation Stack

Test the solution

As part of the CloudFormation template, the DataBrew job km-mlframework-trainingfeatures-job was created, which is scheduled to run every Monday at 10:00 AM UTC. This job creates the features required to train the model.

When the template deployment is successfully completed, you can manually activate the training pipeline. For this, navigate to the DataBrew console, select the DataBrew job km-mlframework-trainingfeatures-job, and choose Run job.

Glue DataBrew Run Train Job

The job writes the features to s3://<artifactbucket>/train_features/.

When the job is complete, an EventBridge rule invokes the Lambda function, which orchestrates the SageMaker training jobs via Step Functions.

Step Function Train Job

When the job is complete, the output of the model is stored in s3://<artifactbucket>/artifact-repo/model/.

In the next step, we trigger the DataBrew job km-mlframework-inferencefeatures-job, which is scheduled to run every Tuesday at 10:00 AM UTC. This job creates the inference features that are used to run inference on the trained model.

You can also activate the inference pipeline by manually triggering the DataBrew job on the DataBrew console.

Glue DataBrew Inference Job

The job writes the features to s3://<artifactbucket>/ inference_features/.

When the job is complete, an EventBridge rule invokes the Lambda function, which orchestrates the SageMaker batch transform job via Step Functions.

Inference Step Function

When the job is complete, the predictions are written to s3://<artifactbucket>/predictions/.

For more information on DataBrew steps and building a DataBrew recipe, see Preparing data for ML models using AWS Glue DataBrew in a Jupyter notebook.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Wait for any currently running activity to complete, or manually stop it (DataBrew, Step Functions, SageMaker).
  2. Delete the scheduled DataBrew jobs km-mlframework-trainingfeatures-job and km-mlframework-inferencefeatures-job. This ensures the jobs aren’t started by the schedule.
  3. Delete the S3 bucket created to store data and model artifacts.
  4. Delete the CloudFormation stack created earlier.

Conclusion

DataBrew is designed to support data engineers and data scientists to experiment with data preparation steps via a visual interface. With more than 250 built-in transformations, DataBrew can be a strong tool to accelerate your ML lifecycle for development and production stages.

In this post, we walked through the process of creating an end-to-end ML framework with DataBrew, which you can use to train an ML model as well as run inferences on a schedule. You can use the same framework with your own DataBrew recipe prepared using any dataset.

To learn more on applying the most frequently used transformations from within DataBrew, see 7 most common data preparation transformations in AWS Glue DataBrew.


About the Authors

Gayatri Ghanakota is a Machine Learning Engineer with AWS Professional Services, where she helps customers build machine learning solutions on AWS. She is passionate about developing, deploying, and explaining ML models.

Surbhi Dangi is a product and design leader at Amazon Web Services. She focusses on providing ease of use and rich functionality for her analytics and monitoring on both her products – Amazon CloudWatch Synthetics and AWS Glue DataBrew. When not working, she mentors aspiring product managers, hiking, and traveling the world.

How to Accelerate Building a Lake House Architecture with AWS Glue

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/how-to-accelerate-building-a-lake-house-architecture-with-aws-glue/

Customers are building databases, data warehouses, and data lake solutions in isolation from each other, each having its own separate data ingestion, storage, management, and governance layers. Often these disjointed efforts to build separate data stores end up creating data silos, data integration complexities, excessive data movement, and data consistency issues. These issues are preventing customers from getting deeper insights. To overcome these issues and easily move data around, a Lake House approach on AWS was introduced.

In this blog post, we illustrate the AWS Glue integration components that you can use to accelerate building a Lake House architecture on AWS. We will also discuss how to derive persona-centric insights from your Lake House using AWS Glue.

Components of the AWS Glue integration system

AWS Glue is a serverless data integration service that facilitates the discovery, preparation, and combination of data. It can be used for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration. So you can start analyzing your data and putting it to use in minutes, rather than months.

The following diagram illustrates the various components of the AWS Glue integration system.

Figure 1. AWS Glue integration components

Figure 1. AWS Glue integration components

Connect – AWS Glue allows you to connect to various data sources anywhere

Glue connector: AWS Glue provides built-in support for the most commonly used data stores. You can use Amazon Redshift, Amazon RDS, Amazon Aurora, Microsoft SQL Server, MySQL, MongoDB, or PostgreSQL using JDBC connections. AWS Glue also allows you to use custom JDBC drivers in your extract, transform, and load (ETL) jobs. For data stores that are not natively supported such as SaaS applications, you can use connectors. You can also subscribe to several connectors offered in the AWS Marketplace.

Glue crawlers: You can use a crawler to populate the AWS Glue Data Catalog with tables. A crawler can crawl multiple data stores in a single pass. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

Catalog – AWS Glue simplifies data discovery and governance

Glue Data Catalog: The Data Catalog serves as the central metadata catalog for the entire data landscape.

Glue Schema Registry: The AWS Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas. With AWS Glue Schema Registry, you can manage and enforce schemas on your data streaming applications.

Data quality – AWS Glue helps you author and monitor data quality rules

Glue DataBrew: AWS Glue DataBrew allows data scientists and data analysts to clean and normalize data. You can use a visual interface, reducing the time it takes to prepare data by up to 80%. With Glue DataBrew, you can visualize, clean, and normalize data directly from your data lake, data warehouses, and databases.

Curate data: You can use either Glue development endpoint or AWS Glue Studio to curate your data.

AWS Glue development endpoint is an environment that you can use to develop and test your AWS Glue scripts. You can choose either Amazon SageMaker notebook or Apache Zeppelin notebook as an environment.

AWS Glue Studio is a new visual interface for AWS Glue that supports extract-transform-and-load (ETL) developers. You can author, run, and monitor AWS Glue ETL jobs. You can now use a visual interface to compose jobs that move and transform data, and run them on AWS Glue.

AWS Data Exchange makes it easy for AWS customers to securely exchange and use third-party data in AWS. This is for data providers who want to structure their data across multiple datasets or enrich their products with additional data. You can publish additional datasets to your products using the AWS Data Exchange.

Deequ is an open-source data quality library developed internally at Amazon, for data quality. It provides multiple features such as automatic constraint suggestions and verification, metrics computation, and data profiling.

Build a Lake House architecture faster, using AWS Glue

Figure 2 illustrates how you can build a Lake House using AWS Glue components.

Figure 2. Building lake house architectures with AWS Glue

Figure 2. Building Lake House architectures with AWS Glue

The architecture flow follows these general steps:

  1. Glue crawlers scan the data from various data sources and populate the Data Catalog for your Lake House.
  2. The Data Catalog serves as the central metadata catalog for the entire data landscape.
  3. Once data is cataloged, fine-grained access control is applied to the tables through AWS Lake Formation.
  4. Curate your data with business and data quality rules by using Glue Studio, Glue development endpoints, or Glue DataBrew. Place transformed data in a curated Amazon S3 for purpose built analytics downstream.
  5. Facilitate data movement with AWS Glue to and from your data lake, databases, and data warehouse by using Glue connections. Use AWS Glue Elastic views to replicate the data across the Lake House.

Derive persona-centric insights from your Lake House using AWS Glue

Many organizations want to gather observations from increasingly larger volumes of acquired data. These insights help them make data-driven decisions with speed and agility. They must use a central data lake, a ring of purpose-built data services, and data warehouses based on persona or job function.

Figure 3 illustrates the Lake House inside-out data movement with AWS Glue DataBrew, Amazon Athena, Amazon Redshift, and Amazon QuickSight to perform persona-centric data analytics.

Figure 3. Lake house persona-centric data analytics using AWS Glue

Figure 3. Lake House persona-centric data analytics using AWS Glue

This shows how Lake House components serve various personas in an organization:

  1. Data ingestion: Data is ingested to Amazon Simple Storage Service (S3) from different sources.
  2. Data processing: Data curators and data scientists use DataBrew to validate, clean, and enrich the data. Amazon Athena is also used to run improvised queries to analyze the data in the lake. The transformation is shared with data engineers to set up batch processing.
  3. Batch data processing: Data engineers or developers set up batch jobs in AWS Glue and AWS Glue DataBrew. Jobs can be initiated by an event, or can be scheduled to run periodically.
  4. Data analytics: Data/Business analysts can now analyze prepared dataset in Amazon Redshift or in Amazon S3 using Athena.
  5. Data visualizations: Business analysts can create visuals in QuickSight. Data curators can enrich data from multiple sources. Admins can enforce security and data governance. Developers can embed QuickSight dashboard in applications.

Conclusion

Using a Lake House architecture will help you get persona-centric insights quickly from all of your data based on user role or job function. In this blog post, we describe several AWS Glue components and AWS purpose-built services that you can use to build Lake House architectures on AWS. We have also presented persona-centric Lake House analytics architecture using AWS Glue, to help you derive insights from your Lake House.

Read more and get started on building Lake House Architectures on AWS.

Data preparation using an Amazon RDS for MySQL database with AWS Glue DataBrew

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/data-preparation-using-an-amazon-rds-for-mysql-database-with-aws-glue-databrew/

With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, or Amazon Aurora and Amazon Relational Database Service (Amazon RDS) databases. You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an RDS database, store the cleaned data in an S3 data lake, and build a business intelligence (BI) report.

Use case overview

For our use case, we use three datasets:

  • A school dataset that contains school details like school ID and school name
  • A student dataset that contains student details like student ID, name, and age
  • A student study details dataset that contains student study time, health, country, and more

The following diagram shows the relation of these tables.

For our use case, this data is collected by a survey organization after an annual exam, and updates are made in Amazon RDS for MySQL using a Java script-based frontend application. We join the tables to create a single view and create aggregated data through a series of data preparation steps, and the business team uses the output data to create BI reports.

Solution overview

The following diagram illustrates our solution architecture. We use Amazon RDS to store data, DataBrew for data preparation, Amazon Athena for data analysis with standard SQL, and Amazon QuickSight for business reporting.

The workflow includes the following steps:
  1. Create a JDBC connection for RDS and a DataBrew project. DataBrew does the transformation to find the top performing students across all the schools considered for analysis.
  2. The DataBrew job writes the final output to our S3 output bucket.
  3. After the output data is written, we can create external tables on top of it with Athena create table statements and load partitions with MCSK REPAIR commands.
  4. Business users can use QuickSight for BI reporting, which fetches data through Athena. Data analysts can also use Athena to analyze the complete refreshed dataset.

Prerequisites

To complete this solution, you should have an AWS account.

Prelab setup

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use three mock datasets. You can download the DDL code and data files from GitHub.

  1. Create the RDS for MySQL instance to capture the student health data.
  2. Make sure you have set up the correct security group for Amazon RDS. For more information, see Setting Up a VPC to Connect to JDBC Data Stores.
  3. Create three tables: student_tbl, study_details_tbl, and school_tbl. You can use DDLsql to create the database objects.
  4. Upload the student.csv, study_details.csv, and school.csv files in their respective tables. You can use student.sql, study_details.sql, and school.sql to insert the data in the tables.

Create an Amazon RDS connection

To create your Amazon RDS connection, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.

  1. For Connection name, enter a name (for example, student_db-conn).
  2. For Connection type, select JDBC.
  3. For Database type, choose MySQL.

  1. Provide other parameters like RDS endpoint, port, database name, and database login credentials.

  1. In the Network options section, choose the VPC, subnet, and security group of your RDS instance.
  2. Choose Create connection.

Create your datasets

We have three tables in Amazon RDS: school_tbl, student_tbl, and study_details_tbl. To use these tables, we first need to create a dataset for each table.

To create the datasets, complete the following steps (we walk you through creating the school dataset):

  1. On the Datasets page of the DataBrew console, choose Connect new dataset.

  1. For Dataset name, enter school-dataset.
  2. Choose the connection you created (AwsGlueDatabrew-student-db-conn).
  3. For Table name, enter school_tbl.
  4. Choose Create dataset.

  1. Repeat these steps for the student_tbl and study_details_tbl tables, and name the new datasets student-dataset and study-detail-dataset, respectively.

All three datasets are available to use on the Datasets page.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

  1. On the DataBrew console, choose Projects.
  2. Choose Create project.
  3. For Project Name, enter my-rds-proj.
  4. For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

  1. For Select a dataset, select My datasets.
  2. For Dataset name, select study-detail-dataset.

  1. For Role name, choose your AWS Identity and Access management (IAM) role to use with DataBrew.
  2. Choose Create project.

You can see a success message along with our RDS study_details_tbl table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Open an Amazon RDS project and build a transformation recipe

In a DataBrew interactive session, you can cleanse and normalize your data using over 250 built-in transforms. In this post, we use DataBrew to identify top performing students by performing a few transforms and finding students who got marks greater than or equal to 60 in the last annual exam.

First, we use DataBrew to join all three RDS tables. To do this, we perform the following steps:

  1. Navigate to the project you created.
  2. Choose Join.

  1. For Select dataset, choose student-dataset.
  2. Choose Next.

  1. For Select join type, select Left join.
  2. For Join keys, choose student_id for Table A and deselect student_id for Table B.
  3. Choose Finish.

Repeat the steps for school-dataset based on the school_id key.

  1. Choose MERGE to merge first_name and last_name.
  2. Enter a space as a separator.
  3. Choose Apply.

We now filter the rows based on marks value greater than or equal to 60 and add the condition as a recipe step.

  1. Choose FILTER.

  1. Provide the source column and filter condition and choose Apply.

The final data shows the top performing students’ data who had marks greater than or equal to 60.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name¸ enter top-performer-student.

For this post, we use Parquet as the output format.

  1. For File type, choose PARQUET.
  2. For S3 location, enter the S3 path of the output folder.

  1. For Role name, choose an existing role or create a new one.
  2. Choose Create and run job.

  1. Navigate to the Jobs page and wait for the top-performer-student job to complete.

  1. Choose the Destination link to navigate to Amazon S3 to access the job output.

Run an Athena query

Let’s validate the aggregated table output in Athena by running a simple SELECT query. The following screenshot shows the output.

Create reports in QuickSight

Now let’s do our final step of the architecture, which is creating BI reports through QuickSight by connecting to the Athena aggregated table.

  1. On the QuickSight console, choose Athena as your data source.

  1. Choose the database and catalog you have in Athena.
  2. Select your table.
  3. Choose Select.

Now you can create a quick report to visualize your output, as shown in the following screenshot.

If QuickSight is using SPICE storage, you need to refresh the dataset in QuickSight after you receive notification about the completion of the data refresh. We recommend using SPICE storage to get better performance.

Clean up

Delete the following resources that might accrue cost over time:

  • The RDS instance
  • The recipe job top-performer-student
  • The job output stored in your S3 bucket
  • The IAM roles created as part of projects and jobs
  • The DataBrew project my-rds-proj and its associated recipe my-rds-proj-recipe
  • The DataBrew datasets

Conclusion

In this post, we saw how to create a JDBC connection for an RDS database. We learned how to use this connection to create a DataBrew dataset for each table, and how to reuse this connection multiple times. We also saw how we can bring data from Amazon RDS into DataBrew and seamlessly apply transformations and run recipe jobs that refresh transformed data for BI reporting.


About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Data preparation using Amazon Redshift with AWS Glue DataBrew

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/data-preparation-using-amazon-redshift-with-aws-glue-databrew/

With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, Amazon Aurora, and other Amazon Relational Database Service (Amazon RDS) databases. You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an Amazon Redshift table, and transform and use different feature engineering techniques to prepare data to build a machine learning (ML) model. Finally, we store the transformed data in an S3 data lake to build the ML model in Amazon SageMaker.

Use case overview

For our use case, we use mock student datasets that contain student details like school, student ID, name, age, student study time, health, country, and marks. The following screenshot shows an example of our data.

For our use case, the data scientist uses this data to build an ML model to predict a student’s score in upcoming annual exam. However, this raw data requires cleaning and transformation. A data engineer must perform the required data transformation so the data scientist can use the transformed data to build the model in SageMaker.

Solution overview

The following diagram illustrates our solution architecture.

The workflow includes the following steps:

  1. Create a JDBC connection for Amazon Redshift and a DataBrew project.
  2. AWS DataBrew queries sample student performance data from Amazon Redshift and does the transformation and feature engineering to prepare the data to build ML model.
  3. The DataBrew job writes the final output to our S3 output bucket.
  4. The data scientist builds the ML model in SageMaker to predict student marks in an upcoming annual exam.

We cover steps 1–3 in this post.

Prerequisites

To complete this solution, you should have an AWS account.

Prelab setup

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use a mock dataset. You can download the DDL and data files from GitHub.

  1. Create the Amazon Redshift cluster to capture the student performance data.
  2. Set up a security group for Amazon Redshift.
  3. Create a schema called student_schema and a table called study_details. You can use DDLsql to create database objects.
  4. We recommend using the COPY command to load a table in parallel from data files on Amazon S3. However, for this post, you can use study_details.sql to insert the data in the tables.

Create an Amazon Redshift connection

To create your Amazon Redshift connection, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.
  3. For Connection name, enter a name (for example, student-db-connection).
  4. For Connection type, select JDBC.
  5. Provide other parameters like the JDBC URL and login credentials.
  6. In the Network options section, choose the VPC, subnet, and security groups of your Amazon Redshift cluster.
  7. Choose Create connection.

Create datasets

To create the datasets, complete the following steps:

  1. On the Datasets page of the DataBrew console, choose Connect new dataset.
  2. For Dataset name, enter a name (for example, student).
  3. For Your JDBC source, choose the connection you created (AwsGlueDatabrew-student-db-connection).
  4. Select the study_details table.
  5. For Enter S3 destination, enter an S3 bucket for Amazon Redshift to store the intermediate result.
  6. Choose Create dataset.

You can also configure a lifecycle rule to automatically clean up old files from the S3 bucket.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

  1. On the DataBrew console, on the Projects page, choose Create project.
  2. For Project Name, enter student-proj.
  3. For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

  1. For Select a dataset, select My datasets.
  2. Select the student dataset.
  3. For Role name, choose the AWS Identity and Access Management (IAM) role to be used with DataBrew.
  4. Choose Create project.

You can see a success message along with our Amazon Redshift study_details table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Create a profiling job

DataBrew helps you evaluate the quality of your data by profiling it to understand data patterns and detect anomalies.

To create your profiling job, complete the following steps:

  1. On the DataBrew console, choose Jobs in the navigation pane.
  2. On the Profile jobs tab, choose Create job.
  3. For Job name, enter student-profile-job.
  4. Choose the student dataset.
  5. Provide the S3 location for job output.
  6. For Role name, choose the role to be used with DataBrew.
  7. Choose Create and run job.

Wait for the job to complete.

  1. Choose the Columns statistics tab.

You can see that the age column has some missing values.

You can also see that the study_time_in_hr column has two outliers.

Build a transformation recipe

All ML algorithms use input data to generate outputs. Input data comprises features usually in structured columns. To work properly, the features need to have specific characteristics. This is where feature engineering comes in. In this section, we perform some feature engineering techniques to prepare our dataset to build the model in SageMaker.

Let’s drop the unnecessary columns from our dataset that aren’t required for model building.

  1. Choose Column and choose Delete.
  2. For Source columns, choose the columns school_name, first_name, and last_name.
  3. Choose Apply.

We know from the profiling report that the age value is missing in two records. Let’s fill in the missing value with the median age of other records.

  1. Choose Missing and choose Fill with numeric aggregate.
  2. For Source column, choose age.
  3. For Numeric aggregate, choose Median.
  4. For Apply transform to, select All rows.
  5. Choose Apply.

We know from the profiling report that the study_time_in_hr column has two outliers, which we can remove.

  1. Choose Outliers and choose Remove outliers.
  2. For Source column, choose study_time_in_hr.
  3. Select Z-score outliers.
  4. For Standard deviation threshold, choose 3.
  5. Select Remove outliers.
  6. Under Remove outliers, select All outliers.
  7. Under Outlier removal options¸ select Delete outliers.
  8. Choose Apply.
  9. Choose Delete rows and click Apply.

The next step is to convert the categorical value to a numerical value for the gender column.

  1. Choose Mapping and choose Categorical mapping.
  2. For Source column, choose gender.
  3. For Mapping options, select Map top 1 values.
  4. For Map values, select Map values to numeric values.
  5. For M, choose 1.
  6. For Others, choose 2.
  7. For Destination column, enter gender_mapped.
  8. For Apply transform to, select All rows.
  9. Choose Apply.

ML algorithms often can’t work on label data directly, requiring the input variables to be numeric. One-hot encoding is one technique that converts categorical data that doesn’t have an ordinal relationship with each other to numeric data.

To apply one-hot encoding, complete the following steps:

  1. Choose Encode and choose One-hot encode column.
  2. For Source column, choose health.
  3. For Apply transform to, select All rows.
  4. Choose Apply.

The following screenshot shows the full recipe that we applied to our dataset before we can use it to build our model in SageMaker.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name¸ enter student-performance.

We use CSV as the output format.

  1. For File type, choose CSV.
  2. For Role name, choose an existing role or create a new one.
  3. Choose Create and run job.
  4. Navigate to the Jobs page and wait for the student-performance job to complete.
  5. Choose the Destination link to navigate to Amazon S3 to access the job output.

Clean up

Delete the following resources that might accrue cost over time:

  • The Amazon Redshift cluster
  • The recipe job student-performance
  • The job output stored in your S3 bucket
  • The IAM roles created as part of projects and jobs
  • The DataBrew project student-proj and its associated recipe student-proj-recipe
  • The DataBrew datasets

Conclusion

In this post, we saw how to create a JDBC connection for an Amazon Redshift data warehouse. We learned how to use this connection to create a DataBrew dataset for an Amazon Redshift table. We also saw how easily we can bring data from Amazon Redshift into DataBrew, seamlessly apply transformations and feature engineering techniques, and run recipe jobs that refresh the transformed data for ML model building in SageMaker.


About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

 

 

 

Building a Showback Dashboard for Cost Visibility with Serverless Architectures

Post Syndicated from Peter Chung original https://aws.amazon.com/blogs/architecture/building-a-showback-dashboard-for-cost-visibility-with-serverless-architectures/

Enterprises with centralized IT organizations and multiple lines of businesses frequently use showback or chargeback mechanisms to hold their departments accountable for their technology usage and costs. Chargeback involves actually billing a department for the cost of their division’s usage. Showback focuses on visibility to make the department more cost conscientious and encourage operational efficiency.

Building a showback mechanism can be potentially challenging for business and financial analysts of an AWS Organization. You may not have the scripting or data engineering skills needed to coordinate workflows and build reports at scale. Although you can use AWS Cost Explorer as starting point, you may want greater customizability, larger datasets beyond a one-year period, and more of a business intelligence (BI) experience.

In this post, we discuss the benefits of building a showback dashboard using the AWS Cost and Usage Report (AWS CUR). You can track costs by cost center, business unit, or project using managed services. Using a showback strategy, you can consolidate and present costs to a business unit to show resource use over a set period of time for your entire AWS Organization. Building this solution with managed services allows you to spend time understanding your costs rather than maintaining the underlying infrastructure.

This solution highlights AWS Glue DataBrew to prepare your data into the appropriate format for dashboards. We recommend DataBrew because it provides a no-code environment for data transformation. It allows anyone to create dashboards similar to those built in the Cloud Intelligence Dashboards Workshop for your Organization.

Figure 1. QuickSight showback dashboard using CUR data transformed by Glue DataBrew and leveraging QuickSight insights

Figure 1. QuickSight showback dashboard using CUR data transformed by Glue DataBrew and leveraging QuickSight insights

Tags for cost allocation

The success of your showback dashboard partially depends on your cost allocation tagging strategy. Typically, customers use business tags such as cost center, business unit, or project to associate AWS costs with traditional financial reporting dimensions within their organization.

The CUR supports the ability to break down AWS costs by tag. For example, if a group of resources are labeled with the same tag, you’ll be able to see the total cost and usage of that group of resources. Read more about Tagging Best Practices to develop a tagging strategy for your organization.

A serverless data workflow for showback dashboards

You can build showback dashboards with managed services such as Amazon QuickSight, without the need to write any code or manage any servers.

Figure 2. A serverless architecture representing data workflow

Figure 2. A serverless architecture representing data workflow

AWS automatically delivers the data you need for showback dashboards through the CUR. Once this data arrives in an Amazon Simple Storage Service (S3) bucket, you can transform the data without the need to write any code by using DataBrew. You can also automatically identify the data schema, and catalog the data’s properties to run queries using Amazon Athena. Lastly, you can visualize the results by publishing and sharing dashboards to key stakeholders within your organization using Amazon QuickSight.

The key benefits of this approach are:

  • Automatic data delivery
  • No-code data transformation
  • Automatic cataloging and querying
  • Serverless data visualization

Let’s take a look at each in more detail.

Automatic data delivery

The CUR is the source for historical cost and usage data. The CUR provides the most comprehensive set of cost and usage data available and will include your defined cost allocation tags for your entire Organization. You configure CUR to deliver your billing data to an Amazon S3 bucket at the payer account level. This will consolidate data for all linked accounts. After delivery starts, Amazon updates the CUR files at least once a day.

No-code data transformation

You can use DataBrew to transform the data in the Amazon S3 bucket aggregating cost and usage according to your tagging strategy. DataBrew summarizes your data for discovery. You can also run transformations called “jobs” in DataBrew without writing any code, using over 250 built-in transforms. Figures 3 through 5 show several job examples.

Figure 3. DataBrew recipe action: rename column

Figure 3. DataBrew recipe action: rename column

Figure 4. DataBrew recipe action: Create column from function

Figure 4. DataBrew recipe action: Create column from function

Figure 5. DataBrew recipe action: fill missing values

Figure 5. DataBrew recipe action: fill missing values

For a full list of columns available in CUR, review the CUR Data Dictionary. Following is a list of relevant columns for an executive summary showback dashboard:

  • bill_billing_period_start_date
  • line_item_usage_account_id
  • line_item_line_item_type
  • product_product_name
  • product_product_family
  • product_product_transfer_type
  • savings_plan_savings_plan_effective cost
  • reservation_effective_cost
  • line_item_unblended_cost

Based on data refresh and business requirements, DataBrew can run a job on a recurring basis (for example, every 12 hours). This can be run at a particular time of day, or as defined by a valid CRON expression. This helps you automate your transformation workflows.

Automatic cataloging and querying

You can use a Glue crawler to automatically classify your data to determine the data’s format, schema, and associated properties. The crawlers write metadata to an AWS Glue Data Catalog to help data users find the data they need.

With the results in Amazon S3, and the metadata in the Glue Data Catalog, you can run standard SQL to queries with Athena. This will help you make more informed business decisions by tracking financial metrics and optimizing costs. This is done directly in Amazon S3 without having to move around data. Using standard SQL, you can create views that aggregate cost and usage by your defined tags.

Serverless data visualization

You can use Amazon QuickSight to create and share dashboards with your teams for cost visibility. QuickSight provides native integration with Athena and S3, and lets you easily create and publish interactive BI dashboards that include ML-powered insights. When building a showback dashboard such as the example in Figure 1, QuickSight authors create visuals and publish interactive dashboards.

Readers log in using your preferred authentication mechanism to view the shared dashboard. You can then filter data based on billing periods, account number, or cost allocation tags. You can also drill down to details using a web browser or mobile app.

Conclusion

In this blog, we’ve discussed designing and building a data transformation process and a showback dashboard. This gives you highly granular cost visualization without having to provision and manage any servers. You can use managed services such as AWS Glue DataBrew, Amazon Athena, and Amazon QuickSight to crawl, catalog, analyze, and visualize your data.

We recommend defining your organization tagging strategy to be able to view costs by tags. You can then get started by creating Cost and Usage Reports. With the data in Amazon S3, you can use the services described in this post to transform the data that works for your business. Additionally, you can get started today by experimenting with the Cloud Intelligence Dashboards Workshop. This workshop provides examples of visualizations that you can build using native AWS services on top of your Cost and Usage Report. You will be able to get cost, usage, and operational insights about your AWS Cloud usage.