Tag Archives: AWS Glue DataBrew

Enforce customized data quality rules in AWS Glue DataBrew

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/enforce-customized-data-quality-rules-in-aws-glue-databrew/

GIGO (garbage in, garbage out) is a concept common to computer science and mathematics: the quality of the output is determined by the quality of the input. In modern data architecture, you bring data from different data sources, which creates challenges around volume, velocity, and veracity. You might write unit tests for applications, but it’s equally important to ensure the data veracity of these applications, because incoming data quality can make or break your application. Incorrect, missing, or malformed data can have a large impact on production systems. Examples of data quality issues include but are not limited to the following:

  • Missing or incorrect values can lead to failures in the production system that require non-null values
  • Changes in the distribution of data can lead to unexpected outputs of machine learning (ML) models
  • Aggregations of incorrect data can lead to wrong business decisions
  • Incorrect data types have a big impact on financial or scientific institutes

In this post, we introduce data quality rules in AWS Glue DataBrew. DataBrew is a visual data preparation tool that makes it easy to profile and prepare data for analytics and ML. We demonstrate how to use DataBrew to define a list of rules in a new entity called a ruleset. A ruleset is a set of rules that compare different data metrics against expected values.

The post describes the implementation process and provides a step-by-step guide to build data quality checks in DataBrew.

Solution overview

To illustrate our data quality use case, we use a human resources dataset. This dataset contains the following attributes:

Emp ID, Name Prefix, First Name, Middle Initial,Last Name,Gender,E Mail,Father's Name,Mother's Name,Mother's Maiden Name,Date of Birth,Time of Birth,Age in Yrs.,Weight in Kgs.,Date of Joining,Quarter of Joining,Half of Joining,Year of Joining,Month of Joining,Month Name of Joining,Short Month,Day of Joining,DOW of Joining,Short DOW,Age in Company (Years),Salary,Last % Hike,SSN,Phone No. ,Place Name,County,City,State,Zip,Region,User Name,Password

For this post, we downloaded data with 5 million records, but feel free to use a smaller dataset to follow along with this post.

The following diagram illustrates the architecture for our solution.

The steps in this solution are as follows:

  1. Create a sample dataset.
  2. Create a ruleset.
  3. Create data quality rules.
  4. Create a profile job.
  5. Inspect the data quality rules validation results.
  6. Clean the dataset.
  7. Create a DataBrew job.
  8. Validate the data quality check with the updated dataset.

Prerequisites

Before you get started, complete the following prerequisites:

  1. Have an AWS account.
  2. Download the sample dataset.
  3. Extract the CSV file.
  4. Create an Amazon Simple Storage Service (Amazon S3) bucket with three folders: input, output, and profile.
  5. Upload the sample data in input folder to your S3 bucket (for example, s3://<s3 bucket name>/input/).

Create a sample dataset

To create your dataset, complete the following steps:

  1. On the DataBrew console, in the navigation pane, choose Datasets.
  2. Choose Connect new dataset.
  3. For Dataset name, enter a name (for example, human-resource-dataset).
  4. Under Data lake/data store, choose Amazon S3 as your source.
  5. For Enter your source from Amazon S3, enter the S3 bucket location where you uploaded your sample files (for example, s3://<s3 bucket name>/input/).
  6. Under Additional configurations, keep the selected file type CSV and CSV delimiter comma (,).
  7. Scroll to the bottom of the page and choose Create dataset.

The dataset is now available on the Datasets page.

Create a ruleset

We now define data quality rulesets against the dataset created in the previous step.

  1. On the DataBrew console, in the navigation pane, choose DQ Rules.
  2. Choose Create data quality ruleset.
  3. For Ruleset name, enter a name (­for example, human-resource-dataquality-ruleset).
  4. Under Associated dataset, choose the dataset you created earlier.

Create data quality rules

To add data quality rules, you can use rules and add multiple rules, and within each rule, you can define multiple checks.

For this post, we create the following data quality rules and data quality checks within the rules:

  • Row count is correct
  • No duplicate rows
  • Employee ID, email address, and SSN are unique
  • Employee ID and phone number are not be null
  • Employee ID and employee age in years has no negative values
  • SSN data format is correct (123-45-6789)
  • Phone number for string length is correct
  • Region column only has the specified region
  • Employee ID is an integer

Row count is correct

To check the total row count, complete the following steps:

  1. Add a new rule.
  2. For Rule name, enter a name (for example, Check total record count).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. For Data quality checks¸ choose Number of rows.
  6. For Condition, choose Is equals.
  7. For Value, enter 5000000.

No duplicate rows

To check the dataset for duplicate rows, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check dataset for duplicate rows).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check¸ choose Duplicate rows.
  6. For Condition, choose Is equals.
  7. For Value, enter 0 and choose rows on the drop-down menu.

Employee ID, email address, and SSN are unique

To check that the employee ID, email, and SSN are unique, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check dataset for Unique Values).
  3. For Data quality check scope, choose Common checks for selected columns.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. For Selected columns, select Selected columns.
  6. Choose the columns Emp ID, e mail, and SSN.
  7. Under Check 1, for Data quality check, choose Unique values.
  8. For Condition, choose Is equals.
  9. For Value, enter 100 and choose %(percent) rows on the drop-down menu.

Employee ID and phone number are not be null

To check that employee IDs and phone numbers aren’t null, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check Dataset for NOT NULL).
  3. For Data quality check scope, choose Common checks for selected columns.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. For Selected columns, select Selected columns.
  6. Choose the columns Emp ID and Phone No.
  7. Under Check 1, for Data quality check, choose Value is not missing.
  8. For Condition, choose Greater than equals.
  9. For Threshold, enter 100 and choose %(percent) rows on the drop-down menu.

Employee ID and age in years has no negative values

To check the employee ID and age for positive values, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check emp ID and age for positive values).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose Numeric values.
  6. Choose Emp ID on the drop-down menu.
  7. For Condition, choose Greater than equals.
  8. For Value, select Custom value and enter 0.
  9. Choose Add another quality check and repeat the same steps for age in years.

SSN data format is correct

To check the SSN data format, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check dataset format).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose String values.
  6. Choose SSN on the drop-down menu.
  7. For Condition, choose Matches (RegEx pattern).
  8. For RegEx value, enter ^[0-9]{3}-[0-9]{2}-[0-9]{4}$.

Phone number string length is correct

To check the length of the phone number, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check Dataset Phone no. for string length).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose Value string length.
  6. Choose Phone No on the drop-down menu.
  7. For Condition, choose Greater than equals.
  8. For Value, select Custom value and enter 9.
  9. Under Check 2, for Data quality check, choose Value string length.
  10. Choose Phone No on the drop-down menu.
  11. For Condition, choose Less than equals.
  12. For Value¸ select Custom value and enter 12.

Region column only has the specified region

To check the Region column, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check Region column only for specific region).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose Value is exactly.
  6. Choose Region on the drop-down menu.
  7. For Value, select Custom value.
  8. Choose the values Midwest, South, West, and Northeast.

Employee ID is an integer

To check that the employee ID is an integer, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Validate Emp ID is an Integer).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose String values.
  6. Choose Emp ID on the drop-down menu.
  7. For Condition, choose Matches (RegEx pattern).
  8. For RegEx value, enter ^[0-9]+$.
  9. After you create all the rules, choose Create ruleset.

Your ruleset is now listed on the Data quality rulesets page.

Create a profile job

To create a profile job with your new ruleset, complete the following steps:

  1. On the Data quality rulesets page, select the ruleset you just created.
  2. Choose Create profile job with ruleset.
  3. For Job name, keep the prepopulated name or enter a new one.
  4. For Data sample, select Full dataset.

The default sample size is important for data quality rules validation, because it matters if you validate all the roles or a limited sample.

  1. Under Job output settings, for S3 location, enter the path to the profile bucket.

If you enter a new bucket name, the folder is created automatically.

  1. Keep the default settings for the remaining optional sections: Data profile configurations, Data quality rules, Advanced job settings, Associated schedules, and Tags.

The next step is to choose or create the AWS Identity and Access Management (IAM) role that grants DataBrew access to read from the input S3 bucket and write to the job output bucket.

  1. For Role name, choose an existing role or choose Create a new IAM role and enter an IAM role suffix.
  2. Choose Create and run job.

For more information about configuring and running DataBrew jobs, see Creating, running, and scheduling AWS Glue DataBrew jobs.

Inspect data quality rules validation results

To inspect the data quality rules, we need to let the profile job complete.

  1. On the Jobs page of the DataBrew console, choose the Profile jobs tab.
  2. Wait until the profile job status changes to Succeeded.
  3. When the job is complete, choose View data profile.

You’re redirected to the Data profile overview tab on the Datasets page.

  1. Choose the Data quality rules tab.

Here you can review the status to your data quality rules. As shown in the following screenshot, eight of the nine data quality rules defined were successful, and one rule failed.

Our failed data quality rule indicates that we found duplicate values for employee ID, SSN, and email.

  1. To confirm that the data has duplicate values, on the Column statistics tab, choose the Emp ID column.
  2. Scroll down to the section Top distinct values.

Similarly, you can check the E Mail and SSN columns to find that those columns also have duplicate values.

Now we have confirmed that our data has duplicate values. The next step is to clean up the dataset and rerun the quality rules validation.

Clean the dataset

To clean the dataset, we first need to create a project.

  1. On the DataBrew console, choose Projects.
  2. Choose Create project.
  3. For Project name, enter a name (for this post, human-resource-project-demo).
  4. For Select a dataset, select My datasets.
  5. Select the human-resource-dataset dataset.
  6. Keep the sampling size at its default.
  7. Under Permissions, for Role name, choose the IAM role that we created previously for our DataBrew profile job.
  8. Choose Create project.

The project takes a few minutes to open. When it’s complete, you can see your data.

Next, we delete the duplicate value from the Emp ID column.

  1. Choose the Emp ID column.
  2. Choose the more options icon (three dots) to view all the transforms available for this column.
  3. Choose Remove duplicate values.
  4. Repeat these steps for the SSN and E Mail columns.

You can now see the three applied steps in the Recipe pane.

Create a DataBrew job

The next step is to create a DataBrew job to run these transforms against the full dataset.

  1. On the project details page, choose Create job.
  2. For Job name, enter a name (for example, human-resource-after-dq-check).
  3. Under Job output settings¸ for File type, choose your final storage format to be CSV.
  4. For S3 location, enter your output S3 bucket location (for example, s3://<s3 bucket name>/output/).
  5. For Compression, choose None.
  6. Under Permissions, for Role name¸ choose the same IAM role we used previously.
  7. Choose Create and run job.
  8. Wait for job to complete; you can monitor the job on the Jobs page.

Validate the data quality check with the corrected dataset

To perform the data quality checks with the corrected dataset, complete the following steps:

  1. Follow the steps outlined earlier to create a new dataset, using the corrected data from the previous section.
  2. Choose the Amazon S3 location of the job output.
  3. Choose Create dataset.
  4. Choose DQ Rules and select the ruleset you created earlier.
  5. On the Actions menu, choose Duplicate.
  6. For Ruleset name, enter a name (for example, human-resource-dataquality-ruleset-on-corrected-dataset).
  7. Select the newly created dataset.
  8. Choose Create data quality ruleset.
  9. After the ruleset is created, select it and choose Create profile job with ruleset.
  10. Create a new profile job.
  11. Choose Create and run job.
  12. When the job is complete, repeat the steps from earlier to inspect the data quality rules validation results.

This time, under Data quality rules, all the rules are passed except Check total record count because you removed duplicate values.

On the Column statistics page, under Top distinct values for the Emp ID column, you can see the distinct values.

You can find similar results for the SSN and E Mail columns.

Clean up

To avoid incurring future charges, we recommend you delete the resources you created during this walkthrough.

Conclusion

As demonstrated in this post, you can use DataBrew to help create data quality rules, which can help you identify any discrepancies in your data. You can also use DataBrew to clean the data and validate it going forwards. You can learn more about AWS Glue DataBrew from here and learn around AWS Glue DataBrew pricing here.


About the Authors

Navnit Shukla is an AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Harsh Vardhan Singh Gaur is an AWS Solutions Architect, specializing in Analytics. He has over 5 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Introducing PII data identification and handling using AWS Glue DataBrew

Post Syndicated from Harsh Vardhan Singh Gaur original https://aws.amazon.com/blogs/big-data/introducing-pii-data-identification-and-handling-using-aws-glue-databrew/

AWS Glue DataBrew, a visual data preparation tool, can now identify and handle sensitive data by applying advance transformations like redaction, replacement, encryption, and decryption on your personally identifiable information (PII) data. With exponential growth of data, companies are handling huge volumes and a wide variety of data coming into their platform, including PII data. Identifying and protecting sensitive data at scale has become increasingly complex, expensive, and time-consuming. Organizations have to adhere to data privacy, compliance, and regulatory needs such as GDPR and CCPA. They need to identify sensitive data, including PII such as name, SSN, address, email, driver’s license, and more. Even after identification, it’s cumbersome to implement redaction, masking, or encryption of sensitive personal information at scale.

To enable data privacy and protection, DataBrew has launched PII statistics, which identifies PII columns and provide their data statistics when you run a profile job on your dataset. Furthermore, DataBrew has introduced PII data handling transformations, which enable you to apply data masking, encryption, decryption, and other operations on your sensitive data.

In this post, we walk through a solution in which we run a data profile job to identify and suggest potential PII columns present in a dataset. Next, we target PII columns in a DataBrew project and apply various transformations to handle the sensitive columns existing in the dataset. Finally, we run a DataBrew job to apply the transformations on the entire dataset and store the processed, masked, and encrypted data securely in Amazon Simple Storage Service (Amazon S3).

Solution overview

We use a public dataset that is available for download at Synthetic Patient Records with COVID-19. The data hosted within SyntheticMass has been generated by SyntheaTM, an open-source patient population simulation made available by The MITRE Corporation.

Download the zipped file 10k_synthea_covid19_csv.zip for this solution and unzip it locally. The solution uses the dummy data in the file patient.csv to demonstrate data redaction and encryption capability. The file contains 10,000 synthetic patient records in CSV format, including PII columns like driver’s license, birth date, address, SSN, and more.

The following diagram illustrates the architecture for our solution.

The steps in this solution are as follows:

  1. The sensitive data is stored in an S3 bucket. You create a DataBrew dataset by connecting to the data in Amazon S3.
  2. Run a DataBrew profile job to identify the PII columns present in the dataset by enabling PII statistics.
  3. After identification of PII columns, apply transformations to redact or encrypt column values as a part of your recipe.
  4. A DataBrew job runs the recipe steps on the entire data and generates output files with sensitive data redacted or encrypted.
  5. After the output data is written to Amazon S3, we create an external table on top in Amazon Athena. Data consumers can use Athena to query the processed and cleaned data.

Prerequisites

For this walkthrough, you need an AWS account. Use us-east-1 as your AWS Region to implement this solution.

Set up your source data in Amazon S3

Create an S3 bucket called databrew-clean-pii-data-<Your-Account-ID> in us-east-1 with the following prefixes:

  • sensitive_data_input
  • cleaned_data_output
  • profile_job_output

Upload the patient.csv file to the sensitive_data_input prefix.

Create a DataBrew dataset

To create a DataBrew dataset, complete the following steps:

  1. On the DataBrew console, in the navigation pane, choose Datasets.
  2. Choose Connect new dataset.
  3. For Dataset name, enter a name (for this post, Patients).
  4. Under Connect to new dataset, select Amazon S3 as your source.
  5. For Enter your source from S3, enter the S3 path to the patient.csv file. In our case, this is s3://databrew-clean-pii-data-<Account-ID>/ sensitive_data_input/patients.csv.
  6. Scroll to the bottom of the page and choose Create dataset.

Run a data profile job

You’re now ready to create your profile job.

  1. In the navigation pane, choose Datasets.
  2. Select the Patients dataset.
  3. Choose Run data profile and choose Create profile job.
  4. Name the job Patients - Data Profile Job.
  5. We run the data profile on the entire dataset, so for Data sample, select Full dataset.
  6. In the Job output settings section, point to the profile_job_output S3 prefix where the data profile output is stored when the job is complete.
  7. Expand Data profile configurations, and select Enable PII statistics to identify PII columns when running the data profile job.

This option is disabled by default; you must enable it manually before running the data profile job.

  1. For PII categories, select All categories.
  2. Keep the remaining settings at their default.
  3. In the Permissions section, create a new AWS Identity and Access Management (IAM) role that is used by the DataBrew job to run the profile job, and use PII-DataBrew-Role as the role suffix.
  4. Choose Create and run job.

The job runs on the sample data and takes a few minutes to complete.

Now that we’ve run our profile job, we can review data profile insights about our dataset by choosing View data profile. We can also review the results of the profile through the visualizations on the DataBrew console and view the PII widget. This section provides a list of identified PII columns mapped to PII categories with column statistics. Furthermore, it suggests potential PII data that you can review.

Create a DataBrew project

After we identify the PII columns in our dataset, we can focus on handling the sensitive data in our dataset. In this solution, we perform redaction and encryption in our DataBrew project using the Sensitive category of transformations.

To create a DataBrew project for handling our sensitive data, complete the following steps:

  1. On the DataBrew console, choose Projects.
  2. Choose Create project.
  3. For Project name, enter a name (for this post, patients-pii-handling).
  4. For Select a dataset, select My datasets.
  5. Select the Patients dataset.
  6. Under Permissions, for Role name, choose the IAM role that we created previously for our DataBrew profile job AWSGlueDataBrewServiceRole-PII-DataBrew-Role.
  7. Choose Create project.

The dataset takes few minutes to load. When the dataset is loaded, we can start performing redactions. Let us start with the column SSN.

  1. For the SSN column, on the Sensitive menu, choose Redact data.
  2. Under Apply redaction, select Full string value.
  3. We redact all the non-alphanumeric characters and replace them with #.
  4. Choose Preview changes to compare the redacted values.
  5. Choose Apply.

On the Sensitive menu, all the data masking transformations—redact, replace, and hash data—are irreversible. After we finalize our recipe and run the DataBrew job, the job output to Amazon S3 is permanently redacted and we can’t recover it.

  1. Now, let’s apply redaction to multiple columns, assuming the following columns must not be consumed by any downstream users like data analyst, BI engineer, and data scientist:
    1. DRIVERS
    2. PASSPORT
    3. BIRTHPLACE
    4. ADDRESS
    5. LAT
    6. LON

In special cases, when we need to recover our sensitive data, instead of masking, we can encrypt our column values and when needed, decrypt the data to bring it back to its original format. Let’s assume we require a column value to be decrypted by a downstream application; in that case, we can encrypt our sensitive data.

We have two encryption options: deterministic and probabilistic. For use cases when we want to join two datasets on the same encrypted column, we should apply deterministic encryption. It makes sure that the encrypted value of all the distinct values is the same across DataBrew projects as long as we use the same AWS secret key. Additionally, keep in mind that when you apply deterministic encryption on your PII columns, you can only use DataBrew to decrypt those columns.

For our use case, let’s assume we want to perform deterministic encryption on a few of our columns.

  1. On the Sensitive menu, choose Deterministic encryption.
  2. For Source columns, select BIRTHDATE, DEATHDATE, FIRST, and LAST.
  3. For Encryption option, select Deterministic encryption.
  4. For Select secret, choose the databrew!default AWS secret.
  5. Choose Apply.
  6. After you finish applying all your transformations, choose Publish.
  7. Enter a description for the recipe version and choose Publish.

Create a DataBrew job

Now that our recipe is ready, we can create a job to apply the recipe steps to the Patients dataset.

  1. On the DataBrew console, choose Jobs.
  2. Choose Create a job.
  3. For Job name, enter a name (for example, Patient PII Making and Encryption).
  4. Select the Patients dataset and choose patients-pii-handling-recipe as your recipe.
  5. Under Job output settings¸ for File type, choose your final storage format to be Parquet.
  6. For S3 location, enter your S3 output as s3://databrew-clean-pii-data-<Account-ID>/cleaned_data_output/.
  7. For Compression, choose None.
  8. For File output storage, select Replace output files for each job run.
  9. Under Permissions, for Role name¸ choose the same IAM role we used previously.
  10. Choose Create and run job.

Create an Athena table

You can create tables by writing the DDL statement in the Athena query editor. If you’re not familiar with Apache Hive, you should review Creating Tables in Athena to learn how to create an Athena table that references the data residing in Amazon S3.

To create an Athena table, use the query editor and enter the following DDL statement:

CREATE EXTERNAL TABLE patient_masked_encrypted_data (
   `id` string, 
  `birthdate` string, 
  `deathdate` string, 
  `ssn` string, 
  `drivers` string, 
  `passport` string, 
  `prefix` string, 
  `first` string, 
  `last` string, 
  `suffix` string, 
  `maiden` string, 
  `marital` string, 
  `race` string, 
  `ethnicity` string, 
  `gender` string, 
  `birthplace` string, 
  `address` string, 
  `city` string, 
  `state` string, 
  `county` string, 
  `zip` int, 
  `lat` string, 
  `lon` string, 
  `healthcare_expenses` double, 
  `healthcare_coverage` double 
)
STORED AS PARQUET
LOCATION 's3://databrew-clean-pii-data-<Account-ID>/cleaned_data_output/'

Let’s validate the table output in Athena by running a simple SELECT query. The following screenshot shows the output.

We can clearly see the encrypted and redacted column values in our query output.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

Conclusion

As demonstrated in this post, you can use DataBrew to help identify, redact, and encrypt PII data. With these new PII transformations, you can streamline and simplify customer data management across industries such as financial services, government, retail, and much more.

Now that you can protect your sensitive data workloads to meet regulatory and compliance best practices, you can use this solution to build de-identified data lakes in AWS. Sensitive data fields remain protected throughout their lifecycle, whereas non-sensitive data fields remain in the clear. This approach can allow analytics or other business functions to operate on data without exposing sensitive data.


About the Authors

Harsh Vardhan Singh Gaur is an AWS Solutions Architect, specializing in Analytics. He has over 5 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Navnit Shukla is an AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Integrate AWS Glue DataBrew and Amazon PinPoint to launch marketing campaigns

Post Syndicated from Suraj Shivananda original https://aws.amazon.com/blogs/big-data/integrate-aws-glue-databrew-and-amazon-pinpoint-to-launch-marketing-campaigns/

Marketing teams often rely on data engineers to provide a consumer dataset that they can use to launch marketing campaigns. This can sometimes cause delays in launching campaigns and consume data engineers’ bandwidth. The campaigns are often launched using complex solutions that are either code heavy or using licensed tools. The processes of both extract, transform, and load (ETL) and launching campaigns need engineers who know coding, take time to build, and require maintenance overtime.

You can now simplify this process by integrating AWS services like Amazon PinPoint, AWS Glue DataBrew, and AWS Lambda. We can use DataBrew (a visual data preparation service) to perform ETL, Amazon PinPoint (an outbound and inbound marketing communications service) to launch campaigns, and Lambda functions to achieve end-to-end automation. This solution helps reduce time to production, can be implemented by less-technical folks because it doesn’t require coding, and has no licensing costs involved.

In this post, we walk through the end-to-end workflow and how to implement this solution.

Solution overview

In this solution, the source datasets are pushed to Amazon Simple Storage Service (Amazon S3) using SFTP (batch data) or Amazon API Gateway (streaming data) services. DataBrew jobs perform data transformations and prepare the data for Amazon PinPoint to launch campaigns. We use Lambda to achieve end-to-end automation, which includes an alert system via Amazon Simple Notification Service (Amazon SNS) to alert relevant teams of anomalies or errors.

The workflow includes the following steps:

  1. We can load or push data to Amazon S3 either through AWS Command Line Interface (AWS CLI) commands, AWS access keys, an AWS transfer service (SFTP), or an API Gateway service.
  2. We use DataBrew to perform ETL jobs.
  3. These ETL jobs are either triggered (using Lambda functions) or scheduled.
  4. The processed dataset from DataBrew is imported to Amazon PinPoint as segments using trigger-based Lambda functions.
  5. Marketing teams use Amazon PinPoint to launch campaigns.
  6. We can also perform data profiling using DataBrew.
  7. Finally, we export Amazon PinPoint metrics data to Amazon S3 using Amazon Kinesis Data Firehose. We can use the data for further analysis using Amazon Athena and Amazon QuickSight.

The following diagram illustrates our solution architecture.

As a bonus step, we can create a simple web portal using AWS Amplify that makes API calls to Amazon PinPoint to launch campaigns in case we want to restrict users from launching campaigns using Amazon PinPoint from the AWS Management Console. This web portal can also provide basic metrics generated by Amazon PinPoint. You can also use it as a product or platform that anyone can use to launch campaigns.

To implement the solution, we complete the following steps:

  1. Create the source datasets using both batch ingestion and Amazon Kinesis streaming services.
  2. Build an automated data ingestion pipeline that transforms the source data and makes it campaign ready.
  3. Build a DataBrew data profile job, after which you can view profile metrics and alert teams in case of any anomalies in the source data.
  4. Launch a campaign using Amazon PinPoint.
  5. Export Amazon PinPoint project events to Amazon S3 using Kinesis Data Firehose.

Create the source datasets

In this step, we create the source datasets from both an SFTP server and Kinesis in Amazon S3.

First, we create an S3 bucket to store the source, processed, and campaign-ready data.

We can ingest data into AWS through many different methods. For this post, we consider methods for batch and streaming ingestion.

For batch ingestion, we create an SFTP-enabled server for source data ingestion. We can configure SFTP servers to store data on either Amazon S3 or Amazon Elastic File System (Amazon EFS). For this post, we configure an SFTP server to store data on Amazon S3.

We use API Gateway and Kinesis for stream ingestion. We can also push data to Kinesis directly, but using API Gateway is preferred because it’s easy to handle cross-account and authentication issues. For instructions on integrating API Gateway and a Kinesis stream, see Tutorial: Create a REST API as an Amazon Kinesis proxy in API Gateway.

Each dataset should be in its own S3 folder: s3://<bucket-name>/src-files/datasource1/, s3://<bucket-name>/src-files/datasource2/, and so on.

Build the automated data ingestion pipeline

In this step, we build the end-to-end automated data ingestion pipeline that transforms the source data and makes it campaign ready.

We use DataBrew for data preparation and data quality, and Lambda and Amazon SNS for automation and alerting. Amazon PinPoint can then use this campaign data to launch campaigns.

Our pipeline performs the following functions:

  1. Run transformations on source datasets.
  2. Merge the datasets to make the data campaign ready.
  3. Perform data quality checks.
  4. Alert relevant teams in case of anomalies in the data quality.
  5. Alert relevant teams in case of DataBrew job failures.
  6. Import the campaign-ready dataset as a segment in Amazon PinPoint.

We can either have one DataBrew project to perform necessary transformations on each dataset and merge all the datasets into one final dataset, or have one DataBrew project for each dataset and another project that merges all the transformed datasets. The advantage of having individual projects for each dataset is that we’re decoupling all data sources, so an issue in one data source doesn’t impact another. We use this latter approach in this post.

For instructions on building the DataBrew job, see Creating and working with AWS Glue DataBrew recipe jobs.

DataBrew provides more than 250 transformations. For this post, we add the following transformations to clean the source datasets:

  • Validate column values and add default values if missing
  • Convert column values to a standard format, such as standardize date values or convert lowercase to uppercase (Amazon PinPoint is case-sensitive)
  • Remove special characters if needed
  • Split values if needed
  • Check for duplicates
  • Add audit columns

You could also add a few data quality recipe steps to the recipe.

DataBrew jobs can be scheduled or trigger based (through a Lambda function). For this post, we configure jobs processing individual data sources to be trigger based, and the final job merging all datasets is scheduled. The advantage of having trigger-based DataBrew jobs is that it only triggers if you have a source file, which helps reduce costs.

We first configure an S3 event that triggers a Lambda function. The function triggers the DataBrew job based on the S3 key value. See the following code (Boto3) for our function:

import boto3
 
client = boto3.client('databrew')

def lambda_handler(event, context):
record=event['Records'][0]
bucketname = record['s3']['bucket']['name']
key = record['s3']['object']['key']
print('key: '+key)
  
if key.find('SourceFolder1') != -1:
print('processing source1 file')
response = client.start_job_run(Name='databrew-process-data-job1')
if key.find('SourceFolder2') != -1:
print('processing source2 file')
response = client.start_job_run(Name='databrew-process-data-job2')

The processed data appears in s3://<bucket-name>/src-files/process-data/data-source1/.

Next we create the job that merges the processed data files into a final, campaign-ready dataset. We can configure the job to merge only those files that have been dropped in the last 24 hours.

The campaign-ready data is located in s3://<bucket-name>/src-files/campaign-ready/. This dataset is now ready to serve as input to Amazon PinPoint.

Build a DataBrew data profile job

We can use DataBrew to run a data profile job on any of the datasets defined in the previous steps. When you profile your data, DataBrew creates a report called a data profile. This report displays statistics such as the number of rows in the sample and the distribution of unique values in each column. In this post, we use Lambda functions to read the report and detect anomalies and send alerts using Amazon SNS to respective teams for further action.

  1. On the DataBrew console, on the Datasets page, select your dataset.
  2. Choose Run data profile.
  3. For Job name, enter a name for your job.
  4. For Data sample, select either Full dataset or Custom sample (for this post, we sample 20,000 rows).
  5. In the Job output settings section, for S3 location, enter your output bucket.
  6. Optionally, select Enable encryption for job output file to encrypt your data.
  7. Configure optional settings, such as profile configurations; number of nodes, job timeouts, and retries; schedules; and tags.
  8. Choose an AWS Identity and Access Management (IAM) role for the profile job.
  9. Choose Create and run job.

After you run the job, DataBrew provides you with job metrics. For example, the following screenshot shows the Dataset preview tab.

The following screenshot shows an example of the Data profile overview tab.

This tab also includes a summary of the column details.

The following screenshot shows an example of the Column statistics tab.

Next, we set up alerts using the profile output file.

  1. Configure an S3 event to trigger a Lambda function.

The function reads the output and checks for anomalies in the data. You define the anomalies; for example, when the total missing for a column is greater than 10. The function can then raise an SNS alert if it detects the anomaly.

Launch a campaign using Amazon PinPoint

Before you create segments, create an Amazon PinPoint project. To create segments, we use S3 events to trigger a Lambda function that creates a new base segment whenever the final DataBrew job loads the campaign-ready data. We can either create or update base segments; for this post, we create a new segment. See the following code (Boto3) for the Lambda function:

import boto3
import time
from datetime import datetime

now = datetime.now() # current date and time

date_time = now.strftime("%Y_%m_%d")

segname = segment_name_'+date_time

time.sleep(60)


client = boto3.client('pinpoint')

def lambda_handler(event, context):

response = client.create_import_job(
ApplicationId='xxx-project-id-from-pinpoint',
ImportJobRequest={
'DefineSegment': True,
'Format': 'CSV',
'RoleArn': 'arn:aws:iam::xxx-accountid:role/service-role/xxx_pinpoint_role',
'S3Url': 's3://<xxx-bucketname>/campaign-ready',
'SegmentName': segname
}
)

print(response)

Use this base segment to create dynamic segments and launch a campaign. For more information, see Amazon PinPoint campaigns.

Export Amazon PinPoint project events to Amazon S3 using Kinesis Data Firehose

You can track and push events related to your project, such as sent, delivered, opened messages, and a few others, to either Amazon Kinesis Data Streams or Kinesis Data Firehose, which stream this data to AWS data stores such as Amazon S3. For this post, we use Kinesis Data Firehose. We create our delivery stream prior to enabling event streams on the Amazon PinPoint project.

  1. Create your Firehose delivery stream.

The event stream is disabled by default.

  1. Choose Edit.
  2. Select Stream to Amazon Kinesis and select Send events to an Amazon Kinesis Data Firehose delivery stream.
  3. Choose the delivery stream you created.
  4. For IAM role, you can allow DataBrew to automatically create a new role or use an existing role.
  5. Choose Save.

The events are now sent to Amazon S3. You could either create an Athena table or use Amazon Kinesis Data Analytics to analyze events and build a dashboard using QuickSight.

Security best practices

Consider the following best practices in order to mitigate security threats:

  • Be mindful while creating IAM roles to provide access to only necessary services
  • If sending emails through Amazon SNS, make sure you send email alerts only to verified or subscribed recipients to minimize the possibility of automated emails being used to target victims’ external email addresses
  • Use IAM roles rather than user keys
  • Have logs written to Amazon CloudWatch, and set CloudWatch alarms in case of failures
  • Take regular backups of DataBrew jobs and Amazon Pinpoint campaigns (if needed)
  • Restrict network access for inbound and outbound traffic to least privilege
  • Enable the lifecycle policy to retain only necessary data, and delete unnecessary data
  • Enable server-side encryption using AWS KMS (SSE-KMS) or Amazon S3 (SSE-S3)
  • Enable cross-Region replication of data in case you feel backing up the source data is necessary

Clean up

To avoid ongoing charges, clean up the resources you created as part of this post:

  • S3 bucket
  • SFTP server
  • DataBrew resources
  • PinPoint resources
  • Firehose delivery stream, if applicable
  • Athena tables and QuickSight dashboards, if applicable

Conclusion

In this post, we walked through how to implement an automated workflow using DataBrew to perform ETL, Amazon PinPoint to launch campaigns, and Lambda to automate the process. This solution helps reduce time to production, is easy to implement because it doesn’t require coding, and has no licensing costs involved. Try this solution today for your own datasets, and leave any comments or questions in the comments section.


About the Authors

Suraj Shivananda is a Solutions Architect at AWS. He has over a decade of experience in Software Engineering, Data and Analytics, DevOps specifically for data solutions, automating and optimizing cloud based solutions. He’s a trusted technical advisor and helps customers build Well Architected solutions on the AWS platform.

Surbhi Dangi is a Sr Manager, Product Management at AWS. Her work includes building user experiences for Database, Analytics & AI AWS consoles, launching new database and analytics products, working on new feature launches for existing products, and building broadly adopted internal tools for AWS teams. She enjoys traveling to new destinations to discover new cultures, trying new cuisines, and teaches product management 101 to aspiring PMs.

Extract, prepare, and analyze Salesforce.com data using Amazon AppFlow, AWS Glue DataBrew, and Amazon Athena

Post Syndicated from Ramkumar Nottath original https://aws.amazon.com/blogs/big-data/extract-prepare-and-analyze-salesforce-com-data-using-amazon-appflow-aws-glue-databrew-and-amazon-athena/

As organizations embark on their data modernization journey, big data analytics and machine learning (ML) use cases are becoming even more integral parts of business. The ease for data preparation and seamless integration with third-party data sources is of paramount importance in order to gain insights quickly and make critical business decisions faster.

AWS Glue DataBrew is a visual data preparation tool that cleans and normalizes data without writing code. This reduces the time it takes to prepare data for analytics and ML by up to 80% compared to traditional approaches to data preparation.

Software as a service (SaaS) applications play a pivotal role in organizations’ analytics pipelines. This data is essential to include when performing analytics to get insights to make better business decisions. Amazon AppFlow is a fully managed integration service that helps you transfer SaaS data to your data lake securely.

Recently, DataBrew announced native console integration with Amazon AppFlow to connect to data from applications like Salesforce, Zendesk, Slack, ServiceNow, and other SaaS applications, and AWS services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift. With native integration with Amazon AppFlow, DataBrew is addressing both the challenges with data preparation and seamless integration with SaaS applications.

Salesforce is a popular and widely used customer relationship management (CRM) platform. It lets you store and manage prospect and customer information—like contact info, accounts, leads, and sales opportunities—in one central location. You can derive a lot of useful information by combining the prospect information stored in Salesforce with other structured and unstructured data in your data lake.

In this post, we walk you through how to extract data from Salesforce.com using the native integration that DataBrew has with Amazon AppFlow, prepare the data for your analytical use cases using DataBrew, store it in Amazon S3, and query it with Amazon Athena.

Architecture overview

The following diagram represents the flow described in this post. With the visual point-and-click interface in Amazon AppFlow, you create a new flow with Salesforce as source. You can either use an existing connection to Salesforce or create a new one. In DataBrew, while creating a dataset, you can choose the Amazon AppFlow flow as one of the sources to import the data for data preparation. After you perform the data preparation steps on sample data, you can save the steps as a recipe and automate the flow by creating a DataBrew job by selecting the dataset for the source and the newly created recipe for transformation. The transformed data is published to an S3 bucket. You can use an AWS Glue crawler to catalog that data and use Athena to query the data.

The workflow includes the following steps:

  1. Create an S3 bucket for the raw and transformed data.
  2. Create a connection to Salesforce.
  3. Create a flow to extract the data from Salesforce.com.
  4. Create a dataset and project.
  5. Prepare the data and create a recipe.
  6. Create a job to pull the data from the Amazon AppFlow flow, run the transformations, and load the data in Amazon S3.
  7. Create a crawler to catalog the data.
  8. Analyze data using Athena.

Prerequisites

To implement this solution, you need the following prerequisites:

Now that we have discussed the architecture of our solution, we present the step-by-step instructions.

Create an S3 bucket

To create an S3 bucket, complete the following steps:

  1. On the Amazon S3 console, choose Create bucket.
  2. Enter a name for your bucket; for example, databrew-appflow-data-prep-<your name>.
  3. Choose Create bucket.

Create a connection to Salesforce

If you already have a Salesforce connection created in Amazon AppFlow, you can skip this step. To create a new connection, complete the following steps:

  1. On the Amazon AppFlow console, choose Connections in the navigation pane.
  2. From the list of connectors, choose Salesforce.
  3. Choose Create connection.
  4. For Connection name, enter a name (for example, sfdc-appflow).
  5. Choose Continue.
  6. You’re redirected to a sign-in screen where you can log in to your Salesforce account. If you don’t have a Salesforce account, you can sign up for a developer account.
  7. Choose Allow to allow Amazon AppFlow to access your Salesforce account.

You can now see the new connection that was created.

Create a flow in Amazon AppFlow to extract data from Salesforce.com

To create a flow in Amazon AppFlow, complete the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. Choose Create flow.
  3. On the Specify flow details page, enter a name for the flow (for example, salesforce-data).
  4. Optionally, provide a description for the flow and tags.
  5. Choose Next.
  6. On the Configure flow page, for Source name¸ choose Salesforce.
  7. Choose the connection we created in the previous step.
  8. For Choose Salesforce object, choose the object you want to work with (for this post, we choose Opportunity).
  9. For Destination name, choose Amazon S3.
  10. For Bucket details, choose the bucket you created earlier.
  11. Optionally, provide a prefix (folder) where you want the data to land within the bucket.
  12. Under Additional settings, for Data transfer preference, select Aggregate all records.
  13. For Choose how to trigger the flow¸ select Run on demand.
  14. Choose Next.
  15. On the Source to destination field mapping page, for Source field name, select the fields you want to work with and choose Map fields directly.

Alternatively, you can choose Map all fields directly to map all the fields from the object.

  1. On the Add filters page, add any filters for the data you’re pulling.
  2. On the Review and create page, review all the details and choose Create flow.
  3. After the flow is created, choose Run flow to run the flow and verify whether the flow ran successfully.

Create a dataset and project in DataBrew

To create a dataset and DataBrew project, complete the following steps:

  1. On the DataBrew console, choose Projects in the navigation pane.
  2. Choose Create project.
  3. For Project name, enter a name (for example, databrew-appflow-integration).
  4. For Select a dataset, select New dataset.
  5. For Dataset name, enter a name (for example, databrew-sfdc).
  6. From the list of sources, choose Amazon AppFlow.
  7. Select the flow you created.
  8. Under Permissions, for Role name, choose Create new IAM role.
  9. For New IAM role suffix, enter a suffix (for example, appflow-databrew).
  10. Choose Create project.

After you create the project, data is loaded to DataBrew so that you can perform data preparation activities.

Prepare data and create a recipe

With DataBrew, you can choose from over 250 pre-built transformations to automate data preparation tasks, all without the need to write any code. In this post, we only discuss a few of them. For the full list of transformations, see Recipe step and function reference.

In this step, we split the CloseDate column to CloseYear, CloseMonth, and CloseDay. Then we flag the outliers in the Amount column.

  1. Duplicate the column CloseDate by choosing the column and choosing Duplicate.
  2. For Duplicate column name, enter a name.
  3. Choose Apply.
  4. Select the column you created and on the Clean menu, choose Replace value or pattern.
  5. For Value to be replaced, select Enter custom value and enter -.
  6. Choose Apply.

This replaces – with empty values.

  1. Select the modified column and on the options menu, choose Split menu and At positions from beginning.
  2. For Position from the beginning¸ enter 4.
  3. Choose Apply.

This creates a new column with the year value.

Next, we split the column at position 2 from the beginning in CloseDate_copy_2 to extract month and day.

  1. On the options menu, choose Split column.
  2. Rename the CloseDate_copy_1 column to CloseYear and choose Apply.
  3. Repeat the steps to rename the other two newly created columns to CloseMonth and CloseDay.
  4. Select the Amount column and on the Outliers menu, choose Flag outliers.
  5. For Standard deviation threshold, enter 3.
  6. Under Outlier actions, select Flag outliers.
  7. Choose Apply.

You can see that an additional column got added and the outliers are flagged.

All the steps that we performed so far are recorded under Recipe.

  1. Under Recipe, choose Publish.
  2. For Version description, enter a description.
  3. Choose Publish.

This saves the recipe for future use.

Create a DataBrew job

To create a DataBrew job, complete the following steps:

  1. On the DataBrew console, choose Jobs in the navigation pane.
  2. Choose Create job.
  3. For Job name, enter a name (for example, databrew-appflow-job).
  4. For Select a dataset, choose the dataset we created (databrew-sfdc).
  5. For Select a recipe, choose the recipe we created (databrew-appflow-integration-recipe).
  6. In the Job output settings section, for Output to, choose Amazon S3.
  7. For S3 location, enter the S3 path for the data (for example, s3://databrew-appflow-data-prep-<your name>/processed/).
  8. For Role name, choose the role with suffix databrew-appflow-role.
  9. Choose Create and run job.

Create a crawler to catalog the data

To create your AWS Glue crawler, complete the following steps:

  1. On the AWS Glue console, choose Crawlers.
  2. Choose Add crawler.
  3. For Crawler name¸ enter a name (for example, databrew-opportunity-data).
  4. Choose Next.
  5. Under Specify crawler source type, keep the default options and choose Next.
  6. Under Add a data store, for Include path, choose the S3 bucket that we used for the processed data (for example, s3://databrew-appflow-data-prep-<your name>/processed).
  7. Choose Next.
  8. For Add another data, select No.
  9. Choose Next.
  10. Select Create an IAM role and provide a suffix for the role (for example, databrew).
  11. For Frequency, choose Run on demand.
  12. On the next page, choose Add database.
  13. Enter a database name (for example, databrew-appflow).
  14. Choose Create.
  15. For Prefix, enter opportunity_.
  16. Choose Next.
  17. Review the details and choose Finish.
  18. After the crawler is created, select it and choose Run crawler.

The crawler catalogs the data that we uploaded to Amazon S3 after processing using DataBrew.

Analyze data using Athena

When the crawler is complete, we can analyze the data with Athena.

  1. On the AWS Glue console, choose the database we created.
  2. Under Data catalog, choose Tables in databrew-appflow.

You can see a table named opportunity_processed, which points to the Amazon S3 location where the processed data was landed.

  1. Select the table name.
  2. On the Action menu, choose View data.

A pop-up may appear to let you know that you’re going to the Athena console.

  1. Choose Preview data.

If this is the first time using Athena in this account, you have to set the query result location.

  1. Run a query in Athena.

You should be able to see the data in Amazon S3.

You can perform further analysis by running more queries on the data. The following query returns the expected revenue for based on various closeyear and closemonth combinations of opportunities:

SELECT closeyear, closemonth, SUM(expectedrevenue) FROM  "AwsDataCatalog"."databrew-appflow"."opportunity_processed" 
GROUP BY  closeyear, closemonth;

Clean up

You may want to clean up the demo environment when you are done. To do so, delete the following resources that were created as part of this post:

  • S3 bucket (databrew-appflow-data-prep-<your name>)
  • Connection to Salesforce in Amazon AppFlow (sfdc-appflow)
  • Flow in Amazon AppFlow (salesforce-data)
  • Dataset (databrew-sfdc), project (databrew-appflow-integration), and job (databrew-appflow-job) in DataBrew
  • AWS Glue Data Catalog database (databrew-appflow)
  • IAM role (AWSGlueDataBrewServiceRole-appflow-databrew)

Conclusion

In this post, we walked you through how to extract data from Salesforce.com using the native integration that DataBrew has with Amazon AppFlow. We also demonstrated how to prepare the data for analytical use cases using DataBrew, operationalize the data preparation steps by creating a recipe and use that in a DataBrew job, store the job result in Amazon S3, and query it with Athena.

If you have any questions or suggestions, please leave a comment.


About the Authors

Ramkumar Nottath is a Sr. Solutions Architect at AWS focusing on Analytics services. He enjoys working with various customers to help them build scalable, reliable big data and analytics solutions. His interests extend to various technologies such as analytics, data warehousing, streaming, and machine learning. He loves spending time with his family and friends.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Prepare, transform, and orchestrate your data using AWS Glue DataBrew, AWS Glue ETL, and AWS Step Functions

Post Syndicated from Durga Mishra original https://aws.amazon.com/blogs/big-data/prepare-transform-and-orchestrate-your-data-using-aws-glue-databrew-aws-glue-etl-and-aws-step-functions/

Data volumes in organizations are increasing at an unprecedented rate, exploding from terabytes to petabytes and in some cases exabytes. As data volume increases, it attracts more and more users and applications to use the data in many different ways—sometime referred to as data gravity. As data gravity increases, we need to find tools and services that allow us to prepare and process a large amount of data with ease to make it ready for consumption by a variety of applications and users. In this post, we look at how to use AWS Glue DataBrew and AWS Glue extract, transform, and load (ETL) along with AWS Step Functions to simplify the orchestration of a data preparation and transformation workflow.

DataBrew is a visual data preparation tool that exposes data in spreadsheet-like views to make it easy for data analysts and data scientists to enrich, clean, and normalize data to prepare it for analytics and machine learning (ML) without writing any line of code. With more than 250 pre-built transformations, it helps reduce the time it takes to prepare the data by about 80% compared to traditional data preparation approaches.

AWS Glue is a fully managed ETL service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage.

Step Functions is a serverless orchestration service that makes it is easy to build an application workflow by combining many different AWS services like AWS Glue, DataBrew, AWS Lambda, Amazon EMR, and more. Through the Step Functions graphical console, you see your application’s workflow as a series of event-driven steps. Step Functions is based on state machines and tasks. A state machine is a workflow, and a task is a state in a workflow that represents a single unit of work that another AWS service performs. Each step in a workflow is a state.

Overview of solution

DataBrew is a new service we introduced in AWS Re:invent 2020 in the self-serviced data preparation space and is focused on the data analyst, data scientist, and self-service audience. We understand that some organizations may have use cases where self-service data preparation needs to be integrated with a standard corporate data pipeline for advanced data transformation and operational reasons. This post provides a solution for customers who are looking for a mechanism to integrate data preparation done by analysts and scientists to the standard AWS Glue ETL pipeline using Step Functions. The following diagram illustrates this workflow.

Architecture overview

To demonstrate the solution, we prepare and transform the publicly available New Your Citi Bike trip data to analyze bike riding patterns. The dataset has the following attributes.

Field Name Description
starttime Start time of bike trip
stoptime End time of bike trip
start_station_id Station ID where bike trip started
start_station_name Station name where bike trip started
start_station_latitude Station latitude where bike trip started
start_station_longitude Station longitude where bike trip started
end_station_id Station ID where bike trip ended
end_station_name Station name where bike trip ended
end_station_latitude Station latitude where bike trip ended
end_station_longitude Station longitude where bike trip ended
bikeid ID of the bike used in bike trip
usertype User type (customer = 24-hour pass or 3-day pass user; subscriber = annual member)
birth_year Birth year of the user on bike trip
gender Gender of the user (zero=unknown; 1=male; 2=female)

We use DataBrew to prepare and clean the most recent data and then use Step Functions for advanced transformation in AWS Glue ETL.

For the DataBrew steps, we clean up the dataset and remove invalid trips where either the start time or stop time is missing, or the rider’s gender isn’t specified.

After DataBrew prepares the data, we use AWS Glue ETL tasks to add a new column tripduration and populate it with values by subtracting starttime from endtime.

After we perform the ETL transforms and store the data in our Amazon Simple Storage Service (Amazon S3) target location, we use Amazon Athena to run interactive queries on the data to find the most used bikes to schedule maintenance, and the start stations with the most trips to make sure enough bikes are available at these stations.

We also create an interactive dashboard using Amazon QuickSight to gain insights and visualize the data to compare trip count by different rider age groups and user type.

The following diagram shows our solution architecture.

Prerequisites

To follow along with this walkthrough, you must have an AWS account. Your account should have permission to run an AWS CloudFormation script to create the services mentioned in solution architecture.

Your AWS account should also have an active subscription to QuickSight to create the visualization on processed data. If you don’t have a Quicksight account, you can sign up for an account.

Create the required infrastructure using AWS CloudFormation

To get started, complete the following steps:

  1. Choose Launch Stack to launch the CloudFormation stack to configure the required resources in your AWS account.
  2. On the Create stack page, choose Next.
  3. On the Specify stack details page, enter values for Stack name and Parameters, and choose Next.
  4. Follow the remaining instructions using all the defaults and complete the stack creation.

The CloudFormation stack takes approximately 2 minutes to complete.

  1. When the stack is in the CREATE_COMPLETE state, go to the Resources tab of the stack and verify that you have 18 new resources.

In the following sections, we go into more detail of a few of these resources.

Source data and script files S3 bucket

The stack created an S3 bucket with the name formatted as <Stack name>-<SolutionS3BucketNameSuffix>.

On the Amazon S3 console, verify the creation of the bucket with the following folders:

  • scripts – Contains the Python script for the ETL job to process the cleaned data
  • source – Has the source City Bike data to be processed by DataBrew and the ETL job

DataBrew dataset, project, recipe, and job

The CloudFormation stack also created the DataBrew dataset, project, recipe, and job for the solution. Complete the following steps to verify that these resources are available:

  1. On the DataBrew console, choose Projects to see the list of projects.

You should see a new project, associated dataset, attached recipe, and job in the Projects list.

  1. To review the source data, choose the dataset, and choose View dataset on the resulting popup.

You’re redirected to the Dataset preview page.

DataBrew lets you create a dynamic dataset using custom parameter and conditions. This feature helps you automatically process the latest files available in your S3 buckets with a user-friendly interface. For example, you can choose the latest 10 files or files that are created in the last 24 hours that match specific conditions to be automatically included in your dynamic dataset.

  1. Choose the Data profile overview tab to examine and collect summaries of statistics about your data by running the data profile.
  2. Choose Run data profile to create and run a profile job.
  3. Follow the instructions to create and run the job to profile the source data.

The job takes 2–3 minutes to complete.

When the profiling job is complete, you should see Summary, Correlations, Value distribution, and Column Summary sections with more insight statistics about the data, including data quality, on the Data profile overview tab.

  1. Choose the Column statistics tab to see more detailed statistics on individual data columns.

DataBrew provides a user-friendly way to prepare, profile, and visualize the lineage of the data. Appendix A at the end of this post provides details on some of the widely used data profiling and statistical features provided by DataBrew out of the box.

  1. Choose the Data Lineage tab to see the information on data lineage.
  1. Choose PROJECTS in the navigation pane and choose the project name to review the data in the project and the transformation applied through the recipe.

DataBrew provides over 250 transforms functions to prepare and transform the dataset. Appendix B at the end of this post reviews some of the most commonly used transformations.

We don’t need to run the DataBrew job manually—we trigger it using a Step Function state machine in subsequent steps.

AWS Glue ETL job

The CloudFormation stack also created an AWS Glue ETL job for the solution. Complete the following steps to review the ETL job:

  1. On the AWS Glue console, choose Jobs in the navigation pane to see the new ETL job.
  2. Select the job and on the Script tab, choose Edit script to inspect the Python script for the job.

We don’t need to run the ETL job manually—we trigger it using a Step Function state machine in subsequent steps.

Start the Step Function state machine

The CloudFormation stack created a Step Functions state machine to orchestrate running the DataBrew job and AWS Glue ETL job. A Lambda function starts the state machine whenever the daily data files are uploaded into the source data folder of the S3 bucket. For this post, we start the state machine manually.

  1. On the Step Functions console, choose State machines in the navigation pane to see the list of state machines.
  2. Choose the state machine to see the state machine details.
  3. Choose Start execution to run the state machine.

The details of the state machine run are displayed on the Details tab.

  1. Review the Graph inspector section to observe the different states of the state machine. As each step completes, it turns green.
  1. Choose the Definition tab to review the definition of the state machine.

When the state machine is complete, it should have run the DataBrew job to clean the data and the AWS Glue ETL job to process the cleaned data.

The DataBrew job removed all trips in the source dataset with missing starttime and stoptime, and unspecified gender. It also copied the cleaned data to the cleaned folder of the S3 bucket. We review the cleaned data in subsequent steps with Athena.

The ETL job processed the data in the cleaned folder to add a calculated column tripduration, which is calculated by subtracting the starttime from the stoptime. It also converted the processed data into columnar format (Parquet), which is more optimized for analytical processing, and copied it to the processed folder. We review the processed data in subsequent steps with Athena and also use it with QuickSight to get some insight into rider behavior.

Run an AWS Glue crawler to create tables in the Data Catalog

The CloudFormation stack also added three AWS Glue crawlers to crawl through the data stored in the source, cleaned, and processed folders. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Complete the following steps to run these crawlers to create AWS Glue tables for the date in each of the S3 folders.

  1. On the AWS Glue console, choose Crawlers in the navigation pane to see the list of crawlers created by the CloudFormation stack.If you have AWS Lake Formation enabled in the Region in which you’re implementing this solution, you may get insufficient lake formation permissions error. Please follow steps in Appendix C to provide required permission to the IAM role used by Glue Crawler to create table in citibike database.
  2. Select each crawler one by one and choose Run crawler.

After the crawlers successfully run, you should see one table added by each crawler.

We run crawlers manually in this post, but you can trigger the crawlers whenever a new file is added to their respective S3 bucket folder.

  1. To verify the AWS Glue database citibike, created by the CloudFormation script, choose Databases in the navigation pane.
  2. Select the citibike database and choose View tables.

You should now see the three tables created by the crawlers.

Use Athena to run analytics queries

In the following steps, we use Athena for ad-hoc analytics queries on the cleaned and processed data in the processed_citibike table of the Data Catalog. For this post, we find the 20 most used bikes to schedule maintenance for them, and find the top 20 start stations with the most trips to make sure enough bikes are available at these stations.

  1. On the Athena console, for Data source, choose AwsDataCatalog.
  2. For Database, choose citibike.

The three new tables are listed under Tables.

If you haven’t used Athena before in your account, you receive a message to set up a query result location for Athena to store the results of queries.

  1. Run the following query on the New query1 tab to find the find the 20 most used bikes:
SELECT bikeid as BikeID, count(*) TripCount FROM "citibike"."processed_citibike" group by bikeid order by 2 desc limit 20;

If you have AWS Lake Formation enabled in the Region in which you’re implementing this solution, you may get error (similar to below error) in executing queries. To resolve the permission issues, please follow steps in Appendix D to provide “Select” permission to all tables in citibike database to logged in user using Lake Formation.

  1. Run the following query on the New query1 tab to find the top 20 start stations with the most trips:
SELECT start_station_name, count(*) trip_count FROM "citibike"."processed_citibike" group by start_station_name order by 2 desc limit 20;

Visualize the processed data on a QuickSight dashboard

As the final step, we visualize the following data using a QuickSight dashboard:

  • Compare trip count by different rider age groups
  • Compare trip count by user type (customer = 24-hour pass or 3-day pass user; subscriber = annual member)

Your AWS account should also have an active subscription to QuickSight to create the visualization on processed data. If you don’t have a Quicksight account, you can sign up for an account.

  1. On the QuickSight console, create a dataset with Athena as your data source.
  2. Follow the instructions to complete your dataset creation.
  3. Add a calculated filed rider_age using following formula:
dateDiff(parseDate({birth_year},"YYYY"), truncDate( "YYYY", now() ) ,"YYYY")

Your dataset should look like the following screenshot.

Now you can create visualizations to compare weekly trip count by user type and total trip count by rider age group.

Clean up

To avoid incurring future charges, delete the resources created for the solution.

  1. Delete the DataBrew profile job created for profiling the source data.
  2. Delete the CloudFormation stack to delete all the resources created by the stack.

Conclusion

In this post, we discussed how to use DataBrew to prepare your data and then further process the data using AWS Glue ETL to integrate it in a standard operational ETL flow to gather insights from your data.

We also walked through how you can use Athena to perform SQL analysis on the dataset and visualize and create business intelligence reports through QuickSight.

We hope this post provides a good starting point for you to orchestrate your DataBrew job with your existing or new data processing ETL pipelines.

For more details on using DataBrew with Step Functions, see Manage AWS Glue DataBrew Jobs with Step Functions.

For more information on DataBrew jobs, see Creating, running, and scheduling AWS Glue DataBrew jobs.

Appendix A

The following table lists the widely used data profiling and statistical features provided by DataBrew out of the box.

Type Data type of the column
missingValuesCount The number of missing values. Null and empty strings are considered as missing.
distinctValuesCount The number of the value appears at least once.
entropy A measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information.
mostCommonValues A list of the top 50 most common values.
mode The value that appears most often in the column.
min/max/range The minimum, maximum, and range values in the column.
mean The mean or average value of the column.
kurtosis A measure of whether the data is heavy-tailed or light-tailed relative to a normal distribution. A dataset with high kurtosis has more outliers compared to a dataset with low kurtosis.
skewness A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
Correlation The Pearson correlation coefficient. This is a measure if one column’s values correlate to values of another column.
Percentile95 The element in the list that represents the 95th percentile (95% of numbers fall below this and 5% of numbers fall above it).
interquartileRange The range between the 25th percentile and 75th percentile of numbers.
standardDeviation The unbiased sample standard deviation of values in the column.
min/max Values A list of the five minimum and maximum values in a column.
zScoreOutliersSample A list of the top 50 outliers that have the largest or smallest Z-score. -/+3 is the default threshold.
valueDistribution The measure of the distribution of values by range.

Appendix B

DataBrew provides the ability to prepare and transform your dataset using over 250 transforms. In this section, we discuss some of the most commonly used transformations:

  • Combine datasets – You can combine datasets in the following ways:
    • Join – Combine several datasets by joining them with other datasets using a join type like inner join, outer join, or excluding join.
    • Union operation – Combine several datasets using a union operation.
    • Multiple files for input datasets – While creating a dataset, you can use a parameterized Amazon S3 path or a dynamic parameter and select multiple files.
  • Aggregate data – You can aggregate the dataset using a group by clause and use standard and advanced aggregation functions like Sum, Count, Min, Max, mode, standard deviation, variance, skewness, kurtosis, as well as cumulative functions like cumulative sum and cumulative count.
  • Pivot a dataset – You can pivot the dataset in the following ways:
    • Pivot operation – Convert all rows to columns.
    • Unpivot operation – Convert all columns to rows.
    • Transpose operation – Convert all selected rows to columns and columns to rows.
  • Unnest top level value – You can extract values from arrays into rows and columns into rows. This only operates on top-level values.
  • Outlier detection and handling – This transformation works with outliers in your data and performs advanced transformations on them like flag outliers, rescale outliers, and replace or remover outliers. You can use several strategies like ZScore, modified Z-score, and interquartile range (IQR) to detect outliers.
  • Delete duplicate values – You can delete any row that is an exact match to an earlier row in the dataset.
  • Handle or impute missing values – You have the following options:
    • Remove invalid records – Delete an entire row if an invalid value is encountered in a column of that row.
    • Replace missing values – Replace missing values with custom values, most frequent value, last valid value, or numeric aggregate values.
  • Filter data – You can filter the dataset based on a custom condition, validity of a column, or missing values.
  • Split or merge columns – You can split a column into multiple columns based on a custom delimiter, or merge multiple columns into a single column.
  • Create columns based on functions –You can create new columns using different functions like mathematical functions, aggregated functions, date functions, text functions, windows functions like next and previous, and rolling aggregation functions like rolling sum, rolling count, and rolling mean.

Appendix C

If you have AWS Lake Formation enabled in the Region in which you’re implementing this solution, you may get insufficient lake formation permissions error during Glue Crawler run.


Please provide create table permission in GlueDatabase (citibike) to GlueCrawlersRole (find the ARN from the Cloud Formation Resource section) for the crawler to create required table.

Appendix D

If you have AWS Lake Formation enabled in the Region in which you’re implementing this solution, you may get error similar to below error in running queries on tables in citibike database.

Please provide “Select” permission to all tables in GlueDatabase (citibike) to logged in user in Lake formation to allow the select on tables created by the solution to logged in user.


About the Authors

Photo and bio go here Narendra Gupta is a solutions architect at AWS, helping customers on their cloud journey with focus on AWS analytics services. Outside of work, Narendra enjoys learning new technologies, watching movies, and visiting new places.

Photo and bio go hereDurga Mishra is a solutions architect at AWS. Outside of work, Durga enjoys spending time with family and loves to hike on Appalachian trails and spend time in nature.

Photo and bio go hereJay Palaniappan is a Sr. Analytics Specialist Solutions architect at AWS, helping media & entertainment customers adopt and run AWS Analytics services.

Centralize feature engineering with AWS Step Functions and AWS Glue DataBrew

Post Syndicated from Gayatri Ghanakota original https://aws.amazon.com/blogs/big-data/centralize-feature-engineering-with-aws-step-functions-and-aws-glue-databrew/

One of the key phases of a machine learning (ML) workflow is data preprocessing, which involves cleaning, exploring, and transforming the data. AWS Glue DataBrew, announced in AWS re:Invent 2020, is a visual data preparation tool that enables you to develop common data preparation steps without having to write any code or installation.

In this post, we show how to integrate the standard data preparation steps with training an ML model and running inference on a pre-trained model via DataBrew and AWS Step Functions. The solution is architected with an ML pipeline that trains the publicly available Air Quality Dataset to predict the CO levels in New York City.

Overview of solution

The following architecture diagram shows an overview of the ML workflow, which employs DataBrew for data preparation and scheduling jobs, and uses AWS Lambda and Step Functions to orchestrate ML model training and inference using the AWS Step Functions Data Science SDK. We use Amazon EventBridge to trigger the Step Functions state machine when the DataBrew job is complete.

Scope of Solution

The steps in this solution are as follows:

  1.  Import your dataset to Amazon Simple Storage Service (Amazon S3).
  2.  Launch the AWS CloudFormation stack, which deploys the following:
    1. DataBrew recipes for training and inference data.
    2.  The DataBrew job’s schedule for training and inference.
    3.  An EventBridge rule.
    4. A Lambda function that triggers the Step Functions state machine, which in turn orchestrates the states.
    5. The training state includes the following steps:
      1. Runs an Amazon SageMaker processing job to remove column headers.
      2. Performs SageMaker model training.
      3. Outputs the data to an S3 bucket to store the trained model.
    6. The inference state includes the following steps:
      1. Runs a SageMaker processing job to remove column headers.
      2. Performs a SageMaker batch transform.
      3. Outputs the data to an S3 bucket to store the predictions.

Prerequisites

For this solution, you should have the following prerequisites:

Load the dataset to Amazon S3

In this first step, we load our air quality dataset into Amazon S3.

  1. Download the Outdoor Air Quality Dataset for the years 2018, 2019, and 2020, limiting to the following options:
    1. Pollutant – CO
    2. Geographic Area – New York
    3. Monitor Site – All Sites
  2. For each year of data, split by year, month, and day, and use the data for 2018–2019 to train the model and the 2020 data to run inference.
  3. Run the following script, which stores the output into the NY_XXXX folder:
import os
import pandas as pd

def split_data(root_folder,df):
    # Create year, month and day columns
    df["year"] = pd.DatetimeIndex(df["Date"]).year
    df["month"] = pd.DatetimeIndex(df["Date"]).month
    df["day"] = pd.DatetimeIndex(df["Date"]).day
    if not os.path.exists(root_folder):
        os.mkdir(root_folder)
    for m, x1 in df.groupby(['month']):
        month_dir = os.path.join("{:02}".format(m))
        if not os.path.exists(root_folder+"/"+month_dir):
            os.mkdir(root_folder+"/"+month_dir)
        for d, x2 in x1.groupby(["day"]):
            day_dir = os.path.join("{:02}".format(d))
            if not os.path.exists(root_folder+"/"+month_dir+"/"+day_dir):
                os.mkdir(root_folder+"/"+month_dir+"/"+day_dir)
            p = os.path.join(root_folder+"/"+month_dir+"/"+day_dir, "{:02}.csv".format(d))
            x2.to_csv(p, index=False)

ny_data_2018 = pd.read_csv("<path to downloaded 2018 data  file>")
ny_data_2019 = pd.read_csv("<path to downloaded 2019 data file>")
ny_data_2020 = pd.read_csv("<path to downloaded 2020 data file>") 
 
split_data("NY_2018", ny_data_2018)
split_data("NY_2019", ny_data_2019)
split_data("NY_2020", ny_data_2020)
  1. Create an S3 bucket in the us-east-1 Region and upload the folders NY_2018 and NY_2019 to the path S3://<artifactbucket>/train_raw_data/.

train-raw-data-s3

  1. Upload the folder NY_2020 to S3:// <artifactbucket>/inference_raw_data/.

inference-raw-data-s3

Deploy your resources

For a quick start of this solution, you can deploy the provided AWS CloudFormation stack. This creates all the required resources in your account (us-east-1 Region), including the DataBrew datasets, jobs, projects, and recipes; the Step Functions train and inference state machines (which include SageMaker processing, model training, and batch transform jobs); an EventBridge rule; and the Lambda function to deploy an end-to-end ML pipeline for a predefined S3 bucket.

  1. Launch the following stack:
  2. For ArtifactBucket, enter the name of the S3 bucket you created in the previous step.

enter cloud formation stack details

  1. Select the three acknowledgement check boxes.
  2. Choose Create stack.

Create Cloud Formation Stack

Test the solution

As part of the CloudFormation template, the DataBrew job km-mlframework-trainingfeatures-job was created, which is scheduled to run every Monday at 10:00 AM UTC. This job creates the features required to train the model.

When the template deployment is successfully completed, you can manually activate the training pipeline. For this, navigate to the DataBrew console, select the DataBrew job km-mlframework-trainingfeatures-job, and choose Run job.

Glue DataBrew Run Train Job

The job writes the features to s3://<artifactbucket>/train_features/.

When the job is complete, an EventBridge rule invokes the Lambda function, which orchestrates the SageMaker training jobs via Step Functions.

Step Function Train Job

When the job is complete, the output of the model is stored in s3://<artifactbucket>/artifact-repo/model/.

In the next step, we trigger the DataBrew job km-mlframework-inferencefeatures-job, which is scheduled to run every Tuesday at 10:00 AM UTC. This job creates the inference features that are used to run inference on the trained model.

You can also activate the inference pipeline by manually triggering the DataBrew job on the DataBrew console.

Glue DataBrew Inference Job

The job writes the features to s3://<artifactbucket>/ inference_features/.

When the job is complete, an EventBridge rule invokes the Lambda function, which orchestrates the SageMaker batch transform job via Step Functions.

Inference Step Function

When the job is complete, the predictions are written to s3://<artifactbucket>/predictions/.

For more information on DataBrew steps and building a DataBrew recipe, see Preparing data for ML models using AWS Glue DataBrew in a Jupyter notebook.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Wait for any currently running activity to complete, or manually stop it (DataBrew, Step Functions, SageMaker).
  2. Delete the scheduled DataBrew jobs km-mlframework-trainingfeatures-job and km-mlframework-inferencefeatures-job. This ensures the jobs aren’t started by the schedule.
  3. Delete the S3 bucket created to store data and model artifacts.
  4. Delete the CloudFormation stack created earlier.

Conclusion

DataBrew is designed to support data engineers and data scientists to experiment with data preparation steps via a visual interface. With more than 250 built-in transformations, DataBrew can be a strong tool to accelerate your ML lifecycle for development and production stages.

In this post, we walked through the process of creating an end-to-end ML framework with DataBrew, which you can use to train an ML model as well as run inferences on a schedule. You can use the same framework with your own DataBrew recipe prepared using any dataset.

To learn more on applying the most frequently used transformations from within DataBrew, see 7 most common data preparation transformations in AWS Glue DataBrew.


About the Authors

Gayatri Ghanakota is a Machine Learning Engineer with AWS Professional Services, where she helps customers build machine learning solutions on AWS. She is passionate about developing, deploying, and explaining ML models.

Surbhi Dangi is a product and design leader at Amazon Web Services. She focusses on providing ease of use and rich functionality for her analytics and monitoring on both her products – Amazon CloudWatch Synthetics and AWS Glue DataBrew. When not working, she mentors aspiring product managers, hiking, and traveling the world.

How to Accelerate Building a Lake House Architecture with AWS Glue

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/how-to-accelerate-building-a-lake-house-architecture-with-aws-glue/

Customers are building databases, data warehouses, and data lake solutions in isolation from each other, each having its own separate data ingestion, storage, management, and governance layers. Often these disjointed efforts to build separate data stores end up creating data silos, data integration complexities, excessive data movement, and data consistency issues. These issues are preventing customers from getting deeper insights. To overcome these issues and easily move data around, a Lake House approach on AWS was introduced.

In this blog post, we illustrate the AWS Glue integration components that you can use to accelerate building a Lake House architecture on AWS. We will also discuss how to derive persona-centric insights from your Lake House using AWS Glue.

Components of the AWS Glue integration system

AWS Glue is a serverless data integration service that facilitates the discovery, preparation, and combination of data. It can be used for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration. So you can start analyzing your data and putting it to use in minutes, rather than months.

The following diagram illustrates the various components of the AWS Glue integration system.

Figure 1. AWS Glue integration components

Figure 1. AWS Glue integration components

Connect – AWS Glue allows you to connect to various data sources anywhere

Glue connector: AWS Glue provides built-in support for the most commonly used data stores. You can use Amazon Redshift, Amazon RDS, Amazon Aurora, Microsoft SQL Server, MySQL, MongoDB, or PostgreSQL using JDBC connections. AWS Glue also allows you to use custom JDBC drivers in your extract, transform, and load (ETL) jobs. For data stores that are not natively supported such as SaaS applications, you can use connectors. You can also subscribe to several connectors offered in the AWS Marketplace.

Glue crawlers: You can use a crawler to populate the AWS Glue Data Catalog with tables. A crawler can crawl multiple data stores in a single pass. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

Catalog – AWS Glue simplifies data discovery and governance

Glue Data Catalog: The Data Catalog serves as the central metadata catalog for the entire data landscape.

Glue Schema Registry: The AWS Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas. With AWS Glue Schema Registry, you can manage and enforce schemas on your data streaming applications.

Data quality – AWS Glue helps you author and monitor data quality rules

Glue DataBrew: AWS Glue DataBrew allows data scientists and data analysts to clean and normalize data. You can use a visual interface, reducing the time it takes to prepare data by up to 80%. With Glue DataBrew, you can visualize, clean, and normalize data directly from your data lake, data warehouses, and databases.

Curate data: You can use either Glue development endpoint or AWS Glue Studio to curate your data.

AWS Glue development endpoint is an environment that you can use to develop and test your AWS Glue scripts. You can choose either Amazon SageMaker notebook or Apache Zeppelin notebook as an environment.

AWS Glue Studio is a new visual interface for AWS Glue that supports extract-transform-and-load (ETL) developers. You can author, run, and monitor AWS Glue ETL jobs. You can now use a visual interface to compose jobs that move and transform data, and run them on AWS Glue.

AWS Data Exchange makes it easy for AWS customers to securely exchange and use third-party data in AWS. This is for data providers who want to structure their data across multiple datasets or enrich their products with additional data. You can publish additional datasets to your products using the AWS Data Exchange.

Deequ is an open-source data quality library developed internally at Amazon, for data quality. It provides multiple features such as automatic constraint suggestions and verification, metrics computation, and data profiling.

Build a Lake House architecture faster, using AWS Glue

Figure 2 illustrates how you can build a Lake House using AWS Glue components.

Figure 2. Building lake house architectures with AWS Glue

Figure 2. Building Lake House architectures with AWS Glue

The architecture flow follows these general steps:

  1. Glue crawlers scan the data from various data sources and populate the Data Catalog for your Lake House.
  2. The Data Catalog serves as the central metadata catalog for the entire data landscape.
  3. Once data is cataloged, fine-grained access control is applied to the tables through AWS Lake Formation.
  4. Curate your data with business and data quality rules by using Glue Studio, Glue development endpoints, or Glue DataBrew. Place transformed data in a curated Amazon S3 for purpose built analytics downstream.
  5. Facilitate data movement with AWS Glue to and from your data lake, databases, and data warehouse by using Glue connections. Use AWS Glue Elastic views to replicate the data across the Lake House.

Derive persona-centric insights from your Lake House using AWS Glue

Many organizations want to gather observations from increasingly larger volumes of acquired data. These insights help them make data-driven decisions with speed and agility. They must use a central data lake, a ring of purpose-built data services, and data warehouses based on persona or job function.

Figure 3 illustrates the Lake House inside-out data movement with AWS Glue DataBrew, Amazon Athena, Amazon Redshift, and Amazon QuickSight to perform persona-centric data analytics.

Figure 3. Lake house persona-centric data analytics using AWS Glue

Figure 3. Lake House persona-centric data analytics using AWS Glue

This shows how Lake House components serve various personas in an organization:

  1. Data ingestion: Data is ingested to Amazon Simple Storage Service (S3) from different sources.
  2. Data processing: Data curators and data scientists use DataBrew to validate, clean, and enrich the data. Amazon Athena is also used to run improvised queries to analyze the data in the lake. The transformation is shared with data engineers to set up batch processing.
  3. Batch data processing: Data engineers or developers set up batch jobs in AWS Glue and AWS Glue DataBrew. Jobs can be initiated by an event, or can be scheduled to run periodically.
  4. Data analytics: Data/Business analysts can now analyze prepared dataset in Amazon Redshift or in Amazon S3 using Athena.
  5. Data visualizations: Business analysts can create visuals in QuickSight. Data curators can enrich data from multiple sources. Admins can enforce security and data governance. Developers can embed QuickSight dashboard in applications.

Conclusion

Using a Lake House architecture will help you get persona-centric insights quickly from all of your data based on user role or job function. In this blog post, we describe several AWS Glue components and AWS purpose-built services that you can use to build Lake House architectures on AWS. We have also presented persona-centric Lake House analytics architecture using AWS Glue, to help you derive insights from your Lake House.

Read more and get started on building Lake House Architectures on AWS.

Data preparation using an Amazon RDS for MySQL database with AWS Glue DataBrew

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/data-preparation-using-an-amazon-rds-for-mysql-database-with-aws-glue-databrew/

With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, or Amazon Aurora and Amazon Relational Database Service (Amazon RDS) databases. You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an RDS database, store the cleaned data in an S3 data lake, and build a business intelligence (BI) report.

Use case overview

For our use case, we use three datasets:

  • A school dataset that contains school details like school ID and school name
  • A student dataset that contains student details like student ID, name, and age
  • A student study details dataset that contains student study time, health, country, and more

The following diagram shows the relation of these tables.

For our use case, this data is collected by a survey organization after an annual exam, and updates are made in Amazon RDS for MySQL using a Java script-based frontend application. We join the tables to create a single view and create aggregated data through a series of data preparation steps, and the business team uses the output data to create BI reports.

Solution overview

The following diagram illustrates our solution architecture. We use Amazon RDS to store data, DataBrew for data preparation, Amazon Athena for data analysis with standard SQL, and Amazon QuickSight for business reporting.

The workflow includes the following steps:
  1. Create a JDBC connection for RDS and a DataBrew project. DataBrew does the transformation to find the top performing students across all the schools considered for analysis.
  2. The DataBrew job writes the final output to our S3 output bucket.
  3. After the output data is written, we can create external tables on top of it with Athena create table statements and load partitions with MCSK REPAIR commands.
  4. Business users can use QuickSight for BI reporting, which fetches data through Athena. Data analysts can also use Athena to analyze the complete refreshed dataset.

Prerequisites

To complete this solution, you should have an AWS account.

Prelab setup

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use three mock datasets. You can download the DDL code and data files from GitHub.

  1. Create the RDS for MySQL instance to capture the student health data.
  2. Make sure you have set up the correct security group for Amazon RDS. For more information, see Setting Up a VPC to Connect to JDBC Data Stores.
  3. Create three tables: student_tbl, study_details_tbl, and school_tbl. You can use DDLsql to create the database objects.
  4. Upload the student.csv, study_details.csv, and school.csv files in their respective tables. You can use student.sql, study_details.sql, and school.sql to insert the data in the tables.

Create an Amazon RDS connection

To create your Amazon RDS connection, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.

  1. For Connection name, enter a name (for example, student_db-conn).
  2. For Connection type, select JDBC.
  3. For Database type, choose MySQL.

  1. Provide other parameters like RDS endpoint, port, database name, and database login credentials.

  1. In the Network options section, choose the VPC, subnet, and security group of your RDS instance.
  2. Choose Create connection.

Create your datasets

We have three tables in Amazon RDS: school_tbl, student_tbl, and study_details_tbl. To use these tables, we first need to create a dataset for each table.

To create the datasets, complete the following steps (we walk you through creating the school dataset):

  1. On the Datasets page of the DataBrew console, choose Connect new dataset.

  1. For Dataset name, enter school-dataset.
  2. Choose the connection you created (AwsGlueDatabrew-student-db-conn).
  3. For Table name, enter school_tbl.
  4. Choose Create dataset.

  1. Repeat these steps for the student_tbl and study_details_tbl tables, and name the new datasets student-dataset and study-detail-dataset, respectively.

All three datasets are available to use on the Datasets page.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

  1. On the DataBrew console, choose Projects.
  2. Choose Create project.
  3. For Project Name, enter my-rds-proj.
  4. For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

  1. For Select a dataset, select My datasets.
  2. For Dataset name, select study-detail-dataset.

  1. For Role name, choose your AWS Identity and Access management (IAM) role to use with DataBrew.
  2. Choose Create project.

You can see a success message along with our RDS study_details_tbl table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Open an Amazon RDS project and build a transformation recipe

In a DataBrew interactive session, you can cleanse and normalize your data using over 250 built-in transforms. In this post, we use DataBrew to identify top performing students by performing a few transforms and finding students who got marks greater than or equal to 60 in the last annual exam.

First, we use DataBrew to join all three RDS tables. To do this, we perform the following steps:

  1. Navigate to the project you created.
  2. Choose Join.

  1. For Select dataset, choose student-dataset.
  2. Choose Next.

  1. For Select join type, select Left join.
  2. For Join keys, choose student_id for Table A and deselect student_id for Table B.
  3. Choose Finish.

Repeat the steps for school-dataset based on the school_id key.

  1. Choose MERGE to merge first_name and last_name.
  2. Enter a space as a separator.
  3. Choose Apply.

We now filter the rows based on marks value greater than or equal to 60 and add the condition as a recipe step.

  1. Choose FILTER.

  1. Provide the source column and filter condition and choose Apply.

The final data shows the top performing students’ data who had marks greater than or equal to 60.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name¸ enter top-performer-student.

For this post, we use Parquet as the output format.

  1. For File type, choose PARQUET.
  2. For S3 location, enter the S3 path of the output folder.

  1. For Role name, choose an existing role or create a new one.
  2. Choose Create and run job.

  1. Navigate to the Jobs page and wait for the top-performer-student job to complete.

  1. Choose the Destination link to navigate to Amazon S3 to access the job output.

Run an Athena query

Let’s validate the aggregated table output in Athena by running a simple SELECT query. The following screenshot shows the output.

Create reports in QuickSight

Now let’s do our final step of the architecture, which is creating BI reports through QuickSight by connecting to the Athena aggregated table.

  1. On the QuickSight console, choose Athena as your data source.

  1. Choose the database and catalog you have in Athena.
  2. Select your table.
  3. Choose Select.

Now you can create a quick report to visualize your output, as shown in the following screenshot.

If QuickSight is using SPICE storage, you need to refresh the dataset in QuickSight after you receive notification about the completion of the data refresh. We recommend using SPICE storage to get better performance.

Clean up

Delete the following resources that might accrue cost over time:

  • The RDS instance
  • The recipe job top-performer-student
  • The job output stored in your S3 bucket
  • The IAM roles created as part of projects and jobs
  • The DataBrew project my-rds-proj and its associated recipe my-rds-proj-recipe
  • The DataBrew datasets

Conclusion

In this post, we saw how to create a JDBC connection for an RDS database. We learned how to use this connection to create a DataBrew dataset for each table, and how to reuse this connection multiple times. We also saw how we can bring data from Amazon RDS into DataBrew and seamlessly apply transformations and run recipe jobs that refresh transformed data for BI reporting.


About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Data preparation using Amazon Redshift with AWS Glue DataBrew

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/data-preparation-using-amazon-redshift-with-aws-glue-databrew/

With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, Amazon Aurora, and other Amazon Relational Database Service (Amazon RDS) databases. You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an Amazon Redshift table, and transform and use different feature engineering techniques to prepare data to build a machine learning (ML) model. Finally, we store the transformed data in an S3 data lake to build the ML model in Amazon SageMaker.

Use case overview

For our use case, we use mock student datasets that contain student details like school, student ID, name, age, student study time, health, country, and marks. The following screenshot shows an example of our data.

For our use case, the data scientist uses this data to build an ML model to predict a student’s score in upcoming annual exam. However, this raw data requires cleaning and transformation. A data engineer must perform the required data transformation so the data scientist can use the transformed data to build the model in SageMaker.

Solution overview

The following diagram illustrates our solution architecture.

The workflow includes the following steps:

  1. Create a JDBC connection for Amazon Redshift and a DataBrew project.
  2. AWS DataBrew queries sample student performance data from Amazon Redshift and does the transformation and feature engineering to prepare the data to build ML model.
  3. The DataBrew job writes the final output to our S3 output bucket.
  4. The data scientist builds the ML model in SageMaker to predict student marks in an upcoming annual exam.

We cover steps 1–3 in this post.

Prerequisites

To complete this solution, you should have an AWS account.

Prelab setup

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use a mock dataset. You can download the DDL and data files from GitHub.

  1. Create the Amazon Redshift cluster to capture the student performance data.
  2. Set up a security group for Amazon Redshift.
  3. Create a schema called student_schema and a table called study_details. You can use DDLsql to create database objects.
  4. We recommend using the COPY command to load a table in parallel from data files on Amazon S3. However, for this post, you can use study_details.sql to insert the data in the tables.

Create an Amazon Redshift connection

To create your Amazon Redshift connection, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.
  3. For Connection name, enter a name (for example, student-db-connection).
  4. For Connection type, select JDBC.
  5. Provide other parameters like the JDBC URL and login credentials.
  6. In the Network options section, choose the VPC, subnet, and security groups of your Amazon Redshift cluster.
  7. Choose Create connection.

Create datasets

To create the datasets, complete the following steps:

  1. On the Datasets page of the DataBrew console, choose Connect new dataset.
  2. For Dataset name, enter a name (for example, student).
  3. For Your JDBC source, choose the connection you created (AwsGlueDatabrew-student-db-connection).
  4. Select the study_details table.
  5. For Enter S3 destination, enter an S3 bucket for Amazon Redshift to store the intermediate result.
  6. Choose Create dataset.

You can also configure a lifecycle rule to automatically clean up old files from the S3 bucket.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

  1. On the DataBrew console, on the Projects page, choose Create project.
  2. For Project Name, enter student-proj.
  3. For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

  1. For Select a dataset, select My datasets.
  2. Select the student dataset.
  3. For Role name, choose the AWS Identity and Access Management (IAM) role to be used with DataBrew.
  4. Choose Create project.

You can see a success message along with our Amazon Redshift study_details table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Create a profiling job

DataBrew helps you evaluate the quality of your data by profiling it to understand data patterns and detect anomalies.

To create your profiling job, complete the following steps:

  1. On the DataBrew console, choose Jobs in the navigation pane.
  2. On the Profile jobs tab, choose Create job.
  3. For Job name, enter student-profile-job.
  4. Choose the student dataset.
  5. Provide the S3 location for job output.
  6. For Role name, choose the role to be used with DataBrew.
  7. Choose Create and run job.

Wait for the job to complete.

  1. Choose the Columns statistics tab.

You can see that the age column has some missing values.

You can also see that the study_time_in_hr column has two outliers.

Build a transformation recipe

All ML algorithms use input data to generate outputs. Input data comprises features usually in structured columns. To work properly, the features need to have specific characteristics. This is where feature engineering comes in. In this section, we perform some feature engineering techniques to prepare our dataset to build the model in SageMaker.

Let’s drop the unnecessary columns from our dataset that aren’t required for model building.

  1. Choose Column and choose Delete.
  2. For Source columns, choose the columns school_name, first_name, and last_name.
  3. Choose Apply.

We know from the profiling report that the age value is missing in two records. Let’s fill in the missing value with the median age of other records.

  1. Choose Missing and choose Fill with numeric aggregate.
  2. For Source column, choose age.
  3. For Numeric aggregate, choose Median.
  4. For Apply transform to, select All rows.
  5. Choose Apply.

We know from the profiling report that the study_time_in_hr column has two outliers, which we can remove.

  1. Choose Outliers and choose Remove outliers.
  2. For Source column, choose study_time_in_hr.
  3. Select Z-score outliers.
  4. For Standard deviation threshold, choose 3.
  5. Select Remove outliers.
  6. Under Remove outliers, select All outliers.
  7. Under Outlier removal options¸ select Delete outliers.
  8. Choose Apply.
  9. Choose Delete rows and click Apply.

The next step is to convert the categorical value to a numerical value for the gender column.

  1. Choose Mapping and choose Categorical mapping.
  2. For Source column, choose gender.
  3. For Mapping options, select Map top 1 values.
  4. For Map values, select Map values to numeric values.
  5. For M, choose 1.
  6. For Others, choose 2.
  7. For Destination column, enter gender_mapped.
  8. For Apply transform to, select All rows.
  9. Choose Apply.

ML algorithms often can’t work on label data directly, requiring the input variables to be numeric. One-hot encoding is one technique that converts categorical data that doesn’t have an ordinal relationship with each other to numeric data.

To apply one-hot encoding, complete the following steps:

  1. Choose Encode and choose One-hot encode column.
  2. For Source column, choose health.
  3. For Apply transform to, select All rows.
  4. Choose Apply.

The following screenshot shows the full recipe that we applied to our dataset before we can use it to build our model in SageMaker.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name¸ enter student-performance.

We use CSV as the output format.

  1. For File type, choose CSV.
  2. For Role name, choose an existing role or create a new one.
  3. Choose Create and run job.
  4. Navigate to the Jobs page and wait for the student-performance job to complete.
  5. Choose the Destination link to navigate to Amazon S3 to access the job output.

Clean up

Delete the following resources that might accrue cost over time:

  • The Amazon Redshift cluster
  • The recipe job student-performance
  • The job output stored in your S3 bucket
  • The IAM roles created as part of projects and jobs
  • The DataBrew project student-proj and its associated recipe student-proj-recipe
  • The DataBrew datasets

Conclusion

In this post, we saw how to create a JDBC connection for an Amazon Redshift data warehouse. We learned how to use this connection to create a DataBrew dataset for an Amazon Redshift table. We also saw how easily we can bring data from Amazon Redshift into DataBrew, seamlessly apply transformations and feature engineering techniques, and run recipe jobs that refresh the transformed data for ML model building in SageMaker.


About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

 

 

 

Building a Showback Dashboard for Cost Visibility with Serverless Architectures

Post Syndicated from Peter Chung original https://aws.amazon.com/blogs/architecture/building-a-showback-dashboard-for-cost-visibility-with-serverless-architectures/

Enterprises with centralized IT organizations and multiple lines of businesses frequently use showback or chargeback mechanisms to hold their departments accountable for their technology usage and costs. Chargeback involves actually billing a department for the cost of their division’s usage. Showback focuses on visibility to make the department more cost conscientious and encourage operational efficiency.

Building a showback mechanism can be potentially challenging for business and financial analysts of an AWS Organization. You may not have the scripting or data engineering skills needed to coordinate workflows and build reports at scale. Although you can use AWS Cost Explorer as starting point, you may want greater customizability, larger datasets beyond a one-year period, and more of a business intelligence (BI) experience.

In this post, we discuss the benefits of building a showback dashboard using the AWS Cost and Usage Report (AWS CUR). You can track costs by cost center, business unit, or project using managed services. Using a showback strategy, you can consolidate and present costs to a business unit to show resource use over a set period of time for your entire AWS Organization. Building this solution with managed services allows you to spend time understanding your costs rather than maintaining the underlying infrastructure.

This solution highlights AWS Glue DataBrew to prepare your data into the appropriate format for dashboards. We recommend DataBrew because it provides a no-code environment for data transformation. It allows anyone to create dashboards similar to those built in the Cloud Intelligence Dashboards Workshop for your Organization.

Figure 1. QuickSight showback dashboard using CUR data transformed by Glue DataBrew and leveraging QuickSight insights

Figure 1. QuickSight showback dashboard using CUR data transformed by Glue DataBrew and leveraging QuickSight insights

Tags for cost allocation

The success of your showback dashboard partially depends on your cost allocation tagging strategy. Typically, customers use business tags such as cost center, business unit, or project to associate AWS costs with traditional financial reporting dimensions within their organization.

The CUR supports the ability to break down AWS costs by tag. For example, if a group of resources are labeled with the same tag, you’ll be able to see the total cost and usage of that group of resources. Read more about Tagging Best Practices to develop a tagging strategy for your organization.

A serverless data workflow for showback dashboards

You can build showback dashboards with managed services such as Amazon QuickSight, without the need to write any code or manage any servers.

Figure 2. A serverless architecture representing data workflow

Figure 2. A serverless architecture representing data workflow

AWS automatically delivers the data you need for showback dashboards through the CUR. Once this data arrives in an Amazon Simple Storage Service (S3) bucket, you can transform the data without the need to write any code by using DataBrew. You can also automatically identify the data schema, and catalog the data’s properties to run queries using Amazon Athena. Lastly, you can visualize the results by publishing and sharing dashboards to key stakeholders within your organization using Amazon QuickSight.

The key benefits of this approach are:

  • Automatic data delivery
  • No-code data transformation
  • Automatic cataloging and querying
  • Serverless data visualization

Let’s take a look at each in more detail.

Automatic data delivery

The CUR is the source for historical cost and usage data. The CUR provides the most comprehensive set of cost and usage data available and will include your defined cost allocation tags for your entire Organization. You configure CUR to deliver your billing data to an Amazon S3 bucket at the payer account level. This will consolidate data for all linked accounts. After delivery starts, Amazon updates the CUR files at least once a day.

No-code data transformation

You can use DataBrew to transform the data in the Amazon S3 bucket aggregating cost and usage according to your tagging strategy. DataBrew summarizes your data for discovery. You can also run transformations called “jobs” in DataBrew without writing any code, using over 250 built-in transforms. Figures 3 through 5 show several job examples.

Figure 3. DataBrew recipe action: rename column

Figure 3. DataBrew recipe action: rename column

Figure 4. DataBrew recipe action: Create column from function

Figure 4. DataBrew recipe action: Create column from function

Figure 5. DataBrew recipe action: fill missing values

Figure 5. DataBrew recipe action: fill missing values

For a full list of columns available in CUR, review the CUR Data Dictionary. Following is a list of relevant columns for an executive summary showback dashboard:

  • bill_billing_period_start_date
  • line_item_usage_account_id
  • line_item_line_item_type
  • product_product_name
  • product_product_family
  • product_product_transfer_type
  • savings_plan_savings_plan_effective cost
  • reservation_effective_cost
  • line_item_unblended_cost

Based on data refresh and business requirements, DataBrew can run a job on a recurring basis (for example, every 12 hours). This can be run at a particular time of day, or as defined by a valid CRON expression. This helps you automate your transformation workflows.

Automatic cataloging and querying

You can use a Glue crawler to automatically classify your data to determine the data’s format, schema, and associated properties. The crawlers write metadata to an AWS Glue Data Catalog to help data users find the data they need.

With the results in Amazon S3, and the metadata in the Glue Data Catalog, you can run standard SQL to queries with Athena. This will help you make more informed business decisions by tracking financial metrics and optimizing costs. This is done directly in Amazon S3 without having to move around data. Using standard SQL, you can create views that aggregate cost and usage by your defined tags.

Serverless data visualization

You can use Amazon QuickSight to create and share dashboards with your teams for cost visibility. QuickSight provides native integration with Athena and S3, and lets you easily create and publish interactive BI dashboards that include ML-powered insights. When building a showback dashboard such as the example in Figure 1, QuickSight authors create visuals and publish interactive dashboards.

Readers log in using your preferred authentication mechanism to view the shared dashboard. You can then filter data based on billing periods, account number, or cost allocation tags. You can also drill down to details using a web browser or mobile app.

Conclusion

In this blog, we’ve discussed designing and building a data transformation process and a showback dashboard. This gives you highly granular cost visualization without having to provision and manage any servers. You can use managed services such as AWS Glue DataBrew, Amazon Athena, and Amazon QuickSight to crawl, catalog, analyze, and visualize your data.

We recommend defining your organization tagging strategy to be able to view costs by tags. You can then get started by creating Cost and Usage Reports. With the data in Amazon S3, you can use the services described in this post to transform the data that works for your business. Additionally, you can get started today by experimenting with the Cloud Intelligence Dashboards Workshop. This workshop provides examples of visualizations that you can build using native AWS services on top of your Cost and Usage Report. You will be able to get cost, usage, and operational insights about your AWS Cloud usage.