Visualize data quality scores and metrics generated by AWS Glue Data Quality

Post Syndicated from Zack Zhou original https://aws.amazon.com/blogs/big-data/visualize-data-quality-scores-and-metrics-generated-by-aws-glue-data-quality/

AWS Glue Data Quality allows you to measure and monitor the quality of data in your data repositories. It’s important for business users to be able to see quality scores and metrics to make confident business decisions and debug data quality issues. AWS Glue Data Quality generates a substantial amount of operational runtime information during the evaluation of rulesets.

An operational scorecard is a mechanism used to evaluate and measure the quality of data processed and validated by AWS Glue Data Quality rulesets. It provides insights and metrics related to the performance and effectiveness of data quality processes.

In this post, we highlight the seamless integration of Amazon Athena and Amazon QuickSight, which enables the visualization of operational metrics for AWS Glue Data Quality rule evaluation in an efficient and effective manner.

This post is Part 5 of a five-post series to explain how to build dashboards to measure and monitor your data quality:

Solution overview

The solution allows you to build your AWS Glue Data Quality score and metrics dashboard using QuickSight in an easy and straightforward manner. The following architecture diagram shows an overview of the complete pipeline.

These are six main steps in the data pipeline:

  1. Amazon EventBridge triggers an AWS Lambda function when the event pattern for AWS Glue Data Quality matches the defined rule. (Refer to Set up alerts and orchestrate data quality rules with AWS Glue Data Quality)
  2. The Lambda function writes the AWS Glue Data Quality result to an Amazon Simple Storage Service (Amazon S3) bucket.
  3. An AWS Glue crawler crawls the results.
  4. The crawler builds a Data Catalog, so the data can be queried using Athena.
  5. We can analyze the data quality score and metrics using Athena SQL queries.
  6. We can query and submit the Athena data to QuickSight to create visuals for the dashboard.

In the following sections, we discuss these steps in more detail.

Prerequisites

To follow along with this post, complete the following prerequisites:

  1. Have an AWS Identity and Access Management (IAM) role with permissions to extract data from an S3 bucket and write to the AWS Glue Data Catalog.
  2. Similarly, have a Lambda function execution role with access to AWS Glue and  S3 buckets.
  3. Set up the Athena query result location. For more information, refer to Working with Query Results, Output Files, and Query History.
  4. Set up QuickSight permissions and enable Athena table and S3 bucket access.

Set up and deploy the Lambda pipeline

To test the solution, we can use the following AWS CloudFormation template. The CloudFormation template creates the EventBridge rule, Lambda function, and S3 bucket to store the data quality results.

If you deployed the CloudFormation template in the previous post, you don’t need to deploy it again in this step.

The following screenshot shows a line of code in which the Lambda function writes the results from AWS Glue Data Quality to an S3 bucket. As depicted, the data will be stored in JSON format and organized according to the time horizon, facilitating convenient access and analysis of the data over time.

Set up the AWS Glue Data Catalog using a crawler

Complete the following steps to create an AWS Glue crawler and set up the Data Catalog:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. For Name, enter data-quality-result-crawler, then choose Next.
  4. Under Data sources, choose Add a data source.
  5. For Data source, choose S3.
  6. For S3 path, enter the S3 path to your data source. (s3://<AWS CloudFormation outputs key:DataQualityS3BucketNameOutputs>/gluedataqualitylogs/). Refer to Set up alerts and orchestrate data quality rules with AWS Glue Data Quality for details.
  7. Choose Add an S3 data source and choose Next.
  8. For Existing IAM role, choose your IAM role (GlueDataQualityLaunchBlogDemoRole-xxxx). Refer to Set up alerts and orchestrate data quality rules with AWS Glue Data Quality for details. Then choose Next.
  9. For Target database, choose Add database.
  10. For Database name, enter data-quality-result-database, then choose Create.
  11. For Table name prefix, enter dq_, then choose Next.
  12. Choose Create crawler.
  13. On the Crawlers page, select data-quality-result-crawler and choose Run.

When the crawler is complete, you can see the AWS Glue Data Catalog table definition.

After you create the table definition on the AWS Glue Data Catalog, you can use Athena to query the Data Catalog table.

Query the Data Catalog table using Athena

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 and the AWS Glue Data Catalog using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run on datasets at petabyte scale.

The purpose of this step is to understand our data quality statistics at the table level as well as at the ruleset level. Athena provides simple queries to assist you with this task. Use the queries in this section to analyze your data quality metrics and create an Athena view to use to build a QuickSight dashboard in the next step.

Query 1

The following is a simple SELECT query on the Data Catalog table:

SELECT * FROM "data-quality-result-database"."dq_gluedataqualitylogs" limit 10;

The following screenshot shows the output.

Before we run the second query, let’s check the schema for the table dq_gluedataqualitylogs.

The following screenshot shows the output.

The table shows that one of the columns, resultrun, is the array data type. In order to work with this column in QuickSight, we need to perform an additional step to transform it into multiple strings. This is necessary because QuickSight doesn’t support the array data type.

Query 2

Use the following query to review the data in the resultrun column:

SELECT resultrun FROM "data-quality-result-database"."dq_gluedataqualitylogs" limit 10;

The following screenshot shows the output.

Query 3

The following query flattens an array into multiple rows using CROSS JOIN in conjunction with the unnest operator and creates a view on the selected columns:

CREATE OR REPLACE VIEW data_quality_result_view AS
SELECT "databasename","tablename", 
"ruleset_name","runid", "resultid", 
"state", "numrulessucceeded", 
"numrulesfailed", "numrulesskipped", 
"score", "year","month",
"day",runs.name,runs.result,
runs.evaluationmessage,runs.Description
FROM "dq_gluedataqualitylogs"
CROSS JOIN unnest(resultrun) AS t(runs)

The following screenshot shows the output.

Verify the columns that were created using the unnest operator.

The following screenshot shows the output.

Query 4

Verify the Athena view created in the previous query:

SELECT * FROM data_quality_result_view LIMIT 10

The following screenshot shows the output.

Visualize the data with QuickSight

Now that you can query your data in Athena, you can use QuickSight to visualize the results. Complete the following steps:

  1. Sign in to the QuickSight console.
  2. In the upper right corner of the console, choose Admin/username, then choose Manage QuickSight.
  3. Choose Security and permissions.
  4. Under QuickSight access to AWS services, choose Add or remove.
  5. Choose Amazon Athena, then choose Next.
  6. Give QuickSight access to the S3 bucket where your data quality result is stored.

Create your datasets

Before you can analyze and visualize the data in QuickSight, you must create datasets for your Athena view (data_quality_result_view). Complete the following steps:

  1. On the Datasets page, choose New dataset, then choose Athena.
  2. Choose the AWS Glue database that you created earlier.
  3. Select Import to SPICE (alternatively, you can select Directly query your data).
  4. Choose Visualize.

Build your dashboard

Create your analysis with one donut chart, one pivot table, one vertical stacked bar, and one funnel chart that use the different fields in the dataset. QuickSight offers a wide range of charts and visuals to help you create your dashboard. For more information, refer to Visual types in Amazon QuickSight.

Clean up

To avoid incurring future charges, delete the resources created in this post.

Conclusion

In this post, we provide insights into running Athena queries and building customized dashboards in QuickSight to understand data quality metrics. This gives you a great starting point for using this solution with your datasets and applying business rules to build a complete data quality framework to monitor issues within your datasets.

To dive into the AWS Glue Data Quality APIs, refer to Data Quality API. To learn more about AWS Glue Data Quality, see the AWS Glue Data Quality Developer Guide. To learn more about QuickSight dashboards, refer to the Amazon QuickSight Developer Guide.


About the authors

Zack Zhou is a Software Development Engineer on the AWS Glue team.

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data architecture on the AWS Cloud. He has helped customers of all sizes implement data management, data warehouse, and data lake solutions.

Avik Bhattacharjee is a Senior Partner Solutions Architect at AWS. He works with customers to build IT strategy, making digital transformation through the cloud more accessible, focusing on big data and analytics and AI/ML.

Amit Kumar Panda is a Data Architect at AWS Professional Services who is passionate about helping customers build scalable data analytics solutions to enable making critical business decisions.