All posts by Deenbandhu Prasad

Getting started with AWS Glue Data Quality for ETL Pipelines

Post Syndicated from Deenbandhu Prasad original https://aws.amazon.com/blogs/big-data/getting-started-with-aws-glue-data-quality-for-etl-pipelines/

Today, hundreds of thousands of customers use data lakes for analytics and machine learning. However, data engineers have to cleanse and prepare this data before it can be used. The underlying data has to be accurate and recent for customer to make confident business decisions. Otherwise, data consumers lose trust in the data and make suboptimal or incorrect decisions. It is a common task for data engineers to evaluate whether the data is accurate and recent or not. Today there are various data quality tools. However, common data quality tools usually require manual processes to monitor data quality.

AWS Glue Data Quality is a preview feature of AWS Glue that measures and monitors the data quality of Amazon Simple Storage Service (Amazon S3) data lakes and in AWS Glue extract, transform, and load (ETL) jobs. This is an open preview feature so it is already enabled in your account in the available Regions. You can easily define and measure the data quality checks in AWS Glue Studio console without writing codes. It simplifies your experience of managing data quality.

This post is Part 2 of a four-post series to explain how AWS Glue Data Quality works. Check out the previous post in this series:

Getting started with AWS Glue Data Quality

In this post, we show how to create an AWS Glue job that measures and monitors the data quality of a data pipeline. We also show how to take action based on the data quality results.

Solution overview

Let’s consider an example use case in which a data engineer needs to build a data pipeline to ingest the data from a raw zone to a curated zone in a data lake. As a data engineer, one of your key responsibilities—along with extracting, transforming, and loading data—is validating the quality of data. Identifying data quality issues upfront helps you prevent placing bad data in the curated zone and avoid arduous data corruption incidents.

In this post, you’ll learn how to easily set up built-in and custom data validation checks in your AWS Glue job to prevent bad data from corrupting the downstream high-quality data.

The dataset used for this post is synthetically generated; the following screenshot shows an example of the data.

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

  • An Amazon Simple Storage Service (Amazon S3) bucket (gluedataqualitystudio-*).
  • The following prefixes and objects in the S3 bucket:
    • datalake/raw/customer/customer.csv
    • datalake/curated/customer/
    • scripts/
    • sparkHistoryLogs/
    • temporary/
  • AWS Identity and Access Management (IAM) users, roles, and policies. The IAM role (GlueDataQualityStudio-*) has permission to read and write from the S3 bucket.
  • AWS Lambda functions and IAM policies required by those functions to create and delete this stack.

To create your resources, complete the following steps:

  1. Sign in to the AWS CloudFormation console in the us-east-1 Region.
  2. Choose Launch Stack:

  3. Select I acknowledge that AWS CloudFormation might create IAM resources.
  4. Choose Create stack and wait for the stack creation step to complete.

Implement the solution

To start configuring your solution, complete the following steps:

  1. On the AWS Glue Studio console, choose Jobs in the navigation pane.
  2. Select Visual with a blank canvas and choose Create.
  3. Choose the Job Details tab to configure the job.
  4. For Name, enter GlueDataQualityStudio.
  5. For IAM Role, choose the role starting with GlueDataQualityStudio-*.
  6. For Glue version, choose Glue 3.0.
  7. For Job bookmark, choose Disable. This allows you to run this job multiple times with the same input dataset.
  8. For Number of retries, enter 0.
  9. In the Advanced properties section, provide the S3 bucket created by the CloudFormation template (starting with gluedataqualitystudio-*).
  10. Choose Save.
  11. After the job is saved, choose the Visual tab and on the Source menu, choose Amazon S3.
  12. On the Data source properties – S3 tab, for S3 source type, select S3 location.
  13. Choose Browse S3 and navigate to prefix /datalake/raw/customer/ in the S3 bucket starting with gluedataqualitystudio-* .
  14. Choose Infer schema.
  15. On the Action menu, choose Evaluate Data Quality.
  16. Choose the Evaluate Data Quality node.

    On the Transform tab, you can now start building data quality rules. The first rule you create is to check if Customer_ID is unique and not null using the isPrimaryKey rule.
  17. On the Rule types tab of the DQDL rule builder, search for isprimarykey and choose the plus sign.
  18. On the Schema tab of the DQDL rule builder, choose the plus sign next to Customer_ID.
  19. In the rule editor, delete id.

    The next rule we add checks that the First_Name column value is present for all the rows.
  20. You can also enter the data quality rules directly in the rule editor. Add a comma (,) and enter IsComplete "First_Name", after the first rule.

    Next, you add a custom rule to validate that no row exists without Telephone or Email.
  21. Enter the following custom rule in the rule editor:
    CustomSql "select count(*) from primary where Telephone is null and Email is null" = 0


    The Evaluate Data Quality feature provides actions to manage the outcome of a job based on the job quality results.

  22. For this post, select Fail job when data quality fails and choose Fail job without loading target data actions. In the Data quality output setting section, choose Browse S3 and navigate to prefix dqresults in the S3 bucket starting with gluedataqualitystudio-*.
  23. On the Target menu, choose Amazon S3.
  24. Choose the Data target – S3 bucket node.
  25. On the Data target properties – S3 tab, for Format, choose Parquet, and for Compression Type, choose Snappy.
  26. For S3 Target Location, choose Browse S3 and navigate to the prefix /datalake/curated/customer/ in the S3 bucket starting with gluedataqualitystudio-*.
  27. Choose Save, then choose Run.
    You can view the job run details on the Runs tab. In our example, the job fails with the error message “AssertionError: The job failed due to failing DQ rules for node: <node>.”
    You can review the data quality result on the Data quality tab. In our example, the custom data quality validation failed because one of the rows in the dataset had no Telephone or Email value.Evaluate Data Quality results is also written to the S3 bucket in JSON format based on the data quality result location parameter of the node.
  28. Navigate to dqresults prefix under the S3 bucket starting gluedataqualitystudio-*. You will see that the data quality result is partitioned by date.

The following is the output of the JSON file. You can use this file output to build custom data quality visualization dashboards.

You can also monitor the Evaluate Data Quality node through Amazon CloudWatch metrics and set alarms to send notifications about data quality results. To learn more on how to set up CloudWatch alarms, refer to Using Amazon CloudWatch alarms.

Clean up

To avoid incurring future charges and to clean up unused roles and policies, delete the resources you created:

  1. Delete the GlueDataQualityStudio job you created as part of this post.
  2. On the AWS CloudFormation console, delete the GlueDataQualityStudio stack.

Conclusion

AWS Glue Data Quality offers an easy way to measure and monitor the data quality of your ETL pipeline. In this post, you learned how to take necessary actions based on the data quality results, which helps you maintain high data standards and make confident business decisions.

To learn more about AWS Glue Data Quality, check out the documentation:


About the Authors

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data architecture on the AWS Cloud. He has helped customers of all sizes implement data management, data warehouse, and data lake solutions.

Yannis Mentekidis is a Senior Software Development Engineer on the AWS Glue team.

Effective data lakes using AWS Lake Formation, Part 4: Implementing cell-level and row-level security

Post Syndicated from Deenbandhu Prasad original https://aws.amazon.com/blogs/big-data/part-4-effective-data-lakes-using-aws-lake-formation-part-4-implementing-cell-level-and-row-level-security/

We announced the preview of AWS Lake Formation transactions, cell-level and row-level security, and acceleration at AWS re: Invent 2020. In Parts 1 , 2, and 3 of this series, we explained how to set up governed tables, add streaming and batch data to them, and use ACID transactions. In this post, we focus on cell-level and row-level security and show you how to enforce business needs by restricting access to specific rows.

Effective data lakes using AWS Lake Formation

The goal of modern data lakes is to democratize access to broad datasets to empower data analysts and business users. In these scenarios, data lake security becomes more important than ever. Enterprises want to share their data across groups, departments, and organizations, while balancing their compliance and security needs. A common paradigm used by many enterprises is to restrict data access to limit scope based on the user profile or the organizations to which they belong. Previously, you had to enforce this by duplicating the original data or creating materialized and non-materialized views of the data based on filtered datasets. However, these solutions often break the concept of a single source of truth and result in write amplification, which doubles or triples storage. The large number of copies required also increases the management effort required due to their complexity.

Lake Formation supports simple row-level security and cell-level security:

  • Basic row-level security allows you to specify filter expressions that limit access to specific rows of a table to a user.
  • Cell-level security builds on row-level security by allowing you to hide or show specific columns along with providing access to specific rows.

In this post, we consider an example use case in which Company A does business across the United States and Canada. They have built an enterprise customer data hub on Amazon Simple Storage Service (Amazon S3), which sources customers from both countries into this central location for company-wide analytics and marketing.

The marketing department of Company A is responsible for organizing promotion campaigns and developing communications content to promote services and product to prospects. The team consumes the data from this enterprise customer data hub to create the targeted campaign.

The marketing team data analysts are divided by country, and the requirement is to restrict analyst access to customer data from their country—analysts from the United States can see only customers from the United States, and analysts from Canada can only access customers from Canada. Additionally, analysts from Canada aren’t allowed to see the date of birth (DoB column) due to local company policy (this restriction is an example of cell-level security).

The dataset used for this post is synthetically generated; the following screenshot shows an example of the data.

Solution overview

Let’s see how you can use the Lake Formation row-level security feature to enforce Company A’s requirements on the data in the S3 data lake. You can apply row-level security to a governed table or to a standard table in Lake Formation. In this post, you apply row-level security on a standard Lake Formation table; you can follow a similar process for a governed table.

We walk through the following high-level steps:

  1. Create a database (lf_rls_blog) and table (customer).
  2. Grant select (row and column) permissions to the users lf-rls-blog-analyst-us and lf-rls-blog-analyst-ca.
  3. Run queries in Amazon Athena as the US and Canada analysts to verify that you only see rows from the user’s respective country (and the appropriate columns).

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

  • AWS Identity and Access Management (IAM) users, roles, and policies:
    • The three users include lf-rls-blog-manager (data lake administrator), lf-rls-blog-analyst-us (US data analyst), and lf-rls-blog-analyst-ca (Canada data analyst)
  • Lake Formation data lake settings and resources:
    • This includes an S3 bucket with the prefix lf-rowlevel-security-blog-* and the customer data files used in this post

As of this writing, these Lake Formation preview features are available only in us-east-1 and us-west-2. When following the steps in this post, use Region us-east-1. Check the availability of the features in other Regions in the future.

To create your resources, complete the following steps:

  1. Sign in to the AWS CloudFormation console in the us-east-1 Region.
  2. Choose Launch Stack:
  3. Create passwords for the three users.
  4. Review the details on the page and select I acknowledge that AWS CloudFormation might create IAM resources.
  5. Choose Create stack.

Create a database and table

To create your database and table, complete the following steps:

  1. Sign in to the AWS Management Console as the data lake administrator (lf-rls-blog-manager).
  2. On the Lake Formation console, choose Databases in the navigation pane.
  3. Choose Create database.
  4. For Name, enter lf_rls_blog.
  5. If Use only IAM access control for new tables in this database is selected, uncheck it.
  6. Choose Create database.

Next, you create a new data lake table.

  1. In the navigation pane, choose Tables.
  2. Choose Create table.
  3. For Name, enter customer.
  4. For Database, choose the database you just created (lf_rls_blog).
  5. Browse to the customers folder under datafiles in the S3 bucket (starting with lf-rowlevel-security-blog-*) created by the CloudFormation template.
  6. For Classification, select CSV.
  7. For Delimiter, choose Comma:
  8. Choose Upload Schema.
  9. Enter the following JSON code:
    [
    {
    "Name": "customer_id",
    "Type": "bigint"
    },
    {
    "Name": "prefix",
    "Type": "string"
    },
    {
    "Name": "first_Name",
    "Type": "string"
    },
    {
    "Name": "middle_Name",
    "Type": "string"
    },
    {
    "Name": "last_Name",
    "Type": "string"
    },
    {
    "Name": "suffix",
    "Type": "string"
    },
    {
    "Name": "gender",
    "Type": "string"
    },
    {
    "Name": "dob",
    "Type": "string"
    },
    {
    "Name": "phone",
    "Type": "string"
    },
    {
    "Name": "building_number",
    "Type": "bigint"
    },
    {
    "Name": "street_Name",
    "Type": "string"
    },
    {
    "Name": "city",
    "Type": "string"
    },
    {
    "Name": "cust_zip",
    "Type": "bigint"
    },
    {
    "Name": "country",
    "Type": "string"
    }
    ]
    

  10. Choose Upload.
  11. Choose Submit.

Create data filters and grant permissions

To implement column-level, row-level, and cell-level security, first you create data filters. Then you choose that data filter while granting the select Lake Formation permission on tables. For this use case, you create two data filters: one for the US data analyst and one for the Canada data analyst.

Permitted filter expressions are predicates that obey a subset of the WHERE clause grammar in PartiQL. You can use comparison operators to compare columns with constants. The following are the supported operators:

  • Comparison operators – =, >, <, >=, <=, <>, BETWEEN, IN, LIKE
  • Logical operators – AND, OR

Let’s first create the data filter for the US analyst.

  1. On the Lake Formation console, choose Data filters in the navigation pane.
  2. Choose Create new filter.
  3. For Data filter name, enter US Filter.
  4. For Target database, choose the lf_rls_blog database.
  5. For Target table, choose the customer table.
  6. For Column-level access, select Access to all columns.
  7. For Row filter expression, enter country='US'.
  8. Choose Create filter.

The US analyst has access to all the columns of US customers only.

Now let’s create a data filter for the Canada analyst.

  1. On the Data filters page, choose Create new filter.
  2. For Data filter name, enter Canada Filter.
  3. For Target database, choose the lf_rls_blog database.
  4. For Target table, choose the customer table.
  5. For Column-level access, select Exclude columns.
  6. For Select columns, choose the dob column.
  7. For Row filter expression, enter country='Canada'.
  8. Choose Create filter.

The Canada analyst now has access to all the columns except dob (date of birth) of Canadian customers only.

Verify both data filters are created by checking the Data filters page.

Now we can grant table and column permissions.

  1. On the Tables page, select the customer table.
  2. On the Actions menu, choose Grant.
  1. For IAM users and roles, choose lf-rls-blog-analyst-us.
  2. Choose Named data catalog resources
  3. For Databases, choose lf_rls_blog.
  4. For Tables, choose customer.
  5. For Table and column permissions, choose Select.
  6. Under Data permissions¸ select Advanced cell-level filters.
  7. Select US Filter.
  8. Choose Grant.
  9. Repeat these steps for the lf-rls-blog-analyst-ca user, choosing the lf_rls_blog database and customer table and granting Select permissions.
  10. Select Advanced cell-level filters.
  11. In the Data permissions section, select Canada Filter.
  12. Choose Grant.

Run queries to test permission levels

To utilize Lake Formation preview features in Athena, you need to create a new workgroup named AmazonAthenaLakeFormationPreview and switch to that workgroup before running queries. For more information, see Managing Workgroups. Additionally, for preview you use the lakeformation qualifier for the database and table name, as shown in the following example:

select * from lakeformation.<databasename>.<tablename>

Lake Formation implicitly grants all permissions to the table creator. In this use case, lf-rls-blog-manager has SELECT permissions on all rows and columns of the customer table. Let’s first verify permissions for lf-rls-blog-manager by querying the customer table using Athena.

  1. On the Athena console (in Region us-east-1), open the query editor.
  2. Choose set up a query result location in Amazon S3.
  3. Navigate to the S3 bucket starting with lf-rowlevel-security-blog-* and select the folder anthenaqueryresults.
  4. Choose Save.
  5. In the query editor, for Data source, choose AWSDataCatalog.
  6. For Database, choose lf_rls_blog.
  7. Create and switch to the AmazonAthenaLakeFormationPreview workgroup.

You can see the customer table under Tables.

  1. Enter the following query:
    SELECT * FROM lakeformation."lf_rls_blog"."customer"

  2. Choose Run query.

Rows from both countries are displayed to the lf-rls-blog-manager users.

Next, lets verify permission of the lf-rls-blog-analyst-us and lf-rls-blog-analyst-ca users on this table.

  1. Sign in to the console as lf-rls-blog-analyst-us.
  2. Repeat the previous steps on the Athena console (in us-east-1) to set up the query result location.
  3. Switch to the AmazonAthenaLakeFormationPreview workgroup.
  4. Run the following query:
    SELECT * FROM lakeformation."lf_rls_blog"."customer"

Only US customers are shown in the result for the US data analyst.

Now you verify the same for the Canada data analyst.

  1. Sign in to the console as lf-rls-blog-analyst-ca.
  2. Repeat the previous steps on the Athena console (in us-east-1) to set up the query result location.
  3. Switch to AmazonAthenaLakeFormationPreview workgroup.
  4. Run the following query:
    SELECT * FROM lakeformation."lf_rls_blog"."customer"

Only customers from Canada are visible to the Canada data analyst. Furthermore, the lf-rls-blog-analyst-ca user doesn’t have access to the dob column.

Clean up

For the final step, clean up the resources you created:

  1. Sign in to the console as lf-rls-blog-manager.
  2. On the Lake Formation console, choose Databases in the navigation pane.
  3. Select the database lf_rls_blog.
  4. On the Action menu, choose Delete.
  5. Delete anthenaqueryresults folder from the Amazon S3 bucket with prefix lf-rowlevel-security-blog-*.
  6. Login as the user that launched the CloudFormation stack in this post.
  7. On the AWS CloudFormation console, delete the stack lf-rowlevel-security-blog.

When you delete the stack, the resources it created are automatically deleted.

Conclusion

In this post, you learned how to implement fine-grained access control on a data lake table using the new row-level security feature of Lake Formation. This feature allows you to easily enforce privacy regulations or corporate governance data access rules on your data lake.

Learn more about Lake Formation and start using these features to build and secure your data lake on Amazon S3 using Lake Formation today, and leave your thoughts and questions in the comments.


About the Authors

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data platforms on the AWS Cloud. Deenbandhu has helped customers of all sizes implement master data management, data warehouse, and data lake solutions.

 

 

 

Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue and AWS Lake Formation. His passion is for implementing software artifacts for building data lakes more effectively and easily. During his spare time, he loves to spend time with his family, especially hunting bugs—not software bugs, but bugs like butterflies, pill bugs, snails, and grasshoppers.