Tag Archives: AWS Lake Formation

Easily manage your data lake at scale using AWS Lake Formation Tag-based access control

2021-08-10 Nivas Shankar

Post Syndicated from Nivas Shankar original https://aws.amazon.com/blogs/big-data/easily-manage-your-data-lake-at-scale-using-tag-based-access-control-in-aws-lake-formation/

Thousands of customers are building petabyte-scale data lakes on AWS. Many of these customers use AWS Lake Formation to easily build and share their data lakes across the organization. As the number of tables and users increase, data stewards and administrators are looking for ways to manage permissions on data lakes easily at scale. Customers are struggling with “role explosion” and need to manage hundreds or even thousands of user permissions to control data access. For example, for an account with 1,000 resources and 100 principals, the data steward would have to create and manage up to 100,000 policy statements. Furthermore, as new principals and resources get added or deleted, these policies have to be updated to keep the permissions current.

Lake Formation Tag-based access control solves this problem by allowing data stewards to create LF-tags (based on their data classification and ontology) that can then be attached to resources. You can create policies on a smaller number of logical tags instead of specifying policies on named resources. LF-tags enable you to categorize and explore data based on taxonomies, which reduces policy complexity and scales permissions management. You can create and manage policies with tens of logical tags instead of the thousands of resources. LF-tags access control decouples policy creation from resource creation, which helps data stewards manage permissions on a large number of databases, tables, and columns by removing the need to update policies every time a new resource is added to the data lake. Finally, LF-tags access allows you to create policies even before the resources come into existence. All you have to do is tag the resource with the right LF-tags to ensure it is managed by existing policies.

This post focuses on managing permissions on data lakes at scale using LF-tags in Lake Formation. When it comes to managing data lake catalog tables from AWS Glue and administering permission to Lake Formation, data stewards within the producing accounts have functional ownership based on the functions they support, and can grant access to various consumers, external organizations, and accounts. You can now define LF-tags; associate at the database, table, or column level; and then share controlled access across analytic, machine learning (ML), and extract, transform, and load (ETL) services for consumption. LF-tags ensures that governance can be scaled easily by replacing the policy definitions of thousands of resources with a small number of logical tags.

LF-tags access has three main components:

Tag ontology and classification – Data stewards can define a LF-tag ontology based on data classification and grant access based on LF-tags to AWS Identity and Access Management (IAM) principals and SAML principals or groups
Tagging resources – Data engineers can easily create, automate, implement, and track all LF-tags and permissions against AWS Glue catalogs through the Lake Formation API
Policy evaluation – Lake Formation evaluates the effective permissions based on LF-tags at query time and allows access to data through consuming services such as Amazon Athena, Amazon Redshift Spectrum, Amazon SageMaker Data Wrangler, and Amazon EMR Studio, based on the effective permissions granted across multiple accounts or organization-level data shares

Solution overview

The following diagram illustrates the architecture of the solution described in this post.

In this post, we demonstrate how you can set up a Lake Formation table and create Lake Formation tag-based policies using a single account with multiple databases. We walk you through the following high-level steps:

The data steward defines the tag ontology with two LF-tags: Confidential and Sensitive. Data with “Confidential = True” has tighter access controls. Data with “Sensitive = True” requires specific analysis from the analyst.
The data steward assigns different permission levels to the data engineer to build tables with different LF-tags.
The data engineer builds two databases: tag_database and col_tag_database. All tables in tag_database are configured with “Confidential = True”. All tables in the col_tag_database are configured with “Confidential = False”. Some columns of the table in col_tag_database are tagged with “Sensitive = True” for specific analysis needs.
The data engineer grants read permission to the analyst for tables with specific expression condition “Confidential = True” and “Confidential = False , Sensitive = True”.
With this configuration, the data analyst can focus on performing analysis with the right data.

Provision your resources

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template creates three different personas to perform this exercise and copies the nyc-taxi-data dataset to your local Amazon Simple Storage Service (Amazon S3) bucket.

To create these resources, complete the following steps:

Sign in to the AWS CloudFormation console in the us-east-1 Region.
Choose Launch Stack:
Choose Next.
In the User Configuration section, enter password for three personas: DataStewardUserPassword, DataEngineerUserPassword and DataAnalystUserPassword.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

The stack takes up to 5 minutes and creates all the required resources, including:

An S3 bucket
The appropriate Lake Formation settings
The appropriate Amazon Elastic Compute Cloud (Amazon EC2) resources
Three user personas with user ID credentials:
- Data steward (administrator) – The lf-data-steward user has the following access:
  - Read access to all resources in the Data Catalog
  - Can create LF-tags and associate to the data engineer role for grantable permission to other principals
- Data engineer – The lf-data-engineer user has the following access:
  - Full read, write, and update access to all resources in the Data Catalog
  - Data location permissions in the data lake
  - Can associate LF-tags and associate to the Data Catalog
  - Can attach LF-tags to resources, which provides access to principals based on any policies created by data stewards
- Data analyst – The lf-data-analyst user has the following access:
  - Fine-grained access to resources shared by Lake Formation Tag-based access policies

Register your data location and create an LF-tag ontology

We perform this first step as the data steward user (lf-data-steward) to verify the data in Amazon S3 and the Data Catalog in Lake Formation.

Sign in to the Lake Formation console as lf-data-steward with the password used while deploying the CloudFormation stack.
In the navigation pane, under Permissions¸ choose Administrative roles and tasks.
For IAM users and roles, choose the user lf-data-steward.
Choose Save to add lf-data-steward as a Lake Formation admin.

Next, we will update the Data catalog settings to use Lake Formation permission to control catalog resources instead of IAM based access control.
In the navigation pane, under Data catalog¸ choose Settings.
Uncheck Use only IAM access control for new databases.
Uncheck Use only IAM access control for new tables in new databases.
Click Save.

Next, we need to register the data location for the data lake.
In the navigation pane, under Register and ingest, choose Data lake locations.
For Amazon S3 path, enter s3://lf-tagbased-demo-<<Account-ID>>.
For IAM role¸ leave it as the default value AWSServiceRoleForLakeFormationDataAccess.
Choose Register location.
Next, we create the ontology by defining a LF-tag.
Under Permissions in the navigation pane, under Administrative roles, choose LF-Tags.
Choose Add LF-tags.
For Key, enter Confidential.
For Values, add True and False.
Choose Add LF-tag.
Repeat the steps to create the LF-tag Sensitive with the value True.
You have created all the necessary LF-tags for this exercise.Next, we give specific IAM principals the ability to attach newly created LF-tags to resources.
Under Permissions in the navigation pane, under Administrative roles, choose LF-tag permissions.
Choose Grant.
Select IAM users and roles.
For IAM users and roles, search for and choose the lf-data-engineer role.
In the LF-tag permission scope section, add the key Confidential with values True and False, and the key Sensitive with value True.
Under Permissions¸ select Describe and Associate for LF-tag permissions and Grantable permissions.
Choose Grant.

Next, we grant permissions to lf-data-engineer to create databases in our catalog and on the underlying S3 bucket created by AWS CloudFormation.
Under Permissions in the navigation pane, choose Administrative roles.
In the Database creators section, choose Grant.
For IAM users and roles, choose the lf-data-engineer role.
For Catalog permissions, select Create database.
Choose Grant.

Next, we grant permissions on the S3 bucket (s3://lf-tagbased-demo-<<Account-ID>>) to the lf-data-engineer user.
In the navigation pane, choose Data locations.
Choose Grant.
Select My account.
For IAM users and roles, choose the lf-data-engineer role.
For Storage locations, enter the S3 bucket created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>).
Choose Grant.
Next, we grant lf-data-engineer grantable permissions on resources associated with the LF-tag expression Confidential=True.
In the navigation pane, choose Data permissions.
Choose Grant.
Select IAM users and roles.
Choose the role lf-data-engineer.
In the LF-tag or catalog resources section, Select Resources matched by LF-Tags.
Choose Add LF-Tag.
Add the key Confidential with the values True.
In the Database permissions section, select Describe for Database permissions and Grantable permissions.
In the Table and column permissions section, select Describe, Select, and Alter for both Table permissions and Grantable permissions.
Choose Grant.
Next, we grant lf-data-engineer grantable permissions on resources associated with the LF-tag expression Confidential=False.
In the navigation pane, choose Data permissions.
Choose Grant.
Select IAM users and roles.
Choose the role lf-data-engineer.
Select Resources matched by LF-tags.
Choose Add LF-tag.
Add the key Confidential with the values False.
In the Database permissions section, select Describe for Database permissions and Grantable permissions.
In the Table and column permissions section, do not select anything.
Choose Grant.
Next, we grant lf-data-engineer grantable permissions on resources associated with the LF-tag expression Confidential=False and Sensitive=True.
In the navigation pane, choose Data permissions.
Choose Grant.
Select IAM users and roles.
Choose the role lf-data-engineer.
Select Resources matched by LF-tags.
Choose Add LF-tag.
Add the key Confidential with the values False.
Choose Add LF-tag.
Add the key Sensitive with the values True.
In the Database permissions section, select Describe for Database permissions and Grantable permissions.
In the Table and column permissions section, select Describe, Select, and Alter for both Table permissions and Grantable permissions.
Choose Grant.

Create the Lake Formation databases

Now, sign in as lf-data-engineer with the password used while deploying the CloudFormation stack. We create two databases and attach LF-tags to the databases and specific columns for testing purposes.

Create your database and table for database-level access

We first create the database tag_database, the table source_data, and attach appropriate LF-tags.

On the Lake Formation console, choose Databases.
Choose Create database.
For Name, enter tag_database.
For Location, enter the S3 location created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>/tag_database/).
Deselect Use only IAM access control for new tables in this database.
Choose Create database.

Next, we create a new table within tag_database.

On the Databases page, select the database tag_database.
Choose View Tables and click Create table.
For Name, enter source_data.
For Database, choose the database tag_database.
For Data is located in, select Specified path in my account.
For Include path, enter the path to tag_database created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>/tag_database/).
For Data format, select CSV.

Under Upload schema, enter the following schema JSON:

[
               {
                    "Name": "vendorid",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "lpep_pickup_datetime",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "lpep_dropoff_datetime",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "store_and_fwd_flag",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "ratecodeid",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "pulocationid",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "dolocationid",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "passenger_count",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "trip_distance",
                    "Type": "string"
                    
                    
               }, 
                  {
                    "Name": "fare_amount",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "extra",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "mta_tax",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "tip_amount",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "tolls_amount",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "ehail_fee",
                    "Type": "string"
                    
                    
               }, 
               {
                    "Name": "improvement_surcharge",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "total_amount",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "payment_type",
                    "Type": "string"
                    
                    
               }
 
]

Choose Upload.

After uploading the schema, the table schema should look like the following screenshot.

Choose Submit.

Now we’re ready to attach LF-tags at the database level.

On the Databases page, find and select tag_database.
On the Actions menu, choose Edit LF-tags.
Choose Assign new LF-tag.
For Assigned keys¸ choose the Confidential LF-tag you created earlier.
For Values, choose True.
Choose Save.

This completes the LF-tag assignment to the tag_database database.

Create your database and table for column-level access

Now we repeat these steps to create the database col_tag_database and table source_data_col_lvl, and attach LF-tags at the column level.

On the Databases page, choose Create database.
For Name, enter col_tag_database.
For Location, enter the S3 location created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>/col_tag_database/).
Deselect Use only IAM access control for new tables in this database.
Choose Create database.
On the Databases page, select your new database (col_tag_database).
Choose View tables and Click Create table.
For Name, enter source_data_col_lvl.
For Database, choose your new database (col_tag_database).
For Data is located in, select Specified path in my account.
Enter the S3 path for col_tag_database (s3://lf-tagbased-demo-<<Account-ID>>/col_tag_database/).
For Data format, select CSV.

Under Upload schema, enter the following schema JSON:

[
               {
                    "Name": "vendorid",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "lpep_pickup_datetime",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "lpep_dropoff_datetime",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "store_and_fwd_flag",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "ratecodeid",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "pulocationid",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "dolocationid",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "passenger_count",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "trip_distance",
                    "Type": "string"
                    
                    
               }, 
                  {
                    "Name": "fare_amount",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "extra",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "mta_tax",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "tip_amount",
                    "Type": "string"
                    
                    
               },
                  {
                    "Name": "tolls_amount",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "ehail_fee",
                    "Type": "string"
                    
                    
               }, 
               {
                    "Name": "improvement_surcharge",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "total_amount",
                    "Type": "string"
                    
                    
               },
               {
                    "Name": "payment_type",
                    "Type": "string"
                    
                    
               }
 
]

Choose Upload.

After uploading the schema, the table schema should look like the following screenshot.

Choose Submit to complete the creation of the table

Now you associate the Sensitive=True LF-tag to the columns vendorid and fare_amount.

On the Tables page, select the table you created (source_data_col_lvl).
On the Actions menu, choose Edit Schema.
Select the column vendorid and choose Edit LF-tags.
For Assigned keys, choose Sensitive.
For Values, choose True.
Choose Save.

Repeat the steps for the Sensitive LF-tag update for fare_amount column.

Select the column fare_amount and choose Edit LF-tags.
Add the Sensitive key with value True.
Choose Save.
Choose Save as new version to save the new schema version with tagged columns.The following screenshot shows column properties with the LF-tags updated.
Next we associate the Confidential=False LF-tag to col_tag_database. This is required for lf-data-analyst to be able to describe the database col_tag_database when logged in from Athena.
On the Databases page, find and select col_tag_database.
On the Actions menu, choose Edit LF-tags.
Choose Assign new LF-tag.
For Assigned keys¸ choose the Confidential LF-tag you created earlier.
For Values, choose False.
Choose Save.

Grant table permissions

Now we grant permissions to data analysts for consumption of the database tag_database and the table col_tag_database.

Sign in to the Lake Formation console as lf-data-engineer.
On the Permissions page, select Data Permissions
Choose Grant.
Under Principals, select IAM users and roles.
For IAM users and roles, choose lf-data-analyst.
Select Resources matched by LF-tags.
Choose Add LF-tag.
For Key, choose Confidential.
For Values¸ choose True.
For Database permissions, select Describe
For Table permissions, choose Select and Describe.
Choose Grant.

This grants permissions to the lf-data-analyst user on the objects associated with the LF-tag Confidential=True (Database : tag_database) to describe the database and the select permission on tables.Next, we repeat the steps to grant permissions to data analysts for LF-tag expression for Confidential=False . This LF-tag is used for describing the col_tag_database and the table source_data_col_lvl when logged in as lf-data-analyst from Athena. And so, we only grant describe access to the resources through this LF-tag expression.
Sign in to the Lake Formation console as lf-data-engineer.
On the Databases page, select the database col_tag_database.
Choose Action and Grant.
Under Principals, select IAM users and roles.
For IAM users and roles, choose lf-data-analyst.
Select Resources matched by LF-tags.
Choose Add LF-tag.
For Key, choose Confidential.
For Values¸ choose False.
For Database permissions, select Describe.
For Table permissions, do not select anything.
Choose Grant.

Next, we repeat the steps to grant permissions to data analysts for LF-tag expression for Confidential=False and Sensitive=True. This LF-tag is used for describing the col_tag_database and the table source_data_col_lvl (Column level) when logged in as lf-data-analyst from Athena.
Sign in to the Lake Formation console as lf-data-engineer.
On the Databases page, select the database col_tag_database.
Choose Action and Grant.
Under Principals, select IAM users and roles.
For IAM users and roles, choose lf-data-analyst.
Select Resources matched by LF-tags.
Choose Add LF-tag.
For Key, choose Confidential.
For Values¸ choose False.
Choose Add LF-tag.
For Key, choose Sensitive.
For Values¸ choose True.
For Database permissions, select Describe.
For Database permissions, select Describe.
For Table permissions, select Select and Describe.
Choose Grant.

Run a query in Athena to verify the permissions

For this step, we sign in to the Athena console as lf-data-analyst and run SELECT queries against the two tables (source_data and source_data_col_lvl). We use our S3 path as the query result location (s3://lf-tagbased-demo-<<Account-ID>>/athena-results/).

In the Athena query editor, choose tag_database in the left panel.
Choose the additional menu options icon (three vertical dots) next to source_data and choose Preview table.
Choose Run query.

The query should take a few minutes to run. The following screenshot shows our query results.

The first query displays all the columns in the output because the LF-tag is associated at the database level and the source_data table automatically inherited the LF-tag from the database tag_database.

Run another query using col_tag_database and source_data_col_lvl.

The second query returns just the two columns that were tagged (Non-Confidential and Sensitive).

As a thought experiment, you can also check to see the Lake Formation Tag-based access policy behavior on columns to which the user doesn’t have policy grants.

When an untagged column is selected from the table source_data_col_lvl, Athena returns an error. For example, you can run the following query to choose untagged columns geolocationid:

SELECT geolocationid FROM "col_tag_database"."source_data_col_lvl" limit 10;

Extend the solution to cross-account scenarios

You can extend this solution to share catalog resources across accounts. The following diagram illustrates a cross-account architecture.

We describe this in more detail in a subsequent post.

Clean up

To help prevent unwanted charges to your AWS account, you can delete the AWS resources that you used for this walkthrough.

Sign in as lf-data-engineer Delete the databases tag_database and col_tag_database
Now, Sign in as lf-data-steward and clean up all the LF-tag Permissions, Data Permissions and Data Location Permissions that were granted above that were granted lf-data-engineer and lf-data-analyst.
Sign in to the Amazon S3 console as the account owner (the IAM credentials you used to deploy the CloudFormation stack).
Delete the following buckets:
1. lf-tagbased-demo-accesslogs-<acct-id>
2. lf-tagbased-demo-<acct-id>
On the AWS CloudFormation console, delete the stack you created.
Wait for the stack status to change to DELETE_COMPLETE.

Conclusion

In this post, we explained how to create a LakeFormation Tag-based access control policy in Lake Formation using an AWS public dataset. In addition, we explained how to query tables, databases, and columns that have LakeFormation Tag-based access policies associated with them.

You can generalize these steps to share resources across accounts. You can also use these steps to grant permissions to SAML identities. In subsequent posts, we highlight these use cases in more detail.

About the Authors

Sanjay Srivastava is a principal product manager for AWS Lake Formation. He is passionate about building products, in particular products that help customers get more out of their data. During his spare time, he loves to spend time with his family and engage in outdoor activities including hiking, running, and gardening.

Nivas Shankar is a Principal Data Architect at Amazon Web Services. He helps and works closely with enterprise customers building data lakes and analytical applications on the AWS platform. He holds a master’s degree in physics and is highly passionate about theoretical physics concepts.

Pavan Emani is a Data Lake Architect at AWS, specialized in big data and analytics solutions. He helps customers modernize their data platforms on the cloud. Outside of work, he likes reading about space and watching sports.

Effective data lakes using AWS Lake Formation, Part 5: Securing data lakes with row-level access control

2021-08-03 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/effective-data-lakes-using-aws-lake-formation-part-5-secure-data-lakes-with-row-level-access-control/

Increasingly, customers are looking at data lakes as a core part of their strategy to democratize data access across the organization. Data lakes enable you to handle petabytes and exabytes of data coming from a multitude of sources in varying formats, and gives users the ability to access it from their choice of analytics and machine learning tools. Fine-grained access controls are needed to ensure data is protected and access is granted to only those who require it.

AWS Lake Formation is a fully managed service that helps you build, secure, and manage data lakes, and provide access control for data in the data lake. Lake Formation row-level permissions allow you to restrict access to specific rows based on data compliance and governance policies. Lake Formation also provides centralized auditing and compliance reporting by identifying which principals accessed what data, when, and through which services.

Effective data lakes using AWS Lake Formation

Part 1: Getting started with governed tables
Part 2: Creating a governed table for streaming data sources
Part 3: Using ACID transactions on governed tables
Part 4: Implementing cell-level and row-level security
Part 5: Securing data lakes with row-level access control

This post demonstrates how row-level access controls work in Lake Formation, and how to set them up.

If you have large fact tables storing billions of records, you need a way to enable different users and teams to access only the data they’re allowed to see. Row-level access control is a simple and performant way to protect data, while giving users access to the data they need to perform their job. In the retail industry for instance, you may want individual departments to only see their own transactions, but allow regional managers access to transactions from every department.

Traditionally you can achieve row-level access control in a data lake through two common approaches:

Duplicate the data, redact sensitive information, and grant coarse-grained permissions on the redacted dataset
Load data into a database or a data warehouse, create a view with a WHERE clause to select only specific records, and grant permission on the resulting view

These solutions work well when dealing with a small number of tables, principals, and permissions. However, they make it difficult to audit and maintain because access controls are spread across multiple systems and methods. To make it easier to manage and enforce fine-grained access controls in a data lake, we announced a preview of Lake Formation row-level access controls. With this preview feature, you can create row-level filters and attach them to tables to restrict access to data for AWS Identity and Access Management (IAM) and SAMLv2 federated identities.

How data filters work for row-level security

Granting permissions on a table with row-level security (row filtering) restricts access to only specific rows in the table. The filtering is based on the values of one or more columns. For example, a salesperson analyzing sales opportunities should only be allowed to see those opportunities in their assigned territory and not others. We can define row-level filters to restrict access where the value of the territory column matches the assigned territory of the user.

With row-level security, we introduced the concept of data filters. Data filters make it simpler to manage and assign a large number of fine-grained permissions. You can specify the row filter expression using the WHERE clause syntax described in the PartiQL dialect.

Example use case

In this post, a fictional ecommerce company sells many different products, like books, videos, and toys. Customers can leave reviews and star ratings for each product, so other customers can make informed decisions about what they should buy. We use the Amazon Customer Reviews Dataset, which includes different products and customer reviews.

To illustrate the different roles and responsibilities of a data owner and a data consumer, we assume two personas: a data lake administrator and a data analyst. The administrator is responsible for setting up the data lake, creating data filters, and granting permissions to data analysts. Data analysts residing in different countries (for our use case, the US and Japan) can only analyze product reviews for customers located in their own country and for compliance reasons, shouldn’t be able to see data for customers located in other countries. We have two data analysts: one responsible for the US marketplace and another for the Japanese marketplace. Each analyst uses Amazon Athena to analyze customer reviews for their specific marketplace only.

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

An AWS Lambda function (for Lambda-backed AWS CloudFormation custom resources). We use the function to copy sample data files from the public S3 bucket to your Amazon Simple Storage Service (Amazon S3) bucket.
An S3 bucket to serve as our data lake.
IAM users and policies:
- DataLakeAdmin
- DataAnalystUS
- DataAnalystJP
An AWS Glue Data Catalog database, table, and partition.
Lake Formation data lake settings and permissions.

When following the steps in this section, use either us-east-1 or us-west-2 Regions (where the preview functionality is currently available).

Before launching the CloudFormation template, you need to ensure that you disabled Use only IAM access control for new databases/tables by following steps:

Sign in to the Lake Formation console in the us-east-1 or us-west-2 Region.
Under Data catalog, choose Settings.
Deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
Choose Save.

To launch the CloudFormation stack, complete the following steps:

Sign in to the CloudFormation console in the same Region.
Choose Launch Stack:
Choose Next.
For DatalakeAdminUserName and DatalakeAdminUserPassword, enter the user name and password you want for the data lake admin IAM user.
For DataAnalystUsUserName and DataAnalystUsUserPassword, enter the user name and password you want for the data analyst user who is responsible for the US marketplace.
For DataAnalystJpUserName and DataAnalystJpUserPassword, enter the user name and password you want for the data analyst user who is responsible for the Japanese marketplace.
For DataLakeBucketName, enter the name of your data lake bucket.
For DatabaseName and TableName, leave as the default.
Choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

Stack creation can take about 1 minute.

Query without data filters

After you set up the environment, you can query the product reviews table. Let’s first query the table without row-level access controls to make sure we can see the data. If you’re running queries in Athena for the first time, you need to configure the query result location.

SELECT * 
FROM lakeformation_tutorial_row_security.amazon_reviews
LIMIT 10

The following screenshot shows the query result. This table has only one partition, product_category=Video, so each record is a review comment for a video product.

Let’s run an aggregation query to retrieve the total number of records per marketplace:

SELECT marketplace, count(*) as total_count
FROM lakeformation_tutorial_row_security.amazon_reviews
GROUP BY marketplace

The following screenshot shows the query result. The marketplace column has five different values. In the subsequent steps, we set up row-based filters using the marketplace column.

Set up data filters

Let’s start by creating two different data filters, one for the analyst responsible for the US marketplace, and another for the one responsible for the Japanese marketplace. The we grant the users their respective permissions.

Create a filter for the US marketplace data

Let’s first set up a filter for the US marketplace data.

As the DatalakeAdmin user, open the Lake Formation console.
Choose Data filters.
Choose Create new filter.
For Data filter name, enter amazon_reviews_US.
For Target database, choose the database lakeformation_tutorial_row_security.
For Target table, choose the table amazon_reviews.
For Column-level access, leave as the default.
For Row filter expression, enter marketplace='US'.
Choose Create filter.

Create a filter for the Japanese marketplace data

Let’s create another data filter to restrict access to the Japanese marketplace data.

On the Data filters page, choose Create new filter.
For Data filter name, enter amazon_reviews_JP.
For Target database, choose the database lakeformation_tutorial_row_security.
For Target table, choose the table amazon_reviews.
For Column-level access, leave as the default.
For Row filter expression, enter marketplace='JP'.
Choose Create filter.

Grant permissions to the US data analyst

Now we have two data filters. Next, we need to grant permissions using these data filters to our analysts. We start by granting permissions to the DataAnalystUS user.

On the Data permissions page, choose Grant.
For Principals, choose IAM users and roles, and choose the user DataAnalystUS.
For Policy tags or catalog resources, choose Named data catalog resources.
For Database, choose the database lakeformation_tutorial_row_security.
For Table, choose the table amazon_reviews.
For Table permissions, select Select.
For Data permissions, select Advanced cell-level filters.
Select the filter amazon_reviews_US.
Choose Grant.

The following screenshot show the available data filters you can attach to a table when configuring permissions.

Grant permissions to the Japanese data analyst

Next, complete the following steps to configure permissions for the user DataAnalystJP:

On the Data permissions page, choose Grant.
For Principals, choose IAM users and roles, and choose the user DataAnalystJP.
For Policy tags or catalog resources, choose Named data catalog resources.
For Database, choose the database lakeformation_tutorial_row_security.
For Table, choose the table amazon_reviews.
For Table permissions, select Select.
For Data permissions, select Advanced cell-level filters.
Select the filter amazon_reviews_JP.
Choose Grant.

Query with data filters

With the data filters attached to the product reviews table, we’re ready to run some queries and see how permissions are enforced by Lake Formation. Because row-level security is in preview as of this writing, we need to create a special Athena workgroup named AmazonAthenaLakeFormationPreview, and switch to using it. For more information, see Managing Workgroups.

Sign in to the Athena console using the DataAnalystUS user and switch to the AmazonAthenaLakeFormationPreview workgroup. Run the following query to retrieve a few records, which are filtered based on the row-level permissions we defined:

SELECT * 
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
LIMIT 10

Note the prefix of lakeformation. before the database name; this is required for the preview only.

The following screenshot shows the query result.

Similarly, run a query to count the total number of records per marketplace:

SELECT marketplace, count(*) as total_count
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
GROUP BY marketplace

The following screenshot shows the query result. Only the marketplace US shows in the results. This is because our user is only allowed to see rows where the marketplace column value is equal to US.

Switch to the DataAnalystJP user and run the same query:

SELECT * 
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
LIMIT 10

The following screenshot shows the query result. All of the records belong to the JP marketplace.

Run the query to count the total number of records per marketplace:

SELECT marketplace, count(*) as total_count
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
GROUP BY marketplace

The following screenshot shows the query result. Again, only the row belonging to the JP marketplace is returned.

Clean up

Now to the final step, cleaning up the resources.

Delete the CloudFormation stack.
Delete the Athena workgroup AmazonAthenaLakeFormationPreview.

Conclusion

In this post, we covered how row-level security in Lake Formation enables you to control data access without needing to duplicate it or manage complicated alternatives such as views. We demonstrated how Lake Formation data filters can make creating, managing, and enforcing row-level permissions simple and easy.

When you want to grant permission on specific cell, you can include or exclude columns in the data filters in addition to the row filter expression. You can learn more about the cell filters in Part 4: Implementing cell-level and row-level security.

You can get started with Lake Formation today by visiting the AWS Lake Formation product page. If you want to try out row-level security, as well as the other exciting new features like ACID transactions and acceleration currently available for preview in the US East (N. Virginia) and the US West (Oregon) Regions, sign up for the preview.

About the Authors

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures.

Sanjay Srivastava is a Principal Product Manager for AWS Lake Formation. He is passionate about building products, in particular products that help customers get more out of their data. During his spare time, he loves to spend time with his family and engage in outdoor activities including hiking, running, and gardening.

Incremental data matching using AWS Lake Formation and AWS Glue

2021-07-13 Shehzad Qureshi

Post Syndicated from Shehzad Qureshi original https://aws.amazon.com/blogs/big-data/incremental-data-matching-using-aws-lake-formation/

AWS Lake Formation provides a machine learning (ML) capability (FindMatches transform) to identify duplicate or matching records in your dataset, even when the records don’t have a common unique identifier and no fields match exactly. Customers across many industries have come to rely on this feature for linking datasets like patient records, customer databases, and TV shows. The initial release of the FindMatches transform identified matching records within a single dataset. When you had a new dataset, you had to merge it with the existing clean dataset and rerun matching against the complete merged dataset.

We’re excited to announce the Lake Formation FindMatches incremental matching feature (Find Incremental Matches), which enables you to effortlessly match to incremental records against existing matched datasets.

In this post, you learn how to use the Find Incremental Matches capability to match prospects data with existing customer datasets for the marketing department of a fictional company. The dataset used for this post is synthetically generated.

Overview of solution

The marketing department of our fictional company is responsible for organizing promotion campaigns every month and developing communications content to promote services and product to prospects (potential new customers). A list of prospects is generated by multiple internal business processes and also from multiple third-party services.

At end of each month, the marketing team ends up with hundreds of thousands of prospects. Now the team has the herculean task of identifying unique prospects by removing duplicates and existing customers from the list.

The prospect list purchased from the third-party service doesn’t have any common unique identifiers like Social Security number (SSN) or driver’s license, which makes these tasks arduous to do manually.

You can use the ML capabilities of Lake Formation to address this challenge. The Find Incremental Matches transform enables you to identify duplicate or matching records in your dataset, even when the records don’t have a common unique identifier and no fields match exactly.

Specifically, the new incremental match capability provides the flexibility to match hundreds of thousands of new prospects with the existing database of prospects and customers without merging the two databases. Moreover, by conducting matches only between the new and existing datasets, the Find Incremental Matches optimization reduces computation time, which also reduces cost.

The following screenshot shows a sample of the existing customers dataset.

The following screenshot shows a sample of the incremental prospect dataset.

In this post, you perform the following steps for incremental matching:

Run an AWS Glue extract, transform, and load (ETL) job for initial matching.
Run an AWS Glue ETL job for incremental matching.
Verify output data from Amazon Simple Storage Service (Amazon S3) with Amazon Athena.

The first step of initial matching is mandatory in order to perform incremental matching.

Prerequisites

To create resources for incremental matching in AWS Glue, launch the following AWS CloudFormation stack in the us-east-1 Region:

This stack creates the following resources:

A S3 bucket that stores the input and outputs of matching
The AWS Glue database marketing-demo
AWS Glue tables for existing and incremental customers:
- existing_customers – Raw customer data
- cleaned_existing_customers – Matched and cleaned customer data. This is the output generated by InitialMatching job.
- incremental_prospects – New incremental prospects data for matching
- unique_prospects – Final output of unique prospects as required by this post’s use case
The AWS Glue ML transform incremental-match-blog-transform
AWS Glue Jobs for initial matching and incremental matching:
- InitialMatching – For matching and transforming existing_customers to cleaned_existing_customers
- IncrementalMatching – For incrementally matching new prospects data with cleaned_existing_customers and identifying unique prospects
IAM roles

Run an AWS Glue ETL job for initial matching

Before we perform the incremental matching, we need to clean the existing customer datasets by running an AWS Glue ETL job:

On the AWS Glue console, choose Jobs in the navigation pane.
Select the job InitialMatching.
On the Action menu, choose Run job.

This job uses the FindMatches transformation to identify unique and matched customers from the existing_customers table and writes it to the cleaned_existing_customers table. The transform adds another column named match_id to identify matching records in the output. Rows with the same match_id are considered matching records.

The cleaned_existing_customers table becomes the primary customer data table and incremental customer data is matched against this table.

Run an AWS Glue ETL job for incremental matching

To perform the incremental matching, complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Select the job IncrementalMatching.
On the Action menu, choose Run job.

In comparison to the initial FindMatches scripts, the following changes are added to read data from the incremental customers table (lines 24 and 27) and call the incremental matching API (line 30):

L6	`import com.amazonaws.services.glue.ml.FindIncrementalMatches`
L22	`val existingCustomersSource = glueContext.getCatalogSource(database = "marketing-demo", tableName = "cleaned_existing_customers", redshiftTmpDir = "", transformationContext = "existingCustomersSource").getDynamicFrame()`
L24	`val incrementalProspectsSource = glueContext.getCatalogSource(database = "marketing-demo", tableName = "incremental_prospects", redshiftTmpDir = "", transformationContext = "incrementalProspectsSource").getDynamicFrame()`
L26	`val existingCustomers = existingCustomersSource.resolveChoice(choiceOption = Some(ChoiceOption("MATCH_CATALOG")), database = Some("marketing-demo"), tableName = Some("cleaned_existing_customers"), transformationContext = "existingCustomers")`
L27	`val incrementalProspects = incrementalProspectsSource.resolveChoice(choiceOption = Some(ChoiceOption("MATCH_CATALOG")), database = Some("marketing-demo"), tableName = Some("incremental_prospects"), transformationContext = "incrementalProspects")`
L30	`val incrementalMatchesResult = FindIncrementalMatches.apply(existingFrame = existingCustomers, incrementalFrame = incrementalProspects, transformId = args("tansform_id"), transformationContext = "findIncrementalMatches")`

The DynamicFrame incrementalMatchesResult contains both matched and unmatched records from the incremental prospects dataset. Matching is done both within the prospects dataset and against the existing customer dataset. In the script, the DynamicFrame incrementalMatchesResult is further processed to filter and store the unique prospects from the incremental dataset (lines 37–53).

The job takes a few minutes to complete with 10 worker nodes. When the job is complete, you can find the matched records in the target S3 path specified in the script.

Create an AWS Glue job bookmark

Because the incremental matching targets the datasets that are received at certain intervals and joins with the existing dataset to generate output, we highly recommend you enable AWS Glue job bookmarks when you create the job. This way, when the new incremental dataset is available, you can schedule the job to run and don’t need to make any change in the ETL script.

On the AWS Glue console, choose Jobs in the navigation pane.
Select the job IncrementalMatching.
On the Action menu, choose Edit job.
Under Advanced properties, for Job bookmark, choose Enable.
Choose Save.

When a new prospect dataset arrives, you only need to upload it to the bucket of incremental dataset and run the incremental matching job you have created. AWS Glue job bookmarks track both the existing and incremental data that has already been processed during your previous job run, so the job automatically reads the cleaned customer dataset generated by the previous job and the newly added incremental prospect dataset. The incremental matching job writes the output to the same target S3 path.

Verify the output

To review the unique prospects identified by the IncrementalMatching job, complete the following steps:

On the Athena console, make sure you’re in the correct Region.
Choose AwsGlueDataCatalog as your data source and marketing_demo as the database.

Create the following query:

SELECT * FROM "marketing_demo"."unique_prospects";

Choose Run query.

The Results window shows all the unique customers from the incremental customer dataset.

Pricing

In Region us-east-1, the total runtime is approximately 7 minutes for both the jobs. We configured these jobs to run with 10 workers with the standard worker type, resulting in a total cost of $1.47. Pricing can vary by region. For more information, see AWS Glue pricing.

Conclusion

This post showed how you can incrementally match a new prospect dataset against an existing customer dataset using the Lake Formation FindMatches transform in order to identify unique prospects. You can use a similar process to identify duplicates and matched records from the incremental dataset, and it’s especially useful in the use case of product matching and fraud detection.

To learn more, see the AWS Glue PySpark or Scala documentation. Please send any feedback to the AWS Glue Discussion Forums or through your usual AWS Support contacts.

About the Authors

Shehzad Qureshi is a Senior Software Engineer at Amazon Web Services.

Bin Pang is a software development engineer at Amazon Web Services.

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data platforms on the AWS Cloud. He has helped customers of all sizes implement data management, data warehouse, and data lake solutions.

Create a secure data lake by masking, encrypting data, and enabling fine-grained access with AWS Lake Formation

2021-07-12 Shekar Tippur

Post Syndicated from Shekar Tippur original https://aws.amazon.com/blogs/big-data/create-a-secure-data-lake-by-masking-encrypting-data-and-enabling-fine-grained-access-with-aws-lake-formation/

You can build data lakes with millions of objects on Amazon Simple Storage Service (Amazon S3) and use AWS native analytics and machine learning (ML) services to process, analyze, and extract business insights. You can use a combination of our purpose-built databases and analytics services like Amazon EMR, Amazon Elasticsearch Service (Amazon ES), and Amazon Redshift as the right tool for your specific job and benefit from optimal performance, scale, and cost.

In this post, you learn how to create a secure data lake using AWS Lake Formation for processing sensitive data. The data (simulated patient metrics) is ingested through a serverless pipeline to identify, mask, and encrypt sensitive data before storing it securely in Amazon S3. After the data has been processed and stored, you use Lake Formation to define and enforce fine-grained access permissions to provide secure access for data analysts and data scientists.

Target personas

The proposed solution focuses on the following personas, with each one having different level of access:

Cloud engineer – As the cloud infrastructure engineer, you implement the architecture but may not have access to the data itself or to define access permissions
secure-lf-admin – As a data lake administrator, you configure the data lake setting and assign data stewards
secure-lf-business-analyst – As a business analyst, you shouldn’t be able to access sensitive information
secure-lf-data-scientist – As a data scientist, you shouldn’t be able to access sensitive information

Solution overview

We use the following AWS services for ingesting, processing, and analyzing the data:

Amazon Athena is an interactive query service that can query data in Amazon S3 using standard SQL queries using tables in an AWS Glue Data Catalog. The data can be accessed via JDBC for further processing such as displaying in business intelligence (BI) dashboards.
Amazon CloudWatch is a monitoring and observability service that provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, and more. The logs from AWS Glue jobs and AWS Lambda functions are saved in CloudWatch logs.
Amazon Comprehend is a natural language processing (NLP) service that uses ML to uncover information in unstructured data.
Amazon DynamoDB is a NoSQL database that delivers single-digit millisecond performance at any scale and is used to avoid processing duplicates files.
AWS Glue is a serverless data preparation service that makes it easy to extract, transform, and load (ETL) data. An AWS Glue job encapsulates a script that reads, processes, and writes data to a new schema. This solution uses Python3.6 AWS Glue jobs for ETL processing.
AWS IoT provides the cloud services that connect your internet of things (IoT) devices to other devices and AWS Cloud services.
Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services.
AWS Lake Formation makes it easy to set up, secure, and manage your data lake. With Lake Formation, you can discover, cleanse, transform, and ingest data into your data lake from various sources; define fine-grained permissions at the database, table, or column level; and share controlled access across analytic, ML, and ETL services.
Amazon S3 is a scalable object storage service that hosts the raw data files and processed files in the data lake for millisecond access.

You can enhance the security of your sensitive data with the following methods:

Implement encryption at rest using AWS Key Management Service (AWS KMS) and customer managed encryption keys
Instrument AWS CloudTrail and audit logging
Restrict access to AWS resources based on the least privilege principle

Architecture overview

The solution emulates diagnostic devices sending Message Queuing Telemetry Transport (MQTT) messages onto an AWS IoT Core topic. We use Kinesis Data Firehose to preprocess and stage the raw data in Amazon S3. We then use AWS Glue for ETL to further process the data by calling Amazon Comprehend to identify any sensitive information. Finally, we use Lake Formation to define fine-grained permissions that restrict access to business analysts and data scientists who use Athena to query the data.

The following diagram illustrates the architecture for our solution.

Prerequisites

To follow the deployment walkthrough, you need an AWS account. Use us-east-1 or us-west-2 as your Region.

For this post, make sure you don’t have Lake Formation enabled in your AWS account.

Stage the data

Download the zipped archive file to use for this solution and unzip the files locally. patient.csv file is dummy data created to help demonstrate masking, encryption, and granting fine-grained access. The send-messages.sh script randomly generates simulated diagnostic data to represent body vitals. AWS Glue job uses glue-script.py script to perform ETL that detects sensitive information, masks/encrypt data, and populates curated table in AWS Glue catalog.

Create an S3 bucket called secure-datalake-scripts-<ACCOUNT_ID> via the Amazon S3 console. Upload the scripts and CSV files to this location.

Deploy your resources

For this post, we use AWS CloudFormation to create our data lake infrastructure.

Choose Launch Stack:
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names before deploying.

The stack takes approximately 5 minutes to complete.

The following screenshot shows the key-values the stack created. We use the TestUserPassword parameter for the Lake Formation personas to sign in to the AWS Management Console.

Load the simulation data

Stage the send-messages.sh script by running the Amazon S3 copy command:

aws s3 cp s3://secure-datalake-scripts-<ACCOUNT_ID>/send-messages.sh

Run your script by using the following command:

sh send-messages.sh.

The script runs for a few minutes and emits 300 messages. This sends MQTT messages to the secure_iot_device_analytics topic, filtered using IoT rules, processed using Kinesis Data Firehose, and converted to Parquet format. After a minute, data starts showing up in the raw bucket.

Run the AWS Glue ETL pipeline

Run AWS Glue workflow (secureGlueWorkflow) from the AWS Glue console; you can also schedule to run this using CloudWatch. It takes approximately 10 minutes to complete.

The AWS Glue job that is triggered as part of the workflow (ProcessSecureData) joins the patient metadata and patient metrics data. See the following code:

# Join Patient metadata and patient metrics dataframe
combined_df=Join.apply(patient_metadata, patient_metrics, 'PatientId', 'pid', transformation_ctx = "combined_df")

The ensuing dataframe contains sensitive information like FirstName, LastName, DOB, Address1, Address2, and AboutYourself. AboutYourself is freeform text entered by the patient during registration. In the following code snippet, the detect_sensitive_info function calls the Amazon Comprehend API to identify personally identifiable information (PII):

# Apply groupBy to get unique  AboutYourself records
group=combined_df.toDF().groupBy("pid","DOB", "FirstName", "LastName", "Address1", "Address2", "AboutYourself").count()
# Apply detect_sensitive_info to get the redacted string after masking  PII data
df_with_about_yourself = Map.apply(frame = group_df, f = detect_sensitive_info)
# Apply encryption to the identified fields
df_with_about_yourself_encrypted = Map.apply(frame = group_df, f = encrypt_rows)

Amazon Comprehend returns an object that has information about the entity name and entity type. Based on your needs, you can filter the entity types that need to be masked.

These fields are masked, encrypted, and written to their respective S3 buckets where fine-grained access controls are applied via Lake Formation:

Masked data – s3://secure-data-lake-masked-<ACCOUNT_ID>
secure-dl-masked-data/
Encrypted data – s3://secure-data-lake-masked-<ACCOUNT_ID>
secure-dl-encrypted-data/
Curated data – s3://secure-data-lake-<ACCOUNT_ID>
secure-dl-curated-data/

Now that the tables have been defined, we review permissions using Lake Formation.

Enable Lake Formation fine-grained access

To enable fine-grained access, we first add a Lake Formation admin user.

On the Lake Formation console, select Add other AWS users or roles.
On the drop-down menu, choose secure-lf-admin.
Choose Get started.
In the navigation pane, choose Settings.
On the Data Catalog Settings page, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
Choose Save.

Grant access to different personas

Before we grant permissions to different user personas, let’s register the S3 locations in Lake Formation so these personas can access S3 data without granting access through AWS Identity and Access Management (IAM).

On the Lake Formation console, choose Register and ingest in the navigation pane.
Choose Data lake locations.
Choose Register location.
Find and select each of the following S3 buckets and choose Register location:
1. s3://secure-raw-bucket-<ACCOUNT_ID>/temp-raw-table
2. s3://secure-data-lake-masked-<ACCOUNT_ID>/secure-dl-encrypted-data
3. s3://secure-data-lake-<ACCOUNT_ID>/secure-dl-curated-data
4. s3://secure-data-lake-masked-<ACCOUNT_ID>/secure-dl-masked-data

We’re now ready to grant access to our different users.

Grant read-only access to all the tables to secure-lf-admin

First, we grant read-only access to all the tables for the user secure-lf-admin.

Sign in to the console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and make sure you’re in the same Region.
Navigate to AWS Lake Formation console
Under Data Catalog, choose Databases.
Select the database secure-db.
On the Actions drop-down menu, choose Grant.
Select IAM users and roles.
Choose the role secure-lf-admin.
Under Policy tags or catalog resources, select Named data catalog resources.
For Database, choose the database secure-db.
For Tables, choose All tables.
Under Permissions, select Table permissions.
For Table permissions, select Super.
Choose Grant.
Choosesecure_dl_curated_data table.
On the Actions drop-down menu, chose View permissions.
Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Grant read-only access to secure-lf-business-analyst

Now we grant read-only access to certain encrypted columns to the user secure-lf-business-analyst.

On the Lake Formation console, under Data Catalog, choose Databases.
Select the database secure-db and choose View tables.
Select the table secure_dl_encrypted_data.
On the Actions drop-down menu, choose Grant.
Select IAM users and roles.
Choose the role secure-lf-business-analyst.
Under Permissions, select Column-based permissions.
Choose the following columns:
1. count
2. address1_encrypted
3. firstname_encrypted
4. address2_encrypted
5. dob_encrypted
6. lastname_encrypted
For Grantable permissions, select Select.
Choose Grant.
Chose secure_dl_encrypted_data table.
On the Actions drop-down menu, chose View permissions.
Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Grant read-only access to secure-lf-data-scientist

Lastly, we grant read-only access to masked data to the user secure-lf-data-scientist.

On the Lake Formation console, under Data Catalog, choose Databases.
Select the database secure-db and choose View tables.
Select the table secure_dl_masked_data.
On the Actions drop-down menu, choose Grant.
Select IAM users and roles.
Choose the role secure-lf-data-scientist.
Under Permissions, select Table permissions.
For Table permissions, select Select.
Choose Grant.
Under Data Catalog, chose Tables.
Chose secure_dl_masked_data table.
On the Actions drop-down menu, chose View permissions.
Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Query the data lake using Athena from different personas

To validate the permissions of different personas, we use Athena to query against the S3 data lake.

Make sure you set the query result location to the location created as part of the CloudFormation stack (secure-athena-query-<ACCOUNT_ID>). The following screenshot shows the location information in the Settings section on the Athena console.

You can see all the tables listed under secure-db.

Sign in to the console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and make sure you’re in the same Region.
Navigate to Athena Console.
Run a SELECT query against the secure_dl_curated_data

The user secure-lf-admin should see all the columns with encryption or masking.

Now let’s validate the permissions of secure-lf-business-analyst user.

Sign in to the console with secure-lf-business-analyst.
Navigate to Athena console.
Run a SELECT query against the secure_dl_encrypted_data table.

The secure-lf-business-analyst user can only view the selected encrypted columns.

Lastly, let’s validate the permissions of secure-lf-data-scientist.

Sign in to the console with secure-lf-data-scientist.
Run a SELECT query against the secure_dl_masked_data table.

The secure-lf-data-scientist user can only view the selected masked columns.

If you try to run a query on different tables, such as secure_dl_curated_data, you get an error message for insufficient permissions.

Clean up

To avoid unexpected future charges, delete the CloudFormation stack.

Conclusion

In this post, we presented a potential solution for processing and storing sensitive data workloads in an S3 data lake. We demonstrated how to build a data lake on AWS to ingest, transform, aggregate, and analyze data from IoT devices in near-real time. This solution also demonstrates how you can mask and encrypt sensitive data, and use fine-grained column-level security controls with Lake Formation, which benefits those with a higher level of security needs.

Lake Formation recently announced the preview for row-level access; and you can sign up for the preview now!

About the Authors

Shekar Tippur is an AWS Partner Solutions Architect. He specializes in machine learning and analytics workloads. He has been helping partners and customers adopt best practices and discover insights from data.

Ramakant Joshi is an AWS Solution Architect, specializing in the analytics and serverless domain. He has over 20 years of software development and architecture experience, and is passionate about helping customers in their cloud journey.

Navnit Shukla is AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Design a data mesh architecture using AWS Lake Formation and AWS Glue

2021-07-09 Nivas Shankar

Post Syndicated from Nivas Shankar original https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/

Organizations of all sizes have recognized that data is one of the key enablers to increase and sustain innovation, and drive value for their customers and business units. They are eagerly modernizing traditional data platforms with cloud-native technologies that are highly scalable, feature-rich, and cost-effective. As you look to make business decisions driven by data, you can be agile and productive by adopting a mindset that delivers data products from specialized teams, rather than through a centralized data management platform that provides generalized analytics.

In this post, we describe an approach to implement a data mesh using AWS native services, including AWS Lake Formation and AWS Glue. This approach enables lines of business (LOBs) and organizational units to operate autonomously by owning their data products end to end, while providing central data discovery, governance, and auditing for the organization at large, to ensure data privacy and compliance.

Benefits of a data mesh model

A centralized model is intended to simplify staffing and training by centralizing data and technical expertise in a single place, to reduce technical debt by managing a single data platform, and to reduce operational costs. Data platform groups, often part of central IT, are divided into teams based on the technical functions of the platform they support. For instance, one team may own the ingestion technologies used to collect data from numerous data sources managed by other teams and LOBs. A different team might own data pipelines, writing and debugging extract, transform, and load (ETL) code and orchestrating job runs, while validating and fixing data quality issues and ensuring data processing meets business SLAs. However, managing data through a central data platform can create scaling, ownership, and accountability challenges, because central teams may not understand the specific needs of a data domain, whether due to data types and storage, security, data catalog requirements, or specific technologies needed for data processing.

You can often reduce these challenges by giving ownership and autonomy to the team who owns the data, best allowing them to build data products, rather than only being able to use a common central data platform. For instance, product teams are responsible for ensuring the product inventory is updated regularly with new products and changes to existing ones. They’re the domain experts of the product inventory datasets. If a discrepancy occurs, they’re the only group who knows how to fix it. Therefore, they’re best able to implement and operate a technical solution to ingest, process, and produce the product inventory dataset. They own everything leading up to the data being consumed: they choose the technology stack, operate in the mindset of data as a product, enforce security and auditing, and provide a mechanism to expose the data to the organization in an easy-to-consume way. This reduces overall friction for information flow in the organization, where the producer is responsible for the datasets they produce and is accountable to the consumer based on the advertised SLAs.

This data-as-a-product paradigm is similar to Amazon’s operating model of building services. Service teams build their services, expose APIs with advertised SLAs, operate their services, and own the end-to-end customer experience. This is distinct from the world where someone builds the software, and a different team operates it. The end-to-end ownership model has enabled us to implement faster, with better efficiency, and to quickly scale to meet customers’ use cases. We aren’t limited by centralized teams and their ability to scale to meet the demands of the business. Each service we build stands on the shoulders of other services that provide the building blocks. The analogy in the data world would be the data producers owning the end-to-end implementation and serving of data products, using the technologies they selected based on their unique needs. At AWS, we have been talking about the data-driven organization model for years, which consists of data producers and consumers. This model is similar to those used by some of our customers, and has been eloquently described recently by Zhamak Dehghani of Thoughtworks, who coined the term data mesh in 2019.

Solution overview

In this post, we demonstrate how the Lake House Architecture is ideally suited to help teams build data domains, and how you can use the data mesh approach to bring domains together to enable data sharing and federation across business units. This approach can enable better autonomy and a faster pace of innovation, while building on top of a proven and well-understood architecture and technology stack, and ensuring high standards for data security and governance.

The following are key points when considering a data mesh design:

Data mesh is a pattern for defining how organizations can organize around data domains with a focus on delivering data as a product. However, it may not be the right pattern for every customer.
A Lake House approach and the data lake architecture provide technical guidance and solutions for building a modern data platform on AWS.
The Lake House approach with a foundational data lake serves as a repeatable blueprint for implementing data domains and products in a scalable way.
The manner in which you utilize AWS analytics services in a data mesh pattern may change over time, but still remains consistent with the technological recommendations and best practices for each service.

The following are data mesh design goals:

Data as a product – Each organizational domain owns their data end to end. They’re responsible for building, operating, serving, and resolving any issues arising from the use of their data. Data accuracy and accountability lies with the data owner within the domain.
Federated data governance – Data governance ensures data is secure, accurate, and not misused. The technical implementation of data governance such as collecting lineage, validating data quality, encrypting data at rest and in transit, and enforcing appropriate access controls can be managed by each of the data domains. However, central data discovery, reporting, and auditing is needed to make it simple for users to find data and for auditors to verify compliance.
Common Access – Data must be easily consumable by subject matter personas like data analysts and data scientists, as well as purpose-built analytics and machine learning (ML) services like Amazon Athena, Amazon Redshift, and Amazon SageMaker. To do that, data domains must expose a set of interfaces that make data consumable while enforcing appropriate access controls and audit tracking.

The following are user experience considerations:

Data teams own their information lifecycle, from the application that creates the original data, through to the analytics systems that extract and create business reports and predictions. Through this lifecycle, they own the data model, and determine which datasets are suitable for publication to consumers.
Data domain producers expose datasets to the rest of the organization by registering them with a central catalog. They can choose what to share, for how long, and how consumers can interact with it. They’re also responsible for maintaining the data and making sure it’s accurate and current.
Data domain consumers or individual users should be given access to data through a supported interface, like a data API, that can ensure consistent performance, tracking, and access controls.
All data assets are easily discoverable from a single central data catalog. The data catalog contains the datasets registered by data domain producers, including supporting metadata such as lineage, data quality metrics, ownership information, and business context.
All actions taken with data, usage patterns, data transformation, and data classifications should be accessible through a single, central place. Data owners, administrators, and auditors should able to inspect a company’s data compliance posture in a single place.

Let’s start with a high-level design that builds on top of the data mesh pattern. As seen in the following diagram, it separates consumers, producers, and central governance to highlight the key aspects discussed previously. However, a data domain may represent a data consumer, a data producer, or both.

The objective for this design is to create a foundation for building data platforms at scale, supporting the objectives of data producers and consumers with strong and consistent governance. The AWS approach to designing a data mesh identifies a set of general design principles and services to facilitate best practices for building scalable data platforms, ubiquitous data sharing, and enable self-service analytics on AWS.

Expanding on the preceding diagram, we provide additional details to show how AWS native services support producers, consumers, and governance. Each data domain, whether a producer, consumer, or both, is responsible for its own technology stack. However, using AWS native analytics services with the Lake House Architecture offers a repeatable blueprint that your organization can use as you scale your data mesh design. Having a consistent technical foundation ensures services are well integrated, core features are supported, scale and performance are baked in, and costs remain low.

A data domain: producer and consumer

A data mesh design organizes around data domains. Each data domain owns and operates multiple data products with its own data and technology stack, which is independent from others. Data domains can be purely producers, such as a finance domain that only produces sales and revenue data for domains to consumers, or a consumer domain, such as a product recommendation service that consumes data from other domains to create the product recommendations displayed on an ecommerce website. In addition to sharing, a centralized data catalog can provide users with the ability to more quickly find available datasets, and allows data owners to assign access permissions and audit usage across business units.

A producer domain resides in an AWS account and uses Amazon Simple Storage Service (Amazon S3) buckets to store raw and transformed data. It maintains its own ETL stack using AWS Glue to process and prepare the data before being cataloged into a Lake Formation Data Catalog in their own account. Similarly, the consumer domain includes its own set of tools to perform analytics and ML in a separate AWS account. The central data governance account is used to share datasets securely between producers and consumers. It’s important to note that sharing is done through metadata linking alone. Data isn’t copied to the central account, and ownership remains with the producer. The central catalog makes it easy for any user to find data and to ask the data owner for access in a single place. They can then use their tool of choice inside of their own environment to perform analytics and ML on the data.

The following diagram illustrates the end-to-end workflow.

The workflow from producer to consumer includes the following steps:

Data source locations hosted by the producer are created within the producer’s AWS Glue Data Catalog and registered with Lake Formation.
When a dataset is presented as a product, producers create Lake Formation Data Catalog entities (database, table, columns, attributes) within the central governance account. This makes it easy to find and discover catalogs across consumers. However, this doesn’t grant any permission rights to catalogs or data to all accounts or consumers, and all grants are be authorized by the producer.
The central Lake Formation Data Catalog shares the Data Catalog resources back to the producer account with required permissions via Lake Formation resource links to metadata databases and tables.
Lake Formation permissions are granted in the central account to producer role personas (such as the data engineer role) to manage schema changes and perform data transformations (alter, delete, update) on the central Data Catalog.
Producers accept the resource share from the central governance account so they can make changes to the schema at a later time.
Data changes made within the producer account are automatically propagated into the central governance copy of the catalog.
Based on a consumer access request, and the need to make data visible in the consumer’s AWS Glue Data Catalog, the central account owner grants Lake Formation permissions to a consumer account based on direct entity sharing, or based on tag based access controls, which can be used to administer access via controls like data classification, cost center, or environment.
Lake Formation in the consumer account can define access permissions on these datasets for local users to consume. Users in the consumer account, like data analysts and data scientists, can query data using their chosen tool such as Athena and Amazon Redshift.

Build data products

Data domain producers ingest data into their respective S3 buckets through a set of pipelines that they manage, own, and operate. Producers are responsible for the full lifecycle of the data under their control, and for moving data from raw data captured from applications to a form that is suitable for consumption by external parties. AWS Glue is a serverless data integration and preparation service that offers all the components needed to develop, automate, and manage data pipelines at scale, and in a cost-effective way. It provides a simple-to-use interface that organizations can use to quickly onboard data domains without needing to test, approve, and juggle vendor roadmaps to ensure all required features and integrations are available.

Central data governance

The central data governance account stores a data catalog of all enterprise data across accounts, and provides features allowing producers to register and create catalog entries with AWS Glue from all their S3 buckets. No data (except logs) exists in this account. Lake Formation centrally defines security, governance, and auditing policies in one place, enforces those policies for consumers across analytics applications, and only provides authorization and session token access for data sources to the role that is requesting access. Lake Formation also provides uniform access control for enterprise-wide data sharing through resource shares with centralized governance and auditing.

Common access

Each consumer obtains access to shared resources from the central governance account in the form of resource links. These are available in the consumer’s local Lake Formation and AWS Glue Data Catalog, allowing database and table access that can be managed by consumer admins. After access is granted, consumers can access the account and perform different actions with the following services:

Athena acts as a consumer and runs queries on data registered using Lake Formation. Lake Formation verifies that the workgroup AWS Identity and Access Management (IAM) role principal has the appropriate Lake Formation permissions to the database, table, and Amazon S3 location as appropriate for the query. If the principal has access, Lake Formation vends temporary credentials to Athena, and the query runs. Authentication is granted through IAM roles or users, or web federated identities using SAML or OIDC. For more information, see How Athena Accesses Data Registered With Lake Formation.
Amazon SageMaker Data Wrangler allows you to quickly select data from multiple data sources, such as Amazon S3, Athena, Amazon Redshift, Lake Formation, and Amazon SageMaker Feature Store. You can also write queries for data sources and import data directly into SageMaker from various file formats, such as CSV files, Parquet files, and database tables. Authentication is granted through IAM roles in the consumer account. For more information, see Prepare ML Data with Amazon SageMaker Data Wrangler.
Amazon Redshift Spectrum allows you to register external schemas from Lake Formation, and provides a hierarchy of permissions to control access to Amazon Redshift databases and tables in a Data Catalog. If the consumer principal has access, Lake Formation vends temporary credentials to Redshift Spectrum tables, and the query runs. Authentication is granted through IAM roles or users, or web federated identities using SAML or OIDC. For more information, see Using Redshift Spectrum with AWS Lake Formation.
Amazon QuickSight via Athena integrates with Lake Formation permissions. If you’re querying data with Athena, you can use Lake Formation to simplify how you secure and connect to your data from QuickSight. Lake Formation adds to the IAM permissions model by providing its own permissions model that is applied to AWS analytics and ML services. Authentication is granted through IAM roles that are mapped to QuickSight user permissions. For more information, see Authorizing Connections Through AWS Lake Formation.
Amazon EMR Studio and EMR notebooks allow running Spark SQL against Lake Formation’s tables backed by a SAML authority. Beginning with Amazon EMR31.0, you can launch a cluster that integrates with Lake Formation. Authentication is granted through IAM roles or users, or web federated identities using SAML or OIDC. For more information, see Integrate Amazon EMR with AWS Lake Formation.

With this design, you can connect multiple data lake houses to a centralized governance account that stores all the metadata from each environment. The strength of this approach is that it integrates all the metadata and stores it in one meta model schema that can be easily accessed through AWS services for various consumers. You can extend this architecture to register new data lake catalogs and share resources across consumer accounts. The following diagram illustrates a cross-account data mesh architecture.

Conclusion

A data mesh approach provides a method by which organizations can share data across business units. Each domain is responsible for the ingestion, processing, and serving of their data. They are data owners and domain experts, and are responsible for data quality and accuracy. This is similar to how microservices turn a set of technical capabilities into a product that can be consumed by other microservices. Implementing a data mesh on AWS is made simple by using managed and serverless services such as AWS Glue, Lake Formation, Athena, and Redshift Spectrum to provide a wellunderstood, performant, scalable, and cost-effective solution to integrate, prepare, and serve data.

One customer who used this data mesh pattern is JPMorgan Chase. For more information, see How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform.

Lake Formation offers the ability to enforce data governance within each data domain and across domains to ensure data is easily discoverable and secure, and lineage is tracked and access can be audited. The Lake House Architecture provides an ideal foundation to support a data mesh, and provides a design pattern to ramp up delivery of producer domains within an organization. Each domain has autonomy to choose their own tech stack, but is governed by a federated security model that can be administered centrally, providing best practices for security and compliance, while allowing high agility within the domain.

About the Authors

Nivas Shankar is a Principal Data Architect at Amazon Web Services. He helps and works closely with enterprise customers building data lakes and analytical applications on the AWS platform. He holds a master’s degree in physics and is highly passionate about theoretical physics concepts.

Roy Hasson is the Global Business Development Lead of Analytics and Data Lakes at AWS. He works with customers around the globe to design solutions to meet their data processing, analytics, and business intelligence needs. Roy is big Manchester United fan, cheering his team on and hanging out with his family.

Zach Mitchell is a Sr. Big Data Architect. He works within the product team to enhance understanding between product engineers and their customers while guiding customers through their journey to develop data lakes and other data solutions on AWS analytics services.

Ian Meyers is a Sr. Principal Product Manager for AWS Database Services. He works with many of AWS largest customers on emerging technology needs, and leads several data and analytics initiatives within AWS including support for Data Mesh.

The AWS Data Lake Team members are Chanu Damarla, Sanjay Srivastava, Natacha Maheshe, Roy Ben-Alta, Amandeep Khurana, Jason Berkowitz, David Tucker, and Taz Sayed.

Effective data lakes using AWS Lake Formation, Part 4: Implementing cell-level and row-level security

2021-06-28 Deenbandhu Prasad

Post Syndicated from Deenbandhu Prasad original https://aws.amazon.com/blogs/big-data/part-4-effective-data-lakes-using-aws-lake-formation-part-4-implementing-cell-level-and-row-level-security/

We announced the preview of AWS Lake Formation transactions, cell-level and row-level security, and acceleration at AWS re: Invent 2020. In Parts 1 , 2, and 3 of this series, we explained how to set up governed tables, add streaming and batch data to them, and use ACID transactions. In this post, we focus on cell-level and row-level security and show you how to enforce business needs by restricting access to specific rows.

Effective data lakes using AWS Lake Formation

Part 1: Getting started with governed tables
Part 2: Creating a governed table for streaming data sources
Part 3: Using ACID transactions on governed tables
Part 4: Implementing cell-level and row-level security

The goal of modern data lakes is to democratize access to broad datasets to empower data analysts and business users. In these scenarios, data lake security becomes more important than ever. Enterprises want to share their data across groups, departments, and organizations, while balancing their compliance and security needs. A common paradigm used by many enterprises is to restrict data access to limit scope based on the user profile or the organizations to which they belong. Previously, you had to enforce this by duplicating the original data or creating materialized and non-materialized views of the data based on filtered datasets. However, these solutions often break the concept of a single source of truth and result in write amplification, which doubles or triples storage. The large number of copies required also increases the management effort required due to their complexity.

Lake Formation supports simple row-level security and cell-level security:

Basic row-level security allows you to specify filter expressions that limit access to specific rows of a table to a user.
Cell-level security builds on row-level security by allowing you to hide or show specific columns along with providing access to specific rows.

In this post, we consider an example use case in which Company A does business across the United States and Canada. They have built an enterprise customer data hub on Amazon Simple Storage Service (Amazon S3), which sources customers from both countries into this central location for company-wide analytics and marketing.

The marketing department of Company A is responsible for organizing promotion campaigns and developing communications content to promote services and product to prospects. The team consumes the data from this enterprise customer data hub to create the targeted campaign.

The marketing team data analysts are divided by country, and the requirement is to restrict analyst access to customer data from their country—analysts from the United States can see only customers from the United States, and analysts from Canada can only access customers from Canada. Additionally, analysts from Canada aren’t allowed to see the date of birth (DoB column) due to local company policy (this restriction is an example of cell-level security).

The dataset used for this post is synthetically generated; the following screenshot shows an example of the data.

Solution overview

Let’s see how you can use the Lake Formation row-level security feature to enforce Company A’s requirements on the data in the S3 data lake. You can apply row-level security to a governed table or to a standard table in Lake Formation. In this post, you apply row-level security on a standard Lake Formation table; you can follow a similar process for a governed table.

We walk through the following high-level steps:

Create a database (lf_rls_blog) and table (customer).
Grant select (row and column) permissions to the users lf-rls-blog-analyst-us and lf-rls-blog-analyst-ca.
Run queries in Amazon Athena as the US and Canada analysts to verify that you only see rows from the user’s respective country (and the appropriate columns).

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

AWS Identity and Access Management (IAM) users, roles, and policies:
- The three users include lf-rls-blog-manager (data lake administrator), lf-rls-blog-analyst-us (US data analyst), and lf-rls-blog-analyst-ca (Canada data analyst)
Lake Formation data lake settings and resources:
- This includes an S3 bucket with the prefix lf-rowlevel-security-blog-* and the customer data files used in this post

As of this writing, these Lake Formation preview features are available only in us-east-1 and us-west-2. When following the steps in this post, use Region us-east-1. Check the availability of the features in other Regions in the future.

To create your resources, complete the following steps:

Sign in to the AWS CloudFormation console in the us-east-1 Region.
Choose Launch Stack:
Create passwords for the three users.
Review the details on the page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

Create a database and table

To create your database and table, complete the following steps:

Sign in to the AWS Management Console as the data lake administrator (lf-rls-blog-manager).
On the Lake Formation console, choose Databases in the navigation pane.
Choose Create database.
For Name, enter lf_rls_blog.
If Use only IAM access control for new tables in this database is selected, uncheck it.
Choose Create database.

Next, you create a new data lake table.

In the navigation pane, choose Tables.
Choose Create table.
For Name, enter customer.
For Database, choose the database you just created (lf_rls_blog).
Browse to the customers folder under datafiles in the S3 bucket (starting with lf-rowlevel-security-blog-*) created by the CloudFormation template.
For Classification, select CSV.
For Delimiter, choose Comma:
Choose Upload Schema.

Enter the following JSON code:

[
{
"Name": "customer_id",
"Type": "bigint"
},
{
"Name": "prefix",
"Type": "string"
},
{
"Name": "first_Name",
"Type": "string"
},
{
"Name": "middle_Name",
"Type": "string"
},
{
"Name": "last_Name",
"Type": "string"
},
{
"Name": "suffix",
"Type": "string"
},
{
"Name": "gender",
"Type": "string"
},
{
"Name": "dob",
"Type": "string"
},
{
"Name": "phone",
"Type": "string"
},
{
"Name": "building_number",
"Type": "bigint"
},
{
"Name": "street_Name",
"Type": "string"
},
{
"Name": "city",
"Type": "string"
},
{
"Name": "cust_zip",
"Type": "bigint"
},
{
"Name": "country",
"Type": "string"
}
]

Choose Upload.
Choose Submit.

Create data filters and grant permissions

To implement column-level, row-level, and cell-level security, first you create data filters. Then you choose that data filter while granting the select Lake Formation permission on tables. For this use case, you create two data filters: one for the US data analyst and one for the Canada data analyst.

Permitted filter expressions are predicates that obey a subset of the WHERE clause grammar in PartiQL. You can use comparison operators to compare columns with constants. The following are the supported operators:

Comparison operators – =, >, <, >=, <=, <>, BETWEEN, IN, LIKE
Logical operators – AND, OR

Let’s first create the data filter for the US analyst.

On the Lake Formation console, choose Data filters in the navigation pane.
Choose Create new filter.
For Data filter name, enter US Filter.
For Target database, choose the lf_rls_blog database.
For Target table, choose the customer table.
For Column-level access, select Access to all columns.
For Row filter expression, enter country='US'.
Choose Create filter.

The US analyst has access to all the columns of US customers only.

Now let’s create a data filter for the Canada analyst.

On the Data filters page, choose Create new filter.
For Data filter name, enter Canada Filter.
For Target database, choose the lf_rls_blog database.
For Target table, choose the customer table.
For Column-level access, select Exclude columns.
For Select columns, choose the dob column.
For Row filter expression, enter country='Canada'.
Choose Create filter.

The Canada analyst now has access to all the columns except dob (date of birth) of Canadian customers only.

Verify both data filters are created by checking the Data filters page.

Now we can grant table and column permissions.

On the Tables page, select the customer table.
On the Actions menu, choose Grant.

For IAM users and roles, choose lf-rls-blog-analyst-us.
Choose Named data catalog resources
For Databases, choose lf_rls_blog.
For Tables, choose customer.
For Table and column permissions, choose Select.
Under Data permissions¸ select Advanced cell-level filters.
Select US Filter.
Choose Grant.
Repeat these steps for the lf-rls-blog-analyst-ca user, choosing the lf_rls_blog database and customer table and granting Select permissions.
Select Advanced cell-level filters.
In the Data permissions section, select Canada Filter.
Choose Grant.

Run queries to test permission levels

To utilize Lake Formation preview features in Athena, you need to create a new workgroup named AmazonAthenaLakeFormationPreview and switch to that workgroup before running queries. For more information, see Managing Workgroups. Additionally, for preview you use the lakeformation qualifier for the database and table name, as shown in the following example:

select * from lakeformation.<databasename>.<tablename>

Lake Formation implicitly grants all permissions to the table creator. In this use case, lf-rls-blog-manager has SELECT permissions on all rows and columns of the customer table. Let’s first verify permissions for lf-rls-blog-manager by querying the customer table using Athena.

On the Athena console (in Region us-east-1), open the query editor.
Choose set up a query result location in Amazon S3.
Navigate to the S3 bucket starting with lf-rowlevel-security-blog-* and select the folder anthenaqueryresults.
Choose Save.
In the query editor, for Data source, choose AWSDataCatalog.
For Database, choose lf_rls_blog.
Create and switch to the AmazonAthenaLakeFormationPreview workgroup.

You can see the customer table under Tables.

Enter the following query:

SELECT * FROM lakeformation."lf_rls_blog"."customer"

Choose Run query.

Rows from both countries are displayed to the lf-rls-blog-manager users.

Next, lets verify permission of the lf-rls-blog-analyst-us and lf-rls-blog-analyst-ca users on this table.

Sign in to the console as lf-rls-blog-analyst-us.
Repeat the previous steps on the Athena console (in us-east-1) to set up the query result location.
Switch to the AmazonAthenaLakeFormationPreview workgroup.

Run the following query:

SELECT * FROM lakeformation."lf_rls_blog"."customer"

Only US customers are shown in the result for the US data analyst.

Now you verify the same for the Canada data analyst.

Sign in to the console as lf-rls-blog-analyst-ca.
Repeat the previous steps on the Athena console (in us-east-1) to set up the query result location.
Switch to AmazonAthenaLakeFormationPreview workgroup.

Run the following query:

SELECT * FROM lakeformation."lf_rls_blog"."customer"

Only customers from Canada are visible to the Canada data analyst. Furthermore, the lf-rls-blog-analyst-ca user doesn’t have access to the dob column.

Clean up

For the final step, clean up the resources you created:

Sign in to the console as lf-rls-blog-manager.
On the Lake Formation console, choose Databases in the navigation pane.
Select the database lf_rls_blog.
On the Action menu, choose Delete.
Delete anthenaqueryresults folder from the Amazon S3 bucket with prefix lf-rowlevel-security-blog-*.
Login as the user that launched the CloudFormation stack in this post.
On the AWS CloudFormation console, delete the stack lf-rowlevel-security-blog.

When you delete the stack, the resources it created are automatically deleted.

Conclusion

In this post, you learned how to implement fine-grained access control on a data lake table using the new row-level security feature of Lake Formation. This feature allows you to easily enforce privacy regulations or corporate governance data access rules on your data lake.

Learn more about Lake Formation and start using these features to build and secure your data lake on Amazon S3 using Lake Formation today, and leave your thoughts and questions in the comments.

About the Authors

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data platforms on the AWS Cloud. Deenbandhu has helped customers of all sizes implement master data management, data warehouse, and data lake solutions.

Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue and AWS Lake Formation. His passion is for implementing software artifacts for building data lakes more effectively and easily. During his spare time, he loves to spend time with his family, especially hunting bugs—not software bugs, but bugs like butterflies, pill bugs, snails, and grasshoppers.

Design patterns for an enterprise data lake using AWS Lake Formation cross-account access

2021-06-22 Satish Sarapuri

Post Syndicated from Satish Sarapuri original https://aws.amazon.com/blogs/big-data/design-patterns-for-an-enterprise-data-lake-using-aws-lake-formation-cross-account-access/

In this post, we briefly walk through the most common design patterns adapted by enterprises to build lake house solutions to support their business agility in a multi-tenant model using the AWS Lake Formation cross-account feature to enable a multi-account strategy for line of business (LOB) accounts to produce and consume data from your data lake.

A modern data platform enables a community-driven approach for customers across various industries, such as manufacturing, retail, insurance, healthcare, and many more, through a flexible, scalable solution to ingest, store, and analyze customer domain-specific data to generate the valuable insights they need to differentiate themselves. Building a data lake on Amazon Simple Storage Service (Amazon S3), together with AWS analytic services, sets you on a path to become a data-driven organization.

Overview of Lake House Architecture on AWS

You can deploy data lakes on AWS to ingest, process, transform, catalog, and consume analytic insights using the AWS suite of analytics services, including Amazon EMR, AWS Glue, Lake Formation, Amazon Athena, Amazon QuickSight, Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Amazon Relational Database Service (Amazon RDS), Amazon SageMaker, and Amazon S3. These services provide the foundational capabilities to realize your data vision, in support of your business outcomes. You can deploy a common data access and governance framework across your platform stack, which aligns perfectly with our own Lake House Architecture.

Large enterprise customers require a scalable data lake with a unified access enforcement mechanism to support their analytics workload. For this, you want to use a single set of single sign-on (SSO) and AWS Identity and Access Management (IAM) mappings to attest individual users, and define a single set of fine-grained access controls across various services. The AWS Lake House Architecture encompasses a single management framework; however, the current platform stack requires that you implement workarounds to meet your security policies without compromising on the ability to drive automation, data proliferation, or scale.

The following diagram illustrates the Lake House architecture.

Lake Formation serves as the central point of enforcement for entitlements, consumption, and governing user access. Furthermore, you may want to minimize data movements (copy) across LOBs and evolve on data mesh methodologies, which is becoming more and more prominent.

Most typical architectures consist of Amazon S3 for primary storage; AWS Glue and Amazon EMR for data validation, transformation, cataloging, and curation; and Athena, Amazon Redshift, QuickSight, and SageMaker for end users to get insight.

Introduction to Lake Formation

Lake Formation is a fully managed service that makes it easy to build, secure, and manage data lakes. Lake Formation simplifies and automates many of the complex manual steps that are usually required to create data lakes. These steps include collecting, cleansing, moving, and cataloging data, and securely making that data available for analytics and ML.

Lake Formation provides its own permissions model that augments the IAM permissions model. This centrally defined permissions model enables fine-grained access to data stored in data lakes through a simple grant or revoke mechanism, much like a relational database management system (RDBMS). Lake Formation permissions are enforced at the table and column level (row level in preview) across the full portfolio of AWS analytics and ML services, including Athena and Amazon Redshift.

With the new cross-account feature of Lake Formation, you can grant access to other AWS accounts to write and share data to or from the data lake to other LOB producers and consumers with fine-grained access. Data lake data (S3 buckets) and the AWS Glue Data Catalog are encrypted with AWS Key Management Service (AWS KMS) customer master keys (CMKs) for security purposes.

Common lake house design patterns using Lake Formation

A typical lake house infrastructure has three major components:

Data producer – Publishes the data into the data lake
Data consumer – Consumes that data out from the data lake and runs predictive and business intelligence (BI) insights
Data platform – Provides infrastructure and an environment to store data assets in the form of a layer cake such as landing, raw, and curated (conformance) data, and establishes security controls between producers and consumers

Although you can construct a data platform in multiple ways, the most common pattern is a single-account strategy, in which the data producer, data consumer, and data lake infrastructure are all in the same AWS account. There is no consensus if using a single account or multiple accounts most of the time is better, but because of the regulatory, security, performance trade-off, we have seen customers adapting to a multi-account strategy in which data producers and data consumers are in different accounts and the data lake is operated from a central, shared account.

This raised the concern of how to manage the data access controls across multiple accounts that are part of the data analytics platform to enable seamless ingestion for producers as well as improved business autonomy and agility for the needs of consumers.

With the general availability of the Lake Formation cross-account feature, the ability to manage data-driven access controls is simplified and offers an RDBMS style of managing data lake assets for producers and consumers.

You can drive your enterprise data platform management using Lake Formation as the central location of control for data access management by following various design patterns that balance your company’s regulatory needs and align with your LOB expectation. The following table summarizes different design patterns.

Design Type	Lake Formation	Glue Data Catalog	Storage (Amazon S3)	Compute
Centralized	Centralized	Centralized	Centralized	De-Centralized
De-Centralized	De-Centralized	Centralized	De- Centralized	De-Centralized

We explain each design pattern in more detail, with examples, in the following sections.\

Terminology

We use the following terms throughout this post when discussing data lake design patterns:

LOB – The line of business, such as inventory, marketing, or manufacturing
Enterprise data lake account (EDLA) – A centralized AWS account for data lake storage with a centralized AWS Glue Data Catalog and Lake Formation
Producer – The process or application producing data for its LOB
Consumer – The consumer of the LOB data via AWS services (such as Athena, AWS Glue, Amazon EMR, Amazon Redshift Spectrum, AWS Lambda, and QuickSight)

Centralized data lake design

In a centralized data lake design pattern, the EDLA is a central place to store all the data in S3 buckets along with a central (enterprise) Data Catalog and Lake Formation. The respective LOB producer and consumer accounts have all the required compute to write and read data in and from the central EDLA data, and required fine-grained access is performed using the Lake Formation cross-account feature. That’s why this architecture pattern (see the following diagram) is called a centralized data lake design pattern.

For this post, we use one LOB as an example, which has an AWS account as a producer account that generates data, which can be from on-premises applications or within an AWS environment. This account uses its compute (in this case, AWS Glue) to write data into its respective AWS Glue database. The database is created in the central EDLA where all S3 data is stored using the database link created with the Lake Formation cross-account feature. The same LOB consumer account consumes data from the central EDLA via Lake Formation to perform advanced analytics using services like AWS Glue, Amazon EMR, Redshift Spectrum, Athena, and QuickSight, using the consumer AWS account compute. The following section provides an example.

Create your database, tables and register S3 locations

In the EDLA, complete the following steps:

Register the EDLA S3 bucket path in Lake Formation.
Create a database called edla_lob_a, which points to the EDLA S3 bucket for LOB-A.
Create a customer table in this edla_lob_a database , which points to the EDLA S3 bucket.

The LOB-A producer account can directly write or update data into tables, and create, update, or delete partitions using the LOB-A producer account compute via the Lake Formation cross-account feature.

You can trigger the table creation process from the LOB-A producer AWS account via Lambda cross-account access.

Grant Lake Formation cross-account access

Grant full access to the LOB-A producer account to write, update, and delete data into the EDLA S3 bucket via AWS Glue tables.

If your EDLA and producer accounts are part of same AWS organization, you should see the accounts on the list. If not, you need to enter the AWS account number manually as an external AWS account.

The following screenshot shows the granted permissions in the EDLA for the LOB-A producer account.

When you grant permissions to another account, Lake Formation creates resource shares in AWS Resource Access Manager (AWS RAM) to authorize all the required IAM layers between the accounts. To validate a share, sign in to the AWS RAM console as the EDLA and verify the resources are shared.

The first time you create a share, you see three resources:

The AWS Glue Data Catalog in the EDLA
The database containing the tables you shared
The table resource itself

You only need one share per resource, so multiple database shares only require a single Data Catalog share, and multiple table shares within the same database only require a single database share.

For the share to appear in the catalog of the receiving account (in our case the LOB-A account), the AWS RAM admin must accept the share by opening the share on the Shared With Me page and accepting it.

If both accounts are part of the same AWS organization and the organization admin has enabled automatic acceptance on the Settings page of the AWS Organizations console, then this step is unnecessary.

If your EDLA Data Catalog is encrypted with a KMS CMK, make sure to add your LOB-A producer account root user as the user for this key, so the LOB-A producer account can easily access the EDLA Data Catalog for read and write permissions with its local IAM KMS policy. Data encryption keys don’t need any additional permissions, because the LOB accounts use the Lake Formation role associated with the registration to access objects in Amazon S3.

When you sign in with the LOB-A producer account to the AWS RAM console, you should see the EDLA shared database details, as in the following screenshot.

Create a database resource link in the LOB-A producer account

Resource links are pointers to the original resource that allow the consuming account to reference the shared resource as if it were local to the account. As a pointer, resource links mean that any changes are instantly reflected in all accounts because they all point to the same resource. No sync is necessary for any of this and no latency occurs between an update and its reflection in any other accounts.

Create a resource link to the shared Data Catalog database from the EDLA called shared_edla_lob_a.
Grant full access to the AWS Glue role in the LOB-A producer account for this newly created shared database link from the EDLA so a producer AWS Glue job can create, update, and delete tables and partitions.

You need to perform two grants: one on the database shared link and one on the target to the AWS Glue job role. Granting on the link allows it to be visible to end-users. Data-level permissions are granted on the target itself.

Create an AWS Glue job using this role to create and write data into the EDLA database and S3 bucket location.

The AWS Glue table and S3 data are in a centralized location for this architecture, using the Lake Formation cross-account feature.

This completes the configuration of the LOB-A producer account remotely writing data into the EDLA Data Catalog and S3 bucket. You can create and share the rest of the required tables for this LOB using the Lake Formation cross-account feature.

Because your LOB-A producer created an AWS Glue table and wrote data into the Amazon S3 location of your EDLA, the EDLA admin can access this data and share the LOB-A database and tables to the LOB-A consumer account for further analysis, aggregation, ML, dashboards, and end-user access.

Share the database to the LOB-A consumer account

In the EDLA, you can share the LOB-A AWS Glue database and tables (edla_lob_a, which contains tables created from the LOB-A producer account) to the LOB-A consumer account (in this case, the entire database is shared).

Next, go to the LOB-A consumer account to accept the resource share in AWS RAM.

Accepting the shared database in AWS RAM of the LOB-A consumer account

Accept this resource share request so you can create a resource link in the LOB-A consumer account.

Create a database resource link in the LOB-A consumer account

Create a resource link to a shared Data Catalog database from the EDLA as consumer_edla_lob_a.

Now, grant full access to the AWS Glue role in the LOB-A consumer account for this newly created shared database link from the EDLA so the consumer account AWS Glue job can perform SELECT data queries from those tables. You need to perform two grants: one on the database shared link and one on the target to the AWS Glue job role.

A grant on the resource link allows a user to describe (or see) the resource link, which allows them to point engines such as Athena at it for queries. A grant on the target grants permissions to local users on the original resource, which allows them to interact with the metadata of the table and the data behind it. Permissions of DESCRIBE on the resource link and SELECT on the target are the minimum permissions necessary to query and interact with a table in most engines.

Create an AWS Glue job using this role to read tables from the consumer database that is shared from the EDLA and for which S3 data is also stored in the EDLA as a central data lake store. This data is accessed via AWS Glue tables with fine-grained access using the Lake Formation cross-account feature.

This completes the process of granting the LOB-A consumer account remote access to data for further analysis.

This data can be accessed via Athena in the LOB-A consumer account. LOB-A consumers can also access this data using QuickSight, Amazon EMR, and Redshift Spectrum for other use cases.

De-centralized data lake design

In the de-centralized design pattern, each LOB AWS account has local compute, an AWS Glue Data Catalog, and a Lake Formation along with its local S3 buckets for its LOB dataset and a central Data Catalog for all LOB-related databases and tables, which also has a central Lake Formation where all LOB-related S3 buckets are registered in EDLA.

EDLA manages all data access (read and write) permissions for AWS Glue databases or tables that are managed in EDLA. It grants the LOB producer account write, update, and delete permissions on the LOB database via the Lake Formation cross-account share. It also grants read permissions to the LOB consumer account. The respective LOB’s local data lake admins grant required access to their local IAM principals.

Refer to the earlier details on how to share database, tables, and table columns from EDLA to the producer and consumer accounts via Lake Formation cross-account sharing via AWS RAM and resource links.

Each LOB account (producer or consumer) also has its own local storage, which is registered in the local Lake Formation along with its local Data Catalog, which has a set of databases and tables, which are managed locally in that LOB account by its Lake Formation admins.

Clean up

To avoid incurring future charges, delete the resources that were created as part of this exercise.

Delete the S3 buckets in the following accounts:

Producer account
EDLA account
Consumer account (if any)

Delete the AWS Glue jobs in the following accounts:

Producer account
Consumer account (if any)

Lake Formation limitations

This solution has the following limitations:

The spark-submit action on Amazon EMR is not currently supported
AWS Glue Context does not yet support column-level fine-grained permissions granted via the Lake Formation

Conclusion

This post describes how you can design enterprise-level data lakes with a multi-account strategy and control fine-grained access to its data using the Lake Formation cross-account feature. This can help your organization build highly scalable, high-performance, and secure data lakes with easy maintenance of its related LOBs’ data in a single AWS account with all access logs and grant details.

About the Authors

Satish Sarapuri is a Data Architect, Data Lake at AWS. He helps enterprise-level customers build high-performance, highly available, cost-effective, resilient, and secure data lakes and analytics platform solutions, which includes streaming and batch ingestions into the data lake. In his spare time, he enjoys spending time with his family and playing tennis.

UmaMaheswari Elangovan is a Principal Data Lake Architect at AWS. She helps enterprise and startup customers adopt AWS data lake and analytic services, and increases awareness on building a data-driven community through scalable, distributed, and reliable data lake infrastructure to serve a wide range of data users, including but not limited to data scientists, data analysts, and business analysts. She also enjoys mentoring young girls and youth in technology by volunteering through nonprofit organizations such as High Tech Kids, Girls Who Code, and many more.

Zach Mitchell is a Sr. Big Data Architect. He works within the product team to enhance understanding between product engineers and their customers while guiding customers through their journey to develop data lakes and other data solutions on AWS analytics services.

Build secure encrypted data lakes with AWS Lake Formation

2021-06-04 Daniela Dorneanu

Post Syndicated from Daniela Dorneanu original https://aws.amazon.com/blogs/big-data/build-secure-encrypted-data-lakes-with-aws-lake-formation/

Maintaining customer data privacy, protection against intellectual property loss, and compliance with data protection laws are essential objectives of today’s organizations. To protect data against security threats, vulnerabilities within the organization, malicious software, or cyber criminality, organizations are increasingly encrypting their data. Although you can enable server-side encryption in Amazon Simple Storage Service (Amazon S3), you may prefer to manage your own encryption keys. Amazon Key Management Service (AWS KMS) makes it easy to create, rotate, and disable cryptographic keys across a wide range of AWS services, including over your data lake in Amazon S3.

AWS Lake Formation is a one-stop service to build and manage your data lake. Among its many features, it allows discovering and cataloging data sources, setting up transformation jobs, configuring fine-grained data access and security policies, and auditing and controlling access from data lake consumers. You can also provide column-level security, which is an imperative feature when you want to protect personal identifiable information (PII).

Using AWS KMS with Lake Formation requires several steps, which we discuss in this post. We create a complete solution for processing encrypted data using customer managed keys with Lake Formation, Amazon Athena, AWS Glue, and AWS KMS. We use an S3 bucket registered through Lake Formation, which only accepts encrypted data with customer managed keys. Additionally, we demonstrate how to easily restrict access to PII data for data analysis stakeholders.

To demonstrate the solution, we upload an encrypted document into the S3 bucket and run data transformations using AWS Glue. The processed data is stored back in an encrypted way to Amazon S3. We automated this solution using AWS CloudFormation to have an end-to-end deployment of data lakes supporting encryption.

Solution overview

We use AWS CloudFormation to deploy the data transformation pipeline and explain all the configurations necessary to achieve end-to-end encryption of your data into a data lake.

The following diagram shows a generic infrastructure of a serverless data lake enhanced by encryption. Transformations such as removing duplicated or bad data are required. Afterward, we want to automatically catalog the data to use it with our consumers (through SQL querying, analytics dashboards, or machine learning services).

The reproducible pattern to support customer managed key encryption requires the following steps:

Configure the S3 bucket to use server-side encryption.
Set up a KMS key policy to allow the AWS Identity and Access Management (IAM) role for Lake Formation to use the key for encryption.
Create the AWS Glue security configuration to specify the keys to use for encryption with AWS Glue.

Prerequisites

Before getting started, complete the following prerequisites:

Sign in to the AWS Management Console and choose the US East (N. Virginia) Region for this sample deployment.
Ensure that Lake Formation has the administrators set up, and the default permissions go through Lake Formation for all newly created databases and tables.

Deploy the solution

To deploy the solution, complete the following steps:

On the Lake Formation console, choose Add administrators.
Add your current role and user as an administrator.
In the navigation pane, under Data catalog, select Settings.
Deselect Use only IAM access for new databases and Use only IAM access control for new tables in new databases.

This makes sure that both IAM and Lake Formation permission modules are used.

Choose Save.
Download the content from the following GitHub repository. The repo should contain the following files:
- The raw data sample file data.json
- The AWS Glue script sample script.py
- The CloudFormation template lakeformation_encryption_demo.yaml
Create an S3 bucket in us-east-1 and upload the AWS Glue script.
Record the script path to use as a parameter for the CloudFormation stack.

You now deploy the CloudFormation stack.

Choose Launch Stack:
Leave the default location for the template and choose Next.
On the Specify stack details page, enter a stack name.
For GlueJobScriptBucketPath, enter the bucket containing the AWS Glue script.
For DataLakeBucket, enter the name of the bucket that the stack creates.
On the Configure stack options page, choose Next.
On the Review page, select the check boxes.
Choose Create stack.

At this point, you have successfully created the resources for the Data Lake solution supporting end-to-end encryption.

The stack deploys an S3 bucket in which you upload the file, and registers that bucket within Lake Formation. An AWS Glue job transforms the data into Parquet format, and an AWS Glue crawler detects the schema of the processed data. Additionally, the stack deploys all the AWS KMS resources, which we describe in detail in the next section.

What is happening in the background?

In this section, we describe in more detail the encryption/decryption process. Namely we talk about how encrypted data is uploaded to the S3 bucket, and the role the AWS Glue security configuration is playing to configure Glue jobs and crawlers to use a particular KMS key.

KMS key

As shown in the following screenshot, the KMS key policy enables access for several IAM roles.

lake-formation-demo-role: Lake Formation is the central service managing access to the data. To enable the Lake Formation service to use the KMS key, we add the IAM role used to register the S3 bucket to Lake Formation to the key policy used within this solution.

demo-lake-formation-glue-job-role: The AWS Glue job role also needs to use the KMS key to encrypt the output data after running the ETL job.

demo-lake-formation-glue-crawler-role: Lastly, the AWS Glue crawler uses the KMS key to decrypt the data and infer the schema of the data.

Learn more about registering an S3 location to Lake Formation in the AWS documentation.

Amazon S3 storage uploads only encrypted data

The data lake S3 bucket has a bucket policy enforcing encryption on all the data uploaded to the bucket with the KMS key. This also allows any user to use their own KMS keys to encrypt the data. Additionally, teams within an organization can use different keys when uploading the data, supporting separation of access within an organization.

The following screenshot shows the S3 bucket policy implemented through the CloudFormation stack. The policy denies Amazon S3 Put API calls for objects that aren’t AWS KMS encrypted.

AWS Glue security configuration

An AWS Glue security configuration contains the properties needed when you read and write encrypted data. To create and view all AWS security configurations, on the AWS Glue console, choose Security configurations in the navigation pane.

A security configuration was added to the AWS Glue job and the crawler to configure what encryption key AWS Glue should use when running a job or a crawler.

Test the solution

In this section, we walk through the steps of the end-to-end encryption pipeline:

Upload sample data to Amazon S3.
Run the AWS Glue job.
Give permissions to the AWS Glue crawler to the Amazon S3 location and run the crawler.
Set up permissions for the new role to query the new table.
Run an Athena query.

Upload sample data to Amazon S3

Use the following command to upload a sample file to Amazon S3:

aws s3 cp data.json s3://<DATA_LAKE_BUCKET_NAME>/raw/ --sse aws:kms --sse-kms-key-id  <LAKE_FORMATION_KMS_DATA_KEY>

For <LAKE_FORMATION_KMS_DATA_KEY> value, you need to enter the Key ID of the kms key with the alias lakeformation-kms-data-key, which you can find in the AWS KMS service console.

In the preceding command, data.json is the file that we upload to Amazon S3, and we specify the prefix raw. While uploading, we provide the KMS key to encrypt the file with this encryption key.

Run the AWS Glue job

We’re now ready to run our AWS Glue job.

On the AWS Glue console, choose Jobs in the navigation pane.
Select the job lake-formation-demo-glue-job.
On the Action menu, choose Run job.

When the job is complete, we should see the processed data in the S3 bucket you configured under the prefix processed. When we check the properties of the output file, we should see that the data is encrypted using the KMS key lakeformation-kms-data-key.

Give permissions to the AWS Glue crawler and run the crawler

We now give permissions to the AWS Glue crawler to access Amazon S3, and then run the crawler.

On the Lake Formation console, under Permissions, choose Data locations.
Choose Grant.
Select My account.
For IAM users and roles, choose demo-lake-formation-glue-crawler-role.
For Storage locations, choose the S3 bucket where your data is stored.
For Registered account location, enter the current account number.
Choose Grant.

This step is required for the crawler to have permissions to the Amazon S3 location where the data to be crawled is stored.

On the AWS Glue console, choose Crawlers.
Select the configured crawler and choose Run crawler.

The crawler infers the schema of the processed data, and a new table is now visible within the database: lakeformation-glue-catalog-db.

This table is also visible on the Lake Formation console.

Set up permissions for the current role to query the table

Next, we configure Athena to have the proper rights to query this newly created table over the encrypted data.

One advantage of using Lake Formation to set up permissions is the ability to restrict access to PII in order to stay compliant and protect the privacy of your customers. For this post, we restrict access to all columns in the processed database that aren’t symbol.

On the Lake Formation console, under Data catalog¸ choose Tables.
Select the processed
Click on Actions and select Grant.
Select My account.
For IAM users and roles, choose the current user/role.
For Column-based permissions, choose Include columns.
For Include columns, choose the column symbol.
For Table permissions, select Select.
Choose Grant.

Run an Athena query

We can now query the database with Athena.

On the Athena console, choose the database lakeformation-glue-catalog-db.
Choose the options icon next to the processed table and choose Preview table.

Enter the following query:

SELECT *
FROM "lakeformation-glue-catalog-db"."processed" limit 10;

Choose Run query.

The following screenshot shows our output, in which we can see the value of the symbol column. The other columns aren’t visible due to the column-level security configuration.

Further steps

We can also enable encryption at rest for the Athena results, meaning that Athena encrypts the query results in Amazon S3. For more information, see Encrypting Query Results Stored in Amazon S3.

Summary

In this post, we addressed the use case of customers with strict regulatory restrictions that require end-to-end data encryption to comply with their country regulations. Additionally, we set up a data lake to support column-level security to restrict access to PII within tables. We included a step-by-step guide and automated the solution with AWS CloudFormation to deploy it promptly.

If you need any help in building data lakes, please reach out to AWS Professional Services. If you have questions about this post, let us know in the comments section, or start a new thread on the Lake Formation forum.

About the Authors

Daniela Dorneanu is a Data Lake Architect at AWS. As part of Professional Services, Daniela supports customers hands-on to get more value out of their data. Daniela advocates for inclusive and diverse work environments, and she is co-chairing the Software Engineering conference track at the Grace Hopper Celebration, the largest gathering of women in Computing.

Muhammad Shahzad is a Professional Services consultant who enables customers to implement DevOps by explaining principles, delivering automated solutions and integrating best practices in their journey to the cloud.

Securely analyze your data with AWS Lake Formation and Amazon QuickSight

2021-05-24 Julia Soscia

Post Syndicated from Julia Soscia original https://aws.amazon.com/blogs/big-data/securely-analyze-your-data-with-aws-lake-formation-and-amazon-quicksight/

Many useful business insights can arise from analyzing customer preferences, behavior, and usage patterns. With this information, businesses can innovate faster and improve the customer experience, leading to better engagement and accelerating product adoption. More and more businesses are looking for ways to securely store and restrict access to customer data, which may include personally identifiable information (PII) and other sensitive information. Accessing customer data for use cases such as analytics and Machine Learning requires careful management of access permissions and the ability to audit usage.

This post explores the ways in which you can securely store, analyze, and visualize sensitive customer data. We start by storing encrypted sample test data in our Amazon Simple Storage Service (Amazon S3) based data lake. We use AWS Lake Formation to configure fine-grained permissions to restrict user access, and finally analyze the data and visualize business insights using Amazon QuickSight.

Prerequisites

For this post, you should be familiar with the following:

Managing an AWS account
Amazon Athena
AWS Glue
AWS Identity and Access Management (IAM)
Lake Formation
QuickSight
Amazon S3

The AWS CloudFormation template associated with this post automatically sets up the different architecture components. You then need to take additional steps to configure user permissions directly in Lake Formation. If you already manage access using AWS Glue resource permissions and IAM, launching this CloudFormation template in the same account could result in access being denied to existing users. We recommend launching this template in a new AWS account or one not running business-critical workloads.

Architecture overview

The following diagram illustrates our solution’s architecture.

The workflow includes the following steps:

Use Lake Formation to create a data lake on Amazon S3 that is protected with IAM and encrypted with AWS Key Management Service (AWS KMS).
AWS Glue crawlers scan your datasets and populate the Data Catalog.
The Data Catalog serves as a central repository to store the metadata of the datasets.
Athena accesses the data for ad hoc queries, using the Data Catalog.
You can securely visualize your data with QuickSight.

Use case and business value

The following use case helps illustrate the challenge we’re trying to solve.

Assume you run an ecommerce company and to help improve the customer experience, you need to collect and store customers’ purchasing history. It’s common for the credit card holder to be different than the individual purchasing a product. If you can identify the relationship between the account holder and the cardholder, you may be able to create targeted recommendations.

For example, the account holder and cardholder share the same last name and home address but have a different first name, and the cardholder’s age is greater. You can use this information to deduce with high certainty that the account holder is using the credit card of a family member. You could analyze the account holder’s purchasing history and correlate it with third-party information collected on the cardholder to create a holistic view of the customer and their circle of close family influencers.

With this new information, you can now make personal product recommendations to improve the customer experience. In the following sections, we demonstrate how to accomplish this by setting up a secure data lake, encrypting data at rest, masking sensitive fields, and restricting access to data using fine-grained permissions.

For similar customer use cases related to security and data analytics on AWS, see AWS Cloud Security.

Set up the environment

After you sign in to your test AWS account, launch the CloudFormation template by choosing Launch Stack:

This template configures the following resources:

An encrypted S3 bucket representing the data lake that is registered with Lake Formation
Two IAM groups that represent the data lake admin and analyst
Two IAM users ([email protected] and [email protected])
IAM roles and policies that go along with these services
An AWS Glue crawler to populate the Data Catalog with tables
An AWS Glue role for running within the data lake
An AWS Glue extract, transform, and load (ETL) job to mask the credit card information
A KMS customer master key (CMK) for the Amazon S3 server-side encryption (SSE)

When you launch the template, enter values for the following parameters:

AdminPassword – An initial password for the [email protected] user
AnalystPassword – An initial password for the [email protected] user
BucketNameParam – A unique name to be used for the data lake bucket
CFNExecRoleOrUserArnParam – An ARN of principal (user or role) executing CloudFormation stack. You can find ARN of logged in user from IAM

The passwords you provide must comply to your account’s password policy. When you later log in as those users, you’re asked to reset these passwords. Make note of the passwords you choose.

Configure your data lake’s security

This section doesn’t cover all the available security offerings in AWS, but rather focuses on restricting user access and protecting data at rest. You should always start by securing all access to your data and resources. It’s important to create policies that use least privilege from the start, even before any data lands in the data lake. You can do this with IAM and AWS KMS.

Set up IAM access policies for your users

First, consider the types of users that access the AWS account that hosts a data lake and use analytic services to access the data assets in the data lake. For this post, we discuss two types of AWS users:

Data lake administrator – Responsible for configuring the data lake and managing Lake Formation permissions to the data lake users
Data analyst – Responsible for data visualization

When you work in your production account (not the test account), you consider a third user type: the data engineer. The data engineer is responsible for data transformation, loading data into Amazon S3, and making the necessary changes to the data structures so the analysts don’t have access to sensitive data directly. For this post, the data lake administrator acts as the data engineer.

On the IAM console, on the Users page, you can see the following users created by the CloudFormation template.

The recommended best practice is to provide your users with the least access privileges.

For the central access control for data in data lake, we use the Lake Formation permission model. This requires us to create a data lake administrator who can grant any principal (including self) any permission on any Data Catalog resource and data location. The designated data lake administrator user can then grant more granular permissions of resources to other principals. The permission scope of the data lake administrator should be to only have the ability to manage the data lake within the Lake Formation and AWS Glue Data Catalog. The data lake administrator should have the ability to add existing IAM users and roles to data lake, but not actually create the IAM users. The permissions scope of the data engineer user should be limited to the IAM policies required for them to successfully do their job, following the principle of least privilege. Specifically, this means running ETL jobs to mask sensitive information from the data records prior to analysis. Also, because the aim of the use case is to maintain compliance, it’s imperative that no individual users have access to cardholder data either visually or programmatically.

It’s considered a best practice to use AWS managed policies whenever possible to reduce the operational overhead of maintaining your own policies. The data lake admin group role should be granted the AWSLakeFormationDataAdmin, AWSGlueConsoleFullAccess, and AmazonS3ReadOnlyAccess managed policies to administer data pipelines with AWS Glue, configure Lake Formation permissions, run AWS Glue crawlers, and store data on Amazon S3. For more information about IAM managed permissions, see Data Lake Administrator Permissions.

Encryption at rest within the data lake

In addition to using IAM to delegate permissions to your data lake, we recommend encrypting the data at rest. In this post, Amazon S3 acts as the basis for storage of our data lake. Amazon S3 offers several options for encryption based on your preference and compliance requirements. For more information, see Protecting data using encryption.

In this post, we use server-side encryption using CMK because it provides low operational overhead. AWS KMS (SSE-KMS) uses CMK stored in KMS to protect customer data. AWS KMS provides fine-grained audit and access control over requests for encryption keys. SSE-KMS and AWS Glue both use data keys protected by your CMK. You can view the default encryption details by navigating to the Amazon S3 console, choosing the bucket created as part of the CloudFormation template, and viewing its properties.

Encryption in transit within the data lake

After you secure the data at rest, let’s make sure that the S3 service endpoints are protected with SSL. By default, public S3 service endpoints can be accessed via HTTPS or HTTP. Therefore, to ensure that traffic is encrypted in transit, use the default HTTPS endpoint for the Amazon S3 API. To ensure that data in Amazon S3 is protected with SSL, you must force SSL on the bucket using a bucket policy. This policy is enforced on all of your buckets where encryption is required to meet your security needs. For details on encrypting your data in transit as you move your data into your data lake in the AWS Well-Architected Framework, see How do you anticipate, respond to, and recover from incidents?

Preprocess data to restrict access to sensitive data

Not all consumers of the data are allowed to see sensitive information, so we need to split the data in a way that gives us more control over how data is exposed to users. To do that, we transform it using PySpark running in AWS Glue. This allows us to aggregate the data for our analysts and mask the sensitive information before saving it back to Amazon S3 for analysis.

Configure Lake Formation

First, we use Lake Formation to create a central data lake repository on Amazon S3 to store and analyze your data.

On the Lake Formation console, under Data Catalog, choose Settings.
Deselect the two check boxes associated with the Data Catalog using only IAM permissions.

To maintain backward compatibility with AWS Glue, these settings are enabled by default for new Data Catalog resources. These settings effectively cause access to Data Catalog resources to be controlled solely by IAM policies. Deselect the permissions so that you can give individual permissions to Data Catalog resources from Lake Formation.

Choose Save.

Next, we revoke generic IAM principal access to tables in our database. This makes sure that only permissions applied using Lake Formation will apply.

On the Lake Formation console, choose Administrative roles and tasks.
Under Database Creators¸ select IAMAllowedPrincipals.
Choose Revoke.
For Catalog permissions, select Create database.
Leave all other settings at their default and choose Revoke.

We now need to revoke permissions for IAMAllowedPrincipals.

On the Data permissions page, revoke all grants to the group IAMAllowedPrincipals.

Lake Formation is now the single pane of glass for data governance within your data lake. To configure user permissions in Lake Formation, you must be a data lake admin. The CloudFormation template already created [email protected] as our data lake admin. When you’re logged in as the admin, you need to grant them the ability to manage permissions for users.

On the IAM console, choose Users.
Choose the [email protected] user.
On the Security Credentials tab, copy the link for that user to log in.
Open the link in a new browser or private browser window.
Reset the password (on your first login).
On the Lake Formation console, choose Data permissions.
Choose Grant.
Make sure the admin user has both database and grantable Super permissions on the db1 database.
On the Databases page, select the db1 database.
On the Actions menu, choose Edit.
Choose the S3 bucket created as part of the CloudFormation stack as the database location.

The naming convention of the S3 bucket is <account_number><region_name><bucket_name>. For example, it should appear as similar to 111122223333virginiamybucketname. Do not choose the bucket name ending with athenaoutput.

Deselect Use only IAM access control for new tables in this database.
Choose Save.

After this step, if you see IAMAllowedPrincpals under Data permissions, follow the steps as explained before to revoke the permissions.

The next permission we want to grant is the ability for our AWS Glue execution role to create new tables in our db1 database.

On the Data permissions page, choose Grant.
For IAM users and roles, choose the AWS Glue role created as part of the CloudFormation stack.
For Database, choose the db1 database.
For Database permissions, select Create table.
Make sure that no options are selected for Grantable permissions.
Choose Grant.

Now that data lake admin is set up and Lake Formation is managing permissions, we can work on creating table definitions of cards, customers, and sales data into the Lake Formation Data Catalog. Let’s verify the files created by the CloudFormation template into S3 bucket folders.

On the Amazon S3 console, choose the bucket that you chose for the db1 location.

The following CSV files are in their respective folders cards, customers, and sales:

cards.csv
customers.csv
sales.csv

Now that we’ve verified the files, let’s catalog it in the Lake Formation Data Catalog using AWS Glue crawlers.

On the AWS Glue console, choose Crawlers.
Select the crawler security-blog-crawler and choose Run crawler.

This crawler was created by the CloudFormation template. It can crawl multiple data stores like cards, customers, and sales to populate the Data Catalog.

After you run the crawler, you should see the tables in the Data Catalog. To view the tables, switch to the Lake Formation console, choose Databases, select the db1 database, and choose View tables.

Configure data access controls

Now that our datasets are cataloged, it’s important to define access controls to ensure only authorized users are allowed to see sensitive information. We use Lake Formation to define these fine-grained controls for users who need access to data in the data lake.

Configure data encryption for AWS Glue

We use AWS Glue security configuration to secure data when being accessed by AWS Glue crawlers and ETL jobs. The data being written by the ETL jobs to Amazon S3 targets and logs to Amazon CloudWatch. The security settings were automatically configured by the CloudFormation template and can be viewed on the AWS Glue console.

Process the result set

Our dataset includes information about our customer demographics and references between customers that share credit cards when making purchases. We develop a simple job using PySpark to combine the purchasing user’s information with the cardholder. You can perform other transformations and enrichment such as masking sensitive fields or looking up additional details in external systems. When the job is complete, it outputs the data in columnar format to give us better performance and reduce cost when we later analyze it. For more information on how this process works and the tools required, see How to extract, transform, and load data for analytic processing using AWS Glue (Part 2).

To demonstrate this capability in action, you run the AWS Glue ETL jobs created by CloudFormation template. To run the script, you log in as an admin user, but ideally, you should have a data engineer managing the ETL at this point. For the sake of simplicity, we configured the data lake administrator to have these permissions.

Let’s run the ETL jobs to clean the cards and sales data. They create new files under the clean_cards and clean_sales S3 folders with the modifications. We start with cleaning the card data. The job replaces full card numbers with the last four digits of the card numbers and create a new file in the clean_cards folder.

Make sure you’re signed in as the data lake admin with username [email protected].
On the AWS Glue console, choose Jobs.
Select the job clean_cards_data and on the Action menu, choose Run job.
Expand Security configuration, script libraries, and job parameters.
Under Job parameters, add the key --output_s3_bucket_name and the value as the bucket name that contains the CSV files.
Choose Run job.

Next, we clean up our sales data. The dollar amounts for the purchase prices are casted as strings with a dollar sign ($) in them. To make analytics easier downstream, we want to have those casted as decimals without the dollar signs.

Follow the same procedure to run the clean_sales_data

Now that we have generated our clean cards and clean sales data in the S3 bucket, we run security-blog-crawler to add the clean cards and clean sales tables to our Data Catalog.

In the navigation pane, choose Crawlers.
Select the crawler called security-blog-crawler and choose Run crawler.

Now that we have our new tables with masked card data and cleaned sales data, you grant the analyst user permission to access it in Lake Formation.

On the Lake Formation console, grant the Select permission to the clean_cards and clean_sales tables for the user [email protected].

This completes the permissions scope for the analyst user.

Query and visualize data with QuickSight

Now that our data is ready, we use QuickSight to visualize the data. We first add [email protected] to QuickSight as an author who can create different analysis for viewers. We use Lake Formation fine-grained permissions to grant secure access to the analyst user, who can prepare analysis in QuickSight.

You need to be logged in as your account administrator, not the analyst or data lake admin. Your account must be subscribed to QuickSight as an Enterprise user to enable integration with Lake Formation fine-grained access control.

Enable fine-grained permission for QuickSight users

Fine-grained permissions defined in Lake Formation are mapped to QuickSight authors or a group of viewers. To enable fine-grained permissions for viewers, you must first add them to a QuickSight group that can be configured in Lake Formation with fine-grained permissions. For this post, we create the QuickSight user [email protected].

Sign in to your AWS account with the admin user (not the Lake Formation admin user).
In a new tab, open the QuickSight console.
Choose the logged-in user and choose Manage QuickSight.
Add the user [email protected] with the author role.

Now we create the QuickSight group Analyst and add the QuickSight user [email protected] to the group. We use the AWS Command Line Interface (AWS CLI) for this purpose.

Run the following command to create the group (provide your account ID):

aws quicksight create-group --aws-account-id=<account_id> --namespace=default --group-name="Analyst" --description="Data Analyst Group"

Run the following command to add the user to the group:

aws quicksight create-group-membership --group-name 'Analyst' --aws-account-id <account_id> --namespace default [email protected]

Run the following command to get the ARN of the QuickSight group:

aws quicksight describe-group --group-name=Analyst --aws-account-id <account_id> --namespace=default

Make a note of this ARN.

We use this ARN to configure access permissions to this QuickSight group in Lake Formation.

Because we configured the data lake bucket and Athena output bucket with CMKs, we need to grant the following key operations to the QuickSight role.

Enter the following AWS CLI command to create the QuickSight role when you subscribe to QuickSight (also provide the KMS key ID, created by the CloudFormation stack):

aws kms create-grant --key-id <kms-key> --grantee-principal arn:aws:iam::<accountid>:role/service-role/aws-quicksight-s3-consumers-role-v0 --operations Decrypt Encrypt DescribeKey GenerateDataKey GenerateDataKeyPair

Sign in with the [email protected]
On the Lake Formation console, choose Data permissions.
Choose Grant.
For SAML and Amazon QuickSight users and groups, enter the Analyst group ARN you copied earlier.
For Database, choose db1.
For Tables, choose clean_cards and clean_sales.
For Table permissions, select Select.
Choose Grant.

Now let’s grant permissions to the customers table by excluding the address and email fields.

On the Data permissions page, choose Grant.
For SAML and Amazon QuickSight users and groups, enter the Analyst group ARN you copied earlier.
For Database, choose db1.
For Tables, choose customers.
For Columns, choose Exclude columns.
For Exclude columns, choose address and email.
For Table permissions, select Select.
Choose Grant.

Now the Analyst group has the Lake Formation permission for the proper datasets.

Additional security measures

In addition to configuring permissions for QuickSight, you can implement security measures to ensure that you’re visualizing sensitive data properly.

On the QuickSight console, choose Security & permissions.

Because we’re using Athena to connect to source data in data lake, we grant write permissions to the S3 bucket for the Athena query result output. The naming convention of the Athena S3 bucket is <account_number><region_name><bucket_name>. For example, it should appear similar to 111122223333virginiamybucketname.

Visualize the user data

This post would be incomplete without some data science that demonstrates how we can create correlations between users. We have secured our infrastructure, scrubbed our sensitive data in preprocessing, output the results to a columnar tables in our AWS Glue Data Catalog managed by Lake Formation, and configured security settings in QuickSight. The next step is to import the dataset into QuickSight to analyze the data. For more information, see Working with Datasets.

In our use case, we logically grouped our users and cardholders into a columnar format in preprocessing, which we can now use to sort through and derive value. In QuickSight, we use the heat map visualization to map the buyer to the number of times they used someone else’s credit card. This gives us a visual of which users used different credit cards the most.

Sign in using [email protected].
On the QuickSight console, choose New analysis.
Choose New datasets.
For Create a Dataset, choose Athena.
Enter ABCCompany Analysis as the data source.
Choose Create data source.
Choose the database db1.

As shown in the following screenshot, you’re only shown the table permissions granted to the QuickSight Analyst group.

Choose Use custom SQL.
Enter the query name as Sales-Cards-Query.

We now run queries to verify that the analyst user doesn’t have access to the cards and sales tables.

Enter the following query for the cards table:
```
SELECT * from "db1"."cards"
```

The following screenshot shows that we get a permission error.

Similarly, you can verify the permissions for sales table by running following query. You should see the same permission error as for the cards table.
```
SELECT * from "db1"."sales"
```
Enter the following query for the customers
```
SELECT * from "db1"."customers"
```

The following screenshot shows that the analyst only has access to customer fields other than address and email.

The analyst only has access to the clean_sales, clean_cards, and customers tables (excluding the address and email fields).

Enter the following SQL and choose Edit/Preview Data:

SELECT "db1"."clean_sales"."customer_id", COUNT("db1"."clean_sales"."customer_id") as num_diff_card_used FROM "db1"."clean_sales" JOIN "db1"."customers" ON "db1"."clean_sales"."customer_id"="db1"."customers"."customer_id" AND "db1"."clean_sales"."card_id" != "db1"."customers"."card_id" GROUP BY "db1"."clean_sales"."customer_id" ORDER BY num_diff_card_used DESC

The following screenshot shows our query results.

Choose Save & visualize to create a visualization.
Choose + Add and choose Add visual.
Choose the heat map visual type.
Set Rows to customer_id.
Set Columns and Values to num_diff_card.
On the Values drop-down menu, choose Aggregate sum.

The following screenshot shows our QuickSight analysis. You can change the color by choosing Format visual.

From this visualization, we can see that several customers are making purchases with more than one card that they don’t own. We can also add further visualizations that add more context to our data like customer IDs and the total number of purchases made with cards that customers don’t own.

The following are some of the additional datasets and visualizations that you can add to your analysis.

This data may provide valuable insights into the relationships between users and also provide a starting point for forensic investigations into customers that may be making fraudulent purchases.

For instructions on creating a similar dataset, see Creating a Dataset Using Amazon Athena Data. For instructions on creating visualizations with your datasets, see Creating an Amazon QuickSight Visual.

Clean up

When you’re done with this exercise, complete the following steps to delete your resources and stop incurring costs:

On the Amazon S3 console, delete the objects in the buckets created by the CloudFormation stack.
On the CloudFormation console, select your stack and choose Delete.

This cleans up all the resources created by the stack.

Cancel your QuickSight account.

Conclusion

In this post, we used an example of credit card purchases to discuss different ways to protect sensitive data, based on separation of responsibilities with least privileges, encryption, and fine-grained access control. With AWS, you gain the control and confidence you need to securely run your business with the most flexible and secure cloud computing environment available today. To show this in action, we created separate IAM users and granted permissions based on the principle of least privilege to allow users to perform their duties. We reviewed the required AWS KMS configuration to protect data at rest using server-side encryption. We used AWS Glue to prepare the data, remove sensitive information, and further protect data access using Lake Formation’s fine-grained access controls. After we applied the security controls, we showed you how the analyst user can safely create different analyses using QuickSight for visualization.

References

For more information about the steps in this solution, see the following:

About the Authors

Julia Soscia is a Solutions Architect Manager with Amazon Web Services on the Startup team, based out of New York City. Her main focus is to help startups create well-architected environments on the AWS cloud platform and build their business. She enjoys skiing on the weekends in Vermont and visiting the many art museums across New York City.

Mitesh Patel is a Senior Solutions Architect at AWS. He works with customers in SMB to help them develop scalable, secure and cost effective solutions in AWS. He enjoys helping customers in modernizing applications using microservices and implementing serverless analytics platform.

Effective data lakes using AWS Lake Formation, Part 1: Getting started with governed tables

2021-03-03 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/part-1-effective-data-lakes-using-aws-lake-formation-part-1-getting-started-with-governed-tables/

Thousands of customers are building their data lakes on Amazon Simple Storage Service (Amazon S3). You can use AWS Lake Formation to build your data lakes easily—in a matter of days as opposed to months. However, there are still some difficult challenges to address with your data lakes:

Supporting streaming updates and deletes in your data lakes, for example, database replication, and supporting privacy regulations such as GDPR and CCPA
Achieving fine-grained secure sharing not only with table- or column-level access control, but with row-level access control
Optimizing the layout of various tables and files on Amazon S3 to improve analytics performance

We announced Lake Formation transactions, row-level security, and acceleration for preview at AWS re:Invent 2020. These capabilities are available via new, open, and public update and access APIs for data lakes. These APIs extend the governance capabilities of Lake Formation with row-level security, and provide transactions semantics on data lakes.

In this series of the posts, we provide a step-by-step instruction to use these new Lake Formation features. In this post, we focus on the first step of setting up governed tables.

Lake Formations transactions, row-level security, and acceleration are currently available for preview in the US East (N. Virginia) AWS Region. To get early access to these capabilities, sign up for the preview. You need to be approved for the preview to gain access to these features.

Governed Table

The Data Catalog supports a new type of metadata tables: governed tables. Governed tables are unique to Lake Formation. Governed tables are a new Amazon S3 table type that supports atomic, consistent, isolated, and durable (ACID) transactions. Lake Formation transactions simplify ETL script and workflow development, and allow multiple users to concurrently and reliably insert, delete, and modify rows across multiple governed tables. Lake Formation automatically compacts and optimizes storage of governed tables in the background to improve query performance. When you create a table, you can specify whether or not the table is governed.

Setting up resources with AWS CloudFormation

In this post, I demonstrate how you can create a new governed table using existing data on Amazon S3. We use the Amazon Customer Reviews Dataset, which is stored in a public S3 bucket as sample data. You don’t need to copy the data to your bucket or worry about Amazon S3 storage costs. You can just set up a governed table pointing to this existing public data to see how it works.

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. If you prefer setting up resources on the AWS Management Console rather than AWS CloudFormation, see the instructions in the appendix at the end of this post.

The CloudFormation template generates the following resources:

AWS Identity and Access Management(IAM) users, roles, and policies
AWS Lake Formation data lake settings and permissions

To create your resources, complete the following steps:

Sign in to the CloudFormation console in us-east-1 Region.
Choose Launch Stack:
Choose Next.
For DatalakeAdminUserNameand DatalakeAdminUserPassword, enter your IAM user name and password for data lake admin user.
For DataAnalystUserNameand DataAnalystUserPassword, enter your IAM user name and password for data analyst user.
For DatabaseName, leave as the default.
Choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

Stack creation can take up to 2 minutes.

Setting up a governed table

Now you can create and configure your first governed table in AWS Lake Formation.

Creating a governed table

To create your governed table, complete the following steps:

Sign in to the Lake Formation console in us-east-1 Region using the DatalakeAdmin1 user.
Choose Tables.
Choose Create table.
For Name, enter amazon_reviews_governed.
For Database, enter lakeformation_tutorial_amazon_reviews.
Select Enable governed data access and management.
Select Enable row based permissions.

Select Enable row based permissions.

1. For Data is located in, choose Specified path in another account.
2. Enter the path s3://amazon-reviews-pds/parquet/.
3. For Classification, choose PARQUET.
4. Choose Upload Schema.
5. Enter the following JSON array into the text box:

[
    {
        "Name": "marketplace",
        "Type": "string"
    },
    {
        "Name": "customer_id",
        "Type": "string"
    },
    {
        "Name": "review_id",
        "Type": "string"
    },
    {
        "Name": "product_id",
        "Type": "string"
    },
    {
        "Name": "product_parent",
        "Type": "string"
    },
    {
        "Name": "product_title",
        "Type": "string"
    },
    {
        "Name": "star_rating",
        "Type": "int"
    },
    {
        "Name": "helpful_votes",
        "Type": "int"
    },
    {
        "Name": "total_votes",
        "Type": "int"
    },
    {
        "Name": "vine",
        "Type": "string"
    },
    {
        "Name": "verified_purchase",
        "Type": "string"
    },
    {
        "Name": "review_headline",
        "Type": "string"
    },
    {
        "Name": "review_body",
        "Type": "string"
    },
    {
        "Name": "review_date",
        "Type": "bigint"
    },
    {
        "Name": "year",
        "Type": "int"
    }
]

Choose Upload.
Choose Add column.
For Column name, enter product_category.
For Data type, choose String.
Select Partition Key.
Choose Add.
Choose Submit.

Now you can see that the new governed table has been created.

When you choose the table name, you can see the details of the governed table, and you can also see Governance: Enabled in this view. It means that it’s a Lake Formation governed table. If you have other existing tables, it should show as Governance: Disabled because the tables are not governed tables.
Now you can see that the new governed table has been created.

You can also see lakeformation.aso.status: true under Table properties. It means that automatic compaction is enabled for this table. For this post, we use a read-only table and don’t utilize automatic compaction. To disable the automatic compaction, complete the following steps:

Choose Edit table.
Deselect Automatic compaction.
Choose Save.

Currently, no data and no partitions are registered to this governed table. In the next step, we register existing S3 objects to the governed table using Lake Formation manifest APIs.

Even if you locate your data in the table location of the governed table, the data isn’t recognized yet. To make the governed table aware of the data, you need to make a Lake Formation API call, or use an AWS Glue job with Lake Formation transactions.

Even if you locate your data in the table location of the governed table, the data isn’t recognized yet.

Configuring Lake Formation permissions

You need to grant Lake Formation permissions for your governed table. Complete the following steps:

Table-level permissions

Sign in to the Lake Formation console in us-east-1 Region using the DatalakeAdmin1 user.
Under Permissions, choose Data permissions.
Under Data permission, choose Grant.
For Database, choose lakeformation_tutorial_amazon_reviews.
For Table, choose amazon_reviews_governed.
For IAM users and roles, choose the role LFRegisterLocationServiceRole-<CloudFormation stack name> and the user DatalakeAdmin1.
Select Table permissions.
Under Table permissions, select Alter, Insert, Drop, Delete, Select, and Describe.
Choose Grant.
Under Data permission, choose Grant.
For Database, choose lakeformation_tutorial_amazon_reviews.
For Table, choose amazon_reviews_governed.
For IAM users and roles, choose the user DataAnalyst1.
Under Table permissions, select Select and Describe.
Choose Grant.

Row-level permissions

Under Permissions, choose Data permissions.
Under Data permission, choose Grant.
For Database, choose lakeformation_tutorial_amazon_reviews.
For Table, choose amazon_reviews_governed.
For IAM users and roles, choose the role LFRegisterLocationServiceRole-<CloudFormation stack name>, the users DatalakeAdmin1 and DataAnalyst.
Select Row-based permissions.
For Filter name, enter allowAll.
For Choose filter type, select Allow access to all rows.
Choose Grant.

Adding table objects into the governed table

To register S3 objects to a governed table, you need to call the UpdateTableObjects API needs for the objects. You can call it using the AWS Command Line Interface (AWS CLI) and SDK, and also the AWS Glue ETL library (the API is called implicitly in the library). For this post, we use the AWS CLI to explain the behavior in the API level. If you don’t have the AWS CLI, see Installing, updating, and uninstalling the AWS CLI. You also need to install the service model file provided in the Lake Formation preview program. You need to run the following commands using DatalakeAdmin1 user’s credential (or an IAM role or user where sufficient permissions are granted).

First, begin a new transaction with the BeginTransaction API:

$ aws lakeformation-preview begin-transaction
{
    "TransactionId": "7e5d506a757f32252ae3402a10191b13bfd1d7aa1c26a099d4a1911241589b8f"
}

Now you can register any files on the location. For this post, we choose one sample partition product_category=Camera from the amazon-reviews-pds table, and choose one file under this partition. Uri, ETag, and Size are the required information for further steps, so you need to copy them.

$ aws s3 ls s3://amazon-reviews-pds/parquet/product_category=Camera/
2018-04-09 15:37:05   65386769 part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:06   65619234 part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:06   64564669 part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:07   65148225 part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:07   65227429 part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:07   65269357 part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:08   65595867 part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:08   65012056 part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:09   65137504 part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:09   64992488 part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet

$ aws s3api head-object --bucket amazon-reviews-pds --key parquet/product_category=Camera/part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
{
    "AcceptRanges": "bytes",
    "LastModified": "Mon, 09 Apr 2018 06:37:07 GMT",
    "ContentLength": 65227429,
    "ETag": "\"980669fcf6ccf31d2d686b9cccdd45e3-8\"",
    "ContentType": "binary/octet-stream",
    "Metadata": {}
}

Create a new file named write-operations1.json and enter the following JSON: (replace Uri, ETag, and Size with the values you copied.)

[
    {
        "AddObject": {
            "Uri": "s3://amazon-reviews-pds/parquet/product_category=Camera/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
            "ETag": "d4c25c40f33071620fb31cf0346ed2ec-8",
            "Size": 65386769,
            "PartitionValues": [
                "Camera"
            ]
        }
    }
]

Let’s register an existing object on the bucket to the governed table by making an UpdateTableObjects API call using write-operations1.json you created. (replace <transaction-id> with the transaction id you got in begin-transaction command.)

$ aws lakeformation-preview update-table-objects --database-name lakeformation_tutorial_amazon_reviews --table-name amazon_reviews_governed --transaction-id <transaction-id> --write-operations file://./write-operations1.json$

Note current date time right after making the UpdateTableObjects API call here. We use this timestamp for time travel queries later.

$ date -u
Tue Feb  2 12:12:00 UTC 2021

You can ensure the change before the transaction commit by making the GetTableObjects API call with the same transaction ID: (Replace <transaction-id> with the id you got in begin-transaction command.)

$ aws lakeformation-preview get-table-objects --database-name lakeformation_tutorial_amazon_reviews --table-name amazon_reviews_governed --transaction-id <transaction-id>

{
    "Objects": [
        {
            "PartitionValues": [
                "Camera"
            ],
            "Objects": [
                {
                    "Uri": "s3://amazon-reviews-pds/parquet/product_category=Camera/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
                    "ETag": "d4c25c40f33071620fb31cf0346ed2ec-8",
                    "Size": 65386769
                }
            ]
        }
    ]
}

To make this data available for other transactions, you need to call the CommitTransaction API: (replace <transaction-id> with the transaction id you got in begin-transaction command.)

$ aws lakeformation-preview commit-transaction --transaction-id <transaction-id>

After running the preceding command, you can see the partition on the Lake Formation console.

After running the preceding command, you can see the partition on the Lake Formation console.

Let’s add one more partition into this table. This time we add one file per partition, and add only two partitions as an example. For actual usage, you need to add all the files under all the partitions that you need.

Add partitions with following commands:

Call the BeginTransaction API to start another Lake Formation transaction:

$ aws lakeformation-preview begin-transaction
{
     "TransactionId": "d70c60e859e832b312668723cf48c1b84ef9109c5dbf6e9dbe8834c481c0ec81"
}

List Amazon S3 objects located on amazon-reviews-pds bucket to choose another sample file:

$ aws s3 ls s3://amazon-reviews-pds/parquet/product_category=Books/
2018-04-09 15:35:58 1094842361 part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:35:59 1093295804 part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:00 1095643518 part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:00 1095218865 part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:00 1094787237 part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:33 1094302491 part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:35 1094565655 part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:35 1095288096 part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:35 1092058864 part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:35 1093613569 part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet

Call the HeadObject API against one sample file in order to copy ETag and Size

$ aws s3api head-object --bucket amazon-reviews-pds --key parquet/product_category=Books/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
{
     "AcceptRanges": "bytes",
     "LastModified": "Mon, 09 Apr 2018 06:35:58 GMT",
     "ContentLength": 1094842361,
     "ETag": "\"9805c2c9a0459ccf337e01dc727f8efc-131\"",
     "ContentType": "binary/octet-stream",
     "Metadata": {}
}

Create a new file named write-operations2.json and enter the following JSON: (Replace Uri, ETag, and Size with the values you copied.)

[
    {
            "AddObject": {
            "Uri": "s3://amazon-reviews-pds/parquet/product_category=Books/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
            "ETag": "9805c2c9a0459ccf337e01dc727f8efc-131",
            "Size": 1094842361,
            "PartitionValues": [
                "Books"
           ]
       }
    }
]

Call the UpdateTableObjects API using write-operations2.json: (replace <transaction-id> with the transaction id you got in begin-transaction command.)

$ aws lakeformation-preview update-table-objects --database-name lakeformation_tutorial_amazon_reviews --table-name amazon_reviews_governed --transaction-id <transaction-id> --write-operations file://./write-operations2.json

Call the CommitTransaction API: (replace <transaction-id> with the transaction id you got in begin-transaction command.)

$ aws lakeformation-preview commit-transaction --transaction-id <transaction-id>

Now the two partitions are visible on the Lake Formation console.

Now the two partitions are visible on the Lake Formation console.

Querying the governed table using Amazon Athena

Now your governed table is ready! Let’s start querying the governed table using Amazon Athena. Sign in to the Athena console in us-east-1 Region using DataAnalyst1 user.

If it’s your first time running queries on Athena, you need to configure a query result location. For more information, see Specifying a Query Result Location.

To utilize Lake Formation preview features, you need to create a special workgroup named AmazonAthenaLakeFormationPreview, and join the workgroup. For more information, see Managing Workgroups.

Running a simple query

Sign in to the Athena console in us-east-1 Region using the DataAnalyst1 user. First, let’s preview 10 records stored in a governed table:

SELECT * 
FROM lakeformation.lakeformation_tutorial_amazon_reviews.amazon_reviews_governed
LIMIT 10

The following screenshot shows the query results.

Running an analytic query

Next, let’s run an analytic query with aggregation for simulating real-world use cases:

SELECT product_category, count(*) as TotalReviews, avg(star_rating) as AverageRating
FROM lakeformation.lakeformation_tutorial_amazon_reviews.amazon_reviews_governed 
GROUP BY product_category

The following screenshot shows the results. This query returned the total number of reviews and average rating per product category.

The following screenshot shows the results

Running an analytic query with time travel

Each governed table maintains a versioned manifest of the Amazon S3 objects that it comprises. You can use previous versions of the manifest for time travel queries. Your queries against governed tables in Athena can include a timestamp to indicate that you want to discover the state of the data at a particular date and time.

To submit a time travel query in Athena, add a WHERE clause that sets the column __asOfDate to the epoch time (long integer) representation of the required date and time. Let’s run the time travel query: (replace <epoch-milliseconds> with the timestamp which is right after you made the first UpdateTableObjects call. To retrieve the epoch milliseconds, see the tips introduced after the screenshots in this post.)

SELECT product_category, count(*) as TotalReviews, avg(star_rating) as AverageRating
FROM lakeformation.lakeformation_tutorial_amazon_reviews.amazon_reviews_governed
WHERE __asOfDate = <epoch-milliseconds>
GROUP BY product_category

The following screenshot shows the query results. The result only includes the record of product_category=Camera. This is because that the file under product_category=Books has been added after this timestamp (1612267920000 ms = 2021/02/02 12:12:00 UTC), which has been specified in the time travel column __asOfDate.

The following screenshot shows the query results.

To retrieve epoch time from commands, you can run below commands.

The following command is for Linux (GNU date command):

$ echo $(($(date -u -d '2021/02/02 12:12:00' +%s%N)/1000000)) 
1612267920000

The following command is for OSX (BSD date command):

$ echo $(($(date -u -j -f "%Y/%m/%d %T" "2021/02/02 12:12:00" +'%s * 1000 + %-N / 1000000')))
1612267920000

Cleaning up

Now to the final step, cleaning up the resources.

Delete the CloudFormation stack. The governed table you created is automatically deleted with the stack.
Delete the Athena workgroup AmazonAthenaLakeFormationPreview.

Conclusion

In this blog post, we explained how to create a Lake Formation governed table with existing data in an AWS public dataset. In addition, we explained how to query against governed tables and how to run time travel queries for governed tables. With Lake Formation governed tables, you can achieve transactions, row-level security, and query acceleration. In Part 2 of this series, we show you how to create a governed table for streaming data sources and demonstrate how Lake Formation transactions work.

Appendix: Setting up resources via the console

When following the steps in this section, use the Region us-east-1 because as of this writing, this Lake Formation preview feature is available only in us-east-1.

Configuring IAM roles and IAM users

First, you need to set up two IAM roles, one is for AWS Glue ETL jobs, another is for the Lake Formation data lake location.

IAM policies

To create your policies, complete the following steps:

On the IAM console, create a new Policy for Amazon S3.

Save the policy as S3DataLakePolicy as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::amazon-reviews-pds/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::amazon-reviews-pds"
            ]
        }
    ]
}

Create a new IAM policy named LFLocationPolicy with the following statements:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "LFPreview1",
            "Effect": "Allow",
            "Action": "execute-api:Invoke",
            "Resource": "arn:aws:execute-api:*:*:*/*/POST/reportStatus"
        },
        {
            "Sid": "LFPreview2",
            "Effect": "Allow",
            "Action": [
                "lakeformation:BeginTransaction",
                "lakeformation:CommitTransaction",
                "lakeformation:AbortTransaction",
                "lakeformation:GetTableObjects",
                "lakeformation:UpdateTableObjects"
            ],
            "Resource": "*"
        }
    ]
}

Create a new IAM policy named LFQuery Policy with the following statements:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "LFPreview1",
            "Effect": "Allow",
            "Action": "execute-api:Invoke",
            "Resource": "arn:aws:execute-api:*:*:*/*/POST/reportStatus"
        },
        {
            "Sid": "LFPreview2",
            "Effect": "Allow",
            "Action": [
                "lakeformation:BeginTransaction",
                "lakeformation:CommitTransaction",
                "lakeformation:AbortTransaction",
                "lakeformation:ExtendTransaction",
                "lakeformation:PlanQuery",
                "lakeformation:GetTableObjects",
                "lakeformation:GetQueryState",
                "lakeformation:GetWorkUnits",
                "lakeformation:Execute"
            ],
            "Resource": "*"
        }
    ]
}

IAM role for AWS Lake Formation

To create your IAM role for the Lake Formation data lake location, complete the following steps:

Create a new Lake Formation role called LFRegisterLocationServiceRole with a Lake Formation trust relationship:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "lakeformation.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Attach the customer managed policies S3DataLakePolicy and LFLocationPolicy you created in the previous step.

This role is used to register locations with Lake Formation which in-turn performs credential vending for Athena at query time.

IAM users

To create your users, complete the following steps:

Create an IAM user named DatalakeAdmin.
Attach the following AWS managed policies:
1. AWSLakeFormationDataAdmin
2. AmazonAthenaFullAccess
3. IAMReadOnlyAccess
Attach the customer managed policy LFQueryPolicy.
Create an IAM user named DataAnalyst that can use Athena to query data.
Attach the AWS managed policy AmazonAthenaFullAccess.
Attach the customer managed policy LFQueryPolicy.

Configuring Lake Formation

If you’re new to Lake Formation, you can follow below steps for getting started with AWS Lake Formation.

On the Lake Formation console, under Permissions, choose Admins and database creators.
In the Data lake administratorssection, choose Grant.
For IAM users and roles, choose your IAM user DatalakeAdmin.
Choose Save.
In the Database creators section, choose Grant.
For IAM users and roles, choose the LFRegisterLocationServiceRole.
Select Create Database.
Choose Grant.
Under Register and ingest, choose Data lake locations.
Choose Register location.
For Amazon S3 path, enter your Amazon S3 path to the bucket where your data is stored. This needs to be the same bucket you listed in LFLocationPolicy. Lake Formation uses this role to vend temporary Amazon S3 credentials to query services that need read/write access to the bucket and all prefixes under it.
For IAM role, choose the LFRegisterLocationServiceRole.
Choose Register location.
Under Data catalog, choose Settings.
Make sure that both check boxes for Use only IAM access control for new databases and Use only IAM access control for new tables in new databases are deselected.
Under Data catalog, choose Databases.
Choose Create database.
Select Database.
For Name, enter lakeformation_tutorial_amazon_reviews.
Choose Create database.

About the Author

Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue & Lake Formation. His passion is for implementing software artifacts for building data lakes more effectively and easily. During his spare time, he loves to spend time with his family, especially hunting bugs—not software bugs, but bugs like butterflies, pill bugs, snails, and grasshoppers.

How FanDuel Group secures personally identifiable information in a data lake using AWS Lake Formation

2020-12-21 Damian Grech

Post Syndicated from Damian Grech original https://aws.amazon.com/blogs/big-data/how-fanduel-group-secures-personally-identifiable-information-in-a-data-lake-using-aws-lake-formation/

This post is co-written with Damian Grech from FanDuel

FanDuel Group is an innovative sports-tech entertainment company that is changing the way consumers engage with their favorite sports, teams, and leagues. The premier gaming destination in the US, FanDuel Group consists of a portfolio of leading brands across gaming, sports betting, daily fantasy sports, advance-deposit wagering, and TV/media, including FanDuel, Betfair US, and TVG. FanDuel Group has a presence across 50 states and over 8.5 million customers. The company is based in New York with offices in California, New Jersey, Florida, Oregon, and Scotland. FanDuel Group is a subsidiary of Flutter Entertainment plc, the world’s largest sports betting and gaming operator with a portfolio of globally recognized brands and a constituent of the FTSE 100 index of the London Stock Exchange.

In this post, we discuss how FanDuel used AWS Lake Formation and Amazon Redshift Spectrum to restrict access to personally identifiable information (PII) in their data lake.

The challenge

In 2018, a series of mergers led to the creation of FanDuel Group, and the combined data engineering team found themselves operating three data warehouses running on Amazon Redshift. The team decided to create a new single platform to replace the three separate warehouses, consisting of a data warehouse containing the core business data model and a data lake to catalog and hold all other types of data. FanDuel’s vision was to create an unified data platform that served their data requirements. This included the ability to ingest and organize real-time and batch datasets, and secure and govern PII.

Because the end-users of the existing data warehouses were familiar with Amazon Redshift, it was critical that they be able to access the data lake using Amazon Redshift. Other important architecture considerations included a simplified user experience, the ability to scale to huge data volumes, and a robust security model to provision relevant data to analysts and data scientists.

To accomplish the vision, FanDuel decided to modernize the data platform and introduce Amazon Simple Storage Service (Amazon S3)-based data lakes. Data lakes are a logical construct that allows data to be stored in its native format using open data formats. With a data lake architecture, FanDuel can enable data analysts to analyze large volume of data without significant modeling. Also, data lakes allow FanDuel to store structured and unstructured data.

Some of the data to be stored in the data lake was customer PII, so access to this category of data needed to be carefully restricted to only employees who required access to perform their job functions. To address these security challenges, FanDuel first tested out a tag-based approach on Amazon S3 to restrict access to the PII data. The idea was to write two datasets for a single dataset—one with PII and another without PII—and apply tags for files where PII is stored, securing files using AWS Identity and Access Management (IAM) policies. This approach was complex and needed 100–200 hours of development time for every data source that was ingested.

Solution overview

FanDuel decided to use Lake Formation and Redshift Spectrum to solve this challenge. The following architectural diagram shows how FanDuel secured their data lake.

The solution includes the following steps:

The FanDuel team registered the S3 location in Lake Formation.

After the location is registered, Lake Formation takes control of the data lake, thereby eliminating the need to set up complicated policies in IAM.

FanDuel built AWS Glue ETL jobs to extract data from sources, including MySQL databases and flat files. They used AWS Glue to cleanse and transform raw data to form refined datasets stored in Parquet-formatted files. They also used AWS Glue crawlers to register the cleansed datasets in the Data Catalog.
The team used Lake Formation to set up column-based permissions using two roles:
1. LimitedPIIAnalyst – Granted access to all columns. Only analysts who needed access to PII data were assigned this role.
2. NonPIIAnalyst – Granted access to non-PII columns. By default, analysts using the data lake were assigned this role.

FanDuel created two external schemas using Redshift Spectrum: one using the NonPIIAnalyst role, and one using the LimitedPIIAnalyst The following code is an example of the DDL that uses the role that was set up in Lake Formation:

CREATE EXTERNAL SCHEMA nonpii_data_lake FROM DATA CATALOG
DATABASE 'fanduel_data_lake' REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::123456789012:role/NonPIIAnalyst';

CREATE EXTERNAL SCHEMA limitedpii_data_lake FROM DATA CATALOG
DATABASE 'fanduel_data_lake' REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::123456789012:role/LimitedPIIAnalyst';

FanDuel could already manage access permissions by adding or removing users from a group in Amazon Redshift, so they already had a group consisting of only the analysts who should be permitted access to PII. The following code grants this group access to the limitedpii_data_lake schema, which effectively means only this group can query the data lake using the LimitedPIIAnalyst role:

GRANT USAGE ON SCHEMA nonpii_data_lake TO base_group;
GRANT SELECT ON ALL TABLES IN SCHEMA nonpii_data_lake TO base_group;
GRANT USAGE ON SCHEMA limitedpii_data_lake TO pii_permitted_group;
GRANT SELECT ON ALL TABLES IN SCHEMA limitedpii_data_lake TO pii_permitted_group;

Benefits

The ability to extend queries to the data lake with Redshift Spectrum and have column-level access control provides superior control over the S3 tag-based permissions approach that was originally considered. This architecture provided the following benefits for FanDuel:

FanDuel could offer new capabilities to data analysts. For example, data analysts could quickly access raw data with PII and combine it with existing data in Amazon Redshift. Lake Formation provided a single view for monitoring the data access patterns.
Lake Formation column-level access control allowed them to secure PII data, which otherwise would have taken a complex S3 tag-based approach. This saved 100–200 hours of development time for every new data source and data footprint, because the original approach required creating two files (one with PII and another without PII), tagging files, and setting up permissions based on tags.
The ability to extend access from Amazon Redshift to the data lake with appropriate access control has allowed FanDuel to reduce data stored in Amazon Redshift.

Conclusion

FanDuel will leverage its new data platform to ingest additional data sources with real-time data so analysts and data scientists can gain insights and improve customer experience.

Questions or feedback? Send an email to [email protected].

About the Authors

Damian Grech is a Data Engineering Senior Manager at FanDuel. Damian has over 15 years of experience in software delivery and has worked with organizations ranging from large enterprises to start-ups at their infant stages. In his spare time, you can find him either experimenting in the kitchen or trailing the Scottish Highlands.

Shiv Narayanan is Global Business Development Manager for Data Lakes and Analytics solutions at AWS. He works with AWS customers across the globe to strategize, build, develop and deploy modern data platforms. Shiv loves music, travel, food and trying out new tech.

Sidhanth Muralidhar is a Senior Technical Account Manager at Amazon Web Services. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them in their cloud journey. In his spare time, he loves to play and watch football.

Controlling data lake access across multiple AWS accounts using AWS Lake Formation

2020-12-02 Rafael Suguiura

Post Syndicated from Rafael Suguiura original https://aws.amazon.com/blogs/big-data/controlling-data-lake-access-across-multiple-aws-accounts-using-aws-lake-formation/

When deploying data lakes on AWS, you can use multiple AWS accounts to better separate different projects or lines of business. In this post, we see how the AWS Lake Formation cross-account capabilities simplify securing and managing distributed data lakes across multiple accounts through a centralized approach, providing fine-grained access control to the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) locations.

Use case

Keeping each business unit’s resources as compute and storage in its own AWS account allows for easier cost allocation and permissions governance. In the other hand, centralizing your Data Catalog into a single account with Lake Formation removes the overhead of managing multiple catalogs in isolated data silos, simplifying the management and data availability.

For this post, we use the example of a company with two separate teams:

The Analytics team is responsible for data ingestion, validation, and cleansing. After processing the income data, they store it on Amazon S3 and use Lake Formation for the Data Catalog, in a primary AWS account.
The Business Analyst team is responsible for generating reports and extracting insight from such data. They use Amazon Athena running in a secondary AWS account.

When a secondary account needs to access data, the data lake administrator use Lake Formation to share data across accounts, avoiding data duplication, silos, and reducing complexity. Data can be shared at the database or table level, and the administrator can define which tables and columns each analyst has access to, establishing a centralized and granular access control. The following diagram illustrates this architecture.

Architecture overview

We provide two AWS CloudFormation templates to set up the required infrastructure for this data lake use case. Each template deploys the resources in one of the accounts (primary and secondary).

In the primary account, the CloudFormation template loads the sample data in the S3 bucket. For this post, we use the publicly available dataset of historic taxi trips collected in New York City in the month of June 2020, in CSV format. The dataset is available from the New York City Taxi & Limousine Commission, via Registry of Open Data on AWS, and contains information on the geolocation and collected fares of individual taxi trips.

The template also creates a Data Catalog configuration by crawling the bucket using an AWS Glue crawler, and updating the Lake Formation Data Catalog on the primary account.

Prerequisites

To follow along with this post, you must have two AWS accounts (primary and secondary), with AWS Identity and Access Management (IAM) administrator access.

Deploying the CloudFormation templates

To get started, launch the first CloudFormation template in the primary account:

After that, deploy the second template in the secondary account:

You now have the deployment as depicted in the following architecture, and are ready to set up Lake Formation with cross-account access.

Setting up Lake Formation in the primary account

Now that you have the basic infrastructure provisioned by the template, we can dive deeper into the steps required for Lake Formation configuration. First sign in to the primary account on the AWS Management Console, using the existing IAM administrator role and account.

Assigning a role to our data lake

Lake Formation administrators are IAM users or roles that can grant and delegate Lake Formation permissions on data locations, databases, and tables. The CloudFormation template created an IAM role with the proper IAM permissions, named LakeFormationPrimaryAdmin. Now we need to assign it to our data lake:

On the Lake Formation console, in the Welcome to Lake Formation pop-up window, choose Add administrators.
1. If the pop-up doesn’t appear, in the navigation pane, under Permissions, choose Admins and database creators.
2. Under Data Lake Administrators, choose Grant.
For IAM users and roles, choose LakeFormationPrimaryAdmin.
Choose Save.

After we assign the Lake Formation administrator, we can assume this role and start managing our data lake.

On the console, choose your user name and choose Switch Roles.

Enter your primary account number and the role LakeFormationPrimaryAdmin.
Choose Switch Role.

For detailed instructions on changing your role, see Switching to a role (console).

Adding the Amazon S3 location as a storage layer

Now you’re the Lake Formation administrator. For Lake Formation to implement access control on the data lake, we need to include the Amazon S3 location as a storage layer. Let’s register our existing S3 bucket that contains sample data.

On the Lake Formation console, in the navigation pane, under Register and Ingest, choose Data lake locations.
For Amazon S3 path, choose Browse.
Choose the S3 bucket in the primary account, referenced in the CloudFormation template outputs as S3BucketPrimary.
Choose Register location.

Configuring access control

When you create the template, an AWS Glue crawler populates the Data Catalog with the database and catalog pointing to our S3 bucket. By default, Lake Formation adds IAMAllowedPrincipals permissions, which isn’t compatible with cross-account sharing. We must disable it on our database and table. For this post, we use Lake Formation access control in conjunction with IAM. For more information, see Change Data Catalog Settings.

On the Lake Formation console, in the navigation pane, under Data Catalog, choose Databases.
Choose gluedatabaseprimary.
Choose Edit.
Deselect Use only IAM access control for new tables in this database.
Choose Save.

On the database details page, on the Actions menu, choose Revoke.
For IAM users and roles, choose IAMAllowedPrincipals.
For Database permissions, select Super.

Choose Revoke.
On the database details page, choose View Tables.
Select the table that starts with lf_table.
On the Actions menu, choose Revoke.
For IAM users and roles, choose IAMAllowedPrincipals.
For Database permissions, select Super.
Choose Revoke.

You can now see the metadata and Amazon S3 data location in the table details. The CloudFormation template ran an AWS Glue crawler that populated the table.

Granting permissions

Now we’re ready to grant permissions to the Business Analyst users. Because they’re in a separate AWS account, we need to share the database across accounts.

On the Lake Formation console, under Data Catalog¸ choose Databases.
Select our database.
On the Actions menu, choose Grant.
Select External account.
For AWS account ID or AWS organization ID, enter the secondary account number.
For Table, choose All tables.
For Table permissions, select Select.
For Grantable permissions, select Select.

Grantable permissions are required to allow the principal to pass this grant to other users and roles. For our use case, the secondary account LakeFormationAdministrator grants access to the secondary account BusinessAnalyst. If this permission is revoked on the primary account in the future, all access granted to BusinessAnalyst and LakeFormationAdministrator on the secondary account is also revoked.

For this post, we share the database with a single account. Lake Formation also allows sharing with an AWS organization.

Choose Grant.

Sharing specific tables across accounts

Optionally, instead of sharing the whole database, you can share specific tables across accounts. You don’t need to share the database to share a table underneath it.

On the Lake Formation console, under Data Catalog, choose Tables.
Select the table that starts with lf_table.
On the Actions menu, choose Grant.
Select External account.
For AWS account ID or AWS organization ID, enter the secondary account number.

You can also choose specific columns to share with the secondary account. For this post, we share five columns.

For Columns, choose Include columns.
For Include columns, choose the following columns
1. vendorid
2. lpep_pickup_datetime
3. lp_dropoff_taketime
4. store_and_forward_flag
5. ratecodeid
For Table permissions, select Select.
For Grantable permissions, select Select.
Choose Grant.

Setting up Lake Formation in the secondary account

Now that the primary account setup is complete, let’s configure the secondary account. We access the resource share and create appropriate resource links, pointing to the databases or tables in the primary account. This allows the data lake administrator to grant proper access to the Business Analyst team, who queries the data through Athena. The following diagram illustrates this architecture.

Assigning a role to our data lake

Similar to the primary account, we need to assign an IAM role as the Lake Formation administrator. To better differentiate the roles, this one is named LakeFormationSecondaryAdmin.

On the Lake Formation console, under Permissions, choose Admins and database creators.
Under Data Lake Administrators, choose Grant.
In the pop-up window, choose LakeFormationSecondaryAdmin.
Choose Save.
On the console, switch to the LakeFormationSecondaryAdmin role.

Sharing resources

Lake Formation shares resources (databases and tables) by using AWS Resource Access Manager. AWS RAM provides a streamlined way to share resources across AWS accounts and also integrates with AWS Organizations. If both primary and secondary accounts are in the same organization with resource sharing enabled, resources shares are accepted automatically and you can skip this step. If not, complete the following steps:

On the AWS RAM console, in the navigation pane, under Shared with me, choose Resource shares.
Choose the Lake Formation share.
Choose Accept resource share.

The resource status switches to Active.

Creating a resource link

With the share accepted, we can create a resource link in the secondary account. Resource links are Data Catalog virtual objects that link to a shared database or table. The resource link lives in your account and the referenced object it points to can be anywhere else.

On the Lake Formation console, under Data Catalog, choose Databases.
Choose Create database.
Select Resource link.
For Resource link name, enter a name, such as lf-primary-database-rl.
For Shared database, choose gluedatabaseprimary.

The shared database’s owner ID is populated automatically.

Choose Create.

You can use this resource link the same way you use database or table references in Lake Formation. The following screenshot shows the resource link listed on the Databases page.

Granting permissions

As the creator of the resource link, at this point only you (IAM role LakeFormationSecondaryAdmin) can view and access this object in the Data Catalog. To grant visibility on the resource link to our Business Analyst users (IAM role LakeFormationSecondaryAnalyst), we need to grant them describe permissions.

On the Lake Formation console, navigate to the database details page.
On the Actions menu, choose Grant.
For IAM users and roles, choose LakeFormationSecondaryAnalyst.
For Resource Link permissions, select Describe and deselect Super.
Choose Grant.

Granting permissions on a resource link doesn’t grant permissions on the target (linked) database or table, so let’s do it now. For our use case, the analysts only need SQL SELECT capabilities, and only to the specific columns of the table.

In the navigation pane, under Data Catalog, choose Databases.
Select lf-primary-database-rl.
On the Actions menu, choose Grant on Target.
In the Grant permissions dialog box, choose My account.
For IAM users and roles, choose LakeFormationSecondaryAnalyst.
Choose the table that starts with lf_table.
Under Columns, select Include Columns and select the first five columns.
For Table permissions, select Select.
Choose Grant.

Accessing the data

With all the Lake Formation grants in place, the users are ready to access the data at the proper level.

In the secondary account, switch to the role LakeFormationSecondaryAnalyst.
On the Athena console, choose Get Started.
On the selection bar, under Workgroup, choose LakeFormationCrossAccount.
Choose Switch workgroup.

The screen refreshes; make sure you are in the right workgroup.

To use Lake Formation cross-account access, you don’t need a separate Athena workgroup. For this post, the CloudFormation template created one to simplify deployment with the proper Athena configuration.

For Data source, choose AwsDataCatalog.
For Database, choose lf-primary-database-rl.
For Tables, choose if_table_<string>.
On the … menu, choose Preview table.

Choose Run query.

You now have a data analyst on the secondary account with access to an S3 bucket in the primary account. The analyst only has access to the five columns we specified earlier.

Data access that is granted by Lake Formation cross-account access is logged in the secondary account AWS CloudTrail log file, and Lake Formation copies the event to the primary account’s log file. For more information and examples of logging messages, see Cross-Account CloudTrail Logging.

Cleaning up

To avoid incurring future charges, delete the CloudFormation templates after you finish testing the solution.

Conclusion

In this post, we went through the process of configuring Lake Formation to share AWS Glue Data Catalog metadata information across AWS accounts.

Large enterprises typically use multiple AWS accounts, and many of those accounts might need access to a data lake managed by a single AWS account. AWS Lake Formation with cross-account access set up enables you to run queries and jobs that can join and query tables across multiple accounts.

About the Authors

Rafael Suguiura is a Principal Solutions Architect at Amazon Web Services. He guides some of the world’s largest financial services companies in their cloud journey. When the weather is nice, he enjoys cycling and finding new hiking trails—and when it’s not, he catches up with sci-fi books, TV series, and video games.

Himanish Kushary is a Senior Big Data Architect at Amazon Web Services. He helps customers across multiple domains build scalable big data analytics platforms. He enjoys playing video games, and watching good movies and TV series.

Creating a source to Lakehouse data replication pipe using Apache Hudi, AWS Glue, AWS DMS, and Amazon Redshift

2020-11-17 Vishal Pathak

Post Syndicated from Vishal Pathak original https://aws.amazon.com/blogs/big-data/creating-a-source-to-lakehouse-data-replication-pipe-using-apache-hudi-aws-glue-aws-dms-and-amazon-redshift/

Most customers have their applications backed by various sql and nosql systems on prem and on cloud. Since the data is in various independent systems, customers struggle to derive meaningful info by combining data from all of these sources. Hence, customers create data lakes to bring their data in a single place.

Typically, a replication tool such as AWS Database Migration Service (AWS DMS) can replicate the data from your source systems to Amazon Simple Storage Service (Amazon S3). When the data is in Amazon S3, customers process it based on their requirements. A typical requirement is to sync the data in Amazon S3 with the updates on the source systems. Although it’s easy to apply updates on a relational database management system (RDBMS) that backs an online source application, it’s tough to apply this change data capture (CDC) process on your data lakes. Apache Hudi is a good way to solve this problem. Currently, you can use Hudi on Amazon EMR to create Hudi tables.

In this post, we use Apache Hudi to create tables in the AWS Glue Data Catalog using AWS Glue jobs. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. This post enables you to take advantage of the serverless architecture of AWS Glue while upserting data in your data lake, hassle-free.

To write to Hudi tables using AWS Glue jobs, we use a JAR file created using open-source Apache Hudi. This JAR file is used as a dependency in the AWS Glue jobs created through the AWS CloudFormation template provided in this post. Steps to create the JAR file are included in the appendix.

The following diagram illustrates the architecture the CloudFormation template implements.

Prerequisites

The CloudFormation template requires you to select an Amazon Elastic Compute Cloud (Amazon EC2) key pair. This key is configured on an EC2 instance that lives in the public subnet. We use this EC2 instance to get to the Aurora cluster that lives in the private subnet. Make sure you have a key in the Region where you deploy the template. If you don’t have one, you can create a new key pair.

Solution overview

The following are the high-level implementation steps:

Create a CloudFormation stack using the provided template.
Connect to the Amazon Aurora cluster used as a source for this post.
Run InitLoad_TestStep1.sql, in the source Amazon Aurora cluster, to create a schema and a table.

AWS DMS replicates the data from the Aurora cluster to the raw S3 bucket. AWS DMS supports a variety of sources.
The CloudFormation stack creates an AWS Glue job (HudiJob) that is scheduled to run at a frequency set in the ScheduleToRunGlueJob parameter of the CloudFormation stack. This job reads the data from the raw S3 bucket, writes to the Curated S3 bucket, and creates a Hudi table in the Data Catalog. The job also creates an Amazon Redshift external schema in the Amazon Redshift cluster created by the CloudFormation stack.

You can now query the Hudi table in Amazon Athena or Amazon Redshift. Visit Creating external tables for data managed in Apache Hudi or Considerations and Limitations to query Apache Hudi datasets in Amazon Athena for details.
Run IncrementalUpdatesAndInserts_TestStep2.sql on the source Aurora cluster.

This incremental data is also replicated to the raw S3 bucket through AWS DMS. HudiJob picks up the incremental data, using AWS Glue bookmarks, and applies it to the Hudi table created earlier.

You can now query the changed data.

Creating your CloudFormation stack

Click on the Launch Stack button to get started and provide the following parameters:

Parameter	Description
`VpcCIDR`	CIDR range for the VPC.
`PrivateSubnet1CIDR`	CIDR range for the first private subnet.
`PrivateSubnet2CIDR`	CIDR range for the second private subnet.
`PublicSubnetCIDR`	CIDR range for the public subnet.
`AuroraDBMasterUserPassword`	Primary user password for the Aurora cluster.
`RedshiftDWMasterUserPassword`	Primary user password for the Amazon Redshift data warehouse.
`KeyName`	The EC2 key pair to be configured in the EC2 instance on the public subnet. This EC2 instance is used to get to the Aurora cluster in the private subnet. Select the value from the dropdown.
`ClientIPCIDR`	Your IP address in CIDR notation. The CloudFormation template creates a security group rule that grants ingress on port 22 to this IP address. On a Mac, you can run the following command to get your IP address: curl ipecho.net/plain ; echo /32
`EC2ImageId`	The image ID used to create the EC2 instance in the public subnet to be a jump box to connect to the source Aurora cluster. If you supply your image ID, the template uses it to create the EC2 instance.
`HudiStorageType`	This is used by the AWS Glue job to determine if you want to create a CoW or MoR storage type table. Enter MoR if you want to create MoR storage type tables.
`ScheduleToRunGlueJob`	The AWS Glue job runs on a schedule to pick the new files and load to the curated bucket. This parameter sets the schedule of the job.
`DMSBatchUnloadIntervalInSecs`	AWS DMS batches the inputs from the source and loads the output to the taw bucket. This parameter defines the frequency in which the data is loaded to the raw bucket.
`GlueJobDPUs`	The number of DPUs that are assigned to the two AWS Glue jobs.

To simplify running the template, your account is given permissions on the key used to encrypt the resources in the CloudFormation template. You can restrict that to the role if desired.

Granting Lake Formation permissions

AWS Lake Formation enables customers to set up fine grained access control for their Datalake. Detail steps to set up AWS Lake Formation can be found here.

Setting up AWS Lake Formation is out of scope for this post. However, if you have Lake Formation configured in the Region where you’re deploying this template, grant Create database permission to the LakeHouseExecuteGlueHudiJobRole role after the CloudFormation stack is successfully created.

This will ensure that you don’t get the following error while running your AWS Glue job.

org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Insufficient Lake Formation permission(s) on global_temp

Similarly grant Describe permission to the LakeHouseExecuteGlueHudiJobRole role on default database.

This will ensure that you don’t get the following error while running your AWS Glue job.

AnalysisException: 'java.lang.RuntimeException: MetaException(message:Unable to verify existence of default database: com.amazonaws.services.glue.model.AccessDeniedException: Insufficient Lake Formation permission(s) on default (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException;

Connecting to source Aurora cluster

To connect to source Aurora cluster using SQL Workbench, complete the following steps:

On SQL Workbench, under File, choose Connect window.

Choose Manage Drivers.

Choose PostgreSQL.
For Library, use the driver JAR file.
For Classname, enter org.postgresql.Driver.
For Sample URL, enter jdbc:postgresql://host:port/name_of_database.

Click the Create a new connection profile button.
For Driver, choose your new PostgreSQL driver.
For URL, enter lakehouse_source_db after port/.
For Username, enter postgres.
For Password, enter the same password that you used for the AuroraDBMasterUserPassword parameter while creating the CloudFormation stack.
Choose SSH.
On the Outputs tab of your CloudFormation stack, copy the IP address next to PublicIPOfEC2InstanceForTunnel and enter it for SSH hostname.
For SSH port, enter 22.
For Username, enter ec2-user.
For Private key file, enter the private key for the public key chosen in the KeyName parameter of the CloudFormation stack.
For Local port, enter any available local port number.
On the Outputs tab of your stack, copy the value next to EndpointOfAuroraCluster and enter it for DB hostname.
For DB port, enter 5432.
Select Rewrite JDBC URL.

Checking the Rewrite JDBC URL checkbox will automatically feed in the value of host and port in the URL text box as shown below.

Test the connection and make sure that you get a message that the connection was successful.

Troubleshooting

Complete the following steps if you receive this message: Could not initialize SSH tunnel: java.net.ConnectException: Operation timed out (Connection timed out)

Go to your CloudFormation stack and search for LakeHouseSecurityGroup under Resources .
Choose the link in the Physical ID.

Select your security group.
From the Actions menu, choose Edit inbound rules.

Look for the rule with the description:Rule to allow connection from the SQL client to the EC2 instance used as jump box for SSH tunnel
From the Source menu, choose My IP.
Choose Save rules.

Test the connection from your SQL Workbench again and make sure that you get a successful message.

Running the initial load script

You’re now ready to run the InitLoad_TestStep1.sql script to create some test data.

Open InitLoad_TestStep1.sql in your SQL client and run it.

The output shows that 11 statements have been run.

AWS DMS replicates these inserts to your raw S3 bucket at the frequency set in the DMSBatchUnloadIntervalInSecs parameter of your CloudFormation stack.

On the AWS DMS console, choose the lakehouse-aurora-src-to-raw-s3-tgt task:
On the Table statistics tab, you should see the seven full load rows of employee_details have been replicated.

The lakehouse-aurora-src-to-raw-s3-tgt replication task has the following table mapping with transformation to add a schema name and a table name as additional columns:

{
   "rules":[
      {
         "rule-type":"selection",
         "rule-id":"1",
         "rule-name":"1",
         "object-locator":{
            "schema-name":"human_resources",
            "table-name":"%"
         },
         "rule-action":"include",
         "filters":[
            
         ]
      },
      {
         "rule-type":"transformation",
         "rule-id":"2",
         "rule-name":"2",
         "rule-target":"column",
         "object-locator":{
            "schema-name":"%",
            "table-name":"%"
         },
         "rule-action":"add-column",
         "value":"schema_name",
         "expression":"$SCHEMA_NAME_VAR",
         "data-type":{
            "type":"string",
            "length":50
         }
      },
      {
         "rule-type":"transformation",
         "rule-id":"3",
         "rule-name":"3",
         "rule-target":"column",
         "object-locator":{
            "schema-name":"%",
            "table-name":"%"
         },
         "rule-action":"add-column",
         "value":"table_name",
         "expression":"$TABLE_NAME_VAR",
         "data-type":{
            "type":"string",
            "length":50
         }
      }
   ]
}

These settings put the name of the source schema and table as two additional columns in the output Parquet file of AWS DMS.
These columns are used in the AWS Glue HudiJob to find out the tables that have new inserts, updates, or deletes.

On the Resources tab of the CloudFormation stack, locate RawS3Bucket.
Choose the Physical ID link.

Navigate to human_resources/employee_details.

The LOAD00000001.parquet file is created under human_resources/employee_details. (The name of your raw bucket is different from the following screenshot).

You can also see the time of creation of this file. You should have at least one successful run of the AWS Glue job (HudiJob) after this time for the Hudi table to be created. The AWS Glue job is configured to load this data into the curated bucket at the frequency set in the ScheduleToRunGlueJob parameter of your CloudFormation stack. The default is 5 minutes.

AWS Glue job HudiJob

The following code is the script for HudiJob:

import sys
import os
import json

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import concat, col, lit, to_timestamp

from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

import boto3
from botocore.exceptions import ClientError

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

spark = SparkSession.builder.config('spark.serializer','org.apache.spark.serializer.KryoSerializer').getOrCreate()
glueContext = GlueContext(spark.sparkContext)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
logger = glueContext.get_logger()

logger.info('Initialization.')
glueClient = boto3.client('glue')
ssmClient = boto3.client('ssm')
redshiftDataClient = boto3.client('redshift-data')

logger.info('Fetching configuration.')
region = os.environ['AWS_DEFAULT_REGION']

curatedS3BucketName = ssmClient.get_parameter(Name='lakehouse-curated-s3-bucket-name')['Parameter']['Value']
rawS3BucketName = ssmClient.get_parameter(Name='lakehouse-raw-s3-bucket-name')['Parameter']['Value']
hudiStorageType = ssmClient.get_parameter(Name='lakehouse-hudi-storage-type')['Parameter']['Value']

dropColumnList = ['db','table_name','Op']

logger.info('Getting list of schema.tables that have changed.')
changeTableListDyf = glueContext.create_dynamic_frame_from_options(connection_type = 's3', connection_options = {'paths': ['s3://'+rawS3BucketName], 'groupFiles': 'inPartition', 'recurse':True}, format = 'parquet', format_options={}, transformation_ctx = 'changeTableListDyf')

logger.info('Processing starts.')
if(changeTableListDyf.count() > 0):
    logger.info('Got new files to process.')
    changeTableList = changeTableListDyf.toDF().select('schema_name','table_name').distinct().rdd.map(lambda row : row.asDict()).collect()

    for dbName in set([d['schema_name'] for d in changeTableList]):
        spark.sql('CREATE DATABASE IF NOT EXISTS ' + dbName)
        redshiftDataClient.execute_statement(ClusterIdentifier='lakehouse-redshift-cluster', Database='lakehouse_dw', DbUser='rs_admin', Sql='CREATE EXTERNAL SCHEMA IF NOT EXISTS ' + dbName + ' FROM DATA CATALOG DATABASE \'' + dbName + '\' REGION \'' + region + '\' IAM_ROLE \'' + boto3.client('iam').get_role(RoleName='LakeHouseRedshiftGlueAccessRole')['Role']['Arn'] + '\'')

    for i in changeTableList:
        logger.info('Looping for ' + i['schema_name'] + '.' + i['table_name'])
        dbName = i['schema_name']
        tableNameCatalogCheck = ''
        tableName = i['table_name']
        if(hudiStorageType == 'MoR'):
            tableNameCatalogCheck = i['table_name'] + '_ro' #Assumption is that if _ro table exists then _rt table will also exist. Hence we are checking only for _ro.
        else:
            tableNameCatalogCheck = i['table_name'] #The default config in the CF template is CoW. So assumption is that if the user hasn't explicitly requested to create MoR storage type table then we will create CoW tables. Again, if the user overwrites the config with any value other than 'MoR' we will create CoW storage type tables.
        isTableExists = False
        isPrimaryKey = False
        isPartitionKey = False
        primaryKey = ''
        partitionKey = ''
        try:
            glueClient.get_table(DatabaseName=dbName,Name=tableNameCatalogCheck)
            isTableExists = True
            logger.info(dbName + '.' + tableNameCatalogCheck + ' exists.')
        except ClientError as e:
            if e.response['Error']['Code'] == 'EntityNotFoundException':
                isTableExists = False
                logger.info(dbName + '.' + tableNameCatalogCheck + ' does not exist. Table will be created.')
        try:
            table_config = json.loads(ssmClient.get_parameter(Name='lakehouse-table-' + dbName + '.' + tableName)['Parameter']['Value'])
            try:
                primaryKey = table_config['primaryKey']
                isPrimaryKey = True
                logger.info('Primary key:' + primaryKey)
            except KeyError as e:
                isPrimaryKey = False
                logger.info('Primary key not found. An append only glueparquet table will be created.')
            try:
                partitionKey = table_config['partitionKey']
                isPartitionKey = True
                logger.info('Partition key:' + partitionKey)
            except KeyError as e:
                isPartitionKey = False
                logger.info('Partition key not found. Partitions will not be created.')
        except ClientError as e:    
            if e.response['Error']['Code'] == 'ParameterNotFound':
                isPrimaryKey = False
                isPartitionKey = False
                logger.info('Config for ' + dbName + '.' + tableName + ' not found in parameter store. Non partitioned append only table will be created.')

        inputDyf = glueContext.create_dynamic_frame_from_options(connection_type = 's3', connection_options = {'paths': ['s3://' + rawS3BucketName + '/' + dbName + '/' + tableName], 'groupFiles': 'none', 'recurse':True}, format = 'parquet',transformation_ctx = tableName)
        
        inputDf = inputDyf.toDF().withColumn('update_ts_dms',to_timestamp(col('update_ts_dms')))
        
        targetPath = 's3://' + curatedS3BucketName + '/' + dbName + '/' + tableName

        morConfig = {'hoodie.datasource.write.storage.type': 'MERGE_ON_READ', 'hoodie.compact.inline': 'false', 'hoodie.compact.inline.max.delta.commits': 20, 'hoodie.parquet.small.file.limit': 0}

        commonConfig = {'className' : 'org.apache.hudi', 'hoodie.datasource.hive_sync.use_jdbc':'false', 'hoodie.datasource.write.precombine.field': 'update_ts_dms', 'hoodie.datasource.write.recordkey.field': primaryKey, 'hoodie.table.name': tableName, 'hoodie.consistency.check.enabled': 'true', 'hoodie.datasource.hive_sync.database': dbName, 'hoodie.datasource.hive_sync.table': tableName, 'hoodie.datasource.hive_sync.enable': 'true'}

        partitionDataConfig = {'hoodie.datasource.write.partitionpath.field': partitionKey, 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.hive_sync.partition_fields': partitionKey}
                     
        unpartitionDataConfig = {'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator'}
        
        incrementalConfig = {'hoodie.upsert.shuffle.parallelism': 20, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 'hoodie.cleaner.commits.retained': 10}
        
        initLoadConfig = {'hoodie.bulkinsert.shuffle.parallelism': 3, 'hoodie.datasource.write.operation': 'bulk_insert'}
        
        deleteDataConfig = {'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload'}

        if(hudiStorageType == 'MoR'):
            commonConfig = {**commonConfig, **morConfig}
            logger.info('MoR config appended to commonConfig.')
        
        combinedConf = {}

        if(isPrimaryKey):
            logger.info('Going the Hudi way.')
            if(isTableExists):
                logger.info('Incremental load.')
                outputDf = inputDf.filter("Op != 'D'").drop(*dropColumnList)
                if outputDf.count() > 0:
                    logger.info('Upserting data.')
                    if (isPartitionKey):
                        logger.info('Writing to partitioned Hudi table.')
                        outputDf = outputDf.withColumn(partitionKey,concat(lit(partitionKey+'='),col(partitionKey)))
                        combinedConf = {**commonConfig, **partitionDataConfig, **incrementalConfig}
                        outputDf.write.format('org.apache.hudi').options(**combinedConf).mode('Append').save(targetPath)
                    else:
                        logger.info('Writing to unpartitioned Hudi table.')
                        combinedConf = {**commonConfig, **unpartitionDataConfig, **incrementalConfig}
                        outputDf.write.format('org.apache.hudi').options(**combinedConf).mode('Append').save(targetPath)
                outputDf_deleted = inputDf.filter("Op = 'D'").drop(*dropColumnList)
                if outputDf_deleted.count() > 0:
                    logger.info('Some data got deleted.')
                    if (isPartitionKey):
                        logger.info('Deleting from partitioned Hudi table.')
                        outputDf_deleted = outputDf_deleted.withColumn(partitionKey,concat(lit(partitionKey+'='),col(partitionKey)))
                        combinedConf = {**commonConfig, **partitionDataConfig, **incrementalConfig, **deleteDataConfig}
                        outputDf_deleted.write.format('org.apache.hudi').options(**combinedConf).mode('Append').save(targetPath)
                    else:
                        logger.info('Deleting from unpartitioned Hudi table.')
                        combinedConf = {**commonConfig, **unpartitionDataConfig, **incrementalConfig, **deleteDataConfig}
                        outputDf_deleted.write.format('org.apache.hudi').options(**combinedConf).mode('Append').save(targetPath)
            else:
                outputDf = inputDf.drop(*dropColumnList)
                if outputDf.count() > 0:
                    logger.info('Inital load.')
                    if (isPartitionKey):
                        logger.info('Writing to partitioned Hudi table.')
                        outputDf = outputDf.withColumn(partitionKey,concat(lit(partitionKey+'='),col(partitionKey)))
                        combinedConf = {**commonConfig, **partitionDataConfig, **initLoadConfig}
                        outputDf.write.format('org.apache.hudi').options(**combinedConf).mode('Overwrite').save(targetPath)
                    else:
                        logger.info('Writing to unpartitioned Hudi table.')
                        combinedConf = {**commonConfig, **unpartitionDataConfig, **initLoadConfig}
                        outputDf.write.format('org.apache.hudi').options(**combinedConf).mode('Overwrite').save(targetPath)
        else:
            if (isPartitionKey):
                logger.info('Writing to partitioned glueparquet table.')
                sink = glueContext.getSink(connection_type = 's3', path= targetPath, enableUpdateCatalog = True, updateBehavior = 'UPDATE_IN_DATABASE', partitionKeys=[partitionKey])
            else:
                logger.info('Writing to unpartitioned glueparquet table.')
                sink = glueContext.getSink(connection_type = 's3', path= targetPath, enableUpdateCatalog = True, updateBehavior = 'UPDATE_IN_DATABASE')
            sink.setFormat('glueparquet')
            sink.setCatalogInfo(catalogDatabase = dbName, catalogTableName = tableName)
            outputDyf = DynamicFrame.fromDF(inputDf.drop(*dropColumnList), glueContext, 'outputDyf')
            sink.writeFrame(outputDyf)

job.commit()

Hudi tables need a primary key to perform upserts. Hudi tables can also be partitioned based on a certain key. We get the names of the primary key and the partition key from AWS Systems Manager Parameter Store.

The HudiJob script looks for an AWS Systems Manager Parameter with the naming format lakehouse-table-<schema_name>.<table_name>. It compares the name of the parameter with the name of the schema and table columns, added by AWS DMS, to get the primary key and the partition key for the Hudi table.

The CloudFormation template creates lakehouse-table-human_resources.employee_details AWS Systems Manager Parameter, as shown on the Resources tab.

If you choose the Physical ID link, you can locate the value of the AWS Systems Manager Parameter. The AWS Systems Manager Parameter has {"primaryKey": "emp_no", "partitionKey": "department"} value in it.

Because of the value in the lakehouse-table-human_resources.employee_details AWS Systems Manager Parameter, the AWS Glue script creates a human_resources.employee_details Hudi table partitioned on the department column for the employee_details table created in the source using the InitLoad_TestStep1.sql script. The HudiJob also uses the emp_no column as the primary key for upserts.

If you reuse this CloudFormation template and create your own table, you have to create an associated AWS Systems Manager Parameter with the naming convention lakehouse-table-<schema_name>.<table_name>. Keep in mind the following:

If you don’t create a parameter, the script creates an unpartitioned glueparquet append-only table.
If you create a parameter that only has the primaryKey part in the value, the script creates an unpartitioned Hudi table.
If you create a parameter that only has the partitionKey part in the value, the script creates a partitioned glueparquet append-only table.

If you have too many tables to replicate, you can also store the primary key and partition key configuration in Amazon DynamoDB or Amazon S3 and change the code accordingly.

In the InitLoad_TestStep1.sql script, replica identity for human_resources.employee_details table is set to full. This makes sure that AWS DMS transfers the full delete record to Amazon S3. Having this delete record is important for the HudiJob script to delete the record from the Hudi table. A full delete record from AWS DMS for the human_resources.employee_details table looks like the following:

{ "Op": "D", "update_ts_dms": "2020-10-25 07:57:48.589284", "emp_no": 3, "name": "Jeff", "department": "Finance", "city": "Tokyo", "salary": 55000, "schema_name": "human_resources", "table_name": "employee_details"}

The schema_name, and table_name columns are added by AWS DMS because of the task configuration shared previously.update_ts_dms has been set as the value for TimestampColumnName S3 setting in AWS DMS S3 Endpoint.Op is added by AWS DMS for cdc and it indicates source DB operations in migrated S3 data.

We also set spark.serializer in the script. This setting is required for Hudi.

In HudiJob script, you can also find a few Python dict that store various Hudi configuration properties. These configurations are just for demo purposes; you have to adjust them based on your workload. For more information about Hudi configurations, see Configurations.

HudiJob is scheduled to run every 5 minutes by default. The frequency is set by the ScheduleToRunGlueJob parameter of the CloudFormation template. Make sure that you successfully run HudiJob at least one time after the source data lands in the raw S3 bucket. The screenshot in Step 6 of Running the initial load script section confirms that AWS DMS put the LOAD00000001.parquet file in the raw bucket at 11:54:41 AM and following screenshot confirms that the job execution started at 11:55 AM.

The job creates a Hudi table in the AWS Glue Data Catalog (see the following screenshot). The table is partitioned on the department column.

Granting AWS Lake Formation permissions

If you have AWS Lake Formation enabled, make sure that you grant Select permission on the human_resources.employee_details table to the role/user used to run Athena query. Similarly, you also have to grant Select permission on the human_resources.employee_details table to the LakeHouseRedshiftGlueAccessRole role so you can query human_resources.employee_details in Amazon Redshift.

Grant Drop permission on the human_resources database to LakeHouseExecuteLambdaFnsRole so that the template can delete the database when you delete the template. Also, the CloudFormation template does not roll back any AWS Lake Formation grants or changes that are manually applied.

Granting access to KMS key

The curated S3 bucket is encrypted by lakehouse-key, which is an AWS Key Management Service (AWS KMS) customer managed key created by AWS CloudFormation template.

To run the query in Athena, you have to add the ARN of the role/user used to run the Athena query in the Allow use of the key section in the key policy.

This will ensure that you don’t get com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; error while running your Athena query.

You might not have to execute the above KMS policy change if you have kept the default of granting access to the AWS account and the role/user used to run Athena query has the necessary KMS related policies attached to it.

Confirming job completion

When HudiJob is complete, you can see the files in the curated bucket.

On the Resources tab, search for CuratedS3Bucket.
Choose the Physical ID link.

The following screenshot shows the timestamp on the initial load.

Navigate to the department=Finance prefix and select the Parquet file.
Choose Select from.

For File format, select Parquet.
Choose Show file preview.

You can see the value of the timestamp in the update_ts_dms column.

Querying the Hudi table

You can now query your data in Amazon Athena or Amazon Redshift.

Querying in Amazon Athena

Query the human_resources.employee_details table in Amazon Athena with the following code:

SELECT emp_no,
         name,
         city,
         salary,
         department,
         from_unixtime(update_ts_dms/1000000,'America/Los_Angeles') update_ts_dms_LA,
         from_unixtime(update_ts_dms/1000000,'UTC') update_ts_dms_UTC         
FROM "human_resources"."employee_details"
ORDER BY emp_no

The timestamp for all the records matches the timestamp in the update_ts_dms column in the earlier screenshot.

Querying in Redshift Spectrum

Read query your table in Redshift Spectrum for Apache Hudi support in Amazon Redshift.

On the Amazon Redshift console, locate lakehouse-redshift-cluster.
Choose Query cluster.

For Database name, enter lakehouse_dw.
For Database user, enter rs_admin.
For Database password, enter the password that you used for the RedshiftDWMasterUserPassword parameter in the CloudFormation template.

Enter the following query for the human_resources.employee_details table:

SELECT emp_no,
         name,
         city,
         salary,
         department,
         (TIMESTAMP 'epoch' + update_ts_dms/1000000 * interval '1 second') AT TIME ZONE 'utc' AT TIME ZONE 'america/los_angeles' update_ts_dms_LA,
         (TIMESTAMP 'epoch' + update_ts_dms/1000000 * interval '1 second') AT TIME ZONE 'utc' update_ts_dms_UTC
FROM human_resources.employee_details
ORDER BY emp_no

The following screenshot shows the query output.

Running the incremental load script

We now run the IncrementalUpdatesAndInserts_TestStep2.sql script. The output shows that 6 statements were run.

AWS DMS now shows that it has replicated the new incremental changes. The changes are replicated at a frequency set in DMSBatchUnloadIntervalInSecs parameter of the CloudFormation stack.

This creates another Parquet file in the raw S3 bucket.

The incremental updates are loaded into the Hudi table according to the chosen frequency to run the job (the ScheduleToRunGlueJob parameter). The HudiJobscript uses job bookmarks to find out the incremental load so it only processes the new files brought in through AWS DMS.

Confirming job completion

Make sure that HudiJob runs successfully at least one time after the incremental file arrives in the raw bucket. The previous screenshot shows that the incremental file arrived in the raw bucket at 1:18:38 PM and the following screenshot shows that the job started at 1:20 PM.

Querying the changed data

You can now check the table in Athena and Amazon Redshift. Both results show that emp_no 3 is deleted, 8 and 9 have been added, and 2 and 5 have been updated.

The following screenshot shows the results in Athena.

The following screenshot shows the results in Redshift Spectrum.

AWS Glue Job HudiMoRCompactionJob

The CloudFormation template also deploys the AWS Glue job HudiMoRCompactionJob. This job is not scheduled; you only use it if you choose the MoR storage type. To execute the pipe for MoR storage type instead of CoW storage type, delete the CloudFormation stack and create it again. After creation, replace CoW in lakehouse-hudi-storage-type AWS Systems Manager Parameter with MoR.

If you use MoR storage type, the incremental updates are stored in log files. You can’t see the updates in the _ro (read optimized) view, but can see them in the _rt view. Amazon Athena documentation and Amazon Redshift documentation gives more details about support and considerations for Apache Hudi.

To see the incremental data in the _ro view, run the HudiMoRCompactionJob job. For more information about Hudi storage types and views, see Hudi Dataset Storage Types and Storage Types & Views. The following code is an example of the CLI command used to run HudiMoRCompactionJob job:

aws glue start-job-run --job-name HudiMoRCompactionJob --arguments="--DB_NAME=human_resources","--TABLE_NAME=employee_details","--IS_PARTITIONED=true"

You can decide on the frequency of running this job. You don’t have to run the job immediately after the HudiJob. You should run this job when you want the data to be available in the _ro view. You have to pass the schema name and the table name to this script so it knows the table to compact.

Additional considerations

The JAR file we use in this post has not been tested for AWS Glue streaming jobs. Additionally, there are some hardcoded Hudi options in the HudiJob script. These options are set for the sample table that we create for this post. Update the options based on your workload.

Conclusion

In this post, we created AWS Glue 2.0 jobs that moved the source upserts and deletes into Hudi tables. The code creates tables in the AWS GLue Data Catalog and updates partitions so you don’t have to run the crawlers to update them.

This post simplified your LakeHouse code base by giving you the benefits of Apache Hudi along with serverless AWS Glue. We also showed how to create an source to LakeHouse replication system using AWS Glue, AWS DMS, and Amazon Redshift with minimum overhead.

Appendix

We can write to Hudi tables because of the hudi-spark.jar file that we downloaded to our DependentJarsAndTempS3Bucket S3 bucket with the CloudFormation template. The path to this file is added as a dependency in both the AWS Glue jobs. This file is based on open-source Hudi. To create the JAR file, complete the following steps:

Get Hudi 0.5.3 and unzip it using the following code:

wget https://github.com/apache/hudi/archive/release-0.5.3.zip
unzip hudi-release-0.5.3.zip

Edit Hudi pom.xml:

vi hudi-release-0.5.3/pom.xml

Remove the following code to make the build process faster:

<module>packaging/hudi-hadoop-mr-bundle</module>
<module>packaging/hudi-hive-bundle</module>
<module>packaging/hudi-presto-bundle</module>
<module>packaging/hudi-utilities-bundle</module>
<module>packaging/hudi-timeline-server-bundle</module>
<module>docker/hoodie/hadoop</module>
<module>hudi-integ-test</module>

Change the versions of all three dependencies of httpcomponents to 4.4.1. The following is the original code:

<!-- Httpcomponents -->
      <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>fluent-hc</artifactId>
        <version>4.3.2</version>
      </dependency>
      <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpcore</artifactId>
        <version>4.3.2</version>
      </dependency>
      <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.3.6</version>
      </dependency>

The following is the replacement code:

<!-- Httpcomponents -->
      <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>fluent-hc</artifactId>
        <version>4.4.1</version>
      </dependency>
      <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpcore</artifactId>
        <version>4.4.1</version>
      </dependency>
      <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.4.1</version>
      </dependency>

Build the JAR file:

mvn clean package -DskipTests -DskipITs -f <Full path of the hudi-release-0.5.3 dir>

You can now get the JAR from the following location:

hudi-release-0.5.3/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.5.3-rc2.jar

The other JAR dependency used in the AWS Glue jobs is spark-avro_2.11-2.4.4.jar.

About the Author

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with the customers on their use cases, architects a solution to solve their business problems and helps the customers build an scalable prototype. Prior to his journey in AWS, Vishal helped customers implement BI, DW and DataLake projects in US and Australia.

AWS serverless data analytics pipeline reference architecture

2020-10-28 Praful Kava

Post Syndicated from Praful Kava original https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/

Onboarding new data or building new analytics pipelines in traditional analytics architectures typically requires extensive coordination across business, data engineering, and data science and analytics teams to first negotiate requirements, schema, infrastructure capacity needs, and workload management.

For a large number of use cases today however, business users, data scientists, and analysts are demanding easy, frictionless, self-service options to build end-to-end data pipelines because it’s hard and inefficient to predefine constantly changing schemas and spend time negotiating capacity slots on shared infrastructure. The exploratory nature of machine learning (ML) and many analytics tasks means you need to rapidly ingest new datasets and clean, normalize, and feature engineer them without worrying about operational overhead when you have to think about the infrastructure that runs data pipelines.

A serverless data lake architecture enables agile and self-service data onboarding and analytics for all data consumer roles across a company. By using AWS serverless technologies as building blocks, you can rapidly and interactively build data lakes and data processing pipelines to ingest, store, transform, and analyze petabytes of structured and unstructured data from batch and streaming sources, all without needing to manage any storage or compute infrastructure.

In this post, we first discuss a layered, component-oriented logical architecture of modern analytics platforms and then present a reference architecture for building a serverless data platform that includes a data lake, data processing pipelines, and a consumption layer that enables several ways to analyze the data in the data lake without moving it (including business intelligence (BI) dashboarding, exploratory interactive SQL, big data processing, predictive analytics, and ML).

Logical architecture of modern data lake centric analytics platforms

The following diagram illustrates the architecture of a data lake centric analytics platform.

You can envision a data lake centric analytics architecture as a stack of six logical layers, where each layer is composed of multiple components. A layered, component-oriented architecture promotes separation of concerns, decoupling of tasks, and flexibility. These in turn provide the agility needed to quickly integrate new data sources, support new analytics methods, and add tools required to keep up with the accelerating pace of changes in the analytics landscape. In the following sections, we look at the key responsibilities, capabilities, and integrations of each logical layer.

Ingestion layer

The ingestion layer is responsible for bringing data into the data lake. It provides the ability to connect to internal and external data sources over a variety of protocols. It can ingest batch and streaming data into the storage layer. The ingestion layer is also responsible for delivering ingested data to a diverse set of targets in the data storage layer (including the object store, databases, and warehouses).

Storage layer

The storage layer is responsible for providing durable, scalable, secure, and cost-effective components to store vast quantities of data. It supports storing unstructured data and datasets of a variety of structures and formats. It supports storing source data as-is without first needing to structure it to conform to a target schema or format. Components from all other layers provide easy and native integration with the storage layer. To store data based on its consumption readiness for different personas across organization, the storage layer is organized into the following zones:

Landing zone – The storage area where components from the ingestion layer land data. This is a transient area where data is ingested from sources as-is. Typically, data engineering personas interact with the data stored in this zone.
Raw zone – After the preliminary quality checks, the data from the landing zone is moved to the raw zone for permanent storage. Here, data is stored in its original format. Having all data from all sources permanently stored in the raw zone provides the ability to “replay” downstream data processing in case of errors or data loss in downstream storage zones. Typically, data engineering and data science personas interact with the data stored in this zone.
Curated zone – This zone hosts data that is in the most consumption-ready state and conforms to organizational standards and data models. Datasets in the curated zone are typically partitioned, cataloged, and stored in formats that support performant and cost-effective access by the consumption layer. The processing layer creates datasets in the curated zone after cleaning, normalizing, standardizing, and enriching data from the raw zone. All personas across organizations use the data stored in this zone to drive business decisions.

Cataloging and search layer

The cataloging and search layer is responsible for storing business and technical metadata about datasets hosted in the storage layer. It provides the ability to track schema and the granular partitioning of dataset information in the lake. It also supports mechanisms to track versions to keep track of changes to the metadata. As the number of datasets in the data lake grows, this layer makes datasets in the data lake discoverable by providing search capabilities.

Processing layer

The processing layer is responsible for transforming data into a consumable state through data validation, cleanup, normalization, transformation, and enrichment. It’s responsible for advancing the consumption readiness of datasets along the landing, raw, and curated zones and registering metadata for the raw and transformed data into the cataloging layer. The processing layer is composed of purpose-built data-processing components to match the right dataset characteristic and processing task at hand. The processing layer can handle large data volumes and support schema-on-read, partitioned data, and diverse data formats. The processing layer also provides the ability to build and orchestrate multi-step data processing pipelines that use purpose-built components for each step.

Consumption layer

The consumption layer is responsible for providing scalable and performant tools to gain insights from the vast amount of data in the data lake. It democratizes analytics across all personas across the organization through several purpose-built analytics tools that support analysis methods, including SQL, batch analytics, BI dashboards, reporting, and ML. The consumption layer natively integrates with the data lake’s storage, cataloging, and security layers. Components in the consumption layer support schema-on-read, a variety of data structures and formats, and use data partitioning for cost and performance optimization.

Security and governance layer

The security and governance layer is responsible for protecting the data in the storage layer and processing resources in all other layers. It provides mechanisms for access control, encryption, network protection, usage monitoring, and auditing. The security layer also monitors activities of all components in other layers and generates a detailed audit trail. Components of all other layers provide native integration with the security and governance layer.

Serverless data lake centric analytics architecture

To compose the layers described in our logical architecture, we introduce a reference architecture that uses AWS serverless and managed services. In this approach, AWS services take over the heavy lifting of the following:

Providing and managing scalable, resilient, secure, and cost-effective infrastructural components
Ensuring infrastructural components natively integrate with each other

This reference architecture allows you to focus more time on rapidly building data and analytics pipelines. It significantly accelerates new data onboarding and driving insights from your data. The AWS serverless and managed components enable self-service across all data consumer roles by providing the following key benefits:

Easy configuration-driven use
Freedom from infrastructure management
Pay-per-use pricing model

The following diagram illustrates this architecture.

Ingestion layer

The ingestion layer in our serverless architecture is composed of a set of purpose-built AWS services to enable data ingestion from a variety of sources. Each of these services enables simple self-service data ingestion into the data lake landing zone and provides integration with other AWS services in the storage and security layers. Individual purpose-built AWS services match the unique connectivity, data format, data structure, and data velocity requirements of operational database sources, streaming data sources, and file sources.

Operational database sources

Typically, organizations store their operational data in various relational and NoSQL databases. AWS Data Migration Service (AWS DMS) can connect to a variety of operational RDBMS and NoSQL databases and ingest their data into Amazon Simple Storage Service (Amazon S3) buckets in the data lake landing zone. With AWS DMS, you can first perform a one-time import of the source data into the data lake and replicate ongoing changes happening in the source database. AWS DMS encrypts S3 objects using AWS Key Management Service (AWS KMS) keys as it stores them in the data lake. AWS DMS is a fully managed, resilient service and provides a wide choice of instance sizes to host database replication tasks.

AWS Lake Formation provides a scalable, serverless alternative, called blueprints, to ingest data from AWS native or on-premises database sources into the landing zone in the data lake. A Lake Formation blueprint is a predefined template that generates a data ingestion AWS Glue workflow based on input parameters such as source database, target Amazon S3 location, target dataset format, target dataset partitioning columns, and schedule. A blueprint-generated AWS Glue workflow implements an optimized and parallelized data ingestion pipeline consisting of crawlers, multiple parallel jobs, and triggers connecting them based on conditions. For more information, see Integrating AWS Lake Formation with Amazon RDS for SQL Server.

Streaming data sources

The ingestion layer uses Amazon Kinesis Data Firehose to receive streaming data from internal and external sources. With a few clicks, you can configure a Kinesis Data Firehose API endpoint where sources can send streaming data such as clickstreams, application and infrastructure logs and monitoring metrics, and IoT data such as devices telemetry and sensor readings. Kinesis Data Firehose does the following:

Buffers incoming streams
Batches, compresses, transforms, and encrypts the streams
Stores the streams as S3 objects in the landing zone in the data lake

Kinesis Data Firehose natively integrates with the security and storage layers and can deliver data to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service (Amazon ES) for real-time analytics use cases. Kinesis Data Firehose is serverless, requires no administration, and has a cost model where you pay only for the volume of data you transmit and process through the service. Kinesis Data Firehose automatically scales to adjust to the volume and throughput of incoming data.

File sources

Many applications store structured and unstructured data in files that are hosted on Network Attached Storage (NAS) arrays. Organizations also receive data files from partners and third-party vendors. Analyzing data from these file sources can provide valuable business insights.

Internal file shares

AWS DataSync can ingest hundreds of terabytes and millions of files from NFS and SMB enabled NAS devices into the data lake landing zone. DataSync automatically handles scripting of copy jobs, scheduling and monitoring transfers, validating data integrity, and optimizing network utilization. DataSync can perform one-time file transfers and monitor and sync changed files into the data lake. DataSync is fully managed and can be set up in minutes.

Partner data files

FTP is most common method for exchanging data files with partners. The AWS Transfer Family is a serverless, highly available, and scalable service that supports secure FTP endpoints and natively integrates with Amazon S3. Partners and vendors transmit files using SFTP protocol, and the AWS Transfer Family stores them as S3 objects in the landing zone in the data lake. The AWS Transfer Family supports encryption using AWS KMS and common authentication methods including AWS Identity and Access Management (IAM) and Active Directory.

Data APIs

Organizations today use SaaS and partner applications such as Salesforce, Marketo, and Google Analytics to support their business operations. Analyzing SaaS and partner data in combination with internal operational application data is critical to gaining 360-degree business insights. Partner and SaaS applications often provide API endpoints to share data.

SaaS APIs

The ingestion layer uses AWS AppFlow to easily ingest SaaS applications data into the data lake. With a few clicks, you can set up serverless data ingestion flows in AppFlow. Your flows can connect to SaaS applications (such as SalesForce, Marketo, and Google Analytics), ingest data, and store it in the data lake. You can schedule AppFlow data ingestion flows or trigger them by events in the SaaS application. Ingested data can be validated, filtered, mapped and masked before storing in the data lake. AppFlow natively integrates with authentication, authorization, and encryption services in the security and governance layer.

Partner APIs

To ingest data from partner and third-party APIs, organizations build or purchase custom applications that connect to APIs, fetch data, and create S3 objects in the landing zone by using AWS SDKs. These applications and their dependencies can be packaged into Docker containers and hosted on AWS Fargate. Fargate is a serverless compute engine for hosting Docker containers without having to provision, manage, and scale servers. Fargate natively integrates with AWS security and monitoring services to provide encryption, authorization, network isolation, logging, and monitoring to the application containers.

AWS Glue Python shell jobs also provide serverless alternative to build and schedule data ingestion jobs that can interact with partner APIs by using native, open-source, or partner-provided Python libraries. AWS Glue provides out-of-the-box capabilities to schedule singular Python shell jobs or include them as part of a more complex data ingestion workflow built on AWS Glue workflows.

Third-party data sources

Your organization can gain a business edge by combining your internal data with third-party datasets such as historical demographics, weather data, and consumer behavior data. AWS Data Exchange provides a serverless way to find, subscribe to, and ingest third-party data directly into S3 buckets in the data lake landing zone. You can ingest a full third-party dataset and then automate detecting and ingesting revisions to that dataset. AWS Data Exchange is serverless and lets you find and ingest third-party datasets with a few clicks.

Storage layer

Amazon S3 provides the foundation for the storage layer in our architecture. Amazon S3 provides virtually unlimited scalability at low cost for our serverless data lake. Data is stored as S3 objects organized into landing, raw, and curated zone buckets and prefixes. Amazon S3 encrypts data using keys managed in AWS KMS. IAM policies control granular zone-level and dataset-level access to various users and roles. Amazon S3 provides 99.99 % of availability and 99.999999999 % of durability, and charges only for the data it stores. To significantly reduce costs, Amazon S3 provides colder tier storage options called Amazon S3 Glacier and S3 Glacier Deep Archive. To automate cost optimizations, Amazon S3 provides configurable lifecycle policies and intelligent tiering options to automate moving older data to colder tiers. AWS services in our ingestion, cataloging, processing, and consumption layers can natively read and write S3 objects. Additionally, hundreds of third-party vendor and open-source products and services provide the ability to read and write S3 objects.

Data of any structure (including unstructured data) and any format can be stored as S3 objects without needing to predefine any schema. This enables services in the ingestion layer to quickly land a variety of source data into the data lake in its original source format. After the data is ingested into the data lake, components in the processing layer can define schema on top of S3 datasets and register them in the cataloging layer. Services in the processing and consumption layers can then use schema-on-read to apply the required structure to data read from S3 objects. Datasets stored in Amazon S3 are often partitioned to enable efficient filtering by services in the processing and consumption layers.

Cataloging and search layer

A data lake typically hosts a large number of datasets, and many of these datasets have evolving schema and new data partitions. A central Data Catalog that manages metadata for all the datasets in the data lake is crucial to enabling self-service discovery of data in the data lake. Additionally, separating metadata from data into a central schema enables schema-on-read for the processing and consumption layer components.

In our architecture, Lake Formation provides the central catalog to store and manage metadata for all datasets hosted in the data lake. Organizations manage both technical metadata (such as versioned table schemas, partitioning information, physical data location, and update timestamps) and business attributes (such as data owner, data steward, column business definition, and column information sensitivity) of all their datasets in Lake Formation. Services such as AWS Glue, Amazon EMR, and Amazon Athena natively integrate with Lake Formation and automate discovering and registering dataset metadata into the Lake Formation catalog. Additionally, Lake Formation provides APIs to enable metadata registration and management using custom scripts and third-party products. AWS Glue crawlers in the processing layer can track evolving schemas and newly added partitions of datasets in the data lake, and add new versions of corresponding metadata in the Lake Formation catalog.

Lake Formation provides the data lake administrator a central place to set up granular table- and column-level permissions for databases and tables hosted in the data lake. After Lake Formation permissions are set up, users and groups can access only authorized tables and columns using multiple processing and consumption layer services such as Athena, Amazon EMR, AWS Glue, and Amazon Redshift Spectrum.

Processing layer

The processing layer in our architecture is composed of two types of components:

Components used to create multi-step data processing pipelines
Components to orchestrate data processing pipelines on schedule or in response to event triggers (such as ingestion of new data into the landing zone)

AWS Glue and AWS Step Functions provide serverless components to build, orchestrate, and run pipelines that can easily scale to process large data volumes. Multi-step workflows built using AWS Glue and Step Functions can catalog, validate, clean, transform, and enrich individual datasets and advance them from landing to raw and raw to curated zones in the storage layer.

AWS Glue is a serverless, pay-per-use ETL service for building and running Python or Spark jobs (written in Scala or Python) without requiring you to deploy or manage clusters. AWS Glue automatically generates the code to accelerate your data transformations and loading processes. AWS Glue ETL builds on top of Apache Spark and provides commonly used out-of-the-box data source connectors, data structures, and ETL transformations to validate, clean, transform, and flatten data stored in many open-source formats such as CSV, JSON, Parquet, and Avro. AWS Glue ETL also provides capabilities to incrementally process partitioned data.

Additionally, you can use AWS Glue to define and run crawlers that can crawl folders in the data lake, discover datasets and their partitions, infer schema, and define tables in the Lake Formation catalog. AWS Glue provides more than a dozen built-in classifiers that can parse a variety of data structures stored in open-source formats. AWS Glue also provides triggers and workflow capabilities that you can use to build multi-step end-to-end data processing pipelines that include job dependencies and running parallel steps. You can schedule AWS Glue jobs and workflows or run them on demand. AWS Glue natively integrates with AWS services in storage, catalog, and security layers.

Step Functions is a serverless engine that you can use to build and orchestrate scheduled or event-driven data processing workflows. You use Step Functions to build complex data processing pipelines that involve orchestrating steps implemented by using multiple AWS services such as AWS Glue, AWS Lambda, Amazon Elastic Container Service (Amazon ECS) containers, and more. Step Functions provides visual representations of complex workflows and their running state to make them easy to understand. It manages state, checkpoints, and restarts of the workflow for you to make sure that the steps in your data pipeline run in order and as expected. Built-in try/catch, retry, and rollback capabilities deal with errors and exceptions automatically.

Consumption layer

The consumption layer in our architecture is composed using fully managed, purpose-built, analytics services that enable interactive SQL, BI dashboarding, batch processing, and ML.

Interactive SQL

Athena is an interactive query service that enables you to run complex ANSI SQL against terabytes of data stored in Amazon S3 without needing to first load it into a database. Athena queries can analyze structured, semi-structured, and columnar data stored in open-source formats such as CSV, JSON, XML Avro, Parquet, and ORC. Athena uses table definitions from Lake Formation to apply schema-on-read to data read from Amazon S3.

Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the amount of data scanned by the queries you run. Athena provides faster results and lower costs by reducing the amount of data it scans by using dataset partitioning information stored in the Lake Formation catalog. You can run queries directly on the Athena console of submit them using Athena JDBC or ODBC endpoints.

Athena natively integrates with AWS services in the security and monitoring layer to support authentication, authorization, encryption, logging, and monitoring. It supports table- and column-level access controls defined in the Lake Formation catalog.

Data warehousing and batch analytics

Amazon Redshift is a fully managed data warehouse service that can host and process petabytes of data and run thousands highly performant queries in parallel. Amazon Redshift uses a cluster of compute nodes to run very low-latency queries to power interactive dashboards and high-throughput batch analytics to drive business decisions. You can run Amazon Redshift queries directly on the Amazon Redshift console or submit them using the JDBC/ODBC endpoints provided by Amazon Redshift.

Amazon Redshift provides the capability, called Amazon Redshift Spectrum, to perform in-place queries on structured and semi-structured datasets in Amazon S3 without needing to load it into the cluster. Amazon Redshift Spectrum can spin up thousands of query-specific temporary nodes to scan exabytes of data to deliver fast results. Organizations typically load most frequently accessed dimension and fact data into an Amazon Redshift cluster and keep up to exabytes of structured, semi-structured, and unstructured historical data in Amazon S3. Amazon Redshift Spectrum enables running complex queries that combine data in a cluster with data on Amazon S3 in the same query.

Amazon Redshift provides native integration with Amazon S3 in the storage layer, Lake Formation catalog, and AWS services in the security and monitoring layer.

Business intelligence

Amazon QuickSight provides a serverless BI capability to easily create and publish rich, interactive dashboards. QuickSight enriches dashboards and visuals with out-of-the-box, automatically generated ML insights such as forecasting, anomaly detection, and narrative highlights. QuickSight natively integrates with Amazon SageMaker to enable additional custom ML model-based insights to your BI dashboards. You can access QuickSight dashboards from any device using a QuickSight app, or you can embed the dashboard into web applications, portals, and websites.

QuickSight allows you to directly connect to and import data from a wide variety of cloud and on-premises data sources. These include SaaS applications such as Salesforce, Square, ServiceNow, Twitter, GitHub, and JIRA; third-party databases such as Teradata, MySQL, Postgres, and SQL Server; native AWS services such as Amazon Redshift, Athena, Amazon S3, Amazon Relational Database Service (Amazon RDS), and Amazon Aurora; and private VPC subnets. You can also upload a variety of file types including XLS, CSV, JSON, and Presto.

To achieve blazing fast performance for dashboards, QuickSight provides an in-memory caching and calculation engine called SPICE. SPICE automatically replicates data for high availability and enables thousands of users to simultaneously perform fast, interactive analysis while shielding your underlying data infrastructure. QuickSight automatically scales to tens of thousands of users and provides a cost-effective, pay-per-session pricing model.

QuickSight allows you to securely manage your users and content via a comprehensive set of security features, including role-based access control, active directory integration, AWS CloudTrail auditing, single sign-on (IAM or third-party), private VPC subnets, and data backup.

Predictive analytics and ML

Amazon SageMaker is a fully managed service that provides components to build, train, and deploy ML models using an interactive development environment (IDE) called Amazon SageMaker Studio. In Amazon SageMaker Studio, you can upload data, create new notebooks, train and tune models, move back and forth between steps to adjust experiments, compare results, and deploy models to production, all in one place by using a unified visual interface. Amazon SageMaker also provides managed Jupyter notebooks that you can spin up with just a few clicks. Amazon SageMaker notebooks provide elastic compute resources, git integration, easy sharing, pre-configured ML algorithms, dozens of out-of-the-box ML examples, and AWS Marketplace integration, which enables easy deployment of hundreds of pre-trained algorithms. Amazon SageMaker notebooks are preconfigured with all major deep learning frameworks, including TensorFlow, PyTorch, Apache MXNet, Chainer, Keras, Gluon, Horovod, Scikit-learn, and Deep Graph Library.

ML models are trained on Amazon SageMaker managed compute instances, including highly cost-effective Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances. You can organize multiple training jobs by using Amazon SageMaker Experiments. You can build training jobs using Amazon SageMaker built-in algorithms, your custom algorithms, or hundreds of algorithms you can deploy from AWS Marketplace. Amazon SageMaker Debugger provides full visibility into model training jobs. Amazon SageMaker also provides automatic hyperparameter tuning for ML training jobs.

You can deploy Amazon SageMaker trained models into production with a few clicks and easily scale them across a fleet of fully managed EC2 instances. You can choose from multiple EC2 instance types and attach cost-effective GPU-powered inference acceleration. After the models are deployed, Amazon SageMaker can monitor key model metrics for inference accuracy and detect any concept drift.

Amazon SageMaker provides native integrations with AWS services in the storage and security layers.

Security and governance layer

Components across all layers of our architecture protect data, identities, and processing resources by natively using the following capabilities provided by the security and governance layer.

Authentication and authorization

IAM provides user-, group-, and role-level identity to users and the ability to configure fine-grained access control for resources managed by AWS services in all layers of our architecture. IAM supports multi-factor authentication and single sign-on through integrations with corporate directories and open identity providers such as Google, Facebook, and Amazon.

Lake Formation provides a simple and centralized authorization model for tables hosted in the data lake. After implemented in Lake Formation, authorization policies for databases and tables are enforced by other AWS services such as Athena, Amazon EMR, QuickSight, and Amazon Redshift Spectrum. In Lake Formation, you can grant or revoke database-, table-, or column-level access for IAM users, groups, or roles defined in the same account hosting the Lake Formation catalog or another AWS account. The simple grant/revoke-based authorization model of Lake Formation considerably simplifies the previous IAM-based authorization model that relied on separately securing S3 data objects and metadata objects in the AWS Glue Data Catalog.

Encryption

AWS KMS provides the capability to create and manage symmetric and asymmetric customer-managed encryption keys. AWS services in all layers of our architecture natively integrate with AWS KMS to encrypt data in the data lake. It supports both creating new keys and importing existing customer keys. Access to the encryption keys is controlled using IAM and is monitored through detailed audit trails in CloudTrail.

Network protection

Our architecture uses Amazon Virtual Private Cloud (Amazon VPC) to provision a logically isolated section of the AWS Cloud (called VPC) that is isolated from the internet and other AWS customers. AWS VPC provides the ability to choose your own IP address range, create subnets, and configure route tables and network gateways. AWS services from other layers in our architecture launch resources in this private VPC to protect all traffic to and from these resources.

Monitoring and logging

AWS services in all layers of our architecture store detailed logs and monitoring metrics in AWS CloudWatch. CloudWatch provides the ability to analyze logs, visualize monitored metrics, define monitoring thresholds, and send alerts when thresholds are crossed.

All AWS services in our architecture also store extensive audit trails of user and service actions in CloudTrail. CloudTrail provides event history of your AWS account activity, including actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services. This event history simplifies security analysis, resource change tracking, and troubleshooting. In addition, you can use CloudTrail to detect unusual activity in your AWS accounts. These capabilities help simplify operational analysis and troubleshooting.

Additional considerations

In this post, we talked about ingesting data from diverse sources and storing it as S3 objects in the data lake and then using AWS Glue to process ingested datasets until they’re in a consumable state. This architecture enables use cases needing source-to-consumption latency of a few minutes to hours. In a future post, we will evolve our serverless analytics architecture to add a speed layer to enable use cases that require source-to-consumption latency in seconds, all while aligning with the layered logical architecture we introduced.

Conclusion

With AWS serverless and managed services, you can build a modern, low-cost data lake centric analytics architecture in days. A decoupled, component-driven architecture allows you to start small and quickly add new purpose-built components to one of six architecture layers to address new requirements and data sources.

We invite you to read the following posts that contain detailed walkthroughs and sample code for building the components of the serverless data lake centric analytics architecture:

About the Authors

Praful Kava is a Sr. Specialist Solutions Architect at AWS. He guides customers to design and engineer Cloud scale Analytics pipelines on AWS. Outside work, he enjoys travelling with his family and exploring new hiking trails.

Changbin Gong is a Senior Solutions Architect at Amazon Web Services (AWS). He engages with customers to create innovative solutions that address customer business problems and accelerate the adoption of AWS services. In his spare time, Changbin enjoys reading, running, and traveling.