All posts by Nivas Shankar

Build a modern data architecture and data mesh pattern at scale using AWS Lake Formation tag-based access control

Post Syndicated from Nivas Shankar original https://aws.amazon.com/blogs/big-data/build-a-modern-data-architecture-and-data-mesh-pattern-at-scale-using-aws-lake-formation-tag-based-access-control/

Customers are exploring building a data mesh on their AWS platform using AWS Lake Formation and sharing their data lakes across the organization. A data mesh architecture empowers business units (organized into domains) to have high ownership and autonomy for the technologies they use, while providing technology that enforces data security policies both within and between domains through data sharing. Data consumers request access to these data products, which are approved by producer owners within a framework that provides decentralized governance, but centralized monitoring and auditing of the data sharing process. As the number of tables and users increase, data stewards and administrators are looking for ways to manage permissions on data lakes easily at scale. Customers are struggling with “role explosion” and need to manage hundreds or even thousands of user permissions to control data access. For example, for an account with 1,000 resources and 100 principals, the data steward would have to create and manage up to 100,000 policy statements. As new principals and resources get added or deleted, these policies have to be updated to keep the permissions current.

Lake Formation tag-based access control (TBAC) solves this problem by allowing data stewards to create LF-tags (based on their business needs) that are attached to resources. You can create policies on a smaller number of logical tags instead of specifying policies on named resources. LF-tags enable you to categorize and explore data based on taxonomies, which reduces policy complexity and scales permissions management. You can create and manage policies with tens of logical tags instead of the thousands of resources. Lake Formation TBAC decouples policy creation from resource creation, which helps data stewards manage permissions on many databases, tables, and columns by removing the need to update policies every time a new resource is added to the data lake. Finally, TBAC allows you to create policies even before the resources come into existence. All you have to do is tag the resource with the right LF-tag to make sure existing policies manage it.

This post focuses on managing permissions on data lakes at scale using LF-tags in Lake Formation for cross accounts. For managing data lake catalog tables from AWS Glue and administering permission to Lake Formation, data stewards within the producing accounts have functional ownership based on the functions they support, and can grant access to various consumers, external organizations, and accounts. You can now define LF-tags; associate at the database, table, or column level; and then share controlled access across analytic, machine learning (ML), and extract, transform, and load (ETL) services for consumption. LF-tags make sure that governance can be scaled easily by replacing the policy definitions of thousands of resources with a few logical tags.

Solution overview

LF-tag access has three key components:

  • Tag ontology and classification – Data stewards can define an LF-tag ontology based on their business needs and grant access based on LF-tags to AWS Identity and Access Management (IAM) principals and SAML principals or groups
  • Tagging resources – Data engineers can easily create, automate, implement, and track all LF-tags and permissions against AWS Glue catalogs through the Lake Formation API
  • Policy evaluation – Lake Formation evaluates the effective permissions based on LF-tags at query time and allows access to data through consuming services such as Amazon Athena, AWS Glue, Amazon Redshift Spectrum, Amazon SageMaker Data Wrangler, and Amazon EMR Studio, based on the effective permissions granted across multiple accounts or organization-level data shares

The following diagram illustrates the relationship between the data producer, data consumer, and central governance accounts.

In the above diagram, the central governance account box shows the tagging ontology that will be used with the associated tag colors. These will be shared with both the producers and consumers, to be used to tag resources.

In this post, we considered two databases, as shown in the following figure, and show how you can set up a Lake Formation table and create Lake Formation tag-based policies.

The solution includes the following high-level steps:

  1. The data mesh owner defines the central tag ontology with LF-tags:
    1. LOB – Classified at the line of business (LOB) level (database)
    2. LOB:Function – Classified at the business function level (table)
    3. Classification – Classification of the functional data level (columns)
  2. The data mesh owner assigns respective permission levels to the product data steward to use centrally defined tags and associates permission to their database and tables with different LF-tags.
  3. The producer steward in the central account owns two databases: lob = card and lob = retail.
  4. The producer steward switches to the data producer account to add table metadata using an AWS Glue crawler.
  5. The producer steward associates column-level classifications Classification = Sensitive or Classification = Non-Sensitive to tables under the Card database in the central account.
  6. The producer steward associates table-level tags lob:retail = Customer and lob:retail = Reviews to tables under the Retail database in the central account.
  7. The consumer admin grants fine-grained access control to different data analysts.

With this configuration, the consumer analyst can focus on performing analysis with the right data.

Set up resources with AWS CloudFormation

We provide three AWS CloudFormation templates in this post: for the producer account, central account, and consumer account. Deploy the CloudFormation templates in the order of producer, central, and consumer, because there are dependencies between the templates.

The CloudFormation template for the central account generates the following resources:

  • Two IAM users:
    • DataMeshOwner
    • ProducerSteward
  • Grant DataMeshOwner as the LakeFormation Admin
  • One IAM role:
    • LFRegisterLocationServiceRole
  • Two IAM policies:
    • ProducerStewardPolicy
    • S3DataLakePolicy
  • Create databases “retail” and “cards” for ProducerSteward to manage Data Catalog
  • Share the data location permission for producer account to manage Data Catalog

The CloudFormation template for the producer account generates the following resources:

  • Two Amazon Simple Storage Service (Amazon S3) buckets:
    • RetailBucket, which holds two tables:
      • Customer_Info
      • Customer_Review
    • CardsBucket, which holds one table:
      • Credit_Card
  • Allow Amazon S3 bucket access for the central account Lake Formation service role.
  • Two AWS Glue crawlers
  • One AWS Glue crawler service role
  • Grant permissions on the S3 bucket locations tbac-cards-<ProducerAccountID>-<aws-region> and tbac-retail-<ProducerAccountID>-<aws-region> to the AWS Glue crawler role
  • One producer steward IAM user

The CloudFormation template for the consumer account generates the following resources:

  • One S3 bucket:
    • <AWS Account ID>-<aws-region>-athena-logs
  • One Athena workgroup:
    • consumer-workgroup
  • Three IAM users:
    • ConsumerAdmin
    • ConsumerAnalyst1
    • ConsumerAnalyst2

Launch the CloudFormation stack in the central account

To create resources in the central account, complete the following steps:

  1. Sign in to the central account’s AWS CloudFormation console in the target Region.
  2. Choose Launch Stack:   
    Launch Stack
  3. Choose Next.
  4. For Stack name, enter stack-central.
  5. For DataMeshOwnerUserPassword, enter the password you want for the data lake admin IAM user in the central account.
  6. For ProducerStewardUserPassword, enter the password you want for the producer steward IAM user in the producer account.
  7. For ProducerAWSAccount, enter the AWS <ProducerAccountID>.
  8. Choose Next.
  9. On the next page, choose Next.
  10. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  11. Choose Create stack.
  12. Collect the value for LFRegisterLocationServiceRole on the stack’s Outputs tab.

Launch the CloudFormation stack in the producer account

To set up resources in the producer account, complete the following steps:

  1. Sign in to the producer account’s AWS CloudFormation console in the target Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For Stack name, enter stack-producer.
  5. For CentralAccountID, copy and paste the value of the <CentralAccountID> .
  6. For CentralAccountLFServiceRole, copy and paste the value of the LFRegisterLocationServiceRole collected from the stack-central.
  7. For LFDatabaseName, keep the default value of the tbac database name.
  8. For ProducerStewardUserPassword, enter the password you want for the data lake admin IAM user on the producer account.
  9. Choose Next.
  10. On the next page, choose Next.
  11. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  12. Choose Create stack.

Launch the CloudFormation stack in the consumer account

To create resources in the consumer account, complete the following steps:

  1. Sign in to the consumer account’s AWS CloudFormation console in the target Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For Stack name, enter stack-consumer.
  5. For ConsumerAdminUserName and ConsumerAdminUserPassword, enter the user name and password you want for the data lake admin IAM user.
  6. For ConsumerAnalyst1UserName and ConsumerAnalyst1UserPassword, enter the user name and password you want for the consumeranalyst1 IAM user.
  7. For ConsumerAnalyst2UserName and ConsumerAnalyst2UserPassword, enter the user name and password you want for the consumeranalyst2 IAM user.
  8. Choose Next.
  9. On the next page, choose Next.
  10. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  11. Choose Create stack.

Configure Lake Formation cross-account sharing

After you create your resources with AWS CloudFormation, you perform the following steps in the producer and central account to set up Lake Formation cross-account sharing.

Central governance account

In the central account, complete the following steps:

  1. Sign in to the Lake Formation console as admin.
  2. In the navigation pane, choose Permissions, then choose Administrative roles and tasks.

The CloudFormation template added the data mesh owner as the data lake administrator.

Next, we update the Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control.

  1. In the navigation pane, under Data catalog¸ choose Settings.
  2. Uncheck Use only IAM access control for new databases.
  3. Uncheck Use only IAM access control for new tables in new databases.
  4. Choose Save.

Next, we need to set up the AWS Glue Data Catalog resource policy to grant cross-account access to Data Catalog resources.

  1. Use the following policy, and replace the account number and Region with your own values:
  2. As described in Lake Formation Tag-Based Access Control Cross-Account Prerequisites, before you can use the tag-based access control method to grant cross-account access to resources, you must add the following JSON permissions object to the AWS Glue Data Catalog resource policy in the producer account. This gives the consumer account permission to access the Data Catalog when glue:EvaluatedByLakeFormationTagsis true. Also, this condition becomes true for resources on which you granted permission using Lake Formation permission Tags to the consumer’s account. This policy is required for every AWS account that you’re granting permissions to. We discuss the full IAM policy later in this post.
    {
       "PolicyInJson": "{\"Version\" : \"2012-10-17\",\"Statement\" : [ {\"Effect\" : \"Allow\",\"Principal\" : {\"AWS\" : [\"arn:aws:iam::<ProducerAccountID>:root\",\"arn:aws:iam::<ConsumerAccountID>:root\"]},\"Action\" : \"glue:*\",\"Resource\" : [ \"arn:aws:glue:<aws-region>:<CentralAccountID>:table/*\", \"arn:aws:glue:<aws-region>:<CentralAccountID>:database/*\", \"arn:aws:glue:<aws-region>:<CentralAccountID>:catalog\" ],\"Condition\" : {\"Bool\" : {\"glue:EvaluatedByLakeFormationTags\" : \"true\"}}}, {\"Effect\" : \"Allow\",\"Principal\" : {\"Service\" : \"ram.amazonaws.com\"},\"Action\" : \"glue:ShareResource\",\"Resource\" : [ \"arn:aws:glue:<aws-region>:<CentralAccountID>:table/*\", \"arn:aws:glue:<aws-region>:<CentralAccountID>:database/*\", \"arn:aws:glue:<aws-region>:<CentralAccountID>:catalog\" ]} ]}",
       "EnableHybrid": "TRUE"
    }

Replace the <aws-region>, <ProducerAccountID>, <ConsumerAccountID> and <CentralAccountID> values in the above policy as appropriate and save it in a file called policy.json.

  1. Next, run the following AWS Command Line Interface (AWS CLI) command on AWS CloudShell.
aws glue put-resource-policy --region <aws-region> --cli-input-json file://policy.json

For more information about this policy, see put-resource-policy.

  1. Next, we verify the two source data S3 buckets are registered as data lake locations in the central account. This is completed by the CloudFormation template.
  2. Under Register and ingest in the navigation pane, choose Data lake locations.

You should see the two S3 buckets registered under the data lake locations.

Configure Lake Formation Data Catalog settings in the central account

After we complete all the prerequisites, we start the data mesh configuration. We log in as DataMeshOwner in the central account.

Define LF-tags

DataMeshOwner creates the tag ontology by defining LF-tags. Complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, under Administrative roles and tasks, choose LF-Tags.
  2. Choose Add LF-tags.
  3. For Key, enter LOB and for Values, choose Retail and Cards.
  4. Choose Add LF-tag.
  5. Repeat these steps to add the key LOB:Retail and values Customer and Reviews, and the key Classification with values Sensitive and Non-Sensitive.

Now we complete the configuration of the tag ontology.

Grant permissions

We grant ProducerSteward in the central accounts describe and associate permissions on the preceding tag ontology. This enables ProducerSteward to view the LF-tags and assign them to Data Catalog resources (databases, tables, and columns). ProducerSteward in the central account can further grant the permission to ProducerSteward in the producer account. For more information, see Granting, Revoking, and Listing LF-Tag Permissions. When you have multiple producers, grant the relevant tags to each steward.

  1. Under Permissions in the navigation pane, under Administrative roles and tasks, choose LF-tag permissions.
  2. Choose Grant.
  3. For IAM users and roles, choose the ProducerSteward user.
  4. In the LF-Tags section, add all three key-values:
    1. Key LOB with values Retail and Cards.
    2. Key LOB:Retail with values Customer and Reviews.
    3. Key Classification with values Sensitive and Non-Sensitive.
  5. For Permissions, select Describe and Associate for both LF-tag permissions and Grantable permissions.
  6. Choose Grant.

Next, we grant ProducerSteward tag-based data lake permissions. This enables ProducerSteward to create, alter, and drop tables in the databases with corresponding tags. ProducerSteward in the producer account can further grant the permission across accounts.

  1. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  2. For Principals, choose IAM users and roles, and choose ProducerSteward.
  3. For LF-tags or catalog resources, select Resources matched by LF-Tags (recommended).
  4. Choose Add LF-Tag.
  5. For Key, choose LOB and for Values, choose Cards.
  6. For Database permissions, select the Super permission because ProducerSteward owns the producer databases.

This permission allows a principal to perform every supported Lake Formation operation on the database. Use this admin permission when a principal is trusted with all operations.

  1. Select Super under Grantable permissions so the ProducerSteward user can grant database-level permissions to the producer and consumer accounts.
  2. For Table permissions, select Super.
  3. Select Super permission under Grantable permissions.
  4. Choose Grant.
  5. Repeat these steps for key LOB and value Retail.
  6. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  7. For Principals, choose IAM users and roles, and choose ProducerSteward.
  8. For LF-tags or catalog resources, select Resources matched by LF-Tags (recommended).
  9. Add the key LOB with value Cards, and the key Classification with values Sensitive and Non-Sensitive.
  10. For Database permissions, select Super.
  11. Select Super permission under Grantable permissions.
  12. For Table permissions, select Super.
  13. Select Super under Grantable permissions.
  14. Choose Grant.

This gives ProducerSteward fine-grained permission expression on columns with either Sensitive or Non-sensitive tags.

  1. Repeat these steps for key LOB and value Retails, and key LOB:Retails and value Reviews or Customer.

This gives ProducerSteward fine-grained permission expression on tables with either Reviews or Customers tags.

Producer data steward actions in the central account

Next, we log in as the ProducerSteward user in the central account and create skeleton databases.

  1. Sign in to the Lake Formation console as ProducerSteward.
  2. In the navigation pane, under Data catalog, select Databases.
  3. Choose the cards database.
  4. On the Actions menu, choose Edit LF-tags.
  5. Choose Assign new LF-tag.
  6. For Assigned Keys, enter LOB and for Values, choose Cards.
  7. Choose Save.

This assigns the LOB=Cards tag to the Cards database.

  1. Repeat these steps for Retail database and assign the LOB=Retail tag to the Retail database.

Next, we share the LF-tags and data lake permissions with the producer account so that ProducerSteward in the producer account can run AWS Glue crawlers and generate tables in the preceding skeleton databases.

  1. Under Permissions in the navigation pane, under Administrative roles and tasks, choose LF-tag permissions.
  2. Choose Grant.
  3. For Principals, select External accounts.
  4. For AWS account or AWS organization, enter the account ID for the producer account.
  5. In the LF-Tags section, we only need to add database-level tags.
  6. For Key, enter LOB and for Values, choose Retail and Cards.
  7. For Permissions, choose Describe and Associate for both LF-tag permissions and Grantable permissions.
  8. Choose Grant.
  9. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  10. For Principals, select External accounts.
  11. For AWS account or AWS organization, enter the account ID for the producer account.
  12. For LF-tags or catalog resources, select Resources matched by LF-Tags (recommended).
  13. Choose Add LF-Tag.
  14. Choose the key LOB and value Cards.
  15. For Database permissions, select Create table and Describe because the ProducerSteward user in the producer account will add tables in the database.
  16. Select Create table and Describe under Grantable permissions so the ProducerSteward user can further grant the permission to the AWS Glue crawler.
  17. For Table permissions, select all the permissions.
  18. Select all the permissions under Grantable permissions.
  19. Choose Grant.
  20. Repeat these steps for LOB=Retail.

Now the Lake Formation administrators on the producer account side has the right permissions to add tables.

Crawl source tables in the producer account

Next, we log in as the ProducerSteward user in the producer account to crawl the source tables for the Cards and Retail databases.

  1. Sign in to the Lake Formation console as ProducerSteward.
  2. In the navigation pane, under Administrative Roles and Tasks, verify that ProducerSteward is configured as the data lake administrator.
  3. In the navigation pane, under Permissions, then choose Administrative roles and tasks, choose LF-Tags.

You can verify the root-level LOB tags that were shared with the producer account.

  1. In the navigation pane, under Data catalog, select Databases.

You can verify the two databases cards and retail that were shared with the producer account from the previous step.

Now, we create resource links in the producer account for these two databases. These links point at the shared databases and are used by AWS Glue crawler to create the tables. First, we create a resource link for the cards database.

  1. Select the cards database and on the Actions menu, choose Create resource link.
  2. For Resource link name, enter rl_cards.
  3. Choose Create.
  4. Repeat these steps to create a resource link for the retail database.

After the resource link creation, you should see both the resource link databases as shown in the following screenshot.

Next, we need to grant permissions to the AWS Glue crawler role so that the crawler can crawl the source bucket and create the tables.

  1. Select the rl_cards database and on the Actions menu, choose Grant.
  2. In the Grant data permissions section, select IAM users and roles, and choose the AWS Glue crawler role that was created by the CloudFormation template (for example, stack-producer-AWSGlueServiceRoleDefault-xxxxxx).
  3. For Databases, choose rl_cards.
  4. For Resource link permissions, select Describe.
  5. Choose Grant.
  6. Repeat these steps for rl_retail.
  7. Next, in the navigation pane, choose Data lake Permissions and choose Grant.
  8. For IAM users and roles, choose the role stack-producer-AWSGlueServiceRoleDefault-XXXX.
  9. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
  10. Enter the key LOB and values Retail and Cards.
  11. For Database permissions, select Create table and Describe.
  12. For Table permissions, choose Select, Describe, and Alter.
  13. Choose Grant.

Next, we will verify grant permissions on the S3 bucket locations corresponding to cards and retail producers to the AWS Glue crawler role. This is completed by the CloudFormation template.

In the navigation pane, under Permissions, on the Data Locations, you should see the locations.

Now we’re ready to run the crawlers. We configure the crawlers that the CloudFormation template created, to point to these resource link databases.

  1. On the AWS Glue console, under Data catalog in the navigation pane, choose Crawlers.

The two crawlers you created should be listed.

  1. Select the crawler for the cards database CardsCrawler-xxxxxxxxxxxx and on the Action menu, choose Edit crawler.
  2. For the input data store, choose the S3 bucket for the cards producer.
  3. For IAM role, choose the AWS Glue service role created by the CloudFormation template.
  4. For Schedule, choose Run on demand.
  5. For the output database, choose the resource link database rl_cards corresponding to the cards database.
  6. Verify all the information and choose Save.
  7. Repeat these steps for the crawler corresponding to the retail producer.
  8. Select both crawlers and choose Run crawler.

When the crawlers finish, they create tables corresponding to each producer in their respective resource link databases. The table schemas are present in the shared database in the central account.

Configure Lake Formation tags in the central account

Next, we perform fine-grained access control for the tables that the crawlers created to support different consumption use cases using Lake Formation tags.

Tag columns

First, we tag sensitive columns in the cards table corresponding to the cards database, first using the Classification tag that we created earlier.

  1. Log in to central account as IAM user ProducerSteward.
  2. On the Lake Formation console, in the navigation pane, choose Data catalog and then choose Tables.

You should see three tables: the cards table corresponding to cards database, and the reviews and customers tables corresponding to the retail database.

  1. Choose the cards table.
  2. Navigate to the Schema section and choose Edit schema.
  3. Select all the columns and choose Edit tags.
  4. Choose Assign new LF-Tag.
  5. For Assigned keys, enter Classification and for Values, choose Non-Sensitive.
  6. Choose Save.

Next, we selectively tag the sensitive columns.

  1. In the Edit schema section, select columns card number, card holder’s name, cvv/cvv2, and card pin.
  2. Choose Edit tags.
  3. For Assigned keys, enter Classification and for Values, choose Sensitive.
  4. Choose Save.
  5. Choose Save as new version to save the schema.

Tag tables

Next, we tag the reviews and customer tables under the retail database using the LOB:retail tag that we created earlier.

  1. On the Tables page, select the reviews table and on the Actions menu, choose Edit LF-tags.
  2. Choose Assign new LF-Tag.
  3. For Assigned keys, choose LOB:Retail and for Values, choose Reviews.
  4. Choose Save.
  5. Repeat the steps for the customer table. Choose LOB:Retail for the key and Customer for the value.

Grant tag permissions

Next, grant LF-tag permissions to the external consumer account.

  1. On the Lake Formation console, in the navigation pane, choose Permissions, then choose Administrative roles and tasks and choose LF-tag permissions.
  2. Choose Grant.
  3. For Principals, select External accounts.
  4. For AWS account or AWS organization, enter the AWS account number corresponding to the consumer account.
  5. For LF-Tags, choose Add LF-Tag.
  6. For Key, choose LOB and for Values, choose Retail and Cards.
  7. Repeat these steps for key Classification with values Non-Sensitive and Sensitive, and key LOB:Retail with values Reviews and Customer.
  8. For Permissions, choose Describe.
  9. For Grantable permissions, choose Describe.
  10. Choose Grant.

Next, we grant Lake Formation policy tag expression permissions to the external consumer account.

  1. In the navigation pane, choose Data lake permissions and choose Grant.
  2. In the Principals section, select External accounts.
  3. For AWS account or AWS organization, enter the AWS account number corresponding to the consumer account.
  4. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
  5. Choose Add LF-Tag.
  6. For Key, choose LOB and for Values¸ choose Retail.
  7. For Database permissions, select Describe.
  8. For Grantable permissions, select Describe.
  9. Choose Grant.
  10. Repeat these steps to grant permissions on the policy tag expression LOB=Cards.

Next, we grant table permissions.

  1. In the navigation pane, choose Data lake permissions and choose Grant.
  2. For Principals, select External accounts.
  3. For AWS account or AWS organization, enter the AWS account number corresponding to the consumer account.
  4. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
  5. Add key LOB with value Retail, and key LOB:Retail with values Reviews and Customer.
  6. For Table Permissions, select Select and Describe.
  7. For Grantable permissions, select Select and Describe.
  8. Choose Grant.
  9. Repeat these steps to grant permissions on the policy tag expressions LOB=Cards and Classification = (Non-Sensitive or Sensitive).

Share and consume tables in the consumer account

When you sign in to the Lake Formation console in the consumer account as ConsumerAdmin, you can see all the tags and the corresponding values that were shared by the producer.

In these next steps, we share and consume tables in the consumer account.

Create a resource link to the shared database

On the Databases page on the Lake Formation console, you can see all the databases that were shared to the consumer account. To create a resource link, complete the following steps:

  1. On the Databases page, select the cards database and on the Actions menu, choose Create resource link.
  2. Enter the resource link name as rl_cards.
  3. Leave the shared database and shared database’s owner ID as default.
  4. Choose Create.
  5. Follow the same process to create the rl_retail resource link.

Grant Describe permission to ConsumerAnalyst1

To grant Describe permissions on resource link databases to ConsumerAnalyst1, complete the following steps:

  1. On the Databases page, select the resource database rl_retail and on the Actions menu, choose Grant.
  2. In the Grant data permissions section, select IAM users and roles.
  3. Choose the role ConsumerAnalyst1.
  4. In the Resource link permissions section, select Describe.
  5. Choose Grant.
  6. Follow the same steps to grant rl_cards access to ConsumerAnalyst2.

Grant Tag permissions to ConsumerAnalyst1

To grant Tag permissions on the LOB:Retail Customer tag to ConsumerAnalyst1 to access the sales table, complete the following steps:

  1. On the Lake Formation console, on the Data permission page, choose Grant.
  2. In the Grant data permissions section, select IAM users and roles.
  3. Choose the role ConsumerAnalyst1.
  4. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
  5. Add the key LOB with value Retail, and the key LOB:Retail with value Customer.
  6. For Table permissions, select Select and Describe.
  7. Choose Grant.

Access to the customers table inside the rl_retail database is granted to ConsumerAnalyst1.

Grant Tag permissions to ConsumerAnalyst2

To grant Tag permissions on the Classification:Sensitive tag to l to access attributes tagged as Sensitive in the cards table, complete the following steps:

  1. On the Lake Formation console, on the Data permission page, choose Grant.
  2. In the Grant data permissions section, select IAM users and roles.
  3. Choose the role ConsumerAnalyst2.
  4. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
  5. Add the key LOB with value Cards, and the key Classification with value Sensitive.
  6. For Table permissions, select Select and Describe.
  7. Choose Grant.

Access to attributes tagged as sensitive in the cards table inside the rl_cards database is granted to ConsumerAnalyst2.

Validate the access to ConsumerAnalyst1

To confirm ConsumerAnalyst1 access, complete the following steps:

  1. On the Athena console, for Workgroup, choose consumer-workgroup.
  2. Choose Acknowledge.
  3. Choose the database rl_retail.

You should be able to see the customers table and be able to query.

Validate the access to ConsumerAnalyst2

To confirm ConsumerAnalyst2 access, complete the following steps:

  1. On the Athena console, for Workgroup, choose consumer-workgroup.
  2. Choose Acknowledge.
  3. Choose the database rl_cards.

You should be able to see only the sensitive attributes from the cards table.

As a thought experiment, you can also check to see the Lake Formation Tag-based access policy behavior on columns to which the user doesn’t have policy grants.

When an untagged column is selected from the table rl_cards.cards, Athena returns an error. For example, you can run the following query to choose the untagged column “issuing_bank” which is non-sensitive.

SELECT issuing_bank FROM "rl_cards"."cards" limit 10;

Conclusion

In this post, we explained how to create a Lake Formation tag-based access control policy in Lake Formation using an AWS public dataset. In addition, we explained how to query tables, databases, and columns that have Lake Formation tag-based access policies associated with them.

You can generalize these steps to share resources across accounts. You can also use these steps to grant permissions to SAML identities.

A data mesh approach provides a method by which organizations can share data across business units. Each domain is responsible for the ingestion, processing, and serving of their data. They are data owners and domain experts, and are responsible for data quality and accuracy. This is similar to how microservices turn a set of technical capabilities into a product that can be consumed by other microservices. Implementing a data mesh on AWS is made simple by using managed and serverless services such as AWS Glue, Lake Formation, Athena, and Redshift Spectrum to provide a well-understood, performant, scalable, and cost-effective solution to integrate, prepare, and serve data.


About the Authors

Nivas Shankar is a Principal Data Architect at Amazon Web Services. He helps and works closely with enterprise customers building data lakes and analytical applications on the AWS platform. He holds a master’s degree in physics and is highly passionate about theoretical physics concepts.

Dylan Qu is an AWS solutions architect responsible for providing architectural guidance across the full AWS stack with a focus on Data Analytics, AI/ML and DevOps.

Pavan Emani is a Data Lake Architect at AWS, specialized in big data and analytics solutions. He helps customers modernize their data platforms on the cloud. Outside of work, he likes reading about space and watching sports.

Prasanna Sridharan is a Senior Data & Analytics Architect with AWS. He is passionate about building the right big data solution for the AWS customers. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems, mainly for Enterprise and FSI customers.

Easily manage your data lake at scale using AWS Lake Formation Tag-based access control

Post Syndicated from Nivas Shankar original https://aws.amazon.com/blogs/big-data/easily-manage-your-data-lake-at-scale-using-tag-based-access-control-in-aws-lake-formation/

Thousands of customers are building petabyte-scale data lakes on AWS. Many of these customers use AWS Lake Formation to easily build and share their data lakes across the organization. As the number of tables and users increase, data stewards and administrators are looking for ways to manage permissions on data lakes easily at scale. Customers are struggling with “role explosion” and need to manage hundreds or even thousands of user permissions to control data access. For example, for an account with 1,000 resources and 100 principals, the data steward would have to create and manage up to 100,000 policy statements. Furthermore, as new principals and resources get added or deleted, these policies have to be updated to keep the permissions current.

Lake Formation Tag-based access control solves this problem by allowing data stewards to create LF-tags (based on their data classification and ontology) that can then be attached to resources. You can create policies on a smaller number of logical tags instead of specifying policies on named resources. LF-tags enable you to categorize and explore data based on taxonomies, which reduces policy complexity and scales permissions management. You can create and manage policies with tens of logical tags instead of the thousands of resources. LF-tags access control decouples policy creation from resource creation, which helps data stewards manage permissions on a large number of databases, tables, and columns by removing the need to update policies every time a new resource is added to the data lake. Finally, LF-tags access allows you to create policies even before the resources come into existence. All you have to do is tag the resource with the right LF-tags to ensure it is managed by existing policies.

This post focuses on managing permissions on data lakes at scale using LF-tags in Lake Formation. When it comes to managing data lake catalog tables from AWS Glue and administering permission to Lake Formation, data stewards within the producing accounts have functional ownership based on the functions they support, and can grant access to various consumers, external organizations, and accounts. You can now define LF-tags; associate at the database, table, or column level; and then share controlled access across analytic, machine learning (ML), and extract, transform, and load (ETL) services for consumption. LF-tags ensures that governance can be scaled easily by replacing the policy definitions of thousands of resources with a small number of logical tags.

LF-tags access has three main components:

  • Tag ontology and classification – Data stewards can define a LF-tag ontology based on data classification and grant access based on LF-tags to AWS Identity and Access Management (IAM) principals and SAML principals or groups
  • Tagging resources – Data engineers can easily create, automate, implement, and track all LF-tags and permissions against AWS Glue catalogs through the Lake Formation API
  • Policy evaluation – Lake Formation evaluates the effective permissions based on LF-tags at query time and allows access to data through consuming services such as Amazon Athena, Amazon Redshift Spectrum, Amazon SageMaker Data Wrangler, and Amazon EMR Studio, based on the effective permissions granted across multiple accounts or organization-level data shares

Solution overview

The following diagram illustrates the architecture of the solution described in this post.

In this post, we demonstrate how you can set up a Lake Formation table and create Lake Formation tag-based policies using a single account with multiple databases. We walk you through the following high-level steps:

  1. The data steward defines the tag ontology with two LF-tags: Confidential and Sensitive. Data with “Confidential = True” has tighter access controls. Data with “Sensitive = True” requires specific analysis from the analyst.
  2. The data steward assigns different permission levels to the data engineer to build tables with different LF-tags.
  3. The data engineer builds two databases: tag_database and col_tag_database. All tables in tag_database are configured with “Confidential = True”. All tables in the col_tag_database are configured with “Confidential = False”. Some columns of the table in col_tag_database are tagged with “Sensitive = True” for specific analysis needs.
  4. The data engineer grants read permission to the analyst for tables with specific expression condition “Confidential = True” and  “Confidential = FalseSensitive = True”.
  5. With this configuration, the data analyst can focus on performing analysis with the right data.

Provision your resources

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template creates three different personas to perform this exercise and copies the nyc-taxi-data dataset to your local Amazon Simple Storage Service (Amazon S3) bucket.

To create these resources, complete the following steps:

  1. Sign in to the AWS CloudFormation console in the us-east-1 Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. In the User Configuration section, enter password for three personas: DataStewardUserPassword, DataEngineerUserPassword and DataAnalystUserPassword.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create.

The stack takes up to 5 minutes and creates all the required resources, including:

  • An S3 bucket
  • The appropriate Lake Formation settings
  • The appropriate Amazon Elastic Compute Cloud (Amazon EC2) resources
  • Three user personas with user ID credentials:
    • Data steward (administrator) – The lf-data-steward user has the following access:
      • Read access to all resources in the Data Catalog
      • Can create LF-tags and associate to the data engineer role for grantable permission to other principals
    • Data engineer – The lf-data-engineer user has the following access:
      • Full read, write, and update access to all resources in the Data Catalog
      • Data location permissions in the data lake
      • Can associate LF-tags and associate to the Data Catalog
      • Can attach LF-tags to resources, which provides access to principals based on any policies created by data stewards
    • Data analyst – The lf-data-analyst user has the following access:
      • Fine-grained access to resources shared by Lake Formation Tag-based access policies

Register your data location and create an LF-tag ontology

We perform this first step as the data steward user (lf-data-steward) to verify the data in Amazon S3 and the Data Catalog in Lake Formation.

  1. Sign in to the Lake Formation console as lf-data-steward with the password used while deploying the CloudFormation stack.
  2. In the navigation pane, under Permissions¸ choose Administrative roles and tasks.
  3. For IAM users and roles, choose the user lf-data-steward.
  4. Choose Save to add lf-data-steward as a Lake Formation admin.

    Next, we will update the Data catalog settings to use Lake Formation permission to control catalog resources instead of IAM based access control.
  5. In the navigation pane, under Data catalog¸ choose Settings.
  6. Uncheck Use only IAM access control for new databases.
  7. Uncheck Use only IAM access control for new tables in new databases.
  8. Click Save.

    Next, we need to register the data location for the data lake.
  9. In the navigation pane, under Register and ingest, choose Data lake locations.
  10. For Amazon S3 path, enter s3://lf-tagbased-demo-<<Account-ID>>.
  11. For IAM role¸ leave it as the default value AWSServiceRoleForLakeFormationDataAccess.
  12. Choose Register location.
    Next, we create the ontology by defining a LF-tag.
  13. Under Permissions in the navigation pane, under Administrative roles, choose LF-Tags.
  14. Choose Add LF-tags.
  15. For Key, enter Confidential.
  16. For Values, add True and False.
  17. Choose Add LF-tag.
  18. Repeat the steps to create the LF-tag Sensitive with the value True.
    You have created all the necessary LF-tags for this exercise.Next, we give specific IAM principals the ability to attach newly created LF-tags to resources.
  19. Under Permissions in the navigation pane, under Administrative roles, choose LF-tag permissions.
  20. Choose Grant.
  21. Select IAM users and roles.
  22. For IAM users and roles, search for and choose the lf-data-engineer role.
  23. In the LF-tag permission scope section, add the key Confidential with values True and False, and the key Sensitive with value True.
  24. Under Permissions¸ select Describe and Associate for LF-tag permissions and Grantable permissions.
  25. Choose Grant.

    Next, we grant permissions to lf-data-engineer to create databases in our catalog and on the underlying S3 bucket created by AWS CloudFormation.
  26. Under Permissions in the navigation pane, choose Administrative roles.
  27. In the Database creators section, choose Grant.
  28. For IAM users and roles, choose the lf-data-engineer role.
  29. For Catalog permissions, select Create database.
  30. Choose Grant.

    Next, we grant permissions on the S3 bucket (s3://lf-tagbased-demo-<<Account-ID>>) to the lf-data-engineer user.
  31. In the navigation pane, choose Data locations.
  32. Choose Grant.
  33. Select My account.
  34. For IAM users and roles, choose the lf-data-engineer role.
  35. For Storage locations, enter the S3 bucket created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>).
  36. Choose Grant.
    Next, we grant lf-data-engineer grantable permissions on resources associated with the LF-tag expression Confidential=True.
  37. In the navigation pane, choose Data permissions.
  38. Choose Grant.
  39. Select IAM users and roles.
  40. Choose the role lf-data-engineer.
  41. In the LF-tag or catalog resources section, Select Resources matched by LF-Tags.
  42. Choose Add LF-Tag.
  43. Add the key Confidential with the values True.
  44. In the Database permissions section, select Describe for Database permissions and Grantable permissions.
  45. In the Table and column permissions section, select Describe, Select, and Alter for both Table permissions and Grantable permissions.
  46. Choose Grant.
    Next, we grant lf-data-engineer grantable permissions on resources associated with the LF-tag expression Confidential=False.
  47. In the navigation pane, choose Data permissions.
  48. Choose Grant.
  49. Select IAM users and roles.
  50. Choose the role lf-data-engineer.
  51. Select Resources matched by LF-tags.
  52. Choose Add LF-tag.
  53. Add the key Confidential with the values False.
  54. In the Database permissions section, select Describe for Database permissions and Grantable permissions.
  55. In the Table and column permissions section, do not select anything.
  56. Choose Grant.
    Next, we grant lf-data-engineer grantable permissions on resources associated with the LF-tag expression Confidential=False and Sensitive=True.
  57. In the navigation pane, choose Data permissions.
  58. Choose Grant.
  59. Select IAM users and roles.
  60. Choose the role lf-data-engineer.
  61. Select Resources matched by LF-tags.
  62. Choose Add LF-tag.
  63. Add the key Confidential with the values False.
  64. Choose Add LF-tag.
  65. Add the key Sensitive with the values True.
  66. In the Database permissions section, select Describe for Database permissions and Grantable permissions.
  67. In the Table and column permissions section, select Describe, Select, and Alter for both Table permissions and Grantable permissions.
  68. Choose Grant.

Create the Lake Formation databases

Now, sign in as lf-data-engineer with the password used while deploying the CloudFormation stack. We create two databases and attach LF-tags to the databases and specific columns for testing purposes.

Create your database and table for database-level access

We first create the database tag_database, the table source_data, and attach appropriate LF-tags.

  1. On the Lake Formation console, choose Databases.
  2. Choose Create database.
  3. For Name, enter tag_database.
  4. For Location, enter the S3 location created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>/tag_database/).
  5. Deselect Use only IAM access control for new tables in this database.
  6. Choose Create database.

Next, we create a new table within tag_database.

  1. On the Databases page, select the database tag_database.
  2. Choose View Tables and click Create table.
  3. For Name, enter source_data.
  4. For Database, choose the database tag_database.
  5. For Data is located in, select Specified path in my account.
  6. For Include path, enter the path to tag_database created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>/tag_database/).
  7. For Data format, select CSV.
  8. Under Upload schema, enter the following schema JSON:
    [
                   {
                        "Name": "vendorid",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "lpep_pickup_datetime",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "lpep_dropoff_datetime",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "store_and_fwd_flag",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "ratecodeid",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "pulocationid",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "dolocationid",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "passenger_count",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "trip_distance",
                        "Type": "string"
                        
                        
                   }, 
                      {
                        "Name": "fare_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "extra",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "mta_tax",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "tip_amount",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "tolls_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "ehail_fee",
                        "Type": "string"
                        
                        
                   }, 
                   {
                        "Name": "improvement_surcharge",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "total_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "payment_type",
                        "Type": "string"
                        
                        
                   }
     
    ]
    

  9. Choose Upload.

After uploading the schema, the table schema should look like the following screenshot.

  1. Choose Submit.

Now we’re ready to attach LF-tags at the database level.

  1. On the Databases page, find and select tag_database.
  2. On the Actions menu, choose Edit LF-tags.
  3. Choose Assign new LF-tag.
  4. For Assigned keys¸ choose the Confidential LF-tag you created earlier.
  5. For Values, choose True.
  6. Choose Save.

This completes the LF-tag assignment to the tag_database database.

Create your database and table for column-level access

Now we repeat these steps to create the database col_tag_database and table source_data_col_lvl, and attach LF-tags at the column level.

  1. On the Databases page, choose Create database.
  2. For Name, enter col_tag_database.
  3. For Location, enter the S3 location created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>/col_tag_database/).
  4. Deselect Use only IAM access control for new tables in this database.
  5. Choose Create database.
  6. On the Databases page, select your new database (col_tag_database).
  7. Choose View tables and Click Create table.
  8. For Name, enter source_data_col_lvl.
  9. For Database, choose your new database (col_tag_database).
  10. For Data is located in, select Specified path in my account.
  11. Enter the S3 path for col_tag_database (s3://lf-tagbased-demo-<<Account-ID>>/col_tag_database/).
  12. For Data format, select CSV.
  13. Under Upload schema, enter the following schema JSON:
    [
                   {
                        "Name": "vendorid",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "lpep_pickup_datetime",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "lpep_dropoff_datetime",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "store_and_fwd_flag",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "ratecodeid",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "pulocationid",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "dolocationid",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "passenger_count",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "trip_distance",
                        "Type": "string"
                        
                        
                   }, 
                      {
                        "Name": "fare_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "extra",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "mta_tax",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "tip_amount",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "tolls_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "ehail_fee",
                        "Type": "string"
                        
                        
                   }, 
                   {
                        "Name": "improvement_surcharge",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "total_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "payment_type",
                        "Type": "string"
                        
                        
                   }
     
    ]
    

  14. Choose Upload.

After uploading the schema, the table schema should look like the following screenshot.

  1. Choose Submit to complete the creation of the table

Now you associate the  Sensitive=True LF-tag to the columns vendorid and fare_amount.

  1. On the Tables page, select the table you created (source_data_col_lvl).
  2. On the Actions menu, choose Edit Schema.
  3. Select the column vendorid and choose Edit LF-tags.
  4. For Assigned keys, choose Sensitive.
  5. For Values, choose True.
  6. Choose Save.

Repeat the steps for the Sensitive LF-tag update for fare_amount column.

  1. Select the column fare_amount and choose Edit LF-tags.
  2. Add the Sensitive key with value True.
  3. Choose Save.
  4. Choose Save as new version to save the new schema version with tagged columns.The following screenshot shows column properties with the LF-tags updated.
    Next we associate the Confidential=False LF-tag to col_tag_database. This is required for lf-data-analyst to be able to describe the database col_tag_database when logged in from Athena.
  5. On the Databases page, find and select col_tag_database.
  6. On the Actions menu, choose Edit LF-tags.
  7. Choose Assign new LF-tag.
  8. For Assigned keys¸ choose the Confidential LF-tag you created earlier.
  9. For Values, choose False.
  10. Choose Save.

Grant table permissions

Now we grant permissions to data analysts for consumption of the database tag_database and the table col_tag_database.

  1. Sign in to the Lake Formation console as lf-data-engineer.
  2. On the Permissions page, select Data Permissions
  3. Choose Grant.
  4. Under Principals, select IAM users and roles.
  5. For IAM users and roles, choose lf-data-analyst.
  6. Select Resources matched by LF-tags.
  7. Choose Add LF-tag.
  8. For Key, choose Confidential.
  9. For Values¸ choose True.
  10. For Database permissions, select Describe
  11. For Table permissions, choose Select and Describe.
  12. Choose Grant.

    This grants permissions to the lf-data-analyst user on the objects associated with the LF-tag Confidential=True (Database : tag_database)  to describe the database and the select permission on tables.Next, we repeat the steps to grant permissions to data analysts for LF-tag expression for Confidential=False . This LF-tag is used for describing the col_tag_database and the table source_data_col_lvl when logged in as lf-data-analyst from Athena. And so, we only grant describe access to the resources through this LF-tag expression.
  13. Sign in to the Lake Formation console as lf-data-engineer.
  14. On the Databases page, select the database col_tag_database.
  15. Choose Action and Grant.
  16. Under Principals, select IAM users and roles.
  17. For IAM users and roles, choose lf-data-analyst.
  18. Select Resources matched by LF-tags.
  19. Choose Add LF-tag.
  20. For Key, choose Confidential.
  21. For Values¸ choose False.
  22. For Database permissions, select Describe.
  23. For Table permissions, do not select anything.
  24. Choose Grant.

    Next, we repeat the steps to grant permissions to data analysts for LF-tag expression for Confidential=False and Sensitive=True. This LF-tag is used for describing the col_tag_database and the table source_data_col_lvl (Column level) when logged in as lf-data-analyst from Athena.
  25. Sign in to the Lake Formation console as lf-data-engineer.
  26. On the Databases page, select the database col_tag_database.
  27. Choose Action and Grant.
  28. Under Principals, select IAM users and roles.
  29. For IAM users and roles, choose lf-data-analyst.
  30. Select Resources matched by LF-tags.
  31. Choose Add LF-tag.
  32. For Key, choose Confidential.
  33. For Values¸ choose False.
  34. Choose Add LF-tag.
  35. For Key, choose Sensitive.
  36. For Values¸ choose True.
  37. For Database permissions, select Describe.
  38. For Database permissions, select Describe.
  39. For Table permissions, select Select and Describe.
  40. Choose Grant.

Run a query in Athena to verify the permissions

For this step, we sign in to the Athena console as lf-data-analyst and run SELECT queries against the two tables (source_data and source_data_col_lvl). We use our S3 path as the query result location (s3://lf-tagbased-demo-<<Account-ID>>/athena-results/).

  1. In the Athena query editor, choose tag_database in the left panel.
  2. Choose the additional menu options icon (three vertical dots) next to source_data and choose Preview table.
  3. Choose Run query.

The query should take a few minutes to run. The following screenshot shows our query results.

The first query displays all the columns in the output because the LF-tag is associated at the database level and the source_data table automatically inherited the LF-tag from the database tag_database.

  1. Run another query using col_tag_database and source_data_col_lvl.

The second query returns just the two columns that were tagged (Non-Confidential and Sensitive).

As a thought experiment, you can also check to see the Lake Formation Tag-based access policy behavior on columns to which the user doesn’t have policy grants.

When an untagged column is selected from the table source_data_col_lvl, Athena returns an error. For example, you can run the following query to choose untagged columns geolocationid:

SELECT geolocationid FROM "col_tag_database"."source_data_col_lvl" limit 10;

Extend the solution to cross-account scenarios

You can extend this solution to share catalog resources across accounts. The following diagram illustrates a cross-account architecture.

We describe this in more detail in a subsequent post.

Clean up

To help prevent unwanted charges to your AWS account, you can delete the AWS resources that you used for this walkthrough.

  1. Sign in as lf-data-engineer Delete the databases tag_database and col_tag_database
  2. Now, Sign in as lf-data-steward and clean up all the LF-tag Permissions, Data Permissions and Data Location Permissions that were granted above that were granted lf-data-engineer and lf-data-analyst.
  3. Sign in to the Amazon S3 console as the account owner (the IAM credentials you used to deploy the CloudFormation stack).
  4. Delete the following buckets:
    1. lf-tagbased-demo-accesslogs-<acct-id>
    2. lf-tagbased-demo-<acct-id>
  5. On the AWS CloudFormation console, delete the stack you created.
  6. Wait for the stack status to change to DELETE_COMPLETE.

Conclusion

In this post, we explained how to create a LakeFormation Tag-based access control policy in Lake Formation using an AWS public dataset. In addition, we explained how to query tables, databases, and columns that have LakeFormation Tag-based access policies associated with them.

You can generalize these steps to share resources across accounts. You can also use these steps to grant permissions to SAML identities. In subsequent posts, we highlight these use cases in more detail.


About the Authors

Sanjay Srivastava is a principal product manager for AWS Lake Formation. He is passionate about building products, in particular products that help customers get more out of their data. During his spare time, he loves to spend time with his family and engage in outdoor activities including hiking, running, and gardening.

 

 

 

Nivas Shankar is a Principal Data Architect at Amazon Web Services. He helps and works closely with enterprise customers building data lakes and analytical applications on the AWS platform. He holds a master’s degree in physics and is highly passionate about theoretical physics concepts.

 

 

Pavan Emani is a Data Lake Architect at AWS, specialized in big data and analytics solutions. He helps customers modernize their data platforms on the cloud. Outside of work, he likes reading about space and watching sports.

 

Design a data mesh architecture using AWS Lake Formation and AWS Glue

Post Syndicated from Nivas Shankar original https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/

Organizations of all sizes have recognized that data is one of the key enablers to increase and sustain innovation, and drive value for their customers and business units. They are eagerly modernizing traditional data platforms with cloud-native technologies that are highly scalable, feature-rich, and cost-effective. As you look to make business decisions driven by data, you can be agile and productive by adopting a mindset that delivers data products from specialized teams, rather than through a centralized data management platform that provides generalized analytics.

In this post, we describe an approach to implement a data mesh using AWS native services, including AWS Lake Formation and AWS Glue. This approach enables lines of business (LOBs) and organizational units to operate autonomously by owning their data products end to end, while providing central data discovery, governance, and auditing for the organization at large, to ensure data privacy and compliance.

Benefits of a data mesh model

A centralized model is intended to simplify staffing and training by centralizing data and technical expertise in a single place, to reduce technical debt by managing a single data platform, and to reduce operational costs. Data platform groups, often part of central IT, are divided into teams based on the technical functions of the platform they support. For instance, one team may own the ingestion technologies used to collect data from numerous data sources managed by other teams and LOBs. A different team might own data pipelines, writing and debugging extract, transform, and load (ETL) code and orchestrating job runs, while validating and fixing data quality issues and ensuring data processing meets business SLAs. However, managing data through a central data platform can create scaling, ownership, and accountability challenges, because central teams may not understand the specific needs of a data domain, whether due to data types and storage, security, data catalog requirements, or specific technologies needed for data processing.

You can often reduce these challenges by giving ownership and autonomy to the team who owns the data, best allowing them to build data products, rather than only being able to use a common central data platform. For instance, product teams are responsible for ensuring the product inventory is updated regularly with new products and changes to existing ones. They’re the domain experts of the product inventory datasets. If a discrepancy occurs, they’re the only group who knows how to fix it. Therefore, they’re best able to implement and operate a technical solution to ingest, process, and produce the product inventory dataset. They own everything leading up to the data being consumed: they choose the technology stack, operate in the mindset of data as a product, enforce security and auditing, and provide a mechanism to expose the data to the organization in an easy-to-consume way. This reduces overall friction for information flow in the organization, where the producer is responsible for the datasets they produce and is accountable to the consumer based on the advertised SLAs.

This data-as-a-product paradigm is similar to Amazon’s operating model of building services. Service teams build their services, expose APIs with advertised SLAs, operate their services, and own the end-to-end customer experience. This is distinct from the world where someone builds the software, and a different team operates it. The end-to-end ownership model has enabled us to implement faster, with better efficiency, and to quickly scale to meet customers’ use cases. We aren’t limited by centralized teams and their ability to scale to meet the demands of the business. Each service we build stands on the shoulders of other services that provide the building blocks. The analogy in the data world would be the data producers owning the end-to-end implementation and serving of data products, using the technologies they selected based on their unique needs. At AWS, we have been talking about the data-driven organization model for years, which consists of data producers and consumers. This model is similar to those used by some of our customers, and has been eloquently described recently by Zhamak Dehghani of Thoughtworks, who coined the term data mesh in 2019.

Solution overview

In this post, we demonstrate how the Lake House Architecture is ideally suited to help teams build data domains, and how you can use the data mesh approach to bring domains together to enable data sharing and federation across business units. This approach can enable better autonomy and a faster pace of innovation, while building on top of a proven and well-understood architecture and technology stack, and ensuring high standards for data security and governance.

The following are key points when considering a data mesh design:

  • Data mesh is a pattern for defining how organizations can organize around data domains with a focus on delivering data as a product. However, it may not be the right pattern for every customer.
  • A Lake House approach and the data lake architecture provide technical guidance and solutions for building a modern data platform on AWS.
  • The Lake House approach with a foundational data lake serves as a repeatable blueprint for implementing data domains and products in a scalable way.
  • The manner in which you utilize AWS analytics services in a data mesh pattern may change over time, but still remains consistent with the technological recommendations and best practices for each service.

The following are data mesh design goals:

  • Data as a product – Each organizational domain owns their data end to end. They’re responsible for building, operating, serving, and resolving any issues arising from the use of their data. Data accuracy and accountability lies with the data owner within the domain.
  • Federated data governance – Data governance ensures data is secure, accurate, and not misused. The technical implementation of data governance such as collecting lineage, validating data quality, encrypting data at rest and in transit, and enforcing appropriate access controls can be managed by each of the data domains. However, central data discovery, reporting, and auditing is needed to make it simple for users to find data and for auditors to verify compliance.
  • Common Access – Data must be easily consumable by subject matter personas like data analysts and data scientists, as well as purpose-built analytics and machine learning (ML) services like Amazon Athena, Amazon Redshift, and Amazon SageMaker. To do that, data domains must expose a set of interfaces that make data consumable while enforcing appropriate access controls and audit tracking.

The following are user experience considerations:

  • Data teams own their information lifecycle, from the application that creates the original data, through to the analytics systems that extract and create business reports and predictions. Through this lifecycle, they own the data model, and determine which datasets are suitable for publication to consumers.
  • Data domain producers expose datasets to the rest of the organization by registering them with a central catalog. They can choose what to share, for how long, and how consumers can interact with it. They’re also responsible for maintaining the data and making sure it’s accurate and current.
  • Data domain consumers or individual users should be given access to data through a supported interface, like a data API, that can ensure consistent performance, tracking, and access controls.
  • All data assets are easily discoverable from a single central data catalog. The data catalog contains the datasets registered by data domain producers, including supporting metadata such as lineage, data quality metrics, ownership information, and business context.
  • All actions taken with data, usage patterns, data transformation, and data classifications should be accessible through a single, central place. Data owners, administrators, and auditors should able to inspect a company’s data compliance posture in a single place.

Let’s start with a high-level design that builds on top of the data mesh pattern. As seen in the following diagram, it separates consumers, producers, and central governance to highlight the key aspects discussed previously. However, a data domain may represent a data consumer, a data producer, or both.

The objective for this design is to create a foundation for building data platforms at scale, supporting the objectives of data producers and consumers with strong and consistent governance. The AWS approach to designing a data mesh identifies a set of general design principles and services to facilitate best practices for building scalable data platforms, ubiquitous data sharing, and enable self-service analytics on AWS.

Expanding on the preceding diagram, we provide additional details to show how AWS native services support producers, consumers, and governance. Each data domain, whether a producer, consumer, or both, is responsible for its own technology stack. However, using AWS native analytics services with the Lake House Architecture offers a repeatable blueprint that your organization can use as you scale your data mesh design. Having a consistent technical foundation ensures services are well integrated, core features are supported, scale and performance are baked in, and costs remain low.

A data domain: producer and consumer

A data mesh design organizes around data domains. Each data domain owns and operates multiple data products with its own data and technology stack, which is independent from others. Data domains can be purely producers, such as a finance domain that only produces sales and revenue data for domains to consumers, or a consumer domain, such as a product recommendation service that consumes data from other domains to create the product recommendations displayed on an ecommerce website. In addition to sharing, a centralized data catalog can provide users with the ability to more quickly find available datasets, and allows data owners to assign access permissions and audit usage across business units.

A producer domain resides in an AWS account and uses Amazon Simple Storage Service (Amazon S3) buckets to store raw and transformed data. It maintains its own ETL stack using AWS Glue to process and prepare the data before being cataloged into a Lake Formation Data Catalog in their own account. Similarly, the consumer domain includes its own set of tools to perform analytics and ML in a separate AWS account. The central data governance account is used to share datasets securely between producers and consumers. It’s important to note that sharing is done through metadata linking alone. Data isn’t copied to the central account, and ownership remains with the producer. The central catalog makes it easy for any user to find data and to ask the data owner for access in a single place. They can then use their tool of choice inside of their own environment to perform analytics and ML on the data.

The following diagram illustrates the end-to-end workflow.

The workflow from producer to consumer includes the following steps:

  1. Data source locations hosted by the producer are created within the producer’s AWS Glue Data Catalog and registered with Lake Formation.
  2. When a dataset is presented as a product, producers create Lake Formation Data Catalog entities (database, table, columns, attributes) within the central governance account. This makes it easy to find and discover catalogs across consumers. However, this doesn’t grant any permission rights to catalogs or data to all accounts or consumers, and all grants are be authorized by the producer.
  3. The central Lake Formation Data Catalog shares the Data Catalog resources back to the producer account with required permissions via Lake Formation resource links to metadata databases and tables.
  4. Lake Formation permissions are granted in the central account to producer role personas (such as the data engineer role) to manage schema changes and perform data transformations (alter, delete, update) on the central Data Catalog.
  5. Producers accept the resource share from the central governance account so they can make changes to the schema at a later time.
  6. Data changes made within the producer account are automatically propagated into the central governance copy of the catalog.
  7. Based on a consumer access request, and the need to make data visible in the consumer’s AWS Glue Data Catalog, the central account owner grants Lake Formation permissions to a consumer account based on direct entity sharing, or based on tag based access controls, which can be used to administer access via controls like data classification, cost center, or environment.
  8. Lake Formation in the consumer account can define access permissions on these datasets for local users to consume. Users in the consumer account, like data analysts and data scientists, can query data using their chosen tool such as Athena and Amazon Redshift.

Build data products

Data domain producers ingest data into their respective S3 buckets through a set of pipelines that they manage, own, and operate. Producers are responsible for the full lifecycle of the data under their control, and for moving data from raw data captured from applications to a form that is suitable for consumption by external parties. AWS Glue is a serverless data integration and preparation service that offers all the components needed to develop, automate, and manage data pipelines at scale, and in a cost-effective way. It provides a simple-to-use interface that organizations can use to quickly onboard data domains without needing to test, approve, and juggle vendor roadmaps to ensure all required features and integrations are available.

Central data governance

The central data governance account stores a data catalog of all enterprise data across accounts, and provides features allowing producers to register and create catalog entries with AWS Glue from all their S3 buckets. No data (except logs) exists in this account. Lake Formation centrally defines security, governance, and auditing policies in one place, enforces those policies for consumers across analytics applications, and only provides authorization and session token access for data sources to the role that is requesting access. Lake Formation also provides uniform access control for enterprise-wide data sharing through resource shares with centralized governance and auditing.

Common access

Each consumer obtains access to shared resources from the central governance account in the form of resource links. These are available in the consumer’s local Lake Formation and AWS Glue Data Catalog, allowing database and table access that can be managed by consumer admins. After access is granted, consumers can access the account and perform different actions with the following services:

  • Athena acts as a consumer and runs queries on data registered using Lake Formation. Lake Formation verifies that the workgroup AWS Identity and Access Management (IAM) role principal has the appropriate Lake Formation permissions to the database, table, and Amazon S3 location as appropriate for the query. If the principal has access, Lake Formation vends temporary credentials to Athena, and the query runs. Authentication is granted through IAM roles or users, or web federated identities using SAML or OIDC. For more information, see How Athena Accesses Data Registered With Lake Formation.
  • Amazon SageMaker Data Wrangler allows you to quickly select data from multiple data sources, such as Amazon S3, Athena, Amazon Redshift, Lake Formation, and Amazon SageMaker Feature Store. You can also write queries for data sources and import data directly into SageMaker from various file formats, such as CSV files, Parquet files, and database tables. Authentication is granted through IAM roles in the consumer account. For more information, see Prepare ML Data with Amazon SageMaker Data Wrangler.
  • Amazon Redshift Spectrum allows you to register external schemas from Lake Formation, and provides a hierarchy of permissions to control access to Amazon Redshift databases and tables in a Data Catalog. If the consumer principal has access, Lake Formation vends temporary credentials to Redshift Spectrum tables, and the query runs. Authentication is granted through IAM roles or users, or web federated identities using SAML or OIDC. For more information, see Using Redshift Spectrum with AWS Lake Formation.
  • Amazon QuickSight via Athena integrates with Lake Formation permissions. If you’re querying data with Athena, you can use Lake Formation to simplify how you secure and connect to your data from QuickSight. Lake Formation adds to the IAM permissions model by providing its own permissions model that is applied to AWS analytics and ML services. Authentication is granted through IAM roles that are mapped to QuickSight user permissions. For more information, see Authorizing Connections Through AWS Lake Formation.
  • Amazon EMR Studio and EMR notebooks allow running Spark SQL against Lake Formation’s tables backed by a SAML authority. Beginning with Amazon EMR31.0, you can launch a cluster that integrates with Lake Formation. Authentication is granted through IAM roles or users, or web federated identities using SAML or OIDC. For more information, see Integrate Amazon EMR with AWS Lake Formation.

With this design, you can connect multiple data lake houses to a centralized governance account that stores all the metadata from each environment. The strength of this approach is that it integrates all the metadata and stores it in one meta model schema that can be easily accessed through AWS services for various consumers. You can extend this architecture to register new data lake catalogs and share resources across consumer accounts. The following diagram illustrates a cross-account data mesh architecture.

Conclusion

A data mesh approach provides a method by which organizations can share data across business units. Each domain is responsible for the ingestion, processing, and serving of their data. They are data owners and domain experts, and are responsible for data quality and accuracy. This is similar to how microservices turn a set of technical capabilities into a product that can be consumed by other microservices. Implementing a data mesh on AWS is made simple by using managed and serverless services such as AWS Glue, Lake Formation, Athena, and Redshift Spectrum to provide a wellunderstood, performant, scalable, and cost-effective solution to integrate, prepare, and serve data.

One customer who used this data mesh pattern is JPMorgan Chase. For more information, see How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform.

Lake Formation offers the ability to enforce data governance within each data domain and across domains to ensure data is easily discoverable and secure, and lineage is tracked and access can be audited. The Lake House Architecture provides an ideal foundation to support a data mesh, and provides a design pattern to ramp up delivery of producer domains within an organization. Each domain has autonomy to choose their own tech stack, but is governed by a federated security model that can be administered centrally, providing best practices for security and compliance, while allowing high agility within the domain.

 


About the Authors

Nivas Shankar is a Principal Data Architect at Amazon Web Services. He helps and works closely with enterprise customers building data lakes and analytical applications on the AWS platform. He holds a master’s degree in physics and is highly passionate about theoretical physics concepts.

 

 

Roy Hasson is the Global Business Development Lead of Analytics and Data Lakes at AWS. He works with customers around the globe to design solutions to meet their data processing, analytics, and business intelligence needs. Roy is big Manchester United fan, cheering his team on and hanging out with his family.

 

 

Zach Mitchell is a Sr. Big Data Architect. He works within the product team to enhance understanding between product engineers and their customers while guiding customers through their journey to develop data lakes and other data solutions on AWS analytics services.

 

 

Ian Meyers is a Sr. Principal Product Manager for AWS Database Services. He works with many of AWS largest customers on emerging technology needs, and leads several data and analytics initiatives within AWS including support for Data Mesh.

 

 

The AWS Data Lake Team members are Chanu Damarla, Sanjay Srivastava, Natacha Maheshe, Roy Ben-Alta, Amandeep Khurana, Jason Berkowitz, David Tucker, and Taz Sayed.