All posts by Srividya Parthasarathy

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/efficiently-crawl-your-data-lake-and-improve-data-access-with-aws-glue-crawler-using-partition-indexes/

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems. AWS Glue crawlers provide a straightforward way to catalog data in the AWS Glue Data Catalog that removes the heavy lifting when it comes to schema management and data classification. AWS Glue crawlers extract the data schema and partitions from Amazon S3 to automatically populate the Data Catalog, keeping the metadata current.

But with data growing exponentially over time, the number of partitions in a given table can grow significantly. Because analytics services like Amazon Athena query a table containing millions of partitions, the time needed to retrieve the partition increases and can cause query runtime to increase.

Today, AWS Glue crawler support has been expanded to automatically add partition indexes for newly discovered tables to optimize query processing on the partitioned dataset. Now, when the crawler creates a new Data Catalog table during a crawler run, it also creates a partition index by default, with the largest permutation of all numeric and string type partition columns as keys. The Data Catalog then creates a searchable index based on these keys, reducing the time required to retrieve and filter partition metadata on tables with millions of partitions. The creation of partition indexes benefits the analytics workloads running on Athena, Amazon EMR, Amazon Redshift Spectrum, and AWS Glue.

In this post, we describe how to create partition indexes with an AWS Glue crawler and compare the query performance improvement when accessing the crawled data with and without a partition index from Athena.

Solution overview

We use an AWS CloudFormation template to create our solution resources. In the following steps, we demonstrate how to configure the AWS Glue crawler to create a partition index using either the AWS Glue console or the AWS Command Line Interface (AWS CLI). Then we compare the query performance improvements using Athena.

Prerequisites

To follow along with this post, you must have access to an AWS Identity and Access Management (IAM) administrator role to create resources using AWS CloudFormation.

Set up your solution resources

The CloudFormation template generates the following resources:

  • IAM roles and policies
  • An AWS Glue database to hold the schema
  • An AWS Glue crawler pointing to a highly partitioned dataset
  • An Athena workgroup and bucket to store query results

Complete the following steps to set up the solution resources:

  1. Log in to the AWS Management Console as an IAM administrator.
  2. Choose Launch Stack to deploy the CloudFormation template:
  3. For DatabaseName, keep the default blog_partition_index_crawlerdb.
  4. Choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create stack.
  7. When the stack is complete, on the AWS CloudFormation console, navigate to the Outputs tab of the stack.
  8. Note down values of DatabaseName and GlueCrawlerName.

Some of the resources that this stack deploys incur costs when in use.

Edit and run the AWS Glue crawler

To configure and run the AWS Glue crawler, complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Locate the crawler blog-partition-index-crawler and choose Edit.
  3. In the Set output and scheduling section, under Advanced options, select Create partition indexes automatically.
  4. Review and update the crawler settings.

Alternatively, you can configure your crawler using the AWS CLI (provide your IAM role and Region):

aws glue create-crawler --name blog-partition-index-crawler --targets '{ "S3Targets": [{ "Path": "s3://awsglue-datasets/examples/highly-partitioned-table/"}] }' --database-name "blog_partition_index_crawlerdb" --role <Crawler_IAM_role> --configuration "{\"Version\":1.0,\"CreatePartitionIndex\":true}" --region <region_name>
  1. Now run the crawler and verify that the crawler run is complete.

This is highly partitioned dataset and will take approximately 90 minutes to complete.

Verify the partitioned table

In the AWS Glue database blog_partition_index_crawlerdb, verify that the table highly_partitioned_table is created.

By default, the crawler determines an index based on the largest permutation of partition columns of valid column types in the same order of partition columns, which are either numeric or string. For the table created by the crawler (highly_partitioned_table), we have partition columns year (string), month (string), day (string), and hour (string).

Based on this definition, the crawler created an index on the permutation of year, month, day, and hour. The crawler created the indexes prefixed with crawler_ on any partition index created by default.

Verify the same by navigating to the table highly_partitioned_table on the AWS Glue console and choosing the Indexes tab.

The crawler was able to crawl the S3 data source and successfully populate the partition indexes for the table.

Compare the query performance improvements using Athena

First, we query the table in Athena without using the partition index. To verify the tables using Athena, complete the following steps:

  1. On the Athena console, choose crawler-primary-workgroup as the Athena workgroup and choose Acknowledge.
  2. Run the following query:
    select count(*), sum(value) from blog_partition_index_crawlerdb.highly_partitioned_table where year='1980' and month='01' and day ='01'

The following screenshot shows the query took approximately 32 seconds without filtering enabled using the partition index.

  1. Now we enable the partition index on the Athena query:
    ALTER TABLE blog_partition_index_crawlerdb.highly_partitioned_table
    SET TBLPROPERTIES ('partition_filtering.enabled' = 'true')

  2. Run the following query again and note the runtime:
    select count(*), sum(value) from blog_partition_index_crawlerdb.highly_partitioned_table where year=‘1980’ and month=‘01’ and day =‘01’

The following screenshot shows the query took only 700 milliseconds, which is much faster with filtering enabled using the partition index.

Clean up

To avoid unwanted charges to your AWS account, you can delete the AWS resources:

  1. Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack.
  2. Delete the CloudFormation stack you created.

Conclusion

In this post, we explained how to configure an AWS crawler to create partition indexes and compared the query performance when accessing the data with indexes from Athena.

If no partition indexes are present on the table, AWS Glue loads all the partitions of the table, and then filters the loaded partitions, which results in inefficient retrieval of metadata. Analytics services like Redshift Spectrum, Amazon EMR, and AWS Glue ETL Spark DataFrames can now utilize indexes for fetching partitions, resulting in significant query performance.

For more information on partition indexes and query performance across various analytical engines, refer to Improve Amazon Athena query performance using AWS Glue Data Catalog partition indexes and Improve query performance using AWS Glue partition indexes.

Special thanks to everyone who contributed to this crawler feature launch: Yuhang Chen, Kyle Duong,and Mita Gavade.


About the authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Build an AWS Lake Formation permissions inventory dashboard using AWS Glue and Amazon QuickSight

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/build-an-aws-lake-formation-permissions-inventory-dashboard-using-aws-glue-and-amazon-quicksight/

AWS Lake Formation is an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning (ML). Lake Formation provides a single place to define fine-grained access control on catalog resources. These permissions are granted to the principals by a data lake admin, and integrated engines like Amazon Athena, AWS Glue, Amazon EMR, and Amazon Redshift Spectrum enforce the access controls defined in Lake Formation. It also allows principals to securely share data catalog resources across multiple AWS accounts and AWS organizations through a centralized approach.

As organizations are adopting Lake Formation for scaling their permissions, there is steady increase in the access policies established and managed within the enterprise. However, it becomes more difficult to analyze and understand the permissions for auditing. Therefore, customers are looking for a simple way to collect, analyze, and visualize permissions data so that they can inspect and validate the policies. It also enables organizations to take actions that help them with compliance requirements.

This solution offers the ability to consolidate and create a central inventory of Lake Formation permissions that are registered in the given AWS account and Region. It provides a high-level view of various permissions that Lake Formation manages and aims at answering questions like:

  • Who has select access on given table
  • Which tables have delete permission granted
  • Which databases or tables does the given principal have select access to

In this post, we walk through how to set up and collect the permissions granted on resources in a given account using the Lake Formation API. AWS Glue makes it straightforward to set up and run jobs for collecting the permission data and creating an external table on the collected data. We use Amazon QuickSight to create a permissions dashboard using an Athena data source and dataset.

Overview of solution

The following diagram illustrates the architecture of this solution.

In this solution, we walk through the following tasks:

  1. Create an AWS Glue job to collect and store permissions data, and create external tables using Boto3.
  2. Verify the external tables created using Athena.
  3. Sign up for a QuickSight Enterprise account and enable Athena access.
  4. Create a dataset using an Athena data source.
  5. Use the datasets for analysis.
  6. Publish the analyses as a QuickSight dashboard.

The collected JSON data is flattened and written into an Amazon Simple Storage Service (Amazon S3) bucket as Parquet files partitioned by account ID, date, and resource type. After the data is stored in Amazon S3, external tables are created on them and filters are added for different types of resource permissions. These datasets can be imported into SPICE, an in-memory query engine that is part of QuickSight, or queried directly from QuickSight to create analyses. Later, you can publish these analyses as a dashboard and share it with other users.

Dashboards are created for the following use cases:

  • Database permissions
  • Table permissions
  • Principal permissions

Prerequisites

You should have the following prerequisites:

  • An S3 bucket to store the permissions inventory data
  • An AWS Glue database for permissions inventory metadata
  • An AWS Identity and Access Management (IAM) role for the AWS Glue job with access to the inventory AWS Glue database and S3 bucket and added as a data lake admin
  • A QuickSight account with access to Athena
  • An IAM role for QuickSight with access to the inventory AWS Glue database and S3 bucket

Set up and run the AWS Glue job

We create an AWS Glue job to collect Lake Formation permissions data for the given account and Region that is provided as job parameters, and the collected data is flattened before storage. Data is partitioned by account ID, date, and permissions type, and is stored as Parquet in an S3 bucket using Boto3. We create external tables on the data and add filters for different types of resource permissions.

To create the AWS Glue job, complete the following steps:

  1. Download the Python script file to local.
  2. On the AWS Glue console, under Data Integration and ETL in the navigation pane, choose Jobs.
  3. Under Create job, select Python Shell script editor.
  4. For Options, select Upload and edit an existing script.
  5. For File upload, choose Choose file.
  6. Choose the downloaded file (lf-permissions-inventory.py).
  7. Choose Create.

GlueJob

  1. After the job is created, enter a name for the job (for this post, lf-inventory-builder) and choose Save.

Glue Job save

  1. Choose the Job details tab.
  2. For Name, enter a name for the job.
  3. For IAM Role¸ choose an IAM role that has access to the inventory S3 bucket and inventory schema and registered as data lake admin.
  4. For Type, choose Python Shell.
  5. For Python version, choose Python 3.9.

Glue Job Details

  1. You can leave other property values at their default.
  2. Under Advanced properties¸ configure the following job parameters and values:
    1. catalog-id: with the value as the current AWS account ID whose permissions data are collected.
    2. databasename: with the value as the AWS Glue database where the inventory-related schema objects are created.
    3. region: with the value as the current Region where the job is configured and whose permissions data is collected.
    4. s3bucket: with the value as the S3 bucket where the collected permissions data is written.
    5. createtable: with the value yes, which enables external table creation on the data.

Job Parameters

  1. Choose Save to save the job settings.

Glue Job Save

  1. Choose Run to start the job.

When the job is complete, the run status changes to Succeeded. You can view the log messages in Amazon CloudWatch Logs.

Job Run

Permissions data is collected and stored in the S3 bucket (under lfpermissions-data) that you provided in the job parameters.

S3 Structure

The following external tables are created on the permissions data and can be queried using Athena:

  • lfpermissions – A summary of resource permissions
  • lfpermissionswithgrant – A summary of grantable resource permissions

For both tables, the schema structure is the same and the lftype column indicates what type of permissions the row applies to.

Athena Table Schema

Verify the tables using Athena

You can use Athena to verify the data using the following queries.

For more information, refer to Running SQL queries using Amazon Athena

  • List the database permissions:
Select * from lfpermissions where lftype=’DATABASE’
  • List the table permissions:
Select * from lfpermissions where lftype= ‘TABLE’
  • List the data lake permissions:
Select * from lfpermissions where lftype= ‘DATA_LOCATION’
  • List the grantable database permissions:
Select * from lfpermissionswithgrant where lftype=’DATABASE’
  • List the grantable table permissions:
Select * from lfpermissionswithgrant where lftype= ‘TABLE’
  • List grantable data lake permissions:
Select * from lfpermissionswithgrant where lftype= ‘DATA_LOCATION’

As the next step, we create a QuickSight dashboard with three sheets, each focused on different sets of permissions (database, table, principal) to slice and dice the data.

Sign up for a QuickSight account

If you haven’t signed up for QuickSight, complete the following steps:

  1. Sign in to the AWS Management Console as Admin, search for QuickSight and choose Sign up for QuickSight.

QuickSight signup

  1. For Edition, select Enterprise.
  2. Choose Continue.
  3. For Authentication method, select Use IAM federated identities & QuickSight-managed users.
  4. Under QuickSight Region, choose the same Region as your inventory S3 bucket.
  5. Under Account info, enter a QuickSight account name and email address for notification.

QuickSight Form

  1. In the Quick access to AWS services section, for IAM Role, select Use QuickSight-managed role (default).
  2. Allow access to IAM, Athena, and Amazon S3.
  3. Specify the S3 bucket that contains the permissions data.
  4. Choose Finish to complete the signup process.

QuickSight configuration

Note: If the inventory bucket and database is managed by Lake Formation, grant database and table access to the created QuickSight IAM role. For instructions, refer to Granting and revoking permissions on Data Catalog resources.

Configure your dataset in QuickSight

QuickSight is configured with an Athena data source the same Region as the S3 bucket. To set up your dataset, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.

Quicksight DataSet

  1. Choose Athena as your data source.

QuickSight Datasource

  1. Enter LF_DASHBOARD_DS as the name of your data source.
  2. Choose Create data source.
  3. For Catalog, leave it as AwsDataCatalog.
  4. For Database, choose database name provided as parameter to the Job.
  5. For Tables, select lfpermissions.
  6. Choose Select.

QuickSight Catalog Info

  1. Select Directly query your data and choose Visualize to take you to the analysis.

Quicksight data mode

Create analyses

We create three sheets for our dashboard to view different levels of permissions.

Sheet 1: Database permission view

To view database permissions, complete the following steps:

  1. On the QuickSight console, choose the plus sign to create a new sheet.
  2. Choose Add, then choose Add title.

QuickSight Title

  1. Name the sheet Database Permissions.
  2. Repeat steps (5-7) to add the following parameters:
    • catalogid
    • databasename
    • permission
    • tablename
  3. On the Add menu, choose Add parameter.
  4. Enter a name for the parameter.
  5. Leave the other values as default and choose Create.
  6. Choose Insights in the navigation pane, then choose Add control.

QuickSight Control

  1. Add a control for each parameter:
    1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
      QuickSight Dependency
    2. Provide additional information for each parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id
databasename DatabaseName lfpermissions databasename
permission Permission lfpermissions permission
  1. Add a control dependency and for Database, choose the options menu and choose Edit.

QuickSight Dependency

  1. Under Format control, choose Control options.
  2. Change the relevant values, choose AccountID, and choose Update.
  3. Similarly, under Permission control, choose Control options.
  4. Change the relevant values, choose AccountID, and choose Update.

We create two visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id and databasename into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop principal into Values and change the aggregation to Count distinct.

QuickSight Database View1

  1. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  2. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  3. Add a filter on databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  4. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.
  5. Choose Actions in the navigation pane.
  6. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

  1. Add a second visual and choose the table visual type.
  2. Drag and drop principal to Group by.
  3. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  4. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  5. Add a filter on databasename the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  6. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.

Now the Database and Permission drop-down menus are populated based on the relevant attributes and changes dynamically.

QuickSight Database View2

Sheet 2: Table permission view

Now that we have created the database permissions sheet, we can add a table permissions sheet.

  1. Choose the plus sign to add a new sheet.
  2. On the QuickSight console, choose Add, then choose Add title.
  3. Name the sheet Table Permissions.
  4. Choose Insights in the navigation pane, then choose Add control.
  5. Add a control for each parameter:
    1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
    2. Provide the additional information for each parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id
databasename DatabaseName lfpermissions databasename
permission Permission lfpermissions permission
tablename TableName lfpermissions tablename

We create two visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id, databasename, and tablename into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop principal into Values and change the aggregation to Count distinct.
  5. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  6. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  7. Add a filter on the databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  8. Add a filter on tablename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose tablename.
  9. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.
  10. Choose Actions in the navigation pane.
  11. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

  1. Add a second visual and choose the table visual type.
  2. Drag and drop principal to Group by.
  3. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  4. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  5. Add a filter on the databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  6. Add a filter on tablename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose tablename.
  7. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.

Now the Databasename, Tablename, and Permission drop-down menus are populated based on the relevant attributes.

QuickSight Table Permissions

Sheet 3: Principal permission view

Now we add a third sheet for principal permissions.

  1. Choose the plus sign to add a new sheet.
  2. On the QuickSight console, choose Add, then choose Add title.
  3. Name the sheet Principal Permissions.
  4. Choose Insights in the navigation pane, then choose Add control.
  5. Add a control for the catalogid parameter:
    1. For Style¸ choose List, and for Values, select Link to a dataset field.
    2. Provide the additional information for the parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id

We create four visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id and principal into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop databasename into Values and change the aggregation to Count distinct.
  5. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  6. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  7. Choose Actions in the navigation pane.
  8. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check Databasename.
  9. For the second visual, choose Visualize and choose table as the visual type.
  10. Drag and drop databasename into Group by.
  11. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  12. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  13. For the third visual, choose Visualize and choose pivot table as the visual type.
  14. Drag and drop catalog_id and principal into Rows.
  15. Drag and drop permission into Column.
  16. Drag and drop tablename into Values and change the aggregation to Count distinct.
  17. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  18. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  19. Choose Actions in the navigation pane.
  20. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check Tablename.
  21. For the final visual, choose Visualize and choose table as the visual type.
  22. Drag and drop tablename into Group by.
  23. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  24. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.

The following screenshot shows our sheet.

QuickSight Prinicipal View

Create a dashboard

Now that the analysis is ready, you can publish it as a dashboard and share it with other users. For instructions, refer to the tutorial Create an Amazon QuickSight dashboard.

Clean up

To clean up the resources created in this post, complete the following steps:

  1. Delete the AWS Glue job lf-inventory-builder.
  2. Delete the data stored under the bucket provided as the value of the s3bucket job parameter.
  3. Drop the external tables created under the schema provided as the value of the databasename job parameter.
  4. If you signed up for QuickSight to follow along with this post, you can delete the account.
  5. For an existing QuickSight account, delete the following resources:
    1. lfpermissions dataset
    2. lfpermissions analysis
    3. lfpermissions dashboard

Conclusion

In this post, we provided a design and implementation steps for a solution to collect Lake Formation permissions in a given Region of an account and consolidate them for analysis. We also walked through the steps to create a dashboard using Amazon QuickSight. You can utilize other QuickSight visuals to create more sophisticated dashboards based on your requirements.

You can also expand this solution to consolidate permissions for a multi-account setup. You can use a shared bucket across organizations and accounts and configure an AWS Glue job in each account or organization to write their permission data. With this solution, you can maintain a unified dashboard view of all the Lake Formation permissions within your organization, thereby providing a central audit mechanism to comply with business requirements.

Thanks for reading this post! If you have any comments or questions, please don’t hesitate to leave them in the comments section.


About the Author

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/centrally-manage-access-and-permissions-for-amazon-redshift-data-sharing-with-aws-lake-formation/

Today’s global, data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Amazon Redshift data warehouse with another Amazon Redshift data warehouse within the same AWS account, across accounts, and across Regions, without needing to copy or move data from one cluster to another.

Some customers share their data with 50–100 data warehouses in different accounts and do a lot of cross-sharing, making it difficult to track who is accessing what data. They have to navigate to an individual account’s Amazon Redshift console to retrieve the access information. Also, many customers have their data lake on Amazon Simple Storage Service (Amazon S3), which is shared within and across various business units. As the organization grows and democratizes the data, administrators want the ability to manage the datashare centrally for governance and auditing, and to enforce fine-grained access control.

Working backward from customer ask, we are announcing the preview of the following new feature: Amazon Redshift data sharing integration with AWS Lake Formation, which enables Amazon Redshift customers to centrally manage access to their Amazon Redshift datashares using Lake Formation.

Lake Formation has been a popular choice for centrally governing data lakes backed by Amazon S3. Now, with Lake Formation support for Amazon Redshift data sharing, it opens up new design patterns, and broadens governance and security posture across data warehouses. With this integration, you can use Lake Formation to define fine-grained access control on tables and views being shared with Amazon Redshift data sharing for federated AWS Identity and Access Management (IAM) users and IAM roles.

Customers are using the data mesh approach, which provides a mechanism to share data across business units. Customers are also using a modern data architecture to share data from data lake stores and Amazon Redshift purpose-built data stores across business units. Lake Formation provides the ability to enforce data governance within and across business units, which enables secure data access and sharing, easy data discovery, and centralized audit for data access.

United Airlines is in the business of connecting people and uniting the world.

“As a data-driven enterprise, United is trying to create a unified data and analytics experience for our analytics community that will innovate and build modern data-driven applications. We believe we can achieve this by building a purpose-built data mesh architecture using a variety of AWS services like Athena, Aurora, Amazon Redshift, and Lake Formation to simplify management and governance around granular data access and collaboration.”

-Ashok Srinivas, Director of ML Engineering and Sarang Bapat, Director of Data Engineering.

In this post, we show how to centrally manage access and permissions for Amazon Redshift data sharing with Lake Formation.

Solution overview

In this solution, we demonstrate how integration of Amazon Redshift data sharing with Lake Formation for data governance can help you build data domains, and how you can use the data mesh approach to bring data domains together to enable data sharing and federation across business units. The following diagram illustrates our solution architecture.

solution architecture

The data mesh is a decentralized, domain-oriented architecture that emphasizes separating data producers from data consumers via a centralized, federated Data Catalog. Typically, the producers and consumers run within their own account. The details of these data mesh characteristics are as follows:

  • Data producers – Data producers own their data products and are responsible for building their data, maintaining its accuracy, and keeping their data product up to date. They determine what datasets can be published for consumption and share their datasets by registering them with the centralized data catalog in a central governance account. You might have a producer steward or administrator persona for managing the data products with the central governance steward or administrators team.
  • Central governance account – Lake Formation enables fine-grained access management on the shared dataset. The centralized Data Catalog offers consumers the ability to quickly find shared datasets, allows administrators to centrally manage access permissions on shared datasets, and provides security teams the ability to audit and track data product usage across business units.
  • Data consumers – The data consumer obtains access to shared resources from the central governance account. These resources are available inside the consumer’s AWS Glue Data Catalog, allowing fine-grained access on the database and table that can be managed by the consumer’s data stewards and administrators.

The following steps provide an overview of how Amazon Redshift data sharing can be governed and managed by Lake Formation in the central governance pattern of a data mesh architecture:

  1. In the producer account, data objects are created and maintained in the Amazon Redshift producer cluster. A data warehouse admin creates the Amazon Redshift datashare and adds datasets (tables, views) to the share.
  2. The data warehouse admin grants and authorizes access on the datashare to the central governance account’s Data Catalog.
  3. In the central governance account, the data lake admin accepts the datashare and creates the AWS Glue database that points to the Amazon Redshift datashare so that Lake Formation can manage it.
  4. The data lake admin shares the AWS Glue database and tables to the consumer account using Lake Formation cross-account sharing.
  5. In the consumer account, the data lake admin accepts the resource share invitation via AWS Resource Access Manager (AWS RAM) and can view the database listed in the account.
  6. The data lake admin defines the fine-grained access control and grants permissions on databases and tables to IAM users (for this post, consumer1 and consumer2) in the account.
  7. In the Amazon Redshift cluster, the data warehouse admin creates an Amazon Redshift database that points to the Glue database and authorizes usage on the created Amazon Redshift database to the IAM users.
  8. The data analyst as an IAM user can now use their preferred tools like the Amazon Redshift query editor to access the dataset based on the Lake Formation fine-grained permissions.

We use the following account setup for our example in this post:

  • Producer account – 123456789012
  • Central account – 112233445566
  • Consumer account – 665544332211

Prerequisites

Create the Amazon Redshift data share and add datasets

In the data producer account, create an Amazon Redshift cluster using the RA3 node type with encryption enabled. Complete the following steps:

  1. On the Amazon Redshift console, create a cluster subnet group.

For more information, refer to Managing cluster subnet groups using the console.

  1. Choose Create cluster.
  2. For Cluster identifier, provide the cluster name of your choice.
  3. For Preview track, choose preview_2022.

  1. For Node type, choose one of the RA3 node types.

This feature is only supported on the RA3 node type.

  1. For Number of nodes, enter the number of nodes that you need for your cluster.
  2. Under Database configurations, choose the admin user name and admin user password.
  3. Under Cluster permissions, you can select the IAM role and set it as the default.

For more information about the default IAM role, refer to Creating an IAM role as default for Amazon Redshift.

cluster permissions

  1. Turn on the Use defaults option next to Additional configurations to modify the default settings.
  2. Under Network and security, specify the following:
    1. For Virtual private cloud (VPC), choose the VPC you would like to deploy the cluster in.
    2. For VPC security groups, either leave as default or add the security groups of your choice.
    3. For Cluster subnet group, choose the cluster subnet group you created.

additional configurations

  1. Under Database configuration, in the Encryption section, select Use AWS Key Management Service (AWS KMS) or Use a hardware security module (HSM).

Encryption is disabled by default.

  1. For Choose an AWS KMS key, you can either choose an existing AWS Key Management Service (AWS KMS) key, or choose Create an AWS KMS key to create a new key.

For more information, refer to Creating keys.

database configurations

  1. Choose Create cluster.
  2. For this post, create tables and load data into the producer Amazon Redshift cluster using the following script.

Authorize the datashare

Install or update the latest AWS Command Line Interface (AWS CLI) version to run the AWS CLI to authorize the datashare. For instructions, refer to Installing or updating the latest version of the AWS CLI.

Set up Lake Formation permissions

To use the AWS Glue Data Catalog in Lake Formation, complete the following steps in the central governance account to update the Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control:

  1. Sign in to the Lake Formation console as admin.
  2. In the navigation pane, under Data catalog, choose Settings.
  3. Deselect Use only IAM access control for new databases.
  4. Deselect Use only IAM access control for new tables in new databases.
  5. Choose Version 2 for Cross account version settings.
  6. Choose Save.

data catalog settings

Set up an IAM user as a data lake administrator

If you’re using an existing data lake administrator user or role add the following managed policies, if not attached and skip the below setup steps:

AWSGlueServiceRole
AmazonRedshiftFullAccess

Otherwise, to set up an IAM user as a data lake administrator, complete the following steps:

  1. On the IAM console, choose Users in the navigation pane.
  2. Select the IAM user who you want to designate as the data lake administrator.
  3. Choose Add an inline policy on the Permissions tab.
  4. Replace <AccountID> with your own account ID and add the following policy:
{
    "Version": "2012-10-17",
    "Statement": [ {
        "Condition": {"StringEquals": {
            "iam:AWSServiceName":"lakeformation.amazonaws.com"}},
            "Action":"iam:CreateServiceLinkedRole",
            "Resource": "*",
            "Effect": "Allow"},
            {"Action": ["iam:PutRolePolicy"],
            "Resource": "arn:aws:iam::<AccountID>:role/aws-service role/lakeformation.amazonaws.com/AWSServiceRoleForLakeFormationDataAccess",
            "Effect": "Allow"
     },{
                  "Effect": "Allow",
                  "Action": [
                    "ram:AcceptResourceShareInvitation",
                    "ram:RejectResourceShareInvitation",
                    "ec2:DescribeAvailabilityZones",
                    "ram:EnableSharingWithAwsOrganization"
                  ],
                  "Resource": "*"
                }]
}
  1. Provide a policy name.
  2. Review and save your settings.
  3. Choose Add permissions, and choose Attach existing policies directly.
  4. Add the following policies:
    1. AWSLakeFormationCrossAccountManager
    2. AWSGlueConsoleFullAccess
    3. AWSGlueServiceRole
    4. AWSLakeFormationDataAdmin
    5. AWSCloudShellFullAccess
    6. AmazonRedshiftFullAccess
  5. Choose Next: Review and add permissions.

Data consumer account setup

In the consumer account, follow the steps mentioned previously in the central governance account to set up Lake Formation and a data lake administrator.

  1. In the data consumer account, create an Amazon Redshift cluster using the RA3 node type with encryption (refer to the steps demonstrated to create an Amazon Redshift cluster in the producer account).
  2. Choose Launch stack to deploy an AWS CloudFormation template to create two IAM users with policies.

launch stack

The stack creates the following users under the data analyst persona:

  • consumer1
  • consumer2
  1. After the CloudFormation stack is created, navigate to the Outputs tab of the stack.
  2. Capture the ConsoleIAMLoginURL and LFUsersCredentials values.

createiamusers

  1. Choose the LFUsersCredentials value to navigate to the AWS Secrets Manager console.
  2. In the Secret value section, choose Retrieve secret value.

secret value

  1. Capture the secret value for the password.

Both consumer1 and consumer2 need to use this same password to log in to the AWS Management Console.

secret value

Configure an Amazon Redshift datashare using Lake Formation

Producer account

Create a datashare using the console

Complete the following steps to create an Amazon Redshift datashare in the data producer account and share it with Lake Formation in the central account:

  1. On the Amazon Redshift console, choose the cluster to create the datashare.
  2. On the cluster details page, navigate to the Datashares tab.
  3. Under Datashares created in my namespace, choose Connect to database.

connect to database

  1. Choose Create datashare.

create datashare

  1. For Datashare type, choose Datashare.
  2. For Datashare name, enter the name (for this post, demotahoeds).
  3. For Database name, choose the database from where to add datashare objects (for this post, dev).
  4. For Publicly accessible, choose Turn off (or choose Turn on to share the datashare with clusters that are publicly accessible).

datashare information

  1. Under DataShare objects, choose Add to add the schema to the datashare (in this post, the public schema).
  2. Under Tables and views, choose Add to add the tables and views to the datashare (for this post, we add the table customer and view customer_view).

datashare objects

  1. Under Data consumers, choose Publish to AWS Data Catalog.
  2. For Publish to the following accounts, choose Other AWS accounts.
  3. Provide the AWS account ID of the consumer account. For this post, we provide the AWS account ID of the Lake Formation central governance account.
  4. To share within the same account, choose Local account.
  5. Choose Create datashare.

data consumers

  1. After the datashare is created, you can verify by going back to the Datashares tab and entering the datashare name in the search bar under Datashares created in my namespace.
  2. Choose the datashare name to view its details.
  3. Under Data consumers, you will see the consumer status of the consumer data catalog account as Pending Authorization.

data consumers

  1. Choose the checkbox against the consumer data catalog which will enable the Authorize option.

authorize

  1. Click Authorize to authorize the datashare access to the consumer account data catalog, consumer status will change to Authorized.

authorized

Create a datashare using a SQL command

Complete the following steps to create a datashare in data producer account 1 and share it with Lake Formation in the central account:

  1. On the Amazon Redshift console, in the navigation pane, choose Editor, then Query editor V2.
  2. Choose (right-click) the cluster name and choose Edit connection or Create Connection.
  3. For Authentication, choose Temporary credentials.

Refer to Connecting to an Amazon Redshift database to learn more about the various authentication methods.

  1. For Database, enter a database name (for this post, dev).
  2. For Database user, enter the user authorized to access the database (for this post, awsuser).
  3. Choose Save to connect to the database.

Connecting to an Amazon Redshift database

  1. Run the following SQL commands to create the datashare and add the data objects to be shared:
create datashare demotahoeds;
ALTER DATASHARE demotahoeds ADD SCHEMA PUBLIC;
ALTER DATASHARE demotahoeds ADD TABLE customer;
ALTER DATASHARE demotahoeds ADD TABLE customer_view;
  1. Run the following SQL command to share the producer datashare to the central governance account:
GRANT USAGE ON DATASHARE demotahoeds TO ACCOUNT '<central-aws-account-id>' via DATA CATALOG

Run the following SQL command

  1. You can verify the datashare created and objects shared by running the following SQL command:
DESC DATASHARE demotahoeds

DESC DATASHARE demotahoeds

  1. Run the following command using the AWS CLI to authorize the datashare to the central data catalog so that Lake Formation can manage them:
aws redshift authorize-data-share \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-identifier DataCatalog/<central-aws-account-id>

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [{
        "ConsumerIdentifier": "DataCatalog/XXXXXXXXXXXX",
        "Status": "AUTHORIZED",
        "CreatedDate": "2022-11-09T21:10:30.507000+00:00",
        "StatusChangeDate": "2022-11-09T21:10:50.932000+00:00"
    }]
}

You can verify the datashare status on the console by following the steps outlined in the previous section.

Central catalog account

The data lake admin accepts and registers the datashare with Lake Formation in the central governance account and creates a database for the same. Complete the following steps:

  1. Sign in to the console as the data lake administrator IAM user or role.
  2. If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
  3. Under Data catalog in the navigation pane, choose Data sharing and view the Amazon Redshift datashare invitations on the Configuration tab.
  4. Select the datashare and choose Review Invitation.

AWS Lake Formation data sharing

A window pops up with the details of the invitation.

  1. Choose Accept to register the Amazon Redshift datashare to the AWS Glue Data Catalog.

accept reject invitation

  1. Provide a name for the AWS Glue database and choose Skip to Review and create.

Skip to Review and create

  1. Review the content and choose Create database.

create database

After the AWS Glue database is created on the Amazon Redshift datashare, you can view them under Shared Databases.

Shared Databases.

You can also use the AWS CLI to register the datashare and create the database. Use the following commands:

  1. Describe the Amazon Redshift datashare that is shared with the central account:
aws redshift describe-data-shares
  1. Accept and associate the Amazon Redshift datashare to Data Catalog:
aws redshift associate-data-share-consumer \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-arn arn:aws:glue:us-east-1:<central-aws-account-id>:catalog

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:123456789012:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:123456789012:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [
        {
            "ConsumerIdentifier": "arn:aws:glue:us-east-1:112233445566:catalog",
            "Status": "ACTIVE",
            "ConsumerRegion": "us-east-1",
            "CreatedDate": "2022-11-09T23:25:22.378000+00:00",
            "StatusChangeDate": "2022-11-09T23:25:22.378000+00:00"
        }
    ]
}
  1. Register the Amazon Redshift datashare in Lake Formation:
aws lakeformation register-resource \
--resource-arn arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds
  1. Create the AWS Glue database that points to the accepted Amazon Redshift datashare:
aws glue create-database --region <central-catalog-region> --cli-input-json '{
    "CatalogId": "<central-aws-account-id>",
    "DatabaseInput": {
        "Name": "demotahoedb",
        "FederatedDatabase": {
            "Identifier": "arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds",
            "ConnectionName": "aws:redshift"
        }
    }
}'

Now the data lake administrator of the central governance account can view and share access on both the database and tables to the data consumer account using the Lake Formation cross-account sharing feature.

Grant datashare access to the data consumer

To grant the data consumer account permissions on the shared AWS Glue database, complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  2. Choose Grant.
  3. Under Principals, select External accounts.
  4. Provide the data consumer account ID (for this post, 665544332211).
  5. Under LF_Tags or catalog resources, select Named data catalog resources.
  6. For Databases, choose the database demotahoedb.
  7. Select Describe for both Database permissions and Grantable permissions.
  8. Choose Grant to apply the permissions.

grant data permissions

To grant the data consumer account permissions on tables, complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  2. Choose Grant.
  3. Under Principals, select External accounts.
  4. Provide the consumer account (for this post, we use 665544332211).
  5. Under LF-Tags or catalog resources, select Named data catalog resources.
  6. For Databases, choose the database demotahoedb.
  7. For Tables, choose All tables.
  8. Select Describe and Select for both Table permissions and Grantable permissions.
  9. Choose Grant to apply the changes.

grant the data consumer account permissions on tables

Consumer account

The consumer admin will receive the shared resources from the central governance account and delegate access to other users in the consumer account as shown in the following table.

IAM User Object Access Object Type Access Level
consumer1 public.customer Table All
consumer2 public.customer_view View specific columns: c_customer_id, c_birth_country, cd_gender, cd_marital_status, cd_education_status

In the data consumer account, follow these steps to accept the resources shared with the account:

  1. Sign in to the console as the data lake administrator IAM user or role.
  2. If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
  3. Sign in to the AWS RAM console.
  4. In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations. You will receive 2 invitations.

Resource shares to view the pending invitations

  1. Choose the pending invitations and accept the resource share.

Choose the pending invitation and accept the resource share

  1. On the Lake formation console, under Data catalog in the navigation pane, choose Databases to view the cross-account shared database.

choose Databases to view the cross-account shared database

Grant access to the data analyst and IAM users using Lake Formation

Now the data lake admin in the data consumer account can delegate permissions on the shared database and tables to users in the consumer account.

Grant database permissions to consumer1 and consumer2

To grant the IAM users consumer1 and consumer2 database permissions, follow these steps:

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.

choose Grant database

  1. Under Principals, select IAM users and roles.
  2. Choose the IAM users consumer1 and consumer2.
  3. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  4. Select Describe for Database permissions.
  5. Choose Grant to apply the permissions.

Choose Grant to apply the permissions

Grant table permissions to consumer1

To grant the IAM user consumer1 permissions on table public.customer, follow these steps:

  1. Under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose IAM user consumer1.
  5. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  6. For Tables, choose public.customer.
  7. Select Describe and Select for Table permissions.
  8. Choose Grant to apply the permissions.

Grant table permissions to consumer1

Grant column permissions to consumer2

To grant the IAM user consumer2 permissions on non-sensitive columns in public.customer_view, follow these steps:

  1. Under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM user consumer2.
  5. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  6. For Tables, choose public.customer_view.

Grant column permissions to consumer2

  1. Select Select for Table permissions.
  2. Under Data Permissions, select Column-based access.
  3. Select Include columns and choose the non-sensitive columns (c_customer_id, c_birth_country, cd_gender, cd_marital_status, and cd_education_status).
  4. Choose Grant to apply the permissions.

table permissions

Consume the datashare from the data consumer account in the Amazon Redshift cluster

In the Amazon Redshift consumer data warehouse, log in as the admin user using Query Editor V2 and complete the following steps:

  1. Create the Amazon Redshift database from the shared catalog database using the following SQL command:
CREATE DATABASE demotahoedb FROM ARN 'arn:aws:glue:<producer-region>:<producer-aws-account-id>:database/demotahoedb' WITH DATA CATALOG SCHEMA demotahoedb ;
  1. Run the following SQL commands to create and grant usage on the Amazon Redshift database to the IAM users consumer1 and consumer2:
CREATE USER IAM:consumer1 password disable;
CREATE USER IAM:consumer2  password disable;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer1;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer2;

In order to use a federated identity to enforce Lake Formation permissions, follow the next steps to configure Query Editor v2.

  1. Choose the settings icon in the bottom left corner of the Query Editor v2, then choose Account settings.

identity to enforce Lake Formation permissions

  1. Under Connection settings, select Authenticate with IAM credentials.
  2. Choose Save.

Authenticate with IAM credentials

Query the shared datasets as a consumer user

To validate that the IAM user consumer1 has datashare access from Amazon Redshift, perform the following steps:

  1. Sign in to the console as IAM user consumer1.
  2. On the Amazon Redshift console, choose Query Editor V2 in the navigation pane.
  3. To connect to the consumer cluster, choose the consumer cluster in the tree-view pane.
  4. When prompted, for Authentication, select Temporary credentials using your IAM identity.
  5. For Database, enter the database name (for this post, dev).
  6. The user name will be mapped to your current IAM identity (for this post, consumer1).
  7. Choose Save.

edit connection for redshift

  1. Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:
select current_user;

  1. To find the federated databases created on the consumer account, run the following SQL command:
SHOW DATABASES FROM DATA CATALOG [ACCOUNT '<id1>', '<id2>'] [LIKE 'expression'];

federated databases created on the consumer account

  1. To validate permissions for consumer1, run the following SQL command:
select * from demotahoedb.public.customer limit 10;

As shown in the following screenshot, consumer1 is able to successfully access the datashare customer object.

Now let’s validate that consumer2 doesn’t have access to the datashare tables “public.customer” on the same consumer cluster.

  1. Log out of the console and sign in as IAM user consumer2.
  2. Follow the same steps to connect to the database using the query editor.
  3. Once connected, run the same query:
select * from demotahoedb.public.customer limit 10;

The user consumer2 should get a permission denied error, as in the following screenshot.

should get a permission denied error

Let’s validate the column-level access permissions of consumer2 on public.customer_view view.

  1. Connect to Query Editor v2 as consumer2 and run the following SQL command:
select c_customer_id,c_birth_country,cd_gender,cd_marital_status from demotahoedb.public.customer_view limit 10;

In the following screenshot, you can see consumer2 is only able to access columns as granted by Lake Formation.

access columns as granted by Lake Formation

Conclusion

A data mesh approach provides a method by which organizations can share data across business units. Each domain is responsible for the ingestion, processing, and serving of their data. They are data owners and domain experts, and are responsible for data quality and accuracy. Using Amazon Redshift data sharing with Lake Formation for data governance helps build the data mesh architecture, enabling data sharing and federation across business units with fine-grained access control.

Special thanks to everyone who contributed to launch Amazon Redshift data sharing with AWS Lake Formation:

Debu Panda, Michael Chess, Vlad Ponomarenko, Ting Yan, Erol Murtezaoglu, Sharda Khubchandani, Rui Bi

References


About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Ranjan Burman is a Analytics Specialist Solutions Architect, with AWS.

Vikram Sahadevan is a Senior Resident Architect on the AWS Data Lab team. He enjoys efforts that focus around providing prescriptive architectural guidance, sharing best practices, and removing technical roadblocks with joint engineering engagements between customers and AWS technical resources that accelerate data, analytics, artificial intelligence, and machine learning initiatives.

Steve Mitchell is a Senior Solution Architect with a passion for analytics and data mesh. He enjoys working closely with customers as they transition to a modern data architecture.