All posts by Srividya Parthasarathy

Access Databricks Unity Catalog data using catalog federation in the AWS Glue Data Catalog

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/access-databricks-unity-catalog-data-using-catalog-federation-in-the-aws-glue-data-catalog/

AWS has launched the catalog federation capability, enabling direct access to Apache Iceberg tables managed in Databricks Unity Catalog through the AWS Glue Data Catalog. With this integration, you can discover and query Unity Catalog data in Iceberg format using an Iceberg REST API endpoint, while maintaining granular access controls through AWS Lake Formation. This approach significantly reduces operational overhead for managing catalog synchronization and associated costs by alleviating the need to replicate or duplicate datasets between platforms.

In this post, we demonstrate how to set up catalog federation between the Glue Data Catalog and Databricks Unity Catalog, enabling data querying using AWS analytics services.

Use cases and key benefits

This federation capability is particularly valuable if you run multiple data platforms, because you can maintain your existing Iceberg catalog investments while using AWS analytics services. Catalog federation supports read operations and provides the following benefits:

  • Interoperability – You can enable interoperability across different data platforms and tools through Iceberg REST APIs while preserving the value of your established technology investments.
  • Cross-platform analytics – You can connect AWS analytics tools (Amazon Athena, Amazon Redshift, Apache Spark) to query Iceberg and UniForm tables stored in Databricks Unity Catalog. It supports Databricks on AWS integration with the AWS Glue Iceberg REST Catalog for metadata retrieval, while using Lake Formation for permission management.
  • Metadata management – The solution avoids manual catalog synchronization by making Databricks Unity Catalog databases and tables discoverable within the Data Catalog. You can implement unified governance through Lake Formation for fine-grained access control across federated catalog resources.

Solution overview

The solution uses catalog federation in the Data Catalog to integrate with Databricks Unity Catalog. The federated catalog created in AWS Glue mirrors the catalog objects in Databricks Unity Catalog and supports OAuth-based authentication. The solution is represented in the following diagram.

The integration involves three high-level steps:

  1. Set up an integration principal in Databricks Unity Catalog and provide required read access on catalog resources to this principal. Enable OAuth-based authentication for the integration principal.
  2. Set up catalog federation to Databricks Unity Catalog in the Glue Data Catalog:
    1. Create a federated catalog in the Data Catalog using an AWS Glue connection.
    2. Create an AWS Glue connection that uses the credentials of the integration principal (in Step 1) to connect to Databricks Unity Catalog. Configure an AWS Identity and Access Management (IAM) role with permission to Amazon Simple Storage Service (Amazon S3) locations where the Iceberg table data resides. In a cross-account scenario, make sure the bucket policy grants required access to this IAM role.
  3. Discover Iceberg tables in federated catalogs using Lake Formation or AWS Glue APIs. During query operations, Lake Formation manages fine-grained permissions on federated resources and credential vending for access to the underlying data.

In the following sections, we walk through the steps to integrate the Glue Data Catalog with Databricks Unity Catalog on AWS.

Prerequisites

To follow along with the solution presented in this post, you must have the following prerequisites:

  • Databricks Workspace (on AWS) with Databricks Unity Catalog configured.
  • An IAM role that is a Lake Formation data lake administrator in your AWS account. A data lake administrator is an IAM principal that can register S3 locations, access the Data Catalog, grant Lake Formation permissions to other users, and view AWS CloudTrail logs. See Create a data lake administrator for more information.

Configure Databricks Unity Catalog for external access

Catalog federation to a Databricks Unity Catalog uses the OAuth2 credentials of a Databricks service principal configured in the workspace admin settings. This authentication mechanism allows the Data Catalog to access the metadata of various objects (such as catalogs, databases, and tables) within Databricks Unity Catalog, based on the privileges associated with the service principal. For proper functionality, grant the service principal with the necessary permissions (read permission on catalog, schema, and tables) to read the metadata of these objects and allow access from external engines.

Next, catalog federation enables discovery and query of Iceberg tables in your Databricks Unity Catalog. For reading delta tables, enable UniForm on a Delta Lake table in Databricks to generate Iceberg metadata. For more information, refer to Read Delta tables with Iceberg clients.

Follow the Databricks tutorial and documentation to create the service principal and associated privileges in your Databricks workspace. For this post, we use a service principal named integrationprincipal that is configured with required permissions (SELECT, USE CATALOG, USE SCHEMA) on Databricks Unity Catalog objects and will be used for authentication to catalog instance.

Catalog federation supports OAuth2 authentication, so enable OAuth for the service principal and note down the client_id and client_secret for later use.

Set up Data Catalog federation with Databricks Unity Catalog

Now that you have service principal access for Databricks Unity Catalog, you can set up catalog federation in the Data Catalog. To do so, you create an AWS Secrets Manager secret and create an IAM role for catalog federation.

Create secret

Complete the following steps to create a secret:

  1. Sign in to the AWS Management Console using an IAM role with access to Secrets Manager.
  2. On the Secrets Manager console, choose Store a new secret and Other type of secret.
  3. Set the key-value pair:
    1. Key: USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET
    2. Value: The client secret noted earlier
  4. Choose Next.
  5. Enter a name for your secret (for this post, we use dbx).
  6. Choose Store.

Create IAM role for catalog federation

As the catalog owner of a federated catalog in the Data Catalog, you can use Lake Formation to implement comprehensive access controls, including table filters, column filters, and row filters, as well as tag-based access for your data teams.

Lake Formation requires an IAM role with permissions to access the underlying S3 locations of your external catalog.

In this step, you create an IAM role that enables the AWS Glue connection to access Secrets Manager, optional virtual private cloud (VPC) configurations, and Lake Formation to manage credential vending for the S3 bucket and prefix:

  • Secrets Manager access – The AWS Glue connection requires permissions to retrieve secret values from Secrets Manager for OAuth tokens stored for your Databricks Unity service connection.
  • VPC access (optional) – When using VPC endpoints to restrict connectivity to your Databricks Unity account, the AWS Glue connection needs permissions to describe and utilize VPC network interfaces. This configuration provides secure, controlled access to both your stored credentials and network resources while maintaining proper isolation through VPC endpoints.
  • S3 bucket and AWS KMS key permission – The AWS Glue connection requires Amazon S3 permissions to read certificates if used in the connection setup. Additionally, Lake Formation requires read permissions on the bucket and prefix where the remote catalog table data resides. If the data is encrypted using an AWS Key Management Service (AWS KMS) key, additional AWS KMS permissions are required.

Complete the following steps:

  1. Create an IAM role called LFDataAccessRole with the following policies:
    {
     "Version": "2012-10-17",
         "Statement": [
             {
                 "Effect": "Allow",
                 "Action": [
                     "secretsmanager:GetSecretValue",
                     "secretsmanager:DescribeSecret"
                 ],
                 "Resource": [
                     "<secrets manager ARN>"
                 ]
             },
             {
                 "Effect": "Allow",
                 "Action": [
                     "ec2:CreateNetworkInterface",
                     "ec2:DeleteNetworkInterface",
                     "ec2:DescribeNetworkInterfaces"
                 ],
                 "Resource": "*",
                 "Condition": {
                     "ArnEquals": {
                         "ec2:Vpc": "arn:aws:ec2:region:account-id:vpc/<vpc-id>", 
                         "ec2:Subnet": [ 
                             "arn:aws:ec2:region:account-id:subnet/<subnet-id>" 
                         ]
                     }
                 }
             },
             {
                # Required when using custom cert to sign requests.
                 "Effect": "Allow",
                 "Action": [
                     "s3:GetObject"
                 ],
                 "Resource": [
                     "arn:aws:s3
    :::<bucketname>/<certpath>"
                 ]
             },
             { # Required when using customer managed encryption key for s3 
                 "Effect": "Allow",
                 "Action": [
                     "kms:decrypt",
                     "kms:encrypt"
                 ],
                 "Resource": [
                     "<kmsKey>"
                 ]
             }
         ]
     }

  2. Configure the role with the following trust policy:
    {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect":  "Allow",
                  "Principal": {
                       "Service": ["glue.amazonaws.com","lakeformation.amazonaws.com"]
                  },
                  "Action":  "sts:AssumeRole"
              }
          ]
      }

Create federated catalog in Data Catalog

AWS Glue supports the DATABRICKSICEBERGRESTCATALOG connection type for connecting the Data Catalog with managed Databricks Unity Catalog. This AWS Glue connector supports OAuth2 authentication for discovering metadata in Databricks Unity Catalog.

Complete the following steps to create the federated catalog:

  1. Sign in to the console as a data lake admin.
  2. On the Lake Formation console, choose Catalogs in the navigation pane.
  3. Choose Create catalog.
  4. For Name, enter a name for your catalog.
  5. For Catalog name in Databricks, enter the name of a catalog existing in Databricks Unity Catalog.
  6. For Connection name, enter a name for the AWS Glue connection.
  7. For Workspace URL, enter the Unity Iceberg REST API URL (in format https://<workspace-url>/cloud.databricks.com).
  8. For Authentication, provide the following information:
    1. For Authentication type, choose OAuth2. Alternatively, you can choose Custom authentication. For Custom authentication, an access token is created, refreshed, and managed by the customer’s application or system and stored using Secrets Manager.
    2. For Token URL, enter the token authentication server URL.
    3. For OAuth Client ID, enter the client_id for integrationprincipal.
    4. For OAuth Secret, enter the secret ARN that you created in the previous step. Alternatively, you can provide the client_secret directly.
    5. For Token URL parameter map scope, provide the API scope supported.
  9. If you have AWS PrivateLink set up or a proxy set up, you can provide network details under Settings for network configurations.
  10. For Register Glue connection with Lake Formation, choose the IAM role (LFDataAccessRole) created earlier to manage data access using Lake Formation.

When the setup is done using AWS Command Line Interface (AWS CLI) commands, you have options to create two separate IAM roles:

  • IAM role with policies to access network and secrets, which AWS Glue assumes to manage authentication
  • IAM role with access to the S3 bucket, which Lake Formation assumes to manage credential vending for data access

On the console, this setup is simplified with a single role having combined policies. For more details, refer to Federate to Databricks Unity Catalog.

  1. To test the connection, choose Run test.
  2. You can proceed to create the catalog.

After you create the catalog, you can see the databases and tables in Databricks Unity Catalog listed under the federated catalog. You can implement fine-grained access control on the tables by applying row and column filters using Lake Formation. The following video shows the catalog federation setup with Databricks Unity Catalog.

Discover and query the data using Athena

In this post, we show how to use the Athena query editor to discover and query the Databricks Unity Catalog tables. On the Athena console, run the following query to access the federated table:SELECT * FROM "customerschema"."person" limit 10;The following video demonstrates querying the federated table from Athena.

If you use the Amazon Redshift query engine, you must create a resource link on the federated database and grant permission on the resource link to the user or role. This database resource link is automounted under awsdatacatalog based on the permission granted for the user or role and available for querying. For instructions, refer to Creating resource links.

Clean up

To clean up your resources, complete the following steps:

  1. Delete the catalog and namespace in Databricks Unity Catalog for this post.
  2. Drop the resources in the Data Catalog and Lake Formation created for this post.
  3. Delete the IAM roles and S3 buckets used for this post.
  4. Delete any VPC and KMS keys if used for this post.

Conclusion

In this post, we explored the key elements of catalog federation and its architectural design, illustrating the interaction between the AWS Glue Data Catalog and Databricks Unity Catalog through centralized authorization and credential distribution for protected data access. By removing the requirement for complicated synchronization workflows, catalog federation makes it possible to query Iceberg data on Amazon S3 directly at its source using AWS analytics services with data governance across multi-catalog platforms. Try out the solution for your own use case, and share your feedback and questions in the comments.


About the Authors

Srividya Parthasarathy

Srividya Parthasarathy

Srividya is a Senior Big Data Architect on the AWS Lake Formation team. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Venkatavaradhan (Venkat) Viswanathan

Venkatavaradhan (Venkat) Viswanathan

Venkat” is a Global Partner Solutions Architect at Amazon Web Services. Venkat is a Technology Strategy Leader in Data, AI, ML, Generative AI, and Advanced Analytics. Venkat is a Global SME for Databricks and helps AWS customers design, build, secure, and optimize Databricks workloads on AWS.

Access Amazon S3 Iceberg tables from Databricks using AWS Glue Iceberg Rest Catalog in Amazon SageMaker Lakehouse

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/access-amazon-s3-iceberg-tables-from-databricks-using-aws-glue-iceberg-rest-catalog-in-amazon-sagemaker-lakehouse/

Amazon SageMaker Lakehouse enables a unified, open, and secure lakehouse platform on your existing data lakes and warehouses. Its unified data architecture supports data analysis, business intelligence, machine learning, and generative AI applications, which can now take advantage of a single authoritative copy of data. With SageMaker Lakehouse, you get the best of both worlds—the flexibility to use cost effective Amazon Simple Storage Service (Amazon S3) storage with the scalable compute of a data lake, along with the performance, reliability and SQL capabilities typically associated with a data warehouse.

SageMaker Lakehouse enables interoperability by providing open source Apache Iceberg REST APIs to access data in the lakehouse. Customers can now use their choice of tools and a wide range of AWS services such as Amazon Redshift, Amazon EMR, Amazon Athena and Amazon SageMaker, in addition to third-party analytics engines that are compatible with Apache Iceberg REST specifications to query their data in-place.

Finally, SageMaker Lakehouse now provides secure and fine-grained access controls on data in both data warehouses and data lakes. With resource permission controls from AWS Lake Formation integrated into the AWS Glue Data Catalog, SageMaker Lakehouse lets customers securely define and share access to a single authoritative copy of data across their entire organization.

Organizations managing workloads in AWS analytics and Databricks can now use this open and secure lakehouse capability to unify policy administration and oversight of their data lake in Amazon S3. In this post, we will show you how Databricks on AWS general purpose compute can integrate with the AWS Glue Iceberg REST Catalog for metadata access and use Lake Formation for data access. To keep the setup in this post straightforward, the Glue Iceberg REST Catalog and Databricks cluster share the same AWS account.

Solution overview

In this post, we show how tables cataloged in Data Catalog and stored on Amazon S3 can be consumed from Databricks compute using Glue Iceberg REST Catalog with data access secured using Lake Formation. We will show you how the cluster can be configured to interact with Glue Iceberg REST Catalog, use a notebook to access the data using Lake Formation temporary vended credentials, and run analysis to derive insights.

The following figure shows the architecture described in the preceding paragraph.

Prerequisites

To follow along with the solution presented in this post, you need the following AWS prerequisites:

  1. Access to the Lake Formation data lake administrator in your AWS account. A Lake Formation data lake administrator is an IAM principal that can register Amazon S3 locations, access the Data Catalog, grant Lake Formation permissions to other users, and view AWS CloudTrail See Create a data lake administrator for more information.
  2. Enable full table access for external engines to access data in Lake Formation.
    • Sign into Lake Formation console as an IAM administrator and choose Administration in the navigation pane.
    • Choose Application integration settings and select Allow external engines to access data in Amazon S3 locations with full table access.
    • Choose Save.
  3. An existing AWS Glue database and tables. For this post, we will use an AWS Glue database named icebergdemodb, which contains an Iceberg table named person and data is stored in an S3 general purpose bucket named icebergdemodatalake.

  4. A user-defined IAM role that Lake Formation assumes when accessing the data in the above S3 location to vend scoped credentials. Follow the instructions provided in Requirements for roles used to register locations. For this post, we will use the IAM role LakeFormationRegistrationRole.

In addition to the AWS prerequisites, you need access to Databricks Workspace (on AWS) and the ability to create a cluster with No isolation shared access mode.

Set up an instance profile role. For instructions on how to create and set up the role, see Manage instance profiles in Databricks. Create customer managed policy named: dataplane-glue-lf-policy with below policies and attach the same to the instance profile role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
               "Action": [
                "glue:UpdateTable",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetCatalog",
                "glue:GetCatalogs",
                "glue:GetPartitions",
                "glue:GetPartition",
                "glue:GetTable",
                "glue:GetTables"
            ],
            "Resource": [
                "arn:aws:glue:<aws_region>:<accountid>:table/icebergdemodb/*",
                "arn:aws:glue:<aws_region>:<accountid>:database/icebergdemodb",
                "arn:aws:glue:<aws_region>:<accountid>:catalog"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}

For this post, we will use an instance profile role (databricks-dataplane-instance-profile-role), which will be attached to the previously created cluster.

Register the Amazon S3 location as the data lake location

Registering an Amazon S3 location with Lake Formation provides an IAM role with read/write permissions to the S3 location. In this case, you are required to register the icebergdemodatalake bucket location using the LakeFormationRegistrationRole IAM role.

After the location is registered, Lake Formation assumes the LakeFormationRegistrationRole role when it grants temporary credentials to the integrated AWS services/third-party analytics engines that are compatible(prerequisite Step 2) that access data in that S3 bucket location.

To register the Amazon S3 location as the data lake location, complete the following steps:

  1. Sign in to the AWS Management Console for Lake Formation as the data lake administrator .
  2. In the navigation pane, choose Data lake locations under Administration.
  3. Choose Register location.
  4. For Amazon S3 path, enter s3://icebergdemodatalake.
  5. For IAM role, select LakeFormationRegistrationRole.
  6. For Permission mode, select Lake Formation.
  7. Choose Register location.

Grant database and table permissions to the IAM role used within Databricks

Grant DESCRIBE permission on the icebergdemodb database to the Databricks IAM instance role.

  1. Sign in to the Lake Formation console as the data lake administrator.
  2. In the navigation pane, choose Data lake permissions and choose Grant.
  3. In the Principles section, select IAM users and roles and choose databricks-dataplane-instance-profile-role.
  4. In the LF-Tags or catalog resources section, select Named Data Catalog resources. Choose <accountid> for Catalogs and icebergdemodb for Databases.
  5. Select DESCRIBE for Database permissions.
  6. Choose Grant.

Grant SELECT and DESCRIBE permissions on the person table in the icebergdemodb database to the Databricks IAM instance role.

  1. In the navigation pane, choose Data lake permissions and choose Grant.
  2. In the Principles section, select IAM users and roles and choose databricks-dataplane-instance-profile-role.
  3. In the LF-Tags or catalog resources section, select Named Data Catalog resources. Choose <accountid> for Catalogs, icebergdemodb for Databases and person for table.
  4. Select SUPER for Table permissions.
  5. Choose Grant.

Grant data location permissions on the bucket to the Databricks IAM instance role.

  1. In the Lake Formation console navigation pane, choose Data Locations, and then choose Grant.
  2. For IAM users and roles, choose databricks-dataplane-instance-profile-role.
  3. For Storage locations, select the s3://icebergdemodatalake.
  4. Choose Grant.

Databricks workspace

Create a cluster and configure it to connect with a Glue Iceberg REST Catalog endpoint. For this post, we will use a Databricks cluster with runtime version 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12).

  1. In Databricks console, choose Compute in the navigation pane.
  2. Create a cluster with runtime version 15.4 LTS, access mode as ‘No isolation shared‘ and choose databricks-dataplane-instance-profile-role as instance profile role under Configuration section.
  3. Expand the Advanced options section. In the Spark section, for Spark Config include the following details:
    spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 
    spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.spark_catalog.type rest 
    spark.sql.catalog.spark_catalog.uri https://glue.<aws region>.amazonaws.com/iceberg
    spark.sql.catalog.spark_catalog.warehouse <aws account number> 
    spark.sql.catalog.spark_catalog.rest.sigv4-enabled true 
    spark.sql.catalog.spark_catalog.rest.signing-name glue 
    spark.sql.defaultCatalog spark_catalog 

  4. In the Cluster section, for Libraries include the following jars:
    1. org.apache.iceberg-spark-runtime-3.5_2.12:1.6.1
    2. software.amazon.awssdk:bundle:2.29.5

Create a notebook for analyzing data managed in Data Catalog:

  1. In the workspace browser, create a new notebook and attach it to the cluster created above.
  2. Run the following commands in the notebook cell to query the data.
    #Show Databases
    df= spark.sql(“show databases”)
    display (df)



  3. Further modify the data in the S3 data lake using the AWS Glue Iceberg REST Catalog.

This shows that you can now analyze data in a Databricks cluster using an AWS Glue Iceberg REST Catalog endpoint with Lake Formation managing the data access.

Clean up

To clean up the resources used in this post and avoid possible charges:

  1. Delete the cluster created in Databricks.
  2. Delete the IAM roles created for this post.
  3. Delete the resources created in Data Catalog.
  4. Empty and then delete the S3 bucket.

Conclusion

In this post, we have showed you how to manage a dataset centrally in AWS Glue Data Catalog and make it accessible to Databricks compute using the Iceberg REST Catalog API. The solution also enables you to use Databricks to use existing access control mechanisms with Lake Formation, which is used to manage metadata access and enable underlying Amazon S3 storage access using credential vending.

Try the feature and share your feedback in the comments.


About the authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Venkatavaradhan (Venkat) Viswanathan is a Global Partner Solutions Architect at Amazon Web Services. Venkat is a Technology Strategy Leader in Data, AI, ML, generative AI, and Advanced Analytics. Venkat is a Global SME for Databricks and helps AWS customers design, build, secure, and optimize Databricks workloads on AWS.

Pratik Das is a Senior Product Manager with AWS Lake Formation. He is passionate about all things data and works with customers to understand their requirements and build delightful experiences. He has a background in building data-driven solutions and machine learning systems.

Simplify data access for your enterprise using Amazon SageMaker Lakehouse

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/simplify-data-access-for-your-enterprise-using-amazon-sagemaker-lakehouse/

Organizations are increasingly using data to make decisions and drive innovation. However, building data-driven applications can be challenging. It often requires multiple teams working together and integrating various data sources, tools, and services. For example, creating a targeted marketing app involves data engineers, data scientists, and business analysts using different systems and tools. This complexity leads to several issues: it takes time to learn multiple systems, it’s difficult to manage data and code across different services, and controlling access for users across various systems is complicated. Currently, organizations often create custom solutions to connect these systems, but they want a more unified approach that them to choose the best tools while providing a streamlined experience for their data teams. The use of separate data warehouses and lakes has created data silos, leading to problems such as lack of interoperability, duplicate governance efforts, complex architectures, and slower time to value.

You can use Amazon SageMaker Lakehouse to achieve unified access to data in both data warehouses and data lakes. Through SageMaker Lakehouse, you can use preferred analytics, machine learning, and business intelligence engines through an open, Apache Iceberg REST API to help ensure secure access to data with consistent, fine-grained access controls.

Solution overview

Let’s consider Example Retail Corp, which is facing increasing customer churn. Its management wants to implement a data-driven approach to identify at-risk customers and develop targeted retention strategies. However, the customer data is scattered across different systems and services, making it challenging to perform comprehensive analyses. Today, Example Retail Corp manages sales data in its data warehouse and customer data in Apache Iceberg tables in Amazon Simple Storage Service (Amazon S3). It uses Amazon EMR Serverless for data processing and machine learning. For governance, it uses AWS Glue Data Catalog as the central technical catalog and AWS Lake Formation as the permission store for enforcing fine-grained access controls. Its main objective is to implement a unified data management system that now combines data from varied sources, enables secure access across enterprise, and allow disparate teams to use preferred tools to predict, analyze, and consume customer churn information.

Let’s examine how Example Retail Corp can use SageMaker Lakehouse to achieve its unified data management vision using this reference architecture diagram.

Personas

There are four personas used in this solution.

  • The Data Lake Admin has an AWS Identity and Access Management (IAM) admin role and is a Lake Formation administrator responsible for managing user permissions to catalog objects using Lake Formation.
  • The Data Warehouse Admin has an IAM admin role and manages databases in Amazon Redshift.
  • The Data Engineer has an IAM ETL role and runs the extract, transform, and load (ETL) pipeline using Spark to populate the Lakehouse catalog on RMS.
  • The Data Analyst has an IAM analyst role and performs churn analysis on SageMaker Lakehouse data using Amazon Athena and Amazon Redshift.

Dataset

The following table describes the elements of the dataset.

Schema Table Data source
public customer_churn Lakehouse catalog with storage on RMS
customerdb customer Lakehouse catalog with storage on Amazon S3
sales store_sales Data warehouse

Prerequisites

To follow along on the solution walkthrough, you need to have the following:

  1. Create a user defined IAM role following the instruction in Requirements for roles used to register locations. For this post, we will use IAM role LakeFormationRegistrationRole.
  2. An Amazon Virtual Private Cloud (Amazon VPC) with private and public subnets.
  3. Create an S3 bucket. For this post, we will use customer_data as the bucket name.
  4. Create an Amazon Redshift serverless endpoint called sales_dw which will host store_sales dataset.
  5. Create an Amazon Redshift serverless endpoint called sales_analysis_dw for churn analysis by sales analysts.
  6. Create an IAM role named DataTransferRole following the instructions in Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog.
  7. Install or update the latest version of the AWS CLI. For instructions, see Installing or updating to the latest version of the AWS CLI.
  8. Create a data lake admin using the instructions in Create a data lake administrator. For this post, we will use an IAM role called Admin.

Configure Datalake administrators :

Sign in to the AWS Management Console as Admin and go to AWS Lake Formation. In the navigation pane, choose Administration roles and then choose Tasks under Administration. Under Data lake administrators, choose Add:

  1. In the Add administrators page, under Access type, choose Data lake administrator.
  2. Under IAM users and roles, select Admin. Choose Confirm.
  3. On the Add administrators page, for Access type select Read-only administrators. Under IAM users and roles, select AWSServiceRoleForRedshift and choose Conrm. This step enables Amazon Redshift to discover and access catalog objects in AWS Glue Data Catalog.

Solution walkthrough

Create a customer table in the Amazon S3 data lake in AWS Glue Data Catalog

  1. Create an AWS Glue database called customerdb in the default catalog in your account by going to the AWS Lake Formation console and choosing Databases in the navigation pane.
  2. Select the database that you just created and choose Edit.
  3. Clear the checkbox Use only IAM access control for new tables in this database.
  4. Sign in to the Athena console as Admin and select Workgroup that the role has access to. Run the following SQL:
    CREATE EXTERNAL TABLE `tempcustomer`(
      `c_salutation` string, 
      `c_preferred_cust_flag` string, 
      `c_first_sales_date_sk` int, 
      `c_customer_sk` int, 
      `c_login` string, 
      `c_current_cdemo_sk` int, 
      `c_first_name` string, 
      `c_current_hdemo_sk` int, 
      `c_current_addr_sk` int, 
      `c_last_name` string, 
      `c_customer_id` string, 
      `c_last_review_date_sk` int, 
      `c_birth_month` int, 
      `c_birth_country` string, 
      `c_birth_year` int, 
      `c_birth_day` int, 
      `c_first_shipto_date_sk` int, 
      `c_email_address` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://customer_data/tempcustomer'
    
    INSERT INTO customer
    VALUES('Dr.','N',2452077,13251813,'Y',1381546,'Joyce',2645,2255449,'Deaton','AAAAAAAAFOEDKMAA',2452543,1,'GREECE',1987,29,2250667,'[email protected]'),
    ('Dr.','N',2450637,12755125,'Y',1581546,'Daniel',9745,4922716,'Dow','AAAAAAAAFLAKCMAA',2432545,1,'INDIA',1952,3,2450667,'[email protected]'),
    ('Dr.','N',2452342,26009249,'Y',1581536,'Marie',8734,1331639,'Lange','AAAAAAAABKONMIBA',2455549,1,'CANADA',1934,5,2472372,'[email protected]'),
    ('Dr.','N',2452342,3270685,'Y',1827661,'Wesley',1548,11108235,'Harris','AAAAAAAANBIOBDAA',2452548,1,'ROME',1986,13,2450667,'[email protected]'),
    ('Dr.','N',2452342,29033279,'Y',1581536,'Alexandar',8262,8059919,'Salyer','AAAAAAAAPDDALLBA',2952543,1,'SWISS',1980,6,2650667,'[email protected]'),
    ('Miss','N',2452342,6520539,'Y',3581536,'Jerry',1874,36370,'Tracy','AAAAAAAALNOHDGAA',2452385,1,'ITALY',1957,8,2450667,'[email protected]')
    
    CREATE TABLE customer
    WITH (table_type = 'ICEBERG',
    format = 'PARQUET',
    location = 's3://customer_data/customer/',
    is_external = false
    ) as select * from tempcustomer;

  5. Register the S3 bucket with Lake Formation:
    • Sign in to the Lake Formation console as Data Lake Admin.
    • In the navigation pane, choose Administration, and then choose Data lake locations.
    • Choose Register location.
    • For the Amazon S3 path, enter s3://customer_data/.
    • For the IAM role, choose LakeFormationRegistrationRole.
    • For Permission mode, select Lake Formation.
    • Choose Register location.

Create the salesdb database in Amazon Redshift

  1. Sign in to the Redshift endpoint sales_dw as Admin user. Run following script to create a database named salesdb.
    Create database salesdb;

  2. Connect to salesdb. Run the following script to create schema sales and the store_sales table and populate it with data.
    Create schema sales;
    CREATE TABLE sales.store_sales (
        sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
        customer_sk INTEGER NOT NULL,
        sale_date DATE NOT NULL,
        sale_amount DECIMAL(10, 2) NOT NULL,
        product_name VARCHAR(100) NOT NULL,
        last_purchase_date DATE
    );
    
    INSERT INTO sales.store_sales (customer_sk, sale_date, sale_amount, product_name, last_purchase_date)
    VALUES
        (13251813, '2023-01-15', 150.00, 'Widget A', '2023-01-15'),
        (29033279, '2023-01-20', 200.00, 'Gadget B', '2023-01-20'),
        (12755125, '2023-02-01', 75.50, 'Tool C', '2023-02-01'),
        (26009249, '2023-02-10', 300.00, 'Widget A', '2023-02-10'),
        (3270685, '2023-02-15', 125.00, 'Gadget B', '2023-02-15'),
        (6520539, '2023-03-01', 100.00, 'Tool C', '2023-03-01'),
        (10251183, '2023-03-10', 250.00, 'Widget A', '2023-03-10'),
        (10251283, '2023-03-15', 180.00, 'Gadget B', '2023-03-15'),
        (10251383, '2023-04-01', 90.00, 'Tool C', '2023-04-01'),
        (10251483, '2023-04-10', 220.00, 'Widget A', '2023-04-10'),
        (10251583, '2023-04-15', 175.00, 'Gadget B', '2023-04-15'),
        (10251683, '2023-05-01', 130.00, 'Tool C', '2023-05-01'),
        (10251783, '2023-05-10', 280.00, 'Widget A', '2023-05-10'),
        (10251883, '2023-05-15', 195.00, 'Gadget B', '2023-05-15'),
        (10251983, '2023-06-01', 110.00, 'Tool C', '2023-06-01'),
        (10251083, '2023-06-10', 270.00, 'Widget A', '2023-06-10'),
        (10252783, '2023-06-15', 185.00, 'Gadget B', '2023-06-15'),
        (10253783, '2023-07-01', 95.00, 'Tool C', '2023-07-01'),
        (10254783, '2023-07-10', 240.00, 'Widget A', '2023-07-10'),
        (10255783, '2023-07-15', 160.00, 'Gadget B', '2023-07-15');

Create the churn_lakehouse RMS catalog in Glue Data Catalog

This catalog will contain the customer churn table with managed RMS storage, which will be populated using Amazon EMR.

We will manage the customer churn data in an AWS Glue managed catalog with managed RMS storage. This data is produced from an analysis conducted in EMR Serverless and is accessible in the presentation layer to serve to business intelligence (BI) applications.

Create Lakehouse (RMS) catalog

  1. Sign in to the Lake Formation console as Data Lake Admin.
  2. In the left navigation pane, choose Data Catalog, and then Catalogs New. Choose Create catalog.
  1. Provide the details for the catalog:
    • Name: Enter churn_lakehouse.
    • Type: Select Managed catalog.
    • Storage: Select Redshift.
    • Under Access from engines, make sure that Access this catalog from Iceberg compatible engines is selected.
    • Choose Next.
    • Under Principals, select IAM users and roles. Under IAM users and roles, select the Admin Under Catalog permissions, select Super user.
    • Choose Add, and then choose Create catalog.

Access churn_lakehouse RMS catalog from Amazon EMR Spark engine

  1. Set up an EMR Studio.
  2. Create an EMR Serverless application using CLI command.
    aws emr-serverless create-application --region <aws_region> \
    --name 'Churn_Analysis' \
    --type 'SPARK' \
    --release-label emr-7.5.0 \
    --network-configuration '{"subnetIds": ["<subnet2>", "<subnet2>"], "securityGroupIds": [<security_group>]}'

Sign in to EMR Studio and use the EMR Studio Workspace

  1. Sign in to the EMR Studio console and choose Workspaces in the navigation pane, and then choose Create Workspace.
  2. Enter a name and a description for the Workspace.
  3. Choose Create Workspace. A new tab containing JupyterLab will open automatically when the Workspace is ready. Enable pop-ups in your browser if necessary.
  4. Choose the Compute icon in the navigation pane to attach the EMR Studio Workspace with a compute engine.
  5. Select EMR Serverless application for Compute type.
  6. Choose Churn_Analysis for EMR-S Application.
  7. For Runtime role, choose Admin.
  8. Choose Attach.

Download the notebook, import it, choose PySpark kernel and execute the cells that will create the table.

Manage your users’ fine-grained access to catalog objects using AWS Lake Formation

Grant the following permissions to the Analyst role on the resources as shown in the following table.

Catalog Database Table Permission
<account_id>:churn_lakehouse/dev public customer_churn Column permission:
<account_id> customerdb customer Table permission
<account_id>:sales_lakehouse/salesdb sales store_sales All table permission
  1. Sign in to the Lake Formation console as Data Lake Admin. In the navigation pane, choose Data Lake Permissions, and then choose Grant.
  2. For IAM user and roles, choose Analyst IAM role. For resources choose as shown below and grant.
  3. For IAM user and roles, choose Analyst IAM Role. For resource choose as shown below and grant.
  4. For IAM user and roles, choose Analyst IAM Role. For resource choose as shown below and grant.

Perform churn analysis using multiple engines:

Using Athena

Sign in to the Athena console using the IAM Analyst role, select the workgroup that the role has access to. Run the following SQL combining data from the data warehouse and Lake House RMS catalog for churn analysis:

SELECT 
c.c_customer_id,
c.c_first_name,
c.c_last_name,
c.c_email_address,
ss.sale_amount,
cc.is_churned
FROM 
    "customerdb"."customer" c
LEFT JOIN 
    "sales_lakehouse/salesdb"."sales"."store_sales" ss ON c.c_customer_sk = ss.customer_sk
LEFT JOIN 
    "churn_lakehouse/dev"."public"."customer_churn" cc ON c.c_customer_sk  = cc.customer_id
WHERE cc.is_churned = true
;

The following figure shows the results, which include customer IDs, names, and other information.

Using Amazon Redshift

Sign in to the Redshift Sale cluster QEV2 using the IAM Analyst role. Sign in using temporary credentials using your IAM identity and run the following SQL command:

SELECT 
c.c_customer_id,
c.c_first_name,
c.c_last_name,
c.c_email_address,
ss.sale_amount,
cc.is_churned
FROM 
   "awsdatacatalog"."customerdb"."customer" c
LEFT JOIN 
    "salesdb@sales_lakehouse"."sales"."store_sales" ss ON c.c_customer_sk = ss.customer_sk
LEFT JOIN 
    "dev@churn_lakehouse"."public"."customer_churn" cc ON c.c_customer_sk  = cc.customer_id
WHERE cc.is_churned = true
;

The following figure shows the results, which include customer IDs, names, and other information.

Clean up

Complete the following steps to delete the resources you created to avoid unexpected costs:

  1. Deletethe Redshift Serverless workgroups.
  2. Deletethe Redshift Serverless associated namespace.
  3. Delete EMR Studio and Application created.
  4. Delete Glue resources and Lake Formation permissions.
  5. Empty the bucket and delete the bucket.

Conclusion

In this post, we showcased how you can use Amazon SageMaker Lakehouse to achieve unified access to data across your data warehouses and data lakes. With unified access, you can use preferred analytics, machine learning, and business intelligence engines through an open, Apache Iceberg REST API and secure access to data with consistent, fine-grained access controls. Try Amazon SageMaker Lakehouse in your environment and share your feedback with us.


About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She works with product team and customer to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/efficiently-crawl-your-data-lake-and-improve-data-access-with-aws-glue-crawler-using-partition-indexes/

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems. AWS Glue crawlers provide a straightforward way to catalog data in the AWS Glue Data Catalog that removes the heavy lifting when it comes to schema management and data classification. AWS Glue crawlers extract the data schema and partitions from Amazon S3 to automatically populate the Data Catalog, keeping the metadata current.

But with data growing exponentially over time, the number of partitions in a given table can grow significantly. Because analytics services like Amazon Athena query a table containing millions of partitions, the time needed to retrieve the partition increases and can cause query runtime to increase.

Today, AWS Glue crawler support has been expanded to automatically add partition indexes for newly discovered tables to optimize query processing on the partitioned dataset. Now, when the crawler creates a new Data Catalog table during a crawler run, it also creates a partition index by default, with the largest permutation of all numeric and string type partition columns as keys. The Data Catalog then creates a searchable index based on these keys, reducing the time required to retrieve and filter partition metadata on tables with millions of partitions. The creation of partition indexes benefits the analytics workloads running on Athena, Amazon EMR, Amazon Redshift Spectrum, and AWS Glue.

In this post, we describe how to create partition indexes with an AWS Glue crawler and compare the query performance improvement when accessing the crawled data with and without a partition index from Athena.

Solution overview

We use an AWS CloudFormation template to create our solution resources. In the following steps, we demonstrate how to configure the AWS Glue crawler to create a partition index using either the AWS Glue console or the AWS Command Line Interface (AWS CLI). Then we compare the query performance improvements using Athena.

Prerequisites

To follow along with this post, you must have access to an AWS Identity and Access Management (IAM) administrator role to create resources using AWS CloudFormation.

Set up your solution resources

The CloudFormation template generates the following resources:

  • IAM roles and policies
  • An AWS Glue database to hold the schema
  • An AWS Glue crawler pointing to a highly partitioned dataset
  • An Athena workgroup and bucket to store query results

Complete the following steps to set up the solution resources:

  1. Log in to the AWS Management Console as an IAM administrator.
  2. Choose Launch Stack to deploy the CloudFormation template:
  3. For DatabaseName, keep the default blog_partition_index_crawlerdb.
  4. Choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create stack.
  7. When the stack is complete, on the AWS CloudFormation console, navigate to the Outputs tab of the stack.
  8. Note down values of DatabaseName and GlueCrawlerName.

Some of the resources that this stack deploys incur costs when in use.

Edit and run the AWS Glue crawler

To configure and run the AWS Glue crawler, complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Locate the crawler blog-partition-index-crawler and choose Edit.
  3. In the Set output and scheduling section, under Advanced options, select Create partition indexes automatically.
  4. Review and update the crawler settings.

Alternatively, you can configure your crawler using the AWS CLI (provide your IAM role and Region):

aws glue create-crawler --name blog-partition-index-crawler --targets '{ "S3Targets": [{ "Path": "s3://awsglue-datasets/examples/highly-partitioned-table/"}] }' --database-name "blog_partition_index_crawlerdb" --role <Crawler_IAM_role> --configuration "{\"Version\":1.0,\"CreatePartitionIndex\":true}" --region <region_name>
  1. Now run the crawler and verify that the crawler run is complete.

This is highly partitioned dataset and will take approximately 90 minutes to complete.

Verify the partitioned table

In the AWS Glue database blog_partition_index_crawlerdb, verify that the table highly_partitioned_table is created.

By default, the crawler determines an index based on the largest permutation of partition columns of valid column types in the same order of partition columns, which are either numeric or string. For the table created by the crawler (highly_partitioned_table), we have partition columns year (string), month (string), day (string), and hour (string).

Based on this definition, the crawler created an index on the permutation of year, month, day, and hour. The crawler created the indexes prefixed with crawler_ on any partition index created by default.

Verify the same by navigating to the table highly_partitioned_table on the AWS Glue console and choosing the Indexes tab.

The crawler was able to crawl the S3 data source and successfully populate the partition indexes for the table.

Compare the query performance improvements using Athena

First, we query the table in Athena without using the partition index. To verify the tables using Athena, complete the following steps:

  1. On the Athena console, choose crawler-primary-workgroup as the Athena workgroup and choose Acknowledge.
  2. Run the following query:
    select count(*), sum(value) from blog_partition_index_crawlerdb.highly_partitioned_table where year='1980' and month='01' and day ='01'

The following screenshot shows the query took approximately 32 seconds without filtering enabled using the partition index.

  1. Now we enable the partition index on the Athena query:
    ALTER TABLE blog_partition_index_crawlerdb.highly_partitioned_table
    SET TBLPROPERTIES ('partition_filtering.enabled' = 'true')

  2. Run the following query again and note the runtime:
    select count(*), sum(value) from blog_partition_index_crawlerdb.highly_partitioned_table where year=‘1980’ and month=‘01’ and day =‘01’

The following screenshot shows the query took only 700 milliseconds, which is much faster with filtering enabled using the partition index.

Clean up

To avoid unwanted charges to your AWS account, you can delete the AWS resources:

  1. Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack.
  2. Delete the CloudFormation stack you created.

Conclusion

In this post, we explained how to configure an AWS crawler to create partition indexes and compared the query performance when accessing the data with indexes from Athena.

If no partition indexes are present on the table, AWS Glue loads all the partitions of the table, and then filters the loaded partitions, which results in inefficient retrieval of metadata. Analytics services like Redshift Spectrum, Amazon EMR, and AWS Glue ETL Spark DataFrames can now utilize indexes for fetching partitions, resulting in significant query performance.

For more information on partition indexes and query performance across various analytical engines, refer to Improve Amazon Athena query performance using AWS Glue Data Catalog partition indexes and Improve query performance using AWS Glue partition indexes.

Special thanks to everyone who contributed to this crawler feature launch: Yuhang Chen, Kyle Duong,and Mita Gavade.


About the authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Build an AWS Lake Formation permissions inventory dashboard using AWS Glue and Amazon QuickSight

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/build-an-aws-lake-formation-permissions-inventory-dashboard-using-aws-glue-and-amazon-quicksight/

AWS Lake Formation is an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning (ML). Lake Formation provides a single place to define fine-grained access control on catalog resources. These permissions are granted to the principals by a data lake admin, and integrated engines like Amazon Athena, AWS Glue, Amazon EMR, and Amazon Redshift Spectrum enforce the access controls defined in Lake Formation. It also allows principals to securely share data catalog resources across multiple AWS accounts and AWS organizations through a centralized approach.

As organizations are adopting Lake Formation for scaling their permissions, there is steady increase in the access policies established and managed within the enterprise. However, it becomes more difficult to analyze and understand the permissions for auditing. Therefore, customers are looking for a simple way to collect, analyze, and visualize permissions data so that they can inspect and validate the policies. It also enables organizations to take actions that help them with compliance requirements.

This solution offers the ability to consolidate and create a central inventory of Lake Formation permissions that are registered in the given AWS account and Region. It provides a high-level view of various permissions that Lake Formation manages and aims at answering questions like:

  • Who has select access on given table
  • Which tables have delete permission granted
  • Which databases or tables does the given principal have select access to

In this post, we walk through how to set up and collect the permissions granted on resources in a given account using the Lake Formation API. AWS Glue makes it straightforward to set up and run jobs for collecting the permission data and creating an external table on the collected data. We use Amazon QuickSight to create a permissions dashboard using an Athena data source and dataset.

Overview of solution

The following diagram illustrates the architecture of this solution.

In this solution, we walk through the following tasks:

  1. Create an AWS Glue job to collect and store permissions data, and create external tables using Boto3.
  2. Verify the external tables created using Athena.
  3. Sign up for a QuickSight Enterprise account and enable Athena access.
  4. Create a dataset using an Athena data source.
  5. Use the datasets for analysis.
  6. Publish the analyses as a QuickSight dashboard.

The collected JSON data is flattened and written into an Amazon Simple Storage Service (Amazon S3) bucket as Parquet files partitioned by account ID, date, and resource type. After the data is stored in Amazon S3, external tables are created on them and filters are added for different types of resource permissions. These datasets can be imported into SPICE, an in-memory query engine that is part of QuickSight, or queried directly from QuickSight to create analyses. Later, you can publish these analyses as a dashboard and share it with other users.

Dashboards are created for the following use cases:

  • Database permissions
  • Table permissions
  • Principal permissions

Prerequisites

You should have the following prerequisites:

  • An S3 bucket to store the permissions inventory data
  • An AWS Glue database for permissions inventory metadata
  • An AWS Identity and Access Management (IAM) role for the AWS Glue job with access to the inventory AWS Glue database and S3 bucket and added as a data lake admin
  • A QuickSight account with access to Athena
  • An IAM role for QuickSight with access to the inventory AWS Glue database and S3 bucket

Set up and run the AWS Glue job

We create an AWS Glue job to collect Lake Formation permissions data for the given account and Region that is provided as job parameters, and the collected data is flattened before storage. Data is partitioned by account ID, date, and permissions type, and is stored as Parquet in an S3 bucket using Boto3. We create external tables on the data and add filters for different types of resource permissions.

To create the AWS Glue job, complete the following steps:

  1. Download the Python script file to local.
  2. On the AWS Glue console, under Data Integration and ETL in the navigation pane, choose Jobs.
  3. Under Create job, select Python Shell script editor.
  4. For Options, select Upload and edit an existing script.
  5. For File upload, choose Choose file.
  6. Choose the downloaded file (lf-permissions-inventory.py).
  7. Choose Create.

GlueJob

  1. After the job is created, enter a name for the job (for this post, lf-inventory-builder) and choose Save.

Glue Job save

  1. Choose the Job details tab.
  2. For Name, enter a name for the job.
  3. For IAM Role¸ choose an IAM role that has access to the inventory S3 bucket and inventory schema and registered as data lake admin.
  4. For Type, choose Python Shell.
  5. For Python version, choose Python 3.9.

Glue Job Details

  1. You can leave other property values at their default.
  2. Under Advanced properties¸ configure the following job parameters and values:
    1. catalog-id: with the value as the current AWS account ID whose permissions data are collected.
    2. databasename: with the value as the AWS Glue database where the inventory-related schema objects are created.
    3. region: with the value as the current Region where the job is configured and whose permissions data is collected.
    4. s3bucket: with the value as the S3 bucket where the collected permissions data is written.
    5. createtable: with the value yes, which enables external table creation on the data.

Job Parameters

  1. Choose Save to save the job settings.

Glue Job Save

  1. Choose Run to start the job.

When the job is complete, the run status changes to Succeeded. You can view the log messages in Amazon CloudWatch Logs.

Job Run

Permissions data is collected and stored in the S3 bucket (under lfpermissions-data) that you provided in the job parameters.

S3 Structure

The following external tables are created on the permissions data and can be queried using Athena:

  • lfpermissions – A summary of resource permissions
  • lfpermissionswithgrant – A summary of grantable resource permissions

For both tables, the schema structure is the same and the lftype column indicates what type of permissions the row applies to.

Athena Table Schema

Verify the tables using Athena

You can use Athena to verify the data using the following queries.

For more information, refer to Running SQL queries using Amazon Athena

  • List the database permissions:
Select * from lfpermissions where lftype=’DATABASE’
  • List the table permissions:
Select * from lfpermissions where lftype= ‘TABLE’
  • List the data lake permissions:
Select * from lfpermissions where lftype= ‘DATA_LOCATION’
  • List the grantable database permissions:
Select * from lfpermissionswithgrant where lftype=’DATABASE’
  • List the grantable table permissions:
Select * from lfpermissionswithgrant where lftype= ‘TABLE’
  • List grantable data lake permissions:
Select * from lfpermissionswithgrant where lftype= ‘DATA_LOCATION’

As the next step, we create a QuickSight dashboard with three sheets, each focused on different sets of permissions (database, table, principal) to slice and dice the data.

Sign up for a QuickSight account

If you haven’t signed up for QuickSight, complete the following steps:

  1. Sign in to the AWS Management Console as Admin, search for QuickSight and choose Sign up for QuickSight.

QuickSight signup

  1. For Edition, select Enterprise.
  2. Choose Continue.
  3. For Authentication method, select Use IAM federated identities & QuickSight-managed users.
  4. Under QuickSight Region, choose the same Region as your inventory S3 bucket.
  5. Under Account info, enter a QuickSight account name and email address for notification.

QuickSight Form

  1. In the Quick access to AWS services section, for IAM Role, select Use QuickSight-managed role (default).
  2. Allow access to IAM, Athena, and Amazon S3.
  3. Specify the S3 bucket that contains the permissions data.
  4. Choose Finish to complete the signup process.

QuickSight configuration

Note: If the inventory bucket and database is managed by Lake Formation, grant database and table access to the created QuickSight IAM role. For instructions, refer to Granting and revoking permissions on Data Catalog resources.

Configure your dataset in QuickSight

QuickSight is configured with an Athena data source the same Region as the S3 bucket. To set up your dataset, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.

Quicksight DataSet

  1. Choose Athena as your data source.

QuickSight Datasource

  1. Enter LF_DASHBOARD_DS as the name of your data source.
  2. Choose Create data source.
  3. For Catalog, leave it as AwsDataCatalog.
  4. For Database, choose database name provided as parameter to the Job.
  5. For Tables, select lfpermissions.
  6. Choose Select.

QuickSight Catalog Info

  1. Select Directly query your data and choose Visualize to take you to the analysis.

Quicksight data mode

Create analyses

We create three sheets for our dashboard to view different levels of permissions.

Sheet 1: Database permission view

To view database permissions, complete the following steps:

  1. On the QuickSight console, choose the plus sign to create a new sheet.
  2. Choose Add, then choose Add title.

QuickSight Title

  1. Name the sheet Database Permissions.
  2. Repeat steps (5-7) to add the following parameters:
    • catalogid
    • databasename
    • permission
    • tablename
  3. On the Add menu, choose Add parameter.
  4. Enter a name for the parameter.
  5. Leave the other values as default and choose Create.
  6. Choose Insights in the navigation pane, then choose Add control.

QuickSight Control

  1. Add a control for each parameter:
    1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
      QuickSight Dependency
    2. Provide additional information for each parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id
databasename DatabaseName lfpermissions databasename
permission Permission lfpermissions permission
  1. Add a control dependency and for Database, choose the options menu and choose Edit.

QuickSight Dependency

  1. Under Format control, choose Control options.
  2. Change the relevant values, choose AccountID, and choose Update.
  3. Similarly, under Permission control, choose Control options.
  4. Change the relevant values, choose AccountID, and choose Update.

We create two visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id and databasename into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop principal into Values and change the aggregation to Count distinct.

QuickSight Database View1

  1. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  2. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  3. Add a filter on databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  4. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.
  5. Choose Actions in the navigation pane.
  6. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

  1. Add a second visual and choose the table visual type.
  2. Drag and drop principal to Group by.
  3. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  4. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  5. Add a filter on databasename the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  6. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.

Now the Database and Permission drop-down menus are populated based on the relevant attributes and changes dynamically.

QuickSight Database View2

Sheet 2: Table permission view

Now that we have created the database permissions sheet, we can add a table permissions sheet.

  1. Choose the plus sign to add a new sheet.
  2. On the QuickSight console, choose Add, then choose Add title.
  3. Name the sheet Table Permissions.
  4. Choose Insights in the navigation pane, then choose Add control.
  5. Add a control for each parameter:
    1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
    2. Provide the additional information for each parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id
databasename DatabaseName lfpermissions databasename
permission Permission lfpermissions permission
tablename TableName lfpermissions tablename

We create two visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id, databasename, and tablename into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop principal into Values and change the aggregation to Count distinct.
  5. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  6. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  7. Add a filter on the databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  8. Add a filter on tablename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose tablename.
  9. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.
  10. Choose Actions in the navigation pane.
  11. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

  1. Add a second visual and choose the table visual type.
  2. Drag and drop principal to Group by.
  3. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  4. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  5. Add a filter on the databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  6. Add a filter on tablename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose tablename.
  7. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.

Now the Databasename, Tablename, and Permission drop-down menus are populated based on the relevant attributes.

QuickSight Table Permissions

Sheet 3: Principal permission view

Now we add a third sheet for principal permissions.

  1. Choose the plus sign to add a new sheet.
  2. On the QuickSight console, choose Add, then choose Add title.
  3. Name the sheet Principal Permissions.
  4. Choose Insights in the navigation pane, then choose Add control.
  5. Add a control for the catalogid parameter:
    1. For Style¸ choose List, and for Values, select Link to a dataset field.
    2. Provide the additional information for the parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id

We create four visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id and principal into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop databasename into Values and change the aggregation to Count distinct.
  5. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  6. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  7. Choose Actions in the navigation pane.
  8. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check Databasename.
  9. For the second visual, choose Visualize and choose table as the visual type.
  10. Drag and drop databasename into Group by.
  11. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  12. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  13. For the third visual, choose Visualize and choose pivot table as the visual type.
  14. Drag and drop catalog_id and principal into Rows.
  15. Drag and drop permission into Column.
  16. Drag and drop tablename into Values and change the aggregation to Count distinct.
  17. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  18. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  19. Choose Actions in the navigation pane.
  20. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check Tablename.
  21. For the final visual, choose Visualize and choose table as the visual type.
  22. Drag and drop tablename into Group by.
  23. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  24. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.

The following screenshot shows our sheet.

QuickSight Prinicipal View

Create a dashboard

Now that the analysis is ready, you can publish it as a dashboard and share it with other users. For instructions, refer to the tutorial Create an Amazon QuickSight dashboard.

Clean up

To clean up the resources created in this post, complete the following steps:

  1. Delete the AWS Glue job lf-inventory-builder.
  2. Delete the data stored under the bucket provided as the value of the s3bucket job parameter.
  3. Drop the external tables created under the schema provided as the value of the databasename job parameter.
  4. If you signed up for QuickSight to follow along with this post, you can delete the account.
  5. For an existing QuickSight account, delete the following resources:
    1. lfpermissions dataset
    2. lfpermissions analysis
    3. lfpermissions dashboard

Conclusion

In this post, we provided a design and implementation steps for a solution to collect Lake Formation permissions in a given Region of an account and consolidate them for analysis. We also walked through the steps to create a dashboard using Amazon QuickSight. You can utilize other QuickSight visuals to create more sophisticated dashboards based on your requirements.

You can also expand this solution to consolidate permissions for a multi-account setup. You can use a shared bucket across organizations and accounts and configure an AWS Glue job in each account or organization to write their permission data. With this solution, you can maintain a unified dashboard view of all the Lake Formation permissions within your organization, thereby providing a central audit mechanism to comply with business requirements.

Thanks for reading this post! If you have any comments or questions, please don’t hesitate to leave them in the comments section.


About the Author

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/centrally-manage-access-and-permissions-for-amazon-redshift-data-sharing-with-aws-lake-formation/

Today’s global, data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Amazon Redshift data warehouse with another Amazon Redshift data warehouse within the same AWS account, across accounts, and across Regions, without needing to copy or move data from one cluster to another.

Some customers share their data with 50–100 data warehouses in different accounts and do a lot of cross-sharing, making it difficult to track who is accessing what data. They have to navigate to an individual account’s Amazon Redshift console to retrieve the access information. Also, many customers have their data lake on Amazon Simple Storage Service (Amazon S3), which is shared within and across various business units. As the organization grows and democratizes the data, administrators want the ability to manage the datashare centrally for governance and auditing, and to enforce fine-grained access control.

Working backward from customer ask, we are announcing the preview of the following new feature: Amazon Redshift data sharing integration with AWS Lake Formation, which enables Amazon Redshift customers to centrally manage access to their Amazon Redshift datashares using Lake Formation.

Lake Formation has been a popular choice for centrally governing data lakes backed by Amazon S3. Now, with Lake Formation support for Amazon Redshift data sharing, it opens up new design patterns, and broadens governance and security posture across data warehouses. With this integration, you can use Lake Formation to define fine-grained access control on tables and views being shared with Amazon Redshift data sharing for federated AWS Identity and Access Management (IAM) users and IAM roles.

Customers are using the data mesh approach, which provides a mechanism to share data across business units. Customers are also using a modern data architecture to share data from data lake stores and Amazon Redshift purpose-built data stores across business units. Lake Formation provides the ability to enforce data governance within and across business units, which enables secure data access and sharing, easy data discovery, and centralized audit for data access.

United Airlines is in the business of connecting people and uniting the world.

“As a data-driven enterprise, United is trying to create a unified data and analytics experience for our analytics community that will innovate and build modern data-driven applications. We believe we can achieve this by building a purpose-built data mesh architecture using a variety of AWS services like Athena, Aurora, Amazon Redshift, and Lake Formation to simplify management and governance around granular data access and collaboration.”

-Ashok Srinivas, Director of ML Engineering and Sarang Bapat, Director of Data Engineering.

In this post, we show how to centrally manage access and permissions for Amazon Redshift data sharing with Lake Formation.

Solution overview

In this solution, we demonstrate how integration of Amazon Redshift data sharing with Lake Formation for data governance can help you build data domains, and how you can use the data mesh approach to bring data domains together to enable data sharing and federation across business units. The following diagram illustrates our solution architecture.

solution architecture

The data mesh is a decentralized, domain-oriented architecture that emphasizes separating data producers from data consumers via a centralized, federated Data Catalog. Typically, the producers and consumers run within their own account. The details of these data mesh characteristics are as follows:

  • Data producers – Data producers own their data products and are responsible for building their data, maintaining its accuracy, and keeping their data product up to date. They determine what datasets can be published for consumption and share their datasets by registering them with the centralized data catalog in a central governance account. You might have a producer steward or administrator persona for managing the data products with the central governance steward or administrators team.
  • Central governance account – Lake Formation enables fine-grained access management on the shared dataset. The centralized Data Catalog offers consumers the ability to quickly find shared datasets, allows administrators to centrally manage access permissions on shared datasets, and provides security teams the ability to audit and track data product usage across business units.
  • Data consumers – The data consumer obtains access to shared resources from the central governance account. These resources are available inside the consumer’s AWS Glue Data Catalog, allowing fine-grained access on the database and table that can be managed by the consumer’s data stewards and administrators.

The following steps provide an overview of how Amazon Redshift data sharing can be governed and managed by Lake Formation in the central governance pattern of a data mesh architecture:

  1. In the producer account, data objects are created and maintained in the Amazon Redshift producer cluster. A data warehouse admin creates the Amazon Redshift datashare and adds datasets (tables, views) to the share.
  2. The data warehouse admin grants and authorizes access on the datashare to the central governance account’s Data Catalog.
  3. In the central governance account, the data lake admin accepts the datashare and creates the AWS Glue database that points to the Amazon Redshift datashare so that Lake Formation can manage it.
  4. The data lake admin shares the AWS Glue database and tables to the consumer account using Lake Formation cross-account sharing.
  5. In the consumer account, the data lake admin accepts the resource share invitation via AWS Resource Access Manager (AWS RAM) and can view the database listed in the account.
  6. The data lake admin defines the fine-grained access control and grants permissions on databases and tables to IAM users (for this post, consumer1 and consumer2) in the account.
  7. In the Amazon Redshift cluster, the data warehouse admin creates an Amazon Redshift database that points to the Glue database and authorizes usage on the created Amazon Redshift database to the IAM users.
  8. The data analyst as an IAM user can now use their preferred tools like the Amazon Redshift query editor to access the dataset based on the Lake Formation fine-grained permissions.

We use the following account setup for our example in this post:

  • Producer account – 123456789012
  • Central account – 112233445566
  • Consumer account – 665544332211

Prerequisites

Create the Amazon Redshift data share and add datasets

In the data producer account, create an Amazon Redshift cluster using the RA3 node type with encryption enabled. Complete the following steps:

  1. On the Amazon Redshift console, create a cluster subnet group.

For more information, refer to Managing cluster subnet groups using the console.

  1. Choose Create cluster.
  2. For Cluster identifier, provide the cluster name of your choice.
  3. For Preview track, choose preview_2022.

  1. For Node type, choose one of the RA3 node types.

This feature is only supported on the RA3 node type.

  1. For Number of nodes, enter the number of nodes that you need for your cluster.
  2. Under Database configurations, choose the admin user name and admin user password.
  3. Under Cluster permissions, you can select the IAM role and set it as the default.

For more information about the default IAM role, refer to Creating an IAM role as default for Amazon Redshift.

cluster permissions

  1. Turn on the Use defaults option next to Additional configurations to modify the default settings.
  2. Under Network and security, specify the following:
    1. For Virtual private cloud (VPC), choose the VPC you would like to deploy the cluster in.
    2. For VPC security groups, either leave as default or add the security groups of your choice.
    3. For Cluster subnet group, choose the cluster subnet group you created.

additional configurations

  1. Under Database configuration, in the Encryption section, select Use AWS Key Management Service (AWS KMS) or Use a hardware security module (HSM).

Encryption is disabled by default.

  1. For Choose an AWS KMS key, you can either choose an existing AWS Key Management Service (AWS KMS) key, or choose Create an AWS KMS key to create a new key.

For more information, refer to Creating keys.

database configurations

  1. Choose Create cluster.
  2. For this post, create tables and load data into the producer Amazon Redshift cluster using the following script.

Authorize the datashare

Install or update the latest AWS Command Line Interface (AWS CLI) version to run the AWS CLI to authorize the datashare. For instructions, refer to Installing or updating the latest version of the AWS CLI.

Set up Lake Formation permissions

To use the AWS Glue Data Catalog in Lake Formation, complete the following steps in the central governance account to update the Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control:

  1. Sign in to the Lake Formation console as admin.
  2. In the navigation pane, under Data catalog, choose Settings.
  3. Deselect Use only IAM access control for new databases.
  4. Deselect Use only IAM access control for new tables in new databases.
  5. Choose Version 2 for Cross account version settings.
  6. Choose Save.

data catalog settings

Set up an IAM user as a data lake administrator

If you’re using an existing data lake administrator user or role add the following managed policies, if not attached and skip the below setup steps:

AWSGlueServiceRole
AmazonRedshiftFullAccess

Otherwise, to set up an IAM user as a data lake administrator, complete the following steps:

  1. On the IAM console, choose Users in the navigation pane.
  2. Select the IAM user who you want to designate as the data lake administrator.
  3. Choose Add an inline policy on the Permissions tab.
  4. Replace <AccountID> with your own account ID and add the following policy:
{
    "Version": "2012-10-17",
    "Statement": [ {
        "Condition": {"StringEquals": {
            "iam:AWSServiceName":"lakeformation.amazonaws.com"}},
            "Action":"iam:CreateServiceLinkedRole",
            "Resource": "*",
            "Effect": "Allow"},
            {"Action": ["iam:PutRolePolicy"],
            "Resource": "arn:aws:iam::<AccountID>:role/aws-service role/lakeformation.amazonaws.com/AWSServiceRoleForLakeFormationDataAccess",
            "Effect": "Allow"
     },{
                  "Effect": "Allow",
                  "Action": [
                    "ram:AcceptResourceShareInvitation",
                    "ram:RejectResourceShareInvitation",
                    "ec2:DescribeAvailabilityZones",
                    "ram:EnableSharingWithAwsOrganization"
                  ],
                  "Resource": "*"
                }]
}
  1. Provide a policy name.
  2. Review and save your settings.
  3. Choose Add permissions, and choose Attach existing policies directly.
  4. Add the following policies:
    1. AWSLakeFormationCrossAccountManager
    2. AWSGlueConsoleFullAccess
    3. AWSGlueServiceRole
    4. AWSLakeFormationDataAdmin
    5. AWSCloudShellFullAccess
    6. AmazonRedshiftFullAccess
  5. Choose Next: Review and add permissions.

Data consumer account setup

In the consumer account, follow the steps mentioned previously in the central governance account to set up Lake Formation and a data lake administrator.

  1. In the data consumer account, create an Amazon Redshift cluster using the RA3 node type with encryption (refer to the steps demonstrated to create an Amazon Redshift cluster in the producer account).
  2. Choose Launch stack to deploy an AWS CloudFormation template to create two IAM users with policies.

launch stack

The stack creates the following users under the data analyst persona:

  • consumer1
  • consumer2
  1. After the CloudFormation stack is created, navigate to the Outputs tab of the stack.
  2. Capture the ConsoleIAMLoginURL and LFUsersCredentials values.

createiamusers

  1. Choose the LFUsersCredentials value to navigate to the AWS Secrets Manager console.
  2. In the Secret value section, choose Retrieve secret value.

secret value

  1. Capture the secret value for the password.

Both consumer1 and consumer2 need to use this same password to log in to the AWS Management Console.

secret value

Configure an Amazon Redshift datashare using Lake Formation

Producer account

Create a datashare using the console

Complete the following steps to create an Amazon Redshift datashare in the data producer account and share it with Lake Formation in the central account:

  1. On the Amazon Redshift console, choose the cluster to create the datashare.
  2. On the cluster details page, navigate to the Datashares tab.
  3. Under Datashares created in my namespace, choose Connect to database.

connect to database

  1. Choose Create datashare.

create datashare

  1. For Datashare type, choose Datashare.
  2. For Datashare name, enter the name (for this post, demotahoeds).
  3. For Database name, choose the database from where to add datashare objects (for this post, dev).
  4. For Publicly accessible, choose Turn off (or choose Turn on to share the datashare with clusters that are publicly accessible).

datashare information

  1. Under DataShare objects, choose Add to add the schema to the datashare (in this post, the public schema).
  2. Under Tables and views, choose Add to add the tables and views to the datashare (for this post, we add the table customer and view customer_view).

datashare objects

  1. Under Data consumers, choose Publish to AWS Data Catalog.
  2. For Publish to the following accounts, choose Other AWS accounts.
  3. Provide the AWS account ID of the consumer account. For this post, we provide the AWS account ID of the Lake Formation central governance account.
  4. To share within the same account, choose Local account.
  5. Choose Create datashare.

data consumers

  1. After the datashare is created, you can verify by going back to the Datashares tab and entering the datashare name in the search bar under Datashares created in my namespace.
  2. Choose the datashare name to view its details.
  3. Under Data consumers, you will see the consumer status of the consumer data catalog account as Pending Authorization.

data consumers

  1. Choose the checkbox against the consumer data catalog which will enable the Authorize option.

authorize

  1. Click Authorize to authorize the datashare access to the consumer account data catalog, consumer status will change to Authorized.

authorized

Create a datashare using a SQL command

Complete the following steps to create a datashare in data producer account 1 and share it with Lake Formation in the central account:

  1. On the Amazon Redshift console, in the navigation pane, choose Editor, then Query editor V2.
  2. Choose (right-click) the cluster name and choose Edit connection or Create Connection.
  3. For Authentication, choose Temporary credentials.

Refer to Connecting to an Amazon Redshift database to learn more about the various authentication methods.

  1. For Database, enter a database name (for this post, dev).
  2. For Database user, enter the user authorized to access the database (for this post, awsuser).
  3. Choose Save to connect to the database.

Connecting to an Amazon Redshift database

  1. Run the following SQL commands to create the datashare and add the data objects to be shared:
create datashare demotahoeds;
ALTER DATASHARE demotahoeds ADD SCHEMA PUBLIC;
ALTER DATASHARE demotahoeds ADD TABLE customer;
ALTER DATASHARE demotahoeds ADD TABLE customer_view;
  1. Run the following SQL command to share the producer datashare to the central governance account:
GRANT USAGE ON DATASHARE demotahoeds TO ACCOUNT '<central-aws-account-id>' via DATA CATALOG

Run the following SQL command

  1. You can verify the datashare created and objects shared by running the following SQL command:
DESC DATASHARE demotahoeds

DESC DATASHARE demotahoeds

  1. Run the following command using the AWS CLI to authorize the datashare to the central data catalog so that Lake Formation can manage them:
aws redshift authorize-data-share \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-identifier DataCatalog/<central-aws-account-id>

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [{
        "ConsumerIdentifier": "DataCatalog/XXXXXXXXXXXX",
        "Status": "AUTHORIZED",
        "CreatedDate": "2022-11-09T21:10:30.507000+00:00",
        "StatusChangeDate": "2022-11-09T21:10:50.932000+00:00"
    }]
}

You can verify the datashare status on the console by following the steps outlined in the previous section.

Central catalog account

The data lake admin accepts and registers the datashare with Lake Formation in the central governance account and creates a database for the same. Complete the following steps:

  1. Sign in to the console as the data lake administrator IAM user or role.
  2. If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
  3. Under Data catalog in the navigation pane, choose Data sharing and view the Amazon Redshift datashare invitations on the Configuration tab.
  4. Select the datashare and choose Review Invitation.

AWS Lake Formation data sharing

A window pops up with the details of the invitation.

  1. Choose Accept to register the Amazon Redshift datashare to the AWS Glue Data Catalog.

accept reject invitation

  1. Provide a name for the AWS Glue database and choose Skip to Review and create.

Skip to Review and create

  1. Review the content and choose Create database.

create database

After the AWS Glue database is created on the Amazon Redshift datashare, you can view them under Shared Databases.

Shared Databases.

You can also use the AWS CLI to register the datashare and create the database. Use the following commands:

  1. Describe the Amazon Redshift datashare that is shared with the central account:
aws redshift describe-data-shares
  1. Accept and associate the Amazon Redshift datashare to Data Catalog:
aws redshift associate-data-share-consumer \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-arn arn:aws:glue:us-east-1:<central-aws-account-id>:catalog

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:123456789012:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:123456789012:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [
        {
            "ConsumerIdentifier": "arn:aws:glue:us-east-1:112233445566:catalog",
            "Status": "ACTIVE",
            "ConsumerRegion": "us-east-1",
            "CreatedDate": "2022-11-09T23:25:22.378000+00:00",
            "StatusChangeDate": "2022-11-09T23:25:22.378000+00:00"
        }
    ]
}
  1. Register the Amazon Redshift datashare in Lake Formation:
aws lakeformation register-resource \
--resource-arn arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds
  1. Create the AWS Glue database that points to the accepted Amazon Redshift datashare:
aws glue create-database --region <central-catalog-region> --cli-input-json '{
    "CatalogId": "<central-aws-account-id>",
    "DatabaseInput": {
        "Name": "demotahoedb",
        "FederatedDatabase": {
            "Identifier": "arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds",
            "ConnectionName": "aws:redshift"
        }
    }
}'

Now the data lake administrator of the central governance account can view and share access on both the database and tables to the data consumer account using the Lake Formation cross-account sharing feature.

Grant datashare access to the data consumer

To grant the data consumer account permissions on the shared AWS Glue database, complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  2. Choose Grant.
  3. Under Principals, select External accounts.
  4. Provide the data consumer account ID (for this post, 665544332211).
  5. Under LF_Tags or catalog resources, select Named data catalog resources.
  6. For Databases, choose the database demotahoedb.
  7. Select Describe for both Database permissions and Grantable permissions.
  8. Choose Grant to apply the permissions.

grant data permissions

To grant the data consumer account permissions on tables, complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  2. Choose Grant.
  3. Under Principals, select External accounts.
  4. Provide the consumer account (for this post, we use 665544332211).
  5. Under LF-Tags or catalog resources, select Named data catalog resources.
  6. For Databases, choose the database demotahoedb.
  7. For Tables, choose All tables.
  8. Select Describe and Select for both Table permissions and Grantable permissions.
  9. Choose Grant to apply the changes.

grant the data consumer account permissions on tables

Consumer account

The consumer admin will receive the shared resources from the central governance account and delegate access to other users in the consumer account as shown in the following table.

IAM User Object Access Object Type Access Level
consumer1 public.customer Table All
consumer2 public.customer_view View specific columns: c_customer_id, c_birth_country, cd_gender, cd_marital_status, cd_education_status

In the data consumer account, follow these steps to accept the resources shared with the account:

  1. Sign in to the console as the data lake administrator IAM user or role.
  2. If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
  3. Sign in to the AWS RAM console.
  4. In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations. You will receive 2 invitations.

Resource shares to view the pending invitations

  1. Choose the pending invitations and accept the resource share.

Choose the pending invitation and accept the resource share

  1. On the Lake formation console, under Data catalog in the navigation pane, choose Databases to view the cross-account shared database.

choose Databases to view the cross-account shared database

Grant access to the data analyst and IAM users using Lake Formation

Now the data lake admin in the data consumer account can delegate permissions on the shared database and tables to users in the consumer account.

Grant database permissions to consumer1 and consumer2

To grant the IAM users consumer1 and consumer2 database permissions, follow these steps:

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.

choose Grant database

  1. Under Principals, select IAM users and roles.
  2. Choose the IAM users consumer1 and consumer2.
  3. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  4. Select Describe for Database permissions.
  5. Choose Grant to apply the permissions.

Choose Grant to apply the permissions

Grant table permissions to consumer1

To grant the IAM user consumer1 permissions on table public.customer, follow these steps:

  1. Under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose IAM user consumer1.
  5. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  6. For Tables, choose public.customer.
  7. Select Describe and Select for Table permissions.
  8. Choose Grant to apply the permissions.

Grant table permissions to consumer1

Grant column permissions to consumer2

To grant the IAM user consumer2 permissions on non-sensitive columns in public.customer_view, follow these steps:

  1. Under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM user consumer2.
  5. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  6. For Tables, choose public.customer_view.

Grant column permissions to consumer2

  1. Select Select for Table permissions.
  2. Under Data Permissions, select Column-based access.
  3. Select Include columns and choose the non-sensitive columns (c_customer_id, c_birth_country, cd_gender, cd_marital_status, and cd_education_status).
  4. Choose Grant to apply the permissions.

table permissions

Consume the datashare from the data consumer account in the Amazon Redshift cluster

In the Amazon Redshift consumer data warehouse, log in as the admin user using Query Editor V2 and complete the following steps:

  1. Create the Amazon Redshift database from the shared catalog database using the following SQL command:
CREATE DATABASE demotahoedb FROM ARN 'arn:aws:glue:<producer-region>:<producer-aws-account-id>:database/demotahoedb' WITH DATA CATALOG SCHEMA demotahoedb ;
  1. Run the following SQL commands to create and grant usage on the Amazon Redshift database to the IAM users consumer1 and consumer2:
CREATE USER IAM:consumer1 password disable;
CREATE USER IAM:consumer2  password disable;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer1;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer2;

In order to use a federated identity to enforce Lake Formation permissions, follow the next steps to configure Query Editor v2.

  1. Choose the settings icon in the bottom left corner of the Query Editor v2, then choose Account settings.

identity to enforce Lake Formation permissions

  1. Under Connection settings, select Authenticate with IAM credentials.
  2. Choose Save.

Authenticate with IAM credentials

Query the shared datasets as a consumer user

To validate that the IAM user consumer1 has datashare access from Amazon Redshift, perform the following steps:

  1. Sign in to the console as IAM user consumer1.
  2. On the Amazon Redshift console, choose Query Editor V2 in the navigation pane.
  3. To connect to the consumer cluster, choose the consumer cluster in the tree-view pane.
  4. When prompted, for Authentication, select Temporary credentials using your IAM identity.
  5. For Database, enter the database name (for this post, dev).
  6. The user name will be mapped to your current IAM identity (for this post, consumer1).
  7. Choose Save.

edit connection for redshift

  1. Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:
select current_user;

  1. To find the federated databases created on the consumer account, run the following SQL command:
SHOW DATABASES FROM DATA CATALOG [ACCOUNT '<id1>', '<id2>'] [LIKE 'expression'];

federated databases created on the consumer account

  1. To validate permissions for consumer1, run the following SQL command:
select * from demotahoedb.public.customer limit 10;

As shown in the following screenshot, consumer1 is able to successfully access the datashare customer object.

Now let’s validate that consumer2 doesn’t have access to the datashare tables “public.customer” on the same consumer cluster.

  1. Log out of the console and sign in as IAM user consumer2.
  2. Follow the same steps to connect to the database using the query editor.
  3. Once connected, run the same query:
select * from demotahoedb.public.customer limit 10;

The user consumer2 should get a permission denied error, as in the following screenshot.

should get a permission denied error

Let’s validate the column-level access permissions of consumer2 on public.customer_view view.

  1. Connect to Query Editor v2 as consumer2 and run the following SQL command:
select c_customer_id,c_birth_country,cd_gender,cd_marital_status from demotahoedb.public.customer_view limit 10;

In the following screenshot, you can see consumer2 is only able to access columns as granted by Lake Formation.

access columns as granted by Lake Formation

Conclusion

A data mesh approach provides a method by which organizations can share data across business units. Each domain is responsible for the ingestion, processing, and serving of their data. They are data owners and domain experts, and are responsible for data quality and accuracy. Using Amazon Redshift data sharing with Lake Formation for data governance helps build the data mesh architecture, enabling data sharing and federation across business units with fine-grained access control.

Special thanks to everyone who contributed to launch Amazon Redshift data sharing with AWS Lake Formation:

Debu Panda, Michael Chess, Vlad Ponomarenko, Ting Yan, Erol Murtezaoglu, Sharda Khubchandani, Rui Bi

References


About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Ranjan Burman is a Analytics Specialist Solutions Architect, with AWS.

Vikram Sahadevan is a Senior Resident Architect on the AWS Data Lab team. He enjoys efforts that focus around providing prescriptive architectural guidance, sharing best practices, and removing technical roadblocks with joint engineering engagements between customers and AWS technical resources that accelerate data, analytics, artificial intelligence, and machine learning initiatives.

Steve Mitchell is a Senior Solution Architect with a passion for analytics and data mesh. He enjoys working closely with customers as they transition to a modern data architecture.