All posts by Aarthi Srinivasan

Create AWS Glue Data Catalog views using cross-account definer roles

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/create-aws-glue-data-catalog-views-using-cross-account-definer-roles/

With AWS Glue Data Catalog views you can create a SQL view in the Data Catalog that references one or more base tables. These multi-dialect views support various SQL query engines, providing consistent access across multiple Amazon Web Services (AWS) services including Amazon Athena, Amazon Redshift Spectrum, and Apache Spark in both Amazon EMR and AWS Glue 5.0.

You can now create Data Catalog views using a cross-account AWS Identity and Access Management (IAM) definer role. A definer role is an IAM role used to create the Data Catalog view and has SELECT permissions on all columns of the underlying base tables. This definer role is assumed by AWS Glue and AWS Lake Formation service principals to vend credentials to the base tables’ data whenever the view is queried. The definer role allows the Data Catalog view to be shared to principals or AWS accounts so that you can share a filtered subset of data without sharing the base tables.

Previously, Data Catalog views required a definer role within the same AWS account as the base tables. The introduction of cross-account definer roles enables Data Catalog view creation in enterprise data mesh architectures. In this setup, database and table metadata is centralized in a governance account, and individual data owner accounts maintain control over table creation and management through their IAM roles. Data owner accounts can now create and manage Data Catalog views in the central governance accounts using their existing continuous integration and continuous delivery (CI/CD) pipeline roles.

In this post, we show you a cross-account scenario involving two AWS accounts: a central governance account containing the tables and hosting the views and a data owner (producer) account with the IAM role used to create and manage views. We provide implementation details for both SPARK dialect using AWS SDK code samples and ATHENA dialect using SQL commands. Using this approach, you can implement sophisticated data governance models at enterprise scale while maintaining operational efficiency across your AWS environment.

Key benefits

Key benefits for cross-account definer roles are as follows:

  • Enhanced data mesh support – Enterprises with multi-account data lakehouse architectures can now maintain their existing operational model where data owner accounts manage table creation and updates using their established IAM roles. These same roles can now create and manage Data Catalog views across account boundaries.
  • Strengthened security controls – By keeping table and view management within data owner account roles:
    • Security posture is enhanced through proper separation of duties.
    • Audit trails become more comprehensive and meaningful.
    • Access controls follow the principle of least privilege.
  • Elimination of data duplication – Data owner accounts can create views in central accounts that:
    • Provide access to specific data subsets without duplicating tables.
    • Reduce storage costs and management overhead.
    • Maintain a single source of truth while enabling targeted data sharing.

Solution overview

An example customer has a database with two transaction tables in their central account, where the catalog and permissions are maintained. With the database shared to the data owner (producer) account, we create a Data Catalog view in the central account on these two tables, using the producer’s definer role. The view from the central account can be shared to additional consumer accounts and queried. We illustrate creating the SPARK dialect using create-table CLI, and add the ATHENA dialect for the same view from the Athena console. We also provide the AWS SDK sample code for CreateTable() and UpdateTable(), with view definition and a sample pySpark script to read and verify the view in AWS Glue.

The following diagram shows the table, view, and definer IAM role placements between a central governance account and data producer account.

Prerequisites

To perform this solution, you need to have the following prerequisites:

  1. Two AWS accounts with AWS Lake Formation set up. For details, refer to Set up AWS Lake Formation. The Lake Formation setup includes registering your IAM admin role as Lake Formation administrator. In the Data Catalog settings, shown in the following screenshot, Default permissions for newly created databases and tables is set to use Lake Formation permissions only. Cross-account version settings is set to Version 4.

  1. Create an IAM role Data-Analyst in the producer account. For the IAM permissions on this role, refer to Data analyst permissions. This role will also be used as the view definer role. Add the permissions to this definer role from the Prerequisites for creating views.

Create database and tables in the central account

In this step, you create two tables in the central governance account and populate them with few rows of data:

  1. Sign in to the central account as admin user. Open the Athena console and set up the Athena query results bucket.
  2. Run the following queries to create two sample Iceberg tables, representing bank customer transactions data:
/* Check if the Database exists, if not create new database. */
CREATE DATABASE IF NOT EXISTS bankdata_icebergdb;

/*Create transaction_table1*/ Replace the bucket name
CREATE TABLE bankdata_icebergdb.transaction_table1 (
  transaction_id string,
  transaction_type string,
  transaction_amount double)
LOCATION 's3://<bucket-name>/bankdata_icebergdb/transaction-table1'
TBLPROPERTIES (
  'table_type'='iceberg',
  'write_compression'='zstd'
);

/*Create transaction_table2*/
CREATE TABLE bankdata_icebergdb.transaction_table2 (
  transaction_id string,
  transaction_location string,
  transaction_date date)
LOCATION 's3://<bucket-name>/bankdata_icebergdb/transaction-table2'
TBLPROPERTIES (
  'table_type'='iceberg',
  'write_compression'='zstd'
);


INSERT INTO bankdata_icebergdb.transaction_table1 (transaction_id, transaction_type, transaction_amount)
VALUES
  ('T001', 'purchase', 50.0),
  ('T002', 'purchase', 120.0),
  ('T003', 'refund', 200.5),
  ('T004', 'purchase', 80.0),
  ('T005', 'withdrawal', 500.0),
  ('T006', 'purchase', 300.0),
  ('T007', 'deposit', 1000.0),
  ('T008', 'refund', 20.0),
  ('T009', 'purchase', 150.0),
  ('T010', 'withdrawal', 75.0);


INSERT INTO bankdata_icebergdb.transaction_table2 (transaction_id, transaction_location, transaction_date)
VALUES
  ('T001', 'Charlotte', DATE '2024-10-01'),
  ('T002', 'Seattle', DATE '2024-10-02'),
  ('T003', 'Chicago', DATE '2024-10-03'),
  ('T004', 'Miami', DATE '2024-10-04'),
  ('T005', 'New York', DATE '2024-10-05'),
  ('T006', 'Austin', DATE '2024-10-06'),
  ('T007', 'Denver', DATE '2024-10-07'),
  ('T008', 'Boston', DATE '2024-10-08'),
  ('T009', 'San Jose', DATE '2024-10-09'),
  ('T010', 'Phoenix', DATE '2024-10-10');
  1. Verify the created tables in Athena query editor by running a preview.

Share the database and tables from central to producer account

In the central governance account, you share the database and the two tables to the producer account and the Data-Analyst role in producer.

  1. Sign in to the Lake Formation console as the Lake Formation admin role.
  2. In the navigation pane, choose Data permissions.
  3. Choose Grant and provide the following information:
    1. For Principals, select External accounts and enter the producer account ID, as shown in the following screenshot.
    2. For Named Data Catalog Resources, select the default catalog and database bankdata_icebergdb, as shown in the following screenshot.
    3. Under Database permissions, select Describe. For Grantable permissions, select Describe.
    4. Choose Grant.
    5. Repeat the preceding steps to grant access to the producer account definer role Data-Analyst on the database bankdata_icebergdb and the two tables transaction_table1 and transaction_table2 as follows.
    6. Under Database permissions, grant Create table and Describe permissions.
    7. Under Table permissions, grant Select and Describe on all columns.

With these steps, the central governance account data admin steward has shared the database and tables to the producer account definer role.

Steps for producer account

Follow these steps for the producer account:

  1. Sign in to the Lake Formation console on the producer account as the Lake Formation administrator.
  2. In the left navigation pane, choose Databases. A blue banner will appear on the console, showing pending invitations from AWS Resource Access Manager (AWS RAM).
  3. Open the AWS RAM console and review the AWS RAM shares under Shared with me. You will see the AWS RAM shares in pending state. Select the pending AWS RAM share from central account and choose Accept resource share. After the resource share request is accepted, the shared database shows up in the producer account.
  4. On the Lake Formation console, select the database. On the Create dropdown list, choose Resource link. Provide a name rl_bank_iceberg and choose Create.
  5. Let’s grant Describe permission on the resource link to the Data-Analyst role in the producer account in the following steps.
    1. In the left navigation pane, choose Data permissions. Choose the Data-Analyst role. Select the resource link rl_bank_iceberg for the database as shown in the following screenshot.
    2. Grant Describe permission on the resource link.

Note: Cross-account Data Catalog views can’t be created using a resource link, although a resource link is needed for the SDK use of SPARK dialect.

Next, add the central account Data Catalog as a Data Source in Athena from producer account:

  1. Open the Athena console.
  2. On the left navigation pane, choose Data sources and catalogs. Choose Create data source.
    1. Select S3-AWS Glue Data Catalog.
    2. Choose AWS – Glue Data Catalog in another account and name the data source as centraladmin.
    3. Choose Next and then create data source.

After the data source is created, navigate to the Query editor and verify the Data source centraladmin appears, as shown in the following screenshot.

The definer role can also now access and query the central catalog database.

Create SPARK dialect view

In this step, you create a view with SPARK dialect, using AWS Glue CLI command create-table:

  1. Sign in to the AWS console in the producer account as Data-Analyst role. Enter the following command in your CLI environment, such as AWS CloudShell, to create a SPARK DIALECT:
aws glue create-table --cli-input-json '{
   "DatabaseName": "rl_bank_iceberg",
   "TableInput": {
     "Name": "mdv_transaction1",
     "StorageDescriptor": {
       "Columns": [
         {
           "Name": "transaction_id",
           "Type": "string"
         },
         {
           "Name": "transaction_type",
           "Type": "string"
         },
         {
           "Name": "transaction_amount",
           "Type": "float"
         },
         {
           "Name": "transaction_location",
           "Type": "string"
         },
         {
           "Name": "transaction_date",
           "Type": "date"
         }
       ],
       "SerdeInfo": {}
     },
     "ViewDefinition": {
       "SubObjects": [
         "arn:aws:glue:<your-region>:<your-central-account-id>:table/bankdata_icebergdb/transaction_table1",
         "arn:aws:glue:<your-region>:<your-central-account-id>:table/bankdata_icebergdb/transaction_table2"
        ],
       "IsProtected": true,
       "Representations": [
         {
           "Dialect": "SPARK",
           "DialectVersion": "1.0",
           "ViewOriginalText": "SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 JOIN transaction_table2 t2 ON t1.transaction_id = t2.transaction_id WHERE t1.transaction_amount > 100;",
           "ViewExpandedText": "SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 JOIN transaction_table2 t2 ON t1.transaction_id = t2.transaction_id WHERE t1.transaction_amount > 100;"
         }
       ]
     }
   }
 }'
  1. Open the Lake Formation console and verify if the view is created. Verify the dialect of the view on the SQL definitions tab for the view details.

Add ATHENA dialect

To add ATHENA dialect, follow these steps:

  1. On the Athena console, select centraladmin from the Data source.
  2. Enter the following SQL script to create the ATHENA dialect for the same view:
ALTER VIEW mdv_transaction1 FORCE ADD DIALECT
AS
SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 JOIN transaction_table2 t2 ON t1.transaction_id = t2.transaction_id WHERE t1.transaction_amount > 100

We can’t use the resource link rl_bank_iceberg in the Athena query editor to create or alter a view in the central account.

  1. Verify the added dialect by running a preview in Athena. For running the query, you can use either the resource link rl_bank_iceberg from the producer account catalog or use the centraladmin catalog.

The following screenshot shows querying using the resource link of the database in the producer account catalog.

The following screenshot shows querying the view from the producer using the connected catalog centraladmin as the data source.

  1. Verify the dialects on the view by inspecting the table in the Lake Formation console.

You can now query the view as the Data-Analyst role in the producer account, using both Athena and Spark. The view will also show in the central account as shown in the following code example, with access to the Lake Formation admin.

You can also create the view with ATHENA dialect and add the SPARK dialect. The SQL syntax to create the view in ATHENA dialect is shown in the following example:

create protected multi dialect view mdv_transaction1
security definer
as
SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 
JOIN transaction_table2 t2 
ON t1.transaction_id = t2.transaction_id 
WHERE t1.transaction_amount > 100;

The update-table CLI to add the corresponding SPARK dialect is shown in the following example:

aws glue update-table --cli-input-json '{
    "DatabaseName": "rl_bankdatadb",
    "ViewUpdateAction": "ADD",
    "Force": true,
    "TableInput": {
        "Name": " mdv_transaction1",
        "StorageDescriptor": {
            "Columns": [
                {
                  "Name": "transaction_id",
                  "Type": "string"
                },
                {
                  "Name": "transaction_type",
                  "Type": "string"
                },
                {
                  "Name": "transaction_amount",
                  "Type": "float"
                },
                {
                  "Name": "transaction_location",
                  "Type": "string"
                },
                {
                  "Name": "transaction_date",
                  "Type": "date"
                }
             ],
             "SerdeInfo": {}
         },
         "ViewDefinition": {
         "SubObjects": [
               " "arn:aws:glue:<your-region>:<your-central-account-id>:table/bankdata_icebergdb/transaction_table1",
           "arn:aws:glue:<your-region>:<your-central-account-id>:table/bankdata_icebergdb/transaction_table2" 
],
         "IsProtected": true,
         "Representations": [
             {
                 "Dialect": "SPARK",
                 "DialectVersion": "1.0",
                 "ViewOriginalText": " SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 JOIN transaction_table2 t2 ON t1.transaction_id = t2.transaction_id WHERE t1.transaction_amount > 100",
                 "ViewExpandedText": " SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 JOIN transaction_table2 t2 ON t1.transaction_id = t2.transaction_id WHERE t1.transaction_amount > 100"
              }
           ]
        }
    }
}'

The following is a sample Python script to create a SPARK dialect view: glueview-createtable.py.

The following code block is a sample AWS Glue extract, transfer, and load (ETL) script to access the Spark dialect of the view from AWS Glue 5.0 from the central account. The AWS Glue job execution role should have Lake Formation SELECT permission on the AWS Glue view:

from pyspark.context import SparkContext
from pyspark.sql import SparkSession

aws_region = "<your-region>"
aws_account_id = "<your-central-account-id>"
local_catalogname = "spark_catalog"
warehouse_path = "s3://<your-bucket-name>/bankdata_icebergdb/transaction-table1"

spark = SparkSession.builder.appName('query_glue_view') \
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .config(f'spark.sql.catalog.{local_catalogname}', 'org.apache.iceberg.spark.SparkSessionCatalog') \
    .config(f'spark.sql.catalog.{local_catalogname}.catalog-impl', 'org.apache.iceberg.aws.glue.GlueCatalog') \
    .config(f'spark.sql.catalog.{local_catalogname}.client.region', aws_region) \
    .config(f'spark.sql.catalog.{local_catalogname}.glue.account-id', aws_account_id) \
    .config(f'spark.sql.catalog.{local_catalogname}.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO') \
    .config(f'spark.sql.catalog.{local_catalogname}.warehouse',warehouse_path) \
    .getOrCreate()
spark.sql(f"show databases").show()
spark.sql(f"SHOW TABLES IN {local_catalogname}.bankdata_icebergdb").show()
spark.sql(f"SELECT * FROM {local_catalogname}.bankdata_icebergdb. mdv_transaction1").show()

In the AWS Glue job-details, for Lake Formation managed tables and for Iceberg tables, set additional parameters respectively as follows:

--enable-lakeformation-fine-grained-access = true
--datalake-formats = iceberg

Cleanup

To avoid incurring costs, clean up the resources you used for this post:

  1. Revoke the Lake Formation permissions granted to the Data-Analyst role and Producer account
  2. Drop the Athena tables
  3. Delete the Athena query results from your Amazon Simple Storage Service (Amazon S3) bucket
  4. Delete the Data-Analyst role from IAM

Conclusion

In this post, we demonstrated how to use cross-account IAM definer roles with AWS Glue Data Catalog views. We showed how data owner accounts can create and manage views in a central governance account while maintaining security and control over their data assets. This feature enables enterprises to implement sophisticated data mesh architectures without compromising on security or requiring data duplication.

The ability to use cross-account definer roles with Data Catalog views provides several key advantages:

  • Streamlines view management in multi-account environments
  • Maintains existing CI/CD workflows and automation
  • Enhances security through centralized governance
  • Reduces operational overhead by eliminating the need for data duplication

As organizations continue to build and scale their data lakehouse architectures across multiple AWS accounts, cross-account definer roles for Data Catalog views provide a crucial capability for implementing efficient, secure, and well-governed data sharing patterns.


About the authors

Aarthi Srinivasan

Aarthi Srinivasan

Aarthi is a Senior Big Data Architect at Amazon Web Services (AWS). She works with AWS customers and partners to architect data lake solutions, enhance product features, and establish best practices for data governance.

Sundeep Kumar

Sundeep Kumar

Sundeep is a Sr. Specialist Solutions Architect at Amazon Web Services (AWS), helping customers build data lake and analytics platforms and solutions. When not building and designing data lakes, Sundeep enjoys listening to music and playing guitar.

Create and update Apache Iceberg tables with partitions in the AWS Glue Data Catalog using the AWS SDK and AWS CloudFormation

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/create-and-update-apache-iceberg-tables-with-partitions-in-the-aws-glue-data-catalog-using-the-aws-sdk-and-aws-cloudformation/

In recent years, we’ve witnessed a significant shift in how enterprises manage and analyze their ever-growing data lakes. At the forefront of this transformation is Apache Iceberg, an open table format that’s rapidly gaining traction among large-scale data consumers.

However, as enterprises scale their data lake implementations, managing these Iceberg tables at scale becomes challenging. Data teams often need to manage table schema evolution, its partitioning, and snapshots versions. Automation streamlines these operations, provides consistency, reduces human error, and helps data teams focus on higher-value tasks.

The AWS Glue Data Catalog now supports Iceberg table management using the AWS Glue API, AWS SDKs, and AWS CloudFormation. Previously, users had to create Iceberg tables in the Data Catalog without partitions using CloudFormation or SDKs and later add partitions from Amazon Athena or other analytics engines. This prevents the table lineage from being tracked in one place and adds steps outside automation in the continuous integration and delivery (CI/CD) pipeline for table maintenance operations. With the launch, AWS Glue customers can now use their preferred automation or infrastructure as code (IaC) tools to automate Iceberg table creation with partitions and use the same tools to manage schema updates and sort order.

In this post, we show how to create and update Iceberg tables with partitions in the Data Catalog using the AWS SDK and CloudFormation.

Solution overview

In the following sections, we illustrate the AWS SDK for Python (Boto3) and AWS Command Line Interface (AWS CLI) usage of Data Catalog APIs—CreateTable() and UpdateTable()—for Amazon Simple Storage Service (Amazon S3) based Iceberg tables with partitions. We also provide the CloudFormation templates to create and update an Iceberg table with partitions.

Prerequisites

The Data Catalog API changes are made available in the following versions of the AWS CLI and SDK for Python:

  • AWS CLI version of 2.27.58 or above
  • SDK for Python version of 1.39.12 or above

AWS CLI usage

Let’s create an Iceberg table with one partition, using CreateTable() in the AWS CLI:

aws glue create-table --cli-input-json file://createicebergtable.json

The createicebergtable.json is as follows:

{
    "CatalogId": "123456789012",
    "DatabaseName": "bankdata_icebergdb",
    "Name": "transactiontable1",
    "OpenTableFormatInput": { 
      "IcebergInput": { 
         "MetadataOperation": "CREATE",
         "Version": "2",
         "CreateIcebergTableInput": { 
            "Location": "s3://sampledatabucket/bankdataiceberg/transactiontable1/",
            "Schema": {
                "SchemaId": 0,
                "Type": "struct",
                "Fields": [ 
                    { 
                        "Id": 1,
                        "Name": "transaction_id",
                        "Required": true,
                        "Type": "string"
                    },
                    { 
                        "Id": 2,
                        "Name": "transaction_date",
                        "Required": true,
                        "Type": "date"
                    },
                    { 
                        "Id": 3,
                        "Name": "monthly_balance",
                        "Required": true,
                        "Type": "float"
                    }
                ]
            },
            "PartitionSpec": { 
                "Fields": [ 
                    { 
                        "Name": "by_year",
                        "SourceId": 2,
                        "Transform": "year"
                    }
                ],
                "SpecId": 0
            },
            "WriteOrder": { 
                "Fields": [ 
                    { 
                        "Direction": "asc",
                        "NullOrder": "nulls-last",
                        "SourceId": 1,
                        "Transform": "none"
                    }
                ],
                "OrderId": 1
            }  
        }
      }
   }
}

The preceding AWS CLI command creates the metadata folder for the Iceberg table in Amazon S3, as shown in the following screenshot.

Amazon S3 bucket interface showing metadata folder containing single JSON file dated November 6, 2025

You can populate the table with values as follows and verify the table schema using the Athena console:

SELECT * FROM "bankdata_icebergdb"."transactiontable1" limit 10;
insert into bankdata_icebergdb.transactiontable1 values
    ('AFTERCREATE1234', DATE '2024-08-23', 6789.99),
    ('AFTERCREATE5678', DATE '2023-10-23', 1234.99);
SELECT * FROM "bankdata_icebergdb"."transactiontable1";

The following screenshot shows the results.

Amazon Athena query editor showing SQL queries and results for bankdata_icebergdb database with transaction data

After populating the table with data, you can inspect the S3 prefix of the table, which will now have the data folder.

Amazon S3 bucket interface displaying data folder with two subfolders organized by year: 2023 and 2024

The data folders partitioned according to our table definition and Parquet data files created from our INSERT command are available under each partitioned prefix.

Amazon S3 bucket interface showing by_year=2023 folder containing single Parquet file of 575 bytes

Next, we update the Iceberg table by adding a new partition, using UpdateTable():

aws glue update-table --cli-input-json file://updateicebergtable.json

The updateicebergtable.json is as follows.

{
  "CatalogId": "123456789012",
  "DatabaseName": "bankdata_icebergdb",
  "Name": "transactiontable1",
  "UpdateOpenTableFormatInput": {
    "UpdateIcebergInput": {
      "UpdateIcebergTableInput": {
        "Updates": [
          {
            "Location": "s3://sampledatabucket/bankdataiceberg/transactiontable1/",
            "Schema": {
              "SchemaId": 1,
              "Type": "struct",
              "Fields": [
                {
                  "Id": 1,
                  "Name": "transaction_id",
                  "Required": true,
                  "Type": "string"
                },
                {
                  "Id": 2,
                  "Name": "transaction_date",
                  "Required": true,
                  "Type": "date"
                },
                {
                  "Id": 3,
                  "Name": "monthly_balance",
                  "Required": true,
                  "Type": "float"
                }
              ]
            },
            "PartitionSpec": {
              "Fields": [
                {
                  "Name": "by_year",
                  "SourceId": 2,
                  "Transform": "year"
                },
                {
                  "Name": "by_transactionid",
                  "SourceId": 1,
                  "Transform": "identity"
                }
              ],
              "SpecId": 1
            },
            "SortOrder": {
              "Fields": [
                {
                  "Direction": "asc",
                  "NullOrder": "nulls-last",
                  "SourceId": 1,
                  "Transform": "none"
                }
              ],
              "OrderId": 2
            }
          }
        ]
      }
    }
  }
}

UpdateTable() modifies the table schema by adding a metadata JSON file to the underlying metadata folder of the table in Amazon S3.

Amazon S3 bucket interface showing 5 metadata objects including JSON and Avro files with timestamps

We insert values into the table using Athena as follows:

insert into bankdata_icebergdb.transactiontable1 values
    ('AFTERUPDATE1234', DATE '2025-08-23', 4536.00),
    ('AFTERUPDATE5678', DATE '2022-10-23', 23489.00);
SELECT * FROM "bankdata_icebergdb"."transactiontable1";

The following screenshot shows the results.

Amazon Athena query editor with SQL statements and results after iceberg partition update and insert data

Inspect the corresponding changes to the data folder in the Amazon S3 location of the table.

Amazon S3 prefix showing new partitions for the Iceberg table

This example has illustrated how to create and update Iceberg tables with partitions using AWS CLI commands.

SDK for Python usage

The following Python scripts illustrate using CreateTable() and UpdateTable() for an Iceberg table with partitions:

CloudFormation usage

Use the following CloudFormation templates for CreateTable() and UpdateTable(). After the CreateTable template is complete, update the same stack with the UpdateTable template by creating a new changeset for your stack and executing it.

Clean up

To avoid incurring costs on the Iceberg tables created using the AWS CLI, delete the tables from the Data Catalog.

Conclusion

In this post, we illustrated how to use the AWS CLI to create and update Iceberg tables with partitions in the Data Catalog. We also provided the SDK for Python and CloudFormation sample code and templates. We hope this helps you automate the creation and management of your Iceberg tables with partitions in your CI/CD pipelines and production environments. Try it out for your own use case and share your feedback in the comments section.


About the authors

Acknowledgements: A special thanks to everyone who contributed to the development and launch of this feature – Purvaja Narayanaswamy, Sachet Saurabh, Akhil Yendluri and Mohit Chandak.

Aarthi Srinivasan

Aarthi Srinivasan

Aarthi is a Senior Big Data Architect with AWS. She works with AWS customers and partners to architect data lake house solutions, enhance product features, and establish best practices for data governance.

Pratik Das

Pratik Das

Pratik is a Senior Product Manager with AWS. He is passionate about all things data and works with customers to understand their requirements and build delightful experiences. He has a background in building data-driven solutions and machine learning systems in production.

Use trusted identity propagation for Apache Spark interactive sessions in Amazon SageMaker Unified Studio

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/use-trusted-identity-propagation-for-apache-spark-interactive-sessions-in-amazon-sagemaker-unified-studio/

Amazon SageMaker Unified Studio introduces support for running interactive Apache Spark sessions with your corporate identities through trusted identity propagation. These Spark interactive sessions are available using Amazon EMR, Amazon EMR Serverless, and AWS Glue. Enterprises with their workforce corporate identity provider (IdP) integrated with AWS IAM Identity Center can now use their IAM Identity Center user and group identity seamlessly with SageMaker Unified Studio to access AWS Glue Data Catalog databases and tables.

Administrators of AWS services can use trusted identity propagation in IAM Identity Center to grant permissions based on user attributes, such as user ID or group associations. With trusted identity propagation, identity context is added to an IAM role to identify the user requesting access to AWS resources and is further propagated to other AWS services when requests are made. Until now, Spark sessions in SageMaker Unified Studio used the project IAM role for managing data access permissions for all members of the project. This provided fine-grained access control at the project IAM role level and not at the user level. Now, with the trusted identity propagation enabled in the SageMaker Unified Studio domain, the data access can be fine-grained at the user or group level.

The trusted identity propagation support for Spark interactive sessions makes the SageMaker Unified Studio a holistic offering for enterprise data users. Enabling trusted identity propagation in SageMaker Unified Studio saves time by avoiding the repeated permission grants to new project IAM roles and enhances security auditing with the IAM Identity Center user or group ID in the AWS CloudTrail logs.

The following are some of the use cases for trusted identity propagation in Spark sessions for SageMaker Unified Studio:

  • Single sign-on experience with AWS analytics – For customers using enterprise data mesh built using AWS Lake Formation, single sign-on experience with trusted identity propagation is available for Spark applications through EMR Studio attached with Amazon EMR on EC2 and SQL experience through Amazon Athena query editor inside EMR Studio. With the addition of EMR Serverless, Amazon EMR on EC2, and AWS Glue for Spark sessions with trusted identity propagation enabled in SageMaker Unified Studio, the single sign-on experience is expanded to provide easier options for the data scientists and developers.
  • Fine-grained access control based on user identity or group membership– Use a single project within the SageMaker Unified Studio domain across multiple data scientists, with the fine-grained permissions of AWS Lake Formation. When a data scientist accesses the AWS Glue Data Catalog table, the session is now enabled by their IAM Identity Center user or group permissions. Further, each can use their preferred tool, such as EMR Serverless, AWS Glue, or Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), for the Spark sessions inside SageMaker Unified Studio.
  • Isolated user sessions – The Spark interactive sessions in SageMaker Unified Studio are securely isolated for each IAM Identity Center user. With secure sessions, data teams can focus more on business data exploration and faster development cycles, rather than building guardrails.
  • Auditing and reporting – Customers in regulated industries need strict compliance reports showing fine-grained details of their data access. CloudTrail logs provide the additionalContext field with the details of IAM Identity Center user ID or group ID and the analytics engine that accessed the Data Catalog tables from SageMaker Unified Studio.
  • Expand and scale with unified governance model – Customers who are already using Amazon Redshift, Amazon QuickSight and AWS Lake Formation permissions integrated with IAM Identity Center can now expand their ML and data analytics platform to include Spark sessions with EMR Serverless and AWS Glue options in SageMaker Unified Studio. They don’t have to maintain IAM role-based policy permissions. Trusted identity propagation for Spark sessions in SageMaker Unified Studio scales the existing permissions mechanism to a wider community of data scientists and developers.

In this post, we provide step-by-step instructions to set up Amazon EMR on EC2, EMR Serverless, and AWS Glue within SageMaker Unified Studio, enabled with trusted identity propagation. We use the setup to illustrate how different IAM Identity Center users can run their Spark sessions, using each compute setup, within the same project in SageMaker Unified Studio. We show how each user will see only tables or part of tables that they’re granted access to in Lake Formation.

Solution overview

A financial services company processes data from millions of retail banking transactions per day, pooled into their centralized data lake and accessed by traditional corporate identities. Their machine learning (ML) platform team would like to enable thousands of their data scientists, working across different teams, with the right dataset and tools in a secure, scalable and auditable fashion. The platform team chooses to use SageMaker Unified Studio, integrate their IdP with IAM Identity Center, and manage access for their data scientists on the data lake tables using fine-grained Lake Formation permissions.

In our sample implementation, we show how to enable three different data scientists—Arnav, Maria, and Wei—belonging to two different teams, to access the same datasets, but with different levels of access. We use Lake Formation tags to grant column restricted access and have the three data scientists run their Spark sessions within the same SageMaker Unified Studio project. When the individual users sign in to the SageMaker Unified Studio project, their IDC user or group identity context is added to the SageMaker Unified Studio project execution role, and their fine-grained permissions from Lake Formation on the catalog tables are effective. We show how their data exploration is isolated and unique.

The following diagram shows an instance of how an enterprise workforce IdP, integrated with IAM Identity Center, would make the users and groups available for use by AWS services. Here, Lake Formation and SageMaker Unified Studio domain are integrated with IAM Identity Center and trusted identity propagation is enabled. In this setup, (a) data permissions are granted to the IDC user or group identities directly instead of IAM roles (b) the user identity context is available end-to-end (c) data access control is centralized in Lake Formation no matter which analytics service the user uses.

Prerequisites

Working with IAM Identity Center and the AWS services that integrate with IAM Identity Center requires several steps. In this post we use one AWS account with IAM Identity Center enabled and a SageMaker Unified Studio domain created. We recommend that you use a test account to follow along the blog.

You need the following prerequisites:

  • An AWS account setup with an IAM administrator role that has permissions to work with IAM Identity Center, Lake Formation, Amazon Simple Storage Service (Amazon S3), CloudTrail, SageMaker Unified Studio, Amazon EMR on EC2, EMR Serverless, and AWS Glue.
  • Enable IAM Identity Center in the account. For details, refer to Enable IAM Identity Center.
    1. Three IAM Identity Center users (Arnav, Maria, and Wei) and two groups (DataScientists and MarketAnalytics). For instructions on creating IAM Identity Center users, refer to Add users to your Identity Center directory. For instructions on creating groups, refer to Add groups to your Identity Center directory.
    2. Add Arnav and Maria to the DataScientists group and add Wei to the MarketAnalytics group. For instructions on adding users to groups, refer to Add users to groups.

    The following screenshot shows users Maria and Arnav in the DataScientists group.

    following screenshot shows user Wei in the MarketAnalytics group.

  • Configure Lake Formation. For detailed instructions, refer to Data lake administrator permissions and Set up AWS Lake Formation in the Lake Formation documentation.
    1. Integrate Lake Formation with the IAM Identity Center instance. For instructions, refer to Integrating IAM Identity Center.
  • A database and a table created in AWS Glue Data Catalog, with the table data in an S3 bucket.
    1. For the sample dataset and table used in this post, refer to Appendix A.
  • Lake Formation tag-based permissions for the three IAM Identity Center users on the Data Catalog table.
    1. For creating and assigning LF-Tags to Data Catalog tables, refer to Creating LF-Tags, and Assigning LF-Tags to Data Catalog resources.
    2. For granting permissions using LF-Tags, refer to Granting data lake permissions using the LF-TBAC method.
    3. We have shown the sample LF-Tags and permissions for the IAM Identity Center users in Appendix B.
  • A SageMaker Unified Studio domain domain-tip-smus-blog. For instructions to create a SageMaker Unified Studio domain, refer to the quick setup guide in the SageMaker Unified Studio documentation.
    1. The domain should be enabled with trusted identity propagation, following the instructions in Trusted identity propagation.
    2. The domain’s project profile should be enabled with Amazon EMR on EC2. You can choose either General purpose or Memory-Optimized profile. You will have to provide a value for certificateLocation, as shown in the following screenshot. For detailed instructions, refer to Specify PEM certificate for EmrOnEc2 blueprint. For this post, you can use OpenSSL to generate a self-signed X.509 certificate with a 2048-bit RSA private key. Detailed instructions for creating one are at the bottom of Create keys and certificates for data encryption with Amazon EMR.
    3. The two IAM Identity Center groups (DataScientists and MarketAnalytics) should be added to the domain as users. For instructions, refer to Managing users in Amazon SageMaker Unified Studio.

Create a project in SageMaker Unified Studio

Now that DataScientists and MarketAnalytics groups are granted access to the domain, IAM Identity Center users belonging to those two groups can sign in to the SageMaker Unified Studio portal for the next steps. Follow these steps:

  1. Sign in to the SageMaker Unified Studio portal as single sign-on user Arnav.
  2. Create a project blogproject_tip_enabled under the domain, as shown in the following screenshot. For details, follow the instructions in Create a project.
  3. Select All capabilities for Project profile, as shown in the following screenshot. Leave the other parameters to default values.

Arnav would like to collaborate with other team members. After creating the project, he grants access on the project to additional IAM Identity Center groups. He adds the two IAM Identity Center groups, DataScientists and MarketAnalytics, as Members of type Contributor to the project, as shown in the following screenshot.

So far, you’ve set up IAM Identity Center, created users and groups, created a SageMaker Unified Studio domain and project, and added the IAM Identity Center groups as users to the domain and the project. In the rest of the sections, we set up the three types of computes for Spark interactive session and enter a query on the Lake Formation managed tables as individual IAM Identity Center users Arnav, Maria, and Wei.

Set up EMR Serverless

In this section, we set up an EMR Serverless compute and run a Spark interactive session as Arnav.

  1. Sign in to the SageMaker Unified Studio domain as the single sign-on user Arnav. Refer to the domain’s detail page to get the URL.
  2. After signing in as Arnav, select the project blogproject_tip_enabled. From the left navigation pane, choose Compute. On the Data processing tab, choose Add compute.
  3. Under Add compute, choose Create new compute resources, as shown in the following screenshot.
  4. Choose EMR Serverless.
  5. Under Release label, choose minimum version 7.8.0 and choose Fine-grained.
  6. After the EMR Serverless compute is in Created status, on the Actions dropdown list, choose Open JupyterLab IDE. This will open a Jupyter Notebook session.
  7. When the Jupyter notebook opens, you will see a banner to update the SageMaker Distribution image to version 2.9. Follow the instructions in Editing a space and update the space to use version 2.9. Save the space and restart after update.
  8. Open the space after it finishes updating. This will open the Jupyter notebook.

    Now, your environment is ready, and you can run Spark queries and test your access to the table bankdata_icebergtbl.
  9. On the Launcher window, under Notebook, choose Python 3(ipykernel).
  10. On the top part of the notebook cell, choose PySpark from the kernel dropdown list and emr-s.blog_tipspark_emrserverless from the Compute dropdown list.
  11. Run the following query:
    spark.sql(“select * from bankdata_db.bankdata_icebergtbl limit 10”).show()

Because Arnav is part of the DataScientists group, he should see all columns of the table, as shown in the following screenshot.

This verifies LF-Tags based access for Arnav on the bankdata_db.bankdata_icebergtbl using a Spark session in EMR Serverless compute.

Set up AWS Glue 5.0

In this section, we set up AWS Glue compute and run a Spark interactive session as Maria.

  1. Sign in to the SageMaker Unified Studio domain as the single sign-on user Maria.
  2. Choose the project blogproject_tip_enabled. From the left navigation pane, choose Compute. On Data processing tab, you should see two computes created by default in Active status (project.spark.compatibility and project.spark.fineGrained) with Type Glue ETL. For additional details on these compute types, refer to AWS Glue ETL in Amazon SageMaker Unified Studio.
  3. Select the project.spark.fineGrained and launch the Jupyter notebook with the PySpark kernel.
  4. For the notebook cell, choose pySpark for kernel and project.spark.fineGrained for compute. Enter the following query:
    sspark.sql(“select * from bankdata_db.bankdata_icebergtbl limit 10”).show()

Because Maria is part of the DataScientists group, she should see all columns of the table, as shown in the following screenshot.

This verifies LF-Tags based access to Maria on the bankdata_db.bankdata_icebergtbl using Spark session in AWS Glue fine-grained access control (FGAC) compute.

To verify what access Wei has using EMR Serverless and AWS Glue, you can sign out and sign in as user Wei. Enter the Spark SELECT queries on the same table. Wei shouldn’t see the three personally identifiable information (PII) columns transaction_id, bank_account_number, and initiator_name, which were tagged as transactions=secured.

The following screenshot shows the same table for Wei using EMR Serverless.

The following screenshot shows the same table for Wei using AWS Glue FGAC mode.

Set up Amazon EMR on EC2

In this section, we set up an Amazon EMR on EC2 compute and run a Spark interactive session as Wei.

  1. Sign in to the SageMaker Unified Studio domain as the single sign-on user Wei.
  2. Create Amazon EMR on EC2 compute using the steps for EMR Serverless in Setup EMR serverless but choose EMR on EC2 cluster instead of EMR Serverless. For the EMR configuration, choose the MemoryOptimized or GeneralPurpose configuration, depending on which one you chose to upload your PEM certificates to in the project profiles blueprint in the Prerequisites section. Choose an Amazon EMR release label greater than or equal to 7.8.0.
  3. After the cluster is provisioned, locate the instance profile role name in the compute details page, as shown in the following screenshot.
  4. As an admin user who can edit IAM policies in your account, add the following inline policy to the instance profile role. A manual intervention outside SageMaker Unified Studio is required currently to perform this step. This will be addressed in the future.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "IdCPermissions",
                "Effect": "Allow",
                "Action": [
                    "sso-oauth:CreateTokenWithIAM",
                    "sso-oauth:IntrospectTokenWithIAM",
                    "sso-oauth:RevokeTokenWithIAM"
                ],
                "Resource": "*"
            },
            {
                "Sid": "AllowAssumeRole",
                "Effect": "Allow",
                "Action": [
                    "sts:AssumeRole"
                ],
                "Resource": [
                    "<instance profile role ARN>"
                ]
            }
        ]
    }

  5. After updating the role’s policy, you can use the Amazon EMR on EC2 connection to initiate an interactive Spark session. Similar to how you launched a notebook as Arnav and Maria, do the same steps to launch the notebook as user Wei.
    1. On the Build tab, choose JupyterNotebook from the project home page. Choose Python3(ipykernel) to launch the notebook. Choose Configure space to update to version 2.9. Refresh the notebook browser.
    2. Inside the notebook, on top of the cell, choose PySpark for kernel and emr.blog_tip_emronec2 that you launched for the compute.
  6. Enter a select query on the table as follows:
    spark.sql(“select * from bankdata_db.bankdata_icebergtbl limit 10”).show()

This verifies that Wei, as part of the MarketAnalytics group, sees all columns of the table with LF-Tags transactions=accessible but doesn’t have access to the three columns that were overwritten with LF-Tags transactions=secured (transaction_id, bank_account_number, and initiator_name).

You can trace the user access of the table in the CloudTrail logs for EventName=GetDataAccess. In the relevant CloudTrail log shown below, we notice that the UserID for Wei is provided under additionalEventData field, whereas requestParameters has the tableARN.

The user ID for Wei is available in the IAM Identity Center console under General information.

Thus, we were able to sign in as an individual IAM Identity Center user to the SageMaker Unified Studio domain and query the Data Catalog tables using Amazon EMR and AWS Glue compute. These IAM Identity Center users were able to query the tables that they were granted access to, instead of the SageMaker Unified Studio project’s IAM role.

Cleanup

To avoid incurring costs, it’s important to delete the resources launched for this walkthrough. Clean up the resources as follows:

  1. SageMaker Unified Studio by default shuts down idle resources such as JupyterLab after 1 hour. If you’ve created a SageMaker Unified Studio domain for this post, remember to delete the domain.
  2. If you’ve created IAM Identity Center users and groups, delete the users and delete the groups. Further, if you’ve created an IAM Identity Center instance only for this post, delete your IAM Identity Center instance.
  3. Delete the database bankdata_db from Lake Formation. This will also delete the tables and all associated permissions. Delete the LF-Tag transactions and its values.
  4. Delete the table’s corresponding data from your S3 bucket two subfolders bankdata-csv and bankdata-iceberg.

Conclusion

In this post, we walked through how to enable a SageMaker Unified Studio domain with IAM Identity Center trusted identity propagation and query Lake Formation managed tables in Data Catalog using Apache Spark interactive sessions with EMR Serverless, AWS Glue, and Amazon EMR on EC2. We also verified in CloudTrail logs the IAM Identity Center user ID accessing the table.

Amazon SageMaker Unified Studio with trusted identity propagation provides the following benefits.

Business benefits

  • Enhanced data security
  • Improved workforce data access and insights

Technical capabilities

  • Enables data access based on workforce identity
  • Provides unified governance through Lake Formation for Data Catalog tables when accessed through SMUS
  • Ensures isolated and secure sessions for each IAM Identity Center user
  • Supports multiple analytics options:
    • Spark sessions via EMR Serverless, EMR on EC2, and AWS Glue
    • SQL analytics through Athena and Redshift Spectrum

Organizational advantages

  • Direct use of corporate identities for enterprise data access
  • Simplified access to data platforms and meshes built on Data Catalog and Lake Formation
  • Enables various user roles to work with their preferred AWS analytics services
  • Reduces data exploration time for Spark-familiar data scientists

To learn more, refer to the following resources:

We encourage you to check out the new trusted identity propagation enabled SageMaker Unified Studio for Spark sessions. Reach out to us through your AWS account teams or using the comments section.

Acknowledgment: A special thanks to everyone who contributed to the development and launch of this feature: Palani Nagarajan, Karthik Seshadri, Vikrant Kumar, Yijie Yan, Radhika Ravirala and Jerica Nicholls.

APPENDIX A – Table creation in Data Catalog

  1. We’ve created a synthetic bank transactions dataset with 100 rows in CSV format. Download the dataset dummy_bank_transaction_data.csv
  2. In your S3 bucket, create two subfolders: bankdata-csv and bankdata-iceberg and upload the dataset to bankdata-csv.
  3. Open the Athena console, navigate to query editor, and enter the following statements in sequence:
    -- Create database for the blog
    CREATE DATABASE bankdata_db;
    
    -- Create external table from the CSV file. Provide your S3 bucket name for the table location
    
    CREATE EXTERNAL TABLE bankdata_db.bankdata_csvtbl(
     `transaction_id` string, 
      `transaction_date` date, 
      `transaction_type` string,
      `bank_account_number` string,
      `initiator_name` string,
      `transaction_country` string, 
      `transaction_amount` double, 
      `merchant_name` string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION 's3://<your-bucket-name>/bankdata-csv/'
    TBLPROPERTIES (
      'areColumnsQuoted'='false', 
      'classification'='csv', 
      'skip.header.line.count'='1',
      'columnsOrdered'='true', 
      'compressionType'='none', 
      'delimiter'=',', 
      'typeOfData'='file');
     
    -- Create Iceberg table for the blog use. Provide your S3 bucket name for the table location
    
    CREATE TABLE bankdata_db.bankdata_icebergtbl WITH (
      table_type='ICEBERG',
      format='parquet',
      write_compression = 'SNAPPY',
      is_external = false,
      partitioning=ARRAY['transaction_type'],
      location='s3://<your-bucket-name>/bankdata-iceberg/'
    ) AS SELECT * FROM bankdata_db.bankdata_csvtbl;

  4. Enter a preview and verify the table data:
    SELECT * FROM bankdata_db.bankdata_icebergtbl limit 10;

APPENDIX B – Creating LF-Tags, attaching tags to the table from Appendix A, and granting permissions to IAM Identity Center users.

We create a Lake Formation tag with Keyname = transactions and Values = secured, accessible. We associate the tag to the table and overwrite a few columns as summarized in the table.

Resource

LF-Tag association

Database

bankdata_db

transactions = accessible

Table

bankdata_icebergtbl

transactions = accessible
Columns transaction_id transactions = secured
bank_account_number transactions = secured
initiator_name transactions = secured

We then grant Lake Formation permissions to the two IAM Identity Center groups using these LF-Tags as follows:

IAM Identity Center group

LF-Tags

Permission

DataScientists

transactions = accessible AND transactions = secured

Database DESCRIBE, Table SELECT

MarketAnalytics

transactions = accessible

Database DESCRIBE, Table SELECT
  1. Sign in to the Lake Formation console and navigate to LF-Tags and permissions. Create an LF-Tag with Keyname = transactions and Values = secured, accessible.
  2. Select the database bankdata_db and associate the LF-Tag transactions=accessible.
  3. Select bankdata_icebergtbl and verify that the LF-Tag transactions=accessible is inherited by the table.
  4. Edit the schema of the table and change the LF-Tag value on the columns transaction_id, bank_account_number, and initiator_name to transactions=secured. After changing, choose Save as new version.


  5. Navigate to the Data permissions page on the Lake Formation console. Choose Grant to grant permissions.
  6. Select the IAM Identity Center group DataScientists for Principals. Select LF-Tags transactions and both the values accessible, secured. Choose Database DESCRIBE and Tables SELECT permissions. Choose Grant.
  7. On the Data permissions page on the Lake Formation console, choose Grant again.
  8. Select the IAM Identity Center group MarketAnalytics for Principals. Select LF-Tags transactions and only one of the values, accessible. Select Database DESCRIBE and Tables SELECT permissions. Choose Grant.
  9. Also grant DESCRIBE permission on the default database to both the IDC groups.
  10. Verify the granted permissions in the Data permissions page, by filtering with expression Principal type = IAM Identity Center group.

Thus, we’ve granted all column access on the table bankdata_icebergtbl to the DataScientists group while securing three PII columns from the MarketAnalytics group.


About the Authors

Aarthi Srinivasan

Aarthi Srinivasan

Aarthi is a Senior Big Data Architect at Amazon Web Services (AWS). She works with AWS customers and partners to architect data lake solutions, enhance product features, and establish best practices for data governance.

Palani Nagarajan

Palani Nagarajan

Palani is a Senior Software Development Engineer with Amazon SageMaker Unified Studio. In his free time, he enjoys playing board games, traveling to new cities, and hiking scenic trails.

Using AWS Glue Data Catalog views with Apache Spark in EMR Serverless and Glue 5.0

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/using-aws-glue-data-catalog-views-with-apache-spark-in-emr-serverless-and-glue-5-0/

The AWS Glue Data Catalog has expanded its Data Catalog views feature, and now supports Apache Spark environments in addition to Amazon Athena and Amazon Redshift. This enhancement, launched in March 2025, now makes it possible to create, share, and query multi-engine SQL views across Amazon EMR Serverless, Amazon EMR on Amazon EKS, and AWS Glue 5.0 Spark, as well as Athena and Amazon Redshift Spectrum. The multi-dialect views empower data teams to create SQL views one time and query them through supported engines—whether it’s Athena for ad-hoc analytics, Amazon Redshift for data warehousing, or Spark for large-scale data processing. This cross-engine compatibility means data engineers can focus on building data products rather than managing multiple view definitions or complex permission schemes. Using AWS Lake Formation permissions, organizations can share these views within the same AWS account, across different AWS accounts, and with AWS IAM Identity Center users and groups, without granting direct access to the underlying tables. Features of Lake Formation such as fine-grained access control (FGAC) using Lake Formation-tag based access control (LF-TBAC) can be applied to Data Catalog views, enabling scalable sharing and access control across organizations.

In an earlier blog post, we demonstrated the creation of Data Catalog views using Athena, adding a SQL dialect for Amazon Redshift, and querying the view using Athena and Amazon Redshift. In this post, we guide you through the process of creating a Data Catalog view using EMR Serverless, adding the SQL dialect to the view for Athena, sharing it with another account using LF-Tags, and then querying the view in the recipient account using a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the versatility and cross-account capabilities of Data Catalog views and access through various AWS analytics services.

Benefits of Data Catalog views

The following are key benefits of Data Catalog views for business solutions:

  • Targeted data sharing and access control – Data Catalog views, combined with the sharing capabilities of Lake Formation, enable organizations to provide specific data subsets to different teams or departments without duplicating data. For example, a retail company can create views that show sales data to regional managers while restricting access to sensitive customer information. By applying LF-TBAC to these views, companies can efficiently manage data access across large, complex organizational structures, maintaining compliance with data governance policies while promoting data-driven decision-making.
  • Multi-service analytics integration – The ability to create a view in one analytics service and query it across Athena, Amazon Redshift, EMR Serverless, and AWS Glue 5.0 Spark breaks down data silos and promotes a unified analytics approach. This feature allows businesses to use the strengths of different services for various analytics needs. For instance, a financial institution could create a view of transaction data and use Athena for ad-hoc queries, Amazon Redshift for complex aggregations, and EMR Serverless for large-scale data processing—all without moving or duplicating the data. This flexibility accelerates insights and improves resource utilization across the analytics stack.
  • Centralized auditing and compliance – With views stored in the central Data Catalog, businesses can maintain a comprehensive audit trail of data access across connected accounts using AWS CloudTrail logs. This centralization is crucial for industries with strict regulatory requirements, such as healthcare or finance. Compliance officers can seamlessly monitor and report on data access patterns, detect unusual activities, and demonstrate adherence to data protection regulations like GDPR or HIPAA. This centralized approach simplifies compliance processes and reduces the risk of regulatory violations.

These capabilities of Data Catalog views provide powerful solutions for businesses to enhance data governance, improve analytics efficiency, and maintain robust compliance measures across their data ecosystem.

Solution overview

An example company has multiple datasets containing details of their customers’ purchase details mixed with personally identifiable information (PII) data. They categorize their datasets based on sensitivity of the information. The data steward wants to share a subset of their preferred customers data for further analysis downstream by their data engineering team.

To demonstrate this use case, we use sample Apache Iceberg tables customer and customer_address. We create a Data Catalog view from these two tables to filter by preferred customers. We then use LF-Tags to share restricted columns of this view to the downstream engineering team. The solution is represented in the following diagram.

arch diagram

Prerequisites

To implement this solution, you need two AWS accounts with an AWS Identity and Access Management (IAM) admin role. We use the role to run the provided AWS CloudFormation templates and also use the same IAM roles added as Lake Formation administrator.

Set up infrastructure in the producer account

We provide a CloudFormation template that deploys the following resources and completes the data lake setup:

  • Two Amazon Simple Storage Service (Amazon S3) buckets: one for scripts, logs, and query results, and one for the data lake storage.
  • Lake Formation administrator and catalog settings. The IAM admin role that you provide is registered as Lake Formation administrator. Cross-account sharing version is set to 4. Default permissions for newly created databases and tables is set to use Lake Formation permissions only.
    data catalog settings
  • An IAM role with read, write, and delete permissions on the data lake bucket objects. The data lake bucket is registered with Lake Formation using this IAM role.
    data lake locations
  • An AWS Glue database for the data lake.
  • Lake Formation tags. These tags are attached to the database.
    lf-tags
  • CSV and Iceberg format tables in the AWS Glue database. The CSV tables are pointing to s3://redshift-downloads/TPC-DS/2.13/10GB/ and the Iceberg tables are stored in the user account’s data lake bucket.
  • An Athena workgroup.
  • An IAM role and an AWS Lambda function to run Athena queries. Athena queries are run in the Athena workgroup to insert data from CSV tables to Iceberg tables. Relevant Lake Formation permissions are granted to the Lambda role.
    lf-tables
  • An EMR Studio and related virtual private cloud (VPC), subnet, routing table, security groups, and EMR Studio service IAM role.
  • An IAM role with policies for the EMR Studio runtime. Relevant Lake Formation permissions are granted to this role on the Iceberg tables. This role will be used as the definer role to create the Data Catalog view. A definer role is the IAM role with necessary permissions to access the referenced tables, and runs the SQL statement that defines the view.

Complete the following steps in your producer AWS account:

  1. Sign in to the AWS Management Console as an IAM administrator role.
  2. Launch the CloudFormation stack.

Allow approximately 5 minutes for the CloudFormation stack to complete creation. After the CloudFormation has completed launching, proceed with the following instructions.

  1. If you’re using the producer account in Lake Formation for the first time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime role GlueViewBlog-EMRStudio-RuntimeRole.
    data permissions

Create an EMR Serverless application

Complete the following steps to create an EMR Serverless application in your EMR Studio:

  1. On the Amazon EMR console, under EMR Studio in the navigation pane, choose Studios.
  2. Choose GlueViewBlog-emrstudio and choose the URL link of the Studio to open it.
    glueviewblog-emrstudio
  3. On the EMR Studio dashboard, choose Create application.
    emr-studio-dashboard

You will be directed to the Create application page on EMR Studio. Let’s create a Lake Formation enabled EMR Serverless application.

  1. Under Application settings, provide the following information:
    1. For Name, enter a name (for example, emr-glueview-application).
    2. For Type, choose Spark.
    3. For Release version, choose emr-7.8.0.
    4. For Architecture, choose x86_64.
  2. Under Application setup options, select Use custom settings.
  3. Under Interactive endpoint, select Enable endpoint for EMR studio.
  4. Under Additional configurations, for Metastore configuration, select Use AWS Glue Data Catalog as metastore, then select Use Lake Formation for fine-grained access control.
  5. Under Network connections, choose emrs-vpc for VPC, enter any two private subnets, and enter emr-serverless-sg for Security groups.
  6. Choose Create and start the application.

Create an EMR Workspace

Complete the following steps to create an EMR Workspace:

  1. On the EMR Studio console, choose Workspaces in the navigation pane and choose Create Workspace.
  2. Enter a Workspace name (for example, emrs-glueviewblog-workspace).
  3. Leave all other settings as default and choose Create Workspace.
  4. Choose Launch Workspace. Your browser might request to allow pop-up permissions for the first time launching the Workspace.
  5. After the Workspace is launched, in the navigation pane, choose Compute.
  6. For Compute type, select EMR Serverless application and enter emr-glueview-application for the application and GlueViewBlog-EMRStudio-RuntimeRole for Interactive runtime role.
  7. Make sure the kernel attached to the Workspace is PySpark.

Create a Data Catalog view and verify

Complete the following steps:

  1. Download the notebook glueviewblog_producer.ipynb. The code creates a Data Catalog view customer_nonpii_view from the two Iceberg tables, customer_iceberg and customer_address_iceberg, in the database glueviewblog_<account-id>_db.
  2. On your EMR Workspace emrs-glueviewblog-workspace, go to the File browser section and choose Upload files.
  3. Upload glueviewblog_producer.ipynb.
  4. Update the data lake bucket name, AWS account ID, and AWS Region to match your resources.
  5. Update the database_name, table1_name, and table2_name to match your resources.
  6. Save the notebook.
  7. Choose the double arrow icon to restart the kernel and rerun the notebook.

The Data Catalog view customer_nonpii_view is created and verified.

  1. In the navigation pane on the Lake Formation console, under Data Catalog, choose Views.
  2. Choose the new view customer_nonpii_view.
  3. On the SQL definitions tab, verify EMR with Apache Spark shows up for Engine name.
  4. Choose the tab LF-Tags. The view should show the LF-Tag sensitivity=pii-confidential inherited from the database.
  5. Choose Edit LF-Tags.
  6. On the Values dropdown menu, choose confidential to overwrite the Data Catalog view’s key value of sensitivity from pii-confidential.
  7. Choose Save.

With this, we have created a non-PII view to share with the data engineering team from the datasets that has PII information of customers.

Add Athena SQL dialect to the view

With the view customer_nonpii_view having been created by the EMR runtime role GlueViewBlog-EMRStudio-RuntimeRole, the Admin will have only describe permissions on it as a database creator and Lake Formation administrator. In this step, the Admin will grant itself alter permissions on the view, in order to add the Athena SQL dialect to the view.

  1. On the Lake Formation console, in the navigation pane, choose Data permissions.
  2. Choose Grant and provide the following information:
    1. For Principals, enter Admin.
    2. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
    3. For Key, choose sensitivity.
    4. For Values, choose confidential and pii-confidential.
    5. Under Database permissions, select Super for Database permissions and Grantable permissions.
    6. Under Table permissions, select Super for Table permissions and Grantable permissions.
    7. Choose Grant.
  3. Verify the LF-Tags based permissions the Admin.
  4. Open the Athena query editor, choose the Workgroup GlueViewBlogWorkgroup and choose the AWS Glue database glueviewblog_<accountID>_db.
  5. Run the following query. Replace <accountID> with your account ID.
    ALTER VIEW glueviewblog_<accountID>_db.customer_nonpii_view ADD DIALECT
    AS
    select c_customer_id, c_customer_sk, c_last_review_date, ca_country, ca_location_type
    from glueviewblog__<accountID>_db.customer_iceberg, glueviewblog__<accountID>_db.customer_address_iceberg
    where c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y';

  6. Verify the Athena dialect by running a preview on the view.
  7. On the Lake Formation console, verify the SQL dialects on the view customer_nonpii_view.

Share the view to the consumer account

Complete the following steps to share the Data Catalog view to the consumer account:

  1. On the Lake Formation console, in the navigation pane, choose Data permissions.
  2. Choose Grant and provide the following information:
    1. For Principals, select External accounts and enter the consumer account ID.
    2. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
    3. For Key, choose sensitivity.
    4. For Values, choose confidential.
    5. Under Database permissions, select Describe for Database permissions and Grantable permissions.
    6. Under Table permissions, select Describe and Select for Table permissions and Grantable permissions.
    7. Choose Grant.
  3. Verify granted permissions on the Data permissions page.

With this, the producer account data steward has created a Data Catalog view of a subset of data from two tables in their Data Catalog, using the EMR runtime role as the definer role. They have shared it to their analytics account using LF-Tags to run further processing of the data downstream.

Set up infrastructure in the consumer account

We provide a CloudFormation template to deploy the following resources and set up the data lake as follows:

  • An S3 bucket for Amazon EMR and AWS Glue logs
  • Lake Formation administrator and catalog settings similar to the producer account setup
  • An AWS Glue database for the data lake
  • An EMR Studio and related VPC, subnet, routing table, security groups, and EMR Studio service IAM role
  • An IAM role with policies for the EMR Studio runtime

Complete the following steps in your consumer AWS account:

  1. Sign in to the console as an IAM administrator role.
  2. Launch the CloudFormation stack.

Allow approximately 5 minutes for the CloudFormation stack to complete creation. After the CloudFormation has completed launching, proceed with the following instructions.

  1. If you’re using the consumer account Lake Formation for the first time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime role GlueViewBlog-EMRStudio-Consumer-RuntimeRole.

Accept AWS RAM shares in the consumer account

You can now log in to the AWS consumer account and accept the AWS RAM invitation:

  1. Open the AWS RAM console with the IAM role that has AWS RAM access.
  2. In the navigation pane, choose Resource shares under Shared with me.

You should see two pending resource shares from the producer account.

  1. Accept both invitations.

Create a resource link for the shared view

To access the view that was shared by the producer AWS account, you need to create a resource link in the consumer AWS account. A resource link is a Data Catalog object that is a link to a local or shared database, table, or view. After you create a resource link to a view, you can use the resource link name wherever you would use the view name. Furthermore, you can grant permission on the resource link to the job runtime role GlueViewBlog-EMRStudio-Consumer-RuntimeRole to access the view through EMR Serverless Spark.

To create a resource link, complete the following steps:

  1. Open the Lake Formation console as the Lake Formation data lake administrator in the consumer account.
  2. In the navigation pane, choose Tables.
  3. Choose Create and Resource link.
  4. For Resource link name, enter the name of the resource link (for example, customer_nonpii_view_rl).
  5. For Database, choose the glueviewblog_customer_<accountID>_db database.
  6. For Shared table region, choose the Region of the shared table.
  7. For Shared table, choose customer_nonpii_view.
  8. Choose Create.

Grant permissions on the database to the EMR job runtime role

Complete the following steps to grant permissions on the database glueviewblog_customer_<accountID>_db to the EMR job runtime role:

  1. On the Lake Formation console, in the navigation pane, choose Databases.
  2. Select the database glueviewblog_customer_<accountID>_db and on the Actions menu, choose Grant.
  3. In the Principles section, select IAM users and roles, and choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
  4. In the Database permissions section, select Describe.
  5. Choose Grant.

Grant permissions on the resource link to the EMR job runtime role

Complete the following steps to grant permissions on the resource link customer_nonpii_view_rl to the EMR job runtime role:

  1. On the Lake Formation console, in the navigation pane, choose Tables.
  2. Select the resource link customer_nonpii_view_rl and on the Actions menu, choose Grant.
  3. In the Principles section, select IAM users and roles, and choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
  4. In the Resource link permissions section, select Describe for Resource link permissions.
  5. Choose Grant.

This allows the EMR Serverless job runtime roles to describe the resource link. We don’t make any selections for grantable permissions because runtime roles shouldn’t be able to grant permissions to other principles.

Grant permissions on the target for the resource link to the EMR job runtime role

Complete the following steps to grant permissions on the target for the resource link customer_nonpii_view_rl to the EMR job runtime role:

  1. On the Lake Formation console, in the navigation pane, choose Tables.
  2. Select the resource link customer_nonpii_view_rl and on the Actions menu, choose Grant on target.
  3. In the Principles section, select IAM users and roles, and choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
  4. In the View permissions section, select Select and Describe.
  5. Choose Grant.

Set up an EMR Serverless application and Workspace in the consumer account

Repeat the steps to create an EMR Serverless application in the consumer account.

Repeat the steps to create a Workspace in the consumer account. For Compute type, select EMR Serverless application and enter emr-glueview-application for the application and GlueViewBlog-EMRStudio-Consumer-RuntimeRole as the runtime role.

Verify access using interactive notebooks from EMR Studio

Complete the following steps to verify access in EMR Studio:

  1. Download the notebook glueviewblog_emr_consumer.ipynb. The code runs a select statement on the view shared from the producer.
  2. In your EMR Workspace emrs-glueviewblog-workspace, navigate to the File browser section and choose Upload files.
  3. Upload glueviewblog_emr_consumer.ipynb.
  4. Update the data lake bucket name, AWS account ID, and Region to match your resources.
  5. Update the database to match your resources.
  6. Save the notebook.
  7. Choose the double arrow icon to restart the kernel with PySpark kernel and rerun the notebook.

Verify access using interactive notebooks from AWS Glue Studio

Complete the following steps to verify access using AWS Glue Studio:

  1. Download the notebook glueviewblog_glue_consumer.ipynb
  2. Open the AWS Glue Studio console.
  3. Choose Notebook and then choose Upload notebook.
  4. Upload the notebook glueviewblog_glue_consumer.ipynb.
  5. For IAM role, choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
  6. Choose Create notebook.
  7. Update the data lake bucket name, AWS account ID, and Region to match your resources.
  8. Update the database to match your resources.
  9. Save the notebook.
  10. Run all the cells to verify fine-grained access.

Verify access using the Athena query editor

Because the view from the producer account was shared to the consumer account, the Lake Formation administrator will have access to the view in the producer account. Also, because the lake admin role created the resource link pointing to the view, it will also have access to the resource link. Go to the Athena query editor and run a simple select query on the resource link.

The analytics team in the consumer account was able to access a subset of the data from a business data producer team, using their analytics tools of choice.

Clean up

To avoid incurring ongoing costs, clean up your resources:

  1. In your consumer account, delete AWS Glue notebook, stop and delete the EMR application, and then delete EMR Workspace.
  2. In your consumer account, delete the CloudFormation stack. This should remove the resources launched by the stack.
  3. In your producer account, log in to the Lake Formation console and revoke the LF-Tags based permissions you had granted to the consumer account.
  4. In your producer account, stop and delete the EMR application and then delete the EMR Workspace.
  5. In your producer account, delete the CloudFormation stack. This should delete the resources launched by the stack.
  6. Review and clean up any additional AWS Glue and Lake Formation resources and permissions.

Conclusion

In this post, we demonstrated a powerful, enterprise-grade solution for cross-account data sharing and analysis using AWS services. We walked you through how to do the following key steps:

  • Create a Data Catalog view using Spark in EMR Serverless within one AWS account
  • Securely share this view with another account using LF-TBAC
  • Access the shared view in the recipient account using Spark in both EMR Serverless and AWS Glue ETL
  • Implement this solution with Iceberg tables (it’s also compatible other open table formats like Apache Hudi and Delta Lake)

The solution approach with multi-dialect data catalog views provided in this post is particularly valuable for enterprises looking to modernize their data infrastructure while optimizing costs, improve cross-functional collaboration while enhancing data governance, and accelerate business insights while maintaining control over sensitive information.

Refer to the following information about Data Catalog views with individual analytics services, and try out the solution. Let us know your feedback and questions in the comments section.


About the Authors

Aarthi Srinivasan is a Senior Big Data Architect with Amazon SageMaker Lakehouse. As part of the SageMaker Lakehouse team, she works with AWS customers and partners to architect lake house solutions, enhance product features, and establish best practices for data governance.

Praveen Kumar is an Analytics Solutions Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-based services. His areas of interest are serverless technology, data governance, and data-driven AI applications.

Dhananjay Badaya is a Software Developer at AWS, specializing in distributed data processing engines including Apache Spark and Apache Hadoop. As a member of the Amazon EMR team, he focuses on designing and implementing enterprise governance features for EMR Spark.

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/configure-cross-account-access-of-amazon-sagemaker-lakehouse-multi-catalog-tables-using-aws-glue-5-0-spark/

Many organizations build and operate enterprise-wide data mesh architectures using the AWS Glue Data Catalog and AWS Lake Formation for their Amazon Simple Storage Service (Amazon S3) based data lakes. Now, with Amazon SageMaker Lakehouse, these organizations can unify their data analytics and AI/ML workflows while maintaining secure cross-account access without data replication. By centralizing access to a single copy of data and using the secure fine-grained permissions of Lake Formation, enterprises can accelerate their analytics initiatives while reducing operational complexity across business units.

SageMaker Lakehouse organizes data using logical containers called catalogs, enabling teams to seamlessly query and analyze data across their entire ecosystem—from S3 data lakes to Amazon Redshift warehouses—using familiar Apache Iceberg compatible tools. Organizations can either mount their existing data warehouse to the lakehouse or create new catalogs using Amazon Redshift managed storage. Built-in zero-ETL connectors reduce data silos by integrating various data sources, enabling unified analytics across teams. This seamless integration particularly benefits existing AWS customers who already use the Data Catalog and Lake Formation, because they can immediately take advantage of SageMaker Lakehouse capabilities.

AWS Glue is a serverless service that makes data integration simpler, faster, and cheaper. We launched AWS Glue 5.0 with upgraded Apache Spark 3.5.4 and Python 3.11. AWS Glue 5.0 adds support for SageMaker Lakehouse to unify your data across S3 data lakes and Redshift data warehouses.

In our previous blog post, we demonstrated the process of creating tables in both the Amazon Redshift managed catalog and Amazon Redshift federated catalog within a single AWS account. In this post, we show you how to share a Redshift table and Amazon S3 based Iceberg table from the account that owns the data to another account that consumes the data. In the recipient account, we run a join query on the shared data lake and data warehouse tables using Spark in AWS Glue 5.0. We walk you through the complete cross-account setup and provide the Spark configuration in a Python notebook.

Solution overview

To demonstrate the functionality of SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark, let’s assume the retail company Example Retail Corp launches a campaign to understand their market and drive growth by country of operation. Their infrastructure consists of a Redshift data warehouse for structured data and an S3 data lake for structured and semi-structured data. The marketing team realizes that customer data is spread across those two systems and wants to use the support of their data engineering and analysts to analyze and provide insights. As a company, they prefer unified governance for managing data access while enabling a secure sharing mechanism for business and engineering teams.

Let’s see how they can achieve the goal using SageMaker Lakehouse. The solution is represented in the following diagram.

001-BDB 5089

The setup could be extended to enterprise data meshes where a data producer account will own the Redshift clusters, catalog the tables in a central governance account, and share with any number of consumer accounts from the central account. Multiple consumer accounts could analyze the shared Redshift tables using the SageMaker Lakehouse integrated analytics engines.

The solution also works for cross-Region table access. You would create a resource link for the catalog tables in an AWS Region where you want to run your analyses and create dashboards. For cross-Region resource link setup, refer to Setting up cross-Region table access.

Prerequisites

To implement this solution, you need the following prerequisites:

  • Two AWS accounts with Lake Formation cross-account sharing version 4 and Lake Formation administrator configured. Refer to the Lake Formation data administrator permissions and initial setup of Lake Formation.
  • Permissions from Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog granted to the Lake Formation administrator role on both accounts.
  • An S3 bucket in the producer account to host the sample Iceberg table data.
  • An AWS Identity and Access Management (IAM) role, LakeFormationS3Registration_custom, in the producer account to register your Iceberg table’s Amazon S3 location with Lake Formation. For details, refer to Registering an Amazon S3 location and Requirements for roles used to register locations.
  • An Amazon Redshift Serverless namespace in the producer account. Follow the instructions in Creating a data warehouse with Amazon Redshift Serverless to launch a serverless namespace with default settings.
  • Two sample datasets, orders and returns, in CSV format. This is Example Retail Corp’s data on their customer purchase and return trends. Their marketing team has collected these data in a Redshift table and Amazon S3 from various systems. The instructions to create these tables are provided in the appendix at the end of this post. After completing the steps in the appendix, you should have customerdb.returnstbl_iceberg in your default catalog and ordersdb.orderstbl in your Redshift Serverless application default namespace.
  • An IAM role, Glue-execution-role, in the consumer account, with the following policies:
    1. AWS managed policies AWSGlueServiceRole and AmazonRedshiftDataFullAccess.
    2. Create a new in-line policy with the following permissions and attach it:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "LFandRSserverlessAccess",
                  "Effect": "Allow",
                  "Action": [
                      "lakeformation:GetDataAccess",
                      "redshift-serverless:GetCredentials"
                  ],
                  "Resource": "*"
              },
              {
                  "Effect": "Allow",
                  "Action": "iam:PassRole",
                  "Resource": "*",
                  "Condition": {
                      "StringEquals": {
                          "iam:PassedToService": "glue.amazonaws.com"
                      }
                  }
              }
          ]
      }

    3. Add the following trust policy to Glue-execution-role, allowing AWS Glue to assume this role:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": [
                          "glue.amazonaws.com"
                      ]
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }

    Steps for producer account setup

    For the producer account setup, you can either use your IAM administrator role added as Lake Formation administrator or use a Lake Formation administrator role with permissions added as discussed in the prerequisites. For illustration purposes, we use the IAM admin role Admin added as Lake Formation administrator.

    002-BDB 5089

    Configure your catalog

    Complete the following steps to set up your catalog:

    1. Log in to AWS Management Console as Admin.
    2. On the Amazon Redshift console, follow the instructions in Registering Amazon Redshift clusters and namespaces to the AWS Glue Data Catalog.
    3. After the registration is initiated, you will see the invite from Amazon Redshift on the Lake Formation console.
    4. Select the pending catalog invitation and choose Approve and create catalog.

    003-BDB 5089

    1. On the Set catalog details page, configure your catalog:
      1. For Name, enter a name (for this post, redshiftserverless1-uswest2).
      2. Select Access this catalog from Apache Iceberg compatible engines.
      3. Choose the IAM role you created for the data transfer.
      4. Choose Next.

      004-BDB 5089

    2. On the Grant permissions – optional page, choose Add permissions.
      1. Grant the Admin user Super user permissions for Catalog permissions and Grantable permissions.
      2. Choose Add.

      005-BDB 5089

    3. Verify the granted permission on the next page and choose Next.
      006-BDB 5089
    4. Review the details on the Review and create page and choose Create catalog.
      007-BDB 5089

    Wait a few seconds for the catalog to show up.

    1. Choose Catalogs in the navigation pane and verify that the redshiftserverless1-uswest2 catalog is created.
      008-BDB 5089
    2. Explore the catalog detail page to verify the ordersdb.public database.
      009-BDB 5089
    3. On the database View dropdown menu, view the table and verify that the orderstbl table shows up.
      010-BDB 5089

    As the Admin role, you can also query the orderstbl in Amazon Athena and confirm the data is available.

    011-BDB 5089

    Grant permissions on the tables from the producer account to the consumer account

    In this step, we share the Amazon Redshift federated catalog database redshiftserverless1-uswest2:ordersdb.public and table orderstbl as well as the Amazon S3 based Iceberg table returnstbl_iceberg and its database customerdb from the default catalog to the consumer account. We can’t share the entire catalog to external accounts as a catalog-level permission; we just share the database and table.

    1. On the Lake Formation console, choose Data permissions in the navigation pane.
    2. Choose Grant.
      012-BDB 5089
    3. Under Principals, select External accounts.
    4. Provide the consumer account ID.
    5. Under LF-Tags or catalog resources, select Named Data Catalog resources.
    6. For Catalogs, choose the account ID that represents the default catalog.
    7. For Databases, choose customerdb.
      013-BDB 5089
    8. Under Database permissions, select Describe under Database permissions and Grantable permissions.
    9. Choose Grant.
      014-BDB 5089
    10. Repeat these steps and grant table-level Select and Describe permissions on returnstbl_iceberg.
    11. Repeat these steps again to grant database- and table-level permissions for the ordertbl table of the federated catalog database redshiftserverless1-uswest2/ordersdb.

    The following screenshots show the configuration for database-level permissions.

    015-BDB 5089

    016-BDB 5089

    The following screenshots show the configuration for table-level permissions.

    017-BDB 5089

    018-BDB 5089

    1. Choose Data permissions in the navigation pane and verify that the consumer account has been granted database- and table-level permissions for both orderstbl from the federated catalog and returnstbl_iceberg from the default catalog.
      019-BDB 5089

    Register the Amazon S3 location of the returnstbl_iceberg with Lake Formation.

    In this step, we register the Amazon S3 based Iceberg table returnstbl_iceberg data location with Lake Formation to be managed by Lake Formation permissions. Complete the following steps:

    1. On the Lake Formation console, choose Data lake locations in the navigation pane.
    2. Choose Register location.
      020-BDB 5089
    3. For Amazon S3 path, enter the path for your S3 bucket that you provided while creating the Iceberg table returnstbl_iceberg.
    4. For IAM role, provide the user-defined role LakeFormationS3Registration_custom that you created as a prerequisite.
    5. For Permission mode, select Lake Formation.
    6. Choose Register location.
      021-BDB 5089
    7. Choose Data lake locations in the navigation pane to verify the Amazon S3 registration.
      022-BDB 5089

    With this step, the producer account setup is complete.

    Steps for consumer account setup

    For the consumer account setup, we use the IAM admin role Admin, added as a Lake Formation administrator.

    The steps in the consumer account are quite involved. In the consumer account, a Lake Formation administrator will accept the AWS Resource Access Manager (AWS RAM) shares and create the required resource links that point to the shared catalog, database, and tables. The Lake Formation admin verifies that the shared resources are accessible by running test queries in Athena. The admin further grants permissions to the role Glue-execution-role on the resource links, database, and tables. The admin then runs a join query in AWS Glue 5.0 Spark using Glue-execution-role.

    Accept and verify the shared resources

    Lake Formation uses AWS RAM shares to enable cross-account sharing with Data Catalog resource policies in the AWS RAM policies. To view and verify the shared resources from producer account, complete the following steps:

    1. Log in to the consumer AWS console and set the AWS Region to match the producer’s shared resource Region. For this post, we use us-west-2.
    2. Open the Lake Formation console. You will see a message indicating there is a pending invite and asking you accept it on the AWS RAM console.
      023-BDB 5089
    3. Follow the instructions in Accepting a resource share invitation from AWS RAM to review and accept the pending invites.
    4. When the invite status changes to Accepted, choose Shared resources under Shared with me in the navigation pane.
    5. Verify that the Redshift Serverless federated catalog redshiftserverless1-uswest2, the default catalog database customerdb, the table returnstbl_iceberg, and the producer account ID under Owner ID column display correctly.
      024-BDB 5089
    6. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
    7. Search by the producer account ID.
      You should see the customerdb and public databases. You can further select each database and choose View tables on the Actions dropdown menu and verify the table names

    025-BDB 5089

    You will not see an AWS RAM share invite for the catalog level on the Lake Formation console, because catalog-level sharing isn’t possible. You can review the shared federated catalog and Amazon Redshift managed catalog names on the AWS RAM console, or using the AWS Command Line Interface (AWS CLI) or SDK.

    Create a catalog link container and resource links

    A catalog link container is a Data Catalog object that references a local or cross-account federated database-level catalog from other AWS accounts. For more details, refer to Accessing a shared federated catalog. Catalog link containers are essentially Lake Formation resource links at the catalog level that reference or point to a Redshift cluster federated catalog or Amazon Redshift managed catalog object from other accounts.

    In the following steps, we create a catalog link container that points to the producer shared federated catalog redshiftserverless1-uswest2. Inside the catalog link container, we create a database. Inside the database, we create a resource link for the table that points to the shared federated catalog table <<producer account id>>:redshiftserverless1-uswest2/ordersdb.public.orderstbl.

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Catalogs.
    2. Choose Create catalog.

    026-BDB 5089

    1. Provide the following details for the catalog:
      1. For Name, enter a name for the catalog (for this post, rl_link_container_ordersdb).
      2. For Type, choose Catalog Link container.
      3. For Source, choose Redshift.
      4. For Target Redshift Catalog, enter the Amazon Resource Name (ARN) of the producer federated catalog (arn:aws:glue:us-west-2:<<producer account id>>:catalog/redshiftserverless1-uswest2/ordersdb).
      5. Under Access from engines, select Access this catalog from Apache Iceberg compatible engines.
      6. For IAM role, provide the Redshift-S3 data transfer role that you had created in the prerequisites.
      7. Choose Next.

    027-BDB 5089

    1. On the Grant permissions – optional page, choose Add permissions.
      1. Grant the Admin user Super user permissions for Catalog permissions and Grantable permissions.
      2. Choose Add and then choose Next.

    028-BDB 5089

    1. Review the details on the Review and create page and choose Create catalog.

    Wait a few seconds for the catalog to show up.

    029-BDB 5089

    1. In the navigation pane, choose Catalogs.
    2. Verify that rl_link_container_ordersdb is created.

    030-BDB 5089

    Create a database under rl_link_container_ordersdb

    Complete the following steps:

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
    2. On the Choose catalog dropdown menu, choose rl_link_container_ordersdb.
    3. Choose Create database.

    Alternatively, you can choose the Create dropdown menu and then choose Database.

    1. Provide details for the database:
      1. For Name, enter a name (for this post, public_db).
      2. For Catalog, choose rl_link_container_ordersdb.
      3. Leave Location – optional as blank.
      4. Under Default permissions for newly created tables, deselect Use only IAM access control for new tables in this database.
      5. Choose Create database.

    031-BDB 5089

    1. Choose Catalogs in the navigation pane to verify that public_db is created under rl_link_container_ordersdb.

    032-BDB 5089

    Create a table resource link for the shared federated catalog table

    A resource link to a shared federated catalog table can reside only inside the database of a catalog link container. A resource link for such tables will not work if created inside the default catalog. For more details on resource links, refer to Creating a resource link to a shared Data Catalog table.

    Complete the following steps to create a table resource link:

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Tables.
    2. On the Create dropdown menu, choose Resource link.

    033-BDB 5089

    1. Provide details for the table resource link:
      1. For Resource link name, enter a name (for this post, rl_orderstbl).
      2. For Destination catalog, choose rl_link_container_ordersdb.
      3. For Database, choose public_db.
      4. For Shared table’s region, choose US West (Oregon).
      5. For Shared table, choose orderstbl.
      6. After the Shared table is selected, Shared table’s database and Shared table’s catalog ID should get automatically populated.
      7. Choose Create.

    034-BDB 5089

    1. In the navigation pane, choose Databases to verify that rl_orderstbl is created under public_db, inside rl_link_container_ordersdb.

    035-BDB 5089

    036-BDB 5089

    Create a database resource link for the shared default catalog database.

    Now we create a database resource link in the default catalog to query the Amazon S3 based Iceberg table shared from the producer. For details on database resource links, refer Creating a resource link to a shared Data Catalog database.

    Though we are able to see the shared database in the default catalog of the consumer, a resource link is required to query from analytics engines, such as Athena, Amazon EMR, and AWS Glue. When using AWS Glue with Lake Formation tables, the resource link needs to be named identically to the source account’s resource. For additional details on using AWS Glue with Lake Formation, refer to Considerations and limitations.

    Complete the following steps to create a database resource link:

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
    2. On the Choose catalog dropdown menu, choose the account ID to choose the default catalog.
    3. Search for customerdb.

    You should see the shared database name customerdb with the Owner account ID as that of your producer account ID.

    1. Select customerdb, and on the Create dropdown menu, choose Resource link.
    2. Provide details for the resource link:
      1. For Resource link name, enter a name (for this post, customerdb).
      2. The rest of the fields should be already populated.
      3. Choose Create.
    3. In the navigation pane, choose Databases and verify that customerdb is created under the default catalog. Resource link names will show in italicized font.

    037-BDB 5089

    Verify access as Admin using Athena

    Now you can verify your access using Athena. Complete the following steps:

    1. Open the Athena console.
    2. Make sure an S3 bucket is provided to store the Athena query results. For details, refer to Specify a query result location using the Athena console.
    3. In the navigation pane, verify both the default catalog and federated catalog tables by previewing them.
    4. You can also run a join query as follows. Pay attention to the three-point notation for referring to the tables from two different catalogs:
    SELECT
    returns_tb.market as Market,
    sum(orders_tb.quantity) as Total_Quantity
    FROM rl_link_container_ordersdb.public_db.rl_orderstbl as orders_tb
    JOIN awsdatacatalog.customerdb.returnstbl_iceberg as returns_tb
    ON orders_tb.order_id = returns_tb.order_id
    GROUP BY returns_tb.market;

    038-BDB 5089

    This verifies the new capability of SageMaker Lakehouse, which enables accessing Redshift cluster tables and Amazon S3 based Iceberg tables in the same query, across AWS accounts, through the Data Catalog, using Lake Formation permissions.

    Grant permissions to Glue-execution-role

    Now we will share the resources from the producer account with additional IAM principals in the consumer account. Usually, the data lake admin grants permissions to data analysts, data scientists, and data engineers in the consumer account to do their job functions, such as processing and analyzing the data.

    We set up Lake Formation permissions on the catalog link container, databases, tables, and resource links to the AWS Glue job execution role Glue-execution-role that we created in the prerequisites.

    Resource links allow only Describe and Drop permissions. You need to use the Grant on target configuration to provide database Describe and table Select permissions.

    Complete the following steps:

    1. On the Lake Formation console, choose Data permissions in the navigation pane.
    2. Choose Grant.
    3. Under Principals, select IAM users and roles.
    4. For IAM users and roles, enter Glue-execution-role.
    5. Under LF-Tags or catalog resources, select Named Data Catalog resources.
    6. For Catalogs, choose rl_link_container_ordersdb and the consumer account ID, which indicates the default catalog.
    7. Under Catalog permissions, select Describe for Catalog permissions.
    8. Choose Grant.

    039-BDB 5089

    040-BDB 5089

    1. Repeat these steps for the catalog rl_link_container_ordersdb:
      1. On the Databases dropdown menu, choose public_db.
      2. Under Database permissions, select Describe.
      3. Choose Grant.
    2. Repeat these steps again, but after choosing rl_link_container_ordersdb and public_db, on the Tables dropdown menu, choose rl_orderstbl.
      1. Under Resource link permissions, select Describe.
      2. Choose Grant.
    3. Repeat these steps to grant additional permissions to Glue-execution-role.
      1. For this iteration, grant Describe permissions on the default catalog databases public and customerdb.
      2. Grant Describe permission on the resource link customerdb.
      3. Grant Select permission on the tables returnstbl_iceberg and orderstbl.

    The following screenshots show the configuration for database public and customerdb permissions.

    041-BDB 5089

    042-BDB 5089

    The following screenshots show the configuration for resource link customerdb permissions.

    043-BDB 5089

    044-BDB 5089

    The following screenshots show the configuration for table returnstbl_iceberg permissions.

    045-BDB 5089

    046-BDB 5089

    The following screenshots show the configuration for table orderstbl permissions.

    047-BDB 5089

    048-BDB 5089

    1. In the navigation pane, choose Data permissions and verify permissions on Glue-execution-role.

    049-BDB 5089

    Run a PySpark job in AWS Glue 5.0

    Download the PySpark script LakeHouseGlueSparkJob.py. This AWS Glue PySpark script runs Spark SQL by joining the producer shared federated orderstbl table and Amazon S3 based returns table in the consumer account to analyze the data and identify the total orders placed per market.

    Replace <<consumer_account_id>> in the script with your consumer account ID. Complete the following steps to create and run an AWS Glue job:

    1. On the AWS Glue console, in the navigation pane, choose ETL jobs.
    2. Choose Create job, then choose Script editor.

    050-BDB 5089

    1. For Engine, choose Spark.
    2. For Options, choose Start fresh.
    3. Choose Upload script.
    4. Browse to the location where you downloaded and edited the script, select the script, and choose Open.
    5. On the Job details tab, provide the following information:
      1. For Name, enter a name (for this post, LakeHouseGlueSparkJob).
      2. Under Basic properties, for IAM role, choose Glue-execution-role.
      3. For Glue version, select Glue 5.0.
      4. Under Advanced properties, for Job parameters, choose Add new parameter.
      5. Add the parameters --datalake-formats = iceberg and --enable-lakeformation-fine-grained-access = true.
    6. Save the job.
    7. Choose Run to execute the AWS Glue job, and wait for the job to complete.
    8. Review the job run details from the Output logs

    051-BDB 5089

    052-BDB 5089

    Clean up

    To avoid incurring costs on your AWS accounts, clean up the resources you created:

    1. Delete the Lake Formation permissions, catalog link container, database, and tables in the consumer account.
    2. Delete the AWS Glue job in the consumer account.
    3. Delete the federated catalog, database, and table resources in the producer account.
    4. Delete the Redshift Serverless namespace in the producer account.
    5. Delete the S3 buckets you created as part of data transfer in both accounts and the Athena query results bucket in the consumer account.
    6. Clean up the IAM roles you created for the SageMaker Lakehouse setup as part of the prerequisites.

    Conclusion

    In this post, we illustrated how to bring your existing Redshift tables to SageMaker Lakehouse and share them securely with external AWS accounts. We also showed how to query the shared data warehouse and data lakehouse tables in the same Spark session, from a recipient account, using Spark in AWS Glue 5.0.

    We hope you find this useful to integrate your Redshift tables with an existing data mesh and access the tables using AWS Glue Spark. Test this solution in your accounts and share feedback in the comments section. Stay tuned for more updates and feel free to explore the features of SageMaker Lakehouse and AWS Glue versions.

    Appendix: Table creation

    Complete the following steps to create a returns table in the Amazon S3 based default catalog and an orders table in Amazon Redshift:

    1. Download the CSV format datasets orders and returns.
    2. Upload them to your S3 bucket under the corresponding table prefix path.
    3. Use the following SQL statements in Athena. First-time users of Athena should refer to Specify a query result location.
    CREATE DATABASE customerdb;
    CREATE EXTERNAL TABLE customerdb.returnstbl_csv(
      `returned` string, 
      `order_id` string, 
      `market` string)
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY '\;' 
    LOCATION
      's3://<your-S3-bucket>/<prefix-for-returns-table-data>/'
    TBLPROPERTIES (
      'skip.header.line.count'='1'
    );
    
    select * from customerdb.returnstbl_csv limit 10; 
    

    053-BDB 5089

    1. Create an Iceberg format table in the default catalog and insert data from the CSV format table:
    CREATE TABLE customerdb.returnstbl_iceberg(
      `returned` string, 
      `order_id` string, 
      `market` string)
    LOCATION 's3://<your-producer-account-bucket>/returnstbl_iceberg/' 
    TBLPROPERTIES (
      'table_type'='ICEBERG'
    );
    
    INSERT INTO customerdb.returnstbl_iceberg
    SELECT *
    FROM returnstbl_csv;  
    
    SELECT * FROM customerdb.returnstbl_iceberg LIMIT 10; 
    

    054-BDB 5089

    1. To create the orders table in the Redshift Serverless namespace, open the Query Editor v2 on the Amazon Redshift console.
    2. Connect to the default namespace using your database admin user credentials.
    3. Run the following commands in the SQL editor to create the database ordersdb and table orderstbl in it. Copy the data from your S3 location of the orders data to the orderstbl:
    create database ordersdb;
    use ordersdb;
    
    create table orderstbl(
      row_id int, 
      order_id VARCHAR, 
      order_date VARCHAR, 
      ship_date VARCHAR, 
      ship_mode VARCHAR, 
      customer_id VARCHAR, 
      customer_name VARCHAR, 
      segment VARCHAR, 
      city VARCHAR, 
      state VARCHAR, 
      country VARCHAR, 
      postal_code int, 
      market VARCHAR, 
      region VARCHAR, 
      product_id VARCHAR, 
      category VARCHAR, 
      sub_category VARCHAR, 
      product_name VARCHAR, 
      sales VARCHAR, 
      quantity bigint, 
      discount VARCHAR, 
      profit VARCHAR, 
      shipping_cost VARCHAR, 
      order_priority VARCHAR
      );
    
    copy orderstbl
    from 's3://<your-s3-bucket>/ordersdatacsv/orders.csv' 
    iam_role 'arn:aws:iam::<producer-account-id>:role/service-role/<your-Redshift-Role>'
    CSV 
    DELIMITER ';'
    IGNOREHEADER 1
    ;
    
    select * from ordersdb.orderstbl limit 5;
    


    About the Authors

    055-BDB 5089Aarthi Srinivasan is a Senior Big Data Architect with Amazon SageMaker Lakehouse. She collaborates with the service team to enhance product features, works with AWS customers and partners to architect lakehouse solutions, and establishes best practices for data governance.

    056-BDB 5089Subhasis Sarkar is a Senior Data Engineer with Amazon. Subhasis thrives on solving complex technological challenges with innovative solutions. He specializes in AWS data architectures, particularly data mesh implementations using AWS CDK components.

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/read-and-write-apache-iceberg-tables-using-aws-lake-formation-hybrid-access-mode/

Enterprises are adopting Apache Iceberg table format for its multitude of benefits. The change data capture (CDC), ACID compliance, and schema evolution features cater to representing big datasets that receive new records at a fast pace. In an earlier blog post, we discussed how to implement fine-grained access control in Amazon EMR Serverless using AWS Lake Formation for reads. Lake Formation helps you centrally manage and scale fine-grained data access permissions and share data with confidence within and outside your organization.

In this post, we demonstrate how to use Lake Formation for read access while continuing to use AWS Identity and Access Management (IAM) policy-based permissions for write workloads that update the schema and upsert (insert and update combined) data records into the Iceberg tables. The bimodal permissions are needed to support existing data pipelines that use only IAM and Amazon Simple Storage Service (Amazon) S3 bucket policy-based permissions and to support table operations that are not yet available in the analytics engines. The two-way permission is achieved by registering the Amazon S3 data location of the Iceberg table with Lake Formation in hybrid access mode. Lake Formation hybrid access mode allows you to onboard new users with Lake Formation permissions to access AWS Glue Data Catalog tables with minimal interruptions to existing IAM policy-based users. With this solution, organizations can use the Lake Formation permissions to scale the access of their existing Iceberg tables in Amazon S3 to new readers. You can extend the methodology to other open table formats, such as Linux Foundation Delta Lake tables and Apache Hudi tables.

Key use cases for Lake Formation hybrid access mode

Lake Formation hybrid access mode is useful in the following use cases:

  • Avoiding data replication – Hybrid access mode helps onboard new users with Lake Formation permissions on existing Data Catalog tables. For example, you can enable a subset of data access (coarse vs. fine-grained access) for various user personas, such as data scientists and data analysts, without making multiple copies of the data. This also helps maintain a single source of truth for production and business insights.
  • Minimal interruption to existing IAM policy-based user access – With hybrid access mode, you can add new Lake Formation managed users with minimal disruptions to your existing IAM and Data Catalog policy-based user access. Both access methods can coexist for the same catalog table, but each user can have only one mode of permissions.
  • Transactional table writes – Certain write operations like insert, update, and delete are not supported by Amazon EMR for Lake Formation managed Iceberg tables. Refer to Considerations and limitations for additional details. Although you could use Lake Formation permissions for Iceberg table read operations, you could manage the write operations as the table owners with IAM policy-based access.

Solution overview

An example Enterprise Corp has a large number of Iceberg tables based on Amazon S3. They are currently managing the Iceberg tables manually with IAM policy, Data Catalog resource policy, and S3 bucket policy-based access in their organization. They want to share their transactional data of Iceberg tables across different teams, such as data analysts and data scientists, asking for read access across a few lines of business. While maintaining the ownership of the table’s updates to their single team, they want to provide restricted read access to certain columns of their tables. This is achieved by using the hybrid access mode feature of Lake Formation.

In this post, we illustrate the scenario with a data engineer team and a new data analyst team. The data engineering team owns the extract, transform, and load (ETL) application that will process the raw data to create and maintain the Iceberg tables. The data analyst team will query the tables to gather business insights from those tables. The ETL application will use IAM role-based access to the Iceberg table, and the data analyst gets Lake Formation permissions to query the same tables.

The solution can be visually represented in the following diagram.

Solution Overview

For ease of illustration, we use only one AWS account in this post. Enterprise use cases typically have multiple accounts or cross-account access requirements. The setup of the Iceberg tables, Lake Formation permissions, and IAM based permissions are similar for multiple and cross-account scenarios.

The high-level steps involved in the permissions setup are as follows:

  1. Make sure that IAMAllowedPrincipals has Super access to the database and tables in Lake Formation. IAMAllowedPrincipals is a virtual group that represents any IAM principal permissions. Super access to this virtual group is required to make sure that IAM policy-based permissions to any IAM principal continues to work.
  2. Register the data location with Lake Formation in hybrid access mode.
  3. Grant DATA LOCATION permission to the IAM role that manages the table with IAM policy-based permissions. Without the DATA LOCATION permission, write workloads will fail. Test the access to the table by writing new records to the table as the IAM role.
  4. Add SELECT table permissions to the Data-Analyst role in Lake Formation.
  5. Opt-in the Data-Analyst to the Iceberg table, making the Lake Formation permissions effective for the analyst.
  6. Test access to the table as the Data-Analyst by running SELECT queries in Athena.
  7. Test the table write operations by adding new records to the table as ETL-application-role using EMR Serverless.
  8. Read the latest update, again, as Data-Analyst.

Prerequisites

You should have the following prerequisites:

  • An AWS account with a Lake Formation administrator configured. Refer to Data lake administrator permissions and Set up AWS Lake Formation. You can also refer to Simplify data access for your enterprise using Amazon SageMaker Lakehouse for the Lake Formation admin setup in your AWS account. For ease of demonstration, we have used an IAM admin role added as a Lake Formation administrator.
  • An S3 bucket to host the sample Iceberg table data and metadata.
  • An IAM role to register your Iceberg table Amazon S3 location with Lake Formation. Follow the policy and trust policy details for a user-defined role creation from Requirements for roles used to register locations.
  • An IAM role named ETL-application-role, which will be the runtime role to execute jobs in EMR Serverless. The minimum policy required is shown in the following code snippet. Replace the Amazon S3 data location of the Iceberg table, database name, and AWS Key Management Service (AWS KMS) key ID with your own. For additional details on the role setup, refer to Job runtime roles for Amazon EMR Serverless. This role can insert, update, and delete data in the table.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "IcebergDataAccessInS3",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetBucketLocation",
                    "s3:ListAllMyBuckets",
                    "s3:Get*",
                    "s3:Put*",
                    "s3:Delete*"
                ],
                "Resource": [
                    "arn:aws:s3:::your-iceberg-data-bucket-name",
                    "arn:aws:s3:::your-iceberg-data-bucket-name/*"
                ]
            },
            {
                "Sid": "GlueCatalogApiPermissions",
                "Effect": "Allow",
                "Action": [
                    "glue:*"
                ],
                "Resource": [
                    "arn:aws:glue:your-Region:account-id:catalog",
                    "arn:aws:glue:your-Region:account-id:database/iceberg-database-name",
                    "arn:aws:glue:your-Region:account-id:database/default",
                    "arn:aws:glue:your-Region:account-id:table/*/*"
                ]
            },
            {
                "Sid": "KmsKeyPermissions",
                "Effect": "Allow",
                "Action": [
                    "kms:Encrypt",
                    "kms:Decrypt",
                    "kms:ReEncrypt*",
                    "kms:GenerateDataKey",
                    "kms:DescribeKey",
                    "kms:ListKeys",
                    "kms:ListAliases"
                ],
                "Resource": [
                    "arn:aws:kms:your-Region:account-id:key/your-key-id"
                ]
            }
        ]
    }

    Add the following trust policy to the role:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "emr-serverless.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

  • An IAM role called Data-Analyst, to represent the data analyst access. Use the following policy to create the role. Also attach the AWS managed policy arn:aws:iam::aws:policy/AmazonAthenaFullAccess to the role, to allow querying the Iceberg table using Amazon Athena. Refer to Data engineer permissions for additional details about this role.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "LFBasicUser",
                "Effect": "Allow",
                "Action": [
                    "glue:GetCatalog",
                    "glue:GetCatalogs",
                    "glue:GetTable",
                    "glue:GetTables",
                    "glue:GetTableVersion",
                    "glue:GetTableVersions",
                    "glue:GetDatabase",
                    "glue:GetDatabases",
                    "glue:GetPartition",
                    "glue:GetPartitions",
                    "lakeformation:GetDataAccess"
                ],
                "Resource": "*"
            },
            {
                "Sid": "AthenaResultsBucket",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetBucketLocation",
                    "s3:Put*",
                    "s3:Get*",
                    "s3:Delete*"
                ],
                "Resource": [
                    "arn:aws:s3:::your-bucket-name-prefix",
                    "arn:aws:s3:::your-bucket-name-prefix/*"
                ]
            }
        ]
    }

    Add the following trust policy to the role:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::<your_account_id>:root"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

Create the Iceberg table

Complete the following steps to create the Iceberg table:

  1. Sign in to the Lake Formation console as the admin role.
  2. In the navigation pane under Data Catalog, choose Databases.
  3. From the Create dropdown menu, create a database named iceberg_db. You can leave the Amazon S3 location property empty for the database.
  4. On the Athena console, run the following provided queries. The queries perform the following operations:
    1. Create a table called customer_csv, pointing to the customer dataset in the public S3 bucket.
    2. Create an Iceberg table called customer_iceberg, pointing to your S3 bucket location that will host the Iceberg table data and metadata.
    3. Insert data from the CSV table to the Iceberg table.
      CREATE EXTERNAL TABLE `iceberg_db`.`customer_csv`(
        `c_customer_sk` int,
        `c_customer_id` string,
        `c_current_cdemo_sk` int,
        `c_current_hdemo_sk` int,
        `c_current_addr_sk` int,
        `c_first_shipto_date_sk` int,
        `c_first_sales_date_sk` int,
        `c_salutation` string,
        `c_first_name` string,
        `c_last_name` string,
        `c_preferred_cust_flag` string,
        `c_birth_day` int,
        `c_birth_month` int,
        `c_birth_year` int,
        `c_birth_country` string,
        `c_login` string,
        `c_email_address` string,
        `c_last_review_date` string)
      ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '|'
      STORED AS INPUTFORMAT
        'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
        'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION
        ' s3://redshift-downloads/TPC-DS/2.13/10GB/customer/'
      TBLPROPERTIES (
        'classification'='csv');   
      
       SELECT * FROM customer_csv LIMIT 5; //verifies table data  
      
      CREATE TABLE IF NOT EXISTS iceberg_db.customer_iceberg (
              c_customer_sk             int,
              c_customer_id             string,
              c_current_cdemo_sk        int,
              c_current_hdemo_sk        int,
              c_current_addr_sk         int,
              c_first_shipto_date_sk    int,
              c_first_sales_date_sk     int,
              c_salutation              string,
              c_first_name              string,
              c_last_name               string,
              c_preferred_cust_flag     string,
              c_birth_day               int,
              c_birth_month             int,
              c_birth_year              int,
              c_birth_country           string,
              c_login                   string,
              c_email_address           string,
              c_last_review_date        string
          )
      LOCATION 's3://your-iceberg-data-bucket-name/path/'
      TBLPROPERTIES ( 'table_type' = 'ICEBERG' );
      
      INSERT INTO customer_iceberg
      SELECT *
      FROM customer_csv;  
      
      SELECT * FROM customer_iceberg LIMIT 5; //verifies table data

Set up the Iceberg table as a hybrid access mode resource

Complete the following steps to set up the Iceberg table’s Amazon S3 data location as hybrid access mode in Lake Formation:

  1. Register your table location with Lake Formation:
    1. Sign in to the Lake Formation console as data lake administrator.
    2. In the navigation pane, choose Data lake Locations.
    3. For Amazon S3 path, provide the S3 prefix of your Iceberg table location that holds both the data and metadata of the table.
    4. For IAM role, provide the user-defined role that has permissions to your Iceberg table’s Amazon S3 location and that you created according to the prerequisites. For more details, refer to Registering an Amazon S3 location.
    5. For Permission mode, select Hybrid access mode.
    6. Choose Register location to register your Iceberg table Amazon S3 location with Lake Formation.

  1. Add data location permission to ETL-application-role:
    1. In the navigation pane, choose Data locations.
    2. For IAM users and roles, choose ETL-application-role.
    3. For Storage location, provide the S3 prefix of your Iceberg table.
    4. Choose Grant.

Data location permission is required for write operations to the Iceberg table location only if the Iceberg table’s S3 prefix is a child location of the database’s Amazon S3 location property.

  1. Grant Super access on the Iceberg database and table to IAMAllowedPrincipals:
    1. In the navigation pane, choose Data permissions.
    2. Choose IAM users and roles and choose IAMAllowedPrincipals.
    3. For LF-Tags or catalog resources, choose Named Data Catalog resources.
    4. Under Databases, select the name of your Iceberg table’s database.
    5. Under Database permissions, select Super.
    6. Choose Grant.

    7. Repeat the preceding steps and for Tables – optional, choose the Iceberg table.
    8. Under Table permissions, select Super.
    9. Choose Grant.

  1. Add database and table permissions to the Data-Analyst role:
    1. Repeat the steps in Step 3 to grant permissions for the Data-Analyst role, once for database-level permission and once for table-level permission.
    2. Select Describe permissions for the Iceberg database.
    3. Select Select permissions for the Iceberg table.
    4. Under Hybrid access mode, select Make Lake Formation permissions effective immediately.
    5. Choose Grant.

The following screenshots show the database permissions for Data-Analyst.

The following screenshots show the table permissions for Data-Analyst.

  1. Verify Lake Formation permissions on the Iceberg table and database to both Data-Analyst and IAMAllowedPrincipals:
    1. In the navigation pane, choose Data permissions.
    2. Filter by Table= customer_iceberg.
      You should see IAMAllowedPrincipals with All permission and Data-Analyst with Select permission.
    3. Similarly, verify permissions for the database by filtering database=iceberg_db.

You should see IAMAllowedPrincipals with All permission and Data-Analyst with Describe permission.

  1. Verify Lake Formation opt-in for Data-Analyst:
    1. In the navigation pane, choose Hybrid access mode.

You should see Data-Analyst opted-in for both database and table level permissions.

Query the table as the Data-Analyst role in Athena

While you are logged in to the AWS Management Console as admin, set up the Athena query results bucket:

  1. On the console navigation bar, choose your user name.
  2. Choose Switch role to switch to the Data-Analyst role.
  3. Enter your account ID, IAM role name (Data-Analyst), and choose Switch Role.
  4. Now that you’re logged in as the Data-Analyst role, open the Athena console and set up the Athena query results bucket.
  5. Run the following query to read the Iceberg table. This verifies the Select permission granted to the Data-Analyst role in Lake Formation.
SELECT * FROM "iceberg_db"."customer_iceberg"
WHERE c_customer_sk = 247

Upsert data as ETL-application-role using Amazon EMR

To upsert data to Lake Formation enabled Iceberg tables, we will use Amazon EMR Studio, which is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio will be our web-based IDE to run our notebooks, and we will use EMR Serverless as the compute engine. EMR Serverless is a deployment option for Amazon EMR that provides a serverless runtime environment. For the steps to run an interactive notebook, see Submit a job run or interactive workload.

  1. Sign out of the AWS console as Data-Analyst and log back or switch the user to admin.
  2. On the Amazon EMR console, choose EMR Serverless in the navigation pane.
  3. Choose Get started.
  4. For first-time users, Amazon EMR allows creation of an EMR Studio without a virtual private cloud (VPC). Create an EMR Serverless application as follows:
    1. Provide a name for the EMR Serverless application, such as DemoHybridAccess.
    2. Under Application setup, choose Use default settings for interactive workloads.
    3. Choose Create and start application.

The next step is to create an EMR Studio.

  1. On the Amazon EMR console, choose Studio under EMR Studio in the navigation pane.
  2. Choose Create Studio.
  3. Select Interactive workloads.
  4. You should see a default pre-populated section. Keep these default settings and choose Create Studio and launch Workspace.

  1. After the workspace is launched, attach the EMR Serverless application created earlier and select ETL-application-role as the runtime role under Compute.

  1. Download the notebook Iceberg-hybridaccess_final.ipynb and upload it to EMR Studio workspace.

This notebook configures the metastore properties to work with Iceberg tables. (For more details, see Using Apache Iceberg with EMR Serverless.) Then it performs insert, update, and delete operations in the Iceberg table. It also verifies if the operations are successful by reading the newly added data.

  1. Select PySpark as the kernel and execute each cell in the notebook by choosing the run icon.

Refer to Submit a job run or interactive workload for further details about how to run an interactive notebook.

The following screenshot shows that the Iceberg table insert operation completed successfully.

The following screenshot illustrates running the update statement on the Iceberg table in the notebook.

The following screenshot shows that the Iceberg table delete operation completed successfully.

Query the table again as Data-Analyst using Athena

Complete the following steps:

  1. Switch your role to Data-Analyst on the AWS console.
  2. Run the following query on the Iceberg table and read the row that was updated by the EMR cluster:
    SELECT * FROM "iceberg_db"."customer_iceberg"
    WHERE c_customer_sk = 247

The following screenshot shows the results. As we can see, ‘c_first_name’ column is updated with new value.

Clean up

To avoid incurring costs, clean up the resources you used for this post:

  1. Revoke the Lake Formation permissions and hybrid access mode opt-in granted to the Data-Analyst role and IAMAllowedPrincipals.
  2. Revoke the registration of the S3 bucket to Lake Formation.
  3. Delete the Athena query results from your S3 bucket.
  4. Delete the EMR Serverless resources.
  5. Delete Data-Analyst role and ETL-application-role from IAM.

Conclusion

In this post, we demonstrated how to scale the adoption and use of Iceberg tables using Lake Formation permissions for read workloads, while maintaining full control over table schema and data updates through IAM policy-based permissions for the table owners. The methodology also applies to other open table formats and standard Data Catalog tables, but the Apache Spark configuration for each open table format will vary.

Hybrid access mode in Lake Formation is an option you could use to adopt Lake Formation permissions gradually and scale those use cases that support Lake Formation permissions while using IAM based permissions for the use cases that don’t. We encourage you to try out this setup in your environment. Please share your feedback and any additional topics you would like to see in the comments section.


About the Authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She collaborates with the service team to enhance product features, works with AWS customers and partners to architect lake house solutions, and establishes best practices.

Parul Saxena is a Senior Big Data Specialist Solutions Architect in AWS. She helps customers and partners build highly optimized, scalable, and secure solutions. She specializes in Amazon EMR, Amazon Athena, and AWS Lake Formation, providing architectural guidance for complex big data workloads and assisting organizations in modernizing their architectures and migrating analytics workloads to AWS.

AWS Lake Formation 2023 year in review

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/aws-lake-formation-2023-year-in-review/

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2022, we talked about the enhancements we had done to these services. We continue to listen to customer stories and work backwards to incorporate their thoughts in our products. In this post, we are happy to summarize the results of our hard work in 2023 to improve and simplify data governance for customers.

We announced our new features and capabilities during AWS re:Invent 2023, as is our custom every year. The following are re:Invent 2023 talks showcasing Lake Formation and Data Catalog capabilities:

We group the new capabilities into four categories:

  • Discover and secure
  • Connect with data sharing
  • Scale and optimize
  • Audit and monitor

Let’s dive deeper and discuss the new capabilities introduced in 2023.

Discover and secure

Using Lake Formation and the Data Catalog as the foundational building blocks, we launched Amazon DataZone in October 2023. DataZone is a data management service that makes it faster and more straightforward for you to catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources. The publishing and subscription workflows of DataZone enhance collaboration between various roles in your organization and speed up the time to derive business insights from your data. You can enhance the technical metadata of the Data Catalog using AI-powered assistants into business metadata of DataZone, making it more easily discoverable. DataZone automatically manages the permissions of your shared data in the DataZone projects. To learn more about DataZone, refer to the User Guide. Bienvenue dans DataZone!

AWS Glue crawlers classify data to determine the format, schema, and associated properties of the raw data, group data into tables or partitions, and write metadata to the Data Catalog. In 2023, we released several updates to AWS Glue crawlers. We added the ability to bring your custom versions of JDBC drivers in crawlers to extract data schemas from your data sources and populate the Data Catalog. To optimize partition retrieval and improve query performance, we added the feature for crawlers to automatically add partition indexes for newly discovered tables. We also integrated crawlers with Lake Formation, supporting centralized permissions for in-account and cross-account crawling of S3 data lakes. These are some much sought-after improvements that simplify your metadata discovery using crawlers. Crawlers, salut!

We have also seen a tremendous rise in the usage of open table formats (OTFs) like Linux Foundation Delta Lake, Apache Iceberg, and Apache Hudi. To support these popular OTFs, we added support to natively crawl these three table formats into the Data Catalog. Furthermore, we worked with other AWS analytics services, such as Amazon EMR, to enable Lake Formation fine-grained permissions on all the three open table formats. We encourage you to explore which features of Lake Formation are supported for OTF tables. Bien intégré!

As the data sources and types increase over time, you are bound to have nested data types in your data lake sooner or later. To bring data governance to these datasets without flattening them, Lake Formation added support for fine-grained access controls on nested data types and columns. We also added support for Lake Formation fine-grained access controls while running Apache Hive jobs on Amazon EMR on EC2 and on Amazon EMR Studio. With Amazon EMR Serverless, fine-grained access control with Lake Formation is now available in preview. Connecté les points!

At AWS, we work very closely with our customers to understand their experience. We came to understand that onboarding to Lake Formation from AWS Identity and Access Management (IAM) based permissions for Amazon S3 and the AWS Glue Data Catalog could be streamlined. We realized that your use cases need more flexibility in data governance. With the hybrid access mode in Lake Formation, we introduced selective addition of Lake Formation permissions for some users and databases, without interrupting other users and workloads. You can define a catalog table in hybrid mode and grant access to new users like data analysts and data scientists using Lake Formation while your production extract, transform, and load (ETL) pipelines continue to use their existing IAM-based permissions. Double victoire!

Let’s talk about identity management. You can use IAM principals, Amazon Quicksight users and groups, and external accounts and IAM principals in external accounts to grant access to Data Catalog resources in Lake Formation. What about your corporate identities? Do you need to create and maintain multiple IAM roles and map them to various corporate identities? You could see the IAM role that accessed the table, but how could you find out which user accessed it? To answer these questions, Lake Formation integrated with AWS IAM Identity Center and added the feature for trusted identity propagation. With this, you can grant fine-grained access permissions to the identities from your organization’s existing identity provider. Other AWS analytics services also support the user identity to be propagated. Your auditors can now see that the user [email protected], for example, had accessed the table managed by Lake Formation permissions using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Intégration facile!

Now you don’t have to worry about moving the data or copying the Data Catalog to another AWS Region to use the AWS services for data governance. We have expanded and made Lake Formation available in all Regions in 2023. Et voila!

Connect with data sharing

Lake Formation provides a straightforward way to share Data Catalog objects like databases and tables with internal and external users. This mechanism empowers organizations with quick and secure access to data and speeds up their business decision-making. Let’s review the new features and enhancements made in 2023 under this theme.

The AWS Glue Data Catalog is the central and foundational component of data governance for both Lake Formation and DataZone. In 2023, we extended the Data Catalog through federation to integrate with external Apache Hive metastores and Redshift datashares. We also made available the connector code, which you can customize to connect the Data Catalog with additional Apache Hive-compatible metastores. These integrations pave the way to get more metadata into the Data Catalog, and allow fine-grained access controls and sharing of these resources across AWS accounts effortlessly with Lake Formation permissions. We also added support to access the Data Catalog table of one Region from other Regions using cross-Region resource links. This enhancement simplifies many use cases to avoid metadata duplication.

With the AWS CloudTrail Lake federation feature, you can discover, analyze, join, and share CloudTrail Lake data with other data sources in Data Catalog. For CloudTrail Lake, fine-grained access controls and querying and visualizing capabilities are available through Athena.

We further extended the Data Catalog capabilities to support uniform views across your data lake. You can create views using different SQL dialects and query from Athena, Redshift Spectrum, and Amazon EMR. This allows you to maintain permissions at the view level and not share the individual tables. The Data Catalog views feature is available in preview, announced at re:Invent 2023.

Scale and optimize

As SQL queries get more complex with the data changes over time or has multiple joins, a cost-based optimizer (CBO) can drive optimizations in the query plan and lead to faster performance, based on statistics of the data in the tables. In 2023, we added support for column-level statistics for tables in the Data Catalog. Customers are already seeing query performance improvements in Athena and Redshift Spectrum, with table column statistics turned on. Suivez les chiffres!

Tag-based access control removes the need to update your policies every time a new resource is added to the data lake. Instead, data lake administrators create Lake Formation Tags (LF-Tags) to tag Data Catalog objects and grant access based on these LF-Tags to users and groups. In 2023, we added support for LF-Tag delegation, where data lake administrators can give permissions to data stewards and other users to manage LF-Tags without the need for administrator privileges. LF-Tag democratization!

Apache Iceberg format uses metadata to keep track of the data files that make up the table. Changes to tables, like inserts or updates, result in new data files being created. As the number of data files for a table grows, the queries using that table can become less efficient. To improve query performance on the Iceberg table, you need to reduce the number of data files by compacting the smaller change capture files into bigger files. Users typically create and run scripts to perform optimization of these Iceberg table files in their own servers or through AWS Glue ETL. To alleviate this complex maintenance of Iceberg tables, customers approached us for a better solution. We introduced the feature for automatic compaction of Apache Iceberg tables in the Data Catalog. After you turn on automatic compaction, the Data Catalog automatically manages the metadata of the table and gives you an always-optimized Amazon S3 layout for your Iceberg tables. To learn more, check out Optimizing Iceberg tables. Automatique!

Audit and monitor

Knowing who has access to what data is a critical component of data governance. Auditors need to validate that the right metadata and data permissions are set in Lake Formation and the Data Catalog. Data lake administrators have full access to permissions and metadata, and can grant access to the data itself. To provide auditors with an option to search and review metadata permissions without granting them access to make changes to permissions, we introduced the read-only administrator role in Lake Formation. This role allows you to audit the catalog metadata and Lake Formation permissions and LF-Tags while restricting it from making any changes to them.

Conclusion

We had an amazing 2023, developing product enhancements to help you simplify and enhance your data governance using Lake Formation and Data Catalog. We invite you to try these new features. The following is a list of our launch posts for reference:

We will continue to innovate on behalf of our customers in 2024. Please share your thoughts, use cases, and feedback for our product improvements in the comments section or through your AWS account teams. We wish you a happy and prosperous 2024. Bonne année!


About the authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Leon Stigter is a Senior Technical Product Manager with AWS Lake Formation. Leon’s focus is on helping developers build data lakes faster, with seamless connectivity to analytical tools, to transform data into game-changing insights. Leon is interested in data and serverless technologies, and enjoys exploring different cities on his mission to taste cheesecake everywhere he goes.

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/introducing-hybrid-access-mode-for-aws-glue-data-catalog-to-secure-access-using-aws-lake-formation-and-iam-and-amazon-s3-policies/

AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage access control for your data lake data in Amazon Simple Storage Service (Amazon S3) and its metadata in AWS Glue Data Catalog in one place with familiar database-style features. You can use fine-grained data access control to verify that the right users have access to the right data down to the cell level of tables. Lake Formation also makes it simpler to share data internally across your organization and externally. Further, Lake Formation integrates with AWS analytics services such as Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL for Apache Spark. These services allow querying Lake Formation managed tables, thus helping you extract business insights from the data quickly and securely.

Before the introduction of Lake Formation and its database-style permissions for data lakes, you had to manage access to your data in the data lake and its metadata separately through AWS Identity and Access Management (IAM) policies and S3 bucket policies. With an IAM and Amazon S3 access control mechanism, which is more complex and less granular compared to Lake Formation, you need more time to migrate to Lake Formation because a given database or table in the data lake could have its access controlled by either IAM and S3 policies or Lake Formation policies, but not both. Also, various use cases operate on the data lakes. Migrating all use cases from one permissions model to another in a single step without disruption was challenging for operations teams.

To ease the transition of data lake permissions from an IAM and S3 model to Lake Formation, we’re introducing a hybrid access mode for AWS Glue Data Catalog. Please refer to the What’s New and documentation. This feature lets you secure and access the cataloged data using both Lake Formation permissions and IAM and S3 permissions. Hybrid access mode allows data administrators to onboard Lake Formation permissions selectively and incrementally, focusing on one data lake use case at a time. For example, say you have an existing extract, transform and load (ETL) data pipeline that uses the IAM and S3 policies to manage data access. Now you want to allow your data analysts to explore or query the same data using Amazon Athena. You can grant access to the data analysts using Lake Formation permissions, to include fine-grained controls as needed, without changing access for your ETL data pipelines.

Hybrid access mode allows both permission models to exist for the same database and tables, providing greater flexibility in how you manage user access. While this feature opens two doors for a Data Catalog resource, an IAM user or role can access the resource using only one of the two permissions. After Lake Formation permission is enabled for an IAM principal, authorization is completely managed by Lake Formation and existing IAM and S3 policies are ignored. AWS CloudTrail logs provide the complete details of the Data Catalog resource access in Lake Formation logs and S3 access logs.

In this blog post, we walk you through the instructions to onboard Lake Formation permissions in hybrid access mode for selected users while the database is already accessible to other users through IAM and S3 permissions. We will review the instructions to set-up hybrid access mode within an AWS account and between two accounts.

Scenario 1 – Hybrid access mode within an AWS account

In this scenario, we walk you through the steps to start adding users with Lake Formation permissions for a database in Data Catalog that’s accessed using IAM and S3 policy permissions. For our illustration, we use two personas:  Data-Engineer, who has coarse grained permissions using an IAM policy and an S3 bucket policy to run an AWS Glue ETL job and Data-Analyst, whom we will onboard with fine grained Lake Formation permissions to query the database using Amazon Athena.

Scenario 1 is depicted in the diagram shown below, where the Data-Engineer role accesses the database hybridsalesdb using IAM and S3 permissions while Data-Analyst role will access the database using Lake Formation permissions.

Prerequisites

To set up Lake Formation and IAM and S3 permissions for a Data Catalog database with Hybrid access mode, you must have the following prerequisites:

  • An AWS account that isn’t used for production applications.
  • Lake Formation already set up in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. For example, we’re using a data lake administrator role called LF-Admin. To learn more about setting up permissions for a data lake administrator role, see Create a data lake administrator.
  • A sample database in the Data Catalog with a few tables. For example, our sample database is called hybridsalesdb and has a set of eight tables, as shown in the following screenshot. You can use any of your datasets to follow along.

Personas and their IAM policy setup

There are two personas that are IAM roles in the account: Data-Engineer and Data-Analyst. Their IAM policies and access are described as follows.

The following IAM policy on the Data-Engineer role allows access to the database and table metadata in the Data Catalog.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue: Get*"
            ],
            "Resource": [
                "arn:aws:glue:<Region>:<account-id>:catalog",
                "arn:aws:glue:<Region>:<account-id>:database/hybridsalesdb",
                "arn:aws:glue:<Region>:<account-id>:table/hybridsalesdb/*"
            ]
        }
    ]
}

The following IAM policy on the Data-Engineer role grants data access to the underlying Amazon S3 location of the database and tables.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowDataLakeBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:Put*",
                "s3:Get*",
                "s3:Delete*"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket-name>",
                "arn:aws:s3:::<bucket-name>/<prefix>/"
            ]
        }
    ]
}

The Data-Engineer also has access to the AWS Glue console using the AWS managed policy arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess and regressive iam:Passrole to run an AWS Glue ETL script as below.

{
    "Version": "2012-10-17",
    "Statement": [
       {
           "Sid": "PassRolePermissions",
           "Effect": "Allow",
           "Action": [
               " iam:PassRole" ],
           "Resource": [  
		   "arn:aws:iam::<account-id>:role/Data-Engineer"
            ]
        }
    ]
}

The following policy is also added to the trust policy of the Data-Engineer role to allow AWS Glue to assume the role to run the ETL script on behalf of the role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

See AWS Glue studio set up for additional permissions required to run an AWS Glue ETL script.

The Data-Analyst role has the data lake basic user permissions as described in Assign permissions to Lake Formation users.

{
"Version": "2012-10-17",
"Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "glue:GetTable",
            "glue:GetTables",
            "glue:GetTableVersions",
            "glue:SearchTables",
            "glue:GetDatabase",
            "glue:GetDatabases",
            "glue:GetPartitions",
            "lakeformation:GetDataAccess",
            "lakeformation:GetResourceLFTags",
            "lakeformation:ListLFTags",
            "lakeformation:GetLFTag",
            "lakeformation:SearchTablesByLFTags",
            "lakeformation:SearchDatabasesByLFTags"
        ],
        "Resource": "*"
    }
    ]
}

Additionally, the Data-Analyst has permissions to write Athena query results to an S3 bucket that isn’t managed by Lake Formation and Athena console full access using the AWS managed policy arn:aws:iam::aws:policy/AmazonAthenaFullAccess.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::<athena-results-bucket>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:Put*",
                "s3:Get*",
                "s3:Delete*"
            ],
            "Resource": [
                "arn:aws:s3:::<athena-results-bucket>/*"
            ]
        }
    ]
}

Set up Lake Formation permissions for Data-Analyst

Complete the following steps to configure your data location in Amazon S3 with Lake Formation in hybrid access mode and grant access to the Data-Analyst role.

  1. Sign in to the AWS Management Console as a Lake Formation administrator role.
  2. Go to Lake Formation.
  3. Select Data lake locations from the left navigation bar under Administration.
  4. Select Register location and provide the Amazon S3 location of your database and tables. Provide an IAM role that has access to the data in the S3 location. For more details see Requirements for roles used to register locations.
  5. Select the Hybrid access mode under Permission mode and choose Register location.
  6. Select Data lake locations under Administration from the left navigation bar. Review that the registered location shows as Hybrid access mode for Permission mode.
  7. Select Databases from Catalog on the left navigation bar. Choose hybridsalesdb. You will select the database that has the data in the S3 location that you registered in the preceding step. From the Actions drop down menu, select Grant.
  8. Select Data-Analyst for IAM users and roles. Under LF-Tags or catalog resources, select Named Data Catalog resources and select hybridsalesdb for Databases.
  9. Under Database permissions, select Describe. Under Hybrid access mode, select the checkbox Make Lake Formation permissions effective immediately. Choose Grant.
  10. Again, select Databases from Catalog on the left navigation bar. Choose hybridsalesdb. Select Grant from the Actions drop down menu.
  11. On the Grant window, select Data-Analyst for IAM users and roles. Under LF-Tags or catalog resources, choose Named Data Catalog resources and select hybridsalesdb for Databases.
  12. Under Tables, select the three tables named hybridcustomer, hybridproduct, and hybridsales_order from the drop down.
  13. Under Table permissions, select Select and Describe permissions for the tables.
  14. Select the checkbox under Hybrid access mode to make the Lake Formation permissions effective immediately.
  15. Choose Grant.
  16. Review the granted permissions by selecting the Data lake permissions under Permissions on the left navigation bar. Filter Data permissions by Principal = Data-Analyst.
  17. On the left navigation bar, select Hybrid access mode. Verify that the opted in Data-Analyst shows up for the hybridsalesdb database and the three tables.
  18. Sign out from the console as the Lake Formation administrator role.

Validating Lake Formation permissions for Data-Analyst

  1. Sign in to the console as Data-Analyst.
  2. Go to the Athena console. If you’re using Athena for the first time, set up the query results location to your S3 bucket as described in Specifying a query result location.
  3. Run preview queries on the table from the Athena query editor.

Validating IAM and S3 permissions for Data-Engineer

  1. Sign out as Data-Analyst and sign back in to the console as Data-Engineer.
  2. Open the AWS Glue console and select ETL jobs from the left navigation bar.
  3. Under Create job, select Spark script editor. Choose Create.
  4. Download and open the sample script provided here.
  5. Copy and paste the script into your studio script editor as a new job.
  6. Edit the catalog_id, database, and table_name to suit your sample.
  7. Save and Run your AWS Glue ETL script by providing the IAM role of Data-Engineer to run the job.
  8. After the ETL script succeeds, you can select the output logs link from the Runs tab of the ETL script.
  9. Review the table’s schema, top 20 rows, and the total number of rows and columns from the AWS CloudWatch logs.

Thus, you can add Lake Formation permissions to a new role to access a Data Catalog database without interfering with another role that is accessing the same database through IAM and S3 permissions.

Scenario 2 – Hybrid access mode set up between two AWS accounts

This is a cross-account sharing scenario where a data producer shares a database and its tables to a consumer account. The producer provides full database access for an AWS Glue ETL workload on the consumer account. At the same time, the producer shares a few tables of the same database to the consumer account using Lake Formation. We walk you through how you can use hybrid access mode to support both access methods.

Prerequisites

  • Cross-account sharing of a database or table location that’s registered in hybrid access mode requires the producer or the grantor account to be in version 4 of cross-account sharing in the catalog setting to grant permissions on the hybrid access mode resource. When moving from version 3 to version 4 of cross-account sharing, existing Lake Formation permissions aren’t affected for database and table locations that are already registered with Lake Formation (Lake Formation mode). For new data set location registration in hybrid access mode and new Lake Formation permissions on this catalog resource, you will need version 4 of cross-account sharing.
  • The consumer or recipient account can use other versions of cross-account sharing. If your accounts are using version 1 or version 2 of cross-account sharing and if you want to upgrade, follow Updating cross-account data sharing version settings to first upgrade the catalog setting of cross-account sharing to version 3, before upgrading to version 4.

The producer account set up is similar to that of scenario 1 and we discuss the extra steps for scenario 2 in the following section.

Set up in producer account A

The consumer Data-Engineer role is granted Amazon S3 data access using the producer’s S3 bucket policy and Data Catalog access using the producer’s Data Catalog resource policy.

The S3 bucket policy in the producer account follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
        "Sid": "data engineer role permissions",
        "Effect": "Allow",
        "Principal": {
            "AWS": "arn:aws:iam::<consumer-account-id>:role/Data-Engineer"
        },
        "Action": [
            "s3:GetLifecycleConfiguration",
            "s3:ListBucket",
            "s3:PutObject",
            "s3:GetObject",
            "s3:DeleteObject"
        ],
        "Resource": [
            "arn:aws:s3:::<producer-account-databucket>",
            "arn:aws:s3:::<producer-account-databucket>/*"
        ]
        }
    ]
}

The Data Catalog resource policy in the producer account is shown below. You also need the glue:ShareResource IAM permission for AWS Resource Access Manager (AWS RAM) to enable cross-account sharing.

{
"Version" : "2012-10-17",
"Statement" : [
    {
    "Effect" : "Allow",
    "Principal" : {
        "AWS" : "arn:aws:iam::<consumer-account-id>:role/Data-Engineer"
    },
    "Action" : "glue:Get*",
    "Resource" : [
        "arn:aws:glue:<Region>:<producer-account-id>:catalog", 
        "arn:aws:glue:<Region>:<producer-account-id>:database/hybridsalesdb", 
        "arn:aws:glue:<Region>:<producer-account-id>:table/hybridsalesdb/*"
    ]
    },
    {
        "Effect" : "Allow",
        "Principal" : {
        "Service" : "ram.amazonaws.com"
        },
        "Action" : "glue:ShareResource",
        "Resource" : [
            "arn:aws:glue:<Region>:<producer-account-id>:table/*/*", 
            "arn:aws:glue:<Region>:<producer-account-id>:database/*", 
            "arn:aws:glue:<Region>:<producer-account-id>:catalog"
        ]
        }
    ]
}

Setting the cross-account version and registering the S3 bucket

  1. Sign in to the Lake Formation console as an IAM administrator role or a role with IAM permissions to the PutDataLakeSettings() API. Choose the AWS Region where you have your sample data set in an S3 bucket and its corresponding database and tables in the Data Catalog.
  2. Select Data catalog settings from the left navigation bar under Administration. Select Version 4 from the dropdown menu for Cross account version settings. Choose Save.
    Note: If there are any other accounts in your environment that share catalog resources to your producer account through Lake Formation, upgrading the sharing version might impact them. See <title of documentation page> for more information.
  3. Sign out as IAM administrator and sign back in to the Lake Formation console as a Lake Formation administrator role.
  4. Select Data lake locations from the left navigation bar under Administration.
  5. Select Register location and provide the S3 location of your database and tables.
  6. Provide an IAM role that has access to the data in the S3 location. For more details about this role requirement, see Requirements for roles used to register locations.
  7. Choose the Hybrid access mode under Permission mode, and then choose Register location.
  8. Select Data lake locations under Administration from the left navigation bar. Confirm that the registered location shows as Hybrid access mode for Permission mode.

Granting cross-account permissions

The steps to share the database hybridsalesdb to the consumer account are similar to the steps to set up scenario 1.

  1. In the Lake Formation console, select Databases from Catalog on the left navigation bar. Choose hybridsalesdb. Select your database that has the data in the S3 location that you registered previously. From the Actions drop down menu, select Grant.
  2. Select External accounts under Principals and provide the consumer account ID. Select Named catalog resources under LF-Tags or catalog resources. Choose hybridsalesdb for Databases.
  3. Select Describe for Database permissions and for Grantable permissions.
  4. Under Hybrid access mode, select the checkbox for Make Lake Formation permissions effective immediately. Choose Grant.

Note: Selecting the checkbox opts-in the consumer account Lake Formation administrator roles to use Lake Formation permissions without interrupting access to the consumer account’s IAM and S3 access for the same database.

  1. Repeat step 2 up to database selection to grant permission to the consumer account ID for table level permission. Select any three tables from the drop-down menu for table level permission under Tables.
  2. Select Select under Table permissions and Grantable permissions. Select the checkbox for Make Lake Formation permissions effective immediately under Hybrid access mode. Choose Grant.
  3. Select the Data lake permissions  on the left navigation bar. Verify the granted permissions to the consumer account.
  4. Select the Hybrid access mode on the left navigation bar. Verify the opted-in resources and principal.

You have now enabled cross-account sharing using Lake Formation permissions without revoking access to the IAMAllowedPrincipal virtual group.

Set up in consumer account B

In scenario 2, the Data-Analyst and Data-Engineer roles are created in the consumer account similar to scenario 1, but these roles access the database and tables shared from the producer account.

In addition to arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess and arn:aws:iam::aws:policy/CloudWatchFullAccess, the  Data-Engineer role also has permissions to create and run an Apache Spark job in AWS Glue Studio.

Data-Engineer has the following IAM policy that grants access to the producer account’s S3 bucket, which is registered with Lake Formation in hybrid access mode.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowDataLakeBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetLifecycleConfiguration",
                "s3:Put*",
                "s3:Get*",
                "s3:Delete*"
            ],
            "Resource": [
                "arn:aws:s3:::<producer-account-databucket>/*",
                "arn:aws:s3:::<producer-account-databucket>"
            ]
        }
    ]
}

Data-Engineer has the following IAM policy that grants access to the consumer account’s entire Data Catalog and producer account’s database hybridsalesdb and its tables.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:*"
            ],
            "Resource": [
                "arn:aws:glue:<Region>:<consumer-account-id>:catalog",
                "arn:aws:glue:<Region>:<consumer-account-id>:database/*",
                "arn:aws:glue:<Region>:<consumer-account-id>:table/*/*",

            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:Get*"
            ],
            "Resource": [
                "arn:aws:glue:<Region>:<producer-account-id>:catalog",
                "arn:aws:glue:<Region>:<producer-account-id>:database/hybridsalesdb",
                "arn:aws:glue:<Region>:<producer-account-id>:table/hybridsalesdb/*"
            ]
        }
    ]
}

The Data-Analyst has the same IAM policies similar to scenario 1, granting basic data lake user permissions. For additional details, see Assign permissions to Lake Formation users.

Accepting AWS RAM invites

  1. Sign in to the Lake Formation console as a Lake Formation administrator role.
  2. Open the AWS RAM console. Select Resource shares from Shared with me on the left navigation bar. You should see two invites from the producer account, one for database level share and one for table level share.
  3. Select each invite, review the producer account ID, and choose Accept resource share.

Granting Lake Formation permissions to Data-Analyst

  1. Open the Lake Formation console. As a Lake Formation administrator, you should see the shared database and tables from the consumer account.
  2. Select Databases from the Data catalog on the left navigation bar. Select the radio button on the database hybridsalesdb and select Create resource link from the Actions drop down menu.
  3. Enter rl_hybridsalesdb as the name for the resource link and leave the rest of the selections as they are. Choose Create.
  4. Select the radio button for rl_hybridsalesdb. Select Grant from the Actions drop down menu.
  5. Grant Describe permissions on the resource link to Data-Analyst.
  6. Again, select the radio button on rl_hybridsalesdb from the Databases under Catalog in the left navigation bar. Select Grant on target from the Actions drop down menu.
  7. Select Data-Analyst for IAM users and roles, keep the already selected database hybridsalesdb.
  8. Select Describe under Database permissions. Select the checkbox for Make Lake Formation permissions effective immediately under Hybrid access mode. Choose Grant.
  9. Select the radio button on rl_hybridsalesdb from Databases under Catalog in the left navigation bar. Select Grant on target from the Actions drop down menu.
  10. Select Data-Analyst for IAM users and roles. Select All tables of the database hybridsalesdb. Select Select under Table permissions.
  11. Select the checkbox for Make Lake Formation permissions effective immediately under Hybrid access mode.
  12. View and verify the permissions granted to Data-Analyst from the Data lake permissions tab on the left navigation bar.
  13. Sign out as Lake Formation administrator role.

Validate Lake Formation permissions as Data-Analyst

  1. Sign back in to the console as Data-Analyst.
  2. Open the Athena console. If you’re using Athena for the first time, set up the query results location to your S3 bucket as described in Specifying a query result location.
    • In the Query Editor page, under Data, select AWSDataDatalog for Data source.  For Tables, select the three dots next to any of the table names. Select Preview Table to run the query.
  3. Sign out as Data-Analyst.

Validate IAM and S3 permissions for Data-Engineer

  1. Sign back in to the console as Data-Engineer.
  2. Using the same steps as scenario 1, verify IAM and S3 access by running the AWS Glue ETL script in AWS Glue Studio.

You’ve added Lake Formation permissions to a new role Data-Analyst, without interrupting existing IAM and S3 access to Data-Engineer for a cross-account sharing use-case.

Clean up

If you’ve used sample datasets from your S3 for this blog post, we recommend removing relevant Lake Formation permissions on your database for the Data-Analyst role and cross-account grants. You can also remove the hybrid access mode opt-in and remove the S3 bucket registration from Lake Formation. After removing all Lake Formation permissions from both the producer and consumer accounts, you can delete the Data-Analyst and Data-Engineer IAM roles.

Considerations

Currently, only a Lake Formation administrator role can opt in other users to use Lake Formation permissions for a resource, since opting in user access using either Lake Formation or IAM and S3 permissions is an administrative task requiring full knowledge of your organizational data access setup. Further, you can grant permissions and opt in at the same time using only the named-resource method and not LF-Tags. If you’re using LF-Tags to grant permissions, we recommend you use the Hybrid access mode option on the left navigation bar to opt in (or the equivalent CreateLakeFormationOptin() API using the AWS SDK or AWS CLI) as a subsequent step after granting permissions.

Conclusion

In this blog post, we went through the steps to set up hybrid access mode for Data Catalog. You learned how to onboard users selectively to the Lake Formation permissions model. The users who had access through IAM and S3 permissions continued to have their access without interruptions. You can use Lake Formation to add fine-grained access to Data Catalog tables to enable your business analysts to query using Amazon Athena and Amazon Redshift Spectrum, while your data scientists can explore the same data using Amazon Sagemaker. Data engineers can continue to use their IAM and S3 permissions on the same data to run workloads using Amazon EMR and AWS Glue. Hybrid access mode for the Data Catalog enables a variety of analytical use-cases for your data without data duplication.

To get started, see the documentation for hybrid access mode. We encourage you to check out the feature and share your feedback in the comments section. We look forward to hearing from you.


About the authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/configure-cross-region-table-access-with-the-aws-glue-catalog-and-aws-lake-formation/

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. Companies also have employees and do business across multiple geographic regions and even around the world. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions.

The AWS Glue Data Catalog and AWS Lake Formation recently announced support for cross-Region table access. This feature lets users query AWS Glue databases and tables in one Region from another Region using resource links, without copying the metadata in the Data Catalog or the data in Amazon Simple Storage Service (Amazon S3). A resource link is a Data Catalog object that is a link to a database or table.

The AWS Glue Data Catalog is a centralized repository of technical metadata that holds the information about your datasets in AWS, and can be queried using AWS analytics services such as Amazon Athena, Amazon EMR, and AWS Glue for Apache Spark. The Data Catalog is localized to every Region in an AWS account, requiring users to replicate the metadata and the source data in S3 buckets for cross-Region queries. With the newly launched feature for cross-Region table access, you can create a resource link in any Region pointing to a database or table of the source Region. With the resource link in the local Region, you can query the source Region’s tables from Athena, Amazon EMR, and AWS Glue ETL in the local Region.

You can use the cross-Region table access feature of the Data Catalog in combination with the permissions management and cross-account sharing capability of Lake Formation. Lake Formation is a fully managed service that makes it easy to build, secure, and manage data lakes. By using cross-Region access support for Data Catalog, together with governance provided by Lake Formation, organizations can discover and access data across Regions without spending time making copies. Some businesses might have restrictions to run their compute in certain Regions. Organizations that need to share their Data Catalog with businesses that have such restrictions can now create and share cross-Region resource links.

In this post, we walk you through configuring cross-Region database and table access in two scenarios. In the first scenario, we go through an example where a customer wants to access an AWS Glue database in Region A from Region B in the same account. In scenario two, we demonstrate cross-account and cross-Region access where a customer wants to share a database in Region A across accounts and access it from Region B of the recipient account.

Scenario 1: Same account use case

In this scenario, we walk you through the steps required to share a Data Catalog database from one Region to another Region within the same AWS account. For our illustrations, we have a sample dataset in an S3 bucket in the us-east-2 Region and have used an AWS Glue crawler to crawl and catalog the dataset into a database in the Data Catalog of the us-east-2 Region. We share this dataset to the us-west-2 Region. You can use any of your datasets to follow along. The following diagram illustrates the architecture for cross-Region sharing within the same AWS account.

Prerequisites

To set up cross-Region sharing of a Data Catalog database for scenario 1, we recommend the following prerequisites:

  • An AWS account that is not used for production use cases.
  • Lake Formation set up already in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. For example, we are using a data lake administrator role called LF-Admin. The LF-Admin role also has the AWS Identity and Access Management (IAM) permission iam:PassRole on the AWS Glue crawler role. To learn more about setting up permissions for a data lake administrator, see Create a data lake administrator.
  • A sample database in the Data Catalog with a few tables. For example, our sample database is called salesdb_useast2 and has a set of eight tables, as shown in the following screenshot.

Set up permissions for us-east-2

Complete the following steps to configure permissions in the us-east-2 Region:

  1. Log in to the Lake Formation console and choose the Region where your database resides. In our example, it is us-east-2 Region.
  2. Grant SELECT and DESCRIBE permissions to the LF-Admin role on all tables of the database salesdb_useast2.
  3. You can confirm if permissions are working by querying the database and tables as the data lake administrator role from Athena.

Set up permissions for us-west-2

Complete the following steps to configure permissions in the us-west-2 Region:

  1. Choose the us-west-2 Region on the Lake Formation console.
  2. Add LF-Admin as a data lake administrator and grant Create database permission to LF-Admin.
  3. In the navigation pane, under Data catalog, select Databases.
  4. Choose Create database and select Resource link.
  5. Enter rl_salesdb_from_useast2 as the name for the resource link.
  6. For Shared database’s region, choose US East (Ohio).
  7. For Shared database, choose salesdb_useast2.
  8. Choose Create.

This creates a database resource link in us-west-2 pointing to the database in us-east-2.

You will notice the Shared resource owner region column populate as us-east-2 for the resource link details on the Databases page.

Because the LF-Admin role created the resource link rl_salesdb_from_useast2, the role has implicit permissions on the resource link. LF-Admin already has permissions to query the table in the us-east-2 Region. There is no need to add a Grant on target permission for LF-Admin. If you are granting permission to another user or role, you need to grant Describe permissions on the resource link rl_salesdb_from_useast2.

  1. Query the database using the resource link in Athena as LF-Admin.

In the preceding steps, we saw how to create a resource link in us-west-2 for a Data Catalog database in us-east-2. You can also create a resource link to the source database in any additional Region where the Data Catalog is available. You can run extract, transform, and load (ETL) scripts in Amazon EMR and AWS Glue by providing the additional Region parameter when referring to the database and table. See the API documentation for GetTable() and GetDatabase() for additional details.

Also, Data Catalog permissions for the database, tables, and resource links and the underlying Amazon S3 data permissions can be managed by IAM policies and S3 bucket policies instead of Lake Formation permissions. For more information, see Identity and access management for AWS Glue.

Scenario 2: Cross-account use case

In this scenario, we walk you through the steps required to share a Data Catalog database from one Region to another Region between two accounts: a producer account and a consumer account. To show an advanced use case, we host the source dataset in us-east-2 of account A and crawl it using an AWS Glue crawler in the Data Catalog in us-east-1. The data lake administrator in account A then shares the database and tables to account B using Lake Formation permissions. The data lake administrator in account B accepts the share in us-east-1 and creates resource links to query the tables from eu-west-1. The following diagram illustrates the architecture for cross-Region sharing between producer account A and consumer account B.

Prerequisites

To set up cross-Region sharing of a Data Catalog database for scenario 2, we recommend the following prerequisites:

  • Two AWS accounts that are not used for production use cases
  • Lake Formation administrator roles in both accounts
  • Lake Formation set up in both accounts with cross-account sharing version 3. For more details, refer documentation.
  • A sample database in the Data Catalog with a few tables

For our example, we continue to use the same dataset and the data lake administrator role LF-Admin for scenario 2.

Set up account A for cross-Region sharing

To set up account A, complete the following steps:

  1. Sign in to the AWS Management Console as the data lake administrator role.
  2. Register the S3 bucket in Lake Formation in us-east-1 with an IAM role that has access to the S3 bucket. See registering your S3 location for instructions.
  3. Set up and run an AWS Glue crawler to catalog the data in the us-east-2 S3 bucket to the Data Catalog database useast2data_salesdb in us-east-1. Refer to AWS Glue crawlers support cross-account crawling to support data mesh architecture for instructions.

The database, as shown in the following screenshot, has a set of eight tables.

  1. Grant SELECT and DESCRIBE along with grantable permissions on all tables of the database to account B.

  2. Grant DESCRIBE with grantable permissions on the database.
  3. Verify the granted permissions on the Data permissions page.
  4. Log out of account A.

Set up account B for cross-Region sharing

To set up account B, complete the following steps:

  1. Sign in as the data lake administrator on the Lake Formation console in us-east-1.

In our example, we have created the data lake administrator role LF-Admin, similar to previous administrator roles in account A and scenario 1.

  1. On the AWS Resource Access Manager (AWS RAM) console, review and accept the AWS RAM invites corresponding to the shared database and tables from account A.

The LF-Admin role can see the shared database useast2data_salesdb from the producer account. LF-Admin has access to the database and tables and so doesn’t need additional permissions on the shared database.

  1. You can grant DESCRIBE on the database and SELECT on All_Tables permissions to any additional IAM principals from the us-east-1 Region on this shared database.
  2. Open the Lake Formation console in eu-west-1 (or any Region where you have Lake Formation and Athena already set up).
  3. Choose Create database and create a resource link named rl_useast1db_crossaccount, pointing to the us-east-1 database useast2data_salesdb.

You can choose any Region on the Shared database’s region drop-down menu and choose the databases from those Regions.

Because we’re using the data lake administrator role LF-Admin, we can see all databases from all Regions in the consumer account’s Data Catalog. A data lake user with restricted permissions will be able to see only those databases for which they have permissions to.

  1. Because LF-Admin created the resource link, this role has permissions to use the resource link rl_useast1db_crossaccount. For additional IAM principals, grant DESCRIBE permissions on the database resource link rl_useast1db_crossaccount.
  2. You can now query the database and tables from Athena.

Considerations

Cross-Region queries involve Amazon S3 data transfer by the analytics services, such as Athena, Amazon EMR, and AWS Glue ETL. As a result, cross-Region queries can be slower and will incur higher transfer costs compared to queries in the same Region. Some analytics services such as AWS Glue jobs and Amazon EMR may require internet access when accessing cross-Region data from Amazon S3, depending on your VPC set up. Refer to Considerations and limitations for more considerations.

Conclusion

In this post, you saw examples of how to set up cross-Region resource links for a database in the same account and across two accounts. You also saw how to use cross-Region resource links to query in Athena. You can share selected tables from a database instead of sharing an entire database. With cross-Region sharing, you can create a resource link for the table using the Create table option.

There are two key things to remember when using the cross-Region table access feature:

  • Grant permissions on the source database or table from its source Region.
  • Grant permissions on the resource link from the Region it was created in.

That is, the original shared database or table is always available in the source Region, and resource links are created and shared in their local Region.

To get started, see Accessing tables across Regions. Share your comments on the post or contact your AWS account team for more details.


About the author

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Query your Apache Hive metastore with AWS Lake Formation permissions

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/query-your-apache-hive-metastore-with-aws-lake-formation-permissions/

Apache Hive is a SQL-based data warehouse system for processing highly distributed datasets on the Apache Hadoop platform. There are two key components to Apache Hive: the Hive SQL query engine and the Hive metastore (HMS). The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries. The Hive metastore can be hosted on an Apache Hadoop cluster or can be backed by a relational database that is external to a Hadoop cluster. Although the Hive metastore stores the metadata of tables, the actual data of the table could be residing on Amazon Simple Storage Service (Amazon S3), the Hadoop Distributed File System (HDFS) of the Hadoop cluster, or any other Hive-supported data stores.

Because Apache Hive was built on top of Apache Hadoop, many organizations have been using the software from the time they have been using Hadoop for big data processing. Also, Hive metastore provides flexible integration with many other open-source big data software like Apache HBase, Apache Spark, Presto, and Apache Impala. Therefore, organizations have come to host huge volumes of metadata of their structured datasets in the Hive metastore. A metastore is a critical part of a data lake, and having this information available, wherever it resides, is important. However, many AWS analytics services don’t integrate natively with the Hive metastore, and therefore, organizations have had to migrate their data to the AWS Glue Data Catalog to use these services.

AWS Lake Formation has launched support for managing user access to Apache Hive metastores through a federated AWS Glue connection. Previously, you could use Lake Formation to manage user permissions on AWS Glue Data Catalog resources only. With the Hive metastore connection from AWS Glue, you can connect to a database in a Hive metastore external to the Data Catalog, map it to a federated database in the Data Catalog, apply Lake Formation permissions on the Hive database and tables, share them with other AWS accounts, and query them using services such as Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL (extract, transform, and load). For additional details on how the Hive metastore integration with Lake Formation works, refer to Managing permissions on datasets that use external metastores.

Use cases for Hive metastore integration with the Data Catalog include the following:

  • An external Apache Hive metastore used for legacy big data workloads like on-premises Hadoop clusters with data in Amazon S3
  • Transient Amazon EMR workloads with underlying data in Amazon S3 and the Hive metastore on Amazon Relational Database Service (Amazon RDS) clusters.

In this post, we demonstrate how to apply Lake Formation permissions on a Hive metastore database and tables and query them using Athena. We illustrate a cross-account sharing use case, where a Lake Formation steward in producer account A shares a federated Hive database and tables using LF-Tags to consumer account B.

Solution overview

Producer account A hosts an Apache Hive metastore in an EMR cluster, with underlying data in Amazon S3. We launch the AWS Glue Hive metastore connector from AWS Serverless Application Repository in account A and create the Hive metastore connection in account A’s Data Catalog. After we create the HMS connection, we create a database in account A’s Data Catalog (called the federated database) and map it to a database in the Hive metastore using the connection. The tables from the Hive database are then accessible to the Lake Formation admin in account A, just like any other tables in the Data Catalog. The admin continues to set up Lake Formation tag-based access control (LF-TBAC) on the federated Hive database and share it to account B.

The data lake users in account B will access the Hive database and tables of account A, just like querying any other shared Data Catalog resource using Lake Formation permissions.

The following diagram illustrates this architecture.

The solution consists of steps in both accounts. In account A, perform the following steps:

  1. Create an S3 bucket to host the sample data.
  2. Launch an EMR 6.10 cluster with Hive. Download the sample data to the S3 bucket. Create a database and external tables, pointing to the downloaded sample data, in its Hive metastore.
  3. Deploy the application GlueDataCatalogFederation-HiveMetastore from AWS Serverless Application Repository and configure it to use the Amazon EMR Hive metastore. This will create an AWS Glue connection to the Hive metastore that shows up on the Lake Formation console.
  4. Using the Hive metastore connection, create a federated database in the AWS Glue Data Catalog.
  5. Create LF-Tags and associate them to the federated database.
  6. Grant permissions on the LF-Tags to account B. Grant database and table permissions to account B using LF-Tag expressions.

In account B, perform the following steps:

  1. As a data lake admin, review and accept the AWS Resource Access Manager (AWS RAM) invites for the shares from account A.
  2. The data lake admin then sees the shared database and tables. The admin creates a resource link to the database and grants fine-grained permissions to a data analyst in this account.
  3. Both the data lake admin and the data analyst query the Hive tables that are available to them using Athena.

Account A has the following personas:

  • hmsblog-producersteward – Manages the data lake in the producer account A

Account B has the following personas:

  • hmsblog-consumersteward – Manages the data lake in the consumer account B
  • hmsblog-analyst – A data analyst who needs access to selected Hive tables

Prerequisites

To follow the tutorial in this post, you need the following:

Lake Formation and AWS CloudFormation setup in account A

To keep the setup simple, we have an IAM admin registered as the data lake admin. Complete the following steps:

  1. Sign into the AWS Management Console and choose the us-west-2 Region.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
  3. Choose Manage Administrators in the Data lake administrators section.
  4. Under IAM users and roles, choose the IAM admin user that you are logged in as and choose Save.
  5. Choose Launch Stack to deploy the CloudFormation template:
  6. Choose Next.
  7. Provide a name for the stack and choose Next.
  8. On the next page, choose Next.
  9. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Create.

Stack creation takes about 10 minutes. The stack establishes the producer account A setup as follows:

  • Creates an S3 data lake bucket
  • Registers the data lake bucket to Lake Formation with the Enable catalog federation flag
  • Launches an EMR 6.10 cluster with Hive and runs two steps in Amazon EMR:
    • Downloads the sample data from public S3 bucket to the newly created bucket
    • Creates a Hive database and four external tables for the data in Amazon S3, using a HQL script
  • Creates an IAM user (hmsblog-producersteward) and sets this user as Lake Formation administrator
  • Creates LF-Tags (LFHiveBlogCampaignRole = Admin, Analyst)

Review CloudFormation stack output in account A

To review the output of your CloudFormation stack, complete the following steps:

  1. Log in to the console as the IAM admin user you used earlier to run the CloudFormation template.
  2. Open the CloudFormation console in another browser tab.
  3. Review and note down the stack Outputs tab details.
  4. Choose the link under Value for ProducerStewardCredentials.

This will open the AWS Secrets Manager console.

  1. Choose Retrieve value and note down the credentials of hmsblog-producersteward.

Set up a federated AWS Glue connection in account A

To set up a federated AWS Glue connection, complete the following steps:

  1. Open the AWS Serverless Application Repository console in another browser tab.
  2. In the navigation pane, choose Available applications.
  3. Select Show apps that create custom IAM roles or resource policies.
  4. In the search bar, enter Glue.

This will list various applications.

  1. Choose the application named GlueDataCatalogFederation-HiveMetastore.

This will open the AWS Lambda console configuration page for a Lambda function that runs the connector application code.

To configure the Lambda function, you need details of the EMR cluster launched by the CloudFormation stack.

  1. On another tab of your browser, open the Amazon EMR console.
  2. Navigate to the cluster launched for this post and note down the following details from the cluster details page:
    1. Primary node public DNS
    2. Subnet ID
    3. Security group ID of the primary node

  3. Back on the Lambda configuration page, under Review, configure, and deploy, in the Application settings section, provide the following details. Leave the rest as the default values.
    1. For GlueConnectionName, enter hive-metastore-connection.
    2. For HiveMetastoreURIs enter thrift://<Primary-node-public-DNS-of your-EMR>:9083. For example, thrift://ec2-54-70-203-146.us-west-2.compute.amazonaws.com:9083, where 9083 is the Hive metastore port in EMR cluster.
    3. For VPCSecurityGroupIds, enter the security group ID of the EMR primary node.
    4. For VPCSubnetIds, enter the subnet ID of the EMR cluster.
  4. Choose Deploy.

Wait for the Create Completed status of the Lambda application. You can review the details of the Lambda application on the Lambda console.

  1. Open Lake Formation console and in the navigation pane, choose Data sharing.

You should see hive-metastore-connection under Connections.

  1. Choose it and review the details.
  2. In the navigation pane, under Administrative roles and tasks, choose LF-Tags.

You should see the created LF-tag LFHiveBlogCampaignRole with two values: Analyst and Admin.

  1. Choose LF-Tag permissions and choose Grant.
  2. Choose IAM users and roles and enter hmsblog-producersteward.
  3. Under LF-Tags, choose Add LF-Tag.
  4. Enter LFHiveBlogCampaignRole for Key and enter Analyst and Admin for Values.
  5. Under Permissions, select Describe and Associate for LF-Tag permissions and Grantable permissions.
  6. Choose Grant.

This gives LF-Tags permissions for the producer steward.

  1. Log out as the IAM administrator user.

Grant Lake Formation permissions as producer steward

Complete the following steps:

  1. Sign in to the console as hmsblog-producersteward, using the credentials from the CloudFormation stack Output tab that you noted down earlier.
  2. On the Lake Formation console, in the navigation pane, choose Administrative roles and tasks.
  3. Under Database creators, choose Grant.
  4. Add hmsblog-producersteward as a database creator.
  5. In the navigation pane, choose Data sharing.
  6. Under Connections, choose the hive-metastore-connection hyperlink.
  7. On the Connection details page, choose Create database.
  8. For Database name, enter federated_emrhivedb.

This is the federated database in the local AWS Glue Data Catalog that will point to a Hive metastore database. This is a one-to-one mapping of a database in the Data Catalog to a database in the external Hive metastore.

  1. For Database identifier, enter the name of the database in the EMR Hive metastore that was created by the Hive SQL script. For this post, we use emrhms_salesdb.
  2. Once created, select federated_emrhivedb and choose View tables.

This will fetch the database and table metadata from the Hive metastore on the EMR cluster and display the tables created by the Hive script.

Now you associate the LF-Tags created by the CloudFormation script on this federated database and share it to the consumer account B using LF-Tag expressions.

  1. In the navigation pane, choose Databases.
  2. Select federated_emrhivedb and on the Actions menu, choose Edit LF-Tags.
  3. Choose Assign new LF-Tag.
  4. Enter LFHiveBlogCampaignRole for Assigned keys and Admin for Values, then choose Save.
  5. In the navigation pane, choose Data lake permissions.
  6. Choose Grant.
  7. Select External accounts and enter the consumer account B number.
  8. Under LF-Tags or catalog resources, choose Resource matched by LF-Tags.
  9. Choose Add LF-Tag.
  10. Enter LFHiveBlogCampaignRole for Key and Admin for Values.
  11. In the Database permissions section, select Describe for Database permissions and Grantable permissions.
  12. In the Table permissions section, select Select and Describe for Table permissions and Grantable permissions.
  13. Choose Grant.
  14. In the navigation pane, under Administrative roles and tasks, choose LF-Tag permissions.
  15. Choose Grant.
  16. Select External accounts and enter the account ID of consumer account B.
  17. Under LF-Tags, enter LFHiveBlogCampaignRole for Key and enter Analyst and Admin for Values.
  18. Under Permissions, select Describe and Associate under LF-Tag permissions and Grantable permissions.
  19. Choose Grant and verify that the granted LF-Tag permissions display correctly.
  20. In the navigation pane, choose Data lake permissions.

You can review and verify the permissions granted to account B.

  1. In the navigation pane, under Administrative roles and tasks, choose LF-Tag permissions.

You can review and verify the permissions granted to account B.

  1. Log out of account A.

Lake Formation and AWS CloudFormation setup in account B

To keep the setup simple, we use an IAM admin registered as the data lake admin.

  1. Sign into the AWS Management Console of account B and select the us-west-2 Region.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
  3. Choose Manage Administrators in the Data lake administrators section.
  4. Under IAM users and roles, choose the IAM admin user that you are logged in as and choose Save.
  5. Choose Launch Stack to deploy the CloudFormation template:
  6. Choose Next.
  7. Provide a name for the stack and choose Next.
  8. On the next page, choose Next.
  9. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Create.

Stack creation should take about 5 minutes. The stack establishes the producer account B setup as follows:

  • Creates an IAM user hmsblog-consumersteward and sets this user as Lake Formation administrator
  • Creates another IAM user hmsblog-analyst
  • Creates an S3 data lake bucket to store Athena query results, with ListBucket and write object permissions to both hmsblog-consumersteward and hmsblog-analyst

Note down the stack output details.

Accept resource shares in account B

Sign in to the console as hmsblog-consumersteward and complete the following steps:

  1. On the AWS CloudFormation console, navigate to the stack Outputs tab.
  2. Choose the link for ConsumerStewardCredentials to be redirected to the Secrets Manager console.
  3. On the Secrets Manager console, choose Retrieve secret value and copy the password for the consumer steward user.
  4. Use the ConsoleIAMLoginURL value from the CloudFormation template Output to log in to account B with the consumer steward user name hmsblog-consumersteward and the password you copied from Secrets Manager.
  5. Open the AWS RAM console in another browser tab.
  6. In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations.

You should see two resource share invitations from producer account A: one for a database-level share and one for a table-level share.

  1. Choose each resource share link, review the details, and choose Accept.

After you accept the invitations, the status of the resource shares changes from Pending to Active.

  1. Open the Lake Formation console in another browser tab.
  2. In the navigation pane, choose Databases.

You should see the shared database federated_emrhivedb from producer account A.

  1. Choose the database and choose View tables to review the list of tables shared under that database.

You should see the four tables of the Hive database that is hosted on the EMR cluster in the producer account.

Grant permissions in account B

To grant permissions in account B, complete the following steps as hmsblog-consumersteward:

  1. On the Lake Formation console, in the navigation pane, choose Administrative roles and tasks.
  2. Under Database creators, choose Grant.
  3. For IAM users and roles, enter hmsblog-consumersteward.
  4. For Catalog permissions, select Create database.
  5. Choose Grant.

This allows hmsblog-consumersteward to create a database resource link.

  1. In the navigation pane, choose Databases.
  2. Select federated_emrhivedb and on the Actions menu, choose Create resource link.
  3. Enter rl_federatedhivedb for Resource link name and choose Create.
  4. Choose Databases in the navigation pane.
  5. Select the resource link rl_federatedhivedb and on the Actions menu, choose Grant.
  6. Choose hmsblog-analyst for IAM users and roles.
  7. Under Resource link permissions, select Describe, then choose Grant.
  8. Select Databases in the navigation pane.
  9. Select the resource link rl_federatedhivedb and on the Actions menu, choose Grant on target.
  10. Choose hmsblog-analyst for IAM users and roles.
  11. Choose hms_productcategory and hms_supplier for Tables.
  12. For Table permissions, select Select and Describe, then choose Grant.
  13. In the navigation pane, choose Data lake permissions and review the permissions granted to hms-analyst.

Query the Apache Hive database of the producer from the consumer Athena

Complete the following steps:

  1. On the Athena console, navigate to the query editor.
  2. Choose Edit settings to configure the Athena query results bucked.
  3. Browse and choose the S3 bucket hmsblog-athenaresults-<your-account-B>-us-west-2 that the CloudFormation template created.
  4. Choose Save.

hmsblog-consumersteward has access to all four tables under federated_emrhivedb from the producer account.

  1. In the Athena query editor, choose the database rl_federatedhivedb and run a query on any of the tables.

You were able to query an external Apache Hive metastore database of the producer account through the AWS Glue Data Catalog and Lake Formation permissions using Athena from the recipient consumer account.

  1. Sign out of the console as hmsblog-consumersteward and sign back in as hmsblog-analyst.
  2. Use the same method as explained earlier to get the login credentials from the CloudFormation stack Outputs tab.

hmsblog-analyst has Describe permissions on the resource link and access to two of the four Hive tables. You can verify that you see them on the Databases and Tables pages on the Lake Formation console.

On the Athena console, you now configure the Athena query results bucket, similar to how you configured it as hmsblog-consumersteward.

  1. In the query editor, choose Edit settings.
  2. Browse and choose the S3 bucket hmsblog-athenaresults-<your-account-B>-us-west-2 that the CloudFormation template created.
  3. Choose Save.
  4. In the Athena query editor, choose the database rl_federatedhivedb and run a query on the two tables.
  5. Sign out of the console as hmsblog-analyst.

You were able to restrict sharing the external Apache Hive metastore tables using Lake Formation permissions from one account to another and query them using Athena. You can also query the Hive tables using Redshift Spectrum, Amazon EMR, and AWS Glue ETL from the consumer account.

Clean up

To avoid incurring charges on the AWS resources created in this post, you can perform the following steps.

Clean up resources in account A

There are two CloudFormation stacks associated with producer account A. You need to delete the dependencies and the two stacks in the correct order.

  1. Log in as the admin user to producer account B.
  2. On the Lake Formation console, choose Data lake permissions in the navigation pane.
  3. Choose Grant.
  4. Grant Drop permissions to your role or user on federated_emrhivedb.
  5. In the navigation pane, choose Databases.
  6. Select federated_emrhivedb and on the Actions menu, choose Delete to delete the federated database that is associated with the Hive metastore connection.

This makes the AWS Glue connection’s CloudFormation stack ready to be deleted.

  1. In the navigation pane, choose Administrative roles and tasks.
  2. Under Database creators, select Revoke and remove hmsblog-producersteward permissions.
  3. On the CloudFormation console, delete the stack named serverlessrepo-GlueDataCatalogFederation-HiveMetastore first.

This is the one created by your AWS SAM application for the Hive metastore connection. Wait for it to complete deletion.

  1. Delete the CloudFormation stack that you created for the producer account set up.

This deletes the S3 buckets, EMR cluster, custom IAM roles and policies, and the LF-Tags, database, tables, and permissions.

Clean up resources in account B

Complete the following steps in account B:

  1. Revoke permission to hmsblog-consumersteward as database creator, similar to the steps in the previous section.
  2. Delete the CloudFormation stack that you created for the consumer account setup.

This deletes the IAM users, S3 bucket, and all the permissions from Lake Formation.

If there are any resource links and permissions left, delete them manually in Lake Formation from both accounts.

Conclusion

In this post, we showed you how to launch the AWS Glue Hive metastore federation application from AWS Serverless Application Repository, configure it with a Hive metastore running on an EMR cluster, create a federated database in the AWS Glue Data Catalog, and map it to a Hive metastore database on the EMR cluster. We illustrated how to share and access the Hive database tables for a cross-account scenario and the benefits of using Lake Formation to restrict permissions.

All Lake Formation features such as sharing to IAM principals within same account, sharing to external accounts, sharing to external account IAM principals, restricting column access, and setting data filters work on federated Hive database and tables. You can use any of the AWS analytics services that are integrated with Lake Formation, such as Athena, Redshift Spectrum, AWS Glue ETL, and Amazon EMR to query the federated Hive database and tables.

We encourage you to check out the features of the AWS Glue Hive metastore federation connector and explore Lake Formation permissions on your Hive database and tables. Please comment on this post or talk to your AWS Account Team to share feedback on this feature.

For more details, see Managing permissions on datasets that use external metastores.


About the authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Enable cross-account sharing with direct IAM principals using AWS Lake Formation Tags

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/enable-cross-account-sharing-with-direct-iam-principals-using-aws-lake-formation-tags/

With AWS Lake Formation, you can build data lakes with multiple AWS accounts in a variety of ways. For example, you could build a data mesh, implementing a centralized data governance model and decoupling data producers from the central governance. Such data lakes enable the data as an asset paradigm and unleash new possibilities with data discovery and exploration across organization-wide consumers. While enabling the power of data in decision-making across your organization, it’s also crucial to secure the data. With Lake Formation, sharing datasets across accounts only requires a few simple steps, and you can control what you share.

Lake Formation has launched Version 3 capabilities for sharing AWS Glue Data Catalog resources across accounts. When moving to Lake Formation cross-account sharing V3, you get several benefits. When moving from V1, you get more optimized usage of AWS Resource Access Manager (AWS RAM) to scale sharing of resources. When moving from V2, you get a few enhancements. First, you don’t have to maintain AWS Glue resource policies to share using LF-tags because Version 3 uses AWS RAM. Second, you can share with AWS Organizations using LF-tags. Third, you can share to individual AWS Identity and Access Management (IAM) users and roles in other accounts, thereby providing data owners control over which individuals can access their data.

Lake Formation tag-based access control (LF-TBAC) is an authorization strategy that defines permissions based on attributes called LF-tags. LF-tags are different from IAM resource tags and are associated only with Lake Formation databases, tables, and columns. LF-TBAC allows you to define the grant and revoke permissions policy by grouping Data Catalog resources, and therefore helps in scaling permissions across a large number of databases and tables. LF-tags are inherited from a database to all its tables and all the columns of each table.

Version 3 offers the following benefits:

  • True central governance with cross-account sharing to specific IAM principals in the target account
  • Ease of use in not having to maintain an AWS Glue resource policy for LF-TBAC
  • Efficient reuse of AWS RAM shares
  • Ease of use in scaling to hundreds of accounts with LF-TBAC

In this post, we illustrate the new features of cross-account sharing Version 3 in a producer-consumer scenario using TPC datasets. We walk through the setup of using LF-TBAC to share data catalog resources from the data producer account to direct IAM users in the consumer account. We also go through the steps in the receiving account to accept the shares and query the data.

Solution overview

To demonstrate the Lake Formation cross-account Version 3 features, we use the TPC datasets available at s3://aws-data-analytics-workshops/shared_datasets/tpcparquet/. The solution consists of steps in both accounts.

In account A, complete the following steps:

  1. As a data producer, register the dataset with Lake Formation and create AWS Glue Data Catalog tables.
  2. Create LF-tags and associate them with the database and tables.
  3. Grant LF-tag based permissions on resources directly to personas in consumer account B.

The following steps take place in account B:

  1. The consumer account data lake admin reviews and accepts the AWS RAM invitations.
  2. The data lake admin gives CREATE DATABASE access to the IAM user lf_business_analysts.
  3. The data lake admin creates a database for the marketing team and grants CREATE TABLE access to lf_campaign_manager.
  4. The IAM users create resource links on the shared database and tables and query them in Amazon Athena.

The producer account A has the following personas:

  • Data lake admin – Manages the data lake in the producer account
  • lf-producersteward – Manages the data and user access

The consumer account B has the following personas:

  • Data lake admin – Manages the data lake in the consumer account
  • lf-business-analysts – The business analysts in the sales team needs access to non-PII data
  • lf-campaign-manager – The manager in the marketing team needs access to data related to products and promotions

Prerequisites

You need the following prerequisites:

  • Two AWS accounts. For this demonstration of how AWS RAM invites are created and accepted, you should use two accounts that are not part of the same organization.
  • An admin IAM user in both accounts to launch the AWS CloudFormation stacks.
  • Lake Formation mode enabled in both the producer and consumer account with cross-account Version 3. For instructions, refer to Change the default permission model.

Lake Formation and AWS CloudFormation setup in account A

To keep the setup simple, we have an IAM admin registered as the data lake admin.

  1. Sign into the AWS Management Console in the us-east-1 Region.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
  3. Select Choose Administrators under Datalake administrators.
  4. In the pop-up window Manage data lake administrators, under IAM users and roles, choose IAM admin user and choose Save.
  5. Choose Launch Stack to deploy the CloudFormation template:
    BDB-2063-launch-cloudformation-stack
  6. Choose Next.
  7. Provide a name for the stack and choose Next.
  8. On the next page, choose Next.
  9. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Create.

Stack creation should take about 2–3 minutes. The stack establishes the producer setup as follows:

  • Creates an Amazon Simple Storage Service (Amazon S3) data lake bucket
  • Registers the data lake bucket with Lake Formation
  • Creates an AWS Glue database and tables
  • Creates an IAM user (lf-producersteward) who will act as producer steward
  • Creates LF-tags and assigns them to the created catalog resources as specified in the following table
Database Table LF-Tag Key LF-Tag Value Resource Tagged
lftpcdb . Sensitivity Public DATABASE
lftpcdb items HasCampaign true TABLE
lftpcdb promotions HasCampaign true TABLE
lftpcdb customers table columns = "c_last_name","c_first_name","c_email_address" Sensitivity Confidential TABLECOLUMNS

Verify permissions in account A

After the CloudFormation stack launches, complete the following steps in account A:

  1. On the AWS CloudFormation console, navigate to the Outputs tab of the stack.

  1. Choose the LFProducerStewardCredentials value to navigate to the AWS Secrets Manager console.
  2. In the Secret value section, choose Retrieve secret value.
  3. Note down the secret value for the password for IAM user lf-producersteward.

You need this to log in to the console later as the user lf-producersteward.

  1. On the LakeFormation console, choose Databases on the navigation pane.
  2. Open the database lftpcdb.
  3. Verify the LF-tags on the database are created.

  1. Choose View tables and choose the items table to verify the LF-tags.

  1. Repeat the steps for the promotions and customers tables to verify the LF-tags assigned.

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Select the database lftpcdb and on the Actions menu, choose View Permissions.
  3. Verify that there are no default permissions granted on the database lftpcdb for IAMAllowedPrincipals.
  4. If you find any, select the permission and choose Revoke to revoke the permission.
  5. On the AWS Management Console, choose the AWS CloudShell icon on the top menu.

This opens AWS CloudShell in another tab of the browser. Allow a few minutes for the CloudShell environment to set up.

  1. Run the following AWS Command Line Interface (AWS CLI) command after replacing {BUCKET_NAME} with DataLakeBucket from the stack output.
aws s3 cp s3://aws-data-analytics-workshops/shared_datasets/tpcparquet/ s3://${BUCKET_NAME}/tpcparquet/  --recursive

If CloudShell isn’t available in your chosen Region, run the following AWS CLI command to copy the required dataset from your preferred AWS CLI environment as the IAM admin user.

  1. Verify that your S3 bucket has the dataset copied in it.
  2. Log out as the IAM admin user.

Grant permissions in account A

Next, we continue granting Lake Formation permissions to the dataset as a data steward within the producer account. The data steward grants the following LF-tag-based permissions to the consumer personas.

Consumer Persona LF-tag Policy
lf-business-analysts Sensitivity=Public
lf-campaign-manager HasCampaign=true
  1. Log in to account A as user lf-producersteward, using the password you noted from Secrets Manager earlier.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  3. Choose Grant.
  4. Under Principals, select External accounts.
  5. Provide the ARN of the IAM user in the consumer account (arn:aws:iam::<accountB_id>:user/lf-business-analysts) and press Enter.

  1. Under LF_Tags or catalog resources, select Resources matched by LF-Tags.
  2. Choose Add LF-Tag to add a new key-value pair.
  3. For the key, choose Sensitivity and for the value, choose Public.
  4. Under Database permissions, select Describe, and under Table permissions, select Select and Describe.

  1. Choose Grant to apply the permissions.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  3. Choose Grant.
  4. Under Principals, select External accounts.
  5. Provide the ARN of the IAM user in the consumer account (arn:aws:iam::<accountB_id>:user/lf-campaign-manager) and press Enter.
  6. Under LF_Tags or catalog resources, select Resources matched by LF-Tags.
  7. Choose Add LF-Tag to add a new key-value pair.
  8. For the key, choose HasCampaign and for the value, choose true.

  1. Under Database permissions, select Describe, and under Table permissions, select Select and Describe.
  2. Choose Grant to apply the permissions.
  3.  Verify on the Data lake permissions tab that the permissions you have granted show up correctly.

AWS CloudFormation setup in account B

Complete the following steps in the consumer account:

  1. Log in as an IAM admin user in account B and launch the CloudFormation stack:
    BDB-2063-launch-cloudformation-stack
  2. Choose Next.
  3. Provide a name for the stack, then choose Next.
  4. On the next page, choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create.

Stack creation should take about 2–3 minutes. The stack sets up the following resources in account B:

  • IAM users datalakeadmin1, lf-business-analysts, and lf-campaign-manager, with relevant IAM and Lake Formation permissions
  • A database called db_for_shared_tables with Create_Table permissions to the lf-campaign-manager user
  • An S3 bucket named lfblog-athenaresults-<your-accountB-id>-us-east-1 with ListBucket and write permissions to lf-business-analysts and lf-campaign-manager

Note down the stack output details.

Accept resource shares in account B

After you launch the CloudFormation stack, complete the following steps in account B:

  1. On the CloudFormation stack Outputs tab, choose the link for DataLakeAdminCredentials.

This takes you to the Secrets Manager console.

  1. On the Secrets Manager console, choose Retrieve secret value and copy the password for DataLakeAdmin user.
  2. Use the ConsoleIAMLoginURL value from the CloudFormation template output to log in to account B with the data lake admin user name datalakeadmin1 and the password you copied from Secrets Manager.
  3. Open the AWS RAM console in another browser tab.
  4. In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations.

You should see two resource share invitations from the producer account A: one for database-level share and one for table-level share.

  1. Choose each resource share link, review the details, and choose Accept.

After you accept the invitations, the status of the resource shares changes from Active from Pending.

Grant permissions in account B

To grant permissions in account B, complete the following steps:

  1. On the Lake Formation console, under Permissions on the navigation pane, choose Administrative roles and tasks.

  1. Under Database creators, choose Grant.

  1. Under IAM users and roles, choose lf-business-analysts.
  2. For Catalog permissions, select Create database.
  3. Choose Grant.
  4. Log out of the console as the data lake admin user.

Query the shared datasets as consumer users

To validate the lf-business-analysts user’s data access, perform the following steps:

  1. Log in to the console as lf-business-analysts, using the credentials noted from the CloudFormation stack output.
  2. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.

  1. Select the database lftpcdb and on the Actions menu, choose Create resource link.

  1. Under Resource link name, enter rl_lftpcdb.
  2. Choose Create.
  3. After the resource link is created, select the resource link and choose View tables.

You can now see the four tables in the shared database.

  1. Open the Athena console in another browser tab and choose the lfblog-athenaresults-<your-accountB-id>-us-east-1 bucket as the query results location.
  2. Verify data access using the following query (for more information, refer to Running SQL queries using Amazon Athena):
Select * from rl_lftpcdb.customers limit 10;

The following screenshot shows the query output.

Notice that account A shared the database lftpcdb to account B using the LF-tag expression Sensitivity=Public. Columns c_first_name, c_last_name, and c_email_address in table customers were overwritten with Sensitivity=Confidential. Therefore, these three columns are not visible to user lf-business-analysts.

You can preview the other tables from the database similarly to see the available columns and data.

  1. Log out of the console as lf-business-analysts.

Now we can validate the lf-campaign-manager user’s data access.

  1. Log in to the console as lf-campaign-manager using the credentials noted from the CloudFormation stack output.
  2. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  3. Verify that you can see the database db_for_shared_tables shared by the data lake admin.

  1. Under Data catalog in the navigation pane, choose Tables.

You should be able to see the two tables shared from account A using the LF-tag expression HasCampaign=true. The two tables show the Owner account ID as account A.

Because lf-campaign-manager received table level shares, this user will create table-level resource links for querying in Athena.

  1. Select the promotions table, and on the Actions menu, choose Create resource link.

  1. For Resource link name, enter rl_promotions.

  1. Under Database, choose db_for_shared_tables for the database to contain the resource link.
  2. Choose Create.
  3. Repeat the table resource link creation for the other table items.

Notice that the resource links show account B as owner, whereas the actual tables show account A as the owner.

  1. Open the Athena console in another browser tab and choose the lfblog-athenaresults-<your-accountB-id>-us-east-1 bucket as the query results location.
  2. 11. Query the tables using the resource links.

As shown in the following screenshot, all columns of both tables are accessible to lf-campaign-manager.

In summary, you have seen how LF-tags are used to share a database and select tables from one account to another account’s IAM users.

Clean up

To avoid incurring charges on the AWS resources created in this post, you can perform the following steps.

First, clean up resources in account A:

  1. Empty the S3 bucket created for this post by deleting the downloaded objects from your S3 bucket.
  2. Delete the CloudFormation stack.

This deletes the S3 bucket, custom IAM roles, policies, and the LF database, tables, and permissions.

  1. You may choose to undo the Lake Formation settings also and add IAM access back from the Lake Formation console Settings page.

Now complete the following steps in account B:

  1. Empty the S3 bucket lfblog-athenaresults-<your-accountB-id>-us-east-1 used as the Athena query results location.
  2. Revoke permission to lf-business-analysts as database creator.
  3. Delete the CloudFormation stack.

This deletes the IAM users, S3 bucket, Lake Formation database db_for_shared_tables, resource links, and all the permissions from Lake Formation.

If there are any resource links and permissions left, delete them manually in Lake Formation from both accounts.

Conclusion

In this post, we illustrated the benefits of using Lake Formation cross-account sharing Version 3 using LF-tags to direct IAM principals and how to receive the shared tables in the consumer account. We used a two-account scenario in which a data producer account shares a database and specific tables to individual IAM users in another account using LF-tags. In the receiving account, we showed the role played by a data lake admin vs. the receiving IAM users. We also illustrated how to overwrite column tags to mask and share PII data.

With Version 3 of cross-account sharing features, Lake Formation makes possible more modern data mesh models, where a producer can directly share to an IAM principal in another account, instead of the entire account. Data mesh implementation becomes easier for data administrators and data platform owners because they can easily scale to hundreds of consumer accounts using the LF-tags based sharing to organizational units or IDs.

We encourage you to upgrade your Lake Formation cross-account sharing to Version 3 and benefit from the enhancements. For more details, see Updating cross-account data sharing version settings.


About the authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.