All posts by Aarthi Srinivasan

AWS Lake Formation 2023 year in review

2024-01-18 Aarthi Srinivasan

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/aws-lake-formation-2023-year-in-review/

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2022, we talked about the enhancements we had done to these services. We continue to listen to customer stories and work backwards to incorporate their thoughts in our products. In this post, we are happy to summarize the results of our hard work in 2023 to improve and simplify data governance for customers.

We announced our new features and capabilities during AWS re:Invent 2023, as is our custom every year. The following are re:Invent 2023 talks showcasing Lake Formation and Data Catalog capabilities:

What’s new in AWS Lake Formation – This session recaps new capabilities and how you can get the most out of Lake Formation. The session also highlights Duke Energy’s journey with Lake Formation and the AWS Glue Data Catalog.
Easily and securely prepare, share, and query data – This session shows how you can use Lake Formation and the AWS Glue Data Catalog to share data without copying, transform and prepare data without coding, and query data.
Curate your data at scale – This session shows how solutions like AWS Glue, AWS Glue Data Quality, and Lake Formation can help you manage your best sources and find sensitive information.

We group the new capabilities into four categories:

Discover and secure
Connect with data sharing
Scale and optimize
Audit and monitor

Let’s dive deeper and discuss the new capabilities introduced in 2023.

Discover and secure

Using Lake Formation and the Data Catalog as the foundational building blocks, we launched Amazon DataZone in October 2023. DataZone is a data management service that makes it faster and more straightforward for you to catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources. The publishing and subscription workflows of DataZone enhance collaboration between various roles in your organization and speed up the time to derive business insights from your data. You can enhance the technical metadata of the Data Catalog using AI-powered assistants into business metadata of DataZone, making it more easily discoverable. DataZone automatically manages the permissions of your shared data in the DataZone projects. To learn more about DataZone, refer to the User Guide. Bienvenue dans DataZone!

AWS Glue crawlers classify data to determine the format, schema, and associated properties of the raw data, group data into tables or partitions, and write metadata to the Data Catalog. In 2023, we released several updates to AWS Glue crawlers. We added the ability to bring your custom versions of JDBC drivers in crawlers to extract data schemas from your data sources and populate the Data Catalog. To optimize partition retrieval and improve query performance, we added the feature for crawlers to automatically add partition indexes for newly discovered tables. We also integrated crawlers with Lake Formation, supporting centralized permissions for in-account and cross-account crawling of S3 data lakes. These are some much sought-after improvements that simplify your metadata discovery using crawlers. Crawlers, salut!

We have also seen a tremendous rise in the usage of open table formats (OTFs) like Linux Foundation Delta Lake, Apache Iceberg, and Apache Hudi. To support these popular OTFs, we added support to natively crawl these three table formats into the Data Catalog. Furthermore, we worked with other AWS analytics services, such as Amazon EMR, to enable Lake Formation fine-grained permissions on all the three open table formats. We encourage you to explore which features of Lake Formation are supported for OTF tables. Bien intégré!

As the data sources and types increase over time, you are bound to have nested data types in your data lake sooner or later. To bring data governance to these datasets without flattening them, Lake Formation added support for fine-grained access controls on nested data types and columns. We also added support for Lake Formation fine-grained access controls while running Apache Hive jobs on Amazon EMR on EC2 and on Amazon EMR Studio. With Amazon EMR Serverless, fine-grained access control with Lake Formation is now available in preview. Connecté les points!

At AWS, we work very closely with our customers to understand their experience. We came to understand that onboarding to Lake Formation from AWS Identity and Access Management (IAM) based permissions for Amazon S3 and the AWS Glue Data Catalog could be streamlined. We realized that your use cases need more flexibility in data governance. With the hybrid access mode in Lake Formation, we introduced selective addition of Lake Formation permissions for some users and databases, without interrupting other users and workloads. You can define a catalog table in hybrid mode and grant access to new users like data analysts and data scientists using Lake Formation while your production extract, transform, and load (ETL) pipelines continue to use their existing IAM-based permissions. Double victoire!

Let’s talk about identity management. You can use IAM principals, Amazon Quicksight users and groups, and external accounts and IAM principals in external accounts to grant access to Data Catalog resources in Lake Formation. What about your corporate identities? Do you need to create and maintain multiple IAM roles and map them to various corporate identities? You could see the IAM role that accessed the table, but how could you find out which user accessed it? To answer these questions, Lake Formation integrated with AWS IAM Identity Center and added the feature for trusted identity propagation. With this, you can grant fine-grained access permissions to the identities from your organization’s existing identity provider. Other AWS analytics services also support the user identity to be propagated. Your auditors can now see that the user [email protected], for example, had accessed the table managed by Lake Formation permissions using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Intégration facile!

Now you don’t have to worry about moving the data or copying the Data Catalog to another AWS Region to use the AWS services for data governance. We have expanded and made Lake Formation available in all Regions in 2023. Et voila!

Connect with data sharing

Lake Formation provides a straightforward way to share Data Catalog objects like databases and tables with internal and external users. This mechanism empowers organizations with quick and secure access to data and speeds up their business decision-making. Let’s review the new features and enhancements made in 2023 under this theme.

The AWS Glue Data Catalog is the central and foundational component of data governance for both Lake Formation and DataZone. In 2023, we extended the Data Catalog through federation to integrate with external Apache Hive metastores and Redshift datashares. We also made available the connector code, which you can customize to connect the Data Catalog with additional Apache Hive-compatible metastores. These integrations pave the way to get more metadata into the Data Catalog, and allow fine-grained access controls and sharing of these resources across AWS accounts effortlessly with Lake Formation permissions. We also added support to access the Data Catalog table of one Region from other Regions using cross-Region resource links. This enhancement simplifies many use cases to avoid metadata duplication.

With the AWS CloudTrail Lake federation feature, you can discover, analyze, join, and share CloudTrail Lake data with other data sources in Data Catalog. For CloudTrail Lake, fine-grained access controls and querying and visualizing capabilities are available through Athena.

We further extended the Data Catalog capabilities to support uniform views across your data lake. You can create views using different SQL dialects and query from Athena, Redshift Spectrum, and Amazon EMR. This allows you to maintain permissions at the view level and not share the individual tables. The Data Catalog views feature is available in preview, announced at re:Invent 2023.

Scale and optimize

As SQL queries get more complex with the data changes over time or has multiple joins, a cost-based optimizer (CBO) can drive optimizations in the query plan and lead to faster performance, based on statistics of the data in the tables. In 2023, we added support for column-level statistics for tables in the Data Catalog. Customers are already seeing query performance improvements in Athena and Redshift Spectrum, with table column statistics turned on. Suivez les chiffres!

Tag-based access control removes the need to update your policies every time a new resource is added to the data lake. Instead, data lake administrators create Lake Formation Tags (LF-Tags) to tag Data Catalog objects and grant access based on these LF-Tags to users and groups. In 2023, we added support for LF-Tag delegation, where data lake administrators can give permissions to data stewards and other users to manage LF-Tags without the need for administrator privileges. LF-Tag democratization!

Apache Iceberg format uses metadata to keep track of the data files that make up the table. Changes to tables, like inserts or updates, result in new data files being created. As the number of data files for a table grows, the queries using that table can become less efficient. To improve query performance on the Iceberg table, you need to reduce the number of data files by compacting the smaller change capture files into bigger files. Users typically create and run scripts to perform optimization of these Iceberg table files in their own servers or through AWS Glue ETL. To alleviate this complex maintenance of Iceberg tables, customers approached us for a better solution. We introduced the feature for automatic compaction of Apache Iceberg tables in the Data Catalog. After you turn on automatic compaction, the Data Catalog automatically manages the metadata of the table and gives you an always-optimized Amazon S3 layout for your Iceberg tables. To learn more, check out Optimizing Iceberg tables. Automatique!

Audit and monitor

Knowing who has access to what data is a critical component of data governance. Auditors need to validate that the right metadata and data permissions are set in Lake Formation and the Data Catalog. Data lake administrators have full access to permissions and metadata, and can grant access to the data itself. To provide auditors with an option to search and review metadata permissions without granting them access to make changes to permissions, we introduced the read-only administrator role in Lake Formation. This role allows you to audit the catalog metadata and Lake Formation permissions and LF-Tags while restricting it from making any changes to them.

Conclusion

We had an amazing 2023, developing product enhancements to help you simplify and enhance your data governance using Lake Formation and Data Catalog. We invite you to try these new features. The following is a list of our launch posts for reference:

We will continue to innovate on behalf of our customers in 2024. Please share your thoughts, use cases, and feedback for our product improvements in the comments section or through your AWS account teams. We wish you a happy and prosperous 2024. Bonne année!

About the authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Leon Stigter is a Senior Technical Product Manager with AWS Lake Formation. Leon’s focus is on helping developers build data lakes faster, with seamless connectivity to analytical tools, to transform data into game-changing insights. Leon is interested in data and serverless technologies, and enjoys exploring different cities on his mission to taste cheesecake everywhere he goes.

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

2023-09-27 Aarthi Srinivasan

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/introducing-hybrid-access-mode-for-aws-glue-data-catalog-to-secure-access-using-aws-lake-formation-and-iam-and-amazon-s3-policies/

AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage access control for your data lake data in Amazon Simple Storage Service (Amazon S3) and its metadata in AWS Glue Data Catalog in one place with familiar database-style features. You can use fine-grained data access control to verify that the right users have access to the right data down to the cell level of tables. Lake Formation also makes it simpler to share data internally across your organization and externally. Further, Lake Formation integrates with AWS analytics services such as Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL for Apache Spark. These services allow querying Lake Formation managed tables, thus helping you extract business insights from the data quickly and securely.

Before the introduction of Lake Formation and its database-style permissions for data lakes, you had to manage access to your data in the data lake and its metadata separately through AWS Identity and Access Management (IAM) policies and S3 bucket policies. With an IAM and Amazon S3 access control mechanism, which is more complex and less granular compared to Lake Formation, you need more time to migrate to Lake Formation because a given database or table in the data lake could have its access controlled by either IAM and S3 policies or Lake Formation policies, but not both. Also, various use cases operate on the data lakes. Migrating all use cases from one permissions model to another in a single step without disruption was challenging for operations teams.

To ease the transition of data lake permissions from an IAM and S3 model to Lake Formation, we’re introducing a hybrid access mode for AWS Glue Data Catalog. Please refer to the What’s New and documentation. This feature lets you secure and access the cataloged data using both Lake Formation permissions and IAM and S3 permissions. Hybrid access mode allows data administrators to onboard Lake Formation permissions selectively and incrementally, focusing on one data lake use case at a time. For example, say you have an existing extract, transform and load (ETL) data pipeline that uses the IAM and S3 policies to manage data access. Now you want to allow your data analysts to explore or query the same data using Amazon Athena. You can grant access to the data analysts using Lake Formation permissions, to include fine-grained controls as needed, without changing access for your ETL data pipelines.

Hybrid access mode allows both permission models to exist for the same database and tables, providing greater flexibility in how you manage user access. While this feature opens two doors for a Data Catalog resource, an IAM user or role can access the resource using only one of the two permissions. After Lake Formation permission is enabled for an IAM principal, authorization is completely managed by Lake Formation and existing IAM and S3 policies are ignored. AWS CloudTrail logs provide the complete details of the Data Catalog resource access in Lake Formation logs and S3 access logs.

In this blog post, we walk you through the instructions to onboard Lake Formation permissions in hybrid access mode for selected users while the database is already accessible to other users through IAM and S3 permissions. We will review the instructions to set-up hybrid access mode within an AWS account and between two accounts.

Scenario 1 – Hybrid access mode within an AWS account

In this scenario, we walk you through the steps to start adding users with Lake Formation permissions for a database in Data Catalog that’s accessed using IAM and S3 policy permissions. For our illustration, we use two personas: Data-Engineer, who has coarse grained permissions using an IAM policy and an S3 bucket policy to run an AWS Glue ETL job and Data-Analyst, whom we will onboard with fine grained Lake Formation permissions to query the database using Amazon Athena.

Scenario 1 is depicted in the diagram shown below, where the Data-Engineer role accesses the database hybridsalesdb using IAM and S3 permissions while Data-Analyst role will access the database using Lake Formation permissions.

Prerequisites

To set up Lake Formation and IAM and S3 permissions for a Data Catalog database with Hybrid access mode, you must have the following prerequisites:

An AWS account that isn’t used for production applications.
Lake Formation already set up in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. For example, we’re using a data lake administrator role called LF-Admin. To learn more about setting up permissions for a data lake administrator role, see Create a data lake administrator.
A sample database in the Data Catalog with a few tables. For example, our sample database is called hybridsalesdb and has a set of eight tables, as shown in the following screenshot. You can use any of your datasets to follow along.

Personas and their IAM policy setup

There are two personas that are IAM roles in the account: Data-Engineer and Data-Analyst. Their IAM policies and access are described as follows.

The following IAM policy on the Data-Engineer role allows access to the database and table metadata in the Data Catalog.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue: Get*"
            ],
            "Resource": [
                "arn:aws:glue:<Region>:<account-id>:catalog",
                "arn:aws:glue:<Region>:<account-id>:database/hybridsalesdb",
                "arn:aws:glue:<Region>:<account-id>:table/hybridsalesdb/*"
            ]
        }
    ]
}

The following IAM policy on the Data-Engineer role grants data access to the underlying Amazon S3 location of the database and tables.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowDataLakeBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:Put*",
                "s3:Get*",
                "s3:Delete*"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket-name>",
                "arn:aws:s3:::<bucket-name>/<prefix>/"
            ]
        }
    ]
}

The Data-Engineer also has access to the AWS Glue console using the AWS managed policy arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess and regressive iam:Passrole to run an AWS Glue ETL script as below.

{
    "Version": "2012-10-17",
    "Statement": [
       {
           "Sid": "PassRolePermissions",
           "Effect": "Allow",
           "Action": [
               " iam:PassRole" ],
           "Resource": [  
		   "arn:aws:iam::<account-id>:role/Data-Engineer"
            ]
        }
    ]
}

The following policy is also added to the trust policy of the Data-Engineer role to allow AWS Glue to assume the role to run the ETL script on behalf of the role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

See AWS Glue studio set up for additional permissions required to run an AWS Glue ETL script.

The Data-Analyst role has the data lake basic user permissions as described in Assign permissions to Lake Formation users.

{
"Version": "2012-10-17",
"Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "glue:GetTable",
            "glue:GetTables",
            "glue:GetTableVersions",
            "glue:SearchTables",
            "glue:GetDatabase",
            "glue:GetDatabases",
            "glue:GetPartitions",
            "lakeformation:GetDataAccess",
            "lakeformation:GetResourceLFTags",
            "lakeformation:ListLFTags",
            "lakeformation:GetLFTag",
            "lakeformation:SearchTablesByLFTags",
            "lakeformation:SearchDatabasesByLFTags"
        ],
        "Resource": "*"
    }
    ]
}

Additionally, the Data-Analyst has permissions to write Athena query results to an S3 bucket that isn’t managed by Lake Formation and Athena console full access using the AWS managed policy arn:aws:iam::aws:policy/AmazonAthenaFullAccess.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::<athena-results-bucket>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:Put*",
                "s3:Get*",
                "s3:Delete*"
            ],
            "Resource": [
                "arn:aws:s3:::<athena-results-bucket>/*"
            ]
        }
    ]
}

Set up Lake Formation permissions for Data-Analyst

Complete the following steps to configure your data location in Amazon S3 with Lake Formation in hybrid access mode and grant access to the Data-Analyst role.

Sign in to the AWS Management Console as a Lake Formation administrator role.
Go to Lake Formation.
Select Data lake locations from the left navigation bar under Administration.
Select Register location and provide the Amazon S3 location of your database and tables. Provide an IAM role that has access to the data in the S3 location. For more details see Requirements for roles used to register locations.
Select the Hybrid access mode under Permission mode and choose Register location.
Select Data lake locations under Administration from the left navigation bar. Review that the registered location shows as Hybrid access mode for Permission mode.
Select Databases from Catalog on the left navigation bar. Choose hybridsalesdb. You will select the database that has the data in the S3 location that you registered in the preceding step. From the Actions drop down menu, select Grant.
Select Data-Analyst for IAM users and roles. Under LF-Tags or catalog resources, select Named Data Catalog resources and select hybridsalesdb for Databases.
Under Database permissions, select Describe. Under Hybrid access mode, select the checkbox Make Lake Formation permissions effective immediately. Choose Grant.
Again, select Databases from Catalog on the left navigation bar. Choose hybridsalesdb. Select Grant from the Actions drop down menu.
On the Grant window, select Data-Analyst for IAM users and roles. Under LF-Tags or catalog resources, choose Named Data Catalog resources and select hybridsalesdb for Databases.
Under Tables, select the three tables named hybridcustomer, hybridproduct, and hybridsales_order from the drop down.
Under Table permissions, select Select and Describe permissions for the tables.
Select the checkbox under Hybrid access mode to make the Lake Formation permissions effective immediately.
Choose Grant.
Review the granted permissions by selecting the Data lake permissions under Permissions on the left navigation bar. Filter Data permissions by Principal = Data-Analyst.
On the left navigation bar, select Hybrid access mode. Verify that the opted in Data-Analyst shows up for the hybridsalesdb database and the three tables.
Sign out from the console as the Lake Formation administrator role.

Validating Lake Formation permissions for Data-Analyst

Sign in to the console as Data-Analyst.
Go to the Athena console. If you’re using Athena for the first time, set up the query results location to your S3 bucket as described in Specifying a query result location.
Run preview queries on the table from the Athena query editor.

Validating IAM and S3 permissions for Data-Engineer

Sign out as Data-Analyst and sign back in to the console as Data-Engineer.
Open the AWS Glue console and select ETL jobs from the left navigation bar.
Under Create job, select Spark script editor. Choose Create.
Download and open the sample script provided here.
Copy and paste the script into your studio script editor as a new job.
Edit the catalog_id, database, and table_name to suit your sample.
Save and Run your AWS Glue ETL script by providing the IAM role of Data-Engineer to run the job.
After the ETL script succeeds, you can select the output logs link from the Runs tab of the ETL script.
Review the table’s schema, top 20 rows, and the total number of rows and columns from the AWS CloudWatch logs.

Thus, you can add Lake Formation permissions to a new role to access a Data Catalog database without interfering with another role that is accessing the same database through IAM and S3 permissions.

Scenario 2 – Hybrid access mode set up between two AWS accounts

This is a cross-account sharing scenario where a data producer shares a database and its tables to a consumer account. The producer provides full database access for an AWS Glue ETL workload on the consumer account. At the same time, the producer shares a few tables of the same database to the consumer account using Lake Formation. We walk you through how you can use hybrid access mode to support both access methods.

Prerequisites

Cross-account sharing of a database or table location that’s registered in hybrid access mode requires the producer or the grantor account to be in version 4 of cross-account sharing in the catalog setting to grant permissions on the hybrid access mode resource. When moving from version 3 to version 4 of cross-account sharing, existing Lake Formation permissions aren’t affected for database and table locations that are already registered with Lake Formation (Lake Formation mode). For new data set location registration in hybrid access mode and new Lake Formation permissions on this catalog resource, you will need version 4 of cross-account sharing.
The consumer or recipient account can use other versions of cross-account sharing. If your accounts are using version 1 or version 2 of cross-account sharing and if you want to upgrade, follow Updating cross-account data sharing version settings to first upgrade the catalog setting of cross-account sharing to version 3, before upgrading to version 4.

The producer account set up is similar to that of scenario 1 and we discuss the extra steps for scenario 2 in the following section.

Set up in producer account A

The consumer Data-Engineer role is granted Amazon S3 data access using the producer’s S3 bucket policy and Data Catalog access using the producer’s Data Catalog resource policy.

The S3 bucket policy in the producer account follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
        "Sid": "data engineer role permissions",
        "Effect": "Allow",
        "Principal": {
            "AWS": "arn:aws:iam::<consumer-account-id>:role/Data-Engineer"
        },
        "Action": [
            "s3:GetLifecycleConfiguration",
            "s3:ListBucket",
            "s3:PutObject",
            "s3:GetObject",
            "s3:DeleteObject"
        ],
        "Resource": [
            "arn:aws:s3:::<producer-account-databucket>",
            "arn:aws:s3:::<producer-account-databucket>/*"
        ]
        }
    ]
}

The Data Catalog resource policy in the producer account is shown below. You also need the glue:ShareResource IAM permission for AWS Resource Access Manager (AWS RAM) to enable cross-account sharing.

{
"Version" : "2012-10-17",
"Statement" : [
    {
    "Effect" : "Allow",
    "Principal" : {
        "AWS" : "arn:aws:iam::<consumer-account-id>:role/Data-Engineer"
    },
    "Action" : "glue:Get*",
    "Resource" : [
        "arn:aws:glue:<Region>:<producer-account-id>:catalog", 
        "arn:aws:glue:<Region>:<producer-account-id>:database/hybridsalesdb", 
        "arn:aws:glue:<Region>:<producer-account-id>:table/hybridsalesdb/*"
    ]
    },
    {
        "Effect" : "Allow",
        "Principal" : {
        "Service" : "ram.amazonaws.com"
        },
        "Action" : "glue:ShareResource",
        "Resource" : [
            "arn:aws:glue:<Region>:<producer-account-id>:table/*/*", 
            "arn:aws:glue:<Region>:<producer-account-id>:database/*", 
            "arn:aws:glue:<Region>:<producer-account-id>:catalog"
        ]
        }
    ]
}

Setting the cross-account version and registering the S3 bucket

Sign in to the Lake Formation console as an IAM administrator role or a role with IAM permissions to the PutDataLakeSettings() API. Choose the AWS Region where you have your sample data set in an S3 bucket and its corresponding database and tables in the Data Catalog.
Select Data catalog settings from the left navigation bar under Administration. Select Version 4 from the dropdown menu for Cross account version settings. Choose Save.
Note: If there are any other accounts in your environment that share catalog resources to your producer account through Lake Formation, upgrading the sharing version might impact them. See <title of documentation page> for more information.
Sign out as IAM administrator and sign back in to the Lake Formation console as a Lake Formation administrator role.
Select Data lake locations from the left navigation bar under Administration.
Select Register location and provide the S3 location of your database and tables.
Provide an IAM role that has access to the data in the S3 location. For more details about this role requirement, see Requirements for roles used to register locations.
Choose the Hybrid access mode under Permission mode, and then choose Register location.
Select Data lake locations under Administration from the left navigation bar. Confirm that the registered location shows as Hybrid access mode for Permission mode.

Granting cross-account permissions

The steps to share the database hybridsalesdb to the consumer account are similar to the steps to set up scenario 1.

In the Lake Formation console, select Databases from Catalog on the left navigation bar. Choose hybridsalesdb. Select your database that has the data in the S3 location that you registered previously. From the Actions drop down menu, select Grant.
Select External accounts under Principals and provide the consumer account ID. Select Named catalog resources under LF-Tags or catalog resources. Choose hybridsalesdb for Databases.
Select Describe for Database permissions and for Grantable permissions.
Under Hybrid access mode, select the checkbox for Make Lake Formation permissions effective immediately. Choose Grant.

Note: Selecting the checkbox opts-in the consumer account Lake Formation administrator roles to use Lake Formation permissions without interrupting access to the consumer account’s IAM and S3 access for the same database.

Repeat step 2 up to database selection to grant permission to the consumer account ID for table level permission. Select any three tables from the drop-down menu for table level permission under Tables.
Select Select under Table permissions and Grantable permissions. Select the checkbox for Make Lake Formation permissions effective immediately under Hybrid access mode. Choose Grant.
Select the Data lake permissions on the left navigation bar. Verify the granted permissions to the consumer account.
Select the Hybrid access mode on the left navigation bar. Verify the opted-in resources and principal.

You have now enabled cross-account sharing using Lake Formation permissions without revoking access to the IAMAllowedPrincipal virtual group.

Set up in consumer account B

In scenario 2, the Data-Analyst and Data-Engineer roles are created in the consumer account similar to scenario 1, but these roles access the database and tables shared from the producer account.

In addition to arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess and arn:aws:iam::aws:policy/CloudWatchFullAccess, the Data-Engineer role also has permissions to create and run an Apache Spark job in AWS Glue Studio.

Data-Engineer has the following IAM policy that grants access to the producer account’s S3 bucket, which is registered with Lake Formation in hybrid access mode.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowDataLakeBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetLifecycleConfiguration",
                "s3:Put*",
                "s3:Get*",
                "s3:Delete*"
            ],
            "Resource": [
                "arn:aws:s3:::<producer-account-databucket>/*",
                "arn:aws:s3:::<producer-account-databucket>"
            ]
        }
    ]
}

Data-Engineer has the following IAM policy that grants access to the consumer account’s entire Data Catalog and producer account’s database hybridsalesdb and its tables.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:*"
            ],
            "Resource": [
                "arn:aws:glue:<Region>:<consumer-account-id>:catalog",
                "arn:aws:glue:<Region>:<consumer-account-id>:database/*",
                "arn:aws:glue:<Region>:<consumer-account-id>:table/*/*",

            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:Get*"
            ],
            "Resource": [
                "arn:aws:glue:<Region>:<producer-account-id>:catalog",
                "arn:aws:glue:<Region>:<producer-account-id>:database/hybridsalesdb",
                "arn:aws:glue:<Region>:<producer-account-id>:table/hybridsalesdb/*"
            ]
        }
    ]
}

The Data-Analyst has the same IAM policies similar to scenario 1, granting basic data lake user permissions. For additional details, see Assign permissions to Lake Formation users.

Accepting AWS RAM invites

Sign in to the Lake Formation console as a Lake Formation administrator role.
Open the AWS RAM console. Select Resource shares from Shared with me on the left navigation bar. You should see two invites from the producer account, one for database level share and one for table level share.
Select each invite, review the producer account ID, and choose Accept resource share.

Granting Lake Formation permissions to Data-Analyst

Open the Lake Formation console. As a Lake Formation administrator, you should see the shared database and tables from the consumer account.
Select Databases from the Data catalog on the left navigation bar. Select the radio button on the database hybridsalesdb and select Create resource link from the Actions drop down menu.
Enter rl_hybridsalesdb as the name for the resource link and leave the rest of the selections as they are. Choose Create.
Select the radio button for rl_hybridsalesdb. Select Grant from the Actions drop down menu.
Grant Describe permissions on the resource link to Data-Analyst.
Again, select the radio button on rl_hybridsalesdb from the Databases under Catalog in the left navigation bar. Select Grant on target from the Actions drop down menu.
Select Data-Analyst for IAM users and roles, keep the already selected database hybridsalesdb.
Select Describe under Database permissions. Select the checkbox for Make Lake Formation permissions effective immediately under Hybrid access mode. Choose Grant.
Select the radio button on rl_hybridsalesdb from Databases under Catalog in the left navigation bar. Select Grant on target from the Actions drop down menu.
Select Data-Analyst for IAM users and roles. Select All tables of the database hybridsalesdb. Select Select under Table permissions.
Select the checkbox for Make Lake Formation permissions effective immediately under Hybrid access mode.
View and verify the permissions granted to Data-Analyst from the Data lake permissions tab on the left navigation bar.
Sign out as Lake Formation administrator role.

Validate Lake Formation permissions as Data-Analyst

Sign back in to the console as Data-Analyst.
Open the Athena console. If you’re using Athena for the first time, set up the query results location to your S3 bucket as described in Specifying a query result location.
- In the Query Editor page, under Data, select AWSDataDatalog for Data source. For Tables, select the three dots next to any of the table names. Select Preview Table to run the query.
Sign out as Data-Analyst.

Validate IAM and S3 permissions for Data-Engineer

Sign back in to the console as Data-Engineer.
Using the same steps as scenario 1, verify IAM and S3 access by running the AWS Glue ETL script in AWS Glue Studio.

You’ve added Lake Formation permissions to a new role Data-Analyst, without interrupting existing IAM and S3 access to Data-Engineer for a cross-account sharing use-case.

Clean up

If you’ve used sample datasets from your S3 for this blog post, we recommend removing relevant Lake Formation permissions on your database for the Data-Analyst role and cross-account grants. You can also remove the hybrid access mode opt-in and remove the S3 bucket registration from Lake Formation. After removing all Lake Formation permissions from both the producer and consumer accounts, you can delete the Data-Analyst and Data-Engineer IAM roles.

Considerations

Currently, only a Lake Formation administrator role can opt in other users to use Lake Formation permissions for a resource, since opting in user access using either Lake Formation or IAM and S3 permissions is an administrative task requiring full knowledge of your organizational data access setup. Further, you can grant permissions and opt in at the same time using only the named-resource method and not LF-Tags. If you’re using LF-Tags to grant permissions, we recommend you use the Hybrid access mode option on the left navigation bar to opt in (or the equivalent CreateLakeFormationOptin() API using the AWS SDK or AWS CLI) as a subsequent step after granting permissions.

Conclusion

In this blog post, we went through the steps to set up hybrid access mode for Data Catalog. You learned how to onboard users selectively to the Lake Formation permissions model. The users who had access through IAM and S3 permissions continued to have their access without interruptions. You can use Lake Formation to add fine-grained access to Data Catalog tables to enable your business analysts to query using Amazon Athena and Amazon Redshift Spectrum, while your data scientists can explore the same data using Amazon Sagemaker. Data engineers can continue to use their IAM and S3 permissions on the same data to run workloads using Amazon EMR and AWS Glue. Hybrid access mode for the Data Catalog enables a variety of analytical use-cases for your data without data duplication.

To get started, see the documentation for hybrid access mode. We encourage you to check out the feature and share your feedback in the comments section. We look forward to hearing from you.

About the authors

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

2023-08-03 Aarthi Srinivasan

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/configure-cross-region-table-access-with-the-aws-glue-catalog-and-aws-lake-formation/

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. Companies also have employees and do business across multiple geographic regions and even around the world. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions.

The AWS Glue Data Catalog and AWS Lake Formation recently announced support for cross-Region table access. This feature lets users query AWS Glue databases and tables in one Region from another Region using resource links, without copying the metadata in the Data Catalog or the data in Amazon Simple Storage Service (Amazon S3). A resource link is a Data Catalog object that is a link to a database or table.

The AWS Glue Data Catalog is a centralized repository of technical metadata that holds the information about your datasets in AWS, and can be queried using AWS analytics services such as Amazon Athena, Amazon EMR, and AWS Glue for Apache Spark. The Data Catalog is localized to every Region in an AWS account, requiring users to replicate the metadata and the source data in S3 buckets for cross-Region queries. With the newly launched feature for cross-Region table access, you can create a resource link in any Region pointing to a database or table of the source Region. With the resource link in the local Region, you can query the source Region’s tables from Athena, Amazon EMR, and AWS Glue ETL in the local Region.

You can use the cross-Region table access feature of the Data Catalog in combination with the permissions management and cross-account sharing capability of Lake Formation. Lake Formation is a fully managed service that makes it easy to build, secure, and manage data lakes. By using cross-Region access support for Data Catalog, together with governance provided by Lake Formation, organizations can discover and access data across Regions without spending time making copies. Some businesses might have restrictions to run their compute in certain Regions. Organizations that need to share their Data Catalog with businesses that have such restrictions can now create and share cross-Region resource links.

In this post, we walk you through configuring cross-Region database and table access in two scenarios. In the first scenario, we go through an example where a customer wants to access an AWS Glue database in Region A from Region B in the same account. In scenario two, we demonstrate cross-account and cross-Region access where a customer wants to share a database in Region A across accounts and access it from Region B of the recipient account.

Scenario 1: Same account use case

In this scenario, we walk you through the steps required to share a Data Catalog database from one Region to another Region within the same AWS account. For our illustrations, we have a sample dataset in an S3 bucket in the us-east-2 Region and have used an AWS Glue crawler to crawl and catalog the dataset into a database in the Data Catalog of the us-east-2 Region. We share this dataset to the us-west-2 Region. You can use any of your datasets to follow along. The following diagram illustrates the architecture for cross-Region sharing within the same AWS account.

Prerequisites

To set up cross-Region sharing of a Data Catalog database for scenario 1, we recommend the following prerequisites:

An AWS account that is not used for production use cases.
Lake Formation set up already in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. For example, we are using a data lake administrator role called LF-Admin. The LF-Admin role also has the AWS Identity and Access Management (IAM) permission iam:PassRole on the AWS Glue crawler role. To learn more about setting up permissions for a data lake administrator, see Create a data lake administrator.
A sample database in the Data Catalog with a few tables. For example, our sample database is called salesdb_useast2 and has a set of eight tables, as shown in the following screenshot.

Set up permissions for us-east-2

Complete the following steps to configure permissions in the us-east-2 Region:

Log in to the Lake Formation console and choose the Region where your database resides. In our example, it is us-east-2 Region.
Grant SELECT and DESCRIBE permissions to the LF-Admin role on all tables of the database salesdb_useast2.
You can confirm if permissions are working by querying the database and tables as the data lake administrator role from Athena.

Set up permissions for us-west-2

Complete the following steps to configure permissions in the us-west-2 Region:

Choose the us-west-2 Region on the Lake Formation console.
Add LF-Admin as a data lake administrator and grant Create database permission to LF-Admin.
In the navigation pane, under Data catalog, select Databases.
Choose Create database and select Resource link.
Enter rl_salesdb_from_useast2 as the name for the resource link.
For Shared database’s region, choose US East (Ohio).
For Shared database, choose salesdb_useast2.
Choose Create.

This creates a database resource link in us-west-2 pointing to the database in us-east-2.

You will notice the Shared resource owner region column populate as us-east-2 for the resource link details on the Databases page.

Because the LF-Admin role created the resource link rl_salesdb_from_useast2, the role has implicit permissions on the resource link. LF-Admin already has permissions to query the table in the us-east-2 Region. There is no need to add a Grant on target permission for LF-Admin. If you are granting permission to another user or role, you need to grant Describe permissions on the resource link rl_salesdb_from_useast2.

Query the database using the resource link in Athena as LF-Admin.

In the preceding steps, we saw how to create a resource link in us-west-2 for a Data Catalog database in us-east-2. You can also create a resource link to the source database in any additional Region where the Data Catalog is available. You can run extract, transform, and load (ETL) scripts in Amazon EMR and AWS Glue by providing the additional Region parameter when referring to the database and table. See the API documentation for GetTable() and GetDatabase() for additional details.

Also, Data Catalog permissions for the database, tables, and resource links and the underlying Amazon S3 data permissions can be managed by IAM policies and S3 bucket policies instead of Lake Formation permissions. For more information, see Identity and access management for AWS Glue.

Scenario 2: Cross-account use case

In this scenario, we walk you through the steps required to share a Data Catalog database from one Region to another Region between two accounts: a producer account and a consumer account. To show an advanced use case, we host the source dataset in us-east-2 of account A and crawl it using an AWS Glue crawler in the Data Catalog in us-east-1. The data lake administrator in account A then shares the database and tables to account B using Lake Formation permissions. The data lake administrator in account B accepts the share in us-east-1 and creates resource links to query the tables from eu-west-1. The following diagram illustrates the architecture for cross-Region sharing between producer account A and consumer account B.

Prerequisites

To set up cross-Region sharing of a Data Catalog database for scenario 2, we recommend the following prerequisites:

Two AWS accounts that are not used for production use cases
Lake Formation administrator roles in both accounts
Lake Formation set up in both accounts with cross-account sharing version 3. For more details, refer documentation.
A sample database in the Data Catalog with a few tables

For our example, we continue to use the same dataset and the data lake administrator role LF-Admin for scenario 2.

Set up account A for cross-Region sharing

To set up account A, complete the following steps:

Sign in to the AWS Management Console as the data lake administrator role.
Register the S3 bucket in Lake Formation in us-east-1 with an IAM role that has access to the S3 bucket. See registering your S3 location for instructions.
Set up and run an AWS Glue crawler to catalog the data in the us-east-2 S3 bucket to the Data Catalog database useast2data_salesdb in us-east-1. Refer to AWS Glue crawlers support cross-account crawling to support data mesh architecture for instructions.

The database, as shown in the following screenshot, has a set of eight tables.

Grant SELECT and DESCRIBE along with grantable permissions on all tables of the database to account B.
Grant DESCRIBE with grantable permissions on the database.
Verify the granted permissions on the Data permissions page.
Log out of account A.

Set up account B for cross-Region sharing

To set up account B, complete the following steps:

In our example, we have created the data lake administrator role LF-Admin, similar to previous administrator roles in account A and scenario 1.

On the AWS Resource Access Manager (AWS RAM) console, review and accept the AWS RAM invites corresponding to the shared database and tables from account A.

The LF-Admin role can see the shared database useast2data_salesdb from the producer account. LF-Admin has access to the database and tables and so doesn’t need additional permissions on the shared database.

You can grant DESCRIBE on the database and SELECT on All_Tables permissions to any additional IAM principals from the us-east-1 Region on this shared database.
Open the Lake Formation console in eu-west-1 (or any Region where you have Lake Formation and Athena already set up).
Choose Create database and create a resource link named rl_useast1db_crossaccount, pointing to the us-east-1 database useast2data_salesdb.

You can choose any Region on the Shared database’s region drop-down menu and choose the databases from those Regions.

Because we’re using the data lake administrator role LF-Admin, we can see all databases from all Regions in the consumer account’s Data Catalog. A data lake user with restricted permissions will be able to see only those databases for which they have permissions to.

Because LF-Admin created the resource link, this role has permissions to use the resource link rl_useast1db_crossaccount. For additional IAM principals, grant DESCRIBE permissions on the database resource link rl_useast1db_crossaccount.
You can now query the database and tables from Athena.

Considerations

Cross-Region queries involve Amazon S3 data transfer by the analytics services, such as Athena, Amazon EMR, and AWS Glue ETL. As a result, cross-Region queries can be slower and will incur higher transfer costs compared to queries in the same Region. Some analytics services such as AWS Glue jobs and Amazon EMR may require internet access when accessing cross-Region data from Amazon S3, depending on your VPC set up. Refer to Considerations and limitations for more considerations.

Conclusion

In this post, you saw examples of how to set up cross-Region resource links for a database in the same account and across two accounts. You also saw how to use cross-Region resource links to query in Athena. You can share selected tables from a database instead of sharing an entire database. With cross-Region sharing, you can create a resource link for the table using the Create table option.

There are two key things to remember when using the cross-Region table access feature:

Grant permissions on the source database or table from its source Region.
Grant permissions on the resource link from the Region it was created in.

That is, the original shared database or table is always available in the source Region, and resource links are created and shared in their local Region.

To get started, see Accessing tables across Regions. Share your comments on the post or contact your AWS account team for more details.

About the author

Query your Apache Hive metastore with AWS Lake Formation permissions

2023-07-20 Aarthi Srinivasan

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/query-your-apache-hive-metastore-with-aws-lake-formation-permissions/

Apache Hive is a SQL-based data warehouse system for processing highly distributed datasets on the Apache Hadoop platform. There are two key components to Apache Hive: the Hive SQL query engine and the Hive metastore (HMS). The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries. The Hive metastore can be hosted on an Apache Hadoop cluster or can be backed by a relational database that is external to a Hadoop cluster. Although the Hive metastore stores the metadata of tables, the actual data of the table could be residing on Amazon Simple Storage Service (Amazon S3), the Hadoop Distributed File System (HDFS) of the Hadoop cluster, or any other Hive-supported data stores.

Because Apache Hive was built on top of Apache Hadoop, many organizations have been using the software from the time they have been using Hadoop for big data processing. Also, Hive metastore provides flexible integration with many other open-source big data software like Apache HBase, Apache Spark, Presto, and Apache Impala. Therefore, organizations have come to host huge volumes of metadata of their structured datasets in the Hive metastore. A metastore is a critical part of a data lake, and having this information available, wherever it resides, is important. However, many AWS analytics services don’t integrate natively with the Hive metastore, and therefore, organizations have had to migrate their data to the AWS Glue Data Catalog to use these services.

AWS Lake Formation has launched support for managing user access to Apache Hive metastores through a federated AWS Glue connection. Previously, you could use Lake Formation to manage user permissions on AWS Glue Data Catalog resources only. With the Hive metastore connection from AWS Glue, you can connect to a database in a Hive metastore external to the Data Catalog, map it to a federated database in the Data Catalog, apply Lake Formation permissions on the Hive database and tables, share them with other AWS accounts, and query them using services such as Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL (extract, transform, and load). For additional details on how the Hive metastore integration with Lake Formation works, refer to Managing permissions on datasets that use external metastores.

Use cases for Hive metastore integration with the Data Catalog include the following:

An external Apache Hive metastore used for legacy big data workloads like on-premises Hadoop clusters with data in Amazon S3
Transient Amazon EMR workloads with underlying data in Amazon S3 and the Hive metastore on Amazon Relational Database Service (Amazon RDS) clusters.

In this post, we demonstrate how to apply Lake Formation permissions on a Hive metastore database and tables and query them using Athena. We illustrate a cross-account sharing use case, where a Lake Formation steward in producer account A shares a federated Hive database and tables using LF-Tags to consumer account B.

Solution overview

Producer account A hosts an Apache Hive metastore in an EMR cluster, with underlying data in Amazon S3. We launch the AWS Glue Hive metastore connector from AWS Serverless Application Repository in account A and create the Hive metastore connection in account A’s Data Catalog. After we create the HMS connection, we create a database in account A’s Data Catalog (called the federated database) and map it to a database in the Hive metastore using the connection. The tables from the Hive database are then accessible to the Lake Formation admin in account A, just like any other tables in the Data Catalog. The admin continues to set up Lake Formation tag-based access control (LF-TBAC) on the federated Hive database and share it to account B.

The data lake users in account B will access the Hive database and tables of account A, just like querying any other shared Data Catalog resource using Lake Formation permissions.

The following diagram illustrates this architecture.

The solution consists of steps in both accounts. In account A, perform the following steps:

Create an S3 bucket to host the sample data.
Launch an EMR 6.10 cluster with Hive. Download the sample data to the S3 bucket. Create a database and external tables, pointing to the downloaded sample data, in its Hive metastore.
Deploy the application GlueDataCatalogFederation-HiveMetastore from AWS Serverless Application Repository and configure it to use the Amazon EMR Hive metastore. This will create an AWS Glue connection to the Hive metastore that shows up on the Lake Formation console.
Using the Hive metastore connection, create a federated database in the AWS Glue Data Catalog.
Create LF-Tags and associate them to the federated database.
Grant permissions on the LF-Tags to account B. Grant database and table permissions to account B using LF-Tag expressions.

In account B, perform the following steps:

As a data lake admin, review and accept the AWS Resource Access Manager (AWS RAM) invites for the shares from account A.
The data lake admin then sees the shared database and tables. The admin creates a resource link to the database and grants fine-grained permissions to a data analyst in this account.
Both the data lake admin and the data analyst query the Hive tables that are available to them using Athena.

Account A has the following personas:

hmsblog-producersteward – Manages the data lake in the producer account A

Account B has the following personas:

hmsblog-consumersteward – Manages the data lake in the consumer account B
hmsblog-analyst – A data analyst who needs access to selected Hive tables

Prerequisites

To follow the tutorial in this post, you need the following:

Two AWS accounts. We recommend you use test accounts and not production accounts.
An admin AWS Identity and Access Management (IAM) user in both accounts to launch the AWS CloudFormation stacks.
Lake Formation mode enabled in both accounts with cross-account settings to version 3. For instructions, refer to Change the default permission model.

Lake Formation and AWS CloudFormation setup in account A

To keep the setup simple, we have an IAM admin registered as the data lake admin. Complete the following steps:

Sign into the AWS Management Console and choose the us-west-2 Region.
On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
Choose Manage Administrators in the Data lake administrators section.
Under IAM users and roles, choose the IAM admin user that you are logged in as and choose Save.
Choose Launch Stack to deploy the CloudFormation template:
Choose Next.
Provide a name for the stack and choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

Stack creation takes about 10 minutes. The stack establishes the producer account A setup as follows:

Creates an S3 data lake bucket
Registers the data lake bucket to Lake Formation with the Enable catalog federation flag
Launches an EMR 6.10 cluster with Hive and runs two steps in Amazon EMR:
- Downloads the sample data from public S3 bucket to the newly created bucket
- Creates a Hive database and four external tables for the data in Amazon S3, using a HQL script
Creates an IAM user (hmsblog-producersteward) and sets this user as Lake Formation administrator
Creates LF-Tags (LFHiveBlogCampaignRole = Admin, Analyst)

Review CloudFormation stack output in account A

To review the output of your CloudFormation stack, complete the following steps:

Log in to the console as the IAM admin user you used earlier to run the CloudFormation template.
Open the CloudFormation console in another browser tab.
Review and note down the stack Outputs tab details.
Choose the link under Value for ProducerStewardCredentials.

This will open the AWS Secrets Manager console.

Choose Retrieve value and note down the credentials of hmsblog-producersteward.

Set up a federated AWS Glue connection in account A

To set up a federated AWS Glue connection, complete the following steps:

Open the AWS Serverless Application Repository console in another browser tab.
In the navigation pane, choose Available applications.
Select Show apps that create custom IAM roles or resource policies.
In the search bar, enter Glue.

This will list various applications.

Choose the application named GlueDataCatalogFederation-HiveMetastore.

This will open the AWS Lambda console configuration page for a Lambda function that runs the connector application code.

To configure the Lambda function, you need details of the EMR cluster launched by the CloudFormation stack.

On another tab of your browser, open the Amazon EMR console.
Navigate to the cluster launched for this post and note down the following details from the cluster details page:
1. Primary node public DNS
2. Subnet ID
3. Security group ID of the primary node
Back on the Lambda configuration page, under Review, configure, and deploy, in the Application settings section, provide the following details. Leave the rest as the default values.
1. For GlueConnectionName, enter hive-metastore-connection.
2. For HiveMetastoreURIs enter thrift://<Primary-node-public-DNS-of your-EMR>:9083. For example, thrift://ec2-54-70-203-146.us-west-2.compute.amazonaws.com:9083, where 9083 is the Hive metastore port in EMR cluster.
3. For VPCSecurityGroupIds, enter the security group ID of the EMR primary node.
4. For VPCSubnetIds, enter the subnet ID of the EMR cluster.
Choose Deploy.

Wait for the Create Completed status of the Lambda application. You can review the details of the Lambda application on the Lambda console.

Open Lake Formation console and in the navigation pane, choose Data sharing.

You should see hive-metastore-connection under Connections.

Choose it and review the details.
In the navigation pane, under Administrative roles and tasks, choose LF-Tags.

You should see the created LF-tag LFHiveBlogCampaignRole with two values: Analyst and Admin.

Choose LF-Tag permissions and choose Grant.
Choose IAM users and roles and enter hmsblog-producersteward.
Under LF-Tags, choose Add LF-Tag.
Enter LFHiveBlogCampaignRole for Key and enter Analyst and Admin for Values.
Under Permissions, select Describe and Associate for LF-Tag permissions and Grantable permissions.
Choose Grant.

This gives LF-Tags permissions for the producer steward.

Log out as the IAM administrator user.

Grant Lake Formation permissions as producer steward

Complete the following steps:

Sign in to the console as hmsblog-producersteward, using the credentials from the CloudFormation stack Output tab that you noted down earlier.
On the Lake Formation console, in the navigation pane, choose Administrative roles and tasks.
Under Database creators, choose Grant.
Add hmsblog-producersteward as a database creator.
In the navigation pane, choose Data sharing.
Under Connections, choose the hive-metastore-connection hyperlink.
On the Connection details page, choose Create database.
For Database name, enter federated_emrhivedb.

This is the federated database in the local AWS Glue Data Catalog that will point to a Hive metastore database. This is a one-to-one mapping of a database in the Data Catalog to a database in the external Hive metastore.

For Database identifier, enter the name of the database in the EMR Hive metastore that was created by the Hive SQL script. For this post, we use emrhms_salesdb.
Once created, select federated_emrhivedb and choose View tables.

This will fetch the database and table metadata from the Hive metastore on the EMR cluster and display the tables created by the Hive script.

Now you associate the LF-Tags created by the CloudFormation script on this federated database and share it to the consumer account B using LF-Tag expressions.

In the navigation pane, choose Databases.
Select federated_emrhivedb and on the Actions menu, choose Edit LF-Tags.
Choose Assign new LF-Tag.
Enter LFHiveBlogCampaignRole for Assigned keys and Admin for Values, then choose Save.
In the navigation pane, choose Data lake permissions.
Choose Grant.
Select External accounts and enter the consumer account B number.
Under LF-Tags or catalog resources, choose Resource matched by LF-Tags.
Choose Add LF-Tag.
Enter LFHiveBlogCampaignRole for Key and Admin for Values.
In the Database permissions section, select Describe for Database permissions and Grantable permissions.
In the Table permissions section, select Select and Describe for Table permissions and Grantable permissions.
Choose Grant.
In the navigation pane, under Administrative roles and tasks, choose LF-Tag permissions.
Choose Grant.
Select External accounts and enter the account ID of consumer account B.
Under LF-Tags, enter LFHiveBlogCampaignRole for Key and enter Analyst and Admin for Values.
Under Permissions, select Describe and Associate under LF-Tag permissions and Grantable permissions.
Choose Grant and verify that the granted LF-Tag permissions display correctly.
In the navigation pane, choose Data lake permissions.

You can review and verify the permissions granted to account B.

In the navigation pane, under Administrative roles and tasks, choose LF-Tag permissions.

You can review and verify the permissions granted to account B.

Log out of account A.

Lake Formation and AWS CloudFormation setup in account B

To keep the setup simple, we use an IAM admin registered as the data lake admin.

Sign into the AWS Management Console of account B and select the us-west-2 Region.
On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
Choose Manage Administrators in the Data lake administrators section.
Under IAM users and roles, choose the IAM admin user that you are logged in as and choose Save.
Choose Launch Stack to deploy the CloudFormation template:
Choose Next.
Provide a name for the stack and choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

Stack creation should take about 5 minutes. The stack establishes the producer account B setup as follows:

Creates an IAM user hmsblog-consumersteward and sets this user as Lake Formation administrator
Creates another IAM user hmsblog-analyst
Creates an S3 data lake bucket to store Athena query results, with ListBucket and write object permissions to both hmsblog-consumersteward and hmsblog-analyst

Note down the stack output details.

Accept resource shares in account B

On the AWS CloudFormation console, navigate to the stack Outputs tab.
Choose the link for ConsumerStewardCredentials to be redirected to the Secrets Manager console.
On the Secrets Manager console, choose Retrieve secret value and copy the password for the consumer steward user.
Use the ConsoleIAMLoginURL value from the CloudFormation template Output to log in to account B with the consumer steward user name hmsblog-consumersteward and the password you copied from Secrets Manager.
Open the AWS RAM console in another browser tab.
In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations.

You should see two resource share invitations from producer account A: one for a database-level share and one for a table-level share.

Choose each resource share link, review the details, and choose Accept.

After you accept the invitations, the status of the resource shares changes from Pending to Active.

Open the Lake Formation console in another browser tab.
In the navigation pane, choose Databases.

You should see the shared database federated_emrhivedb from producer account A.

Choose the database and choose View tables to review the list of tables shared under that database.

You should see the four tables of the Hive database that is hosted on the EMR cluster in the producer account.

Grant permissions in account B

To grant permissions in account B, complete the following steps as hmsblog-consumersteward:

On the Lake Formation console, in the navigation pane, choose Administrative roles and tasks.
Under Database creators, choose Grant.
For IAM users and roles, enter hmsblog-consumersteward.
For Catalog permissions, select Create database.
Choose Grant.

This allows hmsblog-consumersteward to create a database resource link.

In the navigation pane, choose Databases.
Select federated_emrhivedb and on the Actions menu, choose Create resource link.
Enter rl_federatedhivedb for Resource link name and choose Create.
Choose Databases in the navigation pane.
Select the resource link rl_federatedhivedb and on the Actions menu, choose Grant.
Choose hmsblog-analyst for IAM users and roles.
Under Resource link permissions, select Describe, then choose Grant.
Select Databases in the navigation pane.
Select the resource link rl_federatedhivedb and on the Actions menu, choose Grant on target.
Choose hmsblog-analyst for IAM users and roles.
Choose hms_productcategory and hms_supplier for Tables.
For Table permissions, select Select and Describe, then choose Grant.
In the navigation pane, choose Data lake permissions and review the permissions granted to hms-analyst.

Query the Apache Hive database of the producer from the consumer Athena

Complete the following steps:

On the Athena console, navigate to the query editor.
Choose Edit settings to configure the Athena query results bucked.
Browse and choose the S3 bucket hmsblog-athenaresults-<your-account-B>-us-west-2 that the CloudFormation template created.
Choose Save.

hmsblog-consumersteward has access to all four tables under federated_emrhivedb from the producer account.

In the Athena query editor, choose the database rl_federatedhivedb and run a query on any of the tables.

You were able to query an external Apache Hive metastore database of the producer account through the AWS Glue Data Catalog and Lake Formation permissions using Athena from the recipient consumer account.

Sign out of the console as hmsblog-consumersteward and sign back in as hmsblog-analyst.
Use the same method as explained earlier to get the login credentials from the CloudFormation stack Outputs tab.

hmsblog-analyst has Describe permissions on the resource link and access to two of the four Hive tables. You can verify that you see them on the Databases and Tables pages on the Lake Formation console.

On the Athena console, you now configure the Athena query results bucket, similar to how you configured it as hmsblog-consumersteward.

In the query editor, choose Edit settings.
Browse and choose the S3 bucket hmsblog-athenaresults-<your-account-B>-us-west-2 that the CloudFormation template created.
Choose Save.
In the Athena query editor, choose the database rl_federatedhivedb and run a query on the two tables.
Sign out of the console as hmsblog-analyst.

You were able to restrict sharing the external Apache Hive metastore tables using Lake Formation permissions from one account to another and query them using Athena. You can also query the Hive tables using Redshift Spectrum, Amazon EMR, and AWS Glue ETL from the consumer account.

Clean up

To avoid incurring charges on the AWS resources created in this post, you can perform the following steps.

Clean up resources in account A

There are two CloudFormation stacks associated with producer account A. You need to delete the dependencies and the two stacks in the correct order.

Log in as the admin user to producer account B.
On the Lake Formation console, choose Data lake permissions in the navigation pane.
Choose Grant.
Grant Drop permissions to your role or user on federated_emrhivedb.
In the navigation pane, choose Databases.
Select federated_emrhivedb and on the Actions menu, choose Delete to delete the federated database that is associated with the Hive metastore connection.

This makes the AWS Glue connection’s CloudFormation stack ready to be deleted.

In the navigation pane, choose Administrative roles and tasks.
Under Database creators, select Revoke and remove hmsblog-producersteward permissions.
On the CloudFormation console, delete the stack named serverlessrepo-GlueDataCatalogFederation-HiveMetastore first.

This is the one created by your AWS SAM application for the Hive metastore connection. Wait for it to complete deletion.

Delete the CloudFormation stack that you created for the producer account set up.

This deletes the S3 buckets, EMR cluster, custom IAM roles and policies, and the LF-Tags, database, tables, and permissions.

Clean up resources in account B

Complete the following steps in account B:

Revoke permission to hmsblog-consumersteward as database creator, similar to the steps in the previous section.
Delete the CloudFormation stack that you created for the consumer account setup.

This deletes the IAM users, S3 bucket, and all the permissions from Lake Formation.

If there are any resource links and permissions left, delete them manually in Lake Formation from both accounts.

Conclusion

In this post, we showed you how to launch the AWS Glue Hive metastore federation application from AWS Serverless Application Repository, configure it with a Hive metastore running on an EMR cluster, create a federated database in the AWS Glue Data Catalog, and map it to a Hive metastore database on the EMR cluster. We illustrated how to share and access the Hive database tables for a cross-account scenario and the benefits of using Lake Formation to restrict permissions.

All Lake Formation features such as sharing to IAM principals within same account, sharing to external accounts, sharing to external account IAM principals, restricting column access, and setting data filters work on federated Hive database and tables. You can use any of the AWS analytics services that are integrated with Lake Formation, such as Athena, Redshift Spectrum, AWS Glue ETL, and Amazon EMR to query the federated Hive database and tables.

We encourage you to check out the features of the AWS Glue Hive metastore federation connector and explore Lake Formation permissions on your Hive database and tables. Please comment on this post or talk to your AWS Account Team to share feedback on this feature.

For more details, see Managing permissions on datasets that use external metastores.

About the authors

Enable cross-account sharing with direct IAM principals using AWS Lake Formation Tags

2023-01-18 Aarthi Srinivasan

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/enable-cross-account-sharing-with-direct-iam-principals-using-aws-lake-formation-tags/

With AWS Lake Formation, you can build data lakes with multiple AWS accounts in a variety of ways. For example, you could build a data mesh, implementing a centralized data governance model and decoupling data producers from the central governance. Such data lakes enable the data as an asset paradigm and unleash new possibilities with data discovery and exploration across organization-wide consumers. While enabling the power of data in decision-making across your organization, it’s also crucial to secure the data. With Lake Formation, sharing datasets across accounts only requires a few simple steps, and you can control what you share.

Lake Formation has launched Version 3 capabilities for sharing AWS Glue Data Catalog resources across accounts. When moving to Lake Formation cross-account sharing V3, you get several benefits. When moving from V1, you get more optimized usage of AWS Resource Access Manager (AWS RAM) to scale sharing of resources. When moving from V2, you get a few enhancements. First, you don’t have to maintain AWS Glue resource policies to share using LF-tags because Version 3 uses AWS RAM. Second, you can share with AWS Organizations using LF-tags. Third, you can share to individual AWS Identity and Access Management (IAM) users and roles in other accounts, thereby providing data owners control over which individuals can access their data.

Lake Formation tag-based access control (LF-TBAC) is an authorization strategy that defines permissions based on attributes called LF-tags. LF-tags are different from IAM resource tags and are associated only with Lake Formation databases, tables, and columns. LF-TBAC allows you to define the grant and revoke permissions policy by grouping Data Catalog resources, and therefore helps in scaling permissions across a large number of databases and tables. LF-tags are inherited from a database to all its tables and all the columns of each table.

Version 3 offers the following benefits:

True central governance with cross-account sharing to specific IAM principals in the target account
Ease of use in not having to maintain an AWS Glue resource policy for LF-TBAC
Efficient reuse of AWS RAM shares
Ease of use in scaling to hundreds of accounts with LF-TBAC

In this post, we illustrate the new features of cross-account sharing Version 3 in a producer-consumer scenario using TPC datasets. We walk through the setup of using LF-TBAC to share data catalog resources from the data producer account to direct IAM users in the consumer account. We also go through the steps in the receiving account to accept the shares and query the data.

Solution overview

To demonstrate the Lake Formation cross-account Version 3 features, we use the TPC datasets available at s3://aws-data-analytics-workshops/shared_datasets/tpcparquet/. The solution consists of steps in both accounts.

In account A, complete the following steps:

As a data producer, register the dataset with Lake Formation and create AWS Glue Data Catalog tables.
Create LF-tags and associate them with the database and tables.
Grant LF-tag based permissions on resources directly to personas in consumer account B.

The following steps take place in account B:

The consumer account data lake admin reviews and accepts the AWS RAM invitations.
The data lake admin gives CREATE DATABASE access to the IAM user lf_business_analysts.
The data lake admin creates a database for the marketing team and grants CREATE TABLE access to lf_campaign_manager.
The IAM users create resource links on the shared database and tables and query them in Amazon Athena.

The producer account A has the following personas:

Data lake admin – Manages the data lake in the producer account
lf-producersteward – Manages the data and user access

The consumer account B has the following personas:

Data lake admin – Manages the data lake in the consumer account
lf-business-analysts – The business analysts in the sales team needs access to non-PII data
lf-campaign-manager – The manager in the marketing team needs access to data related to products and promotions

Prerequisites

You need the following prerequisites:

Two AWS accounts. For this demonstration of how AWS RAM invites are created and accepted, you should use two accounts that are not part of the same organization.
An admin IAM user in both accounts to launch the AWS CloudFormation stacks.
Lake Formation mode enabled in both the producer and consumer account with cross-account Version 3. For instructions, refer to Change the default permission model.

Lake Formation and AWS CloudFormation setup in account A

To keep the setup simple, we have an IAM admin registered as the data lake admin.

Sign into the AWS Management Console in the us-east-1 Region.
On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
Select Choose Administrators under Datalake administrators.
In the pop-up window Manage data lake administrators, under IAM users and roles, choose IAM admin user and choose Save.
Choose Launch Stack to deploy the CloudFormation template:
Choose Next.
Provide a name for the stack and choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

Stack creation should take about 2–3 minutes. The stack establishes the producer setup as follows:

Creates an Amazon Simple Storage Service (Amazon S3) data lake bucket
Registers the data lake bucket with Lake Formation
Creates an AWS Glue database and tables
Creates an IAM user (lf-producersteward) who will act as producer steward
Creates LF-tags and assigns them to the created catalog resources as specified in the following table

Database	Table	LF-Tag Key	LF-Tag Value	Resource Tagged
`lftpcdb`	.	`Sensitivity`	`Public`	`DATABASE`
`lftpcdb`	`items`	`HasCampaign`	`true`	`TABLE`
`lftpcdb`	`promotions`	`HasCampaign`	`true`	`TABLE`
`lftpcdb`	`customers table columns = "c_last_name","c_first_name","c_email_address"`	`Sensitivity`	`Confidential`	`TABLECOLUMNS`

Verify permissions in account A

After the CloudFormation stack launches, complete the following steps in account A:

On the AWS CloudFormation console, navigate to the Outputs tab of the stack.

Choose the LFProducerStewardCredentials value to navigate to the AWS Secrets Manager console.
In the Secret value section, choose Retrieve secret value.
Note down the secret value for the password for IAM user lf-producersteward.

You need this to log in to the console later as the user lf-producersteward.

On the LakeFormation console, choose Databases on the navigation pane.
Open the database lftpcdb.
Verify the LF-tags on the database are created.

Choose View tables and choose the items table to verify the LF-tags.

Repeat the steps for the promotions and customers tables to verify the LF-tags assigned.

On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
Select the database lftpcdb and on the Actions menu, choose View Permissions.
Verify that there are no default permissions granted on the database lftpcdb for IAMAllowedPrincipals.
If you find any, select the permission and choose Revoke to revoke the permission.
On the AWS Management Console, choose the AWS CloudShell icon on the top menu.

This opens AWS CloudShell in another tab of the browser. Allow a few minutes for the CloudShell environment to set up.

Run the following AWS Command Line Interface (AWS CLI) command after replacing {BUCKET_NAME} with DataLakeBucket from the stack output.

aws s3 cp s3://aws-data-analytics-workshops/shared_datasets/tpcparquet/ s3://${BUCKET_NAME}/tpcparquet/  --recursive

If CloudShell isn’t available in your chosen Region, run the following AWS CLI command to copy the required dataset from your preferred AWS CLI environment as the IAM admin user.

Verify that your S3 bucket has the dataset copied in it.
Log out as the IAM admin user.

Grant permissions in account A

Next, we continue granting Lake Formation permissions to the dataset as a data steward within the producer account. The data steward grants the following LF-tag-based permissions to the consumer personas.

Consumer Persona	LF-tag Policy
`lf-business-analysts`	`Sensitivity=Public`
`lf-campaign-manager`	`HasCampaign=true`

Log in to account A as user lf-producersteward, using the password you noted from Secrets Manager earlier.
On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
Choose Grant.
Under Principals, select External accounts.
Provide the ARN of the IAM user in the consumer account (arn:aws:iam::<accountB_id>:user/lf-business-analysts) and press Enter.

Under LF_Tags or catalog resources, select Resources matched by LF-Tags.
Choose Add LF-Tag to add a new key-value pair.
For the key, choose Sensitivity and for the value, choose Public.
Under Database permissions, select Describe, and under Table permissions, select Select and Describe.

Choose Grant to apply the permissions.
On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
Choose Grant.
Under Principals, select External accounts.
Provide the ARN of the IAM user in the consumer account (arn:aws:iam::<accountB_id>:user/lf-campaign-manager) and press Enter.
Under LF_Tags or catalog resources, select Resources matched by LF-Tags.
Choose Add LF-Tag to add a new key-value pair.
For the key, choose HasCampaign and for the value, choose true.

Under Database permissions, select Describe, and under Table permissions, select Select and Describe.
Choose Grant to apply the permissions.
Verify on the Data lake permissions tab that the permissions you have granted show up correctly.

AWS CloudFormation setup in account B

Complete the following steps in the consumer account:

Log in as an IAM admin user in account B and launch the CloudFormation stack:
Choose Next.
Provide a name for the stack, then choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

Stack creation should take about 2–3 minutes. The stack sets up the following resources in account B:

IAM users datalakeadmin1, lf-business-analysts, and lf-campaign-manager, with relevant IAM and Lake Formation permissions
A database called db_for_shared_tables with Create_Table permissions to the lf-campaign-manager user
An S3 bucket named lfblog-athenaresults-<your-accountB-id>-us-east-1 with ListBucket and write permissions to lf-business-analysts and lf-campaign-manager

Note down the stack output details.

Accept resource shares in account B

After you launch the CloudFormation stack, complete the following steps in account B:

On the CloudFormation stack Outputs tab, choose the link for DataLakeAdminCredentials.

This takes you to the Secrets Manager console.

On the Secrets Manager console, choose Retrieve secret value and copy the password for DataLakeAdmin user.
Use the ConsoleIAMLoginURL value from the CloudFormation template output to log in to account B with the data lake admin user name datalakeadmin1 and the password you copied from Secrets Manager.
Open the AWS RAM console in another browser tab.
In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations.

You should see two resource share invitations from the producer account A: one for database-level share and one for table-level share.

Choose each resource share link, review the details, and choose Accept.

After you accept the invitations, the status of the resource shares changes from Active from Pending.

Grant permissions in account B

To grant permissions in account B, complete the following steps:

On the Lake Formation console, under Permissions on the navigation pane, choose Administrative roles and tasks.

Under Database creators, choose Grant.

Under IAM users and roles, choose lf-business-analysts.
For Catalog permissions, select Create database.
Choose Grant.
Log out of the console as the data lake admin user.

Query the shared datasets as consumer users

To validate the lf-business-analysts user’s data access, perform the following steps:

Log in to the console as lf-business-analysts, using the credentials noted from the CloudFormation stack output.
On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.

Select the database lftpcdb and on the Actions menu, choose Create resource link.

Under Resource link name, enter rl_lftpcdb.
Choose Create.
After the resource link is created, select the resource link and choose View tables.

You can now see the four tables in the shared database.

Open the Athena console in another browser tab and choose the lfblog-athenaresults-<your-accountB-id>-us-east-1 bucket as the query results location.
Verify data access using the following query (for more information, refer to Running SQL queries using Amazon Athena):

Select * from rl_lftpcdb.customers limit 10;

The following screenshot shows the query output.

Notice that account A shared the database lftpcdb to account B using the LF-tag expression Sensitivity=Public. Columns c_first_name, c_last_name, and c_email_address in table customers were overwritten with Sensitivity=Confidential. Therefore, these three columns are not visible to user lf-business-analysts.

You can preview the other tables from the database similarly to see the available columns and data.

Log out of the console as lf-business-analysts.

Now we can validate the lf-campaign-manager user’s data access.

Log in to the console as lf-campaign-manager using the credentials noted from the CloudFormation stack output.
On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
Verify that you can see the database db_for_shared_tables shared by the data lake admin.

Under Data catalog in the navigation pane, choose Tables.

You should be able to see the two tables shared from account A using the LF-tag expression HasCampaign=true. The two tables show the Owner account ID as account A.

Because lf-campaign-manager received table level shares, this user will create table-level resource links for querying in Athena.

Select the promotions table, and on the Actions menu, choose Create resource link.

For Resource link name, enter rl_promotions.

Under Database, choose db_for_shared_tables for the database to contain the resource link.
Choose Create.
Repeat the table resource link creation for the other table items.

Notice that the resource links show account B as owner, whereas the actual tables show account A as the owner.

Open the Athena console in another browser tab and choose the lfblog-athenaresults-<your-accountB-id>-us-east-1 bucket as the query results location.
11. Query the tables using the resource links.

As shown in the following screenshot, all columns of both tables are accessible to lf-campaign-manager.

In summary, you have seen how LF-tags are used to share a database and select tables from one account to another account’s IAM users.

Clean up

To avoid incurring charges on the AWS resources created in this post, you can perform the following steps.

First, clean up resources in account A:

Empty the S3 bucket created for this post by deleting the downloaded objects from your S3 bucket.
Delete the CloudFormation stack.

This deletes the S3 bucket, custom IAM roles, policies, and the LF database, tables, and permissions.

You may choose to undo the Lake Formation settings also and add IAM access back from the Lake Formation console Settings page.

Now complete the following steps in account B:

Empty the S3 bucket lfblog-athenaresults-<your-accountB-id>-us-east-1 used as the Athena query results location.
Revoke permission to lf-business-analysts as database creator.
Delete the CloudFormation stack.

This deletes the IAM users, S3 bucket, Lake Formation database db_for_shared_tables, resource links, and all the permissions from Lake Formation.

If there are any resource links and permissions left, delete them manually in Lake Formation from both accounts.

Conclusion

In this post, we illustrated the benefits of using Lake Formation cross-account sharing Version 3 using LF-tags to direct IAM principals and how to receive the shared tables in the consumer account. We used a two-account scenario in which a data producer account shares a database and specific tables to individual IAM users in another account using LF-tags. In the receiving account, we showed the role played by a data lake admin vs. the receiving IAM users. We also illustrated how to overwrite column tags to mask and share PII data.

With Version 3 of cross-account sharing features, Lake Formation makes possible more modern data mesh models, where a producer can directly share to an IAM principal in another account, instead of the entire account. Data mesh implementation becomes easier for data administrators and data platform owners because they can easily scale to hundreds of consumer accounts using the LF-tags based sharing to organizational units or IDs.

We encourage you to upgrade your Lake Formation cross-account sharing to Version 3 and benefit from the enhancements. For more details, see Updating cross-account data sharing version settings.

About the authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

Discover and secure

Connect with data sharing

Scale and optimize

Audit and monitor

Conclusion

About the authors

Scenario 1 – Hybrid access mode within an AWS account

Prerequisites

Personas and their IAM policy setup

Set up Lake Formation permissions for Data-Analyst

Validating Lake Formation permissions for Data-Analyst

Validating IAM and S3 permissions for Data-Engineer

Scenario 2 – Hybrid access mode set up between two AWS accounts

Prerequisites

Set up in producer account A

Setting the cross-account version and registering the S3 bucket

Granting cross-account permissions

Set up in consumer account B

Accepting AWS RAM invites

Granting Lake Formation permissions to Data-Analyst

Validate Lake Formation permissions as Data-Analyst

Validate IAM and S3 permissions for Data-Engineer

Clean up

Considerations

Conclusion

About the authors

Scenario 1: Same account use case

Prerequisites

Set up permissions for us-east-2

Set up permissions for us-west-2

Scenario 2: Cross-account use case

Prerequisites

Set up account A for cross-Region sharing

Set up account B for cross-Region sharing

Considerations

Conclusion

About the author

Solution overview

Prerequisites

Lake Formation and AWS CloudFormation setup in account A

Review CloudFormation stack output in account A

Set up a federated AWS Glue connection in account A

Grant Lake Formation permissions as producer steward

Lake Formation and AWS CloudFormation setup in account B

Accept resource shares in account B

Grant permissions in account B

Query the Apache Hive database of the producer from the consumer Athena

Clean up

Clean up resources in account A

Clean up resources in account B

Conclusion

About the authors

Solution overview

Prerequisites

Lake Formation and AWS CloudFormation setup in account A

Verify permissions in account A

Grant permissions in account A

AWS CloudFormation setup in account B

Accept resource shares in account B

Grant permissions in account B

Query the shared datasets as consumer users

Clean up

Conclusion

About the authors

The collective thoughts of the interwebz