Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control

Post Syndicated from Ashley Zhou original https://aws.amazon.com/blogs/big-data/use-iam-runtime-roles-with-amazon-emr-studio-workspaces-and-aws-lake-formation-for-cross-account-fine-grained-access-control/

Amazon EMR Studio is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio provides fully managed Jupyter notebooks and tools such as Spark UI and YARN Timeline Server via EMR Studio Workspaces. You can attach an EMR Studio Workspace to an EMR cluster, and use the compute power of the EMR cluster and run data science jobs on the cluster. Data is often stored in data lakes managed by AWS Lake Formation, enabling you to apply fine-grained access control through a simple grant or revoke mechanism.

We’re happy to introduce runtime roles for EMR Studio Workspaces. You can now define a runtime role and assign it to an EMR cluster when attaching an EMR Studio Workspace. The jobs on the EMR cluster will use this runtime role to access AWS resources. After configuring a runtime role, you can also use Lake Formation and apply fine-grained data access control for the jobs submitted by the EMR Studio Workspace.

Previously, when attaching EMR Studio Workspaces to EMR clusters, all Workspaces had to use the same AWS Identity and Access Management (IAM) role—namely, the cluster’s Amazon Elastic Compute Cloud (Amazon EC2) instance profile. Therefore, all Workspaces attached to the same EMR cluster had the same data access. To control access to data sources, each EMR Studio Workspace had to use a different EMR cluster, and multiple EMR instance profiles were needed.

Starting with the release of Amazon EMR 6.11, you can now choose a runtime role when attaching an EMR Studio Workspace to an EMR cluster. This runtime role scopes down access at the Workspace level. Your Apache Livy and Apache Spark jobs that run from the EMR Studio Workspaces will have permission to access only the data and resources permitted by policies attached to the runtime role. Also, when data is accessed from data lakes managed with Lake Formation, you can enforce fine-grained data access control using Lake Formation permissions. This helps you reduce operational overhead.

In this post, we demonstrate how to configure runtime roles for EMR Studio Workspaces and attach a Workspace to an EMR cluster with runtime roles. Because large enterprises typically use multiple AWS accounts, and many of those accounts might need access to a data lake managed by a single AWS account, our example uses two AWS accounts. We explain how to control access to EMR Studio runtime roles, manage data access across accounts in a data lake via Lake Formation, and enforce table-level and column-level permissions to the EMR runtime roles.

Solution overview

To demonstrate fine-grained access control, we create a sample AWS Glue database named company and manage the database permission in Lake Formation. The database consists of two separate tables:

  • employees – This table stores information about the company’s employees, including employee ID, name, department, and salary
  • products – This table stores information about the products sold by the company, including product ID, name, category, and price

To demonstrate data access control, we consider the following data users:

  • Alice, a data scientist in the sales team – She should have read-only access to all columns in the products table and selected columns, including uID, name, and department in the employees table
  • Bob, a data scientist in the human resources team – He should have read-only access to all columns in employees table and should not have access to the products table

To demonstrate cross-account data sharing, we consider two accounts:

  • Data producer account – We refer to this account as 123456789012 in this post. This account manages the raw data in Amazon Simple Storage Service (Amazon S3) and writes data to the data lake. The company database and tables should be in this account.
  • Data consumer account – We refer to this account as 111122223333 in this post. This account is accessed directly by the users for data analysis and doesn’t have write access to the data. This account should be accessible by Alice and Bob.

The architecture is implemented as follows:

  • The data producer account manages a data lake. Raw data is stored in S3 buckets and catalogued in the AWS Glue Data Catalog.
  • Lake Formation in the data producer account governs the data access via the Data Catalog, and provides cross-account data sharing with the data consumer account.
  • Lake Formation in the data consumer account governs cross-account access to the data lake on table level and fine-grained Lake Formation permissions. For more information, refer to Methods for fine-grained access control.
  • EMR Studio Workspaces in the data consumer account use runtime roles when running jobs on an EMR cluster.
  • The EMR cluster connects to Glue Data Catalog in the data consumer account and queries the data from the data lake through cross-account data sharing.

The following diagram illustrates this architecture.

In the following sections, we go through the steps to share data across accounts via Lake Formation, run an EMR Studio Workspace with runtime roles, and demonstrate fine-grained access control.

Prerequisites

You should have the following prerequisites:

Create the infrastructure in the data producer account

Complete the following steps to create the infrastructure resources:

  1. Log in to the data producer AWS account (123456789012).
  2. Choose Launch Stack to deploy a CloudFormation template to create the necessary resources.
  3. For DataLakeBucketSuffix, enter the suffix for the S3 bucket used by the data lake. The whole S3 bucket name to be created will be {AwsAccoundId}-{AwsRegion}-{DataLakeBucketSuffix}.
  4. After the CloudFormation stack is created, navigate to the Outputs tab of the stack and capture the value of DataLakeS3Bucket to use in the next step.

Create data files and upload them to Amazon S3 in the data producer account

Configure your AWS CLI to use the IAM identity with permission to upload to DataLakeS3BucketName in the data producer AWS account (123456789012), or you can sign in to CloudShell using the AWS Management Console. Complete the following steps:

  1. On your local machine, move to a directory of your choice with the cd command, for example, cd ~.
  2. Run the script with chmod 744 create_sample_data.sh && ./create_sample_data.sh <DataLakeS3BucketName>.

The script will create a subdirectory tmp in your current working directory, create the test data in CSV files, and upload the files to the DataLakeS3BucketName S3 bucket.

Set up Lake Formation in the data producer account

In this section, we walk through the steps to set up Lake Formation in the data producer account.

Set up Lake Formation cross-account data sharing version settings

Lake Formation supports multiple data sharing versions. For this post, we use version 3. To learn more about the differences between data sharing versions, refer to Updating cross-account data sharing version settings. To change the data sharing version, see To enable the new version.

Register the Amazon S3 location as the data lake location

When you register an Amazon S3 location with Lake Formation, you specify an IAM role with read/write permissions on that location. After registering, when EMR clusters request access to this Amazon S3 location, Lake Formation will supply temporary credentials of the provided role to access the data. We already created the role LakeFormationCompanyDatabaseDataAccessRole for this purpose in the previous step. To register the Amazon S3 location as the data lake location, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account (123456789012).
  2. In the navigation pane, choose Data lake locations under Administration.
  3. Choose Register location.
  4. For Amazon S3 path, enter s3://<DataLakeS3BucketName>/company-database.
  5. For IAM role, enter LakeFormationCompanyDatabaseDataAccessRole.
  6. For Permission mode, select Lake Formation.
  7. Choose Register location.

Register data location

Revoke permissions granted to IAMAllowedPrincipals

The IAMAllowedPrincipals group includes any IAM users and roles that are allowed access to your Data Catalog resources by your IAM policies. To enforce the Lake Formation model, we need to revoke permission from IAMAllowedPrincipals using the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. In the navigation pane, choose Data lake permissions under Permissions.
  3. Filter permissions by Database = company and Principle=IAMAllowedPrinciples.
  4. Select all the permissions given to the principal IAMAllowedPrincipals and choose Revoke.

Revoke permissions granted to IAMAllowedPrincipals

Set up application integration settings

To enforce permissions for the EMR cluster, you need to register a session tag value with Lake Formation. Lake Formation uses this session tag to authorize callers and provide access to the data lake. We register Amazon EMR as the session tag value. This value will be referenced in the security configuration when creating the EMR cluster.

Set up the session tag using the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. Choose Application integration settings under Administration in the navigation pane.
  3. Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
  4. For Session tag values, enter Amazon EMR.
  5. For AWS account IDs, enter the data consumer AWS account ID (111122223333).
  6. Choose Save.

Set up application integration settings in data producer account

Share the database and tables to the data consumer account

We now grant permissions to the data consumer AWS account, including grantable permissions. This allows the Lake Formation data lake administrator in the data consumer account to control access to the data within the account.

Grant database permissions to the data consumer account

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. In the navigation pane, choose Databases.
  3. Select the database company, and on the Actions menu, under Permissions, choose Grant.
  4. In the Principles section, select External accounts and enter the data consumer AWS account (111122223333).
  5. In the LF-Tags or catalog resources section, choose company for Databases.
  6. In the Database permissions section, select Describe for both Database permissions and Grantable permissions.

This allows the data lake administrator in the data consumer account to describe the database and grant describe permissions to other principals in the data consumer account.

  1. Choose Grant.

Grant database permissions to the data consumer account

Grant table permissions to the data consumer account

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. In the navigation pane, choose Tables.
  3. Select the products table, which belongs to the company database, and on the Actions menu, under Permissions, choose Grant.
  4. In the Principles section, select External accounts and enter in the data consumer AWS account (111122223333).
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables, choose products and employees.
  6. In the Table permissions section, choose Select and Describe for both Table permissions and Grantable permissions.

This allows the data lake administrator in the data consumer account to select and describe the tables, and grant select and describe table permissions to other principals in the data consumer account.

  1. In the Data permissions section, select All data access.
  2. Choose Grant.

Grant table permissions to the data consumer account
Now we have finished setting up the data producer account.

Set up the infrastructure in the data consumer account

Complete the following steps to create the infrastructure resources:

  1. Log in to the data consumer account (111122223333).
  2. Choose Launch stack to deploy a CloudFormation template to create the necessary resources.
    Launch Stack
  3. For Release Label, enter the Amazon EMR release label to use, which can only be emr-6.11 or up.
  4. For InstanceType, choose the instance type for EMR cluster, such as r4.4xlarge.
  5. For EMRS3BucketNameSuffix, enter the S3 bucket suffix to store EMR cluster logs and EMR notebook files. The full S3 bucket name to be created will be {AWSAccoundId}-{AWSRegion}-{EMRS3BucketNameSuffix}.
  6. For S3PathToInTransitCertificate, enter the S3 path for the .zip file that contains the .pem files used for in-transit encryption.

For instructions on creating the .zip file that contains the .pem files and uploading them to your S3 bucket, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption.

  1. After the CloudFormation stack is created, navigate to the Outputs tab of the stack.
  2. Capture the value of EMRStudioLink to use to sign in to EMR Studio.

Accept the resource share in the data consumer account

To access shared resources, you must accept the invitation first.

  1. Open the AWS RAM console of the data consumer account with the IAM identity that has AWS RAM access.
  2. In the navigation pane, choose Resource shares under Shared with me.

You should see two pending resource shares from the data producer account.

  1. Accept both resource shares.

You should see the company database, employees table, and products table in the Data Catalog.

Set up Lake Formation in the data consumer account

In this section, we walk through the steps to set up Lake Formation in the data consumer account.

Set up application integration settings

Similar to the setup in the data producer account, you need register Amazon EMR as a session tag. This value is referenced in the security configuration when creating the EMR cluster in the CloudFormation stack.

To do that, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account (111122223333).
  2. Choose Application integration settings under Administration in the navigation pane.
  3. Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
  4. For Session tag values, enter Amazon EMR.
  5. For AWS account IDs, enter the data consumer AWS account ID (111122223333).
  6. Choose Save.

Set up application integration settings in data consumer account

Grant describe permissions to runtime roles on the default database

If you don’t have a default database in Lake Formation, or your default database already has permissions to grant to IAMAllowedPrinciples, you can skip this step.

Amazon EMR will check on the default database by default. If you already have a default database in your Lake Formation, grant the describe permission to the runtime roles on the default database by completing the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator user in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the default database, verify that the owner account ID is the data consumer account (111122223333), and on the Actions menu, choose Grant.
  4. In the Principles section, select IAM users and roles.
  5. For IAM users and roles, choose sales-runtime-role and human-resource-runtime-role.
  6. For LF-Tags or catalog resources, select Named data catalog resources and choose default for Databases.
  7. In the Database permissions section, for Database permissions, choose Describe.
  8. Choose Grant.

Grant describe permissions to runtime roles on the default database

Create a resource link for the shared database

To access the database and table resources that were shared by the data producer AWS account, you need to create a resource link in the data consumer AWS account. A resource link is a Data Catalog object that is a link to a local or shared database or table. After you create a resource link to a database or table, you can use the resource link name wherever you would use the database or table name. In this step, you grant permission on the resource links to the runtime role principles. The runtime roles will then access the data in shared databases and underlying tables through the resource link.

To create a resource link, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the company database, verify that the owner account ID is the data producer account (123456789012), and on the Actions menu, choose Create Resource links.
  4. For Resource link name, enter the name of the resource link (for example, company-shared).
  5. For Shared database’s region, choose the Region of the company database.
  6. For Shared database, choose the company database.
  7. For Shared database’s owner ID, enter the account ID of the data producer account (123456789012).
  8. Choose Create.

Create a resource link for the shared database

Grant permissions on the resource link to the runtime role principle

Grant permissions on the resource link to sales-runtime-role and human-resource-runtime-role using the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant.
  4. In the Principles section, select IAM users and roles, and choose sales-runtime-role and human-resource-runtime-role.
  5. In the LF-Tags or catalog resources section, for Databases, choose company-shared.
  6. In the Resource link permissions section, select Describe.

This allows the runtime roles to describe the resource link. We don’t make any selections for grantable permissions because runtime roles shouldn’t be able to grant permissions to other principles.

  1. Choose Grant.

Grant permissions on the resource link to the runtime role principle

Grant permission on the tables to the runtime role principle

You need to grant permissions on the tables to sales-runtime-role and human-resource-runtime-role to allow data access:

  • Human-resource-runtime-role should have describe and select permissions on all columns in the employees table, and no permissions on the products table.
  • Sales-runtime-role should have select permissions on the columns uid, name, and department in the employees table, and describe and select permissions on all columns in the products table.

Grant permission on the employees table to human-resource-runtime-role

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant on Target.
  4. In the Principles section, select IAM users and roles, then choose human-resource-runtime-role.
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables¸ choose employees.
  6. In the Table permissions section, for Table permissions, select Describe and Select.
  7. In the Data permissions section, select All data access.
  8. Choose Grant.

Grant permission on the employees table to human-resource-runtime-role

Grant permission on the employees table to sales-runtime-role

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant on Target.
  4. In the Principles section, select IAM users and roles, then choose sales-runtime-role.
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables, choose employees.
  6. In the Table permissions section, for Table permissions, select Select.
  7. In the Data permissions section, select Column-based access.
  8. Select Include columns and choose the uid, name, and department columns.
  9. Choose Grant.

 Grant permission on the employees table to sales-runtime-role

Grant permission on the products table to sales-runtime-role

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant on Target.
  4. In the Principles section, select IAM users and roles, then choose sales-runtime-role.
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables, choose products.
  6. In the Table permissions section, for Table permissions, select Select and Describe.
  7. In the Data permissions section, select All data access.
  8. Choose Grant.

Grant permission on the products table to sales-runtime-role

Log in to EMR Studio and use the EMR Studio Workspace

Switch your role to alice-role or bob-role on the console using different web browsers to test access. Open the EMRStudioLink URL from the CloudFormation stack output to sign in to the EMR Studio with each role, then complete the following steps:

  1. Choose Workspaces in the navigation pane and choose Create Workspace.
  2. Enter a name and a description for the Workspace.
  3. Choose Create Workspace.

A new tab containing JupyterLab will open automatically when the Workspace is ready. Enable pop-ups in your browser if necessary.

  1. Chose the Compute icon in the navigation pane to attach the EMR Studio Workspace with a compute engine.
  2. Select EMR cluster on EC2 for Compute type.
  3. Choose the EMR cluster ID you created with AWS CloudFormation.
  4. For Runtime role, choose sales-runtime-role if signed in as alice-role. Choose human-resource-runtime-role if signed in as bob-role.
  5. Choose Attach.

attach EMR Studio Workspace to cluster

Run code in the EMR Studio Workspace and verify data access

Run the following code in the EMR Studio Workspace with a PySpark kernel after signing in with alice-role or bob-role:

%%sql -o result -n -1
select * from `company-shared`.products limit 5;

%%sql -o result -n -1
select * from `company-shared`.employees limit 5;

You should see different results when using different roles.

According to our data access configuration in Lake Formation, Alice will have full data access for the products table. She can view all the columns except for salary in the employees table.

Alice (sales) query result

For Bob, according to our data access configuration in Lake Formation, he will have full data access to the employees table, but he has no access to the products table.

Bob (human resource) query result

Clean up

When you’re finished experimenting with this solution, clean up your resources:

  1. Stop and delete the EMR Studio Workspaces created in the data consumer AWS account.
  2. Delete all the content in the S3 bucket EMRS3Bucket in the data consumer AWS account.
  3. Delete the CloudFormation stack in the data consumer AWS account.
  4. Delete all the content in the S3 bucket DataLakeS3Bucket in the data producer AWS account.
  5. Delete the CloudFormation stack in the data producer AWS account.

Conclusion

This post showed how you can use runtime roles to connect to an EMR Studio Workspace with Amazon EMR to apply cross-account fine-grained data access control with Lake Formation. We also demonstrated how multiple EMR Studio users can connect to the same EMR cluster, each using a runtime role scoped with permissions matching their individual level of access to data.

To learn more about using EMR Studio Workspaces with Lake Formation, refer to Run an EMR Studio Workspace with a runtime role. We encourage you to try out this new functionality, and connect with the us if you have any questions or feedback!


About the Authors

Ashley Zhou is a Software Development Engineer at AWS. She is interested in data analytics and distributed systems.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.