All posts by Sudipta Mitra

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Post Syndicated from Sudipta Mitra original https://aws.amazon.com/blogs/big-data/design-a-data-mesh-pattern-for-amazon-emr-based-data-lakes-using-aws-lake-formation-with-hive-metastore-federation/

In this post, we delve into the key aspects of using Amazon EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery.

One of the key challenges in modern big data management is facilitating efficient data sharing and access control across multiple EMR clusters. Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. With the AWS Glue Data Catalog federation to external Hive metastore feature, you can now now apply data governance to the metadata residing across those EMR clusters and analyze them using AWS analytics services such as Amazon Athena, Amazon Redshift Spectrum, AWS Glue ETL (extract, transform, and load) jobs, EMR notebooks, EMR Serverless using Lake Formation for fine-grained access control, and Amazon SageMaker Studio. For detailed information on managing your Apache Hive metastore using Lake Formation permissions, refer to Query your Apache Hive metastore with AWS Lake Formation permissions.

In this post, we present a methodology for deploying a data mesh consisting of multiple Hive data warehouses across EMR clusters. This approach enables organizations to take advantage of the scalability and flexibility of EMR clusters while maintaining control and integrity of their data assets across the data mesh.

Use cases for Hive metastore federation for Amazon EMR

Hive metastore federation for Amazon EMR is applicable to the following use cases:

  • Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase. These data lakes require governance for access without the necessity of moving data to consumer accounts. The data resides on Amazon S3, which reduces the storage costs significantly.
  • Centralized catalog for published data – Multiple producers release data currently governed by their respective entities. For consumer access, a centralized catalog is necessary where producers can publish their data assets.
  • Consumer personas – Consumers include data analysts who run queries on the data lake, data scientists who prepare data for machine learning (ML) models and conduct exploratory analysis, as well as downstream systems that run batch jobs on the data within the data lake.
  • Cross-producer data access – Consumers may need to access data from multiple producers within the same catalog environment.
  • Data access entitlements – Data access entitlements involve implementing restrictions at the database, table, and column levels to provide appropriate data access control.

Solution overview

The following diagram shows how data from producers with their own Hive metastores (left) can be made available to consumers (right) using Lake Formation permissions enforced in a central governance account.

Producer and consumer are logical concepts used to indicate the production and consumption of data through a catalog. An entity can act both as a producer of data assets and as a consumer of data assets. The onboarding of producers is facilitated by sharing metadata, whereas the onboarding of consumers is based on granting permission to access this metadata.

The solution consists of multiple steps in the producer, catalog, and consumer accounts:

  1. Deploy the AWS CloudFormation templates and set up the producer, central governance and catalog, and consumer accounts.
  2. Test access to the producer cataloged Amazon S3 data using EMR Serverless in the consumer account.
  3. Test access using Athena queries in the consumer account.
  4. Test access using SageMaker Studio in the consumer account.

Producer

Producers create data within their AWS accounts using an Amazon EMR-based data lake and Amazon S3. Multiple producers then publish this data into a central catalog (data lake technology) account. Each producer account, along with the central catalog account, has either VPC peering or AWS Transit Gateway enabled to facilitate AWS Glue Data Catalog federation with the Hive metastore.

For each producer, an AWS Glue Hive metastore connector AWS Lambda function is deployed in the catalog account. This enables the Data Catalog to access Hive metastore information at runtime from the producer. The data lake locations (the S3 bucket location of the producers) are registered in the catalog account.

Central catalog

A catalog offers governed and secure data access to consumers. Federated databases are established within the catalog account’s Data Catalog using the Hive connection, managed by the catalog Lake Formation admin (LF-Admin). These federated databases in the catalog account are then shared by the data lake LF-Admin with the consumer LF-Admin of the external consumer account.

Data access entitlements are managed by applying access controls as needed at various levels, such as the database or table.

Consumer

The consumer LF-Admin grants the necessary permissions or restricted permissions to roles such as data analysts, data scientists, and downstream batch processing engine AWS Identity and Access Management (IAM) roles within its account.

Data access entitlements are managed by applying access control based on requirements at various levels, such as databases and tables.

Prerequisites

You need three AWS accounts with admin access to implement this solution. It is recommended to use test accounts. The producer account will host the EMR cluster and S3 buckets. The catalog account will host Lake Formation and AWS Glue. The consumer account will host EMR Serverless, Athena, and SageMaker notebooks.

Set up the producer account

Before you launch the CloudFormation stack, gather the following information from the catalog account:

  • Catalog AWS account ID (12-digit account ID)
  • Catalog VPC ID (for example, vpc-xxxxxxxx)
  • VPC CIDR (catalog account VPC CIDR; it should not overlap 10.0.0.0/16)

The VPC CIDR of the producer and catalog can’t overlap due to VPC peering and Transit Gateway requirements. The VPC CIDR should be a VPC from the catalog account where the AWS Glue metastore connector Lambda function will be eventually deployed.

The CloudFormation stack for the producer creates the following resources:

  • S3 bucket to host data for the Hive metastore of the EMR cluster.
  • VPC with the CIDR 10.0.0.0/16. Make sure there is no existing VPC with this CIDR in use.
  • VPC peering connection between the producer and catalog account.
  • Amazon Elastic Compute Cloud (Amazon EC2) security groups for the EMR cluster.
  • IAM roles required for the solution.
  • EMR 6.10 cluster launched with Hive.
  • Sample data downloaded to the S3 bucket.
  • A database and external tables, pointing to the downloaded sample data, in its Hive metastore.

Complete the following steps:

  1. Launch the template PRODUCER.yml. It’s recommended to use an IAM role that has administrator privileges.
  2. Gather the values for the following on the CloudFormation stack’s Outputs tab:
    1. VpcPeeringConnectionId (for example, pcx-xxxxxxxxx)
    2. DestinationCidrBlock (10.0.0.0/16)
    3. S3ProducerDataLakeBucketName

Set up the catalog account

The CloudFormation stack for the catalog account creates the Lambda function for federation. Before you launch the template, on the Lake Formation console, add the IAM role and user deploying the stack as the data lake admin.

Then complete the following steps:

  1. Launch the template CATALOG.yml.
  2. For the RouteTableId parameter, use the catalog account VPC RouteTableId. This is the VPC where the AWS Glue Hive metastore connector Lambda function will be deployed.
  3. On the stack’s Outputs tab, copy the value for LFRegisterLocationServiceRole (arn:aws:iam::account-id: role/role-name).
  4. Confirm if the Data Catalog setting has the IAM access control options un-checked and the current cross-account version is set to 4.

  1. Log in to the producer account and add the following bucket policy to the producer S3 bucket that was created during the producer account setup. Add the ARN of LFRegisterLocationServiceRole to the Principal section and provide the S3 bucket name under the Resource section.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::account-id: role/role-name"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::s3-bucket-name/*",
                "arn:aws:s3:::s3-bucket-name"
            ]
        }
    ]
}
  1. In the producer account, on the Amazon EMR console, navigate to the primary node EC2 instance to get the value for Private IP DNS name (IPv4 only) (for example, ip-xx-x-x-xx.us-west-1.compute.internal).

  1. Switch to the catalog account and deploy the AWS Glue Data Catalog federation Lambda function (GlueDataCatalogFederation-HiveMetastore).

The default Region is set to us-east-1. Change it to your desired Region before deploying the function.

Use the VPC that was used as the CloudFormation input for the VPC CIDR. You can use the VPC’s default security group ID. If using another security group, make sure the outbound allows traffic to 0.0.0.0/0.

Next, you create a federated database in Lake Formation.

  1. On the Lake Formation console, choose Data sharing in the navigation pane.
  2. Choose Create database.

  1. Provide the following information:
    1. For Connection name, choose your connection.
    2. For Database name, enter a name for your database.
    3. For Database identifier, enter emrhms_salesdb (this is the database created on the EMR Hive metastore).
  2. Choose Create database.

  1. On the Databases page, select the database and on the Actions menu, choose Grant to grant describe permissions to the consumer account.

  1. Under Principals, select External accounts and choose your account ARN.
  2. Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database and table.
  3. Under Table permissions, provide the following information:
    1. For Table permissions¸ select Select and Describe.
    2. For Grantable permissions¸ select Select and Describe.
  4. Under Data permissions, select All data access.
  5. Choose Grant.

  1. On the Tables page, select your table and on the Actions menu, choose Grant to grant select and describe permissions.

  1. Under Principals, select External accounts and choose your account ARN.
  2. Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database.
  3. Under Database permissions¸ provide the following information:
    1. For Database permissions¸ select Create table and Describe.
    2. For Grantable permissions¸ select Create table and Describe.
  4. Choose Grant.

Set up the consumer account

Consumers include data analysts who run queries on the data lake, data scientists who prepare data for ML models and conduct exploratory analysis, as well as downstream systems that run batch jobs on the data within the data lake.

The consumer account setup in this section shows how you can query the shared Hive metastore data using Athena for the data analyst persona, EMR Serverless to run batch scripts, and SageMaker Studio for the data scientist to further use data in the downstream model building process.

For EMR Serverless and SageMaker Studio, if you’re using the default IAM service role, add the required Data Catalog and Lake Formation IAM permissions to the role and use Lake Formation to grant table permission access to the role’s ARN.

Data analyst use case

In this section, we demonstrate how a data analyst can query the Hive metastore data using Athena. Before you get started, on the Lake Formation console, add the IAM role or user deploying the CloudFormation stack as the data lake admin.

Then complete the following steps:

  1. Run the CloudFormation template CONSUMER.yml.
  2. If the catalog and consumer accounts are not part of the organization in AWS Organizations, navigate to the AWS Resource Access Manager (AWS RAM) console and manually accept the resources shared from the catalog account.
  3. On the Lake Formation console, on the Databases page, select your database and on the Actions menu, choose Create resource link.

  1. Under Database resource link details, provide the following information:
    1. For Resource link name, enter a name.
    2. For Shared database’s region, choose a Region.
    3. For Shared database, choose your database.
    4. For Shared database’s owner ID, enter the account ID.
  2. Choose Create.

Now you can use Athena to query the table on the consumer side, as shown in the following screenshot.

Batch job use case

Complete the following steps to set up EMR Serverless to run a sample Spark job to query the existing table:

  1. On the Amazon EMR console, choose EMR Serverless in the navigation pane.
  2. Choose Get started.

  1. Choose Create and launch EMR Studio.

  1. Under Application settings, provide the following information:
    1. For Name, enter a name.
    2. For Type, choose Spark.
    3. For Release version, choose the current version.
    4. For Architecture, select x86_64.
  2. Under Application setup options, select Use custom settings.

  1. Under Additional configurations, for Metastore configuration, select Use AWS Glue Data Catalog as metastore, then select Use Lake Formation for fine-grained access control.
  2. Choose Create and start application.

  1. On the application details page, on the Job runs tab, choose Submit job run.

  1. Under Job details, provide the following information:
    1. For Name, enter a name.
    2. For Runtime role¸ choose Create new role.
    3. Note the IAM role that gets created.
    4. For Script location, enter the S3 bucket location created by the CloudFormation template (the script is emr-serverless-query-script.py).
  2. Choose Submit job run.

  1. Add the following AWS Glue access policy to the IAM role created in the previous step (provide your Region and the account ID of your catalog account):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateDatabase",
                "glue:GetDataBases",
                "glue:CreateTable",
                "glue:GetTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:CreatePartition",
                "glue:BatchCreatePartition",
                "glue:GetUserDefinedFunctions"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:1234567890:catalog",
                "arn:aws:glue:us-east-1:1234567890:database/*",
                "arn:aws:glue:us-east-1:1234567890:table/*/*"
            ]
        }
    ]
}
  1. Add the following Lake Formation access policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "LakeFormation:GetDataAccess"
            "Resource": "*"
        }
    ]
}
  1. On the Databases page, select the database and on the Actions menu, choose Grant to grant Lake Formation access to the EMR Serverless runtime role.
  2. Under Principals, select IAM users and roles and choose your role.
  3. Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database.
  4. Under Resource link permissions, for Resource link permissions, select Describe.
  5. Choose Grant.

  1. On the Databases page, select the database and on the Actions menu, choose Grant on target.

  1. Provide the following information:
    1. Under Principals, select IAM users and roles and choose your role.
    2. Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database and table
    3. Under Table permissions, for Table permissions, select Select.
    4. Under Data permissions, select All data access.
  2. Choose Grant.

  1. Submit the job again by cloning it.
  2. When the job is complete, choose View logs.

The output should look like the following screenshot.

Data scientist use case

For this use case, a data scientist queries the data through SageMaker Studio. Complete the following steps:

  1. Set up SageMaker Studio.
  2. Confirm that the domain user role has been granted permission by Lake Formation to SELECT data from the table.
  3. Follow steps similar to the batch run use case to grant access.

The following screenshot shows an example notebook.

Clean up

We recommend deleting the CloudFormation stack after use, because the deployed resources will incur costs. There are no prerequisites to delete the producer, catalog, and consumer CloudFormation stacks. To delete the Hive metastore connector stack on the catalog account (serverlessrepo-GlueDataCatalogFederation-HiveMetastore), first delete the federated database you created.

Conclusion

In this post, we explained how to create a federated Hive metastore for deploying a data mesh architecture with multiple Hive data warehouses across EMR clusters.

By using Data Catalog metadata federation, organizations can construct a sophisticated data architecture. This approach not only seamlessly extends your Hive data warehouse but also consolidates access control and fosters integration with various AWS analytics services. Through effective data governance and meticulous orchestration of the data mesh architecture, organizations can provide data integrity, regulatory compliance, and enhanced data sharing across EMR clusters.

We encourage you to check out the features of the AWS Glue Hive metastore federation connector and explore how to implement a data mesh architecture across multiple EMR clusters. To learn more and get started, refer to the following resources:


About the Authors

Sudipta Mitra is a Senior Data Architect for AWS, and passionate about helping customers to build modern data analytics applications by making innovative use of latest AWS services and their constantly evolving features. A pragmatic architect who works backwards from customer needs, making them comfortable with the proposed solution, helping achieve tangible business outcomes. His main areas of work are Data Mesh, Data Lake, Knowledge Graph, Data Security and Data Governance.

Deepak Sharma is a Senior Data Architect with the AWS Professional Services team, specializing in big data and analytics solutions. With extensive experience in designing and implementing scalable data architectures, he collaborates closely with enterprise customers to build robust data lakes and advanced analytical applications on the AWS platform.

Nanda Chinnappa is a Cloud Infrastructure Architect with AWS Professional Services at Amazon Web Services. Nanda specializes in Infrastructure Automation, Cloud Migration, Disaster Recovery and Databases which includes Amazon RDS and Amazon Aurora. He helps AWS Customer’s adopt AWS Cloud and realize their business outcome by executing cloud computing initiatives.

Secure data movement across Amazon S3 and Amazon Redshift using role chaining and ASSUMEROLE

Post Syndicated from Sudipta Mitra original https://aws.amazon.com/blogs/big-data/secure-data-movement-across-amazon-s3-and-amazon-redshift-using-role-chaining-and-assumerole/

Data lakes use a ring of purpose-built data services around a central data lake. Data needs to move between these services and data stores easily and securely. The following are some examples of such services:

  • Amazon Simple Storage Service (Amazon S3), which stores structured, unstructured, and semi-structured data
  • Amazon Redshift, a fully managed, petabyte-scale data warehouse product to analyze large-scale structured and semi-structured data across data warehouses and operational databases
  • Amazon SageMaker, which consumes data for machine learning (ML) capabilities

In multi-tenant architectures, groups or users within a group may require exclusive permissions to the group’s S3 bucket and also the schema and tables belonging to Amazon Redshift. These teams also need to be able to control loading and unloading of data between the team-owned S3 buckets and Amazon Redshift schemas. Additionally, individual users within the team may require fine-grained control over objects in S3 buckets and specific schemas in Amazon Redshift. Implementing this permissions control use case should be scalable as more teams and users are onboarded and permission-separation requirements evolve.

Amazon Redshift and Amazon S3 provide a unified, natively integrated storage layer for data lakes. You can move data between Amazon Redshift and Amazon S3 using the Amazon Redshift COPY and UNLOAD commands.

This post presents an approach that you can apply at scale to achieve fine-grained access controls to resources in S3 buckets and Amazon Redshift schemas for tenants, including groups of users belonging to the same business unit down to the individual user level. This solution provides tenant isolation and data security. In this approach, we use the bridge model to store data and control access for each tenant at the individual schema level in the same Amazon Redshift database. We utilize ASSUMEROLE and role chaining to provide fine-grained access control when data is being copied and unloaded between Amazon Redshift and Amazon S3, so the data flows within each tenant’s namespace. Role chaining also streamlines the new tenant onboarding process.

Solution overview

In this post, we explore how to achieve resource isolation, data security, scaling to multiple tenants, and fine-grained access control at the individual user level for teams that access, store, and move data across storage using Amazon S3 and Amazon Redshift.

We use the bridge model to store data and control access for each tenant at the individual schema level in the same Amazon Redshift database. In the bridge model, a separate database schema is created for each tenant, and data for each tenant is stored in its own schema. The tenant has access only to its own schema.

We use the COPY and UNLOAD commands to load and unload data into the Amazon Redshift cluster using an S3 bucket. These commands require Amazon Redshift to access Amazon S3 on your behalf, and security credentials are provided to your clusters.

We create an AWS Identity and Access Management (IAM) role—we call it the Amazon Redshift onboarding role—and associate it with the Amazon Redshift cluster. For each tenant, we create a tenant-specific IAM role—we call it the tenant role—to define the fine-grained access to its own Amazon S3 resources. The Amazon Redshift onboarding role doesn’t have any permissions granted except allowing sts:AssumeRole to the tenant roles. The trust relationship to the Amazon Redshift onboarding role is defined in each of the tenant roles. We use the Amazon Redshift ASSUMEROLE privilege to control IAM role access privileges for database users and groups on COPY and UNLOAD commands.

Each tenant database user or group is granted ASSUMEROLE on the Amazon Redshift onboarding role and its own tenant role, which restricts the tenant to access its own Amazon S3 resources when using COPY and UNLOAD commands. We use role chaining when ASSUMEROLE is granted. This means that the tenant role isn’t required to be attached to the Amazon Redshift cluster, and the only IAM role associated is the Amazon Redshift onboarding role. Role chaining streamlines the new tenant onboarding process. With role chaining, we don’t need to modify the cluster; we can make modifications on the tenant IAM role definition when onboarding a new tenant.

For our use case, we have two tenants: team 1 and team 2. A tenant here represents a group of users—a team from the same business unit. We want separate S3 buckets and Amazon Redshift schemas for each team. These teams should be able to access their own data only and also be able to support fine-grained access control over copying and unloading data from Amazon S3 to Amazon Redshift (and vice versa) within the team boundary. We can apply access control at the individual user level using the same approach.

The following architecture diagram shows the AWS resources and process flow of our solution.

In this tutorial, you create two S3 buckets, two Amazon Redshift tenant schemas, two Amazon Redshift tenant groups, one Amazon Redshift onboarding role, and two tenant roles. Then you grant ASSUMEROLE on the onboarding and tenant role to each tenant, using role chaining. To verify that each tenant can only access its own S3 resources, you create two Amazon Redshift users assigned to their own tenant group and run COPY and UNLOAD commands.

Prerequisites

To follow along with this solution, you need the following prerequisites:

Download the source code to your local environment

To implement this solution in your local development environment, you can download the source code from the GitHub repo or clone the source code using the following command:

git clone https://github.com/aws-samples/amazon-redshift-assume-role-sample.git

The following files are included in the source code:

  • redshift-onboarding-role.cf.yaml – A CloudFormation template to deploy the Amazon Redshift onboarding role redshift-onboarding-role
  • redshift-tenant-resources.cf.yaml – A CloudFormation template to deploy an S3 bucket, KMS key, and IAM role for each tenant you want to onboard

Provision an IAM role for Amazon Redshift and attach this role to the Amazon Redshift cluster

Deploy the template redshift-onboarding-role.cf.yaml using the AWS CloudFormation console or the AWS Command Line Interface (AWS CLI). For more information about stack creation, see Create the stack. This template doesn’t have any required parameters. The stack provisions an IAM role named redshift-s3-onboarding-role for Amazon Redshift. The following code is the policy defining sts:AssumeRole to the tenant-specific IAM roles:

{
  "Version": "2012-10-17",
  "Statement": [
  {
   "Action": [
     "sts:AssumeRole"
    ],
   "Resource": [
     "arn:aws:iam::xxxxxxxxxxxx:role/*-tenant-redshift-s3-access-role"
    ],
   "Effect": "Allow"
   }
  ]
}

Navigate to the Amazon Redshift console and select the cluster you want to update. On the Actions menu, choose Manage IAM roles. Choose the role redshift-s3-onboarding-role to associate with the cluster. For more information, see Associate the IAM role with your cluster.

Provision the IAM role and resources for tenants

Deploy the template redshift-tenant-resources.cf.yaml using the AWS CloudFormation console or the AWS CLI. For this post, you deploy the stack twice, supplying two unique tenant names for TenantName. For example, you can use team1 and team2 as the TenantName parameter values.

For each tenant, the stack provisions the following resources:

  • A KMS key
  • An S3 bucket named team1-data-<account id>-<region> with default encryption enabled with SSE-KMS using the created key
  • An IAM role named team1-tenant-redshift-s3-access-role

The policy attached to the role team1-tenant-redshift-s3-access-role can only access the team’s own S3 bucket. The role redshift-s3-onboarding-role is trusted to assume all tenant roles *-tenant-redshift-s3-access-role to enable role chaining. The tenant role *-tenant-redshift-s3-access-role has a trust relationship to redshift-s3-onboarding-role. See the following policy code:

        {
            "Action": [
                "s3:List*",
                "s3:Get*",
                "s3:Put*"
            ],
            "Resource": [
                "arn:aws:s3:::team1-data-<account id>-<region>/*",
                "arn:aws:s3:::team1-data-<account id>-<region>"
            ],
            "Effect": "Allow"
        }

Create a tenant schema and tenant user with appropriate privileges

For this post, you create the following Amazon Redshift database objects using the query editor on the Amazon Redshift console or a SQL client tool like SQL Workbench/J. Replace <password> with your password and <account id> with your AWS account ID before running the following SQL statements:

create schema team1;
create schema team2;

create group team1_grp;
create group team2_grp;

create user team1_usr with password '<password>' in group team1_grp;
create user team2_usr with password '<password>' in group team2_grp;

grant usage on schema team1 to group team1_grp;
grant usage on schema team2 to group team2_grp;

GRANT ALL ON SCHEMA team1 TO group team1_grp;
GRANT ALL ON SCHEMA team2 TO group team2_grp;

revoke assumerole on all from public for all;

grant assumerole
on 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account-id>:role/team1-tenant-redshift-s3-access-role'
to group team1_grp for copy;

grant assumerole
on 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account id>:role/team1-tenant-redshift-s3-access-role'
to group team1_grp for unload;

grant assumerole
on 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account id>:role/team2-tenant-redshift-s3-access-role'
to group team2_grp for copy;

grant assumerole
on 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account id>:role/team2-tenant-redshift-s3-access-role'
to group team2_grp for unload;

commit;

Verify that each tenant can only access its own resources

To verify your access control settings, you can create a test table in each tenant schema and upload a file to the tenant’s S3 bucket using the following commands. You can use the Amazon Redshift query editor or a SQL client tool.

  1. Sign in as team1_usr and enter the following commands:
    CREATE TABLE TEAM1.TEAM1_VENUE(
    VENUEID SMALLINT,
    VENUENAME VARCHAR(100),
    VENUECITY VARCHAR(30),
    VENUESTATE CHAR(2),
    VENUESEATS INTEGER
    ) DISTSTYLE EVEN;
    
    commit;

  2. Sign in as team2_usr and enter the following commands:
    CREATE TABLE TEAM2.TEAM2_VENUE(
    VENUEID SMALLINT,
    VENUENAME VARCHAR(100),
    VENUECITY VARCHAR(30),
    VENUESTATE CHAR(2),
    VENUESEATS INTEGER
    ) DISTSTYLE EVEN;
    
    commit;

  3. Create a file named test-venue.txt with the following contents:
    7|BMO Field|Toronto|ON|0
    16|TD Garden|Boston|MA|0
    23|The Palace of Auburn Hills|Auburn Hills|MI|0
    28|American Airlines Arena|Miami|FL|0
    37|Staples Center|Los Angeles|CA|0
    42|FedExForum|Memphis|TN|0
    52|PNC Arena|Raleigh|NC|0
    59|Scotiabank Saddledome|Calgary|AB|0
    66|SAP Center|San Jose|CA|0
    73|Heinz Field|Pittsburgh|PA|65050

  4. Upload this file to both team1 and team2 S3 buckets.
  5. Sign in as team1_usr and enter the following commands to test Amazon Redshift COPY and UNLOAD:
    copy team1.team1_venue
    from 's3://team1-data-<account id>-<region>/'
    iam_role 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account id>:role/team1-tenant-redshift-s3-access-role'
    delimiter '|' ;
    
    unload ('select * from team1.team1_venue')
    to 's3://team1-data-<account id>-<region>/unload/' 
    iam_role 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account id>:role/team1-tenant-redshift-s3-access-role';

The file test-venue.txt uploaded to the team1 bucket is copied to the table team1_venue in the team1 schema, and the data in table team1_venue is unloaded to the team1 bucket successfully.

  1. Replace team1 with team2 in the preceding commands and then run them again, this time signed in as team2_usr.

If you’re signed in as team1_usr and try to access the team2 S3 bucket or team2 schema or table and team2 IAM role when running COPY and UNLOAD, you get an access denied error. You get the same error if trying to access team1 resources while logged in as team2_usr.

Clean up

To clean up the resources you created, delete the CloudFormation stack created in your AWS account.

Conclusion

In this post, we presented a solution to achieve role-based secure data movement between Amazon S3 and Amazon Redshift. This approach combines with the ASSUMEROLE feature in Amazon Redshift to allow fine-grained access control over the COPY and UNLOAD commands down to the individual user level within a particular team. This in turn provides finer control over resource isolation and data security in a multi-tenant solution. Many use cases can benefit from this solution as more enterprises build data platforms to provide the foundations for highly scalable, customizable, and secure data consumption models.


About the Authors

Sudipta Mitra is a Senior Data Architect for AWS, and passionate about helping customers to build modern data analytics applications by making innovative use of latest AWS services and their constantly evolving features. A pragmatic architect who works backwards from customer needs, making them comfortable with the proposed solution, helping achieve tangible business outcomes. His main areas of work are Data Mesh, Data Lake, Knowledge Graph, Data Security and Data Governance.

Michelle Deng is a Sr. Data Architect at Amazon Web Services. She works with AWS customers to provide guidance and technical assistance about database migrations and Big data projects.

Jared Cook is a Sr. Cloud Infrastructure Architect at Amazon Web Services. He is committed to driving business outcomes in the cloud, and uses Infrastructure as Code and DevOps best practices to build resilient architectures on AWS.  In his leisure time, Jared enjoys the outdoors, music, and plays the drums.

Lisa Matacotta is a Senior Customer Solutions Manager at Amazon Web Services. She works with AWS customers to help customers achieve business and strategic goals, understand their biggest challenges and provide guidance based on AWS best practices to overcome them.