All posts by Rajesh Francis

Centralize governance for your data lake using AWS Lake Formation while enabling a modern data architecture with Amazon Redshift Spectrum

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/centralize-governance-for-your-data-lake-using-aws-lake-formation-while-enabling-a-modern-data-architecture-with-amazon-redshift-spectrum/

Many customers are modernizing their data architecture using Amazon Redshift to enable access to all their data from a central data location. They are looking for a simpler, scalable, and centralized way to define and enforce access policies on their data lakes on Amazon Simple Storage Service (Amazon S3). They want access policies to allow their data lake consumers to use the analytics service of their choice, to best suit the operations they want to perform on the data. Although the existing method of using Amazon S3 bucket policies to manage access control is an option, when the number of combinations of access levels and users increase, managing bucket level policies may not scale.

AWS Lake Formation allows you to simplify and centralize access management. It allows organizations to manage access control for Amazon S3-based data lakes using familiar concepts of databases, tables, and columns (with more advanced options like row and cell-level security). Lake Formation uses the AWS Glue Data Catalog to provide access control for Amazon S3 data lake with most commonly used AWS analytics services, like Amazon Redshift (via Amazon Redshift Spectrum), Amazon Athena, AWS Glue ETL, and Amazon EMR (for Spark-based notebooks). These services honor the Lake Formation permissions model out of the box, which makes it easy for customers to simplify, standardize, and scale data security management for data lakes.

With Amazon Redshift, you can build a modern data architecture, to seamlessly extend your data warehouse to your data lake and read all data – data in your data warehouse, and data in your data lake – without creating multiple copies of data. Amazon Redshift Spectrum feature enable direct query of your S3 data lake, and many customers are leveraging this to modernize their data platform. You can use Amazon Redshift managed storage for frequently accessed data and move less frequently accessed data to Amazon S3 data lake and securely access it using Redshift Spectrum.

In this post, we discuss how you can use AWS Lake Formation to centralize data governance and data access management while using Amazon Redshift Spectrum to query your data lake. Lake Formation allows you to grant and revoke permissions on databases, tables, and column catalog objects created on top of Amazon S3 data lake. This is easier for customers, as it is similar to managing permissions on relational databases.

In the first post of this two-part series, we focus on resources within the same AWS account. In the second post, we extend the solution across AWS accounts using the Lake Formation data sharing feature.

Solution overview

The following diagram illustrates our solution architecture.

The solution workflow consists of the following steps:

  1. Data stored in an Amazon S3 data lake is crawled using an AWS Glue crawler.
  2. The crawler infers the metadata of data on Amazon S3 and stores it in the form of a database and tables in the AWS Glue Data Catalog.
  3. You register the Amazon S3 bucket as the data lake location with Lake Formation. It’s natively integrated with the Data Catalog.
  4. You use Lake Formation to grant permissions at the database, table, and column level to defined AWS Identity and Access Management (IAM) roles.
  5. You create external schemas within Amazon Redshift to manage access for marketing and finance teams.
  6. You provide access to the marketing and finance groups to their respective external schemas and associate the appropriate IAM roles to be assumed. The admin role and admin group is limited for administration work.
  7. Marketing and finance users now can assume their respective IAM roles and query data using the SQL query editor to their external schemas inside Amazon Redshift.

Lake Formation default security settings

To maintain backward compatibility with AWS Glue, Lake Formation has the following initial security settings:

  • The super permission is granted to the group IAMAllowedPrincipals on all existing Data Catalog resources.
  • Settings to use only IAM access control are enabled for new Data Catalog resources.

To change security settings, see Changing the Default Security Settings for Your Data Lake.

Note: Leave the default settings as is until you’re ready to move completely to the Lake Formation permission model. You can update settings at a database level if you want permissions set by Lake Formation to take effect. For more details about upgrades, refer to Upgrading AWS Glue Data Permissions to the AWS Lake Formation Model.

We don’t recommend reverting back from the Lake Formation permission model to an IAM-only permission model. You may also want to first deploy the solution in a new test account.

Prerequisites

To set up this solution, you need basic familiarity with the AWS Management Console, an AWS account, and access to the following AWS services:

Create the data lake administrator

Data lake administrators are initially the only IAM users or roles that can grant Lake Formation permissions on data locations and Data Catalog resources to any principal.

To set up an IAM user as a data lake administrator, add the provided inline policy to the IAM user or IAM role you use to provision the resources for this blog solution. For more details, refer to Create a Data Lake Administrator.

  1. On the IAM console, choose Users, and choose the IAM user who you want to designate as the data lake administrator.
  2. Choose Add an inline policy on the Permissions tab and add the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "lakeformation:GetDataLakeSettings",
                    "lakeformation:PutDataLakeSettings"
                ],
                "Resource": "*"
            }
        ]
    }

  3. Provide a policy name.
  4. Review and save your settings.

Note: If you’re using an existing administrator user/role, you may have this already provisioned.

  1. Sign in to the AWS management console as the designated data lake administrator IAM user or role for this solution.

Note: The CloudFormation template doesn’t work if you skip the below step.

  1. If this is your first time on the Lake Formation console, select Add myself and choose Get started.

You can also add yourself as data lake administrator by going to Administrative roles and tasks under Permissions, select Choose administrators, and adding yourself as an administrator if you missed this in the initial welcome screen.

Provision resources with CloudFormation

In this step, we create the solution resources using a CloudFormation template. The template performs the following actions:

  • Creates an S3 bucket to copy sample data files and SQL scripts
  • Registers the S3 data lake location with Lake Formation
  • Creates IAM roles and policies as needed for the environment
  • Assigns principals (IAM roles) to handle data lake settings
  • Creates Lambda and Step Functions resources to load necessary data
  • Runs AWS Glue crawler jobs to create Data Catalog tables
  • Configures Lake Formation permissions
  • Creates an Amazon Redshift cluster
  • Runs a SQL script to create the database group, database user, and external schemas for the admin, marketing, and finance groups

To create your resources, complete the following steps:

  1. Launch the provided template in AWS Region us-east-1.
  2. Choose Next.
  3. For Stack name, you can keep the default stack name or change it.
  4. For DbPassword, provide a secure password instead of using the default provided.
  5. For InboundTraffic, change the IP address range to your local machine’s IP address in CIDR format instead of using the default.
  6. Choose Next.
  7. Choose Next again until you get to the review page.
  8. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  9. Choose Create stack.

The stack takes approximately 10 minutes to deploy successfully. When it’s complete, you can view the outputs on the AWS CloudFormation console.

Update Lake Formation default settings

You also need to update the default settings at the Lake Formation database level. This makes sure that the Lake Formation permissions the CloudFormation template sets up during provisioning can take effect over the default settings.

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Choose the database you created with the CloudFormation template.
  3. Choose Edit.
  4. Deselect Use only IAM access control for new tables in the database.
  5. Choose Save.

This action is important because it removes the IAM control model from this database and allows only Lake Formation to take security grant/revoke access to it. This step makes sure other steps in this solution are successful.

  1. Choose Databases in the navigation pane.
  2. Select the same database.
  3. On the Actions menu, choose View permissions.

You can review the permissions enabled for this database.

  1. Select the IAMAllowedPrincipals group and choose Revoke to remove default permission settings for this individual database.

The IAMAllowedPrincipal row no longer appears in the list on the Permissions page.

Similarly, we need to remove the IAMAllowedPrincipal group at the table level. The CloudFormation template created six tables for this database. Let’s see how to use data lake permissions to remove access at the table level.

  1. On the Lake Formation console, choose Data lake permissions in the navigation pane.
  2. Filter by Principal:IAMAllowedPrincipals and Database:<<database name>>.

You can review all the tables we need to update permissions for.

  1. Select each table one by one and choose Revoke.

With these steps, we’ve made sure that the default settings at the Lake Formation account level are still in place, and only manually updated for the database and tables we’re going to work with in this post. When you’re ready to move completely to a Lake Formation permission model, you can update the settings at the account level instead of individually updating them. For more details, see Change the default permission model.

Validate the provisioned resources

The CloudFormation template provisions many resources automatically to create your environment. In this section, we check some of the key resources to understand them better.

Lake Formation resources

On the Lake Formation console, check that a new data lake location is registered with an IAM role on the Data lake locations page.

This is the IAM role any integrated service like Amazon Redshift assumes to access data on the registered Amazon S3 location. This integration happens out of the box when the right roles and policies are applied. For more details, see Requirements for Roles Used to Register Locations.

Check the Administrative roles and tasks page confirm that the logged-in user is added as the data lake administrator and IAMAllowedPrincipals is added as database creator.

Then check the tables that the AWS Glue crawlers created in the Data Catalog database. These tables are logical entities, because the data is in an Amazon S3 location. After you create these objects, you can access them via different services.

Lastly, check permissions set by the template using the Lake Formation permission model on the tables to be accessed by finance and marketing users from Amazon Redshift.

The following screenshot shows that the finance role has access to all columns for the store and item tables, but only the listed columns for the store_sales table.

Similarly, you can review access for the marketing role, which has access to all columns in the customer_activity and store_sales tables.

Amazon S3 resources

The CloudFormation template creates two S3 buckets:

  • data-lake – Contains the data used for this post
  • script – Contains the SQL which we use to create Amazon Redshift database objects

Open the script bucket to see the scripts. You can download and open them to view the SQL code used.

The setup_lakeformation_demo.sql script gives you the SQL code to create the external database schema and assign different roles for data governance purposes. The external schema is for AWS Glue Data Catalog-based objects that point to data in the data lake. We then grant access to different database groups and users to manage security for finance and marketing users.

The scripts run in the following order:

  1. sp_create_db_group.sql
  2. sp_create_db_user.sql
  3. setup_lakeformation_demo.sql

Amazon Redshift resources

On the Amazon Redshift console, choose Clusters in the navigation pane and choose the cluster you created with the CloudFormation template. Then choose the Properties tab.

The Cluster permissions section lists three attached roles. The template used the admin role to provision Amazon Redshift database-level objects. The finance role is attached to the finance schema in Amazon Redshift, and the marketing role is attached to the marketing schema.

Each of these roles are given permissions in such a way that they can use the Amazon Redshift query editor to query Data Catalog tables using Redshift Spectrum. For more details, see Using Redshift Spectrum with AWS Lake Formation and Query the Data in the Data Lake Using Amazon Redshift Spectrum.

Query the data

We use Amazon Redshift query editor v2 to query the external schema and Data Catalog tables (external tables). The external schema is already created as part of the CloudFormation template. When the external schema is created using the Data Catalog, the tables in the database are automatically created and are available through Amazon Redshift as external tables.

  1. On the Amazon Redshift console, choose Query editor v2.
  2. Choose Configure account.
  3. Choose the database cluster.
  4. For Database, enter dev.
  5. For User name, enter awsuser.
  6. For Authentication, select Temporary credentials.
  7. Choose Create connection.

When you’re connected and logged in as administrator user, you can see both local and external schemas and tables, as shown in the following screenshot.

Validate role-based Lake formation permissions in Amazon Redshift

Next, we validate how the Lake Formation security settings work for the marketing and finance users.

  1. In the query editor, choose (right-click) the database connection.
  2. Choose Edit connection.
  3. For User name, enter marketing_ro.
  4. Choose Edit connection.
  5. After connected as maketing_ro, choose the dev database under the cluster and navigate to the customer_activity table.
  6. Choose the refresh icon.
  7. Repeat these steps to edit the connection and update the user to finance_ro.
  8. Try again to refresh the dev database.

As expected, this user only has access to the allowed schema and tables.

With this solution, you can segregate different users at the schema level and use Lake Formation to make sure they can only see the tables and columns their role allows.

Column-level security with Lake Formation permissions

Lake Formation also allows you to set which columns a principal can or can’t see within a table. For example, when you select store_sales as the marketing_ro user, you see many columns, like customer_purchase_estimate. However, as the finance_ro user, you don’t see these columns.

Manual access control via the Lake Formation console

In this post, we’ve been working with a CloudFormation template-based environment, which is an automated way to create environment templates and simplify operations.

In this section, we show how you can set up all the configurations through the console, and we use another table as an example to walk you through the steps.

As demonstrated in previous steps, the marketing user in this environment has all column access to the tables customer_activity and store_sales in the external schema retail_datalake_marketing. We change some of that manually to see how it works using the console.

  1. On the Lake Formation console, choose Data lake permissions.
  2. Filter by the principal RedshiftMarketingRole.
  3. Select the principal for the store_sales table and choose Revoke.
  4. Confirm by choosing Revoke again.

A success message appears, and the permission row is no longer listed.

  1. Choose Grant to configure a new permission level for the marketing user on the store_sales table at the column level.
  2. Select IAM users and roles and choose your role.
  3. In the LF-Tags or catalog resources section, select Named data catalog resources.
  4. For Databases, choose your database.
  5. For Tables, choose the store_sales table.
  6. For Table permissions¸ check Select.
  7. In the Data permissions section, select Simple column-based access.
  8. Select Exclude columns.
  9. Choose the columns as shown in the following screenshot.
  10. Choose Grant.

We now query the table from Amazon Redshift again to confirm that the effective changes match the controls placed by Lake Formation. In the following query, we select a column that isn’t authorized:

/* Selecting columns not authorized will result in error. */
select s_country, ss_net_profit from retail_datalake_marketing.store_sales;

As expected, we get an error.

Clean up

Clean up resources created by the CloudFormation template to avoid unnecessary cost to your AWS account. You can delete the CloudFormation stack by selecting the stack on the AWS CloudFormation console and choosing Delete. This action deletes all the resources it provisioned. If you manually updated a template-provisioned resource, you may see some issues during clean-up, and you need to clean these up manually.

Summary

In this post, we showed how you can integrate Lake Formation with Amazon Redshift to seamlessly control access to Amazon S3 data lake. We also demonstrated how to query your data lake using Redshift Spectrum and external tables. This is a powerful mechanism that helps you build a modern data architecture to easily query data on your data lake and data warehouses together. We also saw how you can use CloudFormation templates to automate the resource creation with infrastructure as code. You can use this to simplify your operations, especially when you want replicate the resource setup from development to production landscape during your project cycles.

Finally, we covered how data lake administrators can manually control search on data catalog objects and grant or revoke access at the database, table, and column level. We encourage you to try the steps we outlined in this post and use the CloudFormation template to set up security in Lake Formation to control data lake access from Redshift Spectrum.

In the second post of this series, we focus on how you can take this concept and apply it across accounts using a Lake Formation data-sharing feature in a hub-and-spoke topography.


About the Authors

Vaibhav Agrawal is an Analytics Specialist Solutions Architect at AWS. Throughout his career, he has focused on helping customers design and build well-architected analytics and decision support platforms.

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with over 13 years of data warehousing experience. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Rajesh Francis is a Senior Analytics Customer Experience Specialist at AWS. He specializes in Amazon Redshift and focuses on helping to drive AWS market and technical strategy for data warehousing and analytics services. Rajesh works closely with large strategic customers to help them adopt our new services and features, develop long-term partnerships, and feed customer requirements back to our product development teams to guide our product roadmap.

Security considerations for Amazon Redshift cross-account data sharing

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/security-considerations-for-amazon-redshift-cross-account-data-sharing/

Data driven organizations recognize the intrinsic value of data and realize that monetizing data is not just about selling data to subscribers. They understand the indirect economic impact of data and the value that good data brings to the organization. They must democratize data and make it available for business decision makers to realize its benefits. Today, this would mean replicating data across multiple disparate databases, which requires moving the data across various platforms.

Amazon Redshift data sharing lets you securely and easily share live data across Amazon Redshift clusters or AWS accounts for read purposes. Data sharing can improve the agility of your organization by giving you instant, granular, and high-performance access to data across Amazon Redshift clusters without manually copying or moving it. Data sharing provides you with live access to data so that your users can see the most up-to-date and consistent information as it’s updated in Amazon Redshift clusters.

Cross-account data sharing lets you share data across multiple accounts. The accounts can be within the same organization or across different organizations. We have built in additional authorization steps for security control, since sharing data across accounts could also mean sharing data across different organizations. Please review AWS documentation on cross-account data sharing and a blog from our colleague for detailed steps. We also have a YouTube video on setting up cross-account data sharing for a business use case which you can refer as well.

Cross-account data sharing scenario

For this post, we will use this use case to demonstrate how you could setup cross-account data sharing with the option to control data sharing to specific consumer accounts from the producer account. The producer organization has one AWS account and one Redshift cluster. The consumer organization has two AWS accounts and three Redshift clusters in each of the accounts. The producer organization wants to share data from the producer cluster to one of the consumer accounts “ConsumerAWSAccount1”, and the consumer organization wants to restrict access to the data share to a specific Redshift cluster, “ConsumerCluster1”. Sharing to the second consumer account “ConsumerAWSAccount2” should be disallowed. Similarly, access to the data share should be restricted to the first consumer cluster, “ConsumerCluster1”.

Walkthrough

You can setup this behavior using the following steps:

Setup on the producer account:

  • Create a data share in the Producer cluster and add schema and tables.
  • Setup IAM policy to control which consumer accounts can be authorized for data share.
  • Grant data share usage to a consumer AWS account.

Setup on the consumer account:

  • Setup IAM policy to control which of the consumer Redshift clusters can be associated with the producer data share.
  • Associate consumer cluster to the data share created on the producer cluster.
  • Create database referencing the associated data share.

Prerequisites

To set up cross-account data sharing, you should have the following prerequisites:

  • Three AWS accounts. Once for producer < ProducerAWSAccount1>, and two consumer accounts – <ConsumerAWSAccount1> and < ConsumerAWSAccount2>.
  • AWS permissions to provision Amazon Redshift and create an IAM role and policy.

We assume you have provisioned the required Redshift clusters: one for the producer in the producer AWS Account, two Redshift clusters in ConsumerCluster1, and optionally one Redshift cluster in ConsumerCluster2

  • Two users in producer account, and two users in consumer account 1
    • ProducerClusterAdmin
    • ProducerCloudAdmin
    • Consumer1ClusterAdmin
    • Consumer1CloudAdmin

Security controls from producer and consumer

Approved list of consumer accounts from the producer account

When you share data across accounts, the producer admin can grant usage of the data share to a specific account. For additional security to allow the separation of duty between the database admin and the cloud security administrator, organizations might want to have an approved list of AWS accounts that can be granted access. You can achieve this by creating an IAM policy listing all of the approved accounts, and then add this policy to the role attached to the producer cluster.

Creating the IAM Policy for the approved list of consumer accounts

  1. On the AWS IAM Console, choose Policies.
  2. Choose Create policy.
  3. On the JSON tab, enter the following policy:
    This is the producer side policy. Note: you should replace the following text with the specific details for your cluster and account.
    • “Resource”: “*” – Replace “*” with the ARN of the specific data share.
    • <AWSAccountID> – Add one or more consumer account numbers based on the requirement.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Allow",
"Effect": "Allow",
"Action": [
"redshift:AuthorizeDataShare",
"redshift:DeauthorizeDataShare"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"redshift:ConsumerIdentifier": [
"<AWSAccountID>"
]
}
}
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"redshift:DescribeDataSharesForConsumer",
"redshift:DescribeDataSharesForProducer",
"redshift:DescribeClusters",
"redshift:DescribeDataShares"
],
"Resource": "*"
}
]
}
  1. From the Amazon Redshift console in the producer AWS Account, choose Query Editor V2 and connect to the producer cluster using temporary credentials.
  2. After connecting to the producer cluster, create the data share and add the schema and tables to the data share. Then, grant usage to the consumer accounts<ConsumerAWSAccount1> and <ConsumerAWSAccount2>
CREATE DATASHARE ds;

ALTER DATASHARE ds ADD SCHEMA PUBLIC;
ALTER DATASHARE ds ADD TABLE table1;
ALTER DATASHARE ds ADD ALL TABLES IN SCHEMA sf_schema;

GRANT USAGE ON DATASHARE ds TO ACCOUNT '<ConsumerAWSAccount1>;
GRANT USAGE ON DATASHARE ds TO ACCOUNT '<ConsumerAWSAccount2>;

Note: the GRANT will be successful even though the account is not listed in the IAM policy. But the Authorize step will validate against the list of approved accounts in the IAM policy, and it will fail if the account is not in the approved list.

  1. Now the producer admin can authorize the data share by using the AWS CLI command line interface or the console. When you authorize the data share to <ConsumerAWSAccount1>, then the authorization is successful.
aws redshift authorize-data-share --data-share-arn <DATASHARE ARN> --consumer-identifier <ConsumerAWSAccount1>

  1. When you authorize the data share to <ConsumerAWSAccount2>, the authorization fails, as the IAM policy we setup in the earlier step does not allow data share to <ConsumerAWSAccount2>.
aws redshift authorize-data-share --data-share-arn <DATASHARE ARN> --consumer-identifier <ConsumerAWSAccount2>

We have demonstrated how you can restrict access to the data share created on the producer cluster to specific consumer accounts by using a conditional construct with an approved account list in the IAM policy.

Approved list of Redshift clusters on consumer account

When you grant access to a data share to a consumer account, the consumer admin can determine which Redshift clusters can read the data share by associating it with the appropriate cluster. If the organization wants to control which of the Redshift clusters the admin can associate with the data share, then you can specify the approved list of Redshift clusters by using the cluster ARN in an IAM policy.

  1. On the AWS IAM Console, choose Policies.
  2. Choose Create policy.
  3. On the JSON tab, enter the following policy:
    This is the consumer side policy. Note: you should replace the following text with the specific details for your cluster and account.
    • “Resource”: “*” – Replace “*” with the ARN of the specific data share.
    • Replace “<ProducerDataShareARN>” with the ARN of the data share created in the Redshift cluster in AWS Consumer account 1.
    • Replace “<ConsumerRedshiftCluster1ARN>” with the ARN of the first Redshift cluster in AWS Consumer account 1.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"redshift:AssociateDataShareConsumer",
"redshift:DisassociateDataShareConsumer"
],
"Resource": "<ProducerDataShareARN>",
"Condition": {
"StringEquals": {
"redshift:ConsumerArn": [ "<ConsumerRedshiftCluster1ARN>" ]
}
}
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"redshift:DescribeDataSharesForConsumer",
"redshift:DescribeDataSharesForProducer",
"redshift:DescribeClusters",
"redshift:DescribeDataShares"
],
"Resource": "*"
}
]
}
  1. Now the consumer admin can associate the data share using the AWS CLI command line interface or the console. When you associate the Redshift cluster 1 <ConsumerRedshiftCluster1ARN >, the association is successful.
aws redshift associate-data-share-consumer --no-associate-entire-account --data-share-arn <ProducerDataShareARN> --consumer-arn <ConsumerRedshiftCluster1ARN>

  1. Now the consumer admin can associate the data share by using the AWS CLI command line interface or the console. When you associate the Redshift cluster 2 <ConsumerRedshiftCluster2ARN >, the association fails.
aws redshift associate-data-share-consumer --no-associate-entire-account --data-share-arn <ProducerDataShareARN> --consumer-arn <ConsumerRedshiftCluster2ARN>

  1. After associating the Consumer Redshift cluster 1 to the producer data share, from the Amazon Redshift console in the Consumer AWS Account, choose Query Editor V2 and connect to the consumer cluster using temporary credentials.
  2. After connecting to the consumer cluster, you can create a database referencing the data share on the producer cluster, and then start querying the data.
CREATE DATABASE ds_db FROM DATASHARE ds OF ACCOUNT <PRODUCER ACCOUNT> NAMESPACE <PRODUCER CLUSTER NAMESPACE>;
 
Optional:
CREATE EXTERNAL SCHEMA Schema_from_datashare FROM REDSHIFT DATABASE 'ds_db' SCHEMA 'public';

GRANT USAGE ON DATABASE ds_db TO user/group;

GRANT USAGE ON SCHEMA Schema_from_datashare TO GROUP Analyst_group;

SELECT  * FROM ds_db.public.producer_t1;

You can use the query editor or the new Amazon Redshift Query Editor V2 to run the statements above to read the shared data from the producer by creating an external database reference from the consumer cluster.

Conclusion

We have demonstrated how you can restrict access to the data share created on the producer cluster to specific consumer accounts by listing approved accounts in the IAM policy.

On the consumer side, we have also demonstrated how you can restrict access to a particular Redshift cluster on the consumer account for the data share created on the producer cluster by listing approved Redshift cluster(s) in the IAM policy. Enterprises and businesses can use this approach to control the boundaries of Redshift data sharing at account and cluster granularity.

We encourage you to try cross-account data sharing with these additional security controls to securely share data across Amazon Redshift clusters both within your organizations and with your customers or partners.


About the Authors

Rajesh Francis is a Senior Analytics Customer Experience Specialist at AWS. He specializes in Amazon Redshift and focuses on helping to drive AWS market and technical strategy for data warehousing and analytics. Rajesh works closely with large strategic customers to help them adopt our new services and features, develop long-term partnerships, and feed customer requirements back to our product development teams to guide the direction of our product offerings.

Kiran Sharma is a Senior Big Data Consultant for AWS Professional Services. She works with our customers to architect and implement Big Data Solutions on variety of projects on AWS.

Eric Hotinger is a Software Engineer at AWS. He enjoys solving seemingly impossible problems in the areas of analytics, streaming, containers, and serverless.

Implementing multi-tenant patterns in Amazon Redshift using data sharing

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/implementing-multi-tenant-patterns-in-amazon-redshift-using-data-sharing/

Software service providers offer subscription-based analytics capabilities in the cloud with Analytics as a Service (AaaS), and increasingly customers are turning to AaaS for business insights. A multi-tenant storage strategy allows the service providers to build a cost-effective architecture to meet increasing demand.

Multi-tenancy means a single instance of software and its supporting infrastructure is shared to serve multiple customers. For example, a software service provider could generate data that is housed in a single data warehouse cluster, but accessed securely by multiple customers. This storage strategy offers an opportunity to centralize management of data, simplify ETL processes, and optimize costs. However, service providers have to constantly balance between cost and providing a better user experience for their customers.

With the new data sharing feature, you can use Amazon Redshift to scale and meet both objectives of managing costs by simplifying storage and ETL pipelines while still providing consistent performance to customers.  You can ingest data into a cluster designated as a producer cluster, and share this live data with one or more consumer clusters. Clusters accessing this shared data are isolated from each other, therefore performance of a producer cluster isn’t impacted by workloads on consumer clusters. This enables consuming clusters to get consistent performance based on individual compute capacity.

In this post, we focus on various AaaS patterns, and discuss how you can use data sharing in a multi-tenant architecture to scale for virtually unlimited users. We discuss detailed steps to use data sharing with different storage strategies.

Multi-tenant storage patterns

Multi-tenant storage patterns help simplify the architecture and long-term maintenance of the analytics platform. In a multi-tenant strategy, data is stored centrally in a single cluster for all tenants, enabling simplification of the ETL ingestion pipeline and data management. In the previously published whitepaper SaaS Storage Strategies, various models of storage and benefits are covered for a single cluster scenario.

The three strategies you can choose from are:

  • Pool model – Data is stored in a single database schema for all tenants, and a new column (tenant_id) is used to scope and control access to individual tenant data. Access to the multi-tenant data is controlled using views built on the tables.
  • Bridge model – Storage and access to data for each tenant is controlled at individual schema level in the same database.
  • Silo model – Storage and access control to data for each tenant is maintained in separate databases

The following diagram illustrates the architecture of these multi-tenant storage strategies.

The following diagram illustrates the architecture of these multi-tenant storage strategies.

In the following sections, we will discuss how these multi-tenant strategies can be implemented using Amazon Redshift data sharing feature with a multi-cluster architecture.

Scaling your multi-tenant architecture using data sharing

AaaS providers implementing multi-tenant architectures were previously limited to resources of a single cluster to meet the compute and concurrency requirements of users across all the tenants. As the number of tenants increased, you could either turn on concurrency scaling or create additional clusters. However, the addition of new clusters means additional ingestion pipelines and increased operational overhead.

With data sharing in Amazon Redshift, you can easily and securely share data across clusters. Data ingested into the producer cluster is shared with one or more consumer clusters, which allows total separation of ETL and BI workloads. Several consumer clusters can read data from the managed storage of a producer cluster. This enables instant, granular, and high-performance access without data copies and movement. Workloads accessing shared data are isolated from each other and the producer. You can distribute workloads across multiple clusters while simplifying and consolidating the ETL ingestion pipeline into one main producer cluster, providing optimal price for performance.

Consumer clusters can in turn be producers for the data sets they own. Customers can optimize costs even further by collocating multiple tenants on the same consumer cluster. For instance, you can group low volume tier 3 tenants into a single consumer cluster to provider a lower cost offering, while high volume tier 1 tenants get their own isolated compute clusters. Consumer clusters can be created in the same account as producer or in a different AWS account. With this you can have separate billing for the consumer clusters, where you can chargeback to the business group that uses the consumer cluster or even allow your customers to use their own Redshift cluster in their account, so they pay for usage of the consumer cluster. The following diagram shows the difference in ETL and consumer access patterns in a multi-tenant architecture using data sharing versus a single cluster approach without data sharing.

Consumer clusters can in turn be producers for the data sets they own.

Multi-tenant architecture with data sharing compared to single cluster approach

Creating a multi-tenant architecture for an AaaS solution

For this post, we use a simple data model with a fact and a dimension table to demonstrate how to leverage data sharing to design a scalable multi-tenant AaaS solution. We cover detailed steps involved for each storage strategy using this data model. The tables are as follows:

  • Customer – dimension table containing customer details
  • Sales – fact table containing sales transactions

We use two Amazon Redshift ra3.4xl clusters, with 2 nodes each, and designate one cluster as producer and other as consumer.

The high-level steps involved in enabling data sharing across clusters are as follows:

  1. Create a data share in the producer cluster and assign database objects to the data share.
  2. From the producer cluster, grant usage on the data share to consumer clusters, identified by namespace or AWS account.
  3. From the consumer cluster, create an external database using the data share from the producer
  4. Query the tables in the data share through the external shared database in the consumer cluster. Grant access to other users to access this shared database and objects.

Creating producer and consumer Amazon Redshift clusters

Let us start by creating two Amazon Redshift ra3.4xl clusters with 2-nodes each, one for the producer and other for consumer.

  1. On the Amazon Redshift cluster, create two clusters of RA3 instance type, and name them ds-producer and ds-consumer-c1, respectively.
  1. Next, log in to Amazon Redshift using the query editor. You can also use a SQL client tool like DBeaver, SQL Workbench, or Aginity Workbench. For configuration information, see Connecting to an Amazon Redshift cluster using SQL client tools.

Get the cluster namespace of the producer and consumer clusters from the console. We will use the namespaces to create the tenant table and to create and access the data shares. You can also get the cluster namespaces by logging into each of the clusters and executing the SELECT CURRENT_NAMESPACE statement in the query editor.

Please note to replace the corresponding namespaces in the code sections wherever producercluster_namespace, consumercluster1_namespace, and consumercluster_namespace is referenced.

The following screenshot shows the namespace on the Amazon Redshift console.

Now that we have the clusters created, we will go through the detailed steps for the three models. First, we will cover the Pool model, followed by Bridge model and finally the Silo model.

Pool model

The pool model represents an all-in, multi-tenant model where all tenants share the same storage constructs and provides the most benefit in simplifying the AaaS solution.

With this model, data storage is centralized in one cluster database, and data is stored for all tenants in the same set of data models. To scope and control access to tenant data, we introduce a column (tenant_id) that serves as a unique identifier for each tenant.

Security management to prevent cross-tenant access is one of the main aspects to address with the pool model. We can implement row-level security and provide secure access to the data by creating database views and set application-level policies by creating groups with specific access and assigning users to the groups. The following diagram illustrates the pool model architecture.

The following diagram illustrates the pool model architecture.

To create a multi-tenant solution using the pool model, you create data shares for the pool model in the producer cluster, and share data with the consumer cluster. We provide more detail on these steps in the following sections.

Creating data shares for the pool model in the producer cluster

To create data shares for the pool model in the producer cluster, complete the following steps:

  1. Log in to the producer cluster as an admin user and run the following script.

Note that we have a tenant table to store unique identifiers for each tenant or consumer (tenant).

We add a column (tenant_id) to the sales and customer tables to uniquely identify tenant data. This tenant_id references the tenant_id in the tenant table to uniquely identify the tenant and consumer records. See the following code:

/**********************************************/
/*       datasharing datasetup pool model     */
/**********************************************/

-- Create Schema
create schema sales;

CREATE TABLE IF NOT EXISTS sales.tenant ( 
t_tenantid int8 not null,
t_name varchar(50) not null,
t_namespace varchar(50),
t_account varchar(16)
)
DISTSTYLE AUTO
SORTKEY AUTO;

-- Create tables for multi-tenant sales schema
drop table sales.customer;
CREATE TABLE IF NOT EXISTS sales.customer(
  c_tenantid int8 not null,
  c_custid int8 not null,
  c_name varchar(25) not null,
  c_region varchar(40) not null,
  Primary Key(c_tenantid, c_custid)
) 
DISTSTYLE AUTO
SORTKEY AUTO;

CREATE TABLE IF NOT EXISTS sales.sales (
  s_tenantid int8 not null,
  s_orderid int8 not null,
  s_custid int8 not null,
  s_totalprice numeric(12,2) not null,
  s_orderdate date not null,
  Primary Key(s_tenantid, s_orderid)
) 
DISTSTYLE AUTO
SORTKEY AUTO;
  1. Set up the tenant table with the details for each consumer cluster, and ingest data into the customer dimension and sales fact tables. Using the COPY command is the recommended way to ingest data into Amazon Redshift, but for illustration purposes, we use INSERT statements:
    -- Ingest data 
    insert into sales.tenant values
    (0, 'primary', '<producercluster_namespace>',''),
    (1, 'tenant1', '<consumercluster1_namespace>',''),
    (2, 'tenant2', '<consumercluster2_namespace>','');
    
    insert into sales.customer values
    (1, 1, 'Customer 1', 'NorthEast'),
    (1, 2, 'Customer 2', 'SouthEast'),
    (2, 1, 'Customer 3', 'NorthWest'),
    (2, 2, 'Customer 4', 'SouthEast');
    
    truncate table sales.sales;
    insert into sales.sales values
    (1, 1, 1, 2434.33, '2020-11-21'),
    (1, 2, 2, 54.90, '2020-5-5'),
    (1, 3, 2, 9678.99, '2020-3-8'),
    (2, 1, 2, 452.24, '2020-1-23'),
    (2, 2, 1, 76523.10, '2020-11-3'),
    (2, 3, 1, 6745.20, '2020-10-01');
    
    select count(*) from sales.tenant;
    select count(*) from sales.customer;
    select count(*) from sales.sales;

Securing data on the producer cluster by restricting access

In the pool model, no external user has direct access to underlying tables. All access is restricted using views.

  1. Create a view for each of the fact and dimension tables to include a condition to filter records from the consumer tenant’s namespace. In our example, we create v_customersales to combine sales fact and customer dimension tables with a restrictive filter for tenant.namespace = current_namespace. See the following code:
    /**********************************************/
    /* We will create late binding views          */
    /* but materialized views could also be used  */
    /**********************************************/
    
    create or replace view sales.v_customer as
    select * 
    from sales.customer c, sales.tenant t
    where c.c_tenantid = t.t_tenantid 
    and t.t_namespace  = current_namespace;
    
    create or replace view sales.v_sales as
    select * 
    from sales.sales s, sales.tenant t
    where s.s_tenantid = t.t_tenantid 
    and t.t_namespace  = current_namespace;
    
    create or replace view sales.v_customersales as 
    select c_tenantid, c_name, c_region, 
    	date_part(w, to_date(s_orderdate,'YYYY-MM-DD')) as "week", 
    	date_part(mon, to_date(s_orderdate,'YYYY-MM-DD')) as "month", 
    	date_part(dow, to_date(s_orderdate,'YYYY-MM-DD')) as "dow",
    	date_part(yr, to_date(s_orderdate,'YYYY-MM-DD')) as "year",
    	date_part(d, to_date(s_orderdate,'YYYY-MM-DD')) as "dom", 
    	t.t_namespace
    from sales.tenant t, sales.customer c, sales.sales s
    where t.t_tenantid = c.c_tenantid 
    and c.c_tenantid = s.s_tenantid 
    and c.c_custid = s.s_custid 
    and t.t_namespace = current_namespace 
    WITH NO SCHEMA BINDING;
    
    select * from sales.v_customersales;
          
    

Now that we have database objects created in the producer cluster, we can share the data with the consumer clusters.

Sharing data with the consumer cluster

To share data with the consumer cluster, complete the following steps:

  1. Create a data share for the sales data:
    /***************************************************/
    /* Create Datashare and add objects to the share    */
    /****************************************************/
    CREATE DATASHARE salesshare;
    

  1. Enter the following code to alter the data share, add the sales schema to be shared with the consumer clusters, and add all tables in the sales schema to be shared with the consumer cluster:
    /************************************************************/
    /* Add objects at desired granularities: schemas, tables,   */
    /* views include materialized, and SQL UDFs                 */
    /************************************************************/
    ALTER DATASHARE salesshare ADD SCHEMA sales;  -- New addition to create SCHEMA first
    
    /*For pool model, we share only the views and not tables */
    ALTER DATASHARE SalesShare ADD TABLE sales.v_customer;
    ALTER DATASHARE SalesShare ADD TABLE sales.v_customersales;
    

For the pool model, we share only the views with the consumer cluster and not the tables. The ALTER statement ADD TABLE is used to add both views and tables.

  1. Grant usage on the sales data share to the namespace of the BI consumer cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster. See the following code:
    /********************************************************************/
    /* Grant access to consumer clusters                                */
    /* login to Consumer BI Cluster and get the Namespace from          */
    /* the Redshift console or using SELECT CURRENT_NAMESPACE           */
    /********************************************************************/
    SELECT CURRENT_NAMESPACE;
    
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    GRANT USAGE ON DATASHARE salesshare TO NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long
    GRANT USAGE ON DATASHARE salesshare TO ACCOUNT 'Consumer_AWSAccount';
    

  1. View data shares that are shared from the producer cluster:
    SELECT * FROM SVV_DATASHARES;

The following screenshot shows the output.

The following screenshot shows the output.

You can also see the data shares and their detailed objects and consumers using the following commands:

SHOW DATASHARES;
DESC DATASHARE salesshare;
select * from SVV_DATASHARE_OBJECTS;
select * from SVV_DATASHARE_CONSUMERS;

Viewing and querying data shares for the pool model from the consumer cluster

To view and query data shares from the consumer cluster, complete the following steps:

  1. Log in to the consumer cluster as an admin user and view the data share objects:
    /**********************************************************/
    /* Login to Consumer cluster as awsuser:                  */
    /* View datashares and create local database for querying */
    /**********************************************************/
    select * from SVV_DATASHARE_OBJECTS;

The following screenshot shows the results.

The following screenshot shows the results.

  1. Create a new database from the data share of the producer cluster:
    /**********************************************************/
    /* Create a local database and schema reference           */
    /**********************************************************/
    CREATE DATABASE sales_db FROM DATASHARE salesshare
    OF NAMESPACE '<producercluster_namespace>';

  1. Optionally, you can create an external schema in the consumer cluster pointing to the schema in the database of the producer cluster.

Creating a local external schema in the consumer cluster allows schema-level access controls within the consumer cluster, and uses a two-part notation when referencing shared data objects (localschema.table; vs. external_db.producerschema.table). See the following code:

/*********************************************************/
/* Create External Schema - Optional                     */
/* reason for schema: give specific access to schema     */
/* using shared alias get access to a secondary database */
/*********************************************************/
CREATE EXTERNAL SCHEMA sales_schema 
FROM REDSHIFT DATABASE 'sales_db' SCHEMA 'sales';
  1. Now you can query the shared data from the producer cluster by using the syntax tenant.schema.table:
    select * from sales_db.sales.customer;
    select * from sales_db.sales.v_customersales;

  1. From the tenant1 consumer cluster, you can view the databases and the tenants that are accessible to tenant1. tenant1_schema is as follows:
    select * from SVV_REDSHIFT_DATABASES;
    

The following screenshot shows the results.

The following screenshot shows the results.

Creating local consumer users and controlling access

You can control access to users in your consumer cluster by creating users and groups, and assigning access to the data share objects.

  1. Log in as an admin user on consumer cluster 1 and enter the following code to create tenant1_group, grant usage on the local database sales_db and schema sales_schema to the group, and assign the user tenant1_user to the tenant1_group:
    /********************************************************/
    /* Consumers can create own users and assign privileges */
    /* Create tenant1_group and assign privileges to read   */
    /* sales_db and the sales_schema                        */
    /* Create tenant1_user in tenant1_group                 */
    /********************************************************/
    create group tenant1_group;
    create user tenant1_user password 'Redshift#123!' in group tenant1_group;
    
    GRANT USAGE ON DATABASE sales_db TO tenant1_group;
    GRANT USAGE ON SCHEMA sales_schema TO GROUP tenant1_group;
    

  1. Now, login as tenant1_user to consumer cluster 1 and select data from the views v_customer and v_customersales:
    /*******************************************************/
    /* select from view returns only sales records related */
    /* to Consumer A namespace                             */
    /*******************************************************/
    select * from sales_db.sales.v_customer;
    select * from sales_db.sales.v_customersales;
    

You should see only the data relevant to tenant 1 and not the data that is associated with tenant 2.

You should see only the data relevant to tenant 1 and not the data that is associated with tenant 2.

 

Create Materialized views to optimize performance

Consumer clusters can have their own database objects which are local to the consumer. You can also create materialized views on the datashare objects and control when to refresh the dataset for your consumers. This provides another level of isolation from the producer cluster, and will ensure the consumer clusters go against their local dataset.

  1. Log in as an admin user on consumer cluster 1 and enter the following code to create a materialized view for customersales. This will create a local view that can be periodically refreshed from the consumer cluster.

 

/*******************************************************/
/* Create materialized view in consumer cluster        */
/*******************************************************/
create MATERIALIZED view tenant1_sales.mv_customersales as 
select c_tenantid, c_name, c_region, 
	date_part(w, to_date(s_orderdate,'YYYY-MM-DD')) as "week", 
	date_part(mon, to_date(s_orderdate,'YYYY-MM-DD')) as "month", 
	date_part(dow, to_date(s_orderdate,'YYYY-MM-DD')) as "dow",
	date_part(yr, to_date(s_orderdate,'YYYY-MM-DD')) as "year",
	date_part(d, to_date(s_orderdate,'YYYY-MM-DD')) as "dom", 
	t.t_namespace
from sales_db.tenant t, sales_db.customer c, sales_db.sales s
where t.t_tenantid = c.c_tenantid 
and c.c_tenantid = s.s_tenantid 
and c.c_custid = s.s_custid 
and t.t_namespace = current_namespace;

select * from tenant1_sales.mv_customersales top 100;

REFRESH MATERIALIZED VIEW tenant1_sales.mv_customersales;

 

With the preceding steps, we have demonstrated how you can control access to the tenant data in the same datastore using views. We also reviewed how data shares help efficiently share data between producer and consumer clusters with transaction consistency. We also saw how a local materialized view can be created to further isolate your BI workloads for your customers and provide a consistent, performant user experience. In the next section we will discuss the Bridge model.

Bridge model

In the bridge model, data for each tenant is stored in its own schema in a database and contains a similar set of tables. Data shares are created for each schema and shared with the corresponded consumer. This is an appealing balance between silo and pool model, providing both data isolation and ETL consolidation. With Amazon Redshift, you can create up to 9,900 schemas. For more information, see Quotas and limits in Amazon Redshift.

With data sharing, separate consumer clusters can be provisioned to use the same managed storage from producer cluster. Consumer clusters have all the capabilities of a producer cluster, and can in turn be producer clusters for data objects they own. Consumers can’t share data that is already shared with them. Without data sharing, queries from all customers are directed to a single cluster. The following diagram illustrates the bridge model.

The following diagram illustrates the bridge model.

To create a multi-tenant architecture using bridge model, complete the steps in the following sections.

Creating database schemas and tables for the bridge model in the producer cluster

As we did in the pool model, the first step is to create the database schema and tables. We log in to the producer cluster as an admin user and create separate schemas for each tenant. For our post, we create two schemas, tenant1 and tenant2, to store data for two tenants.

  1. Log in to the producer cluster as the admin user.
  1. Use the script below to create two schemas, tenant1 and tenant2, and create tables for customer dimension and sales facts under each of the two schemas. See the following code:
    /****************************************/
    /* Bridge -  Data Model */
    /****************************************/
    -- Create schemas tenant1 and tenant2
    create schema tenant1;
    create schema tenant2;
    
    -- Create tables for tenant1
    CREATE TABLE IF NOT EXISTS tenant1.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant1.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    
      
    
    -- Create tables for tenant2
    CREATE TABLE IF NOT EXISTS tenant2.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant2.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    

  1. Ingest data into the customer dimension and sales fact tables. Using the COPY command is the recommended way to ingest data into Amazon Redshift, but for illustration purposes, we use the INSERT statement:
    -- ingest data for tenant1
    -- ingest customer data
    insert into tenant1.customer values
    (1, 'Customer 1', 'NorthEast'),
    (2, 'Customer 2', 'SouthEast');
    
    -- ingest sales data
    insert into tenant1.sales values
    (1, 1, 2434.33, '2020-11-21'),
    (2, 2, 54.90, '2020-5-5'),
    (3, 2, 9678.99, '2020-3-8');
    
    select count(*) from tenant1.customer;
    select count(*) from tenant1.sales;
    
    
    -- ingest data for tenant2
    -- ingest customer data
    insert into tenant2.customer values
    (1, 'Customer 3', 'NorthWest'),
    (2, 'Customer 4', 'SouthEast');
    
    -- ingest sales data
    truncate table tenant2.sales;
    insert into tenant2.sales values
    (1, 2, 452.24, '2020-1-23'),
    (2, 1, 76523.10, '2020-11-3'),
    (3, 1, 6745.20, '2020-10-01');
    
    
    select count(*) from tenant2.customer;
    select count(*) from tenant2.sales;
    

Creating data shares and granting usage to the consumer cluster

In the following code, we create two data shares, tenant1share and tenant2share, to share the database objects under the two schemas to the respective consumer clusters.

  1. Create two datashares tenant1share and tenant2share to share the database objects under the two schemas to the respective consumer clusters.
    /******************************************************************/
    /*   Create Datashare and add database objects to the datashare   */
    /******************************************************************/
    CREATE DATASHARE tenant1share;
    CREATE DATASHARE tenant2share;

  1. Alter the datashare and add the schema(s) for respective tenants to be shared with the consumer cluster
    ALTER DATASHARE tenant1share ADD SCHEMA tenant1;
    ALTER DATASHARE tenant2share ADD SCHEMA tenant2;

  1. Alter the datashare and add all tables in the schema(s) to be shared with the consumer cluster
    ALTER DATASHARE tenant1share ADD ALL TABLES IN SCHEMA tenant1;
    ALTER DATASHARE tenant2share ADD ALL TABLES IN SCHEMA tenant2;

Getting the namespace of the first consumer cluster

  1. Log in to the consumer cluster and get the namespace from the console or by running the select current_namespace command:
    /* Grant access to tenant1 schema for Tenant1 BI Cluster */
    /* login to tenant1 BI Cluster and get the Namespace 
     * or get the Namespace from the Redshift console */
    SELECT CURRENT_NAMESPACE;

 

  1. Grant usage on the data share for the first tenant to the namespace of the BI cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster:
    -- Grant usage on the datashare to the first consumer cluster
    -- Namespace refers to the namespace GUID of the consumer cluster 
    GRANT USAGE ON DATASHARE tenant1share TO NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant1share TO ACCOUNT '<Consumer_AWSAccount>';

Getting the namespace of the second consumer cluster

  1. Log in to the second consumer cluster and get the namespace from the console or by running the select current_namespace command:
    /* Grant access to tenant2 schema for Tenant2 BI Cluster */
    /*login to tenant1 BI Cluster and get the Namespace      */
    SELECT CURRENT_NAMESPACE;
    

  1. Grant usage on the data share for the second tenant to the namespace of the second consumer cluster you just got from the previous step:
    -- Grant usage on the datashare to the second consumer cluster
    GRANT USAGE ON DATASHARE tenant2share TO NAMESPACE '<consumercluster2_namespace>'
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant2share TO ACCOUNT '<Consumer_AWSAccount>';

  1. To view data shares from the producer cluster, enter the following code:
    /******************************************************************/
    /*   View Datashares created, Datashare objects and consumers     */
    /******************************************************************/
    select * from SVV_DATASHARES;
    select * from SVV_DATASHARE_OBJECTS;
    select * from SVV_DATASHARE_CONSUMERS;

The following screenshot shows the commands in the query editor.

The following screenshot shows the commands in the query editor.

The following screenshot shows the query results.

The following screenshot shows the query results.

Accessing data using the consumer cluster from the data share

To access data using the consumer cluster, complete the following steps:

  1. Log in to the first consumer cluster ds-consumer-c1, as an admin user.
  1. View data share objects from the SVV_DATASHARE_OBJECTS system view:
    /*****************************************************/
    /* Consumer cluster as adminuser
    /* List the shares available and review contents for each 
    /********************************************************/
    -- You can view datashare objects associated with the cluster
    -- using either of the two commands
    SHOW DATASHARES;
    select * from SVV_DATASHARES;

The following screenshot shows the query results.

The following screenshot shows the query results.

--View objects shared in inbound share for consumer
select * from SVV_DATASHARE_OBJECTS;

The following screenshot shows the query results.

The following screenshot shows the query results.

--View namespace or clusters granted usage to a datashare
select * from svv_datashare_consumers;
  1. Create a local database in the first consumer cluster, and an external schema to be able to provide controlled access to the specific schema to the consumer clusters:
    /*******************************************************/
    /* Create a local database and schema reference        */
    /* to the share objects                                */
    /*******************************************************/
    CREATE DATABASE tenant1_db FROM DATASHARE tenant1share
    OF NAMESPACE '<producercluster_namespace>';

  1. Query the database tables using the three-part notation db.tenant.table:
    select * from tenant1_db.tenant1.customer;
    select * from tenant1_db.tenant1.sales;

  1. Optionally, you can create an external schema.

There are two reasons to create an external schema: either to enable two-part notation access to the tables from the consumer cluster, or to provide restricted access to the specific schemas for selected users, when multiple schemas are shared from the producer cluster. See the following code for our external schema:

/* Create External Schema */
CREATE EXTERNAL SCHEMA tenant1_schema FROM REDSHIFT DATABASE 'tenant1_db' SCHEMA 'tenant1';
  1. If you created the local schemas, you can use the following two-part notation to query the database tables:
    select * from tenant1_schema.customer;
    select * from tenant1_schema.sales;

  1. You can view the shared databases by querying the SVV_REDSHIFT_DATABASES table:
    select * from SVV_REDSHIFT_DATABASES;

The following screenshot shows the query results.

The following screenshot shows the query results.

Creating consumer users for managing access

Still logged in as an admin user to the consumer cluster, you can create other users who have access to the database objects.

  1. Create users and groups, and assign users and object privileges to the groups with the following code:
    /*******************************************************/
    /* Consumer can create own users and assign privileges */
    /* Create tenant1_user and assign privileges to        */
    /* read datashare from tenant1 schema                  */
    /*******************************************************/
    create group tenant1_group;
    create user tenant1_user password 'Redshift#123!' in group tenant1_group;
    
    GRANT USAGE ON DATABASE tenant1_db TO tenant1_user;
    GRANT USAGE ON SCHEMA tenant1_schema TO GROUP tenant1_group;
    

Now tenant1_user can log in and query the shared tables from tenant schema.

  1. Log in to the consumer cluster as tenant1_user and query the tables:
    /************************************************************/
    /* Consumer cluster as tenant1_user - As a consumer cluster */
    /* administrator, you must follow these steps:              */
    /************************************************************/
    
    select * from tenant1_db.tenant1.customer;
    select * from tenant1_db.tenant1.sales;
    
    
    /************************************************************/
    /* If you have created and External Schema                  */
    /* you can use the two-part notation to query tables.       */
    /************************************************************/
    
    select * from tenant1_schema.customer;
    select * from tenant1_schema.sales;
    

Revoking access to a data share (optional)

  1. At any point, if you want to revoke access to the data share, you can the REVOKE USAGE command:
    /*************************************************************/
    /* To revoke access at any time use the REVOKE USAGE command */
    /*************************************************************/
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    REVOKE USAGE ON DATASHARE Salesshare FROM NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long
    REVOKE USAGE ON DATASHARE Salesshare FROM ACCOUNT '<Consumer_AWSAccount>';
    

Silo model

The third option is to store data for each tenant in separate databases within a cluster. If you need your data isolated from other tenants, you can use the silo model and each database may have distinct data models, monitoring, management, and security footprints.

Amazon Redshift supports cross-database queries across databases, which allow you to simplify data organization. You can store common or granular datasets used across all tenants in a centralized database, and use the cross-database query capability to join relevant data for each tenant.

The steps to create a data share in a silo model is similar to a bridge model; however, unlike a bridge model (where data share is for each schema), the silo model has a data share created for each database. The following diagram illustrates the architecture of the silo model.

The following diagram illustrates the architecture of the silo model.

Creating data shares for the silo model in the producer cluster

To create data shares for the silo model in the producer cluster, complete the following steps:

  1. Log in to the producer cluster as an admin user and create separate databases for each tenant:
    /*****************************************************/
    /** Silo Model – Create databases for the 2 tenants **/
    /*****************************************************/
    create database tenant1_silodb;
    
    create database tenant2_silodb;
    

  1. Log in again to the producer cluster with the database name and user ID for the database that you want to share (tenant1_silodb) and create the schema and tables:
    /***********************************************************/
    /* login to tenant1_db and create schema tenant1 and tables*/
    /***********************************************************/
    create schema tenant1_siloschema;
    
    -- Create tables for tenant1
    CREATE TABLE IF NOT EXISTS tenant1_silodb.tenant1_siloschema.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant1_silodb.tenant1_siloschema.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    
    insert into tenant1_siloschema.customer values
    (1, 'Customer 1', 'NorthEast'),
    (2, 'Customer 2', 'SouthEast');
    
    truncate table tenant1_siloschema.sales;
    insert into tenant1.sales values
    (1, 1, 2434.33, '2020-11-21'),
    (2, 2, 54.90, '2020-5-5'),
    (3, 2, 9678.99, '2020-3-8');
    

  1. Create a data share with a name for the first tenant (for example, tenant1dbshare):
    /******************************************************************/
    /*   Create datashare and add database objects to the datashare   */
    /******************************************************************/
    CREATE DATASHARE tenant1_silodbshare;
    

  1. Run Alter datashare commands to add the schemas to be shared with the consumer cluster and add all tables in the schemas to be shared with the consumer cluster:
    ALTER DATASHARE tenant1_silodbshare ADD SCHEMA tenant1_siloschema;
    ALTER DATASHARE tenant1_silodbshare ADD ALL TABLES IN SCHEMA tenant1_siloschema;

  1. Grant usage on the data share for first tenant to the namespace of the BI cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster:
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    GRANT USAGE ON DATASHARE tenant1_silodbshare TO NAMESPACE ‘<consumercluster1_namespace>’
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant1_silodbshare TO ACCOUNT '<AWS-Account>';
    

Viewing and querying data shares for the silo model from the consumer cluster

To view and query your data shares, complete the following steps:

  1. Log in to the consumer cluster as an admin user.
  2. Create a new database from the data share of the producer cluster:
    CREATE DATABASE tenant1_silodb FROM DATASHARE tenant1_silodbshare
    OF NAMESPACE ‘<producercluster_namespace>’;

Now you can start querying the shared data from the producer cluster by using the syntax – tenant.schema.table. If you created an external schema, then you can also use the two-part notation to query the tables.

  1. Query the data with the following code:
    select * from tenant1_silodb.tenant1.customer;
    select * from tenant1_silodb.tenant1.sales;
    

  1. Optionally, you can create an external schema pointing to the schema in the database of the producer cluster. This allows you query shared tables using a two-part notation. See the following code:
    CREATE EXTERNAL SCHEMA tenant1_siloschema FROM REDSHIFT DATABASE 'tenant1_silodb' SCHEMA 'tenant1';
    
    --With this Schema, you can access using two-part notation to select from data share tables
    select * from tenant1_siloschema.customer;
    select * from tenant1_siloschema.sales;
    

  1. You can repeat the same steps for tenant2 to share the tenant2 database with tenant2 You can also control access to users in your consumer cluster by creating users and groups, and assigning access to the data share objects.

System views to view data shares

We have introduced new system tables and views to easily identify the data shares and related objects. You can use three different groups of system views to view the data share objects:

  • Views starting with SVV_DATASHARES – has detail of datashares and objects in a datashare.
View Name Purpose
SVV_DATASHARES View a list of data shares created on the cluster and data shares shared with the cluster
SVV_DATASHARE_OBJECTS View a list of objects in all data shares created on the cluster or shared with the cluster
SVV_DATASHARE_CONSUMERS View a list of consumers for data share created on the cluster
  • Views starting with SVV_REDSHIFT – contains details on both local and remote Redshift databases.
View Name Purpose
SVV_REDSHIFT_DATABASES List of all databases that a user has access to
SVV_REDSHIFT_SCHEMAS List of all schemas that user has access to
SVV_REDSHIFT_TABLES List of all tables that a user has access to
SVV_REDSHIFT_COLUMNS List of all columns that a user has access to
SVV_REDSHIFT_FUNCTIONS List of all functions that user has access to
  • Views starting with SVV_ALL– contain local and remote databases, external schemas including  spectrum and federated query, and external schema references to shared data.  If you create external schemas in consumer cluster, you need to use the SVV_ALL views to look at the objects.
View Name Purpose
SVV_ ALL _SCHEMAS Union of list of all schemas from SVV_REDSHIFT_SCHEMA view and consolidated list of all external tables and schemas that user has access to
SVV_ ALL _TABLES List of all tables that a user has access to
SVV_ ALL _COLUMNS List of all columns that a user has access to
SVV_ ALL _FUNCTIONS List of all functions that user has access to

Considerations for choosing a storage strategy

You can adopt a storage strategy or choose a hybrid approach based on business, technical, and operational requirements. Before deciding on a strategy, consider the quotas and limits for various objects in Amazon Redshift, and the number of databases per cluster or number of schemas per database to check if it meets your requirements. The following table summarizes these considerations.

  Pool Bridge Silo
Separation of tenant data Views Schema Database
ETL pipeline complexity Low Low Medium
Limits 100,000 tables (RA3 – 4x, 16x large clusters) 9,900 schemas per database 60 databases per cluster
Chargeback to consumer accounts Yes Yes Yes
Scalability High High High

Conclusion

In this post, we discussed how you can use the new data sharing feature of Amazon Redshift to implement an AaaS solution with a multi-tenant architecture while meeting SLAs for consumers using separate Amazon Redshift clusters. We demonstrated three types of models providing various levels of isolation for the tenant data. We compared and contrasted the models and provided guidance on when to choose an implementation model. We encourage you to try the data sharing feature to build your AaaS or software as a service (SaaS) solutions.


About the Authors

Rajesh Francis is a Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build scalable Analytic solutions.

 

 

 

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

 

Jeetesh Srivastva is a Sr. Analytics specialist solutions architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions leveraging Redshift and other AWS Analytic services. He has worked to deliver on premises and cloud based analytic solutions for customers in banking & finance and hospitality industry verticals.

Federating single sign-on access to your Amazon Redshift cluster with PingIdentity

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/federating-single-sign-on-access-to-your-amazon-redshift-cluster-with-pingidentity/

Single sign-on (SSO) enables users to have a seamless user experience while accessing various applications in the organization. If you’re responsible for setting up security and database access privileges for users and tasked with enabling SSO for Amazon Redshift, you can set up SSO authentication using ADFS, PingIdentity, Okta, Azure AD or other SAML browser based Identity Providers.

With federation, you can centralize management and governance of authentication and permissions by managing users and groups within the enterprise identity provider (IdP) and use them to authenticate to Amazon Redshift. For more information about the federation workflow using IAM and an identity provider, see Federate Database User Authentication Easily with IAM and Amazon Redshift.

This post shows you how to set up PingOne as your IdP. I provide step-by-step guidance to set up a trial account at pingidentity.com, build users and groups within your organization’s directory, and enable federated SSO into Amazon Redshift to maintain group-level access controls for your data warehouse.

Solution overview

The steps in this post are structured into the following sections:

  • IdP (PingOne) groups configuration – Create groups and assign users to logical groups in PingOne.
  • IdP (PingOne) application configuration – Create PingOne application(s) and configure AWS Identity and Access Management (IAM) roles, and groups allowed to be passed to Amazon Redshift.
  • IAM SAML federation configuration – Setup a role that allows PingOne to access Amazon Redshift by establishing a trust relationship between PingOne IdP and AWS.
  • Amazon Redshift groups and privileges setup – Setup groups within the Amazon Redshift database to match the PingOne groups. You also authorize these groups to access certain schemas and tables.
  • Amazon Redshift server and client setup and test SSO – Finally, configure SQL client tools to use your enterprise credentials and sign in to Amazon Redshift.

The process flow for federated authentication is shown in the following diagram and steps:

  1. The user logs in using a JDBC/ODBC SQL client.
  2. The IdP authenticates using the corporate user name and password, and returns a SAML assertion.
  3. The client uses AWS SDK to call AWS Security Token Service (AWS STS) to assume a role with SAML.
  4. AWS STS returns temporary AWS credentials.
  5. The client uses the temporary AWS credentials to get temporary cluster credentials.
  6. The client connects to Amazon Redshift using the temporary credentials.

Setting up PingOne provider groups and users

Before you get started, sign up for a free trial of PingOne for Enterprise. You then create the users and groups, and assign the users to the groups they belong to and are authorized to access.

You create groups and users in the PingOne user directory. You can set up the groups according to the read/write access privileges or by business functions in your organization to control access to the database objects.

In this post, we set up groups based on ReadOnly and ReadWrite privileges across all functions.

  1. After you have a PingOne account, log in to the PingOne admin dashboard.
  2. Choose Setup from the menu bar.
  3. On the Identity Repository tab, choose Connect to an Identity Repository.
  4. For Select an Identity Repository, you will see options for PingOne Directory, Active Directory, PingFederate and others. Choose PingOne Directory and go to Next.

After you connect to the PingOne repository, you should see the status CONFIGURED.

You can now create your groups and assign users.

  1. Choose Users from the menu bar.
  2. On the User Directory tab, choose Groups.
  3. Choose Add Group.
  4. For Name, enter readonly.
  5. For Directly Applied Role, select No Access.
  6. Choose Save.
  7. Repeat these steps for your readwrite group.
  8. To create the users, choose Users from the menu bar.
  9. On the User Directory tab, choose Users.
  10. Choose Add Users.

For this post, we create two users, Bob and Rachel.

  1. Under Group Memberships, for Memberships, select the group to add your user to.

For this post, we add Bob to readonly and Rachel to readwrite.

  1. Choose Add.
  2. Choose Save.
  3. Repeat these steps to create both users.

Configuring your IdP (PingOne) application

The next step is to set up the applications in the IdP for Amazon Redshift. Because we decided to control access through two groups, we create two applications.

  1. On the PingOne dashboard, choose Applications from the menu bar.
  2. On the My Applications tab, choose SAML.
  3. Choose Add Application.
  4. Choose New SAML Application.
  5. For Application Name, enter AmazonRedshiftReadOnly.
  6. Choose Continue to Next Step.
  7. On the Application Configuration page, for Assertion Consumer Service (ACS), enter http://localhost:7890/redshift/.
  8. For Entity ID, enter urn:amazon:webservices.
  9. For Signing, select Sign Assertion.
  10. For Signing Algorithm, choose RSA_SHA256.
  11. Choose Continue to Next Step.
  12. On the SSO Attribute Mapping page, add the following application attributes:
Application Attribute Identity Bridge As Literal
https://aws.amazon.com/SAML/Attributes/Role

arn:aws:iam::<AWSAccount>:role/pingreadonlyrole,arn:aws:iam:: <AWSAccount>:saml-provider/pingreadonlyprov

pingreadonlyrole is the name of the IAM role you create in the next step.

pingreadonlyprov is the Identity Provider name in IAM where the metadata is imported. You use this name in next step to create your Identity Provider and import the metadata downloaded from this PingOne application configuration.

True
https://aws.amazon.com/SAML/Attributes/RoleSessionName Email
https://redshift.amazon.com/SAML/Attributes/AutoCreate true True
https://redshift.amazon.com/SAML/Attributes/DbUser Email
https://redshift.amazon.com/SAML/Attributes/DbGroups

memberOf

Choose Advanced and for Function, choose ExtractByRegularExpression. For Expression, enter (readonly|readwrite).+

This regular expression is to remove the @directory value from the PingIdentiy group name to be in line with the Amazon Redshift DB group names and send only the relevant groups to the Application.

Refer to the PingIdentity documentation for more details on parsing the memberof attribute in PingOne.

  1. Choose Continue to Next Step.
  2. On the Group Access page, add the groups that this application can access.

This adds the users who are members of that group so they can SSO to the application.

  1. On the Review Setup page, for SAML Metadata, choose Download.
  2. Save the file as ping-saml-readonly.xml.

You use this file later to import the metadata to create the PingOne IdP.

  1. Record the URL for Initiate Single Sign-On (SSO).

You use this URL to set up the SQL client for federated SSO.

  1. Choose Finish.
  2. Repeat these steps to create the second application, AmazonRedshiftReadWrite, with the following changes:
    1. On the SSO Attribute Mapping page, use the IAM role name pingreadwriterole and IdP name pingreadwriteprov.
    2. Save the SAML metadata file as ping-saml-readwrite.xml.

You should now see the two application names on the My Applications tab.

Configuring IAM SAML federation

To set up your IAM SAML configuration, you create the IAM IdP and the roles and policies for the groups.

Setting up the IAM SAML IdP

You set up the IAM IdP and the roles used in the PingOnereadonly and PingOnereadwrite applications to establish a trust relationship between the IdP and AWS. You need to create two IAM IdPs, one for each application. Complete the following steps:

  1. On the IAM console, under Access management, choose Identity providers.
  2. Choose Create Provider.
  3. For Provider Type, choose SAML.
  4. For Provider name, enter pingreadonlyprov.
  5. For Metadata Document, choose the metadata XML file you downloaded from the AmazonRedshiftReadOnly application.
  6. Repeat these steps to create the provider pingreadwriteprov.
    1. Choose the metadata XML file you downloaded from the AmazonRedshiftReadWrite application.

You now have two IdP providers: pingreadonlyprov and pingreadwriteprov.

Creating the IAM role and policy for the groups

You control access privileges to database objects for specific user groups by using IAM roles. In this section, you create separate IAM roles with policies to map to each of the groups defined in PingOne. These roles allow the user to access Amazon Redshift through the IdP.

You use the same role names that you used to set up applications in PingOne: pingreadonlyrole and pingreadwriterole.

Before you create the role, create the policies with the appropriate joingroup privileges.

  1. On the IAM console, under Access Management, choose Policies.
  2. Choose Create policy.
  3. On the JSON tab, enter the following code to create the two policies.
    1. Replace <cluster> with your cluster name and <dbname> with your database name.

The only difference between the two policies is the Action- redshift:JoinGroup section:

  • “JoinGroup”: pingreadonlypolicy allows users to join the readonly group
  • “JoinGroup”: pingreadwritepolicy allows users to join the readwrite group

The group membership lasts only for the duration of the user session, and there is no CreateGroup permission because you need to manually create groups and grant DB privileges in Amazon Redshift.

The following code is the pingreadonlypolicy policy:

pingreadonlypolicy
	{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "GetClusterCredsStatement",
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:*:*:cluster:<cluster>",
                "arn:aws:redshift:*:*:dbname:<cluster/<dbname>",
                "arn:aws:redshift:*:*:dbuser: <dbname>/${redshift:DbUser}"
            ],
            "Condition":{
            "StringLike":{
            "aws:userid":"*:${redshift:DbUser} "
           }
          }
        },
        {
            "Sid": "CreateClusterUserStatement",
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser"
            ],
            "Resource": [
                "arn:aws:redshift:*:*:dbname:<cluster/<dbname>",
                "arn:aws:redshift:*:*:dbuser: <dbname>/${redshift:DbUser}"
            ]
        },
        {
            "Sid": "RedshiftJoinGroupStatement",
            "Effect": "Allow",
            "Action": [
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:*:*:dbgroup:<dbname>/readonly"
            ]
        }
    ]
}

The following code is the pingreadwritepolicy policy:

pingreadwritepolicy
	{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "GetClusterCredsStatement",
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:*:*:cluster:<cluster>",
                "arn:aws:redshift:*:*:dbname:<cluster/<dbname>",
                "arn:aws:redshift:*:*:dbuser:<dbname>/${redshift:DbUser}"
            ],
            "Condition":{
            "StringLike":{
            "aws:userid":"*:${redshift:DbUser} "
           }
          }
        },
        {
            "Sid": "CreateClusterUserStatement",
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser"
            ],
            "Resource": [
                "arn:aws:redshift:*:*:dbname:<cluster/<dbname>",
                "arn:aws:redshift:*:*:dbuser:<dbname>/${redshift:DbUser}"
            ]
        },
        {
            "Sid": "RedshiftJoinGroupStatement",
            "Effect": "Allow",
            "Action": [
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:*:*:dbgroup:<dbname>/readwrite"
            ]
        }
    ]
}
  1. On the IAM console, choose Roles.
  2. Choose Create role.
  3. For Select type of trusted entity, choose SAML 2.0 federation.
  4. For SAML provider, choose the provider you created.
  5. Select Allow programmatic access only.
  6. For Attribute, choose SAML:aud.
  7. For Value, enter http://localhost:7890/redshift/.
  8. Select pingreadonlypolicy for the first role and pingreadwritepolicy for the second role.
  9. Enter a name and description for each role.

The following screenshot shows your new roles: pingreadonlyrole and pingreadwriterole.

Setting up your groups and privileges in Amazon Redshift

In this section, you create the database groups in Amazon Redshift. These group names should match the group names you used when you set up your PingOne groups. Then you assign privileges to the groups to access the database objects including schemas and tables. User assignment to groups is done only one time in PingOne; you don’t assign users to groups in Amazon Redshift.

  1. Log in to your Amazon Redshift cluster with an admin account using the admin database credentials.
  2. Use the following scripts to create groups that match the IdP group names and grant the appropriate permissions to tables and schemas:
    CREATE GROUP readonly;
    CREATE GROUP readwrite;
    
    ALTER DEFAULT PRIVILEGES IN SCHEMA finance
    GRANT SELECT on TABLES to GROUP readonly;
    GRANT USAGE on SCHEMA finance to GROUP readonly;
    GRANT SELECT on ALL TABLES in SCHEMA finance to GROUP readonly;
    
    ALTER DEFAULT PRIVILEGES IN SCHEMA finance
    GRANT ALL on TABLES to GROUP readwrite;
    GRANT USAGE on SCHEMA finance to GROUP readwrite;
    GRANT ALL on ALL TABLES in SCHEMA finance to GROUP readwrite;
    

Setting up your Amazon Redshift server and client and testing SSO

In these final steps, you set up your client tools to use your enterprise credentials and sign in to Amazon Redshift.

Configuring the JDBC SQL Client using SQL Workbench/J

If you haven’t installed the JDBC driver, you can download the Amazon Redshift JDBC driver from the console. You then set up a new connection to your cluster using your PingOne IdP credentials.

  1. Create two new connection profiles, Redshift-ReadOnly and Redshift-ReadWrite.
  2. For URL, enter jdbc:redshift:iam://<cluster endpoint>.

IAM authentication requires using the JDBC driver with the AWS SDK included or making sure the AWS SDK is within your Java classpath.

You don’t need to enter a user name or password in JDBC setting. PingIdentity prompts you to log in on the web browser.

  1. Choose Extended Properties to define the SSO parameters for loging_url and plugin_name.
  2. In the Edit extended properties section, enter the following properties and values:
Property Value
login_url https://sso.connect.PingOne.com/sso/sp/initsso?saasid=
plugin_name com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
listen_port 7890
idp_response_timeout 60

The login_url is the URL from the PingOne AmazonRedshift applications you set up earlier. Choose the SSO URL from the RedshiftReadOnly application for the readonly connection and the SSO URL from RedshiftReadWrite application for the readwrite connection.

The configuration in your extended properties screen should look like the screenshot below:

  1. Choose OK.

Testing SSO authentication and access privileges

When you log in from the SQL client, you’re redirected to the browser to sign in with your PingOne user name and password.

Log in as user bob with the IdP password.

This user has access to SELECT all tables in the finance schema and not INSERT/UPDATE access. You can enter the following statements to test your access.

The following query shows the results from the finance.revenue table:

/* Finance ReadOnly Query */
select * from finance.revenue limit 10;

customer		salesamt
--------------------------------------
ABC Company	        12000
Tech Logistics		175400
XYZ Industry		24355
The tax experts        186577

When you run an INSERT statement, you get the message that you’re not authorized to insert data:

/* Finance ReadWrite Insert */
insert into finance.revenue
values (10001, 'ABC Company', 12000);

You should see the results below:

INSERT INTO finance.revenue not successful
An error occurred when executing the SQL command:
insert into finance.revenue
values(10001, 'ABC Company', 12000)

[Amazon]()500310)Invalid operation:permission denied for relation revenue;1 statement failed.
Execution time:0.05s

You can repeat these steps for the user rachel, who has access to read and write (INSERT) data into the finance schema.

Configuring the ODBC client

To configure your ODBC client, complete the following steps.

  1. Open the ODBC Data source administrator from your desktop.
  2. On the System DSN tab, choose Add.
  3. For Server, enter your Amazon Redshift ODBC endpoint.
  4. For Port, enter 5439.
  5. For Database, enter your database name.
  6. For Auth Type, choose Identity Provider: Browser SAML to use browser-based authentication.
  7. For Cluster ID, enter your cluster ID.
  8. For Preferred Role, enter your IAM role ARN.
  9. For Login URL, enter your PingOne login URL from the application configuration (https://sso.connect.PingOne.com/sso/sp/initsso?saasid=<saasid>&idpid=<idpid>).
  10. For Listen port, enter 7890 (default).
  11. For Timeout, enter 60.

Summary

In this blog post, I walked you through a step-by-step guide to configure and use PingOne as your IdP and enabled federated SSO to an Amazon Redshift cluster. You can follow these steps to setup federated SSO for your organization and manage access privileges based on read/write privileges or by business function and passing group membership defined in your PingOne IdP to your Amazon Redshift cluster.


About the Authors

Rajesh Francis is a Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build scalable Analytic solutions.