All posts by Harshida Patel

Simplify access management with Amazon Redshift and AWS Lake Formation for users in an External Identity Provider

Post Syndicated from Harshida Patel original https://aws.amazon.com/blogs/big-data/simplify-access-management-with-amazon-redshift-and-aws-lake-formation-for-users-in-an-external-identity-provider/

Many organizations use identity providers (IdPs) to authenticate users, manage their attributes, and group memberships for secure, efficient, and centralized identity management. You might be modernizing your data architecture using Amazon Redshift to enable access to your data lake and data in your data warehouse, and are looking for a centralized and scalable way to define and manage the data access based on IdP identities. AWS Lake Formation makes it straightforward to centrally govern, secure, and globally share data for analytics and machine learning (ML). Currently, you may have to map user identities and groups to AWS Identity and Access Management (IAM) roles, and data access permissions are defined at the IAM role level within Lake Formation. This setup is not efficient because setting up and maintaining IdP groups with IAM role mapping as new groups are created is time consuming and it makes it difficult to derive what data was accessed from which service at that time.

Amazon Redshift, Amazon QuickSight, and Lake Formation now integrate with the new trusted identity propagation capability in AWS IAM Identity Center to authenticate users seamlessly across services. In this post, we discuss two use cases to configure trusted identity propagation with Amazon Redshift and Lake Formation.

Solution overview

Trusted identity propagation provides a new authentication option for organizations that want to centralize data permissions management and authorize requests based on their IdP identity across service boundaries. With IAM Identity Center, you can configure an existing IdP to manage users and groups and use Lake Formation to define fine-grained access control permissions on catalog resources for these IdP identities. Amazon Redshift supports identity propagation when querying data with Amazon Redshift Spectrum and with Amazon Redshift Data Sharing, and you can use AWS CloudTrail to audit data access by IdP identities to help your organization meet their regulatory and compliance requirements.

With this new capability, users can connect to Amazon Redshift from QuickSight with a single sign-on experience and create direct query datasets. This is enabled by using IAM Identity Center as a shared identity source. With trusted identity propagation, when QuickSight assets like dashboards are shared with other users, the database permissions of each QuickSight user are applied by propagating their end-user identity from QuickSight to Amazon Redshift and enforcing their individual data permissions. Depending on the use case, the author can apply additional row-level and column-level security in QuickSight.

The following diagram illustrates an example of the solution architecture.

In this post, we walk through how to configure trusted identity propagation with Amazon Redshift and Lake Formation. We cover the following use cases:

  • Redshift Spectrum with Lake Formation
  • Redshift data sharing with Lake Formation

Prerequisites

This walkthrough assumes you have set up a Lake Formation administrator role or a similar role to follow along with the instructions in this post. To learn more about setting up permissions for a data lake administrator, see Create a data lake administrator.

Additionally, you must create the following resources as detailed in Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On:

  • An Okta account integrated with IAM Identity Center to sync users and groups
  • A Redshift managed application with IAM Identity Center
  • A Redshift source cluster with IAM Identity Center integration enabled
  • A Redshift target cluster with IAM Identity Center integration enabled (you can skip the section to set up Amazon Redshift role-based access)
  • Users and groups from IAM Identity Center assigned to the Redshift application
  • A permission set assigned to AWS accounts to enable Redshift Query Editor v2 access
  • Add the below permission to the IAM role used in Redshift managed application for integration with IAM Identity Center.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "lakeformation:GetDataAccess",
                    "glue:GetTable",
                    "glue:GetTables",
                    "glue:SearchTables",
                    "glue:GetDatabase",
                    "glue:GetDatabases",
                    "glue:GetPartitions",
                    "lakeformation:GetResourceLFTags",
                    "lakeformation:ListLFTags",
                    "lakeformation:GetLFTag",
                    "lakeformation:SearchTablesByLFTags",
                    "lakeformation:SearchDatabasesByLFTags"
               ],
                "Resource": "*"
            }
        ]
    }

Use case 1: Redshift Spectrum with Lake Formation

This use case assumes you have the following prerequisites:

  1. Log in to the AWS Management Console as an IAM administrator.
  2. Go to CloudShell or your AWS CLI and run the following AWS CLI command, providing your bucket name to copy the data:
aws s3 sync s3://redshift-demos/data/NY-Pub/ s3://<bucketname>/data/NY-Pub/

In this post, we use an AWS Glue crawler to create the external table ny_pub stored in Apache Parquet format in the Amazon S3 location s3://<bucketname>/data/NY-Pub/. In the next step, we create the solution resources using AWS CloudFormation to create a stack named CrawlS3Source-NYTaxiData in us-east-1.

  1. Download the .yml file or launch the CloudFormation stack.

The stack creates the following resources:

  • The crawler NYTaxiCrawler along with the new IAM role AWSGlueServiceRole-RedshiftAutoMount
  • The AWS Glue database automountdb

When the stack is complete, continue with the following steps to finish setting up your resources:

  1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Crawlers.
  2. Open NYTaxiCrawler and choose Edit.
  1. Under Choose data sources and classifiers, choose Edit.
  1. For Data source, choose S3.
  2. For S3 path, enter s3://<bucketname>/data/NY-Pub/.
  3. Choose Update S3 data source.
  1. Choose Next and choose Update.
  2. Choose Run crawler.

After the crawler is complete, you can see a new table called ny_pub in the Data Catalog under the automountdb database.

After you create the resources, complete the steps in the next sections to set up Lake Formation permissions on the AWS Glue table ny_pub for the sales IdP group and access them via Redshift Spectrum.

Enable Lake Formation propagation for the Redshift managed application

Complete the following steps to enable Lake Formation propagation for the Redshift managed application created in Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On:

  1. Log in to the console as admin.
  2. On the Amazon Redshift console, choose IAM Identity Center connection in the navigation pane.
  3. Select the managed application that starts with redshift-iad and choose Edit.
  1. Select Enable AWS Lake Formation access grants under Trusted identity propagation and save your changes.

Set up Lake Formation as an IAM Identity Center application

Complete the following steps to set up Lake Formation as an IAM Identity Center application:

  1. On the Lake Formation console, under Administration in the navigation pane, choose IAM Identity Center integration.
  1. Review the options and choose Submit to enable Lake Formation integration.

The integration status will update to Success.
Alternatively, you can run the following command:

aws lakeformation create-lake-formation-identity-center-configuration 
--cli-input-json '{"CatalogId": "<catalog_id>","InstanceArn": "<identitycenter_arn>"}'

Register the data with Lake Formation

In this section, we register the data with Lake Formation. Complete the following steps:

  1. On the Lake Formation console, under Administration in the navigation pane, choose Data lake locations.
  2. Choose Register location.
  3. For Amazon S3 path, enter the bucket where the table data resides (s3://<bucketname>/data/NY-Pub/).
  4. For IAM role, choose a Lake Formation user-defined role. For more information, refer to Requirements for roles used to register locations.
  5. For Permission mode, select Lake Formation.
  6. Choose Register location.

Next, verify that the IAMAllowedPrincipal group doesn’t have permission on the database.

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Select automountdb and on the Actions menu, choose View permissions.
  3. If IAMAllowedPrincipal is listed, select the principal and choose Revoke.
  4. Repeat these steps to verify permissions for the table ny_pub.

Grant the IAM Identity Center group permissions on the AWS Glue database and table

Complete the following steps to grant database permissions to the IAM Identity Center group:

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Select the database automountdb and on the Actions menu, choose Grant.
  3. Choose Grant database.
  4. Under Principals, select IAM Identity Center and choose Add.
  5. In the pop-up window, if this is the first time assigning users and groups, choose Get started.
  6. Enter the IAM Identity Center group in the search bar and choose the group.
  7. Choose Assign.
  8. Under LF-Tags or catalog resources, automountdb is already selected for Databases.
  9. Select Describe for Database permissions.
  10. Choose Grant to apply the permissions.

Alternatively, you can run the following command:

aws lakeformation grant-permissions --cli-input-json '
{
    "Principal": {
        "DataLakePrincipalIdentifier": "arn:aws:identitystore:::group/<identitycenter_group_name>"
    },
    "Resource": {
        "Database": {
            "Name": "automountdb"
        }
    },
    "Permissions": [
        "DESCRIBE"
    ]
}'

Next, you grant table permissions to the IAM Identity Center group.

  1. Under Data catalog in the navigation pane, choose Databases.
  2. Select the database automountdb and on the Actions menu, choose Grant.
  3. Under Principals, select IAM Identity Center and choose Add.
  4. Enter the IAM Identity Center group in the search bar and choose the group.
  5. Choose Assign.
  6. Under LF-Tags or catalog resources, automountdb is already selected for Databases.
  7. For Tables, choose ny_pub.
  8. Select Describe and Select for Table permissions.
  9. Choose Grant to apply the permissions.

Alternatively, you can run the following command:

aws lakeformation grant-permissions --cli-input-json '
{
    "Principal": {
        "DataLakePrincipalIdentifier": "arn:aws:identitystore:::group/<identitycenter_group_name>"
    },
    "Resource": {
        "Table": {
            "DatabaseName": "automountdb",
            "Name": "ny_pub "
        }
    },
    "Permissions": [
        "SELECT",
        "DESCRIBE"

    ]
}'

Set up Redshift Spectrum table access for the IAM Identity Center group

Complete the following steps to set up Redshift Spectrum table access:

  1. Sign in to the Amazon Redshift console using the admin role.
  2. Navigate to Query Editor v2.
  3. Choose the options menu (three dots) next to the cluster and choose Create connection.
  4. Connect as the admin user and run the following commands to make the ny_pub data in the S3 data lake available to the sales group:
    create external schema if not exists nyc_external_schema from DATA CATALOG database 'automountdb' catalog_id '<accountid>'; 
    grant usage on schema nyc_external_schema to role "awsidc:awssso-sales"; 
    grant select on all tables in schema nyc_external_schema to role "awsidc:awssso- sales";

Validate Redshift Spectrum access as an IAM Identity Center user

Complete the following steps to validate access:

  1. On the Amazon Redshift console, navigate to Query Editor v2.
  2. Choose the options menu (three dots) next to the cluster and choose Create connection
  3. Choose select IAM Identity Center option for Connect option. Provide Okta user name and password in the browser pop-up.
  4. Once connected as a federated user, run the following SQL commands to query the ny_pub data lake table:
select * from nyc_external_schema.ny_pub;

Use Case 2: Redshift data sharing with Lake Formation

This use case assumes you have IAM Identity Center integration with Amazon Redshift set up, with Lake Formation propagation enabled as per the instructions provided in the previous section.

Create a data share with objects and share it with the Data Catalog

Complete the following steps to create a data share:

  1. Sign in to the Amazon Redshift console using the admin role.
  2. Navigate to Query Editor v2.
  3. Choose the options menu (three dots) next to the Redshift source cluster and choose Create connection.
  4. Connect as admin user using Temporarily credentials using a database user name option and run the following SQL commands to create a data share:
    CREATE DATASHARE salesds; 
    ALTER DATASHARE salesds ADD SCHEMA sales_schema; 
    ALTER DATASHARE salesds ADD TABLE store_sales; 
    GRANT USAGE ON DATASHARE salesds TO ACCOUNT ‘<accountid>’ via DATA CATALOG;

  5. Authorize the data share by choosing Data shares in the navigation page and selecting the data share salesdb.
  6. Select the data share and choose Authorize.

Now you can register the data share in Lake Formation as an AWS Glue database.

  1. Sign in to the Lake Formation console as the data lake administrator IAM user or role.
  2. Under Data catalog in the navigation pane, choose Data sharing and view the Redshift data share invitations on the Configuration tab.
  3. Select the datashare salesds and choose Review Invitation.
  4. Once you review the details choose Accept.
  5. Provide a name for the AWS Glue database (for example, salesds) and choose Skip to Review and create.

After the AWS Glue database is created on the Redshift data share, you can view it under Shared databases.

Grant the IAM Identity Center user group permission on the AWS Glue database and table

Complete the following steps to grant database permissions to the IAM Identity Center group:

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Select the database salesds and on the Actions menu, choose Grant.
  3. Choose Grant database.
  4. Under Principals, select IAM Identity Center and choose Add.
  5. In the pop-up window, enter the IAM Identity Center group awssso in the search bar and choose the awssso-sales group.
  6. Choose Assign.
  7. Under LF-Tags or catalog resources, salesds is already selected for Databases.
  8. Select Describe for Database permissions.
  9. Choose Grant to apply the permissions.

Next, grant table permissions to the IAM Identity Center group.

  1. Under Data catalog in the navigation pane, choose Databases.
  2. Select the database salesds and on the Actions menu, choose Grant.
  3. Under Principals, select IAM Identity Center and choose Add.
  4. In the pop-up window, enter the IAM Identity Center group awssso in the search bar and choose the awssso-sales group.
  5. Choose Assign.
  6. Under LF-Tags or catalog resources, salesds is already selected for Databases.
  7. For Tables, choose sales_schema.store_sales.
  8. Select Describe and Select for Table permissions.
  9. Choose Grant to apply the permissions.

Mount the external schema in the target Redshift cluster and enable access for the IAM Identity Center user

Complete the following steps:

  1. Sign in to the Amazon Redshift console using the admin role.
  2. Navigate to Query Editor v2.
  3. Connect as an admin user and run the following SQL commands to mount the AWS Glue database customerds as an external schema and enable access to the sales group:
create external schema if not exists sales_datashare_schema from DATA CATALOG database salesds catalog_id '<accountid>';
create role "awsidc:awssso-sales"; # If the role was not already created 
grant usage on schema sales_datashare_schema to role "awsidc:awssso-sales";
grant select on all tables in schema sales_datashare_schema to role "awsidc:awssso- sales";

Access Redshift data shares as an IAM Identity Center user

Complete the following steps to access the data shares:

  1. On the Amazon Redshift console, navigate to Query Editor v2.
  2. Choose the options menu (three dots) next to the cluster and choose Create connection.
  3. Connect with IAM Identity Center and the provide IAM Identity Center user and password in the browser login.
  4. Run the following SQL commands to query the data lake table:
SELECT * FROM "dev"."sales_datashare_schema"."sales_schema.store_sales";

With Transitive Identity Propagation we can now audit user access to dataset from Lake Formation dashboard and service used for accessing the dataset providing complete trackability. For federated user Ethan whose Identity Center User ID is ‘459e10f6-a3d0-47ae-bc8d-a66f8b054014’ you can see the below event log.

"eventSource": "lakeformation.amazonaws.com",
    "eventName": "GetDataAccess",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "redshift.amazonaws.com",
    "userAgent": "redshift.amazonaws.com",
    "requestParameters": {
        "tableArn": "arn:aws:glue:us-east-1:xxxx:table/automountdb/ny_pub",
        "durationSeconds": 3600,
        "auditContext": {
            "additionalAuditContext": "{\"invokedBy\":\"arn:aws:redshift:us-east-1:xxxx:dbuser:redshift-consumer/awsidc:[email protected]\", \"transactionId\":\"961953\", \"queryId\":\"613842\", \"isConcurrencyScalingQuery\":\"false\"}"
        },
        "cellLevelSecurityEnforced": true
    },
    "responseElements": null,
    "additionalEventData": {
        "requesterService": "REDSHIFT",
        "LakeFormationTrustedCallerInvocation": "true",
        "lakeFormationPrincipal": "arn:aws:identitystore:::user/459e10f6-a3d0-47ae-bc8d-a66f8b054014",
        "lakeFormationRoleSessionName": "AWSLF-00-RE-726034267621-K7FUMxovuq"
    }

Clean up

Complete the following steps to clean up your resources:

  1. Delete the data from the S3 bucket.
  2. Delete the Lake Formation application and the Redshift provisioned cluster that you created for testing.
  3. Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack, and delete the stack you created.

Conclusion

In this post, we covered how to simplify access management for analytics by propagating user identity across Amazon Redshift and Lake Formation using IAM Identity Center. We learned how to get started with trusted identity propagation by connecting to Amazon Redshift and Lake Formation. We also learned how to configure Redshift Spectrum and data sharing to support trusted identity propagation.

Learn more about IAM Identity Center with Amazon Redshift and AWS Lake Formation. Leave your questions and feedback in the comments section.


About the Authors

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Achieve fine-grained data security with row-level access control in Amazon Redshift

Post Syndicated from Harshida Patel original https://aws.amazon.com/blogs/big-data/achieve-fine-grained-data-security-with-row-level-access-control-in-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the challenges with security is that enterprises want to provide fine-grained access control at the row level for sensitive data. You can do this by creating views or using different databases and schemas for different users. However, this approach isn’t scalable and becomes complex to maintain over time. Customers have asked us to simplify the process of securing their data by providing the ability to control granular access.

Row-level security (RLS) in Amazon Redshift is built on the foundation of role-based access control (RBAC). RLS allows you to control which users or roles can access specific records of data within tables, based on security policies that are defined at the database object level. This new RLS capability in Amazon Redshift enables you to dynamically filter existing rows of data in a table. This is in addition to column-level access control, where you can grant users permissions to a subset of columns. Now you can combine column-level access control with RLS policies to further restrict access to particular rows of visible columns.

In this post, we explore the row-level security features of Amazon Redshift and how you can use roles to simplify managing privileges required to your end-users.

Customer feedback

TrustLogix is a Norwest Venture Partners backed cloud security startup in the Data Security Governance space. TrustLogix delivers powerful monitoring, observability, audit, and fine-grained data entitlement capabilities that empower Amazon Redshift clients to implement data-centric security for their digital transformation initiatives.

“We’re excited about this new and deeper level of integration with Amazon Redshift. Our joint customers in security-forward and highly regulated sectors including financial services, healthcare, and pharmaceutical need to have incredibly fine-grained control over which users are allowed to access what data, and under which specific contexts. The new role-level security capabilities will allow our customers to precisely dictate data access controls based on their business entitlements while abstracting them away from the technical complexities. The new Amazon Redshift RLS capability will enable our joint customers to model policies at the business level, deploy and enforce them via a security-as-code model, ensuring secure and consistent access to their sensitive data.”

-Ganesh Kirti, founder and CEO of TrustLogix.

Overview of row-level security in Amazon Redshift

Row-level security allows you to restrict some records to certain users or roles, depending on the content of those records. With RLS, you can define policies to enforce fine-grained row-level access control. When creating RLS policies, you can specify expressions that control whether Amazon Redshift returns any existing rows in a table in a query. With RLS policies limiting access, you don’t have to add or externalize additional conditions in your queries. You can attach multiple policies to a table, and a single policy can be attached to multiple tables, making this implementation relationship many-to-many. Once attached, the RLS policy is applied on a relation and a set of users or roles, to run SELECT, UPDATE, and DELETE operations. All attached RLS policies have to evaluate together to true for a record to be returned by query. The RBAC built-in role, security admin, is responsible for managing the policies.

The following diagram illustrates the workflow.

With RLS, you can do the following:

  • Restrict row access based on roles – The security admin creates and defines if a role can access specific records of data within a table based on an RLS policy.
  • Combine multiple policies per user or role – Multiple policies can be defined per user or role, and all policies are applied with AND syntax.
  • Enhance granular access control – RLS is built on role-based access control and can work alongside column-level access control.
  • No access if no policy applied – All data access is blocked when there is no applicable policy on an RLS-protected table.
  • Enable row-level and column-level security on the table – In the following example, the user house is part of the role staff. When house queries the table, only one record pertaining to house is returned; the rest of the records are filtered as per the RLS policy. The sensitive column is also restricted, so users from the role staff can’t see this column. User cuddy is part of the role manager. When cuddy queries the employees table, all records and columns are returned.

Row-level security relevant use cases

With row-level security, many use cases for fine-grained access controls become possible. The following are just some of the many application use cases:

  • A global company with data analysts across different countries or regions can enforce restriction of data access to analysts based on geo location due to data compliance requirements.
  • A sales department can create a policy that allows them to restrict the access to sales performance information specific to a particular salesperson or region.
  • A payroll department can create an RLS policy to restrict access to look at an individual’s payroll, but managers need payroll information on their direct reports. Managers don’t need to know the details of payroll information for other departments.
  • A hospital can create an RLS policy that allows doctors and nurses to view data rows for their patients only.
  • A bank can create a policy to restrict access to financial data rows based on an employee’s business division or role in the company.
  • A multi-tenant application can create a policy to enforce a logical separation of each tenant’s data rows from every other tenant’s rows.

In the following example use cases, we illustrate enforcing an RLS policy on a fictitious healthcare setup. We demonstrate RLS on the medicine_data table and patients table, based on a policy established for managers, doctors, and departments. We also cover using a custom session variable context to set an RLS policy for the multi-tenant table customer.

To download the script and set up the tables, choose rls_createtable.sql.

Example 1: Read and write access

To grant read and write access, complete the following steps:

  1. Define four RLS policies using the secadmin role:
    1. all_can_see – No restrictions to be imposed
    2. hide_confidential – Restricts records for non-confidential rows
    3. only_doctors_can_see – Restricts records such that only doctors can see data
    4. see_only_own_department – Restricts records to only see data for own department
      CREATE RLS POLICY all_can_see
      USING ( true );
      
      CREATE RLS POLICY hide_confidential
      WITH ( confidential BOOLEAN )
      USING ( confidential = false )
      ;
      
      Note: Employee table is used as lookup in this policy
      
      CREATE RLS POLICY only_doctors_can_see
      USING (
          true = (
                  SELECT employee_is_doctor
                  FROM employees
                  WHERE employee_username = current_user
                  )
          )
      ;
      
      GRANT SELECT ON employees
      TO RLS POLICY only_doctors_can_see;
      
      CREATE RLS POLICY see_only_own_department
      WITH ( patient_dept_id INTEGER )
      USING (
          patient_dept_id IN (
                              SELECT department_id
                              FROM employees_departments
                              WHERE employee_username = current_user
                              )
          )
      ;
      
      GRANT SELECT ON employees_departments 
      TO RLS POLICY see_only_own_department;

  2. Create three roles for STAFF, MANAGER, and EXTERNAL:
    CREATE ROLE staff;
    CREATE ROLE manager;
    CREATE ROLE external;

  3. Now we define column-level access control for the roles and columns that are implementing the RLS policy:
    1. The MANAGER can access all columns in the Patients and Medicine_data tables, including the confidential column that defines RLS policies:
      --- manager can see full table patients and medicine data
      GRANT SELECT ON employees, employees_departments, patients, medicine_data TO ROLE manager, ROLE external;

    2. The STAFF role can access all columns except the confidential column:
      --- staff can see limited columns from medicine data
      GRANT SELECT (medicine_name, medicine_price) ON medicine_data 
      TO ROLE staff;
      
      --- staff can see limited columns from patients
      GRANT SELECT (patient_dept_id, patient_name, patient_birthday, patient_medicine, diagnosis) ON patients TO ROLE staff;

  4. Attach RLS policies to the roles we created:
    --- manager can see all medicine data
    ATTACH RLS POLICY all_can_see
    ON medicine_data
    TO ROLE manager;
    
    --- manager can see all patient data
    ATTACH RLS POLICY all_can_see
    ON patients
    TO ROLE manager;
    
    --- staff cannot see confidential medicine data
    ATTACH RLS POLICY hide_confidential
    ON medicine_data
    TO ROLE staff;
    
    --- staff cannot see confidential patient data
    ATTACH RLS POLICY hide_confidential
    ON patients
    TO ROLE staff;
    
    --- only doctors can see patient data
    ATTACH RLS POLICY only_doctors_can_see 
    ON patients
    TO PUBLIC;
    
    --- regular staff (doctors) can see data for patients in their department only
    ATTACH RLS POLICY see_only_own_department 
    ON patients
    TO ROLE staff;

  5. Enable RLS security on objects:
    ALTER TABLE medicine_data ROW LEVEL SECURITY on;
    ALTER TABLE patients ROW LEVEL SECURITY on;

  6. Create the users and grant them roles:
    CREATE USER house PASSWORD DISABLE;
    CREATE USER cuddy PASSWORD DISABLE;
    CREATE USER external PASSWORD DISABLE;
    
    GRANT ROLE staff TO house;
    GRANT ROLE manager TO cuddy;
    GRANT ROLE external TO external;

We can see RLS in action with a SELECT query:

--- As Cuddy, who is a doctor and a manager
SET SESSION AUTHORIZATION 'cuddy';

SELECT * FROM medicine_data;
--- policies applied: all_can_see

SELECT * FROM patients;
--- policies applied: all_can_see, only_doctors_can_see

As a super user and secadmin, you can query the svv_rls_applied_policy to audit and monitor the policies applied. We discuss system views for auditing and monitoring more later in this post.

--- As House, who is a doctor but not a manager - he is staff in department id 1

SET SESSION AUTHORIZATION 'house';

SELECT * FROM medicine_data;
--- column level access control applied 

SELECT current_user, medicine_name, medicine_price FROM medicine_data;
--- CLS + RLS policy = hide_confidential

SELECT * FROM patients;
--- column level access control applied

SELECT current_user, patient_dept_id, patient_name, patient_birthday, patient_medicine, diagnosis FROM patients;
--- CLS + RLS policies = hide_confidential, only_doctors_can_see, see_only_own_department

--- As External, who has no permission granted
SET SESSION AUTHORIZATION 'external';

SELECT * FROM medicine_data;
--- RLS policy applied: none - so no access

SELECT * FROM patients;
--- policies applied: none - so no access

With the UPDATE command, only the user house should be able to update patients records, as per the RLS for department 1:

SET SESSION AUTHORIZATION 'house';
UPDATE patients
SET diagnosis = 'house updated diagnosis';

select current_user, * from patients; --house should only be able to query department 1 non-confidential records

To test DELETE, as the user house, let’s delete records from patient table. Only two non-confidential records from patient_dept_id should be deleted as per the RLS policy:

SET SESSION AUTHORIZATION 'house';
delete  from patients;

Because both the records that house has access to are deleted from patients, selecting from the table will return no records.

When we switch to the user cuddy, who is manager and doctor, we have access to confidential records and can see three records:

SET SESSION AUTHORIZATION 'cuddy';
SELECT current_user, * from patients;

As a security admin, you can detach a policy from a table, user, or role. In this example, we detach the policy hide_confidential from the table patients from role staff:

DETACH RLS POLICY hide_confidential ON patients FROM ROLE staff;

When the user house queries the patients table, they should now have access to confidential records:

SET SESSION AUTHORIZATION 'house';

SELECT * from patients;

Using the security admin role, you can drop the policy hide_confidential:

DROP RLS POLICY IF EXISTS hide_confidential;

Because the hide_confidential RLS policy is still attached to the medicine_data table, you get the dependency error.

To remove this policy from all the tables, users, and roles, you can use the cascade option:

DROP RLS POLICY IF EXISTS hide_confidential cascade;

When user house queries the medicine_data table, no records are returned, because the medicine_data table has RLS on and no RLS policy is attached to the role staff for this table.

SET SESSION AUTHORIZATION 'house';
SELECT * from MEDICINE_DATA;

Let’s turn off row-level security on the table medicine_data using the security admin role:

ALTER TABLE MEDICINE_DATA ROW LEVEL SECURITY OFF;
SET SESSION AUTHORIZATION 'house';

SELECT * FROM MEDICINE_DATA;

Example 2: Session context variables

Some of the applications require you to use connection pooling, and you can use application-based user authentication instead of using separate database users for each user. The session context variables feature in Amazon Redshift enables you to pass the application user ID to the database for applying role-base security.

Amazon Redshift now allows you to set a customized session context variable using set_config. Using the session context variable allows you to provide such granular access using RLS.

In this example, we illustrate the use case when you have the common table customer, where you’re getting data from several customers. The table has a column with c_customer_id to distinguish data for respective customers.

  1. Create the external user and grant the external role:
    CREATE USER external_user PASSWORD 'Testemp1';
    grant role EXTERNAL to external_user;

  2. Grant SELECT on the customer table to role external:
    grant usage on schema report to role EXTERNAL;
    GRANT select ON TABLE report.customer TO ROLE EXTERNAL;

  3. Turn on row-level security for the report.customer table:
    ALTER TABLE report.customer row level security on;

  4. Create a row-level security policy using the session context variable app.customer_id to enforce the policy to filter records for c_customer_id:
    CREATE RLS POLICY see_only_own_customer_rows
    WITH ( c_customer_id char(16) )
    USING ( c_customer_id = current_setting('app.customer_id', FALSE));
    ATTACH RLS POLICY see_only_own_customer_rows ON report.customer TO ROLE EXTERNAL;

Now we can observe RLS in action. When you query the customer table with session context set to customer ID AAAAAAAAJNGEGCBA, the row-level policy was enforced only to return one customer row that matched the session variable value:

SET SESSION AUTHORIZATION 'external_user';

select set_config('app.customer_id', 'AAAAAAAAJNGEGCBA', FALSE);
select * from report.customer limit 10;

Auditing and monitoring RLS policies

Amazon Redshift has added several new system views to be able to monitor the row-level policies. The following table lists the system views, users, and roles that have access, and the function of the views.

System View Users Function
SVV_RLS_POLICY sys:secadmin View a list of all row-level security policies created
SVV_RLS_RELATION sys:secadmin View a list of all relations and users that have one or more row-level security policies attached on the currently connected database
SVV_RLS_APPLIED_POLICY sys:secadmin List RLS-protected relations
SVV_RLS_ATTACHED_POLICY Superuser, sys:operator, or any user with the system permission ACCESS SYSTEM TABLE Trace the application of RLS policies on queries that reference RLS-protected relations

Conclusion

In this post, we demonstrated how you can simplify the management of row-level security for fine-grained access control of your sensitive data building on the foundation of role-based access control. For more information about RLS best practices, refer to Amazon Redshift security overview. Try out RLS for your future Amazon Redshift implementations, and feel free to leave a comment about your use cases and experience.

Amazon Redshift Spectrum supports row-level, column-level, and cell-level access control for data stored in Amazon Simple Storage Service (Amazon S3) and managed by AWS Lake Formation. In a future post, we will show how you can implement row-level security for Redshift Spectrum tables using Lake Formation.


About the authors

Harshida Patel is a Specialist Sr. Solutions Architect, Analytics, with AWS.

Milind Oke is a Senior Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over two decades and specializes in Amazon Redshift.

Abhilash Nagilla is a Specialist Solutions Architect, Analytics, with AWS.

Yanzhu Ji is a Product Manager on the Amazon Redshift team. She worked on the Amazon Redshift team as a Software Engineer before becoming a Product Manager. She has rich experience of how the customer-facing Amazon Redshift features are built from planning to launching, and always treats customers’ requirements as first priority. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Kiran Chinta is a Software Development Manager at Amazon Redshift. He leads a strong team in query processing, SQL language, data security, and performance. Kiran is passionate about delivering products that seamlessly integrate with customers’ business applications with the right ease of use and performance. In his spare time, he enjoys reading and playing tennis.

Debu-PandaDebu Panda is a Senior Manager, Product Management, with AWS. He is an industry leader in analytics, application platforms, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases, and has presented at multiple conferences such as AWS re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Optimize your analytical workloads using the automatic query rewrite feature of Amazon Redshift materialized views

Post Syndicated from Harshida Patel original https://aws.amazon.com/blogs/big-data/optimize-your-analytical-workloads-using-the-automatic-query-rewrite-feature-of-amazon-redshift-materialized-views/

Amazon Redshift materialized views enable you to significantly improve performance of complex queries that are frequently run as part of your extract, load, and transform (ELT), business intelligence (BI), or dashboarding applications. Materialized views precompute and store the result sets of the SQL query in the view definition. Materialized views speed up data access, because the query doesn’t need to rerun the computation each time the query runs, which also reduces the resource consumption.

Amazon Redshift has the ability to automatically rewrite your SQL queries that don’t explicitly reference existing materialized views to use an existing materialized view if it will improve performance. This feature is valuable and, in some cases, the only option for performance optimization. Consider packaged ISV apps or even just reports— users often don’t have access to the SQL to optimize. In some cases, even if they do have access, the code or script is so old that nobody is familiar with it and you don’t know what regressions even a small change might introduce.

In this post, we describe how the automatic query rewrite feature works and some scenarios where you could take advantage of this feature. For information about the materialized view feature itself, refer to Speed up your ELT and BI queries with Amazon Redshift materialized views and Creating materialized views in Amazon Redshift.

All examples in this post are run on an 8 node ra3.4xlarge cluster with the 3 TB TPC-DS cloud benchmark dataset.

Let’s look at three different scenarios where the automatic query rewrite feature could help: optimizing joins between two large tables, optimizing joins for tables that have multiple join paths, and optimizing table scans.

Optimize joins between two large tables

There are many situations where you have two large tables that are joined frequently. In this case, creating a materialized view that joins these two tables could help improve the performance of those queries. Materialized views precompute the join and store the results so subsequent runs only need to retrieve the saved results; no need to run the expensive JOINs each time. With automatic query rewrite, none of the end-user queries have to be modified to refer to the materialized view. When creating the explain plan for the query, Amazon Redshift replaces the join between the two tables with the materialized view.

By default, the automatic query rewrite uses a materialized view only if it’s up to date and reflects all changes from its base tables. This means that the query isn’t rewritten to use the materialized view if the base tables have more recent updates that aren’t yet reflected in the materialized view.

For example, consider the following SQL query. The query joins two tables: store_sales (8,639,936,081 rows) and customer (30,000,000 rows):

SELECT 
cust.c_customer_id 
FROM store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk
GROUP BY cust.c_customer_id;

The query runs in 545,520 milliseconds; the following is the explain plan for the query:

XN HashAggregate  (cost=9602679386653.98..9602679386653.98 rows=29705556 width=20)
  ->  XN Hash Join DS_BCAST_INNER  (cost=375000.00..9602659714194.54 rows=7868983773 width=20)
        Hash Cond: (""outer"".ss_customer_sk = ""inner"".c_customer_sk)
        ->  XN Seq Scan on store_sales sales  (cost=0.00..86399365.12 rows=8245454518 width=4)
              Filter: (ss_customer_sk IS NOT NULL)
        ->  XN Hash  (cost=300000.00..300000.00 rows=30000000 width=24)
              ->  XN Seq Scan on customer cust  (cost=0.00..300000.00 rows=30000000 width=24)

Let’s create a materialized view that pre-computes the join between the store_sales and customer tables using the following SQL statement:

CREATE MATERIALIZED VIEW cust_store_sales
AS 
SELECT         
  cust.c_customer_id
, cust.c_first_name
, cust.c_last_name
, sales.ss_item_sk
, sales.ss_quantity
, cust.c_current_addr_sk
FROM  store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk;

Let’s now rerun the original query:

SELECT 
cust.c_customer_id 
FROM store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk
GROUP BY cust.c_customer_sk;

The query runs much faster (46,493 milliseconds). This is because of the automatic query rewrite feature, which has rewritten the preceding query to use the newly created materialized view instead of joining both tables. The explain plan for this query shows this change:

XN HashAggregate  (cost=103138905.60..103138905.60 rows=29705556 width=20)
  ->  XN Seq Scan on mv_tbl__cust_store_sales__0 derived_table1  (cost=0.00..82511124.48 rows=8251112448 width=20)

The original query run also consumed 1,263 CPU seconds and read 45,013 blocks of data, whereas the query that ran after the creation of the materialized view only consumed 898 CPU seconds and read 29,256 blocks. That is a reduction of 29% in CPU consumption and 35% in blocks read.

The optimizer can also rewrite the following query to use the previously created materialized view, which includes the additional join to the customer_address table:

SELECT
cust.c_customer_id
,addr.ca_state
FROM store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk
INNER JOIN customer_address addr
ON cust.c_current_addr_sk = addr.ca_address_sk
GROUP BY cust.c_customer_id, addr.ca_state;
     
      XN HashAggregate  (cost=30242919089.37..30242919089.37 rows=1544688912 width=26)
         ->  XN Hash Join DS_BCAST_INNER  (cost=542661.20..30201663527.13 rows=8251112448 width=26)
        Hash Cond: ("outer".c_current_addr_sk = "inner".ca_address_sk)
        ->  XN Seq Scan on mv_tbl__cust_store_sales_1__0 derived_table1  (cost=0.00..82511124.48 rows=8251112448 width=24)
        ->  XN Hash  (cost=150000.00..150000.00 rows=15000000 width=10)
              ->  XN Seq Scan on customer_address addr  (cost=0.00..150000.00 rows=15000000 width=10)

Optimize joins for tables that have multiple join paths

For large tables on Amazon Redshift, the ideal distribution style would be ‘KEY’, with the distribution key being the column that is used most frequently in the JOIN clause. There are situations where some large tables have multiple join paths. 50% of the queries may use a particular column to join to the table, and the other 50% of the queries may use a different column to join to the table. Both types of queries are important and have stringent performance requirements. In this case, you could pick one column as the distribution key for the table and then create a materialized view with the second column as the distribution key. This is possible because materialized views can have their own distribution and sort keys.

Here’s an example to illustrate how this works.

The web_sales table (2,159,968,881 rows) has the distribution key ws_order_number. This helps optimize a majority of the queries (70% of the joins to this table use ws_order_number as the join column). The remaining 30% use the column ws_bill_customer_sk to join to the table, as shown in the following SQL statement. This query took 12,790 milliseconds to run.

SELECT 
  c_customer_id
, c_email_address 
FROM web_sales ws
INNER JOIN customer cs
ON ws.ws_bill_customer_sk=cs.c_customer_sk;

We can create the materialized view to help improve the performance of the remaining 30% of the queries. Note the DISTKEY keyword in the following code. We have defined a new distribution key for the materialized view (ws_bill_customer_sk):

CREATE MATERIALIZED VIEW web_sales_cust_dist
DISTKEY (ws_bill_customer_sk)
AS
SELECT * FROM web_sales;

Rerunning the following query returns rows much faster than before (7,715 milliseconds vs. 12,790 milliseconds):

SELECT 
  c_customer_id
, c_email_address 
FROM web_sales ws
INNER JOIN customer cs
ON ws.ws_bill_customer_sk=cs.c_customer_sk;

Again, the explain plan of the query has changed; it now references the materialized view even though the SQL statement doesn’t explicitly reference the materialized view:

XN Hash Join DS_DIST_NONE  (cost=375000.00..696964927.69 rows=2159968768 width=74)
  Hash Cond: (""outer"".ws_bill_customer_sk = ""inner"".c_customer_sk)
  ->  XN Seq Scan on mv_tbl__web_sales_cust_dist__0 derived_table1  (cost=0.00..21599687.68 rows=2159968768 width=4)
  ->  XN Hash  (cost=300000.00..300000.00 rows=30000000 width=78)
        ->  XN Seq Scan on customer cs  (cost=0.00..300000.00 rows=30000000 width=78)

Optimize table scans

Table scans on Amazon Redshift are made efficient through the use of sort keys. Sort keys determine the order in which the columns are stored in the data blocks. Picking a column that appears frequently in your filtering conditions as a sort key can improve query performance significantly.

Compound sort keys with multiple columns can be defined on your table in case multiple columns are good candidates for sort keys. But in some situations where two or more high cardinality columns are sort key candidates, the compound sort key may not provide adequate performance. In these cases, a materialized view could be created with a different sort key to maintain that data in an alternate sorted order to help cater to a subset of the queries.

In the following example query, the web_sales table uses the column ws_sold_date_sk for the sort key, because this is the column that is used commonly for filtering rows. A smaller set of queries use ws_sales_price for filtering rows. Given that both ws_sold_date_sk and ws_sales_price are high cardinality columns with lots of unique values, a compound sort key with both columns may not be performant for all query patterns.

SELECT *
FROM web_sales 
WHERE ws_sales_price BETWEEN 50 AND 100;

Let’s create the following materialized view and see how it can help improve the performance of the preceding query:

CREATE MATERIALIZED VIEW web_sales_sort_on_price
SORTKEY (ws_sales_price)
AS
SELECT * FROM web_sales;

Running the following query returns rows much faster (5 milliseconds vs. 3,548 milliseconds) because the automatic query rewrite is using the materialized view:

SELECT *
FROM web_sales 
WHERE ws_sales_price BETWEEN 50 AND 100;

The following is the new explain plan:

XN Seq Scan on mv_tbl__web_sales_cust_dist__0 derived_table1  (cost=0.00..32399531.52 rows=10799844 width=260)
  Filter: ((ws_sales_price <= 100.00) AND (ws_sales_price >= 50.00))

Conclusion

Materialized views on Amazon Redshift can be a powerful optimization tool if used appropriately. With automatic query rewrite, you can optimize queries without any impact to end-users or their queries. This allows you to create materialized views after the application has gone live. Some customers plan this as part of their performance-optimization strategy when building new apps. The real value is that you can optimize queries and workloads without needing to modify the source code or scripts, and you can benefit even with a partial match.


About the Authors

Harshida Patel is a Specialist Sr. Solutions Architect, Analytics with AWS.

Jeetesh Srivastva is a Sr. Manager, Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions using Amazon Redshift and other AWS Analytic services. He has worked to deliver on-premises and cloud-based analytic solutions for customers in banking and finance and hospitality industry verticals.

Sain Das is an Analytics Specialist Solutions Architect at AWS and helps customers build scalable cloud solutions that help turn data into actionable insights.

Somdeb Bhattacharjee is an Enterprise Solutions Architect at AWS.