All posts by Sumeet Joshi

Accelerate resize and encryption of Amazon Redshift clusters with Faster Classic Resize

Post Syndicated from Sumeet Joshi original https://aws.amazon.com/blogs/big-data/accelerate-resize-and-encryption-of-amazon-redshift-clusters-with-faster-classic-resize/

Amazon Redshift has improved the performance of the classic resize feature and increased the flexibility of the cluster snapshot restore operation. You can use the classic resize operation to resize a cluster when you need to change the instance type or transition to a configuration that can’t be supported by elastic resize. This could take the cluster offline for many hours during the resize, but now the cluster can typically be available to process queries in minutes. Clusters can also be resized when restoring from a snapshot and in those cases there could be restrictions.

You can now also restore an encrypted cluster from an unencrypted snapshot or change the encryption key. Amazon Redshift uses AWS Key Management Service (AWS KMS) as an encryption option to provide an additional layer of data protection by securing your data from unauthorized access to the underlying storage. Now you can encrypt an unencrypted cluster with a KMS key faster by simply specifying a KMS key ID when modifying the cluster. You can also restore an AWS KMS-encrypted cluster from an unencrypted snapshot. You can access the feature via the AWS Management Console, SDK, or AWS Command Line Interface (AWS CLI). Please note that these features only apply to the clusters or target clusters with the RA3 node type.

In this post, we show you how the updated classic resize option works and also how it significantly improves the amount of time it takes to resize or encrypt your cluster with this enhancement. We also walk through the steps to resize your Amazon Redshift cluster using Faster Classic Resize.

Existing resize options

We’ve worked closely with our customers to learn how their needs evolve as their data scales or as their security and compliance requirements change. To address and meet your ever-growing demands, you often have to resize your Amazon Redshift cluster and choose an optimal instance type that delivers the best price/performance. As of this writing, there are three ways you can resize your clusters: elastic resize, classic resize, and the snapshot, restore, and resize method.

Among the three options, elastic resize is the fastest available resize mechanism because it works based on slice remapping instead of full data copy. And classic resize is used primarily when cluster resize is outside the allowed slice ranges by elastic resize, or the encryption status should be changed. Let’s briefly discuss these scenarios before describing how the new migration process helps.

Current limitations

The current resize options have a few limitations of note.

  • Configuration changes – Elastic resize supports the following RA3 configuration changes by design. So, when you need to choose a target cluster outsize the ranges mentioned in the following table, you should choose classic resize.
Node Type Growth Limit Reduction Limit
ra3.16xlarge 4x (from 4 to 16 nodes, for example) To one-quarter of the number (from 16 to 4 nodes, for example)
ra3.4xlarge 4x To one-quarter of the number
ra3.xlplus 2x (from 4 to 8 nodes, for example) To one-quarter of the number

Also, elastic resize can’t be performed if the current cluster is a single-node cluster or isn’t running on an EC2-VPC platform. These scenarios also drive customers to choose classic resize.

  • Encryption changes – You may need to encrypt your Amazon Redshift cluster based on security, compliance, and data consumption requirements. Currently, in order to modify encryption on an Amazon Redshift cluster, we use classic resize technology, which internally performs a deep copy operation of the entire dataset and rewrites the dataset with the desired encryption state. To avoid any changes during the deep copy operation, the source cluster is placed in read-only mode during the entire operation, which can take a few hours to days depending on the dataset size. Or, you might be locked out altogether if the data warehouse is down for a resize. As a result, the administrators or application owners can’t support Service Level Agreements (SLAs) that they have set with their business stakeholders.

Switching to the Faster Classic Resize approach can help speed up the migration process when turning on encryption. This has been one of the requirements for cross-account, cross-Region data sharing enabled on unencrypted clusters and integrations with AWS Data Exchange for Amazon Redshift. Additionally, Amazon Redshift Serverless is encrypted by default. So, to create a data share from a provisioned cluster to Redshift Serverless, the provisioned cluster should be encrypted as well. This is one more compelling requirement for Faster Classic Resize.

Faster Classic Resize

Faster Classic Resize works like elastic resize, but performs similar functions like classic resize, thereby offering the best of both approaches. Unlike classic resize, which involves extracting tuples from the source cluster and inserting those tuples on the target cluster, the Faster Classic Resize operation doesn’t involve extraction of tuples. Instead, it starts from the snapshots and the data blocks are copied over to the target cluster.

The new Faster Classic Resize operation involves two stages:

  • Stage 1 (Critical path) – This first stage consists of migrating the data from the source cluster to the target cluster, during which the source cluster is in read-only mode. Typically, this is a very short duration. Then the cluster is made available for read and writes.
  • Stage 2 (Off critical path) – The second stage involves redistributing the data as per the previous data distribution style. This process runs in the background off the critical path of migration from the source to target cluster. The duration of this stage is dependent on the volume to distribute, cluster workload, and so on.

Let’s see how Faster Classic Resize works with configuration changes, encryption changes, and restoring an unencrypted snapshot into an encrypted cluster.

Prerequisites

Complete the following prerequisite steps:

  1. Take a snapshot from the current cluster or use an existing snapshot.
  2. Provide the AWS Identity and Access Management (IAM) role credentials that are required to run the AWS CLI. For more information, refer to Using identity-based policies (IAM policies) for Amazon Redshift.
  3. For encryption changes, create a KMS key if none exists. For instructions, refer to Creating keys.

Configuration changes

As of today, you can initiate Faster Classic Resize through the AWS CLI when the required cluster configuration change is beyond what is allowed by elastic resize. The feature will be added to the console in a future release. As of this writing, you can use it change your cluster configuration from DC2, DS2, and RA3 node types to any RA3 node type. However, changing from RA3 to DC2 or DS2 is not supported yet.

We did a benchmark on Faster Classic Resize with different cluster combinations and volumes. The following table summarizes the results comparing critical paths in classic resize and Faster Classic Resize.

Volume Source Cluster Target Cluster Classic Resize
Duration (min)
Faster Classic Resize
Stage1 Duration (min)
% Faster
10 TB ra3 4xlarge – 6 nodes ra3 16xlarge – 8 nodes 78 11 86%
10 TB ra3 16xlarge – 8 nodes ra3 4xlarge – 2 nodes 738 11 99%
10 TB dc2 8xlarge – 6 nodes ra3 4xLarge – 2 nodes 706 8 99%
3 TB ra3 4xLarge – 2 nodes ra3 16xLarge – 4 nodes 53 11 79%
3 TB ra3 16xLarge – 4 nodes ra3 4xLarge – 2 nodes 244 7 97%
3 TB dc2 8xlarge – 6 nodes ra3 4xLarge – 2 nodes 251 7 97%

The Faster Classic Resize option consistently completed in significantly less time and made the cluster available for read and write operations in a short time. Classic resize took a longer time in all cases and kept the cluster in read-only mode, making it unavailable for writes. Also, the classic resize duration is comparatively longer when the target cluster configuration is smaller than the original cluster configuration.

Perform Faster Classic Resize

As of this writing, the feature isn’t available on the console. You can use either of the following two methods to resize your cluster using Faster Classic Resize via the AWS CLI.

  • Modify cluster method – Resize an existing cluster without changing the endpoint

The following are the steps involved:

    1. Take a snapshot on the current cluster prior to performing the resize operation.
    2. Determine the target cluster configuration and run the following command from the AWS CLI:
      aws redshift modify-cluster --region <CLUSTER REGION>
      --endpoint-url https://redshift.<CLUSTER REGION>.amazonaws.com/
      --cluster-identifier <CLUSTER NAME>
      --cluster-type multi-node
      --node-type <TARGET INSTANCE TYPE>
      --number-of-nodes <TARGET NUMBER OF NODES>

      For example:

      aws redshift modify-cluster --region us-east-1
      --endpoint-url https://redshift.us-east-1.amazonaws.com/
      --cluster-identifier my-cluster-identifier
      --cluster-type multi-node
      --node-type  ra3.16xlarge
      --number-of-nodes 12

  • Snapshot restore method – Restore an existing snapshot to the new cluster with the new cluster endpoint

The following are the steps involved:

    1. Identify the snapshot for restore and a unique name for the new cluster.
    2. Determine the target cluster configuration and run the following command from the AWS CLI:
      aws redshift restore-from-cluster-snapshot --region <CLUSTER REGION>
      --endpoint-url https://redshift.<CLUSTER REGION>.amazonaws.com/
      --snapshot-identifier <SNAPSHOT ID> 
      --cluster-identifier <CLUSTER NAME>
      --node-type <TARGET INSTANCE TYPE>
      --number-of-node <NUMBER>

      For example:

      aws redshift restore-from-cluster-snapshot --region us-east-1
      --endpoint-url https://redshift.us-east-1.amazonaws.com/
      --snapshot-identifier rs:sales-cluster-2022-05-26-16-19-36
      --cluster-identifier my-new-cluster-identifier
      --node-type ra3.16xlarge
      --number-of-node 12

Monitor the resize process

Monitor the progress through the cluster management console. You can also check the events generated by the resize process. The resize completion status is logged in events along with the duration it took for the resize. The following screenshot shows an example.

It’s important to note that you may observe longer query times in the second stage of Faster Classic Resize. During the first stage, the data for tables with dist-key distribution style is transferred as dist-even. Later, a background process converts them back to dist-key (in stage 2). However, background processes are running behind the scenes to get the data redistributed to the original distribution style (the distribution style before the cluster resize). You can monitor the progress of the background processes by querying the stv_xrestore_alter_queue_state table. It’s important to note that tables with ALL, AUTO, or EVEN distribution styles don’t require redistribution post-resize. Therefore, they’re not logged in the stv_xrestore_alter_queue_state table. The counts you observe in these tables are for the tables with distribution style as Key before the resize operation.

See the following example query:

select db_id, status, count(*) from stv_xrestore_alter_queue_state group by 1,2 order by 3 desc

The following table shows that for 60 tables data redistribution is finished, for 323 tables data redistribution is pending, and for 1 table data redistribution is in progress.

We ran tests to assess time to complete the redistribution. For 10 TB of data, it took approximately 5 hours and 30 minutes on an idle cluster. For 3 TB, it took approximately 2 hours and 30 minutes on an idle cluster. The following is a summary of tests performed on larger volumes:

  • A snapshot with 100 TB where 70% of blocks needs redistribution would take 10–40 hours
  • A snapshot with 450 TB where 70% of blocks needs redistribution would take 2–8 days
  • A snapshot with 1600 TB where 70% of blocks needs redistribution would take 7–27 days

The actual time to complete redistribution is largely dependent on data volume, cluster idle cycles, target cluster size, data skewness, and more. Therefore, we recommend performing Faster Classic Resize when there is enough of an idle window (such as weekends) for the cluster to perform redistribution.

Encryption changes

You can encrypt your Amazon Redshift cluster from the console (the modify cluster method) or using the AWS CLI using the snapshot restore method. Amazon Redshift performs the encryption change using Faster Classic Resize. The operation only takes a few minutes to complete and your cluster is available for both read and write. With Faster Classic Resize, you can change an unencrypted cluster to an encrypted cluster or change the encryption key using the snapshot restore method.

For this post, we show how you can change the encryption using the Amazon Redshift console. To test the timings, we created multiple Amazon Redshift clusters using TPC-DS data. The Faster Classic Resize option consistently completed in significantly less time and made clusters available for read and write operations faster. Classic resize took a longer time in all cases and kept the cluster in read-only mode. The following table contains the summary of the results.

Data Volume Cluster Encryption (Classic Resize)
Duration (min)
Encryption (Faster Classic Resize)
Duration (min)
% Faster
10 TB ra3.4xlarge – 2 nodes 580 11 98%
10 TB ra3.xlplus – 2 nodes 680 16 98%
3 TB ra3.4xlarge – 2 nodes 527 9 98%
3 TB ra3.xlplus – 2 nodes 570 10 98%

Now, let’s perform the encryption change from an unencrypted cluster to an encrypted cluster using the console. Complete the following steps:

  1. On the Amazon Redshift console, navigate to your cluster.
  2. On the Properties tab, on the Edit drop-down menu, choose Edit encryption.
  3. For Encryption, select Use AWS Key Management Service (AWS KMS).
  4. For AWS KMS, select Default Redshift key.
  5. Choose Save changes.

You can monitor the progress of your encryption change on the Events tab. As shown in the following screenshot, the entire process to change the encryption took approximately 11 minutes.

Restore an unencrypted snapshot to an encrypted cluster

As of today, the Faster Classic Resize option to restore an unencrypted snapshot into an encrypted cluster or to change the encryption key is available only through the AWS CLI. When triggered, the restored cluster operates in read/write mode immediately. The encryption state change for restored blocks that are unencrypted operates in the background, and newly ingested blocks continue to be encrypted.

Restore the snapshot using the following command into a new cluster. (Replace the indicated parameter values; --encrypted and --kms-key-id are required).

aws redshift restore-from-cluster-snapshot 
--cluster-identifier <CLUSTER NAME>
--snapshot-identifier <SNAPSHOT ID> 
--region <AWS REGION> 
--encrypted
--kms-key-id <KMS KEY ID>
--cluster-subnet-group-name <SUBNET GROUP>

When to use which resize option

The following flow chart provides guidance on which resize option is recommended when changing your cluster encryption status or resizing to a new cluster configuration.

Summary

In this post, we talked about the improved performance of Amazon Redshift’s classic resize feature and how Faster Classic Resize significantly improves your ability to scale your Amazon Redshift clusters using the classic resize method. We also talked about when to use different resize operations based on your requirements. We demonstrated how it works from the console (for encryption changes) and using the AWS CLI. We also showed the results of our benchmark test and how it significantly improves the migration time for configuration changes and encryption changes for your Amazon Redshift cluster.

To learn more about resizing your clusters, refer to Resizing clusters in Amazon Redshift. If you have any feedback or questions, please leave them in the comments.


About the authors

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 17 years of experience in the data warehousing and analytical space.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 16 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Krishna Chaitanya Gudipati is a Senior Software Development Engineer at Amazon Redshift. He has been working on distributed systems for over 14 years and is passionate about building scalable and performant systems. In his spare time, he enjoys reading and exploring new places.

Yanzhu Ji is a Product Manager on the Amazon Redshift team. She worked on the Amazon Redshift team as a Software Engineer before becoming a Product Manager. She has a rich experience of how the customer-facing Amazon Redshift features are built from planning to launching, and always treats customers’ requirements as first priority. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Federate single sign-on access to Amazon Redshift query editor v2 with Okta

Post Syndicated from Sumeet Joshi original https://aws.amazon.com/blogs/big-data/federate-single-sign-on-access-to-amazon-redshift-query-editor-v2-with-okta/

Amazon Redshift query editor v2 is a web-based SQL client application that you can use to author and run queries on your Amazon Redshift data warehouse. You can visualize query results with charts and collaborate by sharing queries with members of your team. You can use query editor v2 to create databases, schemas, tables, and load data from Amazon Simple Storage Service (Amazon S3) using the COPY command or by using a wizard. You can browse multiple databases and run queries on your Amazon Redshift data warehouse or data lake, or run federated queries to operational databases such as Amazon Aurora. Because it’s a managed SQL editor in your browser and it’s integrated with your single sign-on (SSO) provider, Amazon Redshift query editor v2 reduces the number of steps to the first query so you gain insights faster.

Amazon Redshift query editor v2 integration with your identity provider (IdP) automatically redirects the user’s browser to the query editor v2 console instead of Amazon Redshift console. This enables your users to easily access Amazon Redshift clusters through query editor v2 using federated credentials without managing database users and passwords.

In this post, we focus on Okta as the IdP and illustrate how to set up your Okta application and AWS Identity and Access Management (IAM) permissions. We also demonstrate how you can limit the access for your users to use only query editor v2 without giving them access to perform any admin functions on the AWS Management Console.

Solution overview

The high-level steps in this post are as follows:

  1. Set up Okta, which contains your users organized into logical groups and AWS account federation application.
  2. Set up two IAM roles: one that establishes a trust relationship between your IdP and AWS, and a second role that Okta uses to access Amazon Redshift.
  3. Complete Okta advanced configuration:
    1. Finalize the Okta configuration by inputting the roles that you just created.
    2. Define a default relay state to direct users to Amazon Redshift query editor v2 after successful SAML authentication.
    3. Configure the SAML PrincipalTagAttribute. This element allows you to pass attributes as session tags in the SAML assertion. For more information about session tags, see Passing session tags in AWS STS.
  4. Set up Amazon Redshift database groups:
    1. Create groups within the Amazon Redshift database to match the Okta groups.
    2. Authorize these groups to access certain schemas and tables.
    3. Access Amazon Redshift query editor v2 using your enterprise credentials and query your Amazon Redshift database.
  5. Sign in to your Okta account and access the application assigned to you. The application directs you to Amazon Redshift query editor v2 using federated login.
  6. Access and query your Amazon Redshift database.

Prerequisites

This post assumes that you have the following prerequisites:

Set up Okta

First, we set up the Okta application and create users and groups. Complete the following steps:

  1. Log in to your Okta admin console using the URL https://<prefix>-admin.okta.com/admin/dashboard, where <prefix> is specific to your account and was created at account setup.
  2. On the admin console, choose Admin.
  3. Under Directory in the navigation pane, choose People.
  4. To add users, choose Add person.
    The following screenshot shows the users that we created.
  5. To add groups into Okta, choose Groups in the navigation pane, then choose Add group.
    The following screenshot shows two groups that we created. We added Jane to analyst_users and Mike to bi_users.
  6. Under Applications in the navigation pane, choose Applications and choose Browse App Catalog.
  7. Search for AWS Account Federation and choose Add.
  8. After you add the application, choose AWS Account Federation.
  9. Leave the values in General Settings at their default and choose Next.
  10. Under Sign-On Options, select SAML 2.0.
  11. Choose the Identity Provider metadata link to download the metadata file in .xml format.

Configure IAM roles

Next, you set up an IAM role that establishes a trust relationship between the IdP and AWS. You also create an IAM role that Okta uses to access Amazon Redshift query editor v2.

  1. On the IAM console, under Access management in the navigation pane, choose Identity providers.
  2. Choose Add provider.
  3. For Provider type¸ select SAML.
  4. For Provider name¸ enter a name.
  5. Choose Choose file and upload the metadata file you downloaded.
  6. Choose Add provider.

    Now you create the IAM SAML 2.0 federation role.
  7. On the IAM console, choose Roles in the navigation pane.
  8. Choose Create role.
  9. For Trusted entity type, select SAML 2.0 federation.
  10. For SAML 2.0-based provider, choose the IdP you created in the previous step.
  11. Select Allow programmatic and AWS Management Console access.
  12. Choose Next and then choose Create Policy.

Amazon Redshift query editor v2 provides multiple managed policies to access the query editor. For a list of managed policies, refer to Configuring your AWS account. The managed policy enables you to limit the access for your users to use only query editor v2 without giving them access to perform any admin functions on the console. For this post, we use the AmazonRedshiftQueryEditorV2ReadSharing managed policy and create a custom policy.

In the following code, provide your Region, account, and cluster parameters to grant access to Amazon Redshift to get cluster credentials, create users, and allow users to join groups:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RedshiftClusterPermissions",
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials",
                "redshift:CreateClusterUser",
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<cluster>,                 
                "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${aws:PrincipalTag/RedshiftDbUser}",
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/analyst_users",
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/bi_users",
                "arn:aws:redshift:<region>:<account>:dbname:<cluster>/${redshift:DBName}"
            ]
        }
    ]
}

The group membership lasts only for the duration of the user session. Additionally, there is no CreateGroup permission because groups need to be manually created and granted DB privileges.

  1. Attach the policy you created to the role.
    The following screenshot shows the summary page for the role.
  2. Modify the trust relationships for your role and add sts:TagSession permission.
    When using session tags, trust policies for all roles connected to the IdP passing tags must have the sts:TagSession permission. For roles without this permission in the trust policy, the AssumeRole operation fails.
  3. Choose Update policy.

Set up Okta advanced configuration

In this section, you finalize the Okta configuration by adding the IAM roles that you just created. You set up SAML PrincipalTag attributes such as RedshiftDbUser and RedshiftDbGroups, which are passed in the SAML assertion to federate the access to Amazon Redshift query editor v2. You also define a default relay state, which is the URL that users are directed to after a successful authentication through SAML.

  1. In your Okta account, open the AWS Account Federation app.
  2. On the Sign On tab, set Default Relay State to the query editor URL, using the format https://<region>.console.aws.amazon.com/sqlworkbench/home. For this post, we use https://eu-west-1.console.aws.amazon.com/sqlworkbench/home.
  3. Choose Attributes and set up the following attribute mappings:
    1. Set the DB user using PrincipalTag:RedshiftDbUser. This uses the user name in the directory. This is a required tag and defines the database user that is used by query editor v2.
    2. Set the DB groups using PrincipalTag:RedshiftDbGroups. This uses the Okta groups to fill the principal tags. Its value must be a colon-separated list.
    3. Set the transitive keys using TransitiveTagKeys. This prevents users from changing the session tags in case of role chaining.

These tags are forwarded to the redshift:GetClusterCredentials API to get credentials for your cluster. The following table summarizes their attribute configuration.

Name Name Format Value
https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbUser Unspecified user.username
https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbGroups Unspecified String.join(“:”, isMemberOfGroupName(“analyst_users”) ? ‘analyst_users’ : ”, isMemberOfGroupName(“bi_users”) ? ‘bi_users’ : ”)
https://aws.amazon.com/SAML/Attributes/TransitiveTagKeys Unspecified Arrays.flatten(“RedshiftDbUser”, “RedshiftDbGroups”)

  1. Under Advanced Sign-on Settings¸ select Use Group Mapping.
  2. Enter the IdP and IAM role ARNs, which are globally unique, and make sure that Okta is directed to your AWS account.
  3. Authorize users to use the AWS Account Federation application by selecting their respective groups or individual user accounts. In this example, we authorized users by group.

Set up Amazon Redshift database groups

Next, you set up groups in the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables.

  1. Log in to your Amazon Redshift cluster with an admin account.
  2. Create groups that match the IdP group names, and grant the appropriate permissions to tables and schemas:
CREATE GROUP analyst_users;
CREATE GROUP bi_users;

ALTER DEFAULT PRIVILEGES IN SCHEMA sales_analysis
GRANT SELECT on TABLES to GROUP analyst_users;
GRANT USAGE on SCHEMA sales_analysis to GROUP analyst_users;
GRANT SELECT on ALL TABLES in SCHEMA sales_analysis to GROUP analyst_users;

ALTER DEFAULT PRIVILEGES IN SCHEMA sales_bi
GRANT SELECT on TABLES to GROUP bi_users;
GRANT USAGE on SCHEMA sales_bi to GROUP bi_users;
GRANT SELECT on ALL TABLES in SCHEMA sales_bi to GROUP bi_users;

In Okta, you created the user Jane and assigned Jane to the analyst_users group.

In the Amazon Redshift database, you created two database groups: analyst_users and bi_users.

When user Jane logs in via federated authentication to Amazon Redshift using query editor v2, the user is created if it doesn’t already exist and the analyst_users database group is assumed. The user Jane can query tables only in sales_analysis schema.

Because user Jane isn’t part of the bi_users group, when they try to access the sales_bi schema, they get a permission denied error.

The following diagram illustrates this configuration.

Access Amazon Redshift query editor v2

Now you’re ready to connect to your Amazon Redshift cluster using Amazon Redshift query editor v2 using federated login. Log in to your Okta account with your user credentials and under My Apps¸ choose Amazon Redshift Query Editor V2.

You’re redirected to the Amazon Redshift query editor v2 URL that you specified as the default relay state.

Connect to an Amazon Redshift database and run queries

Now let’s set up the connection to your Amazon Redshift cluster.

  1. In the query editor, choose your cluster (right-click) and choose Create connection.
  2. For Database, enter a name.
  3. For Authentication, select Federated user.
    The user name is prepopulated with your federated login information.
  4. Choose Create connection.

When you’re connected to your Amazon Redshift database, you can verify the connection details, as shown in the following screenshot. Notice the session-level group association as per the group assignment in your Okta application configuration. In this case, user Jane is assigned to the analyst_users group.

This user has access to SELECT all tables in the sales_analysis schema and no access to the sales_bi schema. You can run the following statements to test your access.

The following screenshot shows the results from a query to the sales_analysis.store_sales_us table.

When user Jane tries to access the tables in the sales_bi schema, they get a permission denied error.

Summary

In this post, we demonstrated how to federate SSO access to Amazon Redshift query editor v2 using Okta as your IdP. We showed how to set up Okta, set different PrinicpalTag attributes for query editor v2, and pass group memberships defined in your Okta IdP to your Amazon Redshift cluster. We showed how to log in to Amazon Redshift query editor v2 using federated login and validate the configuration by running a few queries. This solution allows you to control access to Amazon Redshift database objects, and your users can easily access Amazon Redshift clusters through query editor v2 using federated credentials without managing database users and passwords.

If you have any feedback or questions, please leave them in the comments.


About the Authors

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 16 years of experience in the data warehousing and analytical space.

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data and analytics for over 14 years. His LinkedIn profile can be found here.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Yanis Telaoumaten is a Software Development Engineer at AWS. His passions are building reliable software and creating tools to allow other engineers to work more efficiently. In the past years, he worked on identity, security and reliability of Redshift services

Federate access to Amazon Redshift query editor V2 with Active Directory Federation Services (AD FS): Part 3

Post Syndicated from Sumeet Joshi original https://aws.amazon.com/blogs/big-data/federate-access-to-amazon-redshift-query-editor-v2-with-active-directory-federation-services-ad-fs-part-3/

In the first post of this series, Federate access to your Amazon Redshift cluster with Active Directory Federation Services (AD FS): Part 1, you set up Microsoft Active Directory Federation Services (AD FS) and Security Assertion Markup Language (SAML) based authentication and tested the SAML federation using a web browser.

In Part 2, you learned to set up an Amazon Redshift cluster and use federated authentication with AD FS to connect from a JDBC SQL client tool.

In this post, we walk through the steps to configure Amazon Redshift query editor v2 to work with AD FS federation SSO.

Organizations want to enable their end-users such as data analysts, data scientists, and database developers to use the query editor v2 to accelerate self-service analytics. Amazon Redshift query editor v2 lets users explore, analyze, and collaborate on data. You can use the query editor to create databases, schemas, tables, and load data from Amazon Simple Storage Service (Amazon S3) using the COPY command or by using a wizard. You can browse multiple databases and run queries on your Amazon Redshift data warehouse or data lake, or run federated queries to operational databases such as Amazon Aurora.

In this post, we show how you can use your corporate Active Directory (AD) and the SAML 2.0 AD FS identity provider (IdP) to enable your users to easily access Amazon Redshift clusters through query editor v2 using corporate user names without managing database users and passwords. We also demonstrate how you can limit the access for your users to use only the query editor without giving them access to perform any admin functions on the AWS Management Console.

Solution overview

After you follow the steps explained in Part 1, you set up a deep link for federated users via the SAML 2.0 RelayState parameter in AD FS. You use the user you set up in your AD in Part 1 (Bob) to authenticate using AD FS and control access to database objects based on the group the user is assigned to. You also test if user Bob is integrated with Amazon Redshift database groups as controlled in AD groups.

By the end of this post, you will have created a unique deep link that authenticates the user Bob using AD FS and redirects them directly to the query editor v2 console, where they’re authenticated using the federation SSO option.

The sign-in process is as follows:

  1. The user chooses a deep link that redirects to the IdP for authentication with the information about the destination (query editor v2, in our case) URL embedded in the RelayState parameter. The user enters their credentials on the login page.
  2. Your IdP (AD FS in the case) verifies the user’s identity in your organization.
  3. Your IdP generates a SAML authentication response that includes assertions that identify the user and attributes about the user. The IdP sends this response to the user’s browser.
  4. The user’s browser is redirected to the AWS Single Sign-On endpoint and posts the SAML assertion and the RelayState parameter.
  5. The endpoint calls the AssumeRoleWithSAML API action to request temporary credentials from the AWS Identity and Access Management (IAM) role specified in the SAML assertion and creates a query editor v2 console sign-in URL that uses those credentials. The IAM role trusts the SAML federation entity and also has a policy that has access to query editor V2. If the SAML authentication response includes attributes that map to multiple IAM roles, the user is first prompted to choose the role to use for access to the query editor v2 console. The sign-in URL is the one specified by the RelayState parameter.
  6. AWS sends the sign-in URL back to the user’s browser as a redirect.
  7. The user’s browser is redirected to the Amazon Redshift query editor v2 console defined by the RelayState parameter.

The following diagram illustrates this flow.

In this post, we walk you through the following steps:

  1. Set up the Sales group in AD and set up the PrincipalTag claim rules in AD FS.
  2. Update the IAM roles.
  3. Construct the SSO URL to authenticate and redirect users to the Amazon Redshift query editor v2 console.
  4. Set up Amazon Redshift database groups and permissions on the Amazon Redshift cluster.
  5. Set up Amazon Redshift query editor v2 to use federated authentication with AD FS to connect directly from the query editor interface.
  6. Query Amazon Redshift objects to validate your authorization.

Prerequisites

For this walkthrough, complete the following prerequisite steps:

  1. Create an Amazon Redshift cluster. For instructions, refer to Create a sample Amazon Redshift cluster or complete the steps in Part 2 of this series.
  2. Complete the steps in Part 1 to set up SAML federation with AD FS:
    1. Set up an AD domain controller using an AWS CloudFormation template on a Windows 2016 Amazon Elastic Compute Cloud (Amazon EC2) instance.
    2. Configure federation in AD FS.
    3. Configure AWS as the relying party with AD FS using an IAM SAML provider and SAML roles with an attached policy to allow access to the Amazon Redshift cluster.
    4. Configure claim rules.
    5. Test the SAML authentication using a web browser.
  3. Verify that your IdP supports RelayState and is enabled. If you’re using AD FS 2.0, you need to download and install either Update Rollup 3 or Update Rollup 2 from Microsoft to enable the RelayState parameter.

Configure AD and AD FS

After you configure your AD FS and AD services by following the instructions in Part 1, you can set up the following AD group and claim rules.

In this post, you use the user Bob to log in to Amazon Redshift and check if Bob can access the Sales and Marketing schemas on the Amazon Redshift cluster. To create the sales group and assign the user [email protected] to it, log in to your AD FS server (Amazon EC2 machine) that you created in Part 1 and use the Windows command tool to run the following command:

dsadd group "cn=RSDB-sales, cn=Users, dc=adfsredshift, dc=com" -members "cn=Bob, cn=Users, dc=adfsredshift, dc=com"

Now you’re ready to create your custom claim rules: PrincipalTag:RedshiftDbUser and PrincipalTag:RedshiftDbGroup.

PrincipalTag:RedshiftDbUser

The custom claim rule PrincipalTag:RedshiftDbUser is mapped to the universal principal name in AD FS. When a user authenticates through federated SSO, this claim rule is mapped to the user name. If user doesn’t exist in the Amazon Redshift database, then the user is automatically created. The auto create option is granted through an IAM policy that is attached to the IAM role. The CreateClusterUser permission allows for auto creation of the user (you set this up as part of Part 1 as a prerequisite).

Complete the following steps to create your custom claim rule:

    1. On the AD FS management console, choose Relying Party Trusts.
    2. Choose Edit Claim Issuance Policy.
    3. Choose Choose Rule Type.
    4. For Claim rule template, choose Send Claims Using a Custom Rule.
    5. Choose Next.
    6. For Claims rule name, enter RedshiftDbUser.
    7. Add the following custom rule:
      c:[Type == "http://schemas.microsoft.com/ws/2008/06/identity/claims/windowsaccountname", Issuer == "AD AUTHORITY"]
       => issue(store = "Active Directory", types = ("https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbUser"), query = ";userPrincipalName;{0}", param = c.Value);

    8. Choose Finish.
    9. Capture the claim rules sent in a SAML assertion response through your browser. For instructions, refer to How to view a SAML response in your browser for troubleshooting.

In my example, I use the following SAML attribute for the RedshiftDbUser PrincipalTag:

<Attribute Name="https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbUser"> <AttributeValue>[email protected]</AttributeValue> 
</Attribute>

PrincipalTag:RedshiftDbGroup

The custom claim rule PrincipalTag:RedshiftDbGroup is built out of AD groups that the user is a member of. This rule is mapped to the Amazon Redshift database groups. The AD groups and Amazon Redshift database group names should match. JoinGroup permission set in the IAM policy allows the user to assume a database group and is session based. If the user is mapped to multiple groups in the AD group, the SAML assertion response should send those groups in : separated values and not as multiple value claims. The following steps demonstrate how to send AD groups as : separated values.

In this example, the user Bob is assigned to the marketing and sales groups. The following code shows how to send multiple groups through the SAML response when the user is in multiple groups, and also how to handle the situation when a user doesn’t exist in any particular group.

  1. Follow the same steps as in the previous section to create the rule Marketing, using the following code for the custom rule:
    c:[Type == " http://temp/groups", Value =~ "RSDB-Marketing"] => add(Type = " http://temp/marketing", Value = c.Value);

  2. Create the rule MarketingNotExists using the following code:
    NOT EXISTS([Type == "http://temp/variable", Value =~ "RSDB-marketing"]) => add(Type = "http://temp/marketing", Value = ""); 

  3. Create the rule sales using the following code:
    c:[Type == " http://temp/groups", Value =~ "RSDB-sales"] => add(Type = " http://temp/marketing", Value = c.Value);

  4. Create the rule SalesNotExists using the following code:
    NOT EXISTS([Type == "http://temp/variable", Value =~ "RSDB-sales"])
     => add(Type = "http://temp/sales", Value = ""); 

  5. Create the rule RedshiftDbGroups using the following code:
    c:[Type == "http://temp/marketing"]
     && c2:[Type == "http://temp/sales"]
     => issue(Type = "https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbGroups", Value = c.Value + ":" + c2.Value);

The following screenshot shows the list of rules that I created in my AD FS. Note the number of rules and the order in which they’re positioned. We created rules 6–11 as part of this post.

If you see a similar SAML response for RedshiftDbGroups, your setup is good:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbGroups"> <AttributeValue> marketing:sales</AttributeValue>

If a user doesn’t exist in one of the groups, an empty value is passed to the claim rule. For example, if user bob is removed from the marketing group, the SAML response for PrincipalTag:RedshiftDbGroup would be :sales.

Update IAM roles

In Part 1 of this series, you created two IAM roles: ADFZ-Dev and ADFZ-Production. These two roles aren’t yet set up with grants on the query editor. In this section, you update these roles with query editor permissions.

Amazon Redshift query editor v2 provides multiple managed policies to access the query editor. For a list of all the managed policies, refer to Configuring your AWS account. For this post, we attach the AmazonRedshiftQueryEditorV2ReadSharing managed policy to the roles.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose the role ADFZ-Dev.
  3. Choose Add permissions and then Attach policies.
  4. Under Other permission policies, search for the AmazonRedshiftQueryEditorV2ReadSharing managed policy.
  5. Select the policy and choose Attach policies.
  6. Modify the trust relationships for your role and add sts:TagSession. While in role select Trust relationships and click on Edit trust policy.When using session tags, trust policies for all roles connected to the IdP passing tags must have the sts:TagSession permission. For roles without this permission in the trust policy, the AssumeRole operation fails.
  7. Choose Update policy.
  8. Repeat these steps to attach the AmazonRedshiftQueryEditorV2ReadSharing managed policy to the ADFZ-Production role.

Limiting User access only to Query Editor

If you would like to limit users only access Query Editor then  update the policy redshift-marketing that you have created in Part 1 blog post as below.

Note: once updated, users will lose admin privileges such as create cluster.

Replace the region, account, and cluster parameters. This custom policy grants access to Amazon Redshift to get cluster credentials, create users, and allow users to join groups.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RedshiftClusterPermissions",
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials",
                "redshift:CreateClusterUser",
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<cluster>,
                "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${aws:PrincipalTag/RedshiftDbUser}",
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/marketing",
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/sales",
                "arn:aws:redshift:<region>:<account>:dbname:<cluster>/${redshift:DBName}"
            ]
        }
    ]
}

There are a few important things to note:

  1. The group membership lasts only for the duration of the user session.
  2. There is no CreateGroup permission because groups need to be manually created and granted DB privileges.

Generate the SSO destination URL as the Amazon Redshift query editor v2 console

In this step, you construct the sign-in URL for the AD FS IdP to redirect users to the Amazon Redshift query editor v2 console. For instructions, refer to How to Use SAML to Automatically Direct Federated Users to a Specific AWS Management Console Page.

To provide a full SSO experience for the end-users, the SAML response can include an optional parameter called RelayState. This parameter contains the destination URL.

Microsoft provides a tool to help generate these SSO URLs for AD FS called the AD FS 2.0 RelayState Generator.

To build this URL, you need three pieces of information:

  • IdP URL string – The string is in the format https://ADFSSERVER/adfs/ls/idpinitiatedsignon.aspx. For this post, we use https://EC2AMAZ-F9TJOIC.adfsredshift.com/adfs/ls/IdpInitiatedSignOn.aspx.
  • Relying party identifier – For AWS, this is urn:amazon:webservices.
  • Relay state or target app – This is the AWS Management Console URL you want your authenticated users redirect to. In this case, it’s https://eu-west-1.console.aws.amazon.com/sqlworkbench/home?. For this post, we use the eu-west-1 Region, but you can adjust this as needed.

I followed the instructions in How to Use SAML to Automatically Direct Federated Users to a Specific AWS Management Console Page and used the AD FS 2.0 RelayState Generator to generate the URL shown in the following screenshot.

The following is an example of the final URL that you use to get authenticated and also get redirected to Amazon Redshift query editor v2 (this URL won’t work in your setup because it has been created specifically for an AD FS server in my account): https://EC2AMAZ-F9TJOIC.adfsredshift.com/adfs/ls/IdpInitiatedSignOn.aspx?RelayState=RPID%3Durn%253Aamazon%253Awebservices%26RelayState%3Dhttps%253A%252F%252Feu-west-1.console.aws.amazon.com%252Fsqlworkbench%252Fhome%253Fregion%253Deu-west-1%2523%252Fclient

You can now save this URL and use it from anywhere you can reach your AD FS server. After you enter the URL in a browser, you first authenticate to AD FS, then you’re redirected to the Amazon Redshift query editor v2 console.

Set up DB groups on Amazon Redshift cluster

In this step, you set up your Amazon Redshift database group. This step is necessary because when the user is authenticated, they have to be part of an Amazon Redshift DB group with proper permissions set on a schema or table (or view).

In Active Directory, the user Bob is part of two groups: Sales and Marketing. In your Amazon Redshift database, you have three database groups: Sales, Marketing, and Finance.

When user Bob logs in via federated authentication, the user assumes the Sales and Marketing database groups, so this user can query tables in both the Sales and Marketing schemas. Because the user Bob isn’t part of the Finance group, when they try to access the Finance schema, they receive a permission denied error.

The following diagram illustrates this configuration.

Complete the following steps to set up your DB groups:

  1. Connect as awsuser (a superuser).
  2. Create three database groups:
    CREATE GROUP sales;
    CREATE GROUP marketing;
    CREATE GROUP finance;

  3. Create three schemas:
    CREATE SCHEMA sales;
    CREATE SCHEMA marketing;
    CREATE SCHEMA finance;

  4. Create a table in each schema:
    CREATE TABLE IF NOT EXISTS marketing.employee
    (
    	n_empkey INTEGER   
    	,n_name CHAR(25)   
    	,n_regionkey INTEGER   
    	,n_comment VARCHAR(152)   
    )
    DISTSTYLE AUTO
     SORTKEY (n_empkey);
    
    CREATE TABLE IF NOT EXISTS sales.employee_sales
    (
    	n_empkey INTEGER   
    	,n_name CHAR(25)   
    	,n_regionkey INTEGER   
    	,n_comment VARCHAR(152)   
    )
    DISTSTYLE AUTO
     SORTKEY (n_empkey);
    
    
    CREATE TABLE IF NOT EXISTS finance.accounts
    (
    	account_id INTEGER   
    	,account_name CHAR(25)   
    	 
    )
    DISTSTYLE AUTO
     SORTKEY (account_id);

  5. Insert sample data into the three tables:
    INSERT INTO marketing.employee
    VALUES(1, 'Bob', 0, 'Marketing');
    
    INSERT INTO sales.employee_sales
    VALUES(1, 'John', 0, 'Sales');
    
    INSERT INTO finance.accounts
    VALUES(1, 'online company');

  6. Validate the data is available in the tables:
    Select * from marketing.employee;
    Select * from sales.employee_sales;
    Select * from finance.accounts;

You can now set up appropriate privileges for the sales, finance, and marketing groups. Groups are collections of users who are all granted privileges associated with the group. You can use groups to assign privileges by job function. For example, you can create different groups for sales, administration, and support, and give the users in each group the appropriate access to the data they require for their work. You can grant or revoke privileges at the group level, and those changes apply to all members of the group, except for superusers.

  1. Enter the following SQL queries to grant access to all tables in the sales schema to the sales group, access to all tables in the marketing schema to the marketing group, and access to all tables in the finance schema to the finance group:
ALTER DEFAULT PRIVILEGES IN SCHEMA sales
GRANT SELECT on TABLES to GROUP sales;
GRANT USAGE on SCHEMA sales to GROUP sales;
GRANT SELECT on ALL TABLES in SCHEMA sales to GROUP sales;

ALTER DEFAULT PRIVILEGES IN SCHEMA marketing
GRANT SELECT on TABLES to GROUP marketing;
GRANT USAGE on SCHEMA marketing to GROUP marketing;
GRANT SELECT on ALL TABLES in SCHEMA marketing to GROUP marketing;

ALTER DEFAULT PRIVILEGES IN SCHEMA finance
GRANT SELECT on TABLES to GROUP finance;
GRANT USAGE on SCHEMA finance to GROUP finance;
GRANT SELECT on ALL TABLES in SCHEMA finance to GROUP finance;

Access Amazon Redshift query editor v2 through federated authentication

Now that you have completed your SAML integration, deep link setup, and DB groups and access rights configuration, you can set up Amazon Redshift query editor v2 to use federated authentication with AD FS to connect from directly from the query editor interface.

  1. Navigate to the deep link URL you created earlier.
    You’re redirected to the AD FS login page.
  2. Sign in as [email protected].
    For this post, I accessed this URL from an Amazon EC2 machine, but you can access it from any location where you can reach the AD FS IdP.
    After AD FS successfully authenticates you, it redirects to the AWS SSO endpoint and posts the SAML assertion and RelayState parameter. Because you configured two IAM roles on the AWS side, you’re prompted to select a role.
  3. Select a role (for this example, ADFZ-Dev) and choose Sign In.

    AWS sends the sign-in URL that is based on the RelayState value back to your browser as a redirect. Your browser is redirected to the query editor v2 console automatically.
  4. Right-click your Amazon Redshift cluster (for this post, redshift-cluster-1) and choose Edit connection.

    The value for User name is automatically populated, and Federated Access is automatically selected.
  5. Choose Edit connection to save the connection and log in to the database.

    After you’re successfully logged in, you can browse the database objects in the left pane.
  6. Test the connection by running the following query:
    select * from stv_sessions;

The following screenshot shows the output.

The output shows that the user [email protected] was authenticated using AD FS. The user also joined the marketing and Sales groups as enforced by the AD FS PrincipalTag:RedshiftDbGroups claim rule and the policy associated with the ADFZ-Dev role, which the user assumes during this session.

Run queries to validate authorization through federated groups

In this final step, you validate how the groups and membership configured in AD are seamlessly integrated with Amazon Redshift database groups.

Run the following query against the marketing and sales schema:

select * from marketing.employee;
select * from sales.employee_sales;

The following screenshots shows the output:

The preceding images show that AD user Bob is part of the AD group RSDB-marketing and RSDB-sales, which are mapped to the DB groups marketing and sales. These DB groups have select access to the schemas marketing and sales and all tables in those schemas. Therefore, the user can successfully query the tables.

To run a query against the finance schema, enter the following code:

select * from finance.accounts;

The following screenshot shows the output.

The output shows that Bob is only part of the AD groups RSDB-marketing and RSDB-sales. Due to the way the claim rule is set up, Bob doesn’t have access to the database group finance, and therefore the query returns with a permission denied error.

Clean up

To avoid incurring future charges, delete the resources by deleting the CloudFormation stack. This cleans up all the resources from your AWS account that you set up in Part 1.

Conclusion

In this post, we demonstrated how to set up an AD FS server, configure different PrincipalTag attributes used for Amazon Redshift query editor v2, and generate an SSO URL with the query editor as the destination location. You then connected to the Amazon Redshift DB cluster using a database user with administrator privileges to set up DB groups and permissions, and used a federated user authentication with AD FS to run several queries. This solution enables you to control access to your Amazon Redshift database objects by using AD groups and memberships seamlessly.

If you have any feedback or questions, please leave them in the comments.


About the Authors

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 16 years of experience in the data warehousing and analytical space.

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data and analytics for over 14 years. His LinkedIn profile can be found here.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Yanis Telaoumaten is a Software Development Engineer at AWS. His passions are building reliable software and creating tools to allow other engineers to work more efficiently. In the past years, he worked on identity, security and reliability of Redshift services

Use the Amazon Redshift SQLAlchemy dialect to interact with Amazon Redshift

Post Syndicated from Sumeet Joshi original https://aws.amazon.com/blogs/big-data/use-the-amazon-redshift-sqlalchemy-dialect-to-interact-with-amazon-redshift/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. You can interact with an Amazon Redshift database in several different ways. One method is using an object-relational mapping (ORM) framework. ORM is widely used by developers as an abstraction layer upon the database, which allows you to write code in your preferred programming language instead of writing SQL. SQLAlchemy is a popular Python ORM framework that enables the interaction between Python code and databases.

A SQLAlchemy dialect is the system used to communicate with various types of DBAPI implementations and databases. Previously, the SQLAlchemy dialect for Amazon Redshift used psycopg2 for communication with the database. Because psycopg2 is a Postgres connector, it doesn’t support Amazon Redshift specific functionality such as AWS Identity and Access Management (IAM) authentication for secure connections and Amazon Redshift specific data types such as SUPER and GEOMETRY. The new Amazon Redshift SQLAlchemy dialect uses the Amazon Redshift Python driver (redshift_connector) and lets you securely connect to your Amazon Redshift database. It natively supports IAM authentication and single sign-on (SSO). It also supports Amazon Redshift specific data types such as SUPER, GEOMETRY, TIMESTAMPTZ, and TIMETZ.

In this post, we discuss how you can interact with your Amazon Redshift database using the new Amazon Redshift SQLAlchemy dialect. We demonstrate how you can securely connect using Okta and perform various DDL and DML operations. Because the new Amazon Redshift SQLAlchemy dialect uses redshift_connector, users of this package can take full advantage of the connection options provided by redshift_connector, such as authenticating via IAM and identity provider (IdP) plugins. Additionally, we also demonstrate the support for IPython SqlMagic, which simplifies running interactive SQL queries directly from a Jupyter notebook.

Prerequisites

The following are the prerequisites for this post:

Get started with the Amazon Redshift SQLAlchemy dialect

It’s easy to get started with the Amazon Redshift SQLAlchemy dialect for Python. You can install the sqlalchemy-redshift library using pip. To demonstrate this, we start with a Jupyter notebook. Complete the following steps:

  1. Create a notebook instance (for this post, we call it redshift-sqlalchemy).
  2. On the Amazon SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
  3. Find the instance you created and choose Open Jupyter.
  4. Open your notebook instance and create a new conda_python3 Jupyter notebook.
  5. Run the following commands to install sqlalchemy-redshift and redshift_connector:
pip install sqlalchemy-redshift
pip install redshift_connector


redshift_connector provides many different connection options that help customize how you access your Amazon Redshift cluster. For more information, see Connection Parameters.

Connect to your Amazon Redshift cluster

In this step, we show you how to connect to your Amazon Redshift cluster using two different methods: Okta SSO federation, and direct connection using your database user and password.

Connect with Okta SSO federation

As a prerequisite, set up your Amazon Redshift application in your Okta configuration. For more information, see Federate Amazon Redshift access with Okta as an identity provider.

To establish a connection to the Amazon Redshift cluster, we utilize the create_engine function. The SQLAlchemy create_engine() function produces an engine object based on a URL. The sqlalchemy-redshift package provides a custom interface for creating an RFC-1738 compliant URL that you can use to establish a connection to an Amazon Redshift cluster.

We build the SQLAlchemy URL as shown in the following code. URL.create() is available for SQLAlchemy version 1.4 and above. When authenticating using IAM, the host and port don’t need to be specified by the user. To connect with Amazon Redshift securely using SSO federation, we use the Okta user name and password in the URL.

import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy import orm as sa_orm

from sqlalchemy_redshift.dialect import TIMESTAMPTZ, TIMETZ


# build the sqlalchemy URL. When authenticating using IAM, the host
# and port do not need to be specified by the user.
url = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
database='dev', # Amazon Redshift database
username='[email protected]', # Okta username
password='<PWD>' # Okta password
)

# a dictionary is used to store additional connection parameters
# that are specific to redshift_connector or cannot be URL encoded.
conn_params = {
"iam": True, # must be enabled when authenticating via IAM
"credentials_provider": "OktaCredentialsProvider",
"idp_host": "<prefix>.okta.com",
"app_id": "<appid>",
"app_name": "amazon_aws_redshift",
"region": "<region>",
"cluster_identifier": "<clusterid>",
"ssl_insecure": False, # ensures certificate verification occurs for idp_host
}

engine = sa.create_engine(url, connect_args=conn_params)

Connect with an Amazon Redshift database user and password

You can connect to your Amazon Redshift cluster using your database user and password. We construct a URL and use the URL.create() constructor, as shown in the following code:

import sqlalchemy as sa
from sqlalchemy.engine.url import URL

# build the sqlalchemy URL
url = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
host='<clusterid>.xxxxxx.<aws-region>.redshift.amazonaws.com', # Amazon Redshift host
port=5439, # Amazon Redshift port
database='dev', # Amazon Redshift database
username='awsuser', # Amazon Redshift username
password='<pwd>' # Amazon Redshift password
)

engine = sa.create_engine(url)

Next, we will create a session using the already established engine above. 

Session = sa_orm.sessionmaker()
Session.configure(bind=engine)
session = Session()

# Define Session-based Metadata
metadata = sa.MetaData(bind=session.bind)

Create a database table using Amazon Redshift data types and insert data

With new Amazon Redshift SQLAlchemy dialect, you can create tables with Amazon Redshift specific data types such as SUPER, GEOMETRY, TIMESTAMPTZ, and TIMETZ.

In this step, you create a table with TIMESTAMPTZ, TIMETZ, and SUPER data types.

Optionally, you can define your table’s distribution style, sort key, and compression encoding. See the following code:

import datetime
import uuid
import random

table_name = 'product_clickstream_tz'

RedshiftDBTable = sa.Table(
table_name,
metadata,
sa.Column('session_id', sa.VARCHAR(80)),
sa.Column('click_region', sa.VARCHAR(100), redshift_encode='lzo'),
sa.Column('product_id', sa.BIGINT),
sa.Column('click_datetime', TIMESTAMPTZ),
sa.Column('stream_time', TIMETZ),
sa.Column ('order_detail', SUPER),
redshift_diststyle='KEY',
redshift_distkey='session_id',
redshift_sortkey='click_datetime'
)

# Drop the table if it already exists
if sa.inspect(engine).has_table(table_name):
RedshiftDBTable.drop(bind=engine)

# Create the table (execute the "CREATE TABLE" SQL statement for "product_clickstream_tz")
RedshiftDBTable.create(bind=engine)

In this step, you will populate the table by preparing the insert command. 

# create sample data set
# generate a UUID for this row
session_id = str(uuid.uuid1())

# create Region information
click_region = "US / New York"

# create Product information
product_id = random.randint(1,100000)

# create a datetime object with timezone
click_datetime = datetime.datetime(year=2021, month=10, day=20, hour=10, minute=12, second=40, tzinfo=datetime.timezone.utc)

# create a time object with timezone
stream_time = datetime.time(hour=10, minute=14, second=56, tzinfo=datetime.timezone.utc)

# create SUPER information
order_detail = '[{"o_orderstatus":"F","o_clerk":"Clerk#0000001991","o_lineitems":[{"l_returnflag":"R","l_tax":0.03,"l_quantity":4,"l_linestatus":"F"}]}]'

# create the insert SQL statement
insert_data_row = RedshiftDBTable.insert().values(
session_id=session_id,
click_region=click_region,
product_id=product_id,
click_datetime=click_datetime,
stream_time=stream_time,
order_detail=order_detail
)

# execute the insert SQL statement
session.execute(insert_data_row)
session.commit()

Query and fetch results from the table

The SELECT statements generated by SQLAlchemy ORM are constructed by a query object. You can use several different methods, such as all(), first(), count(), order_by(), and join(). The following screenshot shows how you can retrieve all rows from the queried table.

Use IPython SqlMagic with the Amazon Redshift SQLAlchemy dialect

The Amazon Redshift SQLAlchemy dialect now supports SqlMagic. To establish a connection, you can build the SQLAlchemy URL with the redshift_connector driver. More information about SqlMagic is available on GitHub.

In the next section, we demonstrate how you can use SqlMagic. Make sure that you have the ipython-sql package installed; if not, install it by running the following command:

pip install ipython-sql

Connect to Amazon Redshift and query the data

In this step, you build the SQLAlchemy URL to connect to Amazon Redshift and run a sample SQL query. For this demo, we have prepopulated TPCH data in the cluster from GitHub. See the following code:

import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import Session
%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = URL.create(
drivername='redshift+redshift_connector',     host='cluster.xxxxxxxx.region.redshift.amazonaws.com',     
port=5439,  
database='dev',  
username='awsuser',  
password='xxxxxx'  
)
%sql $connect_to_db
%sql select current_user, version();

You can view the data in tabular format by using the pandas.DataFrame() method.

If you installed matplotlib, you can use the result set’s .plot(), .pie(), and .bar() methods for quick plotting.

Clean up

Make sure that SQLAlchemy resources are closed and cleaned up when you’re done with them. SQLAlchemy uses a connection pool to provide access to an Amazon Redshift cluster. Once opened, the default behavior leaves these connections open. If not properly cleaned up, this can lead to connectivity issues with your cluster. Use the following code to clean up your resources:

session.close()

# If the connection was accessed directly, ensure it is invalidated
conn = engine.connect()
conn.invalidate()

# Clean up the engine
engine.dispose()

Summary

In this post, we discussed the new Amazon Redshift SQLAlchemy dialect. We demonstrated how it lets you securely connect to your Amazon Redshift database using SSO as well as direct connection using the SQLAlchemy URL. We also demonstrated how SQLAlchemy supports TIMESTAMPTZ, TIMETZ, and SUPER data types without explicitly casting it. We also showcased how redshift_connector and the dialect support SqlMagic with Jupyter notebooks, which enables you to run interactive queries against Amazon Redshift.


About the Authors

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 16 years of experience in data warehousing and analytical space.

Brooke White is a Software Development Engineer at AWS. She enables customers to get the most out of their data through her work on Amazon Redshift drivers. Prior to AWS, she built ETL pipelines and analytics APIs at a San Francisco Bay Area startup.