All posts by Debu Panda

Integrate Tableau and Microsoft Entra ID with Amazon Redshift using AWS IAM Identity Center

2024-09-04 Debu Panda

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/integrate-tableau-and-microsoft-entra-id-with-amazon-redshift-using-aws-iam-identity-center/

This post is co-written with Sid Wray, Jade Koskela, and Ravi Bhattiprolu from SalesForce.

Amazon Redshift and Tableau empower data analysis. Amazon Redshift is a cloud data warehouse that processes complex queries at scale and with speed. Its advanced query optimization serves results to Tableau. Tableau’s extensive capabilities and enterprise connectivity help analysts efficiently prepare, explore, and share data insights company-wide.

Customers can integrate Amazon Redshift with Tableau using single sign-on (SSO) capabilities enabled by AWS IAM Identity Center integration with trusted identity propagation. You can use this to seamlessly implement authentication with third-party identity providers (IdP) and authorization with Redshift. It positions Amazon Redshift as an AWS managed application, allowing you to take full advantage of the trusted identity propagation feature.

Amazon Web Services (AWS) collaborated with Tableau to enable SSO support for accessing Amazon Redshift from Tableau. Both Tableau Desktop 2023.3.9 and Tableau Server 2023.3.9 releases support trusted identity propagation with IAM Identity Center. This SSO integration is available for Tableau Desktop, Tableau Server, and Tableau Prep.

This blog post provides a step-by-step guide to integrating IAM Identity Center with Microsoft Entra ID as the IdP and configuring Amazon Redshift as an AWS managed application. Additionally, you’ll learn how to set up the Amazon Redshift driver in Tableau, enabling SSO directly within Tableau Desktop.

Solution overview

The following diagram illustrates the architecture of the Tableau SSO integration with Amazon Redshift, IAM Identity Center, and Microsoft Entra ID.

Figure 1: Solution overview for Tableau integration with Amazon Redshift using IAM Identity Center and Microsoft Entra ID

The solution depicted in Figure 1 includes the following steps:

The user configures Tableau to access Amazon Redshift using IAM Identity Center.
On a user sign-in attempt, Tableau initiates a browser-based OAuth flow and redirects the user to the Microsoft Entra ID sign-in page to enter the sign-in credentials.
After successful authentication, Microsoft Entra ID issues authentication tokens (ID and access token) to Tableau.
The Amazon Redshift driver then makes a call to the Amazon Redshift-enabled Identity Center application and forwards the access token.
Amazon Redshift passes the token to IAM Identity Center for validation.
IAM Identity Center first validates the token using the OpenID Connect (OIDC) discovery connection to the trusted token issuer (TTI) and returns an IAM Identity Center generated access token for the same user. In Figure 1, the TTI is the Microsoft Entra ID server.
Amazon Redshift then uses the access token to obtain the user and group membership information from Identity Center.
The Tableau user will be able to connect with Amazon Redshift and access data based on the user and group membership returned from IAM Identity Center.

Prerequisites

Before you begin implementing the solution, you must have the following in place:

Set up IAM Identity Center and Amazon Redshift integration by following the steps in Integrate Identity Provider (IdP) with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On.
Download and install the latest ODBC 2.X Driver.
Install Tableau Desktop 2023.3.9 or above version.
Install Tableau Server 2023.3.9 or above version. For Tableau Server installation, see Install and Configure Tableau Server.
Use an Entra ID account that has an active subscription. You need an admin role to set up the application on Entra ID. If you don’t have an Entra ID account, you can create an account for free.

Walkthrough

In this walkthrough, you will use the following steps to build the solution:

Set up the Microsoft Entra ID OIDC application
Collect Microsoft Entra ID information
Set up a trusted token issuer in IAM Identity Center
Set up client connections and trusted token issuers
Set up the Tableau OAuth config files for Microsoft Entra ID
Install the Tableau OAuth config file for Tableau Desktop
Set up the Tableau OAuth config file for Tableau Server or Tableau Cloud
Federate to Amazon Redshift from Tableau Desktop
Federate to Amazon Redshift from Tableau Server

Set up the Microsoft Entra ID OIDC application

To create your Microsoft Entra application and service principal, follow these steps:

Sign in to the Microsoft Entra admin center as Cloud Application Administrator (at the least).
Browse to App registrations under Manage, and choose New registration.
Enter a name for the application. For example, Tableau-OIDC-App.
Select a supported account type, which determines who can use the application. For this example, select the first option in the list.
Under Redirect URI, select Web for the type of application you want to create. Enter the URI where the access token is sent to. In this example, you’re using localhost, so enter http://localhost:55556/Callback and http://localhost/auth/add_oauth_token.
Choose Register.
In the navigation pane, choose Certificates & secrets.
Choose New client secret.
Enter a Description and select an expiration for the secret or specify a custom lifetime. For this example, keep the Microsoft recommended default expiration value of 6 months. Choose Add.
Copy the secret value.
Note: It will only be presented one time; after that you cannot read it.
In the navigation pane, under Manage, choose Expose an API.
If you’re setting up for the first time, you can see Set to the right of Application ID URI.
Choose Set, and then choose Save.
After the application ID URI is set up, choose Add a scope.
For Scope name, enter a name. For example, redshift_login.
For Admin consent display name, enter a display name. For example, redshift_login.
For Admin consent description, enter a description of the scope.
Choose Add scope.

For more information about setting up the Microsoft Entra app, see Register a Microsoft Entra app and create a service principal.

Collect Microsoft Entra ID information

To configure your IdP with IAM Identity Center and Amazon Redshift, collect the following parameters from Microsoft Entra ID. If you don’t have these parameters, contact your Microsoft Entra ID admin.

Tenant ID,Client ID and Audience value: To get these values:
1. Sign in to the Azure portal with your Microsoft account.
2. Under Manage, choose App registrations.
3. Choose the application that you created in previous sections.
4. On the left panel, choose Overview, a new page will appear containing the Essentials section. You can find the Tenant ID,Client ID and Audience value (Application ID URI) as shown in the following figure:
  
  Figure 2: Overview section of OIDC application

Scope: To find your scope value:
1. In the navigation pane of the OIDC application, under Manage, choose Expose an API.
2. You will find the value under Scopes as shown in the following figure:
  
  Figure 3: Application scope

Set up a trusted token issuer in IAM Identity Center

At this point, you have finished configurations in the Entra ID console; now you’re ready to add Entra ID as a TTI. You will start by adding a TTI so you can exchange tokens. In this step, you will create a TTI in the centralized management account. To create a TTI, follow these steps:

Open the AWS Management Console and navigate to IAM Identity Center, and then to the Settings
Select the Authentication tab and under Trusted token issuers, choose Create trusted token issuer.
On the Set up an external IdP to issue trusted tokens page, under Trusted token issuer details, do the following:
1. For Issuer URL, enter the OIDC discovery URL of the external IdP that will issue tokens for trusted identity propagation. The URL would be: https://sts.windows.net/<tenantid>/. To find your Microsoft Entra tenant ID, see Collect Microsoft Entra ID information.
2. For Trusted token issuer name, enter a name to identify this TTI in IAM Identity Center and in the application console.
3. Under Map attributes, do the following:
  1. For Identity provider attribute, select an attribute from the list to map to an attribute in the Identity Center identity store. You can choose Email, Object Identifier, Subject, and Other. This example uses Other where we’re specifying the upn (user principal name) as the Identity provider attribute to map with Email from the IAM identity Center attribute.
  2. For IAM Identity Center attribute, select the corresponding attribute for the attribute mapping.
4. Under Tags (optional), choose Add new tag, specify a value for Key, and optionally for Value. For information about tags, see Tagging AWS IAM Identity Center resources.

Figure 4 that follows shows the set up for TTI.

Figure 4: Create a trusted token issuer

Choose Create trusted token issuer.

Set up client connections and trusted token issuers

A third-party application (such as Tableau) that isn’t managed by AWS exchanges the external token (JSON Web Token (JWT) for an IAM Identity Center token before calling AWS services.

The JWT must contain a subject (sub) claim, an audience (aud) claim, an issuer (iss), a user attribute claim, and a JWT ID (JTI) claim. The audience is a value that represents the AWS service that the application will use, and the audience claim value must match the value that’s configured in the Redshift application that exchanges the token.

In this section, you will specify the audience claim in the Redshift application, which you will get from Microsoft Entra ID. You will configure the Redshift application in the member account where the Redshift cluster or serverless instance is.

Select IAM Identity Center connection from Amazon Redshift console menu.

Figure 5: Redshift IAM Identity Center connection

Select the Amazon Redshift application that you created as part of the prerequisites.
Select the Client connections tab and choose Edit.
Choose Yes under Configure client connections that use third-party IdPs.
Select the checkbox for Trusted token issuer that you created in the previous section.
Enter the aud claim value under Configure selected trusted token issuers. For example, api://1230a234-b456-7890-99c9-a12345bcc123. To get the audience value, see Collect Microsoft Entra ID information.
Choose Save.

Figure 6: Adding an audience claim for the TTI

Your IAM Identity Center, Amazon Redshift, and Microsoft Entra ID configuration is complete. Next, you need to configure Tableau.

Set up the Tableau OAuth config files for Microsoft Entra ID

To integrate Tableau with Amazon Redshift using IAM Identity Center, you need to use a custom XML. In this step, you use the following XML and replace the values starting with the $ sign and highlighted in bold. The rest of the values can be kept as they are, or you can modify them based on your use case. For detailed information on each of the elements in the XML file, see the Tableau documentation on GitHub.

Note: The XML file will be used for all the Tableau products including Tableau Desktop, Server, and Cloud. You can use the following XML or you can refer to Tableau’s github.

<?xml version="1.0" encoding="utf-8"?>
<pluginOAuthConfig>
<dbclass>redshift</dbclass>
<!-- For configs embedded in the connector package, don't prefix with "custom_". For external configs, always prefix with "custom_". -->
<oauthConfigId>custom_redshift_azure</oauthConfigId>
<clientIdDesktop>$copy_client_id_from_azure_oidc_app</clientIdDesktop>
<clientSecretDesktop>$copy_client_secret_from_azure_oidc_app</clientSecretDesktop>
<redirectUrisDesktop>http://localhost:55556/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55557/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55558/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55559/Callback</redirectUrisDesktop>
<!-- For multitenant apps use the common endpoint, for single tenant apps use the directory specific endpoint. -->
<authUri>https://login.microsoftonline.com/$azure_tenant_id/oauth2/v2.0/authorize</authUri>
<tokenUri>https://login.microsoftonline.com/$azure_tenant_id/oauth2/v2.0/token</tokenUri>
<scopes>openid</scopes>
<scopes>offline_access</scopes>
<scopes>email</scopes>
<!-- An example with a custom API, which was required at the time of writing for integration with AWS IAM IDC. -->
<scopes>$scope_from_azure_oidc_app</scopes>
<capabilities>
<entry>
<key>OAUTH_CAP_REQUIRES_PROMPT_SELECT_ACCOUNT</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_REQUIRE_PKCE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_PKCE_REQUIRES_CODE_CHALLENGE_METHOD</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_STATE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_CLIENT_SECRET_IN_URL_QUERY_PARAM</key>
<value>false</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_GET_USERINFO_FROM_ID_TOKEN</key>
<value>true</value>
</entry>
<!-- Depending on the Azure application, dynamic ports may not be allowed. Enable this if not allowed. -->
<entry>
<key>OAUTH_CAP_FIXED_PORT_IN_CALLBACK_URL</key>
<value>true</value>
</entry>
</capabilities>
<accessTokenResponseMaps>
<entry>
<key>ACCESSTOKEN</key>
<value>access_token</value>
</entry>
<entry>
<key>REFRESHTOKEN</key>
<value>refresh_token</value>
</entry>
<entry>
<key>access-token-issue-time</key>
<value>issued_at</value>
</entry>
<entry>
<key>id-token</key>
<value>id_token</value>
</entry>
<entry>
<key>username</key>
<value>email</value>
</entry>
<entry>
<key>access-token-expires-in</key>
<value>expires_in</value>
</entry>
</accessTokenResponseMaps>
</pluginOAuthConfig>

The following is an example XML file:

<?xml version="1.0" encoding="utf-8"?>
<pluginOAuthConfig>
<dbclass>redshift</dbclass>
<!-- For configs embedded in the connector package, don't prefix with "custom_". For external configs, always prefix with "custom_". -->
<oauthConfigId>custom_redshift_azure</oauthConfigId>
<clientIdDesktop>1230a234-b456-7890-99c9-a12345bcc123</clientIdDesktop>
<clientSecretDesktop>RdQbc~1234559xFX~c65737wOwjsdfdsg123bg2</clientSecretDesktop>
<redirectUrisDesktop>http://localhost:55556/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55557/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55558/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55559/Callback</redirectUrisDesktop>
<!-- For multitenant apps use the common endpoint, for single tenant apps use the directory specific endpoint. -->
<authUri>https://login.microsoftonline.com/e12a1ab3-1234-12ab-12b3-1a5012221d12/oauth2/v2.0/authorize</authUri>
<tokenUri>https://login.microsoftonline.com/e12a1ab3-1234-12ab-12b3-1a5012221d12/oauth2/v2.0/token</tokenUri>
<scopes>openid</scopes>
<scopes>offline_access</scopes>
<scopes>email</scopes>
<!-- An example with a custom API, which was required at the time of writing for integration with AWS IAM IDC. -->
<scopes>api://1230a234-b456-7890-99c9-a12345bcc123/redshift_login</scopes>
<capabilities>
<entry>
<key>OAUTH_CAP_REQUIRES_PROMPT_SELECT_ACCOUNT</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_REQUIRE_PKCE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_PKCE_REQUIRES_CODE_CHALLENGE_METHOD</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_STATE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_CLIENT_SECRET_IN_URL_QUERY_PARAM</key>
<value>false</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_GET_USERINFO_FROM_ID_TOKEN</key>
<value>true</value>
</entry>
<!-- Depending on the Azure application, dynamic ports may not be allowed. Enable this if not allowed. -->
<entry>
<key>OAUTH_CAP_FIXED_PORT_IN_CALLBACK_URL</key>
<value>true</value>
</entry>
</capabilities>
<accessTokenResponseMaps>
<entry>
<key>ACCESSTOKEN</key>
<value>access_token</value>
</entry>
<entry>
<key>REFRESHTOKEN</key>
<value>refresh_token</value>
</entry>
<entry>
<key>access-token-issue-time</key>
<value>issued_at</value>
</entry>
<entry>
<key>id-token</key>
<value>id_token</value>
</entry>
<entry>
<key>username</key>
<value>email</value>
</entry>
<entry>
<key>access-token-expires-in</key>
<value>expires_in</value>
</entry>
</accessTokenResponseMaps>
</pluginOAuthConfig>

Install the Tableau OAuth config file for Tableau Desktop

After the configuration XML file is created, it must be copied to a location to be used by Amazon Redshift Connector from Tableau Desktop. Save the file from the previous step as .xml and save it under Documents\My Tableau Repository\OAuthConfigs.

Note: Currently, this integration isn’t supported in macOS because the Redshift ODBC 2.X driver isn’t supported yet for MAC. It will be supported soon.

Set up the Tableau OAuth config file for Tableau Server or Tableau Cloud

To integrate with Amazon Redshift using IAM Identity Center authentication, you must install the Tableau OAuth config file in Tableau Server or Tableau Cloud.

Sign in to the Tableau Server or Tableau Cloud using admin credentials.
Navigate to Settings.
Go to OAuth Clients Registry and select Add OAuth Client
Choose following settings:
1. Connection Type: Amazon Redshift
2. OAuth Provider: Custom_IdP
3. Client Id: Enter your IdP client ID value
4. Client Secret: Enter your client secret value
5. Redirect URL: Enter http://localhost/auth/add_oauth_token. This example uses localhost for testing in a local environment. You should use the full hostname with https.
6. Choose OAuth Config File. Select the XML file that you configured in the previous section.
7. Select Add OAuth Client and choose Save.

Figure 7: Create an OAuth connection in Tableau Server or Cloud

Federate to Amazon Redshift from Tableau Desktop

Now you’re ready to connect to Amazon Redshift from Tableau as an Entra ID federated user. In this step, you create a Tableau Desktop report and publish it to Tableau Server.

Open Tableau Desktop.
Select Amazon Redshift Connector and enter the following values:
1. Server: Enter the name of the server that hosts the database and the name of the database you want to connect to.
2. Port: Enter 5439.
3. Database: Enter your database name. This example uses dev.
4. Authentication: Select OAuth.
5. Federation Type: Select Identity Center.
6. Identity Center Namespace: You can leave this value blank.
7. OAuth Provider: This value should automatically be pulled from your configured XML. It will be the value from the element oauthConfigId.
8. Select Require SSL.
9. Choose Sign in.
  
  Figure 8: Tableau Desktop OAuth connection

Enter your IdP credentials in the browser pop-up window.

Figure 9: Microsoft Entra sign in page
When authentication is successful, you will see the message shown in Figure 10 that follows.

Figure 10: Successful authentication using Tableau

Congratulations! You’re signed in using the IAM Identity Center integration with Amazon Redshift. Now you’re ready to explore and analyze your data using Tableau Desktop.

Figure 11: Successful connection using Tableau Desktop

After signing in, you can create your own Tableau Report on the desktop version and publish it to your Tableau Server. For this example, we created and published a report named SalesReport.

Federate to Amazon Redshift from Tableau Server

After you have published the report from Tableau Desktop to Tableau Server, sign in as a non-admin user and view the published report (SalesReport in this example) using IAM Identity Center authentication.

Sign in to the Tableau Server site as a non-admin user.
Navigate to Explore and go to the folder where your published report is stored.
Select the report and choose Sign In.

Figure 12: User audit in sys_query_history
To authenticate, enter your non-admin Microsoft Entra ID (Azure) credentials in the browser pop-up.

Figure 13: Tableau Server sign In
After your authentication is successful, you can access the report.

Figure 14: Tableau report

Verify user identity from Amazon Redshift

As an optional step, you can audit the federated IAM Identity Center user from Amazon Redshift.

Figure 15 is a screenshot from the Amazon Redshift system table (sys_query_history) showing that user Ethan from Microsoft Entra ID is accessing the sales report.

select distinct user_id, pg.usename as username, trim(query_text) as query_text
from sys_query_history sys
join pg_user_info pg
on sys.user_id=pg.usesysid
where query_id=<query_id> and usesysid=<federateduser_id> and query_type='SELECT'
order by start_time desc
;

Figure 15: User audit in sys_query_history

Clean up

Complete the following steps to clean up your resources:

Delete the IdP applications that you created to integrate with IAM Identity Center.
Delete the IAM Identity Center configuration.
Delete the Amazon Redshift application and the Amazon Redshift provisioned cluster or serverless instance that you created for testing.
Delete the AWS Identity and Access Management (IAM) role and IAM policy that you created as part of the prerequisites for IAM Identity Center and Amazon Redshift integration.
Delete the permission set from IAM Identity Center that you created for Amazon Redshift Query Editor V2 in the management account.

Conclusion

This post explored a streamlined approach to access management for data analytics by using Tableau’s support for OIDC for SSO. The solution facilitates federated user authentication, where user identities from an external IdP are trusted and propagated to Amazon Redshift. You learned how to configure Tableau Desktop and Tableau Server to seamlessly integrate with Amazon Redshift using IAM Identity Center for SSO. By harnessing this integration between a third-party IdP and IAM Identity Center, users can securely access Amazon Redshift data sources within Tableau without managing separate database credentials.

The following are key resources to learn more about Amazon Redshift integration with IAM Identity Center:

About the Authors

Debu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Sid Wray is a Senior Product Manager at Salesforce based in the Pacific Northwest with nearly 20 years of experience in Digital Advertising, Data Analytics, Connectivity Integration and Identity and Access Management. He currently focuses on supporting ISV partners for Salesforce Data Cloud.

Adiascar Cisneros is a Tableau Senior Product Manager based in Atlanta, GA. He focuses on the integration of the Tableau Platform with AWS services to amplify the value users get from our products and accelerate their journey to valuable, actionable insights. His background includes analytics, infrastructure, network security, and migrations.

Jade Koskela is a Principal Software Engineer at Salesforce. He has over a decade of experience building Tableau with a focus on areas including data connectivity, authentication, and identity federation.

Harshida Patel is a Principal Solutions Architect, Analytics with AWS.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Ravi Bhattiprolu is a Senior Partner Solutions Architect at AWS. He collaborates with strategic independent software vendor (ISV) partners like Salesforce and Tableau to design and deliver innovative, well-architected cloud products, integrations, and solutions to help joint AWS customers achieve their business goals.

Integrate Tableau and Okta with Amazon Redshift using AWS IAM Identity Center

2024-06-03 Debu Panda

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/integrate-tableau-and-okta-with-amazon-redshift-using-aws-iam-identity-center/

This blog post is co-written with Sid Wray and Jake Koskela from Salesforce, and Adiascar Cisneros from Tableau.

Amazon Redshift is a fast, scalable cloud data warehouse built to serve workloads at any scale. With Amazon Redshift as your data warehouse, you can run complex queries using sophisticated query optimization to quickly deliver results to Tableau, which offers a comprehensive set of capabilities and connectivity options for analysts to efficiently prepare, discover, and share insights across the enterprise. For customers who want to integrate Amazon Redshift with Tableau using single sign-on capabilities, we introduced AWS IAM Identity Center integration to seamlessly implement authentication and authorization.

IAM Identity Center provides capabilities to manage single sign-on access to AWS accounts and applications from a single location. Redshift now integrates with IAM Identity Center, and supports trusted identity propagation, making it possible to integrate with third-party identity providers (IdP) such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin. This integration positions Amazon Redshift as an IAM Identity Center-managed application, enabling you to use database role-based access control on your data warehouse for enhanced security. Role-based access control allows you to apply fine grained access control using row level, column level, and dynamic data masking in your data warehouse.

AWS and Tableau have collaborated to enable single sign-on support for accessing Amazon Redshift from Tableau. Tableau now supports single sign-on capabilities with Amazon Redshift connector to simplify the authentication and authorization. The Tableau Desktop 2024.1 and Tableau Server 2023.3.4 releases support trusted identity propagation with IAM Identity Center. This allows users to seamlessly access Amazon Redshift data within Tableau using their external IdP credentials without needing to specify AWS Identity and Access Management (IAM) roles in Tableau. This single sign-on integration is available for Tableau Desktop, Tableau Server, and Tableau Prep.

In this post, we outline a comprehensive guide for setting up single sign-on to Amazon Redshift using integration with IAM Identity Center and Okta as the IdP. By following this guide, you’ll learn how to enable seamless single sign-on authentication to Amazon Redshift data sources directly from within Tableau Desktop, streamlining your analytics workflows and enhancing security.

Solution overview

The following diagram illustrates the architecture of the Tableau SSO integration with Amazon RedShift, IAM Identity Center, and Okta.

Figure 1: Solution overview for Tableau integration with Amazon Redshift using IAM Identity Center and Okta

The solution depicted in Figure 1 includes the following steps:

The user configures Tableau to access Redshift using IAM Identity Center authentication
On a user sign-in attempt, Tableau initiates a browser-based OAuth flow and redirects the user to the Okta login page to enter the login credentials.
On successful authentication, Okta issues an authentication token (id and access token) to Tableau
Redshift driver then makes a call to Redshift-enabled IAM Identity Center application and forwards the access token.
Redshift passes the token to Identity Center and requests an access token.
Identity Center verifies/validates the token using the OIDC discovery connection to the trusted token issuer and returns an Identity Center generated access token for the same user. In Figure 1, Trusted Token Issuer (TTI) is the Okta server that Identity Center trusts to provide tokens that third-party applications like Tableau uses to call AWS services.
Redshift then uses the token to obtain the user and group membership information from IAM Identity Center.
Tableau user will be able to connect with Amazon Redshift and access data based on the user and group membership returned from IAM Identity Center.

Prerequisites

Before you begin implementing the solution, make sure that you have the following in place:

Setup IAM Identity Center and Amazon Redshift integration by following the steps in Integrate Identity Provider (IdP) with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On
Download and install the latest ODBC 2.X Driver.
Have installed Tableau Desktop 2024.1 or later.
Tableau Server 2023.3.4 and above version. For Tableau Server installation, please refer to Install and Configure Tableau Server.
An Okta account that has an active subscription. You need an admin role to set up the application on Okta. If you’re new to Okta, you can sign up for a free trial or for a developer account.

Walkthrough

In this walkthrough, you build the solution with following steps:

Set up the Okta OIDC application
Set up the Okta authorization server
Set up the Okta claims
Setup the Okta access policies and rules
Setup trusted token issuer in AWS IAM Identity Center
Setup client connections and trusted token issuers
Setup the Tableau OAuth config files for Okta
Install the Tableau OAuth config file for Tableau Desktop
Setup the Tableau OAuth config file for Tableau Server or Tableau Cloud
Federate to Amazon Redshift from Tableau Desktop
Federate to Amazon Redshift from Tableau Server

Set up the Okta OIDC application

To create an OIDC web app in Okta, you can follow the instructions in this video, or use the following steps to create the wep app in Okta admin console:

Note: The Tableau Desktop redirect URLs should always use localhost. The examples below also use localhost for the Tableau Server hostname for ease of testing in a test environment. For this setup, you should also access the server at localhost in the browser. If you decide to use localhost for early testing, you will also need to configure the gateway to accept localhost using this tsm command:

 tsm configuration set -k gateway.public.host -v localhost

In a production environment, or Tableau Cloud, you should use the full hostname that your users will access Tableau on the web, along with https. If you already have an environment with https configured, you may skip the localhost configuration and use the full hostname from the start.

Sign in to your Okta organization as a user with administrative privileges.
On the admin console, under Applications in the navigation pane, choose Applications.
Choose Create App Integration.
Select OIDC – OpenID Connect as the Sign-in method and Web Application as the Application type.
Choose Next.
In General Settings:
1. App integration name: Enter a name for your app integration. For example, Tableau_Redshift_App.
2. Grant type: Select Authorization Code and Refresh Token.
3. Sign-in redirect URIs: The sign-in redirect URI is where Okta sends the authentication response and ID token for the sign-in request. The URIs must be absolute URIs. Choose Add URl and along with the default URl, add the following URIs.
  - http://localhost:55556/Callback
  - http://localhost:55557/Callback
  - http://localhost:55558/Callback
  - http://localhost/auth/add_oauth_token
4. Sign-out redirect URIs: keep the default value as http://localhost:8080.
5. Skip the Trusted Origins section and for Assignments, select Skip group assignment for now.
6. Choose Save.

Figure 2: OIDC application

In the General Settings section, choose Edit and select Require PKCE as additional verification under Proof Key for Code Exchange (PKCE). This option indicates if a PKCE code challenge is required to verify client requests.
Choose Save.

Figure 3: OIDC App Overview

Select the Assignments tab and then choose Assign to Groups. In this example, we’re assigning awssso-finance and awssso-sales.
Choose Done.

Figure 4: OIDC application group assignments

For more information on creating an OIDC app, see Create OIDC app integrations.

Set up the Okta authorization server

Okta allows you to create multiple custom authorization servers that you can use to protect your own resource servers. Within each authorization server you can define your own OAuth 2.0 scopes, claims, and access policies. If you have an Okta Developer Edition account, you already have a custom authorization server created for you called default.

For this blog post, we use the default custom authorization server. If your application has requirements such as requiring more scopes, customizing rules for when to grant scopes, or you need more authorization servers with different scopes and claims, then you can follow this guide.

Figure 5: Authorization server

Set up the Okta claims

Tokens contain claims that are statements about the subject (for example: name, role, or email address). For this example, we use the default custom claim sub. Follow this guide to create claims.

Figure 6: Create claims

Setup the Okta access policies and rules

Access policies are containers for rules. Each access policy applies to a particular OpenID Connect application. The rules that the policy contains define different access and refresh token lifetimes depending on the nature of the token request. In this example, you create a simple policy for all clients as shown in Figure 7 that follows. Follow this guide to create access policies and rules.

Figure 7: Create access policies

Rules for access policies define token lifetimes for a given combination of grant type, user, and scope. They’re evaluated in priority order and after a matching rule is found, no other rules are evaluated. If no matching rule is found, then the authorization request fails. This example uses the role depicted in Figure 8 that follows. Follow this guide to create rules for your use case.

Figure 8: Access policy rules

Setup trusted token issuer in AWS IAM Identity Center

At this point, you switch to setting up the AWS configuration, starting by adding a trusted token issuer (TTI), which makes it possible to exchange tokens. This involves connecting IAM Identity Center to the Open ID Connect (OIDC) discovery URL of the external OAuth authorization server and defining an attribute-based mapping between the user from the external OAuth authorization server and a corresponding user in Identity Center. In this step, you create a TTI in the centralized management account. To create a TTI:

Open the AWS Management Console and navigate to IAM Identity Center, and then to the Settings page.
Select the Authentication tab and under Trusted token issuers, choose Create trusted token issuer.
On the Set up an external IdP to issue trusted tokens page, under Trusted token issuer details, do the following:
- For Issuer URL, enter the OIDC discovery URL of the external IdP that will issue tokens for trusted identity propagation. The administrator of the external IdP can provide this URL (for example, https://prod-1234567.okta.com/oauth2/default).

To get the issuer URL from Okta, sign in as an admin to Okta and navigate to Security and then to API and choose default under the Authorization Servers tab and copy the Issuer URL

Figure 9: Authorization server issuer

For Trusted token issuer name, enter a name to identify this trusted token issuer in IAM Identity Center and in the application console.
Under Map attributes, do the following:
- For Identity provider attribute, select an attribute from the list to map to an attribute in the IAM Identity Center identity store.
- For IAM Identity Center attribute, select the corresponding attribute for the attribute mapping.
Under Tags (optional), choose Add new tag, enter a value for Key and optionally for Value. Choose Create trusted token issuer. For information about tags, see Tagging AWS IAM Identity Center resources.

This example uses Subject (sub) as the Identity provider attribute to map with Email from the IAM identity Center attribute. Figure 10 that follows shows the set up for TTI.

Figure 10: Create Trusted Token Issuer

Setup client connections and trusted token issuers

In this step, the Amazon Redshift applications that exchange externally generated tokens must be configured to use the TTI you created in the previous step. Also, the audience claim (or aud claim) from Okta must be specified. In this example, you are configuring the Amazon Redshift application in the member account where the Amazon Redshift cluster or serverless instance exists.

Select IAM Identity Center connection from Amazon Redshift console menu.

Figure 11: Amazon Redshift IAM Identity Center connection

Select the Amazon Redshift application that you created as part of the prerequisites.
Select the Client connections tab and choose Edit.
Choose Yes under Configure client connections that use third-party IdPs.
Select the checkbox for Trusted token issuer which you have created in the previous section.
Enter the aud claim value under section Configure selected trusted token issuers. For example, okta_tableau_audience.

To get the audience value from Okta, sign in as an admin to Okta and navigate to Security and then to API and choose default under the Authorization Servers tab and copy the Audience value.

Figure 12: Authorization server audience

Note: The audience claim value must exactly match with IdP audience value otherwise your OIDC connection with third part application like Tableau will fail.

Choose Save.

Figure 13: Adding Audience Claim for Trusted Token Issuer

Setup the Tableau OAuth config files for Okta

At this point, your IAM Identity Center, Amazon Redshift, and Okta configuration are complete. Next, you need to configure Tableau.

Note: The XML file will be used for all the Tableau products including Tableau Desktop, Server, and Cloud.

<?xml version="1.0" encoding="utf-8"?>
<pluginOAuthConfig>
<dbclass>redshift</dbclass>
<oauthConfigId>custom_redshift_okta</oauthConfigId>
<clientIdDesktop>$copy_client_id_from_okta_oidc_app</clientIdDesktop>
<clientSecretDesktop>$copy_client_secret_from_okta_oidc_app</clientSecretDesktop>
<redirectUrisDesktop>http://localhost:55556/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55557/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55558/Callback</redirectUrisDesktop>
<authUri>https://$copy_okta_host_value.okta.com/oauth2/default/v1/authorize</authUri>
<tokenUri>https://$copy_okta_host_value.okta.com/oauth2/default/v1/token</tokenUri>
<scopes>openid</scopes>
<scopes>email</scopes>
<scopes>profile</scopes>
<scopes>offline_access</scopes>
<capabilities>
<entry>
<key>OAUTH_CAP_FIXED_PORT_IN_CALLBACK_URL</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_PKCE_REQUIRES_CODE_CHALLENGE_METHOD</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_REQUIRE_PKCE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_STATE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_CLIENT_SECRET_IN_URL_QUERY_PARAM</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_GET_USERINFO_FROM_ID_TOKEN</key>
<value>true</value>
</entry>
</capabilities>
<accessTokenResponseMaps>
<entry>
<key>ACCESSTOKEN</key>
<value>access_token</value>
</entry>
<entry>
<key>REFRESHTOKEN</key>
<value>refresh_token</value>
</entry>
<entry>
<key>id-token</key>
<value>id_token</value>
</entry>
<entry>
<key>access-token-issue-time</key>
<value>issued_at</value>
</entry>
<entry>
<key>access-token-expires-in</key>
<value>expires_in</value>
</entry>
<entry>
<key>username</key>
<value>preferred_username</value>
</entry>
</accessTokenResponseMaps>
</pluginOAuthConfig>

The following is an example XML file:

<?xml version="1.0" encoding="utf-8"?>
<pluginOAuthConfig>
<dbclass>redshift</dbclass>
<oauthConfigId>custom_redshift_okta</oauthConfigId>
<clientIdDesktop>ab12345z-a5nvb-123b-123b-1c434ghi1234</clientIdDesktop>
<clientSecretDesktop>3243jkbkjb~~ewf.112121.3432423432.asd834k</clientSecretDesktop>
<redirectUrisDesktop>http://localhost:55556/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55557/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55558/Callback</redirectUrisDesktop>
<authUri>https://prod-1234567.okta.com/oauth2/default/v1/authorize</authUri>
<tokenUri>https://prod-1234567.okta.com/oauth2/default/v1/token</tokenUri>
<scopes>openid</scopes>
<scopes>email</scopes>
<scopes>profile</scopes>
<scopes>offline_access</scopes>
<capabilities>
<entry>
<key>OAUTH_CAP_FIXED_PORT_IN_CALLBACK_URL</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_PKCE_REQUIRES_CODE_CHALLENGE_METHOD</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_REQUIRE_PKCE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_STATE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_CLIENT_SECRET_IN_URL_QUERY_PARAM</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_GET_USERINFO_FROM_ID_TOKEN</key>
<value>true</value>
</entry>
</capabilities>
<accessTokenResponseMaps>
<entry>
<key>ACCESSTOKEN</key>
<value>access_token</value>
</entry>
<entry>
<key>REFRESHTOKEN</key>
<value>refresh_token</value>
</entry>
<entry>
<key>id-token</key>
<value>id_token</value>
</entry>
<entry>
<key>access-token-issue-time</key>
<value>issued_at</value>
</entry>
<entry>
<key>access-token-expires-in</key>
<value>expires_in</value>
</entry>
<entry>
<key>username</key>
<value>preferred_username</value>
</entry>
</accessTokenResponseMaps>
</pluginOAuthConfig>

Install the Tableau OAuth config file for Tableau Desktop

Note: Currently this integration isn’t supported in macOS because the Redshift ODBC 2.X driver isn’t supported yet for MAC. It will be supported soon.

Setup the Tableau OAuth config file for Tableau Server or Tableau Cloud

To integrate with Amazon Redshift using IAM Identity Center authentication, you must install the Tableau OAuth config file in Tableau Server or Tableau Cloud

Sign in to the Tableau Server or Tableau Cloud using admin credentials.
Navigate to Settings.
Go to OAuth Clients Registry and select Add OAuth Client
Choose following settings:
- Connection Type: Amazon Redshift
- OAuth Provider: Custom_IdP
- Client ID: Enter your IdP client ID value
- Client Secret: Enter your client secret value
- Redirect URL: Enter http://localhost/auth/add_oauth_token. This example uses localhost for testing in a local environment. You should use the full hostname with https.
- Choose OAuth Config File. Select the XML file that you configured in the previous section.
- Select Add OAuth Client and choose Save.

Figure 14: Create an OAuth connection in Tableau Server or Tableau Cloud

Federate to Amazon Redshift from Tableau Desktop

Now you’re ready to connect to Amazon Redshift from Tableau through federated sign-in using IAM Identity Center authentication. In this step, you create a Tableau Desktop report and publish it to Tableau Server.

Open Tableau Desktop.
Select Amazon Redshift Connector and enter the following values:
1. Server: Enter the name of the server that hosts the database and the name of the database you want to connect to.
2. Port: Enter 5439.
3. Database: Enter your database name. This example uses dev.
4. Authentication: Select OAuth.
5. Federation Type: Select Identity Center.
6. Identity Center Namespace: You can leave this value blank.
7. OAuth Provider: This value should automatically be pulled from your configured XML. It will be the value from the element oauthConfigId.
8. Select Require SSL.
9. Choose Sign in.

Figure 15: Tableau Desktop OAuth connection

Enter your IdP credentials in the browser pop-up window.

Figure 16: Okta Login Page

When authentication is successful, you will see the message shown in Figure 17 that follows.

Figure 17: Successful authentication using Tableau

Congratulations! You’re signed in using IAM Identity Center integration with Amazon Redshift and are ready to explore and analyze your data using Tableau Desktop.

Figure 18: Successfully connected using Tableau Desktop

Figure 19 is a screenshot from the Amazon Redshift system table (sys_query_history) showing that user Ethan from Okta is accessing the sales report.

Figure 19: User audit in sys_query_history

After signing in, you can create your own Tableau Report on the desktop version and publish it to your Tableau Server. For this example, we created and published a report named SalesReport.

Federate to Amazon Redshift from Tableau Server

Sign in to the Tableau Server site as a non-admin user.
Navigate to Explore and go to the folder where your published report is stored.
Select the report and choose Sign In.

Figure 20: Tableau Server Sign In

To authenticate, enter your non-admin Okta credentials in the browser pop-up.

Figure 21: Okta Login Page

After your authentication is successful, you can access the report.

Figure 22: Tableau report

Clean up

Complete the following steps to clean up your resources:

Delete the IdP applications that you have created to integrate with IAM Identity Center.
Delete the IAM Identity Center configuration.
Delete the Amazon Redshift application and the Amazon Redshift provisioned cluster or serverless instance that you created for testing.
Delete the IAM role and IAM policy that you created for IAM Identity Center and Amazon Redshift integration.
Delete the permission set from IAM Identity Center that you created for Amazon Redshift Query Editor V2 in the management account.

Conclusion

This post covered streamlining access management for data analytics by using Tableau’s capability to support single sign-on based on the OAuth 2.0 OpenID Connect (OIDC) protocol. The solution enables federated user authentication, where user identities from an external IdP are trusted and propagated to Amazon Redshift. You walked through the steps to configure Tableau Desktop and Tableau Server to integrate seamlessly with Amazon Redshift using IAM Identity Center for single sign-on. By harnessing this integration of a third party IdP with IAM Identity Center, users can securely access Amazon Redshift data sources within Tableau without managing separate database credentials.

Listed below are key resources to learn more about Amazon Redshift integration with IAM Identity Center

About the Authors

Harshida Patel is a Principal Solutions Architect, Analytics with AWS.

Ravi Bhattiprolu is a Senior Partner Solutions Architect at Amazon Web Services (AWS). He collaborates with strategic independent software vendor (ISV) partners like Salesforce and Tableau to design and deliver innovative, well-architected cloud products, integrations, and solutions to help joint AWS customers achieve their business goals.

Federate IAM-based single sign-on to Amazon Redshift role-based access control with Okta

2023-12-12 Debu Panda

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/federate-iam-based-single-sign-on-to-amazon-redshift-role-based-access-control-with-okta/

Amazon Redshift accelerates your time to insights with fast, easy, and secure cloud data warehousing at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries.

You can use your preferred SQL clients to analyze your data in an Amazon Redshift data warehouse. Connect seamlessly by leveraging an identity provider (IdP) or single sign-on (SSO) credentials to connect to the Redshift data warehouse to reuse existing user credentials and avoid additional user setup and configuration. Using role-based access control (RBAC), you can simplify managing user privileges, database permissions and management of security permissions in Amazon Redshift. You can also use redshift database roles to define a set of elevated permissions, such as for a system monitor or database administrator.

Using AWS Identity and Access Management (IAM) with RBAC, organizations can simplify user management because you no longer need to create users and map them to database roles manually. You can define the mapped database roles as a principal tag for the IdP groups or IAM role, so Redshift database roles and users who are members of those IdP groups are granted to the database roles automatically.

Earlier in 2023, we launched support for Okta integration with Amazon Redshift Serverless using database roles. In this post, we focus on Okta as the IdP and provide step-by-step guidance to integrate a Redshift provisioned cluster with Okta using the Redshift Query Editor v2 and with SQL clients like SQL Workbench/J. You can use this mechanism with other IdP providers such as Azure Active Directory or Ping with any applications or tools using Amazon’s JDBC, ODBC, or Python driver.

Recently we also announced Amazon Redshift integration with AWS IAM Identity Center, supporting trusted identity propagation, allowing you to use third-party Identity Providers (IdP) such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin. This integration simplifies the authentication and authorization process for Amazon Redshift users using Query Editor V2 or Amazon Quicksight, making it easier for them to securely access your data warehouse. AWS IAM Identity Center offers automatic user and group provisioning from Okta to itself by utilizing the System for Cross-domain Identity Management (SCIM) 2.0 protocol. This integration allows for seamless synchronization of information between two services, ensuring accurate and up-to-date information in AWS IAM Identity Center. Refer to Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On blog post to learn more about setting up single sign-on (SSO) to Amazon Redshift using integration with IdC and Okta as the Identity Provider.

If you are interested in using IAM-based single sign-on with Amazon Redshift database roles then you can continue reading this blog.

Solution overview

The following diagram illustrates the authentication flow of Okta with a Redshift provisioned cluster using federated IAM roles and automatic database role mapping.

The workflow contains the following steps:

Either the user chooses an IdP app in their browser, or the SQL client initiates a user authentication request to the IdP (Okta).
Upon a successful authentication, Okta submits a request to the AWS federation endpoint with a SAML assertion containing the principal tags.
The AWS federation endpoint validates the SAML assertion and invokes the AWS Security Token Service (AWS STS) API AssumeRoleWithSAML. The SAML assertion contains the IdP user and group information that is stored in the RedshiftDbUser and RedshiftDbRoles principal tags, respectively. Temporary IAM credentials are returned to the SQL client or, if using the Query Editor v2, the user’s browser is redirected to the Query Editor v2 console using the temporary IAM credentials.
The temporary IAM credentials are used by the SQL client or Query Editor v2 to call the Redshift API GetClusterCredentialsWithIAM. This API uses the principal tags to determine the user and database roles that the user belongs to. An associated database user is created if the user is signing in for the first time and is granted the matching database roles automatically. A temporary password is returned to the SQL client.
Using the database user and temporary password, the SQL client or Query Editor v2 connects to Amazon Redshift. Upon login, the user is authorized based on the Redshift database roles that were assigned in Step 4.

Prerequisites

You need the following prerequisites to set up this solution:

An AWS account. If you don’t have one, you can sign up for one.
A Redshift provisioned cluster. For setup instructions, see Getting started with Amazon Redshift.
The latest Redshift JDBC SDK driver-dependent libraries (download the libraries and unzip the JDBC JAR zipped folder, driver version 2.1.0.18 and later).
An Okta account that has an active subscription. You need an admin role to set up the application on Okta. If you’re new to Okta, you can sign up for a free trial or sign up for a developer account.
A SQL client such as SQL Workbench/J.

Connect with a Redshift provisioned cluster as a federated user using Query Editor v2

To connect using Query Editor v2, complete the following steps:

Follow all the steps described in the sections Set up your Okta application and Set up AWS configuration in the following post.

For the Amazon Redshift access IAM policy, replace the policy with the following JSON to use the GetClusterCredentialsWithIAM API:

{
	"Version": "2012-10-17",
	"Statement": [
					{
						"Sid": "VisualEditor0",
						"Effect": "Allow",
						"Action": "redshift:GetClusterCredentialsWithIAM",
						"Resource": "arn:aws:redshift:us-west-2:123456789012:dbname:redshift-cluster-1/dev"
					}
				]
}

Now you’re ready to connect to your Redshift provisioned cluster using Query Editor v2 and federated login.

Use the SSO URL from Okta and log in to your Okta account with your user credentials. For this demo, we log in with user Ethan.
In Query Editor v2, choose your Redshift provisioned cluster (right-click) and choose Create connection.
For Authentication, select Temporary credentials using your IAM identity.
For Database, enter the database name you want to connect to.
Choose Create connection.
Run the following command to validate that you are logged in as a federated user and also to get the list of roles associated with that user for the current session:

SELECT current_user,* FROM pg_get_session_roles() eff_ro(name name, roleid integer);

Because Ethan is part of the sales group and has been granted permissions to access tables in the sales_schema, he should be able to access those tables without any issues. However, if he tries to access tables in the finance_schema, he would receive a permission denied error because Ethan is not part of the finance group in Okta.

Connect with a Redshift provisioned cluster as a federated user via a third-party client

To connect as a federated user via a third-party client, complete the following steps:

Follow steps 1 and 2 which are described in above section (Connect with a Redshift provisioned cluster as a federated user using Query Editor v2).
Use the Redshift JDBC driver v2.1.0.18 and above because it supports authentication with IAM group federation. For the URL, enter jdbc:redshift:iam://<cluster endpoint>:<port>:<databasename>?groupfederation=true. For example, jdbc:redshift:iam://redshift-cluster-1.abdef0abc0ab.us-west-2.redshift.amazonaws.com:5439/dev?groupfederation=true

In the preceding URL, groupfederation is a mandatory parameter that allows you to authenticate with the IAM credentials for the Redshift provisioned cluster. Without the groupfederation parameter, it will not use Redshift database roles.

For Username and Password, enter your Okta credentials.

To set up extended properties, follow Steps 4–9 in the section Configure the SQL client (SQL Workbench/J) in the following post.

User Ethan will be able to access the sales_schema tables. If Ethan tries to access the tables in the finance_schema, he will get a permission denied error.

Troubleshooting

If your connection didn’t work, consider the following:

Enable logging in the driver. For instructions, see Configure logging.
Make sure to use the latest Amazon Redshift JDBC driver version.
If you’re getting errors while setting up the application on Okta, make sure you have admin access.
If you can authenticate via the SQL client but get a permission issue or can’t see objects, grant the relevant permission to the role.

Clean up

When you’re done testing the solution, clean up the resources to avoid incurring future charges:

Delete the Redshift provisioned cluster.
Delete the IAM roles, IAM IdPs, and IAM policies.

Conclusion

In this post, we provided step-by-step instructions to integrate a Redshift provisioned cluster with Okta using the Redshift Query Editor v2 and SQL Workbench/J with the help of federated IAM roles and automatic database role mapping. You can use a similar setup with other SQL clients (such as DBeaver or DataGrip). We also showed how Okta group membership is mapped automatically with Redshift provisioned cluster roles to use role-based authentication seamlessly.

If you have any feedback or questions, please leave them in the comments.

About the Authors

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Use the Amazon Redshift Data API to interact with Amazon Redshift Serverless

2023-04-28 Debu Panda

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/use-the-amazon-redshift-data-api-to-interact-with-amazon-redshift-serverless/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics. Amazon Redshift Serverless makes it convenient for you to run and scale analytics without having to provision and manage data warehouses. With Redshift Serverless, data analysts, developers, and data scientists can now use Amazon Redshift to get insights from data in seconds by loading data into and querying records from the data warehouse.

As a data engineer or application developer, for some use cases, you want to interact with the Redshift Serverless data warehouse to load or query data with a simple API endpoint without having to manage persistent connections. With the Amazon Redshift Data API, you can interact with Redshift Serverless without having to configure JDBC or ODBC. This makes it easier and more secure to work with Redshift Serverless and opens up new use cases.

This post explains how to use the Data API with Redshift Serverless from the AWS Command Line Interface (AWS CLI) and Python. If you want to use the Data API with Amazon Redshift clusters, refer to Using the Amazon Redshift Data API to interact with Amazon Redshift clusters.

Introducing the Data API

The Data API enables you to seamlessly access data from Redshift Serverless with all types of traditional, cloud-native, and containerized serverless web service-based applications and event-driven applications.

The following diagram illustrates this architecture.

The Data API simplifies data access, ingest, and egress from programming languages and platforms supported by the AWS SDK such as Python, Go, Java, Node.js, PHP, Ruby, and C++.

The Data API simplifies access to Amazon Redshift by eliminating the need for configuring drivers and managing database connections. Instead, you can run SQL commands to Redshift Serverless by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later. Your query results are stored for 24 hours. The Data API federates AWS Identity and Access Management (IAM) credentials so you can use identity providers like Okta or Azure Active Directory or database credentials stored in Secrets Manager without passing database credentials in API calls.

For customers using AWS Lambda, the Data API provides a secure way to access your database without the additional overhead for Lambda functions to be launched in an Amazon VPC. Integration with the AWS SDK provides a programmatic interface to run SQL statements and retrieve results asynchronously.

Relevant use cases

The Data API is not a replacement for JDBC and ODBC drivers, and is suitable for use cases where you don’t need a persistent connection to a serverless data warehouse. It’s applicable in the following use cases:

Accessing Amazon Redshift from custom applications with any programming language supported by the AWS SDK. This enables you to integrate web service-based applications to access data from Amazon Redshift using an API to run SQL statements. For example, you can run SQL from JavaScript.
Building a serverless data processing workflow.
Designing asynchronous web dashboards because the Data API lets you run long-running queries without having to wait for them to complete.
Running your query one time and retrieving the results multiple times without having to run the query again within 24 hours.
Building your ETL pipelines with AWS Step Functions, Lambda, and stored procedures.
Having simplified access to Amazon Redshift from Amazon SageMaker and Jupyter notebooks.
Building event-driven applications with Amazon EventBridge and Lambda.
Scheduling SQL scripts to simplify data load, unload, and refresh of materialized views.

The Data API GitHub repository provides examples for different use cases for both Redshift Serverless and provisioned clusters.

Create a Redshift Serverless workgroup

If you haven’t already created a Redshift Serverless data warehouse, or want to create a new one, refer to the Getting Started Guide. This guide walks you through the steps of creating a namespace and workgroup with their names as default. Also, ensure that you have created an IAM role and make sure that the IAM role you attach to your Redshift Serverless namespace has AmazonS3ReadOnlyAccess permission. You can use the AWS Management Console to create an IAM role and assign Amazon Simple Storage Service (Amazon S3) privileges (refer to Loading in data from Amazon S3). In this post, we create a table and load data using the COPY command.

Prerequisites for using the Data API

You must be authorized to access the Data API. Amazon Redshift provides the RedshiftDataFullAccess managed policy, which offers full access to Data API. This policy also allows access to Redshift Serverless workgroups, Secrets Manager, and API operations needed to authenticate and access a Redshift Serverless workgroup by using IAM credentials.

You can also create your own IAM policy that allows access to specific resources by starting with RedshiftDataFullAccess as a template.

The Data API allows you to access your database either using your IAM credentials or secrets stored in Secrets Manager. In this post, we use IAM credentials.

When you federate your IAM credentials to connect with Amazon Redshift, it automatically creates a database user for the IAM user that is being used. It uses the GetCredentials API to get temporary database credentials. If you want to provide specific database privileges to your users with this API, you can use an IAM role with the tag name RedshiftDBRoles with a list of roles separated by colons. For example, if you want to assign database roles such as sales and analyst, you can have a value sales:analyst assigned to RedshiftDBRoles.

Use the Data API from the AWS CLI

You can use the Data API from the AWS CLI to interact with the Redshift Serverless workgroup and namespace. For instructions on configuring the AWS CLI, see Setting up the AWS CLI. The Amazon Redshift Serverless CLI (aws redshift-serverless) is a part of AWS CLI that lets you manage Amazon Redshift workgroups and namespaces, such as creating, deleting, setting usage limits, tagging resource, and more. The Data API provides a command line interface to the AWS CLI (aws redshift-data) that allows you to interact with the databases in Redshift Serverless.

You can invoke help using the following command:

aws redshift-data help

The following table shows you the different commands available with the Data API CLI.

Command	Description
`list-databases`	Lists the databases in a workgroup.
`list-schemas`	Lists the schemas in a database. You can filter this by a matching schema pattern.
`list-tables`	Lists the tables in a database. You can filter the tables list by a schema name pattern, a matching table name pattern, or a combination of both.
`describe-table`	Describes the detailed information about a table including column metadata.
`execute-statement`	Runs a SQL statement, which can be SELECT, DML, DDL, COPY, or UNLOAD.
`batch-execute-statement`	Runs multiple SQL statements in a batch as a part of single transaction. The statements can be SELECT, DML, DDL, COPY, or UNLOAD.
`cancel-statement`	Cancels a running query. To be canceled, a query must not be in the FINISHED or FAILED state.
`describe-statement`	Describes the details of a specific SQL statement run. The information includes when the query started, when it finished, the number of rows processed, and the SQL statement.
`list-statements`	Lists the SQL statements in the last 24 hours. By default, only finished statements are shown.
`get-statement-result`	Fetches the temporarily cached result of the query. The result set contains the complete result set and the column metadata. You can paginate through a set of records to retrieve the entire result as needed.

If you want to get help on a specific command, run the following command:

aws redshift-data list-tables help

Now we look at how you can use these commands.

List databases

Most organizations use a single database in their Amazon Redshift workgroup. You can use the following command to list the databases in your Serverless endpoint. This operation requires you to connect to a database and therefore requires database credentials.

aws redshift-data list-databases --database dev --workgroup-name default

List schemas

Similar to listing databases, you can list your schemas by using the list-schemas command:

aws redshift-data list-schemas --database dev --workgroup-name default

If you have several schemas that match demo (demo, demo2, demo3, and so on), you can optionally provide a pattern to filter your results matching to that pattern:

aws redshift-data list-schemas --database dev --workgroup-name default --schema-pattern "demo%"

List tables

The Data API provides a simple command, list-tables, to list tables in your database. You might have thousands of tables in a schema; the Data API lets you paginate your result set or filter the table list by providing filter conditions.

You can search across your schema with table-pattern; for example, you can filter the table list by a table name prefix across all your schemas in the database or filter your tables list in a specific schema pattern by using schema-pattern.

The following is a code example that uses both:

aws redshift-data list-tables --database dev --workgroup-name default --schema-pattern "demo%" --table-pattern “orders%”

Run SQL commands

You can run SELECT, DML, DDL, COPY, or UNLOAD commands for Amazon Redshift with the Data API. You can optionally specify the –with-event option if you want to send an event to EventBridge after the query run, then the Data API will send the event with queryId and final run status.

Create a schema

Let’s use the Data API to see how you can create a schema. The following command lets you create a schema in your database. You don’t have to run this SQL if you have pre-created the schema. You have to specify –-sql to specify your SQL commands.

aws redshift-data execute-statement --database dev --workgroup-name default \
--sql "CREATE SCHEMA demo;"

The following shows an example output of execute-statement:

{
    "CreatedAt": "2023-04-07T17:14:43.038000+00:00",
    "Database": "dev",
    "DbUser": "IAMR:Admin",
    "Id": "8e4e5af3-9af9-4567-8e70-7849515b3a79",
    "WorkgroupName": "default"
}

We discuss later in this post how you can check the status of a SQL that you ran with execute-statement.

Create a table

You can use the following command to create a table with the CLI:

aws redshift-data execute-statement --database dev --workgroup-name default  \
   --sql "CREATE TABLE demo.green_201601( \
  vendorid                VARCHAR(4), \
  pickup_datetime         TIMESTAMP, \
  dropoff_datetime        TIMESTAMP, \
  store_and_fwd_flag      VARCHAR(1), \
  ratecode                INT, \
  pickup_longitude        FLOAT4, \
  pickup_latitude         FLOAT4, \
  dropoff_longitude       FLOAT4, \
  dropoff_latitude        FLOAT4, \
  passenger_count         INT, \
  trip_distance           FLOAT4, \
  fare_amount             FLOAT4, \
  extra                   FLOAT4, \
  mta_tax                 FLOAT4, \
  tip_amount              FLOAT4, \
  tolls_amount            FLOAT4, \
  ehail_fee               FLOAT4, \
  improvement_surcharge   FLOAT4, \
  total_amount            FLOAT4, \
  payment_type            VARCHAR(4),\
  trip_type               VARCHAR(4));"

Load sample data

The COPY command lets you load bulk data into your table in Amazon Redshift. You can use the following command to load data into the table we created earlier:

aws redshift-data execute-statement --database dev --workgroup-name default --sql "COPY demo.green_201601 \
FROM 's3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01' \
IAM_ROLE default \
DATEFORMAT 'auto' \
IGNOREHEADER 1 \
DELIMITER ',' \
IGNOREBLANKLINES \
REGION 'us-west-2';"

Retrieve data

The following query uses the table we created earlier:

aws redshift-data execute-statement --database dev --workgroup-name default --sql "SELECT ratecode,  \
COUNT(*) FROM demo.green_201601 WHERE \
trip_distance > 5 GROUP BY 1 ORDER BY 1;"

The following shows an example output:

{
    "CreatedAt": "2023-04-07T17:25:16.030000+00:00",
    "Database": "dev",
    "DbUser": "IAMR:Admin",
    "Id": "cae88c08-0bb4-4279-8845-d5a8fefafade",
    "WorkgroupName": "default"
}

You can fetch results using the statement ID that you receive as an output of execute-statement.

Check the status of a statement

You can check the status of your statement by using describe-statement. The output for describe-statement provides additional details such as PID, query duration, number of rows in and size of the result set, and the query ID given by Amazon Redshift. You have to specify the statement ID that you get when you run the execute-statement command. See the following command:

aws redshift-data describe-statement --id cae88c08-0bb4-4279-8845-d5a8fefafade \

The following is an example output:

{
     "CreatedAt": "2023-04-07T17:27:15.937000+00:00",
     "Duration": 2602410468,
     "HasResultSet": true,
     "Id": "cae88c08-0bb4-4279-8845-d5a8fefafade",
     "QueryString": " SELECT ratecode, COUNT(*) FROM 
     demo.green_201601 WHERE
     trip_distance > 5 GROUP BY 1 ORDER BY 1;",
     "RedshiftPid": 1073815670,
     "WorkgroupName": "default",
     "UpdatedAt": "2023-04-07T17:27:18.539000+00:00"
}

The status of a statement can be STARTED, FINISHED, ABORTED, or FAILED.

Run SQL statements with parameters

You can run SQL statements with parameters. The following example uses two named parameters in the SQL that is specified using a name-value pair:

aws redshift-data execute-statement --database dev --workgroup-name default --sql "select sellerid,sum(pricepaid) totalsales from sales where eventid >= :eventid and sellerid > :selrid group by sellerid"  --parameters "[{\"name\": \"selrid\", \"value\": \"100\"},{\"name\": \"eventid\", \"value\": \"100\"}]"

The describe-statement returns QueryParameters along with QueryString.

You can map the name-value pair in the parameters list to one or more parameters in the SQL text, and the name-value parameter can be in random order. You can’t specify a NULL value or zero-length value as a parameter.

Cancel a running statement

If your query is still running, you can use cancel-statement to cancel a SQL query. See the following command:

aws redshift-data cancel-statement --id 39a0de2f-e85e-45ff-a0d7-cd074c348120

Fetch results from your query

You can fetch the query results by using get-statement-result. The query result is stored for 24 hours. See the following command:

aws redshift-data get-statement-result --id 7b61da88-1b11-4ade-956a-21085a29118d

The output of the result contains metadata such as the number of records fetched, column metadata, and a token for pagination.

Run multiple SQL statements

You can run multiple SELECT, DML, DDL, COPY, or UNLOAD commands for Amazon Redshift in a single transaction with the Data API. The batch-execute-statement enables you to create tables and run multiple COPY commands or create temporary tables as part of your reporting system and run queries on that temporary table. See the following code:

aws redshift-data batch-execute-statement --database dev --workgroup-name default \
--sqls "create temporary table mysales \
(firstname, lastname, total_quantity ) as \
SELECT firstname, lastname, total_quantity \
FROM   (SELECT buyerid, sum(qtysold) total_quantity \
        FROM  sales  \
        GROUP BY buyerid \
        ORDER BY total_quantity desc limit 10) Q, users \
WHERE Q.buyerid = userid \ 
ORDER BY Q.total_quantity desc;" "select * from mysales limit 100;"

The describe-statement for a multi-statement query shows the status of all sub-statements:

{

{
"CreatedAt": "2023-04-10T14:01:11.257000-07:00",
"Duration": 30564173,
"HasResultSet": true,
"Id": "23d99d7f-fd13-4686-92c8-e2c279715c21",
"RedshiftPid": 1073922185,
"RedshiftQueryId": 0,
"ResultRows": -1,
"ResultSize": -1,
"Status": "FINISHED",
"SubStatements": [
{
"CreatedAt": "2023-04-10T14:01:11.357000-07:00",
"Duration": 12779028,
"HasResultSet": false,
"Id": "23d99d7f-fd13-4686-92c8-e2c279715c21:1",
"QueryString": "create temporary table mysales (firstname, lastname,
total_quantity ) as \nSELECT firstname, lastname, total_quantity \nFROM (SELECT
buyerid, sum(qtysold) total_quantity\nFROM sales\nGROUP BY
buyerid\nORDER BY total_quantity desc limit 10) Q, users\nWHERE Q.buyerid =
userid\nORDER BY Q.total_quantity desc;",
"RedshiftQueryId": 0,
"ResultRows": 0,
"ResultSize": 0,
"Status": "FINISHED",
"UpdatedAt": "2023-04-10T14:01:11.807000-07:00"
},
{
"CreatedAt": "2023-04-10T14:01:11.357000-07:00",
"Duration": 17785145,
"HasResultSet": true,
"Id": "23d99d7f-fd13-4686-92c8-e2c279715c21:2",
"QueryString": ""select *\nfrom mysales limit 100;",
"RedshiftQueryId": 0,
"ResultRows": 40,
"ResultSize": 1276,
"Status": "FINISHED",
"UpdatedAt": "2023-04-10T14:01:11.911000-07:00"
}
],
"UpdatedAt": "2023-04-10T14:01:11.970000-07:00",
"WorkgroupName": "default"
}

In the preceding example, we had two SQL statements and therefore the output includes the ID for the SQL statements as 23d99d7f-fd13-4686-92c8-e2c279715c21:1 and 23d99d7f-fd13-4686-92c8-e2c279715c21:2. Each sub-statement of a batch SQL statement has a status, and the status of the batch statement is updated with the status of the last sub-statement. For example, if the last statement has status FAILED, then the status of the batch statement shows as FAILED.

You can fetch query results for each statement separately. In our example, the first statement is a SQL statement to create a temporary table, so there are no results to retrieve for the first statement. You can retrieve the result set for the second statement by providing the statement ID for the sub-statement:

aws redshift-data get-statement-result --id 23d99d7f-fd13-4686-92c8-e2c279715c21:2

Use the Data API with Secrets Manager

The Data API allows you to use database credentials stored in Secrets Manager. You can create a secret type as Other type of secret and then specify username and password. Note you can’t choose an Amazon Redshift cluster because Redshift Serverless is different than a cluster.

Let’s assume that you created a secret key for your credentials as defaultWG. You can use the secret-arn parameter to pass your secret key as follows:

aws redshift-data list-tables --database dev --workgroup-name default --secret-arn defaultWG --region us-west-1

Export the data

Amazon Redshift allows you to export from database tables to a set of files in an S3 bucket by using the UNLOAD command with a SELECT statement. You can unload data in either text or Parquet format. The following command shows you an example of how to use the data lake export with the Data API:

aws redshift-data execute-statement --database dev --workgroup-name default --sql "unload ('select * from demo.green_201601') to '<your-S3-bucket>' iam_role '<your-iam-role>'; "

You can use batch-execute-statement if you want to use multiple statements with UNLOAD or combine UNLOAD with other SQL statements.

Use the Data API from the AWS SDK

You can use the Data API in any of the programming languages supported by the AWS SDK. For this post, we use the AWS SDK for Python (Boto3) as an example to illustrate the capabilities of the Data API.

We first import the Boto3 package and establish a session:

import botocore.session as bc
import boto3

def get_client(service, endpoint=None, region="us-west-2"):
    session = bc.get_session()
    s = boto3.Session(botocore_session=session, region_name=region)
    if endpoint:
        return s.client(service, endpoint_url=endpoint)
    return s.client(service)

Get a client object

You can create a client object from the boto3.Session object and using RedshiftData:

rsd = get_client('redshift-data')

If you don’t want to create a session, your client is as simple as the following code:

import boto3
client = boto3.client('redshift-data')

Run a statement

The following example code uses the Secrets Manager key to run a statement. For this post, we use the table we created earlier. You can use DDL, DML, COPY, and UNLOAD in the SQL parameter:

resp = rsd.execute_statement(
    WorkgroupName ="default",
Database = "dev",
Sql = "SELECT ratecode, COUNT(*) totalrides FROM demo.green_201601 WHERE trip_distance > 5 GROUP BY 1 ORDER BY 1;" 
)

As we discussed earlier, running a query is asynchronous; running a statement returns an ExecuteStatementOutput, which includes the statement ID.

If you want to publish an event to EventBridge when the statement is complete, you can use the additional parameter WithEvent set to true:

resp = rsd.execute_statement(
    Database="dev",
    WorkgroupName="default",
    Sql="SELECT ratecode, COUNT(*) totalrides FROM demo.green_201601 WHERE trip_distance > 5 GROUP BY 1 ORDER BY 1;",
WithEvent=True
)

Describe a statement

You can use describe_statement to find the status of the query and number of records retrieved:

id=resp['Id']
desc = rsd.describe_statement(Id=id)
if desc["Status"] == "FINISHED":
    print(desc["ResultRows"])

Fetch results from your query

You can use get_statement_result to retrieve results for your query if your query is complete:

if desc and desc["ResultRows"]  > 0:
    result = rsd.get_statement_result(Id=qid)

The get_statement_result command returns a JSON object that includes metadata for the result and the actual result set. You might need to process the data to format the result if you want to display it in a user-friendly format.

Fetch and format results

For this post, we demonstrate how to format the results with the Pandas framework. The post_process function processes the metadata and results to populate a DataFrame. The query function retrieves the result from a database in an Amazon Redshift cluster. See the following code:

import pandas as pd

def post_process(meta, records):
    columns = [k["name"] for k in meta]
    rows = []
    for r in records:
        tmp = []
        for c in r:
            tmp.append(c[list(c.keys())[0]])
        rows.append(tmp)
    return pd.DataFrame(rows, columns=columns)

def query(sql, workgroup="default ", database="dev"):
    resp = rsd.execute_statement(
        Database=database,
        WorkgroupName=workgroup,
        Sql=sql
    )
    qid = resp["Id"]
    print(qid)
    desc = None
    while True:
        desc = rsd.describe_statement(Id=qid)
        if desc["Status"] == "FINISHED" or desc["Status"] == "FAILED":
            break
    	print(desc["ResultRows"])
    if desc and desc["ResultRows"]  > 0:
        result = rsd.get_statement_result(Id=qid)
        rows, meta = result["Records"], result["ColumnMetadata"]
        return post_process(meta, rows)

pf=query("select * from demo.customer_activity limit 100;")
print(pf)

In this post, we demonstrated using the Data API with Python with Redshift Serverless. However, you can use the Data API with other programming languages supported by the AWS SDK. You can read how Roche democratized access to Amazon Redshift data using the Data API with Google Sheets. You can also address this type of use case with Redshift Serverless.

Best practices

We recommend the following best practices when using the Data API:

Federate your IAM credentials to the database to connect with Amazon Redshift. Redshift Serverless allows users to get temporary database credentials with GetCredentials. Redshift Serverless scopes the access to the specific IAM user and the database user is automatically created.
Use a custom policy to provide fine-grained access to the Data API in the production environment if you don’t want your users to use temporary credentials. You have to use Secrets Manager to manage your credentials in such use cases.
Don’t retrieve a large amount of data from your client and use the UNLOAD command to export the query results to Amazon S3. You’re limited to retrieving only 100 MB of data with the Data API.
Don’t forget to retrieve your results within 24 hours; results are stored only for 24 hours.

Conclusion

In this post, we introduced how to use the Data API with Redshift Serverless. We also demonstrated how to use the Data API from the Amazon Redshift CLI and Python using the AWS SDK. Additionally, we discussed best practices for using the Data API.

To learn more, refer to Using the Amazon Redshift Data API or visit the Data API GitHub repository for code examples.

About the authors

Debu Panda is a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Fei Peng is a Software Dev Engineer working in the Amazon Redshift team.

Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML

2020-12-08 Debu Panda

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/create-train-and-deploy-machine-learning-models-in-amazon-redshift-using-sql-with-amazon-redshift-ml/

Amazon Redshift is the most popular, fully managed, and petabyte-scale data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to leverage this data to train machine learning (ML) models, which can then be used to generate insights on new data for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies.

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. Amazon Redshift ML allows you to use your data in Amazon Redshift with Amazon SageMaker, a fully managed ML service, without requiring you to become experts in ML. This post shows you how you use familiar SQL statements to create and train ML models from data in Amazon Redshift and use these models to make in-database predictions on new data for use cases such as churn prediction and fraud risk scoring.

ML use cases relevant to data warehousing

You may use different ML approaches according to what’s relevant for your business, such as supervised, unsupervised, and reinforcement learning. With this release, Amazon Redshift ML supports supervised learning, which is most commonly used in enterprises for advanced analytics. As evident in the following diagram, supervised learning is preferred when you have a training dataset and an understanding of how specific input data predicts various business outcomes. The inputs used for the ML model are often referred to as features, and the outcomes or results are called targets or labels. Your training dataset is a table or a query whose attributes or columns comprise features, and targets are extracted from your data warehouse. The following diagram illustrates this architecture.

You can use supervised training for advanced analytics use cases ranging from forecasting and personalization to customer churn prediction. Let’s consider a customer churn prediction use case. The columns that describe customer information and usage are features, and the customer status (active vs. inactive) is the target or label.

The following table shows different types of use cases and algorithms used.

Use Case	Algorithm / Problem Type
Customer churn prediction	Classification
Predict if a sales lead will close	Classification
Fraud detection	Classification
Price and revenue prediction	Linear regression
Customer lifetime value prediction	Linear regression
Detect if a customer is going to default a loan	Logistic regression

Current ways to use ML in your data warehouse

You may rely on ML experts to build and train models on your behalf or invest a lot of time learning new tools and technology to do so yourself. For example, you might need to identify the appropriate ML algorithms in SageMaker or use Amazon SageMaker Autopilot for your use case, and then export the data from your data warehouse and prepare the training data to work with these model types.

Data analysts and database developers are familiar with SQL. Unfortunately, you often have to learn a new programming language (such as Python or R) to build, train, and deploy ML models in SageMaker. When the model is deployed and you want to use it with new data for making predictions (also known as inference), you need to repeatedly move the data back and forth between Amazon Redshift and SageMaker through a series of manual and complicated steps:

Export training data to Amazon Simple Storage Service (Amazon S3).
Train the model in SageMaker.
Export prediction input data to Amazon S3.
Use prediction in SageMaker.
Import predicted columns back into the database.

The following diagram illustrates this workflow.

This iterative process is time-consuming and prone to errors, and automating the data movement can take weeks or months of custom coding that then needs to be maintained. Amazon Redshift ML enables you to use ML with your data in Amazon Redshift without this complexity.

Introducing Amazon Redshift ML

To create an ML model, as a data analyst, you can use a simple SQL query to specify the data in Amazon Redshift you want to use as the data inputs to train your model and the output you want to predict. For example, to create a model that predicts customer churn, you can query columns in one or more tables in Amazon Redshift that include the customer profile information and historical account activity as the inputs, and the column showing whether the customer is active or inactive as the output you want to predict.

When you run the SQL command to create the model, Amazon Redshift ML securely exports the specified data from Amazon Redshift to Amazon S3 and calls SageMaker Autopilot to automatically prepare the data, select the appropriate pre-built algorithm, and apply the algorithm for model training. Amazon Redshift ML handles all the interactions between Amazon Redshift, Amazon S3, and SageMaker, abstracting the steps involved in training and compilation. After the model is trained, Amazon Redshift ML makes it available as a SQL function in your Amazon Redshift data warehouse by compiling it via Amazon SageMaker Neo. The following diagram illustrates this solution.

Benefits of Amazon Redshift ML

Amazon Redshift ML provides the following benefits:

Allows you to create and train ML models with simple SQL commands without having to learn external tools
Provides you with flexibility to use automatic algorithm selection
Automatically preprocesses data and creates, trains, and deploys models
Enables advanced users to specify problem type
Enables ML experts such as data scientists to select algorithms such as XGBoost and specify hyperparameters and preprocessors
Enables you to generate predictions using SQL without having to ship data outside your data warehouse
Allows you to pay only for training; prediction is included with the costs of your cluster (typically, ML predictions drive cost in production)

In this post, we look at a simple example that you can use to get started with Amazon Redshift ML.

To train data for a model that predicts customer churn, SageMaker Autopilot preprocesses the training data, finds the algorithm that provides the best accuracy, and applies it to the training data to build a performant model.

We provide step-by-step guidance to create a cluster, create sample schema, load data, create your first ML model in Amazon Redshift, and invoke the prediction function from your queries.

Prerequisites for enabling Amazon Redshift ML

As an Amazon Redshift administrator, the following steps are required to create your Amazon Redshift cluster for using Amazon Redshift ML:

On the Amazon S3 console, create an S3 bucket that Amazon Redshift ML uses for uploading the training data that SageMaker uses to train the model. For this post, we name the bucket redshiftml-<your_account_id>. Ensure that you create your S3 bucket in the same AWS region where you will create your Amazon Redshift cluster.
Create an AWS Identity and Access Management (IAM role) named RedshiftML with the policy that we provided below. While it is easy to get started with AmazonS3FullAccess and AmazonSageMakerFullAccess, we recommend using a minimal policy that we provided below (If you already have an existing IAM role, then just add these to that role):

To use or modify this policy, replace <your-account-id> with your AWS account number. Note that the policy assumes that you have created the IAM role with the name RedshiftML and the S3 bucket with the name redshiftml-<your_account_id>. The S3 bucket redshift-downloads is from where we will load the sample data used in this blog.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "ecr:BatchCheckLayerAvailability",
                "ecr:BatchGetImage",
                "ecr:GetAuthorizationToken",
                "ecr:GetDownloadUrlForLayer",
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:DescribeLogStreams",
                "logs:PutLogEvents",
                "sagemaker:*Job*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole",
                "s3:AbortMultipartUpload",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:iam::<your-account-id>:role/Redshift-ML",
                "arn:aws:s3:::redshiftml-<your-account-id>/*",
                "arn:aws:s3:::redshift-downloads/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::redshiftml-<your-account-id>",
                "arn:aws:s3:::redshift-downloads"
            
            ]
        }
    ]
}

For instructions, see Creating IAM roles.

Choose Edit trust relationship

Enter the following trust relationship definition to trust SageMaker:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "redshift.amazonaws.com",
          "sagemaker.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

On the Amazon Redshift console, create a new Amazon Redshift cluster.
Attach the IAM policy that you created earlier (RedshiftML).
Create the cluster with the preview track (SQL_PREVIEW).
You can select the preview by turning off default configuration and choosing the maintenance option.

When your cluster creation is complete and the cluster is up and running, you can create accounts for data analysts on an Amazon Redshift cluster. For this post, we create a user named demouser.

Use the Amazon Redshift query editor or your preferred SQL client to connect to Amazon Redshift as an administrator and run the following command:

create user demouser with password '<yourpassword>';

Grant CREATE MODEL privileges to your users. The following code grants privileges to the demouser user for creating a model:
```
GRANT CREATE MODEL TO demouser;
```

Loading sample data

We use a customer churn model in this post. As an admin or database developer, you have to first create the schema and load data into Amazon Redshift. This dataset is attributed to the University of California Irvine Repository of Machine Learning Datasets. We have modified this data for use with Amazon Redshift ML.

Create a schema named demo_ml that stores the example table and the ML model that we create:
```
CREATE SCHEMA DEMO_ML;
```

In the next steps, we create the sample table and load data into the table that we use to train the ML model.

Create the table in the demo_ml schema:

CREATE TABLE demo_ml.customer_activity (
state varchar(2), 
account_length int, 
area_code int,
phone varchar(8), 
intl_plan varchar(3), 
vMail_plan varchar(3),
vMail_message int, 
day_mins float, 
day_calls int, 
day_charge float,
total_charge float,
eve_mins float, 
eve_calls int, 
eve_charge float, 
night_mins float,
night_calls int, 
night_charge float, 
intl_mins float, 
intl_calls int,
intl_charge float, 
cust_serv_calls int, 
churn varchar(6),
record_date date);

Load the sample data by using the following command. Replace your IAM role and account ID appropriate for your environment.

COPY DEMO_ML.customer_activity 
FROM 's3://redshift-downloads/redshift-ml/customer_activity/' 
IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML' delimiter ',' IGNOREHEADER 1  
region 'us-east-1';

The demouser user should also have the usual SELECT access to the tables with the data used for training:
```
GRANT SELECT on demo_ml.customer_activity TO demouser;
```

You need to also grant CREATE and USAGE on the schema to allow users to create models and query using the ML inference functions on the demo_ml schema:
```
GRANT CREATE, USAGE ON SCHEMA demo_ml TO demouser;
```

Now the analyst (demouser) can train a model.

Creating and training your first ML model

Use your favorite SQL client to connect to your Amazon Redshift cluster as the demouser user that your admin created. You have to run the following command to create your model named customer_churn_model:

CREATE MODEL demo_ml.customer_churn_model
FROM (SELECT state,
             area_code,
             total_charge/account_length AS average_daily_spend, 
             cust_serv_calls/account_length AS average_daily_cases,
             churn 
      FROM demo_ml.customer_activity
         WHERE record_date < '2020-01-01' 

     )
TARGET churn
FUNCTION predict_customer_churn
IAM_ROLE 'arn:aws:iam::<accountID>:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'redshiftml-<your-account-id>'
)
;

The SELECT query in the FROM clause specifies the training data. The TARGET clause specifies which column is the label that the CREATE MODEL builds a model to predict. The other columns in the training query are the features (input) used for the prediction. In this example, the training data provides the features state, area code, average daily spend, and average daily cases for the customers that have been active accounts for earlier than January 1, 2020. The target column churn indicates whether the customer still has an active membership or has suspended their membership. For more information about CREATE MODEL syntax, see Amazon Redshift developers guide. After the model is created, you can run queries to make predictions.

Checking the status of your ML model

You can check the status of your models by running the SHOW MODEL command from your SQL prompt.

Enter the SHOW MODEL ALL command to see all the models that you have access to:

SHOW MODEL ALL

`SchemaName`	`ModelName`
`demo_ml`	`customer_churn_model`

Enter the SHOW MODEL command with your model name to see the status for a specific model:

SHOW MODEL demo_ml.customer_churn_model

The following output provides the status of your model:

Key						Value
Model Name				customer_churn_model
Schema Name				demo_ml
Owner					awsuser
Creation Time			"Tue, 24.11.2020 07:02:51"
Model State				READY
validation:			
f1,						0.681240
Estimated Cost			0.990443
TRAINING DATA:,
Query	"SELECT STATE, AREA_CODE, TOTAL_CHARGE/ACCOUNT_LENGTH AS AVERAGE_DAILY_SPEND, CUST_SERV_CALLS/ACCOUNT_LENGTH AS AVERAGE_DAILY_CASES, CHURN"
FROM DEMO_ML.CUSTOMER_ACTIVITY
WHERE ACCOUNT_LENGTH > 120
Target Column,			CHURN

PARAMETERS:,
Model Type					auto
Problem Type				BinaryClassification
Objective					F1
Function Name				predict_customer_churn
Function Parameters,		"state area_code average_daily_spend  
average_daily_cases "
Function Parameter Types 	"varchar int4 float8 int4 "
IAM Role					arn:aws:iam::99999999999:role/RedshiftML
s3 Bucket					redshiftml

Testing

Evaluating your model performance

You can see the F1 value for the example model customer_churn_model in the output of the SHOW MODEL command. The F1 amount signifies the statistical measure of the precision and recall of all the classes in the model. The value ranges between 0–1; the higher the score, the better the accuracy of the model.

You can use the following example SQL query as an illustration to see which predictions are incorrect based on the ground truth:

WITH infer_data AS (
  SELECT area_code ||phone  accountid, churn,
    demo_ml.predict_customer_churn( 
          state,
          area_code, 
          total_charge/account_length , 
          cust_serv_calls/account_length ) AS predicted
  FROM demo_ml.customer_activity
WHERE record_date <  '2020-01-01'

)
SELECT *  FROM infer_data where churn!=predicted;

Invoking your ML model for inference

You can use your SQL function to apply the ML model to your data in queries, reports, and dashboards. For example, you can run the predict_customer_churn SQL function on new customer data in Amazon Redshift regularly to predict customers at risk of churning and feed this information to sales and marketing teams so they can take preemptive actions, such as sending these customers an offer designed to retain them.

For example, you can run the following query to predict which customers in area_code 408 might churn:

SELECT area_code ||phone  accountid, 
       demo_ml.predict_customer_churn( 
          state,
          area_code, 
          total_charge/account_length , 
          cust_serv_calls/account_length )
          AS "predictedActive"
FROM demo_ml.customer_activity
WHERE area_code='408' and record_date > '2020-01-01';

The following output shows the account ID and whether the account is predicted to remain active.

`accountId`	`predictedActive`
`408393-7984`	`False.`
`408357-3817`	`True.`
`408418-6412`	`True.`
`408395-2854`	`True.`
`408372-9976`	`False.`
`408353-3061`	`True.`

Providing privileges to invoke the prediction function

As the model owner, you can grant EXECUTE on the prediction function to business analysts to use the model. The following code grants the EXECUTE privilege to marketing_analyst_grp. The marketing_analyst_grp should have the USAGE granted on the demo_ml schema:

GRANT EXECUTE demo_ml.predict_customer_churn TO marketing_analyst_grp

Cost control

Amazon Redshift ML leverages your existing cluster resources for prediction so you can avoid additional Amazon Redshift charges. There is no additional Amazon Redshift charge for creating or using a model, and prediction happens locally in your Amazon Redshift cluster, so you don’t have to pay extra unless you need to resize your cluster.

The CREATE MODEL request uses SageMaker for model training and Amazon S3 for storage, and incurs additional expense. The cost depends on the number of cells in your training data, where the number of cells is the product of the number of records (in the training query or table) times the number of columns. For example, if the SELECT query of the CREATE MODEL produces 10,000 records for training and each record has five columns, then the number of cells in the training data is 50,000. You can control the training cost by setting the MAX_CELLS. If you don’t, the default value of MAX_CELLS is 1 million.

If the training data produced by the SELECT query of the CREATE MODEL exceeds the MAX_CELLS limit you provided (or the default one million, in case you didn’t provide one) the CREATE MODEL randomly chooses approximately MAX_CELLS divided by number of columns records from the training dataset and trains using these randomly chosen tuples. The random choice ensures that the reduced training dataset doesn’t have any bias. Therefore, by setting the MAX_CELLS, you can keep your cost within your limits. See the following code:

CREATE MODEL demo_ml.customer_churn_model
FROM (SELECT state,
             area_code,
             total_charge/account_length AS average_daily_spend, 
             cust_serv_calls/account_length AS average_daily_cases,
             churn 
      FROM demo_ml.customer_activity
      WHERE account_length > 120 
     )
TARGET churn
FUNCTION predict_customer_churn
IAM_ROLE 'arn:aws:iam::<acountID>:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'redshiftml_<your_account_id>',
   MAX_CELLS 10000
)
;

For more information about costs associated with various cell numbers and free trial details, see Amazon Redshift pricing.

An alternate method of cost control is the MAX_RUNTIME parameter, also specified as a CREATE MODEL setting. If the training job in SageMaker exceeds the specified MAX_RUNTIME seconds, the CREATE MODEL ends the job.

The prediction functions run within your Amazon Redshift cluster, and you don’t incur additional expense there.

Customer feedback

“At Rackspace Technology we help companies elevate their AI/ML operations. We’re excited about the new Amazon Redshift ML feature because it will make it easier for our mutual Redshift customers to use ML on their Redshift with a familiar SQL interface. The seamless integration with Amazon SageMaker will empower data analysts to use data in new ways, and provide even more insight back to the wider organization.” – Nihar Gupta, General Manager for Data Solutions, Rackspace Technology

“We have always been looking for a unified platform that will enable both data processing and machine learning model training/scoring. Amazon Redshift has been our preferred data warehouse for processing large volumes of customer transactional data and we are increasingly leveraging Amazon SageMaker for model training and scoring. Until now, we had to move the data back and forth between the two for the ML steps in pipelines, which is quite time consuming and error prone. With the ML feature embedded, Amazon Redshift becomes that unified platform we have been looking for which will significantly simplify our ML pipelines.” – Srinivas Chilukuri, Principal – AI Center of Excellence, ZS Associates

Conclusion

In this post, we briefly discussed ML use cases relevant for data warehousing. We introduced Amazon Redshift ML and outlined how it enables SQL users to create, train, deploy, and use ML with simple SQL commands without learning external tools. We also provided an example of how to get started with Amazon Redshift ML.

Amazon Redshift ML also enables ML experts such as data scientists to quickly create ML models to simplify their pipeline and eliminate the need to export data from Amazon Redshift. We will discuss how data scientists can use Amazon Redshift ML in a future post.

About the Authors

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 20 years of experience in the IT world.

Yannis Papakonstantinou is a senior principal scientist at AWS and professor (on leave) of University of California at San Diego whose research on querying nested and semi-structured data, data integration, and the use and maintenance of materialized views has received over 16,500 citations.

Murali Balakrishnan Narayanaswamy is a senior machine learning scientist at AWS and received a PhD from Carnegie Mellon University on the intersection of AI, optimization, learning and inference to combat uncertainty in real-world applications.

Sriram Krishnamurthy is a software development manager for the Amazon Redshift query processing team and has been working on semi-structured data processing and SQL compilation and execution for over 15 years.

Sudipta Sengupta is a senior principal technologist at AWS who leads new initiatives in AI/ML, databases, and analytics and holds a Ph.D. in electrical engineering and computer science from Massachusetts Institute of Technology.

Stefano Stefani is a VP and distinguished engineer at AWS and has served as chief technologist for Amazon DynamoDB, Amazon Redshift, Amazon Aurora, Amazon SageMaker, and other services.

Using the Amazon Redshift Data API to interact with Amazon Redshift clusters

2020-09-16 Debu Panda

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/using-the-amazon-redshift-data-api-to-interact-with-amazon-redshift-clusters/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL, and your existing ETL, Business Intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics. As a data engineer or application developer, for some use cases, you want to interact with Amazon Redshift to load or query data with a simple API endpoint without having to manage persistent connections. Now, with the general availability of the Amazon Redshift Data API, you can interact with Amazon Redshift without having to configure JDBC or ODBC. This makes it easier and more secure to work with Amazon Redshift and opens up new use cases.

This introduction explains how to use the Amazon Redshift Data API from the AWS Command Line Interface (CLI) and Python. We also explain how to use AWS Secrets Manager to store and retrieve credentials for the Data API.

Introducing the Data API

The Amazon Redshift Data API enables you to painlessly access data from Amazon Redshift with all types of traditional, cloud-native, and containerized, serverless web services-based applications and event-driven applications. The following diagram illustrates this architecture.

The Amazon Redshift Data API simplifies data access, ingest, and egress from programming languages and platforms supported by the AWS SDK such as Python, Go, Java, Node.js, PHP, Ruby, and C++.

The Data API simplifies access to Amazon Redshift by eliminating the need for configuring drivers and managing database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later. Your query results are stored for 24 hours. The Data API federates AWS Identity and Access Management (IAM) credentials so you can use identity providers like Okta or Azure Active Directory or database credentials stored in Secrets Manager without passing database credentials in API calls.

Relevant use cases

The Amazon Redshift Data API is not a replacement for JDBC and ODBC drivers, and is suitable for use cases where you don’t need a persistent connection to a cluster. It’s applicable in the following use cases:

Integrating web services-based applications to access data from Amazon Redshift using an API to run SQL statements. For example, you can run SQL from JavaScript.
Running a long-running query without having to wait for it to complete.
Running your query one time and retrieving the results multiple times without having to run the query again.
Building your ETL pipelines with AWS Step Functions, Lambda, and stored procedures.
Having simplified access to Amazon Redshift from Amazon SageMaker and Jupyter notebooks.
Building event-driven applications with Amazon EventBridge and Lambda.
Scheduling SQL scripts to simplify data load, unload, and refreshing of materialized views.

Creating an Amazon Redshift cluster

If you haven’t already created a Redshift cluster, or want to create a new one and aren’t sure how, follow these steps. In this post, we create a table and load data using the COPY command. Make sure that the IAM role you attach to your cluster has AmazonS3ReadOnlyAccess permission.

Prerequisites for using the Data API

You must be authorized to access the Amazon Redshift Data API. Amazon Redshift provides the RedshiftDataFullAccess managed policy, which offers full access to Amazon Redshift Data APIs. This policy also allows access to Amazon Redshift clusters, Secrets Manager, and IAM API operations needed to authenticate and access an Amazon Redshift cluster by using temporary credentials. If you want to use temporary credentials with the managed policy RedshiftDataFullAccess, you have to create one with the user name in the database as redshift_data_api_user.

You can also create your own IAM policy that allows access to specific resources by starting with RedshiftDataFullAccess as a template. For details, refer to the Amazon Redshift Cluster management guide.

The Data API allows you to access your database either using your IAM credentials or to use secrets stored in Secrets Manager. In this post, we use Secrets Manager.

For instructions on using database credentials for the Data API, see How to rotate Amazon Redshift credentials in AWS Secrets Manager.

Using the Data API from the AWS CLI

You can use Data API from the AWS CLI to interact with the Amazon Redshift cluster. For instructions on configuring AWS CLI, see Setting up the Amazon Redshift CLI. The Amazon Redshift command line interface (aws redshift) is a part of AWS CLI that lets you manage Amazon Redshift clusters, such as creating, deleting, and resizing. The Data API now provides a command line interface to the AWS CLI (redshift-data) that allows you to interact with the databases in an Amazon Redshift cluster.

Before we get started, ensure that you have the updated AWS SDK configured.

You can invoke help using the following command:

aws redshift-data help

The following table shows you different commands available with the Amazon Redshift Data API CLI.

Command	Description
`list-databases`	Lists the databases in a cluster.
`list-schemas`	Lists the schemas in a database. You can filter this by a matching schema pattern.
`list-tables`	Lists the tables in a database. You can filter the tables list by a schema name pattern, a matching table name pattern, or a combination of both.
`describe-table`	Describes the detailed information about a table including column metadata.
`execute-statement`	Runs a SQL statement, which can be DML, DDL, COPY, or UNLOAD.
`cancel-statement`	Cancels a running query. To be canceled, a query must be in running state.
`describe-statement`	Describes the details about a specific SQL statement run. The information includes when the query started, when it finished, the number of rows processed, and the SQL statement.
`list-statements`	Lists the SQL statements. By default, only finished statements are shown.
`get-statement-result`	Fetches the temporarily cached result of the query. The result set contains the complete result set and the column metadata. You can paginate through a set of records to retrieve the entire result as needed.

If you want to get help on a specific command, you have to run the following command:

aws redshift-data list-tables help

Now we look at how you can use these commands. First, get the secret key ARN by navigating to your key in Secrets Manager.

Listing databases

Most organizations use a single database in their Amazon Redshift cluster. You can use the following command to list the databases you have in your cluster. This operation requires you to connect to a database and therefore requires database credentials:

aws redshift-data list-databases  --cluster-identifier <your-cluster-id>--secret-arn  <replace-with-your-secret ARN> --database <your-db-name> --region us-west-2

Listing schema

Similar to listing databases, you can list your schemas by using the list-schemas command:

aws redshift-data list-schemas  --cluster-identifier <your-cluster-id> --secret-arn <replace-with-your-secret ARN> --database <your-db-name> --region us-west-2

You have several schemas that match demo (demo, demo2, demo3, and so on). You can optionally provide a pattern to filter your results matching to that pattern:

aws redshift-data list-schemas --database <your-db-name> --cluster-identifier <your-cluster-id> --secret-arn <replace-with-your-secret ARN> --region <your-region> --schema-pattern "demo%"

Listing tables

You can search across your schema with table-pattern; for example, you can filter the table list by all tables across all your schemas in the database. See the following code:

aws redshift-data list-tables --database dev --cluster-identifier <your-cluster-id>--secret-arn <your-secret-arn> --database <your-db-name> --region <your-region> --table-pattern "ven%"

You can filter your tables list in a specific schema pattern:

aws redshift-data list-tables --database <your-db-name> --cluster-identifier <your-cluster-id> --secret-arn <your-secret-arn> --region <your-region> --table-pattern "ven%" --schema-pattern demo

Running SQL commands

You can run your SELECT, DML, DDL, COPY, or UNLOAD command for Amazon Redshift with the Data API. You can optionally specify a name for your statement. You can optionally specify if you want to send an event to EventBridge after the query runs. The query is asynchronous, and you get a query ID after running a query.

Creating a schema

Let’s now use the Data API to see how you can create a schema. The following command will let you create a schema in your database. You do not have to run this SQL if you have pre-created the schema.

aws redshift-data execute-statement \
     --database <your-db-name>  \
     --cluster-identifier <your-cluster-id> \
     --secret-arn <your-secret-arn> \
     --sql "CREATE SCHEMA demo;" \
     --region <your-region>

The following shows an example output. We will discuss later how you can check the status of a SQL that you executed with execute-statement

{
    "ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2020-09-11T16:06:28.876000-07:00",
    "Database": "dev",
    "Id": "a6e0072b-4641-4e67-9105-aceb7f57266a",
    "SecretArn": "<Your-ARN->"
}

Creating a table

You can use the following command to create a table with the CLI.

aws redshift-data execute-statement \
    --database <your-db-name> \
    --cluster-identifier <your-cluster-id> \
    --secret-arn <your-secret-arn>  \
    --sql "CREATE TABLE demo.green_201601(
  vendorid                VARCHAR(4), \
  pickup_datetime         TIMESTAMP, \
  dropoff_datetime        TIMESTAMP, \
  store_and_fwd_flag      VARCHAR(1), \
  ratecode                INT, \
  pickup_longitude        FLOAT4, \
  pickup_latitude         FLOAT4, \
  dropoff_longitude       FLOAT4, \
  dropoff_latitude        FLOAT4, \
  passenger_count         INT, \
  trip_distance           FLOAT4, \
  fare_amount             FLOAT4, \
  extra                   FLOAT4, \
  mta_tax                 FLOAT4, \
  tip_amount              FLOAT4, \
  tolls_amount            FLOAT4, \
  ehail_fee               FLOAT4, \
  improvement_surcharge   FLOAT4, \
  total_amount            FLOAT4, \
  payment_type            VARCHAR(4),\
  trip_type               VARCHAR(4));" \
    --region <your-region>

Loading sample data

The COPY command lets you load bulk data into your table in Amazon Redshift. You can use the following command to load data into the table we created earlier.

aws redshift-data execute-statement \
    --database <your-db-name> \
    --cluster-identifier <your-cluster-id> \
    --secret-arn <your-secret-arn>  \
    --region <your-region>  
    --sql "COPY demo.green_201601 \
FROM 's3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01' \
IAM_ROLE 'arn:aws:iam::<Your_ACCOUNT>:role/<YourRole>' \
DATEFORMAT 'auto' \
IGNOREHEADER 1 \
DELIMITER ',' \
IGNOREBLANKLINES \
REGION 'us-west-2';"

Retrieving Data

The following query uses the table we created earlier:

aws redshift-data execute-statement \
    --database <your-db-name> \
    --cluster-identifier <your-cluster-id> \
    --secret-arn <your-secret-arn>  \
    --region <your-region> \
    --sql "SELECT ratecode, COUNT(*) FROM demo.green_201601 WHERE \
trip_distance > 5 GROUP BY 1 ORDER BY 1;"

If you’re fetching a large amount of data, using UNLOAD is recommended. You can unload data into Amazon Simple Storage Service (Amazon S3) either using CSV or Parquet format. UNLOAD uses the MPP capabilities of your Amazon Redshift cluster and is faster than retrieving a large amount of data to the client side.

The following shows an example output:

{
    "ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2020-09-01T09:39:45.693000-07:00",
    "Database": "dev",
    "Id": "fc4111f0-0e01-456a-83cf-d5922a8b100a",
    "SecretArn": "<your-secret-arn>"
}

You can fetch results using the query ID that you receive as an output of execute-statement.

Checking the status of a statement

 aws redshift-data describe-statement \
    --id 76f59b84-34a1-481b-a37d-a7b7e1ea57dc \
    --region <your-region>

The following is an example output:

{
    "ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2020-08-31T15:03:08.852000-07:00",
    "Duration": 82642162,
    "Id": "76f59b84-34a1-481b-a37d-a7b7e1ea57dc",
    "QueryString": " SELECT ratecode, COUNT(*) FROM demo.green_201601 WHERE trip_distance > 5 GROUP BY 1 ORDER BY 1;",
    "RedshiftPid": 27815,
    "RedshiftQueryId": 709972,
    "ResultRows": 6,
    "ResultSize": 186,
    "SecretArn": "<your-secret-arn>",
    "Status": "FINISHED",
    "UpdatedAt": "2020-08-31T15:03:09.505000-07:00"
}

Cancelling a running statement

If your query is still running, you can use cancel-statement to cancel a SQL query. See the following command:

aws redshift-data cancel-statement --id 39a0de2f-e85e-45ff-a0d7-cd074c348120        --region  <your-region>

Fetching results from your query

You can fetch the query results by using get-statement-result. The query result is stored for 24 hours. See the following command:

aws redshift-data get-statement-result     --id 7b61da88-1b11-4ade-956a-21085a29118d     --region <your-region>

The output of the result contains metadata such as the number of records fetched, column metadata, and a token for pagination.

Exporting Data

Amazon Redshift allows you to export from database tables to a set of files in an Amazon S3 bucket using the UNLOAD command with a SELECT statement. You can unload data in either text or Apache Parquet format. The following command shows you an example of how you can use the data lake export with the Data API:

aws redshift-data execute-statement --database <db-name> --cluster-identifier <cluster-name> \
--secret-arn <your-secret-arn> \ 
 --region <your-region> \
--sql "unload ('select * from demo.green_201601') to '<your-S3-bucket>' iam_role '<your-iam-role>'; "

Using the Data API from the AWS SDK

You can use the Amazon Redshift Data API in any of the programming languages supported by AWS SDK. For this post, we use the AWS SDK for Python (boto3) as an example to illustrate the capabilities of the Data API.

We first import the boto3 package and establish a session:

def get_client(service, endpoint=None, region="us-west-2"):
    import botocore.session as bc
    session = bc.get_session()

    s = boto3.Session(botocore_session=session, region_name=region)
    if endpoint:
        return s.client(service, endpoint_url=endpoint)
    return s.client(service)

Getting a client object

You can create a client object from the boto3.Session object and using RedshiftData:

rsd = get_client('redshift-data')

If you don’t want to create a session, your client is as simple as the following code:

import boto3

client = boto3.client('redshift-data')

Running a statement

The following example code uses the Secrets Manager key to run a statement. For this post, we use the table we created earlier. You can use DDL, DML, COPY, and UNLOAD as a parameter:

resp = rsd.execute_statement(
    SecretArn="<replace-with-your-secret-arn>",
    ClusterIdentifier="<replace-with-your-cluster-name> ",
    Database="<replace-with-your-db-name>",
    Sql="SELECT ratecode, COUNT(*) totalrides FROM demo.green_201601 WHERE trip_distance > 5 GROUP BY 1 ORDER BY 1;"

As we discussed earlier, running a query is asynchronous; running a statement returns an ExecuteStatementOutput, which includes the statement ID.

If you want to publish an event to EventBridge when the statement is complete, you can use the additional parameter WithEvent set to true:

resp = rsd.execute_statement(
    Database=database,
    ClusterIdentifier=cluster,
    SecretArn=secretarn,
    Sql=sql,
    WithEvent=True
    )

Using IAM credentials

Amazon Redshift provides users to get temporary database credentials using GetClusterCredentials. We recommend you scope the access to a specific cluster and DB user if you are granting your users to use temporary credentials. Here is an example code that uses getting temporary IAM credentials. As you can see in the code, we are using the redshift_data_api_user. The managed policy RedshiftDataFullAccess scopes to use temporary credentials only to redshift_data_api_user.

def query(sql, cluster="redshift-cluster-1",database="dev",dbuser="redshift_data_api_user"):
     resp = rsd.execute_statement(
        Database=database,
        ClusterIdentifier=cluster,
        DbUser=dbuser,
        Sql=sql,
        WithEvent=True
     )

Describing a statement

You can use describe_statement to find the status of the query and number of records retrieved. See the following code:

id=resp['Id']
desc = rsd.describe_statement(Id=id)
if desc["Status"] == "FINISHED":            
    print(desc["ResultRows"])

Fetching results from your query

You can use get_statement_result to retrieve results for your query if your query is complete. See the following code:

if desc and desc["ResultRows"]  > 0:
    result = rsd.get_statement_result(Id=qid)

The get_statement_result command returns a JSON object that includes metadata for the result, the actual result set. You might need to process to format the result if you want to display in a user-friendly format.

Fetching and formatting results

For this post, we demonstrate how to format the results with the Pandas framework. The post_process function processes the metadata and results to populate a data frame. The query function retrieves the result from a database in an Amazon Redshift cluster. See the following code:

import pandas as pd

def post_process(meta, records):
    columns = [k["name"] for k in meta]
    rows = []
    for r in records:
        tmp = []
        for c in r:
            tmp.append(c[list(c.keys())[0]])
        rows.append(tmp)
    return pd.DataFrame(rows, columns=columns)

def query(sql, cluster="redshift-cluster-1", user="awsuser", database="dev"):
    resp = rsd.execute_statement(
        Database=database,
        ClusterIdentifier=cluster,
        DbUser=user,
        Sql=sql
    )
    qid = resp["Id"]
    print(qid)
    desc = None
    while True:
        desc = rsd.describe_statement(Id=qid)
        if desc["Status"] == "FINISHED":
            break
            print(desc["ResultRows"])
    if desc and desc["ResultRows"]  > 0:
        result = rsd.get_statement_result(Id=qid)
        rows, meta = result["Records"], result["ColumnMetadata"]
        return post_process(meta, rows)

pf=query("select venueid,venuename from venue  limit 100;")
print(pf)

In this post, we demonstrated the use of the Data API with Python. However, you can use the Data API with other programming languages supported by the AWS SDK.

Best practices

We recommend the following best practices when using the Data API:

Federate your IAM credentials to the database to connect with Amazon Redshift. Amazon Redshift allows users to get temporary database credentials with GetClusterCredentials. We recommend scoping the access to a specific cluster and DB user if you’re granting your users temporary credentials. For more information, see Example policy for using GetClusterCredentials.
Use a custom policy to provide fine-grained access to the Data API in the production environment if you don’t want your users to use temporary credentials. You have to use Secrets Manager to manage your credentials in such use cases.
Ensure that the record size that you retrieve is smaller than 64 KB.
Don’t retrieve a large amount of data to your client and use the UNLOAD command to export the query results to Amazon S3. You’re limited to retrieving only 100 MB of data with the Data API.
Don’t forget to retrieve your results within 24 hours; results are stored only for 24 hours.

Customer Feedback

Datacoral is a fast growing startup that offers an AWS-native data integration solution for analytics. Datacoral integrates data from databases, APIs, events and files into Amazon Redshift while providing guarantees on data freshness and data accuracy to ensure meaningful analytics. Using the Redshift API, they are able to create a completely event-driven and serverless platform that makes data integration and loading easier for our mutual customers. Founder and CEO Raghu Murthy said, “As an Amazon Redshift Ready Advanced Technology Partner, we have worked with the Redshift team to integrate their Redshift API into our product. The Redshift API provides the asynchronous component needed in our platform to submit and respond to data pipeline queries running on Amazon Redshift. It is the last piece of the puzzle for us to offer our customers a fully event-driven and serverless platform that is robust, cost-effective, and scales automatically. We are thrilled to be part of the launch”

Zynga Inc. is an American game developer running social video game services founded in April 2007. Zynga uses Amazon Redshift as its central data warehouse for game events, user, and revenue data. The data in the Amazon Redshift data warehouse is used for analytics, BI reporting, and AI/ML across all games and departments. Zynga wants to replace any programmatic access clients connected to Amazon Redshift with the new Amazon Redshift Data API. Currently, Zynga’s services connect using a wide variety of clients and drivers, and they plan to consolidate all of them. This will remove the need for Amazon Redshift credentials and regular password rotations. Johan Eklund, Senior Software Engineer, Analytics Engineering team in Zynga, who participated in the beta testing said, “Data API would be an excellent option for our services that will use Amazon Redshift programmatically. The main improvement would be authentication with IAM roles without having to involve the JDBC/ODBC drivers since they are all AWS hosted. Our most common service client environments are PHP, Python, Go plus a few more”.

Conclusion

In this post, we introduced you to the newly launched Amazon Redshift Data API. We also demonstrated how to use the Data API from the Amazon Redshift CLI and Python using the AWS SDK. We also provided best practices for using the Data API. To learn more, read the Amazon Redshift cluster management guide.

About the Authors

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Martin Grund is a Principal Engineer working in the Amazon Redshift team on all topics related to data lake (e.g. Redshift Spectrum), AWS platform integration and security.

Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.

Daisy Yanrui Zhang is a software Dev Engineer working in the Amazon Redshift team on database monitoring, serverless database and database user experience.