All posts by Srikanth Sopirala

Federate Amazon Redshift access with SecureAuth single sign-on

2021-11-18 Srikanth Sopirala

Post Syndicated from Srikanth Sopirala original https://aws.amazon.com/blogs/big-data/federate-amazon-redshift-access-with-secureauth-single-sign-on/

Amazon Redshift is the leading cloud data warehouse that delivers up to 3x better price performance compared to other cloud data warehouses by using massively parallel query execution, columnar storage on high-performance disks, and results caching. You can confidently run mission-critical workloads, even in highly regulated industries, because Amazon Redshift comes with out-of-the-box security and compliance.

You can use your corporate identity providers (IdPs), for example Azure AD, Active Directory Federation Services, Okta, or Ping Federate, with Amazon Redshift to provide single sign-on (SSO) to your users so they can use their IdP accounts to log in and access Amazon Redshift. With federation, you can centralize management and governance of authentication and permissions. For more information about the federation workflow using AWS Identity and Access Management (IAM) and an IdP, see Federate Database User Authentication Easily with IAM and Amazon Redshift.

This post shows you how to use the Amazon Redshift browser-based plugin with SecureAuth to enable federated SSO into Amazon Redshift.

Solution overview

To implement this solution, you complete the following high-level steps:

Configure your SecureAuth IdP v20.06 or later, including users, groups, and SAML application.
Configure your AWS service provider, including IdP, IAM policy, and IAM role.
Set up your SQL Workbench/J client and test SSO.

The process flow for federated authentication includes the following steps:

The user logs in from SQL Workbench/J.
The IdP authenticates using the corporate user name and password, and returns a SAML assertion.
The client uses the AWS SDK to call AWS Security Token Service (AWS STS) to assume a role with SAML.
AWS STS returns temporary AWS credentials.
The client uses the temporary AWS credentials to connect Amazon Redshift.

The following diagram illustrates this process flow.

Prerequisites

Before starting this walkthrough, you must have the following:

An Amazon Redshift cluster
A SecureAuth account
A database user with superuser permissions
Amazon Redshift DB groups pre-created and necessary privileges assigned to them

Set up a user data store at your IdP (SecureAuth)

For instructions on setting up your user data store integration with SecureAuth, see Add a User Data Store. The following screenshot shows a sample setup.

Configure your IdP (SecureAuth)

The next step is to create a new realm in your IdP.

On the SecureAuth portal, on the Create Realms menu, choose Create New From Template.
Select the application (for this post, Amazon Web Services) from the list to establish the target resource for the new realm.
For Page Title/Header, enter the title to appear on the web admin and end-user login pages (for this post, we enter Amazon Redshift).
For Data source, choose the type of Active Directory integration to use with your realm (the user data source we created earlier).
For Start Location, choose At Identity Provider to specify the provider the end-user uses to initiate the login process.
For Global Aux Fields, set Global Aux ID 1 to true.
For SAML Consumer URL, enter http://localhost:7890/redshift/.
Configure the SAML attributes according to the following table.

Attribute Number	SAML Attributes	Value
Attribute 1	https://aws.amazon.com/SAML/Attributes/Role	Full Group DN List
Attribute 2	https://aws.amazon.com/SAML/Attributes/RoleSessionName	Authenticated User ID
Attribute 3	https://redshift.amazon.com/SAML/Attributes/AutoCreate	Global Aux ID 1
Attribute 4	https://redshift.amazon.com/SAML/Attributes/DbUser	Authenticated User ID

The value of Attribute 1 must be dynamically populated with the AWS roles associated with the user that you use to access the Amazon Redshift cluster. This can be a multi-valued SAML attribute to accommodate situations where a user belongs to multiple AD groups or AWS roles with access to the Amazon Redshift cluster. The format of contents within Attribute 1 look like arn:aws:iam::AWS-ACCOUNT-NUMBER:role/AWS-ROLE-NAME, arn:aws:iam::AWS-ACCOUNT-NUMBER:saml-provider/SAML-PROVIDER-NAME after it’s dynamically framed using the following steps.

The dynamic population of Attribute 1 (https://aws.amazon.com/SAML/Attributes/Role) can be done using SecureAuth’s transformation engine.

Enable the transformation engine and use a transformation script.
Use the following sample XSLT transform script as part of the SecureAuth’s transformation engine to take the AD group names a user belongs to and generate the IAM roles’ information by modifying the AD group names with ARN details.

We are assuming that the AD group name and IAM role name will be the same. This is key for the transformation to work.

You have to substitute the following in the script:

AWS-ACCOUNT-NUMBER – A 12-digit AWS account number where you configure the Amazon Redshift access IAM roles.
SAML-PROVIDER-NAME – The name of SAML provider that you create on the AWS side within IAM using the SAML provider metadata file from SecureAuth. Because you haven’t created the SAML provider yet in IAM, make a note of the name that you’re providing here and reuse that name when you are doing the AWS side of the configuration.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="urn:my-scripts">
<msxsl:script language="C#" implements-prefix="user">
<msxsl:using namespace="System.Globalization"/>
<![CDATA[
public static string Transform(string v)
        {
              int startPosition = v.IndexOf("=") + 1;
              string rolearn = v.Substring(startPosition,v.IndexOf(",", startPosition) - startPosition);
              return string.Format("arn:aws:iam::AWS-ACCOUNT-NUMBER:role/{0},arn:aws:iam::AWS-ACCOUNT-NUMBER:saml-provider/SAML-PROVIDER-NAME",rolearn);
        }
    ]]>
</msxsl:script>
  <xsl:template match="/">
    <user>
      <UserID>
        <xsl:value-of select="user/UserID" />
      </UserID>
      <Email1>
        <xsl:value-of select="user/Email1" />
      </Email1>
      <Email2>
        <xsl:value-of select="user/Email2" />
      </Email2>
      <Email3>
        <xsl:value-of select="user/Email3" />
      </Email3>
      <Email4>
        <xsl:value-of select="user/Email4" />
      </Email4>
      <AuxID1>
        <xsl:value-of select="user/AuxID1" />
      </AuxID1>
      <AuxID2>
        <xsl:value-of select="user/AuxID2" />
      </AuxID2>
      <AuxID3>
        <xsl:value-of select="user/AuxID3" />
      </AuxID3>
      <AuxID4>
        <xsl:value-of select="user/AuxID4" />
      </AuxID4>
      <AuxID5>
        <xsl:value-of select="user/AuxID5" />
      </AuxID5>
      <AuxID6>
        <xsl:value-of select="user/AuxID6" />
      </AuxID6>
      <AuxID7>
        <xsl:value-of select="user/AuxID7" />
      </AuxID7>
      <AuxID8>
        <xsl:value-of select="user/AuxID8" />
      </AuxID8>
      <AuxID9>
        <xsl:value-of select="user/AuxID9" />
      </AuxID9>
      <AuxID10>
        <xsl:value-of select="user/AuxID10" />
      </AuxID10>
      <FirstName>
        <xsl:value-of select="user/FirstName" />
      </FirstName>
      <LastName>
        <xsl:value-of select="user/LastName" />
      </LastName>
      <Phone1>
        <xsl:value-of select="user/Phone1" />
      </Phone1>
      <Phone2>
        <xsl:value-of select="user/Phone2" />
      </Phone2>
      <Phone3>
        <xsl:value-of select="user/Phone3" />
      </Phone3>
      <Phone4>
        <xsl:value-of select="user/Phone4" />
      </Phone4>
      <GroupList>
        <Groups>
          <xsl:for-each select="user/GroupList/Groups/Value">
            <Value>
              <xsl:value-of select="current()" />
            </Value>
          </xsl:for-each>
        </Groups>
        <FullGroups>
         <xsl:for-each select="user/GroupList/FullGroups/Value">
           <Value>
             <xsl:value-of select="user:Transform(current())"/>
           </Value>
         </xsl:for-each>
        </FullGroups>
     </GroupList>
    </user>
  </xsl:template>
</xsl:stylesheet>

If you want to filter down the number of AD groups to the ones that are associated with Amazon Redshift access, you can apply a group filter expression for Attribute 1. For example, in the following screenshot, we’re assuming that all AD groups associated with Amazon Redshift access contain the word “Redshift”. This is represented as a regular expression.
Set the value of Global Aux ID 1 to true.

When the setup is complete, your SecureAuth side settings should look something similar to the following screenshots.

When the configuration is complete, download the SAML provider metadata XML file from SecureAuth.

Configure your service provider (AWS)

The next step is to set up the service provider.

Create an IdP in IAM

To create an IdP in IAM, complete the following steps:

On the IAM console, choose Identity providers in the navigation pane.
Choose Create Provider.
In the Configure Provider section, choose the provider type as SAML, enter the provider name as the same one as you mentioned during SecureAuth configuration, and upload the metadata document that you got earlier from your IdP.
On the next page, choose Create to complete the IdP creation within IAM.

You should see a page similar to the following after the IdP creation is complete.

Create a SAML 2.0 federation IAM role and corresponding policy

For each Amazon Redshift DB group of users, we need to create an IAM role and the corresponding IAM policy. Repeat the steps in this section for each DB group.

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Select type of trusted entity, choose SAML 2.0 federation.
For SAML provider, choose the IdP you created earlier.
Select Allow programmatic access only.
For Attribute¸ choose SAML:aud.
For Value, enter http://localhost:7890/redshift/.
Choose Next: Permissions.
Under Attach permissions policies, choose Create policy.
Choose the JSON tab on the Create policy page.
Enter the following sample IAM policy, with the following information:
1. REGION – The Region where your Amazon Redshift cluster exists. For example, us-east-1.
2. AWS-ACCOUNT-NUMBER – The 12-digit account number in which your Amazon Redshift cluster exists.
3. REDSHIFT-CLUSTER-IDENTIFIER – Your Amazon Redshift cluster identifier that you gathered earlier.
4. DATABASE – Your Amazon Redshift cluster database name.
5. UNIQUE-ROLE-IDENTIFIER – For now, leave it as is; you have to come back to the policy and change it after the role is created.
6. REDSHIFT-DB-GROUP – The database group name within Amazon Redshift.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:REGION:AWS-ACCOUNT-NUMBER:dbname:REDSHIFT-CLUSTER-IDENTIFIER/DATABASE",
                "arn:aws:redshift:REGION:AWS-ACCOUNT-NUMBER:dbuser:REDSHIFT-CLUSTER-IDENTIFIER/${redshift:DbUser}",
                "arn:aws:redshift:REGION:AWS-ACCOUNT-NUMBER:cluster:REDSHIFT-CLUSTER-IDENTIFIER"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:userid": "UNIQUE-ROLE-IDENTIFIER:${redshift:DbUser}"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": "redshift:CreateClusterUser",
            "Resource": "arn:aws:redshift:REGION:AWS-ACCOUNT-NUMBER:dbuser:REDSHIFT-CLUSTER-IDENTIFIER/${redshift:DbUser}"
        },
        {
            "Effect": "Allow",
            "Action": "redshift:JoinGroup",
            "Resource": "arn:aws:redshift:REGION:AWS-ACCOUNT-NUMBER:dbgroup:REDSHIFT-CLUSTER-IDENTIFIER/REDSHIFT-DB-GROUP"
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters",
                "iam:ListRoles"
            ],
            "Resource": "*"
        }
    ]
}

On the Review policy page, enter a name and description.
Choose Create policy.
When the policy creation is complete, go back to the Create role page, choose the refresh icon, and search for the newly created policy.
Choose the policy and choose Next: Tags.
Add any tags that you want and then Next: Review.
In the Review section, provide a role name and choose Create role.

Make sure that your IAM role name matches with the AD groups that you’re creating to support Amazon Redshift access. The transformation script that we discussed in the SecureAuth section is assuming that the AD group name and the Amazon Redshift access IAM role name are the same.

We use the AWS Command Line Interface (AWS CLI) to fetch the unique role identifier for the role you just created.

Configure the AWS CLI with the necessary access token and secret access key prior to running the following command:

aws iam get-role --role-name <value>

Replace <value> with the role-name that you just created. So, in this example, the command is:

aws iam get-role --role-name redshift-sales-role

From the output JSON, note the value of RoleId.
On the IAM console, open the policy you created earlier.
Choose Edit policy.
Choose the JSON tab.
Replace UNIQUE-ROLE-IDENTIFIER with the RoleId fetched earlier.
Choose Review policy and Save changes.

You’ve now created an IAM role and policy corresponding to the DB group for which you’re trying to enable IAM-based access.

Log in to Amazon Redshift using IdP-based credentials

To log in to Amazon Redshift using your IdP-based credentials, complete the following steps:

Download the latest Amazon Redshift JDBC driver with the AWS SDK for Java.
Launch the SQL Workbench/J app.
Under Manage Drivers, add the Amazon Redshift JDBC driver.
Create a new connection profile for the Amazon Redshift connection.
Choose com.amazon.redshift.jdbc.Driver as the driver.
For URL, enter jdbc:redshift:iam://REDSHIFT-CLUSTER-ENDPOINT:PORT#/DATABASE.

For example, jdbc:redshift:iam://sample-redshift-cluster-1.cxqXXXXXXXXX.us-east-1.redshift.amazonaws.com:5439/dev.

Leave the Username and Password fields empty.
Select Autocommit.
Choose Extended Properties and provide the following values:
1. plugin_name – com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
2. login_url – The login URL from your IdP when you did the setup on your IdP side. For example, https://XYZ.identity.secureauth.com/SecureAuth2/.
3. dbgroups – The DB group that you use for login (this is a required parameter). This DB group must exist in Amazon Redshift. For example, finance.
4. preferred_role – The preferred IAM role that you want to use. If only one IAM role is populated in the SAML token provided by IdP, this isn’t required.
Choose Test to check if the configuration is working.

You should be redirected to your IdP in a browser window for authentication.

You’ll see the multi-factor authentication screen if it has been set up within SecureAuth.

You should see the successful login in the browser as well as in the SQL Workbench/J app.

Connect to Amazon Redshift and run a sample query such as select current_user, which should show the currently logged-in user.

Summary

Amazon Redshift supports stringent compliance and security requirements with no extra cost, which makes it ideal for highly regulated industries. With federation, you can centralize management and governance of authentication and permissions by managing users and groups within the enterprise IdP and use them to authenticate to Amazon Redshift. SSO enables users to have a seamless user experience while accessing various applications in the organization.

In this post, we walked you through a step-by-step guide to configure and use SecureAuth as your IdP and enabled federated SSO to an Amazon Redshift cluster. You can follow these steps to set up federated SSO for your organization and manage access privileges based on read/write privileges or by business function and passing group membership defined in your SecureAuth IdP to your Amazon Redshift cluster.

If you have any questions or suggestions, please leave a comment.

About the Authors

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road biking.

Sandeep Veldi is a Sr. Solutions Architect at AWS. He helps AWS customers with prescriptive architectural guidance based on their use cases and navigating their cloud journey. In his spare time, he loves to spend time with his family.

BP Yau is a Sr Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

Jamey Munroe is Head of Database Sales, Global Verticals and Strategic Accounts. He joined AWS as one of its first Database and Analytics Sales Specialists, and he’s passionate about helping customers drive bottom-line business value throughout the full lifecycle of data. In his spare time, he enjoys solving home improvement DIY project challenges, fishing, and competitive cycling.

Simplify your data analysis with Amazon Redshift Query Editor v2

2021-09-30 Srikanth Sopirala

Post Syndicated from Srikanth Sopirala original https://aws.amazon.com/blogs/big-data/simplify-your-data-analysis-with-amazon-redshift-query-editor-v2/

Amazon Redshift is a fast, fully managed cloud data warehouse that provides a web-based query editor in addition to supporting connectivity via ODBC/JDBC or the Redshift Data API. Tens of thousands of customers use Amazon Redshift as their analytics platform. Data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift Query Editor v2 is a web-based SQL client application that you can use to author and run queries on your Amazon Redshift data warehouse. You can visualize query results with charts and collaborate by sharing queries with members of your team.

Query Editor v2 provides several capabilities, such as the ability to browse and explore multiple databases, external tables, views, stored procedures, and user-defined functions. It provides wizards to create schemas, tables, and user-defined functions. It simplifies the management and collaboration of saved queries. You can also gain faster insights by visualizing the results with a single click.

Query Editor v2 enhances and builds upon the functionality of the prior version of the query editor, such as increased size of queries, the ability to author and run multi-statement queries, support for session variables, and query parameters, to name a few.

You can provide Query Editor v2 to end-users such as data analysts, database developers, and data scientists without providing the privileges required to access the Amazon Redshift console.

In this post, we walk through how to create an AWS Identity and Access Management (IAM) role to provide access to Query Editor v2 for end-users, easily connect to your clusters, run SQL queries, load data in your clusters, create charts, and share queries directly from the console.

Configure Query Editor v2 for your AWS account

As an admin, you must first configure Query Editor v2 before providing access to your end-users.

You can access Query Editor v2 from the Amazon Redshift console.

When you choose Query Editor v2 from the Editor options, a new tab in your browser opens with the Query Editor v2 interface.

By default, an AWS-owned key is used to encrypt resources. Optionally, you can create a symmetric customer managed key to encrypt Query Editor v2 resources such as saved queries and query results using the AWS Key Management Service (AWS KMS) console or AWS KMS API operations.

Provide access to Query Editor v2 for your end-users

Enterprises want to democratize access to data in the data warehouse securely by providing a web-based query editor to their end-users. You can either use IAM users or integrate the AWS console with your single sign-on (SSO) provider to provide access to end-users. In a future post, we will document all necessary steps to integrate your SSO provider with the query editor.

To enable your users to access Query Editor v2 using IAM, as an administrator, you can attach one of the AWS-managed policies depicted in the following table to the IAM user or role to grant permission. These managed policies also give access to other required services. You can create your custom-managed policy if you want to customize permissions for your end-users.

Policy	Description
AmazonRedshiftQueryEditorV2FullAccess	Grants full access to Query Editor v2 operations and resources. This is primarily intended for administrators.
AmazonRedshiftQueryEditorV2NoSharing	Grants the ability to work with Query Editor v2 without sharing resources. Users can’t share their queries with their team members.
AmazonRedshiftQueryEditorV2ReadSharing	Grants the ability to work with Query Editor v2 with limited sharing of resources. The granted principal can read the saved queries shared with its team but can’t update them.
AmazonRedshiftQueryEditorV2ReadWriteSharing	Grants the ability to work with Query Editor v2 with sharing of resources. The granted principal can read and update the shared resources with its team.

For example, if you have a group of users as a part of marketing_group, and you want them to collaborate between themselves by sharing their queries, you can create an IAM role for them and assign the AmazonRedshiftQueryEditorV2ReadSharing policy. You can also tag the role with sqlworkbench-team as marketing_group.

You can use the IAM console to attach IAM policies to an IAM user or an IAM role. After you attach a policy to a role, you can attach the role to an IAM user.

To attach the IAM policies to an IAM role, complete the following steps:

On the IAM console, choose Roles.
Choose the role that needs access to Query Editor v2. Assume the name of the role as marketing_role.
Choose Attach policies.
For Policy names, choose the policies that we described previously based on your requirement.
Choose Attach policy.

Now you can add the marketing_group tag for an IAM role.

In the navigation pane, choose Roles and select the name of the role that you want to edit.
Choose the Tags tab and choose Add tags.
Add the tag key sqlworkbench-team and the value marketing_group.
Choose Save changes.

Now the end-users with marketing_role can access Query Editor v2 with limited sharing of resources.

Work with Query Editor v2

You can use Query Editor v2 to author and run queries, visualize results, and share your work with your team. With Query Editor v2, you can create databases, schemas, tables, and user-defined functions (UDFs) with visual wizards. In a tree-view panel, for each of your clusters, you can view its schemas. For each schema, you can view its tables, views, functions (UDFs), and stored procedures.

Open Query Editor v2

After you log in to the console and navigate to Query Editor v2, you see a page like the following screenshot.

Query Editor v2 now provides a more IDE-like experience to its users and offers both dark and light themes. You can switch between themes by choosing the moon icon at the lower left of the page.

The left navigation pane shows the list of clusters that you have access to. If you don’t have an Amazon Redshift cluster, use the Getting Started with Amazon Redshift cluster with sample data option. In this post, we use the sample data (Tickets database) as examples.

Connect to an Amazon Redshift database

You can connect to a cluster by choosing a cluster and entering your credentials.

You can connect using either a database user name and password or temporary credentials. Query Editor v2 creates a secret on your behalf stored in AWS Secrets Manager. This secret contains credentials to connect to your database. With temporary credentials, Query Editor v2 generates a temporary password to connect to the database.

Browse a database

You can browse one or more databases in the cluster that you’re connected to. Within a database, you can manage schemas, tables, views, functions, and stored procedures in the tree-view panel. If you have integrated your cluster with the AWS Glue Data Catalog, you see the Data Catalog schema and external tables. Similarly, you can browse the external tables if you create external schemas using Amazon Redshift data sharing, Amazon Redshift Spectrum, or federated queries.

You can perform an operation on an object choosing it (right-click) and choosing from the menu options.

Author and run queries

Query Editor v2 allows you to run your queries by selecting a specific database. If you have multiple databases, make sure that you choose the correct database.

You can enter a query in the editor or select a saved query from the Queries list and choose Run. The query editor provides several shortcuts for using with your query editor, and you can access that by choosing the content assist option.

By default, Limit 100 is set to limit the results to 100 rows. You can turn off this option to return a more extensive result set. If you turn off this option, you can include the LIMIT option in your SQL statement to avoid very large result sets.

Use multiple SQL statements in a query

The query editor supports multiple queries, session variables, and temporary tables. If you have multiple SQL statements and you run the query, the results are displayed on various tabs.

Run long queries

You don’t have to wait for long queries to complete to view results. The queries run even if the browser window is closed. You can view the results the next time you log in to Query Editor v2.

Run parameterized queries

You can use parameters with your query instead of hardcoding certain values, as in the following code:

SELECT sum(qtysold) 
FROM   sales, date 
WHERE  sales.dateid = date.dateid 
AND    sellerId >= ${sellerid};

When you run a query with a parameter, you’re prompted with a form.

Run the explain plan

You can optimize your queries by turning on the Explain option to display a query plan in the results area. You can choose Save to save the query to the Queries folder.

Export results

You can export the query results on the current page to a file in JSON or CSV format. To save the file in the format you want, open the context menu (right-click) in the results area, then choose Export current page and either JSON or CSV. You can also select rows and export the results for specific rows.

Visual analysis of your results

You can perform a visual analysis of your results for a query by turning on Chart to display a graphic visualization of the results. Choose Traces to display the results as a chart. For Type, choose the style of chart as Bar, Line, and so on. For Orientation, you can choose Vertical or Horizontal. For X, select the table column that you want to use for the horizontal axis. For Y, choose the table column that you want to use for the vertical axis.

Choose Refresh to update the chart display. Choose Fullscreen to expand the chart display.

To create a chart, complete the following steps:

Run a query and get results.
Turn on Chart.
Choose a chart style from the available options.

Choose Trace and start to visualize your data.
Choose Style to customize the appearance, including colors, axes, legend, and annotations.
Choose Annotations to add text, shapes, and images.

For certain chart types, you can add transforms to filter, spilt, aggregate, and sort the underlying data for the chart.

You can also save, export, and browse the charts you created.

Collaborate and share with your team members

You can share queries with others on your team. As we discussed earlier, an administrator sets up a team based on the IAM policy associated with an IAM user or IAM role. For example, if you’re a member of marketing_group, you can share your queries with your team members.

Save, organize and browse queries

Before you can share your query with your team, save your query. You can also view and delete saved queries.

To save your query, choose Save, enter a title, and choose Save again.

To browse for saved queries, choose Queries from the navigation pane. You can view queries that are My queries, Shared by me, or Shared to my team. These queries can appear as individual queries or within folders you created.

Organize your queries with folders

You can organize your queries by creating folders and dragging and dropping a saved query to a folder.

Share a query

You can share your queries with your team.

Choose Queries in the navigation pane.
Open the context menu (right-click) of the query that you want to share.
Choose Share with my team.

Manage query versions

You can also view the history of saved queries and manage query versions. Every time you save an SQL query, Query Editor v2 saves it as a new version. You can view or store 20 different versions of your query and browse earlier query versions, save a copy of a query, or restore a query.

Choose Queries in the navigation pane.
Open the context menu (right-click) for the query that you want to work with.
Choose Version history to open a list of versions of the query.
On the Version history page, choose one of the following options:
- Revert to selected – Revert to the selected version and continue your work with this version.
- Save selected as – Create a new query in the editor.

Conclusion

In this post, we introduced you to Amazon Redshift Query Editor v2, which has a rich set of features to manage and run your SQL statements securely that provide you with several capabilities, such as ability to browse and explore multiple databases, external tables, views, stored procedures, and user-defined functions. It provides wizards to create schemas, tables, and user-defined functions. Query Editor v2 simplifies management and collaboration of saved queries and improves the ability to analyze and visualize results with a single click.

If you have any questions or suggestions, please leave a comment.

Happy querying!

About the Author

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Eren Baydemir, a Technical Product Manager at AWS, has 15 years of experience in building customer facing products and is currently creating data analytics solutions in the Amazon Redshift team. He was the CEO and co-founder of DataRow which was acquired by Amazon in 2020.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Building AWS Glue Spark ETL jobs by bringing your own JDBC drivers for Amazon RDS

2021-01-20 Srikanth Sopirala

Post Syndicated from Srikanth Sopirala original https://aws.amazon.com/blogs/big-data/building-aws-glue-spark-etl-jobs-by-bringing-your-own-jdbc-drivers-for-amazon-rds/

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. AWS Glue has native connectors to connect to supported data sources either on AWS or elsewhere using JDBC drivers. Additionally, AWS Glue now enables you to bring your own JDBC drivers (BYOD) to your Glue Spark ETL jobs. This feature enables you to connect to data sources with custom drivers that aren’t natively supported in AWS Glue, such as MySQL 8 and Oracle 18. You can also use multiple JDBC driver versions in the same AWS Glue job, enabling you to migrate data between source and target databases with different versions. For more information, see Connection Types and Options for ETL in AWS Glue.

This post shows how to build AWS Glue ETL Spark jobs and set up connections with custom drivers with Oracle18 and MySQL8 databases.

Solution overview

We discuss three different use cases in this post, using AWS Glue, Amazon RDS for MySQL, and Amazon RDS for Oracle.

In the following architecture, we connect to Oracle 18 using an external ojdbc7.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

In the second scenario, we connect to MySQL 8 using an external mysql-connector-java-8.0.19.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to MySQL 8.

In the third scenario, we set up a connection where we connect to Oracle 18 and MySQL 8 using external drivers from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

Prerequisites

Before getting started, you must complete the following prerequisites:

Create an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console. Your IAM permissions must also include access to create IAM roles and policies created by the AWS CloudFormation template provided in this post.
Create am IAM policy for AWS Glue.
Before setting up the AWS Glue job, you need to download drivers for Oracle and MySQL, which we discuss in the next section.

Downloading drivers for Oracle and MySQL

To download the required drivers for Oracle and MySQL, complete the following steps:

Download the MySQL JDBC connector.
Select the operating system as platform independent and download the .tar.gz or .zip file (for example, mysql-connector-java-8.0.19.tar.gz or mysql-connector-java-8.0.19.zip) and extract it.
Pick MySQL connector .jar file (such as mysql-connector-java-8.0.19.jar) and upload it into your Amazon Simple Storage Service (Amazon S3) bucket.
Make a note of that path because you use it later in the AWS Glue job to point to the JDBC driver.
Similarly, download the Oracle JDBC connector (ojdbc7.jar).

This post is tested for mysql-connector-java-8.0.19.jar and ojdbc7.jar drivers, but based on your database types, you can download and use appropriate version of JDBC drivers supported by the database.

Upload the Oracle JDBC 7 driver to (ojdbc7.jar) to your S3 bucket.
Make a note of that path, because you use it in the AWS Glue job to establish the JDBC connection with the database.
Make sure to upload the three scripts (OracleBYOD.py, MySQLBYOD.py, and CrossDB_BYOD.py) in an S3 bucket.

Save the following code as py in your S3 bucket.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

connection_oracle18_options_source_emp = {
    "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
    "dbtable": "employee",
    "user": "byod",
    "password": "Awsuser123",
    "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
    "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}

connection_oracle18_options_source_dept = {
    "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
    "dbtable": "dept",
    "user": "byod",
    "password": "Awsuser123",
    "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
    "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
connection_oracle18_options_target_emp_dept = {
    "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
    "dbtable": "emp_dept",
    "user": "byod",
    "password": "Awsuser123",
    "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
    "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
# Read DynamicFrame from Oracle 
df_emp = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_emp)
df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
df_dept = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_dept)
df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])


df_emp.printSchema()
df_dept.printSchema()

df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
df_emp_dept = DropNullFields.apply(frame = df_emp_dept)

df_emp_dept.printSchema()

# Write data to Oracle 
glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="oracle", connection_options=connection_oracle18_options_target_emp_dept)

Save the following code as MySQLBYOD.py in your S3 bucket.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

connection_mysql8_options_source_emp = {
    "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
    "dbtable": "employee",
    "user": "MySQLadmin",
    "password": "MYSQLadmin123",
    "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
    
connection_mysql8_options_source_dept = {
    "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
    "dbtable": "dept",
    "user": "MySQLadmin",
    "password": "MYSQLadmin123",
    "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}

connection_mysql8_options_target_emp_dept = {
    "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
    "dbtable": "emp_dept",
    "user": "MySQLadmin",
    "password": "MYSQLadmin123",
    "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
    

# Read from JDBC databases with custom driver
df_emp = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_emp)
df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])

#print "Applied mapping to the Glue DynamicFrame"
df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
df_dept = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_dept)
df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])

df_emp.printSchema()
df_dept.printSchema()

df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
df_emp_dept = DropNullFields.apply(frame = df_emp_dept)

df_emp_dept.printSchema()

glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="mysql", connection_options=connection_mysql8_options_target_emp_dept)

Save the following code as CrossDB_BYOD.py in your S3 bucket.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

connection_mysql8_options_source_emp = {
    "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
    "dbtable": "employee",
    "user": "MySQLadmin",
    "password": "MYSQLadmin123",
    "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}

connection_oracle18_options_source_dept = {
    "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
    "dbtable": "dept",
    "user": "byod",
    "password": "Awsuser123",
    "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
    "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}

connection_oracle18_options_target_emp_dept = {
    "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
    "dbtable": "emp_dept",
    "user": "byod",
    "password": "Awsuser123",
    "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
    "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}

# Read DynamicFrame from Oracle
df_emp = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_emp)
df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
df_dept = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_dept)
df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])



df_emp.printSchema()
df_dept.printSchema()

df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
df_emp_dept = DropNullFields.apply(frame = df_emp_dept)

df_emp_dept.printSchema()

# Write data to Oracle
glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="oracle", connection_options=connection_oracle18_options_target_emp_dept)

Provisioning resources with AWS CloudFormation

The generic workflow of setting up a connection with your own custom JDBC drivers involves various steps. It’s a manual configuration that is error prone and adds overhead when repeating the steps between environments and accounts. With AWS CloudFormation, you can provision your application resources in a safe, repeatable manner, allowing you to build and rebuild your infrastructure and applications without having to perform manual actions or write custom scripts. The declarative code in the file captures the intended state of the resources to create, and allows you to automate the creation of AWS resources.

We provide this CloudFormation template for you to use. Review and customize it to suit your needs. Some of the resources deployed by this stack incur costs as long as they remain in use, like Amazon RDS for Oracle and Amazon RDS for MySQL.

This CloudFormation template creates the following resources:

A VPC
Two subnets
A route table
An internet gateway
A MySQL 8 database
An Oracle 18 database

To provision your resources, complete the following steps:

Sign in to the console.
Choose the us-east-1 Region in which to create the stack.
Choose Next.
Choose Launch Stack:

This step automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console as required.

For Stack name, enter a name.
Change the other parameters as needed or keep the following default values:
1. Oracle user name – oraadmin
2. Oracle password – oraadmin123
3. MySQL username – MySQLadmin
4. MySQL password – MYSQLadmin123

Change the other parameters as needed or keep the following default values:

Choose Next.
Choose Next
Review the details and choose Create.

This stack creation can take up to 20 minutes.

After the stack creation is complete, go to the Outputs tab on the AWS CloudFormation console and note the following values (you use these in later steps):

MySQLJDBCConnectionString
OracleJDBCConnectionString

Configuring an AWS Glue ETL job using your own drivers

Before creating an AWS Glue ETL, run the SQL script (database_scripts.sql) on both the databases (Oracle and MySQL) to create tables and insert data. For more information about connecting to the RDS DB instance, see How can I troubleshoot connectivity to an Amazon RDS DB instance that uses a public or private subnet of a VPC?

To set up AWS Glue connections, complete the following steps:

On the AWS Glue console, under Databases, choose Connections.
Choose Add Connection.
For Connection Name, enter a name for your connection.
For Connection Type, choose JDBC.
For JDBC URL, enter a URL, such as jdbc:oracle:thin://@<hostname>:1521/ORCL for Oracle or jdbc:mysql://<hostname>:3306/mysql for MySQL.
Enter the user name and password for the database.
Select the VPC in which you created the RDS instance (Oracle and MySQL).
Choose the subnet within your VPC. Refer to the CloudFormation stack Outputs tab for the subnet name.
Choose the security group of the database. Refer to the CloudFormation stack Outputs tab for security group name.
Choose Next.
Check the connection details and choose Finish.

Make sure to add a connection for both databases (Oracle and MySQL).

Creating endpoints and a security group

Before testing the connection, make sure you create an AWS Glue endpoint and S3 endpoint in the VPC in which databases are created. Complete the following steps for both Oracle and MySQL instances:

To create your AWS Glue endpoint, on the Amazon VPC console, choose Endpoints.
Choose Create endpoint.
For Service Names, choose AWS Glue.
Choose amazonaws.<region>.glue (for example, com.amazonaws.us-west-2.glue).
Choose the VPC of the RDS for Oracle or RDS for MySQL
Choose the security group of the RDS instances.
Choose Create endpoint.

To create your S3 endpoint, you use Amazon Virtual Private Cloud (Amazon VPC).

On the Amazon VPC console, choose Endpoints.
Choose Create endpoint.
For Service Names, choose Amazon S3.
Choose amazonaws.<region>.s3 (for example, com.amazonaws.us-west-2.s3).
Choose the VPC of the RDS for Oracle or RDS for MySQL
Choose the route table ID.
Choose Create endpoint.

The RDS for Oracle or RDS for MySQL security group must include itself as a source in its inbound rules.

On the Security Groups page, choose Edit Inbound Rules.
Choose Add rule.
For Type, choose All Traffic Type, for example glue-byod-stack1….
For Source, choose the same security group.
Choose Save Rules.

If both the databases are in the same VPC and subnet, you don’t need to create a connection for MySQL and Oracle databases separately. The reason for setting an AWS Glue connection to the databases is to establish a private connection between the RDS instances in the VPC and AWS Glue via S3 endpoint, AWS Glue endpoint, and Amazon RDS security group. It’s not required to test JDBC connection because that connection is established by the AWS Glue job when you run it. If you test the connection with MySQL8, it fails because the AWS Glue connection doesn’t support the MySQL 8.0 driver at the time of writing this post, therefore you need to bring your own driver.

Setting up AWS Glue ETL jobs

You’re now ready to set up your ETL job in AWS Glue. Complete the following steps for both connections:

Edit the following parameters in the scripts (OracleBYOD.py, MySQLBYOD.py, and CrossDB_BYOD.py) and upload them in Amazon S3:
1. url
2. user
3. password
4. customJdbcDriverS3Path for sources and target tables

You can find the database endpoints (url) on the CloudFormation stack Outputs tab; the other parameters are mentioned earlier in this post. If you use another driver, make sure to change customJdbcDriverClassName to the corresponding class in the driver.

Alternatively, you can pass on this as AWS Glue job parameters and retrieve the arguments that are passed using the getResolvedOptions.

On the AWS Glue console, under ETL, choose Jobs.
Choose Add Job.
For Job Name, enter a name.
For IAM role, choose the IAM role you created as a prerequisite.
For Type, choose Spark.
For Glue Version, choose Python (latest version).
For This job runs, choose An existing script that you provide.
Choose the Amazon S3 path where the script (OracleBYOD.py, MySQLBYOD.py, or CrossDB_BYOD.py) is stored.
Under Advanced properties, enable Job bookmark.

Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.

Keep the remaining settings as their defaults and choose
For Connections, choose the Amazon RDS for Oracle connection for OracleBYOD.py, Amazon RDS for MySQL connection for MySQLBYOD.py, or Amazon RDS for Oracle and Amazon RDS for MySQL connection for CrossDB_BYOD.py.
Choose Save job and edit scripts.
Choose Run Job.
When the job is complete, validate the data loaded in the target table.

Cleaning up

After you finish, don’t forget to delete the CloudFormation stack, because some of the AWS resources deployed by the stack in this post incur a cost as long as you continue to use them.

You can delete the CloudFormation stack to delete all AWS resources created by the stack.

On the AWS CloudFormation console, on the Stacks page, select the stack to delete. The stack must be currently running.
In the stack details pane, choose Delete.
Choose Delete stack when prompted.

Summary

In this post, we showed you how to build AWS Glue ETL Spark jobs and set up connections with custom drivers with Oracle18 and MySQL8 databases using AWS CloudFormation. You can use this solution to use your custom drivers for databases not supported natively by AWS Glue.

If you have any questions or suggestions, please leave a comment.

About the Authors

Srikanth Sopirala is a Sr. Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family and road biking.

Naresh Gautam is a Sr. Analytics Specialist Solutions Architect at AWS. His role is helping customers architect highly available, high-performance, and cost-effective data analytics solutions to empower customers with data-driven decision-making. In his free time, he enjoys meditation and cooking.

Solution overview

Prerequisites

Set up a user data store at your IdP (SecureAuth)

Configure your IdP (SecureAuth)

Configure your service provider (AWS)

Create an IdP in IAM

Create a SAML 2.0 federation IAM role and corresponding policy

Log in to Amazon Redshift using IdP-based credentials

Summary

About the Authors

Configure Query Editor v2 for your AWS account

Provide access to Query Editor v2 for your end-users

Work with Query Editor v2

Open Query Editor v2

Connect to an Amazon Redshift database

Browse a database

Author and run queries

Use multiple SQL statements in a query

Run long queries

Run parameterized queries

Run the explain plan

Export results

Visual analysis of your results

Collaborate and share with your team members

Save, organize and browse queries

Organize your queries with folders

Share a query

Manage query versions

Conclusion

About the Author

Solution overview

Prerequisites

Downloading drivers for Oracle and MySQL

Provisioning resources with AWS CloudFormation

Configuring an AWS Glue ETL job using your own drivers

Creating endpoints and a security group

Setting up AWS Glue ETL jobs

Cleaning up

Summary

About the Authors

The collective thoughts of the interwebz