Running queries securely from the same VPC where an Amazon Redshift cluster is running

Post Syndicated from Seetha Sarma original https://aws.amazon.com/blogs/big-data/running-queries-securely-from-the-same-vpc-where-an-amazon-redshift-cluster-is-running/

Customers who don’t need to set up a VPN or a private connection to AWS often use public endpoints to access AWS. Although this is acceptable for testing out the services, most production workloads need a secure connection to their VPC on AWS. If you’re running your production data warehouse on Amazon Redshift, you can run your queries in Amazon Redshift query editor or use Amazon WorkSpaces from Amazon Virtual Private Cloud (Amazon VPC) to connect to Amazon Redshift securely and analyze and graph a data summary in your favorite business intelligence (BI) or data visualization desktop tool.

With Amazon Redshift, you can query petabytes of structured and semi-structured data across your data warehouse, operational database, and your data lake using standard SQL. Amazon WorkSpaces is a managed, secure Desktop-as-a-Service (DaaS) solution deployed within an Amazon VPC. In this post, we show how you can run SQL queries on Amazon Redshift securely without VPN using Amazon Redshift query editor and Amazon WorkSpaces. First, we discuss how to run queries that return large datasets from the Amazon Redshift query editor using the UNLOAD command. Next, we discuss how to set up Amazon WorkSpaces and use it to securely run queries on Amazon Redshift. We cover the detailed steps for setting up Amazon WorkSpaces and show different scenarios to test Amazon Redshift queries with Amazon WorkSpaces.

The following diagram illustrates these architectures.

Using the Amazon Redshift query editor with UNLOAD command

Amazon Redshift has a query editor on its console that is typically used to run short queries and explore the data in the Amazon Redshift database. You may have a production scenario with queries that return large result sets. For instance, you may want to unload CSV data for use by a downstream process. In this case, you can run your query in the query editor and use the UNLOAD command to send the output directly to Amazon Simple Storage Service (Amazon S3) and get notified when the data is uploaded.

  1. Create separate S3 buckets for each user. You can use the default configuration for the bucket.

Create separate S3 buckets for each user. You can use the default configuration for the bucket.

  1. On the Amazon Redshift console, choose Editor.
  2. In the Connect to database section, enter the database connection details.
  3. Choose Connect to database.

Choose Connect to database.

  1. Use the following format for the queries run from the query editor using the IAM role for Amazon Redshift, so the results get uploaded to Amazon S3:
    UNLOAD 
    ('select id, name
    from <your_ schema>.<your_table>)
    TO 's3://<username>/<yy-mm-dd-hh24-mi-ss>/'
    FORMAT as CSV
    iam_role 'arn:aws:iam:<myaccount>:role/MyAmazon_RedshiftUnloadRole';

Use the following format for the queries run from the query editor.

This query creates multiple files in the designated user’s S3 bucket under the date/time prefix. The user can preview or download the individual files on the Amazon S3 console.

This query creates multiple files in the designated user’s S3 bucket under the date/time prefix.

A large unload may take some time. You can configure Amazon Simple Notification Service (Amazon SNS) to send a notification when the results are uploaded to Amazon S3.

  1. On the Amazon SNS console, choose Topics.
  2. Choose Create topic.

Choose Create topic.

  1. Create an SNS topic with a meaningful description text, like Your query results are uploaded to S3.

In the next steps, you edit the access policy of the SNS topic to give permission for Amazon S3 to publish to it.

  1. Change the Principal from "AWS": "*" to "Service": "s3.amazonaws.com".
  2. Scroll down to “Action” and delete everything except “SNS:Publish”. Make sure to delete the extra comma.
  3. Scroll down to “Condition” and modify the text "StringEquals": { "AWS:SourceOwner": <Your account id>} to "ArnLike": { "aws:SourceArn": "arn:aws:s3:*:*:<user-bucket-name>" }.

In the next steps, you edit the access policy of the SNS topic to give permission for Amazon S3 to publish to it.

  1. In the navigation pane, choose Subscriptions.
  2. Choose Create subscription.

Choose Create subscription.

  1. Subscribe to the SNS notification with the user’s email address as the endpoint.

Subscribe to the SNS notification with the user’s email address as the endpoint.

  1. Make sure the user chooses confirms the subscription from their email.
  2. On the Amazon S3 console, choose the Properties tab of the user’s S3 bucket.
  3. Under Event Notifications, choose Create event notification.

Under Event Notifications, choose Create event notification.

  1. Select All object create events.

Select All object create events.

  1. For Destination, select SNS topic.
  2. For Specify SNS topic, select Choose from your SNS topics.
  3. For SNS topic, choose the topic you created.

For SNS topic, choose the topic you created.

  1. Save your settings.
  2. To test the notification, on the Amazon Redshift console, open the query editor.
  3. Edit the UNLOAD query and change the S3 bucket name to the current date and time.
  4. Run the query and check if the user gets the email notification.

Using Amazon WorkSpaces to run Amazon Redshift queries

In this section, we cover setting up Amazon WorkSpaces, including Amazon VPC prerequisites, creating an Amazon VPC endpoint for Amazon S3, launching Amazon WorkSpaces in the same VPC where an Amazon Redshift cluster is running, setting up an Amazon WorkSpaces client, installing PSQL or a SQL client, and connecting to the client.

When setup is complete, we show different scenarios to test with Amazon WorkSpaces, such as testing a SQL command from the Amazon WorkSpaces client, testing SCREEN program to run SQL in the background, and testing PSQL with Amazon S3 and getting a notification through Amazon SNS.

Prerequisites

By default, AWS Identity and Access Management (IAM) users and roles can’t perform tasks using the AWS Management Console and they don’t have permission to create or modify Amazon VPC resources. Make sure you have administrator privileges or an administrator creates IAM policies that grants sufficient permissions to edit the route table, edit the VPC security group, and enable a DNS hostname for the VPC.

When you have the correct permissions, complete the following prerequisite steps:

  1. On the Amazon Redshift console, in config, check the cluster subnet groups to make sure the Amazon Redshift cluster is created in an Amazon VPC with at least two subnets that are in separate Availability Zones.
  2. On the Amazon VPC console, edit the route table and make sure to associate these two subnets.
  3. Make sure the Amazon VPC security group has a self-referencing inbound rule for the security group for all traffic (not all tcp). The self-referencing rule restricts the source to the same security group in the VPC, and it’s not open to all networks. Consider limiting to just the protocol and port needed for Redshift to talk to Workspaces.
  4. Edit the DNS hostname of the Amazon VPC and enable it.

Creating an Amazon VPC endpoint for Amazon S3 for software downloads

In this step, you create your Amazon VPC endpoint for Amazon S3. This gives you Amazon S3 access to download PSQL from the Amazon repository. Alternatively, you could set up a NAT Gateway and download PSQL or other SQL clients from the internet.

  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
    Choose Create endpoint.
  3. Search for Service Name: S3
  4. Select the S3 service gateway
    Select the S3 service gateway
  5. Select the Amazon VPC where the Amazon Redshift cluster is running
  6. Select the route table
    Select the route table
  7. Enter the following custom policy for the endpoint to access the Amazon Linux AMI
    {
        "Version": "2008-10-17",
        "Statement": [
            {
                "Sid": "Amazon Linux AMI Repository Access",
                "Effect": "Allow",
                "Principal": "*",
                "Action": "s3:GetObject",
                "Resource": [
                    "arn:aws:s3:::*.amazonaws.com",
                    "arn:aws:s3:::*.amazonaws.com/*"
                ]
            }
        ]
    }

    Select the Amazon VPC where the Amazon Redshift cluster is running

  8. Create the endpoint

Launching Amazon WorkSpaces in the VPC where the Amazon Redshift cluster runs

You’re now ready to launch Amazon WorkSpaces.

  1. On the Amazon WorkSpaces console, choose Launch WorkSpaces.

On the Amazon WorkSpaces console, choose Launch WorkSpaces.

  1. For Directory types, select Simple AD.

Directory Service Solutions helps you store information and manage access to resources. For this post, we choose Simple AD.

Directory Service Solutions helps you store information and manage access to resources. For this post, we choose Simple AD.

  1. For Directory size, select Small.
  2. Enter your directory details.

Enter your directory details.

  1. For VPC, choose the VPC where the Amazon Redshift cluster is running.
  2. For Subnets, choose the two subnets you created.

For Subnets, choose the two subnets you created.

It may take a few minutes to provision the directory. You see the status show as Active when it’s ready.

It may take a few minutes to provision the directory. You see the status show as Active when it’s ready.

  1. When the directory is provisioned, choose the directory and subnets you created.
  2. Choose Next Step.

Choose Next Step.

  1. Create and identify your users.

Create and identify your users.

  1. Use the default settings for the compute bundle.
  2. For Running Mode, select AlwaysOn.

Alternatively, select AutoStop and adjust the time in order to run long-running queries.

Alternatively, select AutoStop and adjust the time in order to run long-running queries.

  1. Review and launch WorkSpaces.

It may take up to 20 minutes to become available.

Setting up the Amazon WorkSpaces client

In this section, you configure your Amazon WorkSpaces client.

  1. Use the link from your email to download the Amazon WorkSpaces client.
  2. Register it using the registration code from the email.
  3. Login with your username in the email and your newly created password
  4. In the Amazon WorkSpaces client, open the terminal.

In the Amazon WorkSpaces client, open the terminal.

  1. Run the following command to capture the IP address:
hostname -I | awk '{print $2}'

The following screenshot shows your output.

The following screenshot shows your output.

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your cluster.
  3. Save the endpoint information to use later.
  4. Choose the Properties tab.

Choose the Properties tab.

  1. In the Network and security section, note the VPC security group.
    In the Network and security section, note the VPC security group.
  2. On the Amazon VPC console, under Security, choose Security Groups.
  3. Select the security group that the Amazon Redshift cluster uses.

Select the security group that the Amazon Redshift cluster uses.

  1. Add an inbound rule with the type Redshift and the source value of the IP Address you captured/32.

Add an inbound rule with the type Redshift and the source value of the IP Address you captured/32.

  1. On the Amazon WorkSpaces client, use the Amazon Redshift hostname from the endpoint you saved earlier and verify the VPC setup with the following code:
     nslookup <Amazon Redshift hostname>

If you see an IP address within your subnet range, the private endpoint setup for Amazon Redshift was successful.

If you see an IP address within your subnet range, the private endpoint setup for Amazon Redshift was successful.

Testing a SQL command from PSQL or a SQL client in the Amazon WorkSpaces client

To test a SQL command, complete the following steps:

  1. From the terminal in the Amazon WorkSpaces client, run the following command to install PostgreSQL:
    sudo yum install postgresql-server

Alternatively, setup a NAT Gateway and download a SQL client such as SQL Workbench on the Amazon WorkSpaces client:

sudo wget https://www.sql-workbench.eu/Workbench-Build125-with-optional-libs.zip

Then unzip the content of the downloaded file and save it to a directory:

unzip Workbench-Build125-with-optional-libs.zip -d ~/Workbench
  1. Use the Amazon Redshift hostname, port, and database names of the Amazon Redshift cluster endpoint you copied earlier and try connecting to the database:
    psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W

  1. Enter your password when prompted.

Enter your password when prompted.

  1. Run a SQL command and check the results.

Testing the SCREEN program to run SQL in the background

You can use the SCREEN program to run the SQL command in the background and resume to see the results.

  1. From the terminal in the Amazon WorkSpaces client, install the SCREEN program:
    sudo yum install screen

  1. Run the program:
    screen

  1. Connect to PSQL:
    psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W

  1. Enter the password when prompted.
  2. Run the SQL command
  3. Enter the command ctrl A D to detach from the screen.

The SQL command is now running in the background. You can check by running ps -ef | grep psql.

The SQL command is now running in the background. You can check by running ps -ef | grep psql.

  1. To go back to that screen, run the following command:
    screen -r

  1. To quit SCREEN, enter the following command:
    ctrl A \

Testing PSQL with Amazon S3 and Amazon SNS

Similar to the UNLOAD command we used from the Amazon Redshift query editor in the beginning of this post, you can run PSQL from the Amazon WorkSpaces client, send the output to an S3 bucket, and get an Amazon SNS notification for an object create event.

  1. From the terminal in the Amazon WorkSpaces client, run aws configure to set up AWS credentials with write access to the S3 bucket.
  2. Run the following command to write a single output file to Amazon S3 and send an email notification:
    psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W -c 'select column1, column2 from myschema.mytable' > s3://<username>/<yy-mm-dd-hh24-mi-ss>/

Conclusion

This post explained how to securely query Amazon Redshift databases running in an Amazon VPC using the UNLOAD command with the Amazon Redshift query editor and using Amazon WorkSpaces running in the same VPC. Try out this solution to securely access Amazon Redshift databases without a VPN and run long-running queries. If you have any questions or comments, please share your thoughts in the comments section.


About the Authors

Seetha Sarma is a Senior Database Specialist Solutions Architect with Amazon Web Services. Seetha provides guidance to customers on using AWS services for distributed data processing. In her spare time she likes to go on long walks and enjoy nature.

 

 

 

Moiz MianMoiz Mian is a Solutions Architect for AWS Strategic Accounts. He focuses on enabling customers to build innovative, scalable, and secure solutions for the cloud. In his free time, he enjoys building out Smart Home systems and driving at race tracks.