All posts by BP Yau

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

2024-11-27 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/unlocking-near-real-time-analytics-with-petabytes-of-transaction-data-using-amazon-aurora-zero-etl-integration-with-amazon-redshift-and-dbt-cloud/

While customers can perform some basic analysis within their operational or transactional databases, many still need to build custom data pipelines that use batch or streaming jobs to extract, transform, and load (ETL) data into their data warehouse for more comprehensive analysis.

Zero-ETL integration with Amazon Redshift reduces the need for custom pipelines, preserves resources for your transactional systems, and gives you access to powerful analytics. Within seconds of transactional data being written into Amazon Aurora (a fully managed modern relational database service offering performance and high availability at scale), the data is seamlessly made available in Amazon Redshift for analytics and machine learning. The data in Amazon Redshift is transactionally consistent and updates are automatically and continuously propagated.

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL, business intelligence (BI), and reporting tools. Together with price-performance, Amazon Redshift offers capabilities such as serverless architecture, machine learning integration within your data warehouse and secure data sharing across the organization.

dbt helps manage data transformation by enabling teams to deploy analytics code following software engineering best practices such as modularity, continuous integration and continuous deployment (CI/CD), and embedded documentation.

dbt Cloud is a hosted service that helps data teams productionize dbt deployments. dbt Cloud offers turnkey support for job scheduling, CI/CD integrations; serving documentation, native git integrations, monitoring and alerting, and an integrated developer environment (IDE) all within a web-based UI.

In this post, we explore how to use Aurora MySQL-Compatible Edition Zero-ETL integration with Amazon Redshift and dbt Cloud to enable near real-time analytics. By using dbt Cloud for data transformation, data teams can focus on writing business rules to drive insights from their transaction data to respond effectively to critical, time sensitive events. This enables the line of business (LOB) to better understand their core business drivers so they can maximize sales, reduce costs, and further grow and optimize their business.

Solution overview

Let’s consider TICKIT, a fictional website where users buy and sell tickets online for sporting events, shows, and concerts. The transactional data from this website is loaded into an Aurora MySQL 3.05.0 (or a later version) database. The company’s business analysts want to generate metrics to identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. Analysts can use this information to provide incentives to buyers and sellers who frequently use the site, to attract new users, and to drive advertising and promotions.

The Zero-ETL integration between Aurora MySQL and Amazon Redshift is set up by using a CloudFormation template to replicate raw ticket sales information to a Redshift data warehouse. After the data is in Amazon Redshift, dbt models are used to transform the raw data into key metrics such as ticket trends, seller performance, and event popularity. These insights help analysts make data-driven decisions to improve promotions and user engagement.

The following diagram illustrates the solution architecture at a high-level.

To implement this solution, complete the following steps:

Set up Zero-ETL integration from the AWS Management Console for Amazon Relational Database Service (Amazon RDS).
Create dbt models in dbt Cloud.
Deploy dbt models to Amazon Redshift.

Prerequisites

A dbt Cloud account. Sign up for one if you haven’t already done so.
An AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and related AWS services. Your IAM permissions must also include access to create IAM roles and policies through the AWS CloudFormation template.

Set up resources with CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.

The CloudFormation template provisions the following components

An Aurora MySQL provisioned cluster (source)
An Amazon Redshift Serverless data warehouse (target)
Zero-ETL integration between the source (Aurora MySQL) and target (Amazon Redshift Serverless)

To create your resources:

Sign in to the console.
Choose the us-east-1 AWS Region in which to create the stack.
Choose Launch Stack

Choose Next.

This automatically launches CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.

For Stack name, enter a stack name.
Keep the default values for the rest of the Parameters and choose Next.
On the next screen, choose Next.
Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Submit.

Stack creation can take up to 30 minutes.

After the stack creation is complete go to the Outputs tab of the stack and record the values of the keys for the following components, which you will use in a later step:

NamespaceName
PortNumber
RDSPassword
RDSUsername
RedshiftClusterSecurityGroupName
RedshiftPassword
RedshiftUsername
VPC
Workinggroupname
ZeroETLServicesRoleNameArn

Configure your Amazon Redshift data warehouse security group settings to allow inbound traffic from dbt IP addresses.
You’re now ready to sign in to both Aurora MySQL cluster and Amazon Redshift Serverless data warehouse and run some basic commands to test them.

Create a database from integration in Amazon Redshift

To create a target database using Redshift query editor V2:

On the Amazon Redshift Serverless console, choose the zero-etl-destination workgroup.
Choose Query data to open Query Editor v2.
Connect to an Amazon Redshift Serverless data warehouse using the username and password from the CloudFormation resource creation step.
Get the integration_id from the svv_integration system table.

select integration_id from svv_integration; ---- copy this result, use in the next sql

Use the integration_id from the preceding step to create a new database from the integration.

CREATE DATABASE aurora_zeroetl_integration FROM INTEGRATION '<result from above>';

The integration between Aurora MYSQL and the Amazon Redshift Serverless data warehouse is now complete.

Populate source data in Aurora MySQL

You’re now ready to populate source data in Amazon Aurora MYSQL.

You can use your favorite query editor installed on either an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local system to interact with Aurora MYSQL. However, you need to provide access to Aurora MYSQL from the machine where the query editor is installed. To achieve this, modify the security group inbound rules to allow the IP address of your machine and make Aurora publicly accessible.

To populate source data:

Run the following script on Query Editor to create the sample database DEMO_DB and tables inside DEMO_DB.

create database demodb;

create table demodb.users(
userid integer not null primary key,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports boolean,
liketheatre boolean,
likeconcerts boolean,
likejazz boolean,
likeclassical boolean,
likeopera boolean,
likerock boolean,
likevegas boolean,
likebroadway boolean,
likemusicals boolean);

create table demodb.venue(
venueid integer not null primary key,
venuename varchar(100),
venuecity varchar(30),
venuestate char(2),
venueseats integer);

create table demodb.category(
catid integer not null primary key,
catgroup varchar(10),
catname varchar(10),
catdesc varchar(50));

create table demodb.date (
dateid integer not null primary key,
caldate date not null,
day character(3) not null,
week smallint not null,
month character(5) not null,
qtr character(5) not null,
year smallint not null,
holiday boolean default FALSE );

create table demodb.event(
eventid integer not null primary key,
venueid integer not null,
catid integer not null,
dateid integer not null,
eventname varchar(200),
starttime timestamp);

create table demodb.listing(
listid integer not null primary key,
sellerid integer not null,
eventid integer not null,
dateid integer not null,
numtickets smallint not null,
priceperticket decimal(8,2),
totalprice decimal(8,2),
listtime timestamp);

create table demodb.sales(
salesid integer not null primary key,
listid integer not null,
sellerid integer not null,
buyerid integer not null,
eventid integer not null,
dateid integer not null,
qtysold smallint not null,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp);

Load data from Amazon Simple Storage Service (Amazon S3) to the corresponding table using the following commands:

LOAD DATA FROM S3 PREFIX 's3-us-east-1://aws-bigdata-blog/artifacts/BDB-3864/data/tickit/users/' 
INTO TABLE demodb.users FIELDS TERMINATED BY '|';

LOAD DATA FROM S3 PREFIX 's3-us-east-1://aws-bigdata-blog/artifacts/BDB-3864/data/tickit/venue/' 
INTO TABLE demodb.venue FIELDS TERMINATED BY '|';

LOAD DATA FROM S3 PREFIX 's3-us-east-1://aws-bigdata-blog/artifacts/BDB-3864/data/tickit/category/' 
INTO TABLE demodb.category FIELDS TERMINATED BY '|';

LOAD DATA FROM S3 PREFIX 's3-us-east-1://aws-bigdata-blog/artifacts/BDB-3864/data/tickit/date/' 
INTO TABLE demodb.date FIELDS TERMINATED BY '|';

LOAD DATA FROM S3 PREFIX 's3-us-east-1://aws-bigdata-blog/artifacts/BDB-3864/data/tickit/event/' 
INTO TABLE demodb.event FIELDS TERMINATED BY '|';

LOAD DATA FROM S3 PREFIX 's3-us-east-1://aws-bigdata-blog/artifacts/BDB-3864/data/tickit/listing/' 
INTO TABLE demodb.listing FIELDS TERMINATED BY '|';

LOAD DATA FROM S3 PREFIX 's3-us-east-1://aws-bigdata-blog/artifacts/BDB-3864/data/tickit/sales/' 
INTO TABLE demodb.sales FIELDS TERMINATED BY '|';

The following are common errors associated with load from Amazon S3:

For the current version of the Aurora MySQL cluster, set the aws_default_s3_role parameter in the database cluster parameter group to the role Amazon Resource Name (ARN) that has the necessary Amazon S3 access permissions.
If you get an error for missing credentials, such as the following, you probably haven’t associated your IAM role to the cluster. In this case, add the intended IAM role to the source Aurora MySQL cluster.

Error 63985 (HY000): S3 API returned error: Missing Credentials: Cannot instantiate S3 Client),

Validate the source data in your Amazon Redshift data warehouse

To validate the source data

Navigate to the Redshift Serverless dashboard, open Query Editor v2, and select the workgroup and database created from integration from the drop-down list. Expand the database aurora_zeroetl, schema demodb and you should see 7 tables being created.
Wait a few seconds and run the following SQL query to see integration in action.

select * from aurora_zeroetl_integration.demodb.category;

Transforming data with dbtCloud

Connect dbt Cloud to Amazon Redshift

Create a new project in dbt Cloud. From Account settings (using the gear menu in the top right corner), choose + New Project.
Enter a project name and choose Continue.

For Connection, select Add new connection from the drop-down list.
Select Redshift and enter the following information:
1. Connection name: The Name of the connection.
2. Server Hostname: Your Amazon Redshift Serverless endpoint.
3. Port: Redshift 5439.
4. Database name: dev.
Make sure you allowlist your dbt Cloud IP address in your Redshift data warehouse security group inbound traffic.
Choose Save to set up your connection.

Set your development credentials. These credentials will be used by dbt Cloud to connect to your Amazon Redshift data warehouse. See the CloudFormation template output for the credentials.
Schema – dbt_zetl. dbt Cloud automatically generates a schema name for you. By convention, this is dbt_<first-initial><last-name>. This is the schema connected directly to your development environment, and it’s where your models will be built when running dbt within the Cloud integrated development environment (IDE).

Choose Test Connection. This verifies that dbt Cloud can access your Redshift data warehouse.
Choose Next if the test succeeded. If it failed, check your Amazon Redshift settings and credentials.

Set up a dbt Cloud managed repository

When you develop in dbt Cloud, you can use git to version control your code. For the purposes of this post, use a dbt Cloud-hosted managed repository.

To set up a managed repository:

Under Setup a repository, select Managed.
Enter a name for your repo, such as dbt-zeroetl.
Choose Create. It will take a few seconds for your repository to be created and imported.

Initialize your dbt project and start developing

Now that you have a repository configured, initialize your project and start developing in dbt Cloud.

To start development in dbt Cloud:

In dbt Cloud, choose Start developing in the IDE. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse.

Above the file tree to the left, choose Initialize dbt project. This builds out your folder structure with example models.

Make your initial commit by choosing Commit and sync. Use the commit message initial commit and choose Commit Changes. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code.

To build your models

Under Version Control on the left, choose Create branch. Enter a name, such as add-redshift-models. You need to create a new branch because the main branch is set to read-only mode.
Choose dbt_project.yml.
Update the models section of dbt_project.yml at the bottom of the file. Change example to staging and make sure the materialized value is set to table.

models:

my_new_project:

# Applies to all files under models/example/

staging:

materialized: table

Choose the three-dot icon (…) next to the models directory, then select Create Folder.
Name the folder staging, then choose Create.
Choose the three-dot icon (…) next to the models directory, then select Create Folder.
Name the folder dept_finance, then choose Create.
Choose the three-dot icon (…) next to the staging directory, then select Create File.

Name the file sources.yml, then choose Create.
Copy the following query into the file and choose Save.

version: 2
sources:
- name: ops
database: aurora_zeroetl_integration
schema: demodb
tables:
- name: category
- name: date
- name: event
- name: listing
- name: users
- name: venue
- name: sales

Be aware that the operation database created on your Amazon Redshift data warehouse is a special read only database and you cannot directly connect to it to create objects. You need to connect to another regular database and use three-part notation as defined in sources.yml to query data from it.

Choose the three-dot icon (…) directory, then select Create File.
Name the file staging_event.sql, then choose Create.
Copy the following query into the file and choose Save.

with source as (
select * from {{ source('ops', 'event') }}
)
SELECT
eventid::integer AS eventid,
venueid::smallint AS venueid,
catid::smallint AS catid,
dateid::smallint AS dateid,
eventname::varchar(200) AS eventname,
starttime::timestamp AS starttime,
current_timestamp as etl_load_timestamp
from source

Choose the three-dot icon (…) next to the staging directory, then select Create File.
Name the file staging_sales.sql, then choose Create.
Copy the following query into the file and choose Save.

with store_source as (
select * from {{ source('ops', 'sales') }}
)
SELECT
salesid::integer AS salesid,
'store' as salestype,
listid::integer AS listid,
sellerid::integer AS sellerid,
buyerid::integer AS buyerid,
eventid::integer AS eventid,
dateid::smallint AS dateid,
qtysold::smallint AS qtysold,
pricepaid::decimal(8,2) AS pricepaid,
commission::decimal(8,2) AS commission,
saletime::timestamp AS saletime,
current_timestamp as etl_load_timestamp
from store_source

Choose the three-dot icon (…) next to the dept_finance directory, then select Create File.
Name the file rpt_finance_qtr_total_sales_by_event.sql, then choose Create.
Copy the following query into the file and choose Save.

select
date_part('year', a.saletime) as year,
date_part('quarter', a.saletime) as quarter,
b.eventname,
count(a.salesid) as sales_made,
sum(a.pricepaid) as sales_revenue,
sum(a.commission) as staff_commission,
staff_commission / sales_revenue as commission_pcnt
from {{ref('staging_sales')}} a
left join {{ref('staging_event')}} b on a.eventid = b.eventid
group by
year,
quarter,
b.eventname
order by
year,
quarter,
b.eventname

Choose the three-dot icon (…) next to the dept_finance directory, then select Create File.
Name the file rpt_finance_qtr_top_event_by_sales.sql, then choose Create.
Copy the following query into the file and choose Save.

select *
from
(
select
*,
rank() over (partition by year, quarter order by sales_revenue desc) as row_num
from {{ref('rpt_finance_qtr_total_sales_by_event')}}
)
where row_num <= 3

Choose the three-dot icon (…) next to the example directory, then select Delete.
Enter dbt run in the command prompt at the bottom of the screen and press Enter.

You should get a successful run and see the four models.

Now that you have successfully run the dbt model, you should be able to find it in the Amazon Redshift data warehouse. Go to Redshift Query Editor v2, refresh the dev database, and verify that you have a new dbt_zetl schema with the staging_event and staging_sales tables and rpt_finance_qtr_top_event_by_sales and rpt_finance_qtr_total_sales_by_event views in it.

Run the following SQL statement to verify that data has been loaded into your Amazon Redshift table.

SELECT * FROM dbt_zetl.rpt_finance_qtr_total_sales_by_event;
SELECT * FROM dbt_zetl.rpt_finance_qtr_top_event_by_sales;

Add tests to your models

Adding tests to a project helps validate that your models are working correctly.

To add tests to your project:

Create a new YAML file in the models directory and name it models/schema.yml.
Add the following contents to the file:

version: 2
models:
- name: rpt_finance_qtr_top_events_by_sales
columns:
- name: year
tests:
- not_null
- name: rpt_finance_qtr_total_sales_by_event
columns:
- name: year
tests:
- not_null
- name: staging_event
columns:
- name: eventid
tests:
- not_null
- name: staging_sales
columns:
- name: salesid
tests:
- not_null

Run dbt test, and confirm that all your tests passed.
When you run dbt test, dbt iterates through your YAML files and constructs a query for each test. Each query will return the number of records that fail the test. If this number is 0, then the test is successful.

Document your models

By adding documentation to your project, you can describe your models in detail and share that information with your team.

To add documentation:

Run dbt docs generate to generate the documentation for your project. dbt inspects your project and your warehouse to generate a JSON file documenting your project.

Choose the book icon in the Develop interface to launch documentation in a new tab.

Commit your changes

Now that you’ve built your models, you need to commit the changes you made to the project so that the repository has your latest code.

To commit the changes:

Under Version Control on the left, choose Commit and sync and add a message. For example, Add Aurora zero-ETL integration with Redshift models.

Choose Merge this branch to main to add these changes to the main branch on your repo.

Deploy dbt

Use dbt Cloud’s Scheduler to deploy your production jobs confidently and build observability into your processes. You’ll learn to create a deployment environment and run a job in the following steps.

To create a deployment environment:

In the left pane, select Deploy, then choose Environments.

Choose Create Environment.
In the Name field, enter the name of your deployment environment. For example, Production.
In the dbt Version field, select Versionless from the dropdown.
In the Connection field, select the connection used earlier in development.
Under Deployment Credentials, enter the credentials used to connect to your Redshift data warehouse. Choose Test Connection.

Choose Save.

Create and run a job

Jobs are a set of dbt commands that you want to run on a schedule.

To create and run a job:

After creating your deployment environment, you should be directed to the page for a new environment. If not, select Deploy in the left pane, then choose Jobs.
Choose Create job and select Deploy job.
Enter a Job name, such as, Production run, and link to the environment you just created.
Under Execution Settings, select Generate docs on run.
Under Commands, add this command as part of your job if you don’t see them:
- dbt build
For this exercise, don’t set a schedule for your project to run—while your organization’s project should run regularly, there’s no need to run this example project on a schedule. Scheduling a job is sometimes referred to as deploying a project.

Choose Save, then choose Run now to run your job.
Choose the run and watch its progress under Run history.
After the run is complete, choose View Documentation to see the docs for your project.

Clean up

When you’re finished, delete the CloudFormation stack since some of the AWS resources in this walkthrough incur a cost if you continue to use them. Complete the following steps:

On the CloudFormation console, choose Stacks.
Choose the stack you launched in this walkthrough. The stack must be currently running.
In the stack details pane, choose Delete.
Choose Delete stack.

Summary

In this post, we showed you how to set up Amazon Aurora MySQL Zero-ETL integration from Aurora MySQL to Amazon Redshift, which eliminates complex data pipelines and enables near real-time analytics on transactional and operational data. We also showed you how to build dbt models on Aurora MySQL Zero-ETL integration tables in Amazon Redshift to transform the data to get insight.

We look forward to hearing from you about your experience. If you have questions or suggestions, leave a comment.

About the authors

BP Yau is a Sr Partner Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Saman Irfan is a Senior Specialist Solutions Architect at Amazon Web Services, based in Berlin, Germany. She collaborates with customers across industries to design and implement scalable, high-performance analytics solutions using cloud technologies. Saman is passionate about helping organizations modernize their data architectures and unlock the full potential of their data to drive innovation and business transformation. Outside of work, she enjoys spending time with her family, watching TV series, and staying updated with the latest advancements in technology.

Raghu Kuppala is an Analytics Specialist Solutions Architect experienced working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Neela Kulkarni is a Solutions Architect with Amazon Web Services. She primarily serves independent software vendors in the Northeast US, providing architectural guidance and best practice recommendations for new and existing workloads. Outside of work, she enjoys traveling, swimming, and spending time with her family.

How to get best price performance from your Amazon Redshift Data Sharing deployment

2022-12-20 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/how-to-get-best-price-performance-from-your-amazon-redshift-data-sharing-deployment/

Amazon Redshift is a fast, scalable, secure, and fully-managed data warehouse that enables you to analyze all of your data using standard SQL easily and cost-effectively. Amazon Redshift Data Sharing allows customers to securely share live, transactionally consistent data in one Amazon Redshift cluster with another Amazon Redshift cluster across accounts and regions without needing to copy or move data from one cluster to another.

Amazon Redshift Data Sharing was initially launched in March 2021, and added support for cross-account data sharing was added in August 2021. The cross-region support became generally available in February 2022. This provides full flexibility and agility to share data across Redshift clusters in the same AWS account, different accounts, or different regions.

Amazon Redshift Data Sharing is used to fundamentally redefine Amazon Redshift deployment architectures into a hub-spoke, data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, easily onboard new use cases, and most importantly do all of this without the complexity of data movement and data copies. Some of the most common questions asked during data sharing deployment are, “How big should my consumer clusters and producer clusters be?”, and “How do I get the best price performance for workload isolation?”. As workload characteristics like data size, ingestion rate, query pattern, and maintenance activities can impact data sharing performance, a continuous strategy to size both consumer and producer clusters to maximize the performance and minimize cost should be implemented. In this post, we provide a step-by-step approach to help you determine your producer and consumer clusters sizes for the best price performance based on your specific workload.

Generic consumer sizing guidance

The following steps show the generic strategy to size your producer and consumer clusters. You can use it as a starting point and modify accordingly to cater your specific use case scenario.

Size your producer cluster

You should always make sure that you properly size your producer cluster to get the performance that you need to meet your SLA. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation for the producer cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started, and you should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Size and setup initial consumer cluster

You should always size your consumer cluster based on your compute needs. One way to get started is to follow the generic cluster sizing guide similar to the producer cluster above.

Setup Amazon Redshift data sharing

Setup data sharing from producer to consumer once you have both the producer and consumer cluster setup. Refer to this post for guidance on how to setup data sharing.

Test consumer only workload on initial consumer cluster

Test consumer only workload on the new initial consumer cluster. This can be done by pointing consumer applications, for example ETL tools, BI applications, and SQL clients, to the new consumer cluster and rerunning the workload to evaluate the performance against your requirements.

Test consumer only workload on different consumer cluster configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either continue to use this cluster configuration or you can test on smaller configurations to see if you can further reduce the cost and still get the performance that you need.

On the other hand, if the initial size consumer cluster fails to meet your workload performance requirements, then you can further test larger configurations to get the configuration that meets your SLA.

As a rule of thumb, size up the consumer cluster by 2x the initial cluster configuration incrementally until it meets your workload requirements.

Once you plan out what configuration you want to test, use elastic resize to resize the initial cluster to the target cluster configuration. After elastic resize is completed, perform the same workload test and evaluate the performance against your SLA. Select the configuration that meets your price performance target.

Test producer only workload on different producer cluster configurations

Once you move your consumer workload to the consumer cluster with the optimum price performance, there might be an opportunity to reduce the compute resource on the producer to save on costs.

To achieve this, you can rerun the producer only workload on 1/2x of the original producer size and evaluate the workload performance. Resizing the cluster up and down accordingly depends on the result, and then you select the minimum producer configuration that meets your workload performance requirements.

Re-evaluate after a full workload run over time

As Amazon Redshift continues evolving, and there are continuous performance and scalability improvement releases, data sharing performance will continue improving. Furthermore, numerous variables might impact the performance of data sharing queries. The following are just some examples:

Ingestion rate and amount of data change
Query pattern and characteristic
Workload changes
Concurrency
Maintenance activities, for example vacuum, analyze, and ATO

This is why you must re-evaluate the producer and consumer cluster sizing using the strategy above on occasion, especially after a full workload deployment, to gain the new best price performance from your cluster’s configuration.

Automated sizing solutions

If your environment involved more complex architecture, for example with multiple tools or applications (BI, ingestion or streaming, ETL, data science), then it might not feasible to use the manual method from the generic guidance above. Instead, you can leverage solutions in this section to automatically replay the workload from your production cluster on the test consumer and producer clusters to evaluate the performance.

Simple Replay utility will be leveraged as the automated solution to guide you through the process of getting the right producer and consumer clusters size for the best price performance.

Simple Replay is a tool for conducting a what-if analysis and evaluating how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. It also includes enhanced support for replaying data ingestion and export pipelines with COPY and UNLOAD statements. To get started and replay your workloads, download the tool from the Amazon Redshift GitHub repository.

Here we walk through the steps to extract your workload logs from the source production cluster and replay them in an isolated environment. This lets you perform a direct comparison between these Amazon Redshift clusters seamlessly and select the clusters configuration that best meet your price performance target.

The following diagram shows the solution architecture.

Architecutre for testing simple replay

Solution walkthrough

Follow these steps to go through the solution to size your consumer and producer clusters.

Size your production cluster

You should always make sure to properly size your existing production cluster to get the performance that you need to meet your workload requirements. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation on the production cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started. You should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Identify the workload to be isolated

You might have different workloads running on your original cluster, but the first step is to identify the most critical workload to the business that we want to isolate. This is because we want to make sure that the new architecture can meet your workload requirements. This post is a good reference on a data sharing workload isolation use case that can help you decide which workload can be isolated.

Setup Simple Replay

Once you know your critical workload, you must enable audit logging in your production cluster where the critical workload identified above is running to capture query activities and store in Amazon Simple Storage Service (Amazon S3). Note that it may take up to three hours for the audit logs to be delivered to Amazon S3. Once the audit log is available, proceed to setup Simple Replay and then extract the critical workload from the audit log. Note that start_time and end_time could be used as parameters to filter out the critical workload if those workloads run in certain time periods, for example 9am to 11am. Otherwise it will extract all of the logged activities.

Baseline workload

Create a baseline cluster with the same configuration as the producer cluster by restoring from the production snapshot. The purpose of starting with the same configuration is to baseline the performance with an isolated environment.

Once the baseline cluster is available, replay the extracted workload in the baseline cluster. The output from this replay will be the baseline used to compare against subsequent replays on different consumer configurations.

Setup initial producer and consumer test clusters

Create a producer cluster with the same production cluster configuration by restoring from the production snapshot. Create a consumer cluster with the recommended initial consumer size from the previous guidance. Furthermore, setup data sharing between the producer and consumer.

Replay workload on initial producer and consumer

Replay the producer only workload on the initial size producer cluster. This can be achieved using the “Exclude” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Replay the consumer only workload on the initial size consumer cluster. This can be achieved using the “Include” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Evaluate the performance of these replays against the baseline and workload performance requirements.

Replay consumer workload on different configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either use this cluster configuration or you can follow these steps to test on smaller configurations to see if you can further reduce costs and still get the performance that you need.

Compare initial consumer performance results against your workload requirements:

If the result exceeds your workload performance requirements, then you can reduce the size of the consumer cluster incrementally, starting with 1/2x, retry the replay and evaluate the performance, then resize up or down accordingly based on the result until it meets your workload requirements. The purpose is to get a sweet spot where you’re comfortable with the performance requirements and get the lowest price possible.
If the result fails to meet your workload performance requirements, then you can increase the size of the cluster incrementally, starting with 2x the original size, retry the replay and evaluate the performance until it meets your workload performance requirements.

Replay producer workload on different configurations

Once you split your workloads out to consumer clusters, the load on the producer cluster should be reduced and you should evaluate your producer cluster’s workload performance to seek the opportunity to downsize to save on costs.

The steps are similar to consumer replay. Elastic resize the producer cluster incrementally starting with 1/2x the original size, replay the producer only workload and evaluate the performance, and then further resize up or down until it meets your workload performance requirements. The purpose is to get a sweet spot where you’re comfortable with the workload performance requirements and get the lowest price possible. Once you have the desired producer cluster configuration, retry replay consumer workloads on the consumer cluster to make sure that the performance wasn’t impacted by producer cluster configuration changes. Finally, you should replay both producer and consumer workloads concurrently to make sure that the performance is achieved in a full workload scenario.

Re-evaluate after a full workload run over time

Similar to the generic guidance, you should re-evaluate the producer and consumer clusters sizing using the previous strategy on occasion, especially after full workload deployment to gain the new best price performance from your cluster’s configuration.

Clean up

Running these sizing tests in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save on costs. We also recommend pausing your clusters when they’re not in use.

Applying Amazon Redshift and data sharing best practices

Proper sizing of both your producer and consumer clusters will give you a good start to get the best price performance from your Amazon Redshift deployment. However, sizing isn’t the only factor that can maximize your performance. In this case, understanding and following best practices are equally important.

General Amazon Redshift performance tuning best practices are applicable to data sharing deployment. Make sure that your deployment follows these best practices.

There numerous data sharing specific best practices that you should follow to make sure that you maximize the performance. Refer to this post for more details.

Summary

There is no one-size-fits-all recommendation on producer and consumer cluster sizes. It varies by workloads and your performance SLA. The purpose of this post is to provide you with guidance for how you can evaluate your specific data sharing workload performance to determine both consumer and producer cluster sizes to get the best price performance. Consider testing your workloads on producer and consumer using simple replay before adopting it in production to get the best price performance.

About the Authors

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for costs, reliability, performance and operational excellence at scale in their cloud journey. He has a keen interest in Data Analytics as well.

Amazon Redshift data sharing best practices and considerations

2022-08-17 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/amazon-redshift-data-sharing-best-practices-and-considerations/

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Amazon Redshift data sharing allows for a secure and easy way to share live data for reading across Amazon Redshift clusters. It allows an Amazon Redshift producer cluster to share objects with one or more Amazon Redshift consumer clusters for read purposes without having to copy the data. With this approach, workloads isolated to different clusters can share and collaborate frequently on data to drive innovation and offer value-added analytic services to your internal and external stakeholders. You can share data at many levels, including databases, schemas, tables, views, columns, and user-defined SQL functions, to provide fine-grained access controls that can be tailored for different users and businesses that all need access to Amazon Redshift data. The feature itself is simple to use and integrate into existing BI tools.

In this post, we discuss Amazon Redshift data sharing, including some best practices and considerations.

How does Amazon Redshift data sharing work ?

To achieve best in class performance Amazon Redshift consumer clusters cache and incrementally update block level data (let us refer to this as block metadata) of objects that are queried, from the producer cluster (this works even when cluster is paused).
The time taken for caching block metadata depends on the rate of the data change on the producer since the respective object(s) were last queried on the consumer. (As of today the consumer clusters only update their metadata cache for an object only on demand i.e. when queried)
If there are frequent DDL operations, the consumer is forced to re-cache the full block metadata for an object during the next access to maintain consistency as to enable live sharing as structure changes on the producer invalidate all the existing metadata cache on the consumers.
Once the consumer has the block metadata in sync with the latest state of an object on the producer that is when the query would execute as any other regular query (query referring to local objects).

Now that we have the necessary background on data sharing and how it works, let’s look at a few best practices across streams that can help improve workloads while using data sharing.

Security

In this section, we share some best practices for security when using Amazon Redshift data sharing.

Use INCLUDE NEW cautiously

INCLUDE NEW is a very useful setting while adding a schema to a data share (ALTER DATASHARE). If set to TRUE, this automatically adds all the objects created in the future in the specified schema to the data share automatically. This might not be ideal in cases where you want to have fine-grained control on objects being shared. In these cases, leave the setting at its default of FALSE.

Use views to achieve fine-grained access control

To achieve fine-grained access control for data sharing, you can create late-binding views or materialized views on shared objects on the consumer, and then share the access to these views to users on consumer cluster, instead of giving full access on the original shared objects. This comes with its own set of considerations, which we explain later in this post.

Audit data share usage and changes

Amazon Redshift provides an efficient way to audit all the activity and changes with respect to a data share using system views. We can use the following views to check these details:

SVL_DATASHARE_CHANGE_LOG – Records the activity and usage of data shares on the consumer cluster
SVL_DATASHARE_USAGE_CONSUMER – Records the activity and usage of data shares on the consumer cluster
SVL_DATASHARE_USAGE_PRODUCER – Records the activity and usage of data shares on the producer cluster

Performance

In this section, we discuss best practices related to performance.

Materialized views in data sharing environments

Materialized views (MVs) provide a powerful route to precompute complex aggregations for use cases where high throughput is needed, and you can directly share a materialized view object via data sharing as well.

For materialized views built on tables where there are frequent write operations, it’s ideal to create the materialized view object on the producer itself and share the view. This method gives us the opportunity to centralize the management of the view on the producer cluster itself.

For slowly changing data tables, you can share the table objects directly and build the materialized view on the shared objects directly on the consumer. This method gives us the flexibility of creating a customized view of data on each consumer according to your use case.

This can help optimize the block metadata download and caching times in the data sharing query lifecycle. This also helps in materialized view refreshes because, as of this writing, Redshift doesn’t support incremental refresh for MVs built on shared objects.

Factors to consider when using cross-Region data sharing

Data sharing is supported even if the producer and consumer are in different Regions. There are a few differences we need to consider while implementing a share across Regions:

Consumer data reads are charged at $5/TB for cross region data shares, Data sharing within the same Region is free. For more information, refer to Managing cost control for cross-Region data sharing.
Performance will also vary when compared to a uni-Regional data share because the block metadata exchange and data transfer process between the cross-Regional shared clusters will take more time due to network throughput.

Metadata access

There are many system views that help with fetching the list of shared objects a user has access to. Some of these include all the objects from the database that you’re currently connected to, including objects from all the other databases that you have access to on the cluster, including external objects. The views are as follows:

We suggest using very restrictive filtering while querying these views because a simple select * will result in an entire catalog read, which isn’t ideal. For example, take the following query:

select * from svv_all_tables;

This query will try to collect metadata for all the shared and local objects, making it very heavy in terms of metadata scans, especially for shared objects.

The following is a better query for achieving a similar result:

SELECT table_name,
column_name,
data_type FROM svv_all_tables WHERE table_name = < tablename > AND schema_name = < schemaname > AND database_name = < databasename > ORDER BY ordinal_position

This is a good practice to follow for all metadata views and tables; doing so allows seamless integration into several tools. You can also use the SVV_DATASHARE* system views to exclusively see shared object-related information.

Producer/consumer dependencies

In this section, we discuss the dependencies between the producer and consumer.

Impact of the consumer on the producer

Queries on the consumer cluster will have no impact in terms of performance or activity on the producer cluster. This is why we can achieve true workload isolation using data sharing.

Encrypted producers and consumers

Data sharing seamlessly integrates even if both the producer and the consumer are encrypted using different AWS Key Management Service (AWS KMS) keys. There are sophisticated, highly secure key exchange protocols to facilitate this so you don’t have to worry about encryption at rest and other compliance dependencies. The only thing to make sure is that both the producer and consumer are in a homogeneous encryption configuration.

Data visibility and consistency

A data sharing query on the consumer can’t impact the transaction semantics on the producer. All the queries involving shared objects on the consumer cluster follow read-committed transaction consistency while checking for visible data for that transaction.

Maintenance

If there is a scheduled manual VACUUM operation in use for maintenance activities on the producer cluster on shared objects, you should use VACUUM recluster whenever possible. This is especially important for large objects because it has optimizations in terms of the number of data blocks the utility interacts with, which results in less block metadata churn compared to a full vacuum. This benefits the data sharing workloads by reducing the block metadata sync times.

Add-ons

In this section, we discuss additional add-on features for data sharing in Amazon Redshift.

Real-time data analytics using Amazon Redshift streaming data

Amazon Redshift recently announced the preview for streaming ingestion using Amazon Kinesis Data Streams. This eliminates the need for staging the data and helps achieve low-latency data access. The data generated via streaming on the Amazon Redshift cluster is exposed using a materialized view. You can share this as any other materialized view via a data share and use it to set up low-latency shared data access across clusters in minutes.

Amazon Redshift concurrency scaling to improve throughput

Amazon Redshift data sharing queries can utilize concurrency scaling to improve the overall throughput of the cluster. You can enable concurrency scaling on the consumer cluster for queues where you expect a heavy workload to improve the overall throughput when the cluster is experiencing heavy load.

For more information about concurrency scaling, refer to Data sharing considerations in Amazon Redshift.

Amazon Redshift Serverless

Amazon Redshift Serverless clusters are ready for data sharing out of the box. A serverless cluster can also act as a producer or a consumer for a provisioned cluster. The following are the supported permutations with Redshift Serverless:

Serverless (producer) and provisioned (consumer)
Serverless (producer) and serverless (consumer)
Serverless (consumer) and provisioned (producer)

Conclusion

Amazon Redshift data sharing gives you the ability to fan out and scale complex workloads without worrying about workload isolation. However, like any system, not having the right optimization techniques in place could pose complex challenges in the long term as the systems grow in scale. Incorporating the best practices listed in this post presents a way to mitigate potential performance bottlenecks proactively in various areas.

Try data sharing today to unlock the full potential of Amazon Redshift, and please don’t hesitate to reach out to us in case of further questions or clarifications.

About the authors

Sai Teja Boddapati is a Database Engineer based out of Seattle. He works on solving complex database problems to contribute to building the most user friendly data warehouse available. In his spare time, he loves travelling, playing games and watching movies & documentaries.

Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool

2022-04-12 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/accelerate-snowflake-to-amazon-redshift-migration-using-aws-schema-conversion-tool/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics jobs. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

Several new features of Amazon Redshift address a wide range of analytics requirements and improve performance of the cloud data warehouse:

Amazon Redshift Serverless (preview)
AWS Data Exchange for Amazon Redshift
Data sharing and cross-Region data sharing
SQL notebooks support in Query Editor V2
Automated Materialized View (preview)
Amazon Redshift ML
Concurrency scaling
Support for the spatial geography data type
The new RA3 instance types
Elastic resize
Federated queries, which allow you to query data stored in your Amazon Aurora or Amazon Relational Database Service (Amazon RDS) PostgreSQL and MySQL operational databases directly from Amazon Redshift
The SUPER data type, which can store semi-structured data or documents as values
The new distributed and hardware accelerated cache with AQUA (Advanced Query Accelerator) for Amazon Redshift, which delivers up to 10 times more performance than other cloud warehouses
The machine learning (ML) based self-tuning capability to set sort and distribution keys for tables, which significantly improves query performance that was previously handled manually

To take advantage of these capabilities and future innovations, you must migrate from your existing data warehouse to Amazon Redshift.

In this post, we show you how to migrate your data warehouse schema from Snowflake to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT). AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the source database schema and a majority of the database code objects, including views, stored procedures, and functions, to a format compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. AWS SCT can also scan your application source code for embedded SQL statements and convert them as part of a database schema conversion project. During this process, AWS SCT performs cloud-native code optimization by converting legacy data warehouse functions to their equivalent AWS service, thereby helping you modernize the applications at the same time of database migration.

Solution overview

To implement this solution, you complete the following high-level steps:

Configure your AWS SCT application.
Analyze your source Snowflake schema.
Convert your Snowflake schema to an Amazon Redshift schema.
Deploy the schema to Amazon Redshift.

The following diagram illustrates the solution architecture.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

An existing Snowflake account with user credentials required to use AWS SCT. For more information, see Using Snowflake as source for AWS SCT.
An AWS account.
A desktop machine with AWS SCT installed. For more information, see Installing AWS SCT.
An Amazon Redshift cluster that is accessible from the machine where AWS SCT is installed.
JDBC drivers for Amazon Redshift and Snowflake. For more information, see Installing the required database drivers.

Set up an AWS SCT project and extract the schema from the source

In this walkthrough, we use the Snowflake sample database TPCDS_SF10TCL as the source of the schema conversion.

To set up the database migration project, complete the following steps:

Launch the AWS SCT application.
On the File menu, choose New project wizard.
Enter the project name and location.
For Source engine, choose Snowflake.
Choose Next.
Provide the database information and credentials for Snowflake and choose Test Connection.
When the connection is successful, choose Next.

For more information, see Using Snowflake as a source for AWS SCT.

Expand SNOWFLAKE_SAMPLE_DATA and Schemas.
Select TPCDS_SF10TCL and choose Next.

AWS SCT analyzes the schema and prepares an assessment report summary, as shown in the following screenshot.

This report summarizes the objects that AWS SCT converts to Amazon Redshift.

Review the report and choose Next.
Provide the database information and credentials for Amazon Redshift and deselect Use AWS Glue.
Choose Test Connection.
When the connection is successful, choose Finish.

For more information about establishing a connection to Amazon Redshift, see Getting the JDBC URL.

Review and apply the schema from Snowflake to Amazon Redshift

To convert the schema from Snowflake objects to Amazon Redshift, complete the following steps:

Expand SNOWFLAKE_SAMPLE_DATA and Schemas.
Choose (right-click) TPCDS_SF10TCL and choose Convert schema.
Choose Collect and continue.

AWS SCT converts the schema and shows the converted schema and objects in the right pane. The converted schema is marked with a red check mark.

Amazon Redshift take some actions automatically while converting the schema to Amazon Redshift; objects with such actions are marked with a red warning sign.

Choose the object and then on the top menu, choose Main view and choose Assessment Report view.
Choose the Action items tab.

You can see list of all issues and actions taken by AWS SCT.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, we modify the DISTKEY to use inv_item_sk. AWS SCT analyze the tables and recommends the distribution and sort keys based on the statistics. For cases where you’re not sure, you should set it to AUTO. For more information about automatic data distribution and optimization in Amazon Redshift, refer to Automate your Amazon Redshift performance tuning with automatic table optimization.

To deploy the objects DDL to Amazon Redshift, select the converted schema in the right pane, right-click, and choose Apply to database.
Optionally, if you want to export the copy of the DDLs generated by AWS SCT and apply them manually, you can select the Amazon Redshift converted schema, right-click, and choose Save as SQL.
Log in to your Amazon Redshift cluster using the Query Editor V2.

For more information about Query Editor V2, refer to Introducing Amazon Redshift Query Editor V2, a Free Web-based Query Authoring Tool for Data Analysts.

To verify that the converted schema objects are deployed in Amazon Redshift, select the desired table, right-click, and choose Show table definition to see the underlying DDL.

Clean up

To avoid incurring future charges, complete the following steps:

Delete the Amazon Redshift cluster created for this demonstration.
If you were using an existing cluster, delete the new tables that you created as part of this exercise.
Stop any Amazon Elastic Compute Cloud (Amazon EC2) instances that were created to run the AWS SCT application.

Summary

In this post, we showed how easy it is to convert a Snowflake schema to an Amazon Redshift schema and used AWS SCT for this automatic conversion.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.

About the Authors

BP Yau is a Sr Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Shawn Sachdev is a Sr Analytics Specialist Solutions Architect at AWS. He works with customers and provides guidance to help them innovate and build well-architected and high-performance data warehouses and implement analytics at scale on the AWS platform. Before AWS, he has worked in several Analytics and System Engineering roles. Outside of work, he loves watching sports, and is an avid foodie and a craft beer enthusiast.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Accelerating Amazon Redshift federated query to Amazon Aurora MySQL with AWS CloudFormation

2020-12-17 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/accelerating-amazon-redshift-federated-query-to-amazon-aurora-mysql-with-aws-cloudformation/

Amazon Redshift federated query allows you to combine data from one or more Amazon Relational Database Service (Amazon RDS) for MySQL and Amazon Aurora MySQL databases with data already in Amazon Redshift. You can also combine such data with data in an Amazon Simple Storage Service (Amazon S3) data lake.

This post shows you how to set up Aurora MySQL and Amazon Redshift with a TPC-DS dataset so you can take advantage of Amazon Redshift federated query using AWS CloudFormation. You can use the environment you set up in this post to experiment with various use cases in the post Announcing Amazon Redshift federated querying to Amazon Aurora MySQL and Amazon RDS for MySQL.

Benefits of using CloudFormation templates

The standard workflow for setting up Amazon Redshift federated query involves six steps. For more information, see Querying data with federated queries in Amazon Redshift. With a CloudFormation template, you can condense these manual procedures into a few steps listed in a text file. The declarative code in the file captures the intended state of the resources that you want to create and allows you to automate the setup of AWS resources to support Amazon Redshift federated query. You can further enhance this template to become the single source of truth for your infrastructure.

A CloudFormation template acts as an accelerator. It helps you automate the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and accounts with the least amount of effort and time.

Architecture overview

The following diagram illustrates the solution architecture.

The CloudFormation template provisions the following components in the architecture:

VPC
Subnets
Route tables
Internet gateway
Amazon Linux bastion host
Secrets
Aurora for MySQL cluster with TPC-DS dataset preloaded
Amazon Redshift cluster with TPC-DS dataset preloaded
Amazon Redshift IAM role with required permissions

Prerequisites

Before you create your resources in AWS CloudFormation, you must complete the following prerequisites:

Have an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and related AWS services. Your IAM permissions must also include access to create IAM roles and policies via the CloudFormation template.
Create an Amazon Elastic Compute Cloud (Amazon EC2) key pair in the us-east-1 Region. For instructions, see Create a key pair using Amazon EC2. Make sure that you save the private key; this is the only time you can do so. You use this key pair as an input parameter when you set up the CloudFormation stack.

Setting up resources with AWS CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.

To create your resources, complete the following steps:

Sign in to the console.
Choose the us-east-1 Region in which to create the stack.
Choose Launch Stack:
Choose Next.

This automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.

For Stack name, enter a stack name.
For Session, leave as the default.
For ec2KeyPair, choose the key pair you created earlier.
Choose Next.

This automatically launches AWS CloudFormation in your AWS account with a template.

On the next screen, choose Next.
Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

Stack creation can take up to 45 minutes.

After the stack creation is complete, on the Outputs tab of the stack, record the value of the key for the following components, which you use in a later step:

AuroraClusterEndpoint
AuroraSecretArn
RedshiftClusterEndpoint
RedshiftClusterRoleArn

As of this writing, this feature is in public preview. You can create a snapshot of your Amazon Redshift cluster created by the stack and restore the snapshot as a new cluster in the sql_preview maintenance track with the same configuration.

You can create a snapshot of your Amazon Redshift cluster created by the stack and restore the snapshot as a new cluster

You’re now ready to log in to both the Aurora MySQL and Amazon Redshift cluster and run some basic commands to test them.

Logging in to the clusters using the Amazon Linux bastion host

The following steps assume that you use a computer with an SSH client to connect to the bastion host. For more information about connecting using various clients, see Connect to your Linux instance.

Move the private key of the EC2 key pair (that you saved previously) to a location on your SSH client, where you are connecting to the Amazon Linux bastion host.
Change the permission of the private key using the following code, so that it’s not publicly viewable:
```
chmod 400 <private key file name; for example, bastion-key.pem>
```

On the Amazon EC2 console, choose Instances.
Choose the Amazon Linux bastion host that the CloudFormation stack created.
Choose Connect.
Copy the value for SSHCommand.
On the SSH client, change the directory to the location where you saved the EC2 private key, and enter the SSHCommand value.
On the console, open the AWS Secrets Manager dashboard.
Choose the secret secretAuroraMasterUser-*.
Choose Retrieve secret value.
Record the password under Secret key/value, which you use to log in to the Aurora MySQL cluster.
Choose the secret SecretRedshiftMasterUser.
Choose Retrieve secret value.
Record the password under Secret key/value, which you use to log in to the Amazon Redshift cluster.
Log in to both Aurora MySQL using the MySQL Command-Line Client and Amazon Redshift using query editor.

The CloudFormation template has already set up MySQL Command-Line Client binaries on the Amazon Linux bastion host.

On the Amazon Redshift console, choose Editor.
Choose Query editor.
For Connection, choose Create new connection.
For Cluster, choose the Amazon Redshift cluster.
For Database name, enter your database.
Enter the database user and password recorded earlier.
Choose Connect to database.

Choose Connect to database.

Enter the following SQL command:

select "table" from svv_table_info where schema='public';

You should see 25 tables as the output.

Launch a command prompt session of the bastion host and enter the following code (substitute <AuroraClusterEndpoint> with the value from the AWS CloudFormation output):
```
mysql --host=<AuroraClusterEndpoint> --user=awsuser --password=<database user password recorded earlier>
```

Enter the following SQL command:
```
use tpc;
show tables;
```

You should see the following eight tables as the output:

mysql> use tpc;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+------------------------+
| Tables_in_tpc          |
+------------------------+
| customer               |
| customer_address       |
| household_demographics |
| income_band            |
| item                   |
| promotion              |
| web_page               |
| web_sales              |
+------------------------+
8 rows in set (0.01 sec)

Completing federated query setup

The final step is to create an external schema to connect to the Aurora MySQL instance. The following example code creates an external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS mysqlfq 
FROM MYSQL 
DATABASE 'tpc' 
URI '<AuroraClusterEndpoint>' 
PORT 3306 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

URI – The AuroraClusterEndpoint value from the CloudFormation stack outputs. The value is in the format <stackname>-cluster.<randomcharacter>.us-east-1.rds.amazonaws.com.
IAM_Role – The RedshiftClusterRoleArn value from the CloudFormation stack outputs. The value is in the format arn:aws:iam::<accountnumber>:role/<stackname>-RedshiftClusterRole-<randomcharacter>.
Secret_ARN – The AuroraSecretArn value from the CloudFormation stack outputs. The value is in the format arn:aws:secretsmanager:us-east-1:<accountnumber>: secret:secretAuroraMasterUser-<randomcharacter>.

Federated query test

Now that you have set up federated query, you can start testing the feature using the TPC-DS dataset that was preloaded into both Aurora MySQL and Amazon Redshift.

For example, the following query aggregates the total net sales by product category and class from the web_sales fact table and date and item dimension tables. Tables web_sales and date are stored in Amazon Redshift, and the item table is stored in Aurora MySQL:

select
    sum(ws_net_paid
    ) as total_sum, i_category, i_class, 0 as g_category, 0 as g_class  
from
    web_sales ,date_dim d1 ,mysqlfq.item 
where
    d1.d_month_seq between 1205 
    and 1205+11 
    and d1.d_date_sk = ws_sold_date_sk 
    and i_item_sk = ws_item_sk 
group
    by i_category,i_class ;

You can continue to experiment with the dataset and explore the three main use cases in the post [exact name of post title with embedded link].

Cleaning up

When you’re finished, delete the CloudFormation stack, because some of the AWS resources in this walkthrough incur a cost if you continue to use them. Complete the following steps:

On the AWS CloudFormation console, choose Stacks.
Choose the stack you launched in this walkthrough. The stack must be currently running.
In the stack details pane, choose Delete.
Choose Delete stack.

Summary

This post showed you how to automate the creation of an Aurora MySQL and Amazon Redshift cluster preloaded with the TPC-DS dataset, the prerequisites for the new Amazon Redshift federated query feature using AWS CloudFormation, and a single manual step to complete the setup. It also provided an example federated query using the TPC-DS dataset, which you can use to accelerate your learning and adoption of the new feature. You can continue to modify the CloudFormation templates from this post to support your business needs.

If you have any questions or suggestions, please leave a comment.

About the Authors

BP Yau is an Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Srikanth Sopirala is a Sr. Specialist Solutions Architect, Analytics at AWS. He is passionate about helping customers build scalable data and analytics solutions in the cloud.

Zhouyi Yang is a Software Development Engineer for Amazon Redshift Query Processing team. He’s passionate about gaining new knowledge about large databases and has worked on SQL language features such as federated query and IAM role privilege control. In his spare time, he enjoys swimming, tennis, and reading.

Entong Shen is a Senior Software Development Engineer for Amazon Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics, and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

Announcing Amazon Redshift federated querying to Amazon Aurora MySQL and Amazon RDS for MySQL

2020-12-15 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/announcing-amazon-redshift-federated-querying-to-amazon-aurora-mysql-and-amazon-rds-for-mysql/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers have built analytics workloads using it. We’re always listening to your feedback and, in April 2020, we announced general availability for federated querying to Amazon Aurora PostgreSQL and Amazon Relational Database Service (Amazon RDS) for PostgreSQL to enable you to query data across your operational databases, your data warehouse, and your data lake to gain faster and deeper insights not possible otherwise.

Today, we’re launching a new feature of Amazon Redshift federated query to Amazon Aurora MySQL and Amazon RDS for MySQL to help you expand your operational databases in the MySQL family. With this lake house architecture expansion to support more operational data stores, you can query and combine data more easily in real time and store data in open file formats in your Amazon Simple Storage Service (Amazon S3) data lake. Your data can then be more available to other analytics and machine learning (ML) tools, rather than siloed in disparate data stores.

In this post, we share information about how to get started with this new federated query feature to MySQL.

Prerequisites

To try this new feature, create a new Amazon Redshift cluster in a sql_preview maintenance track and Aurora MySQL instance and load sample TPC data into both data stores. To make sure both Aurora MySQL DB instances can accept connections from the Amazon Redshift cluster, you should make sure that both your Amazon Redshift cluster and Aurora MySQL instances are in the same Amazon Virtual Private Cloud (Amazon VPC) and subnet group. This way, you can add the security group for the Amazon Redshift cluster to the inbound rules of the security group for the Aurora MySQL DB instance.

If your Amazon Redshift cluster and Aurora MySQL instances are in the different VPC, you can set up VPC peering or other networking to allow Amazon Redshift to make connections to your Aurora MySQL instances. For more information about VPC networking, see Working with a DB instance in a VPC.

Configuring AWS Secrets Manager for remote database credentials

Amazon Redshift needs database credentials to issue a federated query to a MySQL database. AWS Secrets Manager provides a centralized service to manage secrets and can be used to store your MySQL database credentials. Because Amazon Redshift retrieves and uses these credentials, they are transient, not stored in any generated code, and discarded after the query runs.

Storing credentials in Secrets Manager takes only a few minutes. To store a new secret, complete the following steps:

On the Secrets Manager console, choose Secrets.
Choose Store a new secret.
For Select secret type, select Credentials for RDS database.
For User name, enter a name.
For Password, enter a password.
For Select the encryption key, choose DefaultEncryptionkey.
For Select which RDS database this secret will access, choose your database.

Storing credentials in Secrets Manager takes only a few minutes.

Optionally, copy programmatic code for accessing your secret using your preferred programming languages (which is not needed for this post).
Choose Next.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console. The secret ARN is needed in the subsequent step.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console.

Setting up IAM role

You can now pull everything together by embedding the secret ARN into an AWS Identity and Access Management (IAM) policy, naming the policy, and attaching it to an IAM role. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "<SecretARN>"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

Finally, attach the same IAM role to your Amazon Redshift cluster.

On the Amazon Redshift console, choose Clusters.
Choose your cluster.
On the Actions drop-down menu, choose Manage IAM roles.

Finally, attach the same IAM role to your Amazon Redshift cluster.

Choose and add the IAM role you just created.

Setting up external schema

The final step is to create an external schema to connect to your Aurora MySQL instance. The following example code creates the external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS mysqlfq 
FROM MYSQL 
DATABASE 'tpc' 
URI '<AuroraClusterEndpoint>' 
PORT 3306 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

URI – Aurora MySQL cluster endpoint
IAM_Role – IAM role created from the previous step
Secret_ARN – Secret ARN

After you set up the external schema, you’re ready to run some queries to test different use cases.

Querying live operational data

You can now query real-time operational data in your Aurora MySQL instance from Amazon Redshift. Note that isolation level is read committed for MySQL. See the following code:

dev=# select top 10 ws_order_number from mysqlfq.web_sales;
 ws_order_number 
-----------------
        93628990
       157020207
         4338647
        41395871
        58468186
       171095867
        12514566
        74946143
         3418243
        67054239
(10 rows)

Querying mysqlfq.web_sales in Amazon Redshift routes the request to MySQL tpc database and web_sales table. If you examine the query plan, you can see the query runs at the MySQL instance as shown by the step Remote MySQL Seq Scan:

dev=# explain select top 10 ws_order_number from mysqlfq.web_sales;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 XN Limit  (cost=0.00..0.20 rows=10 width=8)
   ->  XN MySQL Query Scan web_sales  (cost=0.00..6869.28 rows=343464 width=8)
         ->  Remote MySQL Seq Scan mysqlfq.web_sales  (cost=0.00..3434.64 rows=343464 width=8)
(3 rows)

Simplifying ELT and ETL

You can also extract operational data directly from your Aurora MySQL instance and load it into Amazon Redshift. See the following code:

dev=# create table staging_customer as select c_customer_id from mysqlfq.customer where c_customer_id not in (select c_customer_id from customer);
SELECT
dev=# select count(*) from staging_customer;
 count  
--------
 350000
(1 row)

The preceding code uses CTAS to create and load incremental data from your operational MySQL instance into a staging table in Amazon Redshift. You can then perform transformation and merge operations from the staging table to the target table. For more information, see Updating and inserting new data.

Combining operational data with data from your data warehouse and data lake

You can combine live operational data from your Aurora MySQL instance with data from your Amazon Redshift data warehouse and S3 data lake by creating a late binding view.

To access your S3 data lake historical data via Amazon Redshift Spectrum, create an external table:

create external schema mysqlspectrum
from data catalog
database 'spectrumdb'
iam_role '<IAMRole>'
create external database if not exists;
 
create external table mysqlspectrum.customer 
stored as parquet 
location 's3://<yourS3bucket>/customer/'
as select * from customer where c_customer_sk <= 100000;

You can then run queries on the view to gain insight on data across the three sources:

drop view vwCustomer;
create view vwCustomer as
select c_customer_sk, 'redshift' as source from public.customer where c_customer_sk > 100000
union all
select c_customer_sk, 'mysql' as source from mysqlfq.customer
union all
select c_customer_sk, 's3' as source from mysqlspectrum.customer
with no schema binding;

select * from vwCustomer where c_customer_sk in (1, 149712,29033279);

You should the following three records as output:

dev=# select * from vwCustomer where c_customer_sk in (1, 149712,29033279);
 c_customer_sk |  source  
---------------+----------
      29033279 | mysql
             1 | s3
        149712 | redshift
(3 rows)

If you examine the query plan, you can see that the predicates are pushed down to your MySQL instance to run:

dev=# explain select * from vwCustomer where c_customer_sk in (1,149712,29033279);
                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Subquery Scan vwcustomer  (cost=0.00..48398.40 rows=6988 width=36)
   ->  XN Append  (cost=0.00..48328.52 rows=6988 width=4)
         ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..40000.03 rows=3 width=4)
               ->  XN Seq Scan on customer  (cost=0.00..40000.00 rows=3 width=4)
                     Filter: (((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279)) AND (c_customer_sk > 100000))
         ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..6548.63 rows=5492 width=4)
               ->  XN MySQL Query Scan customer  (cost=0.00..6493.71 rows=5492 width=4)
                     ->  Remote MySQL Seq Scan mysqlfq.customer  (cost=0.00..6438.79 rows=5492 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
         ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..1779.86 rows=1493 width=4)
               ->  XN S3 Query Scan customer  (cost=0.00..1764.93 rows=1493 width=4)
                     ->  S3 Seq Scan mysqlspectrum.customer location:"s3://<yourS3bucket>/customer" format:PARQUET  (cost=0.00..1750.00 rows=1493 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
(13 rows)

Available Now

Amazon Redshift federated querying to Aurora MySQL and Amazon RDS for MySQL is now available for public preview with Amazon Redshift release version 1.0.21591 or later. Refer to the AWS Region Table for Amazon Redshift availability and to check the version of your clusters.

About the Authors

BP Yau is an Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next-generation big data analytics platform using AWS technologies.

Zhouyi Yang is a Software Development Engineer for the Amazon Redshift Query Processing team. He’s passionate about gaining new knowledge about large databases and has worked on SQL language features such as federated query and IAM role privilege control. In his spare time, he enjoys swimming, tennis, and reading.

Introducing Amazon Redshift RA3.xlplus nodes with managed storage

2020-12-09 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/introducing-amazon-redshift-ra3-xlplus-nodes-with-managed-storage/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers have built analytics workloads using it. We’re always listening to your feedback and, in December 2019, we announced our third-generation RA3 node type to provide you the ability to scale and pay for compute and storage independently. In this post, I share more about RA3, including a new smaller size node type, and information about how to get started.

RA3 nodes in Amazon Redshift

The new RA3 nodes let you determine how much compute capacity you need to support your workload and then scale the amount of storage based on your needs. The managed storage of Amazon Redshift automatically uses high-performance, SSD-based local storage as its tier-1 cache. The managed storage takes advantage of optimizations such as data block temperature, data block age, and workload patterns to optimize Amazon Redshift performance and manage data placement across tiers of storage automatically. No action or changes to existing workflows are necessary on your part.

The first member of the RA3 family was the RA3.16xlarge, and we subsequently added the RA3.4xlarge to cater to the needs of customers with a large number of workloads.

We’re now adding a new smaller member of the RA3 family, the RA3.xlplus.

This allows Amazon Redshift to deliver up to three times better price performance than other cloud data warehouses. For existing Amazon Redshift customers using DS2 instances, you get up to two times better performance and double the storage at the same cost when you upgrade to RA3. RA3 also includes AQUA (Advanced Query Accelerator) for Amazon Redshift at no additional cost. AQUA is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run up to ten times faster than other cloud data warehouses by automatically boosting certain types of queries. The preview is open to all customers now, and it will be generally available in January 2021.

RA3 nodes with managed storage are a great fit for analytics workloads that require best price per performance, with massive storage capacity and the ability to scale and pay for compute and storage independently. In the past, there was pressure to offload or archive old data to other storage because of fixed storage limits, which made maintaining the operational analytics dataset and querying the larger historical dataset when needed difficult. With Amazon Redshift managed storage, we’re meeting the needs of customers that want to store more data in their data warehouse.

The new RA3.xlplus node provides 4 vCPUs and 32 GiB of RAM and addresses up to 32 TB of managed storage. A cluster with RA3.xlplus node-type can contain up to 32 of these instances, for a total storage of 1024 TB (that’s 1 petabyte!). With the new smaller RA3.xlplus instance type, it’s even easier to get started with Amazon Redshift.

The differences between RA3 nodes are summarized in the following table.

	vCPU	Memory	Storage Quota	I/O	Price (US East (N. Virginia))
ra3.xlplus	4	32 GiB	32TB RMS	0.65 GB/sec	$1.086 per hour
ra3.4xlarge	12	96 GiB	64TB RMS	2 GB/sec	$3.26 per hour
ra3.16xlarge	48	384 GiB	64TB RMS	8 GB/sec	$13.04 per hour

Creating a new Amazon Redshift cluster

You can create a new cluster on the Amazon Redshift console or the AWS Command Line Interface (AWS CLI). For this post, I walk you through using the Amazon Redshift console.

On the Amazon Redshift console, choose Clusters.
Choose Create cluster.
For Choose the size of the cluster, choose I’ll choose.
For Node type, select your preferred node type (for this post, we select xlplus).

The following screenshot shows the Cluster configuration page for the US East (N. Virginia) Region. The price may vary slightly in different Regions.

If you have a DS2 or DC2 instance-based cluster, you can create an RA3 cluster to evaluate the new instance with managed storage. You use a recent snapshot of your Amazon Redshift DS2 or DC2 cluster to create a new cluster based on RA3 instances. We recommend using 2 nodes of RA3.xlplus for every 3 nodes of DS2.xl or 3 nodes of RA3.xlplus for every 8 nodes of DC2.large. For more information about upgrading sizes, see Upgrading to RA3 node types. You can always adjust the compute capacity by adding or removing nodes with elastic resize.

If you’re migrating to Amazon Redshift from on-premises data warehouses, you can size your Amazon Redshift cluster using the sizing widget on the Amazon Redshift console. On the Cluster configuration page, for Choose the size of the cluster, choose Help me choose.

Answer the subsequent questions on total storage size and your data access pattern in order to size your cluster’s compute and storage resource.

The sizing widget recommends the cluster configuration in the Calculated configuration summary section.

Conclusion

RA3 instances are now available in 16 AWS Regions. For the latest RA3 node type availability, see RA3 node type availability in AWS Regions.

The price varies from Region to Region, starting at $1.086 per hour per node in US East (N. Virginia). For more information, see Amazon Redshift pricing.

About the Author

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Federated API access to Amazon Redshift using an Amazon Redshift connector for Python

2020-11-06 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/federated-api-access-to-amazon-redshift-using-an-amazon-redshift-connector-for-python/

Amazon Redshift is the leading cloud data warehouse that delivers performance 10 times faster at one-tenth of the cost of traditional data warehouses by using massively parallel query execution, columnar storage on high-performance disks, and results caching. You can confidently run mission-critical workloads, even in highly regulated industries, because Amazon Redshift comes with out-of-the-box security and compliance.

You can use your corporate identity providers (IdPs), for example Azure AD, Active Directory Federation Services, Okta, or Ping Federate, with Amazon Redshift to provide single sign-on (SSO) to your users so they can use their IdP accounts to log in and access Amazon Redshift. With federation, you can centralize management and governance of authentication and permissions. For more information about the federation workflow using AWS Identity and Access Management (IAM) and an identity provider, see Federate Database User Authentication Easily with IAM and Amazon Redshift. For this post, we use Okta as our IdP.

If you have personas that use Python on a computer, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance, you generally use an IAM role attached to the EC2 instance for federated access. However, all users who log in to the EC2 instance assume the same IAM role to federate. The Amazon Redshift connector for Python enables IdP authentication for each user who logs in to the computer. This post shows you how to use the Amazon Redshift connector for Python and Okta to enable federated SSO into Amazon Redshift and query your data warehouse using a Python script.

Solution overview

To implement this solution, you complete the following high-level steps:

Configure your Okta IdP, including users, groups, and SAML application.
Configure your AWS service provider, including IdP, IAM policy, and IAM role.
Set up your Amazon Redshift connector.
Check your Amazon Redshift users.
Configure the Python script and Amazon Redshift connector to use your enterprise credentials and sign in to Amazon Redshift to run queries.

The process flow for federated authentication includes the following steps:

The user logs in from a Python script using an Amazon Redshift connector.
The IdP authenticates using the corporate user name and password, and returns a SAML assertion.
The client uses AWS SDK to call AWS Security Token Service (AWS STS) to assume a role with SAML.
AWS STS returns temporary AWS credentials.
The client uses the temporary AWS credentials to connect Amazon Redshift.

The following diagram illustrates this process flow.

Prerequisites

Before starting this walkthrough, you must have the following:

An Amazon Redshift cluster. The US East (N. Virginia) Region is preferred because you need to load data from Amazon Simple Storage Service (Amazon S3) in us-east-1.
A database user with superuser
Python 3.5 or above.
Pip.

Configuring your IdP (Okta)

For instructions on setting up your IdP, see steps 1–4 in Federate Amazon Redshift access with Okta as an identity provider.

For step 2, create a user called [email protected] and add it to the sales group. This is the example user that you use in this post. The following screenshot shows your example users.

When the setup is complete, you should see settings similar to the following screenshots. The following screenshot shows your general SAML settings.

The following screenshot shows your attribute statements.

The following screenshot shows the settings for your application embed link, login page, and access error page.

For instructions on setting up advanced IdP configurations, see steps 8–9 in Federate Amazon Redshift access with Okta as an identity provider.

Configuring your service provider (AWS)

To set up the service provider, complete steps 5–7 from Federate Amazon Redshift access with Okta as an identity provider.

For step 7, the Amazon Redshift connector needs the following additional permission:

{
            "Sid": "DescribeClusters",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
}

The following code shows the example policy used:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "GetClusterCredsStatement",
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:*:*:cluster:<cluster>",
                "arn:aws:redshift:*:*:dbname:<cluster>/<databasename>",
                "arn:aws:redshift:*:*:dbuser:<cluster>/${redshift:DbUser}"
            ]
        },
        {
            "Sid": "DescribeClusters",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
        },
        {
            "Sid": "CreateClusterUserStatement",
            "Effect": "Allow",
            "Action": "redshift:CreateClusterUser",
            "Resource": [
                "arn:aws:redshift:*:*:dbname:<cluster>/<databasenmae>",
                "arn:aws:redshift:*:*:dbuser:<cluster>/${redshift:DbUser}"
            ]
        },
        {
            "Sid": "RedshiftJoinGroupStatement",
            "Effect": "Allow",
            "Action": "redshift:JoinGroup",
            "Resource": [
                "arn:aws:redshift:*:*:dbgroup:<cluster>/sales"
            ]
        }
    ]
}

When the setup is complete, you should see an IAM role setup similar to the following screenshot.

Setting up your Amazon Redshift connector

The easiest way to get started with redshift_connector is via pip. See the following code:

pip install redshift_connector

After running the command, you see a message indicating redshift_connector was installed successfully:

Successfully installed redshift_connector

Checking Amazon Redshift users

To check your users, connect to your Amazon Redshift cluster using your preferred SQL client as a superuser and enter the following code:

select * from pg_user where usename = '[email protected]';

The query should return zero records, which shows that the corporate user [email protected] doesn’t exist yet in Amazon Redshift. See the following code:

dev=# select * from pg_user where usename = '[email protected]'; 
 usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig 
---------+----------+-------------+----------+-----------+--------+----------+-----------
(0 rows)

Testing the Python script

The next step is to run an example Python script that uses federation to connect to your Amazon Redshift cluster and run some queries. See the following code:

import redshift_connector
import numpy
 
# Connects to Redshift cluster using Okta MFA Browser Plugin
conn = redshift_connector.connect(
    iam=True,
    ssl=True,
    host='<RedshiftClusterEndpoint>',
    port=<RedshiftClusterPortNumber>,
    database='<RedshiftDatabase>',
    db_user='[email protected]',
    cluster_identifier='<RedshiftClusterIdentifier>',
    region='<Region>',
    login_url='<IdPLoginUrl>',
    credentials_provider='BrowserSamlCredentialsProvider',
    user='',
    password=''
)
 
cursor: redshift_connector.Cursor = conn.cursor()
cursor.execute("select current_user")
result: numpy.ndarray = cursor.fetch_numpy_array()
print(result)
 
cursor.execute("create Temp table book(bookname varchar,author‎ varchar)")
cursor.executemany("insert into book (bookname, author‎) values (%s, %s)", [('One Hundred Years of Solitude', 'Gabriel García Márquez'),('A Brief History of Time', 'Stephen Hawking')])
cursor.execute("select * from book")
 
result: numpy.ndarray = cursor.fetch_numpy_array()
print(result)
cursor.execute("drop table book")
conn.commit()
 
conn.close()

Use the following parameters:

RedshiftClusterEndpoint – The Amazon Redshift cluster endpoint without a port and database
RedshiftClusterPortNumber – The Amazon Redshift port number
RedshiftDatabase – The Amazon Redshift database name
RedshiftClusterIdentifier – The Amazon Redshift cluster identifier
Region – The Region where the cluster is
IdPLoginUrl – The IdP (Okta) sign-in URL, which you can get it from the embed link from the Okta SAML application property

After you run the Python script, the Okta sign-in page opens. Sign in with [email protected].

Switch back to your Python environment; it should show three records returned:

$ python3 redshift_connector_okta.py
[['[email protected]']]
[['One Hundred Years of Solitude' 'Gabriel García Márquez']
 ['A Brief History of Time' 'Stephen Hawking']]

[email protected] is returned, showing that the Python script ran under the context of this federated user.

You may now close your browser window.

To check your users, connect to your Amazon Redshift cluster using your preferred SQL client as a superuser and enter the following code:

select * from pg_user where usename = '[email protected]';

The query should return one record, which shows that the database user is automatically created based on the IdP corporate user:

dev=# select * from pg_user where usename = '[email protected]';
      usename      | usesysid | usecreatedb | usesuper | usecatupd |  passwd  | valuntil | useconfig 
-------------------+----------+-------------+----------+-----------+----------+----------+-----------
 [email protected] |      116 | f           | f        | f         | ******** |          | 
(1 row)

Summary

Amazon Redshift supports stringent compliance and security requirements with no extra cost, which makes it ideal for highly regulated industries. With federation, you can centralize management and governance of authentication and permissions by managing users and groups within the enterprise IdP and use them to authenticate to Amazon Redshift. SSO enables users to have a seamless user experience while accessing various applications in the organization.

This post showed you how to set up an Amazon Redshift connector and use Okta as your IdP to enable federated SSO to an Amazon Redshift cluster from a Python script. If you have users such as data scientists that heavily rely on Python scripts, you can follow these steps to set up federated SSO for your organization and manage access privileges to your Amazon Redshift cluster.

If you have any questions or suggestions, please leave a comment or contribute to the project.

About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

Brooke White is a Software Development Engineer at AWS. She enables customers to get the most out of their data through her work on Amazon Redshift drivers. Prior to AWS, she built ETL pipelines and analytics APIs at a San Francisco Bay Area startup.

Ilesh Garish is a Software Development Engineer at AWS. His role is to develop connectors for Amazon Redshift. Prior to AWS, he built database drivers for the Oracle RDBMS, TigerLogic XDMS, and OpenAccess SDK. He worked in the database internal technologies at San Francisco Bay Area startups.